DEV Community

Donald Sebastian Leung
Donald Sebastian Leung

Posted on

Setting up a single-node Hadoop cluster

In this article, we will see Hadoop in action by configuring a single-node cluster on Ubuntu 20.04.

Prerequisites

Before we start, it is assumed that you are:

  • Familiar with basic Linux commands and operation
  • Aware of what Hadoop is and how it works

Initial setup

Set up a virtual machine (VM) or cloud instance (or physical machine, if you wish) with a fresh installation of Ubuntu 20.04. Ensure that the VM or cloud instance is reachable from the host (with ping or otherwise), and that the appropriate ports are open if a firewall is enabled. For the purposes of this tutorial, you may wish to disable the firewall (the default in Ubuntu 20.04) and / or enable all inbound traffic for the security group (or equivalent) associated with the cloud instance if you are using one, though note that you should configure your firewall and / or security group properly in a production environment and only allow the absolute minimum of network traffic required.

You may wish to substitute Ubuntu 20.04 with your Linux distribution (distro) of choice, though note that you may have to adapt the instructions accordingly if you choose to do so, due to differences in initial configuration, package names, etc. between distros.

Installing Hadoop dependencies, setting up the environment

First ensure the repository metadata is up to date:

$ sudo apt update
Enter fullscreen mode Exit fullscreen mode

Hadoop 3.3.1, the latest version at the time of writing, requires Java 8 or 11, and the OpenSSH server to be running. On Ubuntu 20.04 server, OpenSSH is already installed and running by default so we do not need to do anything on that end. As for Java 11, install default-jdk by executing the following command:

$ sudo apt install default-jdk
Enter fullscreen mode Exit fullscreen mode

The default JDK in Ubuntu 20.04 is 11, but you can also install JDK 11 explicitly by installing the openjdk-11-jdk package, as follows:

$ sudo apt install openjdk-11-jdk
Enter fullscreen mode Exit fullscreen mode

Confirm that Java 11 is installed:

$ java -version
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)
Enter fullscreen mode Exit fullscreen mode

Now let's set up some environment variables. Hadoop requires JAVA_HOME to be set, and we will set an extra variable HADOOP_HOME for reasons that will become clear later. Fire up your favorite text editor (mine is Vim) and append the following lines to $HOME/.profile:

export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
export HADOOP_HOME="/opt/hadoop-3.3.1"
Enter fullscreen mode Exit fullscreen mode

Alternatively, running the following commands will do the same thing without firing up a text editor:

$ echo "export JAVA_HOME=\"/usr/lib/jvm/java-11-openjdk-amd64\"" >> $HOME/.profile
$ echo "export HADOOP_HOME=\"/opt/hadoop-3.3.1\"" >> $HOME/.profile
Enter fullscreen mode Exit fullscreen mode

Source the file to apply the changes:

$ source $HOME/.profile
Enter fullscreen mode Exit fullscreen mode

Confirm that the environment variables are correctly set:

$ echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64
$ echo $HADOOP_HOME
/opt/hadoop-3.3.1
Enter fullscreen mode Exit fullscreen mode

In case you're wondering how I found the correct value of JAVA_HOME, I found it by listing the files included in the openjdk-11-jdk package using dpkg:

$ dpkg -L openjdk-11-jdk
Enter fullscreen mode Exit fullscreen mode

Now we should be ready to install Hadoop.

Installing Hadoop 3.3.1

You might think that Hadoop would be available in the official Ubuntu repositories or as a Snap in 2021, considering how ubiquitous cloud computing and big data is nowadays, but you would be wrong. See this question on AskUbuntu for details. Actually, Hadoop is available in the Snap store, but it was last packaged in 2017 and only available in the unstable beta / edge channels as of November 2021 so we won't cover it here. The de-facto way of installing Hadoop in 2021 is (still) directly fetching the compressed tarball from upstream.

Fetch the upstream tarball for Hadoop 3.3.1 using wget:

$ wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
Enter fullscreen mode Exit fullscreen mode

Unpack the tarball:

$ tar xvf hadoop-3.3.1.tar.gz
Enter fullscreen mode Exit fullscreen mode

Now copy it to an appropriate location, e.g. $HADOOP_HOME:

$ sudo cp -r hadoop-3.3.1 "$HADOOP_HOME"
Enter fullscreen mode Exit fullscreen mode

This copies and places the Hadoop program files under /opt, which is where manually installed software should be placed as dictated by the FHS.

Optionally clean up the downloaded archive and unpacked tarball:

$ rm -rf hadoop-3.3.1 hadoop-3.3.1.tar.gz
Enter fullscreen mode Exit fullscreen mode

Hadoop binaries are located under $HADOOP_HOME/bin. Add that to our PATH:

$ echo "export PATH=\"\$PATH:\$HADOOP_HOME/bin\"" >> $HOME/.profile
$ source $HOME/.profile
Enter fullscreen mode Exit fullscreen mode

If you did everything correctly, the hadoop command should now be available and print usage instructions:

$ hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class
... (more text)
Enter fullscreen mode Exit fullscreen mode

Hadoop conveniently includes pre-written MapReduce examples so we can run an example right away to confirm that our installation is working as expected. By default, Hadoop runs in a single process which is convenient for testing (but not for production usage).

Let's run WordCount on a text sample. I'll use a copy of the King James Bible (KJB) for my text sample, simply because it almost looks large enough to resemble a production workload, yet small enough to process on a single node with limited storage. Depending on your preferences, you may want to replace it with Don Quixote or whatever.

Fetch our text sample:

$ wget https://raw.githubusercontent.com/ErikSchierboom/sentencegenerator/master/samples/the-king-james-bible.txt
Enter fullscreen mode Exit fullscreen mode

Now create an input directory and place the text sample there:

$ mkdir input
$ mv the-king-james-bible.txt input
Enter fullscreen mode Exit fullscreen mode

My KJB copy is 4.2 megabytes in size:

$ ls -lh input
total 4.2M
-rw-rw-r-- 1 ubuntu ubuntu 4.2M Nov 14 10:20 the-king-james-bible.txt
Enter fullscreen mode Exit fullscreen mode

Recall that HDFS blocks are 128M in size by default. Assuming that the size of our output is roughly in the same order of magnitude of our input size, we would expect the output to fit in a single block.

Finally we invoke the WordCount MapReduce program with input directory input and output directory output:

$ hadoop jar "$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar" wordcount input output
Enter fullscreen mode Exit fullscreen mode

Confirm that the output has been successfully generated:

$ ls -lh output
total 336K
-rw-r--r-- 1 ubuntu ubuntu    0 Nov 14 10:26 _SUCCESS
-rw-r--r-- 1 ubuntu ubuntu 334K Nov 14 10:26 part-r-00000
Enter fullscreen mode Exit fullscreen mode

So our output fits in a single block (output/part-r-00000) as expected.

Finally, let's take the last 10 lines of our output (or the first 10 lines, or whatever):

$ tail output/part-r-00000
youth:  7
youth;  8
youth?  2
youthful    1
youths  1
youths, 1
zeal    13
zeal,   3
zealous 8
zealously   2
Enter fullscreen mode Exit fullscreen mode

Looks good. Hadoop is working as expected. However, Hadoop never runs as a single process in production. Let's change that. But before that, let's remove the output directory:

$ rm -rf output
Enter fullscreen mode Exit fullscreen mode

Configuring a pseudo-distributed cluster

Reference: Pseudo-Distributed Operation

Recall that a Hadoop cluster typically consists of multiple daemons (long-running background processes):

  • A NameNode
  • A SecondaryNameNode
  • One or more DataNodes

In a real production setting, these daemons would each execute on a different node (hence their names), so that the cluster is resistant to the failure of individual DataNodes. However, here we will configure all daemons to run on the same node. While such a configuration negates the reliability guarantees of Hadoop and does not allow us to run production workloads, it is nonetheless one step closer to how a production cluster might be configured and should hopefully give us an idea how one might go about configuring a multi-node cluster.

Before we execute the relevant daemons, we need to edit two configuration files:

  1. $HADOOP_HOME/etc/hadoop/core-site.xml: where NameNode runs in the cluster
  2. $HADOOP_HOME/etc/hadoop/hdfs-site.xml: HDFS daemon configuration: NameNode, SecondaryNameNode, DataNodes

(source)

Now open core-site.xml in your favorite text editor:

$ sudo vim "$HADOOP_HOME/etc/hadoop/core-site.xml"
Enter fullscreen mode Exit fullscreen mode

Find the <configuration></configuration> tags and replace them with the following:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>
Enter fullscreen mode Exit fullscreen mode

Here, fs.defaultFS specifies the name of the default file system to be used (source).

Next, open hdfs-site.xml in your favorite text editor:

$ sudo vim "$HADOOP_HOME/etc/hadoop/hdfs-site.xml"
Enter fullscreen mode Exit fullscreen mode

Find the <configuration></configuration> tags again, and replace them with the following:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>
Enter fullscreen mode Exit fullscreen mode

Here, dfs.replication specifies the number of replicas to create per block (source). Usually in production, 3 is a good number (and also the default if not specified), but since we're running the entire cluster in a single node, negating any advantages of block replication, we might as well set it to 1 to save space.

Since each daemon typically runs in a different node in production, they communicate with each other over SSH. By default, authenticating with SSH requires the user to interactively enter the correct password, which is unacceptable here since the cluster should function continuously without user interaction. So we need to set up public key authentication instead. Since, in our case, all daemons run on the same node, it suffices to set up public key authentication for connecting to localhost:

$ ssh-keygen
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ chmod 600 $HOME/.ssh/authorized_keys
Enter fullscreen mode Exit fullscreen mode

For the first command ssh-keygen, feel free to use the defaults and no passphrase by pressing Enter repeatedly.

Confirm that we can SSH to localhost without a password:

$ ssh localhost
$ exit
logout
Connection to localhost closed.
Enter fullscreen mode Exit fullscreen mode

Now we are almost ready to start the HDFS daemons. But first, we need to format the filesystem. However, there is a problem: $HADOOP_HOME is currently owned by root, so any HDFS operations need to be run as root, which is a bad idea. To rectify that, change ownership of $HADOOP_HOME (and everything in it) to the current non-root user:

$ sudo chown -R "$(whoami):" "$HADOOP_HOME"
Enter fullscreen mode Exit fullscreen mode

Now format the filesystem:

$ hdfs namenode -format
Enter fullscreen mode Exit fullscreen mode

It seems we earlier missed $HADOOP_HOME/sbin in our PATH. Fix that:

$ echo "export PATH=\"\$PATH:\$HADOOP_HOME/sbin\"" >> $HOME/.profile
$ source $HOME/.profile
Enter fullscreen mode Exit fullscreen mode

The script for starting the HDFS daemons also cannot see JAVA_HOME unless you add it to $HADOOP_HOME/etc/hadoop/hadoop-env.sh:

$ echo "export JAVA_HOME=\"/usr/lib/jvm/java-11-openjdk-amd64\"" >> "$HADOOP_HOME/etc/hadoop/hadoop-env.sh"
Enter fullscreen mode Exit fullscreen mode

Finally, we can start NameNode and DataNode:

$ start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [ip-172-31-4-232]
Enter fullscreen mode Exit fullscreen mode

If you get output similar to the above (the last line may differ slightly), you should be fine. Confirm this by visiting http://16.162.217.78:9870 , replacing 16.162.217.78 with the IP address of your node.

Now make some HDFS directories required for MapReduce jobs:

$ hdfs dfs -mkdir -p "/user/$(whoami)"
Enter fullscreen mode Exit fullscreen mode

Copy the KJB (or whatever sample text you used) into our distributed filesystem:

$ hdfs dfs -mkdir input
$ hdfs dfs -put input/the-king-james-bible.txt input
Enter fullscreen mode Exit fullscreen mode

Run WordCount:

$ hadoop jar "$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar" wordcount input output
Enter fullscreen mode Exit fullscreen mode

Examine the output:

$ hdfs dfs -ls output
Found 2 items
-rw-r--r--   1 ubuntu supergroup          0 2021-11-14 14:49 output/_SUCCESS
-rw-r--r--   1 ubuntu supergroup     341453 2021-11-14 14:49 output/part-r-00000
$ hdfs dfs -tail output/part-r-00000 | tail
youth:  7
youth;  8
youth?  2
youthful    1
youths  1
youths, 1
zeal    13
zeal,   3
zealous 8
zealously   2
Enter fullscreen mode Exit fullscreen mode

(HDFS tail outputs the last kilobyte of data and does not seem to be configurable so we pipe its output to ordinary tail to get just the last 10 lines)

If you wish, you can also copy the output from our distributed filesystem back to our local filesystem and inspect the output there:

$ hdfs dfs -get output output
$ ls -l output
total 336
-rw-r--r-- 1 ubuntu ubuntu      0 Nov 14 14:58 _SUCCESS
-rw-r--r-- 1 ubuntu ubuntu 341453 Nov 14 14:58 part-r-00000
$ tail output/part-r-00000
youth:  7
youth;  8
youth?  2
youthful    1
youths  1
youths, 1
zeal    13
zeal,   3
zealous 8
zealously   2
Enter fullscreen mode Exit fullscreen mode

Congratulations, you have successfully orchestrated a single-node Hadoop cluster and run WordCount on it!

To cleanup, stop NameNode, DataNode:

$ stop-dfs.sh
Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [ip-172-31-4-232]
Enter fullscreen mode Exit fullscreen mode

Conclusion

We saw how to install Hadoop and configure it to run as a single-node cluster. However, to leverage the true power of Hadoop (multiple map tasks in parallel, fault tolerance, etc.), a multi-node cluster is required. We shall explore this in a subsequent article, so stay tuned ;-)

References

Discussion (0)