In this article, we will see Hadoop in action by configuring a single-node cluster on Ubuntu 20.04.
Before we start, it is assumed that you are:
Set up a virtual machine (VM) or cloud instance (or physical machine, if you wish) with a fresh installation of Ubuntu 20.04. Ensure that the VM or cloud instance is reachable from the host (with
ping or otherwise), and that the appropriate ports are open if a firewall is enabled. For the purposes of this tutorial, you may wish to disable the firewall (the default in Ubuntu 20.04) and / or enable all inbound traffic for the security group (or equivalent) associated with the cloud instance if you are using one, though note that you should configure your firewall and / or security group properly in a production environment and only allow the absolute minimum of network traffic required.
You may wish to substitute Ubuntu 20.04 with your Linux distribution (distro) of choice, though note that you may have to adapt the instructions accordingly if you choose to do so, due to differences in initial configuration, package names, etc. between distros.
First ensure the repository metadata is up to date:
$ sudo apt update
Hadoop 3.3.1, the latest version at the time of writing, requires Java 8 or 11, and the OpenSSH server to be running. On Ubuntu 20.04 server, OpenSSH is already installed and running by default so we do not need to do anything on that end. As for Java 11, install
default-jdk by executing the following command:
$ sudo apt install default-jdk
The default JDK in Ubuntu 20.04 is 11, but you can also install JDK 11 explicitly by installing the
openjdk-11-jdk package, as follows:
$ sudo apt install openjdk-11-jdk
Confirm that Java 11 is installed:
$ java -version openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04) OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)
Now let's set up some environment variables. Hadoop requires
JAVA_HOME to be set, and we will set an extra variable
HADOOP_HOME for reasons that will become clear later. Fire up your favorite text editor (mine is Vim) and append the following lines to
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64" export HADOOP_HOME="/opt/hadoop-3.3.1"
Alternatively, running the following commands will do the same thing without firing up a text editor:
$ echo "export JAVA_HOME=\"/usr/lib/jvm/java-11-openjdk-amd64\"" >> $HOME/.profile $ echo "export HADOOP_HOME=\"/opt/hadoop-3.3.1\"" >> $HOME/.profile
Source the file to apply the changes:
$ source $HOME/.profile
Confirm that the environment variables are correctly set:
$ echo $JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64 $ echo $HADOOP_HOME /opt/hadoop-3.3.1
In case you're wondering how I found the correct value of
JAVA_HOME, I found it by listing the files included in the
openjdk-11-jdk package using
$ dpkg -L openjdk-11-jdk
Now we should be ready to install Hadoop.
You might think that Hadoop would be available in the official Ubuntu repositories or as a Snap in 2021, considering how ubiquitous cloud computing and big data is nowadays, but you would be wrong. See this question on AskUbuntu for details. Actually, Hadoop is available in the Snap store, but it was last packaged in 2017 and only available in the unstable beta / edge channels as of November 2021 so we won't cover it here. The de-facto way of installing Hadoop in 2021 is (still) directly fetching the compressed tarball from upstream.
Fetch the upstream tarball for Hadoop 3.3.1 using
$ wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
Unpack the tarball:
$ tar xvf hadoop-3.3.1.tar.gz
Now copy it to an appropriate location, e.g.
$ sudo cp -r hadoop-3.3.1 "$HADOOP_HOME"
This copies and places the Hadoop program files under
/opt, which is where manually installed software should be placed as dictated by the FHS.
Optionally clean up the downloaded archive and unpacked tarball:
$ rm -rf hadoop-3.3.1 hadoop-3.3.1.tar.gz
Hadoop binaries are located under
$HADOOP_HOME/bin. Add that to our
$ echo "export PATH=\"\$PATH:\$HADOOP_HOME/bin\"" >> $HOME/.profile $ source $HOME/.profile
If you did everything correctly, the
hadoop command should now be available and print usage instructions:
$ hadoop Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS] or hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS] where CLASSNAME is a user-provided Java class ... (more text)
Hadoop conveniently includes pre-written MapReduce examples so we can run an example right away to confirm that our installation is working as expected. By default, Hadoop runs in a single process which is convenient for testing (but not for production usage).
Let's run WordCount on a text sample. I'll use a copy of the King James Bible (KJB) for my text sample, simply because it almost looks large enough to resemble a production workload, yet small enough to process on a single node with limited storage. Depending on your preferences, you may want to replace it with Don Quixote or whatever.
Fetch our text sample:
$ wget https://raw.githubusercontent.com/ErikSchierboom/sentencegenerator/master/samples/the-king-james-bible.txt
Now create an input directory and place the text sample there:
$ mkdir input $ mv the-king-james-bible.txt input
My KJB copy is 4.2 megabytes in size:
$ ls -lh input total 4.2M -rw-rw-r-- 1 ubuntu ubuntu 4.2M Nov 14 10:20 the-king-james-bible.txt
Recall that HDFS blocks are 128M in size by default. Assuming that the size of our output is roughly in the same order of magnitude of our input size, we would expect the output to fit in a single block.
Finally we invoke the WordCount MapReduce program with input directory
input and output directory
$ hadoop jar "$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar" wordcount input output
Confirm that the output has been successfully generated:
$ ls -lh output total 336K -rw-r--r-- 1 ubuntu ubuntu 0 Nov 14 10:26 _SUCCESS -rw-r--r-- 1 ubuntu ubuntu 334K Nov 14 10:26 part-r-00000
So our output fits in a single block (
output/part-r-00000) as expected.
Finally, let's take the last 10 lines of our output (or the first 10 lines, or whatever):
$ tail output/part-r-00000 youth: 7 youth; 8 youth? 2 youthful 1 youths 1 youths, 1 zeal 13 zeal, 3 zealous 8 zealously 2
Looks good. Hadoop is working as expected. However, Hadoop never runs as a single process in production. Let's change that. But before that, let's remove the
$ rm -rf output
Reference: Pseudo-Distributed Operation
Recall that a Hadoop cluster typically consists of multiple daemons (long-running background processes):
- One or more
In a real production setting, these daemons would each execute on a different node (hence their names), so that the cluster is resistant to the failure of individual
DataNodes. However, here we will configure all daemons to run on the same node. While such a configuration negates the reliability guarantees of Hadoop and does not allow us to run production workloads, it is nonetheless one step closer to how a production cluster might be configured and should hopefully give us an idea how one might go about configuring a multi-node cluster.
Before we execute the relevant daemons, we need to edit two configuration files:
NameNoderuns in the cluster
$HADOOP_HOME/etc/hadoop/hdfs-site.xml: HDFS daemon configuration:
core-site.xml in your favorite text editor:
$ sudo vim "$HADOOP_HOME/etc/hadoop/core-site.xml"
<configuration></configuration> tags and replace them with the following:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
fs.defaultFS specifies the name of the default file system to be used (source).
hdfs-site.xml in your favorite text editor:
$ sudo vim "$HADOOP_HOME/etc/hadoop/hdfs-site.xml"
<configuration></configuration> tags again, and replace them with the following:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
dfs.replication specifies the number of replicas to create per block (source). Usually in production, 3 is a good number (and also the default if not specified), but since we're running the entire cluster in a single node, negating any advantages of block replication, we might as well set it to 1 to save space.
Since each daemon typically runs in a different node in production, they communicate with each other over SSH. By default, authenticating with SSH requires the user to interactively enter the correct password, which is unacceptable here since the cluster should function continuously without user interaction. So we need to set up public key authentication instead. Since, in our case, all daemons run on the same node, it suffices to set up public key authentication for connecting to localhost:
$ ssh-keygen $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys $ chmod 600 $HOME/.ssh/authorized_keys
For the first command
ssh-keygen, feel free to use the defaults and no passphrase by pressing Enter repeatedly.
Confirm that we can SSH to localhost without a password:
$ ssh localhost $ exit logout Connection to localhost closed.
Now we are almost ready to start the HDFS daemons. But first, we need to format the filesystem. However, there is a problem:
$HADOOP_HOME is currently owned by
root, so any HDFS operations need to be run as root, which is a bad idea. To rectify that, change ownership of
$HADOOP_HOME (and everything in it) to the current non-root user:
$ sudo chown -R "$(whoami):" "$HADOOP_HOME"
Now format the filesystem:
$ hdfs namenode -format
It seems we earlier missed
$HADOOP_HOME/sbin in our
PATH. Fix that:
$ echo "export PATH=\"\$PATH:\$HADOOP_HOME/sbin\"" >> $HOME/.profile $ source $HOME/.profile
The script for starting the HDFS daemons also cannot see
JAVA_HOME unless you add it to
$ echo "export JAVA_HOME=\"/usr/lib/jvm/java-11-openjdk-amd64\"" >> "$HADOOP_HOME/etc/hadoop/hadoop-env.sh"
Finally, we can start
$ start-dfs.sh Starting namenodes on [localhost] Starting datanodes Starting secondary namenodes [ip-172-31-4-232]
If you get output similar to the above (the last line may differ slightly), you should be fine. Confirm this by visiting http://22.214.171.124:9870 , replacing
126.96.36.199 with the IP address of your node.
Now make some HDFS directories required for MapReduce jobs:
$ hdfs dfs -mkdir -p "/user/$(whoami)"
Copy the KJB (or whatever sample text you used) into our distributed filesystem:
$ hdfs dfs -mkdir input $ hdfs dfs -put input/the-king-james-bible.txt input
$ hadoop jar "$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar" wordcount input output
Examine the output:
$ hdfs dfs -ls output Found 2 items -rw-r--r-- 1 ubuntu supergroup 0 2021-11-14 14:49 output/_SUCCESS -rw-r--r-- 1 ubuntu supergroup 341453 2021-11-14 14:49 output/part-r-00000 $ hdfs dfs -tail output/part-r-00000 | tail youth: 7 youth; 8 youth? 2 youthful 1 youths 1 youths, 1 zeal 13 zeal, 3 zealous 8 zealously 2
tail outputs the last kilobyte of data and does not seem to be configurable so we pipe its output to ordinary
tail to get just the last 10 lines)
If you wish, you can also copy the output from our distributed filesystem back to our local filesystem and inspect the output there:
$ hdfs dfs -get output output $ ls -l output total 336 -rw-r--r-- 1 ubuntu ubuntu 0 Nov 14 14:58 _SUCCESS -rw-r--r-- 1 ubuntu ubuntu 341453 Nov 14 14:58 part-r-00000 $ tail output/part-r-00000 youth: 7 youth; 8 youth? 2 youthful 1 youths 1 youths, 1 zeal 13 zeal, 3 zealous 8 zealously 2
Congratulations, you have successfully orchestrated a single-node Hadoop cluster and run WordCount on it!
To cleanup, stop
$ stop-dfs.sh Stopping namenodes on [localhost] Stopping datanodes Stopping secondary namenodes [ip-172-31-4-232]
We saw how to install Hadoop and configure it to run as a single-node cluster. However, to leverage the true power of Hadoop (multiple map tasks in parallel, fault tolerance, etc.), a multi-node cluster is required. We shall explore this in a subsequent article, so stay tuned ;-)
- Hadoop: http://hadoop.apache.org/
- HDFS architecture: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
- Hadoop single-node cluster: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
- Why are Hadoop and Spark not in the official Ubuntu repositories?: https://askubuntu.com/questions/1375281/why-are-hadoop-and-spark-not-in-the-official-ubuntu-repositories
- FHS 3.0: https://refspecs.linuxfoundation.org/FHS_3.0/fhs-3.0.pdf
- Standalone operation (Hadoop): https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation
- WordCount: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0
- Pseudo-Distributed Operation: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
- Explaining Hadoop Configuration (Edureka): https://www.edureka.co/blog/explaining-hadoop-configuration/
- Hadoop 3.3.1
- Hadoop 3.3.1