In this article, we will see Hadoop in action by configuring a single-node cluster on Ubuntu 20.04.
Prerequisites
Before we start, it is assumed that you are:
- Familiar with basic Linux commands and operation
- Aware of what Hadoop is and how it works
Initial setup
Set up a virtual machine (VM) or cloud instance (or physical machine, if you wish) with a fresh installation of Ubuntu 20.04. Ensure that the VM or cloud instance is reachable from the host (with ping
or otherwise), and that the appropriate ports are open if a firewall is enabled. For the purposes of this tutorial, you may wish to disable the firewall (the default in Ubuntu 20.04) and / or enable all inbound traffic for the security group (or equivalent) associated with the cloud instance if you are using one, though note that you should configure your firewall and / or security group properly in a production environment and only allow the absolute minimum of network traffic required.
You may wish to substitute Ubuntu 20.04 with your Linux distribution (distro) of choice, though note that you may have to adapt the instructions accordingly if you choose to do so, due to differences in initial configuration, package names, etc. between distros.
Installing Hadoop dependencies, setting up the environment
First ensure the repository metadata is up to date:
$ sudo apt update
Hadoop 3.3.1, the latest version at the time of writing, requires Java 8 or 11, and the OpenSSH server to be running. On Ubuntu 20.04 server, OpenSSH is already installed and running by default so we do not need to do anything on that end. As for Java 11, install default-jdk
by executing the following command:
$ sudo apt install default-jdk
The default JDK in Ubuntu 20.04 is 11, but you can also install JDK 11 explicitly by installing the openjdk-11-jdk
package, as follows:
$ sudo apt install openjdk-11-jdk
Confirm that Java 11 is installed:
$ java -version
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)
Now let's set up some environment variables. Hadoop requires JAVA_HOME
to be set, and we will set an extra variable HADOOP_HOME
for reasons that will become clear later. Fire up your favorite text editor (mine is Vim) and append the following lines to $HOME/.profile
:
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
export HADOOP_HOME="/opt/hadoop-3.3.1"
Alternatively, running the following commands will do the same thing without firing up a text editor:
$ echo "export JAVA_HOME=\"/usr/lib/jvm/java-11-openjdk-amd64\"" >> $HOME/.profile
$ echo "export HADOOP_HOME=\"/opt/hadoop-3.3.1\"" >> $HOME/.profile
Source the file to apply the changes:
$ source $HOME/.profile
Confirm that the environment variables are correctly set:
$ echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64
$ echo $HADOOP_HOME
/opt/hadoop-3.3.1
In case you're wondering how I found the correct value of JAVA_HOME
, I found it by listing the files included in the openjdk-11-jdk
package using dpkg
:
$ dpkg -L openjdk-11-jdk
Now we should be ready to install Hadoop.
Installing Hadoop 3.3.1
You might think that Hadoop would be available in the official Ubuntu repositories or as a Snap in 2021, considering how ubiquitous cloud computing and big data is nowadays, but you would be wrong. See this question on AskUbuntu for details. Actually, Hadoop is available in the Snap store, but it was last packaged in 2017 and only available in the unstable beta / edge channels as of November 2021 so we won't cover it here. The de-facto way of installing Hadoop in 2021 is (still) directly fetching the compressed tarball from upstream.
Fetch the upstream tarball for Hadoop 3.3.1 using wget
:
$ wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
Unpack the tarball:
$ tar xvf hadoop-3.3.1.tar.gz
Now copy it to an appropriate location, e.g. $HADOOP_HOME
:
$ sudo cp -r hadoop-3.3.1 "$HADOOP_HOME"
This copies and places the Hadoop program files under /opt
, which is where manually installed software should be placed as dictated by the FHS.
Optionally clean up the downloaded archive and unpacked tarball:
$ rm -rf hadoop-3.3.1 hadoop-3.3.1.tar.gz
Hadoop binaries are located under $HADOOP_HOME/bin
. Add that to our PATH
:
$ echo "export PATH=\"\$PATH:\$HADOOP_HOME/bin\"" >> $HOME/.profile
$ source $HOME/.profile
If you did everything correctly, the hadoop
command should now be available and print usage instructions:
$ hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
or hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
where CLASSNAME is a user-provided Java class
... (more text)
Hadoop conveniently includes pre-written MapReduce examples so we can run an example right away to confirm that our installation is working as expected. By default, Hadoop runs in a single process which is convenient for testing (but not for production usage).
Let's run WordCount on a text sample. I'll use a copy of the King James Bible (KJB) for my text sample, simply because it almost looks large enough to resemble a production workload, yet small enough to process on a single node with limited storage. Depending on your preferences, you may want to replace it with Don Quixote or whatever.
Fetch our text sample:
$ wget https://raw.githubusercontent.com/ErikSchierboom/sentencegenerator/master/samples/the-king-james-bible.txt
Now create an input directory and place the text sample there:
$ mkdir input
$ mv the-king-james-bible.txt input
My KJB copy is 4.2 megabytes in size:
$ ls -lh input
total 4.2M
-rw-rw-r-- 1 ubuntu ubuntu 4.2M Nov 14 10:20 the-king-james-bible.txt
Recall that HDFS blocks are 128M in size by default. Assuming that the size of our output is roughly in the same order of magnitude of our input size, we would expect the output to fit in a single block.
Finally we invoke the WordCount MapReduce program with input directory input
and output directory output
:
$ hadoop jar "$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar" wordcount input output
Confirm that the output has been successfully generated:
$ ls -lh output
total 336K
-rw-r--r-- 1 ubuntu ubuntu 0 Nov 14 10:26 _SUCCESS
-rw-r--r-- 1 ubuntu ubuntu 334K Nov 14 10:26 part-r-00000
So our output fits in a single block (output/part-r-00000
) as expected.
Finally, let's take the last 10 lines of our output (or the first 10 lines, or whatever):
$ tail output/part-r-00000
youth: 7
youth; 8
youth? 2
youthful 1
youths 1
youths, 1
zeal 13
zeal, 3
zealous 8
zealously 2
Looks good. Hadoop is working as expected. However, Hadoop never runs as a single process in production. Let's change that. But before that, let's remove the output
directory:
$ rm -rf output
Configuring a pseudo-distributed cluster
Reference: Pseudo-Distributed Operation
Recall that a Hadoop cluster typically consists of multiple daemons (long-running background processes):
- A
NameNode
- A
SecondaryNameNode
- One or more
DataNode
s
In a real production setting, these daemons would each execute on a different node (hence their names), so that the cluster is resistant to the failure of individual DataNode
s. However, here we will configure all daemons to run on the same node. While such a configuration negates the reliability guarantees of Hadoop and does not allow us to run production workloads, it is nonetheless one step closer to how a production cluster might be configured and should hopefully give us an idea how one might go about configuring a multi-node cluster.
Before we execute the relevant daemons, we need to edit two configuration files:
-
$HADOOP_HOME/etc/hadoop/core-site.xml
: whereNameNode
runs in the cluster -
$HADOOP_HOME/etc/hadoop/hdfs-site.xml
: HDFS daemon configuration:NameNode
,SecondaryNameNode
,DataNode
s
(source)
Now open core-site.xml
in your favorite text editor:
$ sudo vim "$HADOOP_HOME/etc/hadoop/core-site.xml"
Find the <configuration></configuration>
tags and replace them with the following:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Here, fs.defaultFS
specifies the name of the default file system to be used (source).
Next, open hdfs-site.xml
in your favorite text editor:
$ sudo vim "$HADOOP_HOME/etc/hadoop/hdfs-site.xml"
Find the <configuration></configuration>
tags again, and replace them with the following:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Here, dfs.replication
specifies the number of replicas to create per block (source). Usually in production, 3 is a good number (and also the default if not specified), but since we're running the entire cluster in a single node, negating any advantages of block replication, we might as well set it to 1 to save space.
Since each daemon typically runs in a different node in production, they communicate with each other over SSH. By default, authenticating with SSH requires the user to interactively enter the correct password, which is unacceptable here since the cluster should function continuously without user interaction. So we need to set up public key authentication instead. Since, in our case, all daemons run on the same node, it suffices to set up public key authentication for connecting to localhost:
$ ssh-keygen
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ chmod 600 $HOME/.ssh/authorized_keys
For the first command ssh-keygen
, feel free to use the defaults and no passphrase by pressing Enter repeatedly.
Confirm that we can SSH to localhost without a password:
$ ssh localhost
$ exit
logout
Connection to localhost closed.
Now we are almost ready to start the HDFS daemons. But first, we need to format the filesystem. However, there is a problem: $HADOOP_HOME
is currently owned by root
, so any HDFS operations need to be run as root, which is a bad idea. To rectify that, change ownership of $HADOOP_HOME
(and everything in it) to the current non-root user:
$ sudo chown -R "$(whoami):" "$HADOOP_HOME"
Now format the filesystem:
$ hdfs namenode -format
It seems we earlier missed $HADOOP_HOME/sbin
in our PATH
. Fix that:
$ echo "export PATH=\"\$PATH:\$HADOOP_HOME/sbin\"" >> $HOME/.profile
$ source $HOME/.profile
The script for starting the HDFS daemons also cannot see JAVA_HOME
unless you add it to $HADOOP_HOME/etc/hadoop/hadoop-env.sh
:
$ echo "export JAVA_HOME=\"/usr/lib/jvm/java-11-openjdk-amd64\"" >> "$HADOOP_HOME/etc/hadoop/hadoop-env.sh"
Finally, we can start NameNode
and DataNode
:
$ start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [ip-172-31-4-232]
If you get output similar to the above (the last line may differ slightly), you should be fine. Confirm this by visiting http://16.162.217.78:9870 , replacing 16.162.217.78
with the IP address of your node.
Now make some HDFS directories required for MapReduce jobs:
$ hdfs dfs -mkdir -p "/user/$(whoami)"
Copy the KJB (or whatever sample text you used) into our distributed filesystem:
$ hdfs dfs -mkdir input
$ hdfs dfs -put input/the-king-james-bible.txt input
Run WordCount:
$ hadoop jar "$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar" wordcount input output
Examine the output:
$ hdfs dfs -ls output
Found 2 items
-rw-r--r-- 1 ubuntu supergroup 0 2021-11-14 14:49 output/_SUCCESS
-rw-r--r-- 1 ubuntu supergroup 341453 2021-11-14 14:49 output/part-r-00000
$ hdfs dfs -tail output/part-r-00000 | tail
youth: 7
youth; 8
youth? 2
youthful 1
youths 1
youths, 1
zeal 13
zeal, 3
zealous 8
zealously 2
(HDFS tail
outputs the last kilobyte of data and does not seem to be configurable so we pipe its output to ordinary tail
to get just the last 10 lines)
If you wish, you can also copy the output from our distributed filesystem back to our local filesystem and inspect the output there:
$ hdfs dfs -get output output
$ ls -l output
total 336
-rw-r--r-- 1 ubuntu ubuntu 0 Nov 14 14:58 _SUCCESS
-rw-r--r-- 1 ubuntu ubuntu 341453 Nov 14 14:58 part-r-00000
$ tail output/part-r-00000
youth: 7
youth; 8
youth? 2
youthful 1
youths 1
youths, 1
zeal 13
zeal, 3
zealous 8
zealously 2
Congratulations, you have successfully orchestrated a single-node Hadoop cluster and run WordCount on it!
To cleanup, stop NameNode
, DataNode
:
$ stop-dfs.sh
Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [ip-172-31-4-232]
Conclusion
We saw how to install Hadoop and configure it to run as a single-node cluster. However, to leverage the true power of Hadoop (multiple map tasks in parallel, fault tolerance, etc.), a multi-node cluster is required. We shall explore this in a subsequent article, so stay tuned ;-)
References
- Hadoop: http://hadoop.apache.org/
- HDFS architecture: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
- Hadoop single-node cluster: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
- Why are Hadoop and Spark not in the official Ubuntu repositories?: https://askubuntu.com/questions/1375281/why-are-hadoop-and-spark-not-in-the-official-ubuntu-repositories
- FHS 3.0: https://refspecs.linuxfoundation.org/FHS_3.0/fhs-3.0.pdf
- Standalone operation (Hadoop): https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation
- WordCount: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0
- Pseudo-Distributed Operation: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
- Explaining Hadoop Configuration (Edureka): https://www.edureka.co/blog/explaining-hadoop-configuration/
- Hadoop 3.3.1
core-default.xml
: https://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-common/core-default.xml - Hadoop 3.3.1
hdfs-default.xml
: https://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
Top comments (0)