Introduction
Here is what I learned last week about hadoop installation:
Hadoop sounds like a really big thing, complex installation, cluster, hundreds of machines, Tera's if not Peta's of data, but actually, you can download a simple jar and run hadoop with hdfs on your laptop, for practice, it's very easy!
Our plan
- Setup JAVA_HOME (hadoop is built on java).
- Download hadoop tar.gz.
- Extract hadoop tar.gz
- Setup hadoop config
- Start and format hdfs
- Upload files to hdfs.
- Run hadoop job on these uploaded files.
- Get back and print results!
Sounds like a plan!
Setup JAVA_HOME
As we said hadoop is built on java so we need JAVA_HOME set.
➜ hadoop$ ls /Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
➜ hadoop$ echo $JAVA_HOME
/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
Download Hadoop tar.gz
Next we download hadoop, nice :)
➜ hadoop$ curl http://apache.spd.co.il/hadoop/common/hadoop-3.1.0/hadoop-3.1.0.tar.gz --output hadoop.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
1 310M 1 3581k 0 0 484k 0 0:10:57 0:00:07 0:10:50 580k
Extract hadoop tar.gz
Now that we have the tar.gz on our laptop let's extract it.
➜ hadoop$ tar xvfz ~/Downloads/hadoop-3.1.0.tar.gz
Setup HDFS
Now let's config HDFS on our laptop:
➜ hadoop$ cd hadoop-3.1.0
➜ hadoop-3.1.0$
➜ hadoop-3.1.0$ vi etc/hadoop/core-site.xml
Configuration should be:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
So we configured the hdfs port, let's configure how many replicas we need, we are on laptop we want only one replica for our data:
➜ hadoop-3.1.0$ vi etc/hadoop/hdfs-site.xml:
The above hdfs-site.xml
is the site for replica configuration below is the configuration it should have (hint: 1):
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Enabled SSHD
Hadoop connects to nodes with ssh so let's enable it on our mac laptop:
You should be able to ssh with no pass:
➜ hadoop-3.1.0 ssh localhost
Last login: Wed May 9 17:15:28 2018
➜ ~
If you can't do this:
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Start HDFS
Next we start and format HDFS on our laptop:
bin/hdfs namenode -format
➜ hadoop-3.1.0$ bin/hdfs namenode -format
WARNING: /Users/tomer.bendavid/tmp/hadoop/hadoop-3.1.0/logs does not exist. Creating.
2018-05-10 22:12:02,493 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = Tomers-MacBook-Pro.local/192.168.1.104
➜ hadoop-3.1.0$ sbin/start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Create folders on hdfs
Next we create sample input folder on HDFS on our laptop:
➜ hadoop-3.1.0$ bin/hdfs dfs -mkdir /user
2018-05-10 22:13:16,982 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
➜ hadoop-3.1.0$ bin/hdfs dfs -mkdir /user/tomer
2018-05-10 22:13:22,474 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
➜ hadoop-3.1.0$
Upload testdata to HDFS
Now that we have HDFS up and running on our laptop lets upload some files:
➜ hadoop-3.1.0$ bin/hdfs dfs -put etc/hadoop input
2018-05-10 22:14:28,802 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
put: `input': No such file or directory: `hdfs://localhost:9000/user/tomer.bendavid/input'
➜ hadoop-3.1.0$ bin/hdfs dfs -put etc/hadoop /user/tomer/input
2018-05-10 22:14:37,526 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
➜ hadoop-3.1.0$ bin/hdfs dfs -ls /user/tomer/input
2018-05-10 22:16:09,325 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x - tomer.bendavid supergroup 0 2018-05-10 22:14 /user/tomer/input/hadoop
Run hadoop job
So we have hdfs with files on our laptop, let's run a job on it what do you think?
➜ hadoop-3.1.0$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.0.jar grep /user/tomer/input/hadoop/*.xml /user/tomer/output1 'dfs[a-z.]+'
➜ hadoop-3.1.0$ bin/hdfs dfs -cat /user/tomer/output1/part-r-00000
2018-05-10 22:22:29,118 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1 dfsadmin
1 dfs.replication
We managed to have local hadoop installation with HDFS for tests! and run a test job! That is so cool!.
Summary
We managed to download hadoop, startup hdfs, upload files to this hdfs, run hadoop job, and get results from hdfs, all on our laptop on a single directory! that is cool!
In addition there is nothing new here, I just followed that straight forward guidance at hadoop installation docs. With a few minor modifications and some minor updated explanations to myself so it's clearer for me when I look at it in future for reference.
If you want to see more of what I learned last week i'm always at https://tomer-ben-david.github.io
Top comments (0)