Tomer Ben David

Posted on May 10, 2018

Local hadoop on laptop for practice

#bigdata #hadoop #post

Introduction

Here is what I learned last week about hadoop installation:

Hadoop sounds like a really big thing, complex installation, cluster, hundreds of machines, Tera's if not Peta's of data, but actually, you can download a simple jar and run hadoop with hdfs on your laptop, for practice, it's very easy!

Our plan

Setup JAVA_HOME (hadoop is built on java).
Download hadoop tar.gz.
Extract hadoop tar.gz
Setup hadoop config
Start and format hdfs
Upload files to hdfs.
Run hadoop job on these uploaded files.
Get back and print results!

Sounds like a plan!

Setup JAVA_HOME

As we said hadoop is built on java so we need JAVA_HOME set.

➜  hadoop$ ls /Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
➜  hadoop$ echo $JAVA_HOME
/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home

Download Hadoop tar.gz

Next we download hadoop, nice :)

➜  hadoop$ curl http://apache.spd.co.il/hadoop/common/hadoop-3.1.0/hadoop-3.1.0.tar.gz --output hadoop.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  1  310M    1 3581k    0     0   484k      0  0:10:57  0:00:07  0:10:50  580k

Extract hadoop tar.gz

Now that we have the tar.gz on our laptop let's extract it.

➜  hadoop$ tar xvfz ~/Downloads/hadoop-3.1.0.tar.gz

Setup HDFS

Now let's config HDFS on our laptop:

➜  hadoop$ cd hadoop-3.1.0
➜  hadoop-3.1.0$
➜  hadoop-3.1.0$ vi etc/hadoop/core-site.xml

Configuration should be:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

So we configured the hdfs port, let's configure how many replicas we need, we are on laptop we want only one replica for our data:

➜  hadoop-3.1.0$ vi etc/hadoop/hdfs-site.xml:

The above hdfs-site.xml is the site for replica configuration below is the configuration it should have (hint: 1):

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Enabled SSHD

Hadoop connects to nodes with ssh so let's enable it on our mac laptop:

You should be able to ssh with no pass:

➜  hadoop-3.1.0 ssh localhost
Last login: Wed May  9 17:15:28 2018
➜  ~

If you can't do this:

  $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  $ chmod 0600 ~/.ssh/authorized_keys

Start HDFS

Next we start and format HDFS on our laptop:

bin/hdfs namenode -format

➜  hadoop-3.1.0$ bin/hdfs namenode -format
WARNING: /Users/tomer.bendavid/tmp/hadoop/hadoop-3.1.0/logs does not exist. Creating.
2018-05-10 22:12:02,493 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = Tomers-MacBook-Pro.local/192.168.1.104


➜  hadoop-3.1.0$ sbin/start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes

Create folders on hdfs

Next we create sample input folder on HDFS on our laptop:

➜  hadoop-3.1.0$ bin/hdfs dfs -mkdir /user
2018-05-10 22:13:16,982 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
➜  hadoop-3.1.0$ bin/hdfs dfs -mkdir /user/tomer
2018-05-10 22:13:22,474 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
➜  hadoop-3.1.0$

Upload testdata to HDFS

Now that we have HDFS up and running on our laptop lets upload some files:

➜  hadoop-3.1.0$ bin/hdfs dfs -put etc/hadoop input
2018-05-10 22:14:28,802 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
put: `input': No such file or directory: `hdfs://localhost:9000/user/tomer.bendavid/input'
➜  hadoop-3.1.0$ bin/hdfs dfs -put etc/hadoop /user/tomer/input
2018-05-10 22:14:37,526 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
➜  hadoop-3.1.0$ bin/hdfs dfs -ls /user/tomer/input
2018-05-10 22:16:09,325 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x   - tomer.bendavid supergroup          0 2018-05-10 22:14 /user/tomer/input/hadoop

Run hadoop job

So we have hdfs with files on our laptop, let's run a job on it what do you think?

➜  hadoop-3.1.0$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.0.jar grep /user/tomer/input/hadoop/*.xml /user/tomer/output1 'dfs[a-z.]+'
➜  hadoop-3.1.0$ bin/hdfs dfs -cat /user/tomer/output1/part-r-00000
2018-05-10 22:22:29,118 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1   dfsadmin
1   dfs.replication

We managed to have local hadoop installation with HDFS for tests! and run a test job! That is so cool!.

Summary

We managed to download hadoop, startup hdfs, upload files to this hdfs, run hadoop job, and get results from hdfs, all on our laptop on a single directory! that is cool!

In addition there is nothing new here, I just followed that straight forward guidance at hadoop installation docs. With a few minor modifications and some minor updated explanations to myself so it's clearer for me when I look at it in future for reference.

If you want to see more of what I learned last week i'm always at https://tomer-ben-david.github.io

DEV Community