Install Hadoop on Mac – Ultimate Step by Step Guide

#tutorial #datascience #hadoop

In this tutorial you will learn how to install the hadoop system in your mac machine, including a brief reminder of what hadoop is and its architecture.

When talking about hadoop, first idea that probably crosses your mind is big data. Hadoop emerged with big data as there was a need to store massive amount of data. Not only store it, but analyse it and access it in a reliable, scalable and affordable manner.

The hadoop system solves two key big data problems. First problem, what if one of the computer fails? Traditionally if a machine fails, all information stored is lost, unless there is a backup. The hadoop system have mechanisms to avoid this problem.

The second challenging problem was combining the information from different hard drives. When you are analysing large amount of data saved in many hard drives, accessing and combining this information can be a challenging. Fortunately, Hadoop also tackles this issue.

What is Hadoop?

Hadoop is an open source software optimised for reliable and scalable distributed computing. What does that means? Distributed computing means that instead of a single computer carrying out a processing task, the task is performed by several machines. Multiple computers, all connected together to attempt one goal.

The Hadoop software includes mechanisms that avoid data loss. And It is a scalable system, meaning more computers can be added to the system as the data grows.

Hadoop is designed to handle large files. Once you store a file in Hadoop, the file will be split in smaller pieces and each piece stored in different a machine within the cluster. Plus each file block is replicated in several machines to avoid data loss.

The whole system can be scaled up from one server to thousand of servers. As a result, the computation and storage power of each server is combined resulting in a really powerful system.

Learn more about Hadoop's architecture

Learn about different installation modes

Standalone installation

As we have seen, you can install Hadoop on mac in three different mode. One of them is standalone. Standalone means that there are no separated daemons, all Hadoop processes are running on the same JVM. See below the steps to install Hadoop in standalone mode:

Step 1) Check Java is installed

Hadoop is a software written in Java, and behind the scenes is using Java. Therefore first thing to do is indicating to hadoop where Java is installed. To do so, you need to set up the JAVA_HOME enviroment variable on your machine.

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-14.0.2.jdk/Contents/Home/

Then check that java is installed, by running the following command:

java --version

Step 2) Download Hadoop

You can download hadoop from the following website: Download Hadoop

Select any of the mirrors, then pick a version and download the file called hadoop-X.Y.Z.tar.gz.

Step 3) Set up Hadoop environment variables

Next step to install hadoop on mac is creating the hadoop environment variables. You will need to create the HADOOP_HOME environment variable, which will point to the directory where you uncompressed the previous file. You should also add this path the your PATH variable. See below the command you should execute:

export HADOOP_HOME=~/sw/hadoop-x.y.z
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Run Hadoop on Pseudo Distributed Mode

DEV Community