Jaya chandrika reddy

Posted on Jun 5, 2020

Hello Hadoop | Learn Hadoop in just a few minutes easily!

#devops #beginners #hadoop #tutorial

Originally written at: https://jayachandrika.com

Lets explore Hadoop(Click to skip to that topic):

Hadoop: Whats the buzz around it?
Hadoop EcoSystem
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Mappers and Reducers
Pig : A fast and easy alternative to MapReduce
TEZ
Hive
Ambari: UI
Mesos: An Alternative to YARN
Spark: Queen of Hadoop Chessboard
HBase
Apache STORM
OOZIE
ZooKeeper
Data Ingestion: Sqoop + Flume + Kafka
External Data Storages: MySQL, Cassandra, MongoDB
Query Engines: Drill + Hue + Phoenix + Presto + Zeplin

Platform: Hortonworks Sandbox

Hadoop: Whats the buzz around it?

Hadoop is an open source software platform with utilities or tools, for distributed storage and distributed processing on very large files and datasets, performed on several computer

I agree if it doesnt make complete sense yet. Lets break it down.

Example: Lets say we got employee information of millions of people working in Google, Amazon , Facebook and we have to combine all their information and remake the list in decreasing order of salary.

To load such big data into one normal computer and perform operations on it, making the computer to process each every employee, will either make computer to burst out like a SpaceX rocket or take a lo...ng time to process that our next generation have to see the output!

If we alloted this data to several computers, each computer processing a small piece of it, it gives fast output without the overload of processing huge data.

This is what Hadoop does,

distributed storage-The huge data is distributed to several computers and is stored there.

distributed processing-Instead of one costly computer with high processing power, several normal computers can process the big data.

In short, Hadoop has tools to store, process this big data giving us the expected results.

Lets see how exactly Big Data is handled in Hadoop!

Hadoop EcoSystem

Hadoop has many components, each has its own purpose and functions.(Kind of like each hero in Endgame has their own movie.)

HDFS, YARN and MapReduce belong to core Hadoop Ecosystem while others were added later on to solve specific problems. Lets explore each one of them, one by one.

Lets say we have a huge chunks of potato(Big data) with us and we wish to make French fries, Chips and Nuggets.

In order to make them, we divide potato into 3 blocks of potato for french fries, potato for chips and nuggets,
which is what HDFS does, dividing the big chunk of data and distribute to several computers.
All resources or ingrediants required for each dish are alloted respectively, similiar to YARN which manages the resources needed for each chunk or cluster.
We transform this potato to the desired dish which is same as MapReducers which transform data in efficient manner and aggregate the data giving us the desired result.

Lets explore in more detail about each component.

HDFS: Hadoop Distributed File System

As it says, it distributes the big file across many computers as follows:

The big file is converted to blocks(each block size max upto 128 Mb)
Blocks are stored across many computers.
Redundancy:To handle failure of one computer, more than one copy of each block is stored in each computer. So that if one fails,other computers will have, failed computer's data.

YARN: Yet Another Resource Negotiator

Its the heartbeat of the hadoop system, all the data processing happens here.
It provides the computational resources like CPUs, memory, etc for application executions.
It does job scheduling and manages the resources like what nodes are available for, which ones should be allowed to run tasks etc.
Usually, its advised not to code against YARN directly. Instead, use one of the higher-level tools, for example Hive, to make your life easier.

In a way, like how Operating System allots resources to a computer, YARN allocates resources in Hadoop.

MapReduce: Mappers and Reducers

So we got data in HDFS, resources allocated from YARN. Now how to make get our desired results from this? We code using MapReduce.

MapReduce has two functions: Mappers and Reducers.
Here python or java programming is used to code those functions.Its overall, a programming model to process and compute across the data.

Mappers split the data and map them according to the desired output.
Reducers take the output of the mappers and aggregate or combine the data to give result.

As input gets in, mapper transforms data one by one and organizes according to the output we desire.

The reducer, reduces the mapper output to the aggregated form which we require as output.

Pig : A fast and easy alternative to MapReduce

Pig Latin, a scripting language, uses SQL like syntax, forming relationship, selecting a data, filtering it, transforming it etc. Contains User defined functions to help us create our own functions.
Not all business problems can be solved by converting to mappers and reducers form. So Pig Latin sitting on top of MapReduce, converts our Pig Latin script into map reduce form.
Pig script can be run by either of the 3 below:
- Grunt(Command Line Interpreter)
- Script(place the script in file and running it)
- Ambari (Directly from Pig View)

TEZ

We might wonder if using pig on top of mapreduce might increase latency. Tez is present in between them, to optimize and make it 10x faster.
It makes Hive, Pig and Map Reduce jobs faster, optimising physical dataflow and resource usage.
Using DAG(Directed Acyclic Graphs) it sees the workflow of the jobs and finds the optimal way among it making our job easier.

Suppose:we have three ways, 1-2-3 , 4-9-8, 2-7-9 and we need to find the smallest path, we choose 1-2-3. DAG works in somewhat similar way.

The output is processed again and again Through DAG giving us the optimal solution.

Hive

It takes SQL queries as input from command prompt and transforms them to mapreduce jobs and automatically gets the job done.
We can store the result of query into a view, perform a query onto entire cluster and define our own functions.
Though Pig and Hive might be similiar,Hive is used for completely structured Data whereas Pig Hadoop Component is used for semi structured data.
Pig is relatively faster than Hive.
Hive is slow and not useful for realtime transactions unlike Pig and Spark.
It might give the feel of a relational database, but it cant delete, insert etc as there are no tables, joins, primary keys and the data is just a flat text, de-normalized.

As you can observe,the queries used are same as SQL queries.

Ambari: UI

Its pretty much a web interface to graphically interact with the cluster and its usage.
It gives us a high-level view of your cluster, what’s running, what system you are using, what resources are available/in use.
It shows statistics about the cluster like CPU Usage,Memory Usage,Network Usage, Cluster load etc.
All services like HDFS, YARN etc can be used on a cluster using Ambari.
Even Hadoop can be installed using Ambari.

Mesos: An Alternative to YARN

Its simliar to YARN but then difference is Mesos is more general.
In simple words, YARN works only on Hadoop components like mapreduce etc where as Mesos can allcate resources to outside of hadoop like web servers.
YARN is for analytical and long process jobs where as Mesos is for both long and short process jobs.

Spark: Queen of Hadoop Chessboard

Spark engine is pretty fast, infact 100x faster than mapreduce and is used for large scale data processing.
Like Hive -> it can handle SQL queries, like Apache STORM ->it handles real time streaming hence we call it the queen.
Processing in Spark happens in this way:
- Driver Program, containing script to control what happens in the job.
- Cluster Manager,which can be YARN, Mesos or its own Spark Cluster Manager.
- Executor, it contains cache prioritizing memory based solution unlike other disk based solutions.
Spark has many components:
- Spark Streaming: Handles Realtime data.
- Spark SQL: SQL interface, takes SQL queries.
- MLLib: Machine Learning operations can be performed without about mappers and reducers to solve a problem.
- GraphX: All graph related problems and analytics are done using this.
- Spark Core: It provides distributed task dispatching, scheduling, and basic I/O functionalities.

Spark uses a specialized fundamental data structure known as RDD (Resilient Distributed Datasets).Its an object representing dataset and we can perform operations on that object just like in java.

HBase

Its basically a scalable NoSQL Database.
A columnar data storage, pretty fast and finds it useful for high data transaction rates.
CRUD(Create,Read,Update and Delete) operations are employed here.

Apache STORM

Processes continuous realtime streaming data like from sensors or weblogs etc.
Even spark streaming does this but more in batch intervals but storm handles data event wise.
Spout is the term used for sources of the streaming data i.e., they produce. Bolts is the term used for those which transform or aggregate this streaming data.
Usually kafka(which sends data into cluster) and storm are used as a pair.

OOZIE

Runs and Schedules various jobs/tasks on Hadoop.
Instructions are given via XML file.

ZooKeeper

Co-ordinates everything on cluster. Like which is the master node, task assigned for each worker node, which node is up and which is down and which worker nodes are available etc.
If a master node dies, it chooses who the next master should be.

Example:This phenomenon can be observed in whatsapp groups or in videocalls where if the host leaves the group, someone else is assigned as the next host.

Outside Hadoop EcoSystem, we can have databases to store the data and Query Engines to get input data to the cluster. Lets have a look at them!

Data Ingestion: Sqoop + Flume + Kafka

So how do you get data into the cluster? Sqoop,Flume and Kafka help with this via data ingestion.

Sqoop: Sqoop is a command-line interface application tool for transferring data between relational database servers and Hadoop.

Its just a way of tying hadoop database to relational database.

Flume: Another way of sending data into cluster and it acts as a buffer between the data coming in and storing, so that cluster wont go down with overwhelming data coming in realtime.

Kafka: This is more a general publisher/subscriber model where the publishers send/produce data as topics which are subscribed by consumers.

Collects any sort of data from PCs, webservers etc and broadcast it into Hadoop Cluster.

External Data Storages: MySQL, Cassandra, MongoDB

MySQL: Typical SQL based relational database.

Cassandra : Its a NoSQL columnar database with no single point of failure. It favours availability and partial- tolerance in CAP theorem.

MongoDB : A NoSQL database which uses document model. Its used when we deal with huge data and its pretty flexible.

Query Engines: Drill + Hue + Phoenix + Presto + Zeplin

All these are interactive query engines to fetch required data from databases under them like Hbase, Cassandra etc.

Drill lets us use SQL queries, even when the databases under it (Hbase, Cassandra, MongoDB etc) are non relational Databases.
Phoenix is similar to drill with SQL queries but with ACID properties(Atomicity,Consistency,Isolation and Durability).Its fast but works only HBase.
Presto is similar to Drill and unlike Drill, it connects to Cassandra and executes query across the cluster.
Hue (Hadoop User Experience), an SQL Cloud Editor, is used for query and as UI in Cloudera just like Ambari to HortonWorks.
Zeppelin has Notebook UI to visualize the data similar to jupyter notebooks for TensorFlow and share, interact with the code.

It might be overwhelming to see many components doing the same work.Ultimately based on the task and the type of database(whether its relational or non structered data etc) and other factors, we need to decide the right component and utilize it and it comes with knowledge of the components,practice and experience.

Hope you had fun learning Hadoop, i for one certainly had fun sharing about it!!

Originally written at: https://jayachandrika.com

Now you can follow us to explore more interesting topics, in Instagram at: @code_voyager

Great to have you here, lets get back for the next post and explore more!

Have an amazing day!

Top comments (21)

Comment hidden by post author - thread only accessible via permalink

Sergiy Yevtushenko • Jun 6 '20

The CAP theorem illustration provided in article is incorrect and misleading. Actually there is no strict "pick two" limitation. Instead there is some balance between all three properties in each particular setup. For example, Cassandra is shown on "AP" side while it can be put on any side by changing replication factor and consistency level configuration parameters.

Jaya chandrika reddy • Jun 6 '20

Although there is cap theorem limitation, dbs can be configured to satisfy all three to a certain extent.
Example Cassandra can be configured to satisfy consistency also, like, wait until I get the same result from 3 nodes.

In order to sum it up, it supports all 3 but favours availability over consistency which is what i tried to convey through the article. Thank you for asking and making it more clear to the others☺️

Tharun Shiv • Jun 5 '20

Hey is it necessary to run all the services or can we install specific services that we want?

Jaya chandrika reddy • Jun 5 '20 • Edited

No need to include all

But there are dependencies between services. Like, if you include pig, then you've to include yarn and map reduce.

Tharun Shiv • Jun 5 '20

Yeah, thought so, okay thank you.

Pacharapol Withayasakpunt • Jun 6 '20

To cut short, when will I need to use Hadoop, or is it more enterprise thingy?

Jaya chandrika reddy • Jun 6 '20

Good question, It is used at startups too. Wherever there is a need to handle big data, Hadoop will be used.
So if you have an app or website that deals with huge amount of data, you will need hadoop. Do comment if you have more doubts 😄

Pacharapol Withayasakpunt • Jun 6 '20

Which typical situations are collecting as Big Data with many V's better than Structured Data?
Shared Hadoop service? Self-hosting? Alternatives?

Jaya chandrika reddy • Jun 6 '20

Big data includes both structured and unstructured data

It is the volume, variety, velocity, value and veracity that decides whether it's big data or not.

I've seen people use HDP, Hortonworks data platform to host their Hadoop cluster.

It is a self hosted cluster. You can set it up on any cloud service provider, like gcp, aws or azure.

I haven't come across shared hosting, and let me know if you have used it.

Hope others also can weigh in and share their views and perspectives on this topic😄

Pacharapol Withayasakpunt • Jun 6 '20

Thanks for be very informative (about HDP).

I am starting to feel like it might be meant for BI / Analytics / AI / ML. For getting employed, it might be good, but for small business, it might be better to rely on outsource or softwares.

Also, I came across Cassandra vs HBase vs MongoDB. It seems like HBase / Hadoop ecosystem might be one of the best.