“Apache Hadoop is an open source a software framework for storage and
large scale processing of data-sets on clusters of commodity hardware”
The most significant part about this is the power of large scale data processing on commodity hardware aka your regular server in a rack.
I like to understand the above concept by drawing an analogy of how elections or collection of senses data works in a country.
- Each district or a constituency has a polling booth where people come and cast their vote.
- At the end of the election, each ballot box(EVM machine) is tasked to add all the votes for each candidate.
- Data from each ballot box from different polling stations is added together to find the final tally of votes for a particular political party You can see how we can use this concept to do large scale computation with our data. We store our single file, in chunks in multiple servers. Whenever there is some calculation which needs to be done, we can do that on each server and collect the result. The results from each server are again processed to get the final result. This is the core concept of distributed computing
- Open source: Apache licence
- Highly Scalable: By adding new nodes and removing depending on demand
- Usage of commodity hardware
- Reliable: Replication of data and compute among multiple nodes
- Hadoop Common
- Hadoop Distributed File System (HDFS)
- Hadoop YARN
- Hadoop MapReduce
Let’s study each of the concepts one by one
HDFS is a file system on distributed nodes. It is synonymous to a usual file system on your personal computer but the underlying data is distributed across multiple server nodes.
As we know that our files are broken into chunks and stored across multiple servers to co-ordinate among the multiple servers in our cluster we have a concept called “Name node”.
Data nodes are servers which store the data and do the computation
Map reduce is a programming model to do computation on a distributed parallel computing environment
The business logic of what the map or reduce task should do is given by the programmer
Task tracker: task tracker are the processes which reside in the data nodes. They take care of carrying out the process of their respective nodes.
You can submit your jobs to the name node (job tracker) which schedules and tracks all the jobs
Each data node receives the tasks from job tracker, and it’s the job of the task tracker for conducting and tracking the assigned task. It reports the status of its task back to Job tracker
This was a basic uderstanding of the hadoop stack to carry out big data operation. As i explore into this field i will be posting more content on this subject.
let me know your thoughts on these mamoth data porcessing softwares.