DEV Community


Top Hadoop Interview Questions

jay538 profile image jay538 ・2 min read

Q1) Explain Big data and its characteristics.

Ans. Big Data refers to a large amount of data that exceeds the processing capacity of conventional database systems and requires a special parallel processing mechanism. This data can be either structured or unstructured data.

Characteristics of Big Data:

Volume - It represents the amount of data that is increasing at an exponential rate i.e. in gigabytes, Petabytes, Exabytes, etc.

Velocity - Velocity refers to the rate at which data is generated, modified, and processed. At present, Social media is a major contributor to the velocity of growing data.

Variety - It refers to data coming from a variety of sources like audios, videos, CSV, etc. It can be either structured, unstructured, or semi-structured.

Veracity - Veracity refers to imprecise or uncertain data.

Value - This is the most important element of big data. It includes data on how to access and deliver quality analytics to the organization. It provides a fair market value on the used technology.

Q2) What is Hadoop and list its components?

Ans. Hadoop is an open-source framework used for storing large data sets and runs applications across clusters of commodity hardware.

It offers extensive storage for any type of data and can handle endless parallel tasks.

Core components of Hadoop:

Storage unit– HDFS (DataNode, NameNode)
Processing framework– YARN (NodeManager, ResourceManager)

Q3) What is YARN and explain its components?

Yet Another Resource Negotiator (YARN) is one of the core components of Hadoop and is responsible for managing resources for the various applications operating in a Hadoop cluster, and also schedules tasks on different cluster nodes.

YARN components:

Resource Manager - It runs on a master daemon and controls the resource allocation in the cluster.

Node Manager - It runs on a slave daemon and is responsible for the execution of tasks for each single Data Node.

Application Master - It maintains the user job lifecycle and resource requirements of individual applications. It operates along with the Node Manager and controls the execution of tasks.

Container - It is a combination of resources such as Network, HDD, RAM, CPU, etc., on a single node.

Q4) What are the Hadoop daemons and explain their roles in a Hadoop cluster?

Ans. Generally, the daemon is nothing but a process that runs in the background. Hadoop has five such daemons. They are:

NameNode - Is is the Master node responsible to store the meta-data for all the directories and files.

DataNode - It is the Slave node responsible to store the actual data.

Secondary NameNode - It is responsible for the backup of NameNode and stores the entire metadata of data nodes like data node properties, addresses, and block reports of each data node.

JobTracker - It is used for creating and running jobs. It runs on data nodes and allocates the job to TaskTracker.

TaskTracker - It operates on the data node. It runs the tasks and reports the tasks to JobTracker.

Q5) What is Avro Serialization in Hadoop?

The process of translating objects or data structures state into binary or textual form is called Avro Serialization. It is defined as a language-independent schema (written in JSON).
It provides AvroMapper and AvroReducer for running MapReduce programs.

If You want to more click here:

Discussion (0)

Editor guide