DEV Community

Cover image for MapReduce Vs Tez
Shivansh Yadav
Shivansh Yadav

Posted on

MapReduce Vs Tez

Apache Hadoop uses MapReduce as it's programming model for distributed processing of Big Data, but instead of writing multiple MapReduce jobs, we can also utilize the power of Hive or Pig.

Hive: Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Pig: Pig is a high-level platform for creating programs that run on Apache Hadoop. It provides an SQL-like scripting language.

Both Hive queries and Pig scripts are compiled to MapReduce programs in the background, and then jobs are executed in parallel across the Hadoop cluster.

But instead of MapReduce both Hive and Pig can use Tez.


Hadoop ecosystem with Tez


Apache Tez

Apache Tez is a framework that creates a complex Directed Acyclic Graph (DAG) of tasks for processing data.

It uses DAG to analyze all the relationship between the different steps and figures out the most optimal path to get the result.

Therefore, Tez is much faster as compared to MapReduce.

This technique is also used in Apache Spark for large-scale data processing.

MapReduce Vs Tez

MapReduce access the disk/HDFS multiple times during it's data-flow i.e Mapper -> Shuffle & Sort -> Reducer. It will write & read data/modified data during each of these steps, resulting in 5-6 disk access for a single MapReduce job.

On the other hand Tez gets the data from the disk, performs all the steps, stores the intermediate results in the memory, performs vectorization(processes batch of rows instead of one row at a time) and produces the output.

While MapReduce makes multiple reads/writes to HDFS, Tez avoids unneeded access to HDFS.

MapReduce vs Tez

Top comments (0)