DEV Community

Cover image for BigData Journey from Hadoop and MapReduce to AWS EMR
Olga Woschitz
Olga Woschitz

Posted on • Originally published at Medium

BigData Journey from Hadoop and MapReduce to AWS EMR

Image description

The Times Pre-Hadoop

Before Hadoop, businesses struggled a lot with big data. They had lots of data but no easy way to handle it. They usually used costly, custom solutions, which made things complicated and inefficient. Each company had to come up with its own way of dealing with data, making it hard to get into big data analysis. Also traditional High-Performance Computing (HPC) systems weren’t just pricey, they also weren’t very scalable or flexible. They were built for specific and complex calculations but not for the varied and growing types of big data.

Image descriptionHandling data before Hadoop

Companies faced several big challenges with old data systems:

  • Scaling Problems: The older database systems couldn’t handle growing data well. As the amount of data increased, companies had to spend a lot on hardware upgrades to keep up with storage and processing demands.
  • Inflexibility: To solve their data issues, businesses often had to buy expensive proprietary solutions. These solutions were usually limited to one vendor, draining budgets and lacking the flexibility needed for different types of data and tasks.
  • Data Silos: Processing lots of data used to be very slow. It was hard for companies to get quick insights from their data, slowing down decision-making and innovation.

With all these issues, it became clear that a new solution was needed. Major tech companies, especially Google and Yahoo, were focused on solving these data challenges. They were motivated by the growing need for internet crawling and indexing.

Google’s early work in distributed systems and data processing really set the stage for Hadoop and other big data technologies. They published two important papers, the Google File System (GFS) and MapReduce, which offered new ways to manage and process distributed data. In 2004, Google introduced the MapReduce programming model. This was a big deal because it made it much easier to process huge amounts of data using regular hardware. It allowed developers to write code that could run in parallel, which is key for handling distributed data processing efficiently.

Seeing Google’s success, other companies in the industry, like Yahoo, started to look into similar methods to handle their data challenges. They realized the potential of this approach for managing and processing large datasets effectively.

Hadoop Technologies

When the need to manage huge amounts of data became clear, Apache Hadoop emerged as a game-changer in the Big Data world. Think of Hadoop as a helpful, open-source tool that makes Big Data manageable for everyone. Doug Cutting and Mike Cafarella initially created it, and it gained significant momentum when Yahoo began using it. Yahoo recognized Hadoop’s potential and played a crucial role in its development as a powerful Big Data tool.

Yahoo’s contribution to Hadoop was immense. They needed a cost-effective way to handle their vast data repositories, so they collaborated with Hadoop, enhancing and refining it. Their improvements, especially in the Hadoop Distributed File System (HDFS) and other areas, made Hadoop more reliable and efficient.

Hadoop made managing big data scalable, efficient, and more affordable. It uses standard hardware, which is cheaper and more accessible than the specialized hardware required for traditional HPC systems. This opened up big data analytics to smaller companies, leveling the playing field. Hadoop’s distributed processing model not only reduced costs but also added flexibility for various business needs, making large datasets a valuable asset in the competitive digital arena.

The main concept of Hadoop’s approach, often described as “bringing compute to data,” is to reduce the amount of data movement and boost efficiency by processing the data where it already exists. In traditional setups, large amounts of data are moved across a network to a central location for processing. Hadoop changes this by sending the computational tasks, usually MapReduce jobs, to the servers where the data is stored. This method allows for simultaneous processing right on the distributed data within the HDFS. As a result, it cuts down on network traffic and speeds up data processing significantly.

HDFS lies at the heart of Hadoop, addressing the challenges of storing and processing vast data volumes. Key features of HDFS include:

  • Immutability: Once data is stored in HDFS, it doesn’t change, ensuring consistency and simplifying data processing. Fault Tolerance: HDFS is designed to be resilient. It replicates data across different machines, so if one fails, the data remains secure.
  • Linear Scalability: HDFS can expand without performance degradation. As more storage is needed, you can add more nodes without slowing down the system.

Data in HDFS is divided into blocks (typically 128 MB or 256 MB) and distributed across computers in a cluster. The NameNode orchestrates this process, tracking where each data block is located, while the DataNodes are responsible for storing the data.

Image descriptionHDFS Architecture

Another critical component of Hadoop is YARN (Yet Another Resource Negotiator). YARN acts like a resource manager, allocating resources and scheduling tasks within a Hadoop cluster. It separates job scheduling and resource management, enhancing flexibility and optimizing the use of the cluster’s resources.

Hadoop’s distributed computing approach provides several key benefits that make it a popular choice for Big Data processing:

  • Resilience: Hadoop is designed to be fault-tolerant. If one computer in the network fails, Hadoop’s data replication and recovery systems ensure that data remains safe and accessible.
  • Parallelization: Hadoop excels in dividing tasks and processing them simultaneously across multiple computers. This parallel processing capability allows for faster completion of tasks, making it highly efficient for handling large datasets.
  • Data Location Awareness: Hadoop is smart about where data is stored within its network. It processes data on the same node where it is stored, minimizing the time and resources spent on data transfer. This local processing leads to quicker and more efficient data handling.
  • Self-Healing: Hadoop has the ability to automatically handle failures within the system. If a node goes down or a task fails, Hadoop automatically reroutes the workload to other nodes and continues processing without manual intervention, ensuring minimal disruption and maintaining system integrity.

Image descriptionSelf-Healing process

MapReduce: The Game Changer

Handling huge datasets used to be extremely challenging, like juggling a thousand balls at once. But then MapReduce, a programming model and processing framework developed by Google, changed the game in Big Data.

MapReduce brings simplicity to processing large datasets in distributed computing. It breaks down the complex task of data processing into two basic steps: map and reduce.

  • Map: This step involves splitting the data into chunks and then processing these chunks in parallel. Each chunk is processed by a map function, which transforms the data into key-value pairs.
  • Reduce: After mapping, these key-value pairs are grouped by their keys. A reduce function then takes over, performing operations like aggregation, filtering, or other custom tasks on these grouped data.

Image descriptionMapReduce in action

MapReduce works within distributed clusters and can be described in several stages:

  • Input Data Splitting: Data is broken into smaller pieces, called “splits.” Each split is assigned to a map task.
  • Map Phase: Map tasks independently process their assigned data, producing key-value pairs. These pairs are then sorted and shuffled based on their keys.
  • Shuffle and Sort: The framework shuffles and sorts these key-value pairs, ensuring that all values for a specific key are grouped together.
  • Reduce Phase: Reduce tasks take the sorted data and apply the reduce function to each group of key-value pairs. The final output might be stored externally.

YARN works with MapReduce as a resource manager, efficiently allocating and managing resources in a Hadoop cluster, and enabling multiple applications to run simultaneously.

MapReduce has been applied in various domains, such as web search, log analysis, recommendation systems, natural language processing (NLP), and genomic data analysis, showcasing its versatility and effectiveness in handling diverse big data challenges. It has made a significant impact on Big Data for several reasons:

  • Scalability: It can process large datasets across extensive hardware clusters. Resilience: MapReduce handles failures of cluster nodes without data loss.
  • Simplicity: It makes distributed computing easier for more developers.

Spark: The Rise of In-Memory Computing

Apache Spark marked a significant advancement in the Big Data realm, going a step beyond Hadoop and MapReduce with its state-of-the-art in-memory computing. This innovation greatly sped up data processing, particularly in areas where MapReduce was less efficient, like iterative machine learning algorithms and graph processing.

  • In-Memory Computing: Spark’s core feature is its ability to process data directly in RAM across server clusters. This bypasses the slower disk-based operations typical in Hadoop’s MapReduce, resulting in a substantial speed boost. This makes Spark ideal for complex data tasks that require rapid processing.
  • Lazy Evaluation: Spark optimizes data processing by delaying computations until they are absolutely necessary. This lazy evaluation means that Spark waits until output is required before performing any calculations.
  • DAG Optimization: Spark uses a Directed Acyclic Graph (DAG) to optimize workflows. This approach helps streamline the processing tasks by eliminating redundant operations, which saves time and conserves computing resources.

Image descriptionMapReduce vs Spark

Spark offers APIs in Scala, Python, and Java, allowing developers to leverage their existing programming skills. This flexibility is a significant advantage for Spark, making it accessible to a wider range of users. Additionally, Spark provides a variety of tools for different Big Data operations:

  • SQL: For handling structured data in a SQL-like manner.
  • Streaming Data: For processing real-time data streams.
  • Machine Learning: Offering MLlib for machine learning tasks.
  • Graph Processing: With GraphX for graph algorithms and analysis.

Spark also features an interactive shell (REPL) for Scala and Python. This tool is particularly useful for exploratory data analysis, that provides immediate feedback. It aids developers in learning and debugging, enhancing overall productivity.

Hive: Simplifying Data Querying

In the ever-evolving world of Big Data, Apache Hive emerged as a key player, bringing the ease and familiarity of SQL to the vast Hadoop ecosystem. It simplified how we interact with large datasets stored in the HDFS.

  • Data Warehouse Powerhouse: Hive has gained recognition as a flexible execution engine. It translates SQL-like queries into tasks for Hadoop clusters. These tasks can be executed using different engines like MapReduce, TEZ, or Spark.
  • Apache Tez Integration: Tez is an advanced framework designed for high-performance batch and interactive data processing. When coordinated with YARN in Apache Hadoop, Tez enhances the MapReduce model by boosting speed while maintaining scalability for massive datasets. Hive’s compatibility with Tez enables faster computations without sacrificing the robustness of MapReduce.
  • Metastore: At the heart of Hive’s functionality is its Metastore, which acts as a repository for data structure metadata within HDFS. This feature allows for structured queries across various data formats, making data management more organized and SQL-friendly.
  • Lazy Reading: Hive employs a ‘lazy reading’ approach, fetching data only when necessary. This strategy conserves computational resources and reduces latency, optimizing the execution of complex queries on large datasets. This approach is particularly beneficial in Big Data analytics where efficiency is key.

Apache Hive has transformed the landscape of data warehousing by integrating the simplicity of SQL with the vast scale of Big Data. Its power lies in abstraction; it allows companies to concentrate on deriving insights and conducting analyses without getting bogged down by the intricacies of data processing mechanisms. This makes it a valuable tool for businesses looking to leverage their data in an efficient and user-friendly manner.

Image descriptionHadoop with Hive

The Emergence of AWS Elastic MapReduce (EMR)

Big Data processing has evolved to include cloud-native solutions, with AWS EMR. EMR is a service that makes it easier to use big data frameworks like Apache Hadoop and Apache Spark. It handles the complicated parts of setting up and managing a Hadoop cluster. This includes setting up the hardware, installing software, and taking care of tasks like scaling and tuning the cluster. Also, EMR can work with various big data applications, including Hadoop, Spark, HBase, Presto, and Flink. This allows it to handle different types of data processing, like batch and stream processing, interactive analytics, and machine learning.

The introduction of cloud technologies like Amazon S3 and EMR has led to a change in Hadoop’s “bringing compute to data” approach. These cloud services simplify the management of data location by effectively handling data transfers and computing resources. As a result, the need to process data locally has decreased. These advanced cloud networks and infrastructures reduce the issues associated with moving data. So, we’re seeing a shift back to a more centralized computing model, similar to what existed before Hadoop.

Final thoughts

The story of Big Data has been a tale of constant change and advancement. Initially, dealing with large datasets was a complex challenge. Then, the introduction of Hadoop and MapReduce revolutionized the field, making Big Data operations more accessible to many. The emergence of Apache Spark, with its fast in-memory computing, significantly increased processing speeds and broadened the scope of Big Data analytics. Apache Hive further simplified Big Data by introducing SQL-like functionality, allowing for advanced data warehousing within Hadoop’s scalable system. Now, AWS EMR stands as a pinnacle in this evolution, offering a powerful, adaptable, and cost-effective cloud-based platform that builds on the strengths of its predecessors while enabling new data processing possibilities.

The journey through the Big Data landscape is ongoing. With advancements in machine learning and AI, the spread of IoT devices, and the continual development of cloud technologies, the future of Big Data processing is set to be even more exciting and transformative.

Top comments (0)