loading...

Introduction to Apache Spark

blender profile image Saloni Goyal ・3 min read

MapReduce and Spark are both used for large-scale data processing. However, MapReduce has some shortcomings which renders Spark more useful in a number of scenarios.

Source

Shortcomings of MapReduce

  1. Every workflow has to go through a map and reduce phase: Can’t accommodate a join, filter or more complicated workflows like map- reduce-map.
  2. MapReduce relies heavily on reading data from disk: Performance bottleneck, especially bad for iterative algorithms which may cycle through the data several times.
  3. Only native Java programming interface available: Python is also available, but it makes implementation complex and is not very efficient for floating point data.
  4. Not that easy in terms of programming and requires lots of hand coding.

Solution β€” Apache Spark

  • A new framework: Not a complete replacement of the Hadoop stack, just a replacement for Hadoop MapReduce and more
  • Capable to using Hadoop ecosystem, e.g., HDFS, yarn

Source

Solutions by Spark

  1. Spark provides over 20 highly efficient distributed operations: Can be used in combination
  2. User can choose to cache data in memory: Increases performance for iterative algorithms
  3. Polyglot: Native Java, Python, Scala, R interfaces along with interactive shell (test and explore data interactively on shell)
  4. Easy to program and does not require that much hand coding

100 TB Benchmark

Spark Architecture

Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled.

Worker Node

  • General JVM executor to execute spark workflows
  • Core on which all computation is done
  • Interface to rest of the Hadoop ecosystem, e.g., HDFS

Cluster Manager

  • System to manage provisioning and starting of worker nodes
  • Cluster manager interfaces supported by spark –
  • YARN (Hadoop cluster manager): Same cluster can be used with both Hadoop MR and spark.
  • Standalone: Special spark process that takes care of starting nodes at beginning of computation and restarts them on failures.

Driver Program

  • Interface with cluster
  • Has a JVM that has a spark context: Gateway for us to connect to our spark instance and submit jobs.
  • Jobs can be submitted in -
  • Batch mode : Send program for execution and wait for result
  • Streaming mode: Use spark shell and interact in real-time with data

Source

Cloudera VM Setup vs Amazon EMR

Cloudera VM Setup

  • Using spark in standalone mode
  • Everything running locally on one machine
  • Worker node (executor JVM), spark process and driver program on the same machine

Amazon EMR

  • Supports spark natively
  • Web interface to configure number, type of instances, memory required, etc.
  • Amazon EMR automatically runs YARN to spawn instances and prepares them to be executed with spark.
  • Executor JVMs run on EC2 interfaces
  • Driver program and YARN running on the master node

Resilient Distributed Datasets

Dataset

  • Data storage created from HDFS, S3, HBase, JSON, text, etc.: Once spark reads the data, it can be referenced with an RDD
  • Or from transforming another RDD: RDD are immutable

Distributed

  • Distributes across cluster of machines
  • Data is divided into partitions and partitions divided across machines

Resilient

  • Spark tracks history of each partition
  • Error recovery like node failures, slow processes

Spark Transformations

RDD are immutable but we can transform one RDD to another RDD. Spark does lazy transformations (execution will not start until an action is triggered).

Narrow transformations

Source

  • Like map and filter
  • Do not imply transferring data through the network
  • Depends on memory and CPU

Wide transformations

Source

  • For example, groupByKey transfers data with same key to same partition Shuffle operation across network
  • Also depended on interconnection speed between nodes

Source: Introduction to Apache Spark

Discussion

pic
Editor guide