Introduction to Apache Spark

#spark #hadoop #beginners #mapreduce

MapReduce and Spark are both used for large-scale data processing. However, MapReduce has some shortcomings which renders Spark more useful in a number of scenarios.

Shortcomings of MapReduce

Every workflow has to go through a map and reduce phase: Can’t accommodate a join, filter or more complicated workflows like map- reduce-map.
MapReduce relies heavily on reading data from disk: Performance bottleneck, especially bad for iterative algorithms which may cycle through the data several times.
Only native Java programming interface available: Python is also available, but it makes implementation complex and is not very efficient for floating point data.
Not that easy in terms of programming and requires lots of hand coding.

Solution — Apache Spark

A new framework: Not a complete replacement of the Hadoop stack, just a replacement for Hadoop MapReduce and more
Capable to using Hadoop ecosystem, e.g., HDFS, yarn

Solutions by Spark

Spark provides over 20 highly efficient distributed operations: Can be used in combination
User can choose to cache data in memory: Increases performance for iterative algorithms
Polyglot: Native Java, Python, Scala, R interfaces along with interactive shell (test and explore data interactively on shell)
Easy to program and does not require that much hand coding

Spark Architecture

Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled.

Worker Node

General JVM executor to execute spark workflows
Core on which all computation is done
Interface to rest of the Hadoop ecosystem, e.g., HDFS

Cluster Manager

System to manage provisioning and starting of worker nodes
Cluster manager interfaces supported by spark –
YARN (Hadoop cluster manager): Same cluster can be used with both Hadoop MR and spark.
Standalone: Special spark process that takes care of starting nodes at beginning of computation and restarts them on failures.

Driver Program

Interface with cluster
Has a JVM that has a spark context: Gateway for us to connect to our spark instance and submit jobs.
Jobs can be submitted in -
Batch mode : Send program for execution and wait for result
Streaming mode: Use spark shell and interact in real-time with data

Cloudera VM Setup vs Amazon EMR

Cloudera VM Setup

Using spark in standalone mode
Everything running locally on one machine
Worker node (executor JVM), spark process and driver program on the same machine

Amazon EMR

Supports spark natively
Web interface to configure number, type of instances, memory required, etc.
Amazon EMR automatically runs YARN to spawn instances and prepares them to be executed with spark.
Executor JVMs run on EC2 interfaces
Driver program and YARN running on the master node

Resilient Distributed Datasets

Dataset

Data storage created from HDFS, S3, HBase, JSON, text, etc.: Once spark reads the data, it can be referenced with an RDD
Or from transforming another RDD: RDD are immutable

Distributed

Distributes across cluster of machines
Data is divided into partitions and partitions divided across machines

Resilient

Spark tracks history of each partition
Error recovery like node failures, slow processes

Spark Transformations

RDD are immutable but we can transform one RDD to another RDD. Spark does lazy transformations (execution will not start until an action is triggered).

Narrow transformations

Like map and filter
Do not imply transferring data through the network
Depends on memory and CPU

Wide transformations

For example, groupByKey transfers data with same key to same partition Shuffle operation across network
Also depended on interconnection speed between nodes

For further reading: Apache Spark Tutorial: Get Started With Serving ML Models With Spark

Source: Introduction to Apache Spark