Did you just say you need to handle a minimum of 100 TB of data (volume) that is generated at high speed (velocity) from different sources consisting of structured data like CSV’s, semistructured data like log files and unstructured data like video files (variety) that are also trustworthy and representative (veracity) and can give insights that can lead to groundbreaking discoveries and reduce costs (value)? 😲 Good Gracious! This is Big Data!
We would need a cluster of machines and not just a single machine to process big data and this is where Spark comes into play. With Spark, you can distribute data and its computation among nodes in the cluster - with each node having a subset of the data, data processing are done in parallel over the nodes in a cluster. Spark does all these in memory which makes it lightning fast!!!⚡
Spark is made up of several components. One of them is the Spark Core, which is the heart of Apache Spark and it is the basis for other components. The Spark Core uses a data structure called the Resilient Distributed Datasets, which is the fundamental data structure in Apache Spark.
To develop Spark solutions, we can use Scala, Python, Java or R. Here, I would develop an introductory Spark application using Python via Pyspark, which is the Python API that supports Apache Spark. We would take a look at an introductory example using an RDD - The .csv file used in this example is here. Without further ado, as Spark does it ⚡, let's jump right in below -
From 1.py above -
- we compute the average number of subjects by class. In the student_subject.csv, an example entry s400,c204,10 represents the student_id, class_id and number of subjects.
- Lines 1 to 4 - we import the needed pyspark classes and setup the configuration and use it to instantiate the SparkContext class. local creates a local cluster with only 1 core on your local machine.
- Lines 14 - we read in the csv file, which now becomes an RDD (student_subject_rdd), where every line entry is a value.
Lines 16 - we transform the student_subject_rdd
into an RDD of key-value pairs of class_id and
number_of_subjects e.g ('c204', 10).
- Lines 7 to 11 - the get_class_and_subject function, splits each line entry by a comma, gets the needed fields by indexing and then returns them.
- Note that on Line 10, the number_of_subjects is cast explicitly into an int.
- Lines 18 to 19 - transforms the RDD further by adding 1 as part of the values. One key difference between map and mapValues is that with mapValues, the keys cannot be modified, so it is not even passed in. i.e key- value pairs of ('c204', 10) passes in just 10. So, modified_class_subject_rdd will contain something like ('c204', (10, 1)), where c204 is the key and (10,1) is the value.
Lines 21 to 23 - reduceByKey combines items
together for the same key.
- remember modified_class_subject_rdd, can return multiple items for the same key. e.g [('c204', (10, 1)), ('c204', (8, 1)), ('c204', (7, 1)), ('c204', (6, 1)), ('c204', (7, 1))]
- with the reduceBykey action, it becomes [('c204', (38, 5))], where c204 is the key and (38, 5) is the value representing sum total of subjects done and frequency count respectively for class_id c204.
- Lines 25 to 26 - computes the average by class while Line 28 to 32 produces an array and prints the results.
Awesome! so from above, we cooked up an example Spark application using the RDD. Methods called off an RDD can either be a transformation or an action. A transformation like mapValues just produces another RDD while an action like collect products the result. Essentially, a transformation on an RDD happens when an action is called. This concept of Lazy Evaluation increases speed since execution will not start until an action is triggered.
Spark is amazing and I know you are being Sparked up in becoming a Big Data Bravura. Stay tuned on this series for my next article on Introducing Spark Dataframes, which is a data structure built off the RDD and is much easier to use than the core RDD data structure. Have an amazing Sparked up Week! 😉