PySpark and Parquet - Analysis

#pyspark #parquet #bigdata #analysis

Pandas is known to be a Data Scientist’s tool box, because of the amazing portfolio of the functionalities provided by the Pandas library. But Pandas performs really bad with Big Data and Data which you cannot hold in memory. That is where Spark comes into picture. Spark is a Cluster computing system build for large scale data processing.

Why use Spark?

Spark is a framework that supports both batch processing (eg. Hadoop and MapReduce), and real time processing (eg. Apache Storm).

Lot of high level API’s available for Spark. You can code in Scala, Java, Python and R.

Abstractions and Concepts

DAG (Direct Acyclic Graph) - Directed graph which controls the flow of execution.
SparkContext - Orchestration within the Spark cluster.
Resilient Distributed Dataset - Immutable, any activity (like a filter or a map) on the RDD created a new RDD.
Transformations - Activities on RDD’s
Actions - Lazy loading, processing as and when needed based on the DAG. Collecting data, get counts are all actions.

What is PySpark?

Pandas, gives you the best tools for Data Analysis, and Spark gives you the capability to work with Big Data. PySpark gives you the best of both worlds. You can work with Big Data, without having to really know the underlying mechanisms of cluster computing. PySpark gives you Pandas like syntax for working with data frames.

For example, reading a CSV in Pandas:

df = pd.read_csv('sample.csv')

#Whereas in PySpark, its very similar syntax as shown below. 

df = spark.read.csv('sample.csv')

Although there are couple of differences in the syntax between both the languages, the learning curve is quite less between the two and you can focus more on building the applications.

Working with Parquet

Parquet is a high performance columnar format database.
To read more about the benefits of storing the data in columnar storage, please visit: https://blog.openbridge.com/how-to-be-a-hero-with-powerful-parquet-google-and-amazon-f2ae0f35ee04.

To read a parquet file from s3, we can use the following command:

df = spark.read.parquet("s3://path/to/parquet/file.parquet")

#Converting compressed CSV files to parquet, with snappy compression, 
#takes approximately 40 minutes for each day’s worth of POI_EVENTS data. 
#This can be done easily with PySpark using the following code. 

for i in range(1, 31):
    print("Processsing Day %02d" % i)
    csv_path = "s3://test-data/2018/05/%02d/*" % i
    p_path = "s3://locally-sanjay/2018/05/%02d/" % i
    df = sqlContext.read.option("header", "false").csv(csv_path)
    df.write.parquet(p_path, mode='overwrite')

Downsides of using PySpark

The main downside of using PySpark is that Visualisation not supported yet, and you need to convert the data frame to Pandas to do visualisation, it is not recommended because converting a PySpark dataframe to a pandas dataframe loads all the data into memory. However, you can use the “sample” method to convert parts of the PySpark dataframe to Pandas and then visualise it.

Best practices using PySpark

pyspark.sql.functions library provide built in functions for most of the transformation work. However you can write your own Python UDF’s for transformation, but its not recommended. The built-ins work inside of the JVM which Spark uses, and running Python UDF’s will make the JVM spin up a python instance and pip all the data to that instance and pip all the output back from the instance to the JVM, which will affect the performance significantly:
Do not iterate row by row in Dataframes.
Do not convert the entire PySpark dataframe to Pandas, always sample and convert to Pandas.

Technology Stack

The following technology stack was used in the testing of the products at LOCALLY:

Amazon Spark cluster with 1 Master and 2 slave nodes (standard EC2 instances)
s3 buckets for storing parquet files.
Zeppelin notebook to run the scripts.
The performance and cost on the Google Cloud Platform needs to be tested.

Ultimately we went with ORC as our main storage format at LOCALLY, but depending on your specific use-case Parquet is also a solid choice. ORC is the main driving force behind LOCALLY's Location Intelligence Platform