Serverless Spark on GCP : How does it compare with Dataflow ?

#dataflow #spark #analytics #googlecloud

I am a huge fan of serverless products: it allows developers to be focused on bringing business value on the software they are working on and not the underlying infrastructure, at the end they are more autonomous to test and deploy and the scaling capabilities are often better than an equivalent self managed service.

When it comes to Data processing on GCP there are not so many options for serverless products, the choice is often limited to Dataflow. Dataflow is great but the learning curve is a bit more progressive and Beam (the OSS framework behind Dataflow) is not promoted by other providers which often prioritize Spark. And to run Spark workload on GCP the solutions were not so lightweight: you had to provision a Dataproc cluster or run your workload in Kubernetes: it’s a whole different level of complexity!
This was until recently, because Google surprisingly announced at Next’21 a new Serverless Spark service!

Spark on GCP: a new area for data processing

According to Google, this new service is the industry’s first autoscaling serverless Spark. You do not need any infrastructure provisioning or tuning, it is integrated with BigQuery, Vertex AI and Dataplex and it’s ready to use via a submission service (API), notebooks, Bigquery console for any usage you can imagine (except streaming analytics): ETL, data exploration, analysis, and ML.

On my side I have been able to test the workload submission service (the most interesting to me): it’s an API endpoint to submit custom Spark code (Python, Java, R or SQL). You can see this submission service as an answer to the spark-submit problematic.
On the autoscaling side, Google will magically decide for the number of executors to run the job optimally but you can still manually handle it.
The service is part of the Dataproc family and accessible on the console through the Dataproc page. After some tests the service seems to be working fine, but how is it compared to Dataflow? Let’s check that with a small experiment.

The experiment: Dataflow vs Serverless Spark

I wrote 2 simple programs: the first one in PySpark and the second one with Beam (python SDK). The goal is to read 100GB of ASCII data in a Cloud Storage bucket, parse the content of the files, filter according to a regex pattern, group by according to a key value (some column) and count the number of lines having the same key. The result is written in Parquet format on another bucket.

For the input data I used a subset of a 100TB dataset publicly available here: gs://dataproc-datasets-us-central1/teragen/100tb/ascii_sort_1GB_input.*

In this dataset, each file is about 1GB and the content is as below (not very relevant):

7ed>@}"UeR  0000000000000000000000024FDFC680  1111555588884444DDDD0000555511113333DDDDFFFF88881111
3AXi 40'NA  0000000000000000000000024FDFC681  888800000000CCCCEEEEDDDD11110000DDDD55553333CCCC6666
PL.Ez`vXmt  0000000000000000000000024FDFC682  111122225555CCCC000000002222FFFFFFFFFFFF88885555FFFF
5^?a=6o0]?  0000000000000000000000024FDFC683  7777FFFF55551111BBBBDDDD44447777DDDD5555BBBB9999CCCC

The regex filter applied is ^.*FFFF.*$ in the 3rd “column”, meaning the column content for a given record must have at least 4 F consecutively (totally useless but it's for the sake of the experiment). The grouping key is the first column. The observed reducing factor of the filter operation is about 50%. I agree the experiment is not something we would normally do in a real project but it is not important, it's just to stimulate the workload with an important compute task.

On the configuration side, for the Dataflow job, I enabled the Dataflow Prime feature but everything else was left by default (Prime feature is more optimized and it simpler to calculate the total cost of the job). For the Spark service, everything was left as default and I manually asked for 17 executors (why 17? why not 😅)

The result :

	Dataflow	Serverless Spark
Total execution time	36 min 34 sec	12 min 36 sec
Number of vCPU	64 (autoscaling)	68 (17 executors * 4 vCPUs)
Total cost of the job	28.88 DPU * $0.071 = $2.05

Both jobs accomplished the desired task and output 567 M row in multiple parquet files (I checked with Bigquery external tables):

Serverless Spark service processed the data in about a third of the time compared to Dataflow! Nice performance 👏.

Currently however there are some limitations to this Serverless service:

It’s only for batch processing, not streaming (Dataflow would probably be better for that anyway) and job duration is limited to 24 hours.
There are no monitoring dashboard whatsoever and the Spark UI is not accessible, compared to Dataflow which have a pretty good real time dashboarding functionality
It’s only Spark 3.2 for now, might not be a limitation for you but if you want to migrate existing workload to the service it might not work.

Remarks about the experiment:

The Beam/Dataflow pipeline was developed with the Python SDK and I would probably achieve better results with the Java SDK and by using Flex templates (the scaling operation is more efficient because the pipeline is containerized), so it’s not totally fair to Dataflow.
Dataflow targeted an ideal number of vCPU to 260 but I limited the max number of workers to save cost (and also because my CPU usage quota was at its maximum) but without this limit Dataflow would probably be much quicker to solve the problem.

To conclude I am pretty optimistic about this new Spark serverless service. Running Spark on GCP was not really a solution promoted natively by Google (except for lift and shift migration on Dataproc) whereas AWS and Azure based their main data processing products on Spark (Glue and Mapping data flows). On the downsides, the integration with the GCP ecosystem is way behind Dataflow for now (Monitoring & Operations), it does not support Spark Streaming and the autoscaling feature is still a bit obscure.

At the end you should keep in mind that Serverless Spark and Dataflow are two different products, and the choice between the two is not only in term of performance and pricing, but also the need of batch vs streaming ingestion (Dataflow is much better for that) and the background knowledge of your team for the two frameworks : Spark or Beam.

Anyway the service should get out of Private Preview by mid-december 2021 and be integrated with other GCP products (Bigquery, Vertex AI, Dataplex) later this year. It’s only the beginning but it’s promising.