loading...
Cover image for Spark Journey begins...

Spark Journey begins...

mzsrtgzr2 profile image Moshe Roth Updated on ・3 min read

As an engineer with several years of experience in Backend and Frontend projects it feels like the next natural step is big data challenges.
In the big data world I expect to find computing, IO and scaling challenges not usually found in ordinary/plain/textbook architectures.

I decided that Spark is the best way to get started. Specifically - the Databricks certification, which is focused on Spark programming and architecture. I believe the HDFS/Ops technology are irrelevant for me today because all of the managed services on AWS etc.

My game plan to pass the Databricks spark certification is to:

  1. Read "Learning Spark Lightning fast big data analysis" book and work through all the examples + summarising important insights and lessons so I can repeat those later.
  2. Go over the skeletons of Databricks Developer course that I found on GitHub from 15 months ago. Should be pretty updated - https://github.com/vivek-bombatkar/spark-training + https://github.com/vivek-bombatkar/Spark-with-Python---My-learning-notes-
  3. Going through example questions.

Please, If you can advice on any source of preparation - write in the comments it will help me.

I will update as I go for others (and myself).

Learning Schedule

Theory

Reading throughly the book "Learning Spark Lightning-fast..."
I think it's reasonable to go through 2 chapters per week.
this means: reading, summarizing and running important code snippets on my own.

Week 1
Chapter 3
Chapter 4

Week 2
Chapter 5
Chapter 6

Week 3
Chapter 7
Chapter 8

Week 4
Chapter 9
Chapter 10

Week 5
Chapter 11 - Quick read it's not that important

Hands on coding

Basics (4 notebooks)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-pyspark

https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-from-pandas-to-spark

https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-weather-df

https://github.com/vivek-bombatkar/spark-training/blob/master/spark-python/jupyter-weather-df/Weather%20Analysis%20Exercise.ipynb

Advanced topics (10 notebooks)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-advanced

Windows (4 notebook)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-windows

https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-advanced-windows

UDF (3 notebooks)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-advanced-udf

Spark execution(1 notebooks)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-advanced-execution

Caching (3 notebooks)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-advanced-caching

Pivoting (1 notebook)
https://github.com/vivek-bombatkar/spark-training/tree/master/spark-python/jupyter-advanced-pivoting

total 26 notebooks
I hope to do 3-4 notebooks per week (some will be easy some harder, so taking the average). This will result in 8 weeks of going through the notebooks. Learning what I'm missing etc.

Everything should take 3 months until I'm ready for the exam.

Books PDFs

Learning Spark: Lightning-Fast Big Data Analysis
First Edition
https://b-ok.asia/book/2493162/9b8d4f?dsource=recommend
Second Edition
https://laptrinhx.com/learning-spark-lightning-fast-data-analytics-2nd-edition-436517903/

Spark: The Definitive Guide: Big Data Processing Made Simple
https://b-ok.asia/book/3505368/f04c83?regionChanged

Spark in Action
https://b-ok.asia/book/3502170/d3383b

Discussion

pic
Editor guide