If there is a framework that super excites me, it's Apache Spark.
If there is a conference that excites me, it's the Spark & AI summit.
This year, with the current COVID-19 pandemic, the North America version of the Spark & AI summit is online and free. No need to travel, buy an expensive flight ticket, pay for accommodation and conference fees. It's all free and online.
One caveat, it is in pacific timezone(PDT) friendly hours. I kind of wish the organizers would take a more global approach.
Having said that, the agenda and content look promising!
To get ready for the conference and learn about Spark 3.0
I decided to spin a Spark 3.0 cluster with Azure Databricks, you can do the same or use the Databricks Community Edition.
Please note that with community edition, there are no workers nodes.
Pandas UDFs and Python Type Hints
Probably going to be mostly used by the DataScience and Python developers communities. This feature allows us to create a more readable code and support code static analysis by IDEs such as PyCharm.
Read about it here.
SQL Join hints
Before this change, we had broadcast hash join hints.
Meaning, if there is a join operation and one of the tables can fit in memory, Spark will broadcast it to execute a faster join. The class in charge of it was named
ResolveBroadcastHints. It was replaced with
To learn more, check out the JIRA ticket : SPARK-27225.
List of available hints:
To better understand how they work, I recommend checking out the Apache Spark open source code, specificly, this file:
If you are interested in learning more about the Spark SQL optimization engine - the Catalyst, I wrote a deep dive on it, here.
For the last months, I have been working on various Autonomous Cars scenarios that involve a high load of data. One of the challenges I faced is enabling the DataScience to run Deep Learning at scale. After digging in, I discovered Horovod's framework and the HorovodEstimator. I am excited to attend this session and learn more about it!
Are you curious about it? read about it more here.
If you follow me for a while now, you know I'm deep into how to build machine learning pipelines at scale.
Here is a GitHub repo describing an End-to-End platform I built for Microsoft Build 2020 session. The platform includes MLFlow, Azure Databricks, Azure Machine Learning, and social media text classification with Scikit learn. The repository include data flow, architecture, tutorials, and code.
- Please note that this session is long (~1 hour) and is running multiple times during the online conference.
If you watched my sessions on Big Data and ML, I always mention that:
You are only as Good as your Data
I am referring here to the Machine Learning models of course. We see many biased machine learning models due to unbalanced data and misuse/lack of tools for assessing Data Quality. Many times during the Data Quality process, we need to filter out the data; this is where having a large set of Data can help. However, it brings challenges, as well.
This is why I am excited to hear from Netflix how they tackle these challenges.
BTW, if you would like to get familiar with Data Bias challenges, I recommend this short read from Microsoft Research Blog.
The Veraset Software developers team is closely involved with open source Spark initiatives such as
Datasource V2 and the External Shuffle Service, and it's interesting to hear from them how using the right file format can improve performance. As well as permit Predicate Pushdown.
Always happy to take your thoughts and opinions.