Adi Polak

Posted on Jun 2, 2020 • Edited on Jun 4, 2020

Spark & AI summit and a glimpse of Spark 3.0

#news #apachespark

If there is a framework that super excites me, it's Apache Spark.
If there is a conference that excites me, it's the Spark & AI summit.

This year, with the current COVID-19 pandemic, the North America version of the Spark & AI summit is online and free. No need to travel, buy an expensive flight ticket, pay for accommodation and conference fees. It's all free and online.

One caveat, it is in pacific timezone(PDT) friendly hours. I kind of wish the organizers would take a more global approach.

Having said that, the agenda and content look promising!

To get ready for the conference and learn about Spark 3.0
I decided to spin a Spark 3.0 cluster with Azure Databricks, you can do the same or use the Databricks Community Edition.
Please note that with community edition, there are no workers nodes.

The Workspace:

MANY Exciting Features, let's briefly look at 2 of them

Pandas UDFs and Python Type Hints
Probably going to be mostly used by the DataScience and Python developers communities. This feature allows us to create a more readable code and support code static analysis by IDEs such as PyCharm.
Read about it here.
SQL Join hints
Before this change, we had broadcast hash join hints.
Meaning, if there is a join operation and one of the tables can fit in memory, Spark will broadcast it to execute a faster join. The class in charge of it was named ResolveBroadcastHints. It was replaced with ResolveJoinStrategyHints.
To learn more, check out the JIRA ticket : SPARK-27225.

List of available hints:

To better understand how they work, I recommend checking out the Apache Spark open source code, specificly, this file:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

If you are interested in learning more about the Spark SQL optimization engine - the Catalyst, I wrote a deep dive on it, here.

Top 4 recommended sessions

-1- End-to-End Deep Learning with Horovod on Apache Spark

For the last months, I have been working on various Autonomous Cars scenarios that involve a high load of data. One of the challenges I faced is enabling the DataScience to run Deep Learning at scale. After digging in, I discovered Horovod's framework and the HorovodEstimator. I am excited to attend this session and learn more about it!
Are you curious about it? read about it more here.

Session link.

-2- Building Reliable ML Pipelines with MLflow

If you follow me for a while now, you know I'm deep into how to build machine learning pipelines at scale.
Here is a GitHub repo describing an End-to-End platform I built for Microsoft Build 2020 session. The platform includes MLFlow, Azure Databricks, Azure Machine Learning, and social media text classification with Scikit learn. The repository include data flow, architecture, tutorials, and code.

Session link.

Please note that this session is long (~1 hour) and is running multiple times during the online conference.

-3- An Approach to Data Quality for Netflix Personalization Systems

If you watched my sessions on Big Data and ML, I always mention that:

You are only as Good as your Data

I am referring here to the Machine Learning models of course. We see many biased machine learning models due to unbalanced data and misuse/lack of tools for assessing Data Quality. Many times during the Data Quality process, we need to filter out the data; this is where having a large set of Data can help. However, it brings challenges, as well.

This is why I am excited to hear from Netflix how they tackle these challenges.

BTW, if you would like to get familiar with Data Bias challenges, I recommend this short read from Microsoft Research Blog.

Session link.

-4- The Apache Spark File Format Ecosystem

The Veraset Software developers team is closely involved with open source Spark initiatives such as
Datasource V2 and the External Shuffle Service, and it's interesting to hear from them how using the right file format can improve performance. As well as permit Predicate Pushdown.

Session link.

That's it for now !

Thank you for reading so far.

These are my personal opinions about the summit.
If you enjoy reading, please follow me here on dev.to , Twitter and LinkedIn.

Always happy to take your thoughts and opinions.

DEV Community

Spark & AI summit and a glimpse of Spark 3.0

The Workspace:

MANY Exciting Features, let's briefly look at 2 of them

Top 4 recommended sessions

-1- End-to-End Deep Learning with Horovod on Apache Spark

-2- Building Reliable ML Pipelines with MLflow

-3- An Approach to Data Quality for Netflix Personalization Systems

-4- The Apache Spark File Format Ecosystem

That's it for now !

💡 Which session can't you wait to attend? What excites you about Apache Spark?

Top comments (0)

Read next

💡Only 20% of Developers are Happy at Work?

The next improvement in Angular reactivity

Todo en Uno: Las Últimas Novedades de AWS pre-reInvent 2024

How to Prepare for Coding Interviews: Tips and Resources