Cover image for Apache Spark and Databricks 101 pt. I - The Big Picture

Apache Spark and Databricks 101 pt. I - The Big Picture

hugoestradas profile image Hugo Estrada S. Updated on ・2 min read

Alt Text
If you're interested in becoming a Data Scientist, Data Engineer, Data Analyst (so many Data titles) or whatever, chances are you will work with Apache Spark and/or Databricks.

  1. What is Apache Spark <?>
    Is an open-source framework for distributed data processing. Anyone who uses the Spark ecosystem in an application can focus on his/her domain-specific data processing business case, while trusting that Spark will handle the messy details of parallel computing. Spark is deployed as a cluster consisting of a master server and many worker servers. The master server accepts incoming jobs and breaks them down into smaller tasks that can be handled by workers. Spark isn't a data storage solution, neither a Hadoop replacement.

  2. What is Databricks <?>
    Databricks is an Apache Spark-based analytics platform optimized either AWS or Azure platforms. Designed with the founders of Apache Spark, Databricks is integrated in these cloud providers to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientist, data engineers, and business analysts. There is also a free community edition available where you can learn the basics!

  3. Some interesting Databricks Features
    Databricks lets you run code in notebooks - exactly like Jupyter would.

Alt Text

Looks just like Jupyter, doesn't it <?>
All your code runs on a cluster, though, which you can scale up depending on the workload.

Noticed how this is a "Python Notebook" <?>
In Databricks, you can seamlessly switch languages- EVEN IN THE SAME NOTEBOOK:

Alt Text

This is possible, with the "%" command.

There are a lot of Data Structures in the Apache Spark framework that you can explore using Databricks.
Tools like this are used by Data Scientist and Data Engineers alike.


Editor guide