Apache Spark in Bitesize (3 Part Series)
This is a basic cheat sheet, glossary and the very beginning of getting started with Apache Spark, every time we will share a new post with terms or code snippets, they will appear here as well at a generic form.
If you work with Apache Spark and look for a cheat sheet, this is for you as well!
First thing first:
First, we need to create the workspace, we are using Databricks workspace and here is a tutorial for creating it.
This is a distributed collection of data organized into named columns that provide operations to filter, group, or compute aggregates. Dataframe data is often distributed across multiple machines. It can be in-memory data or on disk.
Strongly typed collection of objects that can be transformed in parallel using functional or relational operations. Each Dataset is a typed view of Dataframe.
Dataset is defined as "lazy", meaning the computations are only triggered when an action is invoked.
This is an evolving page and more terms, code snippets and architecture design will be added.