Hugo Estrada S.

Posted on Jun 4, 2020

Apache Spark and Databricks 101 pt. II - Some DataFrames

#databricks #spark #datascience

One of the core API's of Apache Spark is the DataFrame.
It represents a table of data with rows and columns.

The list of columns and the respective datatypes are what's called the schema.
Like an Excel spreadsheet with named columns.

DataFrames are powerful and can be partitioned across thousands of computers at the same time.

They make parallel processing of big data possible:

1 Creating a Sample Spark DataFrame:

The 'inferSchema' option means Spark will automatically detect the schema for us.

2 Take a Glimpse into the DataFrame:

With the 'display()' function we can see our DataFrame in a nice table format:

If just so happens that you only want to take a look to the first x rows use the '.take(x)' function:

Remember, this is Spark and not Python.

3 Final Thoughts and Conclusions:

Spark DataFrames are completely separate from Pandas DataFrames. That's something for another lecture, and I will show you how to convert one into the other and the differences and benefits of each.

Top comments (0)

LLM-Boosted MIP Solver: Recursively Dynamic Temperature for Rare Scenarios

Mike Young - Sep 11

Which Data Synchronization Method is More Senior?

Apache SeaTunnel - Sep 11

Building a User-Friendly, Budget-Friendly Alternative to dbt Cloud

Marco Porracin - Sep 8

Interactive stock market S&P 500 line chart using Bokeh, Python, JS, Pyscript and a movable angle finder for Trend Line Analysis

Rick Delpo - Oct 12

DEV Community