How to read an Excel file from Azure Databricks using PySpark

#azure #python #cloudcomputing #softwaredevelopment

The Excel file cannot be read directly using Py-Spark in Databricks so necessary Libraries (com.crealytics.spark.excel) have to be installed in the Cluster to successfully run the Python code.

Step 1
Navigate to the Cluster that will be used to run the Python script under Compute in Databricks.

Step 2
Click on the tab Libraries -> Install new.

Step 3
Select Maven as a Library source and click on Search Packages.

Step 4
Type com.crealytics in the search bar and select Maven Central.

Step 5
Select the com.crealytics.spark.excel package version that matches with the version of Scala (Cluster -> Configuration -> Databricks Runtime Version) in your Cluster.

Step 6
Click Install

Use the following code to load the Excel file:-

// Specify the path to the Excel file
val excelFilePath = "/mnt/<your-mount-name>/path_to_your_excel_file.xlsx";  // Replace with your actual spark file path

// Read the Excel file into a Spark DataFrame
val df = spark.read
  .format("com.crealytics.spark.excel")
  .option("location", excelFilePath)
  .option("useHeader", "true")  // Use this option if your Excel file has headers
  .load();

Whereas using Pandas Library the following Python code could be used:-

import pandas 
excelFilePath = "/mnt/<your-mount-name>/path_to_your_excel_file.xlsx"; # Replace with your actual file path

ef=pandas.ExcelFile(excelFilePath); # Load the Excel file as an object

# Mention the Sheet_Name or use ef.sheet_names to iterate through each sheet data

df=ef.parse(Sheet_Name); # Load the required Excel sheet data

DEV Community

How to read an Excel file from Azure Databricks using PySpark

Done

Top comments (0)

Read next

Advent of Code 2024 - Day 15: Warehouse Woes

Software Architecture: The Foundation of Great Software

Straight to the Money 💰 minimalistic yet all-inclusive Python project template

Build an API to Keep Your Marketing Emails Out of Spam