DEV Community

Cover image for How to read an Excel file from Azure Databricks using PySpark
Ruthvik Raja M.V
Ruthvik Raja M.V

Posted on

How to read an Excel file from Azure Databricks using PySpark

The Excel file cannot be read directly using Py-Spark in Databricks so necessary Libraries (com.crealytics.spark.excel) have to be installed in the Cluster to successfully run the Python code.

Step 1
Navigate to the Cluster that will be used to run the Python script under Compute in Databricks.

Step 2
Click on the tab Libraries -> Install new.

Step 3
Select Maven as a Library source and click on Search Packages.

Step 4
Type com.crealytics in the search bar and select Maven Central.

Step 5
Select the com.crealytics.spark.excel package version that matches with the version of Scala (Cluster -> Configuration -> Databricks Runtime Version) in your Cluster.

Step 6
Click Install

Use the following code to load the Excel file:-

// Specify the path to the Excel file
val excelFilePath = "/mnt/<your-mount-name>/path_to_your_excel_file.xlsx";  // Replace with your actual spark file path

// Read the Excel file into a Spark DataFrame
val df = spark.read
  .format("com.crealytics.spark.excel")
  .option("location", excelFilePath)
  .option("useHeader", "true")  // Use this option if your Excel file has headers
  .load();

Enter fullscreen mode Exit fullscreen mode

Whereas using Pandas Library the following Python code could be used:-

import pandas 
excelFilePath = "/mnt/<your-mount-name>/path_to_your_excel_file.xlsx"; # Replace with your actual file path

ef=pandas.ExcelFile(excelFilePath); # Load the Excel file as an object

# Mention the Sheet_Name or use ef.sheet_names to iterate through each sheet data

df=ef.parse(Sheet_Name); # Load the required Excel sheet data

Enter fullscreen mode Exit fullscreen mode

Done

Top comments (0)