DEV Community

Cover image for How to read all the sheets from an Excel file and push into Data Lake using Azure Data Factory
Ruthvik Raja M.V
Ruthvik Raja M.V

Posted on

How to read all the sheets from an Excel file and push into Data Lake using Azure Data Factory

Azure Data Factory doesn't have an in-built function to read all the sheets from an Excel file but it supports reading data from an Excel file using various methods, including Azure Data Factory Mapping Data Flows. Below are the general steps to read all the sheets from an Excel file in Azure Data Factory:-

Create an Azure Data Factory Service in Microsoft Azure Portal
If you haven't already, create an Azure Data Factory instance.

Create Linked Services for Source and Destination Files
In Azure Data Factory, create two Linked Services for your source (Excel File) and destination (Azure Data Lake) files. This is the connection information needed to access and store the data.

Create Datasets for Source and Destination Files
In Azure Data Factory, create two Datasets for your source (Excel File) and destination (Azure Data Lake) files. Set the Linked Services to the one you created in the previous step.

Let us name these two Datasets as follows:-
Source -> source_ds
Destination -> destination_ds

In source_ds create a new parameter under Parameters. Name the parameter as source_sheet_names. Under Connection send the newly created parameter as an input to the Sheet name. This action can be done by adding in the dynamic content under the Sheet name. Use the following expression to add -> @dataset().source_sheet_names.

Source Dataset -> source_ds
Parameters -> New -> source_sheet_names (Type: String)
Connection -> Sheet name -> Dynamic Content -> @dataset().source_sheet_names

Enter fullscreen mode Exit fullscreen mode

Thereby, the sheet names will be sent as a Parameter from the Pipeline.

In destination_ds create a new parameter under Parameters. Name the parameter as destination_file_name. Under Connection send the newly created parameter as an input to the File path. This functionality will create a new folder for each sheet name and name it with the respective sheet name. This action can be done by adding in the dynamic content under the File path. Use the following expression to add -> destination/final_file/@{dataset().destination_file_name}/@dataset().destination_file_name.

Destination Dataset -> destination_ds
Parameters -> New -> destination_file_name (Type: String)
Connection -> File path -> Dynamic Content -> destination/final_file/@{dataset().destination_file_name}/@dataset().destination_file_name 

Enter fullscreen mode Exit fullscreen mode

Thereby, the file names will be sent as a Parameter from the Pipeline.

Create a Pipeline
Create a new Pipeline in Azure Data Factory.

Let us name the above created Pipeline as follows:-
Pipeline -> adl_pipeline

Create a new Variable under Variables and name it as sheet_names. Select the type as Array and under Default value send the list of sheet names as a default value -> ["sheet 1", "sheet 2", "sheet 3", "sheet 4"]...

Pipeline -> adl_pipeline
Variables -> New -> sheet_names -> ["sheet 1", "sheet 2", "sheet 3", "sheet 4"]...

Enter fullscreen mode Exit fullscreen mode

Thereby, the sheet names will be sent during the runtime.

Add an Activity
In your Pipeline, add a ForEach activity and send the sheet_names as input to the ForEach activity. This can be achieved by navigating to Settings -> Items -> Dynamic Content -> @variables('sheet_names').

ForEach Activity -> Settings -> Items -> Dynamic Content -> @variables('sheet_names')

Enter fullscreen mode Exit fullscreen mode

Add an Activity within the ForEach Activity to copy the data from Source to Destination

In your ForEach Activity, add a Copy data Activity and map the Dataset properties with the sheet_names. This can be achieved by navigating to Source -> Name -> source_sheet_names -> @item() and Sink -> Name -> destination_file_name -> @item().

ForEach Activity -> Copy data Activity -> 
Source -> Name -> source_sheet_names -> @item() 
Sink -> Name -> destination_file_name -> @item()

Enter fullscreen mode Exit fullscreen mode

Thereby, the ForEach Activity will iterate through each of the sheet and copies the data to the destination folder.

Save and Execute the Pipeline

The entire workflow in a Nutshell:-

Source Dataset -> source_ds
Parameters -> New -> source_sheet_names (Type: String)
Connection -> Sheet name -> Dynamic Content -> @dataset().source_sheet_names

Destination Dataset -> destination_ds
Parameters -> New -> destination_file_name (Type: String)
Connection -> File path -> Dynamic Content -> destination/final_file/@{dataset().destination_file_name}/@dataset().destination_file_name

Pipeline -> adl_pipeline
Variables -> New -> sheet_names -> ["sheet 1", "sheet 2", "sheet 3", "sheet 4"]...

ForEach Activity -> Settings -> Items -> Dynamic Content -> @variables('sheet_names')

ForEach Activity -> Copy data Activity -> 
Source -> Name -> source_sheet_names -> @item() 
Sink -> Name -> destination_file_name -> @item()

Enter fullscreen mode Exit fullscreen mode

Done

Top comments (0)