Ayush Kumar

Posted on Jul 21, 2024

Ways to load data in DW from External Data Source

#dataengineering #etl #synapse #azure

Introduction

After performing transformation on data, then it needs to save it in Data Warehouse for further analysis. This process comes under ETL (Extract, Transform, Load). It involves extracting data from various sources, transforming it as per business need, and then loading it into destination.

There are many ways to load data in DW: -

POLYBASE
COPY INTO command
ADF Copy activity
Spark Pool in Synapse
Databricks

In this blog I will explain these processes and talk about some limitations of some activity and its workaround.

NOTE: Whenever we use other tools for inserting data in Synapse like ADF, Data Flow, Spark Pool and Databricks, it highly recommended to enable staging or under hood it can be implemented automatically. It allows for efficient data transfer and can handle large datasets more effectively. Basically, this approach minimizes the load on both the source and the destination during the transfer process. Internally the data is first loaded in Staging layer(temporary storage layer), then from there, data is loaded in DW. After loading the data in DW, the temporary data in staging layer is deleted.

POLYBASE

PolyBase is a technology that uses metadata to allows SQL Server and dedicated SQL pools to query and import data from external data sources using T-SQL.

Step 1:

Create an object of Database Scoped Credential. This indicates how it going to connect to external data source(Azure Data Lake Gen 2). In the Database Scoped Credentials, I am using Managed Identity (means Synapse uses its own credentials to access Data Lake).

You also have to give Storage Blob Data Contributor role to the Managed Identity of Synapse Workspace.



CREATE DATABASE SCOPED CREDENTIAL adlsgen2synp
WITH IDENTITY = 'MANAGED IDENTITY'
GO

Step 2:

Create an External Data Source. It's a definition of External Data Sorce Location.



CREATE EXTERNAL DATA SOURCE taxi_raw_data
WITH (
    LOCATION = 'abfss://raw@synapselearningadls.dfs.core.windows.net/',
    CREDENTIAL = adlsgen2synp,
    TYPE = HADOOP
);
GO

Step 3:

Create an External File Format. It defines format of the file like CSV.



CREATE EXTERNAL FILE FORMAT csv_file_format
    WITH(
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS (
            FIELD_TERMINATOR = ',',
            STRING_DELIMITER = '"',
            FIRST_ROW = 2,
            USE_TYPE_DEFAULT = FALSE
        )
    );
GO

Step 4:

Create External Table so that it able to query the external data directly from synapse without importing it.



CREATE EXTERNAL TABLE temp_table.ext_taxi_zone
    (   LocationID SMALLINT,
        Borough VARCHAR(15),
        Zone VARCHAR(50),
        service_zone VARCHAR(15))
    WITH(
        LOCATION = 'taxi_zone.csv',
        DATA_SOURCE = taxi_raw_data,
        FILE_FORMAT = csv_file_format
    );
GO

Step 5:

Use CTAS statement to import the data in Synapse SQL DW.



CREATE TABLE temp_table.taxi_zone
WITH (
    DISTRIBUTION = ROUND_ROBIN
) AS 
SELECT * FROM temp_table.ext_taxi_zone;

NOTE:

Polybase does not support DELTA file format.
External Table can't be modified. For altering columns, you must have to drop the table and then recreate again.

COPY INTO command

It is a newer way of loading the data in DW. It doesn't need any extra objects. It copies the data directly to the table in DW. It is faster and simpler compared to POLYBASE.



COPY INTO [temp_table].[taxi_zone]
FROM 'https://synapselearningadls.blob.core.windows.net/raw/taxi_zone.csv'
WITH (
    CREDENTIAL = (IDENTITY = 'MANAGED IDENTITY'),
    FILE_TYPE = 'CSV',
    FIRSTROW = 2
)

NOTE: It supports only these CSV, PARQUET, ORC file types.

ADF Copy Activity

Step 1:

Giving some permissions to ADF Managed Identity to access Synapse Database.

Here EXTERNAL PROVIDER refers to the Microsoft Entra.



CREATE USER [adf-rnd-learning] FROM EXTERNAL PROVIDER;
GO

EXEC sp_addrolemember db_owner, [adf-rnd-learning];
GRANT ADMINISTER DATABASE BULK OPERATIONS TO [adf-rnd-learning];

Step 2:

Create linked service for ADLS Gen 2 (External data source) and Azure Synapse Analytics.
Create Dataset for source external file in ADLS and sink Table in synapse DB.
Create a pipeline and use COPY Activity to copy data from ADLS to Synapse database.
Set the source and sink dataset in COPY activity and enable the Staging.

Here in staging, the data stored temporary and after loading in Synapse DB it deleted.

NOTE: ADF COPY activity doesn't support DELTA. But a workaround is you can use Data Flow in ADF. While adding the source in Data Flow, choose source type as Inline.

Spark Pool in Synapse

In Spark Pool we can use Synapse DW Objects with the help of spark connector. And the best part is, Synapse automatically manages all authentication for us. So, we don't have to focus on managing credentials.

sqlanalytics contains a spark connector that is used to connect to dedicated SQL Pool from Spark Pool.

Here, under the hood staging data is loaded automatically in some default location.

Databricks

In Databricks, for accessing Synapse DW Objects first we have to manage credentials for both Synapse and ADLS.

Step 1:

Give required permissions to Databricks to access ADLS Gen 2.

Create a service principal in Microsoft Entra Id and add Client Secret to this service principal.

Assign the Storage Blob Data Contributor role of ADLS to this service principal.

Then add the client_id, tenant_id and client secrets of service principal to Azure Key Vault.

Then create Scope in Databricks and add the required details of your key vault. Databricks keeps the Secret Scope in a hidden user interface. To reveal that interface, type this “#secrets/createScope” in the end of URL in the Databricks homepage.

Step 2:

Give permission to Service Principal to access Synapse Database. Execute this below query in Synapse Database.



CREATE USER [databricks_learning_appreg] FROM EXTERNAL PROVIDER;
GO

EXEC sp_addrolemember db_owner, [databricks_learning_appreg];
GRANT ADMINISTER DATABASE BULK OPERATIONS TO [databricks_learning_appreg];

Step 3:

Fetch the credentials from Databricks Secrets Scope.



client_id = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-client-id')
tenant_id = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-tenant-id')
client_secret = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-client-secret')

Step 4:

Set the configurations in Databricks, for ADLS.



spark.conf.set("fs.azure.account.auth.type.synapselearningadls.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.synapselearningadls.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.synapselearningadls.dfs.core.windows.net", client_id)
spark.conf.set("fs.azure.account.oauth2.client.secret.synapselearningadls.dfs.core.windows.net", client_secret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.synapselearningadls.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")

Step 5:

Set the configurations in Databricks, for Synapse SQL DW.



spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.id", client_id)
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.secret", client_secret)

Step 6:

Now load the data from ADLS into data frame and perform some transformations.

Step 7:

Now write the transformed data to Synapse SQL DW.



df.write.format("sqldw") \
.option("url", 'jdbc:sqlserver://synapse-rnd-learning.sql.azuresynapse.net:1433;database=prac_dedicated') \
.option('tempDir', 'abfss://stagging@synapselearningadls.dfs.core.windows.net/stage') \
.option("forwardSparkAzureStorageCredentials", "false") \
.option("dbTable", "temp_table.taxi_zone") \
.option('enableServicePrincipalAuth', 'true') \
.mode('append') \
.save()

Step 8:

Now check the table, transformed data is inserted into Synapse SQL DW.



df2 = spark.read.format("sqldw") \
.option("url", 'jdbc:sqlserver://synapse-rnd-learning.sql.azuresynapse.net:1433;database=prac_dedicated') \
.option('tempDir', 'abfss://stagging@synapselearningadls.dfs.core.windows.net/stage') \
.option("forwardSparkAzureStorageCredentials", "false") \
.option("dbTable", "temp_table.taxi_zone") \
.option('enableServicePrincipalAuth', 'true') \
.load()

NOTE:
Here you notice that in pyspark code, I am defining staging location for both reading and writing.

DEV Community

Ways to load data in DW from External Data Source

Introduction

POLYBASE

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

COPY INTO command

ADF Copy Activity

Step 1:

Step 2:

Spark Pool in Synapse

Databricks

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Step 6:

Step 7:

Step 8:

Top comments (0)