Ayush Kumar

Posted on Jun 24, 2024

Remove unwanted partition data in Azure Synapse (SQL DW)

#dataengineering #sqlserver #azure

Introduction to Partition Switching?

Azure Synapse Dedicated SQL pool or SQL Server or Azure SQL Database, allows you to create partitions on a target table. Table partitions enable you to divide your data into multiple chunks or partitions. It improves query performance by eliminating partitions that is not necessary. In most cases partitions are built on date column.

Why don't we simply drop the unwanted Partition?

We don't simply drop the unwanted Partition because of several regions:

Clustered Columnstore Index: Dropping a partition directly can potentially lead to performance degradation, especially with a CCI. This is because CCIs are optimized for data locality and dropping a partition disrupts that organization. Rebuilding the CCI after dropping the partition would be required, which can be time-consuming for a large table.
Transaction Safety: Directly dropping a partition might not be a transactional operation. This means if the drop operation fails midway, the partition might be left in an inconsistent state, potentially causing data integrity issues.

Requirement to apply Partition Switching

The definitions of source and target tables are the same.

Steps for Partition Switching in Synapse SQL Pool:

Step 1 (Optional) -> Create a credential

Skip this step if you're loading the Contoso public data.

Don't skip this step if you're loading your own data. To access data through a credential, use the following script to create a database-scoped credential. Then use it when defining the location of the data source.

CREATE MASTER KEY;

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
    IDENTITY = 'Managed Identity',
    SECRET = 'https://rnd-learning.vault.azure.net/secrets/synapselearningadls-accesskey/d94c967cb0c5452eafaf5d207afcb86a'
;

CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://<blob_container_name>@<azure_storage_account_name>.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);

MASTER KEY is required to encrypt the credential secret in the next step.

IDENTITY refers to the type of authentication you're using. Here I am using Managed Identity, because I allow Azure Synapse workspace to securely connect to and authenticate with Azure Key Vault without having to embed any credentials directly in your code.

TYPE is HADOOP because, PolyBase uses Hadoop APIs to access data in Azure blob storage.

Step 2 -> Create the external data source

Use this command to store the location of the data, and the data type.

CREATE EXTERNAL DATA SOURCE AzureStorage_west_public
WITH (  
    TYPE = Hadoop,
    LOCATION = 'wasbs://contosoretaildw-tables@contosoretaildw.blob.core.windows.net/'
);

Step 3 -> Configure the data format

The data is stored in text files in Azure blob storage, and each field is separated with a delimiter. Use this command to specify the format of the data in the text files. The Contoso data is uncompressed, and pipe delimited.

CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = '|',
        STRING_DELIMITER = '',
        DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff',
        USE_TYPE_DEFAULT = FALSE
    )
);

Step 4 -> Create the schema for the external tables

To create a place to store the Contoso data in your database, create a schema.

CREATE SCHEMA [asb]
GO

Step 5 -> Create the external tables

Run the following script to create the FactOnlineSales external tables. All you're doing here is defining column names and data types, and binding them to the location and format of the Azure blob storage files. The definition is stored in the data warehouse and the data is still in the Azure Storage Blob.

CREATE EXTERNAL TABLE [asb].FactOnlineSales (
    [OnlineSalesKey] [int]  NOT NULL,
    [DateKey] [datetime] NOT NULL,
    [StoreKey] [int] NOT NULL,
    [ProductKey] [int] NOT NULL,
    [PromotionKey] [int] NOT NULL,
    [CurrencyKey] [int] NOT NULL,
    [CustomerKey] [int] NOT NULL,
    [SalesOrderNumber] [nvarchar](20) NOT NULL,
    [SalesOrderLineNumber] [int] NULL,
    [SalesQuantity] [int] NOT NULL,
    [SalesAmount] [money] NOT NULL,
    [ReturnQuantity] [int] NOT NULL,
    [ReturnAmount] [money] NULL,
    [DiscountQuantity] [int] NULL,
    [DiscountAmount] [money] NULL,
    [TotalCost] [money] NOT NULL,
    [UnitCost] [money] NULL,
    [UnitPrice] [money] NULL,
    [ETLLoadID] [int] NULL,
    [LoadDate] [datetime] NULL,
    [UpdateDate] [datetime] NULL
) WITH (
    LOCATION='/FactOnlineSales/',
    DATA_SOURCE = AzureStorage_west_public,
    FILE_FORMAT = TextFileFormat,
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 0
);

Step 6 -> Load the data

There are different ways to access external data. You can query data directly from the external tables, load the data into new tables in the data warehouse, or add external data to existing data warehouse tables.

Step 6.1 -> Create a new schema

CREATE SCHEMA [cso]
GO

Step 6.2 -> Load the data into new tables

To load data from Azure blob storage into the data warehouse table, use the CREATE TABLE AS SELECT (Transact-SQL) statement.

CTAS creates a new table and populates it with the results of a select statement. CTAS defines the new table to have the same columns and data types as the results of the select statement. If you select all the columns from an external table, the new table will be a replica of the columns and data types in the external table.

CREATE TABLE [cso].[FactOnlineSales]
WITH (
    CLUSTERED COLUMNSTORE INDEX,
    DISTRIBUTION = HASH([ProductKey]),
    PARTITION (
        [DateKey] RANGE RIGHT FOR VALUES (
            '2007-01-01 00:00:00.000','2008-01-01 00:00:00.000',
            '2009-01-01 00:00:00.000','2010-01-01 00:00:00.000'
        )
    )
)
AS
SELECT * FROM [asb].FactOnlineSales;

With this statement I have created 5 partitions in the [cso].[FactOnlineSales] table, each of which has the duration of a year, except the first that contains all rows with DateKey < 2007–01–01 and the last that contains all rows with DateKey ≥ 2010–01–01.

Step 7 -> Create an empty partition table

Now do the same thing for the target table. Here I forcefully created empty table, for switching with source table.

CREATE TABLE [cso].[FactOnlineSales_out]
WITH (
    CLUSTERED COLUMNSTORE INDEX,
    DISTRIBUTION = HASH([ProductKey]),
    PARTITION (
        [DateKey] RANGE RIGHT FOR VALUES (
            '2007-01-01 00:00:00.000'
        )
    )
)
AS SELECT * FROM [cso].[FactOnlineSales] WHERE 1 = 2;

NOTE: If you are switching out the partition (means deleting the partition) you can partition data out to any table irrespective of whether that table is partition or not. So here data will be switched from partition table to a non-partition table. But if you are switching in the partition (means switching the partition with new data), there is a strict criterion where you have to same partitioning boundary define.

Step 8 -> Switch the Partition

Here I switched out the partition. Now after switch; [cso].[FactOnlineSales_out] has the data about Jan 1st, 2007, till December 31st, 2007. While the [cso].[FactOnlineSales] has no data in partition 2.

ALTER TABLE [cso].[FactOnlineSales]
SWITCH PARTITION 2
TO [cso].[FactOnlineSales_out] PARTITION 2;

NOTE: The command is very simple, but there is one catch; it requires the partition number of source and target tables to perform the switching.

Validating partition switching for both source and target table.

Step 9 -> Delete the stagging table

Based on your requirement, delete this table or archive the data of this table as cold data.
DROP TABLE [cso].[FactOnlineSales_out];

What happens during the Partition Switch?

Before the Switch:

Imagine the data for FactOnlineSales is physically stored on disk, potentially spread across multiple files.
Each partition in FactOnlineSales has its own metadata entry that keeps track of the specific location(s) of its data on disk.

During the Switch (using partition X as the example):

You identify partition X (containing old data) in FactOnlineSales.
The ALTER TABLE SWITCH statement updates the metadata entries for both tables:
- In FactOnlineSales, the metadata entry for partition X is modified to point to an empty location on disk. This essentially signifies that partition X is now "empty" within FactOnlineSales.
- In FactOnlineSales__out, a new metadata entry is created for partition X. This new entry points to the same physical location on disk where the data for partition X already resides (remember, the data itself doesn't move).

After the Switch:

Both FactOnlineSales and FactOnlineSales__out have metadata entries for partition X. However, these entries point to different things:

FactOnlineSales entry points to an empty location, indicating the partition is no longer actively used within that table.
FactOnlineSales__out entry points to the actual data location, making it appear like FactOnlineSales "owns" that partition.

How to check or verify the number of partitions?

SQL Pool provides different system, that is used to query the different metadata for all the objects that is in the SQL Pool. And one of the system views that provides all the information related to partition, number of rows in that partition and all those things is sys.dm_pdw_nodes_db_partition_stats

Use this script to check the number of partitions.

SELECT  pnp.partition_number, sum(nps.[row_count]) AS Row_Count
FROM
   sys.tables t
INNER JOIN sys.indexes i
    ON  t.[object_id] = i.[object_id]
    AND i.[index_id] <= 1 /* HEAP = 0, CLUSTERED or CLUSTERED_COLUMNSTORE =1 */
INNER JOIN sys.pdw_table_mappings tm
    ON t.[object_id] = tm.[object_id]
INNER JOIN sys.pdw_nodes_tables nt
    ON tm.[physical_name] = nt.[name]
INNER JOIN sys.pdw_nodes_partitions pnp 
    ON nt.[object_id]=pnp.[object_id] 
    AND nt.[pdw_node_id]=pnp.[pdw_node_id] 
    AND nt.[distribution_id] = pnp.[distribution_id]
INNER JOIN sys.dm_pdw_nodes_db_partition_stats nps
    ON nt.[object_id] = nps.[object_id]
    AND nt.[pdw_node_id] = nps.[pdw_node_id]
    AND nt.[distribution_id] = nps.[distribution_id]
    AND pnp.[partition_id]=nps.[partition_id]
WHERE t.name='FactOnlineSales'
GROUP BY pnp.partition_number;

DEV Community