DP-203 Study Guide - Develop a batch processing solution

Develop batch processing solutions by using Azure Data Lake Storage, Azure Databricks, Azure Synapse Analytics, and Azure Data Factory

Services for each layer in a batch processing architecture
- Ingestion: Data Factory
- Storage: Blob Storage, ADLS Gen2, Cosmos DB
- Processing: Databricks, HDInsight, Data Flows
- Serving: Azure SQL, Dedicated SQL, Analysis Services
- Orchestration: Data Factory (or Synapse)
Azure Synapse Analytics
- Group of multiple, well-integrated services
- Works across all layers of architecture

Use PolyBase to load data to a SQL pool

Dedicated SQL Pool
- Overview
  - Formerly known as Azure SQL Data Warehouse
  - Available as standalone service and within Synapse
  - Like a SQL Server Database
  - Massive parallel processing (MPP) architecture
  - Elastically scale compute and storage separately
  - Pause or resume service to save cost
- Components
  - Distributions
    - Basic unit of storage
    - Fixed 60 distributions
    - Queries executed against each distribution in parallel
    - Stored in Azure Storage
  - Control node
    - SQL Server endpoint
    - Queries go to control node
    - Only stores metadata
    - Coordinates query execution with computer nodes
  - Compute nodes
    - Execute queries
    - Max 60 compute nodes
    - Distributions equally divided among compute nodes
  - Data Movement Service (DMS)
    - Coordinates movement of data between compute nodes
    - For some queries (joins, group by) data needs to be co-located
- Data Warehousing Units
  - DWU = CPU + memory + I/O
  - Represents computational power
  - Can be increased or decreased to enable scaling
  - Paid for per hour (lower to reduce costs)
- Features
  - Most regular SQL features are supported
  - DDL and DML statements and Dynamic SQL
  - Dynamic management views
- Triggers and cross-database queries are not supported
- Constraints, identity columns, and relationships work differently than SQL Server
- Can be used in both the compute and serving layer
Polybase
- Overview
  - Read and write data in external storage using T-SQL
  - Available in SQL Server and Synapse
  - Supports delimited text, parquet, ORC, GZIP, and SNAPPY compressed files
  - Control node passes storage location to compute nodes, which read the data
- Components
  - Database Scoped Credential = access storage account
  - External Data Source = define the storage location
  - External File Format = format of the file being read
  - External Table = metadata of underlying file

Implement Azure Synapse Link and query the replicated data

Azure Synapse Link
- Cloud-native implementation of HTAP
- Hybrid transactional and analytical processing
- Directly query data in operational stores, no ETL required
- Near real-time querying
- Supports Cosmos DB, Azure SQL, Dataverse
Cosmos DB
- Fully managed NoSQL platform
- Supports MongoDB, Table, Cassandra, and Gremlin
- Global distribution - data can be replicated to multiple regions
- Elastic scalability
Synapse Link for Cosmos DB
- Transactional store is synced to analytical store from which Synapse can read data
- No performance impact on the transactional store
- Analytical store auto-syncs every 2 mins (max 5 mins)
- Only accessible from Synapse
- Only charged for storage
- Supports change data capture and time travel
In the portal
- In Cosmos DB account, see Azure Synapse Link under Integrations on the left-side panel
- Enable Synapse Link
- Create the container, setting Analytical Store to On
- To connect in Synapse Link, get primary account key from the Keys under Settings in the left-side panel
- In Synapse workspace, go to Data and setup linked service and data source for Cosmos DB
- Open a SQL script to query the data in Cosmos DB
- Create a credential with the primary key
- Use OPENROWSET to query

Create data pipelines

General steps
- Configure firewall to allow IP address and Azure Services to connect to data sources and sinks
- Create an ADF/Synapse instance
- Create a linked service to the source data
- Create a new dataset from the data in the linked service
- Create a Data Flow
  - Select data source
  - Choose transformation steps (join, group, conditional split, etc)
  - Select sink
- Create a new Pipeline
- Choose a Copy activity and/or the Data Flow

Scale resources

Types of scaling
- Vertical scaling (up/down) = add more resources to a machine to make it more powerful
- Horizontal scaling (in/out) = add more machines
Scaling Azure SQL
- Supports both up and out depending on config
- During up/down, the following can be changed
  - Service tier
    - DTU model: basic, standard, and premium
    - vCore model: general purpose, hyperscale, business critical
  - Compute tier (vCore): provisioned or serverless
  - Resources (CPU, RAM, storage, etc)
- Scaling up/down results in database restart
- To scale out, can only add up to 4 read-only replicas
- In the portal (Azure SQL database)
  - Go to Compute + storage
  - Select an option in Service tier
  - Choose Compute tier
  - Use sliders to select vCores, DTUs, Read scale-out, etc
Scaling Dedicated SQL Pool
- Increase/decrease number of compute nodes and memory on each node
- Defined using DWUs

Create tests for data pipelines

Testing pipelines is different than testing applications because we're testing data instead of code
Automated testing involves automating the process of validating if pipeline is providing expected output
Types of tests
- Unit tests
  - Test individual units
  - In data pipeline, run each activity individually and validate result
  - Hard to do in ADF
  - Programmatically enable one activity at a time and disable others
  - Generate and use fake data to test edge cases
- Functional tests
  - Have pipeline generate actual output and compare to expected output
  - Run complete pipeline, not just individual activities
  - Used to confirm that pipeline meets business requirements
- Performance and regression tests
  - Regression tests ensure that change in one pipeline doesn't impact other pipelines
  - Run multiple dependent pipelines together
  - Performance test to ensure pipeline meets SLAs
- Data quality tests
  - Verify if data meets quality standards
  - Typically embedded as part of the pipeline
  - Completeness
  - Uniqueness
  - Timeliness
  - Accuracy

Integrate Jupyter or Python notebooks into a data pipeline

Notebooks are typically used for Spark apps and development
Notebooks are supported natively in services like Databricks and Synapse
Basic steps for Synapse
- Create Synapse Spark pool
- Create new notebook and define language
- Attach notebook to Spark pool
- Write code to read and process data
- Add parameters to notebook
To invoke notebook in ADF
- Create linked service to Synapse (under compute, not storage)
- Make sure ADF has manage permissions for Synapse Spark and access to storage
- Create pipeline and add notebook activity
- Select notebook and parameters
- Run pipeline

Use Mapping Data Flows in Azure Synapse pipelines and Azure Data Factory pipelines

Mapping Data Flows provides no-code ETL workflow
Can apply transformations to source data
- Add/remove columns, rename, filter, join, aggregate
Runs on Spark code
- Automatically adds optimizations
- Can add user-defined optimizations
Executes on a Spark cluster
- Called Data Flow Debug
- Can define cluster configuration
Pros and Cons
- Pros
  - Faster development
  - UI based drag-and-drop approach
  - Fast and scalable processing
- Cons
  - Less flexible since code can't be modified
  - Can be complex for large workflows

Upsert data

DML statements
- Select, insert, update, delete
- Upsert is combo of update and insert - update if exists, insert if not
Options to change data in Azure SQL
- Using T-SQL (DML statements, merge command)
- Data Factory/Synapse pipelines (copy, data flow with Alter Row)
- Can upsert on files in Data Lake using Delta Lake
Options to perform upsert
- T-SQL "merge" command
  - Specify source with "USING"
  - Specify join condition
  - "WHEN MATCHED" = behavior for existing records
  - "WHEN NOT MATCHED BY TARGET" = behavior for records not in target
  - "WHEN NOT MATCHED BY SOURCE" = behavior for records not in source
- Copy activity
  - Change write behavior in sink to upsert and define key columns
- Data flows
  - Use alter row transformation
  - Define alter row conditions

Revert data to a previous state in Azure storage

Restorable entities
- Individual file (blob) - can revert to previous version or undelete
- Container - container and files can be reverted or undeleted
Restoring files
- Use snapshot (read-only version of file from point in time)
  - Created manually by user or application
  - Used to restore back to prior version
- Enable versioning
  - Enabled at storage account level
  - Auto creates snapshots when file is updated
  - Select and restore a specific version
- Enable soft delete
  - Enabled at storage account level
  - Deleted files can be restored for a certain number of days
Restoring containers
- Enable point-in-time restore
  - Restores container to specific point in time
  - Enabled at storage account level
  - Versioning, change feed, and soft delete must also be enabled
- Enable soft delete
  - Enabled at storage account level
  - Deleted containers can be restored for a certain number of days
In the storage account portal, these options are under Data management --> Data protection in the left-side panel
File versions and snapshots can viewed in blob properties by clicking on the file

Revert data to a previous state in Azure SQL and Dedicated SQL Pool

Azure SQL backup
- Automatically creates backups based on SQL Server technology
  - Full backups every week
  - Differential backups every 12 to 24 hours
  - Transaction log backups every 10 mins
  - Backups are stored in Azure Storage
  - Redundancy is configurable
- Point-in-time restore (auto)
  - Auto-created backup
  - Kept for limited days (1 to 35, default is 7)
- Long-term retention (not auto)
  - Define policy to keep backups longer
  - Configure weekly, monthly, yearly backups and keep up to 10 years
Azure SQL restore
- Restore using PITR or LTR
  - For PITR restore, service identifies which backups to be used
  - For LTR, database can be restored in same or different region
- Restore deleted database
- Restore creates a new database
  - Use to update or replace existing database
In the Azure SQL Server portal
- Data management --> Backups to view restore point details and retention policies
Dedicated SQL backup and restore
- Local backup
  - Dedicated SQL automatically creates snapshots used as restore points
  - Up to 42 user-defined restore points can be created
  - Restore points are retained for 7 days
- Geo backup
  - Created every 24 hours and stored in a different region
  - Only latest backup is retained
- Restore database in any region using restore points
  - Restore creates a new database that updates or replaces existing one

Configure exception handling

For a single activity
- Try/catch block
  - When one activity fails, a second activity runs that performs action based on failure
- Try/catch/proceed block
  - Last activity (proceed) runs if first activity succeeds or fails, even if middle activity fails, due to skip path
- If/else block
  - One path for success, different path for failure
  - Pipeline succeeds if first activity does, will fail otherwise
- If/skip/else block
  - Pipeline succeeds whether first activity succeeds or fails because a failure causes a skip to other activities
For multiple activities
- Sequential run
  - Activities are sequential
  - One or more activities are configured to run on failure or skip of previous activity
  - Pipeline continues regardless of upstream failure
- Parallel run
  - Some activities are parallel
  - Downstream activity depends on success of all parallel activities
  - Further downstream activity can be configured to run after skip so pipeline continues even if parallel activities fail

Read from and write to a delta lake

Data lake challenges
- Data reliability issues
  - Corruption because of failures (no rollback)
  - No data validation
  - Consistency issues while reading data
- No updates/deletes/merges on files
  - Difficult to implement GDPR/CCPA compliance
- Data quality issues
  - Schema isn't verified before writing
  - Cannot apply checks on data
- Query performance issues
- Difficult to maintain historical versions of data
Delta Lake
- Open-source storage layer that brings reliability to data lakes
- Can be installed on-prem
- Available by default on many cloud platforms
- Provides database-like features on top of data lake
  - Create constraints, enforce schema, run DML statements, etc
- Provides ACID guarantees
- Works by storing a transaction log of all transactions performed on data (dataframe.write.format("delta"))
  - Log file is not created until after writing is done and is not created if there is a failure, which helps ensure ACID guarantees
Delta Lake availability
- Can be downloaded and installed
  - On local machine
  - On-prem Spark cluster
  - Cloud platforms like Azure HDInsight
- Available by default in cloud platforms
  - Azure Databricks
  - Azure Synapse Spark pools
In the portal (Databricks)
- Use spark.conf.set to connect to storage
- Use dbutils.fs.ls to list files in storage path
- Define input and output folder paths, use input to read (spark.read.option.csv)
- To write to Delta Lake
  - Write in Delta format with output path DF.write.format("delta").save(outputPath + "filename.delta")
  - Check output location in storage to confirm write
  - Check delta_log to see metadata about write
- To read from Delta Lake
  - Use Spark SQL to create a database
  - Create a table in the database using CREATE TABLE table_name USING DELTA LOCATION "delta_file_path/filename.delta"
  - DESCRIBE HISTORY table_name can be used to audit the history of the Delta table
  - Read different versions of data using SELECT FROM table_name VERSION AS OF [version number], or SELECT FROM table_name TIMESTAMP AS OF '[timestamp]'
- Can restore previous versions with RESTORE TABLE table_name TO VERSION AS OF [version number]