DEV Community

Alec Dutcher
Alec Dutcher

Posted on • Edited on

DP-203 Study Guide - Develop a batch processing solution

Study guide

Develop batch processing solutions by using Azure Data Lake Storage, Azure Databricks, Azure Synapse Analytics, and Azure Data Factory

  • Services for each layer in a batch processing architecture
    • Ingestion: Data Factory
    • Storage: Blob Storage, ADLS Gen2, Cosmos DB
    • Processing: Databricks, HDInsight, Data Flows
    • Serving: Azure SQL, Dedicated SQL, Analysis Services
    • Orchestration: Data Factory (or Synapse)
  • Azure Synapse Analytics
    • Group of multiple, well-integrated services
    • Works across all layers of architecture

Use PolyBase to load data to a SQL pool

  • Dedicated SQL Pool

    • Overview
      • Formerly known as Azure SQL Data Warehouse
      • Available as standalone service and within Synapse
      • Like a SQL Server Database
      • Massive parallel processing (MPP) architecture
      • Elastically scale compute and storage separately
      • Pause or resume service to save cost
    • Components
      • Distributions
        • Basic unit of storage
        • Fixed 60 distributions
        • Queries executed against each distribution in parallel
        • Stored in Azure Storage
      • Control node
        • SQL Server endpoint
        • Queries go to control node
        • Only stores metadata
        • Coordinates query execution with computer nodes
      • Compute nodes
        • Execute queries
        • Max 60 compute nodes
        • Distributions equally divided among compute nodes
      • Data Movement Service (DMS)
        • Coordinates movement of data between compute nodes
        • For some queries (joins, group by) data needs to be co-located
    • Data Warehousing Units
      • DWU = CPU + memory + I/O
      • Represents computational power
      • Can be increased or decreased to enable scaling
      • Paid for per hour (lower to reduce costs)
    • Features
      • Most regular SQL features are supported
      • DDL and DML statements and Dynamic SQL
      • Dynamic management views
    • Triggers and cross-database queries are not supported
    • Constraints, identity columns, and relationships work differently than SQL Server
    • Can be used in both the compute and serving layer
  • Polybase

    • Overview
      • Read and write data in external storage using T-SQL
      • Available in SQL Server and Synapse
      • Supports delimited text, parquet, ORC, GZIP, and SNAPPY compressed files
      • Control node passes storage location to compute nodes, which read the data
    • Components
      • Database Scoped Credential = access storage account
      • External Data Source = define the storage location
      • External File Format = format of the file being read
      • External Table = metadata of underlying file

Implement Azure Synapse Link and query the replicated data

  • Azure Synapse Link
    • Cloud-native implementation of HTAP
    • Hybrid transactional and analytical processing
    • Directly query data in operational stores, no ETL required
    • Near real-time querying
    • Supports Cosmos DB, Azure SQL, Dataverse
  • Cosmos DB
    • Fully managed NoSQL platform
    • Supports MongoDB, Table, Cassandra, and Gremlin
    • Global distribution - data can be replicated to multiple regions
    • Elastic scalability
  • Synapse Link for Cosmos DB
    • Transactional store is synced to analytical store from which Synapse can read data
    • No performance impact on the transactional store
    • Analytical store auto-syncs every 2 mins (max 5 mins)
    • Only accessible from Synapse
    • Only charged for storage
    • Supports change data capture and time travel
  • In the portal
    • In Cosmos DB account, see Azure Synapse Link under Integrations on the left-side panel
    • Enable Synapse Link
    • Create the container, setting Analytical Store to On
    • To connect in Synapse Link, get primary account key from the Keys under Settings in the left-side panel
    • In Synapse workspace, go to Data and setup linked service and data source for Cosmos DB
    • Open a SQL script to query the data in Cosmos DB
    • Create a credential with the primary key
    • Use OPENROWSET to query

Create data pipelines

  • General steps
    • Configure firewall to allow IP address and Azure Services to connect to data sources and sinks
    • Create an ADF/Synapse instance
    • Create a linked service to the source data
    • Create a new dataset from the data in the linked service
    • Create a Data Flow
      • Select data source
      • Choose transformation steps (join, group, conditional split, etc)
      • Select sink
    • Create a new Pipeline
    • Choose a Copy activity and/or the Data Flow

Scale resources

  • Types of scaling
    • Vertical scaling (up/down) = add more resources to a machine to make it more powerful
    • Horizontal scaling (in/out) = add more machines
  • Scaling Azure SQL
    • Supports both up and out depending on config
    • During up/down, the following can be changed
      • Service tier
        • DTU model: basic, standard, and premium
        • vCore model: general purpose, hyperscale, business critical
      • Compute tier (vCore): provisioned or serverless
      • Resources (CPU, RAM, storage, etc)
    • Scaling up/down results in database restart
    • To scale out, can only add up to 4 read-only replicas
    • In the portal (Azure SQL database)
      • Go to Compute + storage
      • Select an option in Service tier
      • Choose Compute tier
      • Use sliders to select vCores, DTUs, Read scale-out, etc
  • Scaling Dedicated SQL Pool
    • Increase/decrease number of compute nodes and memory on each node
    • Defined using DWUs

Create tests for data pipelines

  • Testing pipelines is different than testing applications because we're testing data instead of code
  • Automated testing involves automating the process of validating if pipeline is providing expected output
  • Types of tests
    • Unit tests
      • Test individual units
      • In data pipeline, run each activity individually and validate result
      • Hard to do in ADF
      • Programmatically enable one activity at a time and disable others
      • Generate and use fake data to test edge cases
    • Functional tests
      • Have pipeline generate actual output and compare to expected output
      • Run complete pipeline, not just individual activities
      • Used to confirm that pipeline meets business requirements
    • Performance and regression tests
      • Regression tests ensure that change in one pipeline doesn't impact other pipelines
      • Run multiple dependent pipelines together
      • Performance test to ensure pipeline meets SLAs
    • Data quality tests
      • Verify if data meets quality standards
      • Typically embedded as part of the pipeline
      • Completeness
      • Uniqueness
      • Timeliness
      • Accuracy

Integrate Jupyter or Python notebooks into a data pipeline

  • Notebooks are typically used for Spark apps and development
  • Notebooks are supported natively in services like Databricks and Synapse
  • Basic steps for Synapse
    • Create Synapse Spark pool
    • Create new notebook and define language
    • Attach notebook to Spark pool
    • Write code to read and process data
    • Add parameters to notebook
  • To invoke notebook in ADF
    • Create linked service to Synapse (under compute, not storage)
    • Make sure ADF has manage permissions for Synapse Spark and access to storage
    • Create pipeline and add notebook activity
    • Select notebook and parameters
    • Run pipeline

Use Mapping Data Flows in Azure Synapse pipelines and Azure Data Factory pipelines

  • Mapping Data Flows provides no-code ETL workflow
  • Can apply transformations to source data
    • Add/remove columns, rename, filter, join, aggregate
  • Runs on Spark code
    • Automatically adds optimizations
    • Can add user-defined optimizations
  • Executes on a Spark cluster
    • Called Data Flow Debug
    • Can define cluster configuration
  • Pros and Cons
    • Pros
      • Faster development
      • UI based drag-and-drop approach
      • Fast and scalable processing
    • Cons
      • Less flexible since code can't be modified
      • Can be complex for large workflows

Upsert data

  • DML statements
    • Select, insert, update, delete
    • Upsert is combo of update and insert - update if exists, insert if not
  • Options to change data in Azure SQL
    • Using T-SQL (DML statements, merge command)
    • Data Factory/Synapse pipelines (copy, data flow with Alter Row)
    • Can upsert on files in Data Lake using Delta Lake
  • Options to perform upsert
    • T-SQL "merge" command
      • Specify source with "USING"
      • Specify join condition
      • "WHEN MATCHED" = behavior for existing records
      • "WHEN NOT MATCHED BY TARGET" = behavior for records not in target
      • "WHEN NOT MATCHED BY SOURCE" = behavior for records not in source
    • Copy activity
      • Change write behavior in sink to upsert and define key columns
    • Data flows
      • Use alter row transformation
      • Define alter row conditions

Revert data to a previous state in Azure storage

  • Restorable entities
    • Individual file (blob) - can revert to previous version or undelete
    • Container - container and files can be reverted or undeleted
  • Restoring files
    • Use snapshot (read-only version of file from point in time)
      • Created manually by user or application
      • Used to restore back to prior version
    • Enable versioning
      • Enabled at storage account level
      • Auto creates snapshots when file is updated
      • Select and restore a specific version
    • Enable soft delete
      • Enabled at storage account level
      • Deleted files can be restored for a certain number of days
  • Restoring containers
    • Enable point-in-time restore
      • Restores container to specific point in time
      • Enabled at storage account level
      • Versioning, change feed, and soft delete must also be enabled
    • Enable soft delete
      • Enabled at storage account level
      • Deleted containers can be restored for a certain number of days
  • In the storage account portal, these options are under Data management --> Data protection in the left-side panel
  • File versions and snapshots can viewed in blob properties by clicking on the file

Revert data to a previous state in Azure SQL and Dedicated SQL Pool

  • Azure SQL backup
    • Automatically creates backups based on SQL Server technology
      • Full backups every week
      • Differential backups every 12 to 24 hours
      • Transaction log backups every 10 mins
      • Backups are stored in Azure Storage
      • Redundancy is configurable
    • Point-in-time restore (auto)
      • Auto-created backup
      • Kept for limited days (1 to 35, default is 7)
    • Long-term retention (not auto)
      • Define policy to keep backups longer
      • Configure weekly, monthly, yearly backups and keep up to 10 years
  • Azure SQL restore
    • Restore using PITR or LTR
      • For PITR restore, service identifies which backups to be used
      • For LTR, database can be restored in same or different region
    • Restore deleted database
    • Restore creates a new database
      • Use to update or replace existing database
  • In the Azure SQL Server portal
    • Data management --> Backups to view restore point details and retention policies
  • Dedicated SQL backup and restore
    • Local backup
      • Dedicated SQL automatically creates snapshots used as restore points
      • Up to 42 user-defined restore points can be created
      • Restore points are retained for 7 days
    • Geo backup
      • Created every 24 hours and stored in a different region
      • Only latest backup is retained
    • Restore database in any region using restore points
      • Restore creates a new database that updates or replaces existing one

Configure exception handling

  • For a single activity
    • Try/catch block
      • When one activity fails, a second activity runs that performs action based on failure
    • Try/catch/proceed block
      • Last activity (proceed) runs if first activity succeeds or fails, even if middle activity fails, due to skip path
    • If/else block
      • One path for success, different path for failure
      • Pipeline succeeds if first activity does, will fail otherwise
    • If/skip/else block
      • Pipeline succeeds whether first activity succeeds or fails because a failure causes a skip to other activities
  • For multiple activities
    • Sequential run
      • Activities are sequential
      • One or more activities are configured to run on failure or skip of previous activity
      • Pipeline continues regardless of upstream failure
    • Parallel run
      • Some activities are parallel
      • Downstream activity depends on success of all parallel activities
      • Further downstream activity can be configured to run after skip so pipeline continues even if parallel activities fail

Read from and write to a delta lake

  • Data lake challenges
    • Data reliability issues
      • Corruption because of failures (no rollback)
      • No data validation
      • Consistency issues while reading data
    • No updates/deletes/merges on files
      • Difficult to implement GDPR/CCPA compliance
    • Data quality issues
      • Schema isn't verified before writing
      • Cannot apply checks on data
    • Query performance issues
    • Difficult to maintain historical versions of data
  • Delta Lake
    • Open-source storage layer that brings reliability to data lakes
    • Can be installed on-prem
    • Available by default on many cloud platforms
    • Provides database-like features on top of data lake
      • Create constraints, enforce schema, run DML statements, etc
    • Provides ACID guarantees
    • Works by storing a transaction log of all transactions performed on data (dataframe.write.format("delta"))
      • Log file is not created until after writing is done and is not created if there is a failure, which helps ensure ACID guarantees
  • Delta Lake availability
    • Can be downloaded and installed
      • On local machine
      • On-prem Spark cluster
      • Cloud platforms like Azure HDInsight
    • Available by default in cloud platforms
      • Azure Databricks
      • Azure Synapse Spark pools
  • In the portal (Databricks)
    • Use spark.conf.set to connect to storage
    • Use dbutils.fs.ls to list files in storage path
    • Define input and output folder paths, use input to read (spark.read.option.csv)
    • To write to Delta Lake
      • Write in Delta format with output path DF.write.format("delta").save(outputPath + "filename.delta")
      • Check output location in storage to confirm write
      • Check delta_log to see metadata about write
    • To read from Delta Lake
      • Use Spark SQL to create a database
      • Create a table in the database using CREATE TABLE table_name USING DELTA LOCATION "delta_file_path/filename.delta"
      • DESCRIBE HISTORY table_name can be used to audit the history of the Delta table
      • Read different versions of data using SELECT FROM table_name VERSION AS OF [version number], or SELECT FROM table_name TIMESTAMP AS OF '[timestamp]'
    • Can restore previous versions with RESTORE TABLE table_name TO VERSION AS OF [version number]

Top comments (0)