Develop batch processing solutions by using Azure Data Lake Storage, Azure Databricks, Azure Synapse Analytics, and Azure Data Factory
- Services for each layer in a batch processing architecture
- Ingestion: Data Factory
- Storage: Blob Storage, ADLS Gen2, Cosmos DB
- Processing: Databricks, HDInsight, Data Flows
- Serving: Azure SQL, Dedicated SQL, Analysis Services
- Orchestration: Data Factory (or Synapse)
- Azure Synapse Analytics
- Group of multiple, well-integrated services
- Works across all layers of architecture
Use PolyBase to load data to a SQL pool
-
Dedicated SQL Pool
- Overview
- Formerly known as Azure SQL Data Warehouse
- Available as standalone service and within Synapse
- Like a SQL Server Database
- Massive parallel processing (MPP) architecture
- Elastically scale compute and storage separately
- Pause or resume service to save cost
- Components
-
Distributions
- Basic unit of storage
- Fixed 60 distributions
- Queries executed against each distribution in parallel
- Stored in Azure Storage
-
Control node
- SQL Server endpoint
- Queries go to control node
- Only stores metadata
- Coordinates query execution with computer nodes
-
Compute nodes
- Execute queries
- Max 60 compute nodes
- Distributions equally divided among compute nodes
-
Data Movement Service (DMS)
- Coordinates movement of data between compute nodes
- For some queries (joins, group by) data needs to be co-located
-
Distributions
-
Data Warehousing Units
- DWU = CPU + memory + I/O
- Represents computational power
- Can be increased or decreased to enable scaling
- Paid for per hour (lower to reduce costs)
-
Features
- Most regular SQL features are supported
- DDL and DML statements and Dynamic SQL
- Dynamic management views
- Triggers and cross-database queries are not supported
- Constraints, identity columns, and relationships work differently than SQL Server
- Can be used in both the compute and serving layer
- Overview
-
Polybase
- Overview
- Read and write data in external storage using T-SQL
- Available in SQL Server and Synapse
- Supports delimited text, parquet, ORC, GZIP, and SNAPPY compressed files
- Control node passes storage location to compute nodes, which read the data
-
Components
- Database Scoped Credential = access storage account
- External Data Source = define the storage location
- External File Format = format of the file being read
- External Table = metadata of underlying file
- Overview
Implement Azure Synapse Link and query the replicated data
-
Azure Synapse Link
- Cloud-native implementation of HTAP
- Hybrid transactional and analytical processing
- Directly query data in operational stores, no ETL required
- Near real-time querying
- Supports Cosmos DB, Azure SQL, Dataverse
- Cosmos DB
- Fully managed NoSQL platform
- Supports MongoDB, Table, Cassandra, and Gremlin
- Global distribution - data can be replicated to multiple regions
- Elastic scalability
-
Synapse Link for Cosmos DB
- Transactional store is synced to analytical store from which Synapse can read data
- No performance impact on the transactional store
- Analytical store auto-syncs every 2 mins (max 5 mins)
- Only accessible from Synapse
- Only charged for storage
- Supports change data capture and time travel
- In the portal
- In Cosmos DB account, see Azure Synapse Link under Integrations on the left-side panel
- Enable Synapse Link
- Create the container, setting Analytical Store to On
- To connect in Synapse Link, get primary account key from the Keys under Settings in the left-side panel
- In Synapse workspace, go to Data and setup linked service and data source for Cosmos DB
- Open a SQL script to query the data in Cosmos DB
- Create a credential with the primary key
- Use OPENROWSET to query
Create data pipelines
- General steps
- Configure firewall to allow IP address and Azure Services to connect to data sources and sinks
- Create an ADF/Synapse instance
- Create a linked service to the source data
- Create a new dataset from the data in the linked service
- Create a Data Flow
- Select data source
- Choose transformation steps (join, group, conditional split, etc)
- Select sink
- Create a new Pipeline
- Choose a Copy activity and/or the Data Flow
Scale resources
- Types of scaling
- Vertical scaling (up/down) = add more resources to a machine to make it more powerful
- Horizontal scaling (in/out) = add more machines
-
Scaling Azure SQL
- Supports both up and out depending on config
-
During up/down, the following can be changed
-
Service tier
- DTU model: basic, standard, and premium
- vCore model: general purpose, hyperscale, business critical
- Compute tier (vCore): provisioned or serverless
- Resources (CPU, RAM, storage, etc)
-
Service tier
- Scaling up/down results in database restart
- To scale out, can only add up to 4 read-only replicas
- In the portal (Azure SQL database)
- Go to Compute + storage
- Select an option in Service tier
- Choose Compute tier
- Use sliders to select vCores, DTUs, Read scale-out, etc
-
Scaling Dedicated SQL Pool
- Increase/decrease number of compute nodes and memory on each node
- Defined using DWUs
Create tests for data pipelines
- Testing pipelines is different than testing applications because we're testing data instead of code
- Automated testing involves automating the process of validating if pipeline is providing expected output
-
Types of tests
-
Unit tests
- Test individual units
- In data pipeline, run each activity individually and validate result
- Hard to do in ADF
- Programmatically enable one activity at a time and disable others
- Generate and use fake data to test edge cases
-
Functional tests
- Have pipeline generate actual output and compare to expected output
- Run complete pipeline, not just individual activities
- Used to confirm that pipeline meets business requirements
-
Performance and regression tests
- Regression tests ensure that change in one pipeline doesn't impact other pipelines
- Run multiple dependent pipelines together
- Performance test to ensure pipeline meets SLAs
- Data quality tests
- Verify if data meets quality standards
- Typically embedded as part of the pipeline
- Completeness
- Uniqueness
- Timeliness
- Accuracy
-
Unit tests
Integrate Jupyter or Python notebooks into a data pipeline
- Notebooks are typically used for Spark apps and development
- Notebooks are supported natively in services like Databricks and Synapse
- Basic steps for Synapse
- Create Synapse Spark pool
- Create new notebook and define language
- Attach notebook to Spark pool
- Write code to read and process data
- Add parameters to notebook
- To invoke notebook in ADF
- Create linked service to Synapse (under compute, not storage)
- Make sure ADF has manage permissions for Synapse Spark and access to storage
- Create pipeline and add notebook activity
- Select notebook and parameters
- Run pipeline
Use Mapping Data Flows in Azure Synapse pipelines and Azure Data Factory pipelines
- Mapping Data Flows provides no-code ETL workflow
- Can apply transformations to source data
- Add/remove columns, rename, filter, join, aggregate
-
Runs on Spark code
- Automatically adds optimizations
- Can add user-defined optimizations
-
Executes on a Spark cluster
- Called Data Flow Debug
- Can define cluster configuration
- Pros and Cons
- Pros
- Faster development
- UI based drag-and-drop approach
- Fast and scalable processing
- Cons
- Less flexible since code can't be modified
- Can be complex for large workflows
- Pros
Upsert data
- DML statements
- Select, insert, update, delete
- Upsert is combo of update and insert - update if exists, insert if not
- Options to change data in Azure SQL
- Using T-SQL (DML statements, merge command)
- Data Factory/Synapse pipelines (copy, data flow with Alter Row)
- Can upsert on files in Data Lake using Delta Lake
-
Options to perform upsert
-
T-SQL "merge" command
- Specify source with "USING"
- Specify join condition
- "WHEN MATCHED" = behavior for existing records
- "WHEN NOT MATCHED BY TARGET" = behavior for records not in target
- "WHEN NOT MATCHED BY SOURCE" = behavior for records not in source
-
Copy activity
- Change write behavior in sink to upsert and define key columns
-
Data flows
- Use alter row transformation
- Define alter row conditions
-
T-SQL "merge" command
Revert data to a previous state in Azure storage
- Restorable entities
- Individual file (blob) - can revert to previous version or undelete
- Container - container and files can be reverted or undeleted
- Restoring files
-
Use snapshot (read-only version of file from point in time)
- Created manually by user or application
- Used to restore back to prior version
-
Enable versioning
- Enabled at storage account level
- Auto creates snapshots when file is updated
- Select and restore a specific version
-
Enable soft delete
- Enabled at storage account level
- Deleted files can be restored for a certain number of days
-
Use snapshot (read-only version of file from point in time)
- Restoring containers
-
Enable point-in-time restore
- Restores container to specific point in time
- Enabled at storage account level
- Versioning, change feed, and soft delete must also be enabled
- Enable soft delete
- Enabled at storage account level
- Deleted containers can be restored for a certain number of days
-
Enable point-in-time restore
- In the storage account portal, these options are under Data management --> Data protection in the left-side panel
- File versions and snapshots can viewed in blob properties by clicking on the file
Revert data to a previous state in Azure SQL and Dedicated SQL Pool
-
Azure SQL backup
-
Automatically creates backups based on SQL Server technology
- Full backups every week
- Differential backups every 12 to 24 hours
- Transaction log backups every 10 mins
- Backups are stored in Azure Storage
- Redundancy is configurable
-
Point-in-time restore (auto)
- Auto-created backup
- Kept for limited days (1 to 35, default is 7)
-
Long-term retention (not auto)
- Define policy to keep backups longer
- Configure weekly, monthly, yearly backups and keep up to 10 years
-
Automatically creates backups based on SQL Server technology
-
Azure SQL restore
-
Restore using PITR or LTR
- For PITR restore, service identifies which backups to be used
- For LTR, database can be restored in same or different region
- Restore deleted database
-
Restore creates a new database
- Use to update or replace existing database
-
Restore using PITR or LTR
- In the Azure SQL Server portal
- Data management --> Backups to view restore point details and retention policies
-
Dedicated SQL backup and restore
-
Local backup
- Dedicated SQL automatically creates snapshots used as restore points
- Up to 42 user-defined restore points can be created
- Restore points are retained for 7 days
-
Geo backup
- Created every 24 hours and stored in a different region
- Only latest backup is retained
- Restore database in any region using restore points
- Restore creates a new database that updates or replaces existing one
-
Local backup
Configure exception handling
- For a single activity
-
Try/catch block
- When one activity fails, a second activity runs that performs action based on failure
-
Try/catch/proceed block
- Last activity (proceed) runs if first activity succeeds or fails, even if middle activity fails, due to skip path
-
If/else block
- One path for success, different path for failure
- Pipeline succeeds if first activity does, will fail otherwise
-
If/skip/else block
- Pipeline succeeds whether first activity succeeds or fails because a failure causes a skip to other activities
-
Try/catch block
- For multiple activities
-
Sequential run
- Activities are sequential
- One or more activities are configured to run on failure or skip of previous activity
- Pipeline continues regardless of upstream failure
-
Parallel run
- Some activities are parallel
- Downstream activity depends on success of all parallel activities
- Further downstream activity can be configured to run after skip so pipeline continues even if parallel activities fail
-
Sequential run
Read from and write to a delta lake
- Data lake challenges
- Data reliability issues
- Corruption because of failures (no rollback)
- No data validation
- Consistency issues while reading data
- No updates/deletes/merges on files
- Difficult to implement GDPR/CCPA compliance
- Data quality issues
- Schema isn't verified before writing
- Cannot apply checks on data
- Query performance issues
- Difficult to maintain historical versions of data
- Data reliability issues
-
Delta Lake
- Open-source storage layer that brings reliability to data lakes
- Can be installed on-prem
- Available by default on many cloud platforms
- Provides database-like features on top of data lake
- Create constraints, enforce schema, run DML statements, etc
- Provides ACID guarantees
- Works by storing a transaction log of all transactions performed on data (dataframe.write.format("delta"))
- Log file is not created until after writing is done and is not created if there is a failure, which helps ensure ACID guarantees
- Delta Lake availability
- Can be downloaded and installed
- On local machine
- On-prem Spark cluster
- Cloud platforms like Azure HDInsight
- Available by default in cloud platforms
- Azure Databricks
- Azure Synapse Spark pools
- Can be downloaded and installed
- In the portal (Databricks)
- Use spark.conf.set to connect to storage
- Use dbutils.fs.ls to list files in storage path
- Define input and output folder paths, use input to read (spark.read.option.csv)
- To write to Delta Lake
- Write in Delta format with output path DF.write.format("delta").save(outputPath + "filename.delta")
- Check output location in storage to confirm write
- Check delta_log to see metadata about write
- To read from Delta Lake
- Use Spark SQL to create a database
- Create a table in the database using CREATE TABLE table_name USING DELTA LOCATION "delta_file_path/filename.delta"
- DESCRIBE HISTORY table_name can be used to audit the history of the Delta table
- Read different versions of data using SELECT FROM table_name VERSION AS OF [version number], or SELECT FROM table_name TIMESTAMP AS OF '[timestamp]'
- Can restore previous versions with RESTORE TABLE table_name TO VERSION AS OF [version number]
Top comments (0)