DP-203 Study Guide - Design and implement the data exploration layer

Study guide

Create and execute queries by using a compute solution that leverages SQL serverless and Spark cluster

Azure SQL Serverless
- Not SQL database - it is SQL compute in Azure Synapse Analytics
- Serverless SQL pool
  - Built in to Synapse
  - Always available
  - Billed based on usage
- Data access
  - No data storage
  - Data accessed through ADL
  - OPENROWSET syntax to access data
- Provisioned resources
  - Dedicated SQL pool
  - Static number of servers
  - User chooses runtime
  - Defined cost per data warehouse unit (DWU)
  - Data is stored in relational tables using columnar storage
- Used mainly for EDA
- In the portal
  - Develop section on left-side panel
  - Click + button and add SQL script
  - Select tables from lake DB or SQL DB
  - Choose SQL pool settings
  - SELECT * FROM OPENROWSET( BULK '', FORMAT = 'parquet') AS [result]
  - SELECT * FROM OPENROWSET( BULK '', FORMAT = 'CSV', Parser_Version = '2.0') AS [result]
  - Can also go to Data in left-side panel and link storage account and containers - this can be used to auto-generate basic SELECT queries
Spark clusters
- Apache Spark in Synapse
- In-memory cluster computing
- Synapse offers ease of use and creation
- Data access is interacting with Spark pools through notebooks (similar to Databricks)
- Databases and tables created in a Spark pool are replicated in a serverless SQL pool as read-only
- In the portal
  - Under Develop in left-side panel
  - Click + button and select or create a notebook
  - Under Manage on left-side panel, create and run an Apache Spark pool
  - Be sure to enable automatic pausing, Spark pools are expensive

Recommend and implement Azure Synapse Analytics database templates

Database Templates
- Speed up design process
- Create more thorough databases
Lake database in Synapse
- Data lakes lack structure
- Databases can be too structured
- Lake database removes these downsides
- Provides structured DB with meta info, stored in a data lake (parquet, delta, CSV formats)
- Powered by serverless Synapse compute
In the portal
- Architecture process
  - Access Synapse Studio instance
  - Create a Lake Database
  - Add a Table
  - Add Template
  - Select relevant features

Push new or updated data lineage to Microsoft Purview

Microsoft Purview
- Unified data governance
- On-prem, multi-cloud, SaaS
- 4 pillars
  - Data quality
  - Data stewardship
  - Data protection and compliance
  - Data management
- Data lifecycle management
  - Data catalog - organized inventory of data assets
  - Data estate insights - infrastructure helps organizations manage data
  - Data sharing - internally or across orgs
  - Data policy - provision access to data at scale
- Primary use cases for Purview
  - Pull data from SQL DB and ADL and provide governance across the org
  - Financial services can show where critical data is stored to evaluate security risk
  - Large, diverse orgs can enable data democratization
Data lineage
- Track data flow over time
- Origination --> Delta (data changes) --> Sink (output)
- Provides confidence in data
- Facilitates governance and impact analysis
- In the portal
  - Lineage tab shows a flow chart with sources, processes, and targets

Browse and search metadata in Microsoft Purview Data Catalog

In the portal
- Open the Microsoft Purview Governance Portal
- Data Catalog --> Browse --> By collection or source type
- Go to Data map in left-side panel to register data sources
  - Data map --> Data sources --> Register
  - Need to do a new scan to establish lineage
  - Requires access control to be configured to allow Purview to scan the data sources