Create and execute queries by using a compute solution that leverages SQL serverless and Spark cluster
-
Azure SQL Serverless
- Not SQL database - it is SQL compute in Azure Synapse Analytics
- Serverless SQL pool
- Built in to Synapse
- Always available
- Billed based on usage
- Data access
- No data storage
- Data accessed through ADL
- OPENROWSET syntax to access data
- Provisioned resources
- Dedicated SQL pool
- Static number of servers
- User chooses runtime
- Defined cost per data warehouse unit (DWU)
- Data is stored in relational tables using columnar storage
- Used mainly for EDA
- In the portal
- Develop section on left-side panel
- Click + button and add SQL script
- Select tables from lake DB or SQL DB
- Choose SQL pool settings
- SELECT * FROM OPENROWSET( BULK '', FORMAT = 'parquet') AS [result]
- SELECT * FROM OPENROWSET( BULK '', FORMAT = 'CSV', Parser_Version = '2.0') AS [result]
- Can also go to Data in left-side panel and link storage account and containers - this can be used to auto-generate basic SELECT queries
- Spark clusters
- Apache Spark in Synapse
- In-memory cluster computing
- Synapse offers ease of use and creation
- Data access is interacting with Spark pools through notebooks (similar to Databricks)
- Databases and tables created in a Spark pool are replicated in a serverless SQL pool as read-only
- In the portal
- Under Develop in left-side panel
- Click + button and select or create a notebook
- Under Manage on left-side panel, create and run an Apache Spark pool
- Be sure to enable automatic pausing, Spark pools are expensive
Recommend and implement Azure Synapse Analytics database templates
-
Database Templates
- Speed up design process
- Create more thorough databases
-
Lake database in Synapse
- Data lakes lack structure
- Databases can be too structured
- Lake database removes these downsides
- Provides structured DB with meta info, stored in a data lake (parquet, delta, CSV formats)
- Powered by serverless Synapse compute
- In the portal
- Architecture process
- Access Synapse Studio instance
- Create a Lake Database
- Add a Table
- Add Template
- Select relevant features
- Architecture process
Push new or updated data lineage to Microsoft Purview
-
Microsoft Purview
- Unified data governance
- On-prem, multi-cloud, SaaS
- 4 pillars
- Data quality
- Data stewardship
- Data protection and compliance
- Data management
-
Data lifecycle management
- Data catalog - organized inventory of data assets
- Data estate insights - infrastructure helps organizations manage data
- Data sharing - internally or across orgs
- Data policy - provision access to data at scale
-
Primary use cases for Purview
- Pull data from SQL DB and ADL and provide governance across the org
- Financial services can show where critical data is stored to evaluate security risk
- Large, diverse orgs can enable data democratization
- Data lineage
- Track data flow over time
- Origination --> Delta (data changes) --> Sink (output)
- Provides confidence in data
- Facilitates governance and impact analysis
- In the portal
- Lineage tab shows a flow chart with sources, processes, and targets
Browse and search metadata in Microsoft Purview Data Catalog
- In the portal
- Open the Microsoft Purview Governance Portal
- Data Catalog --> Browse --> By collection or source type
- Go to Data map in left-side panel to register data sources
- Data map --> Data sources --> Register
- Need to do a new scan to establish lineage
- Requires access control to be configured to allow Purview to scan the data sources
Top comments (0)