DP-203 Study Guide - Optimize and troubleshoot data storage and data processing

Compact small files

What does it meant to compact small files?
- Combine a lot of small files into one file
- Improves speed of read queries
- Can be done from a Copy job in ADF/Synapse or incremental load
- Also available in a Delta Lake feature
Using a Copy job
- Source is the directory with all of the small files
- Select using a wildcard (/directory/*)
- Use the Copy behavior to merge the files
Using Delta Lake
- Use OPTIMIZE feature
- Done via query in Spark SQL:

OPTIMIZE delta.`/data/events`
OPTIMIZE delta.`abfss://container-name@storage-account-name.dfs.core.windows.net/path-to-data

Handle skew in data

Skew
- An uneven distribution of data
- Data skew can unbalance compute nodes, lowering performance
- Avoid by balancing parallel processing with correct table distribution (hash or round-robin)
Detect skew in distributed table (database consistency check)
- DBCC PDW_SHOWSPACEUSED('dbo.FactInternetSales');
Resolve data skew
- Research
  - Monitor query impact
  - Weigh the cost of minimizing
- Solution
  - Re-create table with a new distribution column set
  - CREATE TABLE AS SELECT (CTAS)

Handle data spill

Data spill is when compute engine is unable to hold data in memory and writes ("spills") data to disk
Impact is expensive disk reads/writes and longer execution times
Occurs when
- Partition size is too big
- Compute resource size is small
- Data size during merges, unions, etc exceeds memory limit of the compute node
Identifying data spill
- Synapse SQL - TempDB runs out of space and throws error (monitor with DMVs)
- Spark - view task summary screen under spill column
Handling the spill
- Increase compute capacity
- Reduce partition size
- Remove skews in data

Optimize resource management

Optimize Synapse SQL Pools
- Pause when not in use
- Use the right compute unit (DWU) for workload
- Leverage Azure Functions to scale out workload
Optimize Spark
- Select autoscale option in cluster setup
- Select auto-terminate
- Use spot instances
- Right-size cluster nodes based on memory, CPU intensive, etc

Tune queries by using indexers

Types of indexes
- Clustered columnstore index
  - Default in SQL pool table
  - Use for tables > 100 million rows
  - Good performance and data compression
- Clustered index
  - Good for specific filter conditions
  - Use for tables between 100 and 100 million rows
- Heap index
  - Use for staging tables
Maintain by rebuilding indexes when seeing performance degradation in existing indexes
Indexes in Spark Pool
- Spark does not have an inbuilt index
- Uses Hyperspace (or Hyperscale) - ability to create indexes on datasets (CSV, JSON, parquet)
- Works via API
- Criteria
  - Contains filter on predicates
  - Contains a join that requires heavy shuffles

Tune queries by using cache

Troubleshoot a failed Spark job

Debug the issue within the environment and within the job
Environment
- Confirm the region the cluster is in is not down (status.azure.com)
- Use HDInsight Ambari Dashboard to view cluster health
- Are clusters using high CPU or memory?
Jobs
- Driver logs
- Task logs
- Executor logs