DEV Community

Alec Dutcher
Alec Dutcher

Posted on • Edited on

DP-203 Study Guide - Ingest and transform data

Study guide

Design and implement incremental loads

  • Watermarks
    • Column in source table with last updated time stamp or incrementing key
    • Marks the most recent update in the table
  • Delta loading
    • Essentially the same as incremental loading
    • Only changing new data, whether loading or transforming, etc
  • 4 basic design options
    • Delta loading using a watermark
    • Delta loading from SQL DB using change tracking technology
    • Loading new and changed files only using LastModifiedDate
    • Loading new files only using a partitioned folder or file name
  • Considerations
    • Volume and type of data
    • Load on system
  • Steps
    • Query to get old watermark
    • Query to get new watermark
    • Load data between watermarks
    • Update watermark

Transform data by using Apache Spark

  • Apache Spark
    • Can be used in Synapse, Databricks, and Data Factory
    • Ecosystem
      • Apache Spark Core
        • Basic functionalities (task scheduling, memory management)
        • Can be abstracted through APIs
        • Can be done in R, Python, Scala, and Java
      • Spark SQL - similar to standard SQL but allows queries on data in Spark
      • Spark Streaming
      • MLlib
      • GraphX
  • More about Spark architecture
    • Spark core: RDDs and languages
    • Spark SQL engine: Catalyst optimizer, Tungsten (memory/CPU mgmt)
    • DataFrame/Dataset APIs
    • Spark Graph, Spark ML, Spark Streaming, Spark SQL
  • Azure Synapse notebooks in the portal
    • Develop on the left-side panel
    • Click +, then Notebook
    • Must have Spark pool attached before running a notebook
      • Go to Manage in left-side panel
      • Analytics pools --> Apache Spark pools --> choose name and settings --> Review and create
    • Write and execute code in cells like a typical notebook
    • Click + --> Browse gallery --> Notebooks to see example notebooks
  • For the exam, know the basics of Synapse notebooks, and Synapse architecture questions are more likely about keywords than detailed questions

Transform data by using Transact-SQL (T-SQL) in Azure Synapse Analytics

  • Transact-SQL
    • For querying data in a data lake
    • Uses SQL serverless pools
    • Query data without loading it into database storage
    • Standard formats are CSV, JSON, and Parquet
    • Useful for OLAP
  • In the portal
    • Develop --> New --> SQL Script
    • [FROM] OPENROWSET
      • Use instead of defining a table
      • Mimics the properties of a table, but uses data lake object as a source
      • Choose file URL, format, and parser version if CSV

Ingest and transform data by using Azure Synapse Pipelines or Azure Data Factory

  • Common data ingestion pipelines
    • Azure Functions
      • Low latency
      • Serverless compute
      • Short run processing (only designed to run for short periods of time)
    • Custom component
      • Low-scale parallel computing
      • Heavy algorithms
      • Requires wrapping code into an executable (more complex)
    • Azure Databricks
      • Apache Spark, designed for massive and complex data transformations
      • Expensive and complicated
    • Azure Data Factory
      • Suitable for light transformation
      • Can include above methods as activities
  • Copy performance
    • Performance chart
      • Shows how long a copy will take based on amount of data and bandwidth
      • Can help with assessing costs of running pipelines
  • In the portal
    • Most work is done in Author section in left-side panel
    • Under Factory Resources there are pipelines, datasets, etc
    • Linked services is not shown as it is lumped in with datasets
    • Under Datasets, click + to add Dataset
      • Choose Service
      • Name Dataset and select Linked service
      • If you choose New service, input connection details, including subscription, server, database, authentication, etc
      • Select dataset from the linked service (table name, file, etc)
    • Under Datasets you can view and preview the dataset
    • Click + to add a new pipeline
      • Select an activity, i.e. Copy data
      • In the activity settings at the bottom, choose source, sink, copy behavior, and other settings
    • Dataflows allow you to set up transformations within ADF
      • These dataflows can be included as activities in the pipeline
  • Differences between ADF and Synapse
    • ADF has
    • Synapse has
      • Monitoring Spark jobs
    • Both have
      • Solution templates (ADF template gallery, Synapse knowledge center)
      • GIT integration
  • ADF/Synapse portal differences
    • ADF has
      • Home
      • Author - pipelines, datasets, data flows, Power Query, and templates
      • Monitor - dashboards for pipeline/trigger runs, integration runtimes, data flow debug, alerts/metrics
      • Manage
      • Learning center
    • Synapse has
      • Home
      • Data - SQL/Lake database, external datasets, and integration datasets
      • Develop - SQL scripts, notebooks, data flows
      • Integrate - pipelines, Synapse Link connections
      • Monitor - pools, requests, Spark, pipeline/trigger runs, integration runtimes, Link connections
      • Manage

Transform data by using Azure Stream Analytics

  • Azure Stream Analytics
    • Only for streaming solutions, not batch
    • Input can be Blob Storage, Event Hubs, or IOT Hubs
    • These input to the query layer where transformations happen
    • Query outputs to Blob storage or Power BI
  • Queries
    • SELECT * INTO output FROM input
    • Choose specific columns, where clauses, aggregations, etc

Cleanse data

  • Process overview
    • Investigate the data
    • Perform cleaning steps (unique to data set)
    • Evaluate the results
      • Validity (does it match business rules?)
      • Accuracy
      • Completeness
      • Consistency (is there conflicting data?)
      • Uniformity (are data points using same units of measure?)
  • Common tools
    • ADF, Synapse (almost identical for this purpose)
    • Azure Stream Analytics (can be harder to clean)
    • Databricks (more complicated, but versatile and useful for massive data)
  • In the portal (ADF)
    • Create a Data Flow, choose sources
    • Preview data to see which fields can join data
    • Consider how columns can be filtered or removed to provide value or remove extraneous data
    • Once cleansing is done, choose sink

Handle duplicate data

  • Dedupe = eliminate unnecessary copies
    • Consider technology knowledge
    • Consider complexity
    • Consider accompanying solutions (SQL queries, ADF data flows, Spark, etc)
  • Basic steps (in ADF)
    • Create data flow
    • Choose source
    • Choose script snippet (scroll symbol in top right of editor, snippets can be found on Microsoft Learn)
    • Choose destination

Avoiding duplicate data by using Azure Stream Analytics Exactly Once Delivery

Handle missing data

  • Determine impact of missing data, sometimes it won't be a big deal
  • Options of handling missing data
    • Drop rows that have the missing data
    • Imputation = assign an inferred value to the missing element
    • Include the rows that are missing data

Handle late-arriving data

  • Definitions
    • Event time = when original event happened (order is given to waiter)
    • Processing time = when event is observed (waiter gives order to kitchen)
    • Watermark = stamp identifying when event has been ingressed into system
  • Handle late arriving data by choosing a level of tolerance
  • Consequences of tolerance
    • Tolerance = window considered acceptable for late arrival
    • Critical events can be missed without proper tolerance
    • Delayed outputs can result in broken processes or bad reports

Split data

  • Splitting data allows making paths to multiple sinks from the same source
  • Conditional splits
    • Route data to different outputs
    • Available in ADF and Synapse
    • Steps
      • Create data flow
      • Use conditional split transformation
      • Set split conditions
    • Data flow scripts
      • Can use scripts to do the steps above

Shred JSON

  • Shredding JSON = extracting data from a JSON file and transferring to a table (aka parsing)
  • Done in Synapse or ADF
  • Once data is extracted it is persisted to a data store
  • OPENJSON function
    • Table-valued function that parses JSON text
    • Returns objects and properties as rows and columns

Encode and decode data

  • UTF-8
    • Uniform Transformation Format 8-bits
    • The ASCII problem
      • Assigns a code for every character (256 possiblities)
      • As programming expanded, number of available characters ran out
      • UTF-8 provides more character possibilities
  • Program must understand UTF-8 codes in order to decode information
  • There are multiple encoding formats, so the source and sink must use the same encoding
  • Done in ADF and Synapse copy activities
  • In the portal
    • Can choose encoding and compression properties in the Dataset properties

Configure error handling for a transformation

  • Options for error handling
    • Transaction commit: choose whether to write data in a single transaction or in batches
    • Output rejected data: log error rows in a CSV in Azure Storage
    • Success on error: mark it as successful even if errors occur
  • In the portal (ADF)
    • In an activity's settings, fault tolerance represents a form of success on error, continuing past incompatible data
    • Enable logging to store files that show rejected rows
    • Enable staging allows for copying in batches
    • On the right side of the activity there are buttons for "on success," "on failure," etc.
      • Connect these to other activities to choose how pipeline errors are handled
    • In a data flow database sink, there is an Errors tab to configure error handling

Normalize and denormalize data

  • Normalizing data = reorganizing to remove unstructured or redundant data
  • Denormalizing data = adding redundant data to one or more tables
  • What and why
    • Normalizing
      • More tables
      • Requires multiple joins
      • Improves accuracy and integrity
    • Denormalizing
      • More space
      • More difficult to maintain
      • Improves query performance
  • Star schema is not normalized
  • Snowflake schema is normalized
  • In the portal (Synapse)
    • Inspect the data sources to determine normalization status and identify join columns
    • Use a join transformation to combine data sources for denormalization
    • Use a conditional split or a select transformation to normalize
    • Transformations can also be done in the script editor

Perform data exploratory analysis

  • Use summary statistics and visualizations to investigate patterns and anomalies in data
  • Can be done in SQL, Python, Kusto queries in Azure Data Explorer

Top comments (0)