- Horizontal (sharding) = each partition is a separate data store, but all partitions have the same schema (partitions have different rows)
- Vertical = each partition holds a subset of the fields according to patterns of use (partitions have different columns)
- Functional = data is aggregated according to how it is used by each bounded context (i.e. invoice data vs product data)
Implement a partition strategy for files
- Think through the problem, whiteboard it out
- Parquet
- Most common file type for big data
- Column-based storage with nested data structures
- Supports parallel processing queries
- Row-group sections can be treated as partitions - multiple row-groups can be sent to different nodes
- Break partitions apart based on column values, i.e. query based on a date
- Best practices
- Make sure to include partition columns in table's schema definition
- Group related records together
- Don't use unnecessary columns
- 512 MB to 1 GB is optimal partition size
- Consider the query and how the data will be used
- Consider the expected growth of the data
- Consider how static the data is
Implement a partition strategy for analytical workloads
- Distribution types
-
Round-robin
- Characteristics
- Distributed evenly in a random fashion
- Even distribution across DBs
- Assignment is random
- Fast performance for loads as row assignment can be done quickly
- Slower performance for reads as higher potential for data movement
- Best for:
- No clear distribution key
- No frequent joins
- Uniform distribution is desired
- Temporary staging table
- Simple starting point
- Characteristics
-
Hash
- Characteristics
- Distributed deterministically using hash function on a column
- Distribution column can’t be changed later
- Choose one with unique values, few/no nulls, is not a date column
- Best for
- Large tables (>2 Gb)
- Frequent inserts, updates, and deletes
- Characteristics
-
Replicated
- Characteristics
- Full copy of table is replicated to every compute node
- Requires extra storage and overhead for writes
- Normally used in conjunction with other methods
- Best for:
- Small lookup or dimension tables joined with larger tables
- Characteristics
-
Round-robin
Implement a partition strategy for streaming workloads
- Azure Stream Analytics
- Fully managed stream processing engine
- Input layer (Blob storage, Event Hubs, IoT hubs) ingested into ASA
- Query layer: ASA performs query
- Output layer: Results sent to Blob storage for downstream use
- How transformation works in a stream
- Data in stream is diverted to perform query
- Query transformation results are re-introduced to stream for output
- Transformation is done in near real time
-
Partitioning
- Embarrassingly parallel job: equal input and output partitions, one instance of the query
- Must align partition keys between inputs, query logic, and outputs
- Jobs that aren't embarrassingly parallel can still be completed, but not as efficiently
- Involves querying windows
- In the Azure Portal
- Query in the left-side options
- Inputs - define query, can test and see results
- Outputs - define and test output query
- Here you can define partition key with the PARTITION BY clause (in compatibility level 1.1 and below, in 1.2 define partition key in input)
Implement a partition strategy for Azure Synapse Analytics
-
Table partitions
- Supported on all dedicated SQL pool types
- Clustered columnstore, clustered index, heap
- Supported on all distribution types (hash, round robin, etc)
- Why partition
- Query performance
- Load performance - Smaller amounts of data make incremental loading, updating, and deleting faster and easier
- Supported on all dedicated SQL pool types
-
Clustered columnstore indexes
- Standard for storing and querying large data warehouse fact tables
- Rows are organized into row groups containing 1,048,576 rows
- Row groups organized into column segments
- Index columnstore is built from column segments - data is compressed
- Deltastore - leftover row group
- Law of 60
- A distribution is a basic unit of storage and processing
- Synapse divides work into 60 smaller queries that run in parallel on a data distribution
- This turns 10 partitions into 600
- Each partition needs 1 million rows
- In the Azure Synapse Analytics portal
- When writing CREATE TABLE statement, use WITH clause using CLUSTERED COLUMNSTORE INDEX
- Define the distribution type and key
- Choose partition key
Identify when partitioning is needed in Azure Data Lake Storage Gen2
- Azure Blob Storage
- General purpose, Block, and Page blob
- Account --> Container --> Blob
- Partition key identification
- Azure Storage serves single partitions faster than multiple partitions
- Partitioning is used to improve read performance
- Naming blobs correctly is critical
- Blob storage uses a range-based partitioning scheme
- Partition key is combo of Account + Container + Blob
- Blob storage uses lexical ordering and timestamps which increases co-location on partitions
- Best practices
- Avoid slowly changing timestamps (yyyymmdd)
- Name based upon likely queries
- Avoid latency-causing partitioning (use blob size >256 Kb, use hashing functions)
Top comments (0)