I. Storage Optimization
GBase 8a MPP Cluster is a columnar relational database, distinct from traditional row-based databases. Currently, GBase 8a MPP Cluster supports two storage modes: columnar and row-column mixed storage. This article focuses on storage features that provide optimization value.
1. Columnar Storage
In columnar storage mode, DML (Data Manipulation Language) operations on a column scan only the database page chain (column) corresponding to that column, avoiding full table scans. This effectively reduces the I/O overhead during DML operations.
1.1. Optimization Features:
- Smart Indexing: Columnar databases often use sparse indexing, known as smart indexing. The key idea is to create indexes (or statistical information) for data blocks (pages), enabling coarse-grained filtering. These indexes occupy minimal storage and are maintained automatically without manual intervention.
- Compression: Columnar databases typically achieve higher compression ratios than row-based databases due to the similarity of data in the same column and specialized compression algorithms. Row-based databases typically achieve a 2-3x compression ratio, whereas columnar databases can reach 5-10x (and in some cases, even 20-40x).
- Delayed Materialization: In a columnar database execution plan, data is not decompressed until the final step, whether filtering, projection, joining, or aggregation. This reduces CPU, memory, and network transmission overhead, as well as storage requirements.
1.2. Optimization Limitations:
- Columnar storage offers a performance advantage primarily when scanning single columns. When selecting multiple columns or using
SELECT *
, I/O performance can degrade significantly, negatively impacting execution performance. - Coarse-grained indexes are effective primarily for queries involving statistics on data chunks (DC).
2. Row-Column Hybrid Storage (Row Storage in Columns)
As GBase 8a MPP Cluster is columnar in architecture, when dealing with a large number of columns and sparse data access, significant random I/O can occur, leading to reduced performance.
The row-column hybrid storage feature helps mitigate this by storing certain columns redundantly as rows in a single column (of VARCHAR
type). This column contains concatenated row data, allowing selective page loading without reading the entire DC. This feature improves I/O performance, particularly for queries like SELECT * FROM
, by an order of magnitude.
2.1. Optimization Features:
- Compressed Storage: Row-column data supports compression, reducing redundancy.
- Smaller Data Pages: Data is read in smaller Data Pages, minimizing unnecessary I/O and improving query performance.
-
Configurable Hybrid Storage: Controlled by the
gbase_hybrid_store
parameter, which can be set as:-
0
: Disabled. -
1
: Automatically determined by the server. -
2
: Force enabled.
-
Automatic usage occurs under the following conditions:
- The field is defined as a row-column.
-
The number of rows per DC is less than or equal to
_gbase_hybrid_store_limit
.- Flexible Redundancy: Users can customize which fields are stored as rows, primarily selecting those frequently used in query projections.
2.2. Optimization Limitations:
- Currently, row-column data cannot be used for
SCAN
,JOIN
, orGROUP
operations. It is only applicable for single-table queries withORDER BY
, such as:
SELECT * FROM t [WHERE …] ORDER BY …;
2.3. Optimization Parameters:
-
Hybrid Store Usage: Controlled by the parameter
gbase_hybrid_store
(0
= disabled,1
= auto,2
= forced), default is1
. -
Page Size for Row Storage: Set by
gbase_hybrid_store_page_size=<1k –1G>
, default is32K
. -
Limit on Rows Per DC: Set by
_gbase_hybrid_store_limit=<1-65536>
, default is100
.
2.4. Additional Features and Limitations:
- Supports specifying column storage at the time of table creation.
- Supports adding column storage to existing tables.
- Supports deleting column storage.
- Supports creating multiple column stores on a single table, but column stores cannot have duplicate names.
- Data updates for column storage are automatically maintained by the system and are transparent to the user. When performing
INSERT
, fastUPDATE
,DELETE
, orLOAD
operations, the system will automatically update the redundant data. - Does not support modifying the definition of column storage. To change it, you must first manually delete and then recreate the column storage.
- The original column definition of the column store cannot be deleted or have its data attributes modified (although column names and column order can be changed).
- Bulk
UPDATE
operations (entire column replacement mode) are not allowed on the original column's data. - The total length of all columns in the same group cannot exceed 32KB.
- Column storage supports partitioned tables and allows indexes on the original columns.
3. Principles for Selecting Hash Distribution Columns
When working with large tables and equality queries, consider using hash distribution for better performance. Guidelines for selecting hash distribution columns:
-
Even Data Distribution: Select columns with a high
COUNT(DISTINCT)
to ensure even data distribution. -
Distributed Operations: Prioritize using hash distribution for columns involved in
JOIN
conditions in large tables to enable distributed execution across nodes. -
GROUP BY Columns: Prefer hash-distributed columns in
GROUP BY
clauses for optimized group aggregation.
II. Hash Index Utilization
Hash indexes are particularly useful for exact match queries, such as those commonly found in telecommunications applications involving concurrent record searches.
- GBase 8a MPP Cluster first filters data using smart indexes, then uses the hash index if applicable, or defaults to a full DC scan. This can be verified in the
Trace Log
.
Key Notes:
- Indexes introduce maintenance costs and can affect data load and DML performance. Use them judiciously based on your application needs.
- Hash indexes should be created on columns with low duplicate values to avoid hash collisions.
- Binary columns are not suitable for hash indexes.
III. Ordered Data Insertion
1. Benefits of Ordered Data Insertion:
- Improves smart index hit rate.
- Significantly boosts query performance.
2. Local Range Sorting:
When inserting incremental data, perform local range sorting before loading the data to maintain order within the database and improve read speeds.
For example:
- Sort daily data before inserting it into a monthly table.
- Analyze SQL queries to identify key filtering columns and sort data accordingly.
3. Sorting Methods:
-
External Sorting: Use external tools (like
psort
) to sort data files before loading them into the table. -
In-Database Sorting: Use a temporary table to store unsorted data, then sort it with an
INSERT INTO … SELECT * … ORDER BY …
statement before inserting it into the main table.
Recommended Usage:
- External sorting is better for non-real-time data loading, while in-database sorting is preferred for real-time applications.
This concludes today's discussion on storage optimization in GBase 8a MPP Cluster. Thank you for reading!
Top comments (0)