Efficiency is one of the most important indicators for a database. To improve database performance, the following aspects should be well managed: database design, SQL query optimization, database parameter configuration, appropriate data resources, and the operating system.
Today, I will introduce the GCluster parameter optimization methods for GBase.
1. Query Optimization Parameters
gcluster_hash_redistribute_groupby_optimize
This parameter controls whether to enable the group by
optimization mode. The default value is 1
(0
- Disable, 1
- Enable).
If this option is enabled, the temporary results will be redistributed to each compute node using a hash algorithm before performing group by
operations. Since the data is hashed before being distributed to each node, the final results can be directly aggregated without the need for an additional grouping operation on the summary node.
gcluster_hash_redistribute_join_optimize
This parameter controls whether to enable the hash redistribution for the JOIN
mode. The default value is 2
(0
– Copy table, 1
– Dynamic hash, 2
– Auto evaluate).
If enabled, when performing an equi-join operation on two distributed tables, the data from one of the tables will be dynamically hashed based on the join condition column values. This dynamic hash table will then be joined with the other table on each compute node. Thus, each node's operation result can be directly aggregated to get the final result. This strategy avoids creating a copy table on all compute nodes. If set to 2
, hash redistribution JOIN
is used when the data difference between the two tables is within 20%; otherwise, it is not used.
gcluster_special_correlated_optimize
This parameter controls whether to enable hash redistribution optimization for correlated subqueries. If the parent and child queries are correlated and there is an equi-join relationship, the parent and child queries will be hashed and redistributed based on the join column.
- Parameter =
0
: Disable - Parameter =
1
: Enable - Default value is
1
.
The hash redistribution for correlated subqueries needs to be used together with the parameter gcluster_crossjoin_use_hash_distribution
. The rules are as follows:
- When
gcluster_special_correlated_optimize = 0
, the optimization will not be enabled, regardless of thegcluster_crossjoin_use_hash_distribution
value. - When
gcluster_special_correlated_optimize = 1
andgcluster_crossjoin_use_hash_distribution = 0
, the optimization will not be enabled. - When
gcluster_special_correlated_optimize = 1
andgcluster_crossjoin_use_hash_distribution = 1
, the optimization will be enabled.
Example:
SELECT COUNT(*)
FROM x1
WHERE EXISTS (SELECT 1 FROM x2 WHERE x1.id2 = x2.id2);
For hash-related subquery optimization, x1
and x2
are dynamically redistributed.
gcluster_crossjoin_use_hash_distribution
This parameter controls whether to enforce hash redistribution JOIN
when neither side of the JOIN
is a hash column.
- Parameter =
0
: Disable, hash redistributionJOIN
is not used. - Parameter =
1
: Enable, hash redistributionJOIN
is used. - Default value is
1
.
gcluster_empty_result_set_optimize
This parameter controls whether to enable empty result set optimization. The default value is 0
(0
- OFF, 1
- ON). If the optimizer determines that the result set is empty, it will return directly without requiring the executor to perform the execution.
gcluster_single_hash_node_optimize
Optimizes single-table hash conditions. When the single table contains hash column equality conditions, hash optimization is performed, and the SQL statement is sent to a single node. The default value is 1
(0
- OFF, 1
- ON).
gcluster_hash_join_complex_optimize
Optimizes subqueries and parent queries that meet the hash relationship by executing them as a single unit.
- Parameter =
0
: Disable, hash relationship optimization is not executed as a whole. - Parameter =
1
: Enable, hash relationship optimization is executed as a whole. - Default value is
1
.
gcluster_union_optimize
This parameter controls whether to use union optimization. When enabled, the union statement is executed on the node, avoiding pulling all tables that need to be unioned as copy tables.
- Parameter =
0
: Do not use union optimization. - Parameter =
1
: Use union optimization. - Default value is
1
.
gcluster_starschema_join_estimate_optimize
Sets the method for estimating the join result of two tables.
-
0
: Estimate by multiplying the row counts of both tables. -
1
: Estimate by the larger table's row count. - Default value is
1
.
gcluster_delayed_group_by_optimize
Determines whether group by
should be pushed down to gnode
for execution. If there is no significant reduction in the result set after pushing down group by
to gnode
, set this parameter to 1
(0
- OFF, 1
- ON).
gcluster_count_optimize
Determines if intermediate result tables are generated during count(*)
operations.
-
0
: Execute by generating intermediate tables. -
1
: Directly calculate the count value ingcluster
. Thecount
shortcut optimization sends the SQL to eachgnode
node for direct calculation, without generating a query plan or temporary table. -
2
: Based on1
, if the table is a hash distribution table and the SQL'swhere
conditions meet the single hash optimization condition, the executor calculates the targetgnode
based on the hash value and sends the SQL to the targetgnode
for calculation, rather than to allgnodes
, improving small query concurrency.
Optimization Details: The original strategy for count(*)
was to create a temporary summary table on the initiating node's gnode
, aggregate the results from each node into the temporary table, and then sum the results. This strategy created and deleted temporary summary tables frequently during high concurrency, reducing performance. The new approach avoids creating temporary summary tables, collects execution results from each node into gcluster
, and calculates directly, thus improving performance.
Conditions for gcluster_count_optimize = 1
to optimize count
SQL:
- No
dblink
table is used. - No
union
,minus
,intersect
. - No non-correlated subqueries.
- No scalar subqueries.
- No
from
subqueries. - No correlated subqueries.
- Not
select...into
. - Only one column in the projection.
- No
group by
clause. - No
order by
. - No
having
. - No
limit
oroffset
. - No
distinct
. - No
join
. - No copy table.
- Projection column must be
count
but notcount distinct
.
Conditions for gcluster_count_optimize = 2
to optimize count
SQL:
- Meets all conditions of
gcluster_count_optimize = 1
. - Uses hash distribution table, and
where
conditions meet single hash optimization conditions.
Example:
SELECT count(1)
FROM t
WHERE t.distr_col = constant;
With optimization disabled, throughput at 400 concurrent requests is 3905/s; with optimization enabled, throughput is 6055/s.
2. Concurrency Parameters
gcluster_serial_exec_query
Optimization Point: When gnode
does not have automatic resource management capabilities, high concurrency causes gnode
resources to be contested by concurrent SQL, reducing performance. gcluster
can control the number of SQLs sent to gnode
to indirectly control resource usage, preventing concurrent SQLs from competing for resources and improving performance. Configure gcluster_serial_exec_query = batch submission count (number of CPU cores per node)
in the gcluster
configuration file to control the number of SQLs submitted to gnode
. Default value is 0
(no limit).
gcluster_max_conn_in_pool
Optimization Point: Without a thread pool, the concurrent access to gnode
by gcluster
is uncontrolled. Each access to gnode
starts a new thread, and high concurrency results in many threads consuming system resources and increasing gnode
pressure. With a thread pool, concurrent requests compete for threads in the same pool, so the maximum thread count can strictly control the concurrency level, reducing gnode
pressure and thread resource consumption. Default value is 300
.
gcluster_use_conn_pool
Optimization Point: Without a connection pool, the connection count to gnode
is uncontrolled. Each access to gnode
starts a new connection, increasing connection time and consuming system resources during high concurrency. With a connection pool, concurrent requests compete for connections in the same pool, so the maximum connection count can strictly control concurrency, reducing gnode
pressure and connection resource consumption.
-
0
: OFF -
1
: ON
gcluster_insertselect_use_values_optimize
Optimization Point: Multiple insert into t1 select * from t
operations are serialized on gnode
under high concurrency, affecting execution efficiency. However, insert into t1 values()
allows concurrent execution.
Example:
INSERT INTO TB_SVC_SUBS_HIST_TMP1
SELECT *
FROM TB_SVC_SUBS_HIST
WHERE MSISDN = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx';
Applicable Scenario:
This parameter is suitable for high-concurrency insert select
operations, especially when the result set of the select
statement is not too large.
(0-OFF, 1-ON)
gcluster_single_hash_node_optimize
This parameter optimizes single-point hash conditions:
- 0: Disables single-point hash optimization.
- 1: Optimizes single-table equality hash queries. When a single table has equality conditions on hash columns, the SQL statement is only sent to a single node.
-
2: Optimizes single-table and
from
subquery equality hash queries. When a single table or afrom
subquery has equality conditions on hash columns, the SQL statement is only sent to a single node. -
3: Supports single-table optimization and multi-table joins, including
inner join
,left join
, andright join
(not supportingfull join
). The SQL statement is only sent to a single node.
The default value is 1: Enabled.
Example
SELECT *
FROM bas.clcinfdta
WHERE 1=1 AND clt_nbr = '7319022720'
LIMIT 30 OFFSET 0;
If no records are found, remove 1=1
to execute the query correctly. To temporarily disable this optimization, set gcluster_single_hash_node_optimize = 0
.
Performance Comparison for single_hash=3
:
-
20 Concurrent JMeter Threads:
- With optimization enabled: Throughput = 20.2/sec
- With optimization disabled: Throughput = 15.3/sec
-
50 Concurrent JMeter Threads:
- With optimization enabled: Throughput = 27.2/sec
- With optimization disabled: Throughput = 19.2/sec
-
100 Concurrent JMeter Threads:
- With optimization enabled: Throughput = 28.9/sec
- With optimization disabled: Throughput = 20.3/sec
_gbase_enable_hashtree
The default value is 1 (Enabled). This parameter is used when the join columns have high duplication, which can lead to inefficiency.
-
0: Disables
hashtree
usage during joins, switching to a linked list structure for joins.
gcluster_ddl_parallel_execute
This parameter controls whether DDL statements are executed in parallel or sequentially.
- 0: Sequential execution.
- 1: Parallel execution.
The default value is 0.
Top comments (0)