SPL computing performance test series: associate tables and wide table

#sql #performance #bigdata #database

I. Test task

Fact and dimension tables are common in multidimensional analysis. This test is to associate the fact table with multiple and multi-layer dimension tables and then aggregate by dimensions and, aggregate wide table by dimensions.

Based on TPCH100G dataset, we design the following association operations between large fact table and multiple dimension tables:

Associate one fact table with one dimension table, which is a two-table association operation.
Associate the primary-sub fact table with four dimension tables, one of which is utilized twice. This is a seven-table association operation.
Join the above-mentioned seven-table association into a wide table, and count the wide table.

II. Technologies to be compared

In this test, we only test SPL Enterprise Edition (version 20230528), and select the following two products to make a comparison:

Clickhouse 23.3.1, which is said to be the fastest OLAP database in the world
Starrocks 3.0.0, which is claimed to be a faster OLAP database

III. Test environment

One physical server with the following configuration:

2 x Intel3014 CPUs, main frequency 1.7G, 12 cores in total
64G memory
SSD (Solid State Drive)

The size of the largest table in TPCH 100G is only about 70G, and is probably smaller than the physical memory of the machine after simple compression. In order to be able to test the computing ability of these products on external storage as well as their sensitivity to memory, we use virtual machines to limit the number of CPUs and the capacity of memory, and design two test environments based on the relatively common cloud VM specifications in the industry:

VM1: 8 CPUs, 32G memory
VM2: 4 CPUs, 16G memory

For Starrocks, at least two nodes, BE and FE, need to be installed. The BE that undertakes computing task is installed on one VM, and the FE that undertakes management task is installed on the physical machine to avoid affecting the test results.

For SPL and Clickhouse, we only need to install them on VM.

For the preparation of test data, refer to: SPL computing performance test series: TPCH .

IV. Test process

1. Two-table association

SQL query statement:

select
    l_year,
    sum(volume) as revenue
from
    (
        Select
            extract(year from l_shipdate) as l_year,
            ( l_extendedprice * (1 - l_discount) ) as volume
        From
            lineitem,
            part
        where
            p_partkey = l_partkey
            and length(p_type)>2
    ) shipping
group by
    l_year
order by
    l_year;

SPL script:

    A
1   =now()
2   =file(“part.ctx”).open().import@mv(P_TYPE).(len(P_TYPE)>2)
3   =file(“lineitem.ctx”).open().cursor@mv(L_SHIPDATE,L_EXTENDEDPRICE,L_DISCOUNT;A2(L_PARTKEY))
4   =A3.groups(year(L_SHIPDATE):l_year;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)
5   =interval@ms(A1,now())

2. Seven-table association

SQL query statement:

select
    s_nationname,
    sum(volume) as revenue
from
    (
        Select
            n1.n_name as s_nationname,
            ( l_extendedprice * (1 - l_discount) ) as volume
        From
            supplier,
            lineitem,
            orders,
            customer,
            part,
            nation n1,
            nation n2
        where
            s_suppkey = l_suppkey
            and p_partkey = l_partkey
            and o_orderkey = l_orderkey
            and c_custkey = o_custkey
            and s_nationkey = n1.n_nationkey
            and c_nationkey = n2.n_nationkey
            and s_comment not like '%xxx%yyy%'
            and o_totalprice>5
            and length(p_type) > 2
            and n1.n_name is not null
            and n2.n_name is not null
            and c_phone is not null
    ) shipping
group by
    s_nationname
order by
    s_nationname;

SPL script:

    A
1   =now()
2   =file(“nation.btx”).import@bv(N_NAME).(if(N_NAME,N_NAME,null))
3   =file(“customer.ctx”).open().import@mv(C_NATIONKEY,C_PHONE).(A2(C_NATIONKEY) && C_PHONE)
4   =file(“supplier.ctx”).open().import@mv(S_NATIONKEY,S_COMMENT).(if(A2(S_NATIONKEY) && !like(S_COMMENT,“*xxx*yyy*”),S_NATIONKEY,null))
5   =file(“part.ctx”).open().import@mv(P_TYPE).(len(P_TYPE)>2)
6   =file(“orders.ctx”).open().cursor@mv(O_ORDERKEY;A3(O_CUSTKEY) && O_TOTALPRICE>5)
7   =file(“lineitem.ctx”).open().news(A6,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;A5(L_PARTKEY) && A4(L_SUPPKEY))
8   =A7.groups(A2(A4(L_SUPPKEY)):s_nationname;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)
9   =interval@ms(A1,now())

3. Wide table

Create a wide table using the associated data of seven original tables of the previous section, then perform the same query on the new wide table and compare the runtime.

SQL statement for creating wide table:

create table widetable ( l_orderkey    int,
    l_linenumber  int,
    l_partkey     int,
    l_suppkey     int,
    l_shipdate    date ,
    l_extendedprice  decimal(15, 2) NOT NULL,
    l_discount    decimal(15, 2) NOT NULL,
    o_totalprice  decimal(15, 2),
    o_custkey     int,
    c_nationname  varchar(25),
    c_phone       varchar(15),
    s_nationname  varchar(25),
    s_comment     varchar(101),
    p_type        varchar(25) );

SQL statement for inserting data into wide table:

select l_orderkey,l_linenumber,l_partkey,l_suppkey,l_shipdate,
    l_extendedprice,l_discount,o_totalprice,o_custkey,
    n2.n_name as c_nationname,c_phone,
    n1.n_name as s_nationname,s_comment,p_type
into widetable
from supplier,lineitem,orders,customer,part,nation n1,nation n2
where
    s_suppkey = l_suppkey
    and p_partkey = l_partkey
    and o_orderkey = l_orderkey
    and c_custkey = o_custkey
    and s_nationkey = n1.n_nationkey
    and c_nationkey = n2.n_nationkey

SQL query statement on wide table:

select
    s_nationname,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by
    s_nationname
order by
    s_nationname;

SPL script for creating a composite table (widetable.ctx) using the same data as wide table:

    A
1   =file(“nation.btx”).import@b(N_NATIONKEY,N_NAME).keys@i(N_NATIONKEY)
2   =file(“customer.ctx”).open().import(C_CUSTKEY,C_NATIONKEY,C_PHONE).switch(C_NATIONKEY,A1).keys@im(C_CUSTKEY)
3   =file(“supplier.ctx”).open().import(S_SUPPKEY,S_NATIONKEY,S_COMMENT).switch(S_NATIONKEY,A1).keys@im(S_SUPPKEY)
4   =file(“part.ctx”).open().import(P_PARTKEY,P_TYPE)
5   =file(“orders.ctx”).open().cursor@mv(O_ORDERKEY,O_TOTALPRICE,O_CUSTKEY).switch(O_CUSTKEY,A2)
6   =file(“lineitem.ctx”).open().news(A5,L_ORDERKEY,L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_SHIPDATE,L_EXTENDEDPRICE,L_DISCOUNT,O_TOTALPRICE,O_CUSTKEY)
7   =A6.switch(L_PARTKEY,A4:P_PARTKEY;L_SUPPKEY,A3:S_SUPPKEY)
8   =A7.new(L_ORDERKEY,L_LINENUMBER,L_PARTKEY.P_PARTKEY:L_PARTKEY,L_SUPPKEY.S_SUPPKEY:L_SUPPKEY,L_SHIPDATE,L_EXTENDEDPRICE,L_DISCOUNT,O_TOTALPRICE,O_CUSTKEY.C_CUSTKEY:O_CUSTKEY,O_CUSTKEY.C_NATIONKEY.N_NAME:C_NATIONNAME,O_CUSTKEY.C_PHONE,L_SUPPKEY.S_NATIONKEY.N_NAME:S_NATIONNAME,L_SUPPKEY.S_COMMENT,L_PARTKEY.P_TYPE)
9   =file(“widetable.ctx”).create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_SHIPDATE,L_EXTENDEDPRICE,L_DISCOUNT,O_TOTALPRICE,O_CUSTKEY,C_NATIONNAME,C_PHONE,S_NATIONNAME,S_COMMENT,P_TYPE)
10  =A9.append@i(A8)

SPL query script:

    A
1   =now()
2   =file(“widetable.ctx”).open().cursor@mv(S_NATIONNAME,L_EXTENDEDPRICE,L_DISCOUNT;O_TOTALPRICE>5 && C_NATIONNAME!=null && C_PHONE!= null && S_NATIONNAME!=null && len(P_TYPE)>2 && !like(S_COMMENT,“*xxx*yyy*”))
3   =A2.groups(S_NATIONNAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):volume)
4   =interval@ms(A1,now())

V. Test results

Unit: seconds

	VM1			VM2
	2-table association	7-table association	Wide table	2-table association	7-table association	Wide table
SPL	11.5	30.6	57.7	21.5	55.6	114.2
Starrocks	35.1	73.3	62.1	78.8	152.5	129.9
Clickhouse	89.3	Out-of-memory (OOM)	33.2	204.1	OOM	74.3

VI. Comments on test results

Association performance: SPL is the best and significantly better than the other two products; ClickHouse is the worst and far worse than the other two, and OOM occurs;
Performance on wide table: ClickHouse is the best and far better than SPL and Starrocks. SPL is slightly better than Starrocks.
When associating more tables, the performance of both Starrocks and Clickhouse drops significantly and lowers than that on wide table. Therefore, for the two products, it is often recommended to create wide table for querying.
The performance of SPL on association operation is better than its computing performance on wide table, and also better than the computing performance of the other two products on wide table. Therefore, there is no need for SPL to create wide table.
From the performance drop degree from VM1 to VM2, we can see that the performance of both Clickhouse and Starrocks decreases by more than 2 times (equivalent to the difference of CPU number), indicating the performance of the two products is very sensitive to memory capacity (the memory capacity of VM2 is halved), in other words, more memory space is occupied when calculating in Clickhouse or Starrocks. In contrast, the performance of SPL decreases by less than 2 times, indicating that less memory space is occupied when calculating in SPL.
Conclusions SPL has good association performance, and can be used for multidimensional analysis without the need to create wide table, which greatly reduces the time to prepare data, reduces hard disk occupation, and improves the real-timeness of analysis. Moreover, SPL takes up less memory space, so the hardware cost is lower.

Clickhouse has better performance on wide table, but it is completely unsuitable for the scenarios involving association. When Clickhouse is used for multidimensional analysis, there is a need to create wide table, which will be cumbersome and take up more hard disk space.

Starrocks does not do a good job on either association or wide table. Although its association performance is better than that of Clickhouse, it is not satisfactory. Therefore, Starrocks is not suitable for both multi-dimensional analysis scenarios.

Origin: http://c.raqsoft.com/article/1690170794600
SPL Source code: https://github.com/SPLWare/esProc