Fact and dimension tables are common in multidimensional analysis. This test is to associate the fact table with multiple and multi-layer dimension tables and then aggregate by dimensions and, aggregate wide table by dimensions.
Based on TPCH100G dataset, we design the following association operations between large fact table and multiple dimension tables:
- Associate one fact table with one dimension table, which is a two-table association operation.
- Associate the primary-sub fact table with four dimension tables, one of which is utilized twice. This is a seven-table association operation.
- Join the above-mentioned seven-table association into a wide table, and count the wide table.
In this test, we only test SPL Enterprise Edition (version 20230528), and select the following two products to make a comparison:
- Clickhouse 23.3.1, which is said to be the fastest OLAP database in the world
- Starrocks 3.0.0, which is claimed to be a faster OLAP database
One physical server with the following configuration:
2 x Intel3014 CPUs, main frequency 1.7G, 12 cores in total
SSD (Solid State Drive)
The size of the largest table in TPCH 100G is only about 70G, and is probably smaller than the physical memory of the machine after simple compression. In order to be able to test the computing ability of these products on external storage as well as their sensitivity to memory, we use virtual machines to limit the number of CPUs and the capacity of memory, and design two test environments based on the relatively common cloud VM specifications in the industry:
VM1: 8 CPUs, 32G memory
VM2: 4 CPUs, 16G memory
For Starrocks, at least two nodes, BE and FE, need to be installed. The BE that undertakes computing task is installed on one VM, and the FE that undertakes management task is installed on the physical machine to avoid affecting the test results.
For SPL and Clickhouse, we only need to install them on VM.
For the preparation of test data, refer to: SPL computing performance test series: TPCH .
SQL query statement:
select l_year, sum(volume) as revenue from ( Select extract(year from l_shipdate) as l_year, ( l_extendedprice * (1 - l_discount) ) as volume From lineitem, part where p_partkey = l_partkey and length(p_type)>2 ) shipping group by l_year order by l_year;
A 1 =now() 2 =file(“part.ctx”).open().import@mv(P_TYPE).(len(P_TYPE)>2) 3 =file(“lineitem.ctx”).open().cursor@mv(L_SHIPDATE,L_EXTENDEDPRICE,L_DISCOUNT;A2(L_PARTKEY)) 4 =A3.groups(year(L_SHIPDATE):l_year;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) 5 =interval@ms(A1,now())
SQL query statement:
select s_nationname, sum(volume) as revenue from ( Select n1.n_name as s_nationname, ( l_extendedprice * (1 - l_discount) ) as volume From supplier, lineitem, orders, customer, part, nation n1, nation n2 where s_suppkey = l_suppkey and p_partkey = l_partkey and o_orderkey = l_orderkey and c_custkey = o_custkey and s_nationkey = n1.n_nationkey and c_nationkey = n2.n_nationkey and s_comment not like '%xxx%yyy%' and o_totalprice>5 and length(p_type) > 2 and n1.n_name is not null and n2.n_name is not null and c_phone is not null ) shipping group by s_nationname order by s_nationname;
A 1 =now() 2 =file(“nation.btx”).import@bv(N_NAME).(if(N_NAME,N_NAME,null)) 3 =file(“customer.ctx”).open().import@mv(C_NATIONKEY,C_PHONE).(A2(C_NATIONKEY) && C_PHONE) 4 =file(“supplier.ctx”).open().import@mv(S_NATIONKEY,S_COMMENT).(if(A2(S_NATIONKEY) && !like(S_COMMENT,“*xxx*yyy*”),S_NATIONKEY,null)) 5 =file(“part.ctx”).open().import@mv(P_TYPE).(len(P_TYPE)>2) 6 =file(“orders.ctx”).open().cursor@mv(O_ORDERKEY;A3(O_CUSTKEY) && O_TOTALPRICE>5) 7 =file(“lineitem.ctx”).open().news(A6,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;A5(L_PARTKEY) && A4(L_SUPPKEY)) 8 =A7.groups(A2(A4(L_SUPPKEY)):s_nationname;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) 9 =interval@ms(A1,now())
Create a wide table using the associated data of seven original tables of the previous section, then perform the same query on the new wide table and compare the runtime.
SQL statement for creating wide table:
create table widetable ( l_orderkey int, l_linenumber int, l_partkey int, l_suppkey int, l_shipdate date , l_extendedprice decimal(15, 2) NOT NULL, l_discount decimal(15, 2) NOT NULL, o_totalprice decimal(15, 2), o_custkey int, c_nationname varchar(25), c_phone varchar(15), s_nationname varchar(25), s_comment varchar(101), p_type varchar(25) );
SQL statement for inserting data into wide table:
select l_orderkey,l_linenumber,l_partkey,l_suppkey,l_shipdate, l_extendedprice,l_discount,o_totalprice,o_custkey, n2.n_name as c_nationname,c_phone, n1.n_name as s_nationname,s_comment,p_type into widetable from supplier,lineitem,orders,customer,part,nation n1,nation n2 where s_suppkey = l_suppkey and p_partkey = l_partkey and o_orderkey = l_orderkey and c_custkey = o_custkey and s_nationkey = n1.n_nationkey and c_nationkey = n2.n_nationkey
SQL query statement on wide table:
select s_nationname, sum( l_extendedprice * (1 - l_discount) ) as volume from widetable where s_comment not like '%xxx%yyy%' and o_totalprice>5 and length(p_type) > 2 and c_nationname is not null and s_nationname is not null and c_phone is not null group by s_nationname order by s_nationname;
SPL script for creating a composite table (widetable.ctx) using the same data as wide table:
A 1 =file(“nation.btx”).import@b(N_NATIONKEY,N_NAME).keys@i(N_NATIONKEY) 2 =file(“customer.ctx”).open().import(C_CUSTKEY,C_NATIONKEY,C_PHONE).switch(C_NATIONKEY,A1).keys@im(C_CUSTKEY) 3 =file(“supplier.ctx”).open().import(S_SUPPKEY,S_NATIONKEY,S_COMMENT).switch(S_NATIONKEY,A1).keys@im(S_SUPPKEY) 4 =file(“part.ctx”).open().import(P_PARTKEY,P_TYPE) 5 =file(“orders.ctx”).open().cursor@mv(O_ORDERKEY,O_TOTALPRICE,O_CUSTKEY).switch(O_CUSTKEY,A2) 6 =file(“lineitem.ctx”).open().news(A5,L_ORDERKEY,L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_SHIPDATE,L_EXTENDEDPRICE,L_DISCOUNT,O_TOTALPRICE,O_CUSTKEY) 7 =A6.switch(L_PARTKEY,A4:P_PARTKEY;L_SUPPKEY,A3:S_SUPPKEY) 8 =A7.new(L_ORDERKEY,L_LINENUMBER,L_PARTKEY.P_PARTKEY:L_PARTKEY,L_SUPPKEY.S_SUPPKEY:L_SUPPKEY,L_SHIPDATE,L_EXTENDEDPRICE,L_DISCOUNT,O_TOTALPRICE,O_CUSTKEY.C_CUSTKEY:O_CUSTKEY,O_CUSTKEY.C_NATIONKEY.N_NAME:C_NATIONNAME,O_CUSTKEY.C_PHONE,L_SUPPKEY.S_NATIONKEY.N_NAME:S_NATIONNAME,L_SUPPKEY.S_COMMENT,L_PARTKEY.P_TYPE) 9 =file(“widetable.ctx”).create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_SHIPDATE,L_EXTENDEDPRICE,L_DISCOUNT,O_TOTALPRICE,O_CUSTKEY,C_NATIONNAME,C_PHONE,S_NATIONNAME,S_COMMENT,P_TYPE) 10 =A9.append@i(A8)
SPL query script:
A 1 =now() 2 =file(“widetable.ctx”).open().cursor@mv(S_NATIONNAME,L_EXTENDEDPRICE,L_DISCOUNT;O_TOTALPRICE>5 && C_NATIONNAME!=null && C_PHONE!= null && S_NATIONNAME!=null && len(P_TYPE)>2 && !like(S_COMMENT,“*xxx*yyy*”)) 3 =A2.groups(S_NATIONNAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):volume) 4 =interval@ms(A1,now())
|2-table association||7-table association||Wide table||2-table association||7-table association||Wide table|
- Association performance: SPL is the best and significantly better than the other two products; ClickHouse is the worst and far worse than the other two, and OOM occurs;
- Performance on wide table: ClickHouse is the best and far better than SPL and Starrocks. SPL is slightly better than Starrocks.
- When associating more tables, the performance of both Starrocks and Clickhouse drops significantly and lowers than that on wide table. Therefore, for the two products, it is often recommended to create wide table for querying.
- The performance of SPL on association operation is better than its computing performance on wide table, and also better than the computing performance of the other two products on wide table. Therefore, there is no need for SPL to create wide table.
- From the performance drop degree from VM1 to VM2, we can see that the performance of both Clickhouse and Starrocks decreases by more than 2 times (equivalent to the difference of CPU number), indicating the performance of the two products is very sensitive to memory capacity (the memory capacity of VM2 is halved), in other words, more memory space is occupied when calculating in Clickhouse or Starrocks. In contrast, the performance of SPL decreases by less than 2 times, indicating that less memory space is occupied when calculating in SPL.
- Conclusions SPL has good association performance, and can be used for multidimensional analysis without the need to create wide table, which greatly reduces the time to prepare data, reduces hard disk occupation, and improves the real-timeness of analysis. Moreover, SPL takes up less memory space, so the hardware cost is lower.
Clickhouse has better performance on wide table, but it is completely unsuitable for the scenarios involving association. When Clickhouse is used for multidimensional analysis, there is a need to create wide table, which will be cumbersome and take up more hard disk space.
Starrocks does not do a good job on either association or wide table. Although its association performance is better than that of Clickhouse, it is not satisfactory. Therefore, Starrocks is not suitable for both multi-dimensional analysis scenarios.