Do you know there is live beyond SQL? In this post I'm going to show you an interesting alternative.
An alternative to SQL
Let's understand first what a database does. A database can achieve two main functions: perform calculations and process transactions, which are what we often call OLAP and OLTP. When it comes to databases, we want systems that run fast and make interacting with the databases (writing queries) easy.
Have you ever thought about SQL's limitations when it comes to these two characteristics?
SQL Limitation 1: Easiness of writing/reading queries
Let’s start with easy in writing.
When solving real life problems (not textbook problems), SQL statements can be ugly very quickly. Here's a relatively simple example, to calculate the maximum number of consecutive days that a stock price keeps rising:
select max (consecutive_day)
from (select count(*) (consecutive_day
from (select sum(rise_mark) over(order by trade_date) days_no_gain
from (select trade_date,
case when closing_price>lag(closing_price) over(order by trade_date)
then 0 else 1 END rise_mark
from stock_price ) )
group by days_no_gain)
If writing this sort of queries is not easy, but reading them (whether you or a colleague wrote them) is not easy either.
SQL Limitation 2: Speed
Let's consider a simple task, often used as an example: take the top 10 items in a dataset containing 100 million pieces of data. This task is not complicated to write in SQL:
SELECT TOP 10 x FROM T ORDER BY x DESC
However, the execution logic corresponding to this statement is to perform the big sorting for all the data first, and then take the top 10, and discard the remaining data. As we all know that sorting is a very slow action, and will traverse the data many times. If the amount of data is too large to be loaded into memory, it also needs to buffer the data in external storage, resulting in a further sharp decrease in performance. If the logic embodied in this statement is strictly followed, the operation will not run fast anyway. Fortunately, many programmers know that this operation does not need the big sorting, nor the external storage to buffer since it can be done by traversing only once and only occupying a little memory space, it means a higher performance algorithm exists. Regrettably, such algorithm can't be implemented in SQL. We can only hope the database optimizer is smart enough to convert this SQL statement to a high-performance algorithm to execute, but the database optimizer may not be reliable when the situation is complicated.
It seems that SQL is not doing well in both aspects. Although these two examples are not very complicated, SQL does not perform well in either example. In reality, the difficult-to-write and slow running situation abounds in SQL codes with thousands of lines.
Why these limitations?
To answer this question, we need to analyze what exactly the implementation of calculation with program code does.
Essentially, the process of programming is the process of translating problem-solving idea into a precise formal language executable by the computer.
For the formal language used to describe calculation method, its core lies in the algebraic system adopted. To put it simply, the so-called algebraic system includes two key elements: data types and corresponding operation rules. For instance, the key elements of arithmetic we learned in primary school is the integer and the operations including the addition, subtraction, multiplication and division. Once we get both key elements, we can write the operation we want with the symbols stipulated in the algebraic system to something, i.e., the code, and then the computer can execute.
At present, the mainstream database is the relational database, and the reason why it is called this way is because its mathematical basis is called relational algebra. SQL is exactly a formal language developed from the theory of relational algebra.
Now we can answer why SQL is not competent in both aspects we expect. The problem lies in relational algebra, and the relational algebra is just like an arithmetic system with only addition and no multiplication. Therefore, it is inevitable that many things cannot be done well.
And, unfortunately, this problem is at the theoretical level, and it won't help no matter how optimized it is in practice, it can only be improved in a limited way, not eradicated.
Why SPL is competent
SPL's theoretical basis is no longer the relational algebra, but something called discrete dataset. The formal language designed based on this new algebra is named SPL (structured process language).
SPL redefines and extends many operations of structured data, specifically, it adds the discreteness, enhances ordered computation, implements a thorough set orientation, supports object references, and advocates stepwise operation.
Recoding the previous problems in SPL will give you a direct feeling. To calculate the maximum consecutive days that a stock keeps rising:
stock_price.sort(trade_date).group@i(closing_price<closing_price[-1]).max(~.len())
Although the calculation idea is the same as the previous SQL, it is easier to express and no longer confusing, because of the introduction of ordering characteristic.
Take the top 10 out of 100 million pieces of data:
T.groups(;top(-10,x))
SPL has richer set data types, it is easy to describe the efficient algorithm that implements simple aggregation on a single traversal, without involving big sorting action.
Conclusion
In addition to theoretical differences, SPL has many engineering-level advantages such as: easier to write parallel computing code, large memory pre-association to improve foreign key join performance, unique column storage mechanism to support arbitrary segmentation and parallel computing, etc.
For more details, check out their Github repo.
Top comments (0)