DEV Community

Henning
Henning

Posted on

The best SQL function you never heard of

There are a two things in IT that made me feel like my brain turned inside out when learning them. One of them is the SQL concept of row pattern matching.

Think of row pattern matching as doing regex between rows. Instead of finding single rows where some column has some value (or, more often, some combination of values like WHERE col_a>1 AND col_b='puzzle'), we can find sets of rows that together constitute some interesting pattern.

A basic example

The canonical example is analyzing stock prices: Can we find stock that have seen a 5 or more day rally? That have rebounded for at least 5 days after declining for at least 5 days? The longest stretch of continuous price increase? None of these questions can be answered by a WHERE clause alone, and even though the lag() function might get you far, it gets messy quickly even if it can give you an answer.

Row pattern matching introduces some new concepts though, that we need to go through carefully in order to grasp them.

Instead of stock prices we will use currency rates, which have all the same properties. A basic row pattern matching query can look something like this:

SELECT currency, tstamp, value FROM sdr_rates
MATCH_RECOGNIZE(
    PARTITION BY currency
    ORDER BY tstamp
    ALL ROWS PER MATCH
    PATTERN ( up+ )
    DEFINE up AS LAG(value)<value
);
Enter fullscreen mode Exit fullscreen mode

We recognize the first line, but beyond that there are many new keywords.

All the action happens within the MATCH_RECOGNIZE clause, which is the row pattern matching function. The first thing we do is to partition the data by the stock symbol. Partitioning here works similar to partitioning elsewhere, like in an OVER() clause. We want to split the data by stock, we aren't interested in comparing different stocks to each other. Each stock should be seen as a separate time series. ORDER BY is also quite simple, we specify how we should order the data. Here, we order the data by date (ascending, earliest first).

We will come back to the ALL ROWS PER MATCH clause, but for now, I'll note that row pattern matching by default aggregates the data and only shows one line per match. Right now, we want to see all the rows.

The PATTERN clause is the main regex-like clause. We specify we want to select one or more (the + sign is basically regex) of something we call up, which is defined on the next line.

The last line, DEFINE, is where we define up by a set of criteria. Any row that matches these criteria is labeled up, and will match in the pattern we define. This definition uses yet another new keyword, but one we can probably guess PREV. We specify UP to be any row where the price is higher than the previous row.

The output might look something like this:

CURRENCY  TSTAMP VALUE
LKR 1994-01-04 68.0266
LKR 1994-01-05 68.1848
LKR 1994-01-07 68.171
LKR 1994-01-10 68.2884
LKR 1994-01-12 68.1946
LKR 1994-01-17 68.211
LKR 1994-01-18 68.284
LKR 1994-01-19 68.3353
LKR 1994-01-21 68.5181

This might not be what you had expected though. The price is jumping all over the place. But it is exactly what we asked for. The currency LKR had several runs of increasing value, the longest stretch seems to be from the 17th to the 19th of April.

But what about the 12th? That's just one row!?! Well, trust me that the value on the 11th was lower than on the 12th, but we didn't actually ask to get that first row returned. We referenced it in the DEFINE clause, but it isn't in the PATTERN clause and so it isn't returned.

Expanding the example

In order to make the output a little more informative, let's expand the pattern we are looking for and introduce a specialized type of calculated field called MEASURES.

SELECT * FROM sdr_rates
MATCH_RECOGNIZE(
    PARTITION BY currency
    ORDER BY tstamp
    MEASURES CLASSIFIER() AS clf,
    MATCH_NUMBER() AS mnum
    ALL ROWS PER MATCH
    PATTERN (strt up+)
    DEFINE up AS LAG(value)<value
)
Enter fullscreen mode Exit fullscreen mode

We have introduced a new item, strt, in the PATTERN clause, but it isn't defined in the DEFINE clause. This might be unintuitive, but any clause not defined will reluctantly match all rows. So now, we return the "first" row of the pattern returned, and we can see the increase from start to finish. In other words, the 11th should be included in the result.

The MEASURES field defines two extra columns, using special match recognize related functions. CLASSIFIER() actually returns which part of the pattern the row matched - in our case either strt or up. MATCH_NUMBER() enumerates the matches, so that we can easily see which rows relate to each other.

The resulting table is as follows:

CURRENCY TSTAMP VALUE CLF MNUM
LKR 1994-01-03 67.9518 STRT 1
LKR 1994-01-04 68.0266 UP 1
LKR 1994-01-05 68.1848 UP 1
LKR 1994-01-06 68.1103 STRT 2
LKR 1994-01-07 68.171 UP 2
LKR 1994-01-10 68.2884 UP 2
LKR 1994-01-11 68.1027 STRT 3
LKR 1994-01-12 68.1946 UP 3
LKR 1994-01-13 68.1825 STRT 4
LKR 1994-01-17 68.211 UP 4
LKR 1994-01-18 68.284 UP 4
LKR 1994-01-19 68.3353 UP 4
LKR 1994-01-20 68.1867 STRT 5
LKR 1994-01-21 68.5181 UP 5

Going further

Patterns can be a lot more advanced, we could for instance find W-patterns by defining down, and searching for PATTERN ( STRT (DOWN UP){2,}).

SELECT * FROM sdr_rates
MATCH_RECOGNIZE(
    PARTITION BY currency
    ORDER BY tstamp
    MEASURES CLASSIFIER() AS clf,
    MATCH_NUMBER() AS mnum,
    MAX(up.value) AS max_up,
    FIRST(tstamp) AS start_at,
    FINAL LAST(tstamp) AS end_at,
    FINAL COUNT(1) AS n_days
    ALL ROWS PER MATCH
    PATTERN ( strt ( down up ){2,} )
    DEFINE up AS LAG(value)<value,
        down AS LAG(value)>value
);
Enter fullscreen mode Exit fullscreen mode

In addition to the expanded pattern, we now define some more measures: The max value of rows classified as UP, the first timestamp of each match, the (final) last timestamp of the match (measures by default do not look ahead, the FINAL keyword finds the actual last value. Similarly, we take the final row count of each match.

CURRENCY TSTAMP VALUE CLF MNUM max_up start_at end_at n_days
SGD 2010-02-03 2.19037 STRT 243 2010-02-03 2010-02-09 5
SGD 2010-02-04 2.18179 DOWN 243 2010-02-03 2010-02-09 5
SGD 2010-02-05 2.18932 UP 243 2.18932 2010-02-03 2010-02-09 5
SGD 2010-02-08 2.1875 DOWN 243 2.18932 2010-02-03 2010-02-09 5
SGD 2010-02-09 2.18961 UP 243 2.18961 2010-02-03 2010-02-09 5
AUD 2011-08-15 1.53733 STRT 224 2011-08-15 2011-08-19 5
AUD 2011-08-16 1.53521 DOWN 224 2011-08-15 2011-08-19 5
AUD 2011-08-17 1.53681 UP 224 1.53681 2011-08-15 2011-08-19 5
AUD 2011-08-18 1.53167 DOWN 224 1.53681 2011-08-15 2011-08-19 5
AUD 2011-08-19 1.55462 UP 224 1.55462 2011-08-15 2011-08-19 5

References

Row pattern matching was originally introduced by Oracle in 12c. It became part of ANSI SQL 2016, but it is not widely implemented. Snowflake is one of only a handful of other databases that have this feature.

Snowflake reference: https://docs.snowflake.com/en/sql-reference/constructs/match_recognize.html

Oracle reference: https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8956

Top comments (0)