The best SQL function you never heard of

#sql #snowflake #oracle

There are a two things in IT that made me feel like my brain turned inside out when learning them. One of them is the SQL concept of row pattern matching.

Think of row pattern matching as doing regex between rows. Instead of finding single rows where some column has some value (or, more often, some combination of values like WHERE col_a>1 AND col_b='puzzle'), we can find sets of rows that together constitute some interesting pattern.

A basic example

The canonical example is analyzing stock prices: Can we find stock that have seen a 5 or more day rally? That have rebounded for at least 5 days after declining for at least 5 days? The longest stretch of continuous price increase? None of these questions can be answered by a WHERE clause alone, and even though the lag() function might get you far, it gets messy quickly even if it can give you an answer.

Row pattern matching introduces some new concepts though, that we need to go through carefully in order to grasp them.

Instead of stock prices we will use currency rates, which have all the same properties. A basic row pattern matching query can look something like this:

SELECT currency, tstamp, value FROM sdr_rates
MATCH_RECOGNIZE(
    PARTITION BY currency
    ORDER BY tstamp
    ALL ROWS PER MATCH
    PATTERN ( up+ )
    DEFINE up AS LAG(value)<value
);

We recognize the first line, but beyond that there are many new keywords.

All the action happens within the MATCH_RECOGNIZE clause, which is the row pattern matching function. The first thing we do is to partition the data by the stock symbol. Partitioning here works similar to partitioning elsewhere, like in an OVER() clause. We want to split the data by stock, we aren't interested in comparing different stocks to each other. Each stock should be seen as a separate time series. ORDER BY is also quite simple, we specify how we should order the data. Here, we order the data by date (ascending, earliest first).

We will come back to the ALL ROWS PER MATCH clause, but for now, I'll note that row pattern matching by default aggregates the data and only shows one line per match. Right now, we want to see all the rows.

The PATTERN clause is the main regex-like clause. We specify we want to select one or more (the + sign is basically regex) of something we call up, which is defined on the next line.

The last line, DEFINE, is where we define up by a set of criteria. Any row that matches these criteria is labeled up, and will match in the pattern we define. This definition uses yet another new keyword, but one we can probably guess PREV. We specify UP to be any row where the price is higher than the previous row.

The output might look something like this:

CURRENCY	TSTAMP	VALUE
LKR	1994-01-04	68.0266
LKR	1994-01-05	68.1848
LKR	1994-01-07	68.171
LKR	1994-01-10	68.2884
LKR	1994-01-12	68.1946
LKR	1994-01-17	68.211
LKR	1994-01-18	68.284
LKR	1994-01-19	68.3353
LKR	1994-01-21	68.5181

This might not be what you had expected though. The price is jumping all over the place. But it is exactly what we asked for. The currency LKR had several runs of increasing value, the longest stretch seems to be from the 17th to the 19th of April.

But what about the 12th? That's just one row!?! Well, trust me that the value on the 11th was lower than on the 12th, but we didn't actually ask to get that first row returned. We referenced it in the DEFINE clause, but it isn't in the PATTERN clause and so it isn't returned.

Expanding the example

In order to make the output a little more informative, let's expand the pattern we are looking for and introduce a specialized type of calculated field called MEASURES.

SELECT * FROM sdr_rates
MATCH_RECOGNIZE(
    PARTITION BY currency
    ORDER BY tstamp
    MEASURES CLASSIFIER() AS clf,
    MATCH_NUMBER() AS mnum
    ALL ROWS PER MATCH
    PATTERN (strt up+)
    DEFINE up AS LAG(value)<value
)

We have introduced a new item, strt, in the PATTERN clause, but it isn't defined in the DEFINE clause. This might be unintuitive, but any clause not defined will reluctantly match all rows. So now, we return the "first" row of the pattern returned, and we can see the increase from start to finish. In other words, the 11th should be included in the result.

The MEASURES field defines two extra columns, using special match recognize related functions. CLASSIFIER() actually returns which part of the pattern the row matched - in our case either strt or up. MATCH_NUMBER() enumerates the matches, so that we can easily see which rows relate to each other.

The resulting table is as follows:

CURRENCY	TSTAMP	VALUE	CLF	MNUM
LKR	1994-01-03	67.9518	STRT	1
LKR	1994-01-04	68.0266	UP	1
LKR	1994-01-05	68.1848	UP	1
LKR	1994-01-06	68.1103	STRT	2
LKR	1994-01-07	68.171	UP	2
LKR	1994-01-10	68.2884	UP	2
LKR	1994-01-11	68.1027	STRT	3
LKR	1994-01-12	68.1946	UP	3
LKR	1994-01-13	68.1825	STRT	4
LKR	1994-01-17	68.211	UP	4
LKR	1994-01-18	68.284	UP	4
LKR	1994-01-19	68.3353	UP	4
LKR	1994-01-20	68.1867	STRT	5
LKR	1994-01-21	68.5181	UP	5

Going further

Patterns can be a lot more advanced, we could for instance find W-patterns by defining down, and searching for PATTERN ( STRT (DOWN UP){2,}).

SELECT * FROM sdr_rates
MATCH_RECOGNIZE(
    PARTITION BY currency
    ORDER BY tstamp
    MEASURES CLASSIFIER() AS clf,
    MATCH_NUMBER() AS mnum,
    MAX(up.value) AS max_up,
    FIRST(tstamp) AS start_at,
    FINAL LAST(tstamp) AS end_at,
    FINAL COUNT(1) AS n_days
    ALL ROWS PER MATCH
    PATTERN ( strt ( down up ){2,} )
    DEFINE up AS LAG(value)<value,
        down AS LAG(value)>value
);

In addition to the expanded pattern, we now define some more measures: The max value of rows classified as UP, the first timestamp of each match, the (final) last timestamp of the match (measures by default do not look ahead, the FINAL keyword finds the actual last value. Similarly, we take the final row count of each match.

CURRENCY	TSTAMP	VALUE	CLF	MNUM	max_up	start_at	end_at	n_days
SGD	2010-02-03	2.19037	STRT	243		2010-02-03	2010-02-09	5
SGD	2010-02-04	2.18179	DOWN	243		2010-02-03	2010-02-09	5
SGD	2010-02-05	2.18932	UP	243	2.18932	2010-02-03	2010-02-09	5
SGD	2010-02-08	2.1875	DOWN	243	2.18932	2010-02-03	2010-02-09	5
SGD	2010-02-09	2.18961	UP	243	2.18961	2010-02-03	2010-02-09	5
AUD	2011-08-15	1.53733	STRT	224		2011-08-15	2011-08-19	5
AUD	2011-08-16	1.53521	DOWN	224		2011-08-15	2011-08-19	5
AUD	2011-08-17	1.53681	UP	224	1.53681	2011-08-15	2011-08-19	5
AUD	2011-08-18	1.53167	DOWN	224	1.53681	2011-08-15	2011-08-19	5
AUD	2011-08-19	1.55462	UP	224	1.55462	2011-08-15	2011-08-19	5

References

Row pattern matching was originally introduced by Oracle in 12c. It became part of ANSI SQL 2016, but it is not widely implemented. Snowflake is one of only a handful of other databases that have this feature.

Snowflake reference: https://docs.snowflake.com/en/sql-reference/constructs/match_recognize.html

Oracle reference: https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8956