There are a two things in IT that made me feel like my brain turned inside out when learning them. One of them is the SQL concept of row pattern matching.
Think of row pattern matching as doing regex between rows. Instead of finding single rows where some column has some value (or, more often, some combination of values like WHERE col_a>1 AND col_b='puzzle'
), we can find sets of rows that together constitute some interesting pattern.
A basic example
The canonical example is analyzing stock prices: Can we find stock that have seen a 5 or more day rally? That have rebounded for at least 5 days after declining for at least 5 days? The longest stretch of continuous price increase? None of these questions can be answered by a WHERE
clause alone, and even though the lag()
function might get you far, it gets messy quickly even if it can give you an answer.
Row pattern matching introduces some new concepts though, that we need to go through carefully in order to grasp them.
Instead of stock prices we will use currency rates, which have all the same properties. A basic row pattern matching query can look something like this:
SELECT currency, tstamp, value FROM sdr_rates
MATCH_RECOGNIZE(
PARTITION BY currency
ORDER BY tstamp
ALL ROWS PER MATCH
PATTERN ( up+ )
DEFINE up AS LAG(value)<value
);
We recognize the first line, but beyond that there are many new keywords.
All the action happens within the MATCH_RECOGNIZE
clause, which is the row pattern matching function. The first thing we do is to partition the data by the stock symbol. Partitioning here works similar to partitioning elsewhere, like in an OVER()
clause. We want to split the data by stock, we aren't interested in comparing different stocks to each other. Each stock should be seen as a separate time series. ORDER BY
is also quite simple, we specify how we should order the data. Here, we order the data by date (ascending, earliest first).
We will come back to the ALL ROWS PER MATCH
clause, but for now, I'll note that row pattern matching by default aggregates the data and only shows one line per match. Right now, we want to see all the rows.
The PATTERN
clause is the main regex-like clause. We specify we want to select one or more (the +
sign is basically regex) of something we call up
, which is defined on the next line.
The last line, DEFINE
, is where we define up
by a set of criteria. Any row that matches these criteria is labeled up
, and will match in the pattern we define. This definition uses yet another new keyword, but one we can probably guess PREV
. We specify UP
to be any row where the price
is higher than the previous row.
The output might look something like this:
CURRENCY | TSTAMP | VALUE |
---|---|---|
LKR | 1994-01-04 | 68.0266 |
LKR | 1994-01-05 | 68.1848 |
LKR | 1994-01-07 | 68.171 |
LKR | 1994-01-10 | 68.2884 |
LKR | 1994-01-12 | 68.1946 |
LKR | 1994-01-17 | 68.211 |
LKR | 1994-01-18 | 68.284 |
LKR | 1994-01-19 | 68.3353 |
LKR | 1994-01-21 | 68.5181 |
This might not be what you had expected though. The price is jumping all over the place. But it is exactly what we asked for. The currency LKR had several runs of increasing value, the longest stretch seems to be from the 17th to the 19th of April.
But what about the 12th? That's just one row!?! Well, trust me that the value on the 11th was lower than on the 12th, but we didn't actually ask to get that first row returned. We referenced it in the DEFINE
clause, but it isn't in the PATTERN
clause and so it isn't returned.
Expanding the example
In order to make the output a little more informative, let's expand the pattern we are looking for and introduce a specialized type of calculated field called MEASURES
.
SELECT * FROM sdr_rates
MATCH_RECOGNIZE(
PARTITION BY currency
ORDER BY tstamp
MEASURES CLASSIFIER() AS clf,
MATCH_NUMBER() AS mnum
ALL ROWS PER MATCH
PATTERN (strt up+)
DEFINE up AS LAG(value)<value
)
We have introduced a new item, strt
, in the PATTERN
clause, but it isn't defined in the DEFINE
clause. This might be unintuitive, but any clause not defined will reluctantly match all rows. So now, we return the "first" row of the pattern returned, and we can see the increase from start to finish. In other words, the 11th should be included in the result.
The MEASURES
field defines two extra columns, using special match recognize related functions. CLASSIFIER()
actually returns which part of the pattern the row matched - in our case either strt
or up
. MATCH_NUMBER()
enumerates the matches, so that we can easily see which rows relate to each other.
The resulting table is as follows:
CURRENCY | TSTAMP | VALUE | CLF | MNUM |
---|---|---|---|---|
LKR | 1994-01-03 | 67.9518 | STRT | 1 |
LKR | 1994-01-04 | 68.0266 | UP | 1 |
LKR | 1994-01-05 | 68.1848 | UP | 1 |
LKR | 1994-01-06 | 68.1103 | STRT | 2 |
LKR | 1994-01-07 | 68.171 | UP | 2 |
LKR | 1994-01-10 | 68.2884 | UP | 2 |
LKR | 1994-01-11 | 68.1027 | STRT | 3 |
LKR | 1994-01-12 | 68.1946 | UP | 3 |
LKR | 1994-01-13 | 68.1825 | STRT | 4 |
LKR | 1994-01-17 | 68.211 | UP | 4 |
LKR | 1994-01-18 | 68.284 | UP | 4 |
LKR | 1994-01-19 | 68.3353 | UP | 4 |
LKR | 1994-01-20 | 68.1867 | STRT | 5 |
LKR | 1994-01-21 | 68.5181 | UP | 5 |
Going further
Patterns can be a lot more advanced, we could for instance find W-patterns by defining down
, and searching for PATTERN ( STRT (DOWN UP){2,})
.
SELECT * FROM sdr_rates
MATCH_RECOGNIZE(
PARTITION BY currency
ORDER BY tstamp
MEASURES CLASSIFIER() AS clf,
MATCH_NUMBER() AS mnum,
MAX(up.value) AS max_up,
FIRST(tstamp) AS start_at,
FINAL LAST(tstamp) AS end_at,
FINAL COUNT(1) AS n_days
ALL ROWS PER MATCH
PATTERN ( strt ( down up ){2,} )
DEFINE up AS LAG(value)<value,
down AS LAG(value)>value
);
In addition to the expanded pattern, we now define some more measures: The max value of rows classified as UP
, the first timestamp of each match, the (final) last timestamp of the match (measures by default do not look ahead, the FINAL
keyword finds the actual last value. Similarly, we take the final row count of each match.
CURRENCY | TSTAMP | VALUE | CLF | MNUM | max_up | start_at | end_at | n_days |
---|---|---|---|---|---|---|---|---|
SGD | 2010-02-03 | 2.19037 | STRT | 243 | 2010-02-03 | 2010-02-09 | 5 | |
SGD | 2010-02-04 | 2.18179 | DOWN | 243 | 2010-02-03 | 2010-02-09 | 5 | |
SGD | 2010-02-05 | 2.18932 | UP | 243 | 2.18932 | 2010-02-03 | 2010-02-09 | 5 |
SGD | 2010-02-08 | 2.1875 | DOWN | 243 | 2.18932 | 2010-02-03 | 2010-02-09 | 5 |
SGD | 2010-02-09 | 2.18961 | UP | 243 | 2.18961 | 2010-02-03 | 2010-02-09 | 5 |
AUD | 2011-08-15 | 1.53733 | STRT | 224 | 2011-08-15 | 2011-08-19 | 5 | |
AUD | 2011-08-16 | 1.53521 | DOWN | 224 | 2011-08-15 | 2011-08-19 | 5 | |
AUD | 2011-08-17 | 1.53681 | UP | 224 | 1.53681 | 2011-08-15 | 2011-08-19 | 5 |
AUD | 2011-08-18 | 1.53167 | DOWN | 224 | 1.53681 | 2011-08-15 | 2011-08-19 | 5 |
AUD | 2011-08-19 | 1.55462 | UP | 224 | 1.55462 | 2011-08-15 | 2011-08-19 | 5 |
References
Row pattern matching was originally introduced by Oracle in 12c. It became part of ANSI SQL 2016, but it is not widely implemented. Snowflake is one of only a handful of other databases that have this feature.
Snowflake reference: https://docs.snowflake.com/en/sql-reference/constructs/match_recognize.html
Oracle reference: https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8956
Top comments (0)