In this post, we are going to continue our discussion by looking at calculations that involve getting the value from the previous/next row. We would see how using ORDER BY
clause in an SQL window function can make these calculations pretty easy and possible.
Let's dive right in.
Our Big Question
One particularly important insight is the increase is the way the number of cases has been increasing. Suppose we had the following question:
What is the increase in cases recorded in the UK for each day in the month of June?
Assuming we had access to only the total number of confirmed cases one way we could answer this to generate a table where each row gives us the difference between the case recorded today and the case recorded previous day. Soemthing like this:
date | total_confirmed_cases | daily_case_recorded |
---|---|---|
2020-06-01 | v1 | 0 |
2020-06-02 | v2 | v2 - v1 |
2020-06-03 | v3 | v3 - v2 |
2020-06-04 | v4 | v4 - v3 |
. | . | . |
. | . | . |
. | . | . |
n | vn | vn - vn-1 |
Notice how the value in each row in the total_confirmed_cases column uses the value from the previous column to calculate its value.
Note that we assumed that we had access to only the total_confirmed_cases. Google COVID Dataset already gives us the daily_confirmed_cases which is the same as what we want to calculate.
In order to solve this, we would use the LAG aggregate function and ORDER BY in our window function.
The ORDER BY in a window function
Using the ORDER BY as an argument in the window function specifies how you want the window function to sort the rows in the ResultSet while performing calculations.
For example, in order to answer the question we have above, it is important that we sort the result by date so that when getting the value from the previous row, we would be getting the confirmed case from the previous day.
Using the ORDER BY clause makes it possible for us to specify that.
We can use the ORDER BY like so:
OVER(ORDER BY column_1, column_2, ..., column_n [ASC|DESC])
LAG aggregate function
The LAG function is a special aggregate function the works only when used with window functions. It takes a column as an argument and returns the previous value of that column in the ResultSet.
We use the LAG function with the ORDER BY clause like so:
LAG(column_name) OVER(ORDER BY column_1, column_2, ..., column_n {ASC|DESC|)
The SQL QUERY
Combining these together we can answer our Big question using the following query.
SELECT
date,
countries_and_territories as country,
confirmed_cases,
confirmed_cases - LAG(confirmed_cases) OVER(
ORDER BY
date
) as daily_increase
FROM
`bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`
WHERE
date >= "2020-06-01"
AND date <= "2020-06-30"
AND geo_id = "UK";
This filters the resultset and generates the increase in a number of cases using:
confirmed_cases - LAG(confirmed_cases) OVER(
ORDER BY
date
) as daily_increase
The window function orders by date while the LAG(confirmed_cases) returns cases confirmed for the previous row.
This generates the following table.
date | country | confirmed_cases | daily_increase |
---|---|---|---|
2020-06-01 | United_Kingdom | 274762 | NULL |
2020-06-02 | United_Kingdom | 276332 | 1570 |
2020-06-03 | United_Kingdom | 277985 | 1653 |
2020-06-04 | United_Kingdom | 279856 | 1871 |
2020-06-05 | United_Kingdom | 281661 | 1805 |
2020-06-06 | United_Kingdom | 283311 | 1650 |
2020-06-07 | United_Kingdom | 284868 | 1557 |
2020-06-08 | United_Kingdom | 286194 | 1326 |
2020-06-09 | United_Kingdom | 287399 | 1205 |
2020-06-10 | United_Kingdom | 289140 | 1741 |
2020-06-11 | United_Kingdom | 290143 | 1003 |
2020-06-12 | United_Kingdom | 291409 | 1266 |
2020-06-13 | United_Kingdom | 292950 | 1541 |
2020-06-14 | United_Kingdom | 294375 | 1425 |
2020-06-15 | United_Kingdom | 295889 | 1514 |
2020-06-16 | United_Kingdom | 296857 | 968 |
2020-06-17 | United_Kingdom | 298136 | 1279 |
2020-06-18 | United_Kingdom | 299251 | 1115 |
2020-06-19 | United_Kingdom | 300469 | 1218 |
2020-06-20 | United_Kingdom | 301815 | 1346 |
2020-06-21 | United_Kingdom | 303110 | 1295 |
2020-06-22 | United_Kingdom | 304331 | 1221 |
2020-06-23 | United_Kingdom | 305289 | 958 |
2020-06-24 | United_Kingdom | 306210 | 921 |
2020-06-25 | United_Kingdom | 306862 | 652 |
2020-06-26 | United_Kingdom | 307980 | 1118 |
2020-06-27 | United_Kingdom | 309360 | 1380 |
2020-06-28 | United_Kingdom | 310250 | 890 |
2020-06-29 | United_Kingdom | 311151 | 901 |
2020-06-30 | United_Kingdom | 311965 | 814 |
One thing that you may have noticed is that the first value in the table is NULL.
Remember what we said in previous articles that window functions use the ResultSet generated after filtering in the WHERE clause, well, this is one of the consequences.
At the first instance, the LAG function returns the value of the specified column in the previous row but the first row does not have a previous row. Therefore, we get NULL( which means non-existent).
But we know that the confirmed case for the first day of July exists, so how do we get that?
One quick way is to modify the filter in the WHERE clause so that it returns data for a day before July 1st like so:
WHERE
date >= "2020-05-31"
AND date <= "2020-06-30"
AND geo_id = "UK";
Then we omit the first row in our resultset by adding OFFSET 1
. The new query would now become
SELECT
date,
countries_and_territories as country,
confirmed_cases,
confirmed_cases - LAG(confirmed_cases) OVER(
ORDER BY
date
) as daily_increase
FROM
`bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`
where
date >= "2020-05-31"
AND date <= "2020-06-30"
AND geo_id = "UK"
LIMIT
31 OFFSET 1;
This would return the correct data for the first row
date | country | confirmed_cases | daily_increase |
---|---|---|---|
2020-06-01 | United_Kingdom | 274762 | 1936 |
2020-06-02 | United_Kingdom | 276332 | 1570 |
2020-06-03 | United_Kingdom | 277985 | 1653 |
2020-06-04 | United_Kingdom | 279856 | 1871 |
2020-06-05 | United_Kingdom | 281661 | 1805 |
2020-06-06 | United_Kingdom | 283311 | 1650 |
2020-06-07 | United_Kingdom | 284868 | 1557 |
2020-06-08 | United_Kingdom | 286194 | 1326 |
2020-06-09 | United_Kingdom | 287399 | 1205 |
2020-06-10 | United_Kingdom | 289140 | 1741 |
2020-06-11 | United_Kingdom | 290143 | 1003 |
2020-06-12 | United_Kingdom | 291409 | 1266 |
2020-06-13 | United_Kingdom | 292950 | 1541 |
2020-06-14 | United_Kingdom | 294375 | 1425 |
2020-06-15 | United_Kingdom | 295889 | 1514 |
2020-06-16 | United_Kingdom | 296857 | 968 |
2020-06-17 | United_Kingdom | 298136 | 1279 |
2020-06-18 | United_Kingdom | 299251 | 1115 |
2020-06-19 | United_Kingdom | 300469 | 1218 |
2020-06-20 | United_Kingdom | 301815 | 1346 |
2020-06-21 | United_Kingdom | 303110 | 1295 |
2020-06-22 | United_Kingdom | 304331 | 1221 |
2020-06-23 | United_Kingdom | 305289 | 958 |
2020-06-24 | United_Kingdom | 306210 | 921 |
2020-06-25 | United_Kingdom | 306862 | 652 |
2020-06-26 | United_Kingdom | 307980 | 1118 |
2020-06-27 | United_Kingdom | 309360 | 1380 |
2020-06-28 | United_Kingdom | 310250 | 890 |
2020-06-29 | United_Kingdom | 311151 | 901 |
2020-06-30 | United_Kingdom | 311965 | 814 |
Now we can generate our chart:
You can interact with the chart by following this link.
More on LAG, intro to LEAD
The LAG function takes an optional integer as value as its second argument this integer indicates how many rows back we want to get. This number is 1 by default, that's why we get the previous row( 1 row before the current row).
Similar to the LAG function is the LEAD function which takes the value of the next row. It has the same constraints and syntax of the LAG function.
So LEAD(col_name, 2) indicate we want to get a value 2 rows after the current row from col_name while LEAD(col_name) indicate we want the next row value.
Conclusion
In this article, we've learnt how to perform a calculation that involves getting a single value from a row before or after our current row.
That's great but what if we want to perform an aggregation based on more than one row before or after our current row? We would see how to do that using ROWS PRECEDING and ROWS FOLLOWING in the next article of our series.
See you soon.
Latest comments (0)