Welcome to the SQL showdown series!
What is this and how does it work?
I'm committing to publishing a SQL challenge every day for 30 days. In each post I'll describe my solution to the last day's challenge. I'll follow that up with a full description of the challenge for the next day.
Write your own solution in the comments! Let's see who can come up with the most creative solutions.
I'll add connection details for a PostgreSQL database containing test data for the challenge. Solutions are by no means limited to PostgreSQL, but it's there if you want an easy way to test your query!
Challenge #3: Weather Data
Today I'll dive into a challenge using data from the Global Historical Climatology Network and the US Census Bureau. This challenge will involve a taste of PostGIS functions.
Here's the challenge:
Can you find the average temperature per month for every county in the United States, as well as the coldest and hottest temperatures that county has experienced over the year?
Here's an example to give you a better idea of the output you're after:
There are 3 tables in the day3 schema. The measurement, station, and county tables provide all the information needed to answer the challenge.
Here's a sample from the measurement table (values are in tenths of a degree Celsius):
And a sample from the station table:
And finally, here's the county table. The wkb_geometry column stores the PostGIS shape data of the county.
Hint
As a hint, there will be 3 PostGIS functions that come in handy here. You can find a listing of PostGIS functions here.
The 3 functions that might prove useful:
- ST_SetSRID
- ST_Contains
- ST_MakePoint
Sandbox Connection Details
I have a PostgreSQL database ready for you to play with.
The password is the same as the username! To query the measurement table:
SELECT * FROM day3.measurement;
Solution for Challenge #2
This is the question we were trying to answer with yesterday's challenge:
Can you output the day's closing price, the closing price from yesterday, the variance between the two, as well as the average closing price for the current month?
On first reading the question, I immediately think of joins and group by. Getting the closing price from yesterday could potentially be done as a self-join, and the average closing price could be obtained through a group by.
As we saw with the solution to challenge #1, the group by would have to be done as a subquery if we want to return more than the average price.
SELECT
s1.*,
s2.close AS prior_close,
avg_price.*
FROM
day2.stockprice s1
JOIN (
SELECT
DATE_TRUNC('month', day) AS month,
avg(close)
FROM
day2.stockprice
GROUP by
month
) avg_price ON DATE_TRUNC('month', s1.day) = avg_price.month
LEFT JOIN day2.stockprice s2 ON s1.day = s2.day + INTERVAL '1 DAY'
ORDER BY
s1.day;
This mostly works, but it has a few problems. First of all, the self-join to get the prior day's closing price is flawed. Trading doesn't happen every day, after all, so there are going to be gaps that the query won't handle properly.
It turns out there's a better way to do this! Starting around 2009, databases began implementing what are known as window functions.
Window functions are very useful, especially in analytical queries like this. A short explanation is that they are similar to group by, but you can have several of them in one query, and each can group their own window differently.
Let's take a look at the query, and then I'll step through it.
WITH cte AS (
SELECT
close,
LAG(close, 1) OVER (
ORDER BY
day
) AS prior_close,
day,
DATE_TRUNC('month', day) AS month
FROM
day2.stockprice
)
SELECT
cte.close,
cte.prior_close,
cte.close - cte.prior_close AS variance,
AVG(close) OVER m,
cte.month,
cte.day
FROM
cte WINDOW m AS (PARTITION BY month);
The query begins with a common table expression, which if you've never seen one before, is like a temporary table that exists only for the lifetime of the query.
Window functions begin with the OVER keyword on a specific column. Inside the OVER clause you establish a partition and an ordering, although both are not required.
The partition is similar to a group by, and defines what rows belong to each partition.
The ORDER BY part serves a dual purpose here. In addition to ordering the data within each partition, it also establishes the row range. The row range determines what rows are visible relative to the current row.
With an aggregate function like sum, the ORDER BY turns that into a running total that includes every row prior and equal to the current row. Functions like LAG operate relative to the current row, but looking back a certain number of rows only makes sense with a defined ordering.
To solve this challenge, I establish a window with a default partition including all rows, and look back 1 row (ordered by the day) to get the prior close.
To get the monthly average, I define another window partitioned by month (but without an ORDER BY, so all rows in each partition are part of the average).
The WINDOW keyword is just a convenience that allows you to define windows once at the end, so they can be reused without redefining them on each column.
More resources
Window functions are a big topic, and I can't possibly do them justice in this post. Here are some great resources to learn more about them:
Good luck!
Have fun, and I can't wait to see what you come up with! I'll be back tomorrow with a solution to this problem and a new challenge.
Top comments (4)
thanks for the hint, I've not done much with PostGIS for a while!
your example data seems to have rows repeated for each station, which doesn't appear in your description so my version below aggregates over these:
I'm also rounding the mean down to a sensible number of decimal places, we certainly can't know it to ~20 significant figures!
Looks like a fun one!
Could I have access to the county table please?
Oops, I was afraid I'd end up forgetting a table. Fixed!
Added a hint to this one, in case I went a little too obscure on this challenge.