zchtodd

Posted on Oct 5, 2019

#SQL30 Day 3: Weather Data

#sql #database #postgres #challenge

Welcome to the SQL showdown series!

What is this and how does it work?

I'm committing to publishing a SQL challenge every day for 30 days. In each post I'll describe my solution to the last day's challenge. I'll follow that up with a full description of the challenge for the next day.

Write your own solution in the comments! Let's see who can come up with the most creative solutions.

I'll add connection details for a PostgreSQL database containing test data for the challenge. Solutions are by no means limited to PostgreSQL, but it's there if you want an easy way to test your query!

Challenge #3: Weather Data

Today I'll dive into a challenge using data from the Global Historical Climatology Network and the US Census Bureau. This challenge will involve a taste of PostGIS functions.

Here's the challenge:

Can you find the average temperature per month for every county in the United States, as well as the coldest and hottest temperatures that county has experienced over the year?

Here's an example to give you a better idea of the output you're after:

There are 3 tables in the day3 schema. The measurement, station, and county tables provide all the information needed to answer the challenge.

Here's a sample from the measurement table (values are in tenths of a degree Celsius):

And a sample from the station table:

And finally, here's the county table. The wkb_geometry column stores the PostGIS shape data of the county.

Hint

As a hint, there will be 3 PostGIS functions that come in handy here. You can find a listing of PostGIS functions here.

The 3 functions that might prove useful:

ST_SetSRID
ST_Contains
ST_MakePoint

Sandbox Connection Details

I have a PostgreSQL database ready for you to play with.

The password is the same as the username! To query the measurement table:

SELECT * FROM day3.measurement;

Solution for Challenge #2

This is the question we were trying to answer with yesterday's challenge:

Can you output the day's closing price, the closing price from yesterday, the variance between the two, as well as the average closing price for the current month?

On first reading the question, I immediately think of joins and group by. Getting the closing price from yesterday could potentially be done as a self-join, and the average closing price could be obtained through a group by.

As we saw with the solution to challenge #1, the group by would have to be done as a subquery if we want to return more than the average price.

SELECT
  s1.*,
  s2.close AS prior_close,
  avg_price.*
FROM
  day2.stockprice s1
  JOIN (
    SELECT
      DATE_TRUNC('month', day) AS month,
      avg(close)
    FROM
      day2.stockprice
    GROUP by
      month
  ) avg_price ON DATE_TRUNC('month', s1.day) = avg_price.month
  LEFT JOIN day2.stockprice s2 ON s1.day = s2.day + INTERVAL '1 DAY'
ORDER BY
  s1.day;

This mostly works, but it has a few problems. First of all, the self-join to get the prior day's closing price is flawed. Trading doesn't happen every day, after all, so there are going to be gaps that the query won't handle properly.

It turns out there's a better way to do this! Starting around 2009, databases began implementing what are known as window functions.

Window functions are very useful, especially in analytical queries like this. A short explanation is that they are similar to group by, but you can have several of them in one query, and each can group their own window differently.

Let's take a look at the query, and then I'll step through it.

WITH cte AS (
  SELECT
    close,
    LAG(close, 1) OVER (
      ORDER BY
        day
    ) AS prior_close,
    day,
    DATE_TRUNC('month', day) AS month
  FROM
    day2.stockprice
)
SELECT
  cte.close,
  cte.prior_close,
  cte.close - cte.prior_close AS variance,
  AVG(close) OVER m,
  cte.month,
  cte.day
FROM
  cte WINDOW m AS (PARTITION BY month);

The query begins with a common table expression, which if you've never seen one before, is like a temporary table that exists only for the lifetime of the query.

Window functions begin with the OVER keyword on a specific column. Inside the OVER clause you establish a partition and an ordering, although both are not required.

The partition is similar to a group by, and defines what rows belong to each partition.

The ORDER BY part serves a dual purpose here. In addition to ordering the data within each partition, it also establishes the row range. The row range determines what rows are visible relative to the current row.

With an aggregate function like sum, the ORDER BY turns that into a running total that includes every row prior and equal to the current row. Functions like LAG operate relative to the current row, but looking back a certain number of rows only makes sense with a defined ordering.

To solve this challenge, I establish a window with a default partition including all rows, and look back 1 row (ordered by the day) to get the prior close.

To get the monthly average, I define another window partitioned by month (but without an ORDER BY, so all rows in each partition are part of the average).

The WINDOW keyword is just a convenience that allows you to define windows once at the end, so they can be reused without redefining them on each column.

More resources

Window functions are a big topic, and I can't possibly do them justice in this post. Here are some great resources to learn more about them:

Good luck!

Have fun, and I can't wait to see what you come up with! I'll be back tomorrow with a solution to this problem and a new challenge.

Top comments (4)

Sam Mason • Oct 28 '19

thanks for the hint, I've not done much with PostGIS for a while!

your example data seems to have rows repeated for each station, which doesn't appear in your description so my version below aggregates over these:

WITH monthly AS (
  SELECT c.ogc_fid, date_trunc('month', m.taken) as month,
    AVG(m.value) / 10 AS temp_mean,
    MIN(m.value) / 10 AS temp_min,
    MAX(m.value) / 10 AS temp_max
  FROM county c
    JOIN station s ON ST_Contains(c.wkb_geometry, ST_SETSRID(ST_MakePoint(s.lat, s.long), 4326))
    JOIN measurement m ON s.idx = m.idx
  WHERE c.geo_id = '0500000US01003'  -- running for all counties is slow, this is good for testing
  GROUP BY 1, 2
)
SELECT c.geo_id, c.state, c.county, c.name,
  m.month::DATE,
  round(m.temp_mean::numeric, 2) AS month_temp_mean,
  MIN(m.temp_min) OVER year AS year_min_temp,
  MAX(m.temp_max) OVER year AS year_max_temp
FROM county c
  JOIN monthly m USING (ogc_fid)
WINDOW year AS (PARTITION BY c.geo_id, date_trunc('year', m.month));