Welcome to the SQL showdown series!
What is this and how does it work?
I'm committing to publishing a SQL challenge every day for 30 days. In each post I'll describe my solution to the last day's challenge. I'll follow that up with a full description of the challenge for the next day.
Write your own solution in the comments! Let's see who can come up with the most creative solutions.
I'll add connection details for a PostgreSQL database containing test data for the challenge. Solutions are by no means limited to PostgreSQL, but it's there if you want an easy way to test your query!
Challenge #4: Video Game Sales
This challenge uses a data-set that contains aggregated video game sales dating back to 1980. This data-set comes from the Kaggle open data-sets archive.
Here's the challenge:
Can you produce a report that displays one year per row, and the aggregated global sales by genre for that year as columns?
There is only one table in the day4 schema. The videogame table contains info on the game title, published year, genre, and sales for that game.
The videogame table has the following columns:
- name
- platform
- year
- genre
- publisher
- na_sales
- eu_sales
- jp_sales
- global_sales (in millions)
Here's an example to give you a better idea of the output you're after:
Sandbox Connection Details
I have a PostgreSQL database ready for you to play with.
The password is the same as the username! To query the videogame table:
SELECT * FROM day4.videogame;
Solution for Challenge #3
This is the question we were trying to answer with yesterday's SQL challenge:
Can you find the average temperature per month for every county in the United States, as well as the coldest and hottest temperatures that county has experienced over the year?
Part of this question sounds like a candidate for window functions, but how would we do this by county? The station and measurement tables can be joined through the idx column, but there doesn't appear to be a way to join to county.
Or is there?
WITH cte AS
(
SELECT geo_id,
s.idx,
NAME,
value,
taken,
date_trunc('month', taken) AS month
FROM day3.county c
JOIN day3.station s
ON ST_Contains(c.wkb_geometry, ST_SetSRID(St_MakePoint(s.lat, s.long), 4326))
JOIN day3.measurement m
ON s.idx = m.idx
)
SELECT cte.*,
avg(value) OVER geo_m,
min(value) OVER geo_y,
max(value) OVER geo_y
FROM cte WINDOW geo_m AS (PARTITION BY (geo_id, month)),
geo_y AS (PARTITION BY geo_id)
ORDER BY geo_id,
taken;
We saw common table expressions and window functions in the solution to challenge #2, so we're getting a little more practice with those here.
The interesting part is the spatial join that connects weather polling stations in the station table with their respective county. PostGIS provides the ST_Contains function, and I use it here to determine if the county polygon contains the point defined by the station latitude and longitude.
The ST_SetSRID function basically tells PostGIS what coordinate system the point we just created is going to be in. 4326 is known as the World Geodetic System and is one of the most common SRID values that you'll see.
More about PostGIS
GIS in SQL is a topic that I've barely touched on myself, but there are quite a few great resources out there on the subject.
Good luck!
Have fun, and I can't wait to see what you come up with! I'll be back tomorrow with a solution to this problem and a new challenge.
Top comments (1)
I'm a strong believer in Hadley Wickham's "tidy data" and think this sort of transform shouldn't belong in the database! I presume you're going to use the
tablefunc
module, but then columns can appear and disappear and break other downstream code/queries. I'd therefore go with the longer and more explicit:which also lets me place the
Misc
genre at the end rather than awkwardly in the middle!I found it useful to run this:
first to find out the possible values, with the
LIMIT
on in case of crazy data or schema misunderstandings. It also lets me do a quick check of the various_sales
columns adding up to theglobal_sales
column, again to make sure I understand the question and data.