DEV Community

Cover image for #SQL30 Day 4: Video Game Sales
zchtodd
zchtodd

Posted on

#SQL30 Day 4: Video Game Sales

Welcome to the SQL showdown series!

What is this and how does it work?

I'm committing to publishing a SQL challenge every day for 30 days. In each post I'll describe my solution to the last day's challenge. I'll follow that up with a full description of the challenge for the next day.

Write your own solution in the comments! Let's see who can come up with the most creative solutions.

I'll add connection details for a PostgreSQL database containing test data for the challenge. Solutions are by no means limited to PostgreSQL, but it's there if you want an easy way to test your query!

Challenge #4: Video Game Sales

This challenge uses a data-set that contains aggregated video game sales dating back to 1980. This data-set comes from the Kaggle open data-sets archive.

Here's the challenge:

Can you produce a report that displays one year per row, and the aggregated global sales by genre for that year as columns?

There is only one table in the day4 schema. The videogame table contains info on the game title, published year, genre, and sales for that game.

The videogame table has the following columns:

  • name
  • platform
  • year
  • genre
  • publisher
  • na_sales
  • eu_sales
  • jp_sales
  • global_sales (in millions)

Here's an example to give you a better idea of the output you're after:

Alt Text

Sandbox Connection Details

I have a PostgreSQL database ready for you to play with.

Alt Text

The password is the same as the username! To query the videogame table:

SELECT * FROM day4.videogame;
Enter fullscreen mode Exit fullscreen mode

Solution for Challenge #3

This is the question we were trying to answer with yesterday's SQL challenge:

Can you find the average temperature per month for every county in the United States, as well as the coldest and hottest temperatures that county has experienced over the year?

Part of this question sounds like a candidate for window functions, but how would we do this by county? The station and measurement tables can be joined through the idx column, but there doesn't appear to be a way to join to county.

Or is there?

WITH cte AS
(
       SELECT geo_id,
              s.idx,
              NAME,
              value,
              taken,
              date_trunc('month', taken) AS month
       FROM   day3.county c
       JOIN   day3.station s
       ON     ST_Contains(c.wkb_geometry, ST_SetSRID(St_MakePoint(s.lat, s.long), 4326))
       JOIN   day3.measurement m
       ON     s.idx = m.idx
)
SELECT   cte.*,
         avg(value) OVER  geo_m,
         min(value) OVER  geo_y,
         max(value) OVER  geo_y
FROM     cte WINDOW geo_m AS (PARTITION BY (geo_id, month)),
         geo_y            AS (PARTITION BY geo_id)
ORDER BY geo_id,
         taken;
Enter fullscreen mode Exit fullscreen mode

We saw common table expressions and window functions in the solution to challenge #2, so we're getting a little more practice with those here.

The interesting part is the spatial join that connects weather polling stations in the station table with their respective county. PostGIS provides the ST_Contains function, and I use it here to determine if the county polygon contains the point defined by the station latitude and longitude.

The ST_SetSRID function basically tells PostGIS what coordinate system the point we just created is going to be in. 4326 is known as the World Geodetic System and is one of the most common SRID values that you'll see.

More about PostGIS

GIS in SQL is a topic that I've barely touched on myself, but there are quite a few great resources out there on the subject.

Good luck!

Have fun, and I can't wait to see what you come up with! I'll be back tomorrow with a solution to this problem and a new challenge.

Top comments (1)

Collapse
 
smason profile image
Sam Mason

I'm a strong believer in Hadley Wickham's "tidy data" and think this sort of transform shouldn't belong in the database! I presume you're going to use the tablefunc module, but then columns can appear and disappear and break other downstream code/queries. I'd therefore go with the longer and more explicit:

SELECT year,
  SUM(CASE genre WHEN 'Action'       THEN global_sales END) AS "Action",
  SUM(CASE genre WHEN 'Adventure'    THEN global_sales END) AS "Adventure",
  SUM(CASE genre WHEN 'Fighting'     THEN global_sales END) AS "Fighting",
  SUM(CASE genre WHEN 'Platform'     THEN global_sales END) AS "Platform",
  SUM(CASE genre WHEN 'Puzzle'       THEN global_sales END) AS "Puzzle",
  SUM(CASE genre WHEN 'Racing'       THEN global_sales END) AS "Racing",
  SUM(CASE genre WHEN 'Role-Playing' THEN global_sales END) AS "Role-Playing",
  SUM(CASE genre WHEN 'Shooter'      THEN global_sales END) AS "Shooter",
  SUM(CASE genre WHEN 'Simulation'   THEN global_sales END) AS "Simulation",
  SUM(CASE genre WHEN 'Sports'       THEN global_sales END) AS "Sports",
  SUM(CASE genre WHEN 'Strategy'     THEN global_sales END) AS "Strategy",
  SUM(CASE genre WHEN 'Misc'         THEN global_sales END) AS "Misc"
FROM videogame
GROUP BY 1
ORDER BY 1

which also lets me place the Misc genre at the end rather than awkwardly in the middle!

I found it useful to run this:

SELECT genre, SUM(global_sales), COUNT(*),
  SUM(na_sales + eu_sales + jp_sales + other_sales)
FROM videogame
GROUP BY 1 ORDER BY 1 LIMIT 50;

first to find out the possible values, with the LIMIT on in case of crazy data or schema misunderstandings. It also lets me do a quick check of the various _sales columns adding up to the global_sales column, again to make sure I understand the question and data.