DEV Community

loading...

Finding the Top N per Group with SQL

noam sauer-utley
∞ autistic dev with buggy wetware 🧬 they ☿
Originally published at Medium on ・7 min read

Note: this post was originally published on Medium in October, 2019

Solving Top N per Group with SQL

Getting to Know SQL

When I first encountered SQL, everything about it felt… strange. Trying to use it felt alien in some way I couldn’t quite put my finger on.

Dr. McCoy inspects an alien with a tricorder.

Trying to understand what was so different about this language, I spent some time reading aboutDeclarative vs. Imperative Programming, as well as Codd’s A Relational Model of Data for Large Shared Data Banks (Warning: PDF download link), and some basic info about Set Theory.

It quickly became obvious how powerful and elegant SQL could be when it came to handling larger volumes of data. I could just tell SQL exactly what data I wanted, and it would utilize efficient sets to select it, instead of iterating through each individual object and getting swamped.

Tribbles fall onto Captain Kirk from an overhead storage locker.

Still, the #1 most helpful thing I did was spend a weekend writing queries and seeing the results, tweaking things and observing the changes.

I was introduced to Five Thirty Eight’s collection of data[¹][²] about The Daily Show with Jon Stewart through a Flatiron School bonus lab[³], and found it a perfect playground to query around, using beginner-friendly SQLite.

After parsing the CSV data into a SQLite database, I started trying to pull out simple bits of data.

Spock pushes some buttons.

First, as a test run, I wrote a simple query selecting all the shows from 1999.

I printed the results to my console, and received a neat list of info about all 166 episodes from 1999. Success!

I then tried a more specific query, and asked for the five guests with the most repeat appearances. A quick google search determined that the results I received were accurate. So far so good!

I tried again, querying the top occupation for guests, and got back “Actor” (unsurprising).

I played around with more simple queries, getting all the appearances from certain celebs, getting the number of guests with the same first name, the year with the most guest appearances overall, etc.

I was getting the hang of this SQL thing!

Spock holds a calm looking dog dressed up as an alien

Then I tried a more advanced query. Instead of asking for the top guest or occupation out of the whole table, I wanted to find the top occupation per year. I figured it would be a little bit more complicated, but not much.

Except, when I started trying to write that query, I could only get a single result. I could get the most popular occupation of the whole database, or the year with the most different occupation, or something else that represented the database as a whole, but I couldn’t get a list of data representing information from each individual year.

I had no idea how to direct SQL to get the data I wanted.

Captain Kirk looks confused and a yeoman shrugs while surrounded by tribbles.

I needed SQL to get me a layered result — to organize the data by year, then order the information from each year and return specific positions from that ordered list as results.

Every approach I could think of involved iterative sorting, which I knew wouldn’t be an effective use of this declarative, set-based tool.

Suddenly, SQL wasn’t feeling as friendly anymore.

The dog alien snarls angrily.

I did some googling, and found a surprising volume of Stack Overflow[⁴] posts devoted to finding ways to communicate this exact sort of layered demand to various SQL database types. I was intrigued.

I did a little research, and learned that SQLite 3 added a handy window function called Rank that allows you to partition the rows of a SQL query’s result according to the criteria of your choice, then order those partitions as needed.

A control panel with flashing lights.

This was exactly the tool I needed!

Using Rank to Solve Top N per Group in SQLite

Let’s start with the innermost subquery, as set selection starts with the innermost subquery and refines outward.

First things first, we want to select occupation and year from our table “shows”, and get a count of the episode guest’s occupations, which we’ll reference as occupation_count. The results are grouped by year, then by occupation, so that we can get that occupation count per year.

        SELECT occupation, year, COUNT(occupation) AS occupation_count
        FROM shows
        GROUP BY year, occupation

This gives us the base data we want to organize and select from — SQL can ignore everything else.

Next, our middle subquery should use Rank to partition this data by year, making each year’s data into its own set. Then, it will create a ranking within each year, ordered by occupation_count, which we’ll call occupation_count_rank.

      SELECT occupation, year, occupation_count,
      RANK () OVER (
        PARTITION BY year
        ORDER BY occupation_count DESC
      ) occupation_count_rank
      FROM (
        SELECT occupation, year, COUNT(occupation) AS occupation_count
        FROM shows
        GROUP BY year, occupation
      )
      GROUP BY year, occupation

Now, we’ve got out occupations ranked in descending order by frequency, per year. This is also grouped by year first, then by occupation, as we want our query result to be ordered within those groups.

We’re already quite close to our desired query result, thanks to Rank, so our outer query can be quite simple. It simply selects the occupations from the result set supplied by the previous subquery, and returns only the occupations where the occupation_count_rank is equal to one. As we ordered our rank in descending order, the occupation with the highest count will be first.

    SELECT occupation
    FROM (
      SELECT occupation, year, occupation_count,
      RANK () OVER (
        PARTITION BY year
        ORDER BY occupation_count DESC
      ) occupation_count_rank
      FROM (
        SELECT occupation, year, COUNT(occupation) AS occupation_count
        FROM shows
        GROUP BY year, occupation
      )
      GROUP BY year, occupation
    )
    WHERE occupation_count_rank = 1;

If we throw this query into a heredoc and plug it into a ruby method, we can easily view the query result in the console.

Our query result? It’s just “actor”, seventeen times in a row. One for each year in our table. Which, again, unsurprising.

Okay, not super exciting. Still, that means that all three layers of our query are doing their jobs, and organizing the data to get exactly the results we want!

Captain Kirk is joined by Dr. McCoy and Spock on the bridge.

What if we didn’t want to just see actor listed out 17 times though?

Let’s shake things up a little.

    SELECT year, occupation
    FROM (
      SELECT occupation, year, occupation_count,
      RANK () OVER (
        PARTITION BY year
        ORDER BY occupation_count DESC
      ) occupation_count_rank
      FROM (
        SELECT occupation, year, COUNT(occupation) AS occupation_count
        FROM shows
        GROUP BY year, occupation
      )
      GROUP BY year, occupation
    )
    WHERE occupation_count_rank <= 5;

Here, we’ve kept both of our subqueries the same, only changing our outermost query. Instead of selecting just the occupation, let’s select the occupation AND the year, so we can keep things nice and organized. Then, instead of selecting only the occupation with occupation_count_rank equal to 1, we’re selecting the top five occupations for each year[⁵].

This gives us a lot more information! SQL returns the 5 most popular occupations for daily show guests, and also notes the year for us, so we don’t lose track halfway down the list.

Let’s make sure this is actually flexible — we can return the top one or top five per group, but let’s change what we’re returning.

Let’s write another similar query, this time querying the “genre” column of our database, which describes the genre of the guest’s Daily Show appearance. Let’s keep it clean this time, and only return the genre (not the year), and ask for the top 3 genres per year.

``` SELECT genre
FROM (
SELECT genre, year, genre_count,
RANK () OVER (
PARTITION BY year
ORDER BY genre_count DESC
) genre_count_rank
FROM (
SELECT genre, year, COUNT(genre) AS genre_count
FROM shows
GROUP BY year, genre
)
GROUP BY year, genre
)
WHERE genre_count_rank <= 3;



This tells us that “Acting” is usually in first place, but as the years go by, the more general “Media” begins to take over. “Comedy” and “Music” usually fill the remaining spot for the first years, but later in the 00’s, “Politician” becomes increasingly common, revealing a shift in the show’s tone.

Thanks, SQL.

![Lt. Uhura happily holds a purring tribble up to her ear.](https://cdn-images-1.medium.com/max/245/1*rAostJnbsZvaDH_Ac9wvwA.gif)

I queried a SQLite database in the above examples, but most popular SQL software include the RANK() function. If your preferred SQL software doesn’t have a rank function, I’d recommend looking into using a temporary table as a workaround, as explained [here](http://allenbrowne.com/ranking.html#temptable).

Notes & Links:

[1]: [Every Guest Jon Stewart Ever Had On 'The Daily Show'](https://fivethirtyeight.com/features/every-guest-jon-stewart-ever-had-on-the-daily-show/)



[3]: If you are a Flatiron Student and would like to talk about this lab, feel free to contact me via [Twitter](https://twitter.com/thevoidbutcozy).

[4]: Support labor unions in tech!

[5]: In the case of a tie, Rank will return both tied elements, and skip the value that would have held the second element were it not tied. If this will cause issues for you, look into using [Dense Rank() ](https://www.sqlitetutorial.net/sqlite-window-functions/sqlite-dense_rank/)instead!

Discussion (0)