Hanukkah of Data 2022 - Puzzle 4

#hanukkahofdata #datasette #python #pandas

Hanukkah of Data is a series of data-themed puzzles, where you solve puzzles to move your way through a holiday-themed story using a fictional dataset. See the introductory post for a bit more detail, but the pitch in my head is "Advent of Code meets SQL Murder Mystery". This post walks through my approach to the fourth puzzle.

Thinking
Doing (Pandas)
Doing (Datasette)

Warning: This post unavoidably contains spoilers. If you'd rather do the puzzles on your own first, please close this browser tab and run away :).

Thinking

On the fourth day, we find a woman who likes to eat pastries early in the morning:

“A few weeks later my bike chain broke on the way home, and I needed to get it fixed before work the next day. Thankfully, this woman I met on Tinder came over at 5am with her bike chain repair kit and some pastries from Noah’s. Apparently she liked to get up before dawn and claim the first pastries that came out of the oven.

Doing (Pandas)

This time I turned things around and started with Pandas. Because I had explored the data a bit prior to starting puzzles, I noticed that the sku column started with a category/department prefix. Baked goods all seemed to have a bakery prefix of BKY, so finding orders with that sku prefix seemed like a good start:

df[(df.sku.str.contains('BKY'))]

That was far too many results (as expected), so the next step was to find bakery orders early in the morning. Let's say between 4am and 9am, which should cover cases where someone got up before dawn and got the first pastries out of the oven:

df[(df.sku.str.contains('BKY')) & (df.ordered.dt.hour.between(4,9))]

That was still a bunch of rows! But the clues make it sound like this is a habit. So how about looking at which customers get those early morning pastries most often?

df[
    (df.sku.str.contains('BKY'))
    & (df.ordered.dt.hour.between(4,9))
].groupby(['name','phone']).size().sort_values().tail()

That suggests Cristina Booker as the clear winner, and narrowing that time range from 4-9am to 4-6am makes it even clearer.

Doing (Datasette)

Adapting the pandas logic to SQL left me with this query:

select
  c.name,
  c.phone,
  count(*) as ordercount
from
  customers c
  join orders o on c.customerid = o.customerid
  join orders_items i on o.orderid = i.orderid
  join products p on i.sku = p.sku
where
  p.sku like 'BKY%'
  and cast(strftime('%H', o.ordered) as int) <= 5
group by
  c.name,
  c.phone
order by ordercount desc

DEV Community

Hanukkah of Data 2022 - Puzzle 4

Thinking

Doing (Pandas)

Doing (Datasette)

Top comments (0)

Read next

Optimizing Large-Scale Data Processing in Python: A Guide to Parallelizing CSV Operations

Adding new columns - lowCalAlt_update5

Secure Device Authentication in Python: Introducing the System Hardware ID Generator Script

Why Seeing Data Beats Reading It: The Case for Data Visualization