AJ Kerrigan

Posted on Dec 30, 2022

Hanukkah of Data 2022 - Puzzle 2

#hanukkahofdata #datasette #python #pandas

Hanukkah of Data is a series of data-themed puzzles, where you solve puzzles to move your way through a holiday-themed story using a fictional dataset. See the introductory post for a bit more detail, but the pitch in my head is "Advent of Code meets SQL Murder Mystery". This post walks through my approach to the second puzzle.

Thinking
- Tool Choices
Doing (Datasette)
Doing (Pandas)
- Preparation
- Filtering
Refining / Reusing

Warning: This post unavoidably contains spoilers. If you'd rather do the puzzles on your own first, please close this browser tab and run away :).

Thinking

In the second puzzle, we find that the rug in our story was so dirty that the cleaners needed help from a special contractor. And we get a clue about how to track that contractor down:

As they’re right across the street from Noah’s, they usually talked about the project over coffee and bagels at Noah’s before handing off the item to be cleaned. The contractors would pick up the tab and expense it, along with their cleaning supplies.

“So this rug was apparently one of those special projects. The claim ticket said ‘2017 spec JD’. ‘2017’ is the year the item was brought in, and ‘JD’ is the initials of the contractor.

So extracting some facts...

We're looking for a Noah's order from sometime in 2017
The order has coffee and bagels
The customer has initials JD

Tool Choices

In the pre-work section of the last post, I mentioned poking around at the data before working on any specific puzzles to get a feel for it. Looking at this puzzle's clues, it seemed clear that the approach would be very different from puzzle 1. Rather than focusing on a single table, we'd need to look at the full picture of an order:

The customer placing the order
The date and time it took place
Which items were included

VisiData is great for quick exploration and can perform joins, but considering the number of tables involved I figured this would be more straightforward with SQL. And since SQLite was an available format, Datasette seemed like a great fit.

Doing (Datasette)

Joining and Selecting

Before thinking too hard about filters, I built up the joins one table at a time until I had access to all the fields I'd need:

Customer name so I could look for initials
Phone number to submit as an answer
Ordered date to look for 2017
Description to find bagels and coffee

The SELECT clause ended up looking like this:

select
  c.name,
  c.phone,
  o.ordered,
  p.desc
from
  customers c
  join orders o on c.customerid = o.customerid
  join orders_items i on o.orderid = i.orderid
  join products p on i.sku = p.sku

Filtering

Order Date

The easiest first slice to whittle down this result set seemed to be filtering by order date. SQLite's strftime() function handles that nicely:

where
  strftime('%Y', o.ordered) = '2017'

Contractor Initials

I'm not a SQLite expert at all, so I hit the docs looking for the most friendly way to look for the initials JD. LIKE? regexes? Something else? I came up with this:

where
  strftime('%Y', o.ordered) = '2017'
  and c.name glob 'J* D*'

Because...

like isn't case-sensitive by default (though that can be controlled with the case_sensitive_like pragma)
glob is case-sensitive by default
SQLite includes a REGEXP operator, but it errors out unless you define a regexp() application-defined function (or load an extension that does)

This seems like handy stuff to know!

Coffee and Bagels

I wanted orders that had both bagels and coffee, which GROUP BY and HAVING covers well:

group by
  c.name,
  c.phone,
  o.ordered
having
  sum(
    case
      when desc like '%coffee%' then 1
      else 0
    end
  ) > 0
  and sum(
    case
      when desc like '%bagel%' then 1
      else 0
    end
  ) > 0

All Together Now

After building that up bit by bit, the final query looks like this:

select
  c.name,
  c.phone,
  o.ordered,
  p.desc
from
  customers c
  join orders o on c.customerid = o.customerid
  join orders_items i on o.orderid = i.orderid
  join products p on i.sku = p.sku
where
  strftime('%Y', o.ordered) = '2017'
  and c.name glob 'J* D*'
group by
  c.name,
  c.phone,
  o.ordered
having
  sum(
    case
      when desc like '%coffee%' then 1
      else 0
    end
  ) > 0
  and sum(
    case
      when desc like '%bagel%' then 1
      else 0
    end
  ) > 0

And it gives a clear single result 👍️.

Doing (Pandas)

It was rewarding to dig into SQLite a bit while solving this puzzle, so I figured this would be a good opportunity to learn a bit more about pandas too! So how would I adapt this working SQL solution to pandas?

Preparation

To start, it seemed reasonable to flesh out my initial SELECT query and use that to build a starting DataFrame:

import numpy as np
import pandas as pd
import sqlite3

con = sqlite3.connect('noahs.sqlite')
df = pd.read_sql_query('''
select
  c.*,
  o.orderid,
  o.ordered,
  o.shipped,
  o.total,
  i.qty,
  i.unit_price,
  p.sku,
  p.desc,
  p.wholesale_cost
from
  customers c
  join orders o on c.customerid = o.customerid
  join orders_items i on o.orderid = i.orderid
  join products p on i.sku = p.sku''',
  con,
  parse_dates=['ordered','shipped'])

Looking at counts with df.nunique(), it looked like most of the columns were repeated a bunch. And I've heard/read that categorical dtypes can be useful there. So why not give them a shot:

dtypes = {col: 'category' for col, dtype in df.dtypes.to_dict().items() if dtype != np.dtype('datetime64[ns]')}

#df.astype(dtypes).memory_usage(deep=True).sum()
#df.memory_usage(deep=True).sum()
# ^^ ~10% mem usage after type conversion!

df = df.astype(dtypes)

Note: I don't think this was actually a good move... first, it was textbook premature optimization. Second, I hit some errors and performance issues later that seemed to stem from this. But hey, that's what experimentation is for.

I then saved the DataFrame to a Parquet file just so it'd be quicker to reload later:

df.to_parquet('noahs.parquet')

Filtering

Using the .str accessor methods helped find the contractor initials and orders with bagels or coffee, while .dt.year handled the order year check:

df[
    (df.name.str.contains(r"J.* D.*"))
    & (df.ordered.dt.year == 2017)
    & (df.desc.str.contains(r"bagel|coffee", case=False))
]

After some time in the pandas docs, it seemed like the next best step would be to use groupby() and filter() to find orders containing both bagels and coffee. This seemed like a reasonable idea:

df[
    (df.name.str.contains(r"J.* D.*"))
    & (df.ordered.dt.year == 2017)
    & (df.desc.str.contains(r"bagel|coffee", case=False))
].groupby(["orderid"]).filter(
    lambda x: x.desc.str.contains("bagel", case=False).any()
    and x.desc.str.contains("coffee", case=False).any()
)[
    ["name", "phone", "desc"]
]

But... BOOM. It just hung and killed my process.

I suspect that this isn't the best way to do this sort of thing. But separately, it seems that using groupby() on a column with a category dtype is bad news. Changing orderid's dtype to int made this run in a snap. Filtering for bagels and coffee twice feels a little dirty, but avoiding obvious non-matching orders before grouping seemed like a useful thing.

So after minor tweaks, I had the right answer but was also developing some suspicions.

Refining / Reusing

Lots of pandas users seem to use notebooks, but I find it much more comfortable to explore in a REPL. So I wrote a quick script that would let ptipython -i hod_resume.py land me in a fresh REPL with my DataFrame loaded:

import pandas as pd

df = pd.read_parquet('noahs.parquet').astype({
    'customerid': 'int',
    'name': 'string',
    'address': 'string',
    'citystatezip': 'string',
    'birthdate': 'datetime64[ns]',
    'phone': 'string',
    'orderid': 'int',
    'ordered': 'datetime64[ns]',
    'ordered_date': 'datetime64[ns]',
    'shipped': 'datetime64[ns]',
    'total': 'float',
    'qty': 'int',
    'unit_price': 'float',
    'sku': 'string',
    'desc': 'string',
    'wholesale_cost': 'float'
})

I consider that list of data types pretty fluid. Getting rid of object dtypes seemed like a generally good idea, but category was a mixed bag that felt great at times and like a footgun at others. I think I need to understand them better to avoid doing silly things, and in the meantime I experimented and switched dtypes often.

DEV Community

Hanukkah of Data 2022 - Puzzle 2

Thinking

Tool Choices

Doing (Datasette)

Joining and Selecting

Filtering

Order Date

Contractor Initials

Coffee and Bagels

All Together Now

Doing (Pandas)

Preparation

Filtering

Refining / Reusing

Top comments (0)

Read next

Cómo crear un Wallpaper dinámico con la Hora y Fecha usando Python

7 Must-Try Open-Source Tools for Python and JavaScript Developers 🚀

Optimizing Large-Scale Data Processing in Python: A Guide to Parallelizing CSV Operations

Adding new columns - lowCalAlt_update5