Shalvah

Posted on Aug 13, 2023 • Edited on Aug 18, 2023 • Originally published at blog.shalvah.me

High and low cardinality

#engineeringconcepts #data

Let's talk about cardinality. Can we talk about cardinality please, Mac? I've been dying to talk about cardinality with you all day.

In maths, cardinality is the number of elements in a set. A = {2, 4, 5, 8} -> cardinality of A is 4. It's quite similar in software engineering—cardinality is a rough idea of how many distinct values are in a set.

For example, timestamp fields (such as created_at) are often high-cardinality, because they will likely have many different values. By contrast, a field like day_of_week is low-cardinality, because it can only have 7 different values. An enum field like status is also probably low-cardinality (active, inactive, and maybe a few others). Boolean fields/flags have very low cardinality—just two values. User-specific data, like user IDs or names, are high-cardinality.

Note: there's another common usage of cardinality in software, which refers to relationships in database entities, but that's not what I'm talking about here.

Why does cardinality matter? It has effects on data storage and access patterns.

Data access

A good example of where cardinality affects access is relational database indexes.

You might have heard that you should add an index for any fields you want to query on. So, supposing you have a users table, you might add an index on created_at and status, so your queries can be faster.

Here's the thing—for most simple queries, the database only uses one index[note 1]. So in a query like this:

SELECT * FROM users 
WHERE status = 'active' 
AND created_at >= '2023-08-01 00:00:00'

the database has to decide which index it will use, and it's fairly likely it will pick the one on created_at.

Why's that? Because the database engine maintains statistics about the tables, and it may have noticed that the status column usually only has a few different values, while created_at has a wide variety. This means that if it uses the index on created_at, it could filter out a good portion of the rows in the table, and then scan each row to check if it matches the status.

Imagine we have 25,000 users. Most (say 17,000) are active. If the database used the status index, it would have 17,000 rows that it needs to then loop through and check the creation timestamp. But if it used the created_at index and a time range, it could end up with maybe 2,000 users created in the last month. That's a lot fewer rows to check.

Of course, this doesn't always work as I've described. The engine doesn't know your application like you do. A different query, a different time range, different data patterns (such as most users from a certain period being inactive), or outdated intenral engine statistics might make the engine's chosen index perform worse, or might make it to pick an entirely different algorithm.

An important point: cardinality is relative! Or more accurately, selectivity is. Selectivity is cardinality (distinct values in a set) relative to the overall size of the set. The status column might have only 6 distinct values, but if there are only 12 rows, that's a good selectivity ratio, as there's a good chance we could filter out half of the rows with one query.

Database indexing is an involved topic, and I won't delve further here. I'm a big fan of this article, which talks more about cardinality, selectivity, and many different reasons your index might not be used.

Data storage

It's related to access, but more specifically: if we know that a set has a small range of values, we can store it more efficiently (which could mean we can also query it more efficiently). This is what ClickHouse's LowCardinality column type does.

For instance, suppose we have a status column as a String. If we insert three rows with status of "active", we'll be storing a 6-byte string 3 times. But if status is designated as LowCardinality, the database can avoid this by storing active once, and assigning it a small integer ID (like what we did in byte packing). Then every new row gets that ID stored, meaning we'd only store 1 byte per row.

In a database like ClickHouse, used for storing huge amounts of data, this is quite valuable. Here's a good demonstration of the space savings, and here's an example where it improves query speed.

As an end user, it's also helpful to know when a tool you use expects a value to be low- or high-cardinality, even when they don't use those terms.

An example: in Datadog, metrics can have tags; for a metric such as "number of payments processed", you could attach a tag like status, with values such as status:successful and status:failed. However, you're charged for every metric + tag combination, so you pay for every new value of status that you send. If you introduce a status:pending, you'll pay (literally) for that!

This means your tags should not have unbounded cardinality. A tag like payment_amount is asking for trouble. Instead, if you really need to capture values like that, you could put the payment amount in buckets (0 - 100, 100 - 500, etc).

Notes

1. In theory, it could use more than one index (and it does sometimes), but because of the way indices work, that would be more complicated and potentially slower. But for more complex queries, like JOINs, multiple indexes help.

DEV Community

High and low cardinality

Data access

Data storage

Notes

Top comments (0)

Read next

Winter Solstice

Navigating Search Solutions: A Comprehensive Comparison Guide to Meilisearch, Algolia, and ElasticSearch

WIP Notes working though Render hosting Flask + Vite + React + Wouter

Awesome > 347