Shadi AJAM

Posted on Nov 3

Should We Rethink About IDs? A Deep Dive into "Snowflake IDs"

#opensource #database #programming #webdev

Let’s Start from begining: Why Do We Even Use IDs?

The short answer is "Labeling".

Since ancient times, we’ve had to label everything—animals, crops, livestock, geographic regions, even military units. As civilization grew, we began recording information on paper, collecting and storing data. Over time, this amassed into our first version of "Big Data." To manage this, we invent methods to sort and index all that paper, driven by one goal: finding information faster. This need for "organization" led us to create paper files, folders, shelves, and storage containers.

As we entered the computer era, we started using computers to store our files, shifting our basic labeling system into the realm of databases. In a database, each row (or file) gets a unique ID, usually starting with an auto-incremented value from zero. This makes it easy to organize and find information quickly.

Over time, as we began using distributed servers and databases, messaging and communication between devices became even more critical. Each record or message had to be unique, requiring a way to label it individually—without any duplicates—across the entire system, regardless of the device.

End of story, Let's brake "Snowflake IDs"!!

Snowflake IDs are "unique" identifiers created to solve the issue of ID duplication across distributed systems.

Orginally created by X (formaly Twitter) used for the IDs of tweets, also we can find kind of usage by major tech compaines like (Instagram, Uber, Github and Linkedin).

Snowflake ID Structure:

Snowflake ID is a fixed-length 8-byte, 64-bit (63 usable bits).

Snowflake ID is compact and efficient for storage in databases as a single 64-bit integer. This small footprint is ideal for high-performance systems, minimizing storage space while maintaining unique, ordered identifiers.

Structure Breakdown:

Snowflake ID Structure

Empty bit (1 bit).
Timestamp (41 bits): Representing the time in milliseconds since a custom epoch.
Data Center/Machine ID (10 bits): Number present the generator machine/device, up to 1024 number.
Sequence Number (12 bits): serve as a sequence counter within the same millisecond, up to 4096 number.

Real world examples:

Linkedin uses Snowflake IDs on article editor Lets take my article as an example.

Snowflake ID: 7256902784527069184

The table above breaks down the Snowflake ID, showing how LinkedIn structures its identifiers. The timestamp aligns exactly with the date and time I started writing this article: "October 29 at 5:40 AM".

X(Twitter) uses Snowflake IDs post ID Lets take this post for Elon Musk as example: https://x.com/elonmusk/status/1851515326581916096

Snowflake ID: 1851515326581916096

In this approach, X (Twitter) uses a starting timestamp of "1288834974657," which translates to "November 4, 2010, 1:42:54.657 AM." By adding the Snowflake ID timestamp, we get "October 30, 2024, 6:43:48.005 AM," indicating when the tweet was posted.

The datacenter ID identifies the machine that generated the tweet, while the sequence ID ensures each tweet is unique, even if created at the same time.

The good, the bad and the ugly!

The Good:

Small Footprint: The 64-bit structure of Snowflake IDs makes them compact and efficient for storage.
Sortable: Snowflake IDs include a timestamp component, ensuring that IDs are roughly ordered by time.
Usable Components: Because the components are already has meanful data, this data can be used on any part of the system.
Customizable: Changing the allocation of bits for the data center id and sequence number as needs, basically you have 22 bit(10+12) you can divide them for whatever your needs.

The Bad:

Not Globally Unique: Snowflake IDs are unique within a one distributed system but may not be globally unique across different companies/systems.
Limited Numbers for Components: The number of bits allocated for data center and machine IDs restricts the number of unique identifiers for components. for ex: Data Center/Machine ID can only fit 1024 number
Complex Configuration: Properly configuring and managing the allocation of bits and unique identifiers for data centers and machines can become complicated, especially in large distributed systems.

The Ugly:

Clock Drift Issues and Dependency on Accurate Timekeeping: The system relies on precise time synchronization, which can lead to non-sequential IDs or even duplicates if clocks are out of sync.
Potential for ID Collisions: Without careful management and synchronization, Snowflake ID generation can lead to collisions or duplicated IDs, undermining the reliability of the system.

Snowflake IDs vs. GUIDs: A Potential Replacement?

Ahhh no diffiently not. Comparing "GUIDs" and "Snowflake IDs" is more like comparing "Sea" and "River", Yes at the base line both are "water" but with huge diffrences.

GUIDs are "GLOBAL IDs" it's great to ensure is that exact "label" is unique accross the globe.

Snowflake IDs are "SYSTEM IDs" it's great for all your system resources to know that "label".

Still here!? You are really interested!!!

Here is some Snowflake IDs Referances!

Snowflake IDs Delphi Generator: https://github.com/shadiajam/SnowFlakeID-Delphi "This is mine consider to star it ⭐"
Online Snowflake ID Generator: https://www.onlineappzone.com/snowflake-id-generator
Wikipedia: https://en.wikipedia.org/wiki/Snowflake_ID

DEV Community