Sonu Kumar

Posted on Feb 26, 2023

KUID: Compressed Universally Unique Identifier

#architecture #performance #programming

KUID, a universally unique identifier, is essentially a shorter version of a UUID and is commonly used to identify records in various applications.

What's UUID?

A UUID (Universally Unique Identifier) is a unique identifier that is widely used across various applications and systems to identify records. It is a 128-bit number typically represented as a sequence of 32 hexadecimal digits, grouped in 5 sections, separated by hyphens, as in the following format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.

UUIDs are designed to be globally unique, meaning that it is extremely unlikely for two UUIDs to be the same, even when generated independently by different systems. This makes UUIDs useful for identifying and tracking records in distributed systems, databases, and other contexts where unique identification is necessary.

There are different variants and versions of UUIDs, each with its own way of generating unique identifiers. Some commonly used variants include UUIDv1, UUIDv4, and UUIDv5. The specific variant and version of a UUID can be determined from the format and content of the identifier.

How to generate KUID?

KUID is generated in two steps

Generate UUID (use any algorithm like V1,V2,V3,V4 etc)
Encode 128-bit number

KUID = Generate UUID and encode it in the higher base like base 32, 62 or even higher.

The length of the KUID depends on the encoding scheme employed. If the base62 encoding scheme ([0–9A-Za-z]) is used, KUID can be encoded to 22 bytes. However, if you wish to store KUID in upper case, then encoding it using the base36 encoding scheme, which utilizes characters [0–9A-Z], will result in an encoded representation that can be stored in 26 bytes. For this post, we'll use base 62 encoding.

Why should we use KUID?

KUID generates a smaller string, which results in less storage space and faster processing of many operations.

NOTE: Following statements are only valid when KUID is treated as a string, not as 128-bit number.

Hashing: Using KUID with its shorter length of 22 bytes reduces the amount of data that needs to be hashed. This results in fewer instructions being required to perform the hashing operation compared to UUID, which requires a hash of 36 bytes.
Comparison: Due to the shorter length of KUID, only 22 bytes need to be compared instead of 36 bytes, resulting in fewer instructions needed for comparison. Read/Write: Using KUID, which can be encoded in 22 bytes, results in a reduction in the memory and disk space required for read and write operations. This is in contrast to UUID, which requires 36 bytes for the same operations. With KUID, fewer instructions are needed to read or write data due to its smaller size.
I/O Calls: Sending a UUID over a network consumes 36 bytes of data (without considering compression), while a KUID only requires 22 bytes. Therefore, KUID is a more efficient choice for network transmission due to its smaller size.
Database indexing: Due to the smaller size of KUID, it leads to a decrease in the size of the database index. In addition, combining KUID with another column for secondary indexing can be beneficial as databases have a limited index width (e.g., 767 bytes in MySQL) that is dependent on the encoding used for the column or table. For instance, MySQL's UTF8 encoding may limit the index width to as little as 255 bytes.
URL safe: Like UUID, KUID is URL safe since it only uses [0–9A-Za-z] characters to represent the data. KUID is commonly used as a unique column, which means that URLs containing KUIDs may be used to represent specific resources, such as /api/v1/products/:productId. As the KUID is included in the URL, it must be URL safe to avoid issues such as encoding errors or unintended interpretation by the web server or browser.
Lower Memory/Disk Usage: Using KUID results in lower memory/disk usage compared to UUID. Since KUID only requires 22 bytes, it uses less storage space than UUID, which uses 36 bytes. This can be beneficial for systems that require frequent read/write operations or have limited storage capacity. For instance, if we consider the storage of 100 million records, assuming each KUID is stored as a 22-byte string, the space consumed by KUIDs would be approximately 2.2 GB. In contrast, the space required for UUIDs would be around 3.6 GB. Therefore, using KUIDs can result in a storage saving of approximately 39%.

KUID Usage



public class Main {
    public static void main(String[] args) {
        for (int i = 0; i < 10; i++) {
            KUID kuid = KUID.randomKUID();
            System.out.println(kuid.toString());
        }

    }
}

This will produce something like this

If you found this post helpful, please share and give a thumbs up.

Top comments (4)

John Nyingi • Mar 14 '23

what are the chances of collisions? doesn't it increase?

Steven Kamradt • May 18 '24

Ultimately it will resolve to the same bits. Normally a guid is represented as hex bytes, which only use 16 characters to represent the bits. This requires 2 characters per byte (base 16). Using only Normal integers, you would need 3 characters to represent the same byte (base 10). This is base 62 which uses a much larger range of characters to represent the same bytes. In this case, its easier (for me) to think of things as bits and a guid contains 128 of them. 1 character can represent short of a full byte, (think of the bit positions in a byte as 1,2,4,8,16,32,64,128...62 is just short of 64). Using the last character we get 62 so we will need to borrow two characters from the next character position to get the value of the 64bit, so we end up rotating the array. This works for compression, where you have to encode/decode in a serial manner. This will not work as easily for random access where you need a specific byte as you will need to know the offset from the start to determine how many times to rotate the array. If you are familiar with UUEncoding, its the same principle but adds + and / to the possible characters to represent the 1st through 7th bits fully which ends up with a single bit that needs to be carried into the next character. it is only sightly more efficient but has the extra burden of additional characters that don't always look as nice on a URL or as a parameter. UUEncoding otherwise has many of the same benefits as a KUID (smaller storage requirements, faster string comparing). fwiw, the = character in UUencoding is an "empty" or padding character that is used to specify that the end of the string was reached and there is no further data. This is not used for KUID since the length of the KUID is fixed, if you use the algorithm to encode data OTHER than a GUID you may need to borrow this concept as well. :)

Cesar Adrian Leyva Luna • Mar 18 '24

is the same uuid in other encoding

Basil • Apr 12 '25

The same idea has been already implemented several years ago under the name compact-uuid. It uses different encoding and gives 26 chars length, but has some advantages: case-insensitiveness and avoiding ambiguous characters