While working on my note-taking application Heaplist, when I got to saving data to the database, I started using uuid4 identifiers to identify records, which usually look something like this when represented as a string:

```
32dca18531a1435480461f99837a5b1d
```

For some reasons, I didn't really like using uuid:

- it is a rather long string of 32 characters, and I will need to show it to users sometimes
- 6 bits in uuid4 are not used, they are constants, wasteful

constant bits in uuid4:

```
uuid.uuid4().bytes[6] & 0b11110000 # == 64
uuid.uuid4().bytes[8] & 0b11000000 # == 128
```

I decided to explore other options. Guid implementations like ulid, ksuid have some popularity. But they did not suit me either, mainly due to the fact that they include a timestamp.

My choice is completely random 128-bit identifiers, which I encode to a base62 string. Why base62? They contain only letters and numbers, so they can be selected by double-clicking (and, for example, base64 is not). Identifiers have become shorter, now they are 22-character strings, for example:

```
4xT8QKx8f3BwZP06VKSEMy
```

But there is a problem! Encoding to a number system that is not a power of two is a rather slow process, due to the fact that it is involves the division operation. For decoding (base62 to u128) there is no such problem, since there will be no division, only multiplication and addition.

So, first let's try the naive encoder implementation written in Rust:

```
const BASE62_N: usize = 62;
const BASE62_DIGITS: &[u8; BASE62_N] =
b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
// how many base62 numbers do we need to store u128?
// you can calculate this number like this, python: math.ceil(math.log(2**128, 62))
const U128_BASE62_ENCODED_LEN: usize = 22;
pub fn u128_to_base62_naive(mut n: u128) -> String {
// padding occurs from lower digits to higher ones, unfilled high digits will remain zero
let mut b62_str = vec![b'0'; U128_BASE62_ENCODED_LEN];
let mut i = U128_BASE62_ENCODED_LEN;
while n > 0 {
i -= 1;
let digit_index = (n % BASE62_N as u128) as usize;
b62_str[i] = BASE62_DIGITS[digit_index];
n /= BASE62_N as u128;
}
// the variable is guaranteed to contain a valid utf-8 string, there is no need to additionally check it
unsafe { String::from_utf8_unchecked(b62_str) }
}
```

In fact, this is a conversion from a decimal number system to a number system with base 62. To do this, we divide the original number by a new base into which we translate, the remainder of the division is the next digit in the new number system, we divide the quotient again, and so on in a cycle .

Let's look at the speed of the algorithm, the benchmark result is (i5-4690 3.50GHz): `bench_u128_to_base62_naive 1000000: 32.86Mib/s, 2153250.28/s`

Sad numbers, right? For comparison, base64 encoder works at `309.35Mib/s`

, hex encoder `645.34Mib/s`

.

Why is that? Since 62 is not a power of two, for conversion we have to re-divide the entire 128-bit number on each cycle, each bit of the original number affects each subsequent digit of the result. Let's look at the assembler generated by the Rust compiler godbolt, here you can see that two functions are used for division:

```
__umodti3
__udivti3
```

These are the functions implemented in runtime (llvm) for division of 128-bit numbers (designations "u" - unsigned, "ti" - 128 bit). And of course they will be relatively slow.

We need to simplify work for the processor. As you know, any number in the base `BASE`

can be represented as: `HIGH * BASE^R + LOW`

, for example decimal `1111222 = 1111 * 10^3 + 222`

, what does this give us? We divide both parts by `10^3`

by integer division, we get `1111 = 1111`

, the remainder of the division will be `222 = 222`

, so this representation allows us, when converting between bases, to divide not the whole number but only its part. As you probably already guessed, we will replace the division of 128-bit numbers with the division of 64-bit numbers.

So, first we need to choose the power to which we will raise `62`

, the optimal one will be `R`

, under which the condition `62^R <= 2^64 < 62^(R+1)`

is satisfied, it can be calculated by the formula:

```
math.floor(math.log(2**64, 62)) # == 10
```

Now let's write a new implementation in Rust, it looks like this:

```
const BASE62_N: usize = 62;
const BASE62_DIGITS: &[u8; BASE62_N] =
b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
const U128_BASE62_ENCODED_LEN: usize = 22;
const BASE62_POW_10: u128 = (BASE62_N as u128).pow(10);
pub fn u128_to_base62(mut n: u128) -> String {
let mut b62_str = vec![b'0'; U128_BASE62_ENCODED_LEN];
// `nlow` will generate the lower 10 digits in base62
let mut nlow = (n % BASE62_POW_10) as u64;
let mut i = U128_BASE62_ENCODED_LEN;
// convert number in three steps:
// 0000000000001111111111
// + 0022222222220000000000
// + 3300000000000000000000
// = 3322222222221111111111
for _ in 0..2 {
// here is the same algorithm for converting a number to a new base, but now using 64-bit division
for _ in 0..10 {
i -= 1;
let digit_index = (nlow % BASE62_N as u64) as usize;
b62_str[i] = BASE62_DIGITS[digit_index];
nlow /= BASE62_N as u64;
}
// move on to the next block of digits
n /= BASE62_POW_10;
nlow = (n % BASE62_POW_10) as u64;
}
// finish by converting the 2 remaining digits
for _ in 0..2 {
i -= 1;
let digit_index = (nlow % BASE62_N as u64) as usize;
b62_str[i] = BASE62_DIGITS[digit_index];
nlow /= BASE62_N as u64;
}
unsafe { String::from_utf8_unchecked(b62_str) }
}
```

Let's look at the result assembler godbolt, now, the `div`

instruction is used for 64-bit division. And in optimized build it will be replaced by multiplication and shifts.

As a result it was `__umodti3 x 22, __udivti3 x 22`

, became `__umodti3 x 3, __udivti3 x 2, div x 44`

. This is what the benchmark shows: `bench_u128_to_base62 1000000: 112.18Mib/s, 7351959.26/s`

. Speed up by 3.5 times!

Ready.

*and yes, I've tried to replace the u64 division with the u32 division, there was no significant increase in performance, it was about 2% faster*

## My links

I'm making my perfect todo, note-taking app Heaplist, you can try it here heaplist.app

And I have a twitter @rsk

## Top comments (1)

worth noting, I'm considering using base57 encoding for guids, it still will be 22 characters, but uses shorter alphabet, which will allow to remove ambiguous symbols.