wayofthepie

Posted on Jun 21, 2020

Three bytes to an integer

#rust

This post is a quick overview of how I turned an initially poor solution to a problem into a better one, and the thought process behind it.

The problem

I'm working on pulling information out of the Certificate Transparency Logs. The logs are made up of a JSON list of log entries. A single entry is a JSON object with two fields, leaf_input and extra_data. The leaf_input field contains base64 encoded binary data. I want to pull information out of this binary data.

To outline the structure of the binary data I'll use the following notation, indicating byte numbers:

[n] - indicates byte n, so [0] is the first byte.
[n..=m] - indicates a range of bytes, inclusive, so [2..=4] means bytes 2, 3 and 4.
[n..] - indicates the remaining bytes in the binary data after and including index n.

[0] [1] [2..=9] [10..=11] [12..=14] [15..]
 |   |     |        |         |      |
 |   |     |        |         |      |- (rest) the rest of the bytes
 |   |     |        |         |      
 |   |     |        |         |- (length) the length of the certificate      
 |   |     |        |               
 |   |     |        | - the log entry type, not relevant in this post              
 |   |     |                       
 |   |     | - timestamp, not relevant in this post                       
 |   |                            
 |   | - signature type, not relevant in this post                       
 |                               
 | - version, not relevant in this post

The relevant parts here are the length (bytes [12..=14]) and the rest (indicated by [15..] above). rest contains an X.509 certificate we want, and the three bytes in the length define an integer which tells us how many bytes that certificate makes up in rest. Once we read length bytes from rest, the remainder of rest is data that we do not need.

A solution

So we need to turn the three bytes which make up length into an integer. An integer whose size is 3 bytes would be a 24-bit int. Rust does not support 24-bit ints. The closest is a u32, a 32-bit unsigned int. You can't just turn 3 bytes into a u32 however. The first solution I came up with was something like the following code.

// "bytes" is the full binary data inside the "leaf_input" field
fn get_cert_length(bytes: &[u8]) -> u32 {
    let length_bytes_slice = &bytes[12..=14];
    let mut length_bytes = b.iter().cloned().collect::<Vec<u8>>();
    length_bytes.reverse();
    length_bytes.push(0);
    return u32::from_le_bytes(length_bytes.as_slice().try_into().unwrap());
}

This is not great! We allocate a new Vec and we have to pad it with a zero to make it 4 bytes in length. Because we cannot push a value onto the start of a Vec we first reverse it. Then we convert it into an array of type [u8; 4] - this is what length_bytes.as_slice().try_into().unwrap() does. Finally, we read the bytes in little endian format as a u32. Way too much happening here for what on the surface seems like a simple conversion!

I thought about it some more and came up with a better solution.

A better solution

First, here is an outline of my thought process. Let's say bytes 12 to 14 look as follows, in hex notation:

[00, 05, 4c]

Then the value for the length is 00054c in hex, which is 1356 in decimal. 00054c is 000000 + 000500 + 00004c. If we take the bytes 00, 05 and 4c individually, how do we get 00054c?

Well, each time we move a number in the hex string one space to the left, we multiply the number by a power of 16. For example, moving 000005, which is 5 in decimal, to 000050, which is 80 in decimal, we had to multiply 5 by 16. If we move it two places, we would need to multiply it by 256, which is 16 squared, so 000005 to 000500 (1280 in decimal) is 5 * 256. Concretely, to convert 000500 hex to decimal:

\begin{aligned} 000500 &= (0 * 16 ^5) + (0 * 16 ^4) + (0 * 16 ^3) + (5 * 16 ^2) + (0 * 16 ^1) + (0 * 16 ^0) \\ &= 5 * 16^2 \\ &= 1280 \end{aligned}

With this knowledge we can update get_cert_length as follows:

fn get_cert_length(bytes: &[u8]) -> u32 {
    (bytes[12] as u32 * 65536) + (bytes[13] as u32 * 256) + bytes[14] as u32
}

Much cleaner than the original version! The values we multiply each byte with are always a power of 2 so we can also use bit-shifts instead of the multiplications:

fn get_cert_length(bytes: &[u8]) -> u32 {
    ((bytes[12] as u32) << 16) + ((bytes[13] as u32) << 8) + bytes[14] as u32
}

I ended up using the bit-shift version in the end.

Latest comments (7)

Juarez Rudsatz • Jun 25 '20

The compiler is able to optimize both forms of get_cert_length to same code.

Check this playground

playground::get_cert_length_mul:
    pushq   %rax
    cmpq    $13, %rsi
    jb  .LBB8_4
    je  .LBB8_5
    cmpq    $15, %rsi
    jb  .LBB8_6
    movzbl  12(%rdi), %eax
    movzbl  13(%rdi), %esi
    movzbl  14(%rdi), %edx
    movl    %eax, %edi
    popq    %rax
    jmp playground::get2_cert_length_mul

playground::get_cert_length_shift:
    pushq   %rax
    cmpq    $13, %rsi
    jb  .LBB9_4
    je  .LBB9_5
    cmpq    $15, %rsi
    jb  .LBB9_6
    movzbl  12(%rdi), %eax
    movzbl  13(%rdi), %esi
    movzbl  14(%rdi), %edx
    movl    %eax, %edi
    popq    %rax
    jmp playground::get2_cert_length_mul

playground::get2_cert_length_mul:
    shll    $16, %edi
    shll    $8, %esi
    leal    (%rsi,%rdi), %eax
    addl    %edx, %eax
    retq

wayofthepie • Jun 25 '20

Yip that makes sense. The shift vs multiply versions were more about which one looks less magical 😄

Duncan Freeman • Jun 24 '20

Just FYI, the compiler will try to optimize operations like multiplications into bit shifts if it can (godbolt.org/z/9Pgztq), so type out what's clearest first! (In this case, it's probably the bit shifts anyway).

wayofthepie • Jun 24 '20

Agreed, that's why I went with the bit shifts. Shifting by 16 is a little less magical than multiplying by 65536 😄 neither are great in this case though and I went with what @lonami mentioned.

I need to drill deeper into the standard lib apis for sure!

Lonami • Jun 24 '20 • Edited

There was recently a post in URLO about this, Custom u24 type. Essentially, another option was to do the following:

u32::from_be_bytes([bytes[12], bytes[13], bytes[14], 0])

Or if you don't want to type three different indices:

let mut buffer = [0u8; 4];
buffer[..3].copy_from_slice(&bytes[12..15])
u32::from_be_bytes(buffer)

wayofthepie • Jun 24 '20 • Edited

Wow! That's even better, didn't realize you could do that, copy_from_slice, thanks. Not sure why I didn't even think of the first one 😄

Lonami • Jun 24 '20

After searching around in URLO, I also found this neat post on The byte order fallacy which is mildly relevant. By using from_le_bytes or from_be_bytes, we can see the byte order in a concise way and it would be trivial to change too (as opposed to bit-shifts which require a tiny bit more of work and thought).