Shalvah

Posted on Jul 22, 2023 • Edited on Aug 18, 2023 • Originally published at blog.shalvah.me

Packing and unpacking bytes

#data #encoding

Goal: A brief exploration of what it means to "pack" and "unpack" bytes.

Inspiration

I've come across Ruby's Array#pack and String#unpack methods, but never had the time to dive into them. While researching another article, I came across this question and decided to stop to explore it.

Exploration 1: Packing into two bytes

I can't define "packing", but I've gathered that it's a term for representing a series of bytes as a string. And depending on how you do it, you can even do this in fewer bytes than the original. Unpacking is the reverse: recovering the original information.

Trying an example based on the Stack Overflow question. I have a bunch of bytes, ie values between 0 (00000000) and 255 (11111111). Supposing I take two at random, maybe 126 and 2.

let [a, b] = [126, 2]

console.log(a.toString(2).padStart(8, '0'))  // 01111110
console.log(b.toString(2).padStart(8, '0'))  // 00000010

I could represent them in a string by using the JS escape hexadecimal sequence:

console.log(a.toString(16).padStart(2, '0'))  // 7e
console.log(b.toString(16).padStart(2, '0'))  // 02

console.log('\x7E\x02') // "~"

However, this isn't what I want, as this string has two characters. JavaScript strings are UTF-16 [note 1], so this string has 4 bytes, which is more than the original.

Buffer.from('\x7E\x02', 'utf16le').byteLength
// 4

This string has two characters of two bytes each: 00 7e and 00 02. I want to pack the bytes so the string has only one character, 7e 02. Here's how:

let char = String.fromCharCode((a << 8) | b)
console.log(char); // "縂"
Buffer.from(char, 'utf16le').byteLength // 2

This is a bit of bit arithmetic (haha).

a << 8 means "shift the bits in a left 8 times"
shifting 126 (01111110) left 8 times gives us 01111110 00000000
| b is a bitwise OR operation
01111110 00000000 ORed with 2 (00000010) gives 01111110 00000010, which is what I want (7E 02)

So there it is. I started with two bytes, and was able to fit them into a 2-byte character [note 2]. How about unpacking? Some more bitwise magic.

let bytes = char.charCodeAt(0)
let byteA = bytes >> 8 // Shift the bits to the right 8 times to get the first byte
let byteB = bytes & 0xFF // Bitwise AND the bits with 11111111 to keep only the second byte
// Alternative:
// byteB = bytes ^ (byteA << 8)
console.log(byteA, byteB) // 126, 2

Cool, cool.

I also found out you can do this packing natively with the TextDecoder API! [note 3]

let byteArray = new Uint8Array([a, b])
let packedStr = new TextDecoder('utf-16be').decode(byteArray)
console.log(packedStr) // "縂"

However, unpacking with TextEncoder gives wrong results for this use case, since it only supports UTF-8:

let unpackedArray = new TextEncoder.encode(packedStr)
console.log(unpackedArray) // Uint8Array [231, 184, 130]

Exploration 2: packing into one byte

Speaking of UTF-8, it's time to try that. But I'm changing some things:

I won't use JS here, since its strings are UTF-16. I probably can use it, but I don't want that headache. Plus, I love any excuse to work with Ruby.
All the bytes I'll pack are in the range 0 to 15. I've intentionally made it smaller so that I can pack two bytes into one UTF-8 character (one byte). I'll use 13 and 2 as my test bytes.

Packing in Ruby is pretty similar:

a, b = 13, 2

puts a.to_s(2).rjust(8, '0') # 00001101
puts b.to_s(2).rjust(8, '0') # 00000010

# hex
puts a.to_s(16).rjust(2, '0') # 0d
puts b.to_s(16).rjust(2, '0') # 02

char = ((a << 4) | b).chr # Shift by 4 bits, not 8, since I'm now packing in one byte
puts char # => "\xD2"
puts char.length # => 1
puts char.bytes.length # => 1

bytes = char[0].ord
byteA = bytes >> 4
byteB = bytes & 0x0F # AND with 0F, not FF, since I'm splitting up one byte
puts byteA, byteB # 13, 2

The output string here is a single byte "\xD2"...which is simply the original 0D and 02 bytes packed together 😀 Unfortunately, it's not a valid printable character, so printing it shows �, but it's there.

As mentioned earlier, Ruby has inbuilt pack and unpack methods, but they can only map byte to byte, so i couldn't use them for this example.

packed = [a, b].pack('c*') # => "\r\x02"
packed.unpack('c*') # => [13, 2]

But they work with the original UTF-16 example:

a, b = 126, 2
packed = [a, b].pack('c*') # => "~\x02"
packed.unpack('c*') # => [126, 2]

It may not look like that, but the packed version here ("~\x02") is exactly the same as my manually packed JavaScript version. It contains the exact two bytes, 7E 02. The difference is the encoding; in Ruby, this string is UTF-8, so it's rendered differently. But I can change the encoding and see for myself!

packed.force_encoding 'utf-16be' # => "\u7E02" 
packed.length # => 1
packed.bytes.length # => 2

Possible uses of packing

Why would you want to pack, though? I'm thinking, perhaps in a constrained environment like gaming over the Internet. If there is a limited number of possible buttons a player can press (say 12), instead of transmitting each button press as one byte, I could:

wait for a few milliseconds, to gather the next few keypresses and send in a batch
pack these keypresses into a byte. 12 possible buttons can fit in 4 bits (2^4 = 16), so two keypresses can go in one byte (8 bits).

In this, packing serves as a form of compression, to send less data over the network and improve the gaming experience (less data to download, so responses can be faster).

I also found this question, from a user who wanted to send a UUID as binary data. This is a valid use, since UUIDs are often rendered as strings, but they're actually a sequence of 16 bytes. Sending them as a string would take 36 bytes, so packing is useful here. You could also do this for other "binary-but-look-like-strings" data, like SHA-512 hashes for instance.

Let me know if you can think of any other uses.

Notes

1. The ECMAScript spec says:

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.

So JS strings are UTF-16. However, many modern Web APIs, like Blob and TextEncoder, and even older Node.js ones like Buffer assume (or accept only) UTF-8. My guess is that they expect the string to be from the outside world (reading a file, an API response, etc), in which case, it's most likely UTF-8.

2.The only reliable way I found to get the byte length of a native JS string (UTF-16) is Buffer.from(string, 'utf16le').byteLength. Commonly suggested ways I found include TextEncoder and Blob, but they always assume UTF-8.

3.For this to work as expected, I had to specify UTF-16 Big Endian (utf-16be) as the encoding. UTF-16 because I want 2-bytes per character, and big-endian because I want the big digits at the end, like I did in the custom packer.

DEV Community