Deep dive into the binary algorithm of Protocol Buffers

#binary #protocol #encoder #decoder

When a human-understandable format is not a priority, it is best to think binary. Protocol Buffers, also know as protobufs or protos, is an open-source interface description language originally developed by Google and a library that allows JSON-like data messages to be transmitted over the wire without unnecessary ballast. Today, it is most relevant in the context of gRPC, where RPC server and client code for arbitrary programming languages is generated based on Protocol Buffers descriptions.

Protocol Buffers were developed primarily with the goal of speeding up the transmission of strongly typed key-value message objects over the network, which in turn means reducing the amount of data that needs to be transmitted over the wire from A to B. In this article, I focus on the process of encoding and decoding data, i.e. at the wire protocol level.

REST and RPC are two concepts that are now considered a kind of de facto way of developing APIs in web development. Communication between the client and the server is mostly about transferring data in JSON format. This is user-friendly but highly suboptimal at the network level. So people have developed compression mechanisms like that and others, but if you really want to optimize something, you have to start from scratch at the network layer and work your way up from there to the user part, not the other way around.

There are a variety of alternatives to sending JSON-like data such as Apache Thrift, Ion, the Bond protocol (Microsoft), Apache Avro, SBE, Bincode, CBir, MsgPack, Cap'n Proto, Flatbuffers and others could be enumerated. All of these are solutions to the same problem, at least at the network level. In terms of strongly-typed messages over the wire, Protocol Buffers are one of the most optimized protocols and are also growing in popularity, especially in the world of Kubernetes and similar communities. In fact, it is a good and popular solution, which makes it an obvious choice.

Protocol Buffers is a simple protocol that can be explained on a plain sheet of paper. The documentation is quite loosely formulated but does not answer all the questions you might encounter during implementation. Protocol Buffers is, fortunately, open source and any doubts can be cleared by looking at the source code written by the main authors.

The encoder and decoder convert messages from input keys to a shrunken binary format and back. Property names are represented in the Protocol Buffers by unique numbers rather than strings. Compared to the raw JSON format, this already has a significant impact on the final size of the message that is then sent over the wire.

+-------------------+------------------+-------------------+
+      1. JSON      +   2. Transform   +     3. Encode     + 
+-------------------+------------------+-------------------+
+ {                 +                  +                   +
+   "name": "John", + 1, John          + 0a 04 4a 6f 68 6e +
+   "age": 35       + 2, 35            + 10 23             +
+ }                 +                  +                   +
+-------------------+------------------+-------------------+
+      6. JSON      +    5. Rebuild    +     4. Decode     + 
+-------------------+------------------+-------------------+

In addition, Protocol Buffers cover 4 wire types, allowing 18 data types to be represented.

Type	Meaning	Used For
0	Varint	int32, int64, uint32, uint64, sint32, sint64, bool, enum
1	64-bit	fixed64, sfixed64, double
2	Length-delimited	string, bytes, embedded messages, packed repeated fields
5	32-bit	fixed32, sfixed32, float

The encoder converts the message into a binary format. The message is then represented on the wire as a kind of flattened sequence of encoded key-value properties. The key and the value are encoded separately. Each wire type has its own rules and therefore its own way of encoding.

[key1][value1][key2][value2] ... [keyN][valueN]

The key is encoded as a uint32 varint type, and in the last 3 bits (T) contains the wire type. The key's field tag can thus be between 1 and 2^29 - 1 = 536,870,911 (0 is not a valid tag number).

tag = 12345 (unsigned 32-bit), type = 1 (Varint)

11001000 10000011 00000110 ... on the wire
CNNNNNNN CNNNNNNN CNNNNTTT ... bits per type

C = Contiunation, X = Number, T = Type

Varints are a method for serializing integers with one or more bytes. The algorithm used here is known as LEB128. All bytes except the last have the most significant bit (MSB) set (C), so that the decoder can determine where the value ends. The other 7 bits (N) of each byte are intended to represent the number.

LEB128 is an algorithm for encoding integers of arbitrary length in which the bytes are arranged in a little-endian sequence. However, the Protocol Buffers limit the size of the numbers to the supported data types.

value = 150 (unsigned 32-bit)

Standard varint encoding:
   XXXXXXXX 10010110 ... Number 150 in bytes.
   X0000001 X0010110 ... Split to 7-bit sequence.
   X0010110 X0000001 ... Revert the array of bytes.
   10010110 00000001 ... Add MSB (1=continuation, 0=last byte).

Standard varint decoding:
   10010110 00000001 ... Encoded number.
   00000001 10010110 ... Revert the array of bytes.
   X0000001 X0010110 ... Remove MSB.
   XXXXXXXX 10010110 ... Merge bits together (number 150 in bytes).

There is a big difference between signed integer types (sint32 and sint64) and the "standard" integer types (int32 and int64). If you use int32 or int64 as the type for a negative number, the result is always ten bytes long, which makes a very large unsigned integer. In case you know that the value will most likely be negative, you can optimize the result and use one of the signed types, where the resulting varint uses ZigZag encoding for efficiency. Essentially, this means that the positive and negative integers are zigzagged through so that -1 is encoded as 1, 1 as 2, -2 as 3, and so on.

value = -12345 (signed 32-bit)

Signed 32-bit varint encoding:
   -12345 ... Unsigned 32-bit integer.
    24689 ... ZigZag value using (value << 1) ^ (value >> 31).
          ... Continue with the standard varint encoding.
Signed 32-bit varint decoding:
          ... Start with the standard varint decoding.
    24689 ... ZigZag value using (value >> 1) ^ -(value & 1).
   -12345 ... Unsigned 32-bit integer.

value = -54321 (signed 64-bit)

Signed 64-bit varint encoding:
   -54321 ... Unsigned 64-bit integer.
   108641 ... ZigZag value using (val << 1) ^ (val >> 63).
          ... Continue with the standard varint encoding.
Signed 64-bit varint decoding:
          ... Start with the standard varint decoding.
   108641 ... ZigZag value using (value >> 1) ^ -(value & 1).
   -54321 ... Unsigned 64-bit integer.

Numbers can also be represented by the wire types 1 or 5. These are 32-bit data types float, fixed32 and sfixed32 and 64-bit data types double, fixed64 and sfixed64. These numbers are simply represented by bytes in little-endian byte order (so in a reversed order).

value = 12345 (signed 32-bit)

Fixed-size encoding:
   00000000 00000000 00110000 00111001 ... Value in (big-endian) bytes.
   00111001 00110000 00000000 00000000 ... Reverse bytes to little-endian order.
Fixed-size decoding:
   00111001 00110000 00000000 00000000 ... Encoded value in (little-endian) order.
   00000000 00000000 00110000 00111001 ... Value in (big-endian) bytes.

What about wire type 2? "Length-delimited" means that the value is a varint encoded length, followed by the specified number of bytes of data. This describes strings, embedded messages (nested objects), and raw bytes data type.

value = "foo"

Length-delimited encoding:
   00000011 XXXXXXXX XXXXXXXX XXXXXXXX ... Encode message size (3 bytes) as standard 32-bit varint.
   00000011 01100110 01101111 01101111 ... Append string (foo) in bytes.

Finally, let us mention repeated fields. These represent arrays of one data type. There are no special encoding rules other than the rule that each element in an array is sent as a separate key-value encoded stream, with all of these fields having the same tag number.

Well, not everything above is completely true. If the array contains elements of numeric types, the array can be shrunk into what is called a "packed" encoding, where the key is sent only once, followed by the encrypted numbers in sequence. Do you remember how numbers are encoded? For numbers, this is possible because the decoder can always determine where a single number ends. With strings and related data types, this is not possible.

Packed repeated fields:
   [key][value1][value2]...[valueN]

Unpacked repeated fields:
   [key][value1][key][value2]...[key][valueN]

As you can see, Protocol Buffers deals in-depth with optimizing the representation of data types on the wire so that as little data as possible is transmitted between client and server.

As with most of my blogs, these are again my personal notes and insights I gained while implementing protocol buffers in Rust. I hope they will be of use to others as well. The library is again released as an open-source package and the source code is available on GitHub.

DEV Community

Deep dive into the binary algorithm of Protocol Buffers

Top comments (0)