Kostas Kalafatis

Posted on Feb 17, 2023 • Originally published at dfordebugging.wordpress.com

gRPC Demystified - Protobuf Encoding

#programming #tutorial #architecture

In this post, we'll go over the protocol buffer wire format, which specifies the exact way your message is transmitted over the wire and how much room it takes up on your computer's hard drive. This isn't something you need to know in order to use protocol buffers in your application, but it could come in handy when trying to optimize things.

Encoding a simple message

The following is a very simplified definition of a message:

message Message1 {
    optional int32 a = 1;
}

For this example, we'll create a Message1 message in the app and set its parameter to 150. The next step is to serialize the message and write the results to a stream. The encoded message consists of three bytes, and here they are:

08 96 01

Clearly, this is a very abstract and numerical concept at this point, but what does it mean? You'd get a value around 1: 150 if you dumped those bytes with the Protoscope program. Where does it get that information that this is what the message is about?

Base128 Varints

The wire format is built around variable-width integers, or varints. They enable the encoding of unsigned 64-bit integers using one to ten bytes, with smaller values requiring fewer bytes.

Each byte in the varint has a continuation bit that indicates whether or not the byte after it is a part of the varint. This is the byte's most significant bit (MSB) (sometimes also called the sign bit). The lower 7 bits are a payload; the resulting integer is constructed by appending the 7-bit payloads of its constituent bytes together.

So, for example, here is the number one encoded as 01 - because it is a single byte, the MSB is not set:

And 150 is 9601 in binary - It gets a little trickier here:

How do you know this equals 150? First, you take the most significant bit (MSB) out of each byte, since it's only there to tell us if we've reached the end of the number (as you can see, it's set in the first byte because the variable has more than one byte). Then we put the 7-bit payloads together and read it as a little-endian, 64-bit unsigned integer:

Message Structure

A message in a protocol buffer is a list of key-value pairs. The key for the binary version of a message is just the field number. The name and declared type of each field can only be found by looking at the message type's definition (i.e. the .proto file). Protoscope doesn't have access to this information, so it can only give the field numbers.

When a message is encoded, each key-value pair is turned into a record with a field number, a wire type, and a payload. The size of the payload is told to the parser by the wire type. This lets old parsers skip over fields they don't know how to read. Tag-Length-Value, or TLV, is sometimes used to describe this kind of plan.

There are six wire types: VARINT, I64, LEN, SGROUP, EGROUP, and I32

ID	Name	Used For
`0`	`VARINT`	`int32`, `int64`, `uint32`, `uint64`, `sint32`, `sint64`, `bool`, `enum`
`1`	`I64`	`fixed64`, `sfixed64`, `double`
`2`	`LEN`	`string`, `bytes`, `embedded messages`, `packed repeated files`
`3`	`SGROUP`	`group start`(deprecated)
`4`	`EGROUP`	`group end`(deprecated)
`5`	`I32`	`fixed32`, `sfixed32`, `float`

Using the formula (field number << 3) | wire type, the "tag" of a record is encoded as a variable made up of the field number and the type of wire. In other words, after decoding the variable that represents a field, the low 3 bits tell us the type of wire, and the rest of the integer tells us the number of the field.

Let's take another look at our simple example. You now know that the first number in a stream is always a variable key. In this case, that number is 08, or (without the MSB):

000 1000

You take the last three bits to get the wire type (0), and then move three bits to the right to get the field number (1). A tag in Protoscope is made up of an integer, a colon, and the type of wire, so we can write the above bytes as 1:VARINT. Since the wire type is 0, or VARINT, we know that to get the payload, we need to decode a varint. As we saw above, the bytes 9601 vary-decode to 150, which gives us our record. We can write it as 1:VARINT 150 in Protoscope.

More Integer Types

Bools and Enums

Both Bools and Enums are encoded like they are int32s. Particularly, Bools always encode as either 00 or 01. false and true are other names for these byte strings in Protoscope.

Signed Integers

As we saw in the last section, all of the protocol buffer types for wire type 0 are encoded as variables. Varints, on the other hand, are not signed, so negative integers are encoded differently in sint32 and sint64 than in int32 and int64.

The intN types store negative numbers as two's complement, which means that their highest bit is set because they are unsigned 64-bit integers. So, this means that each of the ten bytes must be used. For example, when protoscope sees -2, it changes it to

11111110 11111111 11111111 11111111 11111111
11111111 11111111 11111111 11111111 00000001

This is the two's complement of 2. This is written as 0 - 2 + 1 in unsigned math, where 0 is the 64-bit integer with all ones. This is a good way to figure out why it makes so many ones.

Calculating the complement

Every computer I know uses two's complement to represent integers. To get the two's complement negative notation of an integer, write it out in binary. You then invert the digits and add one to the result.

Assume we're working with 8-bit numbers (for simplicity's sake) and want to know how to express -28 in two's complement notation. We begin by writing 28 in binary form:

00011100

The digits are then inverted. 1 becomes 0, and 0 becomes 1.

111000011

Then we add 1.

11100100

That is how -28 would be written in 8 bit binary.

sintN, on the other hand, encodes negative integers using the "ZigZag" encoding rather than two's complement. Positive integers n are represented as 2-n (the even numbers), while negative integers -n are encoded as 2 * n + 1 (the odd numbers). The encoding thus "zig-zags" between positive and negative numbers.

Signed Original	Encoded As
`0`	`0`
`-1`	`1`
`1`	`2`
`-2`	`3`
`...`	`...`
`0x7FFFFFFF`	`0xFFFFFFFE`
`0x80000000`	`0xFFFFFFFF`

Using some bit voodoo, we can convert n into its ZigZag representation:

n + n + (n < 0)

In this case, we assume that the boolean n < 0 is converted to an integer 1 if true and an integer 0 if false. When a sint32 or sint64 value is parsed, it is decoded back to the original, signed version.

In protoscope, appending a z to an integer causes it to encode as ZigZag. For example, -500z is equivalent to the variant 1001.

Non-varint Numbers

Non-variable numeric types are straightforward - double and fixed64 have wire type I64, which instructs the parser to expect a fixed eight-byte chunk of data. A double record is specified by 5: 25.4, and a fixed64 record is specified by 6: 200i64. The absence of an explicit wire type infers the I64 wire type in both cases.

Similarly, float and fixed32 have wire type I32, which indicates that four bytes are expected instead. The syntax for these is as follows: add an i32 prefix. Both 25.4i32 and 200i32 will emit four bytes. Tag types are deduced to be I32.

Length-Delimited Records

Another important concept in the wire format is length prefixes. The LEN wire type has a variable length, which is specified by a variable immediately after the tag and followed by the payload as usual.

Consider the following message schema:

message Message2 {
    optional string b = 2;
}

A string is a record for the field b, and strings are LEN-encoded. If we set b to "testing," we encoded as an LEN record with the ASCII string "testing" in field number 2. The resulting value is 120774657374696e67. By splitting the bytes:

12 07 [74 65 73 74 69 6e 67]

The tag 12 is 00010 010, or 2:LEN, as we can see. The byte that follows is the varint 7, and the next seven bytes are the UTF-8 encoding of "testing".

This is written as 2:LEN 7 "testing" in Protoscope. However, repeating the length of the string can be inconvenient (which, in Protoscope text, is already quote-delimited). Wrapping Protoscope content in braces generates the following length prefix: {"testing"} is an abbreviation for 7 "testing". {} is always inferred by fields to be a LEN record, so we can write this record simply as 2: {"testing"}.

Submessages

Submessage fields also use the LEN wire type. Here’s a message definition with an embedded message of our original example message, Message1:

message Message3 {
    optional Message1 c = 3;
}

If Message1's a field (i.e., Message3's c.a field) is set to 150, we get 1a03089601. Breaking it up we get:

1a 03 [08 96 01]

The last three bytes (in []) are identical to those in our first example. In the same way that strings are encoded, these bytes are preceded by an LEN-typed tag and a length of 3.

In Protoscope, submessages are quite succinct. 1a03089601 can be written as 3: {1: 150}.

Optional and Repeated Elements

Missing optional fields are simple to encode: we simply leave the record out if they are not present. As a result, "huge" protos with only a few fields set are scarce.

repeated fields are a little more complex. Ordinary (unpacked) repeated fields produce one record for each field element. As a result, if we have:

message Message4 {
    optional string d = 4;
    repeated int32 e = 5;
}

and we construct a Message4 message with d set to "hello", and e set to 1, 2, and 3, this could be encoded as 220568656c6c6f280128022803, or written out as Protoscope:

4: {"hello"}
5: 1
5: 2
5: 3

Records for e, on the other hand, do not have to appear consecutively and can be interleaved with other fields; only the order of records for the same field in relation to each other is preserved. As a result, this could also have been encoded as:

5: 1
5: 2
4: {"hello"}
5: 3

There is no special treatment for oneofs in the wire format.

Last One Wins

Normally, a message that has been encoded wouldn't have more than one instance of a field that isn't repeated. But parsers should be able to handle the case where they do. If the same field appears more than once and it is a number or string, the parser takes the last value it sees. For embedded message fields, the parser combines multiple instances of the same field, as if with the Message::MergeFrom method, all of the single-scalar fields in the second instance replace the ones in the first, single-embedded messages are merged, and fields that appear more than once are added together. Because of these rules, decoding two encoded messages that are joined together gives you the same result as if you had decoded each message separately and then joined the objects that were made. In other words:

MyMessage message;
message.ParseFromString(str1 + str2);

is equivalent to this:

MyMessage message, message2;
message.ParseFromString(str1);
message2.ParseFromString(str2);
message.MergeFrom(message2);

Packed Repeated Fields

As of v2.1.0, scalar fields that are repeated can be declared as "packed." In proto2, you use the [packed=true] to do this, but in proto3, it's already done.

Instead of being encoded as a separate record for each entry, they are encoded as a single LEN record with all of the elements joined together. To decode, each part of the LEN record is decoded one at a time until the payload is gone. The length of the previous element, which depends on the type of field, determines where the next element begins.

For example, let's say you have the type of message:

message Message5 {
    repeated int32 f = 6 [packed=true];
}

Now, let's say you make a Message5 and give the repeated field f the values 3, 270, and 86942. When this is turned into code, we get 3206038e029ea705 or, in Protoscope text,

6: {3 270 86942}

"Packed" can only be used to describe repeated fields with simple numeric types. These are types that would usually use the VARINT, I32, or I64 wire types.

Note that there is usually no reason to encode more than one key-value pair for a packed repeated field, but parsers must be ready to accept multiple key-value pairs. In this situation, the payloads should be joined together. There must be an even number of elements in each pair. This is a valid way for parsers to encode the same message as above:

6: {3 270}
6: {86942}

Protocol buffer parsers must be able to read repeated fields that were compiled as packed as if they were not packed, and vice versa. This makes it possible to add [packed=true] to existing fields in a way that works both forward and backward.

Maps

Map fields are just a shorthand for a special kind of repeated field. If we've got:

message Message6 {
    map<string, int32> g = 7;
}

is actually the same as

message Message6 {
    message g_Entry {
        optional string key = 1;
        optional int32 value = 2;
    }
    repeated g_Entry g = 7;
}

Thus, maps are encoded in the same way that repeated message fields are: as a series of LEN-typed records with two fields each.

Field Order

In a.proto file, field numbers can be declared in any order. The order of the messages has no bearing on how they are serialized.

When a message is serialized, there is no guarantee that its known or unknown fields will be written in the same order. Serialization order is an implementation detail, and any particular implementation's details may change in the future. As a result, protocol buffer parsers must be capable of parsing fields in any order.

Implications

Do not assume the byte output of a serialized message is stable. This is especially true for messages with transitive bytes fields representing other serialized protocol buffer messages.
By default, repeated invocations of serialization methods on the same protocol buffer message instance may not produce the same byte output. That is, the default serialization is not deterministic.
- Deterministic serialization only guarantees the same byte output for a particular binary. The byte output may change across different versions of the binary.
The following checks may fail for a protocol buffer message instance foo:
- foo.SerializeAsString() == foo.SerializeAsString()
- Hash(foo.SerializeAsString()) == Hash(foo.SerializeAsString())
- CRC(foo.SerializeAsString()) == CRC(foo.SerializeAsString())
- FingerPrint(foo.SerializeAsString()) == FingerPrint(foo.SerializeAsString())
Here are a few example scenarios where logically equivalent protocol buffer messages foo and bar may serialize to different byte outputs:
- bar is serialized by an old server that treats some fields as unknown.
- bar is serialized by a server that is implemented in a different programming language and serializes fields in different order.
- bar has a field that serializes in a non-deterministic manner.
- bar has a field that stores a serialized byte output of a protocol buffer message which is serialized differently.
- bar is serialized by a new server that serializes fields in a different order due to an implementation change.
- foo and bar are concatenations of the same individual messages in a different order.

Conclusion

In this post we discussed the protocol buffer wire format, which defines the details of how your message is sent on the wire and how much space it consumes on disk. On the next post of this series we are going to start diving on the internal workings of gRPC. Stay tuned!

DEV Community