Kostas Kalafatis

Posted on Feb 10, 2023 • Originally published at dfordebugging.wordpress.com

gRPC Demystified - Protocol Buffers

#discuss

The "I couldn't care less" attitude toward bandwidth is one of the most serious concerns with REST/JSON microservice architectures. The entire JSON message is sent between clients and servers in plain text, and its payload becomes repetitive and sometimes unnecessary.

JSON APIs are popular among developers because they are easy to read and debug, and JSON is built into many programming languages, including NodeJS. A JSON API can be recommended for early projects that need to move quickly and get the first prototypes up and running. However, problems may arise. Despite the fact that JSON messages are expected to evolve, client and server source code can be duplicated, and any feature connected to the same API must hard code the same parsing and interpretation functionality.

Protocol Buffers are a standard process for serializing data that aims to reduce the size of a given structure's data representation. Google created Protocol Buffers in early 2001, but they were not publicly released until 2008.

Protocol buffers provide a language- and platform-independent, extensible mechanism for serializing structured data with both forward and backward compatibility. It's similar to JSON but smaller and faster, and it generates native language bindings.

The definition language (created in proto files), the code produced by the proto compiler to interface with data, language-specific runtime libraries, and the serialization format for data written to a file (or sent across a network connection) are all components of protocol buffers.

Protocol buffers provide a serialization format for packets of typed, structured data up to a few megabytes in size. The format is appropriate for both ephemeral network traffic and long-term data storage. Protocol buffers can be expanded with new information without invalidating existing data or requiring code updates. Protocol buffers are the most commonly used data format at Google. They are widely used in inter-server communications as well as for data archival storage on disk.

Defining a message type

Let's start with a simple example. Let's say you want to define a search request message format in which each search request has a query string, the page of results you want, and the number of results per page. Here is the .proto file that you use to define the type of message:

syntax = "proto2";

message SearchRequest {
    required string query = 1;
    optional int32 page_number = 2;
    optional int32 results_per_page = 3;
}

The SearchRequest message definition describes three fields (name/value pairs), one for each piece of data that you want to include in this type of message. Each field has a name and an information type.

Specifying field rules

You must specify one of the following message field rules:

required (protobuf2 only): If the field has no value, the message is considered "uninitialized." A properly formatted message can have only one required field.
singular (protobuf3 only): This field can be zero or one in a well-formed message (but not more than one). When no other field rules are specified for a given field in proto3 syntax, this is the default field rule. You can't tell if it was parsed from the wire. Unless it is the default value, it will be serialized to the wire.
optional: The field may or may not be filled out. If a value for an optionalfield is not set, a default value is used instead. You can choose your own default value for simple types. Otherwise, a system default is used. For numeric types, this is 0, for strings, it is an empty string, and for bools, it is false. The default value for embedded messages is always the "default instance" or "prototype" of the message, which has no fields set. When you call the accessor to get the value of a required (or optional) field that hasn't been set explicitly, you always get that field's default value.
repeated: The field can be used as many times as you want (including zero). In the protocol buffer, the repeated values will stay in the same order as they were written. Repeated fields are like arrays whose size can change.

"required" is forever

When you mark fields as required, you should be very careful. If you ever want to stop writing or sending a required field, you can't make it an optional field because old readers will think that messages without this field are incomplete and may reject or drop them. You should instead think about writing custom validation routines for your buffers that are specific to your application. Some Google engineers have decided that using the word "required" does more harm than good. Instead, they prefer to use the rules optional and repeated. But not everyone agrees with this.

Assigning field numbers

As you can see, each field in the message definition has a unique number. These numbers are used to identify your fields in the message binary format and should not be changed once your message type is in use. Encoding field numbers 1 through 15 requires one byte, which includes the field number and type. Field numbers 16 through 2047 require two bytes. As a result, save field numbers 1 through 15 for frequently used message elements. Remember to leave some room for frequently occurring elements that may be added later.

You can register a field number as low as 1 and as high as 536,870,911. Because they are reserved for protocol buffer implementation, the numbers 19000 through 19999 (FieldDescriptor::kFirstReservedNumber through FieldDescriptor::kLastReservedNumber) cannot be used. If you use one of these reserved numbers in your .proto, the protocol buffer compiler will complain. Similarly, you may not use any previously reserved field numbers.

Optional fields and default values

As we've already said, parts of a message description can be marked as optional. An optional element may or may not be part of a well-formed message. When a message is parsed, if it doesn't have an optional element, accessing the field for that element in the object that was parsed gives the field's default value. As part of the message description, you can tell what the default value is. Assume you want a SearchRequest's result per page value to be set to 10 by default:

optional int32 results_per_page = 3 [default = 10];

If the default value for an optional element is not given, the default value for that type is used instead. For strings, this is the empty string. The default value for bytes is an empty byte string. The default value for booleans is false. The default value for number types is 0. For enums, the first value listed in the enum's type definition is its default value. This means that adding a value to the beginning of an enumeration value list must be done carefully.

Reserved fields

If you change the message type by removing or commenting out a field, other users can use the field number to change the type themselves. If they later load old versions of the same .proto file, this can cause serious problems, such as data corruption, privacy bugs, and so on. One way to make sure this doesn't happen is to say that your deleted fields' field numbers (and/or names, which can also cause problems with JSON serialization) are reserved. If users in the future try to use these field identifiers, the protocol buffer compiler will make a fuss.

Adding comments

To add comments to your .proto files, you can use the C/C++ style // and /*...*/ syntax.

/* SearchRequest represents a search query, with pagination options to 
 * indicate which results to include in the response. */
 message SearchRequest {
     required string query = 1;
     optional int32 page_number = 2; // Which page number do we want?
     optional int32 results_per_page = 3; // Number of results to return per page.
 }

Scalar Types

One of the following types can be used for a scalar message field. The table shows the type set in the proto file and the type in the automatically generated class that matches it:

.proto Type	Comments	C++ Type	Java Type	Go Type	Python Type^[1]
`double`		`double`	`double`	`*float64`	`float`
`float`		`float`	`float`	`*float32`	`float`
`int32`	uses variable-length encoding. It is inefficient for encoding negative numbers; if your field is likely to have negative values, use sint32 instead.	`int32`	`int`^[2]	`*int32`	`int`
`int64`	uses variable-length encoding. It is inefficient for encoding negative numbers; if your field is likely to have negative values, use sint64 instead.	`int64`	`long`^[2]	`*int64`	`int/long`
`uint32`	uses variable-length encoding.	`uint32`	`int`	`*uint32`	`int/long`^[3]
`uint64`	uses variable-length encoding.	`uint64`	`long`	`*uint64`	`int/long`^[3]
`sint32`	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular `int32`s.	`int32`	`int`	`*int32`	`int`
`sint64`	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular `int64`s.	`int64`	`long`	`*int64`	`int/long`^[3]
`fixed32`	Always four bytes. More efficient than `uint32` if values are often greater than 2²⁸.	`uint32`	`int`	`*uint32`	`int/long`^[3]
`fixed64`	Always eight bytes. More efficient than `uint64` if values are often greater than 2⁵⁶.	`uint64`	`long`	`*uint64`	`int/long`^[3]
`sfixed32`	Always four bytes.	`int32`	`int`	`*int32`	`int/long`^[3]
`sfixed64`	Always eight bytes.	`int64`	`long`	`*int64`	`int/long`^[3]
`bool`		`bool`	`boolean`	`*bool`	`bool`
`string`	A string must always contain `UTF-8` encoded text.	`string`	`String`	`*string`	`unicode (Python 2) / str (Python 3)`
`bytes`	May contain any arbitrary sequence of bytes	`string`	`ByteString`	`[]byte`	`bytes`

^[1] Setting values to a field will always perform type checking to ensure that they are valid.
^[2] In Java, unsigned 32-bit and 64-bit integers are simply their signed counterparts with the top bit stored in the sign bit.
^[3] When decoded, 64-bit or unsigned 32-bit integers are always represented as longs, but can be ints if an int is specified when setting the field. When set, the value must always be of the type represented.

Enumerations

When creating a message type, you may want one of its fields to have only one of a pre-defined set of values. Assume you want to add a corpus field to each SearchRequest, where the corpus can be UNIVERSAL, WEB, IMAGES, LOCAL, NEWS, PRODUCTS, or VIDEO. You can easily accomplish this by including an enum in your message definition; a field with an enum type can only have one of a predefined set of constants as its value (if you try to provide a different value, the parser will treat it like an unknown field). In the example below, we've added an enum called Corpus with all possible values, as well as a field of type Corpus:

enum Corpus {
    CORPUS_UNSPECIFIED = 0;
    CORPUS_UNIVERSAL = 1;
    CORPUS_WEB = 2;
    CORPUS_IMAGES = 3;
    CORPUS_LOCAL = 4;
    CORPUS_NEWS = 5;
    CORPUS_PRODUCTS = 6;
    CORPUS_VIDEO = 7;
}

message SearchRequest {
    required string query = 1;
    optional int32 page_number = 2;
    optional int32 results_per_page = 3;
    optional Corpus corpus = 4 [default = CORPUS_UNIVERSAL];
}

Aliases are created by assigning the same value to different enum constants. To accomplish this, set the allow_alias option to true. Otherwise, when aliases are discovered, the protocol buffer compiler generates a warning message. While all alias values are valid during deserialization, only the first value is used when serializing.

enum EnumAllowingAlias {
    option allow_alias = true;
    EAA_UNSPECIFIED = 0;
    EAA_UNKNOWN = 1;
    EAA_STARTED = 1;
    EAA_RUNNING = 2;
}

enum EnumNotAllowingAlias {
    ENAA_UNSPECIFIED = 0;
    ENAA_STARTED = 1;
    // ENAA_RUNNING = 1; // Uncommenting this line will cause a warning.
    ENAA_FINISHED = 2;
}

Enumerator constants must be in the 32-bit integer range. Because enum values use varint encoding on the wire, negative values are inefficient and should be avoided. Enums can be defined within or outside of a message definition, and these enums can be reused in any message definition in your .proto file. You can also use the syntax _MessageType_._EnumType_ to use an enum type declared in one message as the type of a field in another message.

Nested Types

Message types can be defined and used within other message types, as shown in the following example, where the Result message is defined within the SearchResponse message:

message SearchResponse {
    message Result {
        required string url = 1;
        optional string title = 2;
        repeated string snippets = 3;
    }
    repeated Result result = 1;
}

If you want to reuse this message type outside its parent message type, you refer to it as _Parent_._Type_:

message SomeMessage {
    optional SearchResponse.Result result = 1;
}

Messages can be nestled as deeply as you want. Because they are defined in different messages, the two nested types named Inner in the example below are completely independent:

message Outer {            // Level 0
    message FirstMiddle {  // Level 1
        message Inner {    // Level 2
            optional int64 ival = 1;
            optional bool booly = 2;
        }
    }
    message SecondMiddle {   // Level 1
        message Inner {      // Level 2
            optional string name = 1;
            optional bool flag = 2;
        }
    }
}

Maps

Protocol buffers provide a convenient shortcut syntax for creating an associative map as part of your data definition:

map<key_type, value_type> map_field = N;

where key type can be any integral or string type (that is, any scalar type other than floats and bytes). Note that key type cannot be an enum. The value type can be any type except for another map.

So, if you wanted to make a map of projects where each Project message is linked to a string key, you could define it like this:

map<string, Project> projects = 3;

Maps Features

Maps do not support extensions.
Maps cannot be repeated, optional or required.
Because wire format ordering and map iteration ordering of map values are undefined, you can't count on your map items being in a specific order.
Maps are sorted by key when generating text format for a .proto. The numerical keys are ordered numerically.
When parsing from the wire or when merging, if there are duplicate map keys the last key seen is used. When parsing a map from text format, parsing may fail if there are duplicate keys.

Backwards compatibility

On the wire, the map syntax is equivalent to the following, so protocol buffers that do not support maps can still handle your data:

message MapFieldEntry {
    optional key_type key = 1;
    optional value_type value = 2;
}

Any protocol buffers implementation that supports maps must produce and accept data that meets the above criteria.

OneOf

If you have a message with many optional fields and only one field will be set at a time, you can use the oneof feature to enforce this behavior and save memory. Oneof fields are similar to optional fields, except that all fields in a oneof share memory and only one field can be set at a time. Setting any of the oneof members automatically clears all of the other members. Depending on the language, you can use a special case() or WhichOneof() method to determine which value in a oneof is set (if any).

Using OneOf

In your .proto, use the oneof keyword followed by your oneof name, in this case test_oneof:

message SampleMessage {
    oneof test_oneof {
        string name = 4;
        SubMessage sub_message = 9;
    }
}

The oneof fields are then added to the oneof definition. You can add any type of field, but you cannot use the required, optional, or repeated keywords. You can use a message containing the repeated field to add a repeated field to a oneof. oneof fields in your generated code has the same getters and setters as regular optional methods. You also get a special method for determining which (if any) value in the oneof is set.

Services

If you want to use your message types with an RPC (Remote Procedure Call) system, define an RPC service interface in a proto file, and the protocol buffer compiler will generate service interface code and stubs in the language of your choice. So, for example, if you want to define an RPC service that takes a SearchRequest and returns a SearchResponse, you can do so in your proto file as follows:

service SearchService {
    rpc Search(SearchRequest) returns (SearchResponse);
}

The protocol compiler will then generate an abstract interface called SearchService and a corresponding "stub" implementation by default. All calls are forwarded by the stub to an RpcChannel, which is an abstract interface that you must define yourself in terms of your own RPC system. For example, you could implement an RpcChannel that serializes the message and sends it via HTTP to a server. In other words, the generated stub provides a type-safe interface for making protocol-buffer-based RPC calls while not tying you to a specific RPC implementation. So, in C++, you might get something like this:

using google::protobuf;

class ExampleSearchService : public SearchService {
    public: 
        void Search(
            protobuf::RpcController* controller,
            const SearchRequest* request,
            SearchResponse* response,
            protobuf::Closure* done
            ) {
            if (request -> query() == "google") {
                response -> add_result() -> set_url("http://www.google.com");
            } else if (request -> query() == "protocol buffers") {
                response -> add_result() -> set_url("http://protobuf.google.com");
            }
            done -> Run();
        }
};

int main() {
    // You provide class MyRpcServer. It doesn not have to implement any
    // particular interface; this is just an example.
    MyRpcServer server;

    protobuf::Service* service = new ExampleSearchService;
    server.ExportOnPort(1234, service);
    server.Run();

    delete service;
    return 0;
}

Options

A .proto file's individual declarations can be annotated with a variety of options. Options do not change the overall meaning of a declaration, but they can influence how it is handled in a specific context. Some options are file-level options, which means they should be written at the top-level scope rather than within any message, enum, or service definition. Some options are message-level, which means they should be included in message definitions. Some options are field-level, which means they must be written within field definitions. Options can also be written on enum types, enum values, oneof fields, service types, and service methods; however, none of these are currently useful.

Here are a few of the most commonly used options:

java_package (file option): The package that will be used to generate Java classes If no explicit Java package option is specified in the proto file, the proto package (specified using the "package" keyword in the proto file) will be used by default. However, because proto packages are not expected to begin with reverse domain names, they generally do not make good Java packages. This option has no effect if Java code is not being generated.

option java_package = "com.example.foo";

java_outer_classname (file option): The name of the wrapper Java class (and thus the file name) that you want to generate. If there is no explicit java_outer_classname specification in the .proto file, the class name will be generated by converting the .proto file name to camel-case (so foo_bar.proto becomes FooBar.java). If the java_multiple_files option is disabled, all other classes/enums/etc. generated for the .proto file will be generated as nested classes/enums/etc. within this outer wrapper Java class. This option has no effect if Java code is not being generated.

option java_outer_classname = "Ponycopter";

java_multiple_files (file option): If false, only a single .java file will be generated for this .proto file, and all Java classes/enums/etc. generated for the top-level messages, services, and enumerations will be nested inside of an outer class (see java_outer_classname). If true, separate .java files will be generated for each of the Java classes/enums/etc. generated for the top-level messages, services, and enumerations, and the wrapper Java class generated for this . proto file will not contain any nested classes/enums/etc. This is a boolean option that is set to false by default. This option has no effect if you are not generating Java code.

option java_multiple_files = true;

optimize_for (file option): Set to SPEED, CODE_SIZE, or LITE_RUNTIME. This has the following effects on the C++ and Java code generators (as well as possibly third-party generators):
- SPEED (default): The protocol buffer compiler will generate code for serializing, parsing, and other common message-type operations. This code has been heavily optimized.
- CODE_SIZE: To implement serialization, parsing, and other operations, the protocol buffer compiler will generate minimal classes and rely on shared, reflection-based code. As a result, the generated code will be much smaller than with SPEED, but operations will be much slower. Classes will continue to implement the same public API that they did in SPEED mode. This mode is most useful in apps that have a large number of .proto files but do not require all of them to be lightning fast.
- LITE_RUNTIME: The protocol buffer compiler will produce classes that rely solely on the "lite" runtime library (libprotobuf-lite instead of libprotobuf). The lite runtime is significantly smaller than the full library (by an order of magnitude), but it lacks certain features such as descriptors and reflection. This is especially useful for apps running on limited platforms such as mobile phones. The compiler will continue to generate fast implementations of all methods in SPEED mode. In each language, generated classes will only implement the MessageLite interface, which provides a subset of the full Message interface's methods.

option optimize_for = CODE_SIZE;

cc_generic_services, java_generic_services, py_generic_services (file options): Whether or not the protocol buffer compiler should generate abstract service code based on C++, Java, and Python service definitions. These are set to true by default for historical reasons. However, as of version 2.3.0 (January 2010), it is preferred for RPC implementations to provide code generator plugins to generate code more specific to each system, rather than relying on "abstract" services.

// This file relies on plugins to generate service code.
option cc_generic_services = false;
option java_generic_services = false;
option py_generic_services = false;

cc_enable_arenas (file option): Enables area allocation for C++ generated code.
message_set_wire_format (message option): If set to true, the message uses a different binary format that is intended to be compatible with an old format called MessageSet that was used internally at Google. Users outside of Google are unlikely to need to use this option. The message must be stated precisely as follows:

message Foo {
    option message_set_wire_format = true;
    extensions 4 to max;
}

packed (file option): When a repeated field of a basic numeric type is set to true, a more compact encoding is used. There are no disadvantages to using this option. However, prior to version 2.3.0, parsers that received packed data when it was not expected ignored it. As a result, changing an existing field to packed format without breaking wire compatibility was impossible. This change is safe in 2.3.0 and later because packable field parsers will always accept both formats, but be cautious if you have to deal with old programs that use old protobuf versions.

repeated int32 samples = 4 [packed = true];

deprecated (field option): If set to true, the field is considered deprecated and should not be used by new code. This has no effect in most languages. This becomes a @Deprecated annotation in Java. When deprecated fields are used in C++, clang-tidy will generate warnings. Other language-specific code generators may generate deprecation annotations on the field's accessors in the future, resulting in a warning when compiling code that attempts to use the field. If the field isn't being used and you don't want new users to use it, consider replacing the declaration with a reserved statement.

optional int32 old_field = 6 [deprecated=true];

Custom Options

Protocol Buffers even allow you to define and use your own options. It should be noted that this is an advanced feature that most people do not require. Because options are defined by the messages defined in google/protobuf/descriptor.proto (such as FileOptions or FieldOptions), defining your own options is simply a matter of extending those messages. For instance:

import "google/protobuf/descriptor.proto";

extend google.protobuf.MessageOptions {
    optional string my_option = 51234;
}

message MyMessage {
    option (my_option) = "Hello world!";
}

By extending MessageOptions, we have defined a new message-level option. When we use the option, we must enclose the option name in parentheses to indicate that it is an extension. In C++, we can now read the value of my_option as follows:

string value = MyMessage::descriptor()->options().GetExtension(my_option);

Here, MyMessage::descriptor()->options() retyrns the MessageOptions protocol message for MyMessage. Reading custom options from it is jsut like reading any other extension.

Similarly, in Java we would write:

String value = MyProtoFile.MyMessage
    .getDescriptor()
    .getOptions()
    .getExtension(MyProtoFile.myOption);

In Python, it would be:

value = my_proto_file_pb2.MyMessage.DESCRIPTOR.GetOptions().Extensions[my_proto_file_pb2.my_option]

In the Protocol Buffers language, custom options can be defined for any type of construct. Here's an example of each type of option:

import "google/protobuf/descriptor.proto";

extend google.protobuf.FileOptions {
  optional string my_file_option = 50000;
}
extend google.protobuf.MessageOptions {
  optional int32 my_message_option = 50001;
}
extend google.protobuf.FieldOptions {
  optional float my_field_option = 50002;
}
extend google.protobuf.OneofOptions {
  optional int64 my_oneof_option = 50003;
}
extend google.protobuf.EnumOptions {
  optional bool my_enum_option = 50004;
}
extend google.protobuf.EnumValueOptions {
  optional uint32 my_enum_value_option = 50005;
}
extend google.protobuf.ServiceOptions {
  optional MyEnum my_service_option = 50006;
}
extend google.protobuf.MethodOptions {
  optional MyMessage my_method_option = 50007;
}

option (my_file_option) = "Hello world!";

message MyMessage {
  option (my_message_option) = 1234;

  optional int32 foo = 1 [(my_field_option) = 4.5];
  optional string bar = 2;
  oneof qux {
    option (my_oneof_option) = 42;

    string quux = 3;
  }
}

enum MyEnum {
  option (my_enum_option) = true;

  FOO = 1 [(my_enum_value_option) = 321];
  BAR = 2;
}

message RequestType {}
message ResponseType {}

service MyService {
  option (my_service_option) = FOO;

  rpc MyMethod(RequestType) returns(ResponseType) {
    // Note:  my_method_option has type MyMessage.  We can set each field
    //   within it using a separate "option" line.
    option (my_method_option).foo = 567;
    option (my_method_option).bar = "Some string";
  }
}

If you want to use a custom option in a package other than the one that defined it, you must prefix the option name with the package name, just like you would for type names. As an example:

// foo.proto
import "google/protobuf/descriptor.proto";
package foo;
extend google.protobuf.MessageOptions {
  optional string my_option = 51234;
}

// bar.proto
import "foo.proto";
package bar;
message MyMessage {
  option (foo.my_option) = "Hello world!";
}

Finally, because custom options are extensions, they must be given field numbers just like any other field or extension. In the preceding examples, we used field numbers ranging from 50000 to 99999. This range is reserved for internal use within individual organizations, so numbers in this range can be used freely for internal applications. However, if you intend to use custom options in public applications, you must ensure that your field numbers are globally unique. Send a request to the protobuf global extension registry to obtain globally unique field numbers. Typically, only one extension number is required. By putting them in a sub-message, you can declare multiple options with only one extension number:

message FooOptions {
  optional int32 opt1 = 1;
  optional string opt2 = 2;
}

extend google.protobuf.FieldOptions {
  optional FooOptions foo_options = 1234;
}

// usage:
message Bar {
  optional int32 a = 1 [(foo_options).opt1 = 123, (foo_options).opt2 = "baz"];
  // alternative aggregate syntax (uses TextFormat):
  optional int32 b = 2 [(foo_options) = { opt1: 123 opt2: "baz" }];
}

Also, keep in mind that each option type (file-level, message-level, field-level, etc.) has its own number space, so you could declare FieldOptions and MessageOptions extensions with the same number.

Conclusion

In this post we saw the basics of Protocol Buffers. Protocol buffers are a way to serialize structured data in a way that is both forward-compatible and backward-compatible across languages and platforms. It's the same as JSON, but it's smaller and faster, and it makes bindings for the native language. Once you decide how you want your data to be organized, you can use special generated source code to easily write and read your structured data to and from a variety of data streams and languages.

Protocol buffers are made up of the definition language (which is written in.proto files), the code that the proto compiler generates to interact with data, language-specific runtime libraries, and the serialization format for data that is written to a file (or sent across a network connection).

In the next post of this series we will briefly examine how protocol buffers are encoded on the wire, something you don't necessarily need to know in order to use them, but is a great asset if you want to perform optimizations.

DEV Community