Why Protobuf Should Dominate the Data Format Ecosystem
Daniel Hayes
Full-Stack Engineer · Leapcell
data:image/s3,"s3://crabby-images/ce22b/ce22bba300b4d84562227741d37282a077c0c024" alt="Cover of "Why Protobuf Should Dominate the Data Format Ecosystem""
In-depth Understanding of Protobuf
What is Protobuf
Protobuf (Google Protocol Buffers), as defined in the official documentation: Protocol buffers is a language-independent, platform-independent, and extensible method for serializing structured data, which can be widely applied in scenarios such as data communication protocols and data storage. It is a tool library provided by Google with an efficient protocol data exchange format, possessing the characteristics of flexible, efficient, and automated structured data serialization mechanisms.
Compared with XML, the size of data encoded by Protobuf is smaller, and the encoding and decoding speed is faster. Compared with Json, Protobuf performs more excellently in conversion efficiency, with both its time efficiency and space efficiency reaching 3 to 5 times that of JSON.
As the official description states: “Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.”
Comparison of Data Formats
Suppose we have a person
object, represented by JSON, XML, and Protobuf respectively, and let's see their differences.
XML Format
<person> <name>John</name> <age>24</age> </person>
JSON Format
{ "name":"John", "age":24 }
Protobuf Format
Protobuf directly represents data in binary format, which is not as intuitive as XML and JSON formats. For example:
[10 6 69 108 108 122 111 116 16 24]
Advantages of Protobuf
Good Performance/High Efficiency
- Time Overhead: The overhead of XML formatting (serialization) is acceptable, but the overhead of XML parsing (deserialization) is relatively large. Protobuf has optimized this aspect and can significantly reduce the time overhead of serialization and deserialization.
- Space Overhead: Protobuf also greatly reduces the space occupation.
Code Generation Mechanism
For example, write the following content similar to a structure:
message testA { required int32 m_testA = 1; }
Protobuf can automatically generate the corresponding .h
file and .cpp
file, and encapsulate the operations on the structure testA
into a class.
Support for Backward Compatibility and Forward Compatibility
When the client and the server use a protocol simultaneously, if the client adds a byte in the protocol, it will not affect the normal use of the client.
Support for Multiple Programming Languages
In the source code officially released by Google, it includes support for multiple programming languages, such as:
- C++
- C#
- Dart
- Go
- Java
- Kotlin
- Python
Disadvantages of Protobuf
Poor Readability Due to Binary Format
To improve performance, Protobuf uses a binary format for encoding, which makes the data less readable and will affect the efficiency during the development and testing phase. However, under normal circumstances, Protobuf performs very reliably, and serious problems generally do not occur.
Lack of Self-description
Generally, XML is self-descriptive, while the Protobuf format is not. It is a piece of binary format protocol content, and it is difficult to know its function without matching it with a pre-written structure.
Poor Universality
Although Protobuf supports serialization and deserialization in multiple languages, it is not a universal transmission standard across platforms and languages. In scenarios of multi-platform message passing, its compatibility with other projects is not good, and corresponding adaptation and transformation work is often required. Compared with json and XML, its universality is slightly insufficient.
Usage Guide
Defining Message Types
Proto message type files generally end with .proto
. In a .proto
file, one or more message types can be defined.
The following is an example of defining a message type for a search query. The syntax
at the beginning of the file is used to describe the version information. Currently, there are two versions of proto, proto2 and proto3.
syntax="proto3";
Explicitly set the syntax format to proto3. If the syntax
is not set, it defaults to proto2. query
represents the content to be queried, page_number
represents the page number of the query, and result_per_page
represents the number of items per page. syntax = "proto3"
must be located on the first line of the .proto
file excluding comments and blank lines.
The following message contains 3 fields (query
, page_number
, result_per_page
), and each field has a corresponding type, field name, and field number. The field type can be string
, int32
, enum
, or a composite type.
syntax = "proto3"; message SearchRequest { string query = 1; int32 page_number = 2; int32 result_per_page = 3; }
Field Numbers
Each field in the message type needs to be defined with a unique number, and this number is used to identify the field in the binary data. Numbers in the range of [1,15] can be encoded and represented with one byte; in the range of [16,2047], they need to be encoded and represented with two bytes. Therefore, leaving the numbers within 15 for frequently occurring fields can save space. The minimum value of the number is 1, and the maximum value is 2^29 - 1 = 536870911. Numbers in the range of [19000, 19999] cannot be used because these numbers are used internally by the proto compiler. Similarly, other pre-reserved numbers cannot be used either.
Field Rules
Each field can be modified by singular
or repeated
. In the proto3 syntax, if the modification type is not specified, the default value is singular
.
singular
: It means that the modified field appears at most once, that is, it appears 0 or 1 time.repeated
: It means that the modified field can appear any number of times, including 0 times. In the proto3 syntax, fields modified byrepeated
use thepacked
encoding by default.
Comments
You can add comments to the .proto
file. The comment syntax is the same as the C/C++ style, using //
or /* ... */
.
/* SearchRequest represents a search query, with pagination options to * indicate which results to include in the response. */ message SearchRequest { string query = 1; int32 page_number = 2; // Which page number do we want? int32 result_per_page = 3; // Number of results to return per page. }
Reserved Fields
When deleting or commenting out a field in a message
, other developers in the future may reuse the previous field number when updating the message
definition. If they accidentally load the old version of the .proto
file, it may lead to serious problems, such as data corruption. To avoid such problems, you can specify the reserved field numbers and field names. If someone uses these field numbers in the future, an error will be generated when compiling the proto, thus reminding that there is a problem with the proto.
Note: Do not mix the use of field names and field numbers for the same field.
message Foo { reserved 2, 15, 9 to 11; reserved "foo", "bar"; }
Mapping between Field Types and Language Types
The defined .proto
file can generate Go language code through a generator. For example, the Go file generated from the a.proto
file is the a.pb.go
file.
The mapping between basic types in proto and Go language types is shown in the following table (here only the type mapping between Go and C/C++ is listed, and for other languages, refer to https://developers.google.com/protocol-buffers/docs/proto3):
.proto Type | Go Type | C++ Type |
---|---|---|
double | float64 | double |
float | float32 | float |
int32 | int32 | int32 |
int64 | int64 | int64 |
uint32 | uint32 | uint32 |
uint64 | uint64 | uint64 |
sint32 | int32 | int32 |
sint64 | int64 | int64 |
fixed32 | uint32 | uint32 |
fixed64 | uint64 | uint64 |
sfixed32 | int32 | int32 |
sfixed64 | int64 | int64 |
bool | bool | bool |
string | string | string |
bytes | []byte | string |
Default Values
.proto Type | default value |
---|---|
string | "" |
bytes | []byte |
bool | false |
numeric types | 0 |
enums | first defined enum value |
Enum Types
When defining a message, if you want the value of a field to be only one of the expected values, you can use the enum type.
For example, now add the corpus
field to SearchRequest
, and its value can only be one of UNIVERSAL
, WEB
, IMAGES
, LOCAL
, NEWS
, PRODUCTS
, and VIDEO
. This can be achieved by adding an enum to the message definition and adding a constant for each possible enum value.
message SearchRequest { string query = 1; int32 page_number = 2; int32 result_per_page = 3; enum Corpus { UNIVERSAL = 0; WEB = 1; IMAGES = 2; LOCAL = 3; NEWS = 4; PRODUCTS = 5; VIDEO = 6; } Corpus corpus = 4; }
The first constant of the Corpus
enum must be mapped to 0, and all enum definitions need to include a constant mapped to 0, and this value is the first line content of the enum definition. This is because 0 is used as the default value of the enum. In the proto2 syntax, the enum value on the first line is always the default value. For the sake of compatibility, the value 0 must be the first line of the definition.
Importing Other Protos
Other .proto
files can be imported in a .proto
file, so as to use the message types defined in the imported file.
import "myproject/other_protos.proto";
By default, only the message types defined in the directly imported .proto
file can be used. But sometimes it may be necessary to move the .proto
file to a new location. At this time, a virtual .proto
file can be placed in the old location, and the import public
syntax can be used to forward all imports to the new location, instead of directly moving the .proto
file and updating all call points at once. Any place that imports a proto file containing the import public
statement can pass on the public dependencies of the imported dependencies.
For example, there are a.proto
and b.proto
files in the current folder, and b.proto
is imported in the a.proto
file, that is, the a.proto
file has the following content:
import "b.proto";
Suppose now we want to put the messages in b.proto
into the common/com.proto
file for use in other places. We can modify b.proto
and import com.proto
in it. Note that we need to use import public
because a single import
can only use the messages defined in b.proto
and cannot use the message types in the proto file imported in b.proto
.
// b.proto file, move the message definitions inside to the common/com.proto file, // add the following import statement inside import public "common/com.proto"
When using protoc
for compilation, the option -I
or --proto_path
needs to be used to notify protoc
where to find the imported files. If the search path is not specified, protoc
will look for it in the current directory (the path where protoc
is called).
Message types in the proto2 version can be imported into a proto3 file for use, and message types in the proto3 version can also be imported into a proto2 file. But the enum types in proto2 cannot be directly applied to the proto3 syntax.
Nested Messages
Message types can be defined inside another message type, that is, nested definitions. For example, the Result
type is defined inside SearchResponse
, and it supports multiple levels of nesting.
message SearchResponse { message Result { string url = 1; string title = 2; repeated string snippets = 3; } repeated Result results = 1; }
When an outer message type uses a message inside another message, such as the SomeOtherMessage
type using Result
, it can use SearchResponse.Result
.
message SomeOtherMessage { SearchResponse.Result result = 1; }
Unknown Fields
Unknown fields are fields that the proto compiler cannot recognize. For example, when an old binary file parses the data sent by a new binary file with new fields, these new fields will become unknown fields in the old binary file. In the initial version of proto3, unknown fields were discarded when the message was parsed, but in version 3.5, the retention of unknown fields was reintroduced. Unknown fields are retained during parsing and are included in the serialized output.
Encoding Principle
TLV Encoding Format
The key to the high efficiency of Protobuf lies in its TLV (tag-length-value) encoding format. Each field has a unique tag
value as an identifier, length
represents the length of the value
data (for a value
with a fixed length, there is no length
), and value
is the content of the data itself.
For the tag
value, it is composed of two parts: field_number
and wire_type
. field_number
is the number given to each field in the message
earlier, and wire_type
represents the type (fixed length or variable length). The wire_type
currently has 6 values from 0 to 5, and these 6 values can be represented by 3 bits.
The values of wire_type
are shown in the following table, where 3 and 4 have been deprecated, and we only need to pay attention to the remaining 4 types. For data encoded with Varint, there is no need to store the byte length length
, and at this time, the TLV encoding format degenerates into TV encoding. For 64-bit and 32-bit data, there is also no need for length
because the type
value already indicates whether the length is 8 bytes or 4 bytes.
wire_type | Encoding Method | Encoding Length | Storage Method | Data Type |
---|---|---|---|---|
0 | Varint | Variable length | T - V | int32 int64 uint32 uint64 bool enum |
0 | Zigzag + Varint | Variable length | T - V | sint32 sint64 |
1 | 64-bit | Fixed 8 bytes | T - V | fixed64 sfixed64 double |
2 | length-delimi | Variable length | T - L - V | string bytes packed repeated fields embedded |
3 | start group | Deprecated | Deprecated | |
4 | end group | Deprecated | Deprecated | |
5 | 32-bit | Fixed 4 bytes | T - V | fixed32 sfixed32 float |
Varint Encoding Principle
Varint is a variable-length int, which is a variable-length encoding method. It can make smaller numbers use fewer bytes to represent, and achieve data compression by reducing the number of bytes used to represent numbers. For an int32 type number, it usually requires 4 bytes to represent, but with Varint encoding, an int32 type number less than 128 can be represented with 1 byte. For larger numbers, it may require 5 bytes to represent, but in most messages, very large numbers usually do not appear, so using Varint encoding can use fewer bytes to represent numbers.
Varint is a variable-length encoding, and it distinguishes each field through the highest bit of each byte. If the highest bit of a byte is 1, it means that the subsequent byte is also part of the number; if it is 0, it means that this is the last byte, and the remaining 7 bits are all used to represent the number. Although each byte will waste 1 bit of space (that is, 1/8 = 12.5% waste), if there are many numbers that do not need to be fixed as 4 bytes for representation, a large amount of space can still be saved.
For example, for an int32 type number 65, its Varint encoding process is as follows, and the 65 that originally occupied 4 bytes only occupies 1 byte after encoding.
For an int32 type number 128, it occupies 2 bytes after encoding.
Varint decoding is the reverse process of encoding, which is relatively simple, and no example is given here.
Zigzag Encoding
numbers to unsigned numbers, and then use Varint encoding to reduce the number of bytes after encoding.
Zigzag uses unsigned numbers to represent signed numbers, enabling numbers with smaller absolute values to be represented with fewer bytes. Before understanding Zigzag encoding, let's first understand a few concepts:
- Original Code: The highest bit is the sign bit, and the remaining bits represent the absolute value.
- One's Complement: Except for the sign bit, invert the remaining bits of the original code one by one.
- Two's Complement: For positive numbers, the two's complement is itself; for negative numbers, except for the sign bit, invert the remaining bits of the original code one by one and then add 1.
Take the int32 type number -2 as an example, and its encoding process is as follows.
In summary, for negative numbers, perform arithmetic operations on their two's complement. For a number n
, if it is of the sint32
type, perform the operation (n<<1) ^ (n>>31)
; if it is of the sint64
type, perform the operation (n<<1) ^ (n>>63)
. Through this operation, the negative number is changed to a positive number, and this process is Zigzag encoding. Finally, use Varint encoding.
Since Varint and Zigzag encoding can self-parse the content length, the length item can be omitted, and the TLV storage is simplified to TV storage, without the need for the length
item.
Calculation Methods of tag and value Values
tag
The tag
stores the identification information and data type information of the field, that is, tag = wire_type
(field data type) + field_number
(identification number). The field number can be obtained through the tag
, corresponding to the defined message field. The calculation formula is tag = field_number<<3 | wire_type
, and then perform Varint encoding on it.
value
The value
is the value of the message field after Varint and Zigzag encoding.
string Encoding (continued)
When the field type is the string
type, the field value is encoded in UTF-8. For example, there is the following message definition:
message stringEncodeTest { string test = 1; }
In the Go language, the sample code for encoding this message is as follows:
func stringEncodeTest(){ vs:=&api.StringEncodeTest{ Test:"English", } data,err:=proto.Marshal(vs) if err!=nil{ fmt.Println(err) return } fmt.Printf("%v\n",data) }
The binary content after encoding is as follows:
[10 14 67 104 105 110 97 228 184 173 144 155 189 228 120 186]
Encoding of Nested Types
Nested messages mean that the value
is another field message. The outer message is stored using TLV storage, and its value
is also a TLV storage structure. The schematic diagram of the entire encoding structure is as follows (it can be imagined as a tree structure, where the outer message is the root node, and the nested message inside it is used as a child node, and each node follows the TLV encoding rule):
- The outermost message has its corresponding
tag
,length
(if any), andvalue
. - When the
value
is a nested message, this nested message has its own independenttag
,length
(if any), andvalue
. - By analogy, if there are nested messages within the nested message, continue to encode according to the TLV rule.
repeated Fields with packed
The fields modified by repeated
can be with packed
or without it. For multiple field values of the same repeated
field, their tag
values are all the same, that is, the data type and field sequence number are the same. If multiple TV
storages are used, there will be redundancy of the tag
.
If packed = true
is set, the storage method of the repeated
field will be optimized. That is, the same tag
is only stored once, and then the total length length
of all values under the repeated
field is added to form a TLVV...
storage structure. This method can effectively compress the length of the serialized data and save transmission overhead. For example:
message repeatedEncodeTest{ // Method 1, without packed repeated int32 cat = 1; // Method 2, with packed repeated int32 dog = 2 [packed=true]; }
In the above example, the cat
field does not use packed
, and each cat
value will have independent tag
and value
storage; while the dog
field uses packed
, and the tag
will only be stored once, followed by the total length length
of all dog
values, and then all dog
values are arranged in sequence. In this way, when the data volume is large, the repeated
field using packed
can significantly reduce the space occupied by the data and the bandwidth consumption during transmission.
Conclusion
With its efficiency (in terms of size) and professionalism (professional types), Protobuf should have a higher coverage in the future data transmission field.
Leapcell: The Next-Gen Serverless Platform for Web Hosting, Async Tasks, and Redis
Finally, I would like to introduce to you the most suitable platform for deploying services: Leapcell
1. Multi-Language Support
- Develop with JavaScript, Python, Go, or Rust.
2. Deploy unlimited projects for free
- pay only for usage — no requests, no charges.
3. Unbeatable Cost Efficiency
- Pay-as-you-go with no idle charges.
- Example: $25 supports 6.94M requests at a 60ms average response time.
4. Streamlined Developer Experience
- Intuitive UI for effortless setup.
- Fully automated CI/CD pipelines and GitOps integration.
- Real-time metrics and logging for actionable insights.
5. Effortless Scalability and High Performance
- Auto-scaling to handle high concurrency with ease.
- Zero operational overhead — just focus on building.
Explore more in the documentation!
Leapcell Twitter: https://x.com/LeapcellHQ