Java Serialization Vs Protocol Buffers Performance

3/25/2020

This tutorial provides a basic C# programmer's introduction to working with protocol buffers, using the proto3 version of the protocol buffers language. By walking through creating a simple example application, it shows you how to

Define message formats in a .proto file.
Use the protocol buffer compiler.
Use the C# protocol buffer API to write and read messages.

This isn't a comprehensive guide to using protocol buffers in C#. For more detailed reference information, see the Protocol Buffer Language Guide, the C# API Reference, the C# Generated Code Guide, and the Encoding Reference.

Besides the benchmark project linked above, a project called. Google's protobuf project beats out most other Java serialization libraries, so is a good baseline. For some caching I'm thinking of doing for an upcoming project, I've been thinking about Java serialization. Namely, should it be used? Now I've previously written custom serialization.

Why use protocol buffers?

The example we're going to use is a very simple 'address book' application that can read and write people's contact details to and from a file. Each person in the address book has a name, an ID, an email address, and a contact phone number.

How do you serialize and retrieve structured data like this? There are a few ways to solve this problem:

Use .NET binary serialization with System.Runtime.Serialization.Formatters.Binary.BinaryFormatter and associated classes. This ends up being very fragile in the face of changes, expensive in terms of data size in some cases. It also doesn't work very well if you need to share data with applications written for other platforms.
You can invent an ad-hoc way to encode the data items into a single string â€“ such as encoding 4 ints as '12:3:-23:67'. This is a simple and flexible approach, although it does require writing one-off encoding and parsing code, and the parsing imposes a small run-time cost. This works best for encoding very simple data.
Serialize the data to XML. This approach can be very attractive since XML is (sort of) human readable and there are binding libraries for lots of languages. This can be a good choice if you want to share data with other applications/projects. However, XML is notoriously space intensive, and encoding/decoding it can impose a huge performance penalty on applications. Also, navigating an XML DOM tree is considerably more complicated than navigating simple fields in a class normally would be.

Protocol buffers are the flexible, efficient, automated solution to solve exactly this problem. With protocol buffers, you write a .proto description of the data structure you wish to store. From that, the protocol buffer compiler creates a class that implements automatic encoding and parsing of the protocol buffer data with an efficient binary format. The generated class provides getters and setters for the fields that make up a protocol buffer and takes care of the details of reading and writing the protocol buffer as a unit. Importantly, the protocol buffer format supports the idea of extending the format over time in such a way that the code can still read data encoded with the old format.

Where to find the example code

Our example is a command-lineapplication for managing an address bookdata file, encoded using protocol buffers. The command AddressBook (see: Program.cs) can add a new entry to the data file or parse the data file and print the data to the console.

You can find the complete example in theexamples directory and csharp/src/AddressBook directoryof the GitHub repository.

Defining your protocol format

To create your address book application, you'll need to start with a.proto file. The definitions in a .proto file aresimple: you add a message for each data structure you want toserialize, then specify a name and a type for each field in the message. Inour example, the .proto file that defines the messages isaddressbook.proto.

The .proto file starts with a package declaration, which helpsto prevent naming conflicts between different projects.

In C#, your generated classes will be placed in a namespace matching the package name if csharp_namespace is not specified. In our example, the csharp_namespace option has been specified to override the default, so the generated code uses a namespace of Google.Protobuf.Examples.AddressBook instead of Tutorial.

Next, you have your message definitions. A message is just an aggregatecontaining a set of typed fields. Many standard simple data types areavailable as field types, including bool, int32,float, double, and string. You can alsoadd further structure to your messages by using other message types as fieldtypes.

In the above example, the Person message containsPhoneNumber messages, while the AddressBook messagecontains Person messages. You can even define message typesnested inside other messages â€“ as you can see, thePhoneNumber type is defined inside Person. You canalso define enum types if you want one of your fields to have oneof a predefined list of values â€“ here you want to specify that a phonenumber can be one of MOBILE, HOME, orWORK.

The ' = 1', ' = 2' markers on each element identify the unique 'tag' that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so as an optimization you can decide to use those tags for the commonly used or repeated elements, leaving tags 16 and higher for less-commonly used optional elements. Each element in a repeated field requires re-encoding the tag number, so repeated fields are particularly good candidates for this optimization.

If a field value isn't set, adefault value is used: zerofor numeric types, the empty string for strings, false for bools. For embeddedmessages, the default value is always the 'default instance' or 'prototype' ofthe message, which has none of its fields set. Calling the accessor to get thevalue of a field which has not been explicitly set always returns that field'sdefault value.

If a field is repeated, the field may be repeated any number of times (including zero). The order of the repeated values will be preserved in the protocol buffer. Think of repeated fields as dynamically sized arrays.

You'll find a complete guide to writing .proto files â€“ including all the possible field types â€“ in the Protocol Buffer Language Guide. Don't go looking for facilities similar to class inheritance, though â€“ protocol buffers don't do that.

Compiling your protocol buffers

Now that you have a .proto, the next thing you need to do is generate the classes you'll need to read and write AddressBook (and hence Person and PhoneNumber) messages. To do this, you need to run the protocol buffer compiler protoc on your .proto:

If you haven't installed the compiler, download the package and follow the instructions in the README.
Now run the compiler, specifying the source directory (where your application's source code lives â€“ the current directory is used if you don't provide a value), the destination directory (where you want the generated code to go; often the same as $SRC_DIR), and the path to your .proto. In this case, you would invoke:Because you want C# classes, you use the --csharp_out option â€“ similar options are provided for other supported languages.

This generates Addressbook.cs in your specified destination directory. To compile this code, you'll need a project with a reference to the Google.Protobuf assembly.

The addressbook classes

Generating Addressbook.cs gives you five useful types:

A static Addressbook class that contains metadata about the protocol buffer messages.
An AddressBook class with a read-only People property.
A Person class with properties for Name, Id, Email and Phones.
A PhoneNumber class, nested in a static Person.Types class.
A PhoneType enum, also nested in Person.Types.

You can read more about the details of exactly what's generated in the C# Generated Code guide, but for the most part you can treat these as perfectly ordinary C# types. One point to highlight is that any properties corresponding to repeated fields are read-only. You can add items to the collection or remove items from it, but you can't replace it with an entirely separate collection. The collection type for repeated fields is always RepeatedField<T>. This type is like List<T> but with a few extra convenience methods, such as an Add overload accepting a collection of items, for use in colleciton initializers.

Here's an example of how you might create an instance of Person:

Note that with C# 6, you can use using static to remove the Person.Types ugliness:

Parsing and serialization

The whole purpose of using protocol buffers is to serialize your data so that it can be parsed elsewhere. Everygenerated class has a WriteTo(CodedOutputStream) method, where CodedOutputStream is a class in theprotocol buffer runtime library. However, usually you'll use one of the extension methods to write to a regular System.IO.Streamor convert the message to a byte array or ByteString. These extension messages are in the Google.Protobuf.MessageExtensions class, sowhen you want to serialize you'll usually want a using directive for the Google.Protobuf namespace. For example:

Parsing is also simple. Each generated class has a static Parser property which returns a MessageParser<T>for that type. That in turn has methods to parse streams, byte arrays and ByteStrings. So to parse the file we've just created,we can use:

A full example program to maintain an addressbook (adding new entries and listing existing ones) using these messages isavailable in the Github repository.

Extending a Protocol Buffer

Sooner or later after you release the code that uses your protocol buffer,you will undoubtedly want to 'improve' the protocol buffer's definition. If youwant your new buffers to be backwards-compatible, and your old buffers to beforward-compatible â€“ and you almost certainly do want this â€“ thenthere are some rules you need to follow. In the new version of theprotocol buffer:

you must not change the tag numbers of any existing fields.
you may delete fields.
you may add new fields but you must use fresh tag numbers (i.e. tag numbers that were never used in this protocol buffer, not even by deleted fields).

(There aresome exceptions tothese rules, but they are rarely used.)

If you follow these rules, old code will happily read new messages andsimply ignore any new fields. To the old code, singular fields that weredeleted will simply have their default value, and deleted repeated fields willbe empty. New code will also transparently read old messages.

However, keep in mind that new fields will not be present in old messages,so you will need to do something reasonable with the default value. Atype-specificdefault valueis used: for strings, the default value is the empty string. For booleans, thedefault value is false. For numeric types, the default value is zero.

Reflection

Message descriptors (the information in the .proto file) and instances of messages can be examinedprogrammatically using the reflection API. This can be useful when writing generic code such as a different text format ora smart diff tool. Each generated class has a static Descriptor property, and the descriptor for any instancecan be retrieved using the IMessage.Descriptor property. As a quick example of how these can be used, here is a shortmethod to print the top-level fields of any message.

Stay up to date on releases

Create your free account today to subscribe to this repository for notifications about new releases, and build software alongside 40 million developers on GitHub.

Choose a tag to compare

Choose a tag to compare

liujisi released this Aug 15, 2017

Planned Future Changes

Preserve unknown fields in proto3: We are going to bring unknown fields back into proto3. In this release, some languages start to support preserving unknown fields in proto3, controlled by flags/options. Some languages also introduce explicit APIs to drop unknown fields for migration. Please read the change log sections by languages for details. See general timeline and plan and issues and discussions
Make C++ implementation C++11 only: we plan to require C++11 to build protobuf code starting from 3.5.0 or 3.6.0 release, after unknown fields semantic changes are finished. Please join this github issue to provide your feedback.

General

Extension ranges now accept options and are customizable.
reserve keyword now supports max in field number ranges, e.g. reserve 1000 to max;

C++

Proto3 messages are now able to preserve unknown fields. The default behavior is still to drop unknowns, which will be flipped in a future release. If you rely on unknowns fields being dropped. Please use Message::DiscardUnknownFields() explicitly.
Packable proto3 fields are now packed by default in serialization.
Following C++11 features are introduced when C++11 is available:
- move-constructor and move-assignment are introduced to messages
- Repeated fields constructor now takes std::initializer_list
- rvalue setters are introduced for string fields
Experimental Table-Driven parsing and serialization available to test. To enable it, pass in table_driven_parsing table_driven_serialization protoc generator flags for C++

$ protoc --cpp_out=table_driven_parsing,table_driven_serialization:./ test.proto
lite generator parameter supported by the generator. Once set, all generated files, use lite runtime regardless of the optimizer_for setting in the .proto file.
Various optimizations to make C++ code more performant on PowerPC platform
Fixed maps data corruption when the maps are modified by both reflection API and generated API.
Deterministic serialization on maps reflection now uses stable sort.
file() accessors are introduced to various *Descriptor classes to make writing template function easier.
ByteSize() and SpaceUsed() are deprecated.Use ByteSizeLong() and SpaceUsedLong() instead
Consistent hash function is used for maps in DEBUG and NDEBUG build.
'using namespace std' is removed from stubs/common.h
Various performance optimizations and bug fixes

Java

Introduced new parser API DiscardUnknownFieldsParser in preparation of proto3 unknown fields preservation change. Users who want to drop unknown fields should migrate to use this new parser API.
For example:
Introduced new TextFormat API printUnicodeFieldValue() that prints field value without escaping unicode characters.
Added Durations.compare(Duration, Duration) and Timestamps.compare(Timestamp, Timestamp).
JsonFormat now accepts base64url encoded bytes fields.
Optimized CodedInputStream to do less copies when parsing large bytes fields.
Optimized TextFormat to allocate less memory when printing.

Python

SerializeToString API is changed to SerializeToString(self, **kwargs), deterministic parameter is accepted for deterministic serialization.
Added sort_keys parameter in json format to make the output deterministic.
Added indent parameter in json format.
Added extension support in json format.
Added __repr__ support for repeated field in cpp implementation.
Added file in FieldDescriptor.
Added pretty-print filter to text format.
Services and method descriptors are always printed even if generic_service option is turned off.
Note: AppEngine 2.5 is deprecated on June 2017 that AppEngine 2.5 will never update protobuf runtime. Users who depend on AppEngine 2.5 should use old protoc.

PHP

Support PHP generic services. Specify file option php_generic_service=true to enable generating service interface.
Message, repeated and map fields setters take value instead of reference.
Added map iterator in c extension.
Support json encode/decode.
Added more type info in getter/setter phpdoc
Fixed the problem that c extension and php implementation cannot be used together.
Added file option php_namespace to use custom php namespace instead of package.
Added fluent setter.
Added descriptor API in runtime for custom encode/decode.
Various bug fixes.

Objective-C

Fix for GPBExtensionRegistry copying and add tests.
Optimize GPBDictionary.m codegen to reduce size of overall library by 46K per architecture.
Fix some cases of reading of 64bit map values.
Properly error on a tag with field number zero.
Preserve unknown fields in proto3 syntax files.
Document the exceptions on some of the writing apis.

C#

Implemented IReadOnlyDictionary<K,V> in MapField<K,V>
Added TryUnpack method for Any message in addition to Unpack.
Converted C# projects to MSBuild (csproj) format.

Ruby

Several bug fixes.

Javascript

Added support of field option js_type. Now one can specify the JS type of a 64-bit integer field to be string in the generated code by adding option [jstype = JS_STRING] on the field.

Assets23

protobuf-cpp-3.4.0.zip5.02 MB

protobuf-csharp-3.4.0.zip5.47 MB

protobuf-java-3.4.0.zip5.68 MB

protobuf-js-3.4.0.zip5.27 MB

protobuf-objectivec-3.4.0.zip5.61 MB

protobuf-php-3.4.0.zip5.38 MB

protobuf-python-3.4.0.zip5.39 MB

protobuf-ruby-3.4.0.zip5.33 MB

protoc-3.4.0-linux-x86_64.zip1.33 MB

protoc-3.4.0-osx-x86_64.zip1.74 MB