Compressing a year of Reddit with Apache Avro and Google Protobuf

8 min readMar 24, 2020

Big data projects have always seemed to generate a lot of hype and excitement among developers. And for good reason, the problem of persisting, transforming, performing analytics, or you name it, over large data sets is quite complex and therefore a lot of exciting frameworks and architectural patterns have surfaced in order to tackle this complexity.

From the outside looking in it seems like the bigger the data the more interesting the project. What is a bit ironic though, is that once you actually do get to work on a big data project you start to realize that one of the challenges is making it smaller. This will be the focus of this article, we will look at Apache Avro and Google Protocol Buffers which are 2 well established solutions and we will briefly compare their features and SDKs, and then look into compression benchmarking between the two.

Shared background

As a bit of a disclaimer, my experience prior to researching for this article has been predominantly with google protobuf. In fact, this whole article is written as a promised follow-up for another article I’ve written about gRPC APIs, which is the main purpose for protobuf. I’ve heard about Avro before, and I knew it’s a competitor from the serialized compression point of view because of industry reference books such as Designing Data Intensive Applications. However, only until I’ve started using it that I’ve realized just how bizarrely similar Avro and Protobuf are.

Need for a change

Back in 2015 Google was dealing with an issue with their micro-services related communication. They traditionally opted for an RPC solution (remote procedure call) called Stubby, and were looking to change it to a new standard which is less interconnected with their infrastructure and more open to the outside.

In this context they’ve created gRPC, and because APIs are not very useful without data structures a sub-part of gRPC that is used for data modelling and serialization was decoupled, and that part is Google Protocol Buffers (Protobufs or GPBs).

Meanwhile Apache Hadoop had a completely different motivation behind Avro.. wait.. it’s almost exactly the same! In 2009 while most people and companies were working out the aftermath of the recent economic crisis, the innovative minds behind Apache Hadoop were discussing to spin off Avro as as a Hadoop sub-project meant to “replace both Hadoop’s RPC and to be used for most Hadoop data files, e.g., by Pig, Hive, etc”.

Remote Procedure Calls

Despite providing good data serializers, both Avro and Protobuf are solutions around the concept of remote procedure call APIs (RPCs).They both rely on Netty as a default web server, which means they both benefit from the HTTP2 and Servlet 3.1 non-blocking API calls which you can (and probably should) use in a reactive way. The RPC features they offer are not in the scope of this article, so I’ll just stop here but at a first glance they’re quite similar in features.

Data compression

One of the key features, and something we will explore in greater detail in the following section is however the compression of objects on serialization. Compared to other formats that may be used either for exports (e.g. csv, json, xml), persistence in columnar or nosql databases, or serving the data via APIs, both Avro and Protobuf offer this feature.

The reasoning behind compression I think is quite intuitive. Both projects were started in the context of high volumes of data, and let’s not forget they were built with RPCs in mind (so data transfers), and not necessarily browser compatible or human readable in general. Adding up all these factors, makes compression on serialization a more obvious choice.

Benchmarking compression

From prior project experience using Protobuf, I know just how important this data compression is, and wanted this impact to be come across in this article as well. Choosing the right data set was quite an important part in order to get that across in a meaningful way.

The data set we will be using is the 2015 Reddit comments database found on kaggle. It has roughly 20 GB when compressed in zip format, and 30 GB after decompressing it. It would have been nice if I were a more active Reddit user in 2015 so I could test the validity of it’s completeness, but the sizes sound plausible so let’s go ahead.

All of the findings and code snippets that we will present from now on are based on the data set mentioned above, as well as a personal github project which you can download an play around with.

So just to be clear going forward, both Avro and Protobuf use code generation based on their common languages which can be compiled in multiple programming languages.

As things may change from one release to another, I have used the following dependencies for Avro and Protobuf. I’ve added the plugins, as the compression is the result of java or class files generated by them.

Apache Avro experience

After building the projects and analyzing the plugin configuration we can already see the first difference. While protobuf generates class files which you can import as part of the project, Avro generates actual java files which you need to handle directly in your project and this is quite an interesting approach.

While I was playing and trying to change the model of the Avro definitions to better fit the Reddit comments model (a breaking change), this has lead my project to not compile anymore. It’s an interesting way of alerting the developer to be a little bit more emphatic towards the open closed principle, and towards other programmers which might suffer the same process.

Let’s have a bit of a look at how these Avro definitions look for the Reddit comments database.

At a first glance the Avro definitions are a bit verbose, and upon some reading we find out this is on purpose. In Avro data is always accompanied by it’s schema in order to:

Allow each entry (datum) to be written with no per-value overheads (such as JSON or XML)
Facilitates use with dynamic /scripting languages

As an alternative you can use Avro IDL, which is a more human friendly definition language, but based on the documentation on their website each ‘.idl’ file ends up being compiled to the standard Avro ‘.avpr’ format that we’ve used in the sample code above before being transformed to Java classes. I’ve added a sample IDL file below (unrelated to the current benchmarking exercise) to get a sense of its format.

While developing this test, a somewhat unpleasant experience was handling the dreaded null values. In order to declare a field as nullable you have to add “null” as a possible type. OK this I can understand, but while transferring data I’ve realized that by default none of the fields are optional which is a bit unpleasant from an SDK point of view. You are basically forced to add that “null” type on all fields which have the possibility of being not set. Yes, we shouldn’t have null values, null should be avoided, etc, but why take the optionality away? It’s just something that I think could have been handled at an SDK level, for example all fields are optional with a default of empty unless explicitly set.

Google Protobuf setup

Looking at the Protobuf definition bellow for the same Reddit data set, we can see something a bit more familiar and similar to the Avro IDL format.

To compare it’s features with Avro, the schema is not dynamic and cannot simply be consumed at runtime using a scripting language. It’s quite a nice touch from the Avro developers, but in protobuf not only do you need the “contract” but you also need it compiled in order to process the data. Protobuf relies on fields numbering/indexing when serializing the data so details such as field names are lost. You could say that this translates to better backwards compatibility, but I think I like the runtime schema a bit better. It feels like you have more flexibility when consuming the data and you don’t need to rely on shared dependencies for schema sharing.

Optionality on the other hand is not an issue in Protobuf. Starting with the proto3 format all fields are by default optional, and trying to set a field to a null value does not end well at runtime (exception is thrown). To share a bit of a trick/code snipped that I’ve used in protobuf (and very similarly in Avro) in order to not explicitly set null values (just a simple generic method):

What could be improved in the Protobuf proto3 format however, is that some data types (e.g. double) when compiled to java code they become primitive types. However, the code generated for primitives with the proto3 syntax loses the ::hasFieldName() methods which are available in proto2 in order to test their optionality. You can still opt for some wrapper types, but this makes the end result a bit more verbose.

Benchmarking scenario

I know, you’ve opened this link hoping to see some quick and clear results and so far you’ve only read everything except that. In order to present them, just let me also describe the experiment’s scenario and that’s it. We’re almost there!

In broad terms I have went line by line in the sqlite database of 2015 Reddit comments, converted each comment to avro/protobuf and summed up the total size in bytes of these objects for each. Protobuf does this conversion of an entry object to bytes very straightforward you just call one method (::toByteArray) on the entry and that’s it.

Avro however wasn’t built with this in mind apparently, so you have to jump through a few hoops to achieve this as can be seen in the snippet below. It probably offers better support at an API/RPC level (?), and from what you can see also in the java samples on their website for serialization the SDK seems much friendlier for exporting data to disk. This is a interesting design choice, which probably relates back to one of the original purposes of Avro to be used for Hadoop structured data files.

Results and conclusions

And now for the actual comparison of results, both frameworks did remarkably well!

Google Protobuf: total size in bytes of conversion 15,010,423,665 ~ 15.01 GB out of 29.6 GB; this means a compression rate of 49.3%
Apache Avro: Total size in bytes of conversion 15,633,574,743 ~ 15.63 GB out of 29.6; this means a compression rate of 47.2%

So in conclusion, the benchmarking showed no significant differences.. There is an actual difference of 2.1% in favor of Protobuf, but this is data dependent. It’s not impossible that a different data set would show a little better results for Avro.

For both frameworks however, these results are amazing. Moving away from the no-compression storage we’ve saved close to 50% of disk space (or memory, depends on what you do with the data). To put this in more production pragmatic terms, you would spend almost half of what you would on data storage solutions, you would need half the data centers, half the memory (for solutions such as Apache Ignite), you would transfer halt the data over networks, and so on.

I am hoping you are not disappointed in the similarity of the two frameworks and compression results. Making a choice is indeed hard, and it depends a lot on each individual scenario/use-case.

From what I’ve experimented I would recommend Avro for low coupled schemas and contracts, dynamic schemas, as well as an alternative if you are ever thinking of backing up data in csv or json formats or even as sqlite.

But for everything else I would still go with Protobuf because the SDK is cleaner on serialization, null handling, and optionality. The API supports somewhat better backwards compatibility by decoupling field names from their identity. You can also use it more naturally for cold storage for example by keeping only minimal structured metadata, while the bulk of the information if compressed as serialized Protobufs.

I hope you’ve found this article useful, and please feel free to share a comment below!