The last data file format

The layers of abstraction in most technology stacks has gotten incredibly deep over the last decade. At some point way down there in the depths of most data applications somebody somewhere has to actually read or write bytes to storage. The flexibility of Apache Parquet has me increasingly convinced that it just might be the last data file format I will need.

In my previous post on the subject I wrote about the file format’s novelty for semi-random data access inside of a .parquet file. I’m certainly wandering off the beaten path with Apache Parquet already. Then this blog post kind of blew my mind: Embedding User-Defined Indexes in Apache Parquet Files.

However, Parquet is extensible with user-defined indexes: Parquet tolerates unknown bytes within the file body and permits arbitrary key/value pairs in its footer metadata. These two features enable embedding user-defined indexes directly in the file—no extra files, no format forks, and no compatibility breakage.

Emphasis mine.

This is news to me.

And it is absolutely wild.

The authors’ approach for embedding user-defined indexes in Apache Parquet files is certainly novel and already worth a read.

But the fact that you can shove arbitrary blocks of bytes in the middle of the otherwise columnar data format is incredible.

Modifications of Apache Parquet files still requires a rewrite of the object which means .parquet is not a file format to be used for heavy data modification workloads.

Use-cases with large amounts of metadata and binary data however would fit nicely within this parquet + unknown bytes design. Parquet readers which are ignorant to the purpose for these unknown byte blocks will completely ignore them.

Altogether this is a new super power, and I am contemplating whether I can use it for good or evil..