rtyler

On Data Engineering Central

2026-02-04T00:00:00+00:00

I was lucky enough to record a podcast episode with Daniel Beach of Data Engineering Central. Daniel and I have known each other for a couple years sharing notes and ideas on the state of the ecosystem, where it falls down, and where things are getting interesting.

In my opinion Data Engineering Central has been one of the most useful broad-ranged surveys of the ecosystem, curated by one crazy mid-westerner: Daniel. He pulls no punches and while we share criticisms of AI in the industry and commercial tools, Daniel’s honesty also has put some of my work on blast, such as this post about some terrible user-experience and lopsided Delta Lake support in delta-rs.

In his post Daniel highlights some of the topics we got into during our time chatting:

What the Lakehouse architecture gets right—and where it still falls short

Why multimodal data (text, images, audio, video, embeddings) changes everything

How open table formats like Delta Lake fit into the next generation of data platforms

The growing gap between data tooling hype and day-to-day data engineering reality

What skills and architectural thinking will matter most for data engineers over the next decade

I encourage you to subscribe to his newsletter or if that’s not your jam, you can subscribe to the RSS feed too.

The last data file format

2025-07-16T00:00:00+00:00

The layers of abstraction in most technology stacks has gotten incredibly deep over the last decade. At some point way down there in the depths of most data applications somebody somewhere has to actually read or write bytes to storage. The flexibility of Apache Parquet has me increasingly convinced that it just might be the last data file format I will need.

In my previous post on the subject I wrote about the file format’s novelty for semi-random data access inside of a .parquet file. I’m certainly wandering off the beaten path with Apache Parquet already. Then this blog post kind of blew my mind: Embedding User-Defined Indexes in Apache Parquet Files.

However, Parquet is extensible with user-defined indexes: Parquet tolerates unknown bytes within the file body and permits arbitrary key/value pairs in its footer metadata. These two features enable embedding user-defined indexes directly in the file—no extra files, no format forks, and no compatibility breakage.

Emphasis mine.

This is news to me.

And it is absolutely wild.

The authors’ approach for embedding user-defined indexes in Apache Parquet files is certainly novel and already worth a read.

But the fact that you can shove arbitrary blocks of bytes in the middle of the otherwise columnar data format is incredible.

Modifications of Apache Parquet files still requires a rewrite of the object which means .parquet is not a file format to be used for heavy data modification workloads.

Use-cases with large amounts of metadata and binary data however would fit nicely within this parquet + unknown bytes design. Parquet readers which are ignorant to the purpose for these unknown byte blocks will completely ignore them.

Altogether this is a new super power, and I am contemplating whether I can use it for good or evil..

Busily writing elsewhere

2025-05-03T00:00:00+00:00

Writing has been a part of my work for a long time, it helps me think and more importantly it helps me share ideas with other developers. Recently a tremendous amount of my time has been spent writing internal design documents, blog posts, and other materials. By the time it has come to personal blogging my words all been spent.

On the Buoyant Data blog I have been writing about a lot of Delta Lake related topics such as:

Some of this work ahs been in preparing for the two upcoming talks I have at Data and AI Summit 2025. Some of these posts have been in doing research with clients, or just spelunking on my own.

You can subscribe to the RSS feed for more up to date articles relating to high-efficiency data processing with Rust!

Fedi-hired! Redesigning the company website

2024-12-02T00:00:00+00:00

Today I launched a new rework of buoyantdata.com thanks to the work of a designer I found in the fediverse! The original “design” of the site was something I had cobbled together with a Jekyll theme I originally ported to Cobalt, but it was always lacking.

The release of Delta Lake The Definitive Guide offered motivation to update the site to help prospective customers understand what Buoyant Data can do for them. I asked around on Mastodon for recommendations for a web designer in the US would would be open to a short-term contract to perform some renovations.

Ben Sulzinsky was one of the talented folks who reached out to offer to help and we quickly turned a lot of ideas around.

I am quite pleased with Ben’s work. He did a fantastic job taking a laundry list of both highly-specific and rather vague requirements, and turning them into re-usable components, structure, and styles. I would certainly recommend you work with him too!

Periodically I’ll see solicitations to be #fedihired in Mastodon, I’m happy to have been able to fedi-hire somebody! (even if only for a short-term contract)

From the beginning, delta-rs to Delta Lake: The Definitive Guide

2024-11-15T00:00:00+00:00

Nothing quite feels like “I made it!” like being published. Which is why I am thrilled to share that Delta Lake: The Definitive Guide is available for purchase, and I kind of helped! I wanted to share a little bit about how my contributions (Chapter 6!) came about, because my entrance into the Delta Lake ecosystem was about as unplanned as my authorship of part of this wonderful book.

The delta-rs project started in 2020 and I wish that I could say it is because I am a brilliant visionary. The project largely started because I have had a bias against JVM-based technology stacks and I had stepped into a role at Scribd where we were migrating to AWS, Databricks, and a new architecture anyways so why not challenge the orthodoxy? My colleague QP Hou and I were loving Rust and liked Delta Lake from a design standpoint, but did not love Apache Spark for some of the things we needed to do.

I would consider the official start of the project to be April 11th, 2020 when I sent our Databricks colleagues the following:

Greetings! As I mentioned in our weekly sync up this week, we have an interest in partnering with Databricks to develop and open source native client interface for Delta Lake.

For framing this conversation and scope of the native interface, I categorize our compute workloads into three groups:

Big offline data processing, requiring a cluster of compute resources where Spark makes a big dent.
Lightweight/small offline data processing, workloads needing “fractional compute” resources, basically less than a single machine. (Ruby/Python type tasks which move data around, or perform small-scale data accesses make up the majority of these in our current infrastructure, we’ve discussed using the Databricks Light runtime for these in the past, since the cost to deploy/run these small tasks on Databricks clusters doesn’t make sense).
Boundary data-processing, where the task might involve a little bit of production “online” data and a little bit of warehouse “offline” data to complete its work. In our environment we have Ruby scripts whose sole job is to sync pre-computed (by Spark) offline data into online data stores for the production Rails application, etc, to access and serve.

I don’t want to burn down our current investment in Ruby for many of the 2nd and 3rd workloads, not to mention retraining a number of developers in-house to learn how to effectively use Scala or pySpark.

My proposal is that we partner with Databricks and jointly develop an open source client interface for Delta Lake. One where we would have at least one developer from Databricks working with at least one developer from Scribd on a jointly scoped effort to deliver a library capable of initially addressing our ‘2’ and ‘3’ use-cases.

[..]

Further, I propose that we jointly develop a client interface in Rust, which will allow us to easy extend that within the Databricks community to support Golang, Python, Ruby, and Node clients.

The key benefits I imagine for us all:

Much broader market share for Delta Lake as a technology. Not only would companies like Scribd benefit, and continue to invest in Delta Lake, but other companies would have an easier on-ramp into the Databricks ecosystem. Basically, if you start using Delta Lake before you use Spark, you will (I guarantee) reach a point where these lightweight workloads become heavyweight workloads requiring the full power and glory of the Databricks runtime :D
It’s a fantastic developer advocacy story that hits a number of key bullet marketing points: open source, partner collaboration, Rust (so hot right now) :)
Scribd is able to “immediately” take advantage of Delta Lake benefits without burning up all our existing codebase and investment in Ruby tasks and tooling. Thereby allowing for an easier onramp into Delta Lake and the Databricks platform as a whole.

The scope of the effort I think would be largely around properly dealing with the transaction log, since the Apache Arrow project has already created a pretty decent parquet crate in Rust. That said, there may be some writer improvements we’d want/need to push upstream to Apache Arrow to make this successful.

On second thought, almost all of this has come true! What a brilliant sage! (plz clap)

Like many advancements, there’s a right time, a right place, and a right group of people. Unfortunately Databricks didn’t join the party until a later on but were a strong supporter of our initial work, providing guidance and helping to make Delta Lake an ever-more thriving open source community. The right people were all converging on the direction that made this possible with Neville helped make arrow-rs a much better Apache Parquet writer. QP wrote the first version of the protocol parser and created the first Python bindings for the library. Christian Williams built out kafka-delta-ingest with Mykhailo Osypov and helped prove that: Rust is way more efficient for data ingestion workloads.. As time went on Will Jones, Florian Valeye, and Robert Peck joined the party and helped turn delta-rs from a small Scribd-motivated open source project into a thriving Rust and Python project.

Scribd had wild success with the data ingestion being in Rust, and the data processing/query being in Spark. The community grew, Databricks grew, and at some point some folks started working on a book.

As a long-time maintainer of delta-rs and talking head in the Delta and Databricks ecosystem I was asked to be a technical reviewer of the book after Prashanth, Scott, Tristen, and Denny had already gotten more than halfway through the chapters.

I provided as much feedback as I could on their chapters. I reviewed the outline and noticed “Chapter 8: TBD”.

What’s supposed to be Chapter 8? “We’re not sure yet.”

My friend Kohsuke once marveled at how I was able to acquire things for the Jenkins project by the simple act of asking for them. There’s some skill involved in finding mutually beneficial opportunities, but being uninhibited by the possibility somebody would say “no” helps a lot.

“So this outline looks good, but when are you going to talk about Rust and Python? There are dozens of us! Dozens!”

Denny needed another chapter and I asked if I could write about building native data applications in Rust and Python.

Suddenly I was helping to write a book.

Scribd is a fun company to work at. Books, audiobooks, podcasts, articles. We have a deep appreciation for the written word, telling stories, and learning. All of which I value highly. Before this experience however I had never seen the other side of books. The creation, the meetings, the rewrites, the edits, the reviews, going to press. It is incredibly interesting and the team at O’Reilly are talented, helpful, and professional.

Going through copy-editing I was fielding review comments on the consistency of tense, the subject of sentences, discussions about what is a proper noun and how to consistently apply terms through hundreds of pages of content. I have heard about how invaluable editors are, I have now seen them in action am in awe.

Over the years I have tried and failed to explain what I do to family members. For people that don’t work in tech “working on the computer” all looks largely the same, especially for older generations. Having your work, your name in print has an intangible “wow” factor. More so than conference talks, websites, GitHub stars, or branded t-shirts, a printed artifact recognizes the accomplishments of the innumerable contributors to the Delta Lake ecosystem over the years.

If you’re data inclined, I recommend picking up a copy, Prashanth, Scott, Tristen, and Denny have written a very useful guide, and also I contributed a bit too! :)

Data and AI Summit 2024 presentations

2024-10-17T00:00:00+00:00

This year has been so jam packed full of activities that I forgot to share some videos from Data and AI Summit 2024 this past summer! The annual conference hosted by Databricks has become one of my favorites to meet with other Delta Lake users and developers to discuss the future of large-scale data ingestion and processing. This year however, I overdid it a little bit.

Using the excuse of promoting my consulting/professional services company Buoyant Data I had effectively three speaking engagements:

The road to delta-rs 1.0 at the Open Source Contributor Summit (Monday)
Fast, cheap, and easy data ingestion with AWS Lambda and Delta Lake, a talk highlighting a lot of the successful patterns I have developed for customers using AWS Lambda with Delta Lake for Rust to create shockingly cheap data ingestion pipelines. (Thursday)
Let’s do data engineering in Rust!, a more fun deep-dive talk to help people start to get into the world of implementing data systems with Rust. (Thursday)(

Unfortunately the first talk was not recorded, but it was probably the most interesting! On Monday morning I was riding my bike from the Ferry Building to the venue in San Francisco and my chain snapped off while I was sprinting off from a green light. I went down hard, scraped up my knees, and generally looked a fool lying in the middle of Market St.

The show must go on, so I hobbled to the Scribd office, deposited my broken bike, and continued to the Open Source Summit.

What I did not know at the time was that I had fractured a bone in my wrist. I did know however that I needed to go to a clinic, but really wanted to attend the summit and take advantage of the one-a-year opportunity (literally!) for some of the brightest minds in the data community to talk about the future of Delta Lake and more.

So that first talk was given with my swollen wrist pulled to my heart, like a broken wing, and I’m sure it was a ludicrous sight to see!

By Thursday my arm had been set and was in a sling, which is far less exciting. Nonetheless, the two talks below are perhaps the only one-handed presentations thus far in my career! I hope you enjoy!

Note: The presentation software used for this talk is the open source presenterm tool which is delightful for creating development-focused presentations like this one!

Improving lock performance for delta-rs

2023-11-29T00:00:00+00:00

I have had the good fortune this year to help a number of organizations develop and deploy native data applications in Python and Rust using a project I helped found: delta-rs. At a high level delta-rs is a Rust implementation of the Delta Lake protocol which offers ACID-like transactions for data lake use-cases. One of the big areas of my focus has been in evaluating and improving performance in highly concurrent runtime environments on AWS.

To help others understand the problem domain I spent some time earlier in the week documenting the challenges in AWS on the Buoyant Data blog: Concurrency limitations for Delta Lake on AWS

In the case of AWS S3’s consistency model many operations are strongly consistent, but concurrent operations on the same key are not. AWS encourages application-level object locking, which the delta-rs implements using AWS DynamoDB.

AWS S3 is an incredible piece of technology that washes away a myriad of common storage problems, and has been jokingly referred to as “the 8th wonder of the world” by Corey Quinn. THe lack of a “putIfAbsent” like semantic is however very annoying for the Delta Lake protocol, adding the need for an application-wide lock for Delta users:

The dynamodb-lock approach allows for some sensible cooperation between concurrent writers but the key limitation is that all concurrent operations must synchronize on the table itself. There is no smaller division of concurrency than a table operation

In the blog post I offer some potential approaches to mitigate the weakness of needing a table-level lock for concurrent Delta Lake writers on AWS, but the problem will unfortunately remain until in some form or fashion until S3 introduces a “putIfAbsent” semantic which allows writers to “put” a file only if it doesn’t exist in an atomic way.

For concurrent Delta writers I can offer some advice, but unfortunately effective cooperative distributed concucrrency at scale remains a challenging problem! :)