rtyler

Based Lake, a petabyte-scale low-latency data lake

2026-03-10T00:00:00+00:00

I had a chat today about building large scale low-latency data retrieval systems around AWS S3. In doing so I got to share a bit of the talk proposal I submitted to Data and AI Summit this year about real-live work that has made it into production.

For years the conventional wisdom around Delta Lake has been to not connect user-facing/online systems to Delta tables. Basically, don’t point your Django app at your Delta tables. This continues to be a decent guideline but definitely not a rule and I have the performance data to back that up.

My talk abstract:

Scribd hosts hundreds of millions of documents and has hundreds of billions of objects across our buckets. Combining large-language models with a massive amounts of text has required investment in our new Content Library architecture. We selected Delta Lake as the underlying storage technology but have pushed it to an extreme. Using the same Delta Lake architecture we offer both direct data access for data scientists in Databricks Notebooks and online data retrieval in milliseconds for user-facing web services.

In this talk we will review principles of performance for each layer of the stack: web APIs, the Delta Lake tables, Apache Parquet, and AWS S3.

The work done by myself and my colleague Eugene in this area has been heavily related to my previous research around Low latency Parquet reads which informed work named Content Crush, which I have explored more on the Scribd tech blog and on the Screaming in the Cloud podcast.

I really hope that I am able to share results at Data and AI Summit from this incredibly challenging work that I am undertaking. But even if I don’t, blog posts like my musings on Multimodal with Delta Lake, scaling streaming Delta Lake applications, and a myriad of other articles I have published can be pieced together to form the larger mosaic of insane large-scale data work I have been hammering on!

Multimodal with Delta Lake

2026-01-19T00:00:00+00:00

The rate of change for data storage systems has accelerated to a frenzied pace and most storage architectures I have seen simply cannot keep up. Much of my time is spent thinking about large-scale tabular data stored in Delta Lake which is one of the “lakehouse” storage systems along with Apache Iceberg and others. These storage architectures were developed 5-10 years ago to solve problems faced moving from data warehouse architectures to massive scale structured data needs faced by many organizations. The storage changes we need today must support “multimodal data” which is a dramatic departure in many ways from the traditional query and usage patterns our existing infrastructure supports.

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

From Wikipedia

Honestly, I have been working on this problem for longer than I knew that it had a name!

Working on Content Crush at Scribd I have had to negotiate an ever-present challenge: how do we make multimodal data seamless to work with our classic tabular datasets?

A couple of the ideas that I have been thinking about revolve around one principle: re-encoding of existing data is unacceptable. In the past I have considered simply encoding binary data such as that from images or PDFs into Apache Parquet. This approach suffers from a couple major flaws:

Re-encoding requires substantial computation for any non-trivial set of images, PDfs, video, etc.
Redundant object storage, even with compression it is unlikely that any organization which has terabytes or petabytes of image data will want to store a secondary copy of it for their multimodal needs.
Embedding a 1MB PDF file inside of a Parquet file is not silly but embedding a 10GB video file inside of a Parquet file is very silly. Any approach taken should scale in a reasonable fashion for data in the gigabyte to terabyte range.

A secondary objective in my thinking has been to avoid needing substantial client changes for working with multimodal data. I recently watched a talk by Ryan Johnson about adding transactional semantics to Delta Lake and one of the big takeaways that I heard from him was about the troublesome nature of ensuring all actors in the system cooperated with the transaction semantics. In a modern data environment that could be dozens of different off-the-shelf libraries, Databricks notebooks, AWS SageMaker transforms, and so on. The less “exposure” to the client layer the better.

Parquet Anchors

The first idea that I had was “Parquet Anchors” which would be built on Binary Protocol Extensions in Apache Parquet. In most cases the rich text/image/video data is already stored in object storage such as AWS S3 and a URL should be sufficient to retrieve that data.

The extension of the binary protocol as I understand it, would allow custom information to be encoded in the Parquet files that are being written as part of an existing Delta Table. The specific mechanism of encoding this data is somewhat irrelevant so long as it can carry:

Artifact name (e.g. some.pdf)
Artifact URL (s3://bucket/prefix/of/keys/some-10x9u09123.pdf)
Artifact length (number of bytes)
Artifact content type (e.g. application/pdf)
Checksum
Checksum Algorithm

Pros

The most obvious benefit of going down this route is the ease at which one could update existing data files and this note from the Binary Protocol Extensions document:

Existing readers will ignore the extension bytes with little processing overhead

Logically Parquet Anchors could be quite simple to implement and for most users of a Delta table with Parquet Anchors would never know they were there.

Cons

The natural downside of this feature being hidden from existing readers is that means clients must be updated in order to read the extension data properly. For something like processing multimodal data where a row of content metadata might refer to some.pdf this would mean the reader would have to have some indication that it must:

Read the extended binary information
Then fetch the necessary artifacts

There is another downside to this approach in that a table would need to be “rewritten” but only partially. If a Parquet file added to the Delta table references 1000 artifacts, then that .parquet file would need to be rewritten to include the Parquet Anchors for those 1000 artifacts alongside that files .add action. In essence I think this approach would require a full-table rewrite where each .parquet in the transaction log would be retrieved, processed, and rewritten with the appropriate Anchors.

Considering ways to address the shortcomings of Parquet Anchors I came up with my next concept.

Virtual Delta Tables (vdt)

The notion of Parquet Anchors I think is useful to hold onto, hyperlinks to existing artifacts is a key part of the multimodal data storage solution, but perhaps not as a direct encoding into the Parquet data files. Considering the shortcomings led me to think of how to present a virtual Delta table “view” to existing clients while hiding the disparate nature of the data behind the scenes.

One underutilized feature of the Delta Lake protocol is the use of URLs in the add actions which enables functionality like shallow clones. I have long thought of this as a super power that should really be used more.

vdt0: just the artifacts

The magic of the URL support in the Delta protocol is that the URLs don’t even have to point to object storage. Nothing about the protocol dictates that the URLs must point to s3:// or abfss:// URLs, you can just point to https:// URLs. AWS S3 supports https:// URLs, but so does every other web service.

Imagine a storage architecture which already contains heaps of .pdf artifacts. A vdt web service could provide a read-only URL structure which maps the existing object storage structure into a Delta Lake URL scheme.

A virtual table with just those PDF artifacts could be configured at https://vdt.aws/v1///

. Using tooling like s3svdt can provide S3-like operations off of this virtual URL, exposing a virtualized JSON transaction log or checkpoints for the Delta client.

Imagine the schema of such a virtual table for PDF artifacts:

Column	Datatype
id	`long`
filename	`string`
content_type	`string`
url	`string`
filesize	`long`
data	`binary`
checksum	`string`
checksum_algo	`string`

The virtualized transaction log is where the real fun can begin. If information about the artifacts can be sourced from an existing database, then the virtualized transaction log could contain numerous imagined parquet files as the add actions:

{
  "add": {
    "path": "datafiles/some-guid.parquet",
    "size": 841454,
    "modificationTime": 1512909768000,
    "dataChange": true,
    "stats": "{\"numRecords\":1,\"minValues\":{\"val..."
  }
}

The special path for the some-guid.parquet would perform on-demand parquet encoding for the underlying artifacts. The most primitive implementation could simply represent each PDF file as a .parquet file with an add action. So long as the add action conveyed the necessary file statistics to allow consuming engine to filter out files which are not necessary, this could be a seamless way to expose structured PDF data to the consumer. The path in the action could also refer to an already cached version of the encoded file in S3 using the existing URL support in the protocol, in this way clients could progressively cache as need be on the server-side.

Brief aside: I have never fully understood why Delta sharing exists as a separate entity. In my opinion the Delta Lake protocol coupled with a clever server-side backend could provide identical functionality for all existing Delta implementations.

Assuming the vdt service supports the schema defined above and can properly retrieve the PDF artifacts and encode them as Parquet data on the fly, a query such as SELECT filename, raw FROM vdt WHERE filename = $?.

Pros

Breaking the pretense of “objects must actually exist” with Delta Lake is very liberating. On-demand encoding artifacts in Apache Parquet would means all client-side libraries should be able to seamlessly work within their existing environments.

When I think about potential approaches for implementing vdt0 I can also imagine many different potential avenues for optimization.

Cons

While I really do like this idea, I’m not sure how much I should like it considering the potential downsides:

Requires some existing structure behind the scenes to build up a sensible virtual Delta log. For situations where artifacts are simply in a dumb bucket somewhere, with no metadata already stored in a relational database, producing a virtual transaction log would be quite difficult.
I cannot imagine a sensible path for write workloads with vdt0.
Without having implemented this (yet!) it is unclear to how much compute-time would be expended on uncached parquet file encoding.
Most data scientists want the PDF/image/etc but they don’t typically want the raw bytes that they then have to parse through.

Uh, what if you just don’t use Delta Lake?

Hey good question. Great interlude opportunity!

As a seller of fine hammers and hammer accessories, everything does in fact look like a nail.

Delta Lake is kind of a means to an end for me here. I think its protocol has enough maturity in terms of features and client capabilities to provide almost everything I need from a multimodal storage system. I just can’t/don’t want to shove everything into a Delta table per se.

vdt1: adding virtual legs

Since I have already indulged in the heretical idea of “what if we just make the files up” I went a level further to consider what if we got even more virtualized. One key characteristic I dislike with the vdt0 approach is that it is too simple believe it or not.

When I think about artifacts like PDFs, they have far more structure than just bytes. There are pages, typically sections, text, images, titles, footnotes, and so on. For most machine learning use-cases the data scientist may be interested in raw bytes for some projects but much more often they are interested in the parsed and structured data of the artifact.

While my expertise is largely around text-based storage and processing, I would imagine image/audio/video artifacts also have similar structure of interest to data scientists.

Indulging in even more virtual-thinking I started to think about collections of data all associated with an artifact. There’s the raw data schema above, but for PDFs I can also envision:

Paragraphs

Column	Datatype
id	`long`
page	`long`
offset	`integer`
text	`string`
is_heading	`bool`
heading_level	`integer`

Images

Column	Datatype
id	`long`
content_type	`string`
page	`long`
data	`binary`
bounds_x	`long`
bounds_y	`long`

Links

Column	Datatype
id	`long`
page	`long`
href	`string`
label	`string`

Taken all together this only represents 20 columns of data but could represent most of the information needed for most multimodal workloads. I mention the low column count because I have seen bug reports from Delta Lake users talking about issues with tables containing thousands of columns.

A virtualized table schema could take these interior schemas and join them together such that a single row might have: id, raw_filename, raw_content_type, raw_url, raw_filesize, raw_data, raw_checksum, raw_checksum_algo, paragraph_page, paragraph_text, paragraph_offset, paragraph_is_heading, paragraph_heading_level, image_content_type, image_page, image_data, image_bounds_x, image_bounds_y, link_page, link_href, link_label.

So long as the schema allows nullable columns for everything but id, the vdt service can expose the disjointed data behind the scenes in a sensible way with the add actions on the virtual Delta table and its file statistics. For example an add action which includes link data would list all other columns as null within the file statistics nullValues such that any engine querying for raw columns would ignore that file entirely.

Pros

I think this structure would be possible to build in a traditional Delta Lake system assuming one wished to re-encode data into new storage. Hiding existing data behind a virtualized Delta table allows us to avoid data denormalization.

Similar to vdt0 there are optimization and caching approaches that are readily available with vdt1 but unlike vdt0 the “write path” is more apparent to me with this approach. By hiding metadata about an artifact inside the virtualized data structure, writes which add rows with those columns could sensibly be accepted and inserted into an internal Delta or other table.

Depending on how metadata associated with an artifact is concerned, the vdt service could simply front a number of other conventional Delta tables and act as a proxy ensuring to push predicates and I/O filtering “to the edge” as far as it will go, before collecting results for the query engine.

Cons

This approach is certainly the most complex but could potentially require the least amount of re-encoding of existing data assets. The devil is in the details with how one might map existing data sources together. My sketch above places a tremendous amount of emphasis on an id which acts as a primary key between all the metadata associated with a singular artifact.

Nothing defined thus far accounts for potential changes in an artifact or its metadata as time goes on. If a new version of an existing document is uploaded, the new version should likely be considered “canonical” but be appended rather than merged with existing records. How one might sensibly model that in a system like Delta which doesn’t support referential integrity between datasets leads me back to the “anchors” idea from before. That said, I’m not sure if that’s much ado about nothing.

From a data storage standpoint one key aspect of multimodal data is that the different modalities are presented to the end user or system together. What I like about the virtual Delta tables concept is that this it doesn’t require substantial client changes to accomplish but does provide a path to present various types of data together for a given artifact.

I have various bits and pieces of a potential vdt system lying around the workshop floor. If the idea has legs I might take a crack at a prototype implementation, but first I will need some feedback!

Let me know what you think by emailing me at rtyler@ this domain!

The challenges facing Delta Kernel

2026-01-12T00:00:00+00:00

The Delta Kernel is one of the most technically challenging and ambitious open source projects I have worked on. Kernel is fundamentally about unifying all of our needs and wants from a Delta Lake implementation into a single cohesive yet-pluggable API surface. Towards the end of 2025 TD asked me to jot down some of the issues which have been frustrating me and/or slowing down the adoption of kernel in projects like delta-rs. At the outset of the project we all discussed concerns about what could actually be possible as we set out into uncharted territory. In many ways we have succeeded, in others we have failed.

Reviewing the history, I was the second developer to commit code behind Zach to the project. Like all open source projects, Delta Kernel is the work of numerous people who have all poured their time into making something happen together. I regularly work with Robert, Zach, Nick, Ryan, and Steve to make delta-rs and delta-kernel-rs better.

While we all have our personal motivations, we also have direction guided by our employers in some cases. That means the goals for kernel from Databricks may not align with my employer (Scribd), or others participating in the project. This complicates trade-off decisions in many open source projects where personal, professional, and hobby motivations intersect.

My hope is to characterize the weaknesses in kernel so that we can collectively adjust in 2026 to make improvements in both the technical design of kernel, but also the community and culture around kernel.

Design

From my perspective the original design trade-offs made in kernel were largely driven by two key factors:

Portability with non-Rust engines: this dictated the need for an FFI abstraction on day zero. The Delta extension for DuckDB had an outsized influence on this due ostensibly to a desire from Databricks to make DuckDB and Delta be best friendsies.
The Java kernel: the Delta kernel is actually two implementations, one in Java for unifying JVM-based connectors, and one in Rust for basically everybody else. Due to the number of folks involved in the Java kernel, the Rust implementation was strongly encouraged to take design cues from the Java design.

More than anything these two factors have contributed to a number of what I would consider original load-bearing sins of design for delta-kernel-rs.

These trade-offs resulted in a Rust-based project which abandons most of the important benefits for using Rust.

Building for the lowest common Denominator

Supporting cross-language and runtime interoperability is brutal. I have done a lot of cross-language support for Ruby and Python projects in the past, where at some point somewhere there’s a pointer being passed from one world into another. It is objectively awful.

Over the years of delta-rs people have tried adding FFI hooks into it, despite us never making any accommodations for it. Seriously, as recently as this month somebody popped up with yet-another set of Golang FFI bindings on top of delta-rs.

FFI is hell.

A hell that we intentionally marched into with Delta kernel. For the uninitiated, FFI basically a convention for allowing multiple languages to meet at a C ABI layer and pass pointers back and forth. There is some more about memory layout and other silliness, but basically, it’s a way for everybody to dumb themselves down to a C-style interface.

FFI is also stupid but it is basically how all higher level languages work such as Python, Ruby, JavaScript, Golang, Rust, etc. Somewhere down there in the stack is a pointer passing into C-based system calls on your machine. There be monsters.

One of our early design disagreements made to accommodate FFI-based engines was the adoption of Iterator based interfaces rather than Future based interfaces. Previously I wrote about our parallelism challenges which stem from this design trade-off.

The debate was whether to hide an evented reactor like Tokio inside kernel and hide that from the FFI caller, or make the caller responsible for trying to make things event-driven. The early influence of DuckDB weighed on the scales here, and the decision was made to avoid embedding Tokio inside kernel.

In the Rust ecosystem it has taken a long time for us to become async. If you were curious why there has been such an explosion of Rust across the systems programming ecosystem in the last five years it’s because the Rust ecosystem is async.

The first Rust application I deployed into production used async/await from the beginning, and without any profiling was an order of magnitude faster than the system it replaced.

async/await is the reason delta-rs was even successful in the first place!

There are ways to hack around the limitations of the Iterator based API in Delta kernel, but the hill is very steep and will require significant investment to make some parts of Delta kernel as fast as parallel reads/scans would otherwise be.

async/await gives incredible performance for free, but Delta kernel’s design choices mean it cannot take advantage and must pay the price.

`EngineData`

I am not smart enough to work on some parts of Delta kernel because of the cleverness that is EngineData. Similar to arrow-rs and its RecordBatch and ArrayData implementations, EngineData is an opaque type-erased container for stuff and things.

One of the reasons I struggled to learn to Rust, but ultimately came to love the language is the strong type system which helps prevent whole classes of problems. The strong type system also makes it a lot simpler for me to reason about the code when I am working with it.

Everything in Delta kernel is EngineData in one form or another. I was pretty preoccupied when this interface was originally being hammered out so I’m less familiar with the history of decisions that went into it, but I find the API of EngineData and its counterparts of RowVisitor, GetData, and TypedGetData to be very unpleasant to work with.

I also find RecordBatch unpleasant to work with. I really struggle to think of more user-unfriendly APIs in the Rust data ecosystem. In the case of arrow’s RecordBatch I have watched some of my colleagues pull in the entire datafusion dependency just so they can work with RecordBatch without resulting to the array offset and indices silliness that permeates Apache Arrow code.

As unpleasant as I find RecordBatch there are thousands of developers invested in its APIs and supporting infrastructure. EngineData does not have a similar level of tooling, but shares some of the same razor-sharp edges.

The EngineData design has resulted in a lot of brittle fixed array offsets being littered throughout the Delta kernel codebase. These “getters” and the visitors APIs result in the Rust type checker being far less useful with Delta kernel than a more conventionally structured Rust project. This also results in a much larger likelihood of runtime errors being emitted for problems rather than compile-time checks.

The type-erased opaque bucket of bytes design of EngineData means that working inside of or with Delta kernel sacrifices one of the most important characteristics of the Rust language: the type checker.

There are some good pieces of the design which honestly I cannot speak to because I don’t stub my toes on them. Ryan and I have discussed at length the importance of deferring work as long as possible in kernel to achieve higher performance. Some of the Expression and Transform APIs allow for lower memory footprints and faster log replay when work can be deferred or outright avoided.

In delta-rs some of the performance deficiencies we have seen since adopting Delta kernel have more to do with our interop code rather than kernel design decisions. The delta-rs project is massive. As a general purpose Delta Lake implementation, the surface area of changes that Robert had to touch to even get to where we are today has been nothing short of heroic.

Community

The Delta kernel project is the first one I have worked on with Databricks where there is some transparency around the week-to-week operations. The kernel Rust community has weekly meetings where developers are talking to developers. Many of my early conversations with Denny were around the propensity for Databricks to dump code into the Delta project as a fait accompli. In one particularly egregious situation, there were protocol and Delta/Spark changes which were reviewed, approved, and merged by Databricks employees the week before being announced at Data and AI Summit. Kernel gets this right.

Even though I cannot make every weekly call with the kernel community, I love it when I can.

I don’t always attend the kernel weekly call, but when I do, I’m asking when the next release will happen.

For reasons I don’t think anybody really understands, Delta kernel moves very slowly. Patch releases are of particular importance to me because delta-rs has started to depend on the Delta kernel for its protocol implementation and therefore many of our new bugs relate to Delta kernel in some way or another.

Releases have averaged around one every three weeks in 2025. Nine of the thirty versions released to crates.io were patch fixes, which means 70% of published releases contained API breaking changes. Some of that is inevitable as developers are figuring out the appropriate shape of different APIs. As a consumer of this release cycle downstream this means that I am highly unlikely to ever receive bug fixes without requiring development effort to adapt to ever-changing APIs.

There is no free lunch.

For the delta-rs project this means our releases are frequently blocked on:

Delta kernel ships with a default engine that has a major version dependency on Apache Arrow, a project which also avoids patch releases. This compounding effect means that when a new arrow is released we (delta-rs) must wait for that to be incorporated into both datafusion and delta_kernel, and for both those crates to be released.

Any issue reported to delta-rs which requires a change in Arrow or Delta kernel will typically take 1-2 months to resolve.

No need to wait

Up until yesterday, the latest released deltalake crate was 0.29.4 which depended on Delta kernel 0.16.0. That version is three months old and unfortunately never saw any patch releases, which is part of the reason all four of the 0.29.x releases of delta-rs depended upon it.

Using the crate downloads statistics as a very unscientific measure, I would hazard a guess that delta-rs drives the majority of downloads for Delta kernel.

The 0.18.0 release went out on November 20th, which has a small uptick, but then the big spike in early December correlates strongly with the incorporation of this pull request pulled 0.18.x into the delta-rs repository.

For completeness’ sake, the deltalake crate’s downloads have a very similar shape. But due to the longer release cycle of 0.29.x is is difficult to tell what versions are being heavily downloaded.

Maintaining stable APIs is a pain, but becomes much more important the lower in the stack any dependency lives.

One approach could be to create release branches which have changes cherry-picked between them as is needed. This introduces more release engineering work and can be challenging. For my own purposes I have done this and backported fixes for both Delta kernel and delta-rs in various shapes to support customers who cannot boil the ocean with unstable releases every two to three weeks.

At Scribd a patch release of delta-rs, with zero API changes requires at least:

New Lambdas to be built.
Those Lambdas to be deployed to a testing environment.
waiting for enough data volume to demonstrate reliability
Promotion of a Lambda to a production environment.
waiting for enough data volume to demonstrate success

When everything operates smoothly this is about two developer-hours of time from end to end, but that is with zero API changes.

Every set of API changes in delta-rs, Delta kernel, or Apache Arrow introduces unknown developer time to perform updates and upgrades. Unless a new release of any of these dependencies confers significant performance or quality improvements, the business looks at these upgrades as unnecessary cost and instead prefers to simply not update.

As a consequence bugs can be discovered in production months after a given Delta kernel release. For example this performance bug in Delta kernel had actually existed for months in released crates. It was not until delta-rs adopted more of Delta kernel. Only then was I able to bring upgrades all the way to production and discovered a couple serious performance issues in delta-rs and Delta kernel.

This timeline is getting a little confusing even for me, so let’s recap:

October 2024: A JSON parsing workaround introduced into kernel and released in 0.4.0.
July 2025: deltalake 0.27.0 released with first serious adoption of Delta kernel at 0.13.0.
August 2025: delta-rs performance issue identified and fixed along with a separate Delta kernel performance issue with wide tables identified. Both problems were identified after I invested some spare work-cycles in using pre-release code to interact with production data sets at Scribd.
September 2025: oxbow incorporates 0.28.0 and that’s quickly reverted until delta-rs 0.29.x is released with additional improvements both in the crate an incorporated in the newer Delta kernel 0.16.0.

From my perspective, the amount of time invested in the performance issues alone has not been “paid back” by improvements delivered from Delta kernel.

NOTE: HR would like to remind me to adopt a growth-mindset.

The improvements from incorporating Delta kernel have not paid back the time-invested yet.

For more than a year there were performance issues sitting in main and released kernel crates.

The time delay between changes being made in kernel and those changes being used for real workloads is long. Too long to be useful as a constructive feedback cycle for development.

I believe the only way to improve this is with faster releases and faster feedback.

Have you tried just

The very-long user-feedback loops on released changes is only half of the velocity troubles afflicting Delta kernel. I have personally avoided contributing too much because the amount of yak-shaving can be pretty wild.

The performance improvement I recently suggested was a new personal TOP SCORE! Garnering a total of 84 comments in the back-and-forth with four different maintainers. That is more pull request comments than lines changed in the patch.

What is sometimes difficult to remember as a maintainer is that a pull request does not represent the start of time invested by a contributor. A pull request is usually the end of their time-investment. In this case I had already invested between 5-8 hours of profiling and understanding the issue before I could create the change.

Hidden in the yak-shaving was useful feedback but the process was so frustrating that I eventually threw in the towel and asked Nick to take it over after about 12 hours of total time invested.

Of the currently open pull requests the one with the most comments is at 99. Of the closed pull requests my maddening 84 comment odyssey doesn’t even fit on the first page of “most commented” pull requests. The top spot is claimed by this pull request which has 369 comments and took over two months from open to merge. That monster is somewhat of an outlier because it represents a substantial change earlier in the history of Delta kernel but a number of other changes are very much in hundreds of comments range.

The pull request culture in Delta kernel is fundamentally contributor hostile.

The suggestions I made to Nick on how to improve this are:

Assigning one maintainer (e.g. CODEOWNERS) to review each pull request. There is relatively little benefit from multiple people offering differing opinions on a non-maintainers’ pull request.
Contributors should feel like their goals are shared with maintainers. The suggest change functionality of GitHub pull requests is fantastic for this. Rather than leaving a wall of text, suggesting direct code changes helps convey a shared investment in the pull request.
Better yet, rather than asking for tests or changes. Make the changes. Most contributors allow maintainers to push to their fork’s topic branches. I regularly use this to add regression tests to contributors’ pull requests, rather than asking them “please write a test.” Modelling good behavior usually is more successful than telling.

Some other ideas that come to mind:

Any comment with “nit: “ should simply be deleted. I see this at work from time to time and will privately discuss with the developer how anti-social that behavior comes across. Any bit of feedback that somebody feels is nitpicky should be made in a follow up pull request or just not. Nitpicks are a waste of everybody’s time.
There is a habit to “stack PRs” in this project and as I write this, there are 19 open “stacked” pull requests. Smaller commits and smaller pull requests should be preferred and move quicker. I think there are a lot of comments on pull requests because each pull request ends up being fairly large and sits in an Open state for a long time.

Many developers believe that code “stabilizes” as if some magic happens to code in main. All code has a rapidly decaying half-life, especially code which sits in open pull requests. The only way to demonstrate that anything is good or bad is for it to be used. Stability comes from use.

I think everybody involved in the Delta kernel project, myself included, wants a stable and high-performance foundation to build our Delta-based applications. As Jez Humble and David Farley wrote in the book on Continuous Delivery, a long cycle time is usually antithetical to stability and reliability.

They’re good kernels Brent

Golly this has been a bunch of words. To quote a wise man:

The Delta Kernel is one of the most technically challenging and ambitious open source projects

I believe in the vision of Delta kernel and certainly wouldn’t be here if I didn’t. The fragmentation that I see in the ecosystem causing nothing but trouble. Since starting this essay I have encountered two new and quirky derivatives of delta-rs code which are trying to coerce it to do things which Delta kernel is meant to support. In fact, the status quo of Delta kernel supports the two use-cases I stumbled into!

Having a stable and high-performance foundation means that features and improvements added into kernel benefit everybody! How marvelous is that? The trick is getting everybody to use kernel!

Kernel’s success is important to the Delta Lake ecosystem and numerous others. For kernel to succeed however I believe we need to adjust course in 2026 to build a stronger technology foundation by introducing more idiomatic Rust code. Leaning more heavily on the strengths of the Rust ecosystem in the interfaces, supporting Rust implementations with async/await as a focus, rather than FFI.

Building in a more Rust-familiar way will enable more new contributors along with their fresh perspectives. We will need to improve our release cadence and change management into something clear and predictable. Making new developers feel welcomed and their contributions valued will solidify kernel’s place as the foundation in the ecosystem.

Stronger technology and a stronger community in 2026 will help Delta kernel overcome the challenges we face today.

Parallelism is a little tricky

2025-12-16T00:00:00+00:00

In theory many developers understand concurrency and parallelism, in practice I think almost none of us do. At least not all the time. Building a mental model of highly parallel interdependent software is incredibly time-consuming, difficult, and error-prone. I have recently been doing a lot of performance analysis with both delta-rs and delta-kernel-rs. In the process I have had to check some of my own assumptions of how things should work compared to how they do work.

Sidenote: to get an idea of how frequently we all “get it wrong”, subscribe to Aphyr’s Jepsen blog for distributed systems safety research.

The Delta Lake Rust binding has relied on Tokio since the beginning, which as any /r/rust commenter knows is an easy turbo button to solve all your performance and parallelism needs!

When we were designing kernel however, there was a strong motivation not to take a direct dependency on Tokio. Due to some early influences in the project, there was a pretty strong push to support C/C++ based engines with delta-kernel-rs. Those engines would need a Foreign-function Interface (FFI) and pushing something like Tokio or even futures over an FFI boundary was unsavory to say the least.

What may be one of our original performance sins in kernel was designing APIs around the Iterator trait. I am writing this partially to help form my thoughts, but consider this screenshot from Hotspot showing Tokio tasks doing the work of “log replay” when opening a large complex Delta table:

These two tasks are concurrent but they are not parallel. In Iterator terms, this is about what I would expect to see. The conceptual model for execution is:

Iterator created.
next() is invoked
“do work”
return result
next() is invoked

The fact that work is being done on different tasks is irrelevant. Iterator is lazy, but is only going to “do work” when it is asked, thus a serial invocation model.

When parallelism is designed, that means work must be done at the same time, but it does not necessarily mean that it must be done “lazily” in the style of the Iterator trait.

In delta-rs Robert pulled in some code from Datafusion which relies on Tokio’s JoinSet API. The JoinSet is effectively what we want if we want an Iterator-style parallel work executor:

JoinSet created, “do work” begins
next() is invoked
return result
next() is invoked
return result
“do work”
next() is invoked
return result

Currently the use of JoinSet happens much higher in the stack inside of delta-rs, but does not happen deeper down in the delta-kernel-rs code.

What the profiling likely indicates is that there are serial Iterator executions happening in the kernel layer which lead to a bottleneck for callers, regardless of how parallel-capable those callers may be.

Tokio has received criticism in the past about its suitability for heavy CPU-bound operations. Its async/await primitives work incredibly well for anything which has I/O wait involved. The scheduler can switch between tasks when a socket is awaiting data, making it highly concurrent for I/O-bound applications. Tokio functions similarly to Goroutines in Golang, greenlets in Python, etc. As I dug deeper into this problem I wanted to ensure that Tokio was going to behave as I expected with CPU-bound operations.

I compared performance of a JoinSet based program which generates RSA keys, and a rayon based program. Both are close enough in performance and parallelism. Both effectively used all available cores when the Tokio runtime was configured with a single worker thread per core.

Coming back to the Delta Lake ecosystem and our beloved Iterator. I think there are two paths ahead:

The Easy Road: taking JoinSet into the default engine of delta-kernel-rs will at least alleviate some of the “concurrent but not parallel” problems that are lurking down there.
The Hard Road: attempting to put a synchronous Engine interface in front of inherently I/O bound operations is going to lead to performance deficiencies compared to an evented system like Tokio or anything else with a kqueue/epoll reactor at its core. Putting async/await at the foundation of delta-kernel-rs would allow for driving more concurrent and parallel behavior depending on the use-case.

The performance of delta-rs is major focus for my work in the project. In 2026 I look forward to sharing more analysis and more pull requests!

The end of the road for kafka-delta-ingest

2025-10-30T00:00:00+00:00

After five years in production kafka-delta-ingest at Scribd has been shut off and removed from our infrastructure. kafka-delta-ingest was the motivation behind my team creating delta-rs, the most successful open source project I have started to date. With kafka-delta-ingest we achieved our original stated goals and reduced streaming data ingestion costs by 95%. In the time since however, we have further reduced that cost with even more efficient infrastructure.

The original kafka-delta-ingest/delta-rs implementations were created by the joint efforts of the following talented developers across three continents in the middle of 2020, an otherwise totally chill time in world history.

Prior to our creation of delta-rs, the only way to read and write Delta Lake tables was through Apache Spark. While it is an incredibly powerful tool for reading and transforming data, it is completely slow and overweight for the task of high-throughput data ingestion. QP and I found ourselves loving Rust and I was able to corner the funding to get the project started on the promise of lower operational costs.

Boy howdy has the investment in Rust delivered. The implementation of kafka-delta-ingest dramatically lowered our operational costs as Christian shares in this video:

Christian also shared some architecture and discussion in this video, which I think are useful for anybody building streaming systems around Delta Lake.

Here’s a demo by Christian too!

The reason kafka-delta-ingest was decommissioned ultimately was that I created an even cheaper ingestion process. My work on the oxbow suite coupled with the medallion architecture has made contemporary Delta Lake ingestion less than 10% of the total data platform cost.

The big argument against kafka-delta-ingest was Apache Kafka. If an organization has Kafka for other reasons, then kafka-delta-ingest can be a useful “sidecar” process to persist data flowing through Kafka. If however the organization is running Kafka just for ingestion, there are cheaper options available. As the organization evolved, the other consumers of Kafka drifted away, driving the value proposition of kafka-delta-ingest lower and lower.

This doesn’t mean kafka-delta-ingest is not useful, it’s just no longer useful at Scribd.

Kyjah Keyes and I are the maintainers of kafka-delta-ingest and we now are both in the position of not actually using it anymore.

I will continue to make delta-rs upgrades to it, since kafka-delta-ingest continues to be a useful test bed for API changes and integration testing, but I don’t have big plans or ideas on how to grow the project further.

Delta Lake Live!

2025-09-18T00:00:00+00:00

Every Tuesday morning at 7am I have a date.

For the past few weeks Robert and I have been jumping onto a shared Twitch stream and working through issues, code reviews, and design discussions for the delta-rs project.

The idea for the project came up at Data and AI Summit earlier this year. Robert lives in Europe and I am as west as west coast in the US generally gets. The timezone spread has been making collaboration difficult on the topics which require lively synchronous debate.

The Delta Lake project is open source and therefore, in my opinion, the discussions and development of the project should also be open! What better than a big open live stream to work through column mapping, deletion vectors, bugs, performance challenges, and more!

I have livestreamed development in the past and found it useful, but with “Delta Lake Live!” we have a much more regular schedule, agenda, and way for folks in the chat to engage, making it all that much more fun!

The streams are also being archived on YouTube but you’re more than welcome to pop by and hang out every Tuesday at 7am PDT

Busily writing elsewhere

2025-05-03T00:00:00+00:00

Writing has been a part of my work for a long time, it helps me think and more importantly it helps me share ideas with other developers. Recently a tremendous amount of my time has been spent writing internal design documents, blog posts, and other materials. By the time it has come to personal blogging my words all been spent.

On the Buoyant Data blog I have been writing about a lot of Delta Lake related topics such as:

Some of this work ahs been in preparing for the two upcoming talks I have at Data and AI Summit 2025. Some of these posts have been in doing research with clients, or just spelunking on my own.

You can subscribe to the RSS feed for more up to date articles relating to high-efficiency data processing with Rust!

From the beginning, delta-rs to Delta Lake: The Definitive Guide

2024-11-15T00:00:00+00:00

Nothing quite feels like “I made it!” like being published. Which is why I am thrilled to share that Delta Lake: The Definitive Guide is available for purchase, and I kind of helped! I wanted to share a little bit about how my contributions (Chapter 6!) came about, because my entrance into the Delta Lake ecosystem was about as unplanned as my authorship of part of this wonderful book.

The delta-rs project started in 2020 and I wish that I could say it is because I am a brilliant visionary. The project largely started because I have had a bias against JVM-based technology stacks and I had stepped into a role at Scribd where we were migrating to AWS, Databricks, and a new architecture anyways so why not challenge the orthodoxy? My colleague QP Hou and I were loving Rust and liked Delta Lake from a design standpoint, but did not love Apache Spark for some of the things we needed to do.

I would consider the official start of the project to be April 11th, 2020 when I sent our Databricks colleagues the following:

Greetings! As I mentioned in our weekly sync up this week, we have an interest in partnering with Databricks to develop and open source native client interface for Delta Lake.

For framing this conversation and scope of the native interface, I categorize our compute workloads into three groups:

Big offline data processing, requiring a cluster of compute resources where Spark makes a big dent.
Lightweight/small offline data processing, workloads needing “fractional compute” resources, basically less than a single machine. (Ruby/Python type tasks which move data around, or perform small-scale data accesses make up the majority of these in our current infrastructure, we’ve discussed using the Databricks Light runtime for these in the past, since the cost to deploy/run these small tasks on Databricks clusters doesn’t make sense).
Boundary data-processing, where the task might involve a little bit of production “online” data and a little bit of warehouse “offline” data to complete its work. In our environment we have Ruby scripts whose sole job is to sync pre-computed (by Spark) offline data into online data stores for the production Rails application, etc, to access and serve.

I don’t want to burn down our current investment in Ruby for many of the 2nd and 3rd workloads, not to mention retraining a number of developers in-house to learn how to effectively use Scala or pySpark.

My proposal is that we partner with Databricks and jointly develop an open source client interface for Delta Lake. One where we would have at least one developer from Databricks working with at least one developer from Scribd on a jointly scoped effort to deliver a library capable of initially addressing our ‘2’ and ‘3’ use-cases.

[..]

Further, I propose that we jointly develop a client interface in Rust, which will allow us to easy extend that within the Databricks community to support Golang, Python, Ruby, and Node clients.

The key benefits I imagine for us all:

Much broader market share for Delta Lake as a technology. Not only would companies like Scribd benefit, and continue to invest in Delta Lake, but other companies would have an easier on-ramp into the Databricks ecosystem. Basically, if you start using Delta Lake before you use Spark, you will (I guarantee) reach a point where these lightweight workloads become heavyweight workloads requiring the full power and glory of the Databricks runtime :D
It’s a fantastic developer advocacy story that hits a number of key bullet marketing points: open source, partner collaboration, Rust (so hot right now) :)
Scribd is able to “immediately” take advantage of Delta Lake benefits without burning up all our existing codebase and investment in Ruby tasks and tooling. Thereby allowing for an easier onramp into Delta Lake and the Databricks platform as a whole.

The scope of the effort I think would be largely around properly dealing with the transaction log, since the Apache Arrow project has already created a pretty decent parquet crate in Rust. That said, there may be some writer improvements we’d want/need to push upstream to Apache Arrow to make this successful.

On second thought, almost all of this has come true! What a brilliant sage! (plz clap)

Like many advancements, there’s a right time, a right place, and a right group of people. Unfortunately Databricks didn’t join the party until a later on but were a strong supporter of our initial work, providing guidance and helping to make Delta Lake an ever-more thriving open source community. The right people were all converging on the direction that made this possible with Neville helped make arrow-rs a much better Apache Parquet writer. QP wrote the first version of the protocol parser and created the first Python bindings for the library. Christian Williams built out kafka-delta-ingest with Mykhailo Osypov and helped prove that: Rust is way more efficient for data ingestion workloads.. As time went on Will Jones, Florian Valeye, and Robert Peck joined the party and helped turn delta-rs from a small Scribd-motivated open source project into a thriving Rust and Python project.

Scribd had wild success with the data ingestion being in Rust, and the data processing/query being in Spark. The community grew, Databricks grew, and at some point some folks started working on a book.

As a long-time maintainer of delta-rs and talking head in the Delta and Databricks ecosystem I was asked to be a technical reviewer of the book after Prashanth, Scott, Tristen, and Denny had already gotten more than halfway through the chapters.

I provided as much feedback as I could on their chapters. I reviewed the outline and noticed “Chapter 8: TBD”.

What’s supposed to be Chapter 8? “We’re not sure yet.”

My friend Kohsuke once marveled at how I was able to acquire things for the Jenkins project by the simple act of asking for them. There’s some skill involved in finding mutually beneficial opportunities, but being uninhibited by the possibility somebody would say “no” helps a lot.

“So this outline looks good, but when are you going to talk about Rust and Python? There are dozens of us! Dozens!”

Denny needed another chapter and I asked if I could write about building native data applications in Rust and Python.

Suddenly I was helping to write a book.

Scribd is a fun company to work at. Books, audiobooks, podcasts, articles. We have a deep appreciation for the written word, telling stories, and learning. All of which I value highly. Before this experience however I had never seen the other side of books. The creation, the meetings, the rewrites, the edits, the reviews, going to press. It is incredibly interesting and the team at O’Reilly are talented, helpful, and professional.

Going through copy-editing I was fielding review comments on the consistency of tense, the subject of sentences, discussions about what is a proper noun and how to consistently apply terms through hundreds of pages of content. I have heard about how invaluable editors are, I have now seen them in action am in awe.

Over the years I have tried and failed to explain what I do to family members. For people that don’t work in tech “working on the computer” all looks largely the same, especially for older generations. Having your work, your name in print has an intangible “wow” factor. More so than conference talks, websites, GitHub stars, or branded t-shirts, a printed artifact recognizes the accomplishments of the innumerable contributors to the Delta Lake ecosystem over the years.

If you’re data inclined, I recommend picking up a copy, Prashanth, Scott, Tristen, and Denny have written a very useful guide, and also I contributed a bit too! :)

Data and AI Summit 2024 presentations

2024-10-17T00:00:00+00:00

This year has been so jam packed full of activities that I forgot to share some videos from Data and AI Summit 2024 this past summer! The annual conference hosted by Databricks has become one of my favorites to meet with other Delta Lake users and developers to discuss the future of large-scale data ingestion and processing. This year however, I overdid it a little bit.

Using the excuse of promoting my consulting/professional services company Buoyant Data I had effectively three speaking engagements:

The road to delta-rs 1.0 at the Open Source Contributor Summit (Monday)
Fast, cheap, and easy data ingestion with AWS Lambda and Delta Lake, a talk highlighting a lot of the successful patterns I have developed for customers using AWS Lambda with Delta Lake for Rust to create shockingly cheap data ingestion pipelines. (Thursday)
Let’s do data engineering in Rust!, a more fun deep-dive talk to help people start to get into the world of implementing data systems with Rust. (Thursday)(

Unfortunately the first talk was not recorded, but it was probably the most interesting! On Monday morning I was riding my bike from the Ferry Building to the venue in San Francisco and my chain snapped off while I was sprinting off from a green light. I went down hard, scraped up my knees, and generally looked a fool lying in the middle of Market St.

The show must go on, so I hobbled to the Scribd office, deposited my broken bike, and continued to the Open Source Summit.

What I did not know at the time was that I had fractured a bone in my wrist. I did know however that I needed to go to a clinic, but really wanted to attend the summit and take advantage of the one-a-year opportunity (literally!) for some of the brightest minds in the data community to talk about the future of Delta Lake and more.

So that first talk was given with my swollen wrist pulled to my heart, like a broken wing, and I’m sure it was a ludicrous sight to see!

By Thursday my arm had been set and was in a sling, which is far less exciting. Nonetheless, the two talks below are perhaps the only one-handed presentations thus far in my career! I hope you enjoy!

Note: The presentation software used for this talk is the open source presenterm tool which is delightful for creating development-focused presentations like this one!

Improving lock performance for delta-rs

2023-11-29T00:00:00+00:00

I have had the good fortune this year to help a number of organizations develop and deploy native data applications in Python and Rust using a project I helped found: delta-rs. At a high level delta-rs is a Rust implementation of the Delta Lake protocol which offers ACID-like transactions for data lake use-cases. One of the big areas of my focus has been in evaluating and improving performance in highly concurrent runtime environments on AWS.

To help others understand the problem domain I spent some time earlier in the week documenting the challenges in AWS on the Buoyant Data blog: Concurrency limitations for Delta Lake on AWS

In the case of AWS S3’s consistency model many operations are strongly consistent, but concurrent operations on the same key are not. AWS encourages application-level object locking, which the delta-rs implements using AWS DynamoDB.

AWS S3 is an incredible piece of technology that washes away a myriad of common storage problems, and has been jokingly referred to as “the 8th wonder of the world” by Corey Quinn. THe lack of a “putIfAbsent” like semantic is however very annoying for the Delta Lake protocol, adding the need for an application-wide lock for Delta users:

The dynamodb-lock approach allows for some sensible cooperation between concurrent writers but the key limitation is that all concurrent operations must synchronize on the table itself. There is no smaller division of concurrency than a table operation

In the blog post I offer some potential approaches to mitigate the weakness of needing a table-level lock for concurrent Delta Lake writers on AWS, but the problem will unfortunately remain until in some form or fashion until S3 introduces a “putIfAbsent” semantic which allows writers to “put” a file only if it doesn’t exist in an atomic way.

For concurrent Delta writers I can offer some advice, but unfortunately effective cooperative distributed concucrrency at scale remains a challenging problem! :)