Working in the data storage and services it can seem like everything revolves around capacity and throughput. We don’t think of throughput until it is lacking. A traffic jam, a flipped breaker, or an overflowing drain. There are architectural changes we make to improve throughput and there are tactical fixes. This post is about the tactical fixes.
Howdy!
Welcome to my blog where I write about software
development, cycling, and other random nonsense. This is not
the only place I write, you can find more words I typed on the Buoyant Data blog, Scribd tech blog, and GitHub.
Multimodal with Delta Lake
The rate of change for data storage systems has accelerated to a frenzied pace and most storage architectures I have seen simply cannot keep up. Much of my time is spent thinking about large-scale tabular data stored in Delta Lake which is one of the “lakehouse” storage systems along with Apache Iceberg and others. These storage architectures were developed 5-10 years ago to solve problems faced moving from data warehouse architectures to massive scale structured data needs faced by many organizations. The storage changes we need today must support “multimodal data” which is a dramatic departure in many ways from the traditional query and usage patterns our existing infrastructure supports.
The challenges facing Delta Kernel
The Delta Kernel is one of the most technically challenging and ambitious open source projects I have worked on. Kernel is fundamentally about unifying all of our needs and wants from a Delta Lake implementation into a single cohesive yet-pluggable API surface. Towards the end of 2025 TD asked me to jot down some of the issues which have been frustrating me and/or slowing down the adoption of kernel in projects like delta-rs. At the outset of the project we all discussed concerns about what could actually be possible as we set out into uncharted territory. In many ways we have succeeded, in others we have failed.
Using sccache with not-S3
On a day-to-day basis I build a lot of Rust code. To make my life easier I
use sccache which I have written about
previously. Periodically
the sccache daemon would exit and then no longer authenticate against my
local network’s not-S3 service.
How to steal my code
All open source code has conditions attached. The majority of code which I have written in my lifetime has been open source and therefore is usually available for you to build from, distribute, or derive new works. There are some stipulations however and in this post I would like to help you understand how you can take code I have written.
Parallelism is a little tricky
In theory many developers understand concurrency and parallelism, in practice I think almost none of us do. At least not all the time. Building a mental model of highly parallel interdependent software is incredibly time-consuming, difficult, and error-prone. I have recently been doing a lot of performance analysis with both delta-rs and delta-kernel-rs. In the process I have had to check some of my own assumptions of how things should work compared to how they do work.
Things you should know about Url in Rust
I would guess most developers think of URLs as a string with a https:// at
the beginning. In many cases there are assumptions that are made about these URL-shaped
strings which may be confusing, misleading, or flat out incorrect. The url crate is compliant to the RFCs about URLs, but while being technically correct is the best kind of correct, that doesn’t mean it still isn’t confusing.
Improving performance with the log crate
On a small crate I maintain a friendly stranger made a suggestion to improve performance, by making logging optional.
The end of the road for kafka-delta-ingest
After five years in production kafka-delta-ingest at Scribd has been shut off and removed from our infrastructure. kafka-delta-ingest was the motivation behind my team creating delta-rs, the most successful open source project I have started to date. With kafka-delta-ingest we achieved our original stated goals and reduced streaming data ingestion costs by 95%. In the time since however, we have further reduced that cost with even more efficient infrastructure.
R.I.P. S3 Object Lambda
Did you know that AWS S3 is almost 20 years old? The “cloud” as a concept is fairly recent but in the time-distortion that has occurred since the rise of the internet, I think many of us have lost track of how old some of these public cloud providers are, and as a side-effect, how old their technology offerings can become. Periodically you need to clean out the attic, and this week AWS did just that with their “AWS Service Availability Updates.”
Sacrifice to AI
What a wild time to be alive. It’s really quite something. How wonderful it is to have a phrase like “what a wild time to be alive” that could mean a dozen different moderately positive or extremely negative things depending on where in your news or social feed you find this article.
Delta Lake Live!
Every Tuesday morning at 7am I have a date.
Introducing recoil, the highly sophisticated AI honeypot
Abusive traffic from AI-based bots or application is becoming more prevelant which is why I’m thrilled to introduce the general availability of recoil. Recoil is a highly sophisticated honeypot which can serve a never-ending stream of data to abusive traffic.
Your cargo workspace has a bug, no it's a feature!
Rust has a useful concept of “features” baked into its packaging tool cargo
which allows developers to optionally toggle functionality on and off. In a
simple project features are simple, as you would expect. In more complex
projects which use cargo
workspaces the
behavior of features becomes much more complicated and in some
cases..surprising!
The thing about appendable objects in S3
Storing bytes at scale is never as simple as we lead ourselves to believe. The concept of files, or in the cloud “objects”, is a useful metaphor for an approximation of reality but it’s not actually reality. As I have fallen deeper and deeper into the rabbit hole, my mental model of what is storage really has been challenged at every turn.
sccache is pretty okay
I have been using sccache to improve feedback loops with large Rust projects
and it has been going okay but it hasn’t been the silver bullet I was hoping
for. sccache can be easily dropped into
any Rust project as a wrapper around rustc, the Rust
compiler, and it will perform caching of intermediate build artifacts. As
dependencies are built, their object files are cached, locally or remotely, and
can be re-used on future compilations. sccache also supports distributed
compilation which can compile those objects on different computers, pulling the
object files back for the final result. I had initially hoped that sccache
would solve all my compile performance problems, but surprising to nobody,
there are some caveats.
Jamming on Google Meet with Pulseaudio
For an upcoming hack week I wanted to have some live jam sessions with colleagues on a video call. Mostly I wanted some background music we could listen to while we hacked together, occasionally discussing our work, etc. I don’t normally use Pulseaudio in anger but it seemed like the closest and potentially simplest solution.
The AI Coding Margin Squeeze
Words cannot express how excited I am for the coming margin squeeze on every “AI company” that isn’t Anthropic, OpenAI, Microsoft, or Google. The entire industry is built on an unethical foundation, having illegitimately acquired massive amounts of content from practically everybody. The companies selling “AI Coding Assistants” I am particularly excited to see implode.
The last data file format
The layers of abstraction in most technology stacks has gotten incredibly deep over the last decade. At some point way down there in the depths of most data applications somebody somewhere has to actually read or write bytes to storage. The flexibility of Apache Parquet has me increasingly convinced that it just might be the last data file format I will need.
Save the world, write more efficient code
Large Language Models have made the relationship between software efficiency and environmentalism click for many people in the technology field. The cost of computing matters.