rtyler

2026 March: Recently Studied Stuff

2026-03-21T00:00:00+00:00

Over the past week I have made a more conscious effort to keep track of some really interesting articles that came through my feed reader. I am a big fan of the open web and the power of RSS for disseminating interesting information from actual people. Below are some really interesting posts I have read recently!

Compressed Apache Arrow tables over HTTP

When discussing transport protocols for sending data between services at work recently, a colleague asked “why can’t we just yeet Arrow over HTTP?” It turns out, you absolutely can and Arrow IPC streams even have a registered MIME type:

Content-Type: application/vnd.apache.arrow.stream

Understanding Parquet format for beginners

A great introduction to the Apache Parquet format and why it makes so many things better with large data storage systems like Delta Lake. I have written on this topic before and encourage you to take another read through this blog post by some maintainers of the parquet crate.

Every layer of review makes you 10x slower

Every layer of approval makes a process 10x slower [..]

Just to be clear, we’re counting “wall clock time” here rather than effort. Almost all the extra time is spent sitting and waiting.

Code a simple bug fix: 30 minutes

Get it code reviewed by the peer next to you: 300 minutes → 5 hours → half a day

Get a design doc approved by your architects team first: 50 hours → about a week

Get it on some other team’s calendar to do all that (for example, if a customer requests a feature): 500 hours → 12 weeks → one fiscal quarter

This inspired these thoughts which I shared with the delta-rs community:

“what if we didn’t require code review for merging into main”

I’m exploring the thought more about what we might need to make that happen. “Why would you do such a thing, code review is so valuable!” I do find code reviews valuable but we do seem to lose a lot of flow time due to timezones, differing work schedules, and a number of other things. For something without a lot of changes, especially bug fixes that come with tests I would be much more comfortable with maintainers merging once CI goes green.

Some pieces of the puzzle that I think would be needed:

Soft caps on pull requests. I saw this mentioned somewhere else, but implementing a soft cap of <500 lines per pull request can help people avoid massive unreviewable changes that are simpler to integrate.
Incorporating some of the benchmarking work into CI that has already been explored. If performance of key operations is not affected and the build is green, go for it.
Stronger semantic version checks: if our APIs have not changed and all tests pass, I’m generally comfortable with landing stuff by maintainers.
Implementing Apache Software Foundation style release candidates and voting: this is where we would put a mandatory bottleneck, rather than some jokey slack emojis like I tend to do, implementing a true release candidate process that requires review and vote before we push something to users.

All of this is to say that reviews can still be requested, but I would love to see us land more improvements faster and I think we have a bunch of different schedules that can make pushing each change through a review queue a lot slower than necessary.

Conditional Impls in Rust

It’s possible in Rust to conditionally implement methods and traits based on the traits implemented by a type’s own type parameters. While this is used extensively in Rust’s standard library, it’s not necessarily obvious that this is possible.

I have been vaguely aware of this functionality but haven’t really taken the time to consider it, so I really appreciated this post walking through the conditional impl functionality in Rust.

Based Lake, a petabyte-scale low-latency data lake

2026-03-10T00:00:00+00:00

I had a chat today about building large scale low-latency data retrieval systems around AWS S3. In doing so I got to share a bit of the talk proposal I submitted to Data and AI Summit this year about real-live work that has made it into production.

For years the conventional wisdom around Delta Lake has been to not connect user-facing/online systems to Delta tables. Basically, don’t point your Django app at your Delta tables. This continues to be a decent guideline but definitely not a rule and I have the performance data to back that up.

My talk abstract:

Scribd hosts hundreds of millions of documents and has hundreds of billions of objects across our buckets. Combining large-language models with a massive amounts of text has required investment in our new Content Library architecture. We selected Delta Lake as the underlying storage technology but have pushed it to an extreme. Using the same Delta Lake architecture we offer both direct data access for data scientists in Databricks Notebooks and online data retrieval in milliseconds for user-facing web services.

In this talk we will review principles of performance for each layer of the stack: web APIs, the Delta Lake tables, Apache Parquet, and AWS S3.

The work done by myself and my colleague Eugene in this area has been heavily related to my previous research around Low latency Parquet reads which informed work named Content Crush, which I have explored more on the Scribd tech blog and on the Screaming in the Cloud podcast.

I really hope that I am able to share results at Data and AI Summit from this incredibly challenging work that I am undertaking. But even if I don’t, blog posts like my musings on Multimodal with Delta Lake, scaling streaming Delta Lake applications, and a myriad of other articles I have published can be pieced together to form the larger mosaic of insane large-scale data work I have been hammering on!

Low latency Parquet reads

2025-06-24T00:00:00+00:00

The Apache Parquet file format has become the de facto standard for large data systems but increasingly I find that most data engineers are not aware of why it has become so popular. The format is interesting especially when taken together with most cloud-based object storage systems, where some design decisions allow for subsecond or millisecond latencies for parquet readers.

In the cloud computing environment: efficiency wins. Hyperscalers make money from renting you resources on a time-basis; the fewer resources and less time your workload requires, the lower the cost. A Lambda function which runs in 1 second compared to 5 seconds is going to cost 80$ less. At small scales this is often inconsequential but with sufficient volume it makes a big difference. For example, at 1 invocation per second the longer function costs ~$431/month compared to ~$81/month.

I have been working on a project exploring new and novel use-cases for Apache Parquet, the file format which underscores the Delta Lake storage protocol. My work uses .parquet files smaller than 50MB in size and ultimately latency is the biggest concern. When retrieving data from any data service there is always a fixed cost of overhead regardless of the data transferred. Retrieving a 1MB object or a 1GB object still requires locating and loading the data from storage, validating authentication credentials/headers, and then constructing a request stream.

Working in this domain I have discsussed challenges with Andrew Lamb who has been doing similarly interesting explorations at InfluxData. His work builds on what he and Raphael outlined in their 2022 post: Querying Parquet with Millisecond Latency

Meanwhile Databricks also released Lakebase which I am confident is also utilizing Apache Parquet for similar retrieval patterns for their PostgreSQL engine.

Somewhere way down the data stack we are all trying to squeeze as much out of Parquet and S3 as possible.

Because of my work on the delta-rs project, I am quite familiar with the Parquet file format and the ways in which it can be read and written. I need to read .parquet files in an exremely low-latency environment with worst-case performance around the 100ms mark. I picked up two foundational dependencies of delta-rs: the parquet and object_store crates, and dove into the Parquet file format:

The .parquet file has a “footer” which contains practically all the useful metadata for understanding the file, with the last eight bytes indicating the length of the footer. This is largely useless trivia until you learn that most object stores like AWS S3 allow for Range headers on the GetObject call with a negative byte range. For a large .parquet file you can retrieve Range: -8 bytes and that would tell you the footer length, which you could then fetch with Range: -

, and then you would be able to understand practically everything about the file! Those Range requests would even allow you to fetch individial row groups, a hugely beneficial performance optimization when working with large .parquet files.

Fortunately for everybody, this is exactly what ParquetObjectReader does! From the perspective of the underlying ObjectStore implementation the call flow is:

get_optse(Range(-8))
get_opts(Range(-))
get_ranges(*row-groups)

For large .parquet files, hundreds of MBs or GBs, this approach works very well for most processing engines where less data neding to be deserialized and processed means tangible performance gains. In fact, I have it on good authority that this approach is how the Databricks Photon engine’s predictive I/O squeezes even more query performance out of Apache Parquet.

For me however each request to S3 in the list above has roughly 30ms overhead and they must be executed sequentially which means 3 requests has a worst-case scenario of 90ms.

Hinting at a rough approximation of footer size can prevent one of the two calls, bringing the worst-case down ot 60ms. Accessing relevant data in under 70-80ms is good but not great.

Andrew and Raphael’s blog post Querying Parquet with Millisecond Latency is full of useful approaches for reducing query and processing time. At some point however you hit the wall of fundamental performance overhead of the object store itself.

I have hit that wall.

The options available in front of me are:

consider novel data structures inside the Parquet file
secondary indices outside of the Parquet file
ggressive caching strategies.

I’m not thrilled with any of them, though I have already utilized hacks from #1 with Parquet data layout changes.

As frustrating as a problem that might genuinely be unsolveable might be, it has been a lot of fun discussing strategies with folks at cloud providers, other companies, and in the open source community on how to squeeze every last bit of performance out of Apache Parquet and cloud storage.

I might have to make peace with 60ms of latency, but not just yet.