rtyler

Based Lake, a petabyte-scale low-latency data lake

2026-03-10T00:00:00+00:00

I had a chat today about building large scale low-latency data retrieval systems around AWS S3. In doing so I got to share a bit of the talk proposal I submitted to Data and AI Summit this year about real-live work that has made it into production.

For years the conventional wisdom around Delta Lake has been to not connect user-facing/online systems to Delta tables. Basically, don’t point your Django app at your Delta tables. This continues to be a decent guideline but definitely not a rule and I have the performance data to back that up.

My talk abstract:

Scribd hosts hundreds of millions of documents and has hundreds of billions of objects across our buckets. Combining large-language models with a massive amounts of text has required investment in our new Content Library architecture. We selected Delta Lake as the underlying storage technology but have pushed it to an extreme. Using the same Delta Lake architecture we offer both direct data access for data scientists in Databricks Notebooks and online data retrieval in milliseconds for user-facing web services.

In this talk we will review principles of performance for each layer of the stack: web APIs, the Delta Lake tables, Apache Parquet, and AWS S3.

The work done by myself and my colleague Eugene in this area has been heavily related to my previous research around Low latency Parquet reads which informed work named Content Crush, which I have explored more on the Scribd tech blog and on the Screaming in the Cloud podcast.

I really hope that I am able to share results at Data and AI Summit from this incredibly challenging work that I am undertaking. But even if I don’t, blog posts like my musings on Multimodal with Delta Lake, scaling streaming Delta Lake applications, and a myriad of other articles I have published can be pieced together to form the larger mosaic of insane large-scale data work I have been hammering on!

On Data Engineering Central

2026-02-04T00:00:00+00:00

I was lucky enough to record a podcast episode with Daniel Beach of Data Engineering Central. Daniel and I have known each other for a couple years sharing notes and ideas on the state of the ecosystem, where it falls down, and where things are getting interesting.

In my opinion Data Engineering Central has been one of the most useful broad-ranged surveys of the ecosystem, curated by one crazy mid-westerner: Daniel. He pulls no punches and while we share criticisms of AI in the industry and commercial tools, Daniel’s honesty also has put some of my work on blast, such as this post about some terrible user-experience and lopsided Delta Lake support in delta-rs.

In his post Daniel highlights some of the topics we got into during our time chatting:

What the Lakehouse architecture gets right—and where it still falls short

Why multimodal data (text, images, audio, video, embeddings) changes everything

How open table formats like Delta Lake fit into the next generation of data platforms

The growing gap between data tooling hype and day-to-day data engineering reality

What skills and architectural thinking will matter most for data engineers over the next decade

I encourage you to subscribe to his newsletter or if that’s not your jam, you can subscribe to the RSS feed too.

From the beginning, delta-rs to Delta Lake: The Definitive Guide

2024-11-15T00:00:00+00:00

Nothing quite feels like “I made it!” like being published. Which is why I am thrilled to share that Delta Lake: The Definitive Guide is available for purchase, and I kind of helped! I wanted to share a little bit about how my contributions (Chapter 6!) came about, because my entrance into the Delta Lake ecosystem was about as unplanned as my authorship of part of this wonderful book.

The delta-rs project started in 2020 and I wish that I could say it is because I am a brilliant visionary. The project largely started because I have had a bias against JVM-based technology stacks and I had stepped into a role at Scribd where we were migrating to AWS, Databricks, and a new architecture anyways so why not challenge the orthodoxy? My colleague QP Hou and I were loving Rust and liked Delta Lake from a design standpoint, but did not love Apache Spark for some of the things we needed to do.

I would consider the official start of the project to be April 11th, 2020 when I sent our Databricks colleagues the following:

Greetings! As I mentioned in our weekly sync up this week, we have an interest in partnering with Databricks to develop and open source native client interface for Delta Lake.

For framing this conversation and scope of the native interface, I categorize our compute workloads into three groups:

Big offline data processing, requiring a cluster of compute resources where Spark makes a big dent.
Lightweight/small offline data processing, workloads needing “fractional compute” resources, basically less than a single machine. (Ruby/Python type tasks which move data around, or perform small-scale data accesses make up the majority of these in our current infrastructure, we’ve discussed using the Databricks Light runtime for these in the past, since the cost to deploy/run these small tasks on Databricks clusters doesn’t make sense).
Boundary data-processing, where the task might involve a little bit of production “online” data and a little bit of warehouse “offline” data to complete its work. In our environment we have Ruby scripts whose sole job is to sync pre-computed (by Spark) offline data into online data stores for the production Rails application, etc, to access and serve.

I don’t want to burn down our current investment in Ruby for many of the 2nd and 3rd workloads, not to mention retraining a number of developers in-house to learn how to effectively use Scala or pySpark.

My proposal is that we partner with Databricks and jointly develop an open source client interface for Delta Lake. One where we would have at least one developer from Databricks working with at least one developer from Scribd on a jointly scoped effort to deliver a library capable of initially addressing our ‘2’ and ‘3’ use-cases.

[..]

Further, I propose that we jointly develop a client interface in Rust, which will allow us to easy extend that within the Databricks community to support Golang, Python, Ruby, and Node clients.

The key benefits I imagine for us all:

Much broader market share for Delta Lake as a technology. Not only would companies like Scribd benefit, and continue to invest in Delta Lake, but other companies would have an easier on-ramp into the Databricks ecosystem. Basically, if you start using Delta Lake before you use Spark, you will (I guarantee) reach a point where these lightweight workloads become heavyweight workloads requiring the full power and glory of the Databricks runtime :D
It’s a fantastic developer advocacy story that hits a number of key bullet marketing points: open source, partner collaboration, Rust (so hot right now) :)
Scribd is able to “immediately” take advantage of Delta Lake benefits without burning up all our existing codebase and investment in Ruby tasks and tooling. Thereby allowing for an easier onramp into Delta Lake and the Databricks platform as a whole.

The scope of the effort I think would be largely around properly dealing with the transaction log, since the Apache Arrow project has already created a pretty decent parquet crate in Rust. That said, there may be some writer improvements we’d want/need to push upstream to Apache Arrow to make this successful.

On second thought, almost all of this has come true! What a brilliant sage! (plz clap)

Like many advancements, there’s a right time, a right place, and a right group of people. Unfortunately Databricks didn’t join the party until a later on but were a strong supporter of our initial work, providing guidance and helping to make Delta Lake an ever-more thriving open source community. The right people were all converging on the direction that made this possible with Neville helped make arrow-rs a much better Apache Parquet writer. QP wrote the first version of the protocol parser and created the first Python bindings for the library. Christian Williams built out kafka-delta-ingest with Mykhailo Osypov and helped prove that: Rust is way more efficient for data ingestion workloads.. As time went on Will Jones, Florian Valeye, and Robert Peck joined the party and helped turn delta-rs from a small Scribd-motivated open source project into a thriving Rust and Python project.

Scribd had wild success with the data ingestion being in Rust, and the data processing/query being in Spark. The community grew, Databricks grew, and at some point some folks started working on a book.

As a long-time maintainer of delta-rs and talking head in the Delta and Databricks ecosystem I was asked to be a technical reviewer of the book after Prashanth, Scott, Tristen, and Denny had already gotten more than halfway through the chapters.

I provided as much feedback as I could on their chapters. I reviewed the outline and noticed “Chapter 8: TBD”.

What’s supposed to be Chapter 8? “We’re not sure yet.”

My friend Kohsuke once marveled at how I was able to acquire things for the Jenkins project by the simple act of asking for them. There’s some skill involved in finding mutually beneficial opportunities, but being uninhibited by the possibility somebody would say “no” helps a lot.

“So this outline looks good, but when are you going to talk about Rust and Python? There are dozens of us! Dozens!”

Denny needed another chapter and I asked if I could write about building native data applications in Rust and Python.

Suddenly I was helping to write a book.

Scribd is a fun company to work at. Books, audiobooks, podcasts, articles. We have a deep appreciation for the written word, telling stories, and learning. All of which I value highly. Before this experience however I had never seen the other side of books. The creation, the meetings, the rewrites, the edits, the reviews, going to press. It is incredibly interesting and the team at O’Reilly are talented, helpful, and professional.

Going through copy-editing I was fielding review comments on the consistency of tense, the subject of sentences, discussions about what is a proper noun and how to consistently apply terms through hundreds of pages of content. I have heard about how invaluable editors are, I have now seen them in action am in awe.

Over the years I have tried and failed to explain what I do to family members. For people that don’t work in tech “working on the computer” all looks largely the same, especially for older generations. Having your work, your name in print has an intangible “wow” factor. More so than conference talks, websites, GitHub stars, or branded t-shirts, a printed artifact recognizes the accomplishments of the innumerable contributors to the Delta Lake ecosystem over the years.

If you’re data inclined, I recommend picking up a copy, Prashanth, Scott, Tristen, and Denny have written a very useful guide, and also I contributed a bit too! :)

Data and AI Summit 2024 presentations

2024-10-17T00:00:00+00:00

This year has been so jam packed full of activities that I forgot to share some videos from Data and AI Summit 2024 this past summer! The annual conference hosted by Databricks has become one of my favorites to meet with other Delta Lake users and developers to discuss the future of large-scale data ingestion and processing. This year however, I overdid it a little bit.

Using the excuse of promoting my consulting/professional services company Buoyant Data I had effectively three speaking engagements:

The road to delta-rs 1.0 at the Open Source Contributor Summit (Monday)
Fast, cheap, and easy data ingestion with AWS Lambda and Delta Lake, a talk highlighting a lot of the successful patterns I have developed for customers using AWS Lambda with Delta Lake for Rust to create shockingly cheap data ingestion pipelines. (Thursday)
Let’s do data engineering in Rust!, a more fun deep-dive talk to help people start to get into the world of implementing data systems with Rust. (Thursday)(

Unfortunately the first talk was not recorded, but it was probably the most interesting! On Monday morning I was riding my bike from the Ferry Building to the venue in San Francisco and my chain snapped off while I was sprinting off from a green light. I went down hard, scraped up my knees, and generally looked a fool lying in the middle of Market St.

The show must go on, so I hobbled to the Scribd office, deposited my broken bike, and continued to the Open Source Summit.

What I did not know at the time was that I had fractured a bone in my wrist. I did know however that I needed to go to a clinic, but really wanted to attend the summit and take advantage of the one-a-year opportunity (literally!) for some of the brightest minds in the data community to talk about the future of Delta Lake and more.

So that first talk was given with my swollen wrist pulled to my heart, like a broken wing, and I’m sure it was a ludicrous sight to see!

By Thursday my arm had been set and was in a sling, which is far less exciting. Nonetheless, the two talks below are perhaps the only one-handed presentations thus far in my career! I hope you enjoy!

Note: The presentation software used for this talk is the open source presenterm tool which is delightful for creating development-focused presentations like this one!

The problem with ML

2023-01-04T00:00:00+00:00

The holidays are the time of year when I typically field a lot of questions from relatives about technology or the tech industry, and this year my favorite questions were around AI. (insert your own scary music) Machine-learning (ML) or Artificial Intelligence (AI) are being widely deployed and I have some Problems™ with that. Machine learning is not necessarily a new domain, the practices commonly accepted as “ML” have been used for quite a while to support search and recommendations use-cases. In fact, my day job includes supporting data scientists and those who are actively creating models and deploying them to production. However, many of my relatives outside of the tech industry believe that “AI” is going to replace people, their jobs, and/or run the future. I genuinely hope AI/ML comes nowhere close to this future imagined by members of my family.

Like many pieces of technology, it is not inherently good or bad, but the problem with ML as it is applied today is that its application is far outpacing our understanding of its consequences.

Brian Kernighan, co-creator of the C programming language and UNIX, said:

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

Setting aside the mountain of ethical concerns around the application of ML which have and should continue to be discussed in the technology industry, there’s a fundamental challenge with ML-based systems: I don’t think their creators understand how they work, how their conclusions are determined, or how to consistently improve them over time. Imagine you are a data scientist or ML developer, how confident are you in what your models will predict between experiments or evolutions of the model? Would you be willing to testify in a court of law about the veracity of your model’s output?

Imagine you are a developer working on the models that Tesla’s “full self-driving” (FSD) mode relies upon. Your model has been implicated in a Tesla killing the driver and/or pedestrians (which has happened). Do you think it would be possible to convince a judge and jury that your model is not programmed to mow down pedestrians outside of a crosswalk? How do you prove what a model is or is not supposed to do given never before seen inputs?

Traditional software does have a variation of this problem but source code lends itself to scrutiny far better than the ML models. Many of which have come from successive evolutions of public training data, proprietary model changes, and integrations with new data sources.

These problems may be solvable in the ML ecosystem, but problem is that the application of ML is outpacing our ability to understand, monitor, and diagnose models when they do harm.

That model your startup is working on to help accelerate home loan approvals based on historical mortgages, how do you assert that your models are not re-introducing racist policies like redlining. (forms of this have happened).

How about that fun image generation (AI art!) project you have been tinkering with uses a publicly available model that was trained on millions of images from the internet, and as a result in some cases unintentionally outputs explicit images, or even what some jurisdictions might consider bordering on child pornography. (forms of this have happened).

Really anything you teach based on the data “from the internet” is asking for racist, pornographic, or otherwise offensive results, as the Microsoft Tay example should have taught us.

Can you imagine the human-rights nightmare that could ensue from shoddy ML models being brought into a healthcare setting? Law-enforcement? Or even military settings?

Machine-learning encompasses a very powerful set of tools and patterns, but our ability to predict how those models will be used, what they will output, or how to prevent negative outcomes are dangerously insufficient for the use outside of search and recommendation systems.

I understand how models are developed, how they are utilized, and what I think they’re supposed to do.

Fundamentally the challenge with AI/ML is that we understand how to “make it work”, but we don’t understand why it works.

Nonetheless we keep deploying “AI” anywhere there’s funding, consequences be damned.

And that’s a problem.

Meet Buoyant Data, and let me reduce your data platform costs

2023-01-02T00:00:00+00:00

One of the many things I learned in 2022 is that I have a particular knack for understanding, analyzing, and optimizing the costs of data platform infrastructure. These skills were born out of both curiosity and necessity in the current economic climate, and have led me to start a small consuhltancy on the side: Buoyant Data. Big data infrastructure can be hugely valuable to lots of businesses, but unfortunately it’s also an area of the cloud bills that is frequently misunderstood, that’s something that I can help with!

Mike Julian from The Duckbill Group once made the proclamation that the way to actually save money in AWS is to design your infrastructure to be cost-effective. “Optimization” techniques can only take you so far, and once you’ve burned through all the optimizations, you may find yourself needing to further reduce the cost of your infrastructure and have no more “fat” to trim! In the first blog post I outline a “reference architecture” for a data platform which I know is cost-effective, easy to manage, and lends itself well to growth.

Planning for sensible, cost-concious growth is very important. With most data platforms as they start to prove their value, the organization will bring even more workloads to them. If you give a data scientist a good platform, they will find themselves wanting ever more from that data platform, and Buoyant Data can help make sure that growth is sustainable and the value to the business is easy to identify as well.

Please add the Buoyant Data RSS feed to your reader, as I have a number of blog posts queued up already with some gratis tips and tricks for understanding the cost of your data platform! 😄

The technology stack for Buoyant Data is something I cannot wait to write more about. After funding the creation of delta-rs as part of my day job, I am utilizing the library in a big way to build extremely lightweight and cost-efficient data ingestion pipelines with Rust and AWS Lambda. There’s still plenty of space for Apache Spark on the querying and processing side, but as DataFusion matures, I’m looking forward to exploring where that can fit into the picture.

There’s a lot of evolution happening right now in the data and ML platform space, I’m really looking forward to growing Buoyant Data in my spare time!

Local SQL querying in Jupyter Notebooks

2022-04-29T00:00:00+00:00

Designing, working with, or thinking about data consumes the vast majority of my time these days, but almost all of that has been “in the cloud” rather than locally. I recently watched this talk about SQLite and Go which served as a good reminder that I have a pretty powerful computer at my fingertips, and that perhaps not all my workloads require a big Spark cluster in the sky. Shortly after watching that video I stumbled into a small (200k rows) data set which I needed to run some queries against, and my first attempt at auto-ingesting it into a Delta table in Databricks failed, so I decided to launch a local Jupyter notebook and give it a try!

My originating data set was a comma-separated values file (CSV) so my first intent was to just load it into SQLite using the .mode csv command in the CLI, but I found that to be a bit restrictive. Notebooks have incredible utility for incrementally working on data. Unfortunately Jupyter doesn’t have a native SQL interface, instead everything has to run through Python. Through my work with delta-rs I am somewhat familar with Pandas for processing data in Python, so my first attempts where using the Pandas data frame API to munge through my data.

import pandas

df = pandas.read_csv('data/2021_05-2022_04.csv')

I could be dense, but I find SQL to be a pretty understandable tool in comparison to data frames, so I needed to find some way to get the data into a SQL interface. The solution that I ended up with was to create an in-memory SQLite database and use Pandas to query it, which works okay enough to where I continued working and didn’t bother thinking too much about how to optimize the approach further:

import sqlite3
import pandas

# Loading everything into a SQLite memory database because I hate data frames and SQL is nice
conn = sqlite3.connect(':memory:')
df = pandas.read_csv('data/2021_05-2022_04.csv')
r = df.to_sql('usage', conn, if_exists='replace', index=False)
# useful little helper
sql = lambda x: pandas.read_sql_query(x, conn)


# Show some sample data
sql('SELECT * FROM usage LIMIT 3')

The benefit of this approach is that I can create additional tables in the SQLite database with static data sets, or other CSVs. Since I’m also just doing some simple ad-hoc analysis, I can skip writing anything to disk and keep things snappy in memory.

I created the little sql lambda to make the notebook a bit more understandable, and to get out of exposing the cursor or database connection to every single cell, meaning that most of my cells in the notebook are simply just sql('SELECT * FROM foo') statments with some documentation surrounding them.

Fairly simple, easy enough to play with data quickly on my local machine without invoking all the infinite cosmic powers the cloud provides!

I’m a Databricks Beacon

2021-10-21T00:00:00+00:00

A bit of belated news but thanks to all the advocacy work we have been doing at Scribd_ I am now a Databricks Beacon. The Beacon program is similar to Docker Captains, Microsoft MVPs, or Java Champions, a group of folks who are considered both skilled with the technology and in communicating/sharing best practices, tips, and short-comings with the broader community.

From the site itself:

The Databricks Beacons program is our way to thank and recognize the community members, data scientists, data engineers, developers and open source enthusiasts who go above and beyond to uplift the data and AI community.

Whether they are speaking at conferences, leading workshops, teaching, mentoring, blogging, writing books, creating tutorials, offering support in forums or organizing meetups, they inspire others and encourage knowledge sharing – all while helping to solve tough data problems.

I’m flattered to be included in the inaugural group of Beacons, which include a number of much more competent data leaders than myself. Most of what I bring to the table is a lot of Delta Lake experience and advocacy. Delta Lake is the bedrock of Scribd’s data platform and I have been investing heavily in the space with our contribution of the delta-rs Rust bindings, upon which kafka-delta-ingest was built.

Scribd is a Databricks customer, and from that angle I have been quite impressed with the organization and technologies they have built. As some folks who have seen my public talks about Databricks, I also don’t hold back in my honest assessment of the platform’s strengths and weaknesses, thus my surprise to be included as a Beacon ;)

I’m looking forward to more events where I am able to share some of the real-world experiences we’re gaining at Scribd in building out massive data platform systems with Delta Lake and Databricks. And as always, if you want to help us build out more feel free to email me!

Building a real-time data platform with Apache Spark and Delta Lake

2020-07-20T00:00:00+00:00

The Real-time Data Platform is one of the fun things we have been building at Scribd since I joined in 2019. Last month I was fortunate enough to share some of our approach in a presentation at Spark and AI Summit titled: “The revolution will be streamed.” At a high level, what I had branded the “Real-time Data Platform” is really: Apache Kafka, Apache Airflow, Structured streaming with Apache Spark, and a smattering of microservices to help shuffle data around. All sitting on top of Delta Lake which acts as an incredibly versatile and useful storage layer for the platform.

In my presentation, which is embedded below, I outline how we tie together Kafka, Databricks, and Delta Lake.

The recorded presentation also complements some of our tech.scribd.com blog posts which I recommend reading as well:

I am incredibly proud of the work the Platform Engineering organization has done at Scribd to make real-time data a reality. I also cannot recommend Kafka + Spark + Delta Lake highly enough for those with similar requirements.

Now that we have the platform in place, I am also excited for our late 2020 and 2021 roadmaps which will start to take advantage of real-time data.