<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://brokenco.de//feed/by_tag/deltalake.xml" rel="self" type="application/atom+xml" /><link href="https://brokenco.de//" rel="alternate" type="text/html" /><updated>2026-04-12T21:39:52+00:00</updated><id>https://brokenco.de//feed/by_tag/deltalake.xml</id><title type="html">rtyler</title><subtitle>a moderately technical blog</subtitle><author><name>R. Tyler Croy</name></author><entry><title type="html">Based Lake, a petabyte-scale low-latency data lake</title><link href="https://brokenco.de//2026/03/10/based-lake.html" rel="alternate" type="text/html" title="Based Lake, a petabyte-scale low-latency data lake" /><published>2026-03-10T00:00:00+00:00</published><updated>2026-03-10T00:00:00+00:00</updated><id>https://brokenco.de//2026/03/10/based-lake</id><content type="html" xml:base="https://brokenco.de//2026/03/10/based-lake.html"><![CDATA[<p>I had a chat today about building large scale low-latency data retrieval
systems around AWS S3. In doing so I got to share a bit of the talk proposal I
submitted to <a href="https://dataaisummit.com">Data and AI Summit</a> this year about
real-live work that has made it into production.</p>

<p>For years the conventional wisdom around <a href="https://delta.io">Delta Lake</a> has
been to <strong>not</strong> connect user-facing/online systems to Delta tables. Basically,
don’t point your Django app at your Delta tables. This continues to be a decent
<em>guideline</em> but definitely <strong>not a rule</strong> and I have the performance data to
back that up.</p>

<p>My talk abstract:</p>

<blockquote>
  <p>Scribd hosts hundreds of millions of documents and has hundreds of billions of
objects across our buckets. Combining large-language models with a massive
amounts of text has required investment in our new Content Library
architecture.  We selected Delta Lake as the underlying storage technology but
have pushed it to an extreme. Using the same Delta Lake architecture we offer
both direct data access for data scientists in Databricks Notebooks and online
data retrieval in milliseconds for user-facing web services.</p>

  <p>In this talk we will review principles of performance for each layer of the
stack: web APIs, the Delta Lake tables, Apache Parquet, and AWS S3.</p>
</blockquote>

<p>The work done by myself and my colleague Eugene in this area has been heavily
related to my previous research around <a href="/2025/06/24/low-latency-parquet.html">Low latency Parquet
reads</a> which informed work named <a href="https://tech.scribd.com/blog/2026/content-crush.html">Content
Crush</a>, which I have
explored more on the Scribd tech blog and on the <a href="/2026/02/13/screaming-in-the-cloud.html">Screaming in the
Cloud</a> podcast.</p>

<p>I really hope that I am able to share results at Data and AI Summit from this
incredibly challenging work that I am undertaking. But even if I don’t, blog
posts like my musings on <a href="/2026/01/19/multimodal-delta-lake.html">Multimodal with Delta
Lake</a>, <a href="https://www.buoyantdata.com/blog/2024-12-31-high-concurrency-logstore.html">scaling streaming Delta Lake
applications</a>,
and a myriad of other articles I have published can be pieced together to form
the larger mosaic of insane large-scale data work I have been hammering on!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="arrow" /><category term="parquet" /><category term="deltalake" /><category term="databricks" /><category term="scribd" /><summary type="html"><![CDATA[I had a chat today about building large scale low-latency data retrieval systems around AWS S3. In doing so I got to share a bit of the talk proposal I submitted to Data and AI Summit this year about real-live work that has made it into production.]]></summary></entry><entry><title type="html">Multimodal with Delta Lake</title><link href="https://brokenco.de//2026/01/19/multimodal-delta-lake.html" rel="alternate" type="text/html" title="Multimodal with Delta Lake" /><published>2026-01-19T00:00:00+00:00</published><updated>2026-01-19T00:00:00+00:00</updated><id>https://brokenco.de//2026/01/19/multimodal-delta-lake</id><content type="html" xml:base="https://brokenco.de//2026/01/19/multimodal-delta-lake.html"><![CDATA[<p>The rate of change for data storage systems has accelerated to a frenzied pace
and most storage architectures I have seen simply cannot keep up. Much of my
time is spent thinking about large-scale tabular data stored in <a href="https://delta.io">Delta
Lake</a> which is one of the “lakehouse” storage systems along
with <a href="https://iceberg.apache.org">Apache Iceberg</a> and others. These storage
architectures were developed 5-10 years ago to solve problems faced moving from
data warehouse architectures to massive scale structured data needs faced by
many organizations. The storage changes we need today must support
“multimodal data” which is a dramatic departure in many ways from the
traditional query and usage patterns our existing infrastructure supports.</p>

<blockquote>
  <p>Multimodal learning is a type of deep learning that integrates and processes
multiple types of data, referred to as modalities, such as text, audio, images,
or video. This integration allows for a more holistic understanding of complex
data, improving model performance in tasks like visual question answering,
cross-modal retrieval, text-to-image generation, aesthetic ranking,
and image captioning.</p>

  <p><a href="https://en.wikipedia.org/wiki/Multimodal_learning">From Wikipedia</a></p>
</blockquote>

<p>Honestly, I have been working on this problem for longer than I knew that it
had a name!</p>

<p>Working on <a href="https://tech.scribd.com/blog/2026/content-crush.html">Content
Crush</a> at Scribd I have
had to negotiate an ever-present challenge: how do we make multimodal data
seamless to work with our classic tabular datasets?</p>

<p>A couple of the ideas that I have been thinking about revolve around one
principle: <strong>re-encoding of existing data is unacceptable.</strong> In the past I have
considered simply encoding binary data such as that from images or PDFs into
<a href="https://parquet.apache.org">Apache Parquet</a>. This approach suffers from a couple major flaws:</p>

<ul>
  <li>Re-encoding requires substantial computation for any non-trivial set of images, PDfs, video, etc.</li>
  <li>Redundant object storage, even with compression it is unlikely that any
organization which has terabytes or petabytes of image data will want to
store a secondary copy of it for their multimodal needs.</li>
  <li>Embedding a 1MB PDF file inside of a Parquet file is <em>not silly</em> but
embedding a 10GB video file inside of a Parquet file is <em>very silly</em>. Any
approach taken should scale in a reasonable fashion for data in the gigabyte
to terabyte range.</li>
</ul>

<p>A secondary objective in my thinking has been to avoid needing substantial
client changes for working with multimodal data. I recently watched <a href="https://www.youtube.com/watch?v=YmY_NwaoxNk">a talk by
Ryan Johnson</a> about adding
transactional semantics to Delta Lake and one of the big takeaways that I
heard from him was about the troublesome nature of ensuring <em>all actors</em> in the
system cooperated with the transaction semantics. In a modern data environment
that could be <em>dozens</em> of different off-the-shelf libraries, Databricks
notebooks, AWS SageMaker transforms, and so on. The less “exposure” to the
client layer the better.</p>

<h2 id="parquet-anchors">Parquet Anchors</h2>

<p>The first idea that I had was “Parquet Anchors” which would be built on <a href="https://parquet.apache.org/docs/file-format/binaryprotocolextensions/">Binary
Protocol
Extensions</a>
in Apache Parquet. In most cases the rich text/image/video data is already
stored in object storage such as AWS S3 and a URL should be sufficient to
retrieve that data.</p>

<p>The extension of the binary protocol as I understand it, would allow custom
information to be encoded in the Parquet files that are being written as part
of an existing Delta Table. The specific mechanism of encoding this data is
somewhat irrelevant so long as it can carry:</p>

<ul>
  <li>Artifact name (e.g. <code class="language-plaintext highlighter-rouge">some.pdf</code>)</li>
  <li>Artifact URL (<code class="language-plaintext highlighter-rouge">s3://bucket/prefix/of/keys/some-10x9u09123.pdf</code>)</li>
  <li>Artifact length (number of bytes)</li>
  <li>Artifact content type (e.g. <code class="language-plaintext highlighter-rouge">application/pdf</code>)</li>
  <li>Checksum</li>
  <li>Checksum Algorithm</li>
</ul>

<h3 id="pros">Pros</h3>
<p>The most obvious benefit of going down this route is the ease at which one
could update existing data files <em>and</em> this note from the Binary Protocol
Extensions document:</p>

<blockquote>
  <p><em>Existing readers will ignore the extension bytes with little processing overhead</em></p>
</blockquote>

<p>Logically Parquet Anchors could be quite simple to implement and for <em>most</em>
users of a Delta table with Parquet Anchors would never know they were there.</p>

<h3 id="cons">Cons</h3>

<p>The natural downside of this feature being hidden from existing readers is that
means clients must be updated in order to read the extension data properly. For
something like processing multimodal data where a row of content metadata
might refer to <code class="language-plaintext highlighter-rouge">some.pdf</code> this would mean the reader would have to have some
indication that it must:</p>

<ol>
  <li>Read the extended binary information</li>
  <li><em>Then</em> fetch the necessary artifacts</li>
</ol>

<p>There is another downside to this approach in that a table would need to be
“rewritten” but only <em>partially</em>. If a Parquet file added to the Delta table
references 1000 artifacts, then that <code class="language-plaintext highlighter-rouge">.parquet</code> file would need to be rewritten
to include the Parquet Anchors for those 1000 artifacts alongside that files
<code class="language-plaintext highlighter-rouge">.add</code> action. In essence I think this approach would require a full-table
rewrite where each <code class="language-plaintext highlighter-rouge">.parquet</code> in the transaction log would be retrieved,
processed, and rewritten with the appropriate Anchors.</p>

<p>Considering ways to address the shortcomings of Parquet Anchors I came up with
my next concept.</p>

<h2 id="virtual-delta-tables-vdt">Virtual Delta Tables (vdt)</h2>

<p>The notion of Parquet Anchors I think is useful to hold onto, hyperlinks to
existing artifacts is a key part of the multimodal data storage solution, but
perhaps not as a direct encoding into the Parquet data files. Considering the
shortcomings led me to think of how to present a virtual Delta table “view” to
existing clients while hiding the disparate nature of the data behind the
scenes.</p>

<p>One underutilized feature of the Delta Lake protocol is the use of URLs in the
<code class="language-plaintext highlighter-rouge">add</code> actions which enables functionality like <a href="https://delta.io/blog/delta-lake-clone/">shallow
clones</a>. I have long thought of this
as a super power that should really be used more.</p>

<h3 id="vdt0-just-the-artifacts">vdt0: just the artifacts</h3>

<p>The magic of the URL support in the Delta protocol is that the URLs don’t even
have to point to object storage. Nothing about the protocol dictates that the
URLs must point to <code class="language-plaintext highlighter-rouge">s3://</code> or <code class="language-plaintext highlighter-rouge">abfss://</code> URLs, you can just point to <code class="language-plaintext highlighter-rouge">https://</code>
URLs. AWS S3 supports <code class="language-plaintext highlighter-rouge">https://</code> URLs, but so does <em>every other web service</em>.</p>

<p>Imagine a storage architecture which already contains heaps of <code class="language-plaintext highlighter-rouge">.pdf</code>
artifacts. A <code class="language-plaintext highlighter-rouge">vdt</code> web service could provide a read-only URL structure which
maps the existing object storage structure into a Delta Lake URL scheme.</p>

<p>A virtual table with just those PDF artifacts could be configured at
<code class="language-plaintext highlighter-rouge">https://vdt.aws/v1/&lt;catalog&gt;/&lt;schema&gt;/&lt;table&gt;</code>. Using tooling like
<a href="https://github.com/s3s-project/s3s">s3s</a> <code class="language-plaintext highlighter-rouge">vdt</code> can provide S3-like operations
off of this virtual URL, exposing a virtualized JSON transaction log or
checkpoints for the Delta client.</p>

<p>Imagine the schema of such a virtual table for PDF artifacts:</p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Datatype</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>id</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>filename</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>content_type</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>url</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>filesize</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>data</td>
      <td><code class="language-plaintext highlighter-rouge">binary</code></td>
    </tr>
    <tr>
      <td>checksum</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>checksum_algo</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
  </tbody>
</table>

<p>The virtualized transaction log is where the real fun can begin. If information
about the artifacts can be sourced from an existing database, then the
virtualized transaction log could contain numerous <em>imagined</em> parquet files as
the <code class="language-plaintext highlighter-rouge">add</code> actions:</p>

<pre><code class="language-JSON">{
  "add": {
    "path": "datafiles/some-guid.parquet",
    "size": 841454,
    "modificationTime": 1512909768000,
    "dataChange": true,
    "stats": "{\"numRecords\":1,\"minValues\":{\"val..."
  }
}
</code></pre>

<p>The special path for the <code class="language-plaintext highlighter-rouge">some-guid.parquet</code> would perform <strong>on-demand</strong>
parquet encoding for the underlying artifacts.  The most primitive
implementation could simply represent <em>each</em> PDF file as a <code class="language-plaintext highlighter-rouge">.parquet</code> file with
an <code class="language-plaintext highlighter-rouge">add</code> action. So long as the <code class="language-plaintext highlighter-rouge">add</code> action conveyed the necessary file
statistics to allow consuming engine to filter out files which are not
necessary, this could be a seamless way to expose structured PDF data to the
consumer. The <code class="language-plaintext highlighter-rouge">path</code> in the action could <em>also</em> refer to an already cached
version of the encoded file in S3 using the existing URL support in the
protocol, in this way clients could progressively cache as need be on the
server-side.</p>

<hr />

<p><strong>Brief aside</strong>: I have never fully understood why <a href="https://delta.io/sharing/">Delta
sharing</a> exists as a separate entity. In my opinion
the Delta Lake protocol coupled with a clever server-side backend could provide
identical functionality for all existing Delta implementations.</p>

<hr />

<p>Assuming the <code class="language-plaintext highlighter-rouge">vdt</code> service supports the schema defined above and can properly
retrieve the PDF artifacts and encode them as Parquet data on the fly, a query
such as <code class="language-plaintext highlighter-rouge">SELECT filename, raw FROM vdt WHERE filename = $?</code>.</p>

<h3 id="pros-1">Pros</h3>

<p>Breaking the pretense of “objects must actually exist” with Delta Lake is very
liberating.  On-demand encoding artifacts in Apache Parquet would means all
client-side libraries should be able to seamlessly work within their existing
environments.</p>

<p>When I think about potential approaches for implementing <code class="language-plaintext highlighter-rouge">vdt0</code> I can also
imagine many different potential avenues for optimization.</p>

<h3 id="cons-1">Cons</h3>

<p>While I really do like this idea, I’m not sure <em>how much</em> I should like it
considering the potential downsides:</p>

<ul>
  <li>Requires some existing structure behind the scenes to build up a sensible
virtual Delta log. For situations where artifacts are simply in a dumb bucket
somewhere, with no metadata already stored in a relational database,
producing a virtual transaction log would be quite difficult.</li>
  <li>I cannot imagine a sensible path for <strong>write</strong> workloads with <code class="language-plaintext highlighter-rouge">vdt0</code>.</li>
  <li>Without having implemented this (yet!) it is unclear to how much compute-time would be expended on uncached parquet file encoding.</li>
  <li>Most data scientists want the PDF/image/etc but they don’t <em>typically</em> want
the raw bytes that they then have to parse through.</li>
</ul>

<hr />

<h2 id="uh-what-if-you-just-dont-use-delta-lake">Uh, what if you just don’t use Delta Lake?</h2>

<p>Hey good question. Great interlude opportunity!</p>

<p>As a seller of fine hammers and hammer accessories, everything does in fact
look like a nail.</p>

<p>Delta Lake is kind of a means to an end for me here. I think its protocol has
enough maturity in terms of features and client capabilities to provide
<em>almost</em> everything I need from a multimodal storage system. I just can’t/don’t
want to shove everything into a Delta table per se.</p>

<hr />

<h2 id="vdt1-adding-virtual-legs">vdt1: adding virtual legs</h2>

<p>Since I have already indulged in the heretical idea of “what if we just make
the files up” I went a level further to consider <em>what if we got even more
virtualized</em>. One key characteristic I dislike with the <code class="language-plaintext highlighter-rouge">vdt0</code> approach is that
it is <em>too simple</em> believe it or not.</p>

<p>When I think about artifacts like PDFs, they have far more structure than just
bytes. There are pages, typically sections, text, images, titles, footnotes,
and so on. For most machine learning use-cases the data scientist may be
interested in raw bytes for some projects but much more often they are
interested in the <em>parsed</em> and <em>structured</em> data of the artifact.</p>

<p>While my expertise is largely around text-based storage and processing, I would
imagine image/audio/video artifacts also have similar structure of interest to
data scientists.</p>

<p>Indulging in even more virtual-thinking I started to think about collections of
data all associated with an artifact. There’s the raw data schema above, but for PDFs I can also envision:</p>

<p><strong>Paragraphs</strong></p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Datatype</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>id</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>page</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>offset</td>
      <td><code class="language-plaintext highlighter-rouge">integer</code></td>
    </tr>
    <tr>
      <td>text</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>is_heading</td>
      <td><code class="language-plaintext highlighter-rouge">bool</code></td>
    </tr>
    <tr>
      <td>heading_level</td>
      <td><code class="language-plaintext highlighter-rouge">integer</code></td>
    </tr>
  </tbody>
</table>

<p><strong>Images</strong></p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Datatype</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>id</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>content_type</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>page</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>data</td>
      <td><code class="language-plaintext highlighter-rouge">binary</code></td>
    </tr>
    <tr>
      <td>bounds_x</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>bounds_y</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
  </tbody>
</table>

<p><strong>Links</strong></p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Datatype</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>id</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>page</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>href</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>label</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
  </tbody>
</table>

<p>Taken all together this only represents <em>20 columns</em> of data but could
represent <strong>most</strong> of the information needed for most multimodal workloads. I
mention the low column count because I have seen bug reports from Delta Lake
users talking about issues with tables containing <em>thousands of columns</em>.</p>

<p>A virtualized table schema could take these interior schemas and join them
together such that a single row might have: <code class="language-plaintext highlighter-rouge">id</code>, <code class="language-plaintext highlighter-rouge">raw_filename</code>,
<code class="language-plaintext highlighter-rouge">raw_content_type</code>, <code class="language-plaintext highlighter-rouge">raw_url</code>, <code class="language-plaintext highlighter-rouge">raw_filesize</code>, <code class="language-plaintext highlighter-rouge">raw_data</code>, <code class="language-plaintext highlighter-rouge">raw_checksum</code>,
<code class="language-plaintext highlighter-rouge">raw_checksum_algo</code>, <code class="language-plaintext highlighter-rouge">paragraph_page</code>, <code class="language-plaintext highlighter-rouge">paragraph_text</code>, <code class="language-plaintext highlighter-rouge">paragraph_offset</code>,
<code class="language-plaintext highlighter-rouge">paragraph_is_heading</code>, <code class="language-plaintext highlighter-rouge">paragraph_heading_level</code>, <code class="language-plaintext highlighter-rouge">image_content_type</code>,
<code class="language-plaintext highlighter-rouge">image_page</code>, <code class="language-plaintext highlighter-rouge">image_data</code>, <code class="language-plaintext highlighter-rouge">image_bounds_x</code>, <code class="language-plaintext highlighter-rouge">image_bounds_y</code>, <code class="language-plaintext highlighter-rouge">link_page</code>,
<code class="language-plaintext highlighter-rouge">link_href</code>, <code class="language-plaintext highlighter-rouge">link_label</code>.</p>

<p>So long as the schema allows nullable columns for everything but <code class="language-plaintext highlighter-rouge">id</code>, the
<code class="language-plaintext highlighter-rouge">vdt</code> service can expose the disjointed data behind the scenes in a sensible
way with the <code class="language-plaintext highlighter-rouge">add</code> actions on the virtual Delta table and its file statistics.
For example an <code class="language-plaintext highlighter-rouge">add</code> action which includes <code class="language-plaintext highlighter-rouge">link</code> data would list all other
columns as null within the file statistics <code class="language-plaintext highlighter-rouge">nullValues</code> such that any engine
querying for <code class="language-plaintext highlighter-rouge">raw</code> columns would ignore that file entirely.</p>

<h3 id="pros-2">Pros</h3>

<p>I think this structure would be possible to build in a traditional Delta Lake
system assuming one wished to re-encode data into new storage. Hiding existing
data behind a virtualized Delta table allows us to avoid data denormalization.</p>

<p>Similar to <code class="language-plaintext highlighter-rouge">vdt0</code> there are optimization and caching approaches that are
readily available with <code class="language-plaintext highlighter-rouge">vdt1</code> but unlike <code class="language-plaintext highlighter-rouge">vdt0</code> the “write path” is more
apparent to me with this approach. By hiding metadata about an artifact inside
the virtualized data structure, writes which add rows with those columns could
sensibly be accepted and inserted into an internal Delta or other table.</p>

<p>Depending on how metadata associated with an artifact is concerned, the <code class="language-plaintext highlighter-rouge">vdt</code>
service could simply front a number of other conventional Delta tables and act
as a proxy ensuring to push predicates and I/O filtering “to the edge” as far
as it will go, before collecting results for the query engine.</p>

<h3 id="cons-2">Cons</h3>

<p>This approach is certainly the most complex but could potentially require the least amount of re-encoding of existing data assets. The devil is in the details with how one might map existing data sources together. My sketch above places a tremendous amount of emphasis on an <code class="language-plaintext highlighter-rouge">id</code> which acts as a primary key between all the metadata associated with a singular artifact.</p>

<p>Nothing defined thus far accounts for potential changes in an artifact or its
metadata as time goes on. If a new version of an existing document is uploaded,
the new version should likely be considered “canonical” but be <em>appended</em>
rather than <em>merged</em> with existing records. How one might sensibly model that
in a system like Delta which doesn’t support referential integrity between
datasets leads me back to the “anchors” idea from before.  That said, I’m not
sure if that’s much ado about nothing.</p>

<hr />

<p>From a data storage standpoint one key aspect of multimodal data is that the
different modalities are presented to the end user or system <strong>together</strong>. What
I like about the virtual Delta tables concept is that this it doesn’t require
substantial client changes to accomplish but <em>does</em> provide a path to present
various types of data <em>together</em> for a given artifact.</p>

<p>I have various bits and pieces of a potential <code class="language-plaintext highlighter-rouge">vdt</code> system lying around the
workshop floor. If the idea has legs I might take a crack at a prototype
implementation, but first I will need some feedback!</p>

<p>Let me know what you think by emailing me at <code class="language-plaintext highlighter-rouge">rtyler@</code> this domain!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><category term="parquet" /><category term="deltalake" /><category term="ml" /><summary type="html"><![CDATA[The rate of change for data storage systems has accelerated to a frenzied pace and most storage architectures I have seen simply cannot keep up. Much of my time is spent thinking about large-scale tabular data stored in Delta Lake which is one of the “lakehouse” storage systems along with Apache Iceberg and others. These storage architectures were developed 5-10 years ago to solve problems faced moving from data warehouse architectures to massive scale structured data needs faced by many organizations. The storage changes we need today must support “multimodal data” which is a dramatic departure in many ways from the traditional query and usage patterns our existing infrastructure supports.]]></summary></entry><entry><title type="html">The challenges facing Delta Kernel</title><link href="https://brokenco.de//2026/01/12/delta-kernel-challenges.html" rel="alternate" type="text/html" title="The challenges facing Delta Kernel" /><published>2026-01-12T00:00:00+00:00</published><updated>2026-01-12T00:00:00+00:00</updated><id>https://brokenco.de//2026/01/12/delta-kernel-challenges</id><content type="html" xml:base="https://brokenco.de//2026/01/12/delta-kernel-challenges.html"><![CDATA[<p>The Delta Kernel is one of the most technically challenging and ambitious open
source projects I have worked on. Kernel is fundamentally about unifying <em>all</em>
of our needs and wants from a <a href="https://delta.io">Delta Lake</a> implementation
into a single cohesive yet-pluggable API surface. Towards the end of 2025
<a href="https://github.com/tdas">TD</a> asked me to jot down some of the issues which
have been frustrating me and/or slowing down the adoption of kernel in projects
like <a href="https://github.com/delta-io/delta-rs">delta-rs</a>. At the outset of the
project we all discussed concerns about what could <em>actually be possible</em> as we
set out into uncharted territory. In many ways we have succeeded, in others we
have failed.</p>

<p>Reviewing the history, I was the second developer to commit code behind
<a href="https://github.com/zachschuermann">Zach</a> to the project.
Like all open source projects, Delta Kernel is the work of numerous people who
have all poured their time into making something happen <em>together</em>. I regularly
work with Robert, Zach, Nick, Ryan, and Steve to make delta-rs and
delta-kernel-rs <strong>better</strong>.</p>

<p>While we all have our personal motivations, we also have direction guided by our
employers in some cases. That means the goals for kernel from Databricks may
not align with my employer (<a href="https://tech.scribd.com">Scribd</a>), or others
participating in the project. This complicates trade-off decisions in many open
source projects where personal, professional, and hobby motivations intersect.</p>

<p>My hope is to characterize the weaknesses in kernel so that we can collectively
adjust in 2026 to make improvements in both the technical design of kernel, but
also the <em>community</em> and culture around kernel.</p>

<h2 id="design">Design</h2>

<p>From my perspective the original design trade-offs made in kernel were largely
driven by two key factors:</p>

<ol>
  <li><strong>Portability with non-Rust engines</strong>: this dictated the need for an
<a href="https://en.wikipedia.org/wiki/Foreign_function_interface">FFI</a> abstraction
on day zero. The <a href="https://duckdb.org/docs/stable/core_extensions/delta">Delta extension for
DuckDB</a> had an
outsized influence on this due ostensibly to a desire from Databricks to
make DuckDB and Delta be best friendsies.</li>
  <li><strong>The Java kernel</strong>: the Delta kernel is actually <em>two</em> implementations, one
in Java for unifying JVM-based connectors, and one in Rust for basically
everybody else. Due to the number of folks involved in the Java kernel, the
Rust implementation was <em>strongly</em> encouraged to take design cues from the
Java design.</li>
</ol>

<p>More than anything these two factors have contributed to a number of what I
would consider original load-bearing sins of design for delta-kernel-rs.</p>

<blockquote>
  <p>These trade-offs resulted in a Rust-based project which <strong>abandons most of
the important benefits for using Rust</strong>.</p>
</blockquote>

<h3 id="building-for-the-lowest-common-denominator">Building for the lowest common Denominator</h3>

<p>Supporting cross-language and runtime interoperability is <strong>brutal</strong>. I have
done a lot of cross-language support for Ruby and Python projects in the past,
where at some point <em>somewhere</em> there’s a pointer being passed from one world
into another. It is objectively <strong>awful</strong>.</p>

<p>Over the years of delta-rs people have tried adding FFI hooks into it, despite
us never making <em>any</em> accommodations for it. Seriously, as recently as <a href="https://github.com/delta-io/delta-rs/issues/3973">this
month</a> somebody popped up
with yet-another set of Golang FFI bindings on top of delta-rs.</p>

<h4 id="ffi-is-hell">FFI is hell.</h4>

<p>A hell that we <em>intentionally marched into</em> with Delta kernel. For
the uninitiated, FFI basically a convention for allowing multiple languages to
meet at a C <a href="https://en.wikipedia.org/wiki/Application_binary_interface">ABI
layer</a> and pass
pointers back and forth. There is some more about memory layout and other
silliness, but basically, it’s a way for everybody to dumb themselves down to a
C-style interface.</p>

<p>FFI is also stupid but it is basically how all higher level languages
work such as Python, Ruby, JavaScript, Golang, Rust, etc. Somewhere down there
in the stack is a pointer passing into C-based system calls on your machine.
There be monsters.</p>

<p>One of our early design disagreements made to accommodate FFI-based engines was
the adoption of <code class="language-plaintext highlighter-rouge">Iterator</code> based interfaces rather than <code class="language-plaintext highlighter-rouge">Future</code> based
interfaces. Previously I <a href="/2025/12/16/parallelism-is-tricky.html">wrote about our parallelism
challenges</a> which stem from this design
trade-off.</p>

<p>The debate was whether to hide an evented reactor like
<a href="https://tokio.rs">Tokio</a> <em>inside</em> kernel and hide that from the FFI caller, or
make the caller responsible for trying to make things event-driven. The early
influence of DuckDB weighed on the scales here, and the decision was made to
avoid embedding Tokio inside kernel.</p>

<p>In the Rust ecosystem it has taken a <em>long time</em> for us to <a href="https://areweasyncyet.rs/">become
async</a>. If you were curious why there has been such
an explosion of Rust across the systems programming ecosystem in the last five
years it’s because <strong>the Rust ecosystem is async</strong>.</p>

<p>The <em>first</em> Rust application I deployed into production used <code class="language-plaintext highlighter-rouge">async/await</code> from
the beginning, and without <em>any profiling</em> was an order of magnitude faster
than the system it replaced.</p>

<p><code class="language-plaintext highlighter-rouge">async/await</code> is the reason delta-rs was even successful in the first place!</p>

<p>There are ways to hack around the limitations of the <code class="language-plaintext highlighter-rouge">Iterator</code>
based API in Delta kernel, but the hill is <em>very</em> steep and will require
significant investment to make some parts of Delta kernel as fast as parallel
reads/scans would otherwise be.</p>

<p><code class="language-plaintext highlighter-rouge">async/await</code> gives incredible performance for free, but Delta kernel’s design choices mean it cannot take advantage and must pay the price.</p>

<h3 id="enginedata"><code class="language-plaintext highlighter-rouge">EngineData</code></h3>

<p>I am not smart enough to work on some parts of Delta kernel because of the
cleverness that is <code class="language-plaintext highlighter-rouge">EngineData</code>. Similar to
<a href="https://github.com/apache/arrow-rs">arrow-rs</a> and its <code class="language-plaintext highlighter-rouge">RecordBatch</code> and
<code class="language-plaintext highlighter-rouge">ArrayData</code> implementations, <code class="language-plaintext highlighter-rouge">EngineData</code> is an opaque type-erased container
for <em>stuff</em> and <em>things</em>.</p>

<p>One of the reasons I struggled to learn to Rust, but ultimately came to love
the language is the strong type system which helps prevent whole classes of
problems. The strong type system also makes it a lot simpler for me to reason
about the code when I am working with it.</p>

<p>Everything in Delta kernel is
<a href="https://docs.rs/delta_kernel/latest/delta_kernel/engine_data/trait.EngineData.html">EngineData</a>
in one form or another. I was pretty preoccupied when this interface was
originally being hammered out so I’m less familiar with the history of
decisions that went into it, but I find the API of <code class="language-plaintext highlighter-rouge">EngineData</code> and its
counterparts of
<a href="https://docs.rs/delta_kernel/latest/delta_kernel/engine_data/trait.RowVisitor.html">RowVisitor</a>,
<a href="https://docs.rs/delta_kernel/latest/delta_kernel/engine_data/trait.GetData.html">GetData</a>,
and
<a href="https://docs.rs/delta_kernel/latest/delta_kernel/engine_data/trait.TypedGetData.html">TypedGetData</a>
to be <em>very</em> unpleasant to work with.</p>

<p>I <em>also</em> find
<a href="https://docs.rs/arrow/latest/arrow/array/struct.RecordBatch.html">RecordBatch</a>
unpleasant to work with. I really struggle to think of more user-unfriendly
APIs in the Rust data ecosystem. In the case of arrow’s <code class="language-plaintext highlighter-rouge">RecordBatch</code> I have
watched some of my colleagues pull in the <em>entire</em>
<a href="https://crates.io/crates/datafusion">datafusion</a> dependency just so they can
work with <code class="language-plaintext highlighter-rouge">RecordBatch</code> without resulting to the array offset and indices
silliness that permeates Apache Arrow code.</p>

<p>As unpleasant as I find <code class="language-plaintext highlighter-rouge">RecordBatch</code> there are <em>thousands</em> of developers
invested in its APIs and supporting infrastructure. <code class="language-plaintext highlighter-rouge">EngineData</code> does not have
a similar level of tooling, but shares some of the same razor-sharp edges.</p>

<p>The <code class="language-plaintext highlighter-rouge">EngineData</code> design has resulted in a <em>lot</em> of brittle <a href="https://github.com/delta-io/delta-kernel-rs/blob/e019ac3fa18707b633f625418d661ed198c86759/kernel/src/actions/visitors.rs#L114-L120">fixed array
offsets</a>
being littered throughout the Delta kernel codebase. These “getters” and the
visitors APIs result in the Rust type checker being <em>far</em> less useful with
Delta kernel than a more conventionally structured Rust project. This also
results in a much larger likelihood of runtime errors being emitted for
problems rather than compile-time checks.</p>

<p>The type-erased opaque bucket of bytes design of <code class="language-plaintext highlighter-rouge">EngineData</code> means that
working inside of <em>or with</em> Delta kernel sacrifices one of the most important
characteristics of the Rust language: the type checker.</p>

<hr />

<p>There are some good pieces of the design which honestly I cannot speak to
because I don’t stub my toes on them. Ryan and I have discussed at length the
importance of deferring work as long as possible in kernel to achieve higher
performance. Some of the Expression and Transform APIs allow for lower memory
footprints and faster log replay when work can be deferred or outright
<em>avoided</em>.</p>

<p>In delta-rs some of the performance deficiencies we have seen since adopting
Delta kernel have more to do with our interop code rather than kernel design
decisions. The delta-rs project is <em>massive</em>. As a general purpose Delta Lake
implementation, the surface area of changes that
<a href="https://github.com/roeap">Robert</a> had to touch to even get to where we are
today has been nothing short of heroic.</p>

<h2 id="community">Community</h2>

<p>The Delta kernel project is the first one I have worked on with Databricks
where there is <em>some</em> transparency around the week-to-week operations. 
The kernel Rust community has weekly meetings where
developers are talking to developers. 
Many of my early conversations with <a href="https://dennyglee.com/">Denny</a> were around
the propensity for Databricks to dump code into the Delta project as a fait
accompli. In one particularly egregious situation, there were protocol and
Delta/Spark changes which were reviewed, approved, and merged by Databricks
employees the week before being announced at <a href="https://dataandaisummit.com">Data and AI
Summit</a>. Kernel gets this right.</p>

<p>Even though I cannot make every weekly call with the kernel community, I love it when I can.</p>

<p><em>I don’t always attend the kernel weekly call, but when I do, I’m asking when the next release will happen.</em></p>

<p>For reasons I don’t think anybody really understands, Delta kernel moves <em>very</em>
slowly. Patch releases are of particular importance to me because delta-rs has
started to depend on the Delta kernel for its protocol implementation and
therefore <em>many</em> of our new bugs relate to Delta kernel in some way or another.</p>

<p>Releases have averaged around one every three weeks in 2025. Nine of the thirty
versions released to
<a href="https://crates.io/crates/delta_kernel/versions">crates.io</a> were patch fixes,
which means <strong>70%</strong> of published releases contained API breaking changes. Some
of that is inevitable as developers are figuring out the appropriate shape of
different APIs. As a consumer of this release cycle downstream this means that
I am highly unlikely to ever receive bug fixes without requiring development
effort to adapt to ever-changing APIs.</p>

<p>There is no free lunch.</p>

<p>For the <a href="https://crates.io/crates/deltalake">delta-rs</a> project this means our releases are <em>frequently blocked</em> on:</p>

<ul>
  <li>Delta kernel</li>
  <li><a href="https://crates.io/crates/arrow">Apache Arrow</a></li>
  <li><a href="https://crates.io/crates/datafusion">Apache Datafusion</a></li>
</ul>

<p>Delta kernel ships with a default engine that has a major version dependency on
Apache Arrow, a project which <em>also</em> avoids patch releases. This compounding
effect means that when a new <code class="language-plaintext highlighter-rouge">arrow</code> is released we (delta-rs) must wait for
that to be incorporated into both <code class="language-plaintext highlighter-rouge">datafusion</code> and <code class="language-plaintext highlighter-rouge">delta_kernel</code>, and for both
those crates to be released.</p>

<blockquote>
  <p>Any issue reported to delta-rs which requires a change in Arrow or Delta kernel
will typically take 1-2 months to resolve.</p>
</blockquote>

<h3 id="no-need-to-wait">No need to wait</h3>

<p>Up until yesterday, the latest released
<a href="https://crates.io/crates/deltalake/">deltalake</a> crate was <code class="language-plaintext highlighter-rouge">0.29.4</code> which
depended on Delta kernel <code class="language-plaintext highlighter-rouge">0.16.0</code>. That version is three months old and
unfortunately never saw any patch releases, which is part of the reason all four of the <code class="language-plaintext highlighter-rouge">0.29.x</code> releases of delta-rs depended upon it.</p>

<p>Using the crate downloads statistics as a <em>very</em> unscientific measure, I would
hazard a guess that <code class="language-plaintext highlighter-rouge">delta-rs</code> drives the majority of downloads for Delta
kernel.</p>

<p><img src="/images/post-images/2025-delta-kernel/delta_kernel_downloads.png" alt="delta_kernel downloads showing a lot of &quot;Other&quot;" /></p>

<p>The <code class="language-plaintext highlighter-rouge">0.18.0</code> release went out on November 20th, which has a small uptick, but
then the big spike in early December correlates strongly with the incorporation
of <a href="https://github.com/delta-io/delta-rs/pull/3949">this pull request</a> pulled
<code class="language-plaintext highlighter-rouge">0.18.x</code> into the delta-rs repository.</p>

<p>For completeness’ sake, the <code class="language-plaintext highlighter-rouge">deltalake</code> crate’s downloads have a very similar
shape. But due to the longer release cycle of <code class="language-plaintext highlighter-rouge">0.29.x</code> is is difficult to tell
what versions are being heavily downloaded.</p>

<p><img src="/images/post-images/2025-delta-kernel/deltalake_downloads.png" alt="deltalake downloads also showing plenty of &quot;Other&quot;" /></p>

<hr />

<p>Maintaining stable APIs is a pain, but becomes much more important the lower in
the stack any dependency lives.</p>

<p>One approach could be to create release branches which have changes
cherry-picked between them as is needed. This introduces more release
engineering work and can be challenging. For my own purposes I <em>have done this</em>
and backported fixes for both Delta kernel and delta-rs in various shapes to
support customers who cannot boil the ocean with unstable releases every two to
three weeks.</p>

<p>At <a href="https://tech.scribd.com">Scribd</a> a patch release of delta-rs, with <em>zero API changes</em> requires at least:</p>

<ul>
  <li>New Lambdas to be built.</li>
  <li>Those Lambdas to be deployed to a testing environment.</li>
  <li><em>waiting for enough data volume to demonstrate reliability</em></li>
  <li>Promotion of a Lambda to a production environment.</li>
  <li><em>waiting for enough data volume to demonstrate success</em></li>
</ul>

<p>When everything operates smoothly this is about two developer-hours of time
from end to end, but that is with <em>zero API changes</em>.</p>

<p>Every set of API changes in delta-rs, Delta kernel, or Apache Arrow introduces
unknown developer time to perform updates and upgrades. Unless a new release of
<em>any</em> of these dependencies confers significant performance or quality
improvements, the business looks at these upgrades as <strong>unnecessary cost</strong> and
instead prefers to simply <em>not</em> update.</p>

<p>As a consequence bugs can be discovered in production months after a given
Delta kernel release. For example <a href="https://github.com/delta-io/delta-kernel-rs/pull/1561">this performance
bug</a> in Delta kernel had
actually existed for <strong>months</strong> in released crates. It was not until delta-rs
adopted more of Delta kernel. Only then was I able to bring upgrades all the way
to production and discovered <a href="https://github.com/buoyant-data/oxbow/commit/2363be8869a025b90bc46c2d7ed1893aca2d37e4">a couple serious performance issues in delta-rs and Delta kernel</a>.</p>

<p>This timeline is getting a little confusing even for me, so let’s recap:</p>

<ul>
  <li><strong>October 2024</strong>: <a href="https://github.com/delta-io/delta-kernel-rs/pull/373">A JSON parsing workaround introduced</a> into kernel and released in <code class="language-plaintext highlighter-rouge">0.4.0</code>.</li>
  <li><strong>July 2025</strong>: <a href="https://crates.io/crates/deltalake/0.27.0">deltalake 0.27.0</a>
released with first serious adoption of Delta kernel at <code class="language-plaintext highlighter-rouge">0.13.0</code>.</li>
  <li><strong>August 2025</strong>: delta-rs performance <a href="https://github.com/delta-io/delta-rs/pull/3660">issue identified and fixed</a> along with a separate Delta kernel <a href="https://github.com/delta-io/delta-kernel-rs/pull/1171">performance issue with wide tables identified</a>. Both problems were identified after I invested some spare work-cycles in using pre-release code to interact with production data sets at Scribd.</li>
  <li><strong>September 2025</strong>: <a href="https://github.com/buoyant-data/oxbow/commit/d8f7b683d7ff1498d1c2eea96a2642d8f5b490c4">oxbow incorporates 0.28.0</a> and that’s quickly reverted until delta-rs <code class="language-plaintext highlighter-rouge">0.29.x</code> is released with additional improvements both in the crate an incorporated in the newer Delta kernel <code class="language-plaintext highlighter-rouge">0.16.0</code>.</li>
</ul>

<p>From my perspective, the amount of time invested in the performance issues
alone has not been “paid back” by improvements delivered from Delta kernel.</p>

<hr />
<p><strong>NOTE:</strong> HR would like to remind me to adopt a growth-mindset.</p>

<p>The improvements from incorporating Delta kernel have not paid back the time-invested <strong><em>yet</em></strong>.</p>

<hr />

<p>For more than a year there were performance issues sitting in <code class="language-plaintext highlighter-rouge">main</code> and
released kernel crates.</p>

<p>The time delay between changes being made in kernel and those changes being
used for real workloads is <strong>long</strong>. Too long to be useful as a constructive
feedback cycle for development.</p>

<p>I believe the only way to improve this is with faster releases and faster
feedback.</p>

<h3 id="have-you-tried-just">Have you tried just</h3>

<p>The very-long user-feedback loops on released changes is only half of the
velocity troubles afflicting Delta kernel. I have personally avoided
contributing too much because the amount of yak-shaving can be pretty wild.</p>

<p>The performance improvement I recently suggested was a new personal TOP SCORE!
Garnering a total of <em>84 comments</em> in the back-and-forth with four different
maintainers. That is more pull request comments than lines changed in the patch.</p>

<p>What is sometimes difficult to remember as a
maintainer is that a pull request does not represent the <em>start</em> of time
invested by a contributor. A pull request is usually the <em>end</em> of their
time-investment. In this case I had already invested between 5-8 hours of
profiling and understanding the issue before I could create the change.</p>

<p>Hidden in the yak-shaving  <em>was useful feedback</em> but the process was so frustrating
that I eventually threw in the towel and asked Nick to take it over after
about 12 hours of total time invested.</p>

<p>Of the currently <a href="https://github.com/delta-io/delta-kernel-rs/pulls?q=is%3Apr+is%3Aopen+sort%3Acomments-desc">open pull
requests</a>
the one with the most comments is at 99. Of the <a href="https://github.com/delta-io/delta-kernel-rs/pulls?q=is%3Apr+sort%3Acomments-desc+is%3Aclosed">closed pull
requests</a>
my maddening 84 comment odyssey doesn’t even fit on the <strong>first page</strong> of “most
commented” pull requests. The top spot is claimed by <a href="https://github.com/delta-io/delta-kernel-rs/pull/109">this pull
request</a> which has 369
comments and took over two months from open to merge. That monster is somewhat
of an outlier because it represents a substantial change earlier in the history
of Delta kernel but a number of other changes are very much in hundreds of
comments range.</p>

<p>The pull request culture in Delta kernel is fundamentally contributor hostile.</p>

<p>The suggestions I made to Nick on how to improve this are:</p>

<ul>
  <li>Assigning one maintainer (e.g. <code class="language-plaintext highlighter-rouge">CODEOWNERS</code>) to review each pull request.
There is relatively little benefit from multiple people offering differing
opinions on a non-maintainers’ pull request.</li>
  <li>Contributors should feel like their goals are shared with maintainers. The
<a href="https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/incorporating-feedback-in-your-pull-request">suggest
change</a>
functionality of GitHub pull requests is fantastic for this. Rather than
leaving a wall of text, suggesting direct code changes helps convey a shared
investment in the pull request.</li>
  <li>Better yet, rather than asking for tests or changes. <strong>Make the changes</strong>.
Most contributors allow maintainers to push to their fork’s topic branches. I
regularly use this to add regression tests to contributors’ pull requests,
rather than asking them “please write a test.” Modelling good behavior
usually is more successful than <em>telling</em>.</li>
</ul>

<p>Some other ideas that come to mind:</p>

<ul>
  <li>Any comment with “nit: “ should simply be deleted. I see this at work from
time to time and will privately discuss with the developer how anti-social
that behavior comes across. Any bit of feedback that somebody feels is
nitpicky should be made in a follow up pull request or just <em>not</em>. Nitpicks
are a waste of everybody’s time.</li>
  <li>There is a habit to “stack PRs” in this project and as I write this, there
are <strong>19</strong> open “stacked” pull requests. Smaller commits and smaller pull
requests should be preferred and move quicker. I think there are a <em>lot</em> of
comments on pull requests because each pull request ends up being fairly
large and sits in an Open state for a long time.</li>
</ul>

<p>Many developers believe that code “stabilizes” as if some magic happens to code
in <code class="language-plaintext highlighter-rouge">main</code>. All code has a rapidly decaying half-life, especially code which
sits in open pull requests. The only way to demonstrate that anything is good
or bad is for it to be <em>used</em>. Stability comes from <em>use</em>.</p>

<p>I think everybody involved in the Delta kernel project, myself included, wants
a stable and high-performance foundation to build our Delta-based applications.
As Jez Humble and David Farley wrote in the book on <a href="https://en.wikipedia.org/wiki/Continuous_delivery">Continuous
Delivery</a>, a long cycle time
is usually <em>antithetical</em> to stability and reliability.</p>

<h2 id="theyre-good-kernels-brent">They’re good kernels Brent</h2>

<p>Golly this has been a bunch of words. To quote a wise man:</p>

<blockquote>
  <p>The Delta Kernel is one of the most technically challenging and ambitious open source projects</p>
</blockquote>

<p>I believe in the vision of Delta kernel and certainly wouldn’t be here if I
didn’t. The fragmentation that I see in the ecosystem causing nothing but
trouble. Since starting this essay I have encountered <em>two</em> new and quirky
derivatives of delta-rs code which are trying to coerce it to do things which
Delta kernel is meant to support. In fact, the status quo of Delta kernel
supports the two use-cases I stumbled into!</p>

<p>Having a stable and high-performance foundation means that features and
improvements added into kernel benefit <em>everybody</em>! How marvelous is that? The
trick is getting <em>everybody</em> to use kernel!</p>

<p>Kernel’s success is important to the Delta Lake ecosystem and numerous others.
For kernel to succeed however I believe we need to adjust course in 2026 to
build a stronger technology foundation by introducing more idiomatic Rust code.
Leaning more heavily on the strengths of the Rust ecosystem in the interfaces,
supporting Rust implementations with async/await as a focus, rather than FFI.</p>

<p>Building in a more Rust-familiar way will enable more new contributors along
with their fresh perspectives. We will need to improve our release cadence and
change management into something clear and predictable. Making new developers
feel welcomed and their contributions valued will solidify kernel’s place as
the foundation in the ecosystem.</p>

<p>Stronger technology <em>and</em> a stronger community in 2026 will help Delta kernel
overcome the challenges we face today.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><category term="deltalake" /><category term="opinion" /><summary type="html"><![CDATA[The Delta Kernel is one of the most technically challenging and ambitious open source projects I have worked on. Kernel is fundamentally about unifying all of our needs and wants from a Delta Lake implementation into a single cohesive yet-pluggable API surface. Towards the end of 2025 TD asked me to jot down some of the issues which have been frustrating me and/or slowing down the adoption of kernel in projects like delta-rs. At the outset of the project we all discussed concerns about what could actually be possible as we set out into uncharted territory. In many ways we have succeeded, in others we have failed.]]></summary></entry><entry><title type="html">Parallelism is a little tricky</title><link href="https://brokenco.de//2025/12/16/parallelism-is-tricky.html" rel="alternate" type="text/html" title="Parallelism is a little tricky" /><published>2025-12-16T00:00:00+00:00</published><updated>2025-12-16T00:00:00+00:00</updated><id>https://brokenco.de//2025/12/16/parallelism-is-tricky</id><content type="html" xml:base="https://brokenco.de//2025/12/16/parallelism-is-tricky.html"><![CDATA[<p>In theory many developers understand concurrency and parallelism, in practice I
think almost none of us do. At least not all the time. Building a mental model
of highly parallel interdependent software is incredibly time-consuming,
difficult, and error-prone. I have recently been doing a <em>lot</em> of performance
analysis with both <a href="https://github.com/delta-io/delta-rs">delta-rs</a> and
<a href="https://github.com/delta-io/delta-kernel-rs">delta-kernel-rs</a>. In the process
I have had to check some of my own assumptions of how things <em>should</em> work
compared to how they <em>do</em> work.</p>

<hr />
<p>Sidenote: to get an idea of how frequently we all “get it wrong”, subscribe to Aphyr’s <a href="https://jepsen.io/blog">Jepsen blog</a> for distributed systems safety research.</p>

<hr />

<p>The Delta Lake Rust binding has relied on <a href="https://tokio.rs/">Tokio</a> since the
beginning, which as any <code class="language-plaintext highlighter-rouge">/r/rust</code> commenter knows is an easy turbo button to
solve all your performance and parallelism needs!</p>

<p>When we were designing kernel however, there was a strong motivation <em>not</em> to
take a direct dependency on Tokio. Due to some early influences in the project,
there was a pretty strong push to support C/C++ based engines with
delta-kernel-rs. Those engines would need a Foreign-function Interface (FFI)
and pushing something like Tokio or even
<a href="https://docs.rs/futures/latest/futures/">futures</a> over an FFI boundary was
unsavory to say the least.</p>

<p>What may be one of our original performance sins in kernel was designing APIs
around the <a href="https://doc.rust-lang.org/std/iter/trait.Iterator.html">Iterator</a>
trait. I am writing this partially to help form my thoughts, but consider this screenshot from
<a href="https://github.com/KDAB/hotspot">Hotspot</a> showing Tokio tasks doing the work of “log replay” when opening a large complex Delta table:</p>

<p><img src="/images/post-images/2025-12-delta-rs/tokio-thread-switching.png" alt="Context switching in tasks" /></p>

<p>These two tasks are <em>concurrent</em> but they are not parallel. In <code class="language-plaintext highlighter-rouge">Iterator</code>
terms, this is about what I would expect to see. The conceptual model for execution is:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">Iterator</code> created.</li>
  <li><code class="language-plaintext highlighter-rouge">next()</code> is invoked</li>
  <li>“do work”</li>
  <li>return result</li>
  <li><code class="language-plaintext highlighter-rouge">next()</code> is invoked</li>
</ol>

<p>The fact that work is being done on different tasks is irrelevant. <code class="language-plaintext highlighter-rouge">Iterator</code>
is lazy, but is only going to “do work” when it is asked, thus a serial
invocation model.</p>

<p>When parallelism is designed, that means work <strong>must</strong> be done at the same
time, but it does not necessarily mean that it must be done “lazily” in the
style of the <code class="language-plaintext highlighter-rouge">Iterator</code> trait.</p>

<p>In delta-rs <a href="https://github.com/roeap">Robert</a> pulled in some code from
<a href="https://datafusion.apache.org">Datafusion</a> which relies on Tokio’s
<a href="https://docs.rs/tokio/latest/tokio/task/struct.JoinSet.html">JoinSet</a> API.  The <code class="language-plaintext highlighter-rouge">JoinSet</code> is effectively what we want if we want an Iterator-style parallel work executor:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">JoinSet</code> created, “do work” begins</li>
  <li><code class="language-plaintext highlighter-rouge">next()</code> is invoked</li>
  <li>return result</li>
  <li><code class="language-plaintext highlighter-rouge">next()</code> is invoked</li>
  <li>return result</li>
  <li>“do work”</li>
  <li><code class="language-plaintext highlighter-rouge">next()</code> is invoked</li>
  <li>return result</li>
</ol>

<p>Currently the use of <code class="language-plaintext highlighter-rouge">JoinSet</code> happens much higher in the stack inside of
delta-rs, but does <em>not</em> happen deeper down in the delta-kernel-rs code.</p>

<p>What the profiling <em>likely</em> indicates is that there are serial <code class="language-plaintext highlighter-rouge">Iterator</code>
executions happening in the kernel layer which lead to a bottleneck for
callers, regardless of how parallel-capable those callers may be.</p>

<hr />

<p>Tokio has received criticism in the past about its suitability for heavy
CPU-bound operations. Its async/await primitives work incredibly well for
anything which has I/O wait involved. The scheduler can switch between tasks
when a socket is awaiting data, making it highly concurrent for I/O-bound
applications. Tokio functions similarly to Goroutines in Golang, greenlets in
Python, etc. As I dug deeper into this problem I wanted to ensure that Tokio
was going to behave as I expected with CPU-bound operations.</p>

<p>I compared performance of a <code class="language-plaintext highlighter-rouge">JoinSet</code> based program which generates
RSA keys, and a <a href="https://crates.io/crates/rayon">rayon</a> based program. Both are
close enough in performance and parallelism. Both effectively used all
available cores when the Tokio runtime was configured with a single worker
thread per core.</p>

<hr />

<p>Coming back to the Delta Lake ecosystem and our beloved <code class="language-plaintext highlighter-rouge">Iterator</code>. I think
there are two paths ahead:</p>

<ul>
  <li>The Easy Road: taking <code class="language-plaintext highlighter-rouge">JoinSet</code> into the default engine of delta-kernel-rs
will at least alleviate some of the “concurrent but not parallel” problems
that are lurking down there.</li>
  <li>The Hard Road: attempting to put a synchronous <code class="language-plaintext highlighter-rouge">Engine</code> interface in front of
inherently I/O bound operations is going to lead to performance deficiencies
compared to an evented system like Tokio or anything else with a kqueue/epoll
reactor at its core. Putting async/await at the foundation of delta-kernel-rs
would allow for driving more concurrent and parallel behavior depending on
the use-case.</li>
</ul>

<p>The performance of delta-rs is major focus for my work in the project. In 2026 I look
forward to sharing more analysis and more <a href="https://github.com/delta-io/delta-kernel-rs/pull/1561">pull
requests</a>!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><category term="deltalake" /><summary type="html"><![CDATA[In theory many developers understand concurrency and parallelism, in practice I think almost none of us do. At least not all the time. Building a mental model of highly parallel interdependent software is incredibly time-consuming, difficult, and error-prone. I have recently been doing a lot of performance analysis with both delta-rs and delta-kernel-rs. In the process I have had to check some of my own assumptions of how things should work compared to how they do work.]]></summary></entry><entry><title type="html">The end of the road for kafka-delta-ingest</title><link href="https://brokenco.de//2025/10/30/kafka-delta-ingest-was-fun.html" rel="alternate" type="text/html" title="The end of the road for kafka-delta-ingest" /><published>2025-10-30T00:00:00+00:00</published><updated>2025-10-30T00:00:00+00:00</updated><id>https://brokenco.de//2025/10/30/kafka-delta-ingest-was-fun</id><content type="html" xml:base="https://brokenco.de//2025/10/30/kafka-delta-ingest-was-fun.html"><![CDATA[<p>After five years in production kafka-delta-ingest at Scribd has been shut off
and removed from our infrastructure.
<a href="http://github.com/delta-io/kafka-delta-ingest">kafka-delta-ingest</a> was the
motivation behind my team creating
<a href="https://github.com/delta-io/delta-rs">delta-rs</a>, the most successful open
source project I have started to date. With kafka-delta-ingest we achieved our
original stated goals and reduced streaming data ingestion costs by <strong>95%</strong>. In
the time since however, we have <em>further</em> reduced that cost <a href="https://www.youtube.com/watch?v=h8nCF_OI0O0">with even more
efficient infrastructure</a>.</p>

<p>The original kafka-delta-ingest/delta-rs implementations were created by the
joint efforts of the following talented developers across <em>three continents</em> in
the middle of 2020, an otherwise totally chill time in world history.</p>

<ul>
  <li><a href="https://github.com/houqp">QP Hou</a></li>
  <li><a href="https://github.com/xianwill">Christian Williams</a></li>
  <li><a href="https://github.com/mosyp">Mykhailo Osypov</a></li>
  <li><a href="https://github.com/nevi-me">@nevi-me</a></li>
</ul>

<p>Prior to our creation of delta-rs, the only way to read and write <a href="https://delta.io">Delta
Lake</a> tables was through <a href="https://spark.apache.org">Apache
Spark</a>. While it is an incredibly powerful tool for
reading and transforming data, it is completely slow and overweight for the
task of high-throughput data ingestion. QP and I found ourselves loving
<a href="https://rust-lang.org">Rust</a> and I was able to corner the funding to get the
project started on the promise of lower operational costs.</p>

<p>Boy howdy has the investment in Rust delivered. The implementation of kafka-delta-ingest dramatically lowered our operational costs as Christian shares in this video:</p>

<center><iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/do4jsxeKfd4?si=vAgTIsWWn4k7f5qi" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe></center>

<p>Christian also shared some <a href="https://www.youtube.com/watch?v=mLmsZ3qYfB0">architecture and discussion in this
video</a>, which I think are useful
for anybody building streaming systems around Delta Lake.</p>

<p>Here’s a <a href="https://www.youtube.com/watch?v=JvonUisY7vE&amp;t=51s">demo by Christian</a> too!</p>

<hr />

<p>The reason kafka-delta-ingest was decommissioned ultimately was that I created an <em>even
cheaper</em> ingestion process. My work on the
<a href="https://github.com/buoyant-data/oxbow">oxbow</a> suite coupled with
<a href="https://www.databricks.com/glossary/medallion-architecture">the medallion
architecture</a>
has made contemporary Delta Lake ingestion less than 10% of the total data
platform cost.</p>

<p>The big argument against kafka-delta-ingest was <a href="https://kafka.apache.org">Apache
Kafka</a>. If an organization has Kafka for other
reasons, then kafka-delta-ingest can be a useful “sidecar” process to persist
data flowing through Kafka. If however the organization is running Kafka <em>just</em>
for ingestion, there are cheaper options available. As the organization
evolved, the other consumers of Kafka drifted away, driving the value
proposition of kafka-delta-ingest lower and lower.</p>

<p>This doesn’t mean kafka-delta-ingest is not <em>useful</em>, it’s just no longer
useful at Scribd.</p>

<hr />

<p><a href="https://github.com/mightyshazam">Kyjah Keyes</a> and I are the maintainers of
kafka-delta-ingest and we now are both in the position of <em>not actually using
it</em> anymore.</p>

<p>I will continue to make delta-rs upgrades to it, since kafka-delta-ingest
continues to be a useful test bed for API changes and integration testing, but
I don’t have big plans or ideas on how to grow the project further.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="s3" /><category term="deltalake" /><category term="kafka" /><category term="rust" /><summary type="html"><![CDATA[After five years in production kafka-delta-ingest at Scribd has been shut off and removed from our infrastructure. kafka-delta-ingest was the motivation behind my team creating delta-rs, the most successful open source project I have started to date. With kafka-delta-ingest we achieved our original stated goals and reduced streaming data ingestion costs by 95%. In the time since however, we have further reduced that cost with even more efficient infrastructure.]]></summary></entry><entry><title type="html">Delta Lake Live!</title><link href="https://brokenco.de//2025/09/18/delta-lake-live.html" rel="alternate" type="text/html" title="Delta Lake Live!" /><published>2025-09-18T00:00:00+00:00</published><updated>2025-09-18T00:00:00+00:00</updated><id>https://brokenco.de//2025/09/18/delta-lake-live</id><content type="html" xml:base="https://brokenco.de//2025/09/18/delta-lake-live.html"><![CDATA[<p>Every Tuesday morning at 7am I have a date.</p>

<p>For the past few weeks <a href="https://github.com/roeap">Robert</a> and I have been
jumping onto a shared <a href="https://twitch.ttv/agentdero">Twitch</a> stream and working
through issues, code reviews, and design discussions for the
<a href="https://github.com/delta-io/delta-rs">delta-rs</a> project.</p>

<p>The idea for the project came up at Data and AI Summit earlier this year.
Robert lives in Europe and I am as west as west coast in the US generally gets.
The timezone spread has been making collaboration difficult on the topics which
require lively synchronous debate.</p>

<p>The Delta Lake project is open source and therefore, in my opinion, the discussions and development of the project should also be open! What better than a big open live stream to work through column mapping, deletion vectors, bugs, performance challenges, and more!</p>

<p>I have livestreamed development <a href="/2012/08/28/pairing-with-the-fourth-wall">in the
past</a> and found it useful, but with
“Delta Lake Live!” we have a much more regular schedule, agenda, and way for
folks in the chat to engage, making it all that much more fun!</p>

<p>The streams are <a href="https://www.youtube.com/watch?v=6EZM0AbLkWU&amp;list=PLzxP01GQMpjdXtIAVxv_ziQHqyhaEhAVh">also being archived on
YouTube</a>
but you’re more than welcome to pop by and hang out <a href="https://www.twitch.tv/agentdero/schedule">every Tuesday at 7am
PDT</a></p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><category term="deltalake" /><summary type="html"><![CDATA[Every Tuesday morning at 7am I have a date.]]></summary></entry><entry><title type="html">Busily writing elsewhere</title><link href="https://brokenco.de//2025/05/03/writing-elsewhere.html" rel="alternate" type="text/html" title="Busily writing elsewhere" /><published>2025-05-03T00:00:00+00:00</published><updated>2025-05-03T00:00:00+00:00</updated><id>https://brokenco.de//2025/05/03/writing-elsewhere</id><content type="html" xml:base="https://brokenco.de//2025/05/03/writing-elsewhere.html"><![CDATA[<p>Writing has been a part of my work for a <em>long</em> time, it helps me think and
more importantly it helps me share ideas with other developers. Recently a
tremendous amount of my time has been spent writing internal design documents,
blog posts, and other materials. By the time it has come to personal blogging
my words all been spent.</p>

<p>On the <a href="https://buoyantdata.com">Buoyant Data</a> blog I have been writing about a
<em>lot</em> of <a href="https://delta.io">Delta Lake</a> related topics such as:</p>

<ul>
  <li><a href="https://www.buoyantdata.com/blog/2024-12-31-high-concurrency-logstore.html">Scaling streaming Delta Lake applications</a></li>
  <li><a href="https://www.buoyantdata.com/blog/2025-02-24-just-keep-buffering.html">Buffering more messages with serverless data ingestion</a></li>
  <li><a href="https://www.buoyantdata.com/blog/2025-03-09-lessons-learned-building-delta-rs.html">Lessons learned in building delta-rs</a></li>
  <li><a href="https://www.buoyantdata.com/blog/2025-04-22-rust-is-good-for-the-climate.html">Build more climate-friendly data applications with Rust</a></li>
</ul>

<p>Some of this work ahs been in preparing for the two upcoming talks I have at
<a href="https://www.databricks.com/dataaisummit">Data and AI Summit 2025</a>. Some of
these posts have been in doing research with clients, or just spelunking on my
own.</p>

<p>You can <a href="https://www.buoyantdata.com/rss.xml">subscribe to the RSS feed</a> for more up to date articles relating to high-efficiency data processing with Rust!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="deltalake" /><category term="buoyantdata" /><summary type="html"><![CDATA[Writing has been a part of my work for a long time, it helps me think and more importantly it helps me share ideas with other developers. Recently a tremendous amount of my time has been spent writing internal design documents, blog posts, and other materials. By the time it has come to personal blogging my words all been spent.]]></summary></entry><entry><title type="html">From the beginning, delta-rs to Delta Lake: The Definitive Guide</title><link href="https://brokenco.de//2024/11/15/deltalake-the-definitive-guide.html" rel="alternate" type="text/html" title="From the beginning, delta-rs to Delta Lake: The Definitive Guide" /><published>2024-11-15T00:00:00+00:00</published><updated>2024-11-15T00:00:00+00:00</updated><id>https://brokenco.de//2024/11/15/deltalake-the-definitive-guide</id><content type="html" xml:base="https://brokenco.de//2024/11/15/deltalake-the-definitive-guide.html"><![CDATA[<p>Nothing quite feels like “I made it!” like being <em>published</em>. Which is why I am
thrilled to share that <a href="https://bookshop.org/p/books/delta-lake-the-definitive-guide-modern-data-lakehouse-architectures-with-data-lakes-denny-lee/21429337?ean=9781098151942">Delta Lake: The Definitive
Guide</a>
is available for purchase, and I kind of helped! I wanted to share a little bit
about how my contributions (Chapter 6!) came about, because my entrance into
the <a href="https://delta.io">Delta Lake</a> ecosystem was about as unplanned as my
authorship of part of this wonderful book.</p>

<p>The <a href="https://github.com/delta-io/delta-rs">delta-rs</a> project started in 2020 and I wish that I could say it is because
I am a brilliant visionary. The project largely started because I have had a
bias against JVM-based technology stacks and I had stepped into a role at
<a href="https://tech.scribd.com">Scribd</a> where we were migrating to AWS, Databricks,
and a new architecture <em>anyways</em> so why not challenge the orthodoxy? My
colleague <a href="https://about.houqp.me/">QP Hou</a> and I were loving Rust and liked
Delta Lake from a design standpoint, but did not love <a href="https://spark.spache.org">Apache
Spark</a> for some of the things we needed to do.</p>

<p>I would consider the official start of the project to be April 11th, 2020 when
I sent our Databricks colleagues the following:</p>

<hr />

<p>Greetings! As I mentioned in our weekly sync up this week, we have an interest
in partnering with Databricks to develop and open source native client
interface for Delta Lake.</p>

<p>For framing this conversation and scope of the native interface, I categorize
our compute workloads into three groups:</p>

<ol>
  <li><strong>Big offline data processing</strong>, requiring a cluster of compute resources where Spark makes a big dent.</li>
  <li><strong>Lightweight/small offline data processing</strong>, workloads needing “fractional
compute” resources, basically less than a single machine. (Ruby/Python type
tasks which move data around, or perform small-scale data accesses make up
the majority of these in our current infrastructure, we’ve discussed using
the Databricks Light runtime for these in the past, since the cost to
deploy/run these small tasks on Databricks clusters doesn’t make sense).</li>
  <li><strong>Boundary data-processing</strong>, where the task might involve a little bit of
production “online” data and a little bit of warehouse “offline” data to
complete its work. In our environment we have Ruby scripts whose sole job is
to sync pre-computed (by Spark) offline data into online data stores for the
production Rails application, etc, to access and serve.</li>
</ol>

<p>I don’t want to burn down our current investment in Ruby for many of the 2nd
and 3rd workloads, not to mention retraining a number of developers in-house to
learn how to effectively use Scala or pySpark.</p>

<p>My proposal is that we partner with Databricks and jointly develop an open
source client interface for Delta Lake. One where we would have at least one
developer from Databricks working with at least one developer from Scribd on a
jointly scoped effort to deliver a library capable of <em>initially</em> addressing
our ‘2’ and ‘3’ use-cases.</p>

<p>[..]</p>

<p>Further, I propose that we jointly develop a client interface in Rust, which
will allow us to easy extend that within the Databricks community to support
Golang, Python, Ruby, and Node clients.</p>

<p>The key benefits I imagine for us all:</p>

<ul>
  <li>
    <p>Much broader market share for Delta Lake as a technology. Not only would
companies like Scribd benefit, and continue to invest in Delta Lake, but
other companies would have an easier on-ramp into the Databricks ecosystem.
Basically, if you start using Delta Lake before you use Spark, you will (I
guarantee) reach a point where these lightweight workloads become heavyweight
workloads requiring the full power and glory of the Databricks runtime :D</p>
  </li>
  <li>
    <p>It’s a fantastic developer advocacy story that hits a number of key bullet
marketing points: open source, partner collaboration, Rust (so hot right now) :)</p>
  </li>
  <li>
    <p>Scribd is able to “immediately” take advantage of Delta Lake benefits without
burning up all our existing codebase and investment in Ruby tasks and
tooling. Thereby allowing for an easier onramp into Delta Lake and the
Databricks platform as a whole.</p>
  </li>
</ul>

<p>The scope of the effort I think would be largely around properly dealing with
the transaction log, since the Apache Arrow project has already created a
pretty decent <a href="https://crates.io/crates/parquet">parquet crate</a> in Rust. That
said, there may be some writer improvements we’d want/need to push upstream to
Apache Arrow to make this successful.</p>

<hr />

<p>On second thought, almost all of this has come true! What a brilliant sage! (plz clap)</p>

<p>Like many advancements, there’s a right time, a right place, and a right group
of people. Unfortunately Databricks didn’t join the party until a later on but
were a strong supporter of our initial work, providing guidance and helping to
make <a href="https://delta.io">Delta Lake</a> an ever-more thriving open source
community.  The right people were all converging on the direction that made
this possible with <a href="https://github.com/nevi-me">Neville</a> helped make
<a href="https://github.com/apache/arrow-rs">arrow-rs</a> a much better <a href="https://parquet.apache.org">Apache
Parquet</a> writer. QP wrote the first version of the
protocol parser and created the first Python bindings for the library.
<a href="https://github.com/xianwill">Christian Williams</a> built out
<a href="https://github.com/delta-io/kafka-delta-ingest">kafka-delta-ingest</a> with
<a href="https://github.com/mosyp">Mykhailo Osypov</a> and helped prove that: <strong>Rust is
way more efficient for data ingestion workloads.</strong>. As time went on Will Jones,
Florian Valeye, and Robert Peck joined the party and helped turn delta-rs from
a small Scribd-motivated open source project into a thriving Rust and Python
project.</p>

<p><a href="https://bookshop.org/p/books/delta-lake-the-definitive-guide-modern-data-lakehouse-architectures-with-data-lakes-denny-lee/21429337?ean=9781098151942" target="_blank"><img src="/images/post-images/2024-deltalake/book-cover.jpg" align="right" width="200" /></a></p>

<p>Scribd had wild success with the data ingestion being in Rust, and the data
processing/query being in Spark. The community grew, Databricks grew, and at
some point some folks started working on a book.</p>

<p>As a long-time maintainer of delta-rs and talking head in the Delta and
Databricks ecosystem I was asked to be a technical reviewer of the book after
Prashanth, Scott, Tristen, and Denny had already gotten more than halfway
through the chapters.</p>

<p>I provided as much feedback as I could on their chapters. I reviewed the
outline and noticed “Chapter 8: TBD”.</p>

<p>What’s supposed to be Chapter 8? “<em>We’re not sure yet.</em>”</p>

<p>My friend <a href="https://kohsuke.org">Kohsuke</a> once marveled at how I was able to
acquire things for the <a href="https://jenkins.io">Jenkins project</a> by the simple act of
asking for them. There’s some skill involved in finding mutually beneficial
opportunities, but being uninhibited by the possibility somebody would say “no”
helps a lot.</p>

<p>“So this outline looks good, but when are you going to talk about Rust and
Python? There are dozens of us! Dozens!”</p>

<p><a href="https://dennyglee.com/">Denny</a> needed another chapter and I asked if I could
write about building native data applications in Rust and Python.</p>

<p>Suddenly I was helping to write a book.</p>

<hr />

<p><a href="https://tech.scribd.com">Scribd</a> is a fun company to work at. Books,
audiobooks, podcasts, articles. We have a deep appreciation for the written
word, telling stories, and learning. All of which I value highly. Before this
experience however I had never seen the <em>other</em> side of books. The creation,
the meetings, the rewrites, the edits, the reviews, going to press. It is
incredibly interesting and the team at O’Reilly are talented, helpful, and professional.</p>

<p>Going through copy-editing I was fielding review comments on the consistency of
tense, the subject of sentences, discussions about what is a proper noun and
how to consistently apply terms through <em>hundreds of pages</em> of content. I have
heard about how invaluable editors are, I have now seen them in action am in
awe.</p>

<p>Over the years I have tried and failed to explain what I do to family members.
For people that don’t work in tech “working on the computer” all looks largely
the same, especially for older generations. Having your work, your name <em>in
print</em> has an intangible “wow” factor. More so than conference talks,
websites, GitHub stars, or branded t-shirts, a printed artifact recognizes the
accomplishments of the innumerable contributors to the Delta Lake ecosystem
over the years.</p>

<p>If you’re data inclined, I recommend picking up a copy, Prashanth, Scott,
Tristen, and Denny have written a very useful guide, and also I contributed a
bit too! :)</p>]]></content><author><name>R. Tyler Croy</name></author><category term="databricks" /><category term="deltalake" /><category term="buoyantdata" /><summary type="html"><![CDATA[Nothing quite feels like “I made it!” like being published. Which is why I am thrilled to share that Delta Lake: The Definitive Guide is available for purchase, and I kind of helped! I wanted to share a little bit about how my contributions (Chapter 6!) came about, because my entrance into the Delta Lake ecosystem was about as unplanned as my authorship of part of this wonderful book.]]></summary></entry><entry><title type="html">Data and AI Summit 2024 presentations</title><link href="https://brokenco.de//2024/10/17/data-ai-summit-videos.html" rel="alternate" type="text/html" title="Data and AI Summit 2024 presentations" /><published>2024-10-17T00:00:00+00:00</published><updated>2024-10-17T00:00:00+00:00</updated><id>https://brokenco.de//2024/10/17/data-ai-summit-videos</id><content type="html" xml:base="https://brokenco.de//2024/10/17/data-ai-summit-videos.html"><![CDATA[<p>This year has been so jam packed full of activities that I forgot to share some
videos from <a href="https://www.buoyantdata.com/blog/2024-06-04-data-and-ai-summit.html">Data and AI Summit
2024</a> this
past summer! The annual conference hosted by Databricks has become one of my
favorites to meet with other <a href="https://delta.io">Delta Lake</a> users and
developers to discuss the future of large-scale data ingestion and processing. This year however, I overdid it a little bit.</p>

<p>Using the excuse of promoting my consulting/professional services company
<a href="https://buoyantdata.com">Buoyant Data</a> I had effectively <em>three</em> speaking
engagements:</p>

<ul>
  <li><strong>The road to delta-rs 1.0</strong> at the Open Source Contributor Summit (Monday)</li>
  <li><strong>Fast, cheap, and easy data ingestion with AWS Lambda and Delta Lake</strong>, a
talk highlighting a lot of the successful patterns I have developed for
customers using AWS Lambda with Delta Lake for Rust to create shockingly
cheap data ingestion pipelines. (Thursday)</li>
  <li><strong>Let’s do data engineering in Rust!</strong>, a more fun deep-dive talk to help
people start to get into the world of implementing data systems with Rust. (Thursday)(</li>
</ul>

<p>Unfortunately the first talk was not recorded, but it was probably the most
interesting! On Monday morning I was riding my bike from the Ferry Building to
the venue in San Francisco and my chain snapped off while I was sprinting off
from a green light. I went down <strong>hard</strong>, scraped up my knees, and generally
looked a fool lying in the middle of Market St.</p>

<p>The show must go on, so I hobbled to the <a href="https://tech.scribd.com">Scribd</a>
office, deposited my broken bike, and continued to the Open Source Summit.</p>

<p>What I did not know at the time was that I had fractured a bone in my wrist. I
did know however that I needed to go to a clinic, but <em>really</em> wanted to attend
the summit and take advantage of the one-a-year opportunity (literally!) for
some of the brightest minds in the data community to talk about the future of
Delta Lake and more.</p>

<p>So that first talk was given with my swollen wrist pulled to my heart, like a
broken wing, and I’m <em>sure</em> it was a ludicrous sight to see!</p>

<p>By Thursday my arm had been set and was in a sling, which is far less exciting.
Nonetheless, the two talks below are perhaps the only one-handed presentations
thus far in my career! I hope you enjoy!</p>

<center>
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/XPoWb9u06xA?si=SNccWEJxorszRGO1" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/Fr5Nx1wuQmQ?si=Svc3GtewzxUyGI4M" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>
</center>

<hr />

<p><em>Note</em>: The presentation software used for this talk is the open source
<a href="https://mfontanini.github.io/presenterm/introduction.html">presenterm</a> tool
which is delightful for creating development-focused presentations like this
one!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="databricks" /><category term="deltalake" /><category term="buoyantdata" /><category term="presentation" /><summary type="html"><![CDATA[This year has been so jam packed full of activities that I forgot to share some videos from Data and AI Summit 2024 this past summer! The annual conference hosted by Databricks has become one of my favorites to meet with other Delta Lake users and developers to discuss the future of large-scale data ingestion and processing. This year however, I overdid it a little bit.]]></summary></entry><entry><title type="html">Improving lock performance for delta-rs</title><link href="https://brokenco.de//2023/11/29/locking-with-deltalake.html" rel="alternate" type="text/html" title="Improving lock performance for delta-rs" /><published>2023-11-29T00:00:00+00:00</published><updated>2023-11-29T00:00:00+00:00</updated><id>https://brokenco.de//2023/11/29/locking-with-deltalake</id><content type="html" xml:base="https://brokenco.de//2023/11/29/locking-with-deltalake.html"><![CDATA[<p>I have had the good fortune this year to help a number of organizations develop
and deploy native data applications in Python and Rust using a project I helped
found: <a href="https://github.com/delta-io/delta-rs">delta-rs</a>. At a high level
delta-rs is a Rust implementation of the <a href="https://github.com/delta-io/delta/blob/master/PROTOCOL.md">Delta Lake
protocol</a> which
offers ACID-like transactions for data lake use-cases. One of the big areas of
my focus has been in evaluating and improving performance in highly concurrent
runtime environments on AWS.</p>

<p>To help others understand the problem domain I spent some time earlier in the
week documenting the challenges in AWS on the Buoyant Data blog: <a href="https://www.buoyantdata.com/blog/2023-11-27-concurrency-limitations-with-deltalake-on-aws.html">Concurrency
limitations for Delta Lake on
AWS</a></p>

<blockquote>
  <p>In the case of AWS S3’s consistency model many operations are strongly
consistent, but concurrent operations on the same key are not. AWS encourages
application-level object locking, which the delta-rs implements using AWS
DynamoDB.</p>
</blockquote>

<p>AWS S3 is an incredible piece of technology that washes away a myriad of common
storage problems, and has been jokingly referred to as “the 8th wonder of the
world” by <a href="https://www.lastweekinaws.com/">Corey Quinn</a>. THe lack of a
“putIfAbsent” like semantic is however <em>very</em> annoying for the Delta Lake
protocol, adding the need for an application-wide <em>lock</em> for Delta users:</p>

<blockquote>
  <p>The dynamodb-lock approach allows for some sensible cooperation between
concurrent writers but the key limitation is that all concurrent operations
must synchronize on the table itself. There is no smaller division of
concurrency than a table operation</p>
</blockquote>

<p>In the blog post I offer some potential approaches to mitigate the weakness of
needing a table-level lock for concurrent Delta Lake writers on AWS, but the
problem will unfortunately remain until in some form or fashion until S3
introduces a “putIfAbsent” semantic which allows writers to “put” a file only
if it doesn’t exist in an atomic way.</p>

<p>For concurrent Delta writers I can offer some advice, but unfortunately
effective cooperative distributed concucrrency at scale remains a challenging
problem! :)</p>]]></content><author><name>R. Tyler Croy</name></author><category term="buoyantdata" /><category term="deltalake" /><category term="rust" /><summary type="html"><![CDATA[I have had the good fortune this year to help a number of organizations develop and deploy native data applications in Python and Rust using a project I helped found: delta-rs. At a high level delta-rs is a Rust implementation of the Delta Lake protocol which offers ACID-like transactions for data lake use-cases. One of the big areas of my focus has been in evaluating and improving performance in highly concurrent runtime environments on AWS.]]></summary></entry></feed>