<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://brokenco.de//feed/by_tag/rust.xml" rel="self" type="application/atom+xml" /><link href="https://brokenco.de//" rel="alternate" type="text/html" /><updated>2026-04-12T21:39:52+00:00</updated><id>https://brokenco.de//feed/by_tag/rust.xml</id><title type="html">rtyler</title><subtitle>a moderately technical blog</subtitle><author><name>R. Tyler Croy</name></author><entry><title type="html">2026 March: Recently Studied Stuff</title><link href="https://brokenco.de//2026/03/21/fresh-from-rss.html" rel="alternate" type="text/html" title="2026 March: Recently Studied Stuff" /><published>2026-03-21T00:00:00+00:00</published><updated>2026-03-21T00:00:00+00:00</updated><id>https://brokenco.de//2026/03/21/fresh-from-rss</id><content type="html" xml:base="https://brokenco.de//2026/03/21/fresh-from-rss.html"><![CDATA[<p>Over the past week I have made a more conscious effort to keep track of some
really interesting articles that came through my feed reader. I am a big fan of
the open web and the power of RSS for disseminating interesting information
from actual people. Below are some really interesting posts I have read recently!</p>

<p><strong><a href="https://felipe.rs/2024/10/23/arrow-over-http/">Compressed Apache Arrow tables over HTTP</a></strong></p>

<p>When discussing transport protocols for sending data between services at work
recently, a colleague asked “why can’t we just yeet Arrow over HTTP?” It turns out, you <a href="https://github.com/apache/arrow-experiments/tree/main/http/get_simple/python">absolutely can</a> and Arrow IPC streams even have a registered MIME type:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Content-Type: application/vnd.apache.arrow.stream
</code></pre></div></div>

<p><strong><a href="https://blog.dataexpert.io/p/parquet-can-shrink-your-data-100x">Understanding Parquet format for beginners</a></strong></p>

<p>A great introduction to the <a href="https://parquet.apache.org">Apache Parquet</a> format
and why it makes so many things better with large data storage systems like
<a href="https://delta.io">Delta Lake</a>. I have written on this
<a href="/tag/parquet.html">topic</a> before and encourage you to take another read
through <a href="https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/">this blog
post</a>
by some maintainers of the <a href="https://crates.io/crates/parquet">parquet</a> crate.</p>

<p><strong><a href="https://apenwarr.ca/log/20260316">Every layer of review makes you 10x slower</a></strong></p>

<blockquote>
  <p>Every layer of approval makes a process 10x slower [..]</p>

  <p>Just to be clear, we’re counting “wall clock time” here rather than effort. Almost all the extra time is spent sitting and waiting.</p>

  <ul>
    <li>Code a simple bug fix: 30 minutes</li>
    <li>Get it code reviewed by the peer next to you: 300 minutes → 5 hours → half a day</li>
    <li>Get a design doc approved by your architects team first: 50 hours → about a week</li>
    <li>Get it on some other team’s calendar to do all that (for example, if a customer requests a feature): 500 hours → 12 weeks → one fiscal quarter</li>
  </ul>
</blockquote>

<p>This inspired these thoughts which I shared with the <a href="https://github.com/delta-io/delta-rs">delta-rs</a> community:</p>

<p>“what if we didn’t require code review for merging into main”</p>

<p>I’m exploring the thought more about what we might need to make that happen.
“Why would you do such a thing, code review is so valuable!”  I do find code
reviews valuable but we do seem to lose a lot of flow time due to timezones,
differing work schedules, and a number of other things. For something without a
lot of changes, especially bug fixes that come with tests I would be much more
comfortable with maintainers merging once CI goes green.</p>

<p>Some pieces of the puzzle that I think would be needed:</p>

<ul>
  <li>Soft caps on pull requests. I saw this mentioned somewhere else, but implementing a soft cap of &lt;500 lines per pull request can help people avoid massive unreviewable changes that are simpler to integrate.</li>
  <li>Incorporating some of the benchmarking work into CI that has already been explored. If performance of key operations is not affected and the build is green, go for it.</li>
  <li>Stronger semantic version checks: if our APIs have not changed and all tests pass, I’m generally comfortable with landing stuff by maintainers.</li>
  <li>Implementing Apache Software Foundation style release candidates and voting: this is where we would put a mandatory bottleneck, rather than some jokey slack emojis like I tend to do, implementing a true release candidate process that requires review and vote before we push something to users.</li>
</ul>

<p>All of this is to say that reviews can still be requested, but I would love to
see us land more improvements faster and I think we have a bunch of different
schedules that can make pushing each change through a review queue a lot slower
than necessary.</p>

<p><strong><a href="https://www.possiblerust.com/pattern/conditional-impls">Conditional Impls in Rust</a></strong></p>

<blockquote>
  <p>It’s possible in Rust to conditionally implement methods and traits based on
the traits implemented by a type’s own type parameters. While this is used
extensively in Rust’s standard library, it’s not necessarily obvious that
this is possible.</p>
</blockquote>

<p>I have been vaguely aware of this functionality but haven’t really taken the
time to consider it, so I really appreciated this post walking through the
conditional impl functionality in Rust.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rss" /><category term="arrow" /><category term="parquet" /><category term="rust" /><summary type="html"><![CDATA[Over the past week I have made a more conscious effort to keep track of some really interesting articles that came through my feed reader. I am a big fan of the open web and the power of RSS for disseminating interesting information from actual people. Below are some really interesting posts I have read recently!]]></summary></entry><entry><title type="html">The value of efficient software</title><link href="https://brokenco.de//2026/02/23/value-of-efficiency.html" rel="alternate" type="text/html" title="The value of efficient software" /><published>2026-02-23T00:00:00+00:00</published><updated>2026-02-23T00:00:00+00:00</updated><id>https://brokenco.de//2026/02/23/value-of-efficiency</id><content type="html" xml:base="https://brokenco.de//2026/02/23/value-of-efficiency.html"><![CDATA[<p>The value of efficient and thoughtfully designed software is going to continue
to grow. What I never expected was for the “AI” data center to be the catalyst
that could help many organizations understand that argument!</p>

<p>Today Hetzner, a major cloud services provider in Europe <a href="https://www.hetzner.com/pressroom/statement-price-adjustment/">announced</a></p>

<blockquote>
  <p>There have been drastic price increases in various areas in the IT sector
recently. That is why, unfortunately, we must also increase the prices of our
products.</p>

  <p>The costs to operate our infrastructure and to buy new hardware have both
increased dramatically. Therefore, our price changes will affect both
existing products and new orders and will take effect starting on 1 April
2026.</p>
</blockquote>

<p>Last year for Earth Day I wrote <a href="https://www.buoyantdata.com/blog/2025-04-22-rust-is-good-for-the-climate.html">on the Buoyant Data blog</a></p>

<blockquote>
  <p>Time is money. In the cloud time is measured and billed by the vCPU/hour and
the most efficient software is always the cheapest.</p>
</blockquote>

<p>Nothing makes the case for more efficient software like more expensive
hardware!</p>

<p>In the past five years I have <em>repeatedly</em> seen success in taking a system
written in a less-efficient platform, redesigning and rebuilding in Rust, and
reaping the rewards in lower operational costs.</p>

<p>For a simple exercise, imagine a service which costs $100,000/year to operate,
that’s roughly $1,900 a week. Assuming a developer’s time costs roughly $6,000
a week, taking a month to rebuild the service might cost $25,000. The
efficiency needed is then only about 25% to pay off that rewrite in a year, but
what I have consistently seen is an <em>order of magnitude</em> change in efficiency.</p>

<p>Instead of costing $100k, these newly deployed services tend to cost less than
10-20% of their predecessors. Recouping the cost of conversion in a couple of
months, freeing up money to go towards different investments.</p>

<p>The biggest cost to contend with is opportunity cost and that one is <em>much</em>
harder to model, and also much less subject to changing prices by your vendors.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="software" /><category term="cloud" /><category term="opinion" /><category term="rust" /><summary type="html"><![CDATA[The value of efficient and thoughtfully designed software is going to continue to grow. What I never expected was for the “AI” data center to be the catalyst that could help many organizations understand that argument!]]></summary></entry><entry><title type="html">Multimodal with Delta Lake</title><link href="https://brokenco.de//2026/01/19/multimodal-delta-lake.html" rel="alternate" type="text/html" title="Multimodal with Delta Lake" /><published>2026-01-19T00:00:00+00:00</published><updated>2026-01-19T00:00:00+00:00</updated><id>https://brokenco.de//2026/01/19/multimodal-delta-lake</id><content type="html" xml:base="https://brokenco.de//2026/01/19/multimodal-delta-lake.html"><![CDATA[<p>The rate of change for data storage systems has accelerated to a frenzied pace
and most storage architectures I have seen simply cannot keep up. Much of my
time is spent thinking about large-scale tabular data stored in <a href="https://delta.io">Delta
Lake</a> which is one of the “lakehouse” storage systems along
with <a href="https://iceberg.apache.org">Apache Iceberg</a> and others. These storage
architectures were developed 5-10 years ago to solve problems faced moving from
data warehouse architectures to massive scale structured data needs faced by
many organizations. The storage changes we need today must support
“multimodal data” which is a dramatic departure in many ways from the
traditional query and usage patterns our existing infrastructure supports.</p>

<blockquote>
  <p>Multimodal learning is a type of deep learning that integrates and processes
multiple types of data, referred to as modalities, such as text, audio, images,
or video. This integration allows for a more holistic understanding of complex
data, improving model performance in tasks like visual question answering,
cross-modal retrieval, text-to-image generation, aesthetic ranking,
and image captioning.</p>

  <p><a href="https://en.wikipedia.org/wiki/Multimodal_learning">From Wikipedia</a></p>
</blockquote>

<p>Honestly, I have been working on this problem for longer than I knew that it
had a name!</p>

<p>Working on <a href="https://tech.scribd.com/blog/2026/content-crush.html">Content
Crush</a> at Scribd I have
had to negotiate an ever-present challenge: how do we make multimodal data
seamless to work with our classic tabular datasets?</p>

<p>A couple of the ideas that I have been thinking about revolve around one
principle: <strong>re-encoding of existing data is unacceptable.</strong> In the past I have
considered simply encoding binary data such as that from images or PDFs into
<a href="https://parquet.apache.org">Apache Parquet</a>. This approach suffers from a couple major flaws:</p>

<ul>
  <li>Re-encoding requires substantial computation for any non-trivial set of images, PDfs, video, etc.</li>
  <li>Redundant object storage, even with compression it is unlikely that any
organization which has terabytes or petabytes of image data will want to
store a secondary copy of it for their multimodal needs.</li>
  <li>Embedding a 1MB PDF file inside of a Parquet file is <em>not silly</em> but
embedding a 10GB video file inside of a Parquet file is <em>very silly</em>. Any
approach taken should scale in a reasonable fashion for data in the gigabyte
to terabyte range.</li>
</ul>

<p>A secondary objective in my thinking has been to avoid needing substantial
client changes for working with multimodal data. I recently watched <a href="https://www.youtube.com/watch?v=YmY_NwaoxNk">a talk by
Ryan Johnson</a> about adding
transactional semantics to Delta Lake and one of the big takeaways that I
heard from him was about the troublesome nature of ensuring <em>all actors</em> in the
system cooperated with the transaction semantics. In a modern data environment
that could be <em>dozens</em> of different off-the-shelf libraries, Databricks
notebooks, AWS SageMaker transforms, and so on. The less “exposure” to the
client layer the better.</p>

<h2 id="parquet-anchors">Parquet Anchors</h2>

<p>The first idea that I had was “Parquet Anchors” which would be built on <a href="https://parquet.apache.org/docs/file-format/binaryprotocolextensions/">Binary
Protocol
Extensions</a>
in Apache Parquet. In most cases the rich text/image/video data is already
stored in object storage such as AWS S3 and a URL should be sufficient to
retrieve that data.</p>

<p>The extension of the binary protocol as I understand it, would allow custom
information to be encoded in the Parquet files that are being written as part
of an existing Delta Table. The specific mechanism of encoding this data is
somewhat irrelevant so long as it can carry:</p>

<ul>
  <li>Artifact name (e.g. <code class="language-plaintext highlighter-rouge">some.pdf</code>)</li>
  <li>Artifact URL (<code class="language-plaintext highlighter-rouge">s3://bucket/prefix/of/keys/some-10x9u09123.pdf</code>)</li>
  <li>Artifact length (number of bytes)</li>
  <li>Artifact content type (e.g. <code class="language-plaintext highlighter-rouge">application/pdf</code>)</li>
  <li>Checksum</li>
  <li>Checksum Algorithm</li>
</ul>

<h3 id="pros">Pros</h3>
<p>The most obvious benefit of going down this route is the ease at which one
could update existing data files <em>and</em> this note from the Binary Protocol
Extensions document:</p>

<blockquote>
  <p><em>Existing readers will ignore the extension bytes with little processing overhead</em></p>
</blockquote>

<p>Logically Parquet Anchors could be quite simple to implement and for <em>most</em>
users of a Delta table with Parquet Anchors would never know they were there.</p>

<h3 id="cons">Cons</h3>

<p>The natural downside of this feature being hidden from existing readers is that
means clients must be updated in order to read the extension data properly. For
something like processing multimodal data where a row of content metadata
might refer to <code class="language-plaintext highlighter-rouge">some.pdf</code> this would mean the reader would have to have some
indication that it must:</p>

<ol>
  <li>Read the extended binary information</li>
  <li><em>Then</em> fetch the necessary artifacts</li>
</ol>

<p>There is another downside to this approach in that a table would need to be
“rewritten” but only <em>partially</em>. If a Parquet file added to the Delta table
references 1000 artifacts, then that <code class="language-plaintext highlighter-rouge">.parquet</code> file would need to be rewritten
to include the Parquet Anchors for those 1000 artifacts alongside that files
<code class="language-plaintext highlighter-rouge">.add</code> action. In essence I think this approach would require a full-table
rewrite where each <code class="language-plaintext highlighter-rouge">.parquet</code> in the transaction log would be retrieved,
processed, and rewritten with the appropriate Anchors.</p>

<p>Considering ways to address the shortcomings of Parquet Anchors I came up with
my next concept.</p>

<h2 id="virtual-delta-tables-vdt">Virtual Delta Tables (vdt)</h2>

<p>The notion of Parquet Anchors I think is useful to hold onto, hyperlinks to
existing artifacts is a key part of the multimodal data storage solution, but
perhaps not as a direct encoding into the Parquet data files. Considering the
shortcomings led me to think of how to present a virtual Delta table “view” to
existing clients while hiding the disparate nature of the data behind the
scenes.</p>

<p>One underutilized feature of the Delta Lake protocol is the use of URLs in the
<code class="language-plaintext highlighter-rouge">add</code> actions which enables functionality like <a href="https://delta.io/blog/delta-lake-clone/">shallow
clones</a>. I have long thought of this
as a super power that should really be used more.</p>

<h3 id="vdt0-just-the-artifacts">vdt0: just the artifacts</h3>

<p>The magic of the URL support in the Delta protocol is that the URLs don’t even
have to point to object storage. Nothing about the protocol dictates that the
URLs must point to <code class="language-plaintext highlighter-rouge">s3://</code> or <code class="language-plaintext highlighter-rouge">abfss://</code> URLs, you can just point to <code class="language-plaintext highlighter-rouge">https://</code>
URLs. AWS S3 supports <code class="language-plaintext highlighter-rouge">https://</code> URLs, but so does <em>every other web service</em>.</p>

<p>Imagine a storage architecture which already contains heaps of <code class="language-plaintext highlighter-rouge">.pdf</code>
artifacts. A <code class="language-plaintext highlighter-rouge">vdt</code> web service could provide a read-only URL structure which
maps the existing object storage structure into a Delta Lake URL scheme.</p>

<p>A virtual table with just those PDF artifacts could be configured at
<code class="language-plaintext highlighter-rouge">https://vdt.aws/v1/&lt;catalog&gt;/&lt;schema&gt;/&lt;table&gt;</code>. Using tooling like
<a href="https://github.com/s3s-project/s3s">s3s</a> <code class="language-plaintext highlighter-rouge">vdt</code> can provide S3-like operations
off of this virtual URL, exposing a virtualized JSON transaction log or
checkpoints for the Delta client.</p>

<p>Imagine the schema of such a virtual table for PDF artifacts:</p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Datatype</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>id</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>filename</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>content_type</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>url</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>filesize</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>data</td>
      <td><code class="language-plaintext highlighter-rouge">binary</code></td>
    </tr>
    <tr>
      <td>checksum</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>checksum_algo</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
  </tbody>
</table>

<p>The virtualized transaction log is where the real fun can begin. If information
about the artifacts can be sourced from an existing database, then the
virtualized transaction log could contain numerous <em>imagined</em> parquet files as
the <code class="language-plaintext highlighter-rouge">add</code> actions:</p>

<pre><code class="language-JSON">{
  "add": {
    "path": "datafiles/some-guid.parquet",
    "size": 841454,
    "modificationTime": 1512909768000,
    "dataChange": true,
    "stats": "{\"numRecords\":1,\"minValues\":{\"val..."
  }
}
</code></pre>

<p>The special path for the <code class="language-plaintext highlighter-rouge">some-guid.parquet</code> would perform <strong>on-demand</strong>
parquet encoding for the underlying artifacts.  The most primitive
implementation could simply represent <em>each</em> PDF file as a <code class="language-plaintext highlighter-rouge">.parquet</code> file with
an <code class="language-plaintext highlighter-rouge">add</code> action. So long as the <code class="language-plaintext highlighter-rouge">add</code> action conveyed the necessary file
statistics to allow consuming engine to filter out files which are not
necessary, this could be a seamless way to expose structured PDF data to the
consumer. The <code class="language-plaintext highlighter-rouge">path</code> in the action could <em>also</em> refer to an already cached
version of the encoded file in S3 using the existing URL support in the
protocol, in this way clients could progressively cache as need be on the
server-side.</p>

<hr />

<p><strong>Brief aside</strong>: I have never fully understood why <a href="https://delta.io/sharing/">Delta
sharing</a> exists as a separate entity. In my opinion
the Delta Lake protocol coupled with a clever server-side backend could provide
identical functionality for all existing Delta implementations.</p>

<hr />

<p>Assuming the <code class="language-plaintext highlighter-rouge">vdt</code> service supports the schema defined above and can properly
retrieve the PDF artifacts and encode them as Parquet data on the fly, a query
such as <code class="language-plaintext highlighter-rouge">SELECT filename, raw FROM vdt WHERE filename = $?</code>.</p>

<h3 id="pros-1">Pros</h3>

<p>Breaking the pretense of “objects must actually exist” with Delta Lake is very
liberating.  On-demand encoding artifacts in Apache Parquet would means all
client-side libraries should be able to seamlessly work within their existing
environments.</p>

<p>When I think about potential approaches for implementing <code class="language-plaintext highlighter-rouge">vdt0</code> I can also
imagine many different potential avenues for optimization.</p>

<h3 id="cons-1">Cons</h3>

<p>While I really do like this idea, I’m not sure <em>how much</em> I should like it
considering the potential downsides:</p>

<ul>
  <li>Requires some existing structure behind the scenes to build up a sensible
virtual Delta log. For situations where artifacts are simply in a dumb bucket
somewhere, with no metadata already stored in a relational database,
producing a virtual transaction log would be quite difficult.</li>
  <li>I cannot imagine a sensible path for <strong>write</strong> workloads with <code class="language-plaintext highlighter-rouge">vdt0</code>.</li>
  <li>Without having implemented this (yet!) it is unclear to how much compute-time would be expended on uncached parquet file encoding.</li>
  <li>Most data scientists want the PDF/image/etc but they don’t <em>typically</em> want
the raw bytes that they then have to parse through.</li>
</ul>

<hr />

<h2 id="uh-what-if-you-just-dont-use-delta-lake">Uh, what if you just don’t use Delta Lake?</h2>

<p>Hey good question. Great interlude opportunity!</p>

<p>As a seller of fine hammers and hammer accessories, everything does in fact
look like a nail.</p>

<p>Delta Lake is kind of a means to an end for me here. I think its protocol has
enough maturity in terms of features and client capabilities to provide
<em>almost</em> everything I need from a multimodal storage system. I just can’t/don’t
want to shove everything into a Delta table per se.</p>

<hr />

<h2 id="vdt1-adding-virtual-legs">vdt1: adding virtual legs</h2>

<p>Since I have already indulged in the heretical idea of “what if we just make
the files up” I went a level further to consider <em>what if we got even more
virtualized</em>. One key characteristic I dislike with the <code class="language-plaintext highlighter-rouge">vdt0</code> approach is that
it is <em>too simple</em> believe it or not.</p>

<p>When I think about artifacts like PDFs, they have far more structure than just
bytes. There are pages, typically sections, text, images, titles, footnotes,
and so on. For most machine learning use-cases the data scientist may be
interested in raw bytes for some projects but much more often they are
interested in the <em>parsed</em> and <em>structured</em> data of the artifact.</p>

<p>While my expertise is largely around text-based storage and processing, I would
imagine image/audio/video artifacts also have similar structure of interest to
data scientists.</p>

<p>Indulging in even more virtual-thinking I started to think about collections of
data all associated with an artifact. There’s the raw data schema above, but for PDFs I can also envision:</p>

<p><strong>Paragraphs</strong></p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Datatype</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>id</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>page</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>offset</td>
      <td><code class="language-plaintext highlighter-rouge">integer</code></td>
    </tr>
    <tr>
      <td>text</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>is_heading</td>
      <td><code class="language-plaintext highlighter-rouge">bool</code></td>
    </tr>
    <tr>
      <td>heading_level</td>
      <td><code class="language-plaintext highlighter-rouge">integer</code></td>
    </tr>
  </tbody>
</table>

<p><strong>Images</strong></p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Datatype</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>id</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>content_type</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>page</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>data</td>
      <td><code class="language-plaintext highlighter-rouge">binary</code></td>
    </tr>
    <tr>
      <td>bounds_x</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>bounds_y</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
  </tbody>
</table>

<p><strong>Links</strong></p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Datatype</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>id</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>page</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>href</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>label</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
  </tbody>
</table>

<p>Taken all together this only represents <em>20 columns</em> of data but could
represent <strong>most</strong> of the information needed for most multimodal workloads. I
mention the low column count because I have seen bug reports from Delta Lake
users talking about issues with tables containing <em>thousands of columns</em>.</p>

<p>A virtualized table schema could take these interior schemas and join them
together such that a single row might have: <code class="language-plaintext highlighter-rouge">id</code>, <code class="language-plaintext highlighter-rouge">raw_filename</code>,
<code class="language-plaintext highlighter-rouge">raw_content_type</code>, <code class="language-plaintext highlighter-rouge">raw_url</code>, <code class="language-plaintext highlighter-rouge">raw_filesize</code>, <code class="language-plaintext highlighter-rouge">raw_data</code>, <code class="language-plaintext highlighter-rouge">raw_checksum</code>,
<code class="language-plaintext highlighter-rouge">raw_checksum_algo</code>, <code class="language-plaintext highlighter-rouge">paragraph_page</code>, <code class="language-plaintext highlighter-rouge">paragraph_text</code>, <code class="language-plaintext highlighter-rouge">paragraph_offset</code>,
<code class="language-plaintext highlighter-rouge">paragraph_is_heading</code>, <code class="language-plaintext highlighter-rouge">paragraph_heading_level</code>, <code class="language-plaintext highlighter-rouge">image_content_type</code>,
<code class="language-plaintext highlighter-rouge">image_page</code>, <code class="language-plaintext highlighter-rouge">image_data</code>, <code class="language-plaintext highlighter-rouge">image_bounds_x</code>, <code class="language-plaintext highlighter-rouge">image_bounds_y</code>, <code class="language-plaintext highlighter-rouge">link_page</code>,
<code class="language-plaintext highlighter-rouge">link_href</code>, <code class="language-plaintext highlighter-rouge">link_label</code>.</p>

<p>So long as the schema allows nullable columns for everything but <code class="language-plaintext highlighter-rouge">id</code>, the
<code class="language-plaintext highlighter-rouge">vdt</code> service can expose the disjointed data behind the scenes in a sensible
way with the <code class="language-plaintext highlighter-rouge">add</code> actions on the virtual Delta table and its file statistics.
For example an <code class="language-plaintext highlighter-rouge">add</code> action which includes <code class="language-plaintext highlighter-rouge">link</code> data would list all other
columns as null within the file statistics <code class="language-plaintext highlighter-rouge">nullValues</code> such that any engine
querying for <code class="language-plaintext highlighter-rouge">raw</code> columns would ignore that file entirely.</p>

<h3 id="pros-2">Pros</h3>

<p>I think this structure would be possible to build in a traditional Delta Lake
system assuming one wished to re-encode data into new storage. Hiding existing
data behind a virtualized Delta table allows us to avoid data denormalization.</p>

<p>Similar to <code class="language-plaintext highlighter-rouge">vdt0</code> there are optimization and caching approaches that are
readily available with <code class="language-plaintext highlighter-rouge">vdt1</code> but unlike <code class="language-plaintext highlighter-rouge">vdt0</code> the “write path” is more
apparent to me with this approach. By hiding metadata about an artifact inside
the virtualized data structure, writes which add rows with those columns could
sensibly be accepted and inserted into an internal Delta or other table.</p>

<p>Depending on how metadata associated with an artifact is concerned, the <code class="language-plaintext highlighter-rouge">vdt</code>
service could simply front a number of other conventional Delta tables and act
as a proxy ensuring to push predicates and I/O filtering “to the edge” as far
as it will go, before collecting results for the query engine.</p>

<h3 id="cons-2">Cons</h3>

<p>This approach is certainly the most complex but could potentially require the least amount of re-encoding of existing data assets. The devil is in the details with how one might map existing data sources together. My sketch above places a tremendous amount of emphasis on an <code class="language-plaintext highlighter-rouge">id</code> which acts as a primary key between all the metadata associated with a singular artifact.</p>

<p>Nothing defined thus far accounts for potential changes in an artifact or its
metadata as time goes on. If a new version of an existing document is uploaded,
the new version should likely be considered “canonical” but be <em>appended</em>
rather than <em>merged</em> with existing records. How one might sensibly model that
in a system like Delta which doesn’t support referential integrity between
datasets leads me back to the “anchors” idea from before.  That said, I’m not
sure if that’s much ado about nothing.</p>

<hr />

<p>From a data storage standpoint one key aspect of multimodal data is that the
different modalities are presented to the end user or system <strong>together</strong>. What
I like about the virtual Delta tables concept is that this it doesn’t require
substantial client changes to accomplish but <em>does</em> provide a path to present
various types of data <em>together</em> for a given artifact.</p>

<p>I have various bits and pieces of a potential <code class="language-plaintext highlighter-rouge">vdt</code> system lying around the
workshop floor. If the idea has legs I might take a crack at a prototype
implementation, but first I will need some feedback!</p>

<p>Let me know what you think by emailing me at <code class="language-plaintext highlighter-rouge">rtyler@</code> this domain!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><category term="parquet" /><category term="deltalake" /><category term="ml" /><summary type="html"><![CDATA[The rate of change for data storage systems has accelerated to a frenzied pace and most storage architectures I have seen simply cannot keep up. Much of my time is spent thinking about large-scale tabular data stored in Delta Lake which is one of the “lakehouse” storage systems along with Apache Iceberg and others. These storage architectures were developed 5-10 years ago to solve problems faced moving from data warehouse architectures to massive scale structured data needs faced by many organizations. The storage changes we need today must support “multimodal data” which is a dramatic departure in many ways from the traditional query and usage patterns our existing infrastructure supports.]]></summary></entry><entry><title type="html">The challenges facing Delta Kernel</title><link href="https://brokenco.de//2026/01/12/delta-kernel-challenges.html" rel="alternate" type="text/html" title="The challenges facing Delta Kernel" /><published>2026-01-12T00:00:00+00:00</published><updated>2026-01-12T00:00:00+00:00</updated><id>https://brokenco.de//2026/01/12/delta-kernel-challenges</id><content type="html" xml:base="https://brokenco.de//2026/01/12/delta-kernel-challenges.html"><![CDATA[<p>The Delta Kernel is one of the most technically challenging and ambitious open
source projects I have worked on. Kernel is fundamentally about unifying <em>all</em>
of our needs and wants from a <a href="https://delta.io">Delta Lake</a> implementation
into a single cohesive yet-pluggable API surface. Towards the end of 2025
<a href="https://github.com/tdas">TD</a> asked me to jot down some of the issues which
have been frustrating me and/or slowing down the adoption of kernel in projects
like <a href="https://github.com/delta-io/delta-rs">delta-rs</a>. At the outset of the
project we all discussed concerns about what could <em>actually be possible</em> as we
set out into uncharted territory. In many ways we have succeeded, in others we
have failed.</p>

<p>Reviewing the history, I was the second developer to commit code behind
<a href="https://github.com/zachschuermann">Zach</a> to the project.
Like all open source projects, Delta Kernel is the work of numerous people who
have all poured their time into making something happen <em>together</em>. I regularly
work with Robert, Zach, Nick, Ryan, and Steve to make delta-rs and
delta-kernel-rs <strong>better</strong>.</p>

<p>While we all have our personal motivations, we also have direction guided by our
employers in some cases. That means the goals for kernel from Databricks may
not align with my employer (<a href="https://tech.scribd.com">Scribd</a>), or others
participating in the project. This complicates trade-off decisions in many open
source projects where personal, professional, and hobby motivations intersect.</p>

<p>My hope is to characterize the weaknesses in kernel so that we can collectively
adjust in 2026 to make improvements in both the technical design of kernel, but
also the <em>community</em> and culture around kernel.</p>

<h2 id="design">Design</h2>

<p>From my perspective the original design trade-offs made in kernel were largely
driven by two key factors:</p>

<ol>
  <li><strong>Portability with non-Rust engines</strong>: this dictated the need for an
<a href="https://en.wikipedia.org/wiki/Foreign_function_interface">FFI</a> abstraction
on day zero. The <a href="https://duckdb.org/docs/stable/core_extensions/delta">Delta extension for
DuckDB</a> had an
outsized influence on this due ostensibly to a desire from Databricks to
make DuckDB and Delta be best friendsies.</li>
  <li><strong>The Java kernel</strong>: the Delta kernel is actually <em>two</em> implementations, one
in Java for unifying JVM-based connectors, and one in Rust for basically
everybody else. Due to the number of folks involved in the Java kernel, the
Rust implementation was <em>strongly</em> encouraged to take design cues from the
Java design.</li>
</ol>

<p>More than anything these two factors have contributed to a number of what I
would consider original load-bearing sins of design for delta-kernel-rs.</p>

<blockquote>
  <p>These trade-offs resulted in a Rust-based project which <strong>abandons most of
the important benefits for using Rust</strong>.</p>
</blockquote>

<h3 id="building-for-the-lowest-common-denominator">Building for the lowest common Denominator</h3>

<p>Supporting cross-language and runtime interoperability is <strong>brutal</strong>. I have
done a lot of cross-language support for Ruby and Python projects in the past,
where at some point <em>somewhere</em> there’s a pointer being passed from one world
into another. It is objectively <strong>awful</strong>.</p>

<p>Over the years of delta-rs people have tried adding FFI hooks into it, despite
us never making <em>any</em> accommodations for it. Seriously, as recently as <a href="https://github.com/delta-io/delta-rs/issues/3973">this
month</a> somebody popped up
with yet-another set of Golang FFI bindings on top of delta-rs.</p>

<h4 id="ffi-is-hell">FFI is hell.</h4>

<p>A hell that we <em>intentionally marched into</em> with Delta kernel. For
the uninitiated, FFI basically a convention for allowing multiple languages to
meet at a C <a href="https://en.wikipedia.org/wiki/Application_binary_interface">ABI
layer</a> and pass
pointers back and forth. There is some more about memory layout and other
silliness, but basically, it’s a way for everybody to dumb themselves down to a
C-style interface.</p>

<p>FFI is also stupid but it is basically how all higher level languages
work such as Python, Ruby, JavaScript, Golang, Rust, etc. Somewhere down there
in the stack is a pointer passing into C-based system calls on your machine.
There be monsters.</p>

<p>One of our early design disagreements made to accommodate FFI-based engines was
the adoption of <code class="language-plaintext highlighter-rouge">Iterator</code> based interfaces rather than <code class="language-plaintext highlighter-rouge">Future</code> based
interfaces. Previously I <a href="/2025/12/16/parallelism-is-tricky.html">wrote about our parallelism
challenges</a> which stem from this design
trade-off.</p>

<p>The debate was whether to hide an evented reactor like
<a href="https://tokio.rs">Tokio</a> <em>inside</em> kernel and hide that from the FFI caller, or
make the caller responsible for trying to make things event-driven. The early
influence of DuckDB weighed on the scales here, and the decision was made to
avoid embedding Tokio inside kernel.</p>

<p>In the Rust ecosystem it has taken a <em>long time</em> for us to <a href="https://areweasyncyet.rs/">become
async</a>. If you were curious why there has been such
an explosion of Rust across the systems programming ecosystem in the last five
years it’s because <strong>the Rust ecosystem is async</strong>.</p>

<p>The <em>first</em> Rust application I deployed into production used <code class="language-plaintext highlighter-rouge">async/await</code> from
the beginning, and without <em>any profiling</em> was an order of magnitude faster
than the system it replaced.</p>

<p><code class="language-plaintext highlighter-rouge">async/await</code> is the reason delta-rs was even successful in the first place!</p>

<p>There are ways to hack around the limitations of the <code class="language-plaintext highlighter-rouge">Iterator</code>
based API in Delta kernel, but the hill is <em>very</em> steep and will require
significant investment to make some parts of Delta kernel as fast as parallel
reads/scans would otherwise be.</p>

<p><code class="language-plaintext highlighter-rouge">async/await</code> gives incredible performance for free, but Delta kernel’s design choices mean it cannot take advantage and must pay the price.</p>

<h3 id="enginedata"><code class="language-plaintext highlighter-rouge">EngineData</code></h3>

<p>I am not smart enough to work on some parts of Delta kernel because of the
cleverness that is <code class="language-plaintext highlighter-rouge">EngineData</code>. Similar to
<a href="https://github.com/apache/arrow-rs">arrow-rs</a> and its <code class="language-plaintext highlighter-rouge">RecordBatch</code> and
<code class="language-plaintext highlighter-rouge">ArrayData</code> implementations, <code class="language-plaintext highlighter-rouge">EngineData</code> is an opaque type-erased container
for <em>stuff</em> and <em>things</em>.</p>

<p>One of the reasons I struggled to learn to Rust, but ultimately came to love
the language is the strong type system which helps prevent whole classes of
problems. The strong type system also makes it a lot simpler for me to reason
about the code when I am working with it.</p>

<p>Everything in Delta kernel is
<a href="https://docs.rs/delta_kernel/latest/delta_kernel/engine_data/trait.EngineData.html">EngineData</a>
in one form or another. I was pretty preoccupied when this interface was
originally being hammered out so I’m less familiar with the history of
decisions that went into it, but I find the API of <code class="language-plaintext highlighter-rouge">EngineData</code> and its
counterparts of
<a href="https://docs.rs/delta_kernel/latest/delta_kernel/engine_data/trait.RowVisitor.html">RowVisitor</a>,
<a href="https://docs.rs/delta_kernel/latest/delta_kernel/engine_data/trait.GetData.html">GetData</a>,
and
<a href="https://docs.rs/delta_kernel/latest/delta_kernel/engine_data/trait.TypedGetData.html">TypedGetData</a>
to be <em>very</em> unpleasant to work with.</p>

<p>I <em>also</em> find
<a href="https://docs.rs/arrow/latest/arrow/array/struct.RecordBatch.html">RecordBatch</a>
unpleasant to work with. I really struggle to think of more user-unfriendly
APIs in the Rust data ecosystem. In the case of arrow’s <code class="language-plaintext highlighter-rouge">RecordBatch</code> I have
watched some of my colleagues pull in the <em>entire</em>
<a href="https://crates.io/crates/datafusion">datafusion</a> dependency just so they can
work with <code class="language-plaintext highlighter-rouge">RecordBatch</code> without resulting to the array offset and indices
silliness that permeates Apache Arrow code.</p>

<p>As unpleasant as I find <code class="language-plaintext highlighter-rouge">RecordBatch</code> there are <em>thousands</em> of developers
invested in its APIs and supporting infrastructure. <code class="language-plaintext highlighter-rouge">EngineData</code> does not have
a similar level of tooling, but shares some of the same razor-sharp edges.</p>

<p>The <code class="language-plaintext highlighter-rouge">EngineData</code> design has resulted in a <em>lot</em> of brittle <a href="https://github.com/delta-io/delta-kernel-rs/blob/e019ac3fa18707b633f625418d661ed198c86759/kernel/src/actions/visitors.rs#L114-L120">fixed array
offsets</a>
being littered throughout the Delta kernel codebase. These “getters” and the
visitors APIs result in the Rust type checker being <em>far</em> less useful with
Delta kernel than a more conventionally structured Rust project. This also
results in a much larger likelihood of runtime errors being emitted for
problems rather than compile-time checks.</p>

<p>The type-erased opaque bucket of bytes design of <code class="language-plaintext highlighter-rouge">EngineData</code> means that
working inside of <em>or with</em> Delta kernel sacrifices one of the most important
characteristics of the Rust language: the type checker.</p>

<hr />

<p>There are some good pieces of the design which honestly I cannot speak to
because I don’t stub my toes on them. Ryan and I have discussed at length the
importance of deferring work as long as possible in kernel to achieve higher
performance. Some of the Expression and Transform APIs allow for lower memory
footprints and faster log replay when work can be deferred or outright
<em>avoided</em>.</p>

<p>In delta-rs some of the performance deficiencies we have seen since adopting
Delta kernel have more to do with our interop code rather than kernel design
decisions. The delta-rs project is <em>massive</em>. As a general purpose Delta Lake
implementation, the surface area of changes that
<a href="https://github.com/roeap">Robert</a> had to touch to even get to where we are
today has been nothing short of heroic.</p>

<h2 id="community">Community</h2>

<p>The Delta kernel project is the first one I have worked on with Databricks
where there is <em>some</em> transparency around the week-to-week operations. 
The kernel Rust community has weekly meetings where
developers are talking to developers. 
Many of my early conversations with <a href="https://dennyglee.com/">Denny</a> were around
the propensity for Databricks to dump code into the Delta project as a fait
accompli. In one particularly egregious situation, there were protocol and
Delta/Spark changes which were reviewed, approved, and merged by Databricks
employees the week before being announced at <a href="https://dataandaisummit.com">Data and AI
Summit</a>. Kernel gets this right.</p>

<p>Even though I cannot make every weekly call with the kernel community, I love it when I can.</p>

<p><em>I don’t always attend the kernel weekly call, but when I do, I’m asking when the next release will happen.</em></p>

<p>For reasons I don’t think anybody really understands, Delta kernel moves <em>very</em>
slowly. Patch releases are of particular importance to me because delta-rs has
started to depend on the Delta kernel for its protocol implementation and
therefore <em>many</em> of our new bugs relate to Delta kernel in some way or another.</p>

<p>Releases have averaged around one every three weeks in 2025. Nine of the thirty
versions released to
<a href="https://crates.io/crates/delta_kernel/versions">crates.io</a> were patch fixes,
which means <strong>70%</strong> of published releases contained API breaking changes. Some
of that is inevitable as developers are figuring out the appropriate shape of
different APIs. As a consumer of this release cycle downstream this means that
I am highly unlikely to ever receive bug fixes without requiring development
effort to adapt to ever-changing APIs.</p>

<p>There is no free lunch.</p>

<p>For the <a href="https://crates.io/crates/deltalake">delta-rs</a> project this means our releases are <em>frequently blocked</em> on:</p>

<ul>
  <li>Delta kernel</li>
  <li><a href="https://crates.io/crates/arrow">Apache Arrow</a></li>
  <li><a href="https://crates.io/crates/datafusion">Apache Datafusion</a></li>
</ul>

<p>Delta kernel ships with a default engine that has a major version dependency on
Apache Arrow, a project which <em>also</em> avoids patch releases. This compounding
effect means that when a new <code class="language-plaintext highlighter-rouge">arrow</code> is released we (delta-rs) must wait for
that to be incorporated into both <code class="language-plaintext highlighter-rouge">datafusion</code> and <code class="language-plaintext highlighter-rouge">delta_kernel</code>, and for both
those crates to be released.</p>

<blockquote>
  <p>Any issue reported to delta-rs which requires a change in Arrow or Delta kernel
will typically take 1-2 months to resolve.</p>
</blockquote>

<h3 id="no-need-to-wait">No need to wait</h3>

<p>Up until yesterday, the latest released
<a href="https://crates.io/crates/deltalake/">deltalake</a> crate was <code class="language-plaintext highlighter-rouge">0.29.4</code> which
depended on Delta kernel <code class="language-plaintext highlighter-rouge">0.16.0</code>. That version is three months old and
unfortunately never saw any patch releases, which is part of the reason all four of the <code class="language-plaintext highlighter-rouge">0.29.x</code> releases of delta-rs depended upon it.</p>

<p>Using the crate downloads statistics as a <em>very</em> unscientific measure, I would
hazard a guess that <code class="language-plaintext highlighter-rouge">delta-rs</code> drives the majority of downloads for Delta
kernel.</p>

<p><img src="/images/post-images/2025-delta-kernel/delta_kernel_downloads.png" alt="delta_kernel downloads showing a lot of &quot;Other&quot;" /></p>

<p>The <code class="language-plaintext highlighter-rouge">0.18.0</code> release went out on November 20th, which has a small uptick, but
then the big spike in early December correlates strongly with the incorporation
of <a href="https://github.com/delta-io/delta-rs/pull/3949">this pull request</a> pulled
<code class="language-plaintext highlighter-rouge">0.18.x</code> into the delta-rs repository.</p>

<p>For completeness’ sake, the <code class="language-plaintext highlighter-rouge">deltalake</code> crate’s downloads have a very similar
shape. But due to the longer release cycle of <code class="language-plaintext highlighter-rouge">0.29.x</code> is is difficult to tell
what versions are being heavily downloaded.</p>

<p><img src="/images/post-images/2025-delta-kernel/deltalake_downloads.png" alt="deltalake downloads also showing plenty of &quot;Other&quot;" /></p>

<hr />

<p>Maintaining stable APIs is a pain, but becomes much more important the lower in
the stack any dependency lives.</p>

<p>One approach could be to create release branches which have changes
cherry-picked between them as is needed. This introduces more release
engineering work and can be challenging. For my own purposes I <em>have done this</em>
and backported fixes for both Delta kernel and delta-rs in various shapes to
support customers who cannot boil the ocean with unstable releases every two to
three weeks.</p>

<p>At <a href="https://tech.scribd.com">Scribd</a> a patch release of delta-rs, with <em>zero API changes</em> requires at least:</p>

<ul>
  <li>New Lambdas to be built.</li>
  <li>Those Lambdas to be deployed to a testing environment.</li>
  <li><em>waiting for enough data volume to demonstrate reliability</em></li>
  <li>Promotion of a Lambda to a production environment.</li>
  <li><em>waiting for enough data volume to demonstrate success</em></li>
</ul>

<p>When everything operates smoothly this is about two developer-hours of time
from end to end, but that is with <em>zero API changes</em>.</p>

<p>Every set of API changes in delta-rs, Delta kernel, or Apache Arrow introduces
unknown developer time to perform updates and upgrades. Unless a new release of
<em>any</em> of these dependencies confers significant performance or quality
improvements, the business looks at these upgrades as <strong>unnecessary cost</strong> and
instead prefers to simply <em>not</em> update.</p>

<p>As a consequence bugs can be discovered in production months after a given
Delta kernel release. For example <a href="https://github.com/delta-io/delta-kernel-rs/pull/1561">this performance
bug</a> in Delta kernel had
actually existed for <strong>months</strong> in released crates. It was not until delta-rs
adopted more of Delta kernel. Only then was I able to bring upgrades all the way
to production and discovered <a href="https://github.com/buoyant-data/oxbow/commit/2363be8869a025b90bc46c2d7ed1893aca2d37e4">a couple serious performance issues in delta-rs and Delta kernel</a>.</p>

<p>This timeline is getting a little confusing even for me, so let’s recap:</p>

<ul>
  <li><strong>October 2024</strong>: <a href="https://github.com/delta-io/delta-kernel-rs/pull/373">A JSON parsing workaround introduced</a> into kernel and released in <code class="language-plaintext highlighter-rouge">0.4.0</code>.</li>
  <li><strong>July 2025</strong>: <a href="https://crates.io/crates/deltalake/0.27.0">deltalake 0.27.0</a>
released with first serious adoption of Delta kernel at <code class="language-plaintext highlighter-rouge">0.13.0</code>.</li>
  <li><strong>August 2025</strong>: delta-rs performance <a href="https://github.com/delta-io/delta-rs/pull/3660">issue identified and fixed</a> along with a separate Delta kernel <a href="https://github.com/delta-io/delta-kernel-rs/pull/1171">performance issue with wide tables identified</a>. Both problems were identified after I invested some spare work-cycles in using pre-release code to interact with production data sets at Scribd.</li>
  <li><strong>September 2025</strong>: <a href="https://github.com/buoyant-data/oxbow/commit/d8f7b683d7ff1498d1c2eea96a2642d8f5b490c4">oxbow incorporates 0.28.0</a> and that’s quickly reverted until delta-rs <code class="language-plaintext highlighter-rouge">0.29.x</code> is released with additional improvements both in the crate an incorporated in the newer Delta kernel <code class="language-plaintext highlighter-rouge">0.16.0</code>.</li>
</ul>

<p>From my perspective, the amount of time invested in the performance issues
alone has not been “paid back” by improvements delivered from Delta kernel.</p>

<hr />
<p><strong>NOTE:</strong> HR would like to remind me to adopt a growth-mindset.</p>

<p>The improvements from incorporating Delta kernel have not paid back the time-invested <strong><em>yet</em></strong>.</p>

<hr />

<p>For more than a year there were performance issues sitting in <code class="language-plaintext highlighter-rouge">main</code> and
released kernel crates.</p>

<p>The time delay between changes being made in kernel and those changes being
used for real workloads is <strong>long</strong>. Too long to be useful as a constructive
feedback cycle for development.</p>

<p>I believe the only way to improve this is with faster releases and faster
feedback.</p>

<h3 id="have-you-tried-just">Have you tried just</h3>

<p>The very-long user-feedback loops on released changes is only half of the
velocity troubles afflicting Delta kernel. I have personally avoided
contributing too much because the amount of yak-shaving can be pretty wild.</p>

<p>The performance improvement I recently suggested was a new personal TOP SCORE!
Garnering a total of <em>84 comments</em> in the back-and-forth with four different
maintainers. That is more pull request comments than lines changed in the patch.</p>

<p>What is sometimes difficult to remember as a
maintainer is that a pull request does not represent the <em>start</em> of time
invested by a contributor. A pull request is usually the <em>end</em> of their
time-investment. In this case I had already invested between 5-8 hours of
profiling and understanding the issue before I could create the change.</p>

<p>Hidden in the yak-shaving  <em>was useful feedback</em> but the process was so frustrating
that I eventually threw in the towel and asked Nick to take it over after
about 12 hours of total time invested.</p>

<p>Of the currently <a href="https://github.com/delta-io/delta-kernel-rs/pulls?q=is%3Apr+is%3Aopen+sort%3Acomments-desc">open pull
requests</a>
the one with the most comments is at 99. Of the <a href="https://github.com/delta-io/delta-kernel-rs/pulls?q=is%3Apr+sort%3Acomments-desc+is%3Aclosed">closed pull
requests</a>
my maddening 84 comment odyssey doesn’t even fit on the <strong>first page</strong> of “most
commented” pull requests. The top spot is claimed by <a href="https://github.com/delta-io/delta-kernel-rs/pull/109">this pull
request</a> which has 369
comments and took over two months from open to merge. That monster is somewhat
of an outlier because it represents a substantial change earlier in the history
of Delta kernel but a number of other changes are very much in hundreds of
comments range.</p>

<p>The pull request culture in Delta kernel is fundamentally contributor hostile.</p>

<p>The suggestions I made to Nick on how to improve this are:</p>

<ul>
  <li>Assigning one maintainer (e.g. <code class="language-plaintext highlighter-rouge">CODEOWNERS</code>) to review each pull request.
There is relatively little benefit from multiple people offering differing
opinions on a non-maintainers’ pull request.</li>
  <li>Contributors should feel like their goals are shared with maintainers. The
<a href="https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/incorporating-feedback-in-your-pull-request">suggest
change</a>
functionality of GitHub pull requests is fantastic for this. Rather than
leaving a wall of text, suggesting direct code changes helps convey a shared
investment in the pull request.</li>
  <li>Better yet, rather than asking for tests or changes. <strong>Make the changes</strong>.
Most contributors allow maintainers to push to their fork’s topic branches. I
regularly use this to add regression tests to contributors’ pull requests,
rather than asking them “please write a test.” Modelling good behavior
usually is more successful than <em>telling</em>.</li>
</ul>

<p>Some other ideas that come to mind:</p>

<ul>
  <li>Any comment with “nit: “ should simply be deleted. I see this at work from
time to time and will privately discuss with the developer how anti-social
that behavior comes across. Any bit of feedback that somebody feels is
nitpicky should be made in a follow up pull request or just <em>not</em>. Nitpicks
are a waste of everybody’s time.</li>
  <li>There is a habit to “stack PRs” in this project and as I write this, there
are <strong>19</strong> open “stacked” pull requests. Smaller commits and smaller pull
requests should be preferred and move quicker. I think there are a <em>lot</em> of
comments on pull requests because each pull request ends up being fairly
large and sits in an Open state for a long time.</li>
</ul>

<p>Many developers believe that code “stabilizes” as if some magic happens to code
in <code class="language-plaintext highlighter-rouge">main</code>. All code has a rapidly decaying half-life, especially code which
sits in open pull requests. The only way to demonstrate that anything is good
or bad is for it to be <em>used</em>. Stability comes from <em>use</em>.</p>

<p>I think everybody involved in the Delta kernel project, myself included, wants
a stable and high-performance foundation to build our Delta-based applications.
As Jez Humble and David Farley wrote in the book on <a href="https://en.wikipedia.org/wiki/Continuous_delivery">Continuous
Delivery</a>, a long cycle time
is usually <em>antithetical</em> to stability and reliability.</p>

<h2 id="theyre-good-kernels-brent">They’re good kernels Brent</h2>

<p>Golly this has been a bunch of words. To quote a wise man:</p>

<blockquote>
  <p>The Delta Kernel is one of the most technically challenging and ambitious open source projects</p>
</blockquote>

<p>I believe in the vision of Delta kernel and certainly wouldn’t be here if I
didn’t. The fragmentation that I see in the ecosystem causing nothing but
trouble. Since starting this essay I have encountered <em>two</em> new and quirky
derivatives of delta-rs code which are trying to coerce it to do things which
Delta kernel is meant to support. In fact, the status quo of Delta kernel
supports the two use-cases I stumbled into!</p>

<p>Having a stable and high-performance foundation means that features and
improvements added into kernel benefit <em>everybody</em>! How marvelous is that? The
trick is getting <em>everybody</em> to use kernel!</p>

<p>Kernel’s success is important to the Delta Lake ecosystem and numerous others.
For kernel to succeed however I believe we need to adjust course in 2026 to
build a stronger technology foundation by introducing more idiomatic Rust code.
Leaning more heavily on the strengths of the Rust ecosystem in the interfaces,
supporting Rust implementations with async/await as a focus, rather than FFI.</p>

<p>Building in a more Rust-familiar way will enable more new contributors along
with their fresh perspectives. We will need to improve our release cadence and
change management into something clear and predictable. Making new developers
feel welcomed and their contributions valued will solidify kernel’s place as
the foundation in the ecosystem.</p>

<p>Stronger technology <em>and</em> a stronger community in 2026 will help Delta kernel
overcome the challenges we face today.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><category term="deltalake" /><category term="opinion" /><summary type="html"><![CDATA[The Delta Kernel is one of the most technically challenging and ambitious open source projects I have worked on. Kernel is fundamentally about unifying all of our needs and wants from a Delta Lake implementation into a single cohesive yet-pluggable API surface. Towards the end of 2025 TD asked me to jot down some of the issues which have been frustrating me and/or slowing down the adoption of kernel in projects like delta-rs. At the outset of the project we all discussed concerns about what could actually be possible as we set out into uncharted territory. In many ways we have succeeded, in others we have failed.]]></summary></entry><entry><title type="html">Using sccache with not-S3</title><link href="https://brokenco.de//2026/01/02/sccache-with-not-s3.html" rel="alternate" type="text/html" title="Using sccache with not-S3" /><published>2026-01-02T00:00:00+00:00</published><updated>2026-01-02T00:00:00+00:00</updated><id>https://brokenco.de//2026/01/02/sccache-with-not-s3</id><content type="html" xml:base="https://brokenco.de//2026/01/02/sccache-with-not-s3.html"><![CDATA[<p>On a day-to-day basis I build a <em>lot</em> of Rust code. To make my life easier I
use <a href="https://github.com/mozilla/sccache">sccache</a> which I have written about
<a href="/2025/01/05/sccache-distributed-compilation.html">previously</a>. Periodically
the <code class="language-plaintext highlighter-rouge">sccache</code> daemon would exit and then no longer authenticate against my
local network’s not-S3 service.</p>

<p><code class="language-plaintext highlighter-rouge">sccache</code> would fail a <code class="language-plaintext highlighter-rouge">cargo build</code> command with an error like the following:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  sccache: error: Server startup failed: cache storage failed to read: Unexpected (temporary) at read =&gt; loading credential to sign http request

  Context:
     called: reqsign::LoadCredential
     service: s3
     path: .sccache_check
     range: 0-

  Source:
     error sending request for url (http://169.254.169.254/latest/api/token): operation timed out
</code></pre></div></div>

<p>Typically I would hit this error when I was busy, so I would disable <code class="language-plaintext highlighter-rouge">sccache</code>
by setting <code class="language-plaintext highlighter-rouge">RUSTC_WRAPPER=</code> in my environment. With a little more time on my
hands this winter holiday I went spelunking around in the <code class="language-plaintext highlighter-rouge">sccache</code> code and
found the issue!</p>

<p>That IP address is the AWS IMDSv2 service, which is actually being queried by
<a href="https://github.com/apache/OpenDAL">Apache OpenDAL</a> for credentials. Were I on
an AWS EC2 instance, this would return a token brokered by AWS STS allowing me
to use the instance’s role. Since I’m not on an EC2 machine and not even
remotely close to AWS, I needed make <code class="language-plaintext highlighter-rouge">sccache</code> avoid this check.</p>

<p>Somewhat paradoxically, when <code class="language-plaintext highlighter-rouge">sccache</code> is configured <em>not</em> to use credentials
it won’t enable the IMDSv2 feature in <code class="language-plaintext highlighter-rouge">opendal</code> <em>but</em> the <code class="language-plaintext highlighter-rouge">opendal</code> subsystem
will still use the credentials defined in <code class="language-plaintext highlighter-rouge">~/.aws/credentials</code> associated with
my current <code class="language-plaintext highlighter-rouge">AWS_PROFILE</code>.</p>

<p>Quirky!</p>

<p>Updating my shell configuration with the following environment variable has made <code class="language-plaintext highlighter-rouge">sccache</code> easy breezy again!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export SCCACHE_S3_NO_CREDENTIALS=true
</code></pre></div></div>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><category term="sccache" /><summary type="html"><![CDATA[On a day-to-day basis I build a lot of Rust code. To make my life easier I use sccache which I have written about previously. Periodically the sccache daemon would exit and then no longer authenticate against my local network’s not-S3 service.]]></summary></entry><entry><title type="html">Parallelism is a little tricky</title><link href="https://brokenco.de//2025/12/16/parallelism-is-tricky.html" rel="alternate" type="text/html" title="Parallelism is a little tricky" /><published>2025-12-16T00:00:00+00:00</published><updated>2025-12-16T00:00:00+00:00</updated><id>https://brokenco.de//2025/12/16/parallelism-is-tricky</id><content type="html" xml:base="https://brokenco.de//2025/12/16/parallelism-is-tricky.html"><![CDATA[<p>In theory many developers understand concurrency and parallelism, in practice I
think almost none of us do. At least not all the time. Building a mental model
of highly parallel interdependent software is incredibly time-consuming,
difficult, and error-prone. I have recently been doing a <em>lot</em> of performance
analysis with both <a href="https://github.com/delta-io/delta-rs">delta-rs</a> and
<a href="https://github.com/delta-io/delta-kernel-rs">delta-kernel-rs</a>. In the process
I have had to check some of my own assumptions of how things <em>should</em> work
compared to how they <em>do</em> work.</p>

<hr />
<p>Sidenote: to get an idea of how frequently we all “get it wrong”, subscribe to Aphyr’s <a href="https://jepsen.io/blog">Jepsen blog</a> for distributed systems safety research.</p>

<hr />

<p>The Delta Lake Rust binding has relied on <a href="https://tokio.rs/">Tokio</a> since the
beginning, which as any <code class="language-plaintext highlighter-rouge">/r/rust</code> commenter knows is an easy turbo button to
solve all your performance and parallelism needs!</p>

<p>When we were designing kernel however, there was a strong motivation <em>not</em> to
take a direct dependency on Tokio. Due to some early influences in the project,
there was a pretty strong push to support C/C++ based engines with
delta-kernel-rs. Those engines would need a Foreign-function Interface (FFI)
and pushing something like Tokio or even
<a href="https://docs.rs/futures/latest/futures/">futures</a> over an FFI boundary was
unsavory to say the least.</p>

<p>What may be one of our original performance sins in kernel was designing APIs
around the <a href="https://doc.rust-lang.org/std/iter/trait.Iterator.html">Iterator</a>
trait. I am writing this partially to help form my thoughts, but consider this screenshot from
<a href="https://github.com/KDAB/hotspot">Hotspot</a> showing Tokio tasks doing the work of “log replay” when opening a large complex Delta table:</p>

<p><img src="/images/post-images/2025-12-delta-rs/tokio-thread-switching.png" alt="Context switching in tasks" /></p>

<p>These two tasks are <em>concurrent</em> but they are not parallel. In <code class="language-plaintext highlighter-rouge">Iterator</code>
terms, this is about what I would expect to see. The conceptual model for execution is:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">Iterator</code> created.</li>
  <li><code class="language-plaintext highlighter-rouge">next()</code> is invoked</li>
  <li>“do work”</li>
  <li>return result</li>
  <li><code class="language-plaintext highlighter-rouge">next()</code> is invoked</li>
</ol>

<p>The fact that work is being done on different tasks is irrelevant. <code class="language-plaintext highlighter-rouge">Iterator</code>
is lazy, but is only going to “do work” when it is asked, thus a serial
invocation model.</p>

<p>When parallelism is designed, that means work <strong>must</strong> be done at the same
time, but it does not necessarily mean that it must be done “lazily” in the
style of the <code class="language-plaintext highlighter-rouge">Iterator</code> trait.</p>

<p>In delta-rs <a href="https://github.com/roeap">Robert</a> pulled in some code from
<a href="https://datafusion.apache.org">Datafusion</a> which relies on Tokio’s
<a href="https://docs.rs/tokio/latest/tokio/task/struct.JoinSet.html">JoinSet</a> API.  The <code class="language-plaintext highlighter-rouge">JoinSet</code> is effectively what we want if we want an Iterator-style parallel work executor:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">JoinSet</code> created, “do work” begins</li>
  <li><code class="language-plaintext highlighter-rouge">next()</code> is invoked</li>
  <li>return result</li>
  <li><code class="language-plaintext highlighter-rouge">next()</code> is invoked</li>
  <li>return result</li>
  <li>“do work”</li>
  <li><code class="language-plaintext highlighter-rouge">next()</code> is invoked</li>
  <li>return result</li>
</ol>

<p>Currently the use of <code class="language-plaintext highlighter-rouge">JoinSet</code> happens much higher in the stack inside of
delta-rs, but does <em>not</em> happen deeper down in the delta-kernel-rs code.</p>

<p>What the profiling <em>likely</em> indicates is that there are serial <code class="language-plaintext highlighter-rouge">Iterator</code>
executions happening in the kernel layer which lead to a bottleneck for
callers, regardless of how parallel-capable those callers may be.</p>

<hr />

<p>Tokio has received criticism in the past about its suitability for heavy
CPU-bound operations. Its async/await primitives work incredibly well for
anything which has I/O wait involved. The scheduler can switch between tasks
when a socket is awaiting data, making it highly concurrent for I/O-bound
applications. Tokio functions similarly to Goroutines in Golang, greenlets in
Python, etc. As I dug deeper into this problem I wanted to ensure that Tokio
was going to behave as I expected with CPU-bound operations.</p>

<p>I compared performance of a <code class="language-plaintext highlighter-rouge">JoinSet</code> based program which generates
RSA keys, and a <a href="https://crates.io/crates/rayon">rayon</a> based program. Both are
close enough in performance and parallelism. Both effectively used all
available cores when the Tokio runtime was configured with a single worker
thread per core.</p>

<hr />

<p>Coming back to the Delta Lake ecosystem and our beloved <code class="language-plaintext highlighter-rouge">Iterator</code>. I think
there are two paths ahead:</p>

<ul>
  <li>The Easy Road: taking <code class="language-plaintext highlighter-rouge">JoinSet</code> into the default engine of delta-kernel-rs
will at least alleviate some of the “concurrent but not parallel” problems
that are lurking down there.</li>
  <li>The Hard Road: attempting to put a synchronous <code class="language-plaintext highlighter-rouge">Engine</code> interface in front of
inherently I/O bound operations is going to lead to performance deficiencies
compared to an evented system like Tokio or anything else with a kqueue/epoll
reactor at its core. Putting async/await at the foundation of delta-kernel-rs
would allow for driving more concurrent and parallel behavior depending on
the use-case.</li>
</ul>

<p>The performance of delta-rs is major focus for my work in the project. In 2026 I look
forward to sharing more analysis and more <a href="https://github.com/delta-io/delta-kernel-rs/pull/1561">pull
requests</a>!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><category term="deltalake" /><summary type="html"><![CDATA[In theory many developers understand concurrency and parallelism, in practice I think almost none of us do. At least not all the time. Building a mental model of highly parallel interdependent software is incredibly time-consuming, difficult, and error-prone. I have recently been doing a lot of performance analysis with both delta-rs and delta-kernel-rs. In the process I have had to check some of my own assumptions of how things should work compared to how they do work.]]></summary></entry><entry><title type="html">Things you should know about Url in Rust</title><link href="https://brokenco.de//2025/12/03/about-url.html" rel="alternate" type="text/html" title="Things you should know about Url in Rust" /><published>2025-12-03T00:00:00+00:00</published><updated>2025-12-03T00:00:00+00:00</updated><id>https://brokenco.de//2025/12/03/about-url</id><content type="html" xml:base="https://brokenco.de//2025/12/03/about-url.html"><![CDATA[<p>I would guess most developers think of URLs as a string with a <code class="language-plaintext highlighter-rouge">https://</code> at
the beginning. In many cases there are assumptions that are made about these URL-shaped
strings which may be confusing, misleading, or flat out incorrect. The <a href="https://crates.io/crates/url">url</a> crate is compliant to the RFCs about URLs, but while being technically correct is the best kind of correct, that doesn’t mean it still isn’t confusing.</p>

<p>Here are some common misconceptions that I have seen crop up as I have worked on incorporating more and more <code class="language-plaintext highlighter-rouge">url::Url</code> usage in my Rust projects.</p>

<h3 id="slashes-are-load-bearing">Slashes are load-bearing</h3>

<p><em>Most</em> web frameworks will take a request like <code class="language-plaintext highlighter-rouge">https://example.com/hello//</code> and route that to the handler for <code class="language-plaintext highlighter-rouge">/hello</code>, conveniently dropping the redundant trailing slashes. From a URL specification standpoint, this is <em>probably not</em> correct. Where I might see a couple of trailing slashes, a URL parser sees a <code class="language-plaintext highlighter-rouge">hello</code> path segment followed by two empty path segments. Consider the following.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">left</span> <span class="o">=</span> <span class="nn">Url</span><span class="p">::</span><span class="nf">parse</span><span class="p">(</span><span class="s">"s3://bucket/prefix/"</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>
<span class="k">let</span> <span class="n">right</span> <span class="o">=</span> <span class="nn">Url</span><span class="p">::</span><span class="nf">parse</span><span class="p">(</span><span class="s">"s3://bucket/prefix"</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>
</code></pre></div></div>

<p>These are not equivalent.</p>

<p>The <code class="language-plaintext highlighter-rouge">path_segments()</code> are different too:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  left: ["prefix", ""]
 right: ["prefix"]
</code></pre></div></div>

<p>This is because that trailing slash means there’s another path segment, it just happens to be empty. Cue subtle bugs from user code which expects the two given URLs to behave identically because … well, S3 treats them as such, as do most other web servers today.</p>

<h3 id="join-the-fun">Join the fun</h3>

<p>With that trailing empty slash meaning there’s an empty path segment on the <code class="language-plaintext highlighter-rouge">Url</code>, that also means that joining onto <code class="language-plaintext highlighter-rouge">Url</code> behaves different than you might otherwise expect. For example:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">left</span><span class="nf">.join</span><span class="p">(</span><span class="s">"_delta_log"</span><span class="p">);</span> <span class="c1">// produces `s3://bucket/prefix/_delta_log`</span>
<span class="n">right</span><span class="nf">.join</span><span class="p">(</span><span class="s">"_delta_log"</span><span class="p">);</span> <span class="c1">// produces `s3://bucket/_delta_log`</span>
</code></pre></div></div>

<p>The <a href="https://docs.rs/url/latest/url/struct.Url.html#method.join">docs</a> try to make this clear:</p>

<blockquote>
  <p>A trailing slash is significant. Without it, the last path component is considered to be a “file” name to be removed to get at the “directory” that is used as the base.</p>
</blockquote>

<p>With the subtle yet significant behavior of the trailing slash, this nuance
might not be noticed by most developers.</p>

<h3 id="file-urls-are-weird">File URLs are weird.</h3>

<p>A file URL is one which starts with <code class="language-plaintext highlighter-rouge">file://</code>, but because a slash is not
always a slash on operating systems, especially those developed in Redmond, WA,
their behavior is not always consistent with what developers expect.</p>

<p>In the <code class="language-plaintext highlighter-rouge">url</code> crate I ended up <a href="https://github.com/servo/rust-url/issues/1086">filing a bug</a> for this behavior but as of today these two produce different results:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">Url</span><span class="p">::</span><span class="nf">parse</span><span class="p">(</span><span class="s">"file:///home/tyler/../../dev/null"</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>
<span class="nn">Url</span><span class="p">::</span><span class="nf">from_file_path</span><span class="p">(</span><span class="s">"/home/tyler/../../dev/null"</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>
</code></pre></div></div>

<p>The resulting <code class="language-plaintext highlighter-rouge">Url</code> structs are <em>not</em> equivalent, and the parsing of the file URL results in canonicalization, removing the <code class="language-plaintext highlighter-rouge">..</code> segments from the path and producing a <code class="language-plaintext highlighter-rouge">Url</code> that is effectively <code class="language-plaintext highlighter-rouge">/dev/null</code>. The second <code class="language-plaintext highlighter-rouge">Url</code> however has a <code class="language-plaintext highlighter-rouge">.path()</code> of the full uncanonicalized path passed in.</p>

<p>The oddities of file URLs about and <a href="https://url.spec.whatwg.org/">the RFC</a> has a lot of documented “quirks” about Windows drive lettering and file URLs, which leads to irritating bugs like <a href="https://github.com/delta-io/delta-rs/issues/3551">this one</a>.</p>

<hr />

<p><code class="language-plaintext highlighter-rouge">Url</code> types are better than raw <code class="language-plaintext highlighter-rouge">str</code> types for working with URL shaped data in
any Rust program. The additional structure is really important for many reasons.
<strong>However</strong> the use of <code class="language-plaintext highlighter-rouge">Url</code> doesn’t absolve the developer of considering
user-inputs where slashes are plentiful and path segments are goofy.</p>

<p>Personally, I was hoping simply adopting <code class="language-plaintext highlighter-rouge">Url</code> would make me have to care less
about garbage input, but unfortunately more structured garbage is still
garbage.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><summary type="html"><![CDATA[I would guess most developers think of URLs as a string with a https:// at the beginning. In many cases there are assumptions that are made about these URL-shaped strings which may be confusing, misleading, or flat out incorrect. The url crate is compliant to the RFCs about URLs, but while being technically correct is the best kind of correct, that doesn’t mean it still isn’t confusing.]]></summary></entry><entry><title type="html">Improving performance with the log crate</title><link href="https://brokenco.de//2025/11/30/log-log-log.html" rel="alternate" type="text/html" title="Improving performance with the log crate" /><published>2025-11-30T00:00:00+00:00</published><updated>2025-11-30T00:00:00+00:00</updated><id>https://brokenco.de//2025/11/30/log-log-log</id><content type="html" xml:base="https://brokenco.de//2025/11/30/log-log-log.html"><![CDATA[<p>On a small crate I maintain a friendly stranger made a suggestion to improve
performance, by making logging optional.</p>

<p>It is rare that somebody will not only make a pull request to such a niche
crate but they also shared some performance numbers with their change, which I
<em>always</em> appreciate. Bringing receipts to a performance discussion is a
<strong>must</strong>.</p>

<p>The main concern they were addressing was logging statements with the
<a href="https://crates.io/crates/log">log</a> crate in a tight loop of invocations within
the crate. I was <em>certain</em> this was a common issue and went digging through the documentation again and found <strong><a href="https://docs.rs/log/latest/log/#compile-time-filters">Compile time filters</a></strong>.</p>

<p>With the <code class="language-plaintext highlighter-rouge">log</code> crate, these <code class="language-plaintext highlighter-rouge">Cargo.toml</code> features allow you to statically disable the <code class="language-plaintext highlighter-rouge">trace!</code>, <code class="language-plaintext highlighter-rouge">debug!</code>, etc macros at compile time, for example:</p>

<div class="language-toml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="py">xmltojson</span> <span class="p">=</span> <span class="s">"*"</span>
<span class="nn">log</span> <span class="o">=</span> <span class="p">{</span> <span class="py">version</span> <span class="p">=</span> <span class="s">"0.4"</span><span class="p">,</span> <span class="py">features</span> <span class="p">=</span> <span class="s">"release_max_level_info"</span><span class="p">}</span>
</code></pre></div></div>

<p>This would disable any log level more granular than <code class="language-plaintext highlighter-rouge">info!</code>, effectively disabling <code class="language-plaintext highlighter-rouge">trace!</code> and <code class="language-plaintext highlighter-rouge">debug!</code> in the resulting release builds.</p>

<p>Pretty neat!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><summary type="html"><![CDATA[On a small crate I maintain a friendly stranger made a suggestion to improve performance, by making logging optional.]]></summary></entry><entry><title type="html">The end of the road for kafka-delta-ingest</title><link href="https://brokenco.de//2025/10/30/kafka-delta-ingest-was-fun.html" rel="alternate" type="text/html" title="The end of the road for kafka-delta-ingest" /><published>2025-10-30T00:00:00+00:00</published><updated>2025-10-30T00:00:00+00:00</updated><id>https://brokenco.de//2025/10/30/kafka-delta-ingest-was-fun</id><content type="html" xml:base="https://brokenco.de//2025/10/30/kafka-delta-ingest-was-fun.html"><![CDATA[<p>After five years in production kafka-delta-ingest at Scribd has been shut off
and removed from our infrastructure.
<a href="http://github.com/delta-io/kafka-delta-ingest">kafka-delta-ingest</a> was the
motivation behind my team creating
<a href="https://github.com/delta-io/delta-rs">delta-rs</a>, the most successful open
source project I have started to date. With kafka-delta-ingest we achieved our
original stated goals and reduced streaming data ingestion costs by <strong>95%</strong>. In
the time since however, we have <em>further</em> reduced that cost <a href="https://www.youtube.com/watch?v=h8nCF_OI0O0">with even more
efficient infrastructure</a>.</p>

<p>The original kafka-delta-ingest/delta-rs implementations were created by the
joint efforts of the following talented developers across <em>three continents</em> in
the middle of 2020, an otherwise totally chill time in world history.</p>

<ul>
  <li><a href="https://github.com/houqp">QP Hou</a></li>
  <li><a href="https://github.com/xianwill">Christian Williams</a></li>
  <li><a href="https://github.com/mosyp">Mykhailo Osypov</a></li>
  <li><a href="https://github.com/nevi-me">@nevi-me</a></li>
</ul>

<p>Prior to our creation of delta-rs, the only way to read and write <a href="https://delta.io">Delta
Lake</a> tables was through <a href="https://spark.apache.org">Apache
Spark</a>. While it is an incredibly powerful tool for
reading and transforming data, it is completely slow and overweight for the
task of high-throughput data ingestion. QP and I found ourselves loving
<a href="https://rust-lang.org">Rust</a> and I was able to corner the funding to get the
project started on the promise of lower operational costs.</p>

<p>Boy howdy has the investment in Rust delivered. The implementation of kafka-delta-ingest dramatically lowered our operational costs as Christian shares in this video:</p>

<center><iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/do4jsxeKfd4?si=vAgTIsWWn4k7f5qi" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe></center>

<p>Christian also shared some <a href="https://www.youtube.com/watch?v=mLmsZ3qYfB0">architecture and discussion in this
video</a>, which I think are useful
for anybody building streaming systems around Delta Lake.</p>

<p>Here’s a <a href="https://www.youtube.com/watch?v=JvonUisY7vE&amp;t=51s">demo by Christian</a> too!</p>

<hr />

<p>The reason kafka-delta-ingest was decommissioned ultimately was that I created an <em>even
cheaper</em> ingestion process. My work on the
<a href="https://github.com/buoyant-data/oxbow">oxbow</a> suite coupled with
<a href="https://www.databricks.com/glossary/medallion-architecture">the medallion
architecture</a>
has made contemporary Delta Lake ingestion less than 10% of the total data
platform cost.</p>

<p>The big argument against kafka-delta-ingest was <a href="https://kafka.apache.org">Apache
Kafka</a>. If an organization has Kafka for other
reasons, then kafka-delta-ingest can be a useful “sidecar” process to persist
data flowing through Kafka. If however the organization is running Kafka <em>just</em>
for ingestion, there are cheaper options available. As the organization
evolved, the other consumers of Kafka drifted away, driving the value
proposition of kafka-delta-ingest lower and lower.</p>

<p>This doesn’t mean kafka-delta-ingest is not <em>useful</em>, it’s just no longer
useful at Scribd.</p>

<hr />

<p><a href="https://github.com/mightyshazam">Kyjah Keyes</a> and I are the maintainers of
kafka-delta-ingest and we now are both in the position of <em>not actually using
it</em> anymore.</p>

<p>I will continue to make delta-rs upgrades to it, since kafka-delta-ingest
continues to be a useful test bed for API changes and integration testing, but
I don’t have big plans or ideas on how to grow the project further.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="s3" /><category term="deltalake" /><category term="kafka" /><category term="rust" /><summary type="html"><![CDATA[After five years in production kafka-delta-ingest at Scribd has been shut off and removed from our infrastructure. kafka-delta-ingest was the motivation behind my team creating delta-rs, the most successful open source project I have started to date. With kafka-delta-ingest we achieved our original stated goals and reduced streaming data ingestion costs by 95%. In the time since however, we have further reduced that cost with even more efficient infrastructure.]]></summary></entry><entry><title type="html">Delta Lake Live!</title><link href="https://brokenco.de//2025/09/18/delta-lake-live.html" rel="alternate" type="text/html" title="Delta Lake Live!" /><published>2025-09-18T00:00:00+00:00</published><updated>2025-09-18T00:00:00+00:00</updated><id>https://brokenco.de//2025/09/18/delta-lake-live</id><content type="html" xml:base="https://brokenco.de//2025/09/18/delta-lake-live.html"><![CDATA[<p>Every Tuesday morning at 7am I have a date.</p>

<p>For the past few weeks <a href="https://github.com/roeap">Robert</a> and I have been
jumping onto a shared <a href="https://twitch.ttv/agentdero">Twitch</a> stream and working
through issues, code reviews, and design discussions for the
<a href="https://github.com/delta-io/delta-rs">delta-rs</a> project.</p>

<p>The idea for the project came up at Data and AI Summit earlier this year.
Robert lives in Europe and I am as west as west coast in the US generally gets.
The timezone spread has been making collaboration difficult on the topics which
require lively synchronous debate.</p>

<p>The Delta Lake project is open source and therefore, in my opinion, the discussions and development of the project should also be open! What better than a big open live stream to work through column mapping, deletion vectors, bugs, performance challenges, and more!</p>

<p>I have livestreamed development <a href="/2012/08/28/pairing-with-the-fourth-wall">in the
past</a> and found it useful, but with
“Delta Lake Live!” we have a much more regular schedule, agenda, and way for
folks in the chat to engage, making it all that much more fun!</p>

<p>The streams are <a href="https://www.youtube.com/watch?v=6EZM0AbLkWU&amp;list=PLzxP01GQMpjdXtIAVxv_ziQHqyhaEhAVh">also being archived on
YouTube</a>
but you’re more than welcome to pop by and hang out <a href="https://www.twitch.tv/agentdero/schedule">every Tuesday at 7am
PDT</a></p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><category term="deltalake" /><summary type="html"><![CDATA[Every Tuesday morning at 7am I have a date.]]></summary></entry></feed>