<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://brokenco.de//feed/by_tag/arrow.xml" rel="self" type="application/atom+xml" /><link href="https://brokenco.de//" rel="alternate" type="text/html" /><updated>2026-04-12T21:39:52+00:00</updated><id>https://brokenco.de//feed/by_tag/arrow.xml</id><title type="html">rtyler</title><subtitle>a moderately technical blog</subtitle><author><name>R. Tyler Croy</name></author><entry><title type="html">2026 March: Recently Studied Stuff</title><link href="https://brokenco.de//2026/03/21/fresh-from-rss.html" rel="alternate" type="text/html" title="2026 March: Recently Studied Stuff" /><published>2026-03-21T00:00:00+00:00</published><updated>2026-03-21T00:00:00+00:00</updated><id>https://brokenco.de//2026/03/21/fresh-from-rss</id><content type="html" xml:base="https://brokenco.de//2026/03/21/fresh-from-rss.html"><![CDATA[<p>Over the past week I have made a more conscious effort to keep track of some
really interesting articles that came through my feed reader. I am a big fan of
the open web and the power of RSS for disseminating interesting information
from actual people. Below are some really interesting posts I have read recently!</p>

<p><strong><a href="https://felipe.rs/2024/10/23/arrow-over-http/">Compressed Apache Arrow tables over HTTP</a></strong></p>

<p>When discussing transport protocols for sending data between services at work
recently, a colleague asked “why can’t we just yeet Arrow over HTTP?” It turns out, you <a href="https://github.com/apache/arrow-experiments/tree/main/http/get_simple/python">absolutely can</a> and Arrow IPC streams even have a registered MIME type:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Content-Type: application/vnd.apache.arrow.stream
</code></pre></div></div>

<p><strong><a href="https://blog.dataexpert.io/p/parquet-can-shrink-your-data-100x">Understanding Parquet format for beginners</a></strong></p>

<p>A great introduction to the <a href="https://parquet.apache.org">Apache Parquet</a> format
and why it makes so many things better with large data storage systems like
<a href="https://delta.io">Delta Lake</a>. I have written on this
<a href="/tag/parquet.html">topic</a> before and encourage you to take another read
through <a href="https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/">this blog
post</a>
by some maintainers of the <a href="https://crates.io/crates/parquet">parquet</a> crate.</p>

<p><strong><a href="https://apenwarr.ca/log/20260316">Every layer of review makes you 10x slower</a></strong></p>

<blockquote>
  <p>Every layer of approval makes a process 10x slower [..]</p>

  <p>Just to be clear, we’re counting “wall clock time” here rather than effort. Almost all the extra time is spent sitting and waiting.</p>

  <ul>
    <li>Code a simple bug fix: 30 minutes</li>
    <li>Get it code reviewed by the peer next to you: 300 minutes → 5 hours → half a day</li>
    <li>Get a design doc approved by your architects team first: 50 hours → about a week</li>
    <li>Get it on some other team’s calendar to do all that (for example, if a customer requests a feature): 500 hours → 12 weeks → one fiscal quarter</li>
  </ul>
</blockquote>

<p>This inspired these thoughts which I shared with the <a href="https://github.com/delta-io/delta-rs">delta-rs</a> community:</p>

<p>“what if we didn’t require code review for merging into main”</p>

<p>I’m exploring the thought more about what we might need to make that happen.
“Why would you do such a thing, code review is so valuable!”  I do find code
reviews valuable but we do seem to lose a lot of flow time due to timezones,
differing work schedules, and a number of other things. For something without a
lot of changes, especially bug fixes that come with tests I would be much more
comfortable with maintainers merging once CI goes green.</p>

<p>Some pieces of the puzzle that I think would be needed:</p>

<ul>
  <li>Soft caps on pull requests. I saw this mentioned somewhere else, but implementing a soft cap of &lt;500 lines per pull request can help people avoid massive unreviewable changes that are simpler to integrate.</li>
  <li>Incorporating some of the benchmarking work into CI that has already been explored. If performance of key operations is not affected and the build is green, go for it.</li>
  <li>Stronger semantic version checks: if our APIs have not changed and all tests pass, I’m generally comfortable with landing stuff by maintainers.</li>
  <li>Implementing Apache Software Foundation style release candidates and voting: this is where we would put a mandatory bottleneck, rather than some jokey slack emojis like I tend to do, implementing a true release candidate process that requires review and vote before we push something to users.</li>
</ul>

<p>All of this is to say that reviews can still be requested, but I would love to
see us land more improvements faster and I think we have a bunch of different
schedules that can make pushing each change through a review queue a lot slower
than necessary.</p>

<p><strong><a href="https://www.possiblerust.com/pattern/conditional-impls">Conditional Impls in Rust</a></strong></p>

<blockquote>
  <p>It’s possible in Rust to conditionally implement methods and traits based on
the traits implemented by a type’s own type parameters. While this is used
extensively in Rust’s standard library, it’s not necessarily obvious that
this is possible.</p>
</blockquote>

<p>I have been vaguely aware of this functionality but haven’t really taken the
time to consider it, so I really appreciated this post walking through the
conditional impl functionality in Rust.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rss" /><category term="arrow" /><category term="parquet" /><category term="rust" /><summary type="html"><![CDATA[Over the past week I have made a more conscious effort to keep track of some really interesting articles that came through my feed reader. I am a big fan of the open web and the power of RSS for disseminating interesting information from actual people. Below are some really interesting posts I have read recently!]]></summary></entry><entry><title type="html">Based Lake, a petabyte-scale low-latency data lake</title><link href="https://brokenco.de//2026/03/10/based-lake.html" rel="alternate" type="text/html" title="Based Lake, a petabyte-scale low-latency data lake" /><published>2026-03-10T00:00:00+00:00</published><updated>2026-03-10T00:00:00+00:00</updated><id>https://brokenco.de//2026/03/10/based-lake</id><content type="html" xml:base="https://brokenco.de//2026/03/10/based-lake.html"><![CDATA[<p>I had a chat today about building large scale low-latency data retrieval
systems around AWS S3. In doing so I got to share a bit of the talk proposal I
submitted to <a href="https://dataaisummit.com">Data and AI Summit</a> this year about
real-live work that has made it into production.</p>

<p>For years the conventional wisdom around <a href="https://delta.io">Delta Lake</a> has
been to <strong>not</strong> connect user-facing/online systems to Delta tables. Basically,
don’t point your Django app at your Delta tables. This continues to be a decent
<em>guideline</em> but definitely <strong>not a rule</strong> and I have the performance data to
back that up.</p>

<p>My talk abstract:</p>

<blockquote>
  <p>Scribd hosts hundreds of millions of documents and has hundreds of billions of
objects across our buckets. Combining large-language models with a massive
amounts of text has required investment in our new Content Library
architecture.  We selected Delta Lake as the underlying storage technology but
have pushed it to an extreme. Using the same Delta Lake architecture we offer
both direct data access for data scientists in Databricks Notebooks and online
data retrieval in milliseconds for user-facing web services.</p>

  <p>In this talk we will review principles of performance for each layer of the
stack: web APIs, the Delta Lake tables, Apache Parquet, and AWS S3.</p>
</blockquote>

<p>The work done by myself and my colleague Eugene in this area has been heavily
related to my previous research around <a href="/2025/06/24/low-latency-parquet.html">Low latency Parquet
reads</a> which informed work named <a href="https://tech.scribd.com/blog/2026/content-crush.html">Content
Crush</a>, which I have
explored more on the Scribd tech blog and on the <a href="/2026/02/13/screaming-in-the-cloud.html">Screaming in the
Cloud</a> podcast.</p>

<p>I really hope that I am able to share results at Data and AI Summit from this
incredibly challenging work that I am undertaking. But even if I don’t, blog
posts like my musings on <a href="/2026/01/19/multimodal-delta-lake.html">Multimodal with Delta
Lake</a>, <a href="https://www.buoyantdata.com/blog/2024-12-31-high-concurrency-logstore.html">scaling streaming Delta Lake
applications</a>,
and a myriad of other articles I have published can be pieced together to form
the larger mosaic of insane large-scale data work I have been hammering on!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="arrow" /><category term="parquet" /><category term="deltalake" /><category term="databricks" /><category term="scribd" /><summary type="html"><![CDATA[I had a chat today about building large scale low-latency data retrieval systems around AWS S3. In doing so I got to share a bit of the talk proposal I submitted to Data and AI Summit this year about real-live work that has made it into production.]]></summary></entry><entry><title type="html">Low latency Parquet reads</title><link href="https://brokenco.de//2025/06/24/low-latency-parquet.html" rel="alternate" type="text/html" title="Low latency Parquet reads" /><published>2025-06-24T00:00:00+00:00</published><updated>2025-06-24T00:00:00+00:00</updated><id>https://brokenco.de//2025/06/24/low-latency-parquet</id><content type="html" xml:base="https://brokenco.de//2025/06/24/low-latency-parquet.html"><![CDATA[<p>The Apache Parquet file format has become the de facto standard for large data
systems but increasingly I find that most data engineers are not aware of <em>why</em>
it has become so popular. The format is <em>interesting</em> especially when taken
together with most cloud-based object storage systems, where some design
decisions allow for subsecond or millisecond latencies for parquet readers.</p>

<p>In the cloud computing environment: <strong>efficiency wins</strong>. Hyperscalers make
money from renting you resources on a time-basis; the fewer resources and less
time your workload requires, the lower the cost. A <a href="https://aws.amazon.com/lambda/">Lambda
function</a> which runs in 1 second compared to 5
seconds is going to cost 80$ less. At small scales this is often
inconsequential but with sufficient volume it makes a big difference. For
example, at 1 invocation per second the longer function costs ~$431/month
compared to ~$81/month.</p>

<p>I have been working on a project exploring new and novel use-cases for <a href="https://parquet.apache.org">Apache
Parquet</a>, the file format which underscores the
<a href="https://delta.io">Delta Lake</a> storage protocol. 
My work uses <code class="language-plaintext highlighter-rouge">.parquet</code> files smaller than 50MB in size and ultimately
<em>latency</em> is the biggest concern. When retrieving data from any data
service there is always a fixed cost of overhead regardless of the data
transferred. Retrieving a 1MB object or a 1GB object still requires locating
and loading the data from storage, validating authentication
credentials/headers, and then constructing a request stream.</p>

<p>Working in this domain I have discsussed challenges with <a href="https://github.com/alamb">Andrew
Lamb</a> who has been doing similarly interesting
explorations at InfluxData. His work builds on what he and Raphael outlined in their 2022 post:
<a href="https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/"><strong>Querying Parquet with Millisecond
Latency</strong></a></p>

<p><em>Meanwhile</em> Databricks also <a href="https://thenewstack.io/lakebase-is-databricks-fully-managed-postgres-database-for-the-ai-era/">released
Lakebase</a>
which I am confident is also utilizing Apache Parquet for similar retrieval
patterns for their PostgreSQL engine.</p>

<p>Somewhere way down the data stack we are all trying to squeeze as much out of
Parquet and S3 as possible.</p>

<hr />

<p>Because of my work on the
<a href="https://github.com/delta-io/delta-rs">delta-rs</a> project, I am quite familiar
with the <a href="https://github.com/apache/parquet-format?tab=readme-ov-file#file-format">Parquet file
format</a>
and the ways in which it can be read and written. 
I need to read <code class="language-plaintext highlighter-rouge">.parquet</code> files in an exremely low-latency
environment with worst-case performance around the 100ms mark. I picked up two
foundational dependencies of delta-rs: the
<a href="https://crates.io/crates/parquet">parquet</a> and
<a href="https://crates.io/crates/object_store">object_store</a> crates, and dove into the <strong>Parquet file format</strong>:</p>

<p><img src="/images/post-images/2025-05-parquet/parquet-format.gif" alt="Parquet File Format" /></p>

<p>The <code class="language-plaintext highlighter-rouge">.parquet</code> file has a “footer” which contains practically all the useful
metadata for understanding the file, with the last eight bytes indicating the
length of the footer. This is largely useless trivia until you learn that most
object stores like AWS S3 allow for <code class="language-plaintext highlighter-rouge">Range</code> headers on the
<a href="https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html#API_GetObject_RequestSyntax">GetObject</a>
call with a <em>negative byte range</em>. For a large <code class="language-plaintext highlighter-rouge">.parquet</code> file you can retrieve
<code class="language-plaintext highlighter-rouge">Range: -8</code> bytes and that would tell you the footer length, which you could
then fetch with <code class="language-plaintext highlighter-rouge">Range: -&lt;footer length&gt;</code>, and then you would be able to
understand practically everything about the file! Those <code class="language-plaintext highlighter-rouge">Range</code> requests would
even allow you to fetch individial row groups, a <em>hugely</em> beneficial
performance optimization when working with large <code class="language-plaintext highlighter-rouge">.parquet</code> files.</p>

<p>Fortunately for everybody, this is <em>exactly</em> what
<a href="https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.ParquetObjectReader.html">ParquetObjectReader</a>
does! From the perspective of the underlying <code class="language-plaintext highlighter-rouge">ObjectStore</code> implementation the call flow is:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">get_optse(Range(-8))</code></li>
  <li><code class="language-plaintext highlighter-rouge">get_opts(Range(-&lt;footerlen&gt;))</code></li>
  <li><code class="language-plaintext highlighter-rouge">get_ranges(*row-groups)</code></li>
</ul>

<p>For large <code class="language-plaintext highlighter-rouge">.parquet</code> files, hundreds of MBs or GBs, this approach works very
well for most processing engines where less data neding to be deserialized and processed
means tangible performance gains. In fact, I have it on good authority that
this approach is how the Databricks Photon engine’s <a href="https://docs.databricks.com/aws/en/optimizations/predictive-io">predictive
I/O</a> squeezes
even more query performance out of Apache Parquet.</p>

<p>For me however each request to S3 in the list above has roughly 30ms overhead
and they <em>must</em> be executed sequentially which means 3 requests has a
<em>worst-case</em> scenario of 90ms.</p>

<p>Hinting at a rough approximation of footer size can prevent one of the two
calls, bringing the worst-case down ot 60ms. Accessing relevant data in under
70-80ms is <em>good</em> but not great.</p>

<p>Andrew and Raphael’s blog post <a href="https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/"><strong>Querying Parquet with Millisecond
Latency</strong></a>
is <em>full</em> of useful approaches for reducing query and processing time. At some
point however you hit the wall of fundamental performance overhead of the
object store itself.</p>

<p>I have hit that wall.</p>

<p>The options available in front of me are:</p>

<ol>
  <li>consider novel data structures <em>inside</em> the Parquet file</li>
  <li>secondary indices outside of the Parquet file</li>
  <li>ggressive caching strategies.</li>
</ol>

<p>I’m not thrilled with <em>any</em> of them, though I have already utilized hacks from
#1 with Parquet data layout changes.</p>

<p>As frustrating as a problem that might genuinely be unsolveable might be, it
has been a lot of fun discussing strategies with folks at cloud providers,
other companies, and in the open source community on how to squeeze every last
bit of performance out of Apache Parquet and cloud storage.</p>

<p>I might have to make peace with 60ms of latency, but not just yet.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="arrow" /><category term="parquet" /><category term="rust" /><summary type="html"><![CDATA[The Apache Parquet file format has become the de facto standard for large data systems but increasingly I find that most data engineers are not aware of why it has become so popular. The format is interesting especially when taken together with most cloud-based object storage systems, where some design decisions allow for subsecond or millisecond latencies for parquet readers.]]></summary></entry></feed>