<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://brokenco.de//feed/by_tag/ml.xml" rel="self" type="application/atom+xml" /><link href="https://brokenco.de//" rel="alternate" type="text/html" /><updated>2026-05-03T00:12:50+00:00</updated><id>https://brokenco.de//feed/by_tag/ml.xml</id><title type="html">rtyler</title><subtitle>a moderately technical blog</subtitle><author><name>R. Tyler Croy</name></author><entry><title type="html">Multimodal with Delta Lake</title><link href="https://brokenco.de//2026/01/19/multimodal-delta-lake.html" rel="alternate" type="text/html" title="Multimodal with Delta Lake" /><published>2026-01-19T00:00:00+00:00</published><updated>2026-01-19T00:00:00+00:00</updated><id>https://brokenco.de//2026/01/19/multimodal-delta-lake</id><content type="html" xml:base="https://brokenco.de//2026/01/19/multimodal-delta-lake.html"><![CDATA[<p>The rate of change for data storage systems has accelerated to a frenzied pace
and most storage architectures I have seen simply cannot keep up. Much of my
time is spent thinking about large-scale tabular data stored in <a href="https://delta.io">Delta
Lake</a> which is one of the “lakehouse” storage systems along
with <a href="https://iceberg.apache.org">Apache Iceberg</a> and others. These storage
architectures were developed 5-10 years ago to solve problems faced moving from
data warehouse architectures to massive scale structured data needs faced by
many organizations. The storage changes we need today must support
“multimodal data” which is a dramatic departure in many ways from the
traditional query and usage patterns our existing infrastructure supports.</p>

<blockquote>
  <p>Multimodal learning is a type of deep learning that integrates and processes
multiple types of data, referred to as modalities, such as text, audio, images,
or video. This integration allows for a more holistic understanding of complex
data, improving model performance in tasks like visual question answering,
cross-modal retrieval, text-to-image generation, aesthetic ranking,
and image captioning.</p>

  <p><a href="https://en.wikipedia.org/wiki/Multimodal_learning">From Wikipedia</a></p>
</blockquote>

<p>Honestly, I have been working on this problem for longer than I knew that it
had a name!</p>

<p>Working on <a href="https://tech.scribd.com/blog/2026/content-crush.html">Content
Crush</a> at Scribd I have
had to negotiate an ever-present challenge: how do we make multimodal data
seamless to work with our classic tabular datasets?</p>

<p>A couple of the ideas that I have been thinking about revolve around one
principle: <strong>re-encoding of existing data is unacceptable.</strong> In the past I have
considered simply encoding binary data such as that from images or PDFs into
<a href="https://parquet.apache.org">Apache Parquet</a>. This approach suffers from a couple major flaws:</p>

<ul>
  <li>Re-encoding requires substantial computation for any non-trivial set of images, PDfs, video, etc.</li>
  <li>Redundant object storage, even with compression it is unlikely that any
organization which has terabytes or petabytes of image data will want to
store a secondary copy of it for their multimodal needs.</li>
  <li>Embedding a 1MB PDF file inside of a Parquet file is <em>not silly</em> but
embedding a 10GB video file inside of a Parquet file is <em>very silly</em>. Any
approach taken should scale in a reasonable fashion for data in the gigabyte
to terabyte range.</li>
</ul>

<p>A secondary objective in my thinking has been to avoid needing substantial
client changes for working with multimodal data. I recently watched <a href="https://www.youtube.com/watch?v=YmY_NwaoxNk">a talk by
Ryan Johnson</a> about adding
transactional semantics to Delta Lake and one of the big takeaways that I
heard from him was about the troublesome nature of ensuring <em>all actors</em> in the
system cooperated with the transaction semantics. In a modern data environment
that could be <em>dozens</em> of different off-the-shelf libraries, Databricks
notebooks, AWS SageMaker transforms, and so on. The less “exposure” to the
client layer the better.</p>

<h2 id="parquet-anchors">Parquet Anchors</h2>

<p>The first idea that I had was “Parquet Anchors” which would be built on <a href="https://parquet.apache.org/docs/file-format/binaryprotocolextensions/">Binary
Protocol
Extensions</a>
in Apache Parquet. In most cases the rich text/image/video data is already
stored in object storage such as AWS S3 and a URL should be sufficient to
retrieve that data.</p>

<p>The extension of the binary protocol as I understand it, would allow custom
information to be encoded in the Parquet files that are being written as part
of an existing Delta Table. The specific mechanism of encoding this data is
somewhat irrelevant so long as it can carry:</p>

<ul>
  <li>Artifact name (e.g. <code class="language-plaintext highlighter-rouge">some.pdf</code>)</li>
  <li>Artifact URL (<code class="language-plaintext highlighter-rouge">s3://bucket/prefix/of/keys/some-10x9u09123.pdf</code>)</li>
  <li>Artifact length (number of bytes)</li>
  <li>Artifact content type (e.g. <code class="language-plaintext highlighter-rouge">application/pdf</code>)</li>
  <li>Checksum</li>
  <li>Checksum Algorithm</li>
</ul>

<h3 id="pros">Pros</h3>
<p>The most obvious benefit of going down this route is the ease at which one
could update existing data files <em>and</em> this note from the Binary Protocol
Extensions document:</p>

<blockquote>
  <p><em>Existing readers will ignore the extension bytes with little processing overhead</em></p>
</blockquote>

<p>Logically Parquet Anchors could be quite simple to implement and for <em>most</em>
users of a Delta table with Parquet Anchors would never know they were there.</p>

<h3 id="cons">Cons</h3>

<p>The natural downside of this feature being hidden from existing readers is that
means clients must be updated in order to read the extension data properly. For
something like processing multimodal data where a row of content metadata
might refer to <code class="language-plaintext highlighter-rouge">some.pdf</code> this would mean the reader would have to have some
indication that it must:</p>

<ol>
  <li>Read the extended binary information</li>
  <li><em>Then</em> fetch the necessary artifacts</li>
</ol>

<p>There is another downside to this approach in that a table would need to be
“rewritten” but only <em>partially</em>. If a Parquet file added to the Delta table
references 1000 artifacts, then that <code class="language-plaintext highlighter-rouge">.parquet</code> file would need to be rewritten
to include the Parquet Anchors for those 1000 artifacts alongside that files
<code class="language-plaintext highlighter-rouge">.add</code> action. In essence I think this approach would require a full-table
rewrite where each <code class="language-plaintext highlighter-rouge">.parquet</code> in the transaction log would be retrieved,
processed, and rewritten with the appropriate Anchors.</p>

<p>Considering ways to address the shortcomings of Parquet Anchors I came up with
my next concept.</p>

<h2 id="virtual-delta-tables-vdt">Virtual Delta Tables (vdt)</h2>

<p>The notion of Parquet Anchors I think is useful to hold onto, hyperlinks to
existing artifacts is a key part of the multimodal data storage solution, but
perhaps not as a direct encoding into the Parquet data files. Considering the
shortcomings led me to think of how to present a virtual Delta table “view” to
existing clients while hiding the disparate nature of the data behind the
scenes.</p>

<p>One underutilized feature of the Delta Lake protocol is the use of URLs in the
<code class="language-plaintext highlighter-rouge">add</code> actions which enables functionality like <a href="https://delta.io/blog/delta-lake-clone/">shallow
clones</a>. I have long thought of this
as a super power that should really be used more.</p>

<h3 id="vdt0-just-the-artifacts">vdt0: just the artifacts</h3>

<p>The magic of the URL support in the Delta protocol is that the URLs don’t even
have to point to object storage. Nothing about the protocol dictates that the
URLs must point to <code class="language-plaintext highlighter-rouge">s3://</code> or <code class="language-plaintext highlighter-rouge">abfss://</code> URLs, you can just point to <code class="language-plaintext highlighter-rouge">https://</code>
URLs. AWS S3 supports <code class="language-plaintext highlighter-rouge">https://</code> URLs, but so does <em>every other web service</em>.</p>

<p>Imagine a storage architecture which already contains heaps of <code class="language-plaintext highlighter-rouge">.pdf</code>
artifacts. A <code class="language-plaintext highlighter-rouge">vdt</code> web service could provide a read-only URL structure which
maps the existing object storage structure into a Delta Lake URL scheme.</p>

<p>A virtual table with just those PDF artifacts could be configured at
<code class="language-plaintext highlighter-rouge">https://vdt.aws/v1/&lt;catalog&gt;/&lt;schema&gt;/&lt;table&gt;</code>. Using tooling like
<a href="https://github.com/s3s-project/s3s">s3s</a> <code class="language-plaintext highlighter-rouge">vdt</code> can provide S3-like operations
off of this virtual URL, exposing a virtualized JSON transaction log or
checkpoints for the Delta client.</p>

<p>Imagine the schema of such a virtual table for PDF artifacts:</p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Datatype</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>id</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>filename</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>content_type</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>url</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>filesize</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>data</td>
      <td><code class="language-plaintext highlighter-rouge">binary</code></td>
    </tr>
    <tr>
      <td>checksum</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>checksum_algo</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
  </tbody>
</table>

<p>The virtualized transaction log is where the real fun can begin. If information
about the artifacts can be sourced from an existing database, then the
virtualized transaction log could contain numerous <em>imagined</em> parquet files as
the <code class="language-plaintext highlighter-rouge">add</code> actions:</p>

<pre><code class="language-JSON">{
  "add": {
    "path": "datafiles/some-guid.parquet",
    "size": 841454,
    "modificationTime": 1512909768000,
    "dataChange": true,
    "stats": "{\"numRecords\":1,\"minValues\":{\"val..."
  }
}
</code></pre>

<p>The special path for the <code class="language-plaintext highlighter-rouge">some-guid.parquet</code> would perform <strong>on-demand</strong>
parquet encoding for the underlying artifacts.  The most primitive
implementation could simply represent <em>each</em> PDF file as a <code class="language-plaintext highlighter-rouge">.parquet</code> file with
an <code class="language-plaintext highlighter-rouge">add</code> action. So long as the <code class="language-plaintext highlighter-rouge">add</code> action conveyed the necessary file
statistics to allow consuming engine to filter out files which are not
necessary, this could be a seamless way to expose structured PDF data to the
consumer. The <code class="language-plaintext highlighter-rouge">path</code> in the action could <em>also</em> refer to an already cached
version of the encoded file in S3 using the existing URL support in the
protocol, in this way clients could progressively cache as need be on the
server-side.</p>

<hr />

<p><strong>Brief aside</strong>: I have never fully understood why <a href="https://delta.io/sharing/">Delta
sharing</a> exists as a separate entity. In my opinion
the Delta Lake protocol coupled with a clever server-side backend could provide
identical functionality for all existing Delta implementations.</p>

<hr />

<p>Assuming the <code class="language-plaintext highlighter-rouge">vdt</code> service supports the schema defined above and can properly
retrieve the PDF artifacts and encode them as Parquet data on the fly, a query
such as <code class="language-plaintext highlighter-rouge">SELECT filename, raw FROM vdt WHERE filename = $?</code>.</p>

<h3 id="pros-1">Pros</h3>

<p>Breaking the pretense of “objects must actually exist” with Delta Lake is very
liberating.  On-demand encoding artifacts in Apache Parquet would means all
client-side libraries should be able to seamlessly work within their existing
environments.</p>

<p>When I think about potential approaches for implementing <code class="language-plaintext highlighter-rouge">vdt0</code> I can also
imagine many different potential avenues for optimization.</p>

<h3 id="cons-1">Cons</h3>

<p>While I really do like this idea, I’m not sure <em>how much</em> I should like it
considering the potential downsides:</p>

<ul>
  <li>Requires some existing structure behind the scenes to build up a sensible
virtual Delta log. For situations where artifacts are simply in a dumb bucket
somewhere, with no metadata already stored in a relational database,
producing a virtual transaction log would be quite difficult.</li>
  <li>I cannot imagine a sensible path for <strong>write</strong> workloads with <code class="language-plaintext highlighter-rouge">vdt0</code>.</li>
  <li>Without having implemented this (yet!) it is unclear to how much compute-time would be expended on uncached parquet file encoding.</li>
  <li>Most data scientists want the PDF/image/etc but they don’t <em>typically</em> want
the raw bytes that they then have to parse through.</li>
</ul>

<hr />

<h2 id="uh-what-if-you-just-dont-use-delta-lake">Uh, what if you just don’t use Delta Lake?</h2>

<p>Hey good question. Great interlude opportunity!</p>

<p>As a seller of fine hammers and hammer accessories, everything does in fact
look like a nail.</p>

<p>Delta Lake is kind of a means to an end for me here. I think its protocol has
enough maturity in terms of features and client capabilities to provide
<em>almost</em> everything I need from a multimodal storage system. I just can’t/don’t
want to shove everything into a Delta table per se.</p>

<hr />

<h2 id="vdt1-adding-virtual-legs">vdt1: adding virtual legs</h2>

<p>Since I have already indulged in the heretical idea of “what if we just make
the files up” I went a level further to consider <em>what if we got even more
virtualized</em>. One key characteristic I dislike with the <code class="language-plaintext highlighter-rouge">vdt0</code> approach is that
it is <em>too simple</em> believe it or not.</p>

<p>When I think about artifacts like PDFs, they have far more structure than just
bytes. There are pages, typically sections, text, images, titles, footnotes,
and so on. For most machine learning use-cases the data scientist may be
interested in raw bytes for some projects but much more often they are
interested in the <em>parsed</em> and <em>structured</em> data of the artifact.</p>

<p>While my expertise is largely around text-based storage and processing, I would
imagine image/audio/video artifacts also have similar structure of interest to
data scientists.</p>

<p>Indulging in even more virtual-thinking I started to think about collections of
data all associated with an artifact. There’s the raw data schema above, but for PDFs I can also envision:</p>

<p><strong>Paragraphs</strong></p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Datatype</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>id</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>page</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>offset</td>
      <td><code class="language-plaintext highlighter-rouge">integer</code></td>
    </tr>
    <tr>
      <td>text</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>is_heading</td>
      <td><code class="language-plaintext highlighter-rouge">bool</code></td>
    </tr>
    <tr>
      <td>heading_level</td>
      <td><code class="language-plaintext highlighter-rouge">integer</code></td>
    </tr>
  </tbody>
</table>

<p><strong>Images</strong></p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Datatype</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>id</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>content_type</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>page</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>data</td>
      <td><code class="language-plaintext highlighter-rouge">binary</code></td>
    </tr>
    <tr>
      <td>bounds_x</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>bounds_y</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
  </tbody>
</table>

<p><strong>Links</strong></p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Datatype</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>id</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>page</td>
      <td><code class="language-plaintext highlighter-rouge">long</code></td>
    </tr>
    <tr>
      <td>href</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
    <tr>
      <td>label</td>
      <td><code class="language-plaintext highlighter-rouge">string</code></td>
    </tr>
  </tbody>
</table>

<p>Taken all together this only represents <em>20 columns</em> of data but could
represent <strong>most</strong> of the information needed for most multimodal workloads. I
mention the low column count because I have seen bug reports from Delta Lake
users talking about issues with tables containing <em>thousands of columns</em>.</p>

<p>A virtualized table schema could take these interior schemas and join them
together such that a single row might have: <code class="language-plaintext highlighter-rouge">id</code>, <code class="language-plaintext highlighter-rouge">raw_filename</code>,
<code class="language-plaintext highlighter-rouge">raw_content_type</code>, <code class="language-plaintext highlighter-rouge">raw_url</code>, <code class="language-plaintext highlighter-rouge">raw_filesize</code>, <code class="language-plaintext highlighter-rouge">raw_data</code>, <code class="language-plaintext highlighter-rouge">raw_checksum</code>,
<code class="language-plaintext highlighter-rouge">raw_checksum_algo</code>, <code class="language-plaintext highlighter-rouge">paragraph_page</code>, <code class="language-plaintext highlighter-rouge">paragraph_text</code>, <code class="language-plaintext highlighter-rouge">paragraph_offset</code>,
<code class="language-plaintext highlighter-rouge">paragraph_is_heading</code>, <code class="language-plaintext highlighter-rouge">paragraph_heading_level</code>, <code class="language-plaintext highlighter-rouge">image_content_type</code>,
<code class="language-plaintext highlighter-rouge">image_page</code>, <code class="language-plaintext highlighter-rouge">image_data</code>, <code class="language-plaintext highlighter-rouge">image_bounds_x</code>, <code class="language-plaintext highlighter-rouge">image_bounds_y</code>, <code class="language-plaintext highlighter-rouge">link_page</code>,
<code class="language-plaintext highlighter-rouge">link_href</code>, <code class="language-plaintext highlighter-rouge">link_label</code>.</p>

<p>So long as the schema allows nullable columns for everything but <code class="language-plaintext highlighter-rouge">id</code>, the
<code class="language-plaintext highlighter-rouge">vdt</code> service can expose the disjointed data behind the scenes in a sensible
way with the <code class="language-plaintext highlighter-rouge">add</code> actions on the virtual Delta table and its file statistics.
For example an <code class="language-plaintext highlighter-rouge">add</code> action which includes <code class="language-plaintext highlighter-rouge">link</code> data would list all other
columns as null within the file statistics <code class="language-plaintext highlighter-rouge">nullValues</code> such that any engine
querying for <code class="language-plaintext highlighter-rouge">raw</code> columns would ignore that file entirely.</p>

<h3 id="pros-2">Pros</h3>

<p>I think this structure would be possible to build in a traditional Delta Lake
system assuming one wished to re-encode data into new storage. Hiding existing
data behind a virtualized Delta table allows us to avoid data denormalization.</p>

<p>Similar to <code class="language-plaintext highlighter-rouge">vdt0</code> there are optimization and caching approaches that are
readily available with <code class="language-plaintext highlighter-rouge">vdt1</code> but unlike <code class="language-plaintext highlighter-rouge">vdt0</code> the “write path” is more
apparent to me with this approach. By hiding metadata about an artifact inside
the virtualized data structure, writes which add rows with those columns could
sensibly be accepted and inserted into an internal Delta or other table.</p>

<p>Depending on how metadata associated with an artifact is concerned, the <code class="language-plaintext highlighter-rouge">vdt</code>
service could simply front a number of other conventional Delta tables and act
as a proxy ensuring to push predicates and I/O filtering “to the edge” as far
as it will go, before collecting results for the query engine.</p>

<h3 id="cons-2">Cons</h3>

<p>This approach is certainly the most complex but could potentially require the least amount of re-encoding of existing data assets. The devil is in the details with how one might map existing data sources together. My sketch above places a tremendous amount of emphasis on an <code class="language-plaintext highlighter-rouge">id</code> which acts as a primary key between all the metadata associated with a singular artifact.</p>

<p>Nothing defined thus far accounts for potential changes in an artifact or its
metadata as time goes on. If a new version of an existing document is uploaded,
the new version should likely be considered “canonical” but be <em>appended</em>
rather than <em>merged</em> with existing records. How one might sensibly model that
in a system like Delta which doesn’t support referential integrity between
datasets leads me back to the “anchors” idea from before.  That said, I’m not
sure if that’s much ado about nothing.</p>

<hr />

<p>From a data storage standpoint one key aspect of multimodal data is that the
different modalities are presented to the end user or system <strong>together</strong>. What
I like about the virtual Delta tables concept is that this it doesn’t require
substantial client changes to accomplish but <em>does</em> provide a path to present
various types of data <em>together</em> for a given artifact.</p>

<p>I have various bits and pieces of a potential <code class="language-plaintext highlighter-rouge">vdt</code> system lying around the
workshop floor. If the idea has legs I might take a crack at a prototype
implementation, but first I will need some feedback!</p>

<p>Let me know what you think by emailing me at <code class="language-plaintext highlighter-rouge">rtyler@</code> this domain!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><category term="parquet" /><category term="deltalake" /><category term="ml" /><summary type="html"><![CDATA[The rate of change for data storage systems has accelerated to a frenzied pace and most storage architectures I have seen simply cannot keep up. Much of my time is spent thinking about large-scale tabular data stored in Delta Lake which is one of the “lakehouse” storage systems along with Apache Iceberg and others. These storage architectures were developed 5-10 years ago to solve problems faced moving from data warehouse architectures to massive scale structured data needs faced by many organizations. The storage changes we need today must support “multimodal data” which is a dramatic departure in many ways from the traditional query and usage patterns our existing infrastructure supports.]]></summary></entry><entry><title type="html">Who is “R Tyler Croy”</title><link href="https://brokenco.de//2024/08/15/who-am-i.html" rel="alternate" type="text/html" title="Who is “R Tyler Croy”" /><published>2024-08-15T00:00:00+00:00</published><updated>2024-08-15T00:00:00+00:00</updated><id>https://brokenco.de//2024/08/15/who-am-i</id><content type="html" xml:base="https://brokenco.de//2024/08/15/who-am-i.html"><![CDATA[<p>I asked a large language model this question:</p>

<blockquote>
  <p>A blast from the past</p>

  <p>R. Tyler Croy is a well-known figure in the world of video games,
particularly in the realm of retro gaming and emulation. He’s often referred
to as “R. Tyler Croy” or simply “RTC” by his  fans.</p>

  <p>Tyler Croy was one of the most prominent figures behind the popular emulator,
MAME (Multiple Arcade Machine Emulator). As a member of the MAME team, he
contributed significantly to the development and maintenance of the emulator,
which allows players to emulate classic arcade games on modern computers.</p>

  <p>Throughout the late 1990s and early 2000s, Croy was an active member of the
emulation community, working closely with other developers and enthusiasts to
create and share emulators for various retro gaming platforms. His
contributions helped bring many classic games to life again, allowing gamers
to experience and enjoy their favorite childhood titles on modern hardware.</p>

  <p>Although he’s not as actively involved in the emulation scene today, R. Tyler
Croy remains a legend among retro gaming enthusiasts and is often remembered
fondly for his tireless efforts in preserving gaming history.</p>
</blockquote>

<p>I figure that should also help answer the question “should I trust a large
language model?”</p>]]></content><author><name>R. Tyler Croy</name></author><category term="ml" /><category term="opinion" /><summary type="html"><![CDATA[I asked a large language model this question:]]></summary></entry><entry><title type="html">A large language model is not a good co-pilot</title><link href="https://brokenco.de//2024/05/12/helpful-code-generation.html" rel="alternate" type="text/html" title="A large language model is not a good co-pilot" /><published>2024-05-12T00:00:00+00:00</published><updated>2024-05-12T00:00:00+00:00</updated><id>https://brokenco.de//2024/05/12/helpful-code-generation</id><content type="html" xml:base="https://brokenco.de//2024/05/12/helpful-code-generation.html"><![CDATA[<p>Large language models (LLMs) seem to only be good at two things: summarizing
text and making up bullshit. The idea that a general purpose LLM is going to
herald a new age of software development efficiency is misleading in most cases
bordering on malicious. While there are a number of other recommendations or
predictive machine learning models which <em>can</em> improve software development
efficiency, LLMs propensity to generate bullshit undermines trust in a way that
makes me question their validity at baseline as a software development tool.</p>

<p>A couple months ago I recall using Google’s Bard in an attempt to generate some Terraform boilerplate code. It so fantastically failed to do so that the top Google search result from Stack Overflow was almost always <em>functional</em> code whereas Bard consistently produced invalid code.</p>

<p>This weekend while playing around I tried out Llama 3, another LLM which was
also trained by pilfering the collective code commons of the internet, but
still cannot generate useful code. I asked it about Rust, and about <a href="https://delta.io">Delta
Lake</a> which is was able to explain similar to the first page
of their respective Wikipedia pages.</p>

<p>I then asked:</p>

<blockquote>
  <p>How do you read a delta table in rust?</p>
</blockquote>

<p>Below a portion of its response:</p>

<blockquote>
  <p>To read a Delta Lake table in Rust, you can use the <code class="language-plaintext highlighter-rouge">delta-rs</code> crate, which is a Rust wrapper around the Apache Arrow and Parquet libraries. Here’s an example of
how to read a Delta table:</p>
</blockquote>

<p>So far so good! You do use the <a href="https://github.com/delta-io/delta-rs">delta-rs</a> crate, which is based on Apache Arrow and Parquet libraries. I’m cheating a bit since I helped create delta-rs, so this is more like a proctored exam rather than a true co-piloting experience. Llama 3 continues with some code:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">delta</span><span class="p">::</span><span class="n">DeltaTable</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">arrow</span><span class="p">::</span><span class="n">record_batch</span><span class="p">;</span>

<span class="k">async</span> <span class="k">fn</span> <span class="nf">read_delta_table</span><span class="p">()</span> <span class="p">{</span>
    <span class="c1">// Load the Delta table using the delta-rs crate</span>
    <span class="k">let</span> <span class="n">dt</span> <span class="o">=</span> <span class="nn">DeltaTable</span><span class="p">::</span><span class="nf">load</span><span class="p">(</span><span class="s">"path/to/table"</span><span class="p">)</span><span class="k">.await</span><span class="nf">.unwrap</span><span class="p">();</span>
                                                             
    <span class="c1">// Get the first record batch (i.e., a chunk of data)   </span>
    <span class="k">let</span> <span class="n">rb</span> <span class="o">=</span> <span class="n">dt</span><span class="nf">.next</span><span class="p">()</span><span class="nf">.unwrap</span><span class="p">();</span>
    
    <span class="c1">// Iterate over the records in the record batch      </span>
    <span class="k">for</span> <span class="n">row</span> <span class="k">in</span> <span class="n">rb</span><span class="nf">.iter_rows</span><span class="p">()</span> <span class="p">{</span>
        <span class="c1">// Print each record as a string</span>
        <span class="nd">println!</span><span class="p">(</span><span class="s">"{:?}"</span><span class="p">,</span> <span class="n">row</span><span class="nf">.to_string</span><span class="p">());</span>
    <span class="p">}</span>
<span class="p">}</span> 
</code></pre></div></div>

<p>This code is so completely and utterly nonsense. That’s not how you import the
<code class="language-plaintext highlighter-rouge">deltalake</code> library. That’s not how you use Arrow. That’s not how you open a
Delta table with Rust. There is no <code class="language-plaintext highlighter-rouge">next()</code> function on a <code class="language-plaintext highlighter-rouge">DeltaTable</code> struct.
There’s no <code class="language-plaintext highlighter-rouge">iter_rows()</code> on a <code class="language-plaintext highlighter-rouge">RecordBatch</code>.</p>

<p>Almost everything about this code is incorrect. Not a little bit of “oh this is
perhaps referring to older APIs”, flat out wrong, will not compile, never would have, fail.</p>

<p>The explanation of this code provided by the LLM contains even more gems:</p>

<blockquote>
  <p>Here’s what’s happening:</p>

  <ol>
    <li><strong><code class="language-plaintext highlighter-rouge">DeltaTable::load()</code></strong>: We load the Delta table using the <code class="language-plaintext highlighter-rouge">DeltaTable</code> struct from the <code class="language-plaintext highlighter-rouge">delta-rs</code> crate. Pass the path to the Delta table as a string.</li>
    <li><strong><code class="language-plaintext highlighter-rouge">next()</code></strong>: We get the first record batch (a chunk of data) from the loaded Delta table using the <code class="language-plaintext highlighter-rouge">next()</code> method.</li>
    <li><strong><code class="language-plaintext highlighter-rouge">iter_rows()</code></strong>: We iterate over the records in the record batch using the <code class="language-plaintext highlighter-rouge">iter_rows()</code> method, which returns an iterator over the rows.</li>
    <li><strong><code class="language-plaintext highlighter-rouge">to_string()</code></strong>: For each row, we convert it to a string using the <code class="language-plaintext highlighter-rouge">to_string()</code> method.</li>
  </ol>

  <p>This code assumes you have the <code class="language-plaintext highlighter-rouge">delta-rs</code> crate installed and the necessary dependencies (e.g., Apache Arrow and Parquet). You can install <code class="language-plaintext highlighter-rouge">delta-rs</code> via Cargo:
<code class="language-plaintext highlighter-rouge">
[dependencies]
delta = "0.10.1"
</code>                                                                                                                                                                 │</p>
</blockquote>

<p>As somebody with intimate knowledge of the <code class="language-plaintext highlighter-rouge">deltalake</code> library and protocol,
this is all ha-ha funny to me. But for somebody who might be trying to learn
something new or is rushing to get some work done, these sorts of answers are
so wildly misleading as to be dangerous. I don’t expect LLMs to be able to
always produce coherent and compileable Rust code, but at no point will <em>any</em>
of these snippets lead the reader to a productive path of inquiry.</p>

<p>It seems that most of the people pushing LLMs in software development have an
ulterior motive that is <em>not</em> in your best interest as a developer. Executives
who want to squeeze more productivity out of you, platforms who want to
increase lock-in to their proprietary tools, or hardware manufacturers who want
to sell more chips.</p>

<p>This has been said a lot lately, but what makes delivering software hard is not
the “writing code”, but typically the “understanding code” part. For which LLMs
are so poorly suited as to be worse than useless. If you cannot trust their
output, you have to both understand the slop they’re producing <em>and</em> be able to
figure out where it is wrong, and how you can fix it.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="software" /><category term="ml" /><category term="opinion" /><summary type="html"><![CDATA[Large language models (LLMs) seem to only be good at two things: summarizing text and making up bullshit. The idea that a general purpose LLM is going to herald a new age of software development efficiency is misleading in most cases bordering on malicious. While there are a number of other recommendations or predictive machine learning models which can improve software development efficiency, LLMs propensity to generate bullshit undermines trust in a way that makes me question their validity at baseline as a software development tool.]]></summary></entry><entry><title type="html">ChatGPT and your intellectual property</title><link href="https://brokenco.de//2023/01/09/chatgpt-and-your-ip.html" rel="alternate" type="text/html" title="ChatGPT and your intellectual property" /><published>2023-01-09T00:00:00+00:00</published><updated>2023-01-09T00:00:00+00:00</updated><id>https://brokenco.de//2023/01/09/chatgpt-and-your-ip</id><content type="html" xml:base="https://brokenco.de//2023/01/09/chatgpt-and-your-ip.html"><![CDATA[<p>There is an excessive number <a href="https://en.wikipedia.org/wiki/ChatGPT">ChatGPT</a>
screenshots littering social media right now, and not nearly enough critical
thinking about feeding data into this novel new chatbot. An anecdotal survey of
my timeline includes people asking ChatGPT to solve math equations, write
emails for them, create short story prompts, identify bugs in code, or even
generate code for them. Behold, the power of AI!</p>

<p>ChatGPT is created by <a href="https://openai.com/blog/chatgpt/">OpenAI</a>, which despite
the name is <em>not</em> any form of “open” organization, but rather a startup which
has been <a href="https://siliconangle.com/2023/01/05/openai-startup-behind-chatgpt-discusses-tender-offer-value-29b">considering funding at a pretty monstrous
valuation</a>.
In essence, ChatGPT is an AI tool trained on a large corpus of public and
proprietary information, packaged up as a kooky chatbot.</p>

<p>Fine. Setting aside my own annoyance with ML developers co-opting data from
“the commons”, fine.</p>

<p>The zeal with which most people are dumping information into ChatGPT really
concerns me however. I have seen a number of people feeding their own source
code into ChatGPT to ask it to find bugs or security holes.  It would be
foolish to assume that the inputs into ChatGPT are not <em>also used to train
ChatGPT</em>, or at least the next generations of the model.</p>

<p>I am certainly no lawyer, but the two primary problems here are:</p>

<ul>
  <li>Most developers are not authorized to disclose proprietary information of
their employers. Pasting source code into <em>any</em> browser window creates a
liability, but a browser window with ChatGPT increases the likelihood that
the source code disclosed will be <em>reproduced</em> in the future, for some other
user of the system. Uh oh!</li>
  <li>Can the code <em>generated</em> by ChatGPT could be considered <em>yours</em>? Who actually
owns the copyright to machine generated code, or machine generated anything
for that matter? Do the architects of the system own it, or the users
supplying the inputs? This particular wrinkle isn’t unique to ChatGPT, but
any ML tool generating data which occupies a space adjacent to human created,
and copyrighted works.</li>
</ul>

<p>My concerns with what OpenAI is doing with this data is not tin-foil paranoia.
<a href="https://news.yahoo.com/adobe-using-photos-train-ai-001413408.html">Adobe is catching
grief</a> for
opting Lightroom users <em>in</em> to train their AI with those users copyrighted or
proprietary works.</p>

<p>I am sure the legal system will catch up to the rapid evolution of these ML
robber barons, but until then I think we should all be <em>very</em> weary of feeding
intellectual property to these systems.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="software" /><category term="ml" /><category term="opinion" /><summary type="html"><![CDATA[There is an excessive number ChatGPT screenshots littering social media right now, and not nearly enough critical thinking about feeding data into this novel new chatbot. An anecdotal survey of my timeline includes people asking ChatGPT to solve math equations, write emails for them, create short story prompts, identify bugs in code, or even generate code for them. Behold, the power of AI!]]></summary></entry><entry><title type="html">The problem with ML</title><link href="https://brokenco.de//2023/01/04/the-problem-with-ml.html" rel="alternate" type="text/html" title="The problem with ML" /><published>2023-01-04T00:00:00+00:00</published><updated>2023-01-04T00:00:00+00:00</updated><id>https://brokenco.de//2023/01/04/the-problem-with-ml</id><content type="html" xml:base="https://brokenco.de//2023/01/04/the-problem-with-ml.html"><![CDATA[<p>The holidays are the time of year when I typically field a lot of questions
from relatives about technology or the tech industry, and this year my favorite
questions were around <strong>AI</strong>. (<em>insert your own scary music</em>) Machine-learning
(ML) or Artificial Intelligence (AI) are being widely deployed and I have some
<strong>Problems™</strong> with that. Machine learning is not necessarily a new
domain, the practices commonly accepted as “ML” have been used for quite a
while to support search and recommendations use-cases. In fact, my day job
includes supporting data scientists and those who are actively creating models
and deploying them to production. <em>However</em>, many of my relatives outside of the tech industry believe that “AI” is going to replace people, their jobs, and/or run the future. I genuinely hope AI/ML comes nowhere close to this future imagined by members of my family.</p>

<p>Like many pieces of technology, it is not inherently good or bad, but the
problem with ML as it is applied today is that <strong>its application is far
outpacing our understanding of its consequences</strong>.</p>

<p>Brian Kernighan, co-creator of the C programming language and UNIX, said:</p>

<blockquote>
  <p>Everyone knows that debugging is twice as hard as writing a program in the
first place. So if you’re as clever as you can be when you write it, how will
you ever debug it?</p>
</blockquote>

<p>Setting aside the <em>mountain</em> of ethical concerns around the application of ML
which have and should continue to be discussed in the technology industry,
there’s a fundamental challenge with ML-based systems: I don’t think their
creators understand how they work, how their conclusions are determined, or how
to consistently improve them over time. Imagine you are a data scientist or ML
developer, how confident are you in what your models will predict between
experiments or evolutions of the model? Would you be willing to testify in a
court of law about the veracity of your model’s output?</p>

<p>Imagine you are a developer working on the models that Tesla’s “full
self-driving” (FSD) mode relies upon. Your model has been implicated in a Tesla
killing the driver and/or pedestrians (which <a href="https://www.reuters.com/business/autos-transportation/us-probing-fatal-tesla-crash-that-killed-pedestrian-2021-09-03/">has
happened</a>).
Do you think it would be possible to convince a judge and jury that your model
is <em>not</em> programmed to mow down pedestrians outside of a crosswalk? How do you
prove what a model is or is not supposed to do given never before seen inputs?</p>

<p>Traditional software <em>does</em> have a variation of this problem but source code
lends itself to scrutiny far better than the ML models. Many of which have come
from successive evolutions of public training data, proprietary model changes,
and integrations with new data sources.</p>

<p>These problems may be solvable in the ML ecosystem, but problem is that the
application of ML is outpacing our ability to understand, monitor, and diagnose
models when they do harm.</p>

<p>That model your startup is working on to help accelerate home loan approvals
based on historical mortgages, how do you assert that your models are not
re-introducing racist policies like
    <a href="https://en.wikipedia.org/wiki/Redlining">redlining</a>. (forms of this <a href="https://fortune.com/2020/02/11/a-i-fairness-eye-on-a-i/">have happened</a>).</p>

<p>How about that fun image generation (AI art!) project you have been tinkering
with uses a publicly available model that was trained on millions of images
from the internet, and as a result in some cases unintentionally outputs
explicit images, or even what some jurisdictions might consider bordering on
child pornography. (forms of this <a href="https://www.wired.com/story/lensa-artificial-intelligence-csem/">have
happened</a>).</p>

<p>Really anything you teach based on the data “from the internet” is asking for
racist, pornographic, or otherwise offensive results, as the <a href="https://www.cbsnews.com/news/microsoft-shuts-down-ai-chatbot-after-it-turned-into-racist-nazi/">Microsoft
Tay</a>
example should have taught us.</p>

<p>Can you imagine the human-rights nightmare that could ensue from shoddy ML
models being brought into a healthcare setting? Law-enforcement? Or even
military settings?</p>

<hr />

<p>Machine-learning encompasses a very powerful set of tools and patterns, but our
ability to predict how those models will be used, what they will output, or how
to prevent negative outcomes are <em>dangerously</em> insufficient for the use outside
of search and recommendation systems.</p>

<p>I understand how models are developed, how they are utilized, and what I
<em>think</em> they’re supposed to do.</p>

<p>Fundamentally the challenge with AI/ML is that we understand how to “make it
work”, but we don’t understand <em>why</em> it works.</p>

<p>Nonetheless we keep deploying “AI” anywhere there’s funding, consequences be
damned.</p>

<p>And that’s a problem.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="software" /><category term="ml" /><category term="aws" /><category term="databricks" /><summary type="html"><![CDATA[The holidays are the time of year when I typically field a lot of questions from relatives about technology or the tech industry, and this year my favorite questions were around AI. (insert your own scary music) Machine-learning (ML) or Artificial Intelligence (AI) are being widely deployed and I have some Problems™ with that. Machine learning is not necessarily a new domain, the practices commonly accepted as “ML” have been used for quite a while to support search and recommendations use-cases. In fact, my day job includes supporting data scientists and those who are actively creating models and deploying them to production. However, many of my relatives outside of the tech industry believe that “AI” is going to replace people, their jobs, and/or run the future. I genuinely hope AI/ML comes nowhere close to this future imagined by members of my family.]]></summary></entry></feed>