rtyler

Multimodal with Delta Lake

2026-01-19T00:00:00+00:00

The rate of change for data storage systems has accelerated to a frenzied pace and most storage architectures I have seen simply cannot keep up. Much of my time is spent thinking about large-scale tabular data stored in Delta Lake which is one of the “lakehouse” storage systems along with Apache Iceberg and others. These storage architectures were developed 5-10 years ago to solve problems faced moving from data warehouse architectures to massive scale structured data needs faced by many organizations. The storage changes we need today must support “multimodal data” which is a dramatic departure in many ways from the traditional query and usage patterns our existing infrastructure supports.

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

From Wikipedia

Honestly, I have been working on this problem for longer than I knew that it had a name!

Working on Content Crush at Scribd I have had to negotiate an ever-present challenge: how do we make multimodal data seamless to work with our classic tabular datasets?

A couple of the ideas that I have been thinking about revolve around one principle: re-encoding of existing data is unacceptable. In the past I have considered simply encoding binary data such as that from images or PDFs into Apache Parquet. This approach suffers from a couple major flaws:

Re-encoding requires substantial computation for any non-trivial set of images, PDfs, video, etc.
Redundant object storage, even with compression it is unlikely that any organization which has terabytes or petabytes of image data will want to store a secondary copy of it for their multimodal needs.
Embedding a 1MB PDF file inside of a Parquet file is not silly but embedding a 10GB video file inside of a Parquet file is very silly. Any approach taken should scale in a reasonable fashion for data in the gigabyte to terabyte range.

A secondary objective in my thinking has been to avoid needing substantial client changes for working with multimodal data. I recently watched a talk by Ryan Johnson about adding transactional semantics to Delta Lake and one of the big takeaways that I heard from him was about the troublesome nature of ensuring all actors in the system cooperated with the transaction semantics. In a modern data environment that could be dozens of different off-the-shelf libraries, Databricks notebooks, AWS SageMaker transforms, and so on. The less “exposure” to the client layer the better.

Parquet Anchors

The first idea that I had was “Parquet Anchors” which would be built on Binary Protocol Extensions in Apache Parquet. In most cases the rich text/image/video data is already stored in object storage such as AWS S3 and a URL should be sufficient to retrieve that data.

The extension of the binary protocol as I understand it, would allow custom information to be encoded in the Parquet files that are being written as part of an existing Delta Table. The specific mechanism of encoding this data is somewhat irrelevant so long as it can carry:

Artifact name (e.g. some.pdf)
Artifact URL (s3://bucket/prefix/of/keys/some-10x9u09123.pdf)
Artifact length (number of bytes)
Artifact content type (e.g. application/pdf)
Checksum
Checksum Algorithm

Pros

The most obvious benefit of going down this route is the ease at which one could update existing data files and this note from the Binary Protocol Extensions document:

Existing readers will ignore the extension bytes with little processing overhead

Logically Parquet Anchors could be quite simple to implement and for most users of a Delta table with Parquet Anchors would never know they were there.

Cons

The natural downside of this feature being hidden from existing readers is that means clients must be updated in order to read the extension data properly. For something like processing multimodal data where a row of content metadata might refer to some.pdf this would mean the reader would have to have some indication that it must:

Read the extended binary information
Then fetch the necessary artifacts

There is another downside to this approach in that a table would need to be “rewritten” but only partially. If a Parquet file added to the Delta table references 1000 artifacts, then that .parquet file would need to be rewritten to include the Parquet Anchors for those 1000 artifacts alongside that files .add action. In essence I think this approach would require a full-table rewrite where each .parquet in the transaction log would be retrieved, processed, and rewritten with the appropriate Anchors.

Considering ways to address the shortcomings of Parquet Anchors I came up with my next concept.

Virtual Delta Tables (vdt)

The notion of Parquet Anchors I think is useful to hold onto, hyperlinks to existing artifacts is a key part of the multimodal data storage solution, but perhaps not as a direct encoding into the Parquet data files. Considering the shortcomings led me to think of how to present a virtual Delta table “view” to existing clients while hiding the disparate nature of the data behind the scenes.

One underutilized feature of the Delta Lake protocol is the use of URLs in the add actions which enables functionality like shallow clones. I have long thought of this as a super power that should really be used more.

vdt0: just the artifacts

The magic of the URL support in the Delta protocol is that the URLs don’t even have to point to object storage. Nothing about the protocol dictates that the URLs must point to s3:// or abfss:// URLs, you can just point to https:// URLs. AWS S3 supports https:// URLs, but so does every other web service.

Imagine a storage architecture which already contains heaps of .pdf artifacts. A vdt web service could provide a read-only URL structure which maps the existing object storage structure into a Delta Lake URL scheme.

A virtual table with just those PDF artifacts could be configured at https://vdt.aws/v1///

. Using tooling like s3svdt can provide S3-like operations off of this virtual URL, exposing a virtualized JSON transaction log or checkpoints for the Delta client.

Imagine the schema of such a virtual table for PDF artifacts:

Column	Datatype
id	`long`
filename	`string`
content_type	`string`
url	`string`
filesize	`long`
data	`binary`
checksum	`string`
checksum_algo	`string`

The virtualized transaction log is where the real fun can begin. If information about the artifacts can be sourced from an existing database, then the virtualized transaction log could contain numerous imagined parquet files as the add actions:

{
  "add": {
    "path": "datafiles/some-guid.parquet",
    "size": 841454,
    "modificationTime": 1512909768000,
    "dataChange": true,
    "stats": "{\"numRecords\":1,\"minValues\":{\"val..."
  }
}

The special path for the some-guid.parquet would perform on-demand parquet encoding for the underlying artifacts. The most primitive implementation could simply represent each PDF file as a .parquet file with an add action. So long as the add action conveyed the necessary file statistics to allow consuming engine to filter out files which are not necessary, this could be a seamless way to expose structured PDF data to the consumer. The path in the action could also refer to an already cached version of the encoded file in S3 using the existing URL support in the protocol, in this way clients could progressively cache as need be on the server-side.

Brief aside: I have never fully understood why Delta sharing exists as a separate entity. In my opinion the Delta Lake protocol coupled with a clever server-side backend could provide identical functionality for all existing Delta implementations.

Assuming the vdt service supports the schema defined above and can properly retrieve the PDF artifacts and encode them as Parquet data on the fly, a query such as SELECT filename, raw FROM vdt WHERE filename = $?.

Pros

Breaking the pretense of “objects must actually exist” with Delta Lake is very liberating. On-demand encoding artifacts in Apache Parquet would means all client-side libraries should be able to seamlessly work within their existing environments.

When I think about potential approaches for implementing vdt0 I can also imagine many different potential avenues for optimization.

Cons

While I really do like this idea, I’m not sure how much I should like it considering the potential downsides:

Requires some existing structure behind the scenes to build up a sensible virtual Delta log. For situations where artifacts are simply in a dumb bucket somewhere, with no metadata already stored in a relational database, producing a virtual transaction log would be quite difficult.
I cannot imagine a sensible path for write workloads with vdt0.
Without having implemented this (yet!) it is unclear to how much compute-time would be expended on uncached parquet file encoding.
Most data scientists want the PDF/image/etc but they don’t typically want the raw bytes that they then have to parse through.

Uh, what if you just don’t use Delta Lake?

Hey good question. Great interlude opportunity!

As a seller of fine hammers and hammer accessories, everything does in fact look like a nail.

Delta Lake is kind of a means to an end for me here. I think its protocol has enough maturity in terms of features and client capabilities to provide almost everything I need from a multimodal storage system. I just can’t/don’t want to shove everything into a Delta table per se.

vdt1: adding virtual legs

Since I have already indulged in the heretical idea of “what if we just make the files up” I went a level further to consider what if we got even more virtualized. One key characteristic I dislike with the vdt0 approach is that it is too simple believe it or not.

When I think about artifacts like PDFs, they have far more structure than just bytes. There are pages, typically sections, text, images, titles, footnotes, and so on. For most machine learning use-cases the data scientist may be interested in raw bytes for some projects but much more often they are interested in the parsed and structured data of the artifact.

While my expertise is largely around text-based storage and processing, I would imagine image/audio/video artifacts also have similar structure of interest to data scientists.

Indulging in even more virtual-thinking I started to think about collections of data all associated with an artifact. There’s the raw data schema above, but for PDFs I can also envision:

Paragraphs

Column	Datatype
id	`long`
page	`long`
offset	`integer`
text	`string`
is_heading	`bool`
heading_level	`integer`

Images

Column	Datatype
id	`long`
content_type	`string`
page	`long`
data	`binary`
bounds_x	`long`
bounds_y	`long`

Links

Column	Datatype
id	`long`
page	`long`
href	`string`
label	`string`

Taken all together this only represents 20 columns of data but could represent most of the information needed for most multimodal workloads. I mention the low column count because I have seen bug reports from Delta Lake users talking about issues with tables containing thousands of columns.

A virtualized table schema could take these interior schemas and join them together such that a single row might have: id, raw_filename, raw_content_type, raw_url, raw_filesize, raw_data, raw_checksum, raw_checksum_algo, paragraph_page, paragraph_text, paragraph_offset, paragraph_is_heading, paragraph_heading_level, image_content_type, image_page, image_data, image_bounds_x, image_bounds_y, link_page, link_href, link_label.

So long as the schema allows nullable columns for everything but id, the vdt service can expose the disjointed data behind the scenes in a sensible way with the add actions on the virtual Delta table and its file statistics. For example an add action which includes link data would list all other columns as null within the file statistics nullValues such that any engine querying for raw columns would ignore that file entirely.

Pros

I think this structure would be possible to build in a traditional Delta Lake system assuming one wished to re-encode data into new storage. Hiding existing data behind a virtualized Delta table allows us to avoid data denormalization.

Similar to vdt0 there are optimization and caching approaches that are readily available with vdt1 but unlike vdt0 the “write path” is more apparent to me with this approach. By hiding metadata about an artifact inside the virtualized data structure, writes which add rows with those columns could sensibly be accepted and inserted into an internal Delta or other table.

Depending on how metadata associated with an artifact is concerned, the vdt service could simply front a number of other conventional Delta tables and act as a proxy ensuring to push predicates and I/O filtering “to the edge” as far as it will go, before collecting results for the query engine.

Cons

This approach is certainly the most complex but could potentially require the least amount of re-encoding of existing data assets. The devil is in the details with how one might map existing data sources together. My sketch above places a tremendous amount of emphasis on an id which acts as a primary key between all the metadata associated with a singular artifact.

Nothing defined thus far accounts for potential changes in an artifact or its metadata as time goes on. If a new version of an existing document is uploaded, the new version should likely be considered “canonical” but be appended rather than merged with existing records. How one might sensibly model that in a system like Delta which doesn’t support referential integrity between datasets leads me back to the “anchors” idea from before. That said, I’m not sure if that’s much ado about nothing.

From a data storage standpoint one key aspect of multimodal data is that the different modalities are presented to the end user or system together. What I like about the virtual Delta tables concept is that this it doesn’t require substantial client changes to accomplish but does provide a path to present various types of data together for a given artifact.

I have various bits and pieces of a potential vdt system lying around the workshop floor. If the idea has legs I might take a crack at a prototype implementation, but first I will need some feedback!

Let me know what you think by emailing me at rtyler@ this domain!

Who is “R Tyler Croy”

2024-08-15T00:00:00+00:00

I asked a large language model this question:

A blast from the past

R. Tyler Croy is a well-known figure in the world of video games, particularly in the realm of retro gaming and emulation. He’s often referred to as “R. Tyler Croy” or simply “RTC” by his fans.

Tyler Croy was one of the most prominent figures behind the popular emulator, MAME (Multiple Arcade Machine Emulator). As a member of the MAME team, he contributed significantly to the development and maintenance of the emulator, which allows players to emulate classic arcade games on modern computers.

Throughout the late 1990s and early 2000s, Croy was an active member of the emulation community, working closely with other developers and enthusiasts to create and share emulators for various retro gaming platforms. His contributions helped bring many classic games to life again, allowing gamers to experience and enjoy their favorite childhood titles on modern hardware.

Although he’s not as actively involved in the emulation scene today, R. Tyler Croy remains a legend among retro gaming enthusiasts and is often remembered fondly for his tireless efforts in preserving gaming history.

I figure that should also help answer the question “should I trust a large language model?”

A large language model is not a good co-pilot

2024-05-12T00:00:00+00:00

Large language models (LLMs) seem to only be good at two things: summarizing text and making up bullshit. The idea that a general purpose LLM is going to herald a new age of software development efficiency is misleading in most cases bordering on malicious. While there are a number of other recommendations or predictive machine learning models which can improve software development efficiency, LLMs propensity to generate bullshit undermines trust in a way that makes me question their validity at baseline as a software development tool.

A couple months ago I recall using Google’s Bard in an attempt to generate some Terraform boilerplate code. It so fantastically failed to do so that the top Google search result from Stack Overflow was almost always functional code whereas Bard consistently produced invalid code.

This weekend while playing around I tried out Llama 3, another LLM which was also trained by pilfering the collective code commons of the internet, but still cannot generate useful code. I asked it about Rust, and about Delta Lake which is was able to explain similar to the first page of their respective Wikipedia pages.

I then asked:

How do you read a delta table in rust?

Below a portion of its response:

To read a Delta Lake table in Rust, you can use the delta-rs crate, which is a Rust wrapper around the Apache Arrow and Parquet libraries. Here’s an example of how to read a Delta table:

So far so good! You do use the delta-rs crate, which is based on Apache Arrow and Parquet libraries. I’m cheating a bit since I helped create delta-rs, so this is more like a proctored exam rather than a true co-piloting experience. Llama 3 continues with some code:

use delta::DeltaTable;
use arrow::record_batch;

async fn read_delta_table() {
    // Load the Delta table using the delta-rs crate
    let dt = DeltaTable::load("path/to/table").await.unwrap();
                                                             
    // Get the first record batch (i.e., a chunk of data)   
    let rb = dt.next().unwrap();
    
    // Iterate over the records in the record batch      
    for row in rb.iter_rows() {
        // Print each record as a string
        println!("{:?}", row.to_string());
    }
} 

This code is so completely and utterly nonsense. That’s not how you import the deltalake library. That’s not how you use Arrow. That’s not how you open a Delta table with Rust. There is no next() function on a DeltaTable struct. There’s no iter_rows() on a RecordBatch.

Almost everything about this code is incorrect. Not a little bit of “oh this is perhaps referring to older APIs”, flat out wrong, will not compile, never would have, fail.

The explanation of this code provided by the LLM contains even more gems:

Here’s what’s happening:

DeltaTable::load(): We load the Delta table using the DeltaTable struct from the delta-rs crate. Pass the path to the Delta table as a string.

next(): We get the first record batch (a chunk of data) from the loaded Delta table using the next() method.

iter_rows(): We iterate over the records in the record batch using the iter_rows() method, which returns an iterator over the rows.

to_string(): For each row, we convert it to a string using the to_string() method.

This code assumes you have the delta-rs crate installed and the necessary dependencies (e.g., Apache Arrow and Parquet). You can install delta-rs via Cargo: [dependencies] delta = "0.10.1" │

As somebody with intimate knowledge of the deltalake library and protocol, this is all ha-ha funny to me. But for somebody who might be trying to learn something new or is rushing to get some work done, these sorts of answers are so wildly misleading as to be dangerous. I don’t expect LLMs to be able to always produce coherent and compileable Rust code, but at no point will any of these snippets lead the reader to a productive path of inquiry.

It seems that most of the people pushing LLMs in software development have an ulterior motive that is not in your best interest as a developer. Executives who want to squeeze more productivity out of you, platforms who want to increase lock-in to their proprietary tools, or hardware manufacturers who want to sell more chips.

This has been said a lot lately, but what makes delivering software hard is not the “writing code”, but typically the “understanding code” part. For which LLMs are so poorly suited as to be worse than useless. If you cannot trust their output, you have to both understand the slop they’re producing and be able to figure out where it is wrong, and how you can fix it.

ChatGPT and your intellectual property

2023-01-09T00:00:00+00:00

There is an excessive number ChatGPT screenshots littering social media right now, and not nearly enough critical thinking about feeding data into this novel new chatbot. An anecdotal survey of my timeline includes people asking ChatGPT to solve math equations, write emails for them, create short story prompts, identify bugs in code, or even generate code for them. Behold, the power of AI!

ChatGPT is created by OpenAI, which despite the name is not any form of “open” organization, but rather a startup which has been considering funding at a pretty monstrous valuation. In essence, ChatGPT is an AI tool trained on a large corpus of public and proprietary information, packaged up as a kooky chatbot.

Fine. Setting aside my own annoyance with ML developers co-opting data from “the commons”, fine.

The zeal with which most people are dumping information into ChatGPT really concerns me however. I have seen a number of people feeding their own source code into ChatGPT to ask it to find bugs or security holes. It would be foolish to assume that the inputs into ChatGPT are not also used to train ChatGPT, or at least the next generations of the model.

I am certainly no lawyer, but the two primary problems here are:

Most developers are not authorized to disclose proprietary information of their employers. Pasting source code into any browser window creates a liability, but a browser window with ChatGPT increases the likelihood that the source code disclosed will be reproduced in the future, for some other user of the system. Uh oh!
Can the code generated by ChatGPT could be considered yours? Who actually owns the copyright to machine generated code, or machine generated anything for that matter? Do the architects of the system own it, or the users supplying the inputs? This particular wrinkle isn’t unique to ChatGPT, but any ML tool generating data which occupies a space adjacent to human created, and copyrighted works.

My concerns with what OpenAI is doing with this data is not tin-foil paranoia. Adobe is catching grief for opting Lightroom users in to train their AI with those users copyrighted or proprietary works.

I am sure the legal system will catch up to the rapid evolution of these ML robber barons, but until then I think we should all be very weary of feeding intellectual property to these systems.

The problem with ML

2023-01-04T00:00:00+00:00

The holidays are the time of year when I typically field a lot of questions from relatives about technology or the tech industry, and this year my favorite questions were around AI. (insert your own scary music) Machine-learning (ML) or Artificial Intelligence (AI) are being widely deployed and I have some Problems™ with that. Machine learning is not necessarily a new domain, the practices commonly accepted as “ML” have been used for quite a while to support search and recommendations use-cases. In fact, my day job includes supporting data scientists and those who are actively creating models and deploying them to production. However, many of my relatives outside of the tech industry believe that “AI” is going to replace people, their jobs, and/or run the future. I genuinely hope AI/ML comes nowhere close to this future imagined by members of my family.

Like many pieces of technology, it is not inherently good or bad, but the problem with ML as it is applied today is that its application is far outpacing our understanding of its consequences.

Brian Kernighan, co-creator of the C programming language and UNIX, said:

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

Setting aside the mountain of ethical concerns around the application of ML which have and should continue to be discussed in the technology industry, there’s a fundamental challenge with ML-based systems: I don’t think their creators understand how they work, how their conclusions are determined, or how to consistently improve them over time. Imagine you are a data scientist or ML developer, how confident are you in what your models will predict between experiments or evolutions of the model? Would you be willing to testify in a court of law about the veracity of your model’s output?

Imagine you are a developer working on the models that Tesla’s “full self-driving” (FSD) mode relies upon. Your model has been implicated in a Tesla killing the driver and/or pedestrians (which has happened). Do you think it would be possible to convince a judge and jury that your model is not programmed to mow down pedestrians outside of a crosswalk? How do you prove what a model is or is not supposed to do given never before seen inputs?

Traditional software does have a variation of this problem but source code lends itself to scrutiny far better than the ML models. Many of which have come from successive evolutions of public training data, proprietary model changes, and integrations with new data sources.

These problems may be solvable in the ML ecosystem, but problem is that the application of ML is outpacing our ability to understand, monitor, and diagnose models when they do harm.

That model your startup is working on to help accelerate home loan approvals based on historical mortgages, how do you assert that your models are not re-introducing racist policies like redlining. (forms of this have happened).

How about that fun image generation (AI art!) project you have been tinkering with uses a publicly available model that was trained on millions of images from the internet, and as a result in some cases unintentionally outputs explicit images, or even what some jurisdictions might consider bordering on child pornography. (forms of this have happened).

Really anything you teach based on the data “from the internet” is asking for racist, pornographic, or otherwise offensive results, as the Microsoft Tay example should have taught us.

Can you imagine the human-rights nightmare that could ensue from shoddy ML models being brought into a healthcare setting? Law-enforcement? Or even military settings?

Machine-learning encompasses a very powerful set of tools and patterns, but our ability to predict how those models will be used, what they will output, or how to prevent negative outcomes are dangerously insufficient for the use outside of search and recommendation systems.

I understand how models are developed, how they are utilized, and what I think they’re supposed to do.

Fundamentally the challenge with AI/ML is that we understand how to “make it work”, but we don’t understand why it works.

Nonetheless we keep deploying “AI” anywhere there’s funding, consequences be damned.

And that’s a problem.