<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://brokenco.de//feed/by_tag/buoyantdata.xml" rel="self" type="application/atom+xml" /><link href="https://brokenco.de//" rel="alternate" type="text/html" /><updated>2026-04-12T21:39:52+00:00</updated><id>https://brokenco.de//feed/by_tag/buoyantdata.xml</id><title type="html">rtyler</title><subtitle>a moderately technical blog</subtitle><author><name>R. Tyler Croy</name></author><entry><title type="html">Private Open Source</title><link href="https://brokenco.de//2026/04/01/private-open-source.html" rel="alternate" type="text/html" title="Private Open Source" /><published>2026-04-01T00:00:00+00:00</published><updated>2026-04-01T00:00:00+00:00</updated><id>https://brokenco.de//2026/04/01/private-open-source</id><content type="html" xml:base="https://brokenco.de//2026/04/01/private-open-source.html"><![CDATA[<p>Open source communities depend on a fundamental assumption that is no longer
true: the presumption of good faith actors. The hosts serving free and open
source code are scraped relentlessly, denying service to developers. Once that
code has been assimilated into various models it is washed of all attribution
and license information, denying rights of the developers. Some subset of users
then feel empowered, emboldened, I’m not sure what exactly by these models and
lob massive thousand line changes back at the developers. Nearly every
technology has the possibility to be used for positive and negative effects,
but free and open source communities are being harmed from multiple directions
right now.</p>

<p>I am a big believer in <a href="https://openinfra.org/four-opens/">the four opens</a>:</p>

<blockquote>
  <p>The Four Opens are a set of principles guidelines that were created by the
OpenStack community as a way to guarantee that the users get all the benefits
associated with open source software, including the ability to engage with the
community and influence future evolution of the software.</p>

  <ul>
    <li>Open Source</li>
    <li>Open Design</li>
    <li>Open Development</li>
    <li>Open Community</li>
  </ul>
</blockquote>

<p>There is an implied “to the public” in each of the four opens, at least how I
have understood it over the past many (<em>many</em>) years. I have repeatedly
advocated for open (to the public) discourse and transparency when working with
companies like <a href="https://cloudbees.com">CloudBees</a> and
<a href="https://databricks.com">Databricks</a> as they have engaged with open source
projects.</p>

<p>The mounting negative pressures and in some cases <a href="https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/">outright
hostility</a>
towards free and open source projects has me reconsidering the implied “to the
public” and how these communities may need to evolve in the future.</p>

<p>While I have never been a fan of invite-only Discord or Slack servers, both of
which are used by the <a href="https://datafusion.apache.org/contributor-guide/communication.html">Apache
Datafusion</a>
project for some odd reason. There are good reasons to put the project’s shared
spaces in slightly more private and slightly less AI-accessible systems. A
little bit of privacy can lead to more candid conversations and <em>potentially</em> a
stronger feeling of community and safety.</p>

<p>My first line of thinking led me to the idea of “vouching” which I recall
<a href="https://mitchellh.com/writing">mitchellh</a> posting about in the fediverse, but
I couldn’t find a good linkable reference.</p>

<p>Vouching is what we did as kids when a new friend was suggested to join the
mischief, somebody would vouch for the new kid and say “hey, they’re my
neighbor, they’re cool” and then we would go start new trouble together. In the
context of an open source community vouching can:</p>

<ul>
  <li>Help build a web of trust without every person necessarily knowing each new person</li>
  <li>But <em>also</em> vouching means there is a higher tendency for a community to be
homogeneous, since it will be less welcoming to random new-comers.</li>
</ul>

<p>I think vouching could also exacerbate the likelihood of a <a href="https://en.wikipedia.org/wiki/XZ_Utils_backdoor">Jia
Tan</a> where the web of trust
within the community is compromised by a malicious actor. Getting <em>one</em> member
to vouch for you may lower the guard of all of the other members of the
community making these style of attacks easier to pull off.</p>

<p>Since I started writing this post a whole week has passed by, without any new
ideas or patterns popping into mind. I’m curious how others are thinking about
it, so please let me know <a href="https://hacky.town/@rtyler/116329725989266400">on Mastodon</a> or via
email <code class="language-plaintext highlighter-rouge">rtyler@</code>~</p>]]></content><author><name>R. Tyler Croy</name></author><category term="opensource" /><category term="buoyantdata" /><category term="ai" /><summary type="html"><![CDATA[Open source communities depend on a fundamental assumption that is no longer true: the presumption of good faith actors. The hosts serving free and open source code are scraped relentlessly, denying service to developers. Once that code has been assimilated into various models it is washed of all attribution and license information, denying rights of the developers. Some subset of users then feel empowered, emboldened, I’m not sure what exactly by these models and lob massive thousand line changes back at the developers. Nearly every technology has the possibility to be used for positive and negative effects, but free and open source communities are being harmed from multiple directions right now.]]></summary></entry><entry><title type="html">On Data Engineering Central</title><link href="https://brokenco.de//2026/02/04/data-engineering-central.html" rel="alternate" type="text/html" title="On Data Engineering Central" /><published>2026-02-04T00:00:00+00:00</published><updated>2026-02-04T00:00:00+00:00</updated><id>https://brokenco.de//2026/02/04/data-engineering-central</id><content type="html" xml:base="https://brokenco.de//2026/02/04/data-engineering-central.html"><![CDATA[<p>I was lucky enough to <a href="https://dataengineeringcentral.substack.com/p/the-lakehouse-architecture-multimodal">record a podcast
episode</a>
with Daniel Beach of Data Engineering Central. Daniel and I have known each
other for a couple years sharing notes and ideas on the state of the ecosystem,
where it falls down, and where things are getting interesting.</p>

<p>In my opinion <a href="https://dataengineeringcentral.substack.com">Data Engineering
Central</a> has been one of the most
useful broad-ranged surveys of the ecosystem, curated by one crazy
mid-westerner: Daniel. He pulls no punches and while we share criticisms of AI
in the industry and commercial tools, Daniel’s honesty also has put some of my
work on blast, such as <a href="https://dataengineeringcentral.substack.com/p/_internaldeltaprotocolerror">this
post</a>
about some terrible user-experience and lopsided Delta Lake support in
<a href="https://github.com/delta-io/deltars">delta-rs</a>.</p>

<p>In his post Daniel highlights some of the topics we got into during our time chatting:</p>

<blockquote>
  <ul>
    <li>What the Lakehouse architecture gets right—and where it still falls short</li>
    <li>Why multimodal data (text, images, audio, video, embeddings) changes everything</li>
    <li>How open table formats like Delta Lake fit into the next generation of data platforms</li>
    <li>The growing gap between data tooling hype and day-to-day data engineering reality</li>
    <li>What skills and architectural thinking will matter most for data engineers over the next decade</li>
  </ul>
</blockquote>

<center><iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/WLlko-liHMg?si=9aGp1v-6nm2kbya0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe></center>

<p>I encourage you to <a href="https://dataengineeringcentral.substack.com/">subscribe</a> to
his newsletter or if that’s not your jam, you can <a href="https://dataengineeringcentral.substack.com/feed">subscribe to the RSS
feed</a> too.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="software" /><category term="dataeng" /><category term="buoyantdata" /><category term="databricks" /><category term="podcast" /><summary type="html"><![CDATA[I was lucky enough to record a podcast episode with Daniel Beach of Data Engineering Central. Daniel and I have known each other for a couple years sharing notes and ideas on the state of the ecosystem, where it falls down, and where things are getting interesting.]]></summary></entry><entry><title type="html">The last data file format</title><link href="https://brokenco.de//2025/07/16/no-way-parquet.html" rel="alternate" type="text/html" title="The last data file format" /><published>2025-07-16T00:00:00+00:00</published><updated>2025-07-16T00:00:00+00:00</updated><id>https://brokenco.de//2025/07/16/no-way-parquet</id><content type="html" xml:base="https://brokenco.de//2025/07/16/no-way-parquet.html"><![CDATA[<p>The layers of abstraction in most technology stacks has gotten incredibly deep
over the last decade. At some point way down there in the depths of <em>most</em> data
applications somebody <em>somewhere</em> has to actually read or write bytes to
storage. The flexibility of <a href="https://parquet.apache.org">Apache Parquet</a> has me
increasingly convinced that it just might be the <strong>last data file format I will
need</strong>.</p>

<p>In my <a href="/2025/06/24/low-latency-parquet.html">previous post</a> on the subject I
wrote about the file format’s novelty for semi-random data access <em>inside</em> of a
<code class="language-plaintext highlighter-rouge">.parquet</code> file. I’m certainly wandering off the beaten path with Apache
Parquet already. <em>Then</em> this blog post kind of blew my mind: <a href="https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/">Embedding
User-Defined Indexes in Apache Parquet
Files</a>.</p>

<blockquote>
  <p>However, Parquet is extensible with user-defined indexes: <strong>Parquet tolerates
unknown bytes within the file body</strong> and permits arbitrary key/value pairs in
its footer metadata. These two features enable embedding user-defined indexes
directly in the file—no extra files, no format forks, and no compatibility
breakage.</p>
</blockquote>

<p>Emphasis mine.</p>

<p>This is news to me.</p>

<p>And it is <em>absolutely wild</em>.</p>

<hr />

<p>The authors’ approach for embedding user-defined indexes in Apache Parquet
files is certainly novel and already worth a read.</p>

<p>But the fact that you can shove arbitrary blocks of bytes in the middle of the
otherwise columnar data format is incredible.</p>

<p>Modifications of Apache Parquet files still requires a rewrite of the
object which means <code class="language-plaintext highlighter-rouge">.parquet</code> is not a file format to be used for heavy data
modification workloads.</p>

<p>Use-cases with large amounts of metadata and binary data however would fit nicely
within this parquet + unknown bytes design. Parquet readers which are ignorant
to the purpose for these unknown byte blocks will completely ignore them.</p>

<p>Altogether this is a new super
power, and I am contemplating whether I can use it for good or evil..</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><category term="parquet" /><category term="buoyantdata" /><category term="dataeng" /><summary type="html"><![CDATA[The layers of abstraction in most technology stacks has gotten incredibly deep over the last decade. At some point way down there in the depths of most data applications somebody somewhere has to actually read or write bytes to storage. The flexibility of Apache Parquet has me increasingly convinced that it just might be the last data file format I will need.]]></summary></entry><entry><title type="html">Busily writing elsewhere</title><link href="https://brokenco.de//2025/05/03/writing-elsewhere.html" rel="alternate" type="text/html" title="Busily writing elsewhere" /><published>2025-05-03T00:00:00+00:00</published><updated>2025-05-03T00:00:00+00:00</updated><id>https://brokenco.de//2025/05/03/writing-elsewhere</id><content type="html" xml:base="https://brokenco.de//2025/05/03/writing-elsewhere.html"><![CDATA[<p>Writing has been a part of my work for a <em>long</em> time, it helps me think and
more importantly it helps me share ideas with other developers. Recently a
tremendous amount of my time has been spent writing internal design documents,
blog posts, and other materials. By the time it has come to personal blogging
my words all been spent.</p>

<p>On the <a href="https://buoyantdata.com">Buoyant Data</a> blog I have been writing about a
<em>lot</em> of <a href="https://delta.io">Delta Lake</a> related topics such as:</p>

<ul>
  <li><a href="https://www.buoyantdata.com/blog/2024-12-31-high-concurrency-logstore.html">Scaling streaming Delta Lake applications</a></li>
  <li><a href="https://www.buoyantdata.com/blog/2025-02-24-just-keep-buffering.html">Buffering more messages with serverless data ingestion</a></li>
  <li><a href="https://www.buoyantdata.com/blog/2025-03-09-lessons-learned-building-delta-rs.html">Lessons learned in building delta-rs</a></li>
  <li><a href="https://www.buoyantdata.com/blog/2025-04-22-rust-is-good-for-the-climate.html">Build more climate-friendly data applications with Rust</a></li>
</ul>

<p>Some of this work ahs been in preparing for the two upcoming talks I have at
<a href="https://www.databricks.com/dataaisummit">Data and AI Summit 2025</a>. Some of
these posts have been in doing research with clients, or just spelunking on my
own.</p>

<p>You can <a href="https://www.buoyantdata.com/rss.xml">subscribe to the RSS feed</a> for more up to date articles relating to high-efficiency data processing with Rust!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="deltalake" /><category term="buoyantdata" /><summary type="html"><![CDATA[Writing has been a part of my work for a long time, it helps me think and more importantly it helps me share ideas with other developers. Recently a tremendous amount of my time has been spent writing internal design documents, blog posts, and other materials. By the time it has come to personal blogging my words all been spent.]]></summary></entry><entry><title type="html">Fedi-hired! Redesigning the company website</title><link href="https://brokenco.de//2024/12/02/fedihired.html" rel="alternate" type="text/html" title="Fedi-hired! Redesigning the company website" /><published>2024-12-02T00:00:00+00:00</published><updated>2024-12-02T00:00:00+00:00</updated><id>https://brokenco.de//2024/12/02/fedihired</id><content type="html" xml:base="https://brokenco.de//2024/12/02/fedihired.html"><![CDATA[<p>Today I launched a new rework of
<a href="https://www.buoyantdata.com">buoyantdata.com</a> thanks to the work of a designer
I found in the fediverse! The original “design” of the site was something I had
cobbled together with a Jekyll theme I originally ported to
<a href="https://cobalt-org.github.io/">Cobalt</a>, but it was always lacking.</p>

<p>The release of <a href="/2024/11/15/deltalake-the-definitive-guide.html">Delta Lake The Definitive
Guide</a> offered motivation to
update the site to help prospective customers understand what Buoyant Data can
do for them. I asked around on <a href="https://hacky.town/@rtyler">Mastodon</a> for
recommendations for a web designer in the US would would be open to a
short-term contract to perform some renovations.</p>

<p><a href="https://bgsulz.com/">Ben Sulzinsky</a> was one of the talented folks who reached out to offer to help and we quickly turned a <em>lot</em> of ideas around.</p>

<p>I am quite pleased with Ben’s work. He did a fantastic job taking a laundry
list of both highly-specific and rather vague requirements, and turning them
into re-usable components, structure, and styles. I would certainly recommend
you work with him too!</p>

<p>Periodically I’ll see solicitations to be <code class="language-plaintext highlighter-rouge">#fedihired</code> in Mastodon, I’m happy to have been able to fedi-hire somebody! (<em>even if only for a short-term contract</em>)</p>]]></content><author><name>R. Tyler Croy</name></author><category term="buoyantdata" /><summary type="html"><![CDATA[Today I launched a new rework of buoyantdata.com thanks to the work of a designer I found in the fediverse! The original “design” of the site was something I had cobbled together with a Jekyll theme I originally ported to Cobalt, but it was always lacking.]]></summary></entry><entry><title type="html">From the beginning, delta-rs to Delta Lake: The Definitive Guide</title><link href="https://brokenco.de//2024/11/15/deltalake-the-definitive-guide.html" rel="alternate" type="text/html" title="From the beginning, delta-rs to Delta Lake: The Definitive Guide" /><published>2024-11-15T00:00:00+00:00</published><updated>2024-11-15T00:00:00+00:00</updated><id>https://brokenco.de//2024/11/15/deltalake-the-definitive-guide</id><content type="html" xml:base="https://brokenco.de//2024/11/15/deltalake-the-definitive-guide.html"><![CDATA[<p>Nothing quite feels like “I made it!” like being <em>published</em>. Which is why I am
thrilled to share that <a href="https://bookshop.org/p/books/delta-lake-the-definitive-guide-modern-data-lakehouse-architectures-with-data-lakes-denny-lee/21429337?ean=9781098151942">Delta Lake: The Definitive
Guide</a>
is available for purchase, and I kind of helped! I wanted to share a little bit
about how my contributions (Chapter 6!) came about, because my entrance into
the <a href="https://delta.io">Delta Lake</a> ecosystem was about as unplanned as my
authorship of part of this wonderful book.</p>

<p>The <a href="https://github.com/delta-io/delta-rs">delta-rs</a> project started in 2020 and I wish that I could say it is because
I am a brilliant visionary. The project largely started because I have had a
bias against JVM-based technology stacks and I had stepped into a role at
<a href="https://tech.scribd.com">Scribd</a> where we were migrating to AWS, Databricks,
and a new architecture <em>anyways</em> so why not challenge the orthodoxy? My
colleague <a href="https://about.houqp.me/">QP Hou</a> and I were loving Rust and liked
Delta Lake from a design standpoint, but did not love <a href="https://spark.spache.org">Apache
Spark</a> for some of the things we needed to do.</p>

<p>I would consider the official start of the project to be April 11th, 2020 when
I sent our Databricks colleagues the following:</p>

<hr />

<p>Greetings! As I mentioned in our weekly sync up this week, we have an interest
in partnering with Databricks to develop and open source native client
interface for Delta Lake.</p>

<p>For framing this conversation and scope of the native interface, I categorize
our compute workloads into three groups:</p>

<ol>
  <li><strong>Big offline data processing</strong>, requiring a cluster of compute resources where Spark makes a big dent.</li>
  <li><strong>Lightweight/small offline data processing</strong>, workloads needing “fractional
compute” resources, basically less than a single machine. (Ruby/Python type
tasks which move data around, or perform small-scale data accesses make up
the majority of these in our current infrastructure, we’ve discussed using
the Databricks Light runtime for these in the past, since the cost to
deploy/run these small tasks on Databricks clusters doesn’t make sense).</li>
  <li><strong>Boundary data-processing</strong>, where the task might involve a little bit of
production “online” data and a little bit of warehouse “offline” data to
complete its work. In our environment we have Ruby scripts whose sole job is
to sync pre-computed (by Spark) offline data into online data stores for the
production Rails application, etc, to access and serve.</li>
</ol>

<p>I don’t want to burn down our current investment in Ruby for many of the 2nd
and 3rd workloads, not to mention retraining a number of developers in-house to
learn how to effectively use Scala or pySpark.</p>

<p>My proposal is that we partner with Databricks and jointly develop an open
source client interface for Delta Lake. One where we would have at least one
developer from Databricks working with at least one developer from Scribd on a
jointly scoped effort to deliver a library capable of <em>initially</em> addressing
our ‘2’ and ‘3’ use-cases.</p>

<p>[..]</p>

<p>Further, I propose that we jointly develop a client interface in Rust, which
will allow us to easy extend that within the Databricks community to support
Golang, Python, Ruby, and Node clients.</p>

<p>The key benefits I imagine for us all:</p>

<ul>
  <li>
    <p>Much broader market share for Delta Lake as a technology. Not only would
companies like Scribd benefit, and continue to invest in Delta Lake, but
other companies would have an easier on-ramp into the Databricks ecosystem.
Basically, if you start using Delta Lake before you use Spark, you will (I
guarantee) reach a point where these lightweight workloads become heavyweight
workloads requiring the full power and glory of the Databricks runtime :D</p>
  </li>
  <li>
    <p>It’s a fantastic developer advocacy story that hits a number of key bullet
marketing points: open source, partner collaboration, Rust (so hot right now) :)</p>
  </li>
  <li>
    <p>Scribd is able to “immediately” take advantage of Delta Lake benefits without
burning up all our existing codebase and investment in Ruby tasks and
tooling. Thereby allowing for an easier onramp into Delta Lake and the
Databricks platform as a whole.</p>
  </li>
</ul>

<p>The scope of the effort I think would be largely around properly dealing with
the transaction log, since the Apache Arrow project has already created a
pretty decent <a href="https://crates.io/crates/parquet">parquet crate</a> in Rust. That
said, there may be some writer improvements we’d want/need to push upstream to
Apache Arrow to make this successful.</p>

<hr />

<p>On second thought, almost all of this has come true! What a brilliant sage! (plz clap)</p>

<p>Like many advancements, there’s a right time, a right place, and a right group
of people. Unfortunately Databricks didn’t join the party until a later on but
were a strong supporter of our initial work, providing guidance and helping to
make <a href="https://delta.io">Delta Lake</a> an ever-more thriving open source
community.  The right people were all converging on the direction that made
this possible with <a href="https://github.com/nevi-me">Neville</a> helped make
<a href="https://github.com/apache/arrow-rs">arrow-rs</a> a much better <a href="https://parquet.apache.org">Apache
Parquet</a> writer. QP wrote the first version of the
protocol parser and created the first Python bindings for the library.
<a href="https://github.com/xianwill">Christian Williams</a> built out
<a href="https://github.com/delta-io/kafka-delta-ingest">kafka-delta-ingest</a> with
<a href="https://github.com/mosyp">Mykhailo Osypov</a> and helped prove that: <strong>Rust is
way more efficient for data ingestion workloads.</strong>. As time went on Will Jones,
Florian Valeye, and Robert Peck joined the party and helped turn delta-rs from
a small Scribd-motivated open source project into a thriving Rust and Python
project.</p>

<p><a href="https://bookshop.org/p/books/delta-lake-the-definitive-guide-modern-data-lakehouse-architectures-with-data-lakes-denny-lee/21429337?ean=9781098151942" target="_blank"><img src="/images/post-images/2024-deltalake/book-cover.jpg" align="right" width="200" /></a></p>

<p>Scribd had wild success with the data ingestion being in Rust, and the data
processing/query being in Spark. The community grew, Databricks grew, and at
some point some folks started working on a book.</p>

<p>As a long-time maintainer of delta-rs and talking head in the Delta and
Databricks ecosystem I was asked to be a technical reviewer of the book after
Prashanth, Scott, Tristen, and Denny had already gotten more than halfway
through the chapters.</p>

<p>I provided as much feedback as I could on their chapters. I reviewed the
outline and noticed “Chapter 8: TBD”.</p>

<p>What’s supposed to be Chapter 8? “<em>We’re not sure yet.</em>”</p>

<p>My friend <a href="https://kohsuke.org">Kohsuke</a> once marveled at how I was able to
acquire things for the <a href="https://jenkins.io">Jenkins project</a> by the simple act of
asking for them. There’s some skill involved in finding mutually beneficial
opportunities, but being uninhibited by the possibility somebody would say “no”
helps a lot.</p>

<p>“So this outline looks good, but when are you going to talk about Rust and
Python? There are dozens of us! Dozens!”</p>

<p><a href="https://dennyglee.com/">Denny</a> needed another chapter and I asked if I could
write about building native data applications in Rust and Python.</p>

<p>Suddenly I was helping to write a book.</p>

<hr />

<p><a href="https://tech.scribd.com">Scribd</a> is a fun company to work at. Books,
audiobooks, podcasts, articles. We have a deep appreciation for the written
word, telling stories, and learning. All of which I value highly. Before this
experience however I had never seen the <em>other</em> side of books. The creation,
the meetings, the rewrites, the edits, the reviews, going to press. It is
incredibly interesting and the team at O’Reilly are talented, helpful, and professional.</p>

<p>Going through copy-editing I was fielding review comments on the consistency of
tense, the subject of sentences, discussions about what is a proper noun and
how to consistently apply terms through <em>hundreds of pages</em> of content. I have
heard about how invaluable editors are, I have now seen them in action am in
awe.</p>

<p>Over the years I have tried and failed to explain what I do to family members.
For people that don’t work in tech “working on the computer” all looks largely
the same, especially for older generations. Having your work, your name <em>in
print</em> has an intangible “wow” factor. More so than conference talks,
websites, GitHub stars, or branded t-shirts, a printed artifact recognizes the
accomplishments of the innumerable contributors to the Delta Lake ecosystem
over the years.</p>

<p>If you’re data inclined, I recommend picking up a copy, Prashanth, Scott,
Tristen, and Denny have written a very useful guide, and also I contributed a
bit too! :)</p>]]></content><author><name>R. Tyler Croy</name></author><category term="databricks" /><category term="deltalake" /><category term="buoyantdata" /><summary type="html"><![CDATA[Nothing quite feels like “I made it!” like being published. Which is why I am thrilled to share that Delta Lake: The Definitive Guide is available for purchase, and I kind of helped! I wanted to share a little bit about how my contributions (Chapter 6!) came about, because my entrance into the Delta Lake ecosystem was about as unplanned as my authorship of part of this wonderful book.]]></summary></entry><entry><title type="html">Data and AI Summit 2024 presentations</title><link href="https://brokenco.de//2024/10/17/data-ai-summit-videos.html" rel="alternate" type="text/html" title="Data and AI Summit 2024 presentations" /><published>2024-10-17T00:00:00+00:00</published><updated>2024-10-17T00:00:00+00:00</updated><id>https://brokenco.de//2024/10/17/data-ai-summit-videos</id><content type="html" xml:base="https://brokenco.de//2024/10/17/data-ai-summit-videos.html"><![CDATA[<p>This year has been so jam packed full of activities that I forgot to share some
videos from <a href="https://www.buoyantdata.com/blog/2024-06-04-data-and-ai-summit.html">Data and AI Summit
2024</a> this
past summer! The annual conference hosted by Databricks has become one of my
favorites to meet with other <a href="https://delta.io">Delta Lake</a> users and
developers to discuss the future of large-scale data ingestion and processing. This year however, I overdid it a little bit.</p>

<p>Using the excuse of promoting my consulting/professional services company
<a href="https://buoyantdata.com">Buoyant Data</a> I had effectively <em>three</em> speaking
engagements:</p>

<ul>
  <li><strong>The road to delta-rs 1.0</strong> at the Open Source Contributor Summit (Monday)</li>
  <li><strong>Fast, cheap, and easy data ingestion with AWS Lambda and Delta Lake</strong>, a
talk highlighting a lot of the successful patterns I have developed for
customers using AWS Lambda with Delta Lake for Rust to create shockingly
cheap data ingestion pipelines. (Thursday)</li>
  <li><strong>Let’s do data engineering in Rust!</strong>, a more fun deep-dive talk to help
people start to get into the world of implementing data systems with Rust. (Thursday)(</li>
</ul>

<p>Unfortunately the first talk was not recorded, but it was probably the most
interesting! On Monday morning I was riding my bike from the Ferry Building to
the venue in San Francisco and my chain snapped off while I was sprinting off
from a green light. I went down <strong>hard</strong>, scraped up my knees, and generally
looked a fool lying in the middle of Market St.</p>

<p>The show must go on, so I hobbled to the <a href="https://tech.scribd.com">Scribd</a>
office, deposited my broken bike, and continued to the Open Source Summit.</p>

<p>What I did not know at the time was that I had fractured a bone in my wrist. I
did know however that I needed to go to a clinic, but <em>really</em> wanted to attend
the summit and take advantage of the one-a-year opportunity (literally!) for
some of the brightest minds in the data community to talk about the future of
Delta Lake and more.</p>

<p>So that first talk was given with my swollen wrist pulled to my heart, like a
broken wing, and I’m <em>sure</em> it was a ludicrous sight to see!</p>

<p>By Thursday my arm had been set and was in a sling, which is far less exciting.
Nonetheless, the two talks below are perhaps the only one-handed presentations
thus far in my career! I hope you enjoy!</p>

<center>
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/XPoWb9u06xA?si=SNccWEJxorszRGO1" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/Fr5Nx1wuQmQ?si=Svc3GtewzxUyGI4M" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>
</center>

<hr />

<p><em>Note</em>: The presentation software used for this talk is the open source
<a href="https://mfontanini.github.io/presenterm/introduction.html">presenterm</a> tool
which is delightful for creating development-focused presentations like this
one!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="databricks" /><category term="deltalake" /><category term="buoyantdata" /><category term="presentation" /><summary type="html"><![CDATA[This year has been so jam packed full of activities that I forgot to share some videos from Data and AI Summit 2024 this past summer! The annual conference hosted by Databricks has become one of my favorites to meet with other Delta Lake users and developers to discuss the future of large-scale data ingestion and processing. This year however, I overdid it a little bit.]]></summary></entry><entry><title type="html">Improving lock performance for delta-rs</title><link href="https://brokenco.de//2023/11/29/locking-with-deltalake.html" rel="alternate" type="text/html" title="Improving lock performance for delta-rs" /><published>2023-11-29T00:00:00+00:00</published><updated>2023-11-29T00:00:00+00:00</updated><id>https://brokenco.de//2023/11/29/locking-with-deltalake</id><content type="html" xml:base="https://brokenco.de//2023/11/29/locking-with-deltalake.html"><![CDATA[<p>I have had the good fortune this year to help a number of organizations develop
and deploy native data applications in Python and Rust using a project I helped
found: <a href="https://github.com/delta-io/delta-rs">delta-rs</a>. At a high level
delta-rs is a Rust implementation of the <a href="https://github.com/delta-io/delta/blob/master/PROTOCOL.md">Delta Lake
protocol</a> which
offers ACID-like transactions for data lake use-cases. One of the big areas of
my focus has been in evaluating and improving performance in highly concurrent
runtime environments on AWS.</p>

<p>To help others understand the problem domain I spent some time earlier in the
week documenting the challenges in AWS on the Buoyant Data blog: <a href="https://www.buoyantdata.com/blog/2023-11-27-concurrency-limitations-with-deltalake-on-aws.html">Concurrency
limitations for Delta Lake on
AWS</a></p>

<blockquote>
  <p>In the case of AWS S3’s consistency model many operations are strongly
consistent, but concurrent operations on the same key are not. AWS encourages
application-level object locking, which the delta-rs implements using AWS
DynamoDB.</p>
</blockquote>

<p>AWS S3 is an incredible piece of technology that washes away a myriad of common
storage problems, and has been jokingly referred to as “the 8th wonder of the
world” by <a href="https://www.lastweekinaws.com/">Corey Quinn</a>. THe lack of a
“putIfAbsent” like semantic is however <em>very</em> annoying for the Delta Lake
protocol, adding the need for an application-wide <em>lock</em> for Delta users:</p>

<blockquote>
  <p>The dynamodb-lock approach allows for some sensible cooperation between
concurrent writers but the key limitation is that all concurrent operations
must synchronize on the table itself. There is no smaller division of
concurrency than a table operation</p>
</blockquote>

<p>In the blog post I offer some potential approaches to mitigate the weakness of
needing a table-level lock for concurrent Delta Lake writers on AWS, but the
problem will unfortunately remain until in some form or fashion until S3
introduces a “putIfAbsent” semantic which allows writers to “put” a file only
if it doesn’t exist in an atomic way.</p>

<p>For concurrent Delta writers I can offer some advice, but unfortunately
effective cooperative distributed concucrrency at scale remains a challenging
problem! :)</p>]]></content><author><name>R. Tyler Croy</name></author><category term="buoyantdata" /><category term="deltalake" /><category term="rust" /><summary type="html"><![CDATA[I have had the good fortune this year to help a number of organizations develop and deploy native data applications in Python and Rust using a project I helped found: delta-rs. At a high level delta-rs is a Rust implementation of the Delta Lake protocol which offers ACID-like transactions for data lake use-cases. One of the big areas of my focus has been in evaluating and improving performance in highly concurrent runtime environments on AWS.]]></summary></entry></feed>