<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://brokenco.de//feed/by_tag/databricks.xml" rel="self" type="application/atom+xml" /><link href="https://brokenco.de//" rel="alternate" type="text/html" /><updated>2026-04-12T21:39:52+00:00</updated><id>https://brokenco.de//feed/by_tag/databricks.xml</id><title type="html">rtyler</title><subtitle>a moderately technical blog</subtitle><author><name>R. Tyler Croy</name></author><entry><title type="html">Based Lake, a petabyte-scale low-latency data lake</title><link href="https://brokenco.de//2026/03/10/based-lake.html" rel="alternate" type="text/html" title="Based Lake, a petabyte-scale low-latency data lake" /><published>2026-03-10T00:00:00+00:00</published><updated>2026-03-10T00:00:00+00:00</updated><id>https://brokenco.de//2026/03/10/based-lake</id><content type="html" xml:base="https://brokenco.de//2026/03/10/based-lake.html"><![CDATA[<p>I had a chat today about building large scale low-latency data retrieval
systems around AWS S3. In doing so I got to share a bit of the talk proposal I
submitted to <a href="https://dataaisummit.com">Data and AI Summit</a> this year about
real-live work that has made it into production.</p>

<p>For years the conventional wisdom around <a href="https://delta.io">Delta Lake</a> has
been to <strong>not</strong> connect user-facing/online systems to Delta tables. Basically,
don’t point your Django app at your Delta tables. This continues to be a decent
<em>guideline</em> but definitely <strong>not a rule</strong> and I have the performance data to
back that up.</p>

<p>My talk abstract:</p>

<blockquote>
  <p>Scribd hosts hundreds of millions of documents and has hundreds of billions of
objects across our buckets. Combining large-language models with a massive
amounts of text has required investment in our new Content Library
architecture.  We selected Delta Lake as the underlying storage technology but
have pushed it to an extreme. Using the same Delta Lake architecture we offer
both direct data access for data scientists in Databricks Notebooks and online
data retrieval in milliseconds for user-facing web services.</p>

  <p>In this talk we will review principles of performance for each layer of the
stack: web APIs, the Delta Lake tables, Apache Parquet, and AWS S3.</p>
</blockquote>

<p>The work done by myself and my colleague Eugene in this area has been heavily
related to my previous research around <a href="/2025/06/24/low-latency-parquet.html">Low latency Parquet
reads</a> which informed work named <a href="https://tech.scribd.com/blog/2026/content-crush.html">Content
Crush</a>, which I have
explored more on the Scribd tech blog and on the <a href="/2026/02/13/screaming-in-the-cloud.html">Screaming in the
Cloud</a> podcast.</p>

<p>I really hope that I am able to share results at Data and AI Summit from this
incredibly challenging work that I am undertaking. But even if I don’t, blog
posts like my musings on <a href="/2026/01/19/multimodal-delta-lake.html">Multimodal with Delta
Lake</a>, <a href="https://www.buoyantdata.com/blog/2024-12-31-high-concurrency-logstore.html">scaling streaming Delta Lake
applications</a>,
and a myriad of other articles I have published can be pieced together to form
the larger mosaic of insane large-scale data work I have been hammering on!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="arrow" /><category term="parquet" /><category term="deltalake" /><category term="databricks" /><category term="scribd" /><summary type="html"><![CDATA[I had a chat today about building large scale low-latency data retrieval systems around AWS S3. In doing so I got to share a bit of the talk proposal I submitted to Data and AI Summit this year about real-live work that has made it into production.]]></summary></entry><entry><title type="html">On Data Engineering Central</title><link href="https://brokenco.de//2026/02/04/data-engineering-central.html" rel="alternate" type="text/html" title="On Data Engineering Central" /><published>2026-02-04T00:00:00+00:00</published><updated>2026-02-04T00:00:00+00:00</updated><id>https://brokenco.de//2026/02/04/data-engineering-central</id><content type="html" xml:base="https://brokenco.de//2026/02/04/data-engineering-central.html"><![CDATA[<p>I was lucky enough to <a href="https://dataengineeringcentral.substack.com/p/the-lakehouse-architecture-multimodal">record a podcast
episode</a>
with Daniel Beach of Data Engineering Central. Daniel and I have known each
other for a couple years sharing notes and ideas on the state of the ecosystem,
where it falls down, and where things are getting interesting.</p>

<p>In my opinion <a href="https://dataengineeringcentral.substack.com">Data Engineering
Central</a> has been one of the most
useful broad-ranged surveys of the ecosystem, curated by one crazy
mid-westerner: Daniel. He pulls no punches and while we share criticisms of AI
in the industry and commercial tools, Daniel’s honesty also has put some of my
work on blast, such as <a href="https://dataengineeringcentral.substack.com/p/_internaldeltaprotocolerror">this
post</a>
about some terrible user-experience and lopsided Delta Lake support in
<a href="https://github.com/delta-io/deltars">delta-rs</a>.</p>

<p>In his post Daniel highlights some of the topics we got into during our time chatting:</p>

<blockquote>
  <ul>
    <li>What the Lakehouse architecture gets right—and where it still falls short</li>
    <li>Why multimodal data (text, images, audio, video, embeddings) changes everything</li>
    <li>How open table formats like Delta Lake fit into the next generation of data platforms</li>
    <li>The growing gap between data tooling hype and day-to-day data engineering reality</li>
    <li>What skills and architectural thinking will matter most for data engineers over the next decade</li>
  </ul>
</blockquote>

<center><iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/WLlko-liHMg?si=9aGp1v-6nm2kbya0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe></center>

<p>I encourage you to <a href="https://dataengineeringcentral.substack.com/">subscribe</a> to
his newsletter or if that’s not your jam, you can <a href="https://dataengineeringcentral.substack.com/feed">subscribe to the RSS
feed</a> too.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="software" /><category term="dataeng" /><category term="buoyantdata" /><category term="databricks" /><category term="podcast" /><summary type="html"><![CDATA[I was lucky enough to record a podcast episode with Daniel Beach of Data Engineering Central. Daniel and I have known each other for a couple years sharing notes and ideas on the state of the ecosystem, where it falls down, and where things are getting interesting.]]></summary></entry><entry><title type="html">From the beginning, delta-rs to Delta Lake: The Definitive Guide</title><link href="https://brokenco.de//2024/11/15/deltalake-the-definitive-guide.html" rel="alternate" type="text/html" title="From the beginning, delta-rs to Delta Lake: The Definitive Guide" /><published>2024-11-15T00:00:00+00:00</published><updated>2024-11-15T00:00:00+00:00</updated><id>https://brokenco.de//2024/11/15/deltalake-the-definitive-guide</id><content type="html" xml:base="https://brokenco.de//2024/11/15/deltalake-the-definitive-guide.html"><![CDATA[<p>Nothing quite feels like “I made it!” like being <em>published</em>. Which is why I am
thrilled to share that <a href="https://bookshop.org/p/books/delta-lake-the-definitive-guide-modern-data-lakehouse-architectures-with-data-lakes-denny-lee/21429337?ean=9781098151942">Delta Lake: The Definitive
Guide</a>
is available for purchase, and I kind of helped! I wanted to share a little bit
about how my contributions (Chapter 6!) came about, because my entrance into
the <a href="https://delta.io">Delta Lake</a> ecosystem was about as unplanned as my
authorship of part of this wonderful book.</p>

<p>The <a href="https://github.com/delta-io/delta-rs">delta-rs</a> project started in 2020 and I wish that I could say it is because
I am a brilliant visionary. The project largely started because I have had a
bias against JVM-based technology stacks and I had stepped into a role at
<a href="https://tech.scribd.com">Scribd</a> where we were migrating to AWS, Databricks,
and a new architecture <em>anyways</em> so why not challenge the orthodoxy? My
colleague <a href="https://about.houqp.me/">QP Hou</a> and I were loving Rust and liked
Delta Lake from a design standpoint, but did not love <a href="https://spark.spache.org">Apache
Spark</a> for some of the things we needed to do.</p>

<p>I would consider the official start of the project to be April 11th, 2020 when
I sent our Databricks colleagues the following:</p>

<hr />

<p>Greetings! As I mentioned in our weekly sync up this week, we have an interest
in partnering with Databricks to develop and open source native client
interface for Delta Lake.</p>

<p>For framing this conversation and scope of the native interface, I categorize
our compute workloads into three groups:</p>

<ol>
  <li><strong>Big offline data processing</strong>, requiring a cluster of compute resources where Spark makes a big dent.</li>
  <li><strong>Lightweight/small offline data processing</strong>, workloads needing “fractional
compute” resources, basically less than a single machine. (Ruby/Python type
tasks which move data around, or perform small-scale data accesses make up
the majority of these in our current infrastructure, we’ve discussed using
the Databricks Light runtime for these in the past, since the cost to
deploy/run these small tasks on Databricks clusters doesn’t make sense).</li>
  <li><strong>Boundary data-processing</strong>, where the task might involve a little bit of
production “online” data and a little bit of warehouse “offline” data to
complete its work. In our environment we have Ruby scripts whose sole job is
to sync pre-computed (by Spark) offline data into online data stores for the
production Rails application, etc, to access and serve.</li>
</ol>

<p>I don’t want to burn down our current investment in Ruby for many of the 2nd
and 3rd workloads, not to mention retraining a number of developers in-house to
learn how to effectively use Scala or pySpark.</p>

<p>My proposal is that we partner with Databricks and jointly develop an open
source client interface for Delta Lake. One where we would have at least one
developer from Databricks working with at least one developer from Scribd on a
jointly scoped effort to deliver a library capable of <em>initially</em> addressing
our ‘2’ and ‘3’ use-cases.</p>

<p>[..]</p>

<p>Further, I propose that we jointly develop a client interface in Rust, which
will allow us to easy extend that within the Databricks community to support
Golang, Python, Ruby, and Node clients.</p>

<p>The key benefits I imagine for us all:</p>

<ul>
  <li>
    <p>Much broader market share for Delta Lake as a technology. Not only would
companies like Scribd benefit, and continue to invest in Delta Lake, but
other companies would have an easier on-ramp into the Databricks ecosystem.
Basically, if you start using Delta Lake before you use Spark, you will (I
guarantee) reach a point where these lightweight workloads become heavyweight
workloads requiring the full power and glory of the Databricks runtime :D</p>
  </li>
  <li>
    <p>It’s a fantastic developer advocacy story that hits a number of key bullet
marketing points: open source, partner collaboration, Rust (so hot right now) :)</p>
  </li>
  <li>
    <p>Scribd is able to “immediately” take advantage of Delta Lake benefits without
burning up all our existing codebase and investment in Ruby tasks and
tooling. Thereby allowing for an easier onramp into Delta Lake and the
Databricks platform as a whole.</p>
  </li>
</ul>

<p>The scope of the effort I think would be largely around properly dealing with
the transaction log, since the Apache Arrow project has already created a
pretty decent <a href="https://crates.io/crates/parquet">parquet crate</a> in Rust. That
said, there may be some writer improvements we’d want/need to push upstream to
Apache Arrow to make this successful.</p>

<hr />

<p>On second thought, almost all of this has come true! What a brilliant sage! (plz clap)</p>

<p>Like many advancements, there’s a right time, a right place, and a right group
of people. Unfortunately Databricks didn’t join the party until a later on but
were a strong supporter of our initial work, providing guidance and helping to
make <a href="https://delta.io">Delta Lake</a> an ever-more thriving open source
community.  The right people were all converging on the direction that made
this possible with <a href="https://github.com/nevi-me">Neville</a> helped make
<a href="https://github.com/apache/arrow-rs">arrow-rs</a> a much better <a href="https://parquet.apache.org">Apache
Parquet</a> writer. QP wrote the first version of the
protocol parser and created the first Python bindings for the library.
<a href="https://github.com/xianwill">Christian Williams</a> built out
<a href="https://github.com/delta-io/kafka-delta-ingest">kafka-delta-ingest</a> with
<a href="https://github.com/mosyp">Mykhailo Osypov</a> and helped prove that: <strong>Rust is
way more efficient for data ingestion workloads.</strong>. As time went on Will Jones,
Florian Valeye, and Robert Peck joined the party and helped turn delta-rs from
a small Scribd-motivated open source project into a thriving Rust and Python
project.</p>

<p><a href="https://bookshop.org/p/books/delta-lake-the-definitive-guide-modern-data-lakehouse-architectures-with-data-lakes-denny-lee/21429337?ean=9781098151942" target="_blank"><img src="/images/post-images/2024-deltalake/book-cover.jpg" align="right" width="200" /></a></p>

<p>Scribd had wild success with the data ingestion being in Rust, and the data
processing/query being in Spark. The community grew, Databricks grew, and at
some point some folks started working on a book.</p>

<p>As a long-time maintainer of delta-rs and talking head in the Delta and
Databricks ecosystem I was asked to be a technical reviewer of the book after
Prashanth, Scott, Tristen, and Denny had already gotten more than halfway
through the chapters.</p>

<p>I provided as much feedback as I could on their chapters. I reviewed the
outline and noticed “Chapter 8: TBD”.</p>

<p>What’s supposed to be Chapter 8? “<em>We’re not sure yet.</em>”</p>

<p>My friend <a href="https://kohsuke.org">Kohsuke</a> once marveled at how I was able to
acquire things for the <a href="https://jenkins.io">Jenkins project</a> by the simple act of
asking for them. There’s some skill involved in finding mutually beneficial
opportunities, but being uninhibited by the possibility somebody would say “no”
helps a lot.</p>

<p>“So this outline looks good, but when are you going to talk about Rust and
Python? There are dozens of us! Dozens!”</p>

<p><a href="https://dennyglee.com/">Denny</a> needed another chapter and I asked if I could
write about building native data applications in Rust and Python.</p>

<p>Suddenly I was helping to write a book.</p>

<hr />

<p><a href="https://tech.scribd.com">Scribd</a> is a fun company to work at. Books,
audiobooks, podcasts, articles. We have a deep appreciation for the written
word, telling stories, and learning. All of which I value highly. Before this
experience however I had never seen the <em>other</em> side of books. The creation,
the meetings, the rewrites, the edits, the reviews, going to press. It is
incredibly interesting and the team at O’Reilly are talented, helpful, and professional.</p>

<p>Going through copy-editing I was fielding review comments on the consistency of
tense, the subject of sentences, discussions about what is a proper noun and
how to consistently apply terms through <em>hundreds of pages</em> of content. I have
heard about how invaluable editors are, I have now seen them in action am in
awe.</p>

<p>Over the years I have tried and failed to explain what I do to family members.
For people that don’t work in tech “working on the computer” all looks largely
the same, especially for older generations. Having your work, your name <em>in
print</em> has an intangible “wow” factor. More so than conference talks,
websites, GitHub stars, or branded t-shirts, a printed artifact recognizes the
accomplishments of the innumerable contributors to the Delta Lake ecosystem
over the years.</p>

<p>If you’re data inclined, I recommend picking up a copy, Prashanth, Scott,
Tristen, and Denny have written a very useful guide, and also I contributed a
bit too! :)</p>]]></content><author><name>R. Tyler Croy</name></author><category term="databricks" /><category term="deltalake" /><category term="buoyantdata" /><summary type="html"><![CDATA[Nothing quite feels like “I made it!” like being published. Which is why I am thrilled to share that Delta Lake: The Definitive Guide is available for purchase, and I kind of helped! I wanted to share a little bit about how my contributions (Chapter 6!) came about, because my entrance into the Delta Lake ecosystem was about as unplanned as my authorship of part of this wonderful book.]]></summary></entry><entry><title type="html">Data and AI Summit 2024 presentations</title><link href="https://brokenco.de//2024/10/17/data-ai-summit-videos.html" rel="alternate" type="text/html" title="Data and AI Summit 2024 presentations" /><published>2024-10-17T00:00:00+00:00</published><updated>2024-10-17T00:00:00+00:00</updated><id>https://brokenco.de//2024/10/17/data-ai-summit-videos</id><content type="html" xml:base="https://brokenco.de//2024/10/17/data-ai-summit-videos.html"><![CDATA[<p>This year has been so jam packed full of activities that I forgot to share some
videos from <a href="https://www.buoyantdata.com/blog/2024-06-04-data-and-ai-summit.html">Data and AI Summit
2024</a> this
past summer! The annual conference hosted by Databricks has become one of my
favorites to meet with other <a href="https://delta.io">Delta Lake</a> users and
developers to discuss the future of large-scale data ingestion and processing. This year however, I overdid it a little bit.</p>

<p>Using the excuse of promoting my consulting/professional services company
<a href="https://buoyantdata.com">Buoyant Data</a> I had effectively <em>three</em> speaking
engagements:</p>

<ul>
  <li><strong>The road to delta-rs 1.0</strong> at the Open Source Contributor Summit (Monday)</li>
  <li><strong>Fast, cheap, and easy data ingestion with AWS Lambda and Delta Lake</strong>, a
talk highlighting a lot of the successful patterns I have developed for
customers using AWS Lambda with Delta Lake for Rust to create shockingly
cheap data ingestion pipelines. (Thursday)</li>
  <li><strong>Let’s do data engineering in Rust!</strong>, a more fun deep-dive talk to help
people start to get into the world of implementing data systems with Rust. (Thursday)(</li>
</ul>

<p>Unfortunately the first talk was not recorded, but it was probably the most
interesting! On Monday morning I was riding my bike from the Ferry Building to
the venue in San Francisco and my chain snapped off while I was sprinting off
from a green light. I went down <strong>hard</strong>, scraped up my knees, and generally
looked a fool lying in the middle of Market St.</p>

<p>The show must go on, so I hobbled to the <a href="https://tech.scribd.com">Scribd</a>
office, deposited my broken bike, and continued to the Open Source Summit.</p>

<p>What I did not know at the time was that I had fractured a bone in my wrist. I
did know however that I needed to go to a clinic, but <em>really</em> wanted to attend
the summit and take advantage of the one-a-year opportunity (literally!) for
some of the brightest minds in the data community to talk about the future of
Delta Lake and more.</p>

<p>So that first talk was given with my swollen wrist pulled to my heart, like a
broken wing, and I’m <em>sure</em> it was a ludicrous sight to see!</p>

<p>By Thursday my arm had been set and was in a sling, which is far less exciting.
Nonetheless, the two talks below are perhaps the only one-handed presentations
thus far in my career! I hope you enjoy!</p>

<center>
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/XPoWb9u06xA?si=SNccWEJxorszRGO1" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/Fr5Nx1wuQmQ?si=Svc3GtewzxUyGI4M" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>
</center>

<hr />

<p><em>Note</em>: The presentation software used for this talk is the open source
<a href="https://mfontanini.github.io/presenterm/introduction.html">presenterm</a> tool
which is delightful for creating development-focused presentations like this
one!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="databricks" /><category term="deltalake" /><category term="buoyantdata" /><category term="presentation" /><summary type="html"><![CDATA[This year has been so jam packed full of activities that I forgot to share some videos from Data and AI Summit 2024 this past summer! The annual conference hosted by Databricks has become one of my favorites to meet with other Delta Lake users and developers to discuss the future of large-scale data ingestion and processing. This year however, I overdid it a little bit.]]></summary></entry><entry><title type="html">The problem with ML</title><link href="https://brokenco.de//2023/01/04/the-problem-with-ml.html" rel="alternate" type="text/html" title="The problem with ML" /><published>2023-01-04T00:00:00+00:00</published><updated>2023-01-04T00:00:00+00:00</updated><id>https://brokenco.de//2023/01/04/the-problem-with-ml</id><content type="html" xml:base="https://brokenco.de//2023/01/04/the-problem-with-ml.html"><![CDATA[<p>The holidays are the time of year when I typically field a lot of questions
from relatives about technology or the tech industry, and this year my favorite
questions were around <strong>AI</strong>. (<em>insert your own scary music</em>) Machine-learning
(ML) or Artificial Intelligence (AI) are being widely deployed and I have some
<strong>Problems™</strong> with that. Machine learning is not necessarily a new
domain, the practices commonly accepted as “ML” have been used for quite a
while to support search and recommendations use-cases. In fact, my day job
includes supporting data scientists and those who are actively creating models
and deploying them to production. <em>However</em>, many of my relatives outside of the tech industry believe that “AI” is going to replace people, their jobs, and/or run the future. I genuinely hope AI/ML comes nowhere close to this future imagined by members of my family.</p>

<p>Like many pieces of technology, it is not inherently good or bad, but the
problem with ML as it is applied today is that <strong>its application is far
outpacing our understanding of its consequences</strong>.</p>

<p>Brian Kernighan, co-creator of the C programming language and UNIX, said:</p>

<blockquote>
  <p>Everyone knows that debugging is twice as hard as writing a program in the
first place. So if you’re as clever as you can be when you write it, how will
you ever debug it?</p>
</blockquote>

<p>Setting aside the <em>mountain</em> of ethical concerns around the application of ML
which have and should continue to be discussed in the technology industry,
there’s a fundamental challenge with ML-based systems: I don’t think their
creators understand how they work, how their conclusions are determined, or how
to consistently improve them over time. Imagine you are a data scientist or ML
developer, how confident are you in what your models will predict between
experiments or evolutions of the model? Would you be willing to testify in a
court of law about the veracity of your model’s output?</p>

<p>Imagine you are a developer working on the models that Tesla’s “full
self-driving” (FSD) mode relies upon. Your model has been implicated in a Tesla
killing the driver and/or pedestrians (which <a href="https://www.reuters.com/business/autos-transportation/us-probing-fatal-tesla-crash-that-killed-pedestrian-2021-09-03/">has
happened</a>).
Do you think it would be possible to convince a judge and jury that your model
is <em>not</em> programmed to mow down pedestrians outside of a crosswalk? How do you
prove what a model is or is not supposed to do given never before seen inputs?</p>

<p>Traditional software <em>does</em> have a variation of this problem but source code
lends itself to scrutiny far better than the ML models. Many of which have come
from successive evolutions of public training data, proprietary model changes,
and integrations with new data sources.</p>

<p>These problems may be solvable in the ML ecosystem, but problem is that the
application of ML is outpacing our ability to understand, monitor, and diagnose
models when they do harm.</p>

<p>That model your startup is working on to help accelerate home loan approvals
based on historical mortgages, how do you assert that your models are not
re-introducing racist policies like
    <a href="https://en.wikipedia.org/wiki/Redlining">redlining</a>. (forms of this <a href="https://fortune.com/2020/02/11/a-i-fairness-eye-on-a-i/">have happened</a>).</p>

<p>How about that fun image generation (AI art!) project you have been tinkering
with uses a publicly available model that was trained on millions of images
from the internet, and as a result in some cases unintentionally outputs
explicit images, or even what some jurisdictions might consider bordering on
child pornography. (forms of this <a href="https://www.wired.com/story/lensa-artificial-intelligence-csem/">have
happened</a>).</p>

<p>Really anything you teach based on the data “from the internet” is asking for
racist, pornographic, or otherwise offensive results, as the <a href="https://www.cbsnews.com/news/microsoft-shuts-down-ai-chatbot-after-it-turned-into-racist-nazi/">Microsoft
Tay</a>
example should have taught us.</p>

<p>Can you imagine the human-rights nightmare that could ensue from shoddy ML
models being brought into a healthcare setting? Law-enforcement? Or even
military settings?</p>

<hr />

<p>Machine-learning encompasses a very powerful set of tools and patterns, but our
ability to predict how those models will be used, what they will output, or how
to prevent negative outcomes are <em>dangerously</em> insufficient for the use outside
of search and recommendation systems.</p>

<p>I understand how models are developed, how they are utilized, and what I
<em>think</em> they’re supposed to do.</p>

<p>Fundamentally the challenge with AI/ML is that we understand how to “make it
work”, but we don’t understand <em>why</em> it works.</p>

<p>Nonetheless we keep deploying “AI” anywhere there’s funding, consequences be
damned.</p>

<p>And that’s a problem.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="software" /><category term="ml" /><category term="aws" /><category term="databricks" /><summary type="html"><![CDATA[The holidays are the time of year when I typically field a lot of questions from relatives about technology or the tech industry, and this year my favorite questions were around AI. (insert your own scary music) Machine-learning (ML) or Artificial Intelligence (AI) are being widely deployed and I have some Problems™ with that. Machine learning is not necessarily a new domain, the practices commonly accepted as “ML” have been used for quite a while to support search and recommendations use-cases. In fact, my day job includes supporting data scientists and those who are actively creating models and deploying them to production. However, many of my relatives outside of the tech industry believe that “AI” is going to replace people, their jobs, and/or run the future. I genuinely hope AI/ML comes nowhere close to this future imagined by members of my family.]]></summary></entry><entry><title type="html">Meet Buoyant Data, and let me reduce your data platform costs</title><link href="https://brokenco.de//2023/01/02/introducing-buoyant-data.html" rel="alternate" type="text/html" title="Meet Buoyant Data, and let me reduce your data platform costs" /><published>2023-01-02T00:00:00+00:00</published><updated>2023-01-02T00:00:00+00:00</updated><id>https://brokenco.de//2023/01/02/introducing-buoyant-data</id><content type="html" xml:base="https://brokenco.de//2023/01/02/introducing-buoyant-data.html"><![CDATA[<p>One of the many things I learned in 2022 is that I have a particular knack for
understanding, analyzing, and optimizing the costs of data platform
infrastructure. These skills were born out of both curiosity and necessity in
the current economic climate, and have led me to start a small consuhltancy on
the side: <a href="https://www.buoyantdata.com/">Buoyant Data</a>. Big data infrastructure
can be hugely valuable to lots of businesses, but unfortunately it’s also an
area of the cloud bills that is frequently misunderstood, that’s something that
I can help with!</p>

<p><a href="https://www.duckbillgroup.com/about/">Mike Julian</a> from <a href="https://www.duckbillgroup.com/">The Duckbill
Group</a> once made the proclamation that the way
to <em>actually</em> save money in AWS is to design your infrastructure to be
cost-effective. “Optimization” techniques can only take you so far, and once
you’ve burned through all the optimizations, you may find yourself needing to
further reduce the cost of your infrastructure and have no more “fat” to trim! In the <a href="https://www.buoyantdata.com/blog/2022-12-18-initial-commit.html">first blog post</a> I outline a “reference architecture” for a data platform which I <strong>know</strong> is cost-effective, easy to manage, and lends itself well to growth.</p>

<p>Planning for sensible, cost-concious growth is <em>very</em> important. With most data
platforms as they start to prove their value, the organization will bring even
<em>more</em> workloads to them. <a href="https://en.wikipedia.org/wiki/If_You_Give_a_Mouse_a_Cookie">If you give a data scientist a good
platform</a>, they
will find themselves wanting ever more from that data platform, and Buoyant
Data can help make sure that growth is sustainable <strong>and</strong> the value to the
business is easy to identify as well.</p>

<p>Please add the Buoyant Data <a href="https://www.buoyantdata.com/rss.xml">RSS feed</a> to your reader, as I have a number of blog posts queued up already with some gratis tips and tricks for understanding the cost of your data platform! 😄</p>

<hr />

<p>The technology stack for Buoyant Data is something I cannot wait to write more
about. After funding the creation of
<a href="https://github.com/delta-io/delta-rs">delta-rs</a> as part of my day job, I am
utilizing the library in a <strong>big</strong> way to build extremely lightweight and
cost-efficient data ingestion pipelines with Rust and AWS Lambda. There’s still
plenty of space for <a href="https://spark.apache.org">Apache Spark</a> on the querying
and processing side, but as
<a href="https://github.com/apache/arrow-datafusion">DataFusion</a> matures, I’m looking
forward to exploring where that can fit into the picture.</p>

<p>There’s a lot of evolution happening right now in the data and ML platform
space, I’m really looking forward to growing <a href="https://buoyantdata.com">Buoyant
Data</a> in my spare time!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="databricks" /><category term="software" /><category term="deltalake" /><category term="aws" /><summary type="html"><![CDATA[One of the many things I learned in 2022 is that I have a particular knack for understanding, analyzing, and optimizing the costs of data platform infrastructure. These skills were born out of both curiosity and necessity in the current economic climate, and have led me to start a small consuhltancy on the side: Buoyant Data. Big data infrastructure can be hugely valuable to lots of businesses, but unfortunately it’s also an area of the cloud bills that is frequently misunderstood, that’s something that I can help with!]]></summary></entry><entry><title type="html">Local SQL querying in Jupyter Notebooks</title><link href="https://brokenco.de//2022/04/29/local-sql-with-jupyter.html" rel="alternate" type="text/html" title="Local SQL querying in Jupyter Notebooks" /><published>2022-04-29T00:00:00+00:00</published><updated>2022-04-29T00:00:00+00:00</updated><id>https://brokenco.de//2022/04/29/local-sql-with-jupyter</id><content type="html" xml:base="https://brokenco.de//2022/04/29/local-sql-with-jupyter.html"><![CDATA[<p>Designing, working with, or thinking about data consumes the vast majority of
my time these days, but almost all of that has been “in the cloud” rather than
locally. I recently watched <a href="https://www.youtube.com/watch?v=RqubKSF3wig">this talk about SQLite and
Go</a> which served as a good
reminder that I have a pretty powerful computer at my fingertips, and that
perhaps not all my workloads require a big <a href="https://spark.apache.org">Spark</a>
cluster in the sky. Shortly after watching that video I stumbled into a small
(200k rows) data set which I needed to run some queries against, and my first
attempt at auto-ingesting it into a <a href="https://delta.io">Delta table</a> in
Databricks failed, so I decided to launch a local <a href="https://jupyter.org/">Jupyter
notebook</a> and give it a try!</p>

<p>My originating data set was a comma-separated values file (CSV) so my first
intent was to just load it into SQLite using the <code class="language-plaintext highlighter-rouge">.mode csv</code> command in the
CLI, but I found that to be a bit restrictive. Notebooks have incredible
utility for incrementally working on data. Unfortunately Jupyter doesn’t have a
native SQL interface, instead everything has to run through Python. Through my
work with <a href="https://github.com/delta-io/delta-rs">delta-rs</a> I am somewhat
familar with <a href="https://pandas.pydata.org/">Pandas</a> for processing data in
Python, so my first attempts where using the Pandas data frame API to munge
through my data.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pandas</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data/2021_05-2022_04.csv'</span><span class="p">)</span>
</code></pre></div></div>

<p>I could be dense, but I find SQL to be a pretty understandable tool in
comparison to data frames, so I needed to find some way to get the data into a
SQL interface. The solution that I ended up with was to create an in-memory
SQLite database and use Pandas to query it, which works <em>okay enough</em> to where
I continued working and didn’t bother thinking too much about how to optimize
the approach further:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sqlite3</span>
<span class="kn">import</span> <span class="nn">pandas</span>

<span class="c1"># Loading everything into a SQLite memory database because I hate data frames and SQL is nice
</span><span class="n">conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="s">':memory:'</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pandas</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data/2021_05-2022_04.csv'</span><span class="p">)</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">to_sql</span><span class="p">(</span><span class="s">'usage'</span><span class="p">,</span> <span class="n">conn</span><span class="p">,</span> <span class="n">if_exists</span><span class="o">=</span><span class="s">'replace'</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="c1"># useful little helper
</span><span class="n">sql</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">pandas</span><span class="p">.</span><span class="n">read_sql_query</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">conn</span><span class="p">)</span>


<span class="c1"># Show some sample data
</span><span class="n">sql</span><span class="p">(</span><span class="s">'SELECT * FROM usage LIMIT 3'</span><span class="p">)</span>
</code></pre></div></div>

<p>The benefit of this approach is that I can create additional tables in the
SQLite database with static data sets, or other CSVs. Since I’m also just doing
some simple ad-hoc analysis, I can skip writing anything to disk and keep
things snappy in memory.</p>

<p>I created the little <code class="language-plaintext highlighter-rouge">sql</code> lambda to make the notebook a bit more
understandable, and to get out of exposing the cursor or database connection to
every single cell, meaning that most of my cells in the notebook are simply
just <code class="language-plaintext highlighter-rouge">sql('SELECT * FROM foo')</code> statments with some documentation surrounding
them.</p>

<p>Fairly simple, easy enough to play with data quickly on my local machine
without invoking all the infinite cosmic powers the cloud provides!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="dataeng" /><category term="databricks" /><summary type="html"><![CDATA[Designing, working with, or thinking about data consumes the vast majority of my time these days, but almost all of that has been “in the cloud” rather than locally. I recently watched this talk about SQLite and Go which served as a good reminder that I have a pretty powerful computer at my fingertips, and that perhaps not all my workloads require a big Spark cluster in the sky. Shortly after watching that video I stumbled into a small (200k rows) data set which I needed to run some queries against, and my first attempt at auto-ingesting it into a Delta table in Databricks failed, so I decided to launch a local Jupyter notebook and give it a try!]]></summary></entry><entry><title type="html">I’m a Databricks Beacon</title><link href="https://brokenco.de//2021/10/21/databricks-beacon.html" rel="alternate" type="text/html" title="I’m a Databricks Beacon" /><published>2021-10-21T00:00:00+00:00</published><updated>2021-10-21T00:00:00+00:00</updated><id>https://brokenco.de//2021/10/21/databricks-beacon</id><content type="html" xml:base="https://brokenco.de//2021/10/21/databricks-beacon.html"><![CDATA[<p>A bit of belated news but thanks to all the advocacy work we have been doing at
<a href="https://tech.scribd.com">Scribd</a>_ I am now a <a href="https://databricks.com/discover/beacons/tyler-croy">Databricks
Beacon</a>. The Beacon program is similar
to Docker Captains, Microsoft MVPs, or Java Champions, a group of folks who are
considered both skilled with the technology and in communicating/sharing best
practices, tips, and short-comings with the broader community.</p>

<p><img src="/images/post-images/databricks-beacons/header-image.png" alt="Beacon profile" /></p>

<p>From the <a href="https://databricks.com/discover/beacons/">site</a> itself:</p>

<blockquote>
  <p>The Databricks Beacons program is our way to thank and recognize the community members, data scientists, data engineers, developers and open source enthusiasts who go above and beyond to uplift the data and AI community.</p>

  <p>Whether they are speaking at conferences, leading workshops, teaching, mentoring, blogging, writing books, creating tutorials, offering support in forums or organizing meetups, they inspire others and encourage knowledge sharing – all while helping to solve tough data problems.</p>
</blockquote>

<p>I’m flattered to be included in the inaugural group of Beacons, which include a
number of much more competent data leaders than myself. Most of what I bring to
the table is a <em>lot</em> of <a href="https://delta.io">Delta Lake</a> experience and advocacy.
Delta Lake is the bedrock of Scribd’s data platform and I have been investing
heavily in the space with our contribution of the
<a href="https://github.com/delta-io/delta-rs">delta-rs</a> Rust bindings, upon which
<a href="https://www.youtube.com/watch?v=mLmsZ3qYfB0">kafka-delta-ingest</a> was built.</p>

<p><a href="https://databricks.com/customers/data-team-effect/scribd">Scribd is a Databricks
customer</a>, and from
that angle I have been quite impressed with the organization and technologies
they have built. As some folks who have seen <a href="https://youtu.be/h5bRBuVmhL4?t=1635">my public talks</a> about Databricks,
I also don’t hold back in my honest assessment of the platform’s strengths and
weaknesses, thus my surprise to be included as a Beacon ;)</p>

<p>I’m looking forward to more events where I am able to share some of the
real-world experiences we’re gaining at Scribd in building out massive data
platform systems with Delta Lake and Databricks. And as always, if you want to <a href="https://tech.scribd.com/careers/#open-positions">help us build out more</a> feel free to email me!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="scribd" /><category term="databricks" /><summary type="html"><![CDATA[A bit of belated news but thanks to all the advocacy work we have been doing at Scribd_ I am now a Databricks Beacon. The Beacon program is similar to Docker Captains, Microsoft MVPs, or Java Champions, a group of folks who are considered both skilled with the technology and in communicating/sharing best practices, tips, and short-comings with the broader community.]]></summary></entry><entry><title type="html">Building a real-time data platform with Apache Spark and Delta Lake</title><link href="https://brokenco.de//2020/07/20/realtime-spark-deltalake.html" rel="alternate" type="text/html" title="Building a real-time data platform with Apache Spark and Delta Lake" /><published>2020-07-20T00:00:00+00:00</published><updated>2020-07-20T00:00:00+00:00</updated><id>https://brokenco.de//2020/07/20/realtime-spark-deltalake</id><content type="html" xml:base="https://brokenco.de//2020/07/20/realtime-spark-deltalake.html"><![CDATA[<p>The <a href="/2019/08/28/real-time-data-platform.html">Real-time Data Platform</a> is one
of the fun things we have been building at Scribd since I joined in 2019. Last
month I was fortunate enough to share some of our approach in a presentation at
Spark and AI Summit titled: “The revolution will be streamed.” At a high level,
what I had branded the “Real-time Data Platform” is really: <a href="https://kafka.apache.org">Apache
Kafka</a>, <a href="https://airflow.apache.org">Apache Airflow</a>,
<a href="https://spark.apache.org">Structured streaming with Apache Spark</a>, and a
smattering of microservices to help shuffle data around. All sitting on top of
<a href="https://delta.io">Delta Lake</a> which acts as an incredibly versatile and useful
storage layer for the platform.</p>

<p>In my presentation, which is embedded below, I outline how we tie together Kafka, Databricks, and Delta Lake.</p>

<center>
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/YmyCOr9Mr9Y" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</center>

<p>The recorded presentation also complements some of our
<a href="https://tech.scribd.com">tech.scribd.com</a> blog posts which I recommend reading as well:</p>

<ul>
  <li><a href="https://tech.scribd.com/blog/2020/streaming-with-delta-lake.html">Streaming data in and out of Delta Lake</a></li>
  <li><a href="https://tech.scribd.com/blog/2020/introducing-kafka-player.html">Streaming development work with Kafka</a></li>
  <li><a href="https://tech.scribd.com/blog/2020/shipping-rust-to-production.html">Ingesting production logs with Rust</a></li>
  <li><a href="https://tech.scribd.com/blog/2019/migrating-kafka-to-aws.html">Migrating Kafka to the cloud</a></li>
</ul>

<p>I am incredibly proud of the work the Platform Engineering organization has
done at Scribd to make real-time data a reality. I also cannot recommend Kafka +
Spark + Delta Lake highly enough for those with similar requirements.</p>

<p>Now that we have the platform in place, I am also excited for our late 2020 and
2021 roadmaps which will start to take advantage of real-time data.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="spark" /><category term="deltalake" /><category term="databricks" /><category term="scribd" /><summary type="html"><![CDATA[The Real-time Data Platform is one of the fun things we have been building at Scribd since I joined in 2019. Last month I was fortunate enough to share some of our approach in a presentation at Spark and AI Summit titled: “The revolution will be streamed.” At a high level, what I had branded the “Real-time Data Platform” is really: Apache Kafka, Apache Airflow, Structured streaming with Apache Spark, and a smattering of microservices to help shuffle data around. All sitting on top of Delta Lake which acts as an incredibly versatile and useful storage layer for the platform.]]></summary></entry></feed>