<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://brokenco.de//feed/by_tag/scribd.xml" rel="self" type="application/atom+xml" /><link href="https://brokenco.de//" rel="alternate" type="text/html" /><updated>2026-05-03T00:12:50+00:00</updated><id>https://brokenco.de//feed/by_tag/scribd.xml</id><title type="html">rtyler</title><subtitle>a moderately technical blog</subtitle><author><name>R. Tyler Croy</name></author><entry><title type="html">Based Lake, a petabyte-scale low-latency data lake</title><link href="https://brokenco.de//2026/03/10/based-lake.html" rel="alternate" type="text/html" title="Based Lake, a petabyte-scale low-latency data lake" /><published>2026-03-10T00:00:00+00:00</published><updated>2026-03-10T00:00:00+00:00</updated><id>https://brokenco.de//2026/03/10/based-lake</id><content type="html" xml:base="https://brokenco.de//2026/03/10/based-lake.html"><![CDATA[<p>I had a chat today about building large scale low-latency data retrieval
systems around AWS S3. In doing so I got to share a bit of the talk proposal I
submitted to <a href="https://dataaisummit.com">Data and AI Summit</a> this year about
real-live work that has made it into production.</p>

<p>For years the conventional wisdom around <a href="https://delta.io">Delta Lake</a> has
been to <strong>not</strong> connect user-facing/online systems to Delta tables. Basically,
don’t point your Django app at your Delta tables. This continues to be a decent
<em>guideline</em> but definitely <strong>not a rule</strong> and I have the performance data to
back that up.</p>

<p>My talk abstract:</p>

<blockquote>
  <p>Scribd hosts hundreds of millions of documents and has hundreds of billions of
objects across our buckets. Combining large-language models with a massive
amounts of text has required investment in our new Content Library
architecture.  We selected Delta Lake as the underlying storage technology but
have pushed it to an extreme. Using the same Delta Lake architecture we offer
both direct data access for data scientists in Databricks Notebooks and online
data retrieval in milliseconds for user-facing web services.</p>

  <p>In this talk we will review principles of performance for each layer of the
stack: web APIs, the Delta Lake tables, Apache Parquet, and AWS S3.</p>
</blockquote>

<p>The work done by myself and my colleague Eugene in this area has been heavily
related to my previous research around <a href="/2025/06/24/low-latency-parquet.html">Low latency Parquet
reads</a> which informed work named <a href="https://tech.scribd.com/blog/2026/content-crush.html">Content
Crush</a>, which I have
explored more on the Scribd tech blog and on the <a href="/2026/02/13/screaming-in-the-cloud.html">Screaming in the
Cloud</a> podcast.</p>

<p>I really hope that I am able to share results at Data and AI Summit from this
incredibly challenging work that I am undertaking. But even if I don’t, blog
posts like my musings on <a href="/2026/01/19/multimodal-delta-lake.html">Multimodal with Delta
Lake</a>, <a href="https://www.buoyantdata.com/blog/2024-12-31-high-concurrency-logstore.html">scaling streaming Delta Lake
applications</a>,
and a myriad of other articles I have published can be pieced together to form
the larger mosaic of insane large-scale data work I have been hammering on!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="arrow" /><category term="parquet" /><category term="deltalake" /><category term="databricks" /><category term="scribd" /><summary type="html"><![CDATA[I had a chat today about building large scale low-latency data retrieval systems around AWS S3. In doing so I got to share a bit of the talk proposal I submitted to Data and AI Summit this year about real-live work that has made it into production.]]></summary></entry><entry><title type="html">I’m a Databricks Beacon</title><link href="https://brokenco.de//2021/10/21/databricks-beacon.html" rel="alternate" type="text/html" title="I’m a Databricks Beacon" /><published>2021-10-21T00:00:00+00:00</published><updated>2021-10-21T00:00:00+00:00</updated><id>https://brokenco.de//2021/10/21/databricks-beacon</id><content type="html" xml:base="https://brokenco.de//2021/10/21/databricks-beacon.html"><![CDATA[<p>A bit of belated news but thanks to all the advocacy work we have been doing at
<a href="https://tech.scribd.com">Scribd</a>_ I am now a <a href="https://databricks.com/discover/beacons/tyler-croy">Databricks
Beacon</a>. The Beacon program is similar
to Docker Captains, Microsoft MVPs, or Java Champions, a group of folks who are
considered both skilled with the technology and in communicating/sharing best
practices, tips, and short-comings with the broader community.</p>

<p><img src="/images/post-images/databricks-beacons/header-image.png" alt="Beacon profile" /></p>

<p>From the <a href="https://databricks.com/discover/beacons/">site</a> itself:</p>

<blockquote>
  <p>The Databricks Beacons program is our way to thank and recognize the community members, data scientists, data engineers, developers and open source enthusiasts who go above and beyond to uplift the data and AI community.</p>

  <p>Whether they are speaking at conferences, leading workshops, teaching, mentoring, blogging, writing books, creating tutorials, offering support in forums or organizing meetups, they inspire others and encourage knowledge sharing – all while helping to solve tough data problems.</p>
</blockquote>

<p>I’m flattered to be included in the inaugural group of Beacons, which include a
number of much more competent data leaders than myself. Most of what I bring to
the table is a <em>lot</em> of <a href="https://delta.io">Delta Lake</a> experience and advocacy.
Delta Lake is the bedrock of Scribd’s data platform and I have been investing
heavily in the space with our contribution of the
<a href="https://github.com/delta-io/delta-rs">delta-rs</a> Rust bindings, upon which
<a href="https://www.youtube.com/watch?v=mLmsZ3qYfB0">kafka-delta-ingest</a> was built.</p>

<p><a href="https://databricks.com/customers/data-team-effect/scribd">Scribd is a Databricks
customer</a>, and from
that angle I have been quite impressed with the organization and technologies
they have built. As some folks who have seen <a href="https://youtu.be/h5bRBuVmhL4?t=1635">my public talks</a> about Databricks,
I also don’t hold back in my honest assessment of the platform’s strengths and
weaknesses, thus my surprise to be included as a Beacon ;)</p>

<p>I’m looking forward to more events where I am able to share some of the
real-world experiences we’re gaining at Scribd in building out massive data
platform systems with Delta Lake and Databricks. And as always, if you want to <a href="https://tech.scribd.com/careers/#open-positions">help us build out more</a> feel free to email me!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="scribd" /><category term="databricks" /><summary type="html"><![CDATA[A bit of belated news but thanks to all the advocacy work we have been doing at Scribd_ I am now a Databricks Beacon. The Beacon program is similar to Docker Captains, Microsoft MVPs, or Java Champions, a group of folks who are considered both skilled with the technology and in communicating/sharing best practices, tips, and short-comings with the broader community.]]></summary></entry><entry><title type="html">Recovering from disasters with Delta Lake</title><link href="https://brokenco.de//2021/04/26/disaster-recovery-with-delta-lake.html" rel="alternate" type="text/html" title="Recovering from disasters with Delta Lake" /><published>2021-04-26T00:00:00+00:00</published><updated>2021-04-26T00:00:00+00:00</updated><id>https://brokenco.de//2021/04/26/disaster-recovery-with-delta-lake</id><content type="html" xml:base="https://brokenco.de//2021/04/26/disaster-recovery-with-delta-lake.html"><![CDATA[<p>Entering into the data platform space with a lot of experience in more
traditional production operations is a <em>lot</em> of fun, especially when you ask
questions like “what if <code class="language-plaintext highlighter-rouge">X</code> goes horribly wrong?”  My favorite scenario to
consider is: “how much damage could one accidentally cause with our existing
policies and controls?”  At <a href="https://tech.scribd.com">Scribd</a> we have made
<a href="https://delta.io">Delta Lake</a> a cornerstone of our data platform, and as such
I’ve spent a lot of time thinking about what could go wrong and how we would
defend against it.</p>

<p>To start I recommend reading this recent post from Databricks: <a href="https://databricks.com/blog/2021/04/20/attack-of-the-delta-clones-against-disaster-recovery-availability-complexity.html">Attack of the
Delta
Clones</a>
which provides a good overview of the <code class="language-plaintext highlighter-rouge">CLONE</code> operation in Delta and some
patterns for “undoing” mistaken operations. Their blog post does a fantastic
job demonstrating the power ot clones in Delta Lake, for example:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Creating a new cloned table  from loan_details_delta</span>
<span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">TABLE</span> <span class="n">loan_details_delta_clone</span>
    <span class="n">DEEP</span> <span class="n">CLONE</span> <span class="n">loan_details_delta</span><span class="p">;</span>

<span class="c1">-- Original view of data</span>
<span class="k">SELECT</span> <span class="n">addr_state</span><span class="p">,</span> <span class="n">funded_amnt</span> <span class="k">FROM</span> <span class="n">loan_details_delta</span> <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">addr_state</span><span class="p">,</span> <span class="n">funded_amnt</span>

<span class="c1">-- Clone view of data</span>
<span class="k">SELECT</span> <span class="n">addr_state</span><span class="p">,</span> <span class="n">funded_amnt</span> <span class="k">FROM</span> <span class="n">loan_details_delta_clone</span> <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">addr_state</span><span class="p">,</span> <span class="n">funded_amnt</span>
</code></pre></div></div>

<p>For my disaster recovery needs, the clone-based approach is insufficient as I detailed in <a href="https://groups.google.com/g/delta-users/c/2WOymkv4KgI/m/zvqKkQwJDwAJ">this post</a> on the delta-users mailing list:</p>

<blockquote>
  <p>Our requirements are basically to prevent catastrophic loss of business critical data via:</p>

  <ul>
    <li>Erroneous rewriting of data by an automated job</li>
    <li>Inadvertent table drops through metastore automation.</li>
    <li>Overaggressive use of VACUUM command</li>
    <li>Failed manual sync/cleanup operations by Data Engineering staff</li>
  </ul>

  <p>It’s important to consider whether you’re worried about the transaction log
getting corrupted, files in storage (e.g. ADLS) disappearing, or both.</p>
</blockquote>

<p>Generally speaking, I’m less concerned about malicious actors so much as
<em>incompetent</em> ones. It is <strong>far</strong> more likely that a member of the team
accidentally deletes data, than somebody kicking in a few layers of cloud-based
security and deleting it for us.</p>

<p>My preference is to work at a layer <em>below</em> Delta Lake to provide disaster
recovery mechanisms, in essence at the object store layer (S3). Relying strictly
on <code class="language-plaintext highlighter-rouge">CLONE</code> gets you copies of data which can definitely be beneficial <em>but</em> the
downside is that whatever is running the query has access to both the “source”
and the “backup” data.</p>

<p>The concern is that if some mistake was able to delete my source data, there’s
nothing actually standing in its way of deleting the backup data as well.</p>

<p>In my mailing list post, I posited a potential solution:</p>

<blockquote>
  <p>For example, with a simple nightly rclone(.org) based snapshot of an S3 bucket, the
“restore” might mean copying the transction log and new parquet files <em>back</em> to
the originating S3 bucket and <em>losing</em> up to 24 hours of data, since the
transaction logs would basically be rewound to the last backup point.</p>
</blockquote>

<p>Since that email we have deployed our Delta Lake backup solution,
which operates strictly at an S3 layer and allows us to impose hard walls (IAM)
between writers of the source and backup data.</p>

<p>One of my colleagues is writing that blog post up for
<a href="https://tech.scribd.com">tech.scribd.com</a> and I hope to see it published later
this week so make sure you follow us on Twitter
<a href="https://twitter.com/scribdtech">@ScribdTech</a> or subscribe to the <a href="https://tech.scribd.com/feed.xml">RSS
feed</a>!</p>

<hr />

<p><strong>Update</strong>: my colleague Kuntal wrote <a href="https://tech.scribd.com/blog/2021/backing-up-data-warehouse.html">this blog post on backing up Delta Lake with AWS S3 Batch Operations</a> which is what we’re doing here at <a href="https://tech.scribd.com">Scribd</a></p>]]></content><author><name>R. Tyler Croy</name></author><category term="deltalake" /><category term="scribd" /><summary type="html"><![CDATA[Entering into the data platform space with a lot of experience in more traditional production operations is a lot of fun, especially when you ask questions like “what if X goes horribly wrong?” My favorite scenario to consider is: “how much damage could one accidentally cause with our existing policies and controls?” At Scribd we have made Delta Lake a cornerstone of our data platform, and as such I’ve spent a lot of time thinking about what could go wrong and how we would defend against it.]]></summary></entry><entry><title type="html">Building a real-time data platform with Apache Spark and Delta Lake</title><link href="https://brokenco.de//2020/07/20/realtime-spark-deltalake.html" rel="alternate" type="text/html" title="Building a real-time data platform with Apache Spark and Delta Lake" /><published>2020-07-20T00:00:00+00:00</published><updated>2020-07-20T00:00:00+00:00</updated><id>https://brokenco.de//2020/07/20/realtime-spark-deltalake</id><content type="html" xml:base="https://brokenco.de//2020/07/20/realtime-spark-deltalake.html"><![CDATA[<p>The <a href="/2019/08/28/real-time-data-platform.html">Real-time Data Platform</a> is one
of the fun things we have been building at Scribd since I joined in 2019. Last
month I was fortunate enough to share some of our approach in a presentation at
Spark and AI Summit titled: “The revolution will be streamed.” At a high level,
what I had branded the “Real-time Data Platform” is really: <a href="https://kafka.apache.org">Apache
Kafka</a>, <a href="https://airflow.apache.org">Apache Airflow</a>,
<a href="https://spark.apache.org">Structured streaming with Apache Spark</a>, and a
smattering of microservices to help shuffle data around. All sitting on top of
<a href="https://delta.io">Delta Lake</a> which acts as an incredibly versatile and useful
storage layer for the platform.</p>

<p>In my presentation, which is embedded below, I outline how we tie together Kafka, Databricks, and Delta Lake.</p>

<center>
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/YmyCOr9Mr9Y" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</center>

<p>The recorded presentation also complements some of our
<a href="https://tech.scribd.com">tech.scribd.com</a> blog posts which I recommend reading as well:</p>

<ul>
  <li><a href="https://tech.scribd.com/blog/2020/streaming-with-delta-lake.html">Streaming data in and out of Delta Lake</a></li>
  <li><a href="https://tech.scribd.com/blog/2020/introducing-kafka-player.html">Streaming development work with Kafka</a></li>
  <li><a href="https://tech.scribd.com/blog/2020/shipping-rust-to-production.html">Ingesting production logs with Rust</a></li>
  <li><a href="https://tech.scribd.com/blog/2019/migrating-kafka-to-aws.html">Migrating Kafka to the cloud</a></li>
</ul>

<p>I am incredibly proud of the work the Platform Engineering organization has
done at Scribd to make real-time data a reality. I also cannot recommend Kafka +
Spark + Delta Lake highly enough for those with similar requirements.</p>

<p>Now that we have the platform in place, I am also excited for our late 2020 and
2021 roadmaps which will start to take advantage of real-time data.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="spark" /><category term="deltalake" /><category term="databricks" /><category term="scribd" /><summary type="html"><![CDATA[The Real-time Data Platform is one of the fun things we have been building at Scribd since I joined in 2019. Last month I was fortunate enough to share some of our approach in a presentation at Spark and AI Summit titled: “The revolution will be streamed.” At a high level, what I had branded the “Real-time Data Platform” is really: Apache Kafka, Apache Airflow, Structured streaming with Apache Spark, and a smattering of microservices to help shuffle data around. All sitting on top of Delta Lake which acts as an incredibly versatile and useful storage layer for the platform.]]></summary></entry><entry><title type="html">Changing the way the world reads at Scribd</title><link href="https://brokenco.de//2019/11/25/building-the-library.html" rel="alternate" type="text/html" title="Changing the way the world reads at Scribd" /><published>2019-11-25T00:00:00+00:00</published><updated>2019-11-25T00:00:00+00:00</updated><id>https://brokenco.de//2019/11/25/building-the-library</id><content type="html" xml:base="https://brokenco.de//2019/11/25/building-the-library.html"><![CDATA[<p>This week we launched the
<a href="https://tech.scribd.com">Scribd tech blog</a>, on which I published today’s
article: <a href="https://tech.scribd.com/blog/2019/building-the-library.html">We’re building the largest library in
history</a>. I
frequently have to remind myself that I have been here less than a year, and we
have undergone incredible positive change, with more coming in 2020.</p>

<p>The <a href="https://tech.scribd.com/blog/2019/building-the-library.html">post</a>
portends a high-level idea of what is to come for technology at Scribd in the
coming year or two, related to our <a href="https://blog.scribd.com/home/scribd-announces-58-million-strategic-investment-led-by-spectrum-equity">announcement
today</a>
of a major round of funding:</p>

<blockquote>
  <p>Today we are excited to announce Scribd has closed $58 million in equity
financing led by Spectrum Equity. The investment will be used to support
growth and product innovation, enhance operations, and further the company’s
mission to change the way the world reads.</p>
</blockquote>

<p>The most important detail I was able to share in the blog post is in the
Infrastructure section:</p>

<blockquote>
  <p>The future of our infrastructure, and our applications, is <strong>entirely in the
cloud</strong>. The migration [to AWS] requires shifting workloads between
datacenters with a tiny error and downtime budget. At our size, that’s many
terabytes of data and thousands of requests per second, which dictates
serious upfront planning, automation, testing, and monitoring of every facet
of our environment.</p>
</blockquote>

<p>Hiding behind this paragraph has been a tremendous amount of my time from these
past few months. Arriving at Scribd in January, there were no plans in the
roadmap to adopt a cloud provider for our infrastructure. I must have
been the straw that broke the camel’s back. “We need to move into the cloud”
was met with “We agree! What’s your plan?” And then it became one of the many
plates I have kept spinning.</p>

<p>We already have migrated a few services, including a major production service
which Core Platform moved over without any issues; I’m very proud of that one!</p>

<p>Unlike many “datacenter to cloud” migrations, I believe ours is unique in that
we have:</p>

<ul>
  <li>A very limited error and downtime budget.</li>
  <li>The green-light to share the process as we go along.</li>
</ul>

<p>I’m looking forward to sharing more on
<a href="https://tech.scribd.com">tech.scribd.com</a>
(<a href="https://tech.scribd.com/feed.xml">RSS</a>) as we move to AWS, I hope you’ll tune
in!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="scribd" /><category term="aws" /><summary type="html"><![CDATA[This week we launched the Scribd tech blog, on which I published today’s article: We’re building the largest library in history. I frequently have to remind myself that I have been here less than a year, and we have undergone incredible positive change, with more coming in 2020.]]></summary></entry><entry><title type="html">Building containers in Jenkins with Kaniko</title><link href="https://brokenco.de//2019/10/03/kanikosan.html" rel="alternate" type="text/html" title="Building containers in Jenkins with Kaniko" /><published>2019-10-03T00:00:00+00:00</published><updated>2019-10-03T00:00:00+00:00</updated><id>https://brokenco.de//2019/10/03/kanikosan</id><content type="html" xml:base="https://brokenco.de//2019/10/03/kanikosan.html"><![CDATA[<p>I have a love/hate relationship with containers. We have used containers for
production services in the Jenkins project’s
<a href="https://github.com/jenkins-infra">infrastructure</a> for six or seven years,
where they have been very useful. I run some desktop applications <a href="https://gist.github.com/rtyler/767cfab0e50d7d79100b52cf0a13427a">in
containers</a>.
There are even a few Kubernetes clusters which show the tell-tale signs of my
usage. Containers are great. Not a week goes by however when some oddity in
containers, or the tools around them, throws a wrench into the gears and causes
me great frustration. This week was one of those weeks: we suddenly had
problems building our Docker containers in one of our Kubernetes environments.</p>

<p>I’m a strong supporter of running Jenkins workloads in Kubernetes for a myriad
of reasons, which I won’t go into here. Like most organizations however, we
don’t just need containers for the testing of our applications, we need to
package them into containers as well. As such, we need to build Docker
containers atop Kubernetes, which isn’t as straight-forward as you might hope.</p>

<p>For years I have followed the same approach that <a href="https://medium.com/hootsuite-engineering/building-docker-images-inside-kubernetes-42c6af855f25">Hoot Suite describes in this
post</a>,
utilizing Docker’s own “Docker in Docker” container (<code class="language-plaintext highlighter-rouge">docker:dind</code>). By <a href="https://gist.github.com/rtyler/14a43e3c2c21d876d3f6315b1e82bc25">using
a pod with Docker-in-Docker and a Docker client
container</a>,
the <code class="language-plaintext highlighter-rouge">Jenkinsfile</code> can be <em>fairly</em> simple for building a container but certainly
not as simple as a plain <code class="language-plaintext highlighter-rouge">sh 'docker build rofl:copter'</code>. With the linked
configuration above, our pipelines would typically have an explicit stage which
would build Docker containers:</p>

<div class="language-groovy highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pipeline</span> <span class="o">{</span>
    <span class="n">stages</span> <span class="o">{</span>
        <span class="n">stage</span><span class="o">(</span><span class="s1">'Buildo Roboto'</span><span class="o">)</span> <span class="o">{</span>
            <span class="n">agent</span> <span class="o">{</span> 
                <span class="n">kubernetes</span> <span class="o">{</span>
                    <span class="n">label</span> <span class="s1">'docker'</span>
                    <span class="n">defaultContainer</span> <span class="s1">'docker'</span>
                <span class="o">}</span>
            <span class="o">}</span>
            <span class="n">steps</span> <span class="o">{</span>
                <span class="n">sh</span> <span class="s1">'docker build -t roboto:latest'</span>
            <span class="o">}</span>
        <span class="o">}</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>In one of our environments, this recently <strong>stopped working</strong>. What’s worse, is
that we still aren’t entirely sure why. We migrated the Jenkins workloads from
an older Kubernetes cluster to a newer one, and afterwards this “dind” approach
to building containers started throwing incredibly confusing network and
filesystem errors. Smart money is on some host kernel or filesystem
configuration issue which is causing the “dind” container, which must run
“privileged”, to function incorrectly. After an hour or two of debugging, I
said “forget this” (I may have used slightly different words) and started
looking at other options.</p>

<h2 id="kaniko">Kaniko</h2>

<p><a href="https://github.com/GoogleContainerTools/kaniko">Kaniko</a> is a curious tool from
Google which allows the building of containers on top of Kubernetes. By curious
I mean that it works fairly different from a “stock” <code class="language-plaintext highlighter-rouge">docker build</code> invocation
and required some tweaking on our end to get things working comfortably. That
said, our initial work is promising and we think we’re going to be switching
fully over to it.</p>

<p>The biggest oddity is the need for intermediate layers in the container build,
and the resultant image to be published to repository. My colleague
hypothesized that this was likely a pattern from Google Cloud Platform, where
local VM disks might not be as fast as the container registry affiliated with a
cluster. While there are local filesystem caching options we found them too
unreliable to be useful.</p>

<p>For our configuration of Kaniko, we riffed on the <em>Scripted</em> Pipeline examples
shared by my former colleagues <a href="https://go.cloudbees.com/docs/cloudbees-core/cloud-install-guide/kubernetes-using-kaniko/">at
CloudBees</a>,
but made some fairly significant modifications along the way. Most notably, we
decided to stand up an ephemeral Docker registry inside the Kaniko pod rather
than rely on an external registry for intermediate layers. The end product is
pushed to a well supported network-based registry, but the intermediate layers
are perfectly fine to run locally, as we have very fast disk I/O on our
Kubernetes nodes.</p>

<p>Kaniko’s invocation is much different, and the way it treats its build context
is also a little odd. In our testing we found that the <code class="language-plaintext highlighter-rouge">--cleanup</code> flag was not
enabled <em>by default</em> and successive calls to Kaniko would <strong>mash all</strong> the
files from different contexts on top of one another in some temp directory used
by Kaniko for builds, thereby leading to frustrating build failures. It should
also be noted that the Kaniko containers use Busybox for their shell, but it’s
on a fun non-standard path (<code class="language-plaintext highlighter-rouge">/busybox/sh</code>), so shell scripts expecting
<code class="language-plaintext highlighter-rouge">/bin/sh</code> or <code class="language-plaintext highlighter-rouge">/bin/bash</code> will definitely fail!</p>

<p>We use Declarative Pipeline very heavily and also utilize own custom JNLP agent
image in Jenkins (custom root certificates!), so the snippet below is should be
largely portable to your environment but may need some tweaks:</p>

<div class="language-groovy highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pipeline</span> <span class="o">{</span>
    <span class="n">stages</span> <span class="o">{</span>
        <span class="n">stage</span><span class="o">(</span><span class="s1">'Buildo Roboto'</span><span class="o">)</span> <span class="o">{</span>
            <span class="n">agent</span> <span class="o">{</span> 
                <span class="n">kubernetes</span> <span class="o">{</span>
                    <span class="n">defaultContainer</span> <span class="s1">'kaniko'</span>
                    <span class="n">yamlFile</span> <span class="s1">'kaniko.yaml'</span>
                <span class="o">}</span>
            <span class="o">}</span>
            <span class="n">steps</span> <span class="o">{</span>
                <span class="cm">/*
                 * Since we're in a different pod than the rest of the
                 * stages, we'll need to grab our source tree since we don't
                 * have a shared workspace with the other pod(s)..
                 */</span>
                <span class="n">checkout</span> <span class="n">scm</span>
                <span class="n">sh</span> <span class="s1">'sh -c ./scripts/build-kaniko.sh'</span>
            <span class="o">}</span>
        <span class="o">}</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p><strong>kaniko.yaml</strong></p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># This pod specification is intended to be used within the Jenkinsfile for</span>
<span class="c1"># building the Docker containers</span>
<span class="c1">#</span>
<span class="c1"># E.g. /kaniko/executor --context `pwd` --destination localhost:5000/roboto:latest --insecure-registry localhost:5000 --cleanup</span>
<span class="nn">---</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Pod</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">kaniko</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">containers</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">jnlp</span>
    <span class="c1"># Overwriting the jnlp container's default "image" parameter, this will be</span>
    <span class="c1"># merged automatically with the Kubernetes plugin's built-in jnlp container</span>
    <span class="c1"># configuration, ensuring that the pod comes up and is accessible</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s1">'</span><span class="s">our-awesome-registry/rtyler/jenkins-agent:latest'</span>
  <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">kaniko</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">gcr.io/kaniko-project/executor:debug</span>
    <span class="na">imagePullPolicy</span><span class="pi">:</span> <span class="s">Always</span>
    <span class="c1"># Command and args are important to set in this manner such that the</span>
    <span class="c1"># Jenkins Pipeline can send commands to be executed from the Jenkinsfile via</span>
    <span class="c1"># stdin (that's how it really works!)</span>
    <span class="na">command</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">/busybox/sh</span>
    <span class="pi">-</span> <span class="s2">"</span><span class="s">-c"</span>
    <span class="na">args</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">/busybox/cat</span>
    <span class="na">tty</span><span class="pi">:</span> <span class="no">true</span>
  <span class="c1">#  Kaniko requires a registry, so we're just bringing one online in the pod</span>
  <span class="c1">#  for the intermediate caching of layers</span>
  <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">registry</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s1">'</span><span class="s">registry'</span>
    <span class="na">command</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">/bin/registry</span>
    <span class="pi">-</span> <span class="s">serve</span>
    <span class="pi">-</span> <span class="s">/etc/docker/registry/config.yml</span>
</code></pre></div></div>

<p>Our experience with Kaniko thus far is that it has been slower, and less
verbose in some of its output than <code class="language-plaintext highlighter-rouge">docker build</code>. Fortunately though it’s been
quite reliable, and that’s the key factor here!</p>

<p>Hopefully with the snippets of code above you won’t need to spend nearly as
much time tinkering as my colleague and I did. But in the process of switching
over to Kaniko we needed to do a <em>lot</em> of interactive debugging in Jenkins, so
I was glad to have something like an <a href="/2017/08/07/jenkins-pipeline-shell.html">interactive
shell</a> in my bag of Jenkins Pipeline
tricks.</p>

<p>While I liked the “dind” solution, the Kaniko-based solution is just as
well. The future development for us is to hide some of this complexity with
<a href="https://jenkins.io/doc/book/pipeline/shared-libraries">shared libraries</a>, but
that’s a project for another day!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="jenkins" /><category term="kubernetes" /><category term="kaniko" /><category term="scribd" /><summary type="html"><![CDATA[I have a love/hate relationship with containers. We have used containers for production services in the Jenkins project’s infrastructure for six or seven years, where they have been very useful. I run some desktop applications in containers. There are even a few Kubernetes clusters which show the tell-tale signs of my usage. Containers are great. Not a week goes by however when some oddity in containers, or the tools around them, throws a wrench into the gears and causes me great frustration. This week was one of those weeks: we suddenly had problems building our Docker containers in one of our Kubernetes environments.]]></summary></entry><entry><title type="html">JKS? jfc. Adding a root certificate</title><link href="https://brokenco.de//2019/09/28/jks-jfc.html" rel="alternate" type="text/html" title="JKS? jfc. Adding a root certificate" /><published>2019-09-28T00:00:00+00:00</published><updated>2019-09-28T00:00:00+00:00</updated><id>https://brokenco.de//2019/09/28/jks-jfc</id><content type="html" xml:base="https://brokenco.de//2019/09/28/jks-jfc.html"><![CDATA[<p>TLS certificates have the largest “complexity/importance” scores imaginable.
Everything about them is error prone and seemingly over-engineered from top to
bottom, yet they are one of the most important pieces of security and
authentication in our software architectures. From an engineering management
standpoint, I am finding myself adopting the rule of: estimates for any project
involving certificates should be multiplied tenfold. If the project involves
the Java Virtual Machine (JVM) and the Java Key Store (JKS), multiply by
another ten I suppose. For my own future convenience, in this blog post I would
like to outline how to add a root certificate to a Java Key Store in Red
Hat-derived environments.</p>

<p>Like many corporate environments, we have our own internal Certificate
Authorities (CAs) which all derive their chain of trust from our internal root
certificate. Accessing internal services requires that the operating system has
that root certificate, or when accessing those internal services from anything
running atop the JVM, the default JKS must have the root certificate.</p>

<p>If you search around the web for how to add root certificates, you might find
the <code class="language-plaintext highlighter-rouge">update-ca-certificates</code> command, whose CentOS/RHEl manpage has the
following:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>The directory /etc/pki/ca-trust/extracted/java/ contains a CA
certificate bundle in the java keystore file format. Distrust information
cannot be represented in this file format, and distrusted certificates are
missing from these files. File cacerts contains CA certificates trusted for TLS
server authentication.
</code></pre></div></div>

<p>You might assume, as I did, that this means the <code class="language-plaintext highlighter-rouge">update-ca-certificates</code> tool
is going to create files that the JVM picks up properly and your default JKS
will have the root certificate in place.</p>

<p>This is <em>false</em>. At least in the environments which I have tested this.</p>

<p>Digging further I found <a href="https://connect2id.com/blog/importing-ca-root-cert-into-jvm-trust-store">this blog post</a> and used the following command to import the root certificate into JKS after installing it on the system at large:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>keytool -importcert -alias startssl -keystore $JAVA_HOME/jre/lib/security/cacerts -storepass changeit -file ca.der
</code></pre></div></div>

<p>Using the SSLPoke tool referenced in <a href="https://confluence.atlassian.com/kb/unable-to-connect-to-ssl-services-due-to-pkix-path-building-failed-779355358.html">this Atlassian knowledgebase
article</a>
I was then <em>finally</em> able to access the same internal services from native
utilities (e.g. <code class="language-plaintext highlighter-rouge">curl</code>) and from the Java-based services which I was working
with at the time.</p>

<p>In my situation, the fact that all of this was happening within Docker
containers further complicated the debugging: multiple by another 2-5 on that
engineering estimate.</p>

<p>Certificates are too important to be this painful.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="security" /><category term="tls" /><category term="java" /><category term="scribd" /><summary type="html"><![CDATA[TLS certificates have the largest “complexity/importance” scores imaginable. Everything about them is error prone and seemingly over-engineered from top to bottom, yet they are one of the most important pieces of security and authentication in our software architectures. From an engineering management standpoint, I am finding myself adopting the rule of: estimates for any project involving certificates should be multiplied tenfold. If the project involves the Java Virtual Machine (JVM) and the Java Key Store (JKS), multiply by another ten I suppose. For my own future convenience, in this blog post I would like to outline how to add a root certificate to a Java Key Store in Red Hat-derived environments.]]></summary></entry><entry><title type="html">Ruby Infrastructure Engineering</title><link href="https://brokenco.de//2019/09/09/rubby-infra.html" rel="alternate" type="text/html" title="Ruby Infrastructure Engineering" /><published>2019-09-09T00:00:00+00:00</published><updated>2019-09-09T00:00:00+00:00</updated><id>https://brokenco.de//2019/09/09/rubby-infra</id><content type="html" xml:base="https://brokenco.de//2019/09/09/rubby-infra.html"><![CDATA[<p>My favorite part of the stack is the netherworld between the underlying
infrastructure and the app. That fuzzy grey area where data goes from databases
to object-relational mappers (ORMs), web servers to request libraries (e.g.
Rack/WSGI), and so on. In many cases a technology roadmap where one considers
infrastructure, but not the application, or vice-versa, is doomed from the
start. At Scribd, I have been given permission to hire more people that love
this layer of the stack, and I have taken to calling it “Ruby Infrastructure.”
A phrase which is fairly unique, that I wanted to define in greater detail.</p>

<p>I have described the general mission of <a href="https://jobs.lever.co/scribd/6fff482b-6363-4525-b6b0-6131d6994eef">the
team</a> as
follows:</p>

<blockquote>
  <p>The Ruby Infrastructure team will help Scribd adopt major ecosystem
improvements such as Sorbet, new Rails versions, and interpreter releases.
Measure and optimize performance across the thousands of requests per second
served by Ruby at Scribd. Create libraries that encapsulate common Ruby
application patterns and approaches. Open high quality pull requests to
improve upstream projects like Sidekiq, Rails, and Ruby itself.</p>
</blockquote>

<p>Ruby at Scribd is serious business. We run one of the largest Rails deployments
on the internet (hi
<a href="https://github.blog/2019-09-09-running-github-on-rails-6-0/">GitHub</a>!) and
need more focused effort on scaling it from a technology and organization
standpoint. The Ruby ecosystem has also matured greatly over the past 10 years
and every couple of months there are new improvements which Scribd can adopt.</p>

<p>The Ruby Infrastructure team is intended to be the group of people which make
sure that all our Ruby and Rails applications are performing well, scaling, and
are easy to develop and deploy.</p>

<p>To give you a better idea of what this team will do, here are some of the
projects which I have in mind:</p>

<h3 id="simplify-with-aurora">Simplify with Aurora</h3>

<p>We have over 7TB of online relational data which, for historical reasons, is
spread across a number of master-replica clusters. Migrating these databases
to, and adopting <a href="https://aws.amazon.com/rds/aurora/#">RDS/Aurora</a> looks very
promising. The advertised read-performance and dataset storage scalability may
allow us to consolidate the database infrastructure and allow us to delete swaths
of complex database magic in the applications.</p>

<p>All that code for switching up database connections or delegating reads to
read-replicas <em>may</em> disappear behind the curtains of Aurora. We certainly need
to do some investigation here, but this is a pristine example of that grey area
where the Ruby Infrastructure team will excel.</p>

<h3 id="web-socketin">Web Socketin’</h3>

<p>Enabling Web Sockets on smaller applications is trivially easy these days. For
larger sites like <a href="https://scribd.com/">Scribd.com</a> a large number of variables
need to be considered: do we terminate sockets in Rails? What will an
incredibly high connection count do to our existing app infrastructure? How
will application developers write code which supports Web Sockets and their
existing request flows? Does our app host capacity plan change dramatically as
a result?</p>

<p>Seemingly mundane requests like “can we enable web sockets?” from application
developers or product managers, at the scale of billions of requests per month,
can have far reaching implications that the Ruby Infrastructure team is poised
perfectly to answer.</p>

<h3 id="efficient-host-sizing">Efficient Host Sizing</h3>

<p>Our current infrastructure is at times over-provisioned. The specifics I won’t
get into in this post, but there are a lot of low-hanging fruit in understanding
our existing application footprints and then sizing our infrastructure around
them appropriately. Whether we’re talking about understanding or improving our
<a href="https://www.joyfulbikeshedding.com/blog/2019-03-29-the-status-of-ruby-memory-trimming-and-how-you-can-help-with-testing.html">memory
utilization</a>,
or becoming more elastic around CPU utilization. Building an overall
understanding of how these Ruby applications perform, how to tune them, and how
to structure their resource usage is going to be one of the frequently
re-evaluated projects for Ruby Infrastructure.</p>

<hr />

<p>There are a myriad of other interesting projects which will crop up once a
couple <a href="https://jobs.lever.co/scribd/6fff482b-6363-4525-b6b0-6131d6994eef">Ruby Infrastructure
Engineers</a>
join the company. Like the other teams in <a href="/2019/08/22/platform-engineering-at-scribd.html">Platform
Engineering</a>, this team will
be entirely remote which means we can  hire the most qualified people we’re
able to find, from nearly anywhere.</p>

<p>I’m excited to see the upstream pull requests, RailsConf presentations, and blog
posts that we’re going to be able to share once we start solving problems
together!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="ruby" /><category term="scribd" /><summary type="html"><![CDATA[My favorite part of the stack is the netherworld between the underlying infrastructure and the app. That fuzzy grey area where data goes from databases to object-relational mappers (ORMs), web servers to request libraries (e.g. Rack/WSGI), and so on. In many cases a technology roadmap where one considers infrastructure, but not the application, or vice-versa, is doomed from the start. At Scribd, I have been given permission to hire more people that love this layer of the stack, and I have taken to calling it “Ruby Infrastructure.” A phrase which is fairly unique, that I wanted to define in greater detail.]]></summary></entry><entry><title type="html">Defining the Real-time Data Platform</title><link href="https://brokenco.de//2019/08/28/real-time-data-platform.html" rel="alternate" type="text/html" title="Defining the Real-time Data Platform" /><published>2019-08-28T00:00:00+00:00</published><updated>2019-08-28T00:00:00+00:00</updated><id>https://brokenco.de//2019/08/28/real-time-data-platform</id><content type="html" xml:base="https://brokenco.de//2019/08/28/real-time-data-platform.html"><![CDATA[<p>One of the harder parts about building new platform infrastructure at a company
which has been around a while is figuring out exactly <em>where</em> to
begin. At <a href="https://www.scribd.com/about/engineering">Scribd</a> the company has
built a good product and curated a large corpus of written content, but
where next? As I alluded to in <a href="/2019/08/22/platform-engineering-at-scribd.html">my previous
post</a> about the Platform
Engineering organization, our “platform” components should help scale out,
accelerate, or open up entirely new avenues of development. In this article, I
want to describe one such project we have been working on and share some of the
thought process behind its inception and prioritization: the Real-time Data
Platform.</p>

<p>(sounds fancy huh?)</p>

<p>My first couple weeks at the company were intense. 
The idea of “Core Platform” was sketched out as a team “to scale apps and data” but that
was about the extent of it. The task I took on was to learn as much as I could,
as quickly as I could, in order to get the recruiting and hiring machine
started. Basically, I
needed to point Core Platform in a direction that was correct enough at a high
level in order to know what skills my future colleagues should have. While I
had <em>tons</em> of discussions and did plenty of reading, I almost feel sheepish to
admit this, but much of our direction was heavily influenced by two
conversations, both of which took less than an hour.</p>

<p>The first was with <a href="https://www.linkedin.com/in/kperko">Kevin Perko</a> (KP), the head
of our <a href="https://www.scribd.com/about/data_science">Data Science team</a>. His team
interacts the most with our current data platform (HDFS, Spark, Hive, etc); in
essence Data Science would be considered one of our customers. I asked some
variant of “what’s wrong with the data infrastructure?” and KP unloaded what
must have been months of pent up frustrations shared by his entire team. The
themes that emerged were:</p>

<ul>
  <li>Developers don’t think about the consumers of the data. Garbage in, garbage
out!</li>
  <li>Many nightly tasks spend a <em>lot</em> of time performing unnecessary pre-processing of data.</li>
  <li>The performance of the system is generally poor. Ad-hoc queries from data
scientists, depending on the time of day, are competing with resources for
automated tasks.</li>
  <li>Everything has to be done in this nightly dependent graph of tasks, and when
something goes wrong, it’s very manual to recover from errors and typically
ruins somebody’s day.</li>
</ul>

<p>Assuring KP that these were problems we would be solving, his next statement
would become a mainstay of our relationship moving forward: “<em>when will it be
ready?</em>”</p>

<p>My second influential conversation was with <a href="https://twitter.com/mikkelewis">Mike
Lewis</a> the head of Product. This conversation
was quite simple and didn’t involve as much trauma counseling as the previous.
I asked “what can’t you do today because of our technology limitations?” This
is a good question to ask product teams every now and again. They frequently
are optimising within their current constraints. One role of
platform and infrastructure teams is to remove those constraints. We discussed
the way in which users convert from passersby, to trial, to paid subscribers.
He also highlighted the importance of our recommendations and search results in
this funnel, and lamented the speed at which we can highlight relevant content
to new users. The maxim goes: the faster a new user sees relevant and
interesting content, the more likely they are to stick around.</p>

<p>Pattern matching between the current problems and the technology needed to
enable new product initiatives I named and defined the high level objective for
the <strong>Real-time Data Platform</strong> as follows:</p>

<blockquote>
  <p><em>To provide a streaming data platform for collecting and acting upon behavioral data
in near real-time with the ultimate goal to enable day zero personalization in
Scribd’s products.</em></p>
</blockquote>

<p>In more concrete terms, the platform is a collection of cloud-based services
(in AWS, more on that later) for ingesting, processing, and storing behavioral
events from frontend, backend, and mobile clients.  The scope of the Real-time
Data Platform extends from event definition and schema, to the layout of events
in persisted into long-term queryable storage, and the tooling which sits on
top of that queryable storage.</p>

<p>As the nominal “product owner” for the effort, I aimed to describe less about
what tools and technologies should be used, and instead forced myself to define
tech-agnostic requirements. Thereby leaving the discovery work for the team I
would ultimately hire.</p>

<p>The Real-time Data Platform must have:</p>

<ul>
  <li>A high, nearing 100% data SLA. Meaning we must design in such a way to reduce
data loss or corruption at every point of the pipeline.</li>
  <li>Maintain data provenance through the pipeline from data creation to usage. In
essence, a Data Scientist should be able to easily track data from where it
originated, and understand the transformative steps along the way.</li>
  <li>Event streams should be considered API contracts, with schemas suggested or
enforced when possible. A consumer from an event stream should be able to
trust the quality of the events in that stream.</li>
  <li>Data processing and transformation must happen as close to ingestion as
possible. Events which arrive in long-term storage must be structured and
partitioned for optimal query performance with zero or minimal post-processing
required for most use-cases.</li>
  <li>The platform must scale as the data volume grows without requiring
significant redesign or rework.</li>
</ul>

<p>In essence, we need to change a number of foundational ways in which we
generate, transfer, and consider the data which Scribd uses. As Core Platform
has unpeeled layer after layer of this onion, we have been able to affirm at
each step of the way that we’re moving in the right direction, which is by
itself quite exciting.</p>

<p>The design of the Real-time Data Platform which we’re currently building out is
something I will share at a high level in a subsequent blog post.</p>

<p>I want to finish this one with some parting thoughts. If you are building
<em>anything</em> foundational in a technology organization, you <strong>must</strong> talk to the
product team. You must also talk to your customers, but I don’t like to ask
them what they want, I like to ask what they don’t like and don’t want. Listen
to that negative feedback, understand what lies beneath the frustrations.
Finally, have a vision for the future, but build and deliver incrementally.
When I first sketched this out, I was forthcoming in stating “this is a 2020
project.” I made sure to clarify that this did not mean we wouldn’t deliver anything
to the business for 18 months. Instead, I made made sure to explain that to
execute on this overall vision would be a long journey with milestones along
the way.</p>

<p>If you haven’t ever watched a skyscraper being built, you would be amazed at
how much of the time is spent digging a great big hole, sinking steel into
bedrock, and pouring concrete. Months of people working in a city block-sized
hole before anything takes shape that even resembles a skyscraper.  Building
strong foundations takes time, but that is in essence the role of any platform
and infrastructure organization. The challenge is to keep the business moving
forward today while <em>also</em> building those fundamental components upon which the
business will stand in a year or two.</p>

<p>It is tough, but that’s exactly what I signed up for. :)</p>]]></content><author><name>R. Tyler Croy</name></author><category term="kafka" /><category term="scribd" /><category term="aws" /><summary type="html"><![CDATA[One of the harder parts about building new platform infrastructure at a company which has been around a while is figuring out exactly where to begin. At Scribd the company has built a good product and curated a large corpus of written content, but where next? As I alluded to in my previous post about the Platform Engineering organization, our “platform” components should help scale out, accelerate, or open up entirely new avenues of development. In this article, I want to describe one such project we have been working on and share some of the thought process behind its inception and prioritization: the Real-time Data Platform.]]></summary></entry><entry><title type="html">Zooming out to Platform Engineering at Scribd</title><link href="https://brokenco.de//2019/08/22/platform-engineering-at-scribd.html" rel="alternate" type="text/html" title="Zooming out to Platform Engineering at Scribd" /><published>2019-08-22T00:00:00+00:00</published><updated>2019-08-22T00:00:00+00:00</updated><id>https://brokenco.de//2019/08/22/platform-engineering-at-scribd</id><content type="html" xml:base="https://brokenco.de//2019/08/22/platform-engineering-at-scribd.html"><![CDATA[<p>The team that I joined <a href="https://scribd.com">Scribd</a> to build, <a href="/2019/03/28/scribd-core-platform.html">Core
Platform</a> is now up and running with
five incredibly talented people. I could not be more pleased with the very
friendly and highly functional group of people we have been able to assemble.
With that team’s projects underway, my focus has been shifting, zooming out
to “Platform Engineering” as a comprehensive part of the engineering
group. In this post, I want to expand on what Platform Engineering is planned
to be and discuss some of the teams and their responsibilities.</p>

<p>I was hired as the “Director of Platform Engineering”, which at the time was an
especially ostentatious title considering an entire group didn’t yet exist. It
was so wacky that “Director” has been something I’m almost ashamed to
reference. It is not in my email signature and it doesn’t show up in Slack; I
don’t want it to interfere with my ability to discuss ideas or hack on
something with my colleagues. The role did however have intent behind it: for
me to focus on growing the organization. A big challenge which I’m fastidiously
working towards addressing. As currently scoped the teams which compose
Platform Engineering are:</p>

<ul>
  <li><strong>Core Platform</strong>, provides foundational infrastructure to help Scribd scale
applications and data.</li>
  <li><strong>Data Engineering</strong>, treats data as a product, ensuring that high quality
data sets are accessible to internal users.</li>
  <li><strong>Ruby Infrastructure</strong>, helps Scribd adopt or upstream major ecosystem changes
which will improve organizational and operational performance of Ruby and
Rails.</li>
</ul>

<p>Defining the scope and charters for these team has been a rather interesting
exercise. Figuring out with the Infrastructure, Data Science, and Internal
Tools teams where the edges of our respective responsibilities lie is one of
those good healthy debates every organization should have as it grows. A year
ago much of engineering was flat with lots of generalists, compare that to
today where both Product and Engineering groups are learning that
specialization when appropriately applied can be quite helpful.</p>

<p>What has also been personally challenging about hiring in Data Engineering is
my relative inexperience in the field. My jam has always been backend service
infrastructure. Across the industry we’re seeing data infrastructure melt
into backend production infrastructure. Scribd is no different, but we have a
lot of work to do, changing from a mindset of “dumping in the data lake” to
where Product and other parts of Engineering are viewing data as a more
integral part of their work. Both in generating clean data but also by
utilizing derived data sets to make more personalized or responsive user
experiences.</p>

<p>The barriers between “data platform” and “production engineering” remind me of
the now outdated silos between application developers and operations engineers.
I’m not sure what to call it, DevDataOps? Maybe DataDevOps?</p>

<p>I’ll have to figure out the hashtag later.</p>

<p>Anyways, like Core Platform, Data Engineering and Ruby Infrastructure are also
intended to be fully remote teams. I maintain that it is better to hire the
best people available rather than the best people “around here.” Hiring
remotely forces the organization to confront all of the collaboration and
communication problems that many growing companies ignore until it’s too late.
Recording meeting notes, sharing knowledge, pair problem solving, capturing
decisions, discussing project roles and responsibilities, all of these are crucial for
effective remote work and they are all unsurprisingly qualities of effective
colocated teams too.</p>

<p>The work we have done thus far in Core Platform I believe sets a strong
precedent for other teams within Platform Engineering and outside of it. We
have patterns of work defined and documented, which will make each successive
remote team we hire at Scribd that much easier to get up and running.</p>

<p>While we’re hiring across the board (who isn’t) the folks I am specifically
hiring for are:</p>

<ul>
  <li><strong>Core Platform</strong>
    <ul>
      <li><a href="https://jobs.lever.co/scribd/78b89735-e4f7-4f44-985e-e028bfca5698">Application Platform
Engineer</a></li>
      <li><a href="https://jobs.lever.co/scribd/ee84d062-19e8-47aa-9403-1935daae70ff">Data Platform
Engineer</a></li>
    </ul>
  </li>
  <li><strong>Data Engineering</strong>
    <ul>
      <li><a href="https://jobs.lever.co/scribd/7a9e16c6-9cb3-48a0-bf82-2e405a596fcd">Data Engineering
Manager</a></li>
      <li><a href="https://jobs.lever.co/scribd/46a9ef46-d214-483d-be09-f811c8b19127">Data
Engineer</a></li>
    </ul>
  </li>
  <li><strong>Ruby Infrastructure</strong>
    <ul>
      <li><a href="https://jobs.lever.co/scribd/6fff482b-6363-4525-b6b0-6131d6994eef">Ruby Infrastructure
Engineering</a></li>
    </ul>
  </li>
</ul>

<p>We’re also hiring an <a href="https://jobs.lever.co/scribd/d5aa5ade-e520-4c63-947c-d48bee5e748d">Infrastructure Team
Manager</a>
who I would be working heavily with.</p>

<p>If you’re curious about these roles, or Platform Engineering type things,
please email me: rtyler at brokenco.de</p>

<p>If you’re not curious about those roles, but want to share thoughts on remote
engineering, you can also email me for that too! At some  point I want to
write down all the patterns and practices I have learned, adopted, or stopped
using over the past five years for building successful remote engineering
organizations. That idea is pending a surplus of spare time which isn’t <em>currently</em>
in the budget however. :)</p>

<hr />

<p>I have been afforded a lot of leeway by my boss to publicly discuss not only
the projects that we’re working on, but a bit of the work we’re doing behind
the scenes. Over the coming months I’m looking forward to sharing even more
about what scaling up an organization like Scribd requires, where we’ve failed,
and where we’re succeeding.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="scribd" /><summary type="html"><![CDATA[The team that I joined Scribd to build, Core Platform is now up and running with five incredibly talented people. I could not be more pleased with the very friendly and highly functional group of people we have been able to assemble. With that team’s projects underway, my focus has been shifting, zooming out to “Platform Engineering” as a comprehensive part of the engineering group. In this post, I want to expand on what Platform Engineering is planned to be and discuss some of the teams and their responsibilities.]]></summary></entry></feed>