<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://brokenco.de//feed/by_tag/aws.xml" rel="self" type="application/atom+xml" /><link href="https://brokenco.de//" rel="alternate" type="text/html" /><updated>2026-05-03T00:12:50+00:00</updated><id>https://brokenco.de//feed/by_tag/aws.xml</id><title type="html">rtyler</title><subtitle>a moderately technical blog</subtitle><author><name>R. Tyler Croy</name></author><entry><title type="html">Screaming in the Cloud</title><link href="https://brokenco.de//2026/02/13/screaming-in-the-cloud.html" rel="alternate" type="text/html" title="Screaming in the Cloud" /><published>2026-02-13T00:00:00+00:00</published><updated>2026-02-13T00:00:00+00:00</updated><id>https://brokenco.de//2026/02/13/screaming-in-the-cloud</id><content type="html" xml:base="https://brokenco.de//2026/02/13/screaming-in-the-cloud.html"><![CDATA[<p>One of the reasons I work where I work is because of the fascinating
data-at-scale problems that they have. This has led me deep into the world of
<a href="https://delta.io">Delta Lake</a> and AWS S3.  Not one to take anything too
seriously, I have been cooking up absolutely bonkers solutions to some of these
<em>billions-scale</em> challenges I am tasked with solving.</p>

<p>Recently I was fortunate enough to discuss some of the objectively insane ideas
with an old PuppetConf pal <a href="https://www.linkedin.com/in/coquinn/">Corey Quinn</a>.</p>

<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/TZj38Bm1DC4?si=m_jo0HOFPHqPC--2" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>

<p>In <a href="https://tech.scribd.com/blog/2026/content-crush.html">this post</a> I wrote
about the design of Content Crush and how Scribd is consolidating objects in S3
to minimize our costs.</p>

<p><em>Checking if files are damaged? $100K. Using newer S3 tools? Way too expensive.
Normal solutions don’t work anymore. Tyler shares how with this much data, you
can’t just throw money at the problem, but rather you have to engineer your way
out.</em></p>

<p>For better or worse I have been so much fun coming up with crazy data solutions
during the day, that I also am doing it on nights and weekends with my
consultancy <a href="https://www.buoyantdata.com">Buoyant Data</a>.</p>

<p>In the coming months I’m expecting to have some more time free up, so I’m
hoping to find another couple clients who need some AWS and data expertise to
spice up their infrastructure! You can find me at
<a href="mailto:rtyler@buoyantdata.com">rtyler@buoyantdata.com</a> for that type of thing,
but if you just want to share your own crazy ideas with me, or commiserate with
me about S3, you can find me at
<a href="mailto:rtyler@brokenco.de">rtyler@brokenco.de</a>.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="opinion" /><category term="aws" /><category term="podcast" /><summary type="html"><![CDATA[One of the reasons I work where I work is because of the fascinating data-at-scale problems that they have. This has led me deep into the world of Delta Lake and AWS S3. Not one to take anything too seriously, I have been cooking up absolutely bonkers solutions to some of these billions-scale challenges I am tasked with solving.]]></summary></entry><entry><title type="html">R.I.P. S3 Object Lambda</title><link href="https://brokenco.de//2025/10/15/rip-object-lambda.html" rel="alternate" type="text/html" title="R.I.P. S3 Object Lambda" /><published>2025-10-15T00:00:00+00:00</published><updated>2025-10-15T00:00:00+00:00</updated><id>https://brokenco.de//2025/10/15/rip-object-lambda</id><content type="html" xml:base="https://brokenco.de//2025/10/15/rip-object-lambda.html"><![CDATA[<p>Did you know that AWS S3 is almost 20 years old? The “cloud” as a concept is
fairly <em>recent</em> but in the time-distortion that has occurred since the rise of
the internet, I think many of us have lost track of how <em>old</em> some of these
public cloud providers are, and as a side-effect, how old their technology
offerings can become. Periodically you need to clean out the attic, and this week AWS did just that with their 
“<a href="https://aws.amazon.com/about-aws/whats-new/2025/10/aws-service-availability/">AWS Service Availability Updates</a>.”</p>

<p>In the list of services that probably have fewer users than most YC startups,
was one which I had recently found <em>incredibly useful</em>: <strong>S3 Object Lambda</strong>.</p>

<p>From Corey of <a href="https://www.lastweekinaws.com/">Last Week in AWS</a> infamy:</p>

<blockquote>
  <p>S3 Object Lambdas have always been a bit weird. You can still have Lambdas
operate on S3, and at least actual Lambdas are likely to see service
improvements; Object Lambdas have been moribund for years.</p>
</blockquote>

<p>Object Lambda is admittedly a <em>niche</em> product. But what makes it quite
interesting for my purposes is it allows you to modify S3 requests en route. It
is by far the fastest way to add custom business logic around data stored in S3
while preserving S3’s API and semantics.</p>

<p>For example, you can create a <em>completely fabricated</em> key space with S3 Object
Lambda that represents a <em>logical</em> object layout, even if your physical object
layout, the actual bytes stored in S3, does not match.</p>

<p>As <em>handy</em> as I think S3 Object Lambda is, when I spoke with some folks
responsible for S3 Object Lambda at AWS earlier this year, it became clear that
there was no further investment in the feature. To me the writing was on the
wall that AWS was going to kill the feature <em>eventually</em>, so I proactively
shifted any work where it was present.</p>

<p>S3 Object Lambda now joins the graveyard next to <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html">S3
Select</a>
and closes the book on “what if S3 were a data application platform.” Instead
AWS continues to push vectors, vectors, VECTORS! Pivoting towards “what if S3
were an AI platform?”.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="aws" /><summary type="html"><![CDATA[Did you know that AWS S3 is almost 20 years old? The “cloud” as a concept is fairly recent but in the time-distortion that has occurred since the rise of the internet, I think many of us have lost track of how old some of these public cloud providers are, and as a side-effect, how old their technology offerings can become. Periodically you need to clean out the attic, and this week AWS did just that with their “AWS Service Availability Updates.”]]></summary></entry><entry><title type="html">The thing about appendable objects in S3</title><link href="https://brokenco.de//2025/08/26/express-many-zones.html" rel="alternate" type="text/html" title="The thing about appendable objects in S3" /><published>2025-08-26T00:00:00+00:00</published><updated>2025-08-26T00:00:00+00:00</updated><id>https://brokenco.de//2025/08/26/express-many-zones</id><content type="html" xml:base="https://brokenco.de//2025/08/26/express-many-zones.html"><![CDATA[<p>Storing bytes at scale is never as simple as we lead ourselves to believe. The
concept of files, or in the cloud “objects”, is a useful metaphor for an
<em>approximation</em> of reality but it’s not <em>actually reality</em>. As I have fallen
deeper and deeper into the rabbit hole, my mental model of what <em>is</em> storage
really has been challenged at every turn.</p>

<p>This evening I was at the <a href="https://www.duckbillgroup.com/san-francisco-finops-meetup/">San Francisco FinOps
Meetup</a> with the
nice folks from Chime and the Duckbill Group. Corey asked some questions about
S3 Express One Zone that I thought warranted a little bit more thought.</p>

<p>Last year Amazon announced that <a href="https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-express-one-zone-append-data-object/">S3 Express One Zone now supports the ability
to append data to an
object</a>.</p>

<p>Setting aside the discussion on whether S3 Express One Zone is <em>actually</em> useful for a
moment, I want to focus on the “appendable object” concept.</p>

<blockquote>
  <p>Applications that continuously receive data over a period of time need the
ability to add data to existing objects. For example, log-processing
applications continuously add new log entries to the end of existing log
files. Similarly, media-broadcasting applications add new video segments to
video files as they are transcoded and then immediately stream the video to
viewers.</p>
</blockquote>

<p>I don’t know much about media-broadcasting applications, so perhaps this
functionality is useful there, but I know a <strong>lot</strong> about log-processing
applications.</p>

<p>Corey’s fundamental question about appendable objects: <strong>is this useful in S3 Standard</strong>.</p>

<p>After a good hour or two of consideration, I am going to say pretty
definitively: <em>probably not</em>.</p>

<p>Appendable objects work by requiring the writer, the caller of <code class="language-plaintext highlighter-rouge">PutObject</code> to
specify the offset of the object to put <em>new bytes</em> at. This pushes a
coordination requirement to the writer which I have difficulty conceiving a way
to make work in real-world applications.</p>

<p>Setting <code class="language-plaintext highlighter-rouge">Standard</code> aside, I am having trouble grappling with how to design an application to use this functionality. Take the example provided in the AWS docs:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws s3api put-object --bucket amzn-s3-demo-bucket--azid--x-s3 \
        --key sampleinput/file001.bin \
        --body bucket-seed/file001.bin \
        --write-offset-bytes size-of-sampleinput/file001.bin
</code></pre></div></div>

<ul>
  <li>My application has written 4096kB of <code class="language-plaintext highlighter-rouge">file001.bin</code></li>
  <li>I have more data to append, I need to know that <strong>I am the only instance</strong> appending to <code class="language-plaintext highlighter-rouge">file001.bin</code></li>
  <li>I also need to know that <strong>no other process has appended</strong> to <code class="language-plaintext highlighter-rouge">file001.bin</code> past the original 4096kB boundary</li>
  <li>Then I <code class="language-plaintext highlighter-rouge">PutObject</code> the next 4096kB.</li>
</ul>

<p>There is external-to-S3 coordination that would be required by an application
to make sure two concurrent appenders don’t <em>ever</em> touch the same file. In
fact, the only safe way I can imagine this working is to put a lock entry into
a DynamoDB table saying <code class="language-plaintext highlighter-rouge">process-A</code> is appending to <code class="language-plaintext highlighter-rouge">file001.bin</code>, and <em>then</em>
the process would need to send <code class="language-plaintext highlighter-rouge">HeadObject</code> to make absolutely certain it had
the <em>correct offset bytes</em> before issuing a write.</p>

<p>For an application where a single process is <em>guaranteed</em> to operate on a single object in S3, this would be viable, but I would need to make sure the application architecture ensures a number of guarantees are in place.</p>

<p>From a reliability standpoint, I don’t know what would happen should a process
<em>crash</em> in the middle of a write. Is the object forever corrupted? Are parts
left in limbo like when multi-part uploads are aborted? Perhaps at AWS their
applications don’t crash in  the middle of I/O operations, but I can
confidently say that applications I write crash all the time!</p>

<p><strong>Bytes offsets are just so damn dangerous</strong>.</p>

<p>As Corey now knows <a href="/2025/07/16/no-way-parquet.html">I have a love/hate relationship with Apache
Parquet</a>, which has been designed with a <em>lot</em>
of lessons learned from large scale data systems. Byte offsets as a way to write segments of an object are <em>extremely</em> likely to lead to corrupted data. Developers like to joke about the two hard problems in computer science:</p>

<ul>
  <li>Caching</li>
  <li>Naming things</li>
  <li>Off-by-one errors</li>
</ul>

<p>The probability of an application corrupting its own data is 1.0.</p>

<p>With <a href="https://parquet.apache.org">Apache Parquet</a> the <strong>footer</strong> contains the
important metadata about the data contained within the file. One major benefit
of the design is that the data must have been <strong>written first</strong> for a valid
file to exist. Contrast this to <a href="https://avro.apache.org">Apache Avro</a>, which I
am decidedly less fond of. Avro <em>starts</em> with the file header and then data
blocks. The data blocks on their own indicate how long each block is, but as
far as I can tell there is no way for a reader to tell if all the necessary
data blocks were actually written to storage. You can easily tell if a data
block was partially written, but I don’t believe you can tell if a data block
is simply missing.</p>

<p>The “finalization” of an Apache Parquet footer provides a very useful end for
the write of any particular data application.</p>

<h2 id="just-answer-the-question">Just answer the question</h2>

<p>Fine, okay, what were we talking about again?</p>

<p>Corey wants to know whether appendable objects are useful in S3 Standard?</p>

<p><strong>No</strong></p>

<p>Appendable objects require application level coordination which is largely
impractical for <em>most</em> developers, myself included, to safely manage. Standard
tier introduces the challenges of availability zones to the discussion,
cross-AZ latencies, and a myriad of other distributed computing problems. What
<em>would</em> be useful is cheaper <a href="https://docs.aws.amazon.com/firehose/latest/dev/create-transform.html">output conversions and
transformations</a>
from Kinesis Firehose. Most append-oriented applications I have seen, built, or
designed, require something in the shape of a Kinesis, <a href="https://kafka.apache.org">Apache
Kafka</a>, or similar to provide that mission-critical
<strong>durable data ordering</strong> function.</p>

<p>Output conversion with Kinesis is an incredibly novel tool at our disposal.
While expensive it makes turning data streams into objects in S3 <em>very</em>
simple.</p>

<p>Appendable objects are best suited for applications where losing data or
corrupting objects is acceptable.</p>

<p>Management has kindly requested that I stop building such applications, so I’ll
stick to more durable data primitives for now.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="aws" /><category term="opinion" /><summary type="html"><![CDATA[Storing bytes at scale is never as simple as we lead ourselves to believe. The concept of files, or in the cloud “objects”, is a useful metaphor for an approximation of reality but it’s not actually reality. As I have fallen deeper and deeper into the rabbit hole, my mental model of what is storage really has been challenged at every turn.]]></summary></entry><entry><title type="html">Ditching the cloud is most likely a bad idea</title><link href="https://brokenco.de//2023/02/21/ditching-the-cloud-is-complicated.html" rel="alternate" type="text/html" title="Ditching the cloud is most likely a bad idea" /><published>2023-02-21T00:00:00+00:00</published><updated>2023-02-21T00:00:00+00:00</updated><id>https://brokenco.de//2023/02/21/ditching-the-cloud-is-complicated</id><content type="html" xml:base="https://brokenco.de//2023/02/21/ditching-the-cloud-is-complicated.html"><![CDATA[<p>I have the dubious honor of leading a migration from an on-premise
managed colocation facility into AWS. It was necessary to help the business
succeed, but frankly I would rather not have needed to do it. Earlier this morning I saw <a href="https://world.hey.com/dhh/we-stand-to-save-7m-over-five-years-from-our-cloud-exit-53996caa">a
post</a>
about ‘leaving the cloud” by that attention-seeking guy who keeps trying to
keynote RailsConf, I had some opinions. I was hopped up on caffeine and free
office
snacks, and just could not help but share my thoughts in the fediverse.</p>

<p>Long story short, I think the original author’s analysis is nonsense and will
most likely result in him Musking his own company. Either way, here are some thoughts saved for posterity:</p>

<hr />

<p>I have always disliked this dude’s simpleton analyses but <em>IF</em> you are
considering leaving AWS (or other cloud providers) you <em>must</em> include:</p>

<ul>
  <li>Operational cost: which is all that the original author’s analysis includes.</li>
  <li>Labor cost: migrations use people’s time, which is typically the biggest
portion of a company’s budget.</li>
  <li>Opportunity cost: managing infrastructure or migrating it means you’re not
investing in growing the business. If your business isn’t about running
infrastructure (e.g. CloudFlare, Fastly, etc), this typically means you’re
actively harming your business by focusing elsewhere.</li>
</ul>

<p>But there’s so much more!</p>

<p><em>IF</em> the business’ workloads are CPU intensive and consistent, buying metal
<em>might</em> be cheaper.</p>

<p>Otherwise, if your math shows that on-premise is cheaper than I would have
<em>questions</em> about the current infrastructure, are you using:</p>

<ul>
  <li>ECS/Fargate is crazy cheap and works great for almost all web apps you can
shove into a container.</li>
  <li>AWS Aurora is crazy good and makes a <em>lot</em> of RDMS work and scaling easy.</li>
  <li>AWS Savings Plans help further reduce costs for predictable compute.</li>
</ul>

<p><em>IF</em> the business already has a big investment into AWS S3, I hope you’re
planning to get punished with S3 egress costs.</p>

<p>S3 is a modern marvel as <a href="https://awscommunity.social/@Quinnypig">Corey Quinn</a>
has said. You literally cannot make faster, cheaper, or more resilient storage
But AWS uses cost to <em>encourage</em> you not to walk away from S3.</p>

<p>Depending on the relation of the application to the S3 storage, transit fees
can eat you alive.</p>

<p><em>IF</em> the business’ SLAs allow for the risk of a single-site on-premise
deployment, that’s coo.</p>

<p>AWS can have downtimes but it can be enlightening to ask the ops old guard
about the time suck of configuration management, rack management, or dealing
with RMAs with shitty hardware vendors.</p>

<p>I don’t relish funding Jeff Bezos’ next super yacht any more than you do, but
the stack you can get on AWS is unrivaled in its cost, reliability, and ease
of use.</p>

<p>Nobody gives AWS enough credit for their security work.</p>

<p>Building secure infrastructure is really challenging. There’s patch management,
role-based access control systems, data encryption needs, certificates, all
sorts of things.</p>

<p>Not all clouds do it well (lol azure).</p>

<p>But walking away from VPCs, Security Groups (Network Isolation), IAM
(Role-based access controls), CloudTrail (audit logging), GuardDuty (intrusion
detection), and automated upgrades for managed services would have me very
seriously questioning what security posture the org may or may not have.</p>

<p>Anyways, I don’t love AWS. It’s a monoculture and it makes an ugly
anti-competitive business viable.</p>

<p>It’s still the right choice in my opinion for the vast majority of businesses.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="aws" /><category term="opinion" /><summary type="html"><![CDATA[I have the dubious honor of leading a migration from an on-premise managed colocation facility into AWS. It was necessary to help the business succeed, but frankly I would rather not have needed to do it. Earlier this morning I saw a post about ‘leaving the cloud” by that attention-seeking guy who keeps trying to keynote RailsConf, I had some opinions. I was hopped up on caffeine and free office snacks, and just could not help but share my thoughts in the fediverse.]]></summary></entry><entry><title type="html">The problem with ML</title><link href="https://brokenco.de//2023/01/04/the-problem-with-ml.html" rel="alternate" type="text/html" title="The problem with ML" /><published>2023-01-04T00:00:00+00:00</published><updated>2023-01-04T00:00:00+00:00</updated><id>https://brokenco.de//2023/01/04/the-problem-with-ml</id><content type="html" xml:base="https://brokenco.de//2023/01/04/the-problem-with-ml.html"><![CDATA[<p>The holidays are the time of year when I typically field a lot of questions
from relatives about technology or the tech industry, and this year my favorite
questions were around <strong>AI</strong>. (<em>insert your own scary music</em>) Machine-learning
(ML) or Artificial Intelligence (AI) are being widely deployed and I have some
<strong>Problems™</strong> with that. Machine learning is not necessarily a new
domain, the practices commonly accepted as “ML” have been used for quite a
while to support search and recommendations use-cases. In fact, my day job
includes supporting data scientists and those who are actively creating models
and deploying them to production. <em>However</em>, many of my relatives outside of the tech industry believe that “AI” is going to replace people, their jobs, and/or run the future. I genuinely hope AI/ML comes nowhere close to this future imagined by members of my family.</p>

<p>Like many pieces of technology, it is not inherently good or bad, but the
problem with ML as it is applied today is that <strong>its application is far
outpacing our understanding of its consequences</strong>.</p>

<p>Brian Kernighan, co-creator of the C programming language and UNIX, said:</p>

<blockquote>
  <p>Everyone knows that debugging is twice as hard as writing a program in the
first place. So if you’re as clever as you can be when you write it, how will
you ever debug it?</p>
</blockquote>

<p>Setting aside the <em>mountain</em> of ethical concerns around the application of ML
which have and should continue to be discussed in the technology industry,
there’s a fundamental challenge with ML-based systems: I don’t think their
creators understand how they work, how their conclusions are determined, or how
to consistently improve them over time. Imagine you are a data scientist or ML
developer, how confident are you in what your models will predict between
experiments or evolutions of the model? Would you be willing to testify in a
court of law about the veracity of your model’s output?</p>

<p>Imagine you are a developer working on the models that Tesla’s “full
self-driving” (FSD) mode relies upon. Your model has been implicated in a Tesla
killing the driver and/or pedestrians (which <a href="https://www.reuters.com/business/autos-transportation/us-probing-fatal-tesla-crash-that-killed-pedestrian-2021-09-03/">has
happened</a>).
Do you think it would be possible to convince a judge and jury that your model
is <em>not</em> programmed to mow down pedestrians outside of a crosswalk? How do you
prove what a model is or is not supposed to do given never before seen inputs?</p>

<p>Traditional software <em>does</em> have a variation of this problem but source code
lends itself to scrutiny far better than the ML models. Many of which have come
from successive evolutions of public training data, proprietary model changes,
and integrations with new data sources.</p>

<p>These problems may be solvable in the ML ecosystem, but problem is that the
application of ML is outpacing our ability to understand, monitor, and diagnose
models when they do harm.</p>

<p>That model your startup is working on to help accelerate home loan approvals
based on historical mortgages, how do you assert that your models are not
re-introducing racist policies like
    <a href="https://en.wikipedia.org/wiki/Redlining">redlining</a>. (forms of this <a href="https://fortune.com/2020/02/11/a-i-fairness-eye-on-a-i/">have happened</a>).</p>

<p>How about that fun image generation (AI art!) project you have been tinkering
with uses a publicly available model that was trained on millions of images
from the internet, and as a result in some cases unintentionally outputs
explicit images, or even what some jurisdictions might consider bordering on
child pornography. (forms of this <a href="https://www.wired.com/story/lensa-artificial-intelligence-csem/">have
happened</a>).</p>

<p>Really anything you teach based on the data “from the internet” is asking for
racist, pornographic, or otherwise offensive results, as the <a href="https://www.cbsnews.com/news/microsoft-shuts-down-ai-chatbot-after-it-turned-into-racist-nazi/">Microsoft
Tay</a>
example should have taught us.</p>

<p>Can you imagine the human-rights nightmare that could ensue from shoddy ML
models being brought into a healthcare setting? Law-enforcement? Or even
military settings?</p>

<hr />

<p>Machine-learning encompasses a very powerful set of tools and patterns, but our
ability to predict how those models will be used, what they will output, or how
to prevent negative outcomes are <em>dangerously</em> insufficient for the use outside
of search and recommendation systems.</p>

<p>I understand how models are developed, how they are utilized, and what I
<em>think</em> they’re supposed to do.</p>

<p>Fundamentally the challenge with AI/ML is that we understand how to “make it
work”, but we don’t understand <em>why</em> it works.</p>

<p>Nonetheless we keep deploying “AI” anywhere there’s funding, consequences be
damned.</p>

<p>And that’s a problem.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="software" /><category term="ml" /><category term="aws" /><category term="databricks" /><summary type="html"><![CDATA[The holidays are the time of year when I typically field a lot of questions from relatives about technology or the tech industry, and this year my favorite questions were around AI. (insert your own scary music) Machine-learning (ML) or Artificial Intelligence (AI) are being widely deployed and I have some Problems™ with that. Machine learning is not necessarily a new domain, the practices commonly accepted as “ML” have been used for quite a while to support search and recommendations use-cases. In fact, my day job includes supporting data scientists and those who are actively creating models and deploying them to production. However, many of my relatives outside of the tech industry believe that “AI” is going to replace people, their jobs, and/or run the future. I genuinely hope AI/ML comes nowhere close to this future imagined by members of my family.]]></summary></entry><entry><title type="html">Meet Buoyant Data, and let me reduce your data platform costs</title><link href="https://brokenco.de//2023/01/02/introducing-buoyant-data.html" rel="alternate" type="text/html" title="Meet Buoyant Data, and let me reduce your data platform costs" /><published>2023-01-02T00:00:00+00:00</published><updated>2023-01-02T00:00:00+00:00</updated><id>https://brokenco.de//2023/01/02/introducing-buoyant-data</id><content type="html" xml:base="https://brokenco.de//2023/01/02/introducing-buoyant-data.html"><![CDATA[<p>One of the many things I learned in 2022 is that I have a particular knack for
understanding, analyzing, and optimizing the costs of data platform
infrastructure. These skills were born out of both curiosity and necessity in
the current economic climate, and have led me to start a small consuhltancy on
the side: <a href="https://www.buoyantdata.com/">Buoyant Data</a>. Big data infrastructure
can be hugely valuable to lots of businesses, but unfortunately it’s also an
area of the cloud bills that is frequently misunderstood, that’s something that
I can help with!</p>

<p><a href="https://www.duckbillgroup.com/about/">Mike Julian</a> from <a href="https://www.duckbillgroup.com/">The Duckbill
Group</a> once made the proclamation that the way
to <em>actually</em> save money in AWS is to design your infrastructure to be
cost-effective. “Optimization” techniques can only take you so far, and once
you’ve burned through all the optimizations, you may find yourself needing to
further reduce the cost of your infrastructure and have no more “fat” to trim! In the <a href="https://www.buoyantdata.com/blog/2022-12-18-initial-commit.html">first blog post</a> I outline a “reference architecture” for a data platform which I <strong>know</strong> is cost-effective, easy to manage, and lends itself well to growth.</p>

<p>Planning for sensible, cost-concious growth is <em>very</em> important. With most data
platforms as they start to prove their value, the organization will bring even
<em>more</em> workloads to them. <a href="https://en.wikipedia.org/wiki/If_You_Give_a_Mouse_a_Cookie">If you give a data scientist a good
platform</a>, they
will find themselves wanting ever more from that data platform, and Buoyant
Data can help make sure that growth is sustainable <strong>and</strong> the value to the
business is easy to identify as well.</p>

<p>Please add the Buoyant Data <a href="https://www.buoyantdata.com/rss.xml">RSS feed</a> to your reader, as I have a number of blog posts queued up already with some gratis tips and tricks for understanding the cost of your data platform! 😄</p>

<hr />

<p>The technology stack for Buoyant Data is something I cannot wait to write more
about. After funding the creation of
<a href="https://github.com/delta-io/delta-rs">delta-rs</a> as part of my day job, I am
utilizing the library in a <strong>big</strong> way to build extremely lightweight and
cost-efficient data ingestion pipelines with Rust and AWS Lambda. There’s still
plenty of space for <a href="https://spark.apache.org">Apache Spark</a> on the querying
and processing side, but as
<a href="https://github.com/apache/arrow-datafusion">DataFusion</a> matures, I’m looking
forward to exploring where that can fit into the picture.</p>

<p>There’s a lot of evolution happening right now in the data and ML platform
space, I’m really looking forward to growing <a href="https://buoyantdata.com">Buoyant
Data</a> in my spare time!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="databricks" /><category term="software" /><category term="deltalake" /><category term="aws" /><summary type="html"><![CDATA[One of the many things I learned in 2022 is that I have a particular knack for understanding, analyzing, and optimizing the costs of data platform infrastructure. These skills were born out of both curiosity and necessity in the current economic climate, and have led me to start a small consuhltancy on the side: Buoyant Data. Big data infrastructure can be hugely valuable to lots of businesses, but unfortunately it’s also an area of the cloud bills that is frequently misunderstood, that’s something that I can help with!]]></summary></entry><entry><title type="html">Generating pre-signed S3 URLs in Rust</title><link href="https://brokenco.de//2021/05/13/presigned-urls-rusoto.html" rel="alternate" type="text/html" title="Generating pre-signed S3 URLs in Rust" /><published>2021-05-13T00:00:00+00:00</published><updated>2021-05-13T00:00:00+00:00</updated><id>https://brokenco.de//2021/05/13/presigned-urls-rusoto</id><content type="html" xml:base="https://brokenco.de//2021/05/13/presigned-urls-rusoto.html"><![CDATA[<p>Creating Pre-signed S3 URLs in Rust took me a little more brainpower than I had
anticipated, so I thought I would share how to generate them using
<a href="https://rusoto.github.io/">Rusoto</a>.  <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html">Pre-signed
URLs</a>
allow the creation of purpose built URLs for fetching or uploading objects to
S3, and can be especially useful when granting access to S3 objects to mobile
or web clients. In my use-case, I wanted the clients of my web service to be able to access some specific objects from a bucket.</p>

<p>Rusoto supports the creation of pre-signed URLs via the <a href="https://docs.rs/rusoto_s3/0.46.0/rusoto_s3/util/trait.PreSignedRequest.html">PreSignedRequest</a> which is implemented for <code class="language-plaintext highlighter-rouge">GetObjectRequest</code>, <code class="language-plaintext highlighter-rouge">PutObjectRequest</code>, etc. The trait exposes a simple method <code class="language-plaintext highlighter-rouge">get_presigned_url</code> which returns a String with all the query parameters to allow for a pre-signed request. <em>Unfortunately</em> however, these <code class="language-plaintext highlighter-rouge">GetObjectRequest</code> structs don’t really blend easily with an existing <a href="https://docs.rs/rusoto_s3/0.46.0/rusoto_s3/struct.S3Client.html">S3Client</a> and need to be constructed with the appropriate region and credentials whenever you want to use them.</p>

<p>Starting with the region, I re-use some code we have in <a href="https://github.com/delta-io/delta-rs">delta-rs</a> for identifying the region in a way that allows testing with localstack or minio via the <code class="language-plaintext highlighter-rouge">AWS_ENDPOINT_URL</code> environment variable:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">rusoto_core</span><span class="p">::</span><span class="n">Region</span><span class="p">;</span>

<span class="k">let</span> <span class="n">region</span> <span class="o">=</span> <span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="n">url</span><span class="p">)</span> <span class="o">=</span> <span class="nn">std</span><span class="p">::</span><span class="nn">env</span><span class="p">::</span><span class="nf">var</span><span class="p">(</span><span class="s">"AWS_ENDPOINT_URL"</span><span class="p">)</span> <span class="p">{</span>
    <span class="nn">Region</span><span class="p">::</span><span class="n">Custom</span> <span class="p">{</span>
        <span class="n">name</span><span class="p">:</span> <span class="nn">std</span><span class="p">::</span><span class="nn">env</span><span class="p">::</span><span class="nf">var</span><span class="p">(</span><span class="s">"AWS_REGION"</span><span class="p">)</span><span class="nf">.unwrap_or_else</span><span class="p">(|</span><span class="n">_</span><span class="p">|</span> <span class="s">"custom"</span><span class="nf">.to_string</span><span class="p">()),</span>
        <span class="n">endpoint</span><span class="p">:</span> <span class="n">url</span><span class="p">,</span>
    <span class="p">}</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
    <span class="nn">Region</span><span class="p">::</span><span class="nf">default</span><span class="p">()</span>
<span class="p">};</span>
</code></pre></div></div>

<p>For most users, this code doesn’t really do much, but if you’ve got a custom <code class="language-plaintext highlighter-rouge">AWS_REGION</code> or <code class="language-plaintext highlighter-rouge">AWS_ENDPOINT_URL</code>, you need to properly construct a custom <code class="language-plaintext highlighter-rouge">Region</code> in order for Rusoto to work.</p>

<p>The next important argument that <code class="language-plaintext highlighter-rouge">get_presigned_url</code> requires is an <code class="language-plaintext highlighter-rouge">AwsCredentials</code> provider, which I was originally quite worried about hacking into place. Once again I went looking at the delta-rs codebase for inspiration and noticed our use of <code class="language-plaintext highlighter-rouge">ChainProvider</code> which tries its best to find the right AWS credentials given the user’s environment:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">rusoto_credential</span><span class="p">::</span><span class="n">ChainProvider</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">rusoto_credential</span><span class="p">::</span><span class="n">ProvideAwsCredentials</span><span class="p">;</span>

<span class="k">let</span> <span class="n">provider</span> <span class="o">=</span> <span class="nn">ChainProvider</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
<span class="k">let</span> <span class="n">credentials</span> <span class="o">=</span> <span class="n">provider</span><span class="nf">.credentials</span><span class="p">()</span><span class="k">.await</span><span class="o">?</span><span class="p">;</span>
</code></pre></div></div>

<p>With those two pieces in place, I could finally construct the URL!</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">rusoto_s3</span><span class="p">::</span><span class="n">GetObjectRequest</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">rusoto_s3</span><span class="p">::</span><span class="nn">util</span><span class="p">::{</span><span class="n">PreSignedRequest</span><span class="p">,</span> <span class="n">PreSignedRequestOption</span><span class="p">};</span>

<span class="k">let</span> <span class="n">options</span> <span class="o">=</span> <span class="n">PreSignedRequestOption</span> <span class="p">{</span>
    <span class="n">expires_in</span><span class="p">:</span> <span class="nn">std</span><span class="p">::</span><span class="nn">time</span><span class="p">::</span><span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="mi">300</span><span class="p">),</span>
<span class="p">};</span>
<span class="k">let</span> <span class="n">req</span> <span class="o">=</span> <span class="n">GetObjectRequest</span> <span class="p">{</span>
    <span class="n">bucket</span><span class="p">:</span> <span class="s">"my-bucket"</span><span class="nf">.to_string</span><span class="p">(),</span>
    <span class="n">key</span><span class="p">:</span> <span class="s">"secret.txt"</span><span class="nf">.to_string</span><span class="p">(),</span>
    <span class="o">..</span><span class="nn">Default</span><span class="p">::</span><span class="nf">default</span><span class="p">()</span>
<span class="p">};</span>
<span class="k">let</span> <span class="n">url</span> <span class="o">=</span> <span class="n">req</span><span class="nf">.get_presigned_url</span><span class="p">(</span><span class="o">&amp;</span><span class="n">region</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">credentials</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">options</span><span class="p">);</span>
</code></pre></div></div>

<p>Of course, in your application you might find the structure of managing a shared credentials provider or region to change the structure of the code. However you manage them, as long as you can plug a reference to either into the <code class="language-plaintext highlighter-rouge">get_presigned_url</code> function, you can generate useful pre-signed URLs for S3, <a href="https://min.io">Minio</a>, etc.</p>]]></content><author><name>R. Tyler Croy</name></author><category term="rust" /><category term="aws" /><summary type="html"><![CDATA[Creating Pre-signed S3 URLs in Rust took me a little more brainpower than I had anticipated, so I thought I would share how to generate them using Rusoto. Pre-signed URLs allow the creation of purpose built URLs for fetching or uploading objects to S3, and can be especially useful when granting access to S3 objects to mobile or web clients. In my use-case, I wanted the clients of my web service to be able to access some specific objects from a bucket.]]></summary></entry><entry><title type="html">Intentionally leaking AWS keys</title><link href="https://brokenco.de//2021/01/15/leaking-aws-keys.html" rel="alternate" type="text/html" title="Intentionally leaking AWS keys" /><published>2021-01-15T00:00:00+00:00</published><updated>2021-01-15T00:00:00+00:00</updated><id>https://brokenco.de//2021/01/15/leaking-aws-keys</id><content type="html" xml:base="https://brokenco.de//2021/01/15/leaking-aws-keys.html"><![CDATA[<p>“Never check secrets into source control” is one of those <em>rules</em> that are 100%
correct, until it’s not. There are no universal laws in software, and recently
I had a reason to break this one. I checked AWS keys into a Git repository. I
then pushed those commits to a <em>public</em> repository on GitHub. I did this
<strong>intentionally</strong>, and lived to tell the tale. You almost certainly should
never do this, so I thought I would share what happens when you do.</p>

<p>I can imagine you thinking: “this guy posted his AWS credentials on purpose? He
must be an idiot.” I don’t disagree with your conclusion, but just let me
explain!</p>

<p>My use-case is pretty simple: the
<a href="https://github.com/delta-io/delta-rs">delta-rs</a> project needed a real S3
bucket to do some integration testing. I decided to set up a real S3 bucket for
our (read-only) integration tests. Fortunately our tests just needed to
retrieve objects from a bucket to confirm that an S3 bucket is presenting
itself as a Delta table properly. I would have <em>never</em> done this if we needed
“write” operations on the bucket.</p>

<h2 id="preparing">Preparing</h2>

<p>AWS has an integral access control framework called IAM, not to be confused
with an anagram of “AMI” which <a href="https://twitter.com/QuinnyPig">Corey Quinn</a> can
help you learn how to pronounce. IAM allows crafting policies and roles for
just about everything in AWS a dozen or more different ways. It slices, it
dices, it keeps your buckets safe. It is also configured with JSON, which is
awful, but I’ll have to save those rantings for another blog post. Anyways,
here’s the read-only policy that I set up for the bucket:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
    </span><span class="nl">"Version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2012-10-17"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"Statement"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
            </span><span class="nl">"Sid"</span><span class="p">:</span><span class="w"> </span><span class="s2">"VisualEditor0"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"Effect"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Allow"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"Action"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
                </span><span class="s2">"s3:GetLifecycleConfiguration"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketTagging"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetInventoryConfiguration"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetObjectVersionTagging"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:ListBucketVersions"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketLogging"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:ListBucket"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetAccelerateConfiguration"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketPolicy"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetObjectVersionTorrent"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetObjectAcl"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetEncryptionConfiguration"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketObjectLockConfiguration"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketRequestPayment"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetObjectVersionAcl"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetObjectTagging"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetMetricsConfiguration"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketOwnershipControls"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketPublicAccessBlock"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketPolicyStatus"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:ListBucketMultipartUploads"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetObjectRetention"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketWebsite"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketVersioning"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketAcl"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetObjectLegalHold"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketNotification"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetReplicationConfiguration"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:ListMultipartUploadParts"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetObject"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetObjectTorrent"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketCORS"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetAnalyticsConfiguration"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetObjectVersionForReplication"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetBucketLocation"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"s3:GetObjectVersion"</span><span class="w">
            </span><span class="p">],</span><span class="w">
            </span><span class="nl">"Resource"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
                </span><span class="s2">"arn:aws:s3:::deltars"</span><span class="p">,</span><span class="w">
                </span><span class="s2">"arn:aws:s3:::deltars/*"</span><span class="w">
            </span><span class="p">]</span><span class="w">
        </span><span class="p">}</span><span class="w">
    </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>I also set up an <a href="https://aws.amazon.com/aws-cost-management/aws-budgets/">AWS
Budget</a> to alert me
should this start to ever cost real money. My currently monthly costs in this
AWS account are almost $1.50, so my budget is set such that if/when this
starts costing me more than a couple of dollars a month, AWS will email me so I
can figure out what to do in order to save my snapple money.</p>

<p>Finally, I created an IAM user for the integration tests. This IAM user has a
single IAM policy attached to it, listed out above. I then took the AWS access
key and secret key ID for the IAM user and checked those into Git.</p>

<hr />

<p><strong>2021-01-19 update:</strong> An anonymous reader points out:</p>

<p><em>Certain AWS APIs cannot be disabled via IAM, <a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_GetCallerIdentity.html">including
<code class="language-plaintext highlighter-rouge">sts:GetCallerIdentify</code></a>
which in turn allows anyone with the public credentials to run the AWS
equivalent of <code class="language-plaintext highlighter-rouge">whoami</code>:</em></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% AWS_PROFILE=rtyler aws sts get-caller-identity
{
    "UserId": "AIDAX7EGEQ7F24XVIBAAL",
    "Account": "547889645515",
    "Arn": "arn:aws:iam::547889645515:user/deltars-ro"
}
</code></pre></div></div>

<p><em>AWS account numbers and IAM user ARNs are not especially privileged but be
aware that publishing access keys has a side effect of disclosing those too.</em></p>

<hr />

<h2 id="boom-goes-the-dynamite">Boom goes the dynamite</h2>

<p>After preparing the integration tests, I pushed <a href="https://github.com/delta-io/delta-rs/pull/63">my pull
request</a> at <strong>13:05 PST</strong>. When
pushing code to GitHub, anything that looks like an AWS access key is
immediately identified by robots around the world, most of them
malicious in intent, but a few designed to help developers like me who make silly mistakes.</p>

<p>At <strong>13:05:36 PST</strong>, an AWS Support Case was opened in my account:</p>

<blockquote>
  <p>Dear AWS customer,</p>

  <p>We have become aware that the AWS Access Key AKIAX7EGEQ7FT6CLQGWH, belonging to IAM User deltars-ro, along with the corresponding Secret Key is publicly available online at https://github.com/rtyler/delta.rs/blob/b3581ee06eee26d971bd3b76bb788c85ecf0c6c0/rust/tests/s3_test.rs .</p>

  <p>Your security is important to us and this exposure of your account’s IAM credentials poses a security risk to your AWS account, could lead to excessive charges from unauthorized activity, and violates the AWS Customer Agreement or other agreement with us governing your use of our Services.</p>

  <p>To protect your account from excessive charges and unauthorized activity, we have applied the “AWSCompromisedKeyQuarantine” AWS Managed Policy (“Quarantine Policy”) to the IAM User listed above. The Quarantine Policy applied to the User protects your account by limiting permissions for high risk AWS services.
You can view the policy by going here: https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AWSCompromisedKeyQuarantine$jsonEditor?section=permissions .</p>

  <p>For your security, DO NOT remove the Quarantine Policy before following the instructions below. In cases where the Quarantine Policy is causing production issues you may detach the policy from the user. NOTE: Only users with admin privileges or with access to iam:DetachUserPolicy may remove the policy. For instructions on how to remove managed policies go here: https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html#remove-policies-console . In the event of the unauthorized use of your AWS account, we may, at our sole discretion, provide you with concessions. However, a failure to follow the instructions below may jeopardize your ability to receive a concession.</p>

  <p>If you believe you’ve received this note in error, please contact us immediately via the support case.</p>

  <p>PLEASE FOLLOW THE INSTRUCTIONS BELOW TO SECURE YOUR ACCOUNT:</p>

  <p>Step 1: Delete or rotate the exposed AWS Access Key AKIAX7EGEQ7FT6CLQGWH. To delete IAM User Keys go to your AWS Management Console here: https://console.aws.amazon.com/iam/home#users . To delete Root User Keys go here: https://console.aws.amazon.com/iam/home#security_credential .</p>

  <p>If your application uses the exposed Access Key, you need to replace the Key. To replace the Key, first create a second Key (at that point both Keys will be active) and then modify your application to use the new Key.
Then disable (but do not delete) the exposed Key by clicking on the “Make inactive” option in the console. If there are any problems with your application, you can reactivate the exposed Key. When your application is fully functional using the new Key, please delete the exposed Key.</p>

  <p>NOTE: Only rotating or deleting the exposed Key may not be sufficient to protect your account, see Step 2.</p>

  <p>Step 2: Check your CloudTrail log for unsanctioned activity such as the creation of unauthorized IAM users, policies, roles or temporary security credentials. To secure your account please delete any unauthorized IAM users, roles and policies, and revoke any temporary credentials.</p>

  <p>To delete unauthorized IAM User, navigate to https://console.aws.amazon.com/iam/home#users . To delete unauthorized policies go here: https://console.aws.amazon.com/iam/home#/policies . To delete unauthorized roles go here: https://console.aws.amazon.com/iam/home#/roles  .</p>

  <p>Unauthorized temporary credentials may have been created for the IAM User deltars-ro with the exposed AWS Access Key AKIAX7EGEQ7FT6CLQGWH. You can revoke temporary credentials by following instructions outlined here: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_control-access_disable-perms.html#denying-access-to-credentials-by-issue-time . Temporary credentials can also be revoked by deleting the IAM User. NOTE: Deleting IAM Users may impact production workloads and should be done with care.</p>

  <p>Step 3: Check your CloudTrail log to review your AWS account for any unauthorized AWS usage, such as unauthorized EC2 instances, Lambda functions or EC2 Spot bids. You can also check usage by logging into your AWS Management Console and reviewing each service page. The “Bills” page in the Billing console can also be checked for unexpected usage. https://console.aws.amazon.com/billing/home#/bill</p>

  <p>Please keep in mind that unauthorized usage can occur in any region and that your console may show you only one region at a time. To switch between regions, you can use the dropdown in the top-right corner of the console screen.</p>

  <p>Please take steps to prevent any new credentials from being publicly exposed. See Best Practices of Managing your Access Keys at http://docs.aws.amazon.com/general/latest/gr/aws-access-keys-best-practices.html .</p>

  <p>WE RECOMMEND THAT YOU ENABLE AMAZON GUARDDUTY:</p>

  <p>Amazon GuardDuty is an AWS threat detection service that helps you continuously monitor and protect your AWS accounts and workloads. Enabling Amazon GuardDuty on your accounts gives you further visibility into malicious or unauthorized activity, alerting you to take action in order to reduce the risk of harm. To learn more, visit: https://aws.amazon.com/guardduty .</p>

  <p>If you have any questions, you can contact us by accessing the newly created Support Case in your account’s Support Center. If you do not see a new case, you can create a case from the Support Center here: https://console.aws.amazon.com/support/home?#/</p>

  <p>Thank you for your immediate attention to this matter.</p>
</blockquote>

<p>I <em>also</em> got emails from two third party services
<a href="https://gitguardian.com">GitGuardian</a> at <strong>13:09 PST</strong> and
<a href="https://leakd.io">leakd.io</a> at <strong>14:56 PST</strong>. Nice try folks, but AWS was
already on top of it within literal seconds of my git push.</p>

<p>I ignored the third party services and responded to the AWS Support Case to let
them know that my disclosure was in fact intentional. The support person surely
rolled their eyes before reminding me that I would be responsible for charges
on the account and still recommended that I:</p>

<ul>
  <li>Change the password for the root account.</li>
  <li>Delete and rotate all access keys.</li>
  <li>Check for possible unauthorized usage.</li>
</ul>

<hr />

<p>Normally this story doesn’t end well. I did this on purpose and planned
accordingly. There is one incidence of leaking AWS keys on GitHub which I personally know the details of (friend of a friend, I swear!).
An errant <code class="language-plaintext highlighter-rouge">git add</code>
resulted in a local credentials file being pushed to a personal, but public
repository. Because the email account linked to the AWS account was not
regularly checked, the key was used abusively to rack up a few hundred dollars
on an AWS bill before the keys were revoked.</p>

<p>If you add anything that looks like AWS keys to a public repository, website,
or really anything on the internet, malicious actors will download the keys and
try to launch services in your AWS account. Typically cryptocurrency miners or
spam gateways, anything that costs a lot of money which they’re happy you’ve
volunteered to pay for.</p>

<p>Don’t check your AWS credentials into GitHub!</p>

<p>But if you must, do it safely :)</p>]]></content><author><name>R. Tyler Croy</name></author><category term="security" /><category term="github" /><category term="aws" /><summary type="html"><![CDATA[“Never check secrets into source control” is one of those rules that are 100% correct, until it’s not. There are no universal laws in software, and recently I had a reason to break this one. I checked AWS keys into a Git repository. I then pushed those commits to a public repository on GitHub. I did this intentionally, and lived to tell the tale. You almost certainly should never do this, so I thought I would share what happens when you do.]]></summary></entry><entry><title type="html">Changing the way the world reads at Scribd</title><link href="https://brokenco.de//2019/11/25/building-the-library.html" rel="alternate" type="text/html" title="Changing the way the world reads at Scribd" /><published>2019-11-25T00:00:00+00:00</published><updated>2019-11-25T00:00:00+00:00</updated><id>https://brokenco.de//2019/11/25/building-the-library</id><content type="html" xml:base="https://brokenco.de//2019/11/25/building-the-library.html"><![CDATA[<p>This week we launched the
<a href="https://tech.scribd.com">Scribd tech blog</a>, on which I published today’s
article: <a href="https://tech.scribd.com/blog/2019/building-the-library.html">We’re building the largest library in
history</a>. I
frequently have to remind myself that I have been here less than a year, and we
have undergone incredible positive change, with more coming in 2020.</p>

<p>The <a href="https://tech.scribd.com/blog/2019/building-the-library.html">post</a>
portends a high-level idea of what is to come for technology at Scribd in the
coming year or two, related to our <a href="https://blog.scribd.com/home/scribd-announces-58-million-strategic-investment-led-by-spectrum-equity">announcement
today</a>
of a major round of funding:</p>

<blockquote>
  <p>Today we are excited to announce Scribd has closed $58 million in equity
financing led by Spectrum Equity. The investment will be used to support
growth and product innovation, enhance operations, and further the company’s
mission to change the way the world reads.</p>
</blockquote>

<p>The most important detail I was able to share in the blog post is in the
Infrastructure section:</p>

<blockquote>
  <p>The future of our infrastructure, and our applications, is <strong>entirely in the
cloud</strong>. The migration [to AWS] requires shifting workloads between
datacenters with a tiny error and downtime budget. At our size, that’s many
terabytes of data and thousands of requests per second, which dictates
serious upfront planning, automation, testing, and monitoring of every facet
of our environment.</p>
</blockquote>

<p>Hiding behind this paragraph has been a tremendous amount of my time from these
past few months. Arriving at Scribd in January, there were no plans in the
roadmap to adopt a cloud provider for our infrastructure. I must have
been the straw that broke the camel’s back. “We need to move into the cloud”
was met with “We agree! What’s your plan?” And then it became one of the many
plates I have kept spinning.</p>

<p>We already have migrated a few services, including a major production service
which Core Platform moved over without any issues; I’m very proud of that one!</p>

<p>Unlike many “datacenter to cloud” migrations, I believe ours is unique in that
we have:</p>

<ul>
  <li>A very limited error and downtime budget.</li>
  <li>The green-light to share the process as we go along.</li>
</ul>

<p>I’m looking forward to sharing more on
<a href="https://tech.scribd.com">tech.scribd.com</a>
(<a href="https://tech.scribd.com/feed.xml">RSS</a>) as we move to AWS, I hope you’ll tune
in!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="scribd" /><category term="aws" /><summary type="html"><![CDATA[This week we launched the Scribd tech blog, on which I published today’s article: We’re building the largest library in history. I frequently have to remind myself that I have been here less than a year, and we have undergone incredible positive change, with more coming in 2020.]]></summary></entry><entry><title type="html">Defining the Real-time Data Platform</title><link href="https://brokenco.de//2019/08/28/real-time-data-platform.html" rel="alternate" type="text/html" title="Defining the Real-time Data Platform" /><published>2019-08-28T00:00:00+00:00</published><updated>2019-08-28T00:00:00+00:00</updated><id>https://brokenco.de//2019/08/28/real-time-data-platform</id><content type="html" xml:base="https://brokenco.de//2019/08/28/real-time-data-platform.html"><![CDATA[<p>One of the harder parts about building new platform infrastructure at a company
which has been around a while is figuring out exactly <em>where</em> to
begin. At <a href="https://www.scribd.com/about/engineering">Scribd</a> the company has
built a good product and curated a large corpus of written content, but
where next? As I alluded to in <a href="/2019/08/22/platform-engineering-at-scribd.html">my previous
post</a> about the Platform
Engineering organization, our “platform” components should help scale out,
accelerate, or open up entirely new avenues of development. In this article, I
want to describe one such project we have been working on and share some of the
thought process behind its inception and prioritization: the Real-time Data
Platform.</p>

<p>(sounds fancy huh?)</p>

<p>My first couple weeks at the company were intense. 
The idea of “Core Platform” was sketched out as a team “to scale apps and data” but that
was about the extent of it. The task I took on was to learn as much as I could,
as quickly as I could, in order to get the recruiting and hiring machine
started. Basically, I
needed to point Core Platform in a direction that was correct enough at a high
level in order to know what skills my future colleagues should have. While I
had <em>tons</em> of discussions and did plenty of reading, I almost feel sheepish to
admit this, but much of our direction was heavily influenced by two
conversations, both of which took less than an hour.</p>

<p>The first was with <a href="https://www.linkedin.com/in/kperko">Kevin Perko</a> (KP), the head
of our <a href="https://www.scribd.com/about/data_science">Data Science team</a>. His team
interacts the most with our current data platform (HDFS, Spark, Hive, etc); in
essence Data Science would be considered one of our customers. I asked some
variant of “what’s wrong with the data infrastructure?” and KP unloaded what
must have been months of pent up frustrations shared by his entire team. The
themes that emerged were:</p>

<ul>
  <li>Developers don’t think about the consumers of the data. Garbage in, garbage
out!</li>
  <li>Many nightly tasks spend a <em>lot</em> of time performing unnecessary pre-processing of data.</li>
  <li>The performance of the system is generally poor. Ad-hoc queries from data
scientists, depending on the time of day, are competing with resources for
automated tasks.</li>
  <li>Everything has to be done in this nightly dependent graph of tasks, and when
something goes wrong, it’s very manual to recover from errors and typically
ruins somebody’s day.</li>
</ul>

<p>Assuring KP that these were problems we would be solving, his next statement
would become a mainstay of our relationship moving forward: “<em>when will it be
ready?</em>”</p>

<p>My second influential conversation was with <a href="https://twitter.com/mikkelewis">Mike
Lewis</a> the head of Product. This conversation
was quite simple and didn’t involve as much trauma counseling as the previous.
I asked “what can’t you do today because of our technology limitations?” This
is a good question to ask product teams every now and again. They frequently
are optimising within their current constraints. One role of
platform and infrastructure teams is to remove those constraints. We discussed
the way in which users convert from passersby, to trial, to paid subscribers.
He also highlighted the importance of our recommendations and search results in
this funnel, and lamented the speed at which we can highlight relevant content
to new users. The maxim goes: the faster a new user sees relevant and
interesting content, the more likely they are to stick around.</p>

<p>Pattern matching between the current problems and the technology needed to
enable new product initiatives I named and defined the high level objective for
the <strong>Real-time Data Platform</strong> as follows:</p>

<blockquote>
  <p><em>To provide a streaming data platform for collecting and acting upon behavioral data
in near real-time with the ultimate goal to enable day zero personalization in
Scribd’s products.</em></p>
</blockquote>

<p>In more concrete terms, the platform is a collection of cloud-based services
(in AWS, more on that later) for ingesting, processing, and storing behavioral
events from frontend, backend, and mobile clients.  The scope of the Real-time
Data Platform extends from event definition and schema, to the layout of events
in persisted into long-term queryable storage, and the tooling which sits on
top of that queryable storage.</p>

<p>As the nominal “product owner” for the effort, I aimed to describe less about
what tools and technologies should be used, and instead forced myself to define
tech-agnostic requirements. Thereby leaving the discovery work for the team I
would ultimately hire.</p>

<p>The Real-time Data Platform must have:</p>

<ul>
  <li>A high, nearing 100% data SLA. Meaning we must design in such a way to reduce
data loss or corruption at every point of the pipeline.</li>
  <li>Maintain data provenance through the pipeline from data creation to usage. In
essence, a Data Scientist should be able to easily track data from where it
originated, and understand the transformative steps along the way.</li>
  <li>Event streams should be considered API contracts, with schemas suggested or
enforced when possible. A consumer from an event stream should be able to
trust the quality of the events in that stream.</li>
  <li>Data processing and transformation must happen as close to ingestion as
possible. Events which arrive in long-term storage must be structured and
partitioned for optimal query performance with zero or minimal post-processing
required for most use-cases.</li>
  <li>The platform must scale as the data volume grows without requiring
significant redesign or rework.</li>
</ul>

<p>In essence, we need to change a number of foundational ways in which we
generate, transfer, and consider the data which Scribd uses. As Core Platform
has unpeeled layer after layer of this onion, we have been able to affirm at
each step of the way that we’re moving in the right direction, which is by
itself quite exciting.</p>

<p>The design of the Real-time Data Platform which we’re currently building out is
something I will share at a high level in a subsequent blog post.</p>

<p>I want to finish this one with some parting thoughts. If you are building
<em>anything</em> foundational in a technology organization, you <strong>must</strong> talk to the
product team. You must also talk to your customers, but I don’t like to ask
them what they want, I like to ask what they don’t like and don’t want. Listen
to that negative feedback, understand what lies beneath the frustrations.
Finally, have a vision for the future, but build and deliver incrementally.
When I first sketched this out, I was forthcoming in stating “this is a 2020
project.” I made sure to clarify that this did not mean we wouldn’t deliver anything
to the business for 18 months. Instead, I made made sure to explain that to
execute on this overall vision would be a long journey with milestones along
the way.</p>

<p>If you haven’t ever watched a skyscraper being built, you would be amazed at
how much of the time is spent digging a great big hole, sinking steel into
bedrock, and pouring concrete. Months of people working in a city block-sized
hole before anything takes shape that even resembles a skyscraper.  Building
strong foundations takes time, but that is in essence the role of any platform
and infrastructure organization. The challenge is to keep the business moving
forward today while <em>also</em> building those fundamental components upon which the
business will stand in a year or two.</p>

<p>It is tough, but that’s exactly what I signed up for. :)</p>]]></content><author><name>R. Tyler Croy</name></author><category term="kafka" /><category term="scribd" /><category term="aws" /><summary type="html"><![CDATA[One of the harder parts about building new platform infrastructure at a company which has been around a while is figuring out exactly where to begin. At Scribd the company has built a good product and curated a large corpus of written content, but where next? As I alluded to in my previous post about the Platform Engineering organization, our “platform” components should help scale out, accelerate, or open up entirely new avenues of development. In this article, I want to describe one such project we have been working on and share some of the thought process behind its inception and prioritization: the Real-time Data Platform.]]></summary></entry></feed>