rtyler

Screaming in the Cloud

2026-02-13T00:00:00+00:00

One of the reasons I work where I work is because of the fascinating data-at-scale problems that they have. This has led me deep into the world of Delta Lake and AWS S3. Not one to take anything too seriously, I have been cooking up absolutely bonkers solutions to some of these billions-scale challenges I am tasked with solving.

Recently I was fortunate enough to discuss some of the objectively insane ideas with an old PuppetConf pal Corey Quinn.

In this post I wrote about the design of Content Crush and how Scribd is consolidating objects in S3 to minimize our costs.

Checking if files are damaged? $100K. Using newer S3 tools? Way too expensive. Normal solutions don’t work anymore. Tyler shares how with this much data, you can’t just throw money at the problem, but rather you have to engineer your way out.

For better or worse I have been so much fun coming up with crazy data solutions during the day, that I also am doing it on nights and weekends with my consultancy Buoyant Data.

In the coming months I’m expecting to have some more time free up, so I’m hoping to find another couple clients who need some AWS and data expertise to spice up their infrastructure! You can find me at rtyler@buoyantdata.com for that type of thing, but if you just want to share your own crazy ideas with me, or commiserate with me about S3, you can find me at rtyler@brokenco.de.

R.I.P. S3 Object Lambda

2025-10-15T00:00:00+00:00

Did you know that AWS S3 is almost 20 years old? The “cloud” as a concept is fairly recent but in the time-distortion that has occurred since the rise of the internet, I think many of us have lost track of how old some of these public cloud providers are, and as a side-effect, how old their technology offerings can become. Periodically you need to clean out the attic, and this week AWS did just that with their “AWS Service Availability Updates.”

In the list of services that probably have fewer users than most YC startups, was one which I had recently found incredibly useful: S3 Object Lambda.

From Corey of Last Week in AWS infamy:

S3 Object Lambdas have always been a bit weird. You can still have Lambdas operate on S3, and at least actual Lambdas are likely to see service improvements; Object Lambdas have been moribund for years.

Object Lambda is admittedly a niche product. But what makes it quite interesting for my purposes is it allows you to modify S3 requests en route. It is by far the fastest way to add custom business logic around data stored in S3 while preserving S3’s API and semantics.

For example, you can create a completely fabricated key space with S3 Object Lambda that represents a logical object layout, even if your physical object layout, the actual bytes stored in S3, does not match.

As handy as I think S3 Object Lambda is, when I spoke with some folks responsible for S3 Object Lambda at AWS earlier this year, it became clear that there was no further investment in the feature. To me the writing was on the wall that AWS was going to kill the feature eventually, so I proactively shifted any work where it was present.

S3 Object Lambda now joins the graveyard next to S3 Select and closes the book on “what if S3 were a data application platform.” Instead AWS continues to push vectors, vectors, VECTORS! Pivoting towards “what if S3 were an AI platform?”.

The thing about appendable objects in S3

2025-08-26T00:00:00+00:00

Storing bytes at scale is never as simple as we lead ourselves to believe. The concept of files, or in the cloud “objects”, is a useful metaphor for an approximation of reality but it’s not actually reality. As I have fallen deeper and deeper into the rabbit hole, my mental model of what is storage really has been challenged at every turn.

This evening I was at the San Francisco FinOps Meetup with the nice folks from Chime and the Duckbill Group. Corey asked some questions about S3 Express One Zone that I thought warranted a little bit more thought.

Last year Amazon announced that S3 Express One Zone now supports the ability to append data to an object.

Setting aside the discussion on whether S3 Express One Zone is actually useful for a moment, I want to focus on the “appendable object” concept.

Applications that continuously receive data over a period of time need the ability to add data to existing objects. For example, log-processing applications continuously add new log entries to the end of existing log files. Similarly, media-broadcasting applications add new video segments to video files as they are transcoded and then immediately stream the video to viewers.

I don’t know much about media-broadcasting applications, so perhaps this functionality is useful there, but I know a lot about log-processing applications.

Corey’s fundamental question about appendable objects: is this useful in S3 Standard.

After a good hour or two of consideration, I am going to say pretty definitively: probably not.

Appendable objects work by requiring the writer, the caller of PutObject to specify the offset of the object to put new bytes at. This pushes a coordination requirement to the writer which I have difficulty conceiving a way to make work in real-world applications.

Setting Standard aside, I am having trouble grappling with how to design an application to use this functionality. Take the example provided in the AWS docs:

aws s3api put-object --bucket amzn-s3-demo-bucket--azid--x-s3 \
        --key sampleinput/file001.bin \
        --body bucket-seed/file001.bin \
        --write-offset-bytes size-of-sampleinput/file001.bin

My application has written 4096kB of file001.bin
I have more data to append, I need to know that I am the only instance appending to file001.bin
I also need to know that no other process has appended to file001.bin past the original 4096kB boundary
Then I PutObject the next 4096kB.

There is external-to-S3 coordination that would be required by an application to make sure two concurrent appenders don’t ever touch the same file. In fact, the only safe way I can imagine this working is to put a lock entry into a DynamoDB table saying process-A is appending to file001.bin, and then the process would need to send HeadObject to make absolutely certain it had the correct offset bytes before issuing a write.

For an application where a single process is guaranteed to operate on a single object in S3, this would be viable, but I would need to make sure the application architecture ensures a number of guarantees are in place.

From a reliability standpoint, I don’t know what would happen should a process crash in the middle of a write. Is the object forever corrupted? Are parts left in limbo like when multi-part uploads are aborted? Perhaps at AWS their applications don’t crash in the middle of I/O operations, but I can confidently say that applications I write crash all the time!

Bytes offsets are just so damn dangerous.

As Corey now knows I have a love/hate relationship with Apache Parquet, which has been designed with a lot of lessons learned from large scale data systems. Byte offsets as a way to write segments of an object are extremely likely to lead to corrupted data. Developers like to joke about the two hard problems in computer science:

Caching
Naming things
Off-by-one errors

The probability of an application corrupting its own data is 1.0.

With Apache Parquet the footer contains the important metadata about the data contained within the file. One major benefit of the design is that the data must have been written first for a valid file to exist. Contrast this to Apache Avro, which I am decidedly less fond of. Avro starts with the file header and then data blocks. The data blocks on their own indicate how long each block is, but as far as I can tell there is no way for a reader to tell if all the necessary data blocks were actually written to storage. You can easily tell if a data block was partially written, but I don’t believe you can tell if a data block is simply missing.

The “finalization” of an Apache Parquet footer provides a very useful end for the write of any particular data application.

Just answer the question

Fine, okay, what were we talking about again?

Corey wants to know whether appendable objects are useful in S3 Standard?

Appendable objects require application level coordination which is largely impractical for most developers, myself included, to safely manage. Standard tier introduces the challenges of availability zones to the discussion, cross-AZ latencies, and a myriad of other distributed computing problems. What would be useful is cheaper output conversions and transformations from Kinesis Firehose. Most append-oriented applications I have seen, built, or designed, require something in the shape of a Kinesis, Apache Kafka, or similar to provide that mission-critical durable data ordering function.

Output conversion with Kinesis is an incredibly novel tool at our disposal. While expensive it makes turning data streams into objects in S3 very simple.

Appendable objects are best suited for applications where losing data or corrupting objects is acceptable.

Management has kindly requested that I stop building such applications, so I’ll stick to more durable data primitives for now.

Ditching the cloud is most likely a bad idea

2023-02-21T00:00:00+00:00

I have the dubious honor of leading a migration from an on-premise managed colocation facility into AWS. It was necessary to help the business succeed, but frankly I would rather not have needed to do it. Earlier this morning I saw a post about ‘leaving the cloud” by that attention-seeking guy who keeps trying to keynote RailsConf, I had some opinions. I was hopped up on caffeine and free office snacks, and just could not help but share my thoughts in the fediverse.

Long story short, I think the original author’s analysis is nonsense and will most likely result in him Musking his own company. Either way, here are some thoughts saved for posterity:

I have always disliked this dude’s simpleton analyses but IF you are considering leaving AWS (or other cloud providers) you must include:

Operational cost: which is all that the original author’s analysis includes.
Labor cost: migrations use people’s time, which is typically the biggest portion of a company’s budget.
Opportunity cost: managing infrastructure or migrating it means you’re not investing in growing the business. If your business isn’t about running infrastructure (e.g. CloudFlare, Fastly, etc), this typically means you’re actively harming your business by focusing elsewhere.

But there’s so much more!

IF the business’ workloads are CPU intensive and consistent, buying metal might be cheaper.

Otherwise, if your math shows that on-premise is cheaper than I would have questions about the current infrastructure, are you using:

ECS/Fargate is crazy cheap and works great for almost all web apps you can shove into a container.
AWS Aurora is crazy good and makes a lot of RDMS work and scaling easy.
AWS Savings Plans help further reduce costs for predictable compute.

IF the business already has a big investment into AWS S3, I hope you’re planning to get punished with S3 egress costs.

S3 is a modern marvel as Corey Quinn has said. You literally cannot make faster, cheaper, or more resilient storage But AWS uses cost to encourage you not to walk away from S3.

Depending on the relation of the application to the S3 storage, transit fees can eat you alive.

IF the business’ SLAs allow for the risk of a single-site on-premise deployment, that’s coo.

AWS can have downtimes but it can be enlightening to ask the ops old guard about the time suck of configuration management, rack management, or dealing with RMAs with shitty hardware vendors.

I don’t relish funding Jeff Bezos’ next super yacht any more than you do, but the stack you can get on AWS is unrivaled in its cost, reliability, and ease of use.

Nobody gives AWS enough credit for their security work.

Building secure infrastructure is really challenging. There’s patch management, role-based access control systems, data encryption needs, certificates, all sorts of things.

Not all clouds do it well (lol azure).

But walking away from VPCs, Security Groups (Network Isolation), IAM (Role-based access controls), CloudTrail (audit logging), GuardDuty (intrusion detection), and automated upgrades for managed services would have me very seriously questioning what security posture the org may or may not have.

Anyways, I don’t love AWS. It’s a monoculture and it makes an ugly anti-competitive business viable.

It’s still the right choice in my opinion for the vast majority of businesses.

The problem with ML

2023-01-04T00:00:00+00:00

The holidays are the time of year when I typically field a lot of questions from relatives about technology or the tech industry, and this year my favorite questions were around AI. (insert your own scary music) Machine-learning (ML) or Artificial Intelligence (AI) are being widely deployed and I have some Problems™ with that. Machine learning is not necessarily a new domain, the practices commonly accepted as “ML” have been used for quite a while to support search and recommendations use-cases. In fact, my day job includes supporting data scientists and those who are actively creating models and deploying them to production. However, many of my relatives outside of the tech industry believe that “AI” is going to replace people, their jobs, and/or run the future. I genuinely hope AI/ML comes nowhere close to this future imagined by members of my family.

Like many pieces of technology, it is not inherently good or bad, but the problem with ML as it is applied today is that its application is far outpacing our understanding of its consequences.

Brian Kernighan, co-creator of the C programming language and UNIX, said:

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

Setting aside the mountain of ethical concerns around the application of ML which have and should continue to be discussed in the technology industry, there’s a fundamental challenge with ML-based systems: I don’t think their creators understand how they work, how their conclusions are determined, or how to consistently improve them over time. Imagine you are a data scientist or ML developer, how confident are you in what your models will predict between experiments or evolutions of the model? Would you be willing to testify in a court of law about the veracity of your model’s output?

Imagine you are a developer working on the models that Tesla’s “full self-driving” (FSD) mode relies upon. Your model has been implicated in a Tesla killing the driver and/or pedestrians (which has happened). Do you think it would be possible to convince a judge and jury that your model is not programmed to mow down pedestrians outside of a crosswalk? How do you prove what a model is or is not supposed to do given never before seen inputs?

Traditional software does have a variation of this problem but source code lends itself to scrutiny far better than the ML models. Many of which have come from successive evolutions of public training data, proprietary model changes, and integrations with new data sources.

These problems may be solvable in the ML ecosystem, but problem is that the application of ML is outpacing our ability to understand, monitor, and diagnose models when they do harm.

That model your startup is working on to help accelerate home loan approvals based on historical mortgages, how do you assert that your models are not re-introducing racist policies like redlining. (forms of this have happened).

How about that fun image generation (AI art!) project you have been tinkering with uses a publicly available model that was trained on millions of images from the internet, and as a result in some cases unintentionally outputs explicit images, or even what some jurisdictions might consider bordering on child pornography. (forms of this have happened).

Really anything you teach based on the data “from the internet” is asking for racist, pornographic, or otherwise offensive results, as the Microsoft Tay example should have taught us.

Can you imagine the human-rights nightmare that could ensue from shoddy ML models being brought into a healthcare setting? Law-enforcement? Or even military settings?

Machine-learning encompasses a very powerful set of tools and patterns, but our ability to predict how those models will be used, what they will output, or how to prevent negative outcomes are dangerously insufficient for the use outside of search and recommendation systems.

I understand how models are developed, how they are utilized, and what I think they’re supposed to do.

Fundamentally the challenge with AI/ML is that we understand how to “make it work”, but we don’t understand why it works.

Nonetheless we keep deploying “AI” anywhere there’s funding, consequences be damned.

And that’s a problem.

Meet Buoyant Data, and let me reduce your data platform costs

2023-01-02T00:00:00+00:00

One of the many things I learned in 2022 is that I have a particular knack for understanding, analyzing, and optimizing the costs of data platform infrastructure. These skills were born out of both curiosity and necessity in the current economic climate, and have led me to start a small consuhltancy on the side: Buoyant Data. Big data infrastructure can be hugely valuable to lots of businesses, but unfortunately it’s also an area of the cloud bills that is frequently misunderstood, that’s something that I can help with!

Mike Julian from The Duckbill Group once made the proclamation that the way to actually save money in AWS is to design your infrastructure to be cost-effective. “Optimization” techniques can only take you so far, and once you’ve burned through all the optimizations, you may find yourself needing to further reduce the cost of your infrastructure and have no more “fat” to trim! In the first blog post I outline a “reference architecture” for a data platform which I know is cost-effective, easy to manage, and lends itself well to growth.

Planning for sensible, cost-concious growth is very important. With most data platforms as they start to prove their value, the organization will bring even more workloads to them. If you give a data scientist a good platform, they will find themselves wanting ever more from that data platform, and Buoyant Data can help make sure that growth is sustainable and the value to the business is easy to identify as well.

Please add the Buoyant Data RSS feed to your reader, as I have a number of blog posts queued up already with some gratis tips and tricks for understanding the cost of your data platform! 😄

The technology stack for Buoyant Data is something I cannot wait to write more about. After funding the creation of delta-rs as part of my day job, I am utilizing the library in a big way to build extremely lightweight and cost-efficient data ingestion pipelines with Rust and AWS Lambda. There’s still plenty of space for Apache Spark on the querying and processing side, but as DataFusion matures, I’m looking forward to exploring where that can fit into the picture.

There’s a lot of evolution happening right now in the data and ML platform space, I’m really looking forward to growing Buoyant Data in my spare time!

Generating pre-signed S3 URLs in Rust

2021-05-13T00:00:00+00:00

Creating Pre-signed S3 URLs in Rust took me a little more brainpower than I had anticipated, so I thought I would share how to generate them using Rusoto. Pre-signed URLs allow the creation of purpose built URLs for fetching or uploading objects to S3, and can be especially useful when granting access to S3 objects to mobile or web clients. In my use-case, I wanted the clients of my web service to be able to access some specific objects from a bucket.

Rusoto supports the creation of pre-signed URLs via the PreSignedRequest which is implemented for GetObjectRequest, PutObjectRequest, etc. The trait exposes a simple method get_presigned_url which returns a String with all the query parameters to allow for a pre-signed request. Unfortunately however, these GetObjectRequest structs don’t really blend easily with an existing S3Client and need to be constructed with the appropriate region and credentials whenever you want to use them.

Starting with the region, I re-use some code we have in delta-rs for identifying the region in a way that allows testing with localstack or minio via the AWS_ENDPOINT_URL environment variable:

use rusoto_core::Region;

let region = if let Ok(url) = std::env::var("AWS_ENDPOINT_URL") {
    Region::Custom {
        name: std::env::var("AWS_REGION").unwrap_or_else(|_| "custom".to_string()),
        endpoint: url,
    }
} else {
    Region::default()
};

For most users, this code doesn’t really do much, but if you’ve got a custom AWS_REGION or AWS_ENDPOINT_URL, you need to properly construct a custom Region in order for Rusoto to work.

The next important argument that get_presigned_url requires is an AwsCredentials provider, which I was originally quite worried about hacking into place. Once again I went looking at the delta-rs codebase for inspiration and noticed our use of ChainProvider which tries its best to find the right AWS credentials given the user’s environment:

use rusoto_credential::ChainProvider;
use rusoto_credential::ProvideAwsCredentials;

let provider = ChainProvider::new();
let credentials = provider.credentials().await?;

With those two pieces in place, I could finally construct the URL!

use rusoto_s3::GetObjectRequest;
use rusoto_s3::util::{PreSignedRequest, PreSignedRequestOption};

let options = PreSignedRequestOption {
    expires_in: std::time::Duration::from_secs(300),
};
let req = GetObjectRequest {
    bucket: "my-bucket".to_string(),
    key: "secret.txt".to_string(),
    ..Default::default()
};
let url = req.get_presigned_url(&region, &credentials, &options);

Of course, in your application you might find the structure of managing a shared credentials provider or region to change the structure of the code. However you manage them, as long as you can plug a reference to either into the get_presigned_url function, you can generate useful pre-signed URLs for S3, Minio, etc.

Intentionally leaking AWS keys

2021-01-15T00:00:00+00:00

“Never check secrets into source control” is one of those rules that are 100% correct, until it’s not. There are no universal laws in software, and recently I had a reason to break this one. I checked AWS keys into a Git repository. I then pushed those commits to a public repository on GitHub. I did this intentionally, and lived to tell the tale. You almost certainly should never do this, so I thought I would share what happens when you do.

I can imagine you thinking: “this guy posted his AWS credentials on purpose? He must be an idiot.” I don’t disagree with your conclusion, but just let me explain!

My use-case is pretty simple: the delta-rs project needed a real S3 bucket to do some integration testing. I decided to set up a real S3 bucket for our (read-only) integration tests. Fortunately our tests just needed to retrieve objects from a bucket to confirm that an S3 bucket is presenting itself as a Delta table properly. I would have never done this if we needed “write” operations on the bucket.

Preparing

AWS has an integral access control framework called IAM, not to be confused with an anagram of “AMI” which Corey Quinn can help you learn how to pronounce. IAM allows crafting policies and roles for just about everything in AWS a dozen or more different ways. It slices, it dices, it keeps your buckets safe. It is also configured with JSON, which is awful, but I’ll have to save those rantings for another blog post. Anyways, here’s the read-only policy that I set up for the bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetLifecycleConfiguration",
                "s3:GetBucketTagging",
                "s3:GetInventoryConfiguration",
                "s3:GetObjectVersionTagging",
                "s3:ListBucketVersions",
                "s3:GetBucketLogging",
                "s3:ListBucket",
                "s3:GetAccelerateConfiguration",
                "s3:GetBucketPolicy",
                "s3:GetObjectVersionTorrent",
                "s3:GetObjectAcl",
                "s3:GetEncryptionConfiguration",
                "s3:GetBucketObjectLockConfiguration",
                "s3:GetBucketRequestPayment",
                "s3:GetObjectVersionAcl",
                "s3:GetObjectTagging",
                "s3:GetMetricsConfiguration",
                "s3:GetBucketOwnershipControls",
                "s3:GetBucketPublicAccessBlock",
                "s3:GetBucketPolicyStatus",
                "s3:ListBucketMultipartUploads",
                "s3:GetObjectRetention",
                "s3:GetBucketWebsite",
                "s3:GetBucketVersioning",
                "s3:GetBucketAcl",
                "s3:GetObjectLegalHold",
                "s3:GetBucketNotification",
                "s3:GetReplicationConfiguration",
                "s3:ListMultipartUploadParts",
                "s3:GetObject",
                "s3:GetObjectTorrent",
                "s3:GetBucketCORS",
                "s3:GetAnalyticsConfiguration",
                "s3:GetObjectVersionForReplication",
                "s3:GetBucketLocation",
                "s3:GetObjectVersion"
            ],
            "Resource": [
                "arn:aws:s3:::deltars",
                "arn:aws:s3:::deltars/*"
            ]
        }
    ]
}

I also set up an AWS Budget to alert me should this start to ever cost real money. My currently monthly costs in this AWS account are almost $1.50, so my budget is set such that if/when this starts costing me more than a couple of dollars a month, AWS will email me so I can figure out what to do in order to save my snapple money.

Finally, I created an IAM user for the integration tests. This IAM user has a single IAM policy attached to it, listed out above. I then took the AWS access key and secret key ID for the IAM user and checked those into Git.

2021-01-19 update: An anonymous reader points out:

Certain AWS APIs cannot be disabled via IAM, including sts:GetCallerIdentify which in turn allows anyone with the public credentials to run the AWS equivalent of whoami:

% AWS_PROFILE=rtyler aws sts get-caller-identity
{
    "UserId": "AIDAX7EGEQ7F24XVIBAAL",
    "Account": "547889645515",
    "Arn": "arn:aws:iam::547889645515:user/deltars-ro"
}

AWS account numbers and IAM user ARNs are not especially privileged but be aware that publishing access keys has a side effect of disclosing those too.

Boom goes the dynamite

After preparing the integration tests, I pushed my pull request at 13:05 PST. When pushing code to GitHub, anything that looks like an AWS access key is immediately identified by robots around the world, most of them malicious in intent, but a few designed to help developers like me who make silly mistakes.

At 13:05:36 PST, an AWS Support Case was opened in my account:

Dear AWS customer,

We have become aware that the AWS Access Key AKIAX7EGEQ7FT6CLQGWH, belonging to IAM User deltars-ro, along with the corresponding Secret Key is publicly available online at https://github.com/rtyler/delta.rs/blob/b3581ee06eee26d971bd3b76bb788c85ecf0c6c0/rust/tests/s3_test.rs .

Your security is important to us and this exposure of your account’s IAM credentials poses a security risk to your AWS account, could lead to excessive charges from unauthorized activity, and violates the AWS Customer Agreement or other agreement with us governing your use of our Services.

To protect your account from excessive charges and unauthorized activity, we have applied the “AWSCompromisedKeyQuarantine” AWS Managed Policy (“Quarantine Policy”) to the IAM User listed above. The Quarantine Policy applied to the User protects your account by limiting permissions for high risk AWS services. You can view the policy by going here: https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AWSCompromisedKeyQuarantine$jsonEditor?section=permissions .

For your security, DO NOT remove the Quarantine Policy before following the instructions below. In cases where the Quarantine Policy is causing production issues you may detach the policy from the user. NOTE: Only users with admin privileges or with access to iam:DetachUserPolicy may remove the policy. For instructions on how to remove managed policies go here: https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html#remove-policies-console . In the event of the unauthorized use of your AWS account, we may, at our sole discretion, provide you with concessions. However, a failure to follow the instructions below may jeopardize your ability to receive a concession.

If you believe you’ve received this note in error, please contact us immediately via the support case.

PLEASE FOLLOW THE INSTRUCTIONS BELOW TO SECURE YOUR ACCOUNT:

Step 1: Delete or rotate the exposed AWS Access Key AKIAX7EGEQ7FT6CLQGWH. To delete IAM User Keys go to your AWS Management Console here: https://console.aws.amazon.com/iam/home#users . To delete Root User Keys go here: https://console.aws.amazon.com/iam/home#security_credential .

If your application uses the exposed Access Key, you need to replace the Key. To replace the Key, first create a second Key (at that point both Keys will be active) and then modify your application to use the new Key. Then disable (but do not delete) the exposed Key by clicking on the “Make inactive” option in the console. If there are any problems with your application, you can reactivate the exposed Key. When your application is fully functional using the new Key, please delete the exposed Key.

NOTE: Only rotating or deleting the exposed Key may not be sufficient to protect your account, see Step 2.

Step 2: Check your CloudTrail log for unsanctioned activity such as the creation of unauthorized IAM users, policies, roles or temporary security credentials. To secure your account please delete any unauthorized IAM users, roles and policies, and revoke any temporary credentials.

To delete unauthorized IAM User, navigate to https://console.aws.amazon.com/iam/home#users . To delete unauthorized policies go here: https://console.aws.amazon.com/iam/home#/policies . To delete unauthorized roles go here: https://console.aws.amazon.com/iam/home#/roles .

Unauthorized temporary credentials may have been created for the IAM User deltars-ro with the exposed AWS Access Key AKIAX7EGEQ7FT6CLQGWH. You can revoke temporary credentials by following instructions outlined here: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_control-access_disable-perms.html#denying-access-to-credentials-by-issue-time . Temporary credentials can also be revoked by deleting the IAM User. NOTE: Deleting IAM Users may impact production workloads and should be done with care.

Step 3: Check your CloudTrail log to review your AWS account for any unauthorized AWS usage, such as unauthorized EC2 instances, Lambda functions or EC2 Spot bids. You can also check usage by logging into your AWS Management Console and reviewing each service page. The “Bills” page in the Billing console can also be checked for unexpected usage. https://console.aws.amazon.com/billing/home#/bill

Please keep in mind that unauthorized usage can occur in any region and that your console may show you only one region at a time. To switch between regions, you can use the dropdown in the top-right corner of the console screen.

Please take steps to prevent any new credentials from being publicly exposed. See Best Practices of Managing your Access Keys at http://docs.aws.amazon.com/general/latest/gr/aws-access-keys-best-practices.html .

WE RECOMMEND THAT YOU ENABLE AMAZON GUARDDUTY:

Amazon GuardDuty is an AWS threat detection service that helps you continuously monitor and protect your AWS accounts and workloads. Enabling Amazon GuardDuty on your accounts gives you further visibility into malicious or unauthorized activity, alerting you to take action in order to reduce the risk of harm. To learn more, visit: https://aws.amazon.com/guardduty .

If you have any questions, you can contact us by accessing the newly created Support Case in your account’s Support Center. If you do not see a new case, you can create a case from the Support Center here: https://console.aws.amazon.com/support/home?#/

Thank you for your immediate attention to this matter.

I also got emails from two third party services GitGuardian at 13:09 PST and leakd.io at 14:56 PST. Nice try folks, but AWS was already on top of it within literal seconds of my git push.

I ignored the third party services and responded to the AWS Support Case to let them know that my disclosure was in fact intentional. The support person surely rolled their eyes before reminding me that I would be responsible for charges on the account and still recommended that I:

Change the password for the root account.
Delete and rotate all access keys.
Check for possible unauthorized usage.

Normally this story doesn’t end well. I did this on purpose and planned accordingly. There is one incidence of leaking AWS keys on GitHub which I personally know the details of (friend of a friend, I swear!). An errant git add resulted in a local credentials file being pushed to a personal, but public repository. Because the email account linked to the AWS account was not regularly checked, the key was used abusively to rack up a few hundred dollars on an AWS bill before the keys were revoked.

If you add anything that looks like AWS keys to a public repository, website, or really anything on the internet, malicious actors will download the keys and try to launch services in your AWS account. Typically cryptocurrency miners or spam gateways, anything that costs a lot of money which they’re happy you’ve volunteered to pay for.

Don’t check your AWS credentials into GitHub!

But if you must, do it safely :)

Changing the way the world reads at Scribd

2019-11-25T00:00:00+00:00

This week we launched the Scribd tech blog, on which I published today’s article: We’re building the largest library in history. I frequently have to remind myself that I have been here less than a year, and we have undergone incredible positive change, with more coming in 2020.

The post portends a high-level idea of what is to come for technology at Scribd in the coming year or two, related to our announcement today of a major round of funding:

Today we are excited to announce Scribd has closed $58 million in equity financing led by Spectrum Equity. The investment will be used to support growth and product innovation, enhance operations, and further the company’s mission to change the way the world reads.

The most important detail I was able to share in the blog post is in the Infrastructure section:

The future of our infrastructure, and our applications, is entirely in the cloud. The migration [to AWS] requires shifting workloads between datacenters with a tiny error and downtime budget. At our size, that’s many terabytes of data and thousands of requests per second, which dictates serious upfront planning, automation, testing, and monitoring of every facet of our environment.

Hiding behind this paragraph has been a tremendous amount of my time from these past few months. Arriving at Scribd in January, there were no plans in the roadmap to adopt a cloud provider for our infrastructure. I must have been the straw that broke the camel’s back. “We need to move into the cloud” was met with “We agree! What’s your plan?” And then it became one of the many plates I have kept spinning.

We already have migrated a few services, including a major production service which Core Platform moved over without any issues; I’m very proud of that one!

Unlike many “datacenter to cloud” migrations, I believe ours is unique in that we have:

A very limited error and downtime budget.
The green-light to share the process as we go along.

I’m looking forward to sharing more on tech.scribd.com (RSS) as we move to AWS, I hope you’ll tune in!

Defining the Real-time Data Platform

2019-08-28T00:00:00+00:00

One of the harder parts about building new platform infrastructure at a company which has been around a while is figuring out exactly where to begin. At Scribd the company has built a good product and curated a large corpus of written content, but where next? As I alluded to in my previous post about the Platform Engineering organization, our “platform” components should help scale out, accelerate, or open up entirely new avenues of development. In this article, I want to describe one such project we have been working on and share some of the thought process behind its inception and prioritization: the Real-time Data Platform.

(sounds fancy huh?)

My first couple weeks at the company were intense. The idea of “Core Platform” was sketched out as a team “to scale apps and data” but that was about the extent of it. The task I took on was to learn as much as I could, as quickly as I could, in order to get the recruiting and hiring machine started. Basically, I needed to point Core Platform in a direction that was correct enough at a high level in order to know what skills my future colleagues should have. While I had tons of discussions and did plenty of reading, I almost feel sheepish to admit this, but much of our direction was heavily influenced by two conversations, both of which took less than an hour.

The first was with Kevin Perko (KP), the head of our Data Science team. His team interacts the most with our current data platform (HDFS, Spark, Hive, etc); in essence Data Science would be considered one of our customers. I asked some variant of “what’s wrong with the data infrastructure?” and KP unloaded what must have been months of pent up frustrations shared by his entire team. The themes that emerged were:

Developers don’t think about the consumers of the data. Garbage in, garbage out!
Many nightly tasks spend a lot of time performing unnecessary pre-processing of data.
The performance of the system is generally poor. Ad-hoc queries from data scientists, depending on the time of day, are competing with resources for automated tasks.
Everything has to be done in this nightly dependent graph of tasks, and when something goes wrong, it’s very manual to recover from errors and typically ruins somebody’s day.

Assuring KP that these were problems we would be solving, his next statement would become a mainstay of our relationship moving forward: “when will it be ready?”

My second influential conversation was with Mike Lewis the head of Product. This conversation was quite simple and didn’t involve as much trauma counseling as the previous. I asked “what can’t you do today because of our technology limitations?” This is a good question to ask product teams every now and again. They frequently are optimising within their current constraints. One role of platform and infrastructure teams is to remove those constraints. We discussed the way in which users convert from passersby, to trial, to paid subscribers. He also highlighted the importance of our recommendations and search results in this funnel, and lamented the speed at which we can highlight relevant content to new users. The maxim goes: the faster a new user sees relevant and interesting content, the more likely they are to stick around.

Pattern matching between the current problems and the technology needed to enable new product initiatives I named and defined the high level objective for the Real-time Data Platform as follows:

To provide a streaming data platform for collecting and acting upon behavioral data in near real-time with the ultimate goal to enable day zero personalization in Scribd’s products.

In more concrete terms, the platform is a collection of cloud-based services (in AWS, more on that later) for ingesting, processing, and storing behavioral events from frontend, backend, and mobile clients. The scope of the Real-time Data Platform extends from event definition and schema, to the layout of events in persisted into long-term queryable storage, and the tooling which sits on top of that queryable storage.

As the nominal “product owner” for the effort, I aimed to describe less about what tools and technologies should be used, and instead forced myself to define tech-agnostic requirements. Thereby leaving the discovery work for the team I would ultimately hire.

The Real-time Data Platform must have:

A high, nearing 100% data SLA. Meaning we must design in such a way to reduce data loss or corruption at every point of the pipeline.
Maintain data provenance through the pipeline from data creation to usage. In essence, a Data Scientist should be able to easily track data from where it originated, and understand the transformative steps along the way.
Event streams should be considered API contracts, with schemas suggested or enforced when possible. A consumer from an event stream should be able to trust the quality of the events in that stream.
Data processing and transformation must happen as close to ingestion as possible. Events which arrive in long-term storage must be structured and partitioned for optimal query performance with zero or minimal post-processing required for most use-cases.
The platform must scale as the data volume grows without requiring significant redesign or rework.

In essence, we need to change a number of foundational ways in which we generate, transfer, and consider the data which Scribd uses. As Core Platform has unpeeled layer after layer of this onion, we have been able to affirm at each step of the way that we’re moving in the right direction, which is by itself quite exciting.

The design of the Real-time Data Platform which we’re currently building out is something I will share at a high level in a subsequent blog post.

I want to finish this one with some parting thoughts. If you are building anything foundational in a technology organization, you must talk to the product team. You must also talk to your customers, but I don’t like to ask them what they want, I like to ask what they don’t like and don’t want. Listen to that negative feedback, understand what lies beneath the frustrations. Finally, have a vision for the future, but build and deliver incrementally. When I first sketched this out, I was forthcoming in stating “this is a 2020 project.” I made sure to clarify that this did not mean we wouldn’t deliver anything to the business for 18 months. Instead, I made made sure to explain that to execute on this overall vision would be a long journey with milestones along the way.

If you haven’t ever watched a skyscraper being built, you would be amazed at how much of the time is spent digging a great big hole, sinking steel into bedrock, and pouring concrete. Months of people working in a city block-sized hole before anything takes shape that even resembles a skyscraper. Building strong foundations takes time, but that is in essence the role of any platform and infrastructure organization. The challenge is to keep the business moving forward today while also building those fundamental components upon which the business will stand in a year or two.

It is tough, but that’s exactly what I signed up for. :)