rtyler

Based Lake, a petabyte-scale low-latency data lake

2026-03-10T00:00:00+00:00

I had a chat today about building large scale low-latency data retrieval systems around AWS S3. In doing so I got to share a bit of the talk proposal I submitted to Data and AI Summit this year about real-live work that has made it into production.

For years the conventional wisdom around Delta Lake has been to not connect user-facing/online systems to Delta tables. Basically, don’t point your Django app at your Delta tables. This continues to be a decent guideline but definitely not a rule and I have the performance data to back that up.

My talk abstract:

Scribd hosts hundreds of millions of documents and has hundreds of billions of objects across our buckets. Combining large-language models with a massive amounts of text has required investment in our new Content Library architecture. We selected Delta Lake as the underlying storage technology but have pushed it to an extreme. Using the same Delta Lake architecture we offer both direct data access for data scientists in Databricks Notebooks and online data retrieval in milliseconds for user-facing web services.

In this talk we will review principles of performance for each layer of the stack: web APIs, the Delta Lake tables, Apache Parquet, and AWS S3.

The work done by myself and my colleague Eugene in this area has been heavily related to my previous research around Low latency Parquet reads which informed work named Content Crush, which I have explored more on the Scribd tech blog and on the Screaming in the Cloud podcast.

I really hope that I am able to share results at Data and AI Summit from this incredibly challenging work that I am undertaking. But even if I don’t, blog posts like my musings on Multimodal with Delta Lake, scaling streaming Delta Lake applications, and a myriad of other articles I have published can be pieced together to form the larger mosaic of insane large-scale data work I have been hammering on!

I’m a Databricks Beacon

2021-10-21T00:00:00+00:00

A bit of belated news but thanks to all the advocacy work we have been doing at Scribd_ I am now a Databricks Beacon. The Beacon program is similar to Docker Captains, Microsoft MVPs, or Java Champions, a group of folks who are considered both skilled with the technology and in communicating/sharing best practices, tips, and short-comings with the broader community.

From the site itself:

The Databricks Beacons program is our way to thank and recognize the community members, data scientists, data engineers, developers and open source enthusiasts who go above and beyond to uplift the data and AI community.

Whether they are speaking at conferences, leading workshops, teaching, mentoring, blogging, writing books, creating tutorials, offering support in forums or organizing meetups, they inspire others and encourage knowledge sharing – all while helping to solve tough data problems.

I’m flattered to be included in the inaugural group of Beacons, which include a number of much more competent data leaders than myself. Most of what I bring to the table is a lot of Delta Lake experience and advocacy. Delta Lake is the bedrock of Scribd’s data platform and I have been investing heavily in the space with our contribution of the delta-rs Rust bindings, upon which kafka-delta-ingest was built.

Scribd is a Databricks customer, and from that angle I have been quite impressed with the organization and technologies they have built. As some folks who have seen my public talks about Databricks, I also don’t hold back in my honest assessment of the platform’s strengths and weaknesses, thus my surprise to be included as a Beacon ;)

I’m looking forward to more events where I am able to share some of the real-world experiences we’re gaining at Scribd in building out massive data platform systems with Delta Lake and Databricks. And as always, if you want to help us build out more feel free to email me!

Recovering from disasters with Delta Lake

2021-04-26T00:00:00+00:00

Entering into the data platform space with a lot of experience in more traditional production operations is a lot of fun, especially when you ask questions like “what if X goes horribly wrong?” My favorite scenario to consider is: “how much damage could one accidentally cause with our existing policies and controls?” At Scribd we have made Delta Lake a cornerstone of our data platform, and as such I’ve spent a lot of time thinking about what could go wrong and how we would defend against it.

To start I recommend reading this recent post from Databricks: Attack of the Delta Clones which provides a good overview of the CLONE operation in Delta and some patterns for “undoing” mistaken operations. Their blog post does a fantastic job demonstrating the power ot clones in Delta Lake, for example:

-- Creating a new cloned table  from loan_details_delta
CREATE OR REPLACE TABLE loan_details_delta_clone
    DEEP CLONE loan_details_delta;

-- Original view of data
SELECT addr_state, funded_amnt FROM loan_details_delta GROUP BY addr_state, funded_amnt

-- Clone view of data
SELECT addr_state, funded_amnt FROM loan_details_delta_clone GROUP BY addr_state, funded_amnt

For my disaster recovery needs, the clone-based approach is insufficient as I detailed in this post on the delta-users mailing list:

Our requirements are basically to prevent catastrophic loss of business critical data via:

Erroneous rewriting of data by an automated job

Inadvertent table drops through metastore automation.

Overaggressive use of VACUUM command

Failed manual sync/cleanup operations by Data Engineering staff

It’s important to consider whether you’re worried about the transaction log getting corrupted, files in storage (e.g. ADLS) disappearing, or both.

Generally speaking, I’m less concerned about malicious actors so much as incompetent ones. It is far more likely that a member of the team accidentally deletes data, than somebody kicking in a few layers of cloud-based security and deleting it for us.

My preference is to work at a layer below Delta Lake to provide disaster recovery mechanisms, in essence at the object store layer (S3). Relying strictly on CLONE gets you copies of data which can definitely be beneficial but the downside is that whatever is running the query has access to both the “source” and the “backup” data.

The concern is that if some mistake was able to delete my source data, there’s nothing actually standing in its way of deleting the backup data as well.

In my mailing list post, I posited a potential solution:

For example, with a simple nightly rclone(.org) based snapshot of an S3 bucket, the “restore” might mean copying the transction log and new parquet files back to the originating S3 bucket and losing up to 24 hours of data, since the transaction logs would basically be rewound to the last backup point.

Since that email we have deployed our Delta Lake backup solution, which operates strictly at an S3 layer and allows us to impose hard walls (IAM) between writers of the source and backup data.

One of my colleagues is writing that blog post up for tech.scribd.com and I hope to see it published later this week so make sure you follow us on Twitter @ScribdTech or subscribe to the RSS feed!

Update: my colleague Kuntal wrote this blog post on backing up Delta Lake with AWS S3 Batch Operations which is what we’re doing here at Scribd

Building a real-time data platform with Apache Spark and Delta Lake

2020-07-20T00:00:00+00:00

The Real-time Data Platform is one of the fun things we have been building at Scribd since I joined in 2019. Last month I was fortunate enough to share some of our approach in a presentation at Spark and AI Summit titled: “The revolution will be streamed.” At a high level, what I had branded the “Real-time Data Platform” is really: Apache Kafka, Apache Airflow, Structured streaming with Apache Spark, and a smattering of microservices to help shuffle data around. All sitting on top of Delta Lake which acts as an incredibly versatile and useful storage layer for the platform.

In my presentation, which is embedded below, I outline how we tie together Kafka, Databricks, and Delta Lake.

The recorded presentation also complements some of our tech.scribd.com blog posts which I recommend reading as well:

I am incredibly proud of the work the Platform Engineering organization has done at Scribd to make real-time data a reality. I also cannot recommend Kafka + Spark + Delta Lake highly enough for those with similar requirements.

Now that we have the platform in place, I am also excited for our late 2020 and 2021 roadmaps which will start to take advantage of real-time data.

Changing the way the world reads at Scribd

2019-11-25T00:00:00+00:00

This week we launched the Scribd tech blog, on which I published today’s article: We’re building the largest library in history. I frequently have to remind myself that I have been here less than a year, and we have undergone incredible positive change, with more coming in 2020.

The post portends a high-level idea of what is to come for technology at Scribd in the coming year or two, related to our announcement today of a major round of funding:

Today we are excited to announce Scribd has closed $58 million in equity financing led by Spectrum Equity. The investment will be used to support growth and product innovation, enhance operations, and further the company’s mission to change the way the world reads.

The most important detail I was able to share in the blog post is in the Infrastructure section:

The future of our infrastructure, and our applications, is entirely in the cloud. The migration [to AWS] requires shifting workloads between datacenters with a tiny error and downtime budget. At our size, that’s many terabytes of data and thousands of requests per second, which dictates serious upfront planning, automation, testing, and monitoring of every facet of our environment.

Hiding behind this paragraph has been a tremendous amount of my time from these past few months. Arriving at Scribd in January, there were no plans in the roadmap to adopt a cloud provider for our infrastructure. I must have been the straw that broke the camel’s back. “We need to move into the cloud” was met with “We agree! What’s your plan?” And then it became one of the many plates I have kept spinning.

We already have migrated a few services, including a major production service which Core Platform moved over without any issues; I’m very proud of that one!

Unlike many “datacenter to cloud” migrations, I believe ours is unique in that we have:

A very limited error and downtime budget.
The green-light to share the process as we go along.

I’m looking forward to sharing more on tech.scribd.com (RSS) as we move to AWS, I hope you’ll tune in!

Building containers in Jenkins with Kaniko

2019-10-03T00:00:00+00:00

I have a love/hate relationship with containers. We have used containers for production services in the Jenkins project’s infrastructure for six or seven years, where they have been very useful. I run some desktop applications in containers. There are even a few Kubernetes clusters which show the tell-tale signs of my usage. Containers are great. Not a week goes by however when some oddity in containers, or the tools around them, throws a wrench into the gears and causes me great frustration. This week was one of those weeks: we suddenly had problems building our Docker containers in one of our Kubernetes environments.

I’m a strong supporter of running Jenkins workloads in Kubernetes for a myriad of reasons, which I won’t go into here. Like most organizations however, we don’t just need containers for the testing of our applications, we need to package them into containers as well. As such, we need to build Docker containers atop Kubernetes, which isn’t as straight-forward as you might hope.

For years I have followed the same approach that Hoot Suite describes in this post, utilizing Docker’s own “Docker in Docker” container (docker:dind). By using a pod with Docker-in-Docker and a Docker client container, the Jenkinsfile can be fairly simple for building a container but certainly not as simple as a plain sh 'docker build rofl:copter'. With the linked configuration above, our pipelines would typically have an explicit stage which would build Docker containers:

pipeline {
    stages {
        stage('Buildo Roboto') {
            agent { 
                kubernetes {
                    label 'docker'
                    defaultContainer 'docker'
                }
            }
            steps {
                sh 'docker build -t roboto:latest'
            }
        }
    }
}

In one of our environments, this recently stopped working. What’s worse, is that we still aren’t entirely sure why. We migrated the Jenkins workloads from an older Kubernetes cluster to a newer one, and afterwards this “dind” approach to building containers started throwing incredibly confusing network and filesystem errors. Smart money is on some host kernel or filesystem configuration issue which is causing the “dind” container, which must run “privileged”, to function incorrectly. After an hour or two of debugging, I said “forget this” (I may have used slightly different words) and started looking at other options.

Kaniko

Kaniko is a curious tool from Google which allows the building of containers on top of Kubernetes. By curious I mean that it works fairly different from a “stock” docker build invocation and required some tweaking on our end to get things working comfortably. That said, our initial work is promising and we think we’re going to be switching fully over to it.

The biggest oddity is the need for intermediate layers in the container build, and the resultant image to be published to repository. My colleague hypothesized that this was likely a pattern from Google Cloud Platform, where local VM disks might not be as fast as the container registry affiliated with a cluster. While there are local filesystem caching options we found them too unreliable to be useful.

For our configuration of Kaniko, we riffed on the Scripted Pipeline examples shared by my former colleagues at CloudBees, but made some fairly significant modifications along the way. Most notably, we decided to stand up an ephemeral Docker registry inside the Kaniko pod rather than rely on an external registry for intermediate layers. The end product is pushed to a well supported network-based registry, but the intermediate layers are perfectly fine to run locally, as we have very fast disk I/O on our Kubernetes nodes.

Kaniko’s invocation is much different, and the way it treats its build context is also a little odd. In our testing we found that the --cleanup flag was not enabled by default and successive calls to Kaniko would mash all the files from different contexts on top of one another in some temp directory used by Kaniko for builds, thereby leading to frustrating build failures. It should also be noted that the Kaniko containers use Busybox for their shell, but it’s on a fun non-standard path (/busybox/sh), so shell scripts expecting /bin/sh or /bin/bash will definitely fail!

We use Declarative Pipeline very heavily and also utilize own custom JNLP agent image in Jenkins (custom root certificates!), so the snippet below is should be largely portable to your environment but may need some tweaks:

pipeline {
    stages {
        stage('Buildo Roboto') {
            agent { 
                kubernetes {
                    defaultContainer 'kaniko'
                    yamlFile 'kaniko.yaml'
                }
            }
            steps {
                /*
                 * Since we're in a different pod than the rest of the
                 * stages, we'll need to grab our source tree since we don't
                 * have a shared workspace with the other pod(s)..
                 */
                checkout scm
                sh 'sh -c ./scripts/build-kaniko.sh'
            }
        }
    }
}

kaniko.yaml

# This pod specification is intended to be used within the Jenkinsfile for
# building the Docker containers
#
# E.g. /kaniko/executor --context `pwd` --destination localhost:5000/roboto:latest --insecure-registry localhost:5000 --cleanup
---
kind: Pod
metadata:
  name: kaniko
spec:
  containers:
  - name: jnlp
    # Overwriting the jnlp container's default "image" parameter, this will be
    # merged automatically with the Kubernetes plugin's built-in jnlp container
    # configuration, ensuring that the pod comes up and is accessible
    image: 'our-awesome-registry/rtyler/jenkins-agent:latest'
  - name: kaniko
    image: gcr.io/kaniko-project/executor:debug
    imagePullPolicy: Always
    # Command and args are important to set in this manner such that the
    # Jenkins Pipeline can send commands to be executed from the Jenkinsfile via
    # stdin (that's how it really works!)
    command:
    - /busybox/sh
    - "-c"
    args:
    - /busybox/cat
    tty: true
  #  Kaniko requires a registry, so we're just bringing one online in the pod
  #  for the intermediate caching of layers
  - name: registry
    image: 'registry'
    command:
    - /bin/registry
    - serve
    - /etc/docker/registry/config.yml

Our experience with Kaniko thus far is that it has been slower, and less verbose in some of its output than docker build. Fortunately though it’s been quite reliable, and that’s the key factor here!

Hopefully with the snippets of code above you won’t need to spend nearly as much time tinkering as my colleague and I did. But in the process of switching over to Kaniko we needed to do a lot of interactive debugging in Jenkins, so I was glad to have something like an interactive shell in my bag of Jenkins Pipeline tricks.

While I liked the “dind” solution, the Kaniko-based solution is just as well. The future development for us is to hide some of this complexity with shared libraries, but that’s a project for another day!

JKS? jfc. Adding a root certificate

2019-09-28T00:00:00+00:00

TLS certificates have the largest “complexity/importance” scores imaginable. Everything about them is error prone and seemingly over-engineered from top to bottom, yet they are one of the most important pieces of security and authentication in our software architectures. From an engineering management standpoint, I am finding myself adopting the rule of: estimates for any project involving certificates should be multiplied tenfold. If the project involves the Java Virtual Machine (JVM) and the Java Key Store (JKS), multiply by another ten I suppose. For my own future convenience, in this blog post I would like to outline how to add a root certificate to a Java Key Store in Red Hat-derived environments.

Like many corporate environments, we have our own internal Certificate Authorities (CAs) which all derive their chain of trust from our internal root certificate. Accessing internal services requires that the operating system has that root certificate, or when accessing those internal services from anything running atop the JVM, the default JKS must have the root certificate.

If you search around the web for how to add root certificates, you might find the update-ca-certificates command, whose CentOS/RHEl manpage has the following:

The directory /etc/pki/ca-trust/extracted/java/ contains a CA
certificate bundle in the java keystore file format. Distrust information
cannot be represented in this file format, and distrusted certificates are
missing from these files. File cacerts contains CA certificates trusted for TLS
server authentication.

You might assume, as I did, that this means the update-ca-certificates tool is going to create files that the JVM picks up properly and your default JKS will have the root certificate in place.

This is false. At least in the environments which I have tested this.

Digging further I found this blog post and used the following command to import the root certificate into JKS after installing it on the system at large:

keytool -importcert -alias startssl -keystore $JAVA_HOME/jre/lib/security/cacerts -storepass changeit -file ca.der

Using the SSLPoke tool referenced in this Atlassian knowledgebase article I was then finally able to access the same internal services from native utilities (e.g. curl) and from the Java-based services which I was working with at the time.

In my situation, the fact that all of this was happening within Docker containers further complicated the debugging: multiple by another 2-5 on that engineering estimate.

Certificates are too important to be this painful.

Ruby Infrastructure Engineering

2019-09-09T00:00:00+00:00

My favorite part of the stack is the netherworld between the underlying infrastructure and the app. That fuzzy grey area where data goes from databases to object-relational mappers (ORMs), web servers to request libraries (e.g. Rack/WSGI), and so on. In many cases a technology roadmap where one considers infrastructure, but not the application, or vice-versa, is doomed from the start. At Scribd, I have been given permission to hire more people that love this layer of the stack, and I have taken to calling it “Ruby Infrastructure.” A phrase which is fairly unique, that I wanted to define in greater detail.

I have described the general mission of the team as follows:

The Ruby Infrastructure team will help Scribd adopt major ecosystem improvements such as Sorbet, new Rails versions, and interpreter releases. Measure and optimize performance across the thousands of requests per second served by Ruby at Scribd. Create libraries that encapsulate common Ruby application patterns and approaches. Open high quality pull requests to improve upstream projects like Sidekiq, Rails, and Ruby itself.

Ruby at Scribd is serious business. We run one of the largest Rails deployments on the internet (hi GitHub!) and need more focused effort on scaling it from a technology and organization standpoint. The Ruby ecosystem has also matured greatly over the past 10 years and every couple of months there are new improvements which Scribd can adopt.

The Ruby Infrastructure team is intended to be the group of people which make sure that all our Ruby and Rails applications are performing well, scaling, and are easy to develop and deploy.

To give you a better idea of what this team will do, here are some of the projects which I have in mind:

Simplify with Aurora

We have over 7TB of online relational data which, for historical reasons, is spread across a number of master-replica clusters. Migrating these databases to, and adopting RDS/Aurora looks very promising. The advertised read-performance and dataset storage scalability may allow us to consolidate the database infrastructure and allow us to delete swaths of complex database magic in the applications.

All that code for switching up database connections or delegating reads to read-replicas may disappear behind the curtains of Aurora. We certainly need to do some investigation here, but this is a pristine example of that grey area where the Ruby Infrastructure team will excel.

Web Socketin’

Enabling Web Sockets on smaller applications is trivially easy these days. For larger sites like Scribd.com a large number of variables need to be considered: do we terminate sockets in Rails? What will an incredibly high connection count do to our existing app infrastructure? How will application developers write code which supports Web Sockets and their existing request flows? Does our app host capacity plan change dramatically as a result?

Seemingly mundane requests like “can we enable web sockets?” from application developers or product managers, at the scale of billions of requests per month, can have far reaching implications that the Ruby Infrastructure team is poised perfectly to answer.

Efficient Host Sizing

Our current infrastructure is at times over-provisioned. The specifics I won’t get into in this post, but there are a lot of low-hanging fruit in understanding our existing application footprints and then sizing our infrastructure around them appropriately. Whether we’re talking about understanding or improving our memory utilization, or becoming more elastic around CPU utilization. Building an overall understanding of how these Ruby applications perform, how to tune them, and how to structure their resource usage is going to be one of the frequently re-evaluated projects for Ruby Infrastructure.

There are a myriad of other interesting projects which will crop up once a couple Ruby Infrastructure Engineers join the company. Like the other teams in Platform Engineering, this team will be entirely remote which means we can hire the most qualified people we’re able to find, from nearly anywhere.

I’m excited to see the upstream pull requests, RailsConf presentations, and blog posts that we’re going to be able to share once we start solving problems together!

Defining the Real-time Data Platform

2019-08-28T00:00:00+00:00

One of the harder parts about building new platform infrastructure at a company which has been around a while is figuring out exactly where to begin. At Scribd the company has built a good product and curated a large corpus of written content, but where next? As I alluded to in my previous post about the Platform Engineering organization, our “platform” components should help scale out, accelerate, or open up entirely new avenues of development. In this article, I want to describe one such project we have been working on and share some of the thought process behind its inception and prioritization: the Real-time Data Platform.

(sounds fancy huh?)

My first couple weeks at the company were intense. The idea of “Core Platform” was sketched out as a team “to scale apps and data” but that was about the extent of it. The task I took on was to learn as much as I could, as quickly as I could, in order to get the recruiting and hiring machine started. Basically, I needed to point Core Platform in a direction that was correct enough at a high level in order to know what skills my future colleagues should have. While I had tons of discussions and did plenty of reading, I almost feel sheepish to admit this, but much of our direction was heavily influenced by two conversations, both of which took less than an hour.

The first was with Kevin Perko (KP), the head of our Data Science team. His team interacts the most with our current data platform (HDFS, Spark, Hive, etc); in essence Data Science would be considered one of our customers. I asked some variant of “what’s wrong with the data infrastructure?” and KP unloaded what must have been months of pent up frustrations shared by his entire team. The themes that emerged were:

Developers don’t think about the consumers of the data. Garbage in, garbage out!
Many nightly tasks spend a lot of time performing unnecessary pre-processing of data.
The performance of the system is generally poor. Ad-hoc queries from data scientists, depending on the time of day, are competing with resources for automated tasks.
Everything has to be done in this nightly dependent graph of tasks, and when something goes wrong, it’s very manual to recover from errors and typically ruins somebody’s day.

Assuring KP that these were problems we would be solving, his next statement would become a mainstay of our relationship moving forward: “when will it be ready?”

My second influential conversation was with Mike Lewis the head of Product. This conversation was quite simple and didn’t involve as much trauma counseling as the previous. I asked “what can’t you do today because of our technology limitations?” This is a good question to ask product teams every now and again. They frequently are optimising within their current constraints. One role of platform and infrastructure teams is to remove those constraints. We discussed the way in which users convert from passersby, to trial, to paid subscribers. He also highlighted the importance of our recommendations and search results in this funnel, and lamented the speed at which we can highlight relevant content to new users. The maxim goes: the faster a new user sees relevant and interesting content, the more likely they are to stick around.

Pattern matching between the current problems and the technology needed to enable new product initiatives I named and defined the high level objective for the Real-time Data Platform as follows:

To provide a streaming data platform for collecting and acting upon behavioral data in near real-time with the ultimate goal to enable day zero personalization in Scribd’s products.

In more concrete terms, the platform is a collection of cloud-based services (in AWS, more on that later) for ingesting, processing, and storing behavioral events from frontend, backend, and mobile clients. The scope of the Real-time Data Platform extends from event definition and schema, to the layout of events in persisted into long-term queryable storage, and the tooling which sits on top of that queryable storage.

As the nominal “product owner” for the effort, I aimed to describe less about what tools and technologies should be used, and instead forced myself to define tech-agnostic requirements. Thereby leaving the discovery work for the team I would ultimately hire.

The Real-time Data Platform must have:

A high, nearing 100% data SLA. Meaning we must design in such a way to reduce data loss or corruption at every point of the pipeline.
Maintain data provenance through the pipeline from data creation to usage. In essence, a Data Scientist should be able to easily track data from where it originated, and understand the transformative steps along the way.
Event streams should be considered API contracts, with schemas suggested or enforced when possible. A consumer from an event stream should be able to trust the quality of the events in that stream.
Data processing and transformation must happen as close to ingestion as possible. Events which arrive in long-term storage must be structured and partitioned for optimal query performance with zero or minimal post-processing required for most use-cases.
The platform must scale as the data volume grows without requiring significant redesign or rework.

In essence, we need to change a number of foundational ways in which we generate, transfer, and consider the data which Scribd uses. As Core Platform has unpeeled layer after layer of this onion, we have been able to affirm at each step of the way that we’re moving in the right direction, which is by itself quite exciting.

The design of the Real-time Data Platform which we’re currently building out is something I will share at a high level in a subsequent blog post.

I want to finish this one with some parting thoughts. If you are building anything foundational in a technology organization, you must talk to the product team. You must also talk to your customers, but I don’t like to ask them what they want, I like to ask what they don’t like and don’t want. Listen to that negative feedback, understand what lies beneath the frustrations. Finally, have a vision for the future, but build and deliver incrementally. When I first sketched this out, I was forthcoming in stating “this is a 2020 project.” I made sure to clarify that this did not mean we wouldn’t deliver anything to the business for 18 months. Instead, I made made sure to explain that to execute on this overall vision would be a long journey with milestones along the way.

If you haven’t ever watched a skyscraper being built, you would be amazed at how much of the time is spent digging a great big hole, sinking steel into bedrock, and pouring concrete. Months of people working in a city block-sized hole before anything takes shape that even resembles a skyscraper. Building strong foundations takes time, but that is in essence the role of any platform and infrastructure organization. The challenge is to keep the business moving forward today while also building those fundamental components upon which the business will stand in a year or two.

It is tough, but that’s exactly what I signed up for. :)

Zooming out to Platform Engineering at Scribd

2019-08-22T00:00:00+00:00

The team that I joined Scribd to build, Core Platform is now up and running with five incredibly talented people. I could not be more pleased with the very friendly and highly functional group of people we have been able to assemble. With that team’s projects underway, my focus has been shifting, zooming out to “Platform Engineering” as a comprehensive part of the engineering group. In this post, I want to expand on what Platform Engineering is planned to be and discuss some of the teams and their responsibilities.

I was hired as the “Director of Platform Engineering”, which at the time was an especially ostentatious title considering an entire group didn’t yet exist. It was so wacky that “Director” has been something I’m almost ashamed to reference. It is not in my email signature and it doesn’t show up in Slack; I don’t want it to interfere with my ability to discuss ideas or hack on something with my colleagues. The role did however have intent behind it: for me to focus on growing the organization. A big challenge which I’m fastidiously working towards addressing. As currently scoped the teams which compose Platform Engineering are:

Core Platform, provides foundational infrastructure to help Scribd scale applications and data.
Data Engineering, treats data as a product, ensuring that high quality data sets are accessible to internal users.
Ruby Infrastructure, helps Scribd adopt or upstream major ecosystem changes which will improve organizational and operational performance of Ruby and Rails.

Defining the scope and charters for these team has been a rather interesting exercise. Figuring out with the Infrastructure, Data Science, and Internal Tools teams where the edges of our respective responsibilities lie is one of those good healthy debates every organization should have as it grows. A year ago much of engineering was flat with lots of generalists, compare that to today where both Product and Engineering groups are learning that specialization when appropriately applied can be quite helpful.

What has also been personally challenging about hiring in Data Engineering is my relative inexperience in the field. My jam has always been backend service infrastructure. Across the industry we’re seeing data infrastructure melt into backend production infrastructure. Scribd is no different, but we have a lot of work to do, changing from a mindset of “dumping in the data lake” to where Product and other parts of Engineering are viewing data as a more integral part of their work. Both in generating clean data but also by utilizing derived data sets to make more personalized or responsive user experiences.

The barriers between “data platform” and “production engineering” remind me of the now outdated silos between application developers and operations engineers. I’m not sure what to call it, DevDataOps? Maybe DataDevOps?

I’ll have to figure out the hashtag later.

Anyways, like Core Platform, Data Engineering and Ruby Infrastructure are also intended to be fully remote teams. I maintain that it is better to hire the best people available rather than the best people “around here.” Hiring remotely forces the organization to confront all of the collaboration and communication problems that many growing companies ignore until it’s too late. Recording meeting notes, sharing knowledge, pair problem solving, capturing decisions, discussing project roles and responsibilities, all of these are crucial for effective remote work and they are all unsurprisingly qualities of effective colocated teams too.

The work we have done thus far in Core Platform I believe sets a strong precedent for other teams within Platform Engineering and outside of it. We have patterns of work defined and documented, which will make each successive remote team we hire at Scribd that much easier to get up and running.

While we’re hiring across the board (who isn’t) the folks I am specifically hiring for are:

Core Platform
- Application Platform Engineer
- Data Platform Engineer
Data Engineering
- Data Engineering Manager
- Data Engineer
Ruby Infrastructure
- Ruby Infrastructure Engineering

We’re also hiring an Infrastructure Team Manager who I would be working heavily with.

If you’re curious about these roles, or Platform Engineering type things, please email me: rtyler at brokenco.de

If you’re not curious about those roles, but want to share thoughts on remote engineering, you can also email me for that too! At some point I want to write down all the patterns and practices I have learned, adopted, or stopped using over the past five years for building successful remote engineering organizations. That idea is pending a surplus of spare time which isn’t currently in the budget however. :)

I have been afforded a lot of leeway by my boss to publicly discuss not only the projects that we’re working on, but a bit of the work we’re doing behind the scenes. Over the coming months I’m looking forward to sharing even more about what scaling up an organization like Scribd requires, where we’ve failed, and where we’re succeeding.