Building a real-time data platform with Apache Spark and Delta Lake

The Real-time Data Platform is one of the fun things we have been building at Scribd since I joined in 2019. Last month I was fortunate enough to share some of our approach in a presentation at Spark and AI Summit titled: “The revolution will be streamed.” At a high level, what I had branded the “Real-time Data Platform” is really: Apache Kafka, Apache Airflow, Structured streaming with Apache Spark, and a smattering of microservices to help shuffle data around. All sitting on top of Delta Lake which acts as an incredibly versatile and useful storage layer for the platform.

In my presentation, which is embedded below, I outline how we tie together Kafka, Databricks, and Delta Lake.

The recorded presentation also complements some of our tech.scribd.com blog posts which I recommend reading as well:

I am incredibly proud of the work the Platform Engineering organization has done at Scribd to make real-time data a reality. I also cannot recommend Kafka + Spark + Delta Lake highly enough for those with similar requirements.

Now that we have the platform in place, I am also excited for our late 2020 and 2021 roadmaps which will start to take advantage of real-time data.