After five years in production kafka-delta-ingest at Scribd has been shut off and removed from our infrastructure. kafka-delta-ingest was the motivation behind my team creating delta-rs, the most successful open source project I have started to date. With kafka-delta-ingest we achieved our original stated goals and reduced streaming data ingestion costs by 95%. In the time since however, we have further reduced that cost with even more efficient infrastructure.
The original kafka-delta-ingest/delta-rs implementations were created by the joint efforts of the following talented developers across three continents in the middle of 2020, an otherwise totally chill time in world history.
Prior to our creation of delta-rs, the only way to read and write Delta Lake tables was through Apache Spark. While it is an incredibly powerful tool for reading and transforming data, it is completely slow and overweight for the task of high-throughput data ingestion. QP and I found ourselves loving Rust and I was able to corner the funding to get the project started on the promise of lower operational costs.
Boy howdy has the investment in Rust delivered. The implementation of kafka-delta-ingest dramatically lowered our operation costs as Christian shares in this video:
Christian also shared some architecture and discussion in this video, which I think are useful for anybody building streaming systems around Delta Lake.
Here’s a demo by Christian too!
The reason kafka-delta-ingest was decommissioned ultimately was that I created an even cheaper ingestion process. My work on the oxbow suite coupled with the medallion architecture has made contemporary Delta Lake ingestion less than 10% of the total data platform cost.
The big argument against kafka-delta-ingest was Apache Kafka. If an organization has Kafka for other reasons, then kafka-delta-ingest can be a useful “sidecar” process to persist data flowing through Kafka. If however the organization is running Kafka just for ingestion, there are cheaper options available. As the organization evolved, the other consumers of Kafka drifted away, driving the value proposition of kafka-delta-ingest lower and lower.
This doesn’t mean kafka-delta-ingest is not useful, it’s just no longer useful at Scribd.
Kyjah Keyes and I are the maintainers of kafka-delta-ingest and we now are both in the position of not actually using it anymore.
I will continue to make delta-rs upgrades to it, since kafka-delta-ingest continues to be a useful test bed for API changes and integration testing, but I don’t have big plans or ideas on how to grow the project further.