Recovering from disasters with Delta Lake

Entering into the data platform space with a lot of experience in more traditional production operations is a lot of fun, especially when you ask questions like “what if X goes horribly wrong?” My favorite scenario to consider is: “how much damage could one accidentally cause with our existing policies and controls?” At Scribd we have made Delta Lake a cornerstone of our data platform, and as such I’ve spent a lot of time thinking about what could go wrong and how we would defend against it.

To start I recommend reading this recent post from Databricks: Attack of the Delta Clones which provides a good overview of the CLONE operation in Delta and some patterns for “undoing” mistaken operations. Their blog post does a fantastic job demonstrating the power ot clones in Delta Lake, for example:

-- Creating a new cloned table  from loan_details_delta
CREATE OR REPLACE TABLE loan_details_delta_clone
    DEEP CLONE loan_details_delta;

-- Original view of data
SELECT addr_state, funded_amnt FROM loan_details_delta GROUP BY addr_state, funded_amnt

-- Clone view of data
SELECT addr_state, funded_amnt FROM loan_details_delta_clone GROUP BY addr_state, funded_amnt

For my disaster recovery needs, the clone-based approach is insufficient as I detailed in this post on the delta-users mailing list:

Our requirements are basically to prevent catastrophic loss of business critical data via:

Erroneous rewriting of data by an automated job

Inadvertent table drops through metastore automation.

Overaggressive use of VACUUM command

Failed manual sync/cleanup operations by Data Engineering staff

It’s important to consider whether you’re worried about the transaction log getting corrupted, files in storage (e.g. ADLS) disappearing, or both.

Generally speaking, I’m less concerned about malicious actors so much as incompetent ones. It is far more likely that a member of the team accidentally deletes data, than somebody kicking in a few layers of cloud-based security and deleting it for us.

My preference is to work at a layer below Delta Lake to provide disaster recovery mechanisms, in essence at the object store layer (S3). Relying strictly on CLONE gets you copies of data which can definitely be beneficial but the downside is that whatever is running the query has access to both the “source” and the “backup” data.

The concern is that if some mistake was able to delete my source data, there’s nothing actually standing in its way of deleting the backup data as well.

In my mailing list post, I posited a potential solution:

For example, with a simple nightly rclone(.org) based snapshot of an S3 bucket, the “restore” might mean copying the transction log and new parquet files back to the originating S3 bucket and losing up to 24 hours of data, since the transaction logs would basically be rewound to the last backup point.

Since that email we have deployed our Delta Lake backup solution, which operates strictly at an S3 layer and allows us to impose hard walls (IAM) between writers of the source and backup data.

One of my colleagues is writing that blog post up for tech.scribd.com and I hope to see it published later this week so make sure you follow us on Twitter @ScribdTech or subscribe to the RSS feed!

Update: my colleague Kuntal wrote this blog post on backing up Delta Lake with AWS S3 Batch Operations which is what we’re doing here at Scribd