Entering into the data platform space with a lot of experience in more
traditional production operations is a lot of fun, especially when you ask
questions like “what if
X goes horribly wrong?” My favorite scenario to
consider is: “how much damage could one accidentally cause with our existing
policies and controls?” At Scribd we have made
Delta Lake a cornerstone of our data platform, and as such
I’ve spent a lot of time thinking about what could go wrong and how we would
defend against it.
To start I recommend reading this recent post from Databricks: Attack of the
which provides a good overview of the
CLONE operation in Delta and some
patterns for “undoing” mistaken operations. Their blog post does a fantastic
job demonstrating the power ot clones in Delta Lake, for example:
-- Creating a new cloned table from loan_details_delta CREATE OR REPLACE TABLE loan_details_delta_clone DEEP CLONE loan_details_delta; -- Original view of data SELECT addr_state, funded_amnt FROM loan_details_delta GROUP BY addr_state, funded_amnt -- Clone view of data SELECT addr_state, funded_amnt FROM loan_details_delta_clone GROUP BY addr_state, funded_amnt
For my disaster recovery needs, the clone-based approach is insufficient as I detailed in this post on the delta-users mailing list:
Our requirements are basically to prevent catastrophic loss of business critical data via:
- Erroneous rewriting of data by an automated job
- Inadvertent table drops through metastore automation.
- Overaggressive use of VACUUM command
- Failed manual sync/cleanup operations by Data Engineering staff
It’s important to consider whether you’re worried about the transaction log getting corrupted, files in storage (e.g. ADLS) disappearing, or both.
Generally speaking, I’m less concerned about malicious actors so much as incompetent ones. It is far more likely that a member of the team accidentally deletes data, than somebody kicking in a few layers of cloud-based security and deleting it for us.
My preference is to work at a layer below Delta Lake to provide disaster
recovery mechanisms, in essence at the object store layer (S3). Relying strictly
CLONE gets you copies of data which can definitely be beneficial but the
downside is that whatever is running the query has access to both the “source”
and the “backup” data.
The concern is that if some mistake was able to delete my source data, there’s nothing actually standing in its way of deleting the backup data as well.
In my mailing list post, I posited a potential solution:
For example, with a simple nightly rclone(.org) based snapshot of an S3 bucket, the “restore” might mean copying the transction log and new parquet files back to the originating S3 bucket and losing up to 24 hours of data, since the transaction logs would basically be rewound to the last backup point.
Since that email we have deployed our Delta Lake backup solution, which operates strictly at an S3 layer and allows us to impose hard walls (IAM) between writers of the source and backup data.
Update: my colleague Kuntal wrote this blog post on backing up Delta Lake with AWS S3 Batch Operations which is what we’re doing here at Scribd