Data Audit with Databricks: Delta Time Travel
Databricks is an amazing platform for data engineering, data science and machine learning. One of the critical requirements of secure data processing is data audit - the ability to identity what data changes have been performed, when, and who authored the changes.
Unfortunately Databricks data science workspace does not provide audit capability. SQL Analytics workspaces provide query history but unfortunately it is limited to 30 days (confirmed with Microsoft product team on Azure).
Luckily Databricks Delta Lake Time Travel capability helps to fill this gap.
For details one can reference Microsoft docs: Query an older snapshot of a table (time travel)
The article below shows how easily this capability can be used for data audit : retrieve details of each change, including values before and after change. Below are tested insert, update and delete scenarios leveraging Delta Lake from Databricks academy samples.
First lets identify the record we will update with rest heart rate 72.82814960035947.
Now lets update the record to increase rest heart rate by 1. As we can see one record has been updated.
Now lets delete a record with heart rate value 52.06579439269219 (can see one record deleted)
Now lets insert one record copied from existing record but set an earlier date. Then lets check inserted date so we can validate against audit.
As we can see we can use the "describe history" command to identity who and when performed the update, delete and insert in Delta Lake.
But we can do better then that. We can use timestamp from table history to identify actual values in updated, deleted and inserted records.
We can see resting heart rate has been increased by 1.
Getting all the record with timestamp previous to delete and removing all records that exist after delete allows see deleted records.
Similarly getting all records after the insert and removing all records that existed prior allows identify inserted data.
One periodically would optimize Delta lake and use "VACUUM" to clean up files associated with a table. VACUUM removes uncommitted files older than a retention threshold. The default threshold is 7 days.
If you run VACUUM on a Delta table, you lose the ability to time travel back to a version older than the specified data retention period.
One can set the following retention parameters on Delta if default behavior proves insufficient.
from delta.tables import * deltaTable = DeltaTable.forName(spark, 'dbacademy_romicgd.health_profile_data') deltaTable.logRetentionDuration = "interval 365 days" deltaTable.deletedFileRetentionDuration = "interval 365 days"
From Databricks documentation Data retention
delta.logRetentionDuration = "interval <interval>": controls how long the history for a table is kept. The default is interval 30 days. Each time a checkpoint is written, Databricks automatically cleans up log entries older than the retention interval. If you set this config to a large enough value, many log entries are retained. This should not impact performance as operations against the log are constant time. Operations on history are parallel but will become more expensive as the log size increases.
delta.deletedFileRetentionDuration = "interval <interval>": controls how long ago a file must have been deleted before being a candidate for VACUUM. The default is interval 7 days. To access 30 days of historical data even if you run VACUUM on the Delta table, set delta.deletedFileRetentionDuration = "interval 30 days". This setting may cause your storage costs to go up.
Delta Lake safety check to prevent you from running a dangerous VACUUM command is spark.databricks.delta.retentionDurationCheck.enabled Spark configuration property (should be set to "true" unless you certain no damage or audit capability is lost due to VACUUM operation).
Detailed in Microsoft documentation: Vacuum a Delta table (Delta Lake on Azure Databricks)
Also as per following Microsoft documentation: Vaccuming with zero retention results in data loss - Azure Databricks