Run both Databricks Optimize and Vacuum? - databricks

Does it make sense to call BOTH Databricks (Delta) Optimize and Vacuum? It SEEMS like it makes sense but I don't want to just infer what to do. I want to ask.
Vacuum
Recursively vacuum directories associated with the Delta table and remove data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. Files are deleted according to the time they have been logically removed from Delta’s transaction log + retention hours, not their modification timestamps on the storage system. The default threshold is 7 days.
Optimize
Optimizes the layout of Delta Lake data. Optionally optimize a subset of data or colocate data by column. If you do not specify colocation, bin-packing optimization is performed.
Second question: if the answer is yes, which is the best order of operations?
Optimize then Vacuum
Vacuum then Optimize

Yes, you need to run both commands at least to cleanup the files that were optimized by OPTIMIZE. With default settings, the order shouldn't matter, as it will delete files only after 7 days. Order will matter only if you run VACUUM with retention of 0 seconds, but it's not recommended anyway as it will remove whole history.

Related

What happens at VACUUM with delta tables?

When we run VACUUM command, does it go through each parquet file and remove older versions of each record or does it retain all the parquet files even id it has one record with the latest version? What about compaction? Is this any different?
Vacuum and Compaction go through the _delta_log/ folder in your Delta Lake Table and identify the files that are still being referenced.
Vacuum deletes all unreferenced files.
Compaction reads in the referenced files and writes your new partitions back to the table, unreferencing the existing files.
Think of a single version of a Delta Lake table as a set of parquet data files. Every version adds an entry (about files added and removed) to the transaction log (under _delta_log directory).
VACUUM
VACUUM allows defining what number of hours to retain (using RETAIN number HOURS clause). That gives Delta Lake the versions to delete (up to the number HOURS). These versions are "translated" into a series of parquet files (remember one single parquet file belongs to a single version until it is deleted that may take a couple of versions).
This translation gives the files to be deleted.
Compaction
Compaction is basically an optimization (and is usually triggered by OPTIMIZE command or a combination of repartition, dataChange disabled and overwrite).
This is nothing else as another version of a delta table (but this time data is not changed so other transactions can happily be all committed).
The explanation about VACUUM above applies here.

How to use vacuum to delete old files created by compaction without losing ability to time travel

I am running OPTIMIZE command for compaction. Now I want to delete the old files left out after compaction. So if I use vacuum command, then I am not able to do time travel. So, what is the better way to delete old files left out due to compaction without losing ability to time travel?
It depends on what you are trying to achieve. Time travel is really meant for shorter-term debugging as opposed to long-term storage per se. If you would like to keep the data around for the long-term, perhaps make use of Delta CLONE per Attack of the Delta Clones (Against Disaster Recovery Availability Complexity).

Delta Lake: don't we need time partition for full reprocessed tables anymore

Objective
Suppose you're building Data Lake and Star Schema with help of ETL. Storage format is Delta Lake. One of the ETL responsibilities is to build Slowly Changing Dimension (SCD) tables (cummulative state). This means that every day for every SCD table ETL reads full table's state, applies updates and saves them back (full overwrite).
Question
One of the questions we argued within my team: should we add time partition to SCD (full overwrite) tables? Means, should I save the latest (full) table state to SOME_DIMENSION/ or to SOME_DIMENSION/YEAR=2020/MONTH=12/DAY=04/?
Considerations
In one hand, Delta Lake has all required features: time-travel & ACID. When its overwritting the whole table, logical deletion happens, and you're still able to query old versions and rollback to them. So Delta Lake is almost managing time partition for you, the code get simpler.
In other hand, I said "almost" because IMHO time-travel & ACID don't cover 100% of use cases. It hasn't got a notion of arrival time. For example:
Example (when you need time partition)
BA team reported that SOME_FACT/YEAR=2019/MONTH=07/DAY=15 data are broken (facts must be stored with time partition any case, because data are processed by arrival time). In order to reproduce the issue on DEV/TEST environment you need 1 fact table raw inputs and 10 SCD tables.
With facts everything is simple, because you have raw inputs in Data Lake. But with incremental state (SCD tables) things get complex - how to get the state of 10 SCD tables for the point in time when SOME_FACT/YEAR=2019/MONTH=07/DAY=15 was processed? How to do this automatically?
To complicate the things even more, your environment may come through bunch of bugfixes and history re-processings. Means 2019-07 data may be reprocessed somewhere in 2020. And Delta Lake allow you to rollback only based on processing or version number. So you actually don't know which version you should use.
In other hand, with date partitioning, you are always sure that SOME_FACT/YEAR=2019/MONTH=07/DAY=15 was calculated over SOME_DIMENSION/YEAR=2019/MONTH=07/DAY=15.
It depends, and I think it's a bit more complicated.
Some context first - Delta gives you time travel only limited to the current commit history, which is by default 30 days. If you are doing optimizations, that time might be significantly shorter (default 7 days).
Also, you actually can query Delta tables as of specific time, not only version, but due to above limitations (unless you are willing to pay the performance and financial cost of storing really long commit history), it's not useful from long-term perspective.
This is why a very common data lake architecture right now is medallion tables approach (Bronze->Silver->Gold). Ideally, I'd want to store the raw inputs in the 'bronze' layer, have a whole historical perspective in the silver layer (already clean, validated, best source of truth, but with whole history as needed), and consume the current version directly from "golden" tables.
This would avoid increasing the complexity of querying the SCDs due to additional partitions, while giving you the option to "go back" to silver layer if need arises. But it's always a tradeoff decision - in any case, don't rely on Delta for long-term versioning.

Configuring TTL on a deltaLake table

I'm looking for a way to add ttl(time-to-live) to my deltaLake table so that any record in it goes away automatically after a fixed span, I haven't found anything concrete of yet, any one knows if there's a workaround with this?
Unfortunately, there is no configuration called TTL (time-to-live) in Delta Lake tables.
You can remove files no longer referenced by a Delta table and are older than the retention threshold by running the vacuum command on the table. vacuum is not triggered automatically. The default retention threshold for the files is 7 days.
Delta Lake provides snapshot isolation for reads, which means that it is safe to run OPTIMIZE even while other users or jobs are querying the table. Eventually however, you should clean up old snapshots.
You can do this by running the VACUUM command:
VACUUM events
You control the age of the latest retained snapshot by using the RETAIN HOURS option:
VACUUM events RETAIN 24 HOURS
For details on using VACUUM effectively, see Vacuum.

Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs

I'm writing a lot of data into Databricks Delta lake using the open source version, running on AWS EMR with S3 as storage layer. I'm using EMRFS.
For performance improvements, I'm compacting and vacuuming the table every so often like so:
spark.read.format("delta").load(s3path)
.repartition(num_files)
.write.option("dataChange", "false").format("delta").mode("overwrite").save(s3path)
t = DeltaTable.forPath(spark, path)
t.vacuum(24)
It's then deleting 100k's of files from S3. However, the vacuum step takes an extremly long time. During this time, it appears the job is idle, however every ~5-10 minutes there will be a small task that indicates the job is alive and doing something.
I've read through this post Spark: long delay between jobs which seems to suggest it may be related to parquet? But I don't see any options on the delta side to tune any parameters.
I've also observed that the Delta vacuum command is quite slow. The open source developers are probably limited from making AWS specific optimizations in the repo because this library is cross platform (needs to work on all clouds).
I've noticed that vacuum is even slow locally. You can clone the Delta repo, run the test suite on your local machine, and see for yourself.
Deleting hundreds of thousands of files stored in S3 is slow, even if you're using the AWS CLI. You should see if you can refactor your compaction operation to create fewer files that need to be vacuumed.
Suppose your goal is to create 1GB files. Perhaps you have 15,000 one-gig files and 20,000 small files. Right now, your compaction operation is rewriting all of the data (so all 35,000 original files need to be vacuumed post-compaction). Try to refactor your code to only compact the 20,000 small files (so the vacuum operation only needs to delete 20,000 files).
The real solution is to build a vacuum command that's optimized for AWS. Delta Lake needs to work with all the popular clouds and the local filesystem. It should be pretty easy to make an open source library that reads the transaction log, figures out what files need to be deleted, makes a performant file deletion API call, and then writes out an entry to the transaction log that's Delta compliant. Maybe I'll make that repo ;)
Here's more info on the vacuum command. As a sidenote, you may way to use coalesce instead of repartition when compacting, as described here.
EDIT:
Delta issue: https://github.com/delta-io/delta/issues/395
and PR: https://github.com/delta-io/delta/pull/416
There was issue filed for this in deltalake
Problem Statement:
Deltalake vacuum jobs are taking too long to finish as underneath file deletion logic was sequential. Known bug for deltalake (v0.6.1) Ref: https://github.com/delta-io/delta/issues/395
Solution:
Deltalake team has resolved this issue & yet to be released stable version for this. Pull Request: https://github.com/delta-io/delta/pull/522
For v0.6.x
Lot of organizations are using 0.6.x in production & want this to be part of 0.6.x. Following are quick steps to generate delta 0.6.1 jar with this patch
https://swapnil-chougule.medium.com/delta-with-improved-vacuum-patch-381378e79d1d
With this change, parallel deletion of files is supported during vacuum job. It speeds up process & reduces execution time

Resources