What happens at VACUUM with delta tables? - delta-lake

When we run VACUUM command, does it go through each parquet file and remove older versions of each record or does it retain all the parquet files even id it has one record with the latest version? What about compaction? Is this any different?

Vacuum and Compaction go through the _delta_log/ folder in your Delta Lake Table and identify the files that are still being referenced.
Vacuum deletes all unreferenced files.
Compaction reads in the referenced files and writes your new partitions back to the table, unreferencing the existing files.

Think of a single version of a Delta Lake table as a set of parquet data files. Every version adds an entry (about files added and removed) to the transaction log (under _delta_log directory).
VACUUM
VACUUM allows defining what number of hours to retain (using RETAIN number HOURS clause). That gives Delta Lake the versions to delete (up to the number HOURS). These versions are "translated" into a series of parquet files (remember one single parquet file belongs to a single version until it is deleted that may take a couple of versions).
This translation gives the files to be deleted.
Compaction
Compaction is basically an optimization (and is usually triggered by OPTIMIZE command or a combination of repartition, dataChange disabled and overwrite).
This is nothing else as another version of a delta table (but this time data is not changed so other transactions can happily be all committed).
The explanation about VACUUM above applies here.

Related

Is it ok to use a delta table tracker based on parquet file name in Azure databricks?

Today at work i saw a delta lake tracker based on file name. By delta tracker, i mean a function that defines whether a parquet file has already been ingested or not.
The code would check what file (from the delta table) has not already been ingested, and the parquets file in the delta table would then be read using this : spark.createDataFrame(path,StringType())
Having worked with Delta tables, it does not seem ok to me to use a delta tracker that way.
In case record is deleted, what are the chances that the delta log would point to a new file , and that this deleted record would
be read as a new one?
In case record is updated, what would be the chance that delta log would not point to a new file, and that this updated record
would not be considered ?
In case some maintenance is happening on the delta table, what are
the chances that some new files are written out of nowhere ? Which may cause a record to be re-ingested
Any observation or suggestion whether it is ok to work that way would be great. Thank you
In Delta Lake everything works on file level. So there are no 'in place' updates or deletes. Say a single record gets deleted (or updated) then roughly the following happens:
Read in the parquet file with the relevant record (+ the other records which happen to be in the file)
Write out all records except for the deleted record into a new parquet file
Update the transaction log with a new version, marking the old parquet file as removed and the new parquet file as added. Note the old parquet file doesn't get physically deleted until you run the VACUUM command.
The process for an update is basically the same.
To answer your questions more specifically:
In case record is deleted, what are the chances that the delta log
would point to a new file , and that this deleted record would be read
as a new one?
The delta log will point to a new file, but the deleted record will not be in there. There will be all the other records which happened to be in the original file.
In case record is updated, what would be the chance that delta log
would not point to a new file, and that this updated record would not
be considered ?
Files are not updated in place, so this doesn't happen. A new file is written containing the updated record (+ any other other records in the original file). The transaction log is updated to 'point' to this new file.
In case some maintenance is happening on the delta table, what are the
chances that some new files are written out of nowhere ? Which may
cause a record to be re-ingested
This is possible, although not 'out of nowhere'. For example if you run OPTIMIZE existing parquet files get reshuffled/combined to improve performance. Basically this means a number of new parquet files will be written and a new version in the transaction log will point to these parquet files. If you pickup all new files after this you will re-ingest data.
Some considerations: if your delta table is append only you could use structured streaming to read from it instead. If not then Databricks offers Change Data Feed giving your record level details of inserts, updates and deletes.

Spark overwrite does not delete files in target path

My goal is to build a daily process that will overwrite all partitions under specific path in S3 with new data from data frame.
I do -
df.write.format(source).mode("overwrite").save(path)
(Also tried the dynamic overwrite option).
However, in some runs old data is not being deleted. Means I see files from old date together with new files under the same partition.
I suspect it has something to do with runs that broke in the middle due to memory issues and left some corrupted files that the next run did not delete but couldn’t reproduce it yet.
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") - option will keep your existing partition and overwriting a single partition. if you want to overwrite all existing partitions and keep the current partition then unset the above configurations. ( i tested in spark version 2.4.4)

Run both Databricks Optimize and Vacuum?

Does it make sense to call BOTH Databricks (Delta) Optimize and Vacuum? It SEEMS like it makes sense but I don't want to just infer what to do. I want to ask.
Vacuum
Recursively vacuum directories associated with the Delta table and remove data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. Files are deleted according to the time they have been logically removed from Delta’s transaction log + retention hours, not their modification timestamps on the storage system. The default threshold is 7 days.
Optimize
Optimizes the layout of Delta Lake data. Optionally optimize a subset of data or colocate data by column. If you do not specify colocation, bin-packing optimization is performed.
Second question: if the answer is yes, which is the best order of operations?
Optimize then Vacuum
Vacuum then Optimize
Yes, you need to run both commands at least to cleanup the files that were optimized by OPTIMIZE. With default settings, the order shouldn't matter, as it will delete files only after 7 days. Order will matter only if you run VACUUM with retention of 0 seconds, but it's not recommended anyway as it will remove whole history.

Configuring TTL on a deltaLake table

I'm looking for a way to add ttl(time-to-live) to my deltaLake table so that any record in it goes away automatically after a fixed span, I haven't found anything concrete of yet, any one knows if there's a workaround with this?
Unfortunately, there is no configuration called TTL (time-to-live) in Delta Lake tables.
You can remove files no longer referenced by a Delta table and are older than the retention threshold by running the vacuum command on the table. vacuum is not triggered automatically. The default retention threshold for the files is 7 days.
Delta Lake provides snapshot isolation for reads, which means that it is safe to run OPTIMIZE even while other users or jobs are querying the table. Eventually however, you should clean up old snapshots.
You can do this by running the VACUUM command:
VACUUM events
You control the age of the latest retained snapshot by using the RETAIN HOURS option:
VACUUM events RETAIN 24 HOURS
For details on using VACUUM effectively, see Vacuum.

Can you stream and batch from the same delta table?

I tried to stream and batch from the same delta table but ran into small files problem on the batch side. But if you optimize the delta table, the streaming size will lose track of the files it reads because the compaction results of the optimization.
When the OPTIMIZE command removes small files and adds back in compacted ones, these operations are flagged with the dataChange flag set to false. This flag tells streams that are following the transaction log that it is safe to ignore this transaction to avoid processing duplicate data.
I'll also note that DBR 5.3 contains a private preview features called Auto-Optimize, that can perform this compaction before small files even make it into the table. This feature will be GA-ed in the next release of DBR.

Resources