Configuring TTL on a deltaLake table

Configuring TTL on a deltaLake table - databricks

I'm looking for a way to add ttl(time-to-live) to my deltaLake table so that any record in it goes away automatically after a fixed span, I haven't found anything concrete of yet, any one knows if there's a workaround with this?

Unfortunately, there is no configuration called TTL (time-to-live) in Delta Lake tables.
You can remove files no longer referenced by a Delta table and are older than the retention threshold by running the vacuum command on the table. vacuum is not triggered automatically. The default retention threshold for the files is 7 days.
Delta Lake provides snapshot isolation for reads, which means that it is safe to run OPTIMIZE even while other users or jobs are querying the table. Eventually however, you should clean up old snapshots.
You can do this by running the VACUUM command:
VACUUM events
You control the age of the latest retained snapshot by using the RETAIN HOURS option:
VACUUM events RETAIN 24 HOURS
For details on using VACUUM effectively, see Vacuum.

Related

What happens at VACUUM with delta tables?

When we run VACUUM command, does it go through each parquet file and remove older versions of each record or does it retain all the parquet files even id it has one record with the latest version? What about compaction? Is this any different?

Vacuum and Compaction go through the _delta_log/ folder in your Delta Lake Table and identify the files that are still being referenced.
Vacuum deletes all unreferenced files.
Compaction reads in the referenced files and writes your new partitions back to the table, unreferencing the existing files.

Think of a single version of a Delta Lake table as a set of parquet data files. Every version adds an entry (about files added and removed) to the transaction log (under _delta_log directory).
VACUUM
VACUUM allows defining what number of hours to retain (using RETAIN number HOURS clause). That gives Delta Lake the versions to delete (up to the number HOURS). These versions are "translated" into a series of parquet files (remember one single parquet file belongs to a single version until it is deleted that may take a couple of versions).
This translation gives the files to be deleted.
Compaction
Compaction is basically an optimization (and is usually triggered by OPTIMIZE command or a combination of repartition, dataChange disabled and overwrite).
This is nothing else as another version of a delta table (but this time data is not changed so other transactions can happily be all committed).
The explanation about VACUUM above applies here.

Combine Cassandra Snapshot with updated data

We deleted some old data within our 3 node cassandra cluster (v3.11) some days ago which shall now be restored from a Snapshot. Is there a possibility to restore the data from the snapshot without loosing updates made since the snapshot was taken?
There are two approaches which come to my mind
A)
Create export via COPY keypsace.table TO xy.csv
Truncate table
restore table from snapshot via sstableloader
Reimport newer data via COPY keypsace.table FROM xy.csv
B)
Just copy sstable files of snapshot into current table directory
Is A) a feasible option? What do we need to consider so that the COPY FROM/TO commands get synchronized over all nodes?
For option B) I read that the deletion commands that happend may be executed again (tombstone rows). Can I ignore this warning if we make sure the deletion commands happened more than 10 days ago (gc_grace_seconds)?

for exporting/importing data from Apache Cassandra®, there is an efficient tool -- DataStax Bulk Loader (aka DSBulk). You could refer to more documentation and examples here. For getting consistent reads and writes, you could leverage --datastax-java-driver.basic.request.consistency LOCAL_QUORUM in your unload & load commands.

Run both Databricks Optimize and Vacuum?

Does it make sense to call BOTH Databricks (Delta) Optimize and Vacuum? It SEEMS like it makes sense but I don't want to just infer what to do. I want to ask.
Vacuum
Recursively vacuum directories associated with the Delta table and remove data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. Files are deleted according to the time they have been logically removed from Delta’s transaction log + retention hours, not their modification timestamps on the storage system. The default threshold is 7 days.
Optimize
Optimizes the layout of Delta Lake data. Optionally optimize a subset of data or colocate data by column. If you do not specify colocation, bin-packing optimization is performed.
Second question: if the answer is yes, which is the best order of operations?
Optimize then Vacuum
Vacuum then Optimize

Yes, you need to run both commands at least to cleanup the files that were optimized by OPTIMIZE. With default settings, the order shouldn't matter, as it will delete files only after 7 days. Order will matter only if you run VACUUM with retention of 0 seconds, but it's not recommended anyway as it will remove whole history.

Restoring a snapshot containing expired data

I have a snapshot of a production system that I want to restore into a test system to recreate a problem, easy so far. One of the tables contains data with a TTL that has expired now because the snapshot is a few weeks old.
The restore works fine using sstableloader, destination cluster has a different topology. After the restore cfstats show I have key entries in the table.
As soon as I query the table the rows are expired as the TTL is in the past and I get 0 results.
Is there any way I can get the TTL reset as part of the restore? I'm using Cassandra 3 so the TTLRemover tool is not usable.

How to store data in cassandra in a temporary location while sync is in progress?

We have a mysql server running which is serving application writes. To do some batch processing we have written a sync job to migrate data into cassandra cluster.
1. A daily sync job which transfers by updated timestamp for that day.
2. A complete sync job which transfers complete data, overriding existing ones.
Now there may be a possibility that the row was deleted from mysql, in that case using the above approach it will lie forever in cassandra.
To solve that problem we have given a TTL of 15 days for every row. So eventually it will get deleted, if it was not deleted then in next full sync the TTL will be over written again.
Its working fine as far as the use case is concerned but the issue is that in full sync complete data is over written and sstable is generated continuously with compactions happenning all the time, load averages shoot up with slowness and backup size increases (which could have been avoided).
Essentially we would want to replace the existing table data by new data but we dont want to truncate before starting the job but only after job completes.
Is there any way by which this can be solved other than creating a new table altogether and dropping past table when data is generated?

You can look at the double-run migration strategy I presented here: http://www.slideshare.net/doanduyhai/from-rdbms-to-cassandra-without-a-hitch
It has the advantage of allowing 100% uptime and possible rollback if things go wrong. The downside is the amount of work required in term of releases & codes

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Configuring TTL on a deltaLake table - databricks

I'm looking for a way to add ttl(time-to-live) to my deltaLake table so that any record in it goes away automatically after a fixed span, I haven't found anything concrete of yet, any one knows if there's a workaround with this?

Related

What happens at VACUUM with delta tables?

Combine Cassandra Snapshot with updated data

Run both Databricks Optimize and Vacuum?

Restoring a snapshot containing expired data

How to store data in cassandra in a temporary location while sync is in progress?

Categories

Resources