Restoring a snapshot containing expired data

Restoring a snapshot containing expired data - cassandra

I have a snapshot of a production system that I want to restore into a test system to recreate a problem, easy so far. One of the tables contains data with a TTL that has expired now because the snapshot is a few weeks old.
The restore works fine using sstableloader, destination cluster has a different topology. After the restore cfstats show I have key entries in the table.
As soon as I query the table the rows are expired as the TTL is in the past and I get 0 results.
Is there any way I can get the TTL reset as part of the restore? I'm using Cassandra 3 so the TTLRemover tool is not usable.

Related

Configuring TTL on a deltaLake table

I'm looking for a way to add ttl(time-to-live) to my deltaLake table so that any record in it goes away automatically after a fixed span, I haven't found anything concrete of yet, any one knows if there's a workaround with this?

Unfortunately, there is no configuration called TTL (time-to-live) in Delta Lake tables.
You can remove files no longer referenced by a Delta table and are older than the retention threshold by running the vacuum command on the table. vacuum is not triggered automatically. The default retention threshold for the files is 7 days.
Delta Lake provides snapshot isolation for reads, which means that it is safe to run OPTIMIZE even while other users or jobs are querying the table. Eventually however, you should clean up old snapshots.
You can do this by running the VACUUM command:
VACUUM events
You control the age of the latest retained snapshot by using the RETAIN HOURS option:
VACUUM events RETAIN 24 HOURS
For details on using VACUUM effectively, see Vacuum.

Cassandra query table without partition key

I am trying to extract data from a table as part of a migration job.
The schema is as follows:
CREATE TABLE IF NOT EXISTS ${keyspace}.entries (
username text,
entry_type int,
entry_id text,
PRIMARY KEY ((username, entry_type), entry_id)
);
In order to query the table we need the partition keys, the first part of the primary key.
Hence, if we know the username and the entry_type, we can query the table.
In this case the username can be whatever, but the entry_type is an integer in the range 0-9.
When doning the extraction we iterate the table 10 times for every username to make sure we try all versions of entry_type.
We can no longer find any entries as we have depleted our list of usernames. But our nodetool tablestats report that there is still data left in the table, gigabytes even. Hence we assume the table is not empty.
But I cannot find a way to inspect the table to figure out what usernames remains in the table. If I could inspect it I could add the usernames left in the table to our extraction job and eventually we could deplete the table. But I cannot simply query the table as such:
SELECT * FROM ${keyspace}.entries LIMIT 1
as cassandra requires the partition keys to make meaningful queries.
What can I do to figure out what is left in our table?

As per the comment, the migration process includes a DELETE operation from the Cassandra table, but the engine will have a delay before actually removing from disk the affected records; this process is controlled internally with tombstones and the gc_grace_seconds attribute of the table. The reason for this delay is fully explained in this blog entry, for a tl dr, if the default value is still in place, Cassandra will need to pass at least 10 days (864,000 seconds) from the execution of the delete before the actual removal of the data.
For your case, one way to proceed is:
Ensure that all your nodes are "Up" and "Healthy" (UN)
Decrease the gc_grace_seconds attribute of your table, in the example, it will set it to 1 minute, while the default is
ALTER TABLE .entries with GC_GRACE_SECONDS = 60;
Manually compact the table:
nodetool compact entries
Once that the process is completed, nodetool tablestats should be up to date

To answer your first question, I would like to put more light on gc_grace_seconds property.
In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. So in Cassandra, a delete is actually an update, and updates are actually inserts. A “tombstone” marker is written to indicate that the data is now (logically) deleted (also known as soft delete). Records marked tombstoned must be removed to claim back the storage space. Which is done by a process called Compaction. But remember that tombstones are eligible for physical deletion / garbage collection only after a specific number of seconds known as gc_grace_seconds. This is a very good blog to read more in detail : https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Now possibly you are looking into table size before gc_grace_seconds and data is still there.
Coming to your second issue where you want to fetch some samples from the table without providing partition keys. You can analyze your table content using Spark. The Spark Cassandra Connector allows you to create Java applications that use Spark to analyze database data. You can follow the articles / documentation to write a quick handy spark application to analyze Cassandra data.
https://www.instaclustr.com/support/documentation/cassandra-add-ons/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkJavaApi.html
I would recommend not to delete records while you do the migration. Rather first complete the migration and post that do a quick validation / verification to ensure all records are migrated successfully (this use can easily do using Spark buy comparing dataframes from old and new tables). Post successful verification truncate the old table as truncate does not create tombstones and hence more efficient. Note that huge no of tombstone is not good for cluster health.

Cassandra data directory does not get updated with deletion

Currently, I am bench marking Cassandra database using YCSB framework. During this time I have performed (batch) insertion and deletion of the data quite regularly.
I am using Truncate command to delete keyspace rows. However, I am noticing that my Cassandra data directory swells up as the experiments.
I have checked and can confirm that even there is no data in the keystore when I checked the size of data directory. Is there a way to initialize a process so that Cassandra automatically release the stored space, or does it happen over time.

When you use Truncate cassandra will create snapshots of your data.
To disable it you will have to set auto_snapshot: false in cassandra.yaml file.
If you are using Delete, then cassandra use tombstone,i.e your data will not get deleted immediately. Data will get deleted once compaction is ran.
To remove previous snapshots one can use nodetool snapshot command.

How to store data in cassandra in a temporary location while sync is in progress?

We have a mysql server running which is serving application writes. To do some batch processing we have written a sync job to migrate data into cassandra cluster.
1. A daily sync job which transfers by updated timestamp for that day.
2. A complete sync job which transfers complete data, overriding existing ones.
Now there may be a possibility that the row was deleted from mysql, in that case using the above approach it will lie forever in cassandra.
To solve that problem we have given a TTL of 15 days for every row. So eventually it will get deleted, if it was not deleted then in next full sync the TTL will be over written again.
Its working fine as far as the use case is concerned but the issue is that in full sync complete data is over written and sstable is generated continuously with compactions happenning all the time, load averages shoot up with slowness and backup size increases (which could have been avoided).
Essentially we would want to replace the existing table data by new data but we dont want to truncate before starting the job but only after job completes.
Is there any way by which this can be solved other than creating a new table altogether and dropping past table when data is generated?

You can look at the double-run migration strategy I presented here: http://www.slideshare.net/doanduyhai/from-rdbms-to-cassandra-without-a-hitch
It has the advantage of allowing 100% uptime and possible rollback if things go wrong. The downside is the amount of work required in term of releases & codes

Cassandra get backup of expired data

I have a typically very huge amount of data on one of my Cassandra table. I want to keep only last two months of data in my table for that i used TTL of two month on every data. But now i want to keep expired data as a backup for later use case. Please suggest me what should i do to take backup?

Data in Cassandra is stored as files on the disk. You could just copy those files off your production machines onto whatever storage medium you would like to restore them later. You can follow the link below to see how you would do this:
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_restore_c.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string