Cassandra - What precaustion should I take while deleting the backup files? - cassandra

As per documentation at http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/operations/ops_backup_incremental_t.html,
As with snapshots, Cassandra does not automatically clear incremental backup files. DataStax recommends setting up a process to clear incremental backup hard-links each time a new snapshot is created.
So is it safe to trigger deleting all the files in backups directory immediately after invoking snapshot?
How can I check, whether snapshot was, not only invoked successfully, but also completed successfully?
What if I end-up deleting a backup hard-link which got created "just after" invoking the snapshot, but before the moment I triggered deletion of backup files?

Related

How to Spark batch job global commit on ADLS Gen2?

I have spark batch application writing to ADLS Gen2 (hierarchy).
When designing the application I was sure the spark will perform global commit once the job is committed, but what it really does it commits on each task, meaning once task completes writing it moves from temp to target storage.
So if the batch fails we have partial data, and on retry we are getting data duplications. Our scale is really huge so rolling back (deleting data) is not an option for us, the search will takes a lot of time.
Is there any "built-in" solution, something we can use out of the box?
Right now we are considering writing to some temp destination and move files only after the whole job completed, but we would like to find some more elegant solution (if exists).
This is a known issue. Apache Iceberg, Hudi and Delta lake and among the possible solutions.
Alternatively, instead of writing the output directly to the "official" location, write it to some staging directory instead. Once the job is done, rename the staging dir to the official location.

Cassandra: concurrent snapshot and clear snapshot?

I have a daily cron job[1] which take snapshots of cassandra, and upload it to s3 buckets. After doing that, the snapshots will be deleted.
However, there is also a pipeline job that takes snapshot of cassandra, which I cannot modify. This job does not delete snapshots after it's done and it relies on another daily cron job[2] to delete all snapshots (basically call nodetool clearsnapshot).
My concern now is that, the daily cron job[2] might delete my snapshots, and thus my cron job[1] will not be able to upload them into s3 buckets. What will happen if my nodetool snapshot and nodetool clearsnapshot of another job happens at the same time? Is there a way to require the daily cron job[2] to happen after my cron job[1]?
nodetool snapshot has the functionality to tag the snapshots. One way to solve this is to compromise with the owner of the other process so every time that a snapshot is taken, it is properly tagged.
Your backup procedure should be something similar to:
nodetool snapshot -t backup
... upload to s3 ...
nodetool clearsnapshot -t backup
The other pipeline can have its own tag:
nodetool snapshot -t pipeline
And the crontab should include the pipeline's tag
nodetool clearsnapshot -t pipeline
If there is no chance to change the pipeline to include the tag, you may restrict the execution of the cron job so it will verify that no backup process is running (like looking for a PID) before doing the clearsnapshot.

Not able to restore cassandra data from snapshot

We have a regular backup of our cluster and we store schema and snapshot back up on aws s3 on daily basis.
Somehow we have lost all the data and while recovering the data from backup we are able to recover schema but while copying snapshots files to /var/lib/cassandra/data directory its not showing up the data in the tables.
After copying the data we have done nodetool refresh -- keyspace table but still nothing is working out.
could you please help on this ?
Im new at Apache Cassandra, but my first focus at this topic was the Backup.
If you want to restore from a Snapshot (on new node/cluster) you have to shut down Cassandra on any node and clear any existing data from these folders:
/var/lib/cassandra/data -> If you want to safe your System Keyspaces so delete only your Userkeyspaces folders
/var/lib/cassandra/commitlog
/var/lib/cassandra/hints
/var/lib/cassandra/saved_cashes
After this, you have to start Cassandra again (the whole Cluster). Create the Keyspace like the one you want to restore and the table you want to restore. In Your Snapshot folder you will find a schema.cql script for the creation of the table.
After Creating the Keyspaces an tables again, wait a moment (time depends on the ammount of nodes in your cluster and keypsaces you want to restore.)
Shut down the Cassandra Cluster again.
Copy the Files from the Snapshot folder to the new folders of the tables you want to restore. Do this on ALL NODES!
After copying the files, start the nodes one by one.
If all nodes are running, run the nodetool repair command.
If you try to check the data via CQLSH, so think of the CONSISTENCY LEVEL! (ALL/QUORUM)
Thats the way, wich work at my Cassandra cluster verry well.
The general steps to follow for restoring a snapshot is:
1.Shutdown Cassandra if still running.
2.Clear any existing data in commitlogs, data and saved caches directories
3.Copy snapshots to relevant data directories
4.Copy incremental backups to data directory (if incremental backups are enabled)
If required, set restore_point_in_time parameter in commitlog_archiving.properties to
restore point.
5.Start Cassandra.
6.Run repair
So try running repair after copying data.

If auto_snapshot is enabled in cassandra.yaml, when these snapshots will be deleted

If we set auto_snapshot: true in the cassandra.yaml, and we delete some table the snapshots for that particular table will be created right? So when will these snapshots be deleted? Do we need to delete them manually by running scripts? Or is there a setting which I can enable to auto-delete it after sometime?
so when these snapshots will be deleted?
Automatically? Never.
Do we need to delete them manually by running scripts?
Yes. This can be a long term problem, so it is a good idea to have a script running to handle this. In fact, the DataStax docs have a recommendation on this:
When taking a snapshot, previous snapshot files are not automatically deleted. You should remove old snapshots that are no longer needed.
The nodetool clearsnapshot command removes all existing snapshot files from the snapshot directory of each keyspace. You should make it part of your back-up process to clear old snapshots before taking a new one.

Backups folder in Opscenter keyspace growing really huge

We have a 10 node Cassandra cluster. We configured a repair in Opscenter. We find there is a backups folder created for every table in Opscenter keyspace. It keeps growing huge. Is there a solution to this, or do we manually delete the data in each backups folder?
First off, Backups are different from snapshots - you can take a look at the backup documentation for OpsCenter to learn more.
Incremental backups:
From the datastax docs -
When incremental backups are enabled (disabled by default), Cassandra
hard-links each flushed SSTable to a backups directory under the
keyspace data directory. This allows storing backups offsite without
transferring entire snapshots. Also, incremental backups combine with
snapshots to provide a dependable, up-to-date backup mechanism.
...
As with snapshots, Cassandra does not automatically clear
incremental backup files. DataStax recommends setting up a process to
clear incremental backup hard-links each time a new snapshot is
created.
You must have turned on incremental backups by setting incremental_backups to true in cassandra yaml.
If you are interested in a backup strategy, I recommend you use the OpsCenter Backup Service instead. That way, you're able to control granularly which keyspace you want to back up and push your files to S3.
Snapshots
Snapshots are hardlinks to old (no longer used) SSTables. Snapshots protect you from yourself. For example you accidentally truncate the wrong keyspace, you'll still have a snapshot for that table that you can bring back. There are some cases when you have too many snapshots, there's a couple of things you can do:
Don't run Sync repairs
This is related to repairs because synchronous repairs generate a Snapshot each time they run. In order to avoid this, you should run parallel repairs instead (-par flag or by setting the number of repairs in the opscenter config file note below)
Clear your snapshots
If you have too many snapshots and need to free up space (maybe once you have backed them up to S3 or glacier or something) go ahead and use nodetool clearsnapshots to delete them. This will free up space. You can also go in and remove them manually from your file system but nodetool clearsnapshots removes the risk of rm -rf ing the wrong thing.
Note: You may also be running repairs too fast if you don't have a ton of data (check my response to this other SO question for an explanation and the repair service config levers).

Resources