I am writing python script creating and uploading Cassandra backups to external storage. Cassandra cluster is presented by three nodes and every node has two JBOD disks.
The script starts nodetool snapshot command and this command starts snapshot creation job on cluster and exits immediately while snapshot creation process continues its operation. Now script should wait until snapshot creation will be finished and continue with upload. Question is how to get snapshot creation status?
nodetool snapshot creates hard links to the sstables. When the command exits, the snapshot is complete. it's typically v. fast
Related
I have a daily cron job[1] which take snapshots of cassandra, and upload it to s3 buckets. After doing that, the snapshots will be deleted.
However, there is also a pipeline job that takes snapshot of cassandra, which I cannot modify. This job does not delete snapshots after it's done and it relies on another daily cron job[2] to delete all snapshots (basically call nodetool clearsnapshot).
My concern now is that, the daily cron job[2] might delete my snapshots, and thus my cron job[1] will not be able to upload them into s3 buckets. What will happen if my nodetool snapshot and nodetool clearsnapshot of another job happens at the same time? Is there a way to require the daily cron job[2] to happen after my cron job[1]?
nodetool snapshot has the functionality to tag the snapshots. One way to solve this is to compromise with the owner of the other process so every time that a snapshot is taken, it is properly tagged.
Your backup procedure should be something similar to:
nodetool snapshot -t backup
... upload to s3 ...
nodetool clearsnapshot -t backup
The other pipeline can have its own tag:
nodetool snapshot -t pipeline
And the crontab should include the pipeline's tag
nodetool clearsnapshot -t pipeline
If there is no chance to change the pipeline to include the tag, you may restrict the execution of the cron job so it will verify that no backup process is running (like looking for a PID) before doing the clearsnapshot.
I want to know delete job works on Databricks. Does it immediately terminate the code execution on terminate the job cluster? If I am using micro-batching, does it make sure that the last batch is processed and then terminates or it is just abrupt termination which can cause data loss/data corruption? How can I avoid that?
Also what happens when i delete a job on a running cluster?
It will terminate immediately - not gracefully.
Are you using Structured Streaming or true micro batching? If the former then a checkpoint file will suffice in starting in the right place again. (https://docs.databricks.com/spark/latest/structured-streaming/production.html)
If you have your own batch process you will need to manually write a checkpoint file to keep track of where you are up to. Given the lack of transactions I would ensure your pipeline is idempotent so that if you do restart and repeat a batch then there is no impact.
I have a new cassandra node joining an existing cluster. Current nodetool netstat shows file transfer status at 25%. My question is, do I have to wait for compaction as well or is the node joining process considered completed when file transfer reaches 100% ?
Once newly added node showing status UN, it will accept read and write requests. compaction will run in background and there is no impact however you can stop compaction as well for some time.
I have below cassandra query ;
Few days ago i have developed application using c# and Single node Cassandra db. While the application in production, power failure occurred and cassandra commitlog got corrupt. Because of it cassandra node not starting, so i have shifted all commitlog files to another directory and started the cassandra node.
Recently i noticed the power failure day's data not available in database, I have all commitlog files with corrupted commitlog file name.
Can you please suggest, is there a way to recover data using commitlog files.
As well how to avoid commitlog file corruption issue, so that in production data loss can be avoid.
Thank you.
There is no way to restore back the node to the previous state if your commit logs have got corrupted and you have no SSTables with you.
If your commit logs are healthy (meaning it's not corrupted), then you just need to restart your node . It will be replayed,as a result will rebuild the memtable(s) and flush generation-1 SSTables on the disk.
What you can ideally do is to forcibly create SSTables.
You can do that under the apache-cassandra/bin directory by
nodetool flush
So if you are wary of losing commit logs .You can rebuild your node to previous states using SSTables so created above using
nodetool.bat refresh [keyspace] [columnfamily].
Alternatively you can also try creating snapshots.
nodetool snapshot
This command will take a snapshot of all keyspaces on the node.You also have the option of creating backups but this one will only keep record of the latest operations.
For more info try reading
https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsNodetool_r.html
I suggest you can also try having more nodes and thus increase the replication factor to avoid such scenarios in future.
Hope it helps!
As per documentation at http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/operations/ops_backup_incremental_t.html,
As with snapshots, Cassandra does not automatically clear incremental backup files. DataStax recommends setting up a process to clear incremental backup hard-links each time a new snapshot is created.
So is it safe to trigger deleting all the files in backups directory immediately after invoking snapshot?
How can I check, whether snapshot was, not only invoked successfully, but also completed successfully?
What if I end-up deleting a backup hard-link which got created "just after" invoking the snapshot, but before the moment I triggered deletion of backup files?