How to modify the memtable flush time interval in cassandra? - cassandra

I have enabled the incremental backup in the cassandra.yaml file. As I know when we enable incremental backups, cassandra will backup the data (in backups directory) only when the memtable is flushed. But what if the memtable is yet to be flushed? I won't be able to get the incremental backup right?. I know that for the memtable to be flushed there are certain conditions to be met such as time interval or memtable space. My question is how do I modify this so that even if I enter one record after the last snapshot, I can still backup entire data along with that latest entry?
Consider this example
Take the snapshot.
Clear incremental backup (backups directory)
Enter a record to a table.
Check for the incremental backup in backups directory. It is still empty.
Now how do I backup the record which is written after the last snapshot?In general how do we backup the entire upto-date data unless we take the snapshot?

You can flush the files manually with nodetool flush just before taking the backup. That way you'll always have the latest memtable flushed.
nodetool docs

If you want to backup a cluster without taking a snapshot you can do it by simply saving everything under /data folder from every node (this includes mainly the .db files stats files etc).
In order to not override files you should store it with the token information as well.
When you want to restore from this backup, you should spin up a cluster with the same number of nodes, and simply copy the data, one-to-one from each backed-up node to a restored node. Pay attention that you'll have to modify cassandra.yaml to include the relevant token in cassandra.yaml (as well as the peers/seeds/etc) for each restored node.
After all the data is copied, you can start C* process on all the nodes.

Related

cassandra: restoring partially lost data

Theoretical question:
Lets say I have a cassandra cluster with some data in it.
Backups are created on a daily basis.
Now a subset of data is being lost, either by application error or manual deletion.
What is the best way to restore data from existing backup?
I can think of starting a separate node with the backup disk attached, then export data manually through selects and reimport into the prod database.
That would work but sounds complicated, is there a more straight forward solution for such problems?
If its a single partition probably best bet is to use sstabledump or something like sstable-tools to read from it and just manually reinstert. If ok with restoring everything deleted from time of snapshot: reduce gcgrace to purge any tombstones with a force compact (or else they will continue to shadow the restored data) and use the sstable loader or if the token ranges are the same copy the backed up sstables back in the data directory.

Cassandra - recovery of data after accidental delete

As the data in case of Cassandra is physically removed during compaction, is it possible to access the recently deleted data in any way? I'm looking for something similar to Oracle Flashback feature (AS OF TIMESTAMP).
Also, I can see the pieces of deleted data in the relevant commit log file, however it's obviously unreadable. Is it possible to convert this file to a more readable format?
You will want to execute a restore from your commitlog.
The safest is to copy the commitlog to a new cluster (with same schema), and restore following the instructions (comments) from commitlog_archiving.properties file. In your case, you will want to set restore_point_in_time to a time between your insert and your delete.

Cassandra data directory does not get updated with deletion

Currently, I am bench marking Cassandra database using YCSB framework. During this time I have performed (batch) insertion and deletion of the data quite regularly.
I am using Truncate command to delete keyspace rows. However, I am noticing that my Cassandra data directory swells up as the experiments.
I have checked and can confirm that even there is no data in the keystore when I checked the size of data directory. Is there a way to initialize a process so that Cassandra automatically release the stored space, or does it happen over time.
When you use Truncate cassandra will create snapshots of your data.
To disable it you will have to set auto_snapshot: false in cassandra.yaml file.
If you are using Delete, then cassandra use tombstone,i.e your data will not get deleted immediately. Data will get deleted once compaction is ran.
To remove previous snapshots one can use nodetool snapshot command.

Proper cassandra keyspace restore procedure

I am looking for confirmation that my Cassandra backup and restore procedures are sound and I am not missing anything. Can you please confirm, or tell me if something is incorrect/missing?
Backups:
I run daily full backups of the keyspaces I care about, via "nodetool snapshot keyspace_name -t current_timestamp". After the snapshot has been taken, I copy the data to a mounted disk, dedicated to backups, then do a "nodetool clearsnapshot $keyspace_name -t $current_timestamp"
I also run hourly incremental backups - executing a "nodetool flush keyspace_name" and then moving files from the backup directory of each keyspace, into the backup mountpoint
Restore:
So far, the only valid way I have found to do a restore (and tested/confirmed) is to do this, on ALL Cassandra nodes in the cluster:
Stop Cassandra
Clear the commitlog *.log files
Clear the *.db files from the table I want to restore
Copy the snapshot/full backup files into that directory
Copy any incremental files I need to (I have not tested with multiple incrementals, but I am assuming I will have to overlay the files, in sequence from oldest to newest)
Start Cassandra
On one of the nodes, run a "nodetool repair keyspace_name"
So my questions are:
Does the above backup and restore strategy seem valid? Are any steps inaccurate or anything missing?
Is there a way to do this without stopping Cassandra on EVERY node? For example, is there a way to restore the data on ONE node, then somehow make it "authoritative"? I tried this, and, as expected, since the restored data is older, the data on the other nodes (which is newer) overwrites in when they sync up during repair.
Thank you!
There's two ways to restore Cassandra backups without restarting C*:
Copy the files into place, then run "nodetool refresh". This has the caveat that the rows will still be older than tombstones. So if you're trying to restore deleted data, it won't do what you want. It also only applies to the local server (you'll want to repair after)
Use "sstableloader". This will load data to all nodes. You'll need to make sure you have the sstables from a complete replica, which may mean loading the sstables from multiple nodes. Added bonus, this works even if the cluster size has changed. I'm not sure if ordering matters here (that is, I don't know if row timestamps are preserved through the load or if they're redefined during load)

Cassandra Scrub - define a destination directory for snapshot

In my C* 1.2.4 setup, I have an ssd drive of 200Gb for the data and a rotational drive for commit logs of 500Gb.
I had the unpleasant surprise during a scrub operation to fill in my ssd drive with the snapshots. That made the cassandra box unresponsive but it kept the status as up when doing nodetool status.
I am wondering if there is a way to specify the target directory for snapshots when doing a scrub.
Otherwise if you have ideas for workarounds?
I can do a column family at a time and then copy the snapshots folder, but I am open for smarter solutions.
Thanks,
H
Snapshots in Cassandra are created as hard links to your existing data files. This means at the time the snapshot is taken, it takes up almost no extra space. However, it causes the old files to remain so if you delete or update data, the old version is still there.
This means snapshots must be taken on the drive that stores the data. If you don't need the snapshot any more, just delete it with 'nodetool clearsnapshot' (see the nodetool help output for how to decide which snapshots to delete). If you want to keep the snapshot then you can move it elsewhere. It will only start using much disk space after a while, so you could keep it until you are happy the scrub didn't delete important data then delete the snapshot.

Resources