Cassandra: Does backup one node's data make sense? - cassandra

When I copy a node's snapshot to its /var/lib/cassandra/data///, and run 'nodetool refresh', what will happen to this newly replaced sstables and original sstables, because the original sstables is still there, and some new writes still in commitlog and memtables.
Does backup one node's snapshot make sense? Because other nodes may have the the data with more recent timestamp.

Does backup one node's data make sense?
Not in a multi node environment. If you're removing a node from the cluster or running a rolling upgrade maybe, but not for backup purposes on a live cluster.
This is a problem that tends to get addressed by using a parallel ssh tool. The example given by DataStax is pssh. This will create all the snapshots at the same time on each node giving you consistent data assuming you're not dealing with heavy writes (since C* is eventually consistent).

Related

Cassandra: How to find node with matching token for restoring to newer cluster?

I want to restore data from an existing cluster to newer cluster. I want to do so using the method, that of, copying the snapshot SSTables from old cluster to keyspaces of newer cluster, as explained in http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html.
The same document says, " ... the snapshot must be copied to the correct node with matching tokens". What does it really mean by "node with matching tokens"?
My current cluster is of 5 nodes and for each node num_tokens: 256. I am gonna create another cluster with same no of nodes and num_tokens and same schema. Do I need to follow the ring order while copying SSTables to newer cluster? How do I find matching target node for a given source node?
I tried command "nodetool ring" to check if I can use token values to match. But this command gives all the tokens for each host. How can I get the single token value (which determines the position of the node in the ring)? If I can get it, then I can find the matching nodes as well.
With vnodes its really hard to copy the sstables over correctly because its not just one assigned token that you have to reassign, but 256. To do what your asking you need to do some additional steps described http://datascale.io/cloning-cassandra-clusters-fast-way/. Basically reassign the 256 tokens of each node to a new node in other cluster so the ring is the same. The article you listed describes loading it on the same cluster which is a lot simpler because you dont have to worry about different topologies. Worth noting that even in that scenario, if a new node was added or a node was removed since the snapshot it will not work.
Safest bet will be to use sstableloader, it will walk through the sstable and distribute the data in the appropriate node. It will also open up possibility of making changes without worrying if everything is correct. Also it ensures everything is on the correct nodes so no worries about human errors. Each node in the original cluster can just run sstableloader on each sstable to the new cluster and you will parallelize the work pretty well.
I would strongly recommend you use this opportunity to decrease the number of vnodes to 32. The 256 default is excessive and absolutely horrible for rebuilds, solr indexes, spark, and most of all it ruins repairs. Especially if you use incremental repairs (default), the additional ranges will cause much more anticompactions and load. If you use sstableloader on each sstable it will just work. Increasing your streaming throughput in the cassandra.yaml will potentially speed this up a bit as well.
If by chance your using OpsCenter this backup and restore to new cluster is automated as well.

why lost some data after nodetool cleanup in cassandra

We added a new node to datacenter and then run nodetool cleanup according to Add new node to existing cluster in cassandra. But after cleanup completed, we noticed that we lost some data.
What could be the reason?
Yes, it's important to understand that nodetool cleanup is a potentially destructive tool. Your cluster needs to be in a fully-repaired state (from regular, successful runs of nodetool repair prior).
When you add a new node to the cluster, the token ranges that each node is responsible for are adjusted, and lowered per node. This leaves data on the original nodes that they are no longer responsible for. And that is by design.
The idea was that if for whatever reason the node add process failed and you had to leave your cluster at its original size, then the data is still there. But if you can't guarantee that your cluster was in a fully-repaired state in the first place and cleanup was run, it's possible that not all replicas would have made it to their proper nodes. But like nodetool getendpoints the bootstrap process would have assumed that it was.
That's why it's important to ensure that you have been regularly running nodetool repair on your cluster before running nodetool cleanup.
nodetool cleanup frees partition keys no longer belonging to a node, so after adding a node and transferring it's portion of data, this "portion" is no longer belongs to the old node, so running cleanup will free some space on this node.
If you see that old node now have lower storage, it is ok, there wasn't any data loss.
On other hand, if you really can't find some data, it can be due to data corruption or deleted data (with tombstones). What do you mean by data loss anyway?

Restoring cassandra from snapshot

So I did something of a test run/disaster recovery practice deleting a table and restoring in Cassandra via snapshot on a test cluster I have built.
This test cluster has four nodes, and I used the node restart method so after truncating the table in question, all nodes were shutdown, commitlog directories cleared, and the current snapshot data copied back into the table directory for each node. Afterwards, I brought each node back up. Then following the documentation I ran a repair on each node, followed by a refresh on each node.
My question is, why is it necessary for me to run a repair on each node afterwards assuming none of the nodes were down except when I shut them down to perform the restore procedure? (in this test instance it was a small amount of data and took very little time to repair, if this happened in our production environment the repairs would take about 12 hours to perform so this could be a HUGE issue for us in a disaster scenario).
And I assume running the repair would be completely unnecessary on a single node instance, correct?
Just trying to figure out what the purpose of running the repair and subsequent refresh is.
What is repair?
Repair is one of Cassandra's main anti-entropy mechanisms. Essentially it ensures that all your nodes have the latest version of all the data. The reason it takes 12 hours (this is normal by the way) is that it is an expensive operation -- io and CPU intensive -- to generate merkel trees for all your data, compare them with merkel trees from other nodes, and stream any missing / outdated data.
Why run a repair after a restoring from snapshots
Repair gives you a consistency baseline. For Example: If the snapshots weren't taken at the exact same time, you have a chance of reading stale data if you're using CL ONE and hit a replica restored from the older snapshot. Repair ensures all your replicas are up to date with the latest data available.
tl;dr:
repairs would take about 12 hours to perform so this could be a HUGE
issue for us in a disaster scenario).
While your repair is running, you'll have some risk of reading stale data if your snapshots don't have the same exact data. If they are old snapshots, gc_grace may have already passed for some tombstones giving you a higher risk of zombie data if tombstones aren't well propagated across your cluster.
Related side rant - When to run a repair?
The coloquial definition of the term repair seems to imply that your system is broken. We think "I have to run a repair? I must have done something wrong to get to this un-repaired state!" This is simply not true. Repair is a normal maintenance operation with Cassandra. In fact, you should be running repair at least every gc_grace seconds to ensure data consistency and avoid zombie data (or use the opscenter repair service).
In my opinion, we should have called it AntiEntropyMaintenence or CassandraOilChange or something rather than Repair : )

Cassandra - avoid nodetool cleanup

If we have added new nodes to a C* ring, do we need to run "nodetool cleanup" to get rid of the data that has now been assigned elsewhere? Or is this going to happen anyway during normal compactions?
During normal compactions, does C* remove data that does no longer belong on this node, or do we need to run "nodetoool cleanup" for that? Asking because "cleanup" takes forever and crashes the node before finishing.
If we need to run "nodetool cleanup", is there a way to find out which nodes now have data they should no longer own? (i.e data that now belongs on the new nodes, but is still present on the old nodes because no one removed it. This is the data that "nodetool cleanup" would remove.) We have RF=3 and two data centers, each of which has a complete copy of the data. I assume we need to run cleanup on all nodes in the data center where we have added nodes, because each row on the new node used to be on another node (primary), plus two copies (replicas) on two other nodes.
If you are on Apache Cassandra 1.2 or newer, cleanup checks the meta data on files so that it only does something if it needs to. So you are safe to just run it on every node, and only those nodes with extra data will do something. The data will not be removed during the normal compaction process, you have to call cleanup to remove it.
What I found helpful is to just compare how much space each node occupies in the data folder (for me it was /var/lib/cassandra/data). Some things like snapshots might differ between the nodes but when you see that newer nodes use much less disk space than older ones it might be because they did not have a cleanup after the newer ones where added. While you are there, you can also check what is the biggest .db file in there and check if your storage is has enough free space to store another file of that size The cleanup seems to copy the data of the .db files into new ones, minus the data that is now on other nodes. So you might need that extra space while it runs.

Proper cassandra keyspace restore procedure

I am looking for confirmation that my Cassandra backup and restore procedures are sound and I am not missing anything. Can you please confirm, or tell me if something is incorrect/missing?
Backups:
I run daily full backups of the keyspaces I care about, via "nodetool snapshot keyspace_name -t current_timestamp". After the snapshot has been taken, I copy the data to a mounted disk, dedicated to backups, then do a "nodetool clearsnapshot $keyspace_name -t $current_timestamp"
I also run hourly incremental backups - executing a "nodetool flush keyspace_name" and then moving files from the backup directory of each keyspace, into the backup mountpoint
Restore:
So far, the only valid way I have found to do a restore (and tested/confirmed) is to do this, on ALL Cassandra nodes in the cluster:
Stop Cassandra
Clear the commitlog *.log files
Clear the *.db files from the table I want to restore
Copy the snapshot/full backup files into that directory
Copy any incremental files I need to (I have not tested with multiple incrementals, but I am assuming I will have to overlay the files, in sequence from oldest to newest)
Start Cassandra
On one of the nodes, run a "nodetool repair keyspace_name"
So my questions are:
Does the above backup and restore strategy seem valid? Are any steps inaccurate or anything missing?
Is there a way to do this without stopping Cassandra on EVERY node? For example, is there a way to restore the data on ONE node, then somehow make it "authoritative"? I tried this, and, as expected, since the restored data is older, the data on the other nodes (which is newer) overwrites in when they sync up during repair.
Thank you!
There's two ways to restore Cassandra backups without restarting C*:
Copy the files into place, then run "nodetool refresh". This has the caveat that the rows will still be older than tombstones. So if you're trying to restore deleted data, it won't do what you want. It also only applies to the local server (you'll want to repair after)
Use "sstableloader". This will load data to all nodes. You'll need to make sure you have the sstables from a complete replica, which may mean loading the sstables from multiple nodes. Added bonus, this works even if the cluster size has changed. I'm not sure if ordering matters here (that is, I don't know if row timestamps are preserved through the load or if they're redefined during load)

Resources