We added a new node to datacenter and then run nodetool cleanup according to Add new node to existing cluster in cassandra. But after cleanup completed, we noticed that we lost some data.
What could be the reason?
Yes, it's important to understand that nodetool cleanup is a potentially destructive tool. Your cluster needs to be in a fully-repaired state (from regular, successful runs of nodetool repair prior).
When you add a new node to the cluster, the token ranges that each node is responsible for are adjusted, and lowered per node. This leaves data on the original nodes that they are no longer responsible for. And that is by design.
The idea was that if for whatever reason the node add process failed and you had to leave your cluster at its original size, then the data is still there. But if you can't guarantee that your cluster was in a fully-repaired state in the first place and cleanup was run, it's possible that not all replicas would have made it to their proper nodes. But like nodetool getendpoints the bootstrap process would have assumed that it was.
That's why it's important to ensure that you have been regularly running nodetool repair on your cluster before running nodetool cleanup.
nodetool cleanup frees partition keys no longer belonging to a node, so after adding a node and transferring it's portion of data, this "portion" is no longer belongs to the old node, so running cleanup will free some space on this node.
If you see that old node now have lower storage, it is ok, there wasn't any data loss.
On other hand, if you really can't find some data, it can be due to data corruption or deleted data (with tombstones). What do you mean by data loss anyway?
Related
I am wondering why it is a best practice to not run nodetool removenode. What is it used for? Is there a hierarchy of commands to run instead? What kind of issues arise when running said command? Any first hand experience/nightmare stories of using removenode? Overarching why not?
A default preference order would be:
Replacement node option (if replacement is planned)
Decomission
RemoveNode
Assassinate
However - there are situations where you will still choose a lower entry over an earlier one.
If the node being removed is operational, then you would normally run a decommission and allow the node to stream the data from itself to other nodes which will now be holding one of the replicas that was previously on the node being removed.
Removing a node will cause the token ranges to be recalculated and move, potentially requiring all nodes to start streaming data to other nodes who now own that range.
If the node is not operational, you can perform a nodetool removenode - this will trigger the same range movement and cause a large amount of streaming. There are streaming throughput throttles that are in place by default and can be adjusted to limit this impact.
You can also forcibly terminate either a decommission or a removenode by using nodetool [decommission | removenode] force - however, this means that one of the replicas of the data has not been re-established to another node, leaving you with less resilience.
Why would you do that? for the same streaming reason, if you accept the loss of resilience for a period of time, you can roll out the repair node by node in a controlled manner. This option should not be considered your 'default approach' or the choice taken lightly - I cannot stress or make that in bold enough.
The final option, when decommission / removenode is not available, is to assassinate the node - this is pretty much the same as performing a removenode, followed by an immediate force. You then have to manage to repairs and clean up in the same manner.
Outside of all of these 3 options - the best option is if your intention is to replace the node, then performing a replacement instead of a remove / add is the winner - this will only then require the new node has data streamed to it from the other replica's and there is no further token ring range movement. Instructions here
If the data disks are available, it is also possible to bring the replacement up without streaming the data, instructions here
The Datastax Documentation goes over the use case for nodetool removenode in depth.
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsRemoveNode.html
The gist of why it can be bad is this:
Warning: This command triggers cluster streaming. In large
environments, the additional streaming activity causes more pending
gossip tasks in the output of nodetool tpstats. Nodes can start to
appear offline and may need to be restarted to clear up the back log
of pending gossip tasks.
According to the document, this is when it should be used:
When the node is down and nodetool decommission cannot be used, use nodetool removenode. Run this command only on nodes that are down.
I've been tracking the growth of some big Cassandra tables using Spark rdd.count(). Up 'till now the expected behavior was consistent, the number of rows is constantly growing.
Today I ran nodetool cleanup on one of the seeds and as usual it ran for a 50+ minutes.
And now rdd.count() returns one third of the rows it did before....
Did I destroy data using nodetool cleanup? Or is the Spark count unreliable and was counting ghost keys? I got no errors during cleanup and lots don't show anything out of the usual. It did seem like a successful operation, until now.
Update 2016-11-13
Turns out the Cassandra documentation set me up for the loss of 25+ million rows of data.
The documentation is explicit:
Use nodetool status to verify that the node is fully bootstrapped and
all other nodes are up (UN) and not in any other state. After all new
nodes are running, run nodetool cleanup on each of the previously
existing nodes to remove the keys that no longer belong to those
nodes. Wait for cleanup to complete on one node before running
nodetool cleanup on the next node.
Cleanup can be safely postponed for low-usage hours.
Well you check the status of the other nodes via nodetool status and they are all UP and Normal (UN), BUT here's the catch, you also need to run the command is nodetool describecluster where you might find that the schemas were not synced.
My schemas were not synced and I ran cleanup, when all nodes were UN, up and running normally as per the documentation. The Cassandra documentation does not mention nodetool describecluster after adding new nodes.
So I merrily added nodes, waited till they were UN (Up / Normal) and ran cleanup.
As a result, 25+ million rows of data are gone. I hope this helps others avoid this dangerous pitfall. Basically the Datastax documentation sets you up to destroy data by recommending cleanup as a step of the process of adding new nodes.
In my opinion, that cleanup step should be taken out of the new node procedure documentation altogether. It should be mentioned, elsewhere, that cleanup is a good practice but not in the same section as adding new nodes...it's like recommending rm -rf / as one of the steps for virus removal. Sure will remove the virus...
Thank you Aravind R. Yarram for your reply, I came to the same conclusion as your reply and came here to update this. Appreciate your feedback.
I am guessing you might have either added/removed nodes from the cluster or decreased replication factor before running nodetool cleanup. Until you run the cleanup, I guess Cassandra still reports the old key ranges as part of the rdd.count() as old data still exists on those nodes.
Reference:
https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCleanup.html
So I did something of a test run/disaster recovery practice deleting a table and restoring in Cassandra via snapshot on a test cluster I have built.
This test cluster has four nodes, and I used the node restart method so after truncating the table in question, all nodes were shutdown, commitlog directories cleared, and the current snapshot data copied back into the table directory for each node. Afterwards, I brought each node back up. Then following the documentation I ran a repair on each node, followed by a refresh on each node.
My question is, why is it necessary for me to run a repair on each node afterwards assuming none of the nodes were down except when I shut them down to perform the restore procedure? (in this test instance it was a small amount of data and took very little time to repair, if this happened in our production environment the repairs would take about 12 hours to perform so this could be a HUGE issue for us in a disaster scenario).
And I assume running the repair would be completely unnecessary on a single node instance, correct?
Just trying to figure out what the purpose of running the repair and subsequent refresh is.
What is repair?
Repair is one of Cassandra's main anti-entropy mechanisms. Essentially it ensures that all your nodes have the latest version of all the data. The reason it takes 12 hours (this is normal by the way) is that it is an expensive operation -- io and CPU intensive -- to generate merkel trees for all your data, compare them with merkel trees from other nodes, and stream any missing / outdated data.
Why run a repair after a restoring from snapshots
Repair gives you a consistency baseline. For Example: If the snapshots weren't taken at the exact same time, you have a chance of reading stale data if you're using CL ONE and hit a replica restored from the older snapshot. Repair ensures all your replicas are up to date with the latest data available.
tl;dr:
repairs would take about 12 hours to perform so this could be a HUGE
issue for us in a disaster scenario).
While your repair is running, you'll have some risk of reading stale data if your snapshots don't have the same exact data. If they are old snapshots, gc_grace may have already passed for some tombstones giving you a higher risk of zombie data if tombstones aren't well propagated across your cluster.
Related side rant - When to run a repair?
The coloquial definition of the term repair seems to imply that your system is broken. We think "I have to run a repair? I must have done something wrong to get to this un-repaired state!" This is simply not true. Repair is a normal maintenance operation with Cassandra. In fact, you should be running repair at least every gc_grace seconds to ensure data consistency and avoid zombie data (or use the opscenter repair service).
In my opinion, we should have called it AntiEntropyMaintenence or CassandraOilChange or something rather than Repair : )
If we have added new nodes to a C* ring, do we need to run "nodetool cleanup" to get rid of the data that has now been assigned elsewhere? Or is this going to happen anyway during normal compactions?
During normal compactions, does C* remove data that does no longer belong on this node, or do we need to run "nodetoool cleanup" for that? Asking because "cleanup" takes forever and crashes the node before finishing.
If we need to run "nodetool cleanup", is there a way to find out which nodes now have data they should no longer own? (i.e data that now belongs on the new nodes, but is still present on the old nodes because no one removed it. This is the data that "nodetool cleanup" would remove.) We have RF=3 and two data centers, each of which has a complete copy of the data. I assume we need to run cleanup on all nodes in the data center where we have added nodes, because each row on the new node used to be on another node (primary), plus two copies (replicas) on two other nodes.
If you are on Apache Cassandra 1.2 or newer, cleanup checks the meta data on files so that it only does something if it needs to. So you are safe to just run it on every node, and only those nodes with extra data will do something. The data will not be removed during the normal compaction process, you have to call cleanup to remove it.
What I found helpful is to just compare how much space each node occupies in the data folder (for me it was /var/lib/cassandra/data). Some things like snapshots might differ between the nodes but when you see that newer nodes use much less disk space than older ones it might be because they did not have a cleanup after the newer ones where added. While you are there, you can also check what is the biggest .db file in there and check if your storage is has enough free space to store another file of that size The cleanup seems to copy the data of the .db files into new ones, minus the data that is now on other nodes. So you might need that extra space while it runs.
Cassandra nodetool has a command called cleanup:
cleanup [keyspace][cf_name]
Triggers the immediate cleanup of keys no longer belonging to this
node. This has roughly the same effect on a node that a major
compaction does in terms of a temporary increase in disk space usage
and an increase in disk I/O. Optionally takes a list of column family
names.
My questions are:
When will a node having keys not belonging to it?
When should I issue a cleanup?
Should I do cleanup regularly (e.g. once per week)?
When will a node having keys not belonging to it?
When you have added new nodes to the cluster, decreased replication factor or moved tokens.
When should I issue a cleanup?
After one of the above operations, if you need to save disk space. There is no harm in delaying running it - there is a performance impact and the only reason to is to save disk space.
Should I do cleanup regularly (e.g. once per week)?
No, only if you need to save space after one of the above operations.
When will a node having keys not belonging to it?
When you bootstrap a new node, some of the existing nodes will lose ownership of data by transferring the ownership to the new node.
Reducing replication factor also does this.
When should I issue a cleanup?
After operations mentioned below, but before you start any other topology / replication change.
You should run it on all affected nodes in the cluster. When in doubt, run on all nodes.
One reason to run it is to reclaim the disk space used to store no longer owned data.
Another reason is that failure to do so may cause data consistency problems. You may see resurrection of deleted data. Consider the case of node A losing ownership of key k after bootstrapping a new node, and holding a live row for key k. Later, key k is deleted but deletion does not propagate to node A (no longer a replica). Then the deletion expires in the whole cluster. Then you change the topology such that A is the owner of key k again. It will serve the old, deleted, row.
Source: https://docs.datastax.com/en/dse/6.7/dse-admin/datastax_enterprise/tools/nodetool/toolsCleanup.html
No need to run nodetool cleanup after nodetoool decommission, nodetool replace, or nodetool removenode.
Should I do cleanup regularly (e.g. once per week)?
No need to.