I have a Cassandra 2.2.9 repair that has failed, and in this state the Cassandra metrics show about 70 repair tasks still pending. Cassandra should take care of retrying these failed tasks itself, but for whatever reason this time it has not.
The repairs take a long time, instead of running the whole repair again, can I see the chosen token ranges Cassandra uses for the repair so I can manually run the last few tasks instead?
One way I found is to search the logs at the time you started the repair - Cassandra spits out the ID and token range of each repair task it will attempt at the time you run the repair.
With Cassandra 2.2.9 I found grepping the logs for new session: will sync did the trick :)
Related
Currently, I'm running once a week manually nodetool repair (while no action is happening to the cassandra nodes (nothing is inserted, e.t.c.)). Just wondering, if I can run nodetool repair, while data is being inserted?
Secondly - Can I create a script in crontab, that automatically runs every week nodetool repair, and is ocne a week enough to run nodetool repair?
yes, if running one repair at a time you shouldn't be impacting normal usage. Instead of a cron job i would recommend using reaper (free and open source) for automating it. Would give you a bit more visibility and it handles things a bit better than just the default that running from nodetool provides.
Yes you can run repair while data is inserting it may impact your traffic so avoid this you can run repair table or key space wise.
Yes, you can run repair and schedule a crone for the same.
I have scheduled incremental repair for everyday. But while the repair is going on, our monitoring system reports COMPACTIONEXECUTOR_PENDING tasks.
I am wondering, if I can introduce a check, to see, if compaction is not running, before I trigger repair.
I should be able to check if compaction is running by parsing output of nodetool netstats and compactionstats command output.
I will proceed with repair if both of the following checks passes:
nodetool netstats output contains Not sending any streams.
nodetool compactionstats output contains pending tasks: 0
But I want to get some expert opinion before I proceed.
Is my understanding correct?
I don't want to get into situation, in which, these checks are failing always and repair process is not getting triggered at all.
Thanks.
Compaction is occurring regularly in Cassandra. So I'm a bit scared that only triggering repair when pending_compactions=0 will result in repair not running enough. But it depends on your traffic of course, e.g. if you have few writes you won't do many compactions. You should probably add a max wait time for pending_compactions=0 so that after a specified time if the condition is not true repair will run anyway.
To answer your question. Nodetool uses JMX to fetch MBeans in Cassandra. You can see all available MBeans here: http://cassandra.apache.org/doc/latest/operating/metrics.html
You want this MBean:
org.apache.cassandra.metrics:type=Compaction name=PendingTasks
You can create your own JMX Client like this: How to connect to a java program on localhost jvm using JMX?
Or you can use jmxterm: https://github.com/jiaqi/jmxterm
My understanding is you could use it like this:
java -jar jmxterm-1.0.0-uber.jar
get -b org.apache.cassandra.metrics:type=Compaction name=PendingTasks
I've been tracking the growth of some big Cassandra tables using Spark rdd.count(). Up 'till now the expected behavior was consistent, the number of rows is constantly growing.
Today I ran nodetool cleanup on one of the seeds and as usual it ran for a 50+ minutes.
And now rdd.count() returns one third of the rows it did before....
Did I destroy data using nodetool cleanup? Or is the Spark count unreliable and was counting ghost keys? I got no errors during cleanup and lots don't show anything out of the usual. It did seem like a successful operation, until now.
Update 2016-11-13
Turns out the Cassandra documentation set me up for the loss of 25+ million rows of data.
The documentation is explicit:
Use nodetool status to verify that the node is fully bootstrapped and
all other nodes are up (UN) and not in any other state. After all new
nodes are running, run nodetool cleanup on each of the previously
existing nodes to remove the keys that no longer belong to those
nodes. Wait for cleanup to complete on one node before running
nodetool cleanup on the next node.
Cleanup can be safely postponed for low-usage hours.
Well you check the status of the other nodes via nodetool status and they are all UP and Normal (UN), BUT here's the catch, you also need to run the command is nodetool describecluster where you might find that the schemas were not synced.
My schemas were not synced and I ran cleanup, when all nodes were UN, up and running normally as per the documentation. The Cassandra documentation does not mention nodetool describecluster after adding new nodes.
So I merrily added nodes, waited till they were UN (Up / Normal) and ran cleanup.
As a result, 25+ million rows of data are gone. I hope this helps others avoid this dangerous pitfall. Basically the Datastax documentation sets you up to destroy data by recommending cleanup as a step of the process of adding new nodes.
In my opinion, that cleanup step should be taken out of the new node procedure documentation altogether. It should be mentioned, elsewhere, that cleanup is a good practice but not in the same section as adding new nodes...it's like recommending rm -rf / as one of the steps for virus removal. Sure will remove the virus...
Thank you Aravind R. Yarram for your reply, I came to the same conclusion as your reply and came here to update this. Appreciate your feedback.
I am guessing you might have either added/removed nodes from the cluster or decreased replication factor before running nodetool cleanup. Until you run the cleanup, I guess Cassandra still reports the old key ranges as part of the rdd.count() as old data still exists on those nodes.
Reference:
https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCleanup.html
We had 3 regions for the cassandra cluster each with 2 nodes, totally 6. Then we have added 3 more regions now totally we have 12 cassandra nodes in the cluster. After adding the nodes, we have updated the replication factors and started the nodetool repair. But the command is hanging for more than 48+ hours and not finished yet. When we looked into the logs 1 or 2 AntiEntropySessions are pending still, because some of the CF's are not fully synced. All AntiEntropySessions are getting the merkle tree successfully from all the nodes for all CF's. But some repair b/w some nodes are not completed for some CF's, so it leads to pending AntiEntropySessions and the repair is hanging.
We are using Cassandra 1.1.12. We will not able to upgrade the Cassandra now.
We have restarted the nodes and started the repair again but it still hangs.
We have observed one CF which has frequent read and writes in the initial 3 regions which is active during the repair is failing to sync completely in all the times.
Is that necessary that while running repair there shouldn't be any read/writes in any CF?
OR suggest me what could be the issue here?
Cassandra 1.1 is very old so its hard to remember exact issues, but there was problems with streaming then which would possibly hang. Some causes were things like if a read was timed out or was connection was reset. Since you are past 1.1.11 though your Ok to try subrange repairs.
Try to find an appropriate token range you can repair in an hour (keep running smaller and smaller range until you can complete it), set a timeout of a couple hours. Expect some repairs to fail (timeout) so just retry them until they complete. If you cannot get it after many retries continue to make that subrange smaller, but even then it may have problems if you have a partition thats very wide (can check with nodetool cfstats) that will make it much worse.
Once you get a completed repair, upgrade like crazy.
So I did something of a test run/disaster recovery practice deleting a table and restoring in Cassandra via snapshot on a test cluster I have built.
This test cluster has four nodes, and I used the node restart method so after truncating the table in question, all nodes were shutdown, commitlog directories cleared, and the current snapshot data copied back into the table directory for each node. Afterwards, I brought each node back up. Then following the documentation I ran a repair on each node, followed by a refresh on each node.
My question is, why is it necessary for me to run a repair on each node afterwards assuming none of the nodes were down except when I shut them down to perform the restore procedure? (in this test instance it was a small amount of data and took very little time to repair, if this happened in our production environment the repairs would take about 12 hours to perform so this could be a HUGE issue for us in a disaster scenario).
And I assume running the repair would be completely unnecessary on a single node instance, correct?
Just trying to figure out what the purpose of running the repair and subsequent refresh is.
What is repair?
Repair is one of Cassandra's main anti-entropy mechanisms. Essentially it ensures that all your nodes have the latest version of all the data. The reason it takes 12 hours (this is normal by the way) is that it is an expensive operation -- io and CPU intensive -- to generate merkel trees for all your data, compare them with merkel trees from other nodes, and stream any missing / outdated data.
Why run a repair after a restoring from snapshots
Repair gives you a consistency baseline. For Example: If the snapshots weren't taken at the exact same time, you have a chance of reading stale data if you're using CL ONE and hit a replica restored from the older snapshot. Repair ensures all your replicas are up to date with the latest data available.
tl;dr:
repairs would take about 12 hours to perform so this could be a HUGE
issue for us in a disaster scenario).
While your repair is running, you'll have some risk of reading stale data if your snapshots don't have the same exact data. If they are old snapshots, gc_grace may have already passed for some tombstones giving you a higher risk of zombie data if tombstones aren't well propagated across your cluster.
Related side rant - When to run a repair?
The coloquial definition of the term repair seems to imply that your system is broken. We think "I have to run a repair? I must have done something wrong to get to this un-repaired state!" This is simply not true. Repair is a normal maintenance operation with Cassandra. In fact, you should be running repair at least every gc_grace seconds to ensure data consistency and avoid zombie data (or use the opscenter repair service).
In my opinion, we should have called it AntiEntropyMaintenence or CassandraOilChange or something rather than Repair : )