Why are hints not generated on surviving Cassandra nodes? - cassandra

we have 3 nodes cluster and have keyspace with 3 RF.One of the node goes down due to hardware issue but we observed hints are not generating on surviving nodes.
The max_hint_window_in_ms parameter value is set to 30 min and gc_grace_seconds is set to default( 10 days).
Why this is happening?

Hints will only be generated for max_hint_window_in_ms. After that time, the coordinators will no longer generate hints nodes which are down or unresponsive.
Hints will only be generated again when the problematic node comes back online. Cheers!

Related

High CPU usage and traffic on some Cassandra nodes

As stated in the title, we are having a problem with our Cassandra cluster. There are 9 nodes with a replication factor of 3 using NetworkTopologyStrategy. All in the same DC and Rack. Cassandra version is 3.11.4 (planning to move on 3.11.10). Instances have 4 CPU and 32 GB RAM. (planning to move on 8 CPU)
Whenever we try to run repair on our cluster (using Cassandra Reaper on one of our nodes), we lose one node somewhere in the process. We quickly stop the repair, restart Cassandra service on the node and wait for it to join the ring. Therefore we are never able to run repair these days.
I observed the problem and realized that this problem is caused by high CPU usage on some of our nodes (exactly 3). You may see the 1 week interval graph in below. Ups and downs are caused by the usage of the app. In the mornings, it's very low.
I compared the running processes on each node and there is nothing extra on the high CPU nodes. I compared the configurations. They are identical. Couldn't find any difference.
I also realized that these nodes are the ones that take most of the traffic. See the 1 week interval graph in below. Both sent & received bytes.
I made some research. I found this thread and at the end it is recommended to set dynamic_snitch: false in Cassandra configuration. I looked at our snitch strategy which is GossipingPropertyFileSnitch. In practice, this strategy should work properly but I guess it doesn't.
The job of a snitch is to provide information about your network topology so that Cassandra can efficiently route requests.
My only observation that could be cause of this issue is there is a file called cassandra-topology.properties which is specifically told to be removed if using GossipingPropertyFileSnitch
The rack and datacenter for the local node are defined in cassandra-rackdc.properties and propagated to other nodes via gossip. If cassandra-topology.properties exists, it is used as a fallback, allowing migration from the PropertyFileSnitch.
I did not remove this file as I couldn't find any hard proof that this is causing the issue. If you have any knowledge on this or see any other reason to my problem, I would appreciate your help.
These two sentences tell me some important things about your cluster:
high CPU usage on some of our nodes (exactly 3).
I also realized that these nodes are the ones that take most of the traffic.
The obvious point, is that your replication factor (RF) is 3 (most common). The not-so-obvious, is that your data model is likely keyed on date or some other natural key which results in the same (3?) nodes serving all of the traffic for long periods of time. Running repair during those high-traffic periods will likely lead to issues.
Some things to try:
Have a look at the data model, and see if there's a better way to partition the data to distribute traffic over the rest of the cluster. This is often done with a modeling technique known as "bucketing" (adding another component...usually time based...to the partition key).
Are the partitions large? (Check with nodetool tablehistograms) And by "large," like > 10MB? It could also be that the large partitions are causing the repair operations to fail. If so, hopefully lowering resource consumption (below) will help.
Does your cluster sustain high amounts of write throughput? If so, it may also be dealing with compactions (nodetool compactionstats). You could try lowering compaction throughput (nodetool setcompactionthroughput) to free up some resources. Repair operations can also invoke compactions.
Likewise, you can also lower streaming throughput (nodetool setstreamthroughput) during repairs. Repairs will take longer to stream data, but if that's what is really tipping-over the node(s), it might be necessary.
In case you're not already, set up another instance and use Cassandra Reaper for repairs. It is so much better than triggering from cron. Plus, the UI allows for some finely-tuned config which might be necessary here. It also lets you pause and resume repairs, to pick-up where it leaves off.

ScyllaDB / Cassandra higher replication factor than total number of nodes with CL=QUORUM

Highly appreciate if someone can help with below questions.
*RF= Replication Factor
*CL= Consistency Level
We have requirement of strong Consistency and higher Availability. So, I have been testing RF and CL for 7 nodes ScyllaDB cluster , by keeping RF=7 (100% data on each node) and CL=QUORUM.
What will happen to data copy / replication if 2 nodes goes down ? Does it replicate 2 down nodes data (6th & 7th copy) on to remaining 5 nodes?
or will it simply discard these copies ? What will be effect of RF=7 when there are only 5 active nodes ?
I could not find anything in logs. Do we have any document/link reference for this case ? Or how can I verify and prove this behaviour? Please explain?
With RF=7, the data is always replicated to 7 nodes.
When a node (or two) goes down, the rest of the five nodes already have a copy, and no additional streaming is required.
Using CL=QUORUM, even three nodes down, will not hurt your HA or consistency.
When the fail nodes come back to life, they will be sync, either automatically using Hinted Handoff (for a short failure) or with Repair (for longer failure)[1]
If you replace a dead node[2], the other replicas will stream the data to it till it is up to speed with the
[1] https://docs.scylladb.com/architecture/anti-entropy/
[2] https://docs.scylladb.com/operating-scylla/procedures/cluster-management/replace_dead_node/
Data will always replicate to all nodes cause you have set RF=7 if 2 nodes down then remaining nodes will store hints for those nodes once, nodes come up remaining nodes will replicate the data automatically based on hint period.If hint period(default 3 hours) expired then you need to run manual repair to get data sync in the cluster.

Cassandra 'nodetool repair -pr' taking way too much time

I am running a cluster with 1 datacenter (10 nodes) and Cassandra 2.1.7 installed on each. We are using SimpleStretegy (old mistake).
The situation is, I have not run any nodetool repair since begining, and now there is data of about 200 GB with 3 RF.
As running full repair or incremental repair is same at this point. So I have tried to run full repair. But this result in coordinator node down.
So I end up running full partition ranges repair (nodetool repair -pr) on each node one at a time. But this is taking way too much time (15+ hrs for each node, hence weeks for all nodes).
Am I doing this wrong, or this is supposed to happen? Or this is a version problem?
In future, if I run full repair again after finishing this, would this take weeks as well?
Since full repair is mainly affected by data size, it should take same amount of time.
I suggest moving to incremental repairs, this should save your time and resources.
Here's a article about how to do this in 2.1:
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
If your date size too big, you can use Sub-range repair, it's smiliar to repair pr but it's focus in sub range.
For more explain :
https://www.pythian.com/blog/effective-anti-entropy-repair-cassandra

Cassandra: Hanging node tool repair

We had 3 regions for the cassandra cluster each with 2 nodes, totally 6. Then we have added 3 more regions now totally we have 12 cassandra nodes in the cluster. After adding the nodes, we have updated the replication factors and started the nodetool repair. But the command is hanging for more than 48+ hours and not finished yet. When we looked into the logs 1 or 2 AntiEntropySessions are pending still, because some of the CF's are not fully synced. All AntiEntropySessions are getting the merkle tree successfully from all the nodes for all CF's. But some repair b/w some nodes are not completed for some CF's, so it leads to pending AntiEntropySessions and the repair is hanging.
We are using Cassandra 1.1.12. We will not able to upgrade the Cassandra now.
We have restarted the nodes and started the repair again but it still hangs.
We have observed one CF which has frequent read and writes in the initial 3 regions which is active during the repair is failing to sync completely in all the times.
Is that necessary that while running repair there shouldn't be any read/writes in any CF?
OR suggest me what could be the issue here?
Cassandra 1.1 is very old so its hard to remember exact issues, but there was problems with streaming then which would possibly hang. Some causes were things like if a read was timed out or was connection was reset. Since you are past 1.1.11 though your Ok to try subrange repairs.
Try to find an appropriate token range you can repair in an hour (keep running smaller and smaller range until you can complete it), set a timeout of a couple hours. Expect some repairs to fail (timeout) so just retry them until they complete. If you cannot get it after many retries continue to make that subrange smaller, but even then it may have problems if you have a partition thats very wide (can check with nodetool cfstats) that will make it much worse.
Once you get a completed repair, upgrade like crazy.

Cassandra Replicas Down during nodetool repair?

I am developing an automated script for nodetool repair which would execute ever weekend on all the 6 Cassandra nodes. We have 3 in DC1 and 3 in DC2. Just want to understand worst case scenario. What would happens if connectivity between DC1 and DC2 is lost or couple of replica goes down before or during a nodetool repair. It could be a network issue, an network upgrade(which usually happens on weekends),or something else. I understand that nodetool repair computes a Merkle tree for each range of data on that node, and compares it with the versions on other replicas. So if their is no connectivity between replicas how would a nodetool repair behave ? Will it really repair the nodes. Do i have to rerun node tool repair after all nodes are up and connectivity is restored. Will their be any side effects of this event ? I goggled about it but couldn't find much details. Any insight would be helpful.
Thanks.
Let's say you are using vnodes, which by default means that each node has 256 ranges, but the idea is the same.
If the network problem happens after nodetool repair already started you will see in the logs that some ranges where successfully repaired and other don't. The error will say that the range repair failed because node "192.168.1.1 is dead" something like that.
If the network error happens before nodetool repair starts all the ranges will fail with the same error.
In both cases you will need to run another nodetool repair after the network problem is solved.
I don't know the amount of data you have in those 6 nodes, but in my experience if the cluster can handle it it is better to run nodetool repair once a week in a different day of the week. For instance you can repair node 1 on Sunday, node 2 on Monday and so on. If you have a small amount of data or the adds/updates during a day are not too many you can even run a repair once a day. When you have an already repaired cluster and you run nodetool repair more often it will take much less time to finish, but again if you have too much data in it it may not be possible.
Regarding the side effects you can only note a difference in the data if you use consistency level 1, if it happens that you run a query against the "unrepaired" node the data will be different than the one on the "repaired" nodes. You can solve this by increasing the consistency level to 2 for instance, then again if 2 nodes are "unrepaired" and the query you run is resolved using those 2 nodes you will see a difference again. You have a trade-off here since the best option to avoid this "difference" in the queries is to have the consistency level = replication factor, which brings another problem when 1 of the nodes is down the entire cluster is down and you'll start receiving timeouts on your queries.
Hope it helps!
There are multiple repair options available, you can choose one depending upon your application usage. If you are using DSE Cassandra then I would recommend scheduling OpsCenter repair which does incremental repair by giving duration less than gc_grace_seconds.
Following are different options of doing repair:
Default (None): Will repair all 3 partition ranges: 1 primary and 2 replicas owned by the node on which it was run. Total of 5 nodes will be involved 2 nodes will be fixing 1 partition range, 2 nodes will be fixing 2 partition ranges, 1 node will be fixing 3 partition ranges.
-par: Will do the above operation in parallel.
-pr : Will fix only primary partition range for the node on which it was run. If you are using write consistency of EACH_QUORUM then use -local option as well to reduce cross DC traffic.
I would suggest going with option 3 if you are already live in production to avoid any performance impacts due to repair.
If you want read about repair in more detail please have a look at this here

Resources