Sudden load spikes in Cassandra cluster - cassandra

We recently started having problems with our Cassandra cluster. Maybe someone has ideas on how to fix this. We're running Cassandra 3.11.7 on a 40 node cluster. We are using replication factor = 3 and read/write at consistency level QUORUM.
Recently, a single node experienced a sudden spike in CPU load which then last for a while. During that period, we can observe a lot of dropped and queued MUTATIONs. If we restart Cassandra on the problematic node, one or two other nodes start to suffer of the same problem. We have examined log files and access patterns and have not yet been able to find the reason.
What could be the most common reasons for such behaviour? Where should we take a closer look? Has anyone already had similar experiences?

If we restart Cassandra on the problematic node, one or two other nodes start to suffer of the same problem.
First of all, when a single node presents a problem, restarting it generally achieves nothing. If anything, you'll clear the JVM heap...which will be quickly repopulated upon startup. Seriously, don't expect restarting a node to fix anything.
Has anyone already had similar experiences?
Yes, several times. For things not Cassandra related:
Are you in a cloud environment? Run iostat and look for things like high percentages of iowait and steal. Sometimes shared resources don't play well with others. If you don't have iostat, get it (yum install -y sysstat).
Check cron for all users. We once had an issue with a file integrity checker getting installed as a part of our base image, and it did exactly what you are talking about.
What could be the most common reasons for such behaviour? Where should we take a closer look?
For Cassandra related issues, I see a few possibilities:
Repairs. Check if the node is running a repair. You can see Merkle Tree calculations with nodetool compactionstats and repair streams with nodetool netstats.
Compactions. Check nodetool compactionstats. If this is it, you can try lowering your compaction throughput so that it doesn't affect normal operations.
Garbage Collection. Check the gc.log.* files. If it's GC, it can usually be fixed by reading up on and adjusting the GC settings. If there isn't anyone on your team who is a JVM GC expert, I recommend using G1GC as it removes a lot of the guesswork.
Do note that everything I mentioned above can never be fixed with a reboot. In fact, it's likely it'll pick right back up where it left off.

Related

How do I bring back a Cassandra 2.0 node that's been down for a long time

We have a Cassandra 2.0.17 cluster with 3 DCs, where each DC has 8 nodes and RF of 3. We have not been running regular repairs on it.
One node has been down for 2 months due to hardware issue with one of the drives.
We finally got a new drive to replace the faulty one, and are trying to figure out the best way to bring the node back into the cluster.
We initially thought to just run nodetool repair but from my research so far it seems like that would only be good if the node was down for less than gc_grace_seconds which is 10 days.
Seems like that would mean removing the node and then adding it back in as a new node.
Someone mentioned somewhere that rather than completely removing the node and then bootstrapping it back in, I could potentially use the same procedure used for replacing a node, using the replace_address flag (or replace_address_first_boot if available), to replace the node with itself. But I couldn't find any real documentation or case studies of doing this.
It seems like this is not a typical situation - normally, either a node goes down for a short period of time and you can just run repair on it, or it needs to be replaced altogether. But it's hard to find much prior art on our exact use case.
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster?
Is repair really not a good option here?
Also, whatever the answer is, how would I monitor the process and ensure that it's successful?
So here's what I would do:
If you haven't already, run a removenode on the "dead" node's host ID.
Fire-up the old node, making sure that it is not a seed node and that auto_bootstrap is either true or not specified. It defaults to true unless explicitly set otherwise.
It should join right back in, and re-stream its data.
You can monitor it's progress by running nodetool netstats | grep Already, which returns a status by each node streaming, specifying completion progress in terms of # of files streamed vs. total files.
The advantage of doing it this way, is that the node will not attempt to serve requests until bootstrapping is completed.
If you run into trouble, feel free to comment here or ask for help in the cassandra-admins channel on DataStax's Discord server.
You have mentioned already that you are aware that node has to be removed if it is down for more than gc_grace_seconds
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster? Is repair really not a good option here?
So the answer is that only. You cannot safely bring that node back if it is down more than gc_grace_seconds. It needs to be removed to prevent possible deleted data from appearing back.
https://stackoverflow.com/a/69098765/429476
From https://community.datastax.com/questions/3987/one-of-my-nodes-powered-off.html
Erick Ramirez answered • May 12 2020 at 1:19 PM | Erick Ramirez edited • Dec 03 2021 at 4:49 AM BEST ANSWERACCEPTED ANSWER
#cache_drive If the node has been down for less than the smallest gc_grace_seconds, it should be as simple as starting Cassandra on the node then running a repair on it.
If the node has been down longer than the smallest GC grace, you will need to wipe the node clean including deleting all the contents of data/, commitlog/ and saved_caches/. Then replace the node "with itself" by adding the replace_address flag and specifying its own IP. For details, see Replacing a dead node. Cheers!

Cassandra repairs on TWCS

We have a 13 nodes Cassandra cluster (version 3.10) with RP 2 and read/write consistency of 1.
This means that the cluster isn't fully consistent, but eventually consistent. We chose this setup to speed up the performance, and we can tolerate a few seconds of inconsistency.
The tables are set with TWCS with read-repair disabled, and we don't run full repairs on them
However, we've discovered that some entries of the data are replicated only once, and not twice, which means that when the not-updated node is queried it fails to retrieve the data.
My first question is how could this happen? Shouldn't Cassandra replicate all the data?
Now if we choose to perform repairs, it will create overlapping tombstones, therefore they won't be deleted when their time is up. I'm aware of the unchecked_tombstone_compaction property to ignore the overlap, but I feel like it's a bad approach. Any ideas?
So you've obviously made some deliberate choices regarding your client CL. You've opted to potentially sacrifice consistency for speed. You have achieved your goals, but you assumed that data would always make it to all of the other nodes in the cluster that it belongs. There are no guarantees of that, as you have found out. How could that happen? There are multiple reasons I'm sure, some of which include: networking/issues, hardware overload (I/O, CPU, etc. - which can cause dropped mutations), cassandra/dse being unavailable for whatever reasons, etc.
If none of your nodes have not been "off-line" for at least a few hours (whether it be dse or the host being unavailable), I'm guessing your nodes are dropping mutations, and I would check two things:
1) nodetool tpstats
2) Look through your cassandra logs
For DSE: cat /var/log/cassandra/system.log | grep -i mutation | grep -i drop (and debug.log as well)
I'm guessing you're probably dropping mutations, and the cassandra logs and tpstats will record this (tpstats will only show you since last cassandra/dse restart). If you are dropping mutations, you'll have to try to understand why - typically some sort of load pressure causing it.
I have scheduled 1-second vmstat output that spools to a log continuously with log rotation so I can go back and check a few things out if our nodes start "mis-behaving". It could help.
That's where I would start. Either way, your decision to use read/write CL=1 has put you in this spot. You may want to reconsider that approach.
Consistency level=1 can create a problem sometimes due to many reasons like if data is not replicating to the cluster properly due to mutations or cluster/node overload or high CPU or high I/O or network problem so in this case you can suffer data inconsistency however read repair handles this problem some times if it is enabled. you can go with manual repair to ensure consistency of the cluster but you can get some zombie data too for your case.
I think, to avoid this kind of issue you should consider CL at least Quorum for write or you should run manual repair within GC_grace_period(default is 10 days) for all the tables in the cluster.
Also, you can use incremental repair so that Cassandra run repair in background for chunk of data. For more details you can refer below link
http://cassandra.apache.org/doc/latest/operating/repair.html or https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsRepair.html

Is there a way to speed up cassandra nodeltool repair ?

I've 10 nodes of Cassandra Cluster and currently installed version is 3.0.13.
How I launched : nodetool repair -j 4 -pr
Would like to know if there are some configuration options to speed up this process, I still see "Anticompaction after repair" is in progress when i check for compactionstats.
The current state of the art way of doing repairs are subrange repairs running all the time. See http://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html for some explanations:
While the idea behind incremental repair is brilliant, the implementation still has flaws that can cause severe damage to a production cluster, especially when using LCS and DTCS. The improvements and fixes planned for 4.0 will need to be thoroughly tested to prove they fixed incremental repair and allow it to be safely used as a daily routine.
That beeing said (or quoted), have a look at http://cassandra-reaper.io/ - a simple and easy tool managing your repairs.

Speed of hinted handoff in Cassandra

Given a particular set of configurations and a particular size of data to be written on a node, can we predict how much time will the hinted handoff take to finish?
In my case, as soon as the node came up, I checked using the 'nodetool statushandoff' command, that the hinted handoff had started running. However, it seems to be running endlessly. So is there any way by looking at the configurations, missing data size etc. so that we can know that after this much amount of time, the missing data will be written on the node.
You should be able to track the progress with some hint metrics. Have a look on this page: http://cassandra.apache.org/doc/latest/operating/metrics.html#hintedhandoff-metrics
The TotalHintsInProgress will tell you how big the backlog is and TotalHints will tell you the number of hints written on the node since startup. So by tracking these two metrics you should be able to give an estimate (good or bad) on how far it's come.

How to speedup the bootstrap of single node

I have a single node Cassandra installation on my development machine (and very little experience with Cassandra). I always had very few data in the node and I experienced no problems. I inserted about 9,000 elements in a table today to experiment with a real world use case. When I start up the node the boot time is extremely long now. I get this in system.log
Replaying /var/lib/cassandra/commitlog/CommitLog-3-1388134836280.log
...
Log replay complete, 9274 replayed mutations
That took 13 minutes and is hardly bearable. I wonder if there is a way to store data in such a way that can be read at once without replaying the log. After all 9,000 elements are nothing and there must be a quicker way to boot. I googled for hints and searched into Cassandra's documentation but I didn't find anything. It's obvious that I'm not looking for the right things, would anybody be so kind to point me to the right documents? Thanks.
There are a few things that might help. The most obvious thing you can do is flush the commit log before you shutdown Cassandra. This is a good idea to do in production too. Before I stop a Cassandra node in production I'll run the following commands:
nodetool disablethrift
nodetool disablegossip
nodetool drain
The first two commands gracefully shut down connections to clients connected to this node and then to other nodes in the ring. The drain command flushes memtables to disk (sstables). This should minimize what needs to be replayed on startup.
There are other factors that can make startup take a long time. Cassandra opens all the SSTables on disk at startup. So the more column families and SSTables you have on disk the longer it will take before a node is able to start serving clients. There was some work done in the 1.2 release to speed this up (so if you are not on 1.2 yet you should consider upgrading). Reducing the number of SSTables would probably improve your start time.
Since you mentioned this was a development machine I'll also give you my dev environment observations. On my development machine I do a lot of creating and dropping column families and key spaces. This can cause some of the system CFs to grow significantly and eventually cause a noticeable slowdown. The easiest way to handle this is to have a script that can quickly bootstrap a new database and blow away all the old data in /var/lib/cassandra.

Resources