MUTATION_REQ/RSP message keep being dropped by Cassandra Cluster - cassandra

I have a Cassandra cluster on my development environment. Recently I did some write request testing, and found many "MUTATION_REQ/RSP was dropped" message from the log as follows.
${source_ip}:7000->/${dest_ip}:7000-SMALL_MESSAGES-d21c2c3e dropping message of type MUTATION_RSP whose timeout expired before reaching the network
MUTATION_REQ messages were dropped in last 5000 ms: 0 internal and 110 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 3971 ms
I also found there were more dropped MUTATION_REQ than MUTATION_RSP:
(I found that with "nodetool tpstats")
Latencies waiting in queue (micros) per dropped message types
Message type Dropped 50% 95% 99% Max
HINT_RSP 63 1131.752 2816.159 3379.391 3379.391
GOSSIP_DIGEST_SYN 0 1629.722 2816.159 4055.2690000000002 4055.2690000000002
HINT_REQ 4903 1955.666 1955.666 1955.666 1955.666
GOSSIP_DIGEST_ACK2 0 1131.752 2816.159 4055.2690000000002 4055.2690000000002
MUTATION_RSP **6146** 1358.102 2816.159 3379.391 89970.66
MUTATION_REQ **450775** 1358.102 3379.391 4866.323 4139110.981
My questions are:
Is it usual for a health cluster to have so many dropped MUTATION_REQ/RSP?
I supposed MUTATION_RSP were dropped on replica node, and MUTATION_REQ on coordinator node. am I correct?
Thanks

I had same issues and asked same question on Cassandra mailing list, here answer:
First thing to check, do you have NTP client running on all Cassandra
servers? Are their clock in sync? If you answer "yes" to both, check
the server load, does any server have high CPU usage or disk
utilization? Any swapping activity? If not, check the GC logs, and
looking for long GC pauses.
Mine issue was wrong clock synchronization.
So, first things first, check clock on each node with timedatectl utility, in output you should have two entries NTP enabled and NTP synchronized with yes. In case clock out of sync, force synchronization on all nodes and then make full repair on each affected node (I made nodetool -full -pr). After that you should radically less messages with MUTATION_* drops - from hundreds on heavy load to one-two per day.

Related

Reaper failed to run repair on Cassandra nodes

After Reaper failed to run repair on 18 nodes of Cassandra cluster, I ran a full repair of each node to fix the failed repair issue, after the full repair, Reaper executed successfully, but after a few days again the Reaper failed to run, I can see the following error in system.log
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
in nodetool tpstats I can see some pending tasks
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
Also in nodetool compactionstats there are 4 pending tasks:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
My question is why even after a full repair, reaper is still failing? and what is the root cause of pending repair?
PS: version of Reaper is 2.2.3, not sure if it is a bug in Reaper!
You most likely don't have enough segments in your Reaper repair definition, or the default timeout (30 mins) is too low for your repair.
Segments (and the associated repair session) get terminated when they reach the timeout, in order to avoid stuck repairs. When tuned inappropriately, this can give the behavior you're observing.
Nodetool doesn't set a timeout on repairs, which explains why it passes there. The good news is that nothing will prevent repair from passing with Reaper once tuned correctly.
We're currently working on adaptive repairs to have Reaper deal with this situation automatically, but in the meantime you'll need to deal with this manually.
Check the list of segments in the UI and apply the following rule:
If you have less than 20% of segments failing, double the timeout by adjusting the hangingRepairTimeoutMins value in the config yaml.
If you have more than 20% of segments failing, double the number of segments.
Once repair passes at least twice, check the maximum duration of segments and further tune the number of segments to have them last at most 15 mins.
Assuming you're not running Cassandra 4.0 yet, now that you ran repair through nodetool, you have sstables which are marked as repaired like incremental repair would. This will create a problem as Reaper's repairs don't mark sstables as repaired and you now have two different sstables pools (repaired and unrepaired), which cannot be compacted together.
You'll need to use the sstablerepairedset tool to mark all sstables as unrepaired to put all sstables back in the same pool. Please read the documentation to learn how to achieve this.
There could be a number of things taking place such as Reaper can't connect to the nodes via JMX (for whatever reason). It isn't possible to diagnose the problem with the limited information you've provided.
You'll need to check the Reaper logs for clues on the root cause.
As a side note, this isn't related to repairs and is a client/driver/app connecting to the node on the CQL port:
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
Cheers!

Cassandra-2.2.3 : Repeatedly facing "writing large partition error" even after multiple repairs

We have a 6 node each 2 datacenter Cassandra cluster production environment setup. We encounter large partition warning. We ran 2 successful repairs, still this is not getting resolved. How can I analyze and fix this?
BigTableWriter.java:184 - Writing large partition system_distributed/repair_history:rf_key_space:my_table (108140638 bytes)
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 1171896
Mismatch (Blocking): 808
Mismatch (Background): 131
Pool Name Active Pending Completed
Large messages n/a 11 0
Small messages n/a 0 48881938
Gossip messages n/a 0 113659
The system_distributed.repair_history table is not one that you really need to concern yourself with. Unfortunately, this can happen when a lot of repairs have been run. With 2.2, the only real solution is to TRUNCATE that table every now and then.

Cassandra dropped mutations

7/15/17
5:13:49.602 PM
{"line":"*|1|INFO|m2mmc_bm_platform_c1_cassandra_cassandra5.ef78f1ec-665a-11e7-8113-0242432187ce|110|m2mmc|cassandra|cassandra|m2mmc/bm/platform/c1/cassandra|org.apache.cassandra.net.MessagingService:1048|2017/07/15 17:13:49.602|MUTATION messages were dropped in last 5000 ms: 174 for internal timeout and 0 for cross node timeout. Mean internal dropped latency: 2395 ms and Mean cross-node dropped latency: 0 ms","source":"stdout","tag":"f8a31682d73b"}
Show syntax highlighted
I got the above error for one of my nodes .Is this really a problem? Since we run repairs every week the data should be consistent eventually.
I did not see any GC in system.log at that time.
Dropped mutations aren't typically a problem, as long as they're not occurring regularly. Your 2.3s latency suggests a GC pause, disk hang, or some other problem, but if it only happened this one time, repair/read repair will fix it, and you don't have anything to worry about. If it keeps happening, you should identify the cause.

High disk I/O on Cassandra nodes

Setup:
We have 3 nodes Cassandra cluster having data of around 850G on each node, we have LVM setup for Cassandra data directory (currently consisting 3 drives 800G + 100G + 100G) and have separate volume (non LVM) for cassandra_logs
Versions:
Cassandra v2.0.14.425
DSE v4.6.6-1
Issue:
After adding 3rd (100G) volume in LVM on each of the node, all the nodes went very high in disk I/O and they go down quite often, servers also become inaccessible and we need to reboot the servers, servers don't get stable and we need to reboot after every 10 - 15 mins.
Other Info:
We have DSE recommended server settings (vm.max_map_count, file descriptor) configured on all nodes
RAM on each node : 24G
CPU on each node : 6 cores / 2600MHz
Disk on each node : 1000G (Data dir) / 8G (Logs)
As I suspected, you are having throughput problems on your disk. Here's what I looked at to give you background. The nodetool tpstats output from your three nodes had these lines:
Pool Name Active Pending Completed Blocked All time blocked
FlushWriter 0 0 22 0 8
FlushWriter 0 0 80 0 6
FlushWriter 0 0 38 0 9
The column I'm concerned about is the All Time Blocked. As a ratio to completed, you have a lot of blocking. The flushwriter is responsible for flushing memtables to the disk to keep the JVM from running out of memory or creating massive GC problems. The memtable is an in-memory representation of your tables. As your nodes take more writes, they start to fill and need to be flushed. That operation is a long sequential write to disk. Bookmark that. I'll come back to it.
When flushwriters are blocked, the heap starts to fill. If they stay blocked, you will see the requests starting to queue up and eventually the node will OOM.
Compaction might be running as well. Compaction is a long sequential read of SSTables into memory and then a long sequential flush of the merge sorted results. More sequential IO.
So all these operations on disk are sequential. Not random IOPs. If your disk is not able to handle simultaneous sequential read and write, IOWait shoots up, requests get blocked and then Cassandra has a really bad day.
You mentioned you are using Ceph. I haven't seen a successful deployment of Cassandra on Ceph yet. It will hold up for a while and then tip over on sequential load. Your easiest solution in the short term is to add more nodes to spread out the load. The medium term is to find some ways to optimize your stack for sequential disk loads, but that will eventually fail. Long term is get your data on real disks and off shared storage.
I have told this to consulting clients for years when using Cassandra "If your storage has an ethernet plug, you are doing it wrong" Good rule of thumb.

Cassandra upgrade from 2.0.12.200 to 2.1.5.469 upgradesstables command seems stuck

As part of the upgrade of Datastax Entreprise from 4.6.1 to 4.7, I am upgrading Cassandra from 2.0.12.200 to 2.1.5.469. The overall DB size is about 27GB.
One step in the upgrade process is running nodetool upgradesstables on a seed node. However this command has been running for more than 48 hours now and looking at CPU and IO, this node seems to be effectively idle.
This is the current output from nodetool compactionstats. The numbers don't seem to change over the course of a few hours :
pending tasks: 28
compaction type keyspace table completed total unit progress
Upgrade sstables prod_2_0_7_v1 image_event_index 8731948551 17452496604 bytes 50.03%
Upgrade sstables prod_2_0_7_v1 image_event_index 9121634294 18710427035 bytes 48.75%
Active compaction remaining time : 0h00m00s
And this is the output from nodetool netstats:
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed
Commands n/a 0 0
Responses n/a 0 0
My question is, is this normal behavior? Is there a way for me to check if this command is stuck or if it is actually still doing work?
If it is stuck, what's the best course of action?
For what it's worth this node is an M3-Large instance on Amazon EC2 with all the data on the local ephemeral disk (SSD).

Resources