Cassandra instability that causes Timeouts on queries - cassandra

Cassandra 3.11.1, 5-node cluster
All works well till yesterday
But yesterday (without visible cause) we start to get random Read/Write Timeout Exception. Any query can be executed for 1ms and after it, repeat and Timeout, repeat and again 1ms - so application cannot work.
I'm not an admin (developer) but i start to looking for something in nodetool and have a look at tpstats and it's Dropped part, and what i see.
Message type Dropped
READ 396
RANGE_SLICE 485
_TRACE 496047
HINT 0
MUTATION 1139
COUNTER_MUTATION 0
BATCH_STORE 28
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
For me - it's sign that something is very and very wrong but i cannot understand how to diagnose it more detailed, what is cause and how to fix.
After some experiments we see that timeout caused if token owned by certain node,
for example select id from mytable where it = '<token from invalid node>' - it will be fail with timeout every 5 runs.
Is where any suggestions???

Some diagnostics.
In logs 2 of nodes was spam each other (from system.log) with
2018-05-23 10:05:38,281 INFO [HintsDispatcher:33]
HintsDispatchExecutor.java:289 deliver Finished hinted handoff of file
c53d4133-c681-4903-8399-60dfd8fa786f-1526980061074-1.hints to endpoint
/111.11.11.111: c53d4133-c681-4903-8399-60dfd8fa786f, partially
many and many.
After restart one of this nodes, hints was deleted and situation was normalized.
But still no info - why is it so and how to prevent...

Related

MUTATION_REQ/RSP message keep being dropped by Cassandra Cluster

I have a Cassandra cluster on my development environment. Recently I did some write request testing, and found many "MUTATION_REQ/RSP was dropped" message from the log as follows.
${source_ip}:7000->/${dest_ip}:7000-SMALL_MESSAGES-d21c2c3e dropping message of type MUTATION_RSP whose timeout expired before reaching the network
MUTATION_REQ messages were dropped in last 5000 ms: 0 internal and 110 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 3971 ms
I also found there were more dropped MUTATION_REQ than MUTATION_RSP:
(I found that with "nodetool tpstats")
Latencies waiting in queue (micros) per dropped message types
Message type Dropped 50% 95% 99% Max
HINT_RSP 63 1131.752 2816.159 3379.391 3379.391
GOSSIP_DIGEST_SYN 0 1629.722 2816.159 4055.2690000000002 4055.2690000000002
HINT_REQ 4903 1955.666 1955.666 1955.666 1955.666
GOSSIP_DIGEST_ACK2 0 1131.752 2816.159 4055.2690000000002 4055.2690000000002
MUTATION_RSP **6146** 1358.102 2816.159 3379.391 89970.66
MUTATION_REQ **450775** 1358.102 3379.391 4866.323 4139110.981
My questions are:
Is it usual for a health cluster to have so many dropped MUTATION_REQ/RSP?
I supposed MUTATION_RSP were dropped on replica node, and MUTATION_REQ on coordinator node. am I correct?
Thanks
I had same issues and asked same question on Cassandra mailing list, here answer:
First thing to check, do you have NTP client running on all Cassandra
servers? Are their clock in sync? If you answer "yes" to both, check
the server load, does any server have high CPU usage or disk
utilization? Any swapping activity? If not, check the GC logs, and
looking for long GC pauses.
Mine issue was wrong clock synchronization.
So, first things first, check clock on each node with timedatectl utility, in output you should have two entries NTP enabled and NTP synchronized with yes. In case clock out of sync, force synchronization on all nodes and then make full repair on each affected node (I made nodetool -full -pr). After that you should radically less messages with MUTATION_* drops - from hundreds on heavy load to one-two per day.

Reaper failed to run repair on Cassandra nodes

After Reaper failed to run repair on 18 nodes of Cassandra cluster, I ran a full repair of each node to fix the failed repair issue, after the full repair, Reaper executed successfully, but after a few days again the Reaper failed to run, I can see the following error in system.log
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
in nodetool tpstats I can see some pending tasks
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
Also in nodetool compactionstats there are 4 pending tasks:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
My question is why even after a full repair, reaper is still failing? and what is the root cause of pending repair?
PS: version of Reaper is 2.2.3, not sure if it is a bug in Reaper!
You most likely don't have enough segments in your Reaper repair definition, or the default timeout (30 mins) is too low for your repair.
Segments (and the associated repair session) get terminated when they reach the timeout, in order to avoid stuck repairs. When tuned inappropriately, this can give the behavior you're observing.
Nodetool doesn't set a timeout on repairs, which explains why it passes there. The good news is that nothing will prevent repair from passing with Reaper once tuned correctly.
We're currently working on adaptive repairs to have Reaper deal with this situation automatically, but in the meantime you'll need to deal with this manually.
Check the list of segments in the UI and apply the following rule:
If you have less than 20% of segments failing, double the timeout by adjusting the hangingRepairTimeoutMins value in the config yaml.
If you have more than 20% of segments failing, double the number of segments.
Once repair passes at least twice, check the maximum duration of segments and further tune the number of segments to have them last at most 15 mins.
Assuming you're not running Cassandra 4.0 yet, now that you ran repair through nodetool, you have sstables which are marked as repaired like incremental repair would. This will create a problem as Reaper's repairs don't mark sstables as repaired and you now have two different sstables pools (repaired and unrepaired), which cannot be compacted together.
You'll need to use the sstablerepairedset tool to mark all sstables as unrepaired to put all sstables back in the same pool. Please read the documentation to learn how to achieve this.
There could be a number of things taking place such as Reaper can't connect to the nodes via JMX (for whatever reason). It isn't possible to diagnose the problem with the limited information you've provided.
You'll need to check the Reaper logs for clues on the root cause.
As a side note, this isn't related to repairs and is a client/driver/app connecting to the node on the CQL port:
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
Cheers!

Cassandra-2.2.3 : Repeatedly facing "writing large partition error" even after multiple repairs

We have a 6 node each 2 datacenter Cassandra cluster production environment setup. We encounter large partition warning. We ran 2 successful repairs, still this is not getting resolved. How can I analyze and fix this?
BigTableWriter.java:184 - Writing large partition system_distributed/repair_history:rf_key_space:my_table (108140638 bytes)
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 1171896
Mismatch (Blocking): 808
Mismatch (Background): 131
Pool Name Active Pending Completed
Large messages n/a 11 0
Small messages n/a 0 48881938
Gossip messages n/a 0 113659
The system_distributed.repair_history table is not one that you really need to concern yourself with. Unfortunately, this can happen when a lot of repairs have been run. With 2.2, the only real solution is to TRUNCATE that table every now and then.

decipher dropped mutations message

How do I read dropped mutations error messages - what is internal and cross node? For mutations it fails on cross-node and read_repair/read it fails on internal. What does it mean?
INFO [ScheduledTasks:1] 2019-07-21 11:44:46,150 MessagingService.java:1281 - MUTATION messages were dropped in last 5000 ms: 0 internal and 65 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 4966 ms
INFO [ScheduledTasks:1] 2019-07-19 05:01:10,620 MessagingService.java:1281 - READ_REPAIR messages were dropped in last 5000 ms: 9 internal and 8 cross node. Mean internal dropped latency: 6013 ms and Mean cross-node dropped latency: 8164 ms
Does internal mean local node operations?
In this case, does mutation response from local node and cross node means the time it took to get response back from other nodes depending on the consistency level chosen?
What does Read and _Trace dropped mutations mean? There is no tracing enabled on any node in the cluster. What are these _TRACE dropped messages?
INFO [ScheduledTasks:1] 2019-07-25 21:17:13,878 MessagingService.java:1281 - READ messages were dropped in last 5000 ms: 1 internal and 0 cross node. Mean internal dropped latency: 5960 ms and Mean cross-node dropped latency: 0 ms
INFO [ScheduledTasks:1] 2019-07-25 20:38:43,788 MessagingService.java:1281 - _TRACE messages were dropped in last 5000 ms: 5035 internal and 0 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 0 ms
How do I read dropped mutations error messages
You can use JMX to read new dropped messages but you will not be able to check the messages which are already dropped. You can enable JMX using this link.
In JMX these are accessible via org.apache.cassandra.net:MessagingService or org.apache.cassandra.metrics:DroppedMessage.
What does Read and _Trace dropped mutations mean?
Read messages are messages corresponding to your actual read request.Read request times out after read_request_timeout_in_ms. No point in servicing reads after that point since it would of returned error to the client.
Trace is used for recording traces (nodetool settraceprobability). It has a special executor (1 thread, 1000 queue depth) that throws away messages on insertion instead of within the execute
Check your traceprobability by nodetool gettraceprobability.
Does internal mean local node operations?
Yes, you are right. Internal node means local node operations.
If you are seeing a lot of MUTATION failure, it means a lot of writes are failing due to timeouts and you may need to check your Cassandra servers and increase infra if necessary.

Cassandra dropped mutations

7/15/17
5:13:49.602 PM
{"line":"*|1|INFO|m2mmc_bm_platform_c1_cassandra_cassandra5.ef78f1ec-665a-11e7-8113-0242432187ce|110|m2mmc|cassandra|cassandra|m2mmc/bm/platform/c1/cassandra|org.apache.cassandra.net.MessagingService:1048|2017/07/15 17:13:49.602|MUTATION messages were dropped in last 5000 ms: 174 for internal timeout and 0 for cross node timeout. Mean internal dropped latency: 2395 ms and Mean cross-node dropped latency: 0 ms","source":"stdout","tag":"f8a31682d73b"}
Show syntax highlighted
I got the above error for one of my nodes .Is this really a problem? Since we run repairs every week the data should be consistent eventually.
I did not see any GC in system.log at that time.
Dropped mutations aren't typically a problem, as long as they're not occurring regularly. Your 2.3s latency suggests a GC pause, disk hang, or some other problem, but if it only happened this one time, repair/read repair will fix it, and you don't have anything to worry about. If it keeps happening, you should identify the cause.

Resources