No response from any of the 3 replicas - cassandra

I need help understanding why a query with ConsistencyLevel.ONE would receive no response from any of the 3 replicas (RF=3). This happens sporadically, but frequently enough to cause headaches.
I see the same thing occasionally when doing queries with QUORUM, which I could almost rationalize as something like 2/3 of the replicas are "down" doing compaction, GC, something.
But with ONE, doesn't that mean all 3 replicas are "down"? What could cause them to be down? Could something else be going on?
Full disclosure: we're running 2.1.9 with 15 nodes in a single data center in GCE, but the nodes are in 3 different zones. As far as I know, this isn't due to a network partition among the zones, but anything is possible.
Please help me understand the possible causes for 0 replicas responding. Thanks in advance!
Here's an example of an exception:
cassandra.OperationTimedOut: errors={: ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out - received only 0 responses." info={\'received_responses\': 0, \'required_responses\': 1, \'consistency\': \'ONE\'}',), : ConnectionException('Host has been marked down or removed',)}, last_host=10.240.0.31

Related

MUTATION_REQ/RSP message keep being dropped by Cassandra Cluster

I have a Cassandra cluster on my development environment. Recently I did some write request testing, and found many "MUTATION_REQ/RSP was dropped" message from the log as follows.
${source_ip}:7000->/${dest_ip}:7000-SMALL_MESSAGES-d21c2c3e dropping message of type MUTATION_RSP whose timeout expired before reaching the network
MUTATION_REQ messages were dropped in last 5000 ms: 0 internal and 110 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 3971 ms
I also found there were more dropped MUTATION_REQ than MUTATION_RSP:
(I found that with "nodetool tpstats")
Latencies waiting in queue (micros) per dropped message types
Message type Dropped 50% 95% 99% Max
HINT_RSP 63 1131.752 2816.159 3379.391 3379.391
GOSSIP_DIGEST_SYN 0 1629.722 2816.159 4055.2690000000002 4055.2690000000002
HINT_REQ 4903 1955.666 1955.666 1955.666 1955.666
GOSSIP_DIGEST_ACK2 0 1131.752 2816.159 4055.2690000000002 4055.2690000000002
MUTATION_RSP **6146** 1358.102 2816.159 3379.391 89970.66
MUTATION_REQ **450775** 1358.102 3379.391 4866.323 4139110.981
My questions are:
Is it usual for a health cluster to have so many dropped MUTATION_REQ/RSP?
I supposed MUTATION_RSP were dropped on replica node, and MUTATION_REQ on coordinator node. am I correct?
Thanks
I had same issues and asked same question on Cassandra mailing list, here answer:
First thing to check, do you have NTP client running on all Cassandra
servers? Are their clock in sync? If you answer "yes" to both, check
the server load, does any server have high CPU usage or disk
utilization? Any swapping activity? If not, check the GC logs, and
looking for long GC pauses.
Mine issue was wrong clock synchronization.
So, first things first, check clock on each node with timedatectl utility, in output you should have two entries NTP enabled and NTP synchronized with yes. In case clock out of sync, force synchronization on all nodes and then make full repair on each affected node (I made nodetool -full -pr). After that you should radically less messages with MUTATION_* drops - from hundreds on heavy load to one-two per day.

Reaper failed to run repair on Cassandra nodes

After Reaper failed to run repair on 18 nodes of Cassandra cluster, I ran a full repair of each node to fix the failed repair issue, after the full repair, Reaper executed successfully, but after a few days again the Reaper failed to run, I can see the following error in system.log
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
in nodetool tpstats I can see some pending tasks
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
Also in nodetool compactionstats there are 4 pending tasks:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
My question is why even after a full repair, reaper is still failing? and what is the root cause of pending repair?
PS: version of Reaper is 2.2.3, not sure if it is a bug in Reaper!
You most likely don't have enough segments in your Reaper repair definition, or the default timeout (30 mins) is too low for your repair.
Segments (and the associated repair session) get terminated when they reach the timeout, in order to avoid stuck repairs. When tuned inappropriately, this can give the behavior you're observing.
Nodetool doesn't set a timeout on repairs, which explains why it passes there. The good news is that nothing will prevent repair from passing with Reaper once tuned correctly.
We're currently working on adaptive repairs to have Reaper deal with this situation automatically, but in the meantime you'll need to deal with this manually.
Check the list of segments in the UI and apply the following rule:
If you have less than 20% of segments failing, double the timeout by adjusting the hangingRepairTimeoutMins value in the config yaml.
If you have more than 20% of segments failing, double the number of segments.
Once repair passes at least twice, check the maximum duration of segments and further tune the number of segments to have them last at most 15 mins.
Assuming you're not running Cassandra 4.0 yet, now that you ran repair through nodetool, you have sstables which are marked as repaired like incremental repair would. This will create a problem as Reaper's repairs don't mark sstables as repaired and you now have two different sstables pools (repaired and unrepaired), which cannot be compacted together.
You'll need to use the sstablerepairedset tool to mark all sstables as unrepaired to put all sstables back in the same pool. Please read the documentation to learn how to achieve this.
There could be a number of things taking place such as Reaper can't connect to the nodes via JMX (for whatever reason). It isn't possible to diagnose the problem with the limited information you've provided.
You'll need to check the Reaper logs for clues on the root cause.
As a side note, this isn't related to repairs and is a client/driver/app connecting to the node on the CQL port:
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
Cheers!

Cassandra 3.10 debug.log contains frequent "FailureDetector.java:457 - Ignoring interval time of..."

The debug.log files for one of our Cassandra 3.10 clusters has frequent messages similar to “FailureDetector.java:457 - Ignoring interval time of…”
The messages appear even if the cluster is idle. I see the messages at a rate of about 1 per second on each node of this 6 node cluster (3 nodes each in two data centers).
Can someone tell me what causes the messages and if they are something to be concerned about?
We have a couple of other small clusters supporting the same application (different environments) and I see this message much less often (days apart).
The FailureDetector is responsible of deciding if a node is considered UP or DOWN.
The gossip process tracks state from other nodes both directly (nodes
gossiping directly to it) and indirectly (nodes communicated about
secondhand, third-hand, and so on). Rather than have a fixed threshold
for marking failing nodes, Cassandra uses an accrual detection
mechanism to calculate a per-node threshold that takes into account
network performance, workload, and historical conditions. During
gossip exchanges, every node maintains a sliding window of
inter-arrival times of gossip messages from other nodes in the
cluster.
Here you can find the source code, which gives you the log message. It is set to DEBUG level because they may be helpful in tracking down the actual issue causing the latency, but don't indicate a problem on their own.
In other words: your node measures the acknowledgement latency for each gossip message sent to the other nodes e.g: X nanosec for IP address1, Z nanosec for IP address2, etc. If eitherX or Y is above the expected 2 sec threshold as stated in MAX_INTERVAL_IN_NANO, it will get reported.
Problems, which can cause this log message:
Huge load on the node(s): e.g too many large partitions
High pressure: e.g. too many queries in sort period of time
Bad network connection
The extra FailureDetector logging was added with this:
Expose phi values from failure detector via JMX and tweak debug
and trace logging (CASSANDRA-9526)
and also I found this open issue, might be related to your problem:
The failure detector becomes more sensitive when the network is flakey(CASSANDRA-9536)
Also I find this article about Gossiping and Failure Detection very useful.

Readtimeout when using User Defined Function in Cassandra

We have a single node Cassandra Cluster (Apache) with 2 vCPUs and around 16 GB RAM on AWS. We have around 28 GB of data uploaded into Cassandra.
Now Cassandra is working fine for select and group by queries using primary keys, however when using User Defined Functions to use aggregate functions on non-primary key - it is giving a timeout.
To elaborate - we have partition on Year, Month and Date for a 3 year data. Now for example if two columns are - Bill_ID and Bill_Amount we want to have a sum of Bill_Amount by Bill_ID using UDF.
Kind of confused here as I believe that if the info says it has received 1 response, why should it give a message of timeout if it has received it? Why are we getting a timeout and that too only when using User Defined functions?
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 1 responses." info={'received_responses': 1, 'required_responses': 1, 'consistency': 'ONE'}
We have increase read timeouts in the yaml file to as high as 10 minutes.
Edit - Adding the screenshot of the query. The results displayed before setting --request-timeout and post that using UDF. The table has 150 million rows with over 1095 partitions only for 3 years of data - with primary keys been year, day and month.
Try to increase the timeouts also on client side, cqlsh for example:
cqlsh --request-timeout=3600

Cassandra throwing NoHostAvailableException after 5 minutes of high IOPS run

I'm using datastax cassandra 2.1 driver and performing read/write operations at the rate of ~8000 IOPS. I've used pooling options to configure my session and am using separate session for read and write each of which connect to a different node in the cluster as contact point.
This works fine for say 5 mins but after that I get a lot of exceptions like :
Failed with: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.0.1.123:9042 (com.datastax.driver.core.TransportException: [/10.0.1.123:9042] Connection has been closed), /10.0.1.56:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)))
Can anyone help me out here on what could be the problem?
The exception asks me to increase number of connections per host but how high a value can I set for this parameter ?
Also I'm not able to set CoreConnectionsPerHost beyond 2 as it throws me exception saying 2 is the max.
This is how I'm creating each read / write session.
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 200);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.builder()
.withPoolingOptions( poolingOpts )
.addContactPoint(ip)
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect(keySpace);
Your problem might not actually be in your code or the way you are connecting. If you say the problem is happening after a few minutes then it could simply be that your cluster is becoming overloaded trying to process the ingestion of data and cannot keep up. The typical sign of this is when you start seeing JVM garbage collection "GC" messages in the cassandra system.log file, too many small ones batched together of large ones on their own can mean that incoming clients are not responded to causing this kind of scenario. Verify that you do not have too many of these event showing up in your logs first before you start to look at your code. Here's a good example of a large GC event:
INFO [ScheduledTasks:1] 2014-05-15 23:19:49,678 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 2896 ms for 2 collections, 310563800 used; max is 8375238656
When connecting to a cluster there are some recommendations, one of which is only have one Cluster object per real cluster. As per the article I've linked below (apologies if you already studied this):
Use one cluster instance per (physical) cluster (per application lifetime)
Use at most one session instance per keyspace, or use a single Session and explicitly specify the keyspace in your queries
If you execute a statement more than once, consider using a prepared statement
You can reduce the number of network roundtrips and also have atomic operations by using batches
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/fourSimpleRules.html
As you are doing a high number of reads I'd most definitely recommend using setFetchSize also if its applicable to your code
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/cqlStatements.html
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/reference/queryBuilderOverview.html
For reference heres the connection options in case you find it useful
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/connectionsOptions_c.html
Hope this helps.

Resources