We are having client side timeouts on a 6 node Cassandra 2.1.15 cluster.
Looking at those timeouts in details, we see that they rise only with two coordinators (the native client is enabled on the 6 nodes).
This is JMX informations about the coordinator latency on a sane node :
This is JMX informations about the coordinator latency on a node having problem :
Look at the 999thPercentile: about 40 milliseconds on the sane node and 5.6 seconds on the node having problem.
5.6 seconds are significantly higher than client timeout.
We will certainly try to decommission and replace the two nodes that causes timeouts but we would first like to understand what is happening.
What could explain such a different behaviour of coordinators? Do you have any idea of what could fix this?
Related
While I am running nodetool decommission, I want to use 100% of my network. I set "nodetool setstreamthroughput 0". At the beginning, since the node on which decommission process started sends multiple nodes, The node can send data at speed 900Mbps. Later, since number of nodes that transferred is reducing, the node can send data like 300Mbps.
I see that the node sends one SSTable to one node. I want to increase the parallelism. nodetool says that one connection per hosts. How can I increase this setting. I mean "multiple connection per hosts" while I am streaming?
Most likely Cassandra 3.0 will not be able to utilize 100% of your network regardless of how you set it up. Even with multiple threads you push up against a point where the allocation rate of objects generated in the streaming will exceed what the jvm can clean up and then your GC pauses will only be able give you 100% for short periods. Kind of moot though as you cannot configure it to use more threads.
In cassandra 4.0 you will be able to achieve this: http://cassandra.apache.org/blog/2018/08/07/faster_streaming_in_cassandra.html
I have a cassandra cluster of multiple nodes. When I do 'nodetool gossipinfo'. I see that 1 node has RPC_READY value different than others. All other nodes share the same value. Can anyone explain it to me what this property is and if is there any problem if the value is different on one node? I am using cassandra version 2.2.8.
I would appreciate the response.
Before 2.2 when a node goes up it would be broadcasted to all the nodes that its now in an UP state. This occurred sometimes before CQL was ready. The drivers listened for events like changes in state, when the node went up the drivers would try to connect to that node.
If they tried before CQL was ready the connection would fail and trigger a backoff which greatly increased time to connect to now up nodes. This caused the drivers state to be flipped from UP to DOWN with a bunch of log spam. The RPC_READY is a state that tracks if the node is actually ready for drivers to connect to. Jira here where it was added. In current version at least (I haven't looked at 2.2) the RPC_READY can change to false when being shutdown (drain) or when a node is being decommissioned as well.
We recently deployed micro-services into our production and these micro-service communicates with Cassandra nodes for reads/writes.
After deployment, we started noticing sudden drop in CPU to 0 on all cassandra nodes in primary DC. This is happening at least once per day. when this happens each time, we see randomly 2 nodes (in SAME DC) are not able to reachable to each other ("nodetool describecluster") and when we check "nodetool tpstats", these 2 nodes has higher number of ACTIVE Native-Transport-Requests b/w 100-200. Also these 2 nodes are storing HINTS for each other but when i do longer "pings" b/w them i don't see any packet loss. when we restart those 2 cassandra nodes, issue will be fixed at that moment. This is happening since 2 weeks.
We use Apache Cassandra 2.2.8.
Also microservices logs are having reads/writes timeouts before sudden drop in CPU on all cassandra nodes.
You might be using token aware load balancing policy on client, and updating a single partition or range heavily. In which case all the coordination load will be focused on the single replica set. Can change your application to use RoundRobin (or dc aware round robin) LoadBalancingPolicy and it will likely resolve. If it does you have a hotspot in your application and you might want to give attention to your data model.
It does look like a datamodel problem (hot partitions causing issues in specific replicas).
But in any case you might want to add the following to your cassandra-env.sh to see if it helps:
JVM_OPTS="$JVM_OPTS -Dcassandra.max_queued_native_transport_requests=1024"
More information about this here: https://issues.apache.org/jira/browse/CASSANDRA-11363
using Cassandra 2.2.8 , gossipingpropertyfilesnitch
I'm repairing a node and compacting large number of sstables - i'm thinking to alleviate load on the cpu/node and want to routing incoming web traffic to other nodes in the cluster.
may you guys please share how i can route internet traffic to other nodes in the cluster so to let the node keep using cpu on the major maintenance work?
thanks in advance
Providing you have a replication factor and consistency level that can handle a node being down, you can remove the node from the cluster during the compactions
nodetool disablebinary
nodetool disablethrift
This would prevent your client application from sending requests and it acting as coordinator but it will still recieve the mutations from writes so it wont get behind. If you want to reduce load further you can completely remove it with
nodetool disablebinary
nodetool disablethrift
nodetool disablegossip
But make sure you enable gossip again before your max_hint_window_in_ms which is defined in the cassandra.yaml (default 3 hours). If you dont the hints for that node will expire and not be delivered, leading to a consistency issue that will not be resolved without a repair.
Once you reconnect wait for the pending hints and active hints are down to 0 before disabling gossip again. Note: pending will always be +1 since it has a regular scheduled task, so 1 not zero.
Can check the hint thread pool with OpsCenter, nodetool tpstats or via JMX with org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=HintedHandoff,name=PendingTasks and org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=HintedHandoff,name=ActiveTasks
We are using a 3-node cassandra cluster (each node on a different vm) and currently investigating failover times during write and read operations in case one of the nodes dies.
Failover times are pretty good when shutting down one node gracefully, however, when killing a node (by shutting down the VM) the latency during the tests is about 12 seconds. I guess this has something to do with the tcp timeout?
Is there any way to tweak this?
Edit:
At the moment we are using Cassandra Version 2.0.10.
We are using the java client driver, version 2.1.9.
To describe the situation in more detail:
The write/read operations are performed using the QUROUM Consistency Level with a replication factor of 3. The cluster consists of 3 nodes (c1,c2,c3), each on a different host (VM). The client driver is connected to c1. During the tests I shutdown the host for c2. From then on we observe that the client is blocking for > 12 seconds, until the other nodes realize that c2 is gone. So i think this is not a client issue, since the client is connected to node c1, which is still running in this scenario.
Edit: I don't believe that the fact that running cassandra inside a VM affects the network stack. In fact, killing the VM has the effect, that the TCP connections are not terminated. In this case, a remote host can notice this only through some time out mechanism (either a timeout on the application level protocol or a TCP timeout).
If the process is killed on OS level, the TCP stack of the OS will take care of terminating the TCP connection (IMHO with a TCP reset) enabling a remote host to immediately be notified about the failure.
However, it might be important that even in situations, where a host crashes due to a hardware failure, or where a host is disconnected due to an unplugged network cable (in both cases the TCP connection will not be terminated immediately) the failover time is low. I've tried to sigkill the cassandra process inside the VM. In this case the failover time is about 600ms, which is fine.
kind regards
Failover times are pretty good when shutting down one node gracefully, however, when killing a node (by shutting down the VM) the latency during the tests is about 12 seconds
12 secs is a pretty huge value. Some questions before investigating further
what is your Cassandra version ? Since version 2.0.2 there is a speculative retry mechanism that help reducing the latency for such failover scenario: http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2
what is the client driver you're using (java ? c# ? version ?). Normally with a properly configured load balancing policy, when a node is down the client will retry automatically the query by re-routing to another replica. There is also speculative retry implemented at the driver-side : http://datastax.github.io/java-driver/manual/speculative_execution/
Edit: for a node to be marked down, the gossip protocol is using the phi accrual detector. Instead of having a binary state (UP/DOWN), the algorithm adjust the suspicion level and if the value is above a threshold, the node is considered down
This algorithm is necessary to avoid marking down a node because of a micro network issue.
Look in the cassandra.yaml file for this config:
# phi value that must be reached for a host to be marked down.
# most users should never need to adjust this.
# phi_convict_threshold: 8
Another question is: what load balancing strategy are you using from the driver ? And did you use the speculative retry policy ?