Why does a Cassandra node get picked as coordinator even when the driver keeps throwing OperationTimedOutException? - cassandra

I set up a Cassandra cluster with several coordinator nodes.
Sometimes one of the coordinator nodes becomes unavailable...my code handles this with a retry policy which moves to the next node and the problem is solved.
However, it seems that the problematic node still receives traffic even if the driver keeps throwing OperationTimedOutException...it is a time consuming since this node useless.
Further details:
Cassandra Driver -
I'm using Cassandra driver version 3.11.0 (cassandra-driver-core-3.11.0.jar)
Loading balancing policy -
I didn't set any load balancing policy - thus, the default is used.
Retry Policy -
I implemented my own retry policy -
In case of read/write timeout or unavailable retry cause - I'm using retry while reducing the consistency level to one. In case of request error - I'm trying a different host.
Is there anyway to configure that if the driver keeps throwing OperationTimedOutException while sending query to a specific coordinator node, this node will not be called for some period of time?

Cassandra client connection does the Cassandra co-ordinator node caching. So, It will continue sending the query to the same node. Tune your application layer socket config with the client connection timeout.
SocketOptions options = new SocketOptions();
options.setConnectTimeoutMillis(30000);
options.setReadTimeoutMillis(30000);
options.setTcpNoDelay(true);

There are a few misconceptions in your question so let me begin by correcting them.
Misconception #1
I set up a Cassandra cluster with several coordinator nodes.
All nodes in a Cassandra cluster are the same. This is one of the attributes that makes Cassandra awesome. Any node in the cluster can be picked as a coordinator. You can NOT configure/nominate/setup a node to be a coordinator while others aren't.
Misconception #2
... if a coordinator node keeps throwing OperationTimedOutException ...
Cassandra nodes are not capable of throwing OperationTimedOutException. OperationTimedOutException is a client-side exception which gets thrown by the driver when it doesn't get a response from a coordinator within the configured client timeout period.
It is a different exception from read or write timeout exceptions which are thrown when the coordinator sends a response back to the driver when a read or write request timed out on the server-side.
Picking nodes
You didn't specify which driver + version you're using. OperationTimedOutException is in Java driver v3.x but not in v4.x (it was replaced with DriverTimeoutException which makes it clearer that the exception is client-side) so for the purposes of my response, I'm going to assume that you're using Java driver v3.11 (latest in the v3 series).
You also didn't specify which load balancing policies (LBP) you've configured and which retry policies. If you're using the latency-aware LBP LatencyAwarePolicy, the likely scenario is that the problematic node has the lowest latency so it is listed as the "preferred node" by the policy.
Handling misbehaving nodes is a very tough thing to do for the drivers, particularly if the nodes are unresponsive because a driver won't know what is really going on if a node doesn't respond at all. The drivers can't be too aggressive at marking nodes as "down" because if the node is just temporarily unavailable (for example, due to a GC pause), it won't get picked again as a coordinator for a bit of time.
Sometimes, the latency "signal" from a problematic node takes a while to bubble up for a driver to effectively route around it because of the algorithm used by the driver to average out the reported latencies over a period of one or two minutes, scaled such that older latencies are weighted less than newer latencies. In the case of an unresponsive node, the driver can only base the average/scaling on the last time the node reported its latency.
For this reason, the LatencyAwarePolicy was dropped in Java driver v4 in preference for the new DefaultLoadBalancingPolicy which has a much better detection algorithm for slow replicas.
Your workaround using tryNextHost() is a bit clunky because you have to effectively wait for the retry policy to kick in. What you really need to focus on is the fact that your nodes become unresponsive. If your cluster is getting overloaded, you should consider increasing the capacity by adding more nodes.
Trying to come up with a software solution for what is an infrastructure capacity issue is never going to be successful in the long run. Cheers!

Related

Fail fast Cassandra NTR blocked tasks

We ran into an issue where a Cassandra node goes down in a cluster of 18 nodes and the overall cluster read/write latencies spike up due to which the Native Transport requests threads reach maximum capacity of 128 (default) and the NTR max queued capcity is reached (128 default) and the native-transport requests starts getting blocked.
I am not sure what blocked requests mean here? does cassandra starts failing the incoming requests until the queue is full? or the requests are blocked on the server side until they time out.
If it's the latter, is it possible to fail fast these requests from the Cassandra server side?
we are using Apache Cassandra version 2.2.8 with Datastax Cassandra java driver 3.0.0
You can increase the number of concurrent requests to coordinate which is a common enough configuration with many tiny requests with -Dcassandra.max_queued_native_transport_requests=4096 with 2.2.8+. There is no feature to have it return an error instead of blocking but the back pressure will be noticed on the client and queued there until you get busy pool exceptions.

Request timed out is not logging in server side Cassandra

I have set server timeout in cassandra as 60 seconds and client timeout in cpp driver as 120 seconds.
I use Batch query which has 18K operations, I get the Request timed out error in cpp driver logs but in Cassandra server logs there is no TRACE available in spite of enabling ALL logs in Cassandra logback.xml
So how can I confirm that It is thrown from the server / client side in Cassandra?
BATCH is not intended to work that way. It’s designed to apply 6 or 7 mutations to different tables atomically. You’re trying to use it like it’s RDBMS counterpart (Cassandra just doesn’t work that way). The BATCH timeout is designed to protect the node/cluster from crashing due to how expensive that query is for the coordinator.
In the system.log, you should see warnings/failures concerning the sheer size of your BATCH. If you’ve modified them and don’t see that, you should see a warning about a timeout threshold being exceeded (I think BATCH gets its own timeout in 3.0).
If all else fails, run your BATCH statement (part of it) in cqlsh with tracing on, and you’ll see precisely why this is a bad idea (server side).
Also, the default query timeouts are there to protect your cluster. You really shouldn’t need to alter those. You should change your query/model or approach before looking at adjusting the timeout.

Cassandra DB - Node is down and a request is made to fetch data in that Node

If we configured our replication factor in such a way that there are no replica nodes (Data is stored in one place/Node only) and if the Node contains requested data is down, How will the request be handled by Cassandra DB?
Will it return no data or Other nodes gossip and somehow pick up data from failed Node(Storage) and send the required response? If data is picked up, Will data transfer between nodes happen as soon as Node is down(GOSSIP PROTOCOL) or after a request is made?
Have researched for long time on how GOSSIP happens and high availability of Cassandra but was wondering availability of data in case of "No Replicas" since I do not want to waste additional Storage for occasional failures and at the same time, I need availability and No data loss(though delayed)
I assume when you say that there is "no replica nodes" you mean that you have set the Replication Factor=1. In this case if the request is a Read then it will fail, if the request is a write it will be stored as a hint, up to the maximum hint time, and will be replayed. If the node is down for longer than the hint time then that write will be lost. Hinted Handoff: repair during write path
In general only having a single replica of data in your C* cluster goes against some the basic design of how C* is to be used and is an anti-pattern. Data duplication is a normal and expected part of using C* and is what allows for it's high availability aspects. Having an RF=1 introduces a single point of failure into the system as the server containing that data can go out for any of a variety of reasons (including things like maintenance) which will cause requests to fail.
If you are truly looking for a solution that provides high availability and no data loss then you need to increase your replication factor (the standard I usually see is RF=3) and setup your clusters hardware in such a manner as to reduce/remove potential single points of failure.

Cassandra nodejs driver time out after a node moves

We use vnodes on our cluster.
I noticed that when the token space of a node changes (automatically on vnodes, during a repair or a cleanup after adding new nodes), the datastax nodejs driver gets a lot of "Operation timed out - received only X responses" for a few minutes.
I tried using ONE and LOCAL_QUORUM consistencies.
I suppose this is due to the coordinator not hitting the right node just after the move. This seems to be a logical behavior (data was moved) but we really want to address this particular issue.
What do you guys suggest we should do to avoid this ? Having a custom retry policy ? Caching ? Changing the consistency ?
Example of behavior
when we see this:
4/7/2016, 10:43am Info Host 172.31.34.155 moved from '8185241953623605265' to '-1108852503760494577'
We see a spike of those:
{
"message":"Operation timed out - received only 0 responses.",
"info":"Represents an error message from the server",
"code":4608,
"consistencies":1,
"received":0,
"blockFor":1,
"isDataPresent":0,
"coordinator":"172.31.34.155:9042",
"query":"SELECT foo FROM foo_bar LIMIT 10"
}
I suppose this is due to the coordinator not hitting the right node just after the move. This seems to be a logical behavior (data was moved) but we really want to address this particular issue.
In fact, when adding new node, there will be token range movement but Cassandra can still serve read requests using the old token ranges until the scale out has finished completely. So the behavior you're facing is very suspicious.
If you can reproduce this error, please activate query tracing to narrow down the issue.
The error can also be related to a node under heavy load and not replying fast enough

Hazelcast - OperationTimeoutException

I am using Hazelcast version 3.3.1.
I have a 9 node cluster running on aws using c3.2xlarge servers.
I am using a distributed executor service and a distributed map.
Distributed executor service uses a single thread.
Distributed map is configured with no replication and no near-cache and stores about 1 million objects of size 1-2kb using Kryo serializer.
My use case goes as follow:
All 9 nodes constantly execute a synchronous remote operation on the distributed executor service and generate about 20k hits per second (about ~2k per node).
Invocations are executed using Hazelcast API: com.hazelcast.core.IExecutorService#executeOnKeyOwner.
Each operation accesses the distributed map on the node owning the partition, does some calculation using the stored object and stores the object in to the map. (for that I use the get and set API of the IMap object).
Every once in a while Hazelcast encounters a timeout exceptions such as:
com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:mapService', op=GetOperation{}, partitionId=212, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[172.31.44.2]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false} No response has been received! backups-expected:0 backups-completed: 0
In some cases I see map partitions start to migrate which makes thing even worse, nodes constantly leave and re-join the cluster and the only way I can overcome the problem is by restarting the entire cluster.
I am wondering what may cause Hazelcast to block a map-get operation for 120 seconds?
I am pretty sure it's not network related since other services on the same servers operate just fine.
Also note that the servers are mostly idle (~70%).
Any feedbacks on my use case will be highly appreciated.
Why don't you make use of an entry processor? This is also send to the right machine owning the partition and the load, modify, store is done automatically and atomically. So no race problems. It will probably outperform the current approach significantly since there is less remoting involved.
The fact that the map.get is not returning for 120 seconds is indeed very confusing. If you switch to Hazelcast 3.5 we added some logging/debugging stuff for this using the slow operation detector (executing side) and slow invocation detector (caller side) and should give you some insights what is happening.
Do you see any Health monitor logs being printed?

Resources