As per this flow : Cassandra read_request_timeout_in_ms set up for external(Client) request , I understand that setting the timeout in server side is not just enough, we need to set in client side too.
What is the difference between setting timeout in client and server side?
Example :
Setting the request time out in server side in Cassandra (cassandra.yaml)
VS
Setting the request time out in client side in Cassandra driver
EDITED :
driver read timeout: the driver did not receive any response from the current coordinator within SocketOptions.setReadTimeoutMillis. It invokes onRequestError on the retry policy with an OperationTimedOutException to decide what to do.
server read timeout: the driver did receive a response, but that response indicates that the coordinator timed out while waiting for other replicas. It invokes onReadTimeout on the retry policy to decide what to do.
Could somebody clearly explain the purpose and difference between both please.
Setting the timeout at server-side i.e in cassandra.yaml is not the same as setting the driver (aka client-side) timeout using SocketOptions.setReadTimeoutMillis. They both work individually. One does not over-ride the other. In general, you should set driver timeout slightly larger than server-side timeout.
If Cassandra node is reachable and working but it's not able to respond within read time mentioned in cassandra.yaml it'll throw an exception and driver will get the same exception. The driver might retry if configured.
If Cassandra node does not respond due to some reason, the driver can't wait indefinitely. That's when driver timeout kicks in and throws exception if Cassandra is not responding.
Related
I set up a Cassandra cluster with several coordinator nodes.
Sometimes one of the coordinator nodes becomes unavailable...my code handles this with a retry policy which moves to the next node and the problem is solved.
However, it seems that the problematic node still receives traffic even if the driver keeps throwing OperationTimedOutException...it is a time consuming since this node useless.
Further details:
Cassandra Driver -
I'm using Cassandra driver version 3.11.0 (cassandra-driver-core-3.11.0.jar)
Loading balancing policy -
I didn't set any load balancing policy - thus, the default is used.
Retry Policy -
I implemented my own retry policy -
In case of read/write timeout or unavailable retry cause - I'm using retry while reducing the consistency level to one. In case of request error - I'm trying a different host.
Is there anyway to configure that if the driver keeps throwing OperationTimedOutException while sending query to a specific coordinator node, this node will not be called for some period of time?
Cassandra client connection does the Cassandra co-ordinator node caching. So, It will continue sending the query to the same node. Tune your application layer socket config with the client connection timeout.
SocketOptions options = new SocketOptions();
options.setConnectTimeoutMillis(30000);
options.setReadTimeoutMillis(30000);
options.setTcpNoDelay(true);
There are a few misconceptions in your question so let me begin by correcting them.
Misconception #1
I set up a Cassandra cluster with several coordinator nodes.
All nodes in a Cassandra cluster are the same. This is one of the attributes that makes Cassandra awesome. Any node in the cluster can be picked as a coordinator. You can NOT configure/nominate/setup a node to be a coordinator while others aren't.
Misconception #2
... if a coordinator node keeps throwing OperationTimedOutException ...
Cassandra nodes are not capable of throwing OperationTimedOutException. OperationTimedOutException is a client-side exception which gets thrown by the driver when it doesn't get a response from a coordinator within the configured client timeout period.
It is a different exception from read or write timeout exceptions which are thrown when the coordinator sends a response back to the driver when a read or write request timed out on the server-side.
Picking nodes
You didn't specify which driver + version you're using. OperationTimedOutException is in Java driver v3.x but not in v4.x (it was replaced with DriverTimeoutException which makes it clearer that the exception is client-side) so for the purposes of my response, I'm going to assume that you're using Java driver v3.11 (latest in the v3 series).
You also didn't specify which load balancing policies (LBP) you've configured and which retry policies. If you're using the latency-aware LBP LatencyAwarePolicy, the likely scenario is that the problematic node has the lowest latency so it is listed as the "preferred node" by the policy.
Handling misbehaving nodes is a very tough thing to do for the drivers, particularly if the nodes are unresponsive because a driver won't know what is really going on if a node doesn't respond at all. The drivers can't be too aggressive at marking nodes as "down" because if the node is just temporarily unavailable (for example, due to a GC pause), it won't get picked again as a coordinator for a bit of time.
Sometimes, the latency "signal" from a problematic node takes a while to bubble up for a driver to effectively route around it because of the algorithm used by the driver to average out the reported latencies over a period of one or two minutes, scaled such that older latencies are weighted less than newer latencies. In the case of an unresponsive node, the driver can only base the average/scaling on the last time the node reported its latency.
For this reason, the LatencyAwarePolicy was dropped in Java driver v4 in preference for the new DefaultLoadBalancingPolicy which has a much better detection algorithm for slow replicas.
Your workaround using tryNextHost() is a bit clunky because you have to effectively wait for the retry policy to kick in. What you really need to focus on is the fact that your nodes become unresponsive. If your cluster is getting overloaded, you should consider increasing the capacity by adding more nodes.
Trying to come up with a software solution for what is an infrastructure capacity issue is never going to be successful in the long run. Cheers!
I have set server timeout in cassandra as 60 seconds and client timeout in cpp driver as 120 seconds.
I use Batch query which has 18K operations, I get the Request timed out error in cpp driver logs but in Cassandra server logs there is no TRACE available in spite of enabling ALL logs in Cassandra logback.xml
So how can I confirm that It is thrown from the server / client side in Cassandra?
BATCH is not intended to work that way. It’s designed to apply 6 or 7 mutations to different tables atomically. You’re trying to use it like it’s RDBMS counterpart (Cassandra just doesn’t work that way). The BATCH timeout is designed to protect the node/cluster from crashing due to how expensive that query is for the coordinator.
In the system.log, you should see warnings/failures concerning the sheer size of your BATCH. If you’ve modified them and don’t see that, you should see a warning about a timeout threshold being exceeded (I think BATCH gets its own timeout in 3.0).
If all else fails, run your BATCH statement (part of it) in cqlsh with tracing on, and you’ll see precisely why this is a bad idea (server side).
Also, the default query timeouts are there to protect your cluster. You really shouldn’t need to alter those. You should change your query/model or approach before looking at adjusting the timeout.
When looking at node details in Datastax OpsCenter:
We can see that there were 34903422 "native-transport-requests", but 1072 were blocked.
Could someone explain what is native transport request? What is that in relation to mutation?
Is it normal that they are being blocked and what does it mean?
BTW. We can also see that there were 93 mutations dropped and we know what that means: What is mutation in cassandra?.
The native transport is the CQL Native Protocol (as opposed to the Thrift Protocol) and is the way all modern Cassandra Driver's communicate with the server. This includes all reads/writes/schemachanges/etc ...
A blocked request is one that is sitting around waiting for something else to complete before it can run. Very few C* operations are actually blocking so the total blocked number should be very low. The total count is just the over time sum of all requests that were blocked.
I am doing load testing on Cassandra By using JMeter.
After systematically increasing the load, I can see that more than 58000 Active connection has been established by the driver with different node of cassandra.
I have started with 500 and added up 500 more after 10 iteration. like this i reached upto 2500 request. where it is failing. And Throwing NoHOSTAvailableExecption. I thought that may be cassandra is down. But when i have tried to send request to cassnadra by using DataStax driver . Running in a different System it is working fine. So now My question is that
When I am increasing the load on DataStax java driver it is Opening
more connection instead of using existing connection. Why it is not
using the existing connection?
By default the driver should only have connections based on the number of nodes in the cluster (1 connection per node I believe). This makes me think that your issue is with your Jmeter code and not the Java driver.
In normal operation using the native protocol, the java driver sends multiple requests along each connection simultaneously so there is no need to open multiple connections to the same server. There is some work around upping the limit of simultaneous requests.
The connection was increasing Due to calling creating multiple session for each request. Now it is working Fine.
builder = new Cluster.Builder().
addContactPoints("192.168.114.42");
builder.withPoolingOptions(new PoolingOptions().setCoreConnectionsPerHost(
HostDistance.LOCAL, new PoolingOptions().getMaxConnectionsPerHost(HostDistance.LOCAL)));
cluster = builder
.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
.withReconnectionPolicy(new ConstantReconnectionPolicy(100L))
.build();
session = cluster.connect("demodb");
Now Driver is maintain 17-26 number of connection irrespective of number of transaction.
in a Cassandra/Astyanax development environment, I'm running a single/local Cassandra node. When this single Cassandra node goes down (for whatever reason), any Astyanax based client code (mutation batch, queries ...) fail with something like that:
com.netflix.astyanax.connectionpool.exceptions.NoAvailableHostsException: NoAvailableHostsException: [host=None(0.0.0.0):0, latency=0(0), attempts=0]No hosts to borrow from
at com.netflix.astyanax.connectionpool.impl.RoundRobinExecuteWithFailover.<init>(RoundRobinExecuteWithFailover.java:30)
at com.netflix.astyanax.connectionpool.impl.TokenAwareConnectionPoolImpl.newExecuteWithFailover(TokenAwareConnectionPoolImpl.java:80)
at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:248)
at com.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$4.execute(ThriftColumnFamilyQueryImpl.java:532)
and this exception is logged for each subsequent Astyanax-based client request, resulting in the above log spam. Basically, I'm asking if there is a way to configure an Astyanax connection pool in a way to stop accepting requests and ideally provide some sort of callback which allows me to shutdown my Astyanax-based client application (e.g. our server application).
What you want to do is a bad choice, the whole point of Cassandra is to build a distributed, multi-node cluster which can handle failures, however you could catch that specific exception and disregard it in java without doing the costly logging.
try {
// lots of C* stuff
} catch (NoAvailableHostsException ex) {
// Swallow exception
}
Also Astyanax doesn't accept requests, its an API not a server, Cassandra is the server. You can tell Cassandra to stop excepting requests using the JMX functionality. You could invoke a call to the stopRPCServer() JMX function in org.apache.cassandra.db.StorageService.