Cassandra write query timeout out after PT2S - cassandra

I have cassandra monolithic application where I want to write at high rate reading some payloads from queue. Cassandra cluster has 3 nodes . When i start processing large number of messages in parallel(by spawning threads) I get below exceptions
java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT2S
I am creating CQLsession as bean
return CqlSession.builder().addContactPoints(contactPoints)
/*.addContactPoint(new InetSocketAddress("localhost", 9042))*/
.withConfigLoader(new DefaultDriverConfigLoader()).withLocalDatacenter("datacenter1")
.addTypeCodecs(new CustomDateCodec())
.withKeyspace("dev").build();
I am injecting this CqlSession into my mapper and other classes to run queries
In my datastax driver i have given ip of 3 nodes as contact points
Is there any tuning I need to do in CQLsession creation/ or my cassandra nodes so that they can take is writes at high concurrency ?
Also How many writes can I do in parallel ?
All are update statement without any if condition only on primary key

The timeout you're seeing is a result of your app overloading the cluster, effectively doing a DDoS attack.
PT2S is the 2-second write timeout. There will come a point when the commitlog disks can only take so much write IO. If you're seeing dropped mutations in the logs or nodetool tpstats, that's confirmation that the commitlog can't keep up with the writes.
If your cluster can sustain 10K writes/sec but your app is doing 20K writes then you need to double the size of your cluster (add more nodes) to support the throughput requirements. Cheers!

Related

Why does a Cassandra node get picked as coordinator even when the driver keeps throwing OperationTimedOutException?

I set up a Cassandra cluster with several coordinator nodes.
Sometimes one of the coordinator nodes becomes unavailable...my code handles this with a retry policy which moves to the next node and the problem is solved.
However, it seems that the problematic node still receives traffic even if the driver keeps throwing OperationTimedOutException...it is a time consuming since this node useless.
Further details:
Cassandra Driver -
I'm using Cassandra driver version 3.11.0 (cassandra-driver-core-3.11.0.jar)
Loading balancing policy -
I didn't set any load balancing policy - thus, the default is used.
Retry Policy -
I implemented my own retry policy -
In case of read/write timeout or unavailable retry cause - I'm using retry while reducing the consistency level to one. In case of request error - I'm trying a different host.
Is there anyway to configure that if the driver keeps throwing OperationTimedOutException while sending query to a specific coordinator node, this node will not be called for some period of time?
Cassandra client connection does the Cassandra co-ordinator node caching. So, It will continue sending the query to the same node. Tune your application layer socket config with the client connection timeout.
SocketOptions options = new SocketOptions();
options.setConnectTimeoutMillis(30000);
options.setReadTimeoutMillis(30000);
options.setTcpNoDelay(true);
There are a few misconceptions in your question so let me begin by correcting them.
Misconception #1
I set up a Cassandra cluster with several coordinator nodes.
All nodes in a Cassandra cluster are the same. This is one of the attributes that makes Cassandra awesome. Any node in the cluster can be picked as a coordinator. You can NOT configure/nominate/setup a node to be a coordinator while others aren't.
Misconception #2
... if a coordinator node keeps throwing OperationTimedOutException ...
Cassandra nodes are not capable of throwing OperationTimedOutException. OperationTimedOutException is a client-side exception which gets thrown by the driver when it doesn't get a response from a coordinator within the configured client timeout period.
It is a different exception from read or write timeout exceptions which are thrown when the coordinator sends a response back to the driver when a read or write request timed out on the server-side.
Picking nodes
You didn't specify which driver + version you're using. OperationTimedOutException is in Java driver v3.x but not in v4.x (it was replaced with DriverTimeoutException which makes it clearer that the exception is client-side) so for the purposes of my response, I'm going to assume that you're using Java driver v3.11 (latest in the v3 series).
You also didn't specify which load balancing policies (LBP) you've configured and which retry policies. If you're using the latency-aware LBP LatencyAwarePolicy, the likely scenario is that the problematic node has the lowest latency so it is listed as the "preferred node" by the policy.
Handling misbehaving nodes is a very tough thing to do for the drivers, particularly if the nodes are unresponsive because a driver won't know what is really going on if a node doesn't respond at all. The drivers can't be too aggressive at marking nodes as "down" because if the node is just temporarily unavailable (for example, due to a GC pause), it won't get picked again as a coordinator for a bit of time.
Sometimes, the latency "signal" from a problematic node takes a while to bubble up for a driver to effectively route around it because of the algorithm used by the driver to average out the reported latencies over a period of one or two minutes, scaled such that older latencies are weighted less than newer latencies. In the case of an unresponsive node, the driver can only base the average/scaling on the last time the node reported its latency.
For this reason, the LatencyAwarePolicy was dropped in Java driver v4 in preference for the new DefaultLoadBalancingPolicy which has a much better detection algorithm for slow replicas.
Your workaround using tryNextHost() is a bit clunky because you have to effectively wait for the retry policy to kick in. What you really need to focus on is the fact that your nodes become unresponsive. If your cluster is getting overloaded, you should consider increasing the capacity by adding more nodes.
Trying to come up with a software solution for what is an infrastructure capacity issue is never going to be successful in the long run. Cheers!

driver default retry policy

I am testing our cassandra cluster for resiliency, its a 9 node cluster with rf=3. When i disable all traffic on port 7000 of one node, the client gets a
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
The reason being that this host was only partially able to communicate to other nodes, but it then proceeds to retry again on the same host
INFO - c.d.d.c.p.LoggingRetryPolicy - - Retrying on read timeout on same host at consistency QUORUM (initial consistency: QUORUM, required responses: 2, received responses: 1, data retrieved: true, retries: 0)
It continues ALL it retries on the same host, and never recovers, eventually the request fails.
I can create a custom policy, but wondering why it never tries any other nodes ?
As per your definitions, the database will have only 3 copies of the information (RF=3), so, even though you have 9 nodes, the QUORUM will be evaluated only with the 3 nodes that actually are the owners of the data, this is defined with the number of tokens and their assignation in the nodes.
Before disabling the port in that node, was the cluster reported as healthy? (in other words, nodetool status reported all the nodes as UN Up and Normal). Is the latency reported by all the nodes similar? If you have a node with increased latencies, the query will timeout before it gets a response from it.
Before creating "custom policies", and once that you confirmed that all the nodes are healthy, reachable and available, you may want to explore using a lower consistency level (like ANY or ONE) which can improve resiliency and performance with an impact of accuracy, or increase the replication factor which will increase the number of nodes where you can find the data but with the inconvenience that the amount of disk utilization will increase.

Is spark JDBC sink transactionally safe at node level?

I have a question related to opening a transaction at partition level. If I use jdbc connector to write to database (postgess), will partition specific writes at worker node be transactionally safe i.e.
If a worker node goes down while writing the data, will the rows related to this partition/ worker node be rolled back?
There is a transaction boundary on the partition (see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L588)
But if there are failures afterwards before the task is marked as SUCCESS, for example with a network issue or timeout, then you might still get multiple writes

Connection pooling in cassandra

In Mongo and HBase, we have a way to track the client connection, Is there a way to get the total client connection in Cassandra?
On client-side Datastax driver reports metrics on connections, task queues, queries and errors (connection errors, r/w timeouts, retries, speculative executions).
Can be accessed via the Cluster.getMetrics() operation (java).
Datastax provides a cassandra structure, CassMetrics (cpp driver). This structure contains min and max microseconds to execute a query, pending requests, timed-out requests, total connections and available connections. It gives the entire performance metrics of a particular snapshot of the session.
Each member of the structure is explained clearly in the following datastax documentation- https://docs.datastax.com/en/developer/cpp-driver/2.3/api/struct.CassMetrics/
Here, is an example program on how to use the structure effectively - https://github.com/datastax/cpp-driver/blob/master/examples/perf/perf.c

Hazelcast - OperationTimeoutException

I am using Hazelcast version 3.3.1.
I have a 9 node cluster running on aws using c3.2xlarge servers.
I am using a distributed executor service and a distributed map.
Distributed executor service uses a single thread.
Distributed map is configured with no replication and no near-cache and stores about 1 million objects of size 1-2kb using Kryo serializer.
My use case goes as follow:
All 9 nodes constantly execute a synchronous remote operation on the distributed executor service and generate about 20k hits per second (about ~2k per node).
Invocations are executed using Hazelcast API: com.hazelcast.core.IExecutorService#executeOnKeyOwner.
Each operation accesses the distributed map on the node owning the partition, does some calculation using the stored object and stores the object in to the map. (for that I use the get and set API of the IMap object).
Every once in a while Hazelcast encounters a timeout exceptions such as:
com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:mapService', op=GetOperation{}, partitionId=212, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[172.31.44.2]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false} No response has been received! backups-expected:0 backups-completed: 0
In some cases I see map partitions start to migrate which makes thing even worse, nodes constantly leave and re-join the cluster and the only way I can overcome the problem is by restarting the entire cluster.
I am wondering what may cause Hazelcast to block a map-get operation for 120 seconds?
I am pretty sure it's not network related since other services on the same servers operate just fine.
Also note that the servers are mostly idle (~70%).
Any feedbacks on my use case will be highly appreciated.
Why don't you make use of an entry processor? This is also send to the right machine owning the partition and the load, modify, store is done automatically and atomically. So no race problems. It will probably outperform the current approach significantly since there is less remoting involved.
The fact that the map.get is not returning for 120 seconds is indeed very confusing. If you switch to Hazelcast 3.5 we added some logging/debugging stuff for this using the slow operation detector (executing side) and slow invocation detector (caller side) and should give you some insights what is happening.
Do you see any Health monitor logs being printed?

Resources