In Mongo and HBase, we have a way to track the client connection, Is there a way to get the total client connection in Cassandra?
On client-side Datastax driver reports metrics on connections, task queues, queries and errors (connection errors, r/w timeouts, retries, speculative executions).
Can be accessed via the Cluster.getMetrics() operation (java).
Datastax provides a cassandra structure, CassMetrics (cpp driver). This structure contains min and max microseconds to execute a query, pending requests, timed-out requests, total connections and available connections. It gives the entire performance metrics of a particular snapshot of the session.
Each member of the structure is explained clearly in the following datastax documentation- https://docs.datastax.com/en/developer/cpp-driver/2.3/api/struct.CassMetrics/
Here, is an example program on how to use the structure effectively - https://github.com/datastax/cpp-driver/blob/master/examples/perf/perf.c
Related
I have cassandra monolithic application where I want to write at high rate reading some payloads from queue. Cassandra cluster has 3 nodes . When i start processing large number of messages in parallel(by spawning threads) I get below exceptions
java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT2S
I am creating CQLsession as bean
return CqlSession.builder().addContactPoints(contactPoints)
/*.addContactPoint(new InetSocketAddress("localhost", 9042))*/
.withConfigLoader(new DefaultDriverConfigLoader()).withLocalDatacenter("datacenter1")
.addTypeCodecs(new CustomDateCodec())
.withKeyspace("dev").build();
I am injecting this CqlSession into my mapper and other classes to run queries
In my datastax driver i have given ip of 3 nodes as contact points
Is there any tuning I need to do in CQLsession creation/ or my cassandra nodes so that they can take is writes at high concurrency ?
Also How many writes can I do in parallel ?
All are update statement without any if condition only on primary key
The timeout you're seeing is a result of your app overloading the cluster, effectively doing a DDoS attack.
PT2S is the 2-second write timeout. There will come a point when the commitlog disks can only take so much write IO. If you're seeing dropped mutations in the logs or nodetool tpstats, that's confirmation that the commitlog can't keep up with the writes.
If your cluster can sustain 10K writes/sec but your app is doing 20K writes then you need to double the size of your cluster (add more nodes) to support the throughput requirements. Cheers!
I am using cassandra according to the following struct:
21 nodes , AWS EC2 i3.2xlarge , version 3.11.4 .
The application is opening about 5000 connection per node (so its 100k connections per cluster) using the datastax java connection driver.
Application is using autoscale and frequently opens/close connections.
Number of connections to open at once by app servers can reach up to 500 per node (opens simultaneously on all nodes at once - so its 10k connections opens at the same time across the cluster)
This cause spikes of load on cassandra and cause reads and writes latency.
I have noticed each time connections opens/close there are high number of reads from system_auth.roles and system_auth.role_permissions.
How can I prevent the load and resolve this issue ?
You need to modify your application to work with as small number of connections as possible. You need to have following in mind:
Create Cluster/Session object, once at start and keep it. Initialization of session is very expensive operation, it adds a load to Cassandra, and to your application as well
you may increase the number of the simultaneous requests per connection, instead of opening new connections. Protocol allows to have up to 32k requests per connection. Although, if you have too many requests in-flight, then it's a sign that your Cassandra doesn't keep with workload and can't answer fast enough. See documentation on connection pooling
I have set server timeout in cassandra as 60 seconds and client timeout in cpp driver as 120 seconds.
I use Batch query which has 18K operations, I get the Request timed out error in cpp driver logs but in Cassandra server logs there is no TRACE available in spite of enabling ALL logs in Cassandra logback.xml
So how can I confirm that It is thrown from the server / client side in Cassandra?
BATCH is not intended to work that way. It’s designed to apply 6 or 7 mutations to different tables atomically. You’re trying to use it like it’s RDBMS counterpart (Cassandra just doesn’t work that way). The BATCH timeout is designed to protect the node/cluster from crashing due to how expensive that query is for the coordinator.
In the system.log, you should see warnings/failures concerning the sheer size of your BATCH. If you’ve modified them and don’t see that, you should see a warning about a timeout threshold being exceeded (I think BATCH gets its own timeout in 3.0).
If all else fails, run your BATCH statement (part of it) in cqlsh with tracing on, and you’ll see precisely why this is a bad idea (server side).
Also, the default query timeouts are there to protect your cluster. You really shouldn’t need to alter those. You should change your query/model or approach before looking at adjusting the timeout.
I'm trying to tune Cassandra because I keep getting this error:
com.datastax.driver.core.exceptions.OperationTimedOutException: Timed out waiting for server response
But I'm not sure whether I should increase the number of connections I have per host or increase the number of requests per connection in the datastax driver.
In general, what is the best way to determine if adding more connections is better or adding more requests per connection?
I am using Hazelcast version 3.3.1.
I have a 9 node cluster running on aws using c3.2xlarge servers.
I am using a distributed executor service and a distributed map.
Distributed executor service uses a single thread.
Distributed map is configured with no replication and no near-cache and stores about 1 million objects of size 1-2kb using Kryo serializer.
My use case goes as follow:
All 9 nodes constantly execute a synchronous remote operation on the distributed executor service and generate about 20k hits per second (about ~2k per node).
Invocations are executed using Hazelcast API: com.hazelcast.core.IExecutorService#executeOnKeyOwner.
Each operation accesses the distributed map on the node owning the partition, does some calculation using the stored object and stores the object in to the map. (for that I use the get and set API of the IMap object).
Every once in a while Hazelcast encounters a timeout exceptions such as:
com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:mapService', op=GetOperation{}, partitionId=212, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[172.31.44.2]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false} No response has been received! backups-expected:0 backups-completed: 0
In some cases I see map partitions start to migrate which makes thing even worse, nodes constantly leave and re-join the cluster and the only way I can overcome the problem is by restarting the entire cluster.
I am wondering what may cause Hazelcast to block a map-get operation for 120 seconds?
I am pretty sure it's not network related since other services on the same servers operate just fine.
Also note that the servers are mostly idle (~70%).
Any feedbacks on my use case will be highly appreciated.
Why don't you make use of an entry processor? This is also send to the right machine owning the partition and the load, modify, store is done automatically and atomically. So no race problems. It will probably outperform the current approach significantly since there is less remoting involved.
The fact that the map.get is not returning for 120 seconds is indeed very confusing. If you switch to Hazelcast 3.5 we added some logging/debugging stuff for this using the slow operation detector (executing side) and slow invocation detector (caller side) and should give you some insights what is happening.
Do you see any Health monitor logs being printed?