Cassandra CAS INSERT timeouts for requests with milliseconds latency - cassandra

We are load-testing our cassandra cluster (3 nodes, replication factor 3) and started to receive occasional WriteTimeoutExceptions for CAS insert operations on one table:
CREATE TABLE users.by_identity (
account ascii,
domain ascii,
identity text
PRIMARY KEY ((account, domain), identity)
);
We are doing inserts with IF NOT EXISTS clause to this table. When increasing load to > 10 inserts/s for one partition, client requests start to "time out":
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency SERIAL (2 replica were required but only 1 acknowledged the write)
WriteType for timeouts is CAS and exceptions are thrown only for this table. Execution time is always < 10ms. Read/write timeouts are configured to > 1000 ms on cluster and only this table is hitting the problem.
Any ideas what might be the issue we are hitting and why are we getting timeouts for requests with milliseconds latency?
We are on Cassandra v3.0.8 and Datastax Java driver v3.1.0.

Sorry for the late answer, but you are probably hitting this bug: https://issues.apache.org/jira/browse/CASSANDRA-9328
You can likely confirm by reducing concurrency so there's only ever 1 request at a time (if your requests are super fast you can probably still just do 10 fast requests per second one after the other just don't have any concurrent) and leaving your cluster setup (3 nodes, replication factor 3) or leaving your request rate at 10/s and changing your cluster setup to a single node. If you do either you probably won't see any timeouts < 1000 ms and then changing back to concurrency 10 and 3 nodes with replication factor 3 and you will likely reproduce the timeouts that are too low for the timeout setting again.
Unfortunately the bug report doesn't provide any pseudo code how to workaround this problem other but does say you should check the state yourself to see if the write actually happened and retry based on that. If your writes are idempotent maybe you just need to simply retry.
Unfortunately for my purposes our application was quite complicated and we were unable to workaround without a lot of other work so we are still living with this bug. If this is ends up being the problem you are having, I'd be interested to see an example in pseudocode how you were able to workaround it as it might provide inspiration for others hitting this problem as well.

Related

Cassandra lightweight transactions failing

There are 2 DCs each with 3 nodes, and the RF used for writes is 2 and reads its each_quorum. A lightweight transaction is used to ensure consistency of updates across DCs. Now what is happening is for certain records, hundreds (maybe thousands) of lwt updates are hitting the cluster around same time. What is happening is that all of these updates are failing with "Operation timed out - received only 0 responses", not even one attempt is able to change the status of that one record and its making everyone else fail. Ideally it would be better for the first attempt to go through the update and change the values so that subsequent lwt updates will not go through since the lwt values do not satisfy. Is there any way to achieve this?
Tried increasing cas_contention timeout but this not help except making all the transactions wait longer before failing. Used "local consistency" which made lwt run faster but this would not help in our case since we want strong consistency on both the DCs. Any alternatives?

What's the nature of Cassandra "write timeout"?

I am running a write-heavy program (10 threads peaks at 25K/sec writes) on a 24 node Cassandra 3.5 cluster on AWS EC2 (each host is of c4.2xlarge type: 8 vcore and 15G ram)
Every once in a while my Java client, using DataStax driver 3.0.2, would get write timeout issue:
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency TWO (2 replica were required but only 1 acknowledged the write)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:73)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:64)
The error happens infrequently and in a very unpredictable way. So far, I am not able to link the failures to anything specific (e.g. program running time, data size on disk, time of the day, indicators of system load such as CPU, memory, network metrics) Nonetheless, it is really disrupting our operations.
I am trying to find the root cause of the issue. Looking online for options, I am a bit overwhelmed by all the leads out there, such as
Changing "write_request_timeout_in_ms" in "cassandra.yaml" (already changed to 5 seconds)
Using proper "RetryPolicy" to keep the session going (already using DowngradingConsistencyRetryPolicy on a ONE session level consistency level)
Changing cache size, heap size, etc. - never tried those b/c there are good reasons to discount them as the root cause.
One thing is really confusing during my research is that I am getting this error from a fully replicated cluster with very few ClientRequest.timeout.write events:
I have a fully-replicated 24 node cluster spans 5 aws regions. Each region has at least 2 copies of the data
My program runs consistency level ONE at Session level (Cluster builder with QueryOption)
When the error happened, our Graphite chart registered no more than three (3) host hiccups, i.e. having the Cassandra.ClientRequest.Write.Timeouts.Count values
I already set write_timeout to 5 seconds. The network is pretty fast (using iperf3 to verify) and stable
On paper, the situation should be well within Cassandra's failsafe range. But why my program still failed? Are the numbers not what they appear to be?
Its not always necessarily a bad thing to see timeouts or errors especially if you're writing at a higher consistency level, the writes may still get through.
I see you mention CL=ONE you could still get timeouts here but the write (mutation) still have got through. I found this blog really useful: https://www.datastax.com/dev/blog/cassandra-error-handling-done-right. Check your server side (node) logs at the time of the error to see if you have things like ERROR / WARN / GC pauses (like one of the comments mentions above) these kind of events can cause the node to be unresponsive and therefor a timeout or other type of error.
If your updates are idempotent (ideally) then you can build in some retry mechanism.

Replication acknowledgement in PostgreSQL + BDR

I'm using libpq C Library for testing PG + BDR replica set. I'd like to get acknowledgement of the CRUD operations' replication. My purpose is to make my own log of the replication time in milliseconds or if possible in microseconds.
The program:
Starts 10-20 threads witch separate connections, each thread makes 1000-5000 cycles of basic CRUD operations on three tables.
Which would be the best way?
Parsing some high verbosity logs if they have proper data with time stamp or in my C api I should start N thread (N = {number of nodes} - {the master I'm connected to}) after every CRUD op. and query the nodes for the data.
You can't get replay confirmation of individual xacts easily. The system keeps track of the log sequence number replayed by peer nodes but not what transaction IDs those correspond to, since it doesn't care.
What you seem to want is near-synchronous or semi-synchronous replication. There's some work coming there for 9.6 that will hopefully benefit BDR in time, but that's well in the future.
In the mean time you can see the log sequence number as restart_lsn in pg_replication_slots. This is not the position the replica has replayed to, but it's the oldest point it might have to restart replay at after a crash.
You can see the other LSN fields like replay_location only when a replica is connected in pg_stat_replication. Unfortunately in 9.4 there's no easy way to see which slot in pg_replication_slots is associated with which active connection in pg_stat_replication (fixed in 9.5, but BDR is based on 9.4 still). So you have to use the application_name set by BDR if you want to pick out individual nodes, and it's ... "interesting" to parse. Also often truncated.
You can get the current LSN of the server you committed an xact on after committing it by calling SELECT pg_current_xlog_location(); which will return a value like 0/19E0F060 or whatever. You can then look that value up in the pg_stat_replication of peer nodes until you see that the replay_location for the node you committed on has reached or passed the LSN you captured immediately after commit.
It's not perfect. There could be other work done between when you commit and when you capture the server's current LSN. There's no way around that, but at worst you wait slightly too long. If you're using BDR you shouldn't be caring about micro or even milliseconds anyway, since it's an asynchronous replication solution.
The principles are pretty similar to measuring replication lag for normal physical standby servers, so I suggest reading some docs on that. Except that pg_last_xact_replay_timestamp() won't work for logical replication, so you can't get lag using that, you have to use the LSNs and do your own timing client-side.

Frequent rpc_timeouts of the query SELECT count(*) FROM Keyspace1.Standard1 limit 5; in cassandra

I have a 5 node cassandra cluster with 3 nodes on a private DC & the other 2 on AWS.
Select * requests are timing out even when it is limited to 5. I understand if they are timing out for high numbers but timing out for single digits looks strange strange.
Any one observed this before?
NOTE: Queries with WHERE clause are normal.
There are two or three options:
1) Your servers are too busy / slow to reply to the query.
2) You're hitting a tombstone exception, which sometimes doesn't get reported properly. Check the log on the cassandra server for the word 'tombstone' to be sure.
3) You're asking for too much data at once - less likely if it happens when you LIMIT 5.
I'm guessing it's #2. Look for tombstone warnings in your cassandra server logs. If that's the problem, you likely have a data model problem.
Are the nodes on two different networks (you said private DC and AWS), check if the communication between nodes.
what is the consistency you are using when querying, try with consistency of one and see the response and then checkthe communication between nodes (with higher consistency it always checks the consistency of data with other nodes before responding back with results).
Does your select have any where clause or a simple select *, if later then again retrieving data from different nodes with a slow inter node communication might be an issue.

Cassandra stress test fails after large number of inserts

I've been trying to insert 1 billion records into Cassandra using the stress test and it fails after a couple of million inserts with the following error:
Operation [641412926] retried 10 times - error inserting key 0641412926 ((UnavailableException))
Operation [641412995] retried 10 times - error inserting key 0641412995 ((UnavailableException))
Operation [641413235] retried 10 times - error inserting key 0641413235 ((UnavailableException))
Operation [641413164] retried 10 times - error inserting key 0641413164 ((UnavailableException))
I've observed this issue in every run of my stress test. Sometimes, any one of the nodes in the cluster goes down. Is this a known issue? Any particular reason as to why this is happening? I am using Cassandra 1.2.3 on cluster of 8 machines.
Thanks,
VS
UnavailableException means that the node you've contacted cannot find enough replicas in the cluster for the key requested to fulfill the request. If you have nodes going up and down during your stress test you likely need more capacity to handle the load that you're running against the cluster.
Why is this happening? You're probably under-capacity in some way. If you're not running out of disk space you should evaluate your CPU load, and your IO to try to figure out what's going on. It's important, when using Cassandra, to distinguish between peak load and sustained load. While Cassandra can handle ephemeral peaks it's entirely possible to throw more load at a node than it can handle in the long term. This means that if your peak lasts for five minutes you'll probably be ok. If your peak lasts for days you should add capacity because your cluster is eventually going to fall behind.
First thing to check is if the node you are inserting to is up and cassandra is running. Assuming it is, then you may be overwhelming cassandra. In general applications running within the JVM are unable to recover when the JVM Garbage Collection process has failed in a catastrophic fashion. This may be the error condition you are triggering, which could be why your Cassandra node does not recover. To confirm whether this is the case, enable more verbose GC logging and/or consult existing JVM GC log messages in system.log.

Resources