I've been trying to insert 1 billion records into Cassandra using the stress test and it fails after a couple of million inserts with the following error:
Operation [641412926] retried 10 times - error inserting key 0641412926 ((UnavailableException))
Operation [641412995] retried 10 times - error inserting key 0641412995 ((UnavailableException))
Operation [641413235] retried 10 times - error inserting key 0641413235 ((UnavailableException))
Operation [641413164] retried 10 times - error inserting key 0641413164 ((UnavailableException))
I've observed this issue in every run of my stress test. Sometimes, any one of the nodes in the cluster goes down. Is this a known issue? Any particular reason as to why this is happening? I am using Cassandra 1.2.3 on cluster of 8 machines.
Thanks,
VS
UnavailableException means that the node you've contacted cannot find enough replicas in the cluster for the key requested to fulfill the request. If you have nodes going up and down during your stress test you likely need more capacity to handle the load that you're running against the cluster.
Why is this happening? You're probably under-capacity in some way. If you're not running out of disk space you should evaluate your CPU load, and your IO to try to figure out what's going on. It's important, when using Cassandra, to distinguish between peak load and sustained load. While Cassandra can handle ephemeral peaks it's entirely possible to throw more load at a node than it can handle in the long term. This means that if your peak lasts for five minutes you'll probably be ok. If your peak lasts for days you should add capacity because your cluster is eventually going to fall behind.
First thing to check is if the node you are inserting to is up and cassandra is running. Assuming it is, then you may be overwhelming cassandra. In general applications running within the JVM are unable to recover when the JVM Garbage Collection process has failed in a catastrophic fashion. This may be the error condition you are triggering, which could be why your Cassandra node does not recover. To confirm whether this is the case, enable more verbose GC logging and/or consult existing JVM GC log messages in system.log.
Related
We are load-testing our cassandra cluster (3 nodes, replication factor 3) and started to receive occasional WriteTimeoutExceptions for CAS insert operations on one table:
CREATE TABLE users.by_identity (
account ascii,
domain ascii,
identity text
PRIMARY KEY ((account, domain), identity)
);
We are doing inserts with IF NOT EXISTS clause to this table. When increasing load to > 10 inserts/s for one partition, client requests start to "time out":
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency SERIAL (2 replica were required but only 1 acknowledged the write)
WriteType for timeouts is CAS and exceptions are thrown only for this table. Execution time is always < 10ms. Read/write timeouts are configured to > 1000 ms on cluster and only this table is hitting the problem.
Any ideas what might be the issue we are hitting and why are we getting timeouts for requests with milliseconds latency?
We are on Cassandra v3.0.8 and Datastax Java driver v3.1.0.
Sorry for the late answer, but you are probably hitting this bug: https://issues.apache.org/jira/browse/CASSANDRA-9328
You can likely confirm by reducing concurrency so there's only ever 1 request at a time (if your requests are super fast you can probably still just do 10 fast requests per second one after the other just don't have any concurrent) and leaving your cluster setup (3 nodes, replication factor 3) or leaving your request rate at 10/s and changing your cluster setup to a single node. If you do either you probably won't see any timeouts < 1000 ms and then changing back to concurrency 10 and 3 nodes with replication factor 3 and you will likely reproduce the timeouts that are too low for the timeout setting again.
Unfortunately the bug report doesn't provide any pseudo code how to workaround this problem other but does say you should check the state yourself to see if the write actually happened and retry based on that. If your writes are idempotent maybe you just need to simply retry.
Unfortunately for my purposes our application was quite complicated and we were unable to workaround without a lot of other work so we are still living with this bug. If this is ends up being the problem you are having, I'd be interested to see an example in pseudocode how you were able to workaround it as it might provide inspiration for others hitting this problem as well.
We had 3 regions for the cassandra cluster each with 2 nodes, totally 6. Then we have added 3 more regions now totally we have 12 cassandra nodes in the cluster. After adding the nodes, we have updated the replication factors and started the nodetool repair. But the command is hanging for more than 48+ hours and not finished yet. When we looked into the logs 1 or 2 AntiEntropySessions are pending still, because some of the CF's are not fully synced. All AntiEntropySessions are getting the merkle tree successfully from all the nodes for all CF's. But some repair b/w some nodes are not completed for some CF's, so it leads to pending AntiEntropySessions and the repair is hanging.
We are using Cassandra 1.1.12. We will not able to upgrade the Cassandra now.
We have restarted the nodes and started the repair again but it still hangs.
We have observed one CF which has frequent read and writes in the initial 3 regions which is active during the repair is failing to sync completely in all the times.
Is that necessary that while running repair there shouldn't be any read/writes in any CF?
OR suggest me what could be the issue here?
Cassandra 1.1 is very old so its hard to remember exact issues, but there was problems with streaming then which would possibly hang. Some causes were things like if a read was timed out or was connection was reset. Since you are past 1.1.11 though your Ok to try subrange repairs.
Try to find an appropriate token range you can repair in an hour (keep running smaller and smaller range until you can complete it), set a timeout of a couple hours. Expect some repairs to fail (timeout) so just retry them until they complete. If you cannot get it after many retries continue to make that subrange smaller, but even then it may have problems if you have a partition thats very wide (can check with nodetool cfstats) that will make it much worse.
Once you get a completed repair, upgrade like crazy.
I am running a write-heavy program (10 threads peaks at 25K/sec writes) on a 24 node Cassandra 3.5 cluster on AWS EC2 (each host is of c4.2xlarge type: 8 vcore and 15G ram)
Every once in a while my Java client, using DataStax driver 3.0.2, would get write timeout issue:
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency TWO (2 replica were required but only 1 acknowledged the write)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:73)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:64)
The error happens infrequently and in a very unpredictable way. So far, I am not able to link the failures to anything specific (e.g. program running time, data size on disk, time of the day, indicators of system load such as CPU, memory, network metrics) Nonetheless, it is really disrupting our operations.
I am trying to find the root cause of the issue. Looking online for options, I am a bit overwhelmed by all the leads out there, such as
Changing "write_request_timeout_in_ms" in "cassandra.yaml" (already changed to 5 seconds)
Using proper "RetryPolicy" to keep the session going (already using DowngradingConsistencyRetryPolicy on a ONE session level consistency level)
Changing cache size, heap size, etc. - never tried those b/c there are good reasons to discount them as the root cause.
One thing is really confusing during my research is that I am getting this error from a fully replicated cluster with very few ClientRequest.timeout.write events:
I have a fully-replicated 24 node cluster spans 5 aws regions. Each region has at least 2 copies of the data
My program runs consistency level ONE at Session level (Cluster builder with QueryOption)
When the error happened, our Graphite chart registered no more than three (3) host hiccups, i.e. having the Cassandra.ClientRequest.Write.Timeouts.Count values
I already set write_timeout to 5 seconds. The network is pretty fast (using iperf3 to verify) and stable
On paper, the situation should be well within Cassandra's failsafe range. But why my program still failed? Are the numbers not what they appear to be?
Its not always necessarily a bad thing to see timeouts or errors especially if you're writing at a higher consistency level, the writes may still get through.
I see you mention CL=ONE you could still get timeouts here but the write (mutation) still have got through. I found this blog really useful: https://www.datastax.com/dev/blog/cassandra-error-handling-done-right. Check your server side (node) logs at the time of the error to see if you have things like ERROR / WARN / GC pauses (like one of the comments mentions above) these kind of events can cause the node to be unresponsive and therefor a timeout or other type of error.
If your updates are idempotent (ideally) then you can build in some retry mechanism.
I have a single node Cassandra installation on my development machine (and very little experience with Cassandra). I always had very few data in the node and I experienced no problems. I inserted about 9,000 elements in a table today to experiment with a real world use case. When I start up the node the boot time is extremely long now. I get this in system.log
Replaying /var/lib/cassandra/commitlog/CommitLog-3-1388134836280.log
...
Log replay complete, 9274 replayed mutations
That took 13 minutes and is hardly bearable. I wonder if there is a way to store data in such a way that can be read at once without replaying the log. After all 9,000 elements are nothing and there must be a quicker way to boot. I googled for hints and searched into Cassandra's documentation but I didn't find anything. It's obvious that I'm not looking for the right things, would anybody be so kind to point me to the right documents? Thanks.
There are a few things that might help. The most obvious thing you can do is flush the commit log before you shutdown Cassandra. This is a good idea to do in production too. Before I stop a Cassandra node in production I'll run the following commands:
nodetool disablethrift
nodetool disablegossip
nodetool drain
The first two commands gracefully shut down connections to clients connected to this node and then to other nodes in the ring. The drain command flushes memtables to disk (sstables). This should minimize what needs to be replayed on startup.
There are other factors that can make startup take a long time. Cassandra opens all the SSTables on disk at startup. So the more column families and SSTables you have on disk the longer it will take before a node is able to start serving clients. There was some work done in the 1.2 release to speed this up (so if you are not on 1.2 yet you should consider upgrading). Reducing the number of SSTables would probably improve your start time.
Since you mentioned this was a development machine I'll also give you my dev environment observations. On my development machine I do a lot of creating and dropping column families and key spaces. This can cause some of the system CFs to grow significantly and eventually cause a noticeable slowdown. The easiest way to handle this is to have a script that can quickly bootstrap a new database and blow away all the old data in /var/lib/cassandra.
I met very strange problem during testing cassandra. I have a very simple column family that stores video data (keys point to time period and there is only one column containing ~2MB video for this period).
Use Case
I start to load data using Hector API (round-robin) to 6 empty nodes (8GB RAM for Cassandra)- load is run in 4 threads adding 4 rows in second for each thread.
After a while (running load for hour or so) near 100-200 GB are added to the node (depending on the replication factor) and then one or several nodes become unreachable. (no pinging just reboot helps)
Why Compaction
I do use tiered-level compaction and monitoring the system(Debian) i can see that it actually not writes but compaction that takes almost all resources (disk, memory) and cause server to refuse writes and than fail.
After like 30-40 minutes of test compaction tasks just cannot be handled and get queued. Interesting thing is that there are no deletes and updates - so compaction just reads/writes data again and again without bringing actual value to me (like it can be compacted once in the evening).
When i slow down the pace - i.e running 2 threads with 1 second delay things go better but whether it still be working when i have 20TB not 100 GB on a node.
Is Cassandra optimized for such type of workload? How the resources are normally distributed between compaction and reads/writes?
Update
Update of network driver solved problem with unreachable cluster
Thanks,
Sergey.
Cassandra will use up to in_memory_compaction_limit_in_mb memory for a compaction. It is routine to have compaction running while reads and writes are served simultaneously. It is also normal that compaction can fall behind if you continue to throw writes at it as fast as possible; if your read workload requires that compaction be up to date or close to it at all times, then you'll need a larger cluster to spread the load around more machines.
Recommended amount of disk per node for online queries is up to 500GB, maybe 1TB if you're pushing it. Remember that this amount of data will have to be rebuilt if a node fails. Typical Cassandra workloads are CPU-bound or iops-bound, not disk-space bound, so you won't be able to make good use of that space anyway.
(It's also possible to do batch analytics against Cassandra, which we do with the Cassandra Filesystem, in which case higher disk:cpu ratios are desirable, but we use a custom compaction strategy for that as well.)
It's not clear from your report why a server would become unreachable. This is really an OS-level problem. (Are you swapping? Disabling swap would be a good first step.)