I'm considering raising max_hint_window_in_ms to something like 72 hours. Anyone see issues with this? Essentially, it would allow us much longer downtime of nodes over a weekend without having to do a full repair.
It depends on the version. After C* 3.0 or DSE 5.0 when the hinted handoff storage was refactored its actually a very good idea to increase it. Before then (given your 2.1 tag assuming this is you) theres a lot of issues with accumulating too many hints highlighted in this blog post. Unless using a version after 3.0 I would not recommend increasing it too much.
To highlight some pre 3.0 issues:
Hints are stored in a C* table and acts like a queue which is a known antipattern, builds up many tombstones and slow/expensive reads
Hints are partitioned by node, so if one node is down a long time the partition gets very huge. This is handled better in the latest of C*/DSE but particularly in 2.1 this impacts compactions, and gcs significantly.
Compactions are called regularly and are required, but if there is nothing getting removed this means just rewriting the mutations over and over while the node is down (wasteful)
Individual mutations need to go through memtables and full write path vs just appended to disk
Related
We recently started having problems with our Cassandra cluster. Maybe someone has ideas on how to fix this. We're running Cassandra 3.11.7 on a 40 node cluster. We are using replication factor = 3 and read/write at consistency level QUORUM.
Recently, a single node experienced a sudden spike in CPU load which then last for a while. During that period, we can observe a lot of dropped and queued MUTATIONs. If we restart Cassandra on the problematic node, one or two other nodes start to suffer of the same problem. We have examined log files and access patterns and have not yet been able to find the reason.
What could be the most common reasons for such behaviour? Where should we take a closer look? Has anyone already had similar experiences?
If we restart Cassandra on the problematic node, one or two other nodes start to suffer of the same problem.
First of all, when a single node presents a problem, restarting it generally achieves nothing. If anything, you'll clear the JVM heap...which will be quickly repopulated upon startup. Seriously, don't expect restarting a node to fix anything.
Has anyone already had similar experiences?
Yes, several times. For things not Cassandra related:
Are you in a cloud environment? Run iostat and look for things like high percentages of iowait and steal. Sometimes shared resources don't play well with others. If you don't have iostat, get it (yum install -y sysstat).
Check cron for all users. We once had an issue with a file integrity checker getting installed as a part of our base image, and it did exactly what you are talking about.
What could be the most common reasons for such behaviour? Where should we take a closer look?
For Cassandra related issues, I see a few possibilities:
Repairs. Check if the node is running a repair. You can see Merkle Tree calculations with nodetool compactionstats and repair streams with nodetool netstats.
Compactions. Check nodetool compactionstats. If this is it, you can try lowering your compaction throughput so that it doesn't affect normal operations.
Garbage Collection. Check the gc.log.* files. If it's GC, it can usually be fixed by reading up on and adjusting the GC settings. If there isn't anyone on your team who is a JVM GC expert, I recommend using G1GC as it removes a lot of the guesswork.
Do note that everything I mentioned above can never be fixed with a reboot. In fact, it's likely it'll pick right back up where it left off.
We have a 13 nodes Cassandra cluster (version 3.10) with RP 2 and read/write consistency of 1.
This means that the cluster isn't fully consistent, but eventually consistent. We chose this setup to speed up the performance, and we can tolerate a few seconds of inconsistency.
The tables are set with TWCS with read-repair disabled, and we don't run full repairs on them
However, we've discovered that some entries of the data are replicated only once, and not twice, which means that when the not-updated node is queried it fails to retrieve the data.
My first question is how could this happen? Shouldn't Cassandra replicate all the data?
Now if we choose to perform repairs, it will create overlapping tombstones, therefore they won't be deleted when their time is up. I'm aware of the unchecked_tombstone_compaction property to ignore the overlap, but I feel like it's a bad approach. Any ideas?
So you've obviously made some deliberate choices regarding your client CL. You've opted to potentially sacrifice consistency for speed. You have achieved your goals, but you assumed that data would always make it to all of the other nodes in the cluster that it belongs. There are no guarantees of that, as you have found out. How could that happen? There are multiple reasons I'm sure, some of which include: networking/issues, hardware overload (I/O, CPU, etc. - which can cause dropped mutations), cassandra/dse being unavailable for whatever reasons, etc.
If none of your nodes have not been "off-line" for at least a few hours (whether it be dse or the host being unavailable), I'm guessing your nodes are dropping mutations, and I would check two things:
1) nodetool tpstats
2) Look through your cassandra logs
For DSE: cat /var/log/cassandra/system.log | grep -i mutation | grep -i drop (and debug.log as well)
I'm guessing you're probably dropping mutations, and the cassandra logs and tpstats will record this (tpstats will only show you since last cassandra/dse restart). If you are dropping mutations, you'll have to try to understand why - typically some sort of load pressure causing it.
I have scheduled 1-second vmstat output that spools to a log continuously with log rotation so I can go back and check a few things out if our nodes start "mis-behaving". It could help.
That's where I would start. Either way, your decision to use read/write CL=1 has put you in this spot. You may want to reconsider that approach.
Consistency level=1 can create a problem sometimes due to many reasons like if data is not replicating to the cluster properly due to mutations or cluster/node overload or high CPU or high I/O or network problem so in this case you can suffer data inconsistency however read repair handles this problem some times if it is enabled. you can go with manual repair to ensure consistency of the cluster but you can get some zombie data too for your case.
I think, to avoid this kind of issue you should consider CL at least Quorum for write or you should run manual repair within GC_grace_period(default is 10 days) for all the tables in the cluster.
Also, you can use incremental repair so that Cassandra run repair in background for chunk of data. For more details you can refer below link
http://cassandra.apache.org/doc/latest/operating/repair.html or https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsRepair.html
We have a 20 nodes Cassandra cluster running a lot of read requests (~900k/sec at peak). Our dataset is fairly small, so everything is served directly from memory (OS Page Cache). Our datamodel is quite simple (just a key/value) and all of our reads are performed with consistency level one (RF 3).
We use the Java Datastax driver with TokenAwarePolicy, so all of our reads should go directly to one node that has the requested data.
These are some metrics extracted from one of the nodes regarding client read request latency and local read latency.
org_apache_cassandra_metrics_ClientRequest_50thPercentile{scope="Read",name="Latency",} 105.778
org_apache_cassandra_metrics_ClientRequest_95thPercentile{scope="Read",name="Latency",} 1131.752
org_apache_cassandra_metrics_ClientRequest_99thPercentile{scope="Read",name="Latency",} 3379.391
org_apache_cassandra_metrics_ClientRequest_999thPercentile{scope="Read",name="Latency",} 25109.16
org_apache_cassandra_metrics_Keyspace_50thPercentile{keyspace=“<keyspace>”,name="ReadLatency",} 61.214
org_apache_cassandra_metrics_Keyspace_95thPercentile{keyspace="<keyspace>",name="ReadLatency",} 126.934
org_apache_cassandra_metrics_Keyspace_99thPercentile{keyspace="<keyspace>",name="ReadLatency",} 182.785
org_apache_cassandra_metrics_Keyspace_999thPercentile{keyspace="<keyspace>",name="ReadLatency",} 454.826
org_apache_cassandra_metrics_Table_50thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 105.778
org_apache_cassandra_metrics_Table_95thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 1131.752
org_apache_cassandra_metrics_Table_99thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 3379.391
org_apache_cassandra_metrics_Table_999thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 25109.16
Another important detail is that most of our queries (~70%) don't return anything, i.e., they are for records not found. So, bloom filters play an important role here and they seem to be fine:
Bloom filter false positives: 27574
Bloom filter false ratio: 0.00000
Bloom filter space used:
Bloom filter off heap memory used: 6760992
As it can be seen, the reads in each one of the nodes are really fast, the 99.9% is less than 0.5 ms. However, the client request latency is way higher, going above 4ms on the 99%. If I'm reading with CL ONE and using TokenAwarePolicy, shouldn't both values be similar to each other, since no coordination is required? Am I missing something? Is there anything else I could check to see what's going on?
Thanks in advance.
#luciano
there are various reasons why the coordinator and the replica can report different 99th percentiles for read latencies, even with token awareness configured in the client.
these can be anything that manifests in between the coordinator code to the replica's storage engine code in the read path.
examples can be:
read repairs (not directly related to a particular request, as is asynchronous to the read the triggered it, but can cause issues),
host timeouts (and/or speculative retries),
token awareness failure (dynamic snitch simply not keeping up),
GC pauses,
look for metrics anomalies per host, overlaps with GC, and even try to capture traces for some of the slower requests and investigate if they're doing everything you expect from C* (eg token awareness).
well-tuned and spec'd clusters may also witness the dynamic snitch simply not being able to keep up and do its intended job. in such situations disabling the dynamic snitch can fix the high latencies for top-end read percentiles. see https://issues.apache.org/jira/browse/CASSANDRA-6908
be careful though, measure and confirm hypotheses, as mis-applied solutions easily have negative effects!
Even if using TokenAwarePolicy, the driver can't work with the policy when the driver doesn't know which partition key is.
If you are using simple statements, no routing information is provided. So you need additional information to the driver by calling setRoutingKey.
The DataStax Java Driver's manual is a good friend.
http://docs.datastax.com/en/developer/java-driver/3.1/manual/load_balancing/#requirements
If TokenAware is perfectly working, CoordinatorReadLatency value is mostly same value with ReadLatency. You should check it too.
http://cassandra.apache.org/doc/latest/operating/metrics.html?highlight=coordinatorreadlatency
thanks for your reply and sorry about the delay in getting back to you.
One thing I’ve found out is that our clusters had:
dynamic_snitch_badness_threshold=0
in the config files. Changing that to the default value (0.1) helped a lot in terms of the client request latency.
The GC seems to be stable, even under high load. The pauses are constant (~10ms / sec) and I haven’t seen spikes (not even full gcs). We’re using CMS with a bigger Xmn (2.5GB).
Read repairs happen all the time (we have it set to 10% chance), so when the system is handling 800k rec/sec, we have ~80k read repairs/sec happening in background.
It also seems that we’re asking too much for the 20 machines cluster. From the client point of view, latency is quite stable until 800k qps, after that it starts to spike a little bit, but still under a manageable threshold.
Thanks for all the tips, the dynamic snitch thing was really helpful!
I have a two-node Cassandra cluster, with RF of 2. So both nodes contain 100% of data.
Now, I am running short on disk space. I can remove some old data, since they were aggregated and processed before, and I don't need them anymore.
I tried running a delete query from cqlsh, but I get a timeout. I tried increasing timeouts, but it seems that running a query from cqlsh will take much more time.
How can I disable this timeout for a single query or connection? Is there any other way, besides increasing timeout, to remove some data from a node?
My Cassandra version is 3.11.0.
PS. I increases write_request_timeout_in_ms in cassandra.yaml. Is this the correct one for delete queries?
Deletes really shouldn't timeout unless there is a problem related to something else. Its inserting a tombstone with no reads or anything and should be fast/cheap regardless of what exists already. Reading on other hand can be impacted a lot. I would guess GC related problems related to reads. You could check GC logs and maybe increase heap and reduce CMSInitiatingOccupancyFraction (if using cms and not g1).
So check GC and normal logs for issues (look for WARN, ERROR in system log) and at pause times in gc logs >1 second, there should be none.
After issuing delete you could try to do a force compaction (nodetool compact keyspace table) to see if it helps disk space. The delete by itself will not reduce disk space until the data has been compacted with the tombstone.
write_request_timeout_in_ms is the right setting, but if your hitting it something is wrong and your just masking it. It should really take less than 1 millisecond normal use.
Side note: RF=2 on a 2 node cluster is not how C* is designed to run. You have no availability on a database that sacrificed consistency for high availability.
Read-your-own-writes consistency is great improvement from the so called eventual consistency: if I change my profile picture I don't care if others see the change a minute later, but it looks weird if after a page reload I still see the old one.
Can this be achieved in Cassandra without having to do a full read-check on more than one node?
Using ConsistencyLevel.QUORUM is fine while reading an unspecified data and n>1 nodes are actually being read. However when client reads from the same node as he writes in (and actually using the same connection) it can be wasteful - some databases will in this case always ensure that the previously written (my) data are returned, and not some older one. Using ConsistencyLevel.ONE does not ensure this and assuming it leads to race conditions. Some test showed this: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/per-connection-quot-read-after-my-write-quot-consistency-td6018377.html
My hypothetical setup for this scenario is 2 nodes, replication factor 2, read level 1, write level 1. This leads to eventual consistency, but I want read-your-own-writes consistency on reads.
Using 3 nodes, RF=3, RL=quorum and WL=quorum in my opinion leads to wasteful read request if I being consistent only on "my" data is enough.
// seo: also known as: session consistency, read-after-my-write consistency
Good question.
We've had http://issues.apache.org/jira/browse/CASSANDRA-876 open for a while to add this, but nobody's bothered finishing it because
CL.ONE is just fine for a LOT of workloads without any extra gymnastics
Reads are so fast anyway that doing the extra one is not a big deal (and in fact Read Repair, which is on by default, means all the nodes get checked anyway, so the difference between CL.ONE and higher is really more about availability than performance)
That said, if you're motivated to help, ask on the ticket and I'll be happy to point you in the right direction.
I've been following Cassandra development for a little while and I haven't seen a feature like this mentioned.
That said, if you only have 2 nodes with a replication factor of 2, I would question whether Cassandra is the best solution. You are going to end up with the entire data set on each node, so a more traditional replicated SQL setup might be simpler and more widely tested. Cassandra is very promising but it is still only version 0.8.2 and problems are regularly reported on the mailing list.
The other way to solve the 'see my own updates' problem would be to cache the results somewhere closer to the client, whether in the web server, the application layer, or using something like memcached.