Cassandra and G1 Garbage collector stop the world event (STW) - cassandra

We have a 6 node Cassandra Cluster under heavy utilization. We have been dealing a lot with garbage collector stop the world event, which can take up to 50 seconds in our nodes, in the meantime Cassandra Node is unresponsive, not even accepting new logins.
Extra details:
Cassandra Version: 3.11
Heap Size = 12 GB
We are using G1 Garbage Collector with default settings
Nodes size: 4 CPUs 28 GB RAM
The G1 GC behavior is the same across all nodes.
Any help would be very much appreciated!
Edit 1:
Checking object creation stats, it does not look healthy at all.
Edit 2:
I have tried to use the suggested settings by Chris Lohfink, here is the GC report:
Using CMS suggested settings
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTAtNDk=
Using G1 suggested settings
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTExLTE3
The behavior remains basically the same:
Old Gen starts to fill up.
GC can't clean it properly without a full GC and a STW event.
The full GC starts to take longer, until the node is completely unresponsive.
I'm going to get the cfstats output for maximum partition size and tombstones per read asap and edit the post again.

Have you looked at using Zing? Cassandra situations like these are a classic use case, as Zing fundamentally eliminates all GC-related glitches in Cassandra nodes and clusters.
You can see some details on the how/why in my recent "Understanding GC" talk from JavaOne (https://www.slideshare.net/howarddgreen/understanding-gc-javaone-2017). Or just skip to slides 56-60 for Cassandra-specific results.

Without knowing what your existing settings or possible data model problems, heres a guess of some conservative settings to use to try to reduce evacuation pauses from not having enough to-space (check gc logs):
-Xmx12G -Xms12G -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=500 -XX:-ReduceInitialCardMarks -XX:G1HeapRegionSize=32m
This should also help reduce the pause of the update remember set which becomes an issue and reducing humongous objects, by setting G1HeapRegionSize, which can become a problem depending on data model. Make sure -Xmn is not set.
12Gb with C* is probably more suited for using CMS for what its worth, you can get better throughput certainly. Just need to be careful of fragmentation over time with the rather large objects that can get allocated.
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=55 -XX:MaxTenuringThreshold=3 -Xmx12G -Xms12G -Xmn3G -XX:+CMSEdenChunksRecordAlways -XX:+CMSParallelInitialMarkEnabled -XX:+CMSParallelRemarkEnabled -XX:CMSWaitDuration=10000 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCondCardMark
Most likely theres an issue with data model or your under provisioned though.

Related

Cassandra : memory consumption while compacting

I have ParNew GC warnings into system.log that go over 8 seconds pause :
WARN [Service Thread] GCInspector.java:283 - ParNew GC in 8195ms. CMS Old Gen: 22316280488 -> 22578261416; Par Eden Space: 1717787080 -> 0; Par Survivor Space: 123186168 -> 214695936
It seems to appear when minor compactions occurs on a particular table :
92128ed0-46fe-11ec-bf5a-0d5dfeeee6e2 ks table 1794583380 1754598812 {1:92467, 2:5291, 3:22510}
f6e3cd30-46fc-11ec-bf5a-0d5dfeeee6e2 ks table 165814525 160901558 {1:3196, 2:24814}
334c63f0-46fc-11ec-bf5a-0d5dfeeee6e2 ks table 126097876 122921938 {1:3036, 2:24599}
The table :
is configured with LCS strategy.
average row size is 1MB
there are also some wide rows, up to 60MB (from cfhistograms, don't know if it includes or not the LZ4 compression applied on that row ?).
The heap size is 32GB.
Question :
a. how many rows must fit into memory (at once!) during compaction process ? It is just one, or more ?
b. while compacting, does each partition is read in decompressed form into memory, or in compressed form ?
c. do you think the compaction process in my case could fill up all the heap memory ?
Thank you
full GC settings :
-Xms32G
-Xmx32G
#-Xmn800M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSWaitDuration=10000
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways
a. how many rows must fit into memory (at once!) during compaction process ? It is just one, or more ?
It is definitely multiple.
b. while compacting, does each partition is read in decompressed form into memory, or in compressed form ?
The compression only works at the disk level. Before compaction can do anything with it, it needs to decompress and read it.
c. do you think the compaction process in my case could fill up all the heap memory ?
Yes, the compaction process allocates a significant amount of the heap, and running compactions will cause issues with an already stressed heap.
TBH, I see several opportunities for improvement with the GC settings listed. And right now, I think that's where the majority of the problems are. Let's start with the new gen size:
#-Xmn800M
With CMS you absolutely need to be explicit about your heap new size (Xmn). Especially with a gigantic heap. And yes, with CMS 32GB is "gigantic." The 100MB per CPU core wisdom is incorrect. With Cassandra, the heap new size should be in the range of 25% to 50% of the max heap size (Xmx). For 32GB, I'd say uncomment the Xmn line and set it to -Xmn12G.
So here is how memory is mapped out for CMS:
Now let's look at these two:
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
Laid out linearly, the heap is split into a new/young generation, the old generation, and the permanent generation. Major, stop-the-world collections happen on inter-generational promotion (ex: new gen to old gen).
Within the new gen, it is split into the Eden space, and the survivor spaces S0 and S1. What you want, is for all your objects to be created, live, and die in the new gen space. For that to happen, the MaxTenuringThreshold (how many times an object can be copied between survivor spaces) needs to be higher. Also, the survivor spaces need to be big enough to pull their weight. With a ratio of 1:8, each survivor space will be 1/8th of the Eden space. So I'd go with these, just to start:
-XX:SurvivorRatio=2
-XX:MaxTenuringThreshold=6
That'll make the survivor spaces bigger, and allow objects to be passed between them 6 times. Hopefully, that's long enough to avoid having to promote them.
Adding these will help, too:
-XX:+AlwaysPreTouch
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:-UseBiasedLocking
For more info on these ^ check out Amy's Cassandra 2.1 Tuning Guide. But with Cassandra you do want to "pre touch," you do want to enable thread local allocation blocks (TLAB), you do want those blocks to be able to be resized, and you don't want biased locking.
Pick one of your nodes, make these changes, restart, and monitor performance. If they help (which I think they will), add them to the remaining nodes, as well.
tl;dr;
I'd make these changes:
-Xmn12G
-XX:SurvivorRatio=2
-XX:MaxTenuringThreshold=6
-XX:+AlwaysPreTouch
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:-UseBiasedLocking
References:
CASSANDRA-8150 - An ultimately unsuccessful attempt to alter the default JVM settings. But the ensuing discussion resulted one of the best compilations of JVM tuning wisdom.
Amy's Cassandra 2.1 Tuning Guide - It may be dated, but this is still one of the most comprehensive admin guides for Cassandra. Many of the settings and approaches discussed are still very relevant.

How to tune cassandra for large baremetal server deployment

I have cassandra deployed on large baremetal servers. 56 core and 756 gb ram 20 TB SSD.( I know its an antipattern but I have no choice to create vm or anything). Its a 10 node cluster. What settings are important for such deployments.
I have read and write heavy workload. Running into long compaction time leading to read and write timeouts.
I don't see cpu,memory,disk,network being a bottleneck
So I have a saying with dense node architectures: "Big servers equal big problems."
I can think of a few things off the top of my head which might help.
In the cassandra.yaml, check these two settings:
concurrent_compactors: 2
compaction_throughput_mb_per_sec: 16
Specifically, concurrent_compactors is one of those that can be set proportional to the number of CPU cores. I wouldn't go too high, but maybe test it by increasing by a factor of 2, and see if you notice anything. Also, with your resources, you should be able to set compaction_throughput_mb_per_sec at least to 256. The good news about this one, is that you can set it with nodetool ephemerally, just to try it out.
Make sure that the disk-specific settings are optimized for SSDs:
disk_optimization_strategy: ssd
trickle_fsync: true
And make sure that the servers are set to use the G1GC collector, and you could probably afford to have a large heap of 32GB or so.
Also, take a read through Amy Tobey's Cassandra 2.1 Tuning Guide. She has lots of good info in there that still applies to Cassandra 3.
TBH though, Alex is right. The biggest wins are going to be in adjusting your table definitions. Cassandra's performance has more to do with data model definition. If that's not correct, there's very little that any server-side "tuning" can help with.

How to restrict Cassandra to fixed memory

We are using cassandra in order to collect the data from thingsboard. The memory it started with was 4GB (after executing the systemctl status for cassandra) and after 15 hours it has reached up to 9.3GB.
I want to know why is there this much increase in memory and is there any way to control it or to restrict it to use fixed amount of memory without the data being lost.
Check this for setting max heap size used . But tune cassandra gc properly when you change this.

Cassandra High client read request latency compared to local read latency

We have a 20 nodes Cassandra cluster running a lot of read requests (~900k/sec at peak). Our dataset is fairly small, so everything is served directly from memory (OS Page Cache). Our datamodel is quite simple (just a key/value) and all of our reads are performed with consistency level one (RF 3).
We use the Java Datastax driver with TokenAwarePolicy, so all of our reads should go directly to one node that has the requested data.
These are some metrics extracted from one of the nodes regarding client read request latency and local read latency.
org_apache_cassandra_metrics_ClientRequest_50thPercentile{scope="Read",name="Latency",} 105.778
org_apache_cassandra_metrics_ClientRequest_95thPercentile{scope="Read",name="Latency",} 1131.752
org_apache_cassandra_metrics_ClientRequest_99thPercentile{scope="Read",name="Latency",} 3379.391
org_apache_cassandra_metrics_ClientRequest_999thPercentile{scope="Read",name="Latency",} 25109.16
org_apache_cassandra_metrics_Keyspace_50thPercentile{keyspace=“<keyspace>”,name="ReadLatency",} 61.214
org_apache_cassandra_metrics_Keyspace_95thPercentile{keyspace="<keyspace>",name="ReadLatency",} 126.934
org_apache_cassandra_metrics_Keyspace_99thPercentile{keyspace="<keyspace>",name="ReadLatency",} 182.785
org_apache_cassandra_metrics_Keyspace_999thPercentile{keyspace="<keyspace>",name="ReadLatency",} 454.826
org_apache_cassandra_metrics_Table_50thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 105.778
org_apache_cassandra_metrics_Table_95thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 1131.752
org_apache_cassandra_metrics_Table_99thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 3379.391
org_apache_cassandra_metrics_Table_999thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 25109.16
Another important detail is that most of our queries (~70%) don't return anything, i.e., they are for records not found. So, bloom filters play an important role here and they seem to be fine:
Bloom filter false positives: 27574
Bloom filter false ratio: 0.00000
Bloom filter space used:
Bloom filter off heap memory used: 6760992
As it can be seen, the reads in each one of the nodes are really fast, the 99.9% is less than 0.5 ms. However, the client request latency is way higher, going above 4ms on the 99%. If I'm reading with CL ONE and using TokenAwarePolicy, shouldn't both values be similar to each other, since no coordination is required? Am I missing something? Is there anything else I could check to see what's going on?
Thanks in advance.
#luciano
there are various reasons why the coordinator and the replica can report different 99th percentiles for read latencies, even with token awareness configured in the client.
these can be anything that manifests in between the coordinator code to the replica's storage engine code in the read path.
examples can be:
read repairs (not directly related to a particular request, as is asynchronous to the read the triggered it, but can cause issues),
host timeouts (and/or speculative retries),
token awareness failure (dynamic snitch simply not keeping up),
GC pauses,
look for metrics anomalies per host, overlaps with GC, and even try to capture traces for some of the slower requests and investigate if they're doing everything you expect from C* (eg token awareness).
well-tuned and spec'd clusters may also witness the dynamic snitch simply not being able to keep up and do its intended job. in such situations disabling the dynamic snitch can fix the high latencies for top-end read percentiles. see https://issues.apache.org/jira/browse/CASSANDRA-6908
be careful though, measure and confirm hypotheses, as mis-applied solutions easily have negative effects!
Even if using TokenAwarePolicy, the driver can't work with the policy when the driver doesn't know which partition key is.
If you are using simple statements, no routing information is provided. So you need additional information to the driver by calling setRoutingKey.
The DataStax Java Driver's manual is a good friend.
http://docs.datastax.com/en/developer/java-driver/3.1/manual/load_balancing/#requirements
If TokenAware is perfectly working, CoordinatorReadLatency value is mostly same value with ReadLatency. You should check it too.
http://cassandra.apache.org/doc/latest/operating/metrics.html?highlight=coordinatorreadlatency
thanks for your reply and sorry about the delay in getting back to you.
One thing I’ve found out is that our clusters had:
dynamic_snitch_badness_threshold=0
in the config files. Changing that to the default value (0.1) helped a lot in terms of the client request latency.
The GC seems to be stable, even under high load. The pauses are constant (~10ms / sec) and I haven’t seen spikes (not even full gcs). We’re using CMS with a bigger Xmn (2.5GB).
Read repairs happen all the time (we have it set to 10% chance), so when the system is handling 800k rec/sec, we have ~80k read repairs/sec happening in background.
It also seems that we’re asking too much for the 20 machines cluster. From the client point of view, latency is quite stable until 800k qps, after that it starts to spike a little bit, but still under a manageable threshold.
Thanks for all the tips, the dynamic snitch thing was really helpful!

Cassandra 1.2.x - RAM heap and JRE parameter - questions for better understanding

May I ask some questions, to get a better understand of Cassandra and JRE and RAM configuration (referring to V1.2.5 and documentation of May 2013):
The current documentation and lots of google research still left some open questions to me.
Interested in using it as simple embedded datastore for a few hundred GB of data on 6 machines distributed in 3 locations, that also run a java application.
1) Cassandra's stack sizing
The Windows .bat file has a default set to 1GB, which I think is a bug, the Linux cassandra-env.sh defines 180k. Is this a "just leave it with 180k, fire and forget about stack size" thing?
2) Cassandra's RAM usage
When using JNA, system RAM is basically split into 3 main areas:
Cassandra uses the assigned Java heap
Cassandra uses exra RAM obtained by JNA
Operation system uses leftovers of RAM as disk cache
Current documentation basically only recommends: "don't set Java heap size higher than 8GB"
Is this info still up to date? (It could be that this statement was from a time, when CMS Garbage collector wasn't included in Java 1.6)
How do I limit the JNA heap (is it the 'row_cache_size_in_mb' parameter?)
what is a good layout rule of thumb for the 3 RAM areas (Java HEAP, JNA extra HEAP, OS CACHE) on a dedicated system in Cassandra 1.2.x ?
when having lots of RAM (128GB)?
when having few RAM (4GB)?
(I know about the heap size calculator, this question is more for theoretical understanding and up to date info)
3) Java Runtime
Why is the recommendation still to use Java 1.6 and not Java 1.7.
Is this a "maturity" operational recommendation?
Are there specific problems known from the near past?
Or just waiting a bit more until more people report flawless operation with 1.7?
4) Embedding Cassandra
The "-XX:MaxTenuringThreshold=1" in the C* start scripts is a slight hint to separate Cassandra from application code, which usually lives better with higher threshold. On the other hand the "1" might also be a bit outdated -
Is this setting still that important? (since now using CMS Garbage collector and JNA-RAM and maybe even using Java1.7?)
1) Are you looking at Xmx? I don't see Xss at all in cassandra.bat
2) Mostly correct. Cassandra doesn't actually require JNA for off-heap allocation for a long time now (1.0 IIRC).
You don't want heap larger than 8GB because CMS and G1 still choke and cause STW pauses eventually. Short explanation: fragmentation. Longer: http://www.scribd.com/doc/37127094/GCTuningPresentationFISL10
Cassandra does off-heap allocation for row cache and for storage engine metadata. The former is straightforward to tune; the latter is not. Basically, you need to have about 20GB of ram per TB of compressed data, end of story. Things you can do to reduce memory usage include disabling compression, reducing bloom filter accuracy, and increasing index_interval. All of which are going to reduce your performance, other things being equal.
3) Maturity. We're late adopters; we have less problems that way. Cassandra 2.0 will require Java7.
4) This is not outdated.

Resources