Cassandra : memory consumption while compacting - cassandra

I have ParNew GC warnings into system.log that go over 8 seconds pause :
WARN [Service Thread] GCInspector.java:283 - ParNew GC in 8195ms. CMS Old Gen: 22316280488 -> 22578261416; Par Eden Space: 1717787080 -> 0; Par Survivor Space: 123186168 -> 214695936
It seems to appear when minor compactions occurs on a particular table :
92128ed0-46fe-11ec-bf5a-0d5dfeeee6e2 ks table 1794583380 1754598812 {1:92467, 2:5291, 3:22510}
f6e3cd30-46fc-11ec-bf5a-0d5dfeeee6e2 ks table 165814525 160901558 {1:3196, 2:24814}
334c63f0-46fc-11ec-bf5a-0d5dfeeee6e2 ks table 126097876 122921938 {1:3036, 2:24599}
The table :
is configured with LCS strategy.
average row size is 1MB
there are also some wide rows, up to 60MB (from cfhistograms, don't know if it includes or not the LZ4 compression applied on that row ?).
The heap size is 32GB.
Question :
a. how many rows must fit into memory (at once!) during compaction process ? It is just one, or more ?
b. while compacting, does each partition is read in decompressed form into memory, or in compressed form ?
c. do you think the compaction process in my case could fill up all the heap memory ?
Thank you
full GC settings :
-Xms32G
-Xmx32G
#-Xmn800M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSWaitDuration=10000
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways

a. how many rows must fit into memory (at once!) during compaction process ? It is just one, or more ?
It is definitely multiple.
b. while compacting, does each partition is read in decompressed form into memory, or in compressed form ?
The compression only works at the disk level. Before compaction can do anything with it, it needs to decompress and read it.
c. do you think the compaction process in my case could fill up all the heap memory ?
Yes, the compaction process allocates a significant amount of the heap, and running compactions will cause issues with an already stressed heap.
TBH, I see several opportunities for improvement with the GC settings listed. And right now, I think that's where the majority of the problems are. Let's start with the new gen size:
#-Xmn800M
With CMS you absolutely need to be explicit about your heap new size (Xmn). Especially with a gigantic heap. And yes, with CMS 32GB is "gigantic." The 100MB per CPU core wisdom is incorrect. With Cassandra, the heap new size should be in the range of 25% to 50% of the max heap size (Xmx). For 32GB, I'd say uncomment the Xmn line and set it to -Xmn12G.
So here is how memory is mapped out for CMS:
Now let's look at these two:
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
Laid out linearly, the heap is split into a new/young generation, the old generation, and the permanent generation. Major, stop-the-world collections happen on inter-generational promotion (ex: new gen to old gen).
Within the new gen, it is split into the Eden space, and the survivor spaces S0 and S1. What you want, is for all your objects to be created, live, and die in the new gen space. For that to happen, the MaxTenuringThreshold (how many times an object can be copied between survivor spaces) needs to be higher. Also, the survivor spaces need to be big enough to pull their weight. With a ratio of 1:8, each survivor space will be 1/8th of the Eden space. So I'd go with these, just to start:
-XX:SurvivorRatio=2
-XX:MaxTenuringThreshold=6
That'll make the survivor spaces bigger, and allow objects to be passed between them 6 times. Hopefully, that's long enough to avoid having to promote them.
Adding these will help, too:
-XX:+AlwaysPreTouch
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:-UseBiasedLocking
For more info on these ^ check out Amy's Cassandra 2.1 Tuning Guide. But with Cassandra you do want to "pre touch," you do want to enable thread local allocation blocks (TLAB), you do want those blocks to be able to be resized, and you don't want biased locking.
Pick one of your nodes, make these changes, restart, and monitor performance. If they help (which I think they will), add them to the remaining nodes, as well.
tl;dr;
I'd make these changes:
-Xmn12G
-XX:SurvivorRatio=2
-XX:MaxTenuringThreshold=6
-XX:+AlwaysPreTouch
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:-UseBiasedLocking
References:
CASSANDRA-8150 - An ultimately unsuccessful attempt to alter the default JVM settings. But the ensuing discussion resulted one of the best compilations of JVM tuning wisdom.
Amy's Cassandra 2.1 Tuning Guide - It may be dated, but this is still one of the most comprehensive admin guides for Cassandra. Many of the settings and approaches discussed are still very relevant.

Related

Garbage collection tuning in Spark: how to estimate size of Eden?

I am reading about garbage collection tuning in Spark: The Definitive Guide by Bill Chambers and Matei Zaharia. This chapter is largely based on Spark's documentation. Nevertheless, the authors extend the documentation with an example of how to deal with too many minor collections but not many major collections.
Both official documentation and the book state that:
If there are too many minor collections but not many major GCs,
allocating more memory for Eden would help. You can set the size of
the Eden to be an over-estimate of how much memory each task will
need. If the size of Eden is determined to be E, then you can set the
size of the Young generation using the option -Xmn=4/3*E. (The scaling
up by 4/3 is to account for space used by survivor regions as well.) (See here)
The book offers an example (Spark: The Definitive Guide, first ed., p. 324):
If your task is reading data from HDFS, the amount of memory used by
the task can be estimated by using the size of the data block read
from HDFS. Note that the size of a decompressed block is often two or
three times the size of the block. So if you want to have three or
four tasks' worth of working space, and the HDFS block size is 128 MB,
we can estimate size of Eden to be 43,128 MB.
Assuming that each uncompressed block takes even 512 MB and we have 4 tasks, and we scale up by 4/3, I don't really see how you can come up with the estimate of 43,128 MB of memory for Eden.
I would rather answer that ~3 GB should be enough for Eden given the book's assumptions.
Could anyone explain how this estimation should be calculated?
OK, I think the new Spark docs make it clear:
As an example, if your task is reading data from HDFS, the amount of
memory used by the task can be estimated using the size of the data
block read from HDFS. Note that the size of a decompressed block is
often 2 or 3 times the size of the block. So if we wish to have 3 or 4
tasks’ worth of working space, and the HDFS block size is 128 MB, we
can estimate size of Eden to be 4*3*128MB.
So, it's 4*3*128 MB rather than what the book says (i.e. 43,128 MB).

Cassandra and G1 Garbage collector stop the world event (STW)

We have a 6 node Cassandra Cluster under heavy utilization. We have been dealing a lot with garbage collector stop the world event, which can take up to 50 seconds in our nodes, in the meantime Cassandra Node is unresponsive, not even accepting new logins.
Extra details:
Cassandra Version: 3.11
Heap Size = 12 GB
We are using G1 Garbage Collector with default settings
Nodes size: 4 CPUs 28 GB RAM
The G1 GC behavior is the same across all nodes.
Any help would be very much appreciated!
Edit 1:
Checking object creation stats, it does not look healthy at all.
Edit 2:
I have tried to use the suggested settings by Chris Lohfink, here is the GC report:
Using CMS suggested settings
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTAtNDk=
Using G1 suggested settings
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTExLTE3
The behavior remains basically the same:
Old Gen starts to fill up.
GC can't clean it properly without a full GC and a STW event.
The full GC starts to take longer, until the node is completely unresponsive.
I'm going to get the cfstats output for maximum partition size and tombstones per read asap and edit the post again.
Have you looked at using Zing? Cassandra situations like these are a classic use case, as Zing fundamentally eliminates all GC-related glitches in Cassandra nodes and clusters.
You can see some details on the how/why in my recent "Understanding GC" talk from JavaOne (https://www.slideshare.net/howarddgreen/understanding-gc-javaone-2017). Or just skip to slides 56-60 for Cassandra-specific results.
Without knowing what your existing settings or possible data model problems, heres a guess of some conservative settings to use to try to reduce evacuation pauses from not having enough to-space (check gc logs):
-Xmx12G -Xms12G -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=500 -XX:-ReduceInitialCardMarks -XX:G1HeapRegionSize=32m
This should also help reduce the pause of the update remember set which becomes an issue and reducing humongous objects, by setting G1HeapRegionSize, which can become a problem depending on data model. Make sure -Xmn is not set.
12Gb with C* is probably more suited for using CMS for what its worth, you can get better throughput certainly. Just need to be careful of fragmentation over time with the rather large objects that can get allocated.
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=55 -XX:MaxTenuringThreshold=3 -Xmx12G -Xms12G -Xmn3G -XX:+CMSEdenChunksRecordAlways -XX:+CMSParallelInitialMarkEnabled -XX:+CMSParallelRemarkEnabled -XX:CMSWaitDuration=10000 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCondCardMark
Most likely theres an issue with data model or your under provisioned though.

Cassandra - how to disable memtable flush

I'm running Cassandra with a very small dataset so that the data can exist on memtable only. Below are my configurations:
In jvm.options:
-Xms4G
-Xmx4G
In cassandra.yaml,
memtable_cleanup_threshold: 0.50
memtable_allocation_type: heap_buffers
As per the documentation in cassandra.yaml, the memtable_heap_space_in_mb and memtable_heap_space_in_mb will be set of 1/4 of heap size i.e. 1000MB
According to the documentation here (http://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configCassandra_yaml.html#configCassandra_yaml__memtable_cleanup_threshold), the memtable flush will trigger if the total size of memtabl(s) goes beyond (1000+1000)*0.50=1000MB.
Now if I perform several write requests which results in almost ~300MB of the data, memtable still gets flushed since I see sstables being created on file system (Data.db etc.) and I don't understand why.
Could anyone explain this behavior and point out if I'm missing something here?
One additional trigger for memtable flushing is commitlog space used (default 32mb).
http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsMemtableThruput.html
http://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configCassandra_yaml.html#configCassandra_yaml__commitlog_total_space_in_mb
Since Cassandra should be persistent, it should do writes to disk to come up with the data after the node failing. If you don't need this durability, you can use any other memory based databases - redis, memcache etc.
Below is the response I got from Cassandra user group, copying it here in case someone else is looking for the similar info.
After thinking about your scenario I believe your small SSTable size might be due to data compression. By default, all tables enable SSTable compression.
Let go through your scenario. Let's say you have allocated 4GB to your Cassandra node. Your memtable_heap_space_in_mb and
memtable_offheap_space_in_mb will roughly come to around 1GB. Since you have memtable_cleanup_threshold to .50 table cleanup will be triggered when total allocated memtable space exceeds 1/2GB. Note the cleanup threshold is .50 of 1GB and not a combination of heap and off heap space. This memtable allocation size is the total amount available for all tables on your node. This includes all system related keyspaces. The cleanup process will write the largest memtable to disk.
For your case, I am assuming that you are on a single node with only one table with insert activity. I do not think the commit log will trigger a flush in this circumstance as by default the commit log has 8192 MB of space unless the commit log is placed on a very small disk.
I am assuming your table on disk is smaller than 500MB because of compression. You can disable compression on your table and see if this helps get the desired size.
I have written up a blog post explaining memtable flushing (http://abiasforaction.net/apache-cassandra-memtable-flush/)
Let me know if you have any other question.
I hope this helps.

Cassandra In-Memory option

There is an In-Memory option introduced in the Cassandra by DataStax Enterprise 4.0:
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/inMemory.html
But with 1GB size limited for an in-memory table.
Anyone know the consideration why limited it as 1GB? And possible extend to a large size of in-memory table, such as 64GB?
To answer your question: today it's not possible to bypass this limitation.
In-Memory tables are stored within the JVM Heap, regardless the amount of memory available on single node allocating more than 8GB to JVM Heap is not recommended.
The main reason of this limitation is that Java Garbage Collector slow down when dealing with huge memory amount.
However if you consider Cassandra as a distributed system 1GB is not the real limitation.
(nodes*allocated_memory)/ReplicationFactor
allocated_memory is max 1GB -- So your table may contains many GB in memory allocated in different nodes.
I think that in future something will improve but dealing with 64GB in memory it could be a real problem when you need to flush data on disk. One more consideration that creates limitation: avoid TTL when working with In-Memory tables. TTL creates tombstones, a tombstone is not deallocated until the GCGraceSeconds period passes -- so considering a default value of 10 days each tombstone will keep the portion of memory busy and unavailable, possibly for long time.
HTH,
Carlo

Cassandra cfstats: differences between Live and Total used space values

For about 1 month I'm seeing the following values of used space for the 3 nodes ( I have replication factor = 3) in my Cassandra cluster in nodetool cfstats output:
Pending Tasks: 0
Column Family: BinaryData
SSTable count: 8145
Space used (live): 787858513883
Space used (total): 1060488819870
For other nodes I see good values, something like:
Space used (live): 780599901299
Space used (total): 780599901299
You can note a 25% difference (~254Gb) between Live and Total space. It seems I have a lot garbage on these 3 nodes which cannot be compacted for some reason.
The column family I'm talking about has a LeveledCompaction strategy configured with SSTable size of 100Mb:
create column family BinaryData with key_validation_class=UTF8Type
and compaction_strategy=LeveledCompactionStrategy
and compaction_strategy_options={sstable_size_in_mb: 100};
Note, that total value staying for month on all of the three nodes. I relied Cassandra normalize data automatically.
What I tried to decrease space (without result):
nodetool cleanup
nodetool repair -pr
nodetool compact [KEYSPACE] BinaryData (nothing happens: major compaction is ignored for LeveledCompaction strategy)
Are there any other things I should try to cleanup a garbage and free space?
Ok, I have a solution. It looks like Cassandra issue.
First, I went deep into the Cassandra 1.1.9 sources and noted that Cassandra perform some re-analysing of SStables during node starting. It removes the SStables marked as compacted, performs recalculation of used space, and do some other staff.
So, what I did is restarted the 3 problem nodes. The Total and Live values have become equals immediately after restart was completed and then Compaction process has been started and used space is reducing now.
Leveled compaction creates sstables of a fixed, relatively small size,
in your case it is 100Mb that are grouped into “levels”. Within each
level, sstables are guaranteed to be non-overlapping. Each level is
ten times as large as the previous.
So basically from this statement provided in cassandra doc, we can conclude that may be in your case ten time large level background is not formed yet, resulting to no compaction.
Coming to second question, since you have kept the replication factor as 3, so data has 3 duplicate copies, for which you have this anomaly.
And finally 25% difference between Live and Total space, as you know its due over deletion operation.
For LeveledCompactionStrategy you want to set the sstable size to a max of around 15 MB. 100MB is going to cause you a lot of needless disk IO and it will cause it to take a long time for data to propagate to higher levels, making deleted data stick around for a long time.
With a lot of deletes, you are most likely hitting some of the issues with minor compactions not doing a great job cleaning up deleted data in Cassandra 1.1. There are a bunch of fixes for tombstone cleanup during minor compaction in Cassandra 1.2. Especially when combined with LCS. I would take a look at testing Cassandra 1.2 in your Dev/QA environment. 1.2 does still have some kinks being ironed out, so you will want to make sure to keep up to date with installing new versions, or even running off of the 1.2 branch in git, but for your data size and usage pattern, I think it will give you some definite improvements.

Resources