nodetool gcstats "GC Reclaimed (MB)" value to high

nodetool gcstats "GC Reclaimed (MB)" value to high - cassandra

I have been monitoring gcstats from last couple of days and can't believe the value it return is correct.
nodetool gcstats [GC Reclaimed (MB)] shows below values in last 5 runs when nothing is running against the database
30356680056
663531768
4222674760
567091224
2147418944
The total keyspace size is less than 1 GB.

Thats the result of the java's jvm garbage collection not anything specific to C*. The difference of the garbage collector mbean values: http://docs.oracle.com/javase/7/docs/api/java/lang/management/GarbageCollectorMXBean.html since the last time gcstats was called.

Related

Why is the default value of spark.memory.fraction so low?

From the Spark configuration docs, we understand the following about the spark.memory.fraction configuration parameter:
Fraction of (heap space - 300MB) used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. Leaving this at the default value is recommended.
The default value for this configuration parameter is 0.6 at the time of writing this question. This means that for an executor with, for example, 32GB of heap space and the default configurations we have:
300MB of reserved space (a hardcoded value on this line)
(32GB - 300MB) * 0.6 = 19481MB of shared memory for execution + storage
(32GB - 300MB) * 0.4 = 12987MB of user memory
This "user memory" is (according to the docs) used for the following:
The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.
On an executor with 32GB of heap space, we're allocating 12,7GB of memory for this, which feels rather large!
Do these user data structures/internal metadata/safeguarding against OOM errors really need that much space? Are there some striking examples of user memory usage which illustrate the need of this big of a user memory region?

I did some research and imo its 0.6 not to ensure enough memory for user memory but to ensure that execution + storage can fit into old gen region of jvm
Here i found something interesting: Spark tuning
The tenured generation size is controlled by the JVM’s NewRatio
parameter, which defaults to 2, meaning that the tenured generation is
2 times the size of the new generation (the rest of the heap). So, by
default, the tenured generation occupies 2/3 or about 0.66 of the
heap. A value of 0.6 for spark.memory.fraction keeps storage and
execution memory within the old generation with room to spare. If
spark.memory.fraction is increased to, say, 0.8, then NewRatio may
have to increase to 6 or more.
So by default in OpenJvm this ratio is set to 2 so you have 0,66% for old-gen, they choose to use 0,6 to have small margin
I found that in version 1.6 this was changed to 0,75 and it was causing some issues, here is Jira ticket
In the description you will find sample code which is adding records to cache just to use whole memory reserved for exeution + storage. With storage + execution set to higher amount than old gen overhead for gc was really high and code which was executed on older version (with this setting equal to 0.6) was 6 time faster (40-50 sec vs 6 min)
There was discussion and community decided to roll it back to 0.6 in Spark 2.0, here is PR with changes
I think that if you want to increase performance a little bit, you can try to change it up to 0.66 but if you want to have more memory for execution+storageyou need to also adjust your jvm and change old/new ratio as well otherwise you may face performance issues

GC pause (G1 Evacuation Pause) issue in Cassandra

I'm getting the following error in the application logs that trying to connect to a Cassandra cluster with 6 nodes
com.datastax.driver.core.exceptions.OperationTimedOutException: [DB1:9042] Timed out waiting for server response
I have set java heap memory to 8GB (-Xms8G -Xmx8G), wondering if 8 GB is too much?
Below is time out configuration in cassandra.yaml
read_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 20000
write_request_timeout_in_ms: 10000
request_timeout_in_ms: 20000
In the application there aren't heavy delete or update statements, so my question is what else may cause the long GC pause? the majority types of the GC pause that I can see in the log is G1 Evacuation Pause, what does it mean exactly?

The heap size heavily depends on the amount of data that you need to process. Usually for production workload minimum of 16Gb was recommended. Also, G1 isn't very effective on the small heaps (less than 12Gb) - it's better to use default ParNewGC for such heap sizes, but you may need to tune GC to get better performance. Look into this blog post that explains tuning of the GC.
Regarding your question on the "G1 Evacuation Pause" - look into this blog posts: 1 and 2. Here is quote from 2nd post:
Evacuation Pause (G1 collection) in which live objects are copied from one set of regions (young OR young+old) to another set. It is a stop-the-world activity and all
the application threads are stopped at a safepoint during this time.
For you this means that you're filling regions very fast and regions are big, so it requires significant amount of time to copy data.

Why is counter cache not being utilized after I increase the size?

My Cassandra application entails primarily counter writes and reads. As such, having a counter cache is important to performance. I increased the counter cache size in cassandra.yaml from 1000 to 3500 and did a cassandra service restart. The results were not what I expected. Disk use went way up, throughput went way down and it appears the counter cache is not being utilized at all based on what I'm seeing in nodetool info (see below). It's been almost two hours now and performance is still very bad.
I saw this same pattern yesterday when I increased the counter cache from 0 to 1000. It went quite awhile without using the counter cache at all and then for some reason it started using it. My question is whether there is something I need to do to activate counter cache utilization?
Here are my settings in cassandra.yaml for the counter cache:
counter_cache_size_in_mb: 3500
counter_cache_save_period: 7200
counter_cache_keys_to_save: (currently left unset)
Here's what I get out of nodetool info after about 90 minutes:
Gossip active : true
Thrift active : false
Native Transport active: false
Load : 1.64 TiB
Generation No : 1559914322
Uptime (seconds) : 6869
Heap Memory (MB) : 15796.00 / 20480.00
Off Heap Memory (MB) : 1265.64
Data Center : WDC07
Rack : R10
Exceptions : 0
Key Cache : entries 1345871, size 1.79 GiB, capacity 1.95 GiB, 67936405 hits, 83407954 requests, 0.815 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 5294462, size 778.34 MiB, capacity 3.42 GiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 24064, size 1.47 GiB, capacity 1.47 GiB, 65602315 misses, 83689310 requests, 0.216 recent hit rate, 3968.677 microseconds miss latency
Percent Repaired : 8.561186035383143%
Token : (invoke with -T/--tokens to see all 256 tokens)
Here's a nodetool info on the Counter Cache prior to increasing the size:
Counter Cache : entries 6802239, size 1000 MiB, capacity 1000 MiB,
57154988 hits, 435820358 requests, 0.131 recent hit rate,
7200 save period in seconds
Update:
I've been running for several days now trying various values of the counter cache size on various nodes. It is consistent that the counter cache isn't enabled until it reaches capacity. That's just how it works as far as I can tell. If anybody knows a way to enable the cache before it is full let me know. I'm setting it very high because it seems optimal but that means that the cache is down for several hours while it fills up and while it's down my disks are absolutely maxed out with read requests...
Another update:
Further running shows that occasionally the counter cache does kick in before it fills up. I really don't know why that is. I don't see a pattern yet. I would love to know the criteria for when this does and does not work.
One last update:
While the counter cache is filling up native transport is disabled for the node as well. Setting the counter to 3.5 GB I'm now going 24 hours with the node in this low performance state with native transport disabled.

I have found out a way to 100% of the time avoid the counter cache not being enabled and native transport mode disabled. This approach avoids the serious performance problems I encountered waiting for the counter cache to enable (sometimes for hours in my case since I want a large counter cache):
1. Prior to starting Cassandra, set cassandra.yaml file field counter_cache_size_in_mb to 0
2. After starting cassandra and getting it up and running use node tool commands to set the cache sizes:
Example command:
nodetool setcachecapacity 2000 0 1000
In this example, the first value of 2000 sets the key cache size, the second value of 0 is the row cache size and the third value of 1000 is the counter cache size.
Take measurements and decide if those are the optimal values. If not, you can repeat step two without restarting Cassandra with new values as needed
Further details:
Some things that don't work:
Setting the counter_cache_size_in_mb value if the counter cache is not yet enabled. This is the case where you started Cassandra with a non-zero value in counter_cache_size_in_mb in Cassandra.yaml and you have not yet reached that size threshold. If you do this, the counter cache will never enabled. Just don't do this. I would call this a defect but it is the way things currently work.
Testing that I did:
I tested this on five separate nodes multiple times with multiple values. Both initially when Cassandra is just coming up and after some period of time. This method I have described worked in every case. I guess I should have saved some screenshots of nodetool info to show results.
One last thing: If Cassandra developers are watching could they please consider tweaking the code so that this workaround isn't necessary?

Severe degradation in Cassandra Write performance with continuous streaming data over time

I notice a severe degradation in Cassandra write performance with continuous writes over time.
I am inserting time series data with time stamp (T) as the column name in a wide column that stores 24 hours worth of data in a single row.
Streaming data is written from data generator (4 instances, each with 256 threads) inserting data into multiple rows in parallel.
Additionally, data is also inserted into a column family that has indexes over DateType and UUIDType.
CF1:
Col1 | Col2 | Col3(DateType) | Col(UUIDType4) |
RowKey1
RowKey2
:
:
CF2 (Wide column family):
RowKey1 (T1, V1) (T2, V3) (T4, V4) ......
RowKey2 (T1, V1) (T3, V3) .....
:
:
The no. of data points inserted/sec decreases over time until no further inserts are possible. The initial performance is of the order of 60000 ops/sec for ~6-8 hours and then it gradually tapers down to 0 ops/sec. Restarting the DataStax_Cassandra_Community_Server on all nodes helps restore the original throughput, but the behaviour is observed again after a few hours.
OS: Windows Server 2008
No.of nodes: 5
Cassandra version: DataStax Community 1.2.3
RAM: 8GB
HeapSize: 3GB
Garbage collector: default settings [ParNewGC]
I also notice a phenomenal increase in the no. of Pending write requests as reported by the OpsCenter (~of magnitude 200,000) when the performance begins to degrade.
I fail to understand what is preventing the write operations to be completed and why do they pile up over time? I do not see anything suspicious in the Cassandra logs.
Has the OS settings got anything to do with this?
Any suggestions to probe this issue further?

Do you see an increase in pending compactions (nodetool compactionstats)? Or are you seeing blocked flush writers (nodetool tpstats)? I'm guessing you're writing data to Cassandra faster than it can be consumed.
Cassandra won't block on writes, but that doesn't mean that you won't see an increase in the amount of heap used. Pending writes have overhead, as do blocked memtables. In addition, each SSTable has some memory overhead. If compactions fall behind this is magnified. At some point you probably don't have enough headroom in your heap to allocate the objects required for a single write, and you end up spending all your time waiting for an allocation that the GC can't provide.
With increased total capacity, or more IO on the machines consuming the data you would be able to sustain this write rate, but everything indicates you don't have enough capacity to sustain that load over time.

Bringing your write timeout in line with the new default in 2.0 (of 2s instead of 10s) will help with your write backlog by allowing load shedding to kick in faster: https://issues.apache.org/jira/browse/CASSANDRA-6059

How to prevent heap filling up

Firstly forgive me for what might be a very naive question.
I am on a mission to identify the right nosql database for my project.
I was inserting and updating records in the table (column family) in highly concurrent fashion.
Then i encountered this.
INFO 11:55:20,924 Writing Memtable-scan_request#314832703(496750/1048576 serialized/live bytes, 8204 ops)
INFO 11:55:21,084 Completed flushing /var/lib/cassandra/data/mykey/scan_request/mykey-scan_request-ic-14-Data.db (115527 bytes) for commitlog position ReplayPosition(segmentId=1372313109304, position=24665321)
INFO 11:55:21,085 Writing Memtable-scan_request#721424982(1300975/2097152 serialized/live bytes, 21494 ops)
INFO 11:55:21,191 Completed flushing /var/lib/cassandra/data/mykey/scan_request/mykey-scan_request-ic-15-Data.db (304269 bytes) for commitlog position ReplayPosition(segmentId=1372313109304, position=26554523)
WARN 11:55:21,268 Heap is 0.829968311377531 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
WARN 11:55:21,268 Flushing CFS(Keyspace='mykey', ColumnFamily='scan_request') to relieve memory pressure
INFO 11:55:25,451 Enqueuing flush of Memtable-scan_request#714386902(324895/843149 serialized/live bytes, 5362 ops)
INFO 11:55:25,452 Writing Memtable-scan_request#714386902(324895/843149 serialized/live bytes, 5362 ops)
INFO 11:55:25,490 Completed flushing /var/lib/cassandra/data/mykey/scan_request/mykey-scan_request-ic-16-Data.db (76213 bytes) for commitlog position ReplayPosition(segmentId=1372313109304, position=27025950)
WARN 11:55:30,109 Heap is 0.9017950505664833 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid8849.hprof ...
Heap dump file created [1359702396 bytes in 105.277 secs]
WARN 12:25:26,656 Flushing CFS(Keyspace='mykey', ColumnFamily='scan_request') to relieve memory pressure
INFO 12:25:26,657 Enqueuing flush of Memtable-scan_request#728952244(419985/1048576 serialized/live bytes, 6934 ops)
Its to be noticed that i was able to insert & update around 6 million records before i got this. I am using cassandra on a single node. In-spite of the hint in the logs, i am not able to decide what config to change. I did check into the bin/cassandra shell script and i see they have done lots of manipulation before they came up with the -Xms & -Xmx values.
Kindly advice.

First, you can run
ps -ef|grep cassandra
to see what -Xmx is set to in your Cassandra. The default values of -Xms and -Xmx are based on the amount of your system's memory.
Check this for details:
http://www.datastax.com/documentation/cassandra/1.2/index.html?pagename=docs&version=1.2&file=index#cassandra/operations/ops_tune_jvm_c.html
You can try to increase MAX_HEAP_SIZE (in conf/cassandra-env.sh) to see if the problem would go away.
For example, you can replace
MAX_HEAP_SIZE="${max_heap_size_in_mb}M"
with
MAX_HEAP_SIZE="2048M"

I assume that tuning the Garbage Collector for Cassandra might solve the OOM error. Cassandra use Concurrent mark-and-sweep(CMS) JVM implementation for Garbage Collector when we use default settings.Most oftenly the CMS garbage collector would only start after the heap is almost fully populated. But the CMS process itself takes some time to finish and the problem is that the JVM runs out of space before the CMS process finished.We can set the percentage of used old generation space which triggers the CMS with following options in bin/cassandra.in.sh file under JAVA_OPTS variable
-XX:CMSInitiatingOccupancyFraction={percentage} - This controls the percentage of the old generation when the CMS is triggered and we can set this bit lower value to hold until CMS process finished.
-XX:+UseCMSInitiatingOccupancyOnly - This parameter ensures the percentage is kept fixed
Also with the following options we can achieve incremental CMS
-XX:+UseConcMarkSweepGC \
-XX:+CMSIncrementalMode \
-XX:+CMSIncrementalPacing \
-XX:CMSIncrementalDutyCycleMin=0 \
-XX:+CMSIncrementalDutyCycle=10
We can increase the parallel CMS threads considering the number of cores of the CPU
-XX:ParallelCMSThreads={numberOfTreads}
Further we can tune the garbage collection for young generation for make the process optimum. Here we have to control the amount of re-used objects
Increasing the size of the young generation
Delay the young generation objects promotion for old generation
To achieve this we can set following parameters
-XX:NewSize={size} - Determine the size of the young generation
-XX:NewMaxSize={size} - This is the maximum size of the young generation
-Xmn{size} - Fix the maximum size
-XX:NewRatio={n} - Set the ratio of young generation to old generation
Before the objects are migrate to the old generation from young generation they are put in to phase called "young survior". So we can controll the migration of the objects to old generation using following parameters
-XX:SurvivorRatio={n} - ration of "young eden" to "young survivors"
-XX:MaxTenuringThreshold={age} Number of objects to be migrated to old generation

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string