Heap Memory default allocation in Cassandra - cassandra

As per cassandra-env.sh the default heap memory allocation for a 440G Total RAM should be 32765M (Maximum CAP before JVM Swithches to 64 bit reference).
So, why is it showing 32210157568 bytes(30718M) when I query "java -XX:+PrintCommandLineFlags -version" or "java -XX:+PrintFlagsFinal -version | grep -iE 'MaxHeapSize'"
Why is there difference, of around 2G.
FYI: jvm.options files was default & using DSE 5.1.3.

java -XX:+PrintFlagsFinal has nothing to do with Cassandra, and I don't know why you mention cassandra-env.sh. Anyway, let me answer the main part of the question.
In JDK 8, when -Xmx is not specified, the maximum heap size is estimated as
MaxHeapSize = min(1/4 RAM, max_heap_for_compressed_oops)
In your case the server has plently of RAM, so the default heap size is limited by the maximum possible size supported by zero-based compressed oops, that is, 32 GB.
The heap obviously cannot start at zero address (null page is reserved by the OS), and the default heap alignment is 2 MB, so we must subtract at least 2 MB.
Then, JDK prefers to allocate the heap at HeapBaseMinAddress, which is equal to 2 GB on Linux. This provides some space to grow the native heap of the process. For this reason JVM reduces the default maximum heap size by HeapBaseMinAddress.
That's why the final computed heap size is equal to
32 GB - 2 MB - 2 GB = 32210157568
If you give up the requirement for the zero-based compressed oops, you may set -XX:HeapBaseMinAddress=0. In this case the computed heap size would be
32 GB - 2MB = 32766 MB

Related

Explanation about Executor Summary in Spark Web UI

I have a few questions about the Executor section of Spark Web UI:
I see two numbers such as 0.0 B / 434.4 MiB under Storage Memory, On Heap Storage Memory, and Off Heap Storage Memory, what are those?
Also is Storage Memory the sum of On Heap Storage Memory and Off Heap Storage Memory?
With regard to Off Heap Storage Memory, does it purely come from OffHeap memory (set by spark.memory.offHeap.size) or include spark.executor.memoryOverhead? Not so sure the latter can be used for the dataframe storage though.
Thank you in advance!
I'm not exactly sure of which version you're on, so I'll make this answer for version 3.3.1 (latest version at the time of writing this post):
We can understand what those 2 numbers are by looking at the HTML code that generates this page.
Storage Memory: Memory used / total available memory for storage of data like RDD partitions cached in memory.
On Heap Storage Memory: Memory used / total available memory for on heap storage of data like RDD partitions cached in memory.
Off Heap Storage Memory: Memory used / total available memory for off heap storage of data like RDD partitions cached in memory.
Storage Memory is indeed the sum of On Heap and Off heap memory usage, both for:
the memory used
/**
* Storage memory currently in use, in bytes.
*/
final def storageMemoryUsed: Long = synchronized {
onHeapStorageMemoryPool.memoryUsed + offHeapStorageMemoryPool.memoryUsed
}
as for the total available memory
/** Total amount of memory available for storage, in bytes. */
private def maxMemory: Long = {
memoryManager.maxOnHeapStorageMemory + memoryManager.maxOffHeapStorageMemory
}
The off heap storage memory comes purely from the spark.memory.offHeap.size parameter, as can be seen here:
protected[this] val maxOffHeapMemory = conf.get(MEMORY_OFFHEAP_SIZE)
protected[this] val offHeapStorageMemory =
(maxOffHeapMemory * conf.get(MEMORY_STORAGE_FRACTION)).toLong
This MEMORY_OFFHEAP_SIZE is defined by spark.memory.offHeap.size:
private[spark] val MEMORY_OFFHEAP_SIZE = ConfigBuilder("spark.memory.offHeap.size")
.doc("The absolute amount of memory which can be used for off-heap allocation, " +
" in bytes unless otherwise specified. " +
"This setting has no impact on heap memory usage, so if your executors' total memory " +
"consumption must fit within some hard limit then be sure to shrink your JVM heap size " +
"accordingly. This must be set to a positive value when spark.memory.offHeap.enabled=true.")
.version("1.6.0")
.bytesConf(ByteUnit.BYTE)
.checkValue(_ >= 0, "The off-heap memory size must not be negative")
.createWithDefault(0)

Troubleshooting and fixing Cassandra OOM issue

Although there are multiple threads regarding the OOM issue would like to clarify certain things. We are running a 36 node Cassandra cluster of 3.11.6 version in K8's with 32gigs allocated for the container.
The container is getting OOM killed (Note:- Not java heap OOM error rather linux cgroup OOM killer) since it's reaching the memory limit of 32 gigs for its cgroup.
Stats and configs
map[limits:map[ephemeral-storage:2Gi memory:32Gi] requests:map[cpu:7 ephemeral-storage:2Gi memory:32Gi]]
Cgroup Memory limit
34359738368 -> 32 Gigs
The JVM spaces auto calculated by Cassandra -Xms19660M -Xmx19660M -Xmn4096M
Grafana Screenshot
Cassandra Yaml --> https://pastebin.com/ZZLTc1cM
JVM Options --> https://pastebin.com/tjzZRZvU
Nodetool info output on a node which is already consuming 98% of the memory
nodetool info
ID : 59c53bdb-4f61-42f5-a42c-936ea232e12d
Gossip active : true
Thrift active : true
Native Transport active: true
Load : 179.71 GiB
Generation No : 1643635507
Uptime (seconds) : 9134829
Heap Memory (MB) : 5984.30 / 19250.44
Off Heap Memory (MB) : 1653.33
Data Center : datacenter1
Rack : rack1
Exceptions : 5
Key Cache : entries 138180, size 99.99 MiB, capacity 100 MiB, 9666222 hits, 10281941 requests, 0.940 recent hit rate, 14400 save period in seconds
Row Cache : entries 10561, size 101.76 MiB, capacity 1000 MiB, 12752 hits, 88528 requests, 0.144 recent hit rate, 900 save period in seconds
Counter Cache : entries 714, size 80.95 KiB, capacity 50 MiB, 21662 hits, 21688 requests, 0.999 recent hit rate, 7200 save period in seconds
Chunk Cache : entries 15498, size 968.62 MiB, capacity 1.97 GiB, 283904392 misses, 34456091078 requests, 0.992 recent hit rate, 467.960 microseconds miss latency
Percent Repaired : 8.28107989669628E-8%
Token : (invoke with -T/--tokens to see all 256 tokens)
What had been done
We had made sure there is no memory leak on the cassandra process since we have a custom trigger code. Gc log analytics shows we occupy roughly 14 gigs of total jvm space.
Questions
Although we know cassandra does occupy off heap spaces (Bloom filter, Memtables , etc )
The grafana screenshot shows the node is occupying 98% of 32 gigs. JVM heap = 19.5 gigs + offheap space in nodetool info output = 1653.33 MB (1Gigs) (JVM heap + off heap = 22 gigs ). Where is the remaining memory (10 gigs) ?. How to exactly account what is occupying the remaining memory. (Nodetool tablestats and nodetool cfstats output are not shared for complaince reasons) ?
Our production cluster requires tons of approval so deploying them with jconsole remote is tough. Any other ways to account for this memory usage.
Once we account the memory usage what are the next steps to fix this and avoid OOM kill ?
There's a good chance that the SSTables are getting mapped to memory (cached with mmap()). If this is the case, it wouldn't be immediate and memory usage would grow over time depending on when SSTables are read which are then cached. I've written about this issue in https://community.datastax.com/questions/6947/.
There's an issue with a not-so-well-known configuration property called "disk access mode". When it's not set it cassandra.yaml, it defaults to mmap which means that all SSTables get mmaped to memory. If so, you'll see an entry in the system.log on startup that looks like:
INFO [main] 2019-05-02 12:33:21,572 DatabaseDescriptor.java:350 - \
DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
The solution is to configure disk access mode to only cache SSTable index files (not the *-Data.db component) by setting:
disk_access_mode: mmap_index_only
For more information, see the link I posted above. Cheers!

Compaction cause OutOfMemoryError

I'm getting OutOfMemoryError when run compaction on some big sstables in production, table size is around 800 GB, compaction on small sstables is working properly though.
$ noodtool compact keyspace1 users
error: Direct buffer memory
-- StackTrace --
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:693)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at org.apache.cassandra.io.compress.BufferType$2.allocate(BufferType.java:35)
Java heap memory(Xms and Xmx) have been set to 8 GB, wondering if I should increase Java heap memory to 12 or 16 GB?
It's not the Heap size, but it's instead so-called "direct memory" - you need to check what amount you have (it's could be specified by something like this -XX:MaxDirectMemorySize=512m, or it will take the same max size as heap). You can increase it indirectly by increasing the heap size, or you can control it explicitly via -XX flag. Here is the good article about non-heap memory in Java.

Monitoring Java process memory utilization

I have 24 GB RAM on my server(RHEL) and have assigned 2 GB Xmx value to a Java process.
I need to check if this 2 GB is being consumed completely. Can I check the top command and see if this Java process is consuming 8.3% memory(ie: 2/24) and make an assumption that its using 2 GB at that point. If its less than 8.3%, then I am assuming that it has not reached 2 GB mark. Let me know if my assumption is wrong.

What is the Cache memory limit in spark by default?

What is the maximum limit of cache in spark. How much data can it hold at once?
See this. It is 0.6 x (JVM heap space - 300MB) by default.
I may be wrong but to my understanding here is calculation
What is executer memory. Lets say it is 1 GB.
Then heap size is 0.6 of it which 600 MB
Then 50% of heap size is cache. i.,e 300 MB.
http://spark.apache.org/docs/latest/tuning.html#memory-management-overview in this, they must have assumed executor memory is 500 MB. In fact, for local executor memory default size is 500 MB. If it executer memory is 500 MB then only 150 MB is allocated to cache
Its Actually totally depends on executor memory. Spark will take as much as large part of the RDD in memory and the rest will be fetched and recomputed on the fly each time they're needed. It is totally configurable and you can check it here

Resources