Understanding an Apache Cassandra Memtable Flush [duplicate] - cassandra

This question already has an answer here:
Cassandra Mem table content
(1 answer)
Closed 12 months ago.
A memtable is created for every table or column family. There can be multiple memtables for a table but only one of them will be active. The rest will be waiting to be flushed. There are a few properties that affect a memtables size and flushing frequency. These include:
memtable_flush_writers – This is the number of threads allocated for flushing memtables to disk. This defaults to two.
memtable_heap_space_in_mb – This is the total allocated space for all memtables on an Apache Cassandra node. By default, this is one-fourth your heap size. Specifying this property results in an absolute heap size in MB as opposed to a percentage of the total JVM heap.
memtable_cleanup_threshold – A percentage of your total available memtable space that will trigger a memtable cleanup. memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers + 1). By default this is essentially 33% of your memtable_heap_space_in_mb.
A scheduled cleanup results in flushing of the table/column family that occupies the largest portion of memtable space. This keeps happening till your available memtable memory drops below the cleanup threshold.
Let assume we have an Apache Cassandra instance that has allocated 4G of space. Out of this only 3,925.5MB is available to the Java runtime. Please look at the following StackOverflow question(Why do -Xmx and Runtime.maxMemory not agree) for the reasons behind this. Of this, by default, we have 981 MB allocated towards memtable i.e. 1/4the of 3,925.5. Our memtable_cleanup_threshold is the default value i.e. 33 percent of the total memtable heap and off heap memory. In our example that comes to 327 MB. Thus when total space allocated for all memtables is greater than 327 MB a memtable clean-up is triggered. The cleanup process looks for the largest memtable and flushes that to disk.
if I am allocating 981MB for mem table and cassandra initiates a flush after 327 Mb, that means at any point of time cassandra will have max of 327 mb of active memtables...then what about (981-327)mb = 654mb mem space.What is it used for. I could sense that memtables which are in queue to be flushes occupy some portion of this 654mb, but what about the rest of the spaces, it not it being wasted??

memtable_heap_space_in_mb decides how much heap can be used for memtable. It's not mandatory to allocate all of them to memtable. If there are 327 mb for memtable, the other memory (total heap) could be used for queries or repair operations.

Related

Troubleshooting and fixing Cassandra OOM issue

Although there are multiple threads regarding the OOM issue would like to clarify certain things. We are running a 36 node Cassandra cluster of 3.11.6 version in K8's with 32gigs allocated for the container.
The container is getting OOM killed (Note:- Not java heap OOM error rather linux cgroup OOM killer) since it's reaching the memory limit of 32 gigs for its cgroup.
Stats and configs
map[limits:map[ephemeral-storage:2Gi memory:32Gi] requests:map[cpu:7 ephemeral-storage:2Gi memory:32Gi]]
Cgroup Memory limit
34359738368 -> 32 Gigs
The JVM spaces auto calculated by Cassandra -Xms19660M -Xmx19660M -Xmn4096M
Grafana Screenshot
Cassandra Yaml --> https://pastebin.com/ZZLTc1cM
JVM Options --> https://pastebin.com/tjzZRZvU
Nodetool info output on a node which is already consuming 98% of the memory
nodetool info
ID : 59c53bdb-4f61-42f5-a42c-936ea232e12d
Gossip active : true
Thrift active : true
Native Transport active: true
Load : 179.71 GiB
Generation No : 1643635507
Uptime (seconds) : 9134829
Heap Memory (MB) : 5984.30 / 19250.44
Off Heap Memory (MB) : 1653.33
Data Center : datacenter1
Rack : rack1
Exceptions : 5
Key Cache : entries 138180, size 99.99 MiB, capacity 100 MiB, 9666222 hits, 10281941 requests, 0.940 recent hit rate, 14400 save period in seconds
Row Cache : entries 10561, size 101.76 MiB, capacity 1000 MiB, 12752 hits, 88528 requests, 0.144 recent hit rate, 900 save period in seconds
Counter Cache : entries 714, size 80.95 KiB, capacity 50 MiB, 21662 hits, 21688 requests, 0.999 recent hit rate, 7200 save period in seconds
Chunk Cache : entries 15498, size 968.62 MiB, capacity 1.97 GiB, 283904392 misses, 34456091078 requests, 0.992 recent hit rate, 467.960 microseconds miss latency
Percent Repaired : 8.28107989669628E-8%
Token : (invoke with -T/--tokens to see all 256 tokens)
What had been done
We had made sure there is no memory leak on the cassandra process since we have a custom trigger code. Gc log analytics shows we occupy roughly 14 gigs of total jvm space.
Questions
Although we know cassandra does occupy off heap spaces (Bloom filter, Memtables , etc )
The grafana screenshot shows the node is occupying 98% of 32 gigs. JVM heap = 19.5 gigs + offheap space in nodetool info output = 1653.33 MB (1Gigs) (JVM heap + off heap = 22 gigs ). Where is the remaining memory (10 gigs) ?. How to exactly account what is occupying the remaining memory. (Nodetool tablestats and nodetool cfstats output are not shared for complaince reasons) ?
Our production cluster requires tons of approval so deploying them with jconsole remote is tough. Any other ways to account for this memory usage.
Once we account the memory usage what are the next steps to fix this and avoid OOM kill ?
There's a good chance that the SSTables are getting mapped to memory (cached with mmap()). If this is the case, it wouldn't be immediate and memory usage would grow over time depending on when SSTables are read which are then cached. I've written about this issue in https://community.datastax.com/questions/6947/.
There's an issue with a not-so-well-known configuration property called "disk access mode". When it's not set it cassandra.yaml, it defaults to mmap which means that all SSTables get mmaped to memory. If so, you'll see an entry in the system.log on startup that looks like:
INFO [main] 2019-05-02 12:33:21,572 DatabaseDescriptor.java:350 - \
DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
The solution is to configure disk access mode to only cache SSTable index files (not the *-Data.db component) by setting:
disk_access_mode: mmap_index_only
For more information, see the link I posted above. Cheers!

Compaction cause OutOfMemoryError

I'm getting OutOfMemoryError when run compaction on some big sstables in production, table size is around 800 GB, compaction on small sstables is working properly though.
$ noodtool compact keyspace1 users
error: Direct buffer memory
-- StackTrace --
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:693)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at org.apache.cassandra.io.compress.BufferType$2.allocate(BufferType.java:35)
Java heap memory(Xms and Xmx) have been set to 8 GB, wondering if I should increase Java heap memory to 12 or 16 GB?
It's not the Heap size, but it's instead so-called "direct memory" - you need to check what amount you have (it's could be specified by something like this -XX:MaxDirectMemorySize=512m, or it will take the same max size as heap). You can increase it indirectly by increasing the heap size, or you can control it explicitly via -XX flag. Here is the good article about non-heap memory in Java.

Repair status not 100% after repair

I have noticed that some tables show less than 100% "Percent repaired" in the nodetool tablestatus output. I have manually executed repairs on all nodes (3 node cluster, RF=3) but the value doesnt seem to change.
Example output:
Table: users
SSTable count: 3
Space used (live): 66636
Space used (total): 66636
Space used by snapshots (total): 0
Off heap memory used (total): 688
SSTable Compression Ratio: 0.5731829674519404
Number of partitions (estimate): 162
Memtable cell count: 11
Memtable data size: 483
Memtable off heap memory used: 0
Memtable switch count: 27
Local read count: 120833
Local read latency: NaN ms
Local write count: 12094
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 91.54
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 568
Bloom filter off heap memory used: 544
Index summary off heap memory used: 112
Compression metadata off heap memory used: 32
Compacted partition minimum bytes: 30
Compacted partition maximum bytes: 1916
Compacted partition mean bytes: 420
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
Repair was done with nodetool repair -pr
What is going on?
Percent repaired seems to be a misleading metric as it refers to the percentage of SSTables repaired, but there are some conditions to be computed here:
- the tables should not be from systems keyspaces
- the tables should have a replication factor greater than 1
- the repair should be incremental or full (non-subrange)
When you use nodetool repair -pr, that will invoke a full repair that won't be able to update this value.
For more information regarding incremental repairs, I would recommend this article from the Last Pickle. Since they adopted the maintenance of the reaper tool, they have become an authority regarding repairs.
Executing nodetool repair -pr will repair the primary range owned by the node that command is executed on.
What does this mean? The node this command is executed on has data that it "owns", i.e., its primary range, but the node also contains data/replicas "owned" by other nodes. You are not repairing the replicas "owned" owned by other nodes.
Now, if you execute that command on every single node in the cluster (not data center), it will cover all the token ranges.
EDIT / NOTE:
My answer did not properly address the question. Although what I wrote is accurate, the answer to the question is stated in the answer above mine; basically, the percentage repaired is a value that is for incremental repair usage and is not affected by a full repair. (Incremental repair marks the repaired ranges as it works so it does not spend time re-repairing later.)

How much free disk space is needed for compaction?

According to this article: http://thelastpickle.com/blog/2017/03/16/compaction-nuance.html
"It looks at the the total space used by all sstables, adds it up, and
checks it against available disk space"
Does it mean that total space is counted for all sstables stored on given node, or for all sstables that will be compacted?
We can assume that we have SizeTieredCompactionStrategy.
According to the source code in CompactionTask#buildCompactionCandidatesForAvailableDiskSpace for Cassandra 2.1.20 the write size is for all sstables that will be compacted (sstables that are not expired).
Also, as a general recommendation, you should fill up your disk up to 50% of the disk size so compactions can be executed safely.
For SizeTieredCompactionStrategy compaction strategy at least 50% disk space of total disk space of Cassandra data file should be free so that compaction can be executed safely.

What is the Cache memory limit in spark by default?

What is the maximum limit of cache in spark. How much data can it hold at once?
See this. It is 0.6 x (JVM heap space - 300MB) by default.
I may be wrong but to my understanding here is calculation
What is executer memory. Lets say it is 1 GB.
Then heap size is 0.6 of it which 600 MB
Then 50% of heap size is cache. i.,e 300 MB.
http://spark.apache.org/docs/latest/tuning.html#memory-management-overview in this, they must have assumed executor memory is 500 MB. In fact, for local executor memory default size is 500 MB. If it executer memory is 500 MB then only 150 MB is allocated to cache
Its Actually totally depends on executor memory. Spark will take as much as large part of the RDD in memory and the rest will be fetched and recomputed on the fly each time they're needed. It is totally configurable and you can check it here

Resources