Cassandra repair cause timeout in node - cassandra

We are using Cassandra (3.10-1) cluster with 5 nodes - each with 8 cores and 23Gi memory, all in the same DC.
Replication factor - 2
Consistency level - 2
Lately during scheduled repair which is being done about once a week, there are many timeouts on queries from one node - always the same node in different repairs.
From looking at it logs:
debug.log shows that there are several errors when building the merkel tree for repair.
DEBUG [MemtablePostFlush:7354] 2018-12-30 23:52:08,314 ColumnFamilyStore.java:954 - forceFlush requested but everything is clean in user_device_validation_configuration
ERROR [ValidationExecutor:973] 2018-12-30 23:52:08,315 Validator.java:268 - Failed creating a merkle tree for [repair #b3887f60-0c8d-11e9-b894-754001cf0917 on keyspace1/user_device_validation_configuration, [(-8096630393642361664,-8073407512196669022]]], /10.52.5.42 (see log for details)
DEBUG [AntiEntropyStage:1] 2018-12-30 23:52:08,318 RepairMessageVerbHandler.java:114 - Validating ValidationRequest{gcBefore=1545349928} org.apache.cassandra.repair.messages.ValidationRequest#5c1c2b28
DEBUG [ValidationExecutor:973] 2018-12-30 23:52:08,319 StorageService.java:3313 - Forcing flush on keyspace keyspace1, CF raw_sensors
DEBUG [MemtablePostFlush:7354] 2018-12-30 23:52:08,319 ColumnFamilyStore.java:954 - forceFlush requested but everything is clean in raw_sensors
ERROR [ValidationExecutor:973] 2018-12-30 23:52:08,319 Validator.java:268 - Failed creating a merkle tree for [repair #b3887f60-0c8d-11e9-b894-754001cf0917 on keyspace1/raw_sensors, [(-8096630393642361664,-8073407512196669022]]], /10.52.5.42 (see log for details)
DEBUG [AntiEntropyStage:1] 2018-12-30 23:52:08,321 RepairMessageVerbHandler.java:114 - Validating ValidationRequest{gcBefore=1545349928} org.apache.cassandra.repair.messages.ValidationRequest#5c1c2b28
DEBUG [AntiEntropyStage:1] 2018-12-30 23:52:08,321 RepairMessageVerbHandler.java:142 - Got anticompaction request AnticompactionRequest{parentRepairSession=b387e320-0c8d-11e9-b894-754001cf0917} org.apache.cassandra.repair.messages.AnticompactionRequest#d4b7ed7b
ERROR [AntiEntropyStage:1] 2018-12-30 23:52:08,322 RepairMessageVerbHandler.java:168 - Got error, removing parent repair session
ERROR [AntiEntropyStage:1] 2018-12-30 23:52:08,322 CassandraDaemon.java:229 - Exception in thread Thread[AntiEntropyStage:1,5,main]
java.lang.RuntimeException: java.lang.RuntimeException: Parent repair session with id = b387e320-0c8d-11e9-b894-754001cf0917 has failed.
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_131]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) [apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
Caused by: java.lang.RuntimeException: Parent repair session with id = b387e320-0c8d-11e9-b894-754001cf0917 has failed.
at org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:400) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.ActiveRepairService.doAntiCompaction(ActiveRepairService.java:435) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:143) ~[apache-cassandra-3.10.jar:3.10]
... 7 common frames omitted
DEBUG [ValidationExecutor:973] 2018-12-30 23:52:08,323 StorageService.java:3313 - Forcing flush on keyspace keyspace1, CF mouse_interactions
DEBUG [MemtablePostFlush:7354] 2018-12-30 23:52:08,323 ColumnFamilyStore.java:954 - forceFlush requested but everything is clean in mouse_interactions
ERROR [ValidationExecutor:973] 2018-12-30 23:52:08,327 Validator.java:268 - Failed creating a merkle tree for [repair #b3887f60-0c8d-11e9-b894-754001cf0917 on keyspace1/mouse_interactions, [(-8096630393642361664,-8073407512196669022]]], /10.52.5.42 (see log for details)
DEBUG [GossipStage:1] 2018-12-30 23:52:10,643 FailureDetector.java:457 - Ignoring interval time of 2000164738 for /10.52.3.47
DEBUG [GossipStage:1] 2018-12-30 23:52:13,643 FailureDetector.java:457 - Ignoring interval time of 2000239040 for /10.52.3.47
DEBUG [ReadRepairStage:407] 2018-12-30 23:52:15,133 ReadCallback.java:242 - Digest mismatch:
org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(7486012912397474412, 000467657474000020376337363933643363613837616231643531633936396564616234636363613400) (a0e45fcd73255bcd93a63b15d41e0843 vs 7dff1a87a755cf62150befc8352f59e8)
at org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) [apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
DEBUG [GossipStage:1] 2018-12-30 23:52:26,639 FailureDetector.java:457 - Ignoring interval time of 2000385682 for /10.52.3.47
After few hours at GC logs, I noticed that GC is being called roughly every 20 seconds, and stop for more than 10 seconds each call:
2018-12-31T06:24:57.450+0000: 1184437.292: Total time for which application threads were stopped: 18.0318658 seconds, Stopping threads took: 0.0005000 seconds
2018-12-31T06:24:57.483+0000: 1184437.325: Total time for which application threads were stopped: 0.0053815 seconds, Stopping threads took: 0.0007325 seconds
2018-12-31T06:24:57.565+0000: 1184437.407: Total time for which application threads were stopped: 0.0118127 seconds, Stopping threads took: 0.0057652 seconds
2018-12-31T06:24:57.604+0000: 1184437.446: Total time for which application threads were stopped: 0.0064909 seconds, Stopping threads took: 0.0023037 seconds
2018-12-31T06:24:57.701+0000: 1184437.543: Total time for which application threads were stopped: 0.0066540 seconds, Stopping threads took: 0.0031299 seconds
{Heap before GC invocations=1377084 (full 108682):
par new generation total 943744K, used 943711K [0x00000005c0000000, 0x0000000600000000, 0x0000000600000000)
eden space 838912K, 100% used [0x00000005c0000000, 0x00000005f3340000, 0x00000005f3340000)
from space 104832K, 99% used [0x00000005f99a0000, 0x00000005ffff7ce0, 0x0000000600000000)
to space 104832K, 0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000005f99a0000)
concurrent mark-sweep generation total 7340032K, used 7340032K [0x0000000600000000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 71629K, capacity 73844K, committed 75000K, reserved 1116160K
class space used 8521K, capacity 8909K, committed 9176K, reserved 1048576K
2018-12-31T06:24:58.029+0000: 1184437.870: [Full GC (Allocation Failure) 2018-12-31T06:24:58.029+0000: 1184437.871: [CMS: 7340032K->7340031K(7340032K), 15.2051822 secs] 8283743K->7443230K(8283776K), [Metaspace: 71629K->71629K(1116160K)], 15.2055451 secs] [Times: user=13.94 sys=1.28, real=15.20 secs]
Heap after GC invocations=1377085 (full 108683):
par new generation total 943744K, used 103198K [0x00000005c0000000, 0x0000000600000000, 0x0000000600000000)
eden space 838912K, 12% used [0x00000005c0000000, 0x00000005c64c7950, 0x00000005f3340000)
from space 104832K, 0% used [0x00000005f99a0000, 0x00000005f99a0000, 0x0000000600000000)
to space 104832K, 0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000005f99a0000)
concurrent mark-sweep generation total 7340032K, used 7340031K [0x0000000600000000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 71629K, capacity 73844K, committed 75000K, reserved 1116160K
class space used 8521K, capacity 8909K, committed 9176K, reserved 1048576K
}
2018-12-31T06:25:13.235+0000: 1184453.076: Total time for which application threads were stopped: 15.2103050 seconds, Stopping threads took: 0.0004204 seconds
2018-12-31T06:25:13.243+0000: 1184453.085: Total time for which application threads were stopped: 0.0047592 seconds, Stopping threads took: 0.0008416 seconds
2018-12-31T06:25:13.272+0000: 1184453.114: Total time for which application threads were stopped: 0.0085538 seconds, Stopping threads took: 0.0046376 seconds
2018-12-31T06:25:13.298+0000: 1184453.140: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7340031K(7340032K)] 7536074K(8283776K), 0.0650538 secs] [Times: user=0.12 sys=0.01, real=0.06 secs]
2018-12-31T06:25:13.364+0000: 1184453.206: Total time for which application threads were stopped: 0.0728487 seconds, Stopping threads took: 0.0039520 seconds
2018-12-31T06:25:13.364+0000: 1184453.206: [CMS-concurrent-mark-start]
{Heap before GC invocations=1377085 (full 108684):
par new generation total 943744K, used 943215K [0x00000005c0000000, 0x0000000600000000, 0x0000000600000000)
eden space 838912K, 100% used [0x00000005c0000000, 0x00000005f3340000, 0x00000005f3340000)
from space 104832K, 99% used [0x00000005f99a0000, 0x00000005fff7bd98, 0x0000000600000000)
to space 104832K, 0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000005f99a0000)
concurrent mark-sweep generation total 7340032K, used 7340031K [0x0000000600000000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 71631K, capacity 73844K, committed 75000K, reserved 1116160K
class space used 8521K, capacity 8909K, committed 9176K, reserved 1048576K
2018-12-31T06:25:13.821+0000: 1184453.662: [Full GC (Allocation Failure) 2018-12-31T06:25:13.821+0000: 1184453.663: [CMS2018-12-31T06:25:16.960+0000: 1184456.802: [CMS-concurrent-mark: 3.592/3.596 secs] [Times: user=12.47 sys=0.38, real=3.60 secs]
So i've start checking the data spread in cluster - we are using num_of_tokens - 32
data seem to be balanced ~ 40% in each node.
UN 10.52.2.94 672.64 GiB 32 ? ad3d1365-bbb7-4229-b586-40667ec22b41 rack1
UN 10.52.3.47 640.91 GiB 32 ? cdba952b-9685-4769-aaf4-22e538a5c37f rack1
UN 10.52.1.57 719.34 GiB 32 ? 13bb7573-eb30-489b-80c4-6e5a7c8e5f5e rack1
UN 10.52.5.42 743.04 GiB 32 ? c9e892c6-9281-4a49-b4c4-a147a387c3de rack1
UN 10.52.4.43 691.1 GiB 32 ? 53e3724e-f5a9-46b1-b330-7bb542f15f90 rack1
So after checking the logs in other node I cant find any reason for those timeout in that specific node.
Any thoughts or ideas as for what cause this to happen on the same node again and again ?

That's really odd to only see this on one node. Double-check that the configs are the same. Otherwise, you might be writing/querying a large partition which that node is primarily responsible for.
Replication factor - 2 Consistency level - 2
In general, repairs can cause nodes to have trouble serving requests, as building Merkle Trees and streaming data are quite resource-intensive. I see two problems here:
Long GC pauses.
A RF/CL ratio which does not allow for any node to be down.
Starting with #2, when you have a RF=2 and you're requiring 2 replicas to respond on your queries, you are essentially querying at CONSISTENCY_ALL. Therefore, should a node become overwhelmed and short on resources, your queries will be unable to complete. If it were me, I would increase the RF to 3, and run a repair (assuming the nodes have adequate storage). The other option, would be to query at a consistency level of 1 (ONE), which is probably what you should be doing with a RF=2 anyway.
Also, when diagnosing query problems, it usually helps to see the actual query being run (which you have not provided). With Cassandra, more often than not, query issues are the result of queries which don't fit the designed table.
As for the long GC pauses, that's a trickier problem to solve. There's an old Cassandra JIRA (CASSANDRA-8150) ticket which talks about optimal settings for the CMS Garbage Collector. You should give that a read.
What is your MAX_HEAP set to? I see your new generation is less than 1GB, which is going to be too small. With 23GB of RAM on each node, I'd start with the following settings for using CMS GC:
Max Heap Size (Xmx) of 8GB-12GB (you want to leave about 50% of RAM for the OS/page-cache).
Initial Heap Size (Xms) equal to Max Heap Size to avoid the performance hit of resizing the heap.
Heap New Size (Xmn) of 40-50% of your Max Heap Size (so somewhere between 3GB-6GB). You want plenty of room available for the new generation.
MaxTenuringThreshold of 6 to 8, as you want objects to be passed around the new gen survivor spaces until they die, in lieu of being promoted to the old gen. By default, I think this is set to 1, which means objects will get promoted too quickly.
Basically, new-to-old/old-to-permanent promotion is where the long pauses happen. Ideally, you'd like all objects on your heap to be created, live, and die in the new gen, without being promoted.
Otherwise, it might be worth your while to try using G1GC. For G1, I'd go with a Max/Initial Heap of about 50% of RAM, a MaxGCPauseMillis of 400-500ms, and don't set Heap New Size at all (G1 computes that).

Related

GC pause (G1 Evacuation Pause) issue in Cassandra

I'm getting the following error in the application logs that trying to connect to a Cassandra cluster with 6 nodes
com.datastax.driver.core.exceptions.OperationTimedOutException: [DB1:9042] Timed out waiting for server response
I have set java heap memory to 8GB (-Xms8G -Xmx8G), wondering if 8 GB is too much?
Below is time out configuration in cassandra.yaml
read_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 20000
write_request_timeout_in_ms: 10000
request_timeout_in_ms: 20000
In the application there aren't heavy delete or update statements, so my question is what else may cause the long GC pause? the majority types of the GC pause that I can see in the log is G1 Evacuation Pause, what does it mean exactly?
The heap size heavily depends on the amount of data that you need to process. Usually for production workload minimum of 16Gb was recommended. Also, G1 isn't very effective on the small heaps (less than 12Gb) - it's better to use default ParNewGC for such heap sizes, but you may need to tune GC to get better performance. Look into this blog post that explains tuning of the GC.
Regarding your question on the "G1 Evacuation Pause" - look into this blog posts: 1 and 2. Here is quote from 2nd post:
Evacuation Pause (G1 collection) in which live objects are copied from one set of regions (young OR young+old) to another set. It is a stop-the-world activity and all
the application threads are stopped at a safepoint during this time.
For you this means that you're filling regions very fast and regions are big, so it requires significant amount of time to copy data.

Has anyone faced Cassandra issue "Maximum memory usage reached"?

I'm using apache cassandra-3.0.6 ,4 node cluster, RF=3, CONSISTENCY is '1', Heap 16GB.
Im getting info message in system.log as
INFO [SharedPool-Worker-1] 2017-03-14 20:47:14,929 NoSpamLogger.java:91 - Maximum memory usage reached (536870912 bytes), cannot allocate chunk of 1048576 bytes
don't know exactly which memory it mean and I have tried by increasing the file_cache_size_in_mb to 1024 from 512 in Cassandra.yaml file But again it immediatly filled the remaining 512MB increased and stoping the application recording by showing the same info message as
INFO [SharedPool-Worker-5] 2017-03-16 06:01:27,034 NoSpamLogger.java:91 - Maximum memory usage reached (1073741824 bytes), cannot allocate chunk of 1048576 bytes
please suggest if anyone has faced the same issue..Thanks!!
Bhargav
As far as I can tell with Cassandra 3.11, no matter how large you set file_cache_size_in_mb, you will still get this message. The cache fills up, and writes this useless message. It happens in my case whether I set it to 2GB or 20GB. This may be a bug in the cache eviction strategy, but I can't tell.
The log message indicates that the node's off-heap cache is full because the node is busy servicing reads.
The 536870912 bytes in the log message is equivalent to 512 MB which is the default file_cache_size_in_mb.
It is fine to see the occasional occurrences of the message in the logs which is why it is logged at INFO level but if it gets logged repeatedly, it is an indicator that the node is getting overloaded and you should consider increasing the capacity of your cluster by adding more nodes.
For more info, see my post on DBA Stack Exchange -- What does "Maximum memory usage reached" mean in the Cassandra logs?. Cheers!

Cassandra Error message: Not marking nodes down due to local pause. Why?

I have 6 nodes, 1 Solr, 5 Spark nodes, using datastax. My cluster is on a similar server to Amazon EC2, with EBS volume. Each node has 3 EBS volumes, which compose a logical data disk using LVM. In my OPS center the same node frequently becomes unresponsive, which leads to a connect time out of my data system. My data amount is around 400GB with 3 replicas. I have 20 streaming jobs with batch interval every minute. Here is my error message:
/var/log/cassandra/output.log:WARN 13:44:31,868 Not marking nodes down due to local pause of 53690474502 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-25 16:40:34,944 FailureDetector.java:258 - Not marking nodes down due to local pause of 64532052919 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-25 16:59:12,023 FailureDetector.java:258 - Not marking nodes down due to local pause of 66027485893 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-26 13:44:31,868 FailureDetector.java:258 - Not marking nodes down due to local pause of 53690474502 > 5000000000
EDIT:
These are my more specific configurations. I would like to know wether I am doing something wrong and if so how can I find out in details what it is and how to fix it?
out heap is set to
MAX_HEAP_SIZE="16G"
HEAP_NEWSIZE="4G"
current heap:
[root#iZ11xsiompxZ ~]# jstat -gc 11399
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
0.0 196608.0 0.0 196608.0 6717440.0 2015232.0 43417600.0 23029174.0 69604.0 68678.2 0.0 0.0 1041 131.437 0 0.000 131.437
[root#iZ11xsiompxZ ~]# jmap -heap 11399
Attaching to process ID 11399, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.102-b14
using thread-local object allocation.
Garbage-First (G1) GC with 23 thread(s)
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 51539607552 (49152.0MB)
NewSize = 1363144 (1.2999954223632812MB)
MaxNewSize = 30920409088 (29488.0MB)
OldSize = 5452592 (5.1999969482421875MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 16777216 (16.0MB)
Heap Usage:
G1 Heap:
regions = 3072
capacity = 51539607552 (49152.0MB)
used = 29923661848 (28537.427757263184MB)
free = 21615945704 (20614.572242736816MB)
58.059545404588185% used
G1 Young Generation:
Eden Space:
regions = 366
capacity = 6878658560 (6560.0MB)
used = 6140461056 (5856.0MB)
free = 738197504 (704.0MB)
89.26829268292683% used
Survivor Space:
regions = 12
capacity = 201326592 (192.0MB)
used = 201326592 (192.0MB)
free = 0 (0.0MB)
100.0% used
G1 Old Generation:
regions = 1443
capacity = 44459622400 (42400.0MB)
used = 23581874200 (22489.427757263184MB)
free = 20877748200 (19910.572242736816MB)
53.04110320109241% used
40076 interned Strings occupying 7467880 bytes.
I don't know why this happens. Thanks a lot.
The message you see Not marking nodes down due to local pause is due to the JVM pausing. Although you're doing some good things here by posting JVM information, often a good place to start is just looking at the /var/log/cassandra/system.log for example check for things such as ERROR, WARN. Also check for length and frequency of GC events by grepping for GCInspector.
Tools such as nodetool tpstats are your friend here, seeing if you have backed up or dropped mutations, blocked flush writers and such.
Docs here have some good things to check for: https://docs.datastax.com/en/landing_page/doc/landing_page/troubleshooting/cassandra/cassandraTrblTOC.html
Also check your nodes have the recommended production settings, this is something often overlooked:
http://docs.datastax.com/en/landing_page/doc/landing_page/recommendedSettingsLinux.html
Also one thing to note is that Cassandra is rather i/o sensitive and "normal" EBS might not be fast enough for what you need here. Throw Solr into the mix too and you can see a lot of i/o contention when you hit a Cassandra compaction and Lucene Merge going for disk at the same time.

nodetool gcstats "GC Reclaimed (MB)" value to high

I have been monitoring gcstats from last couple of days and can't believe the value it return is correct.
nodetool gcstats [GC Reclaimed (MB)] shows below values in last 5 runs when nothing is running against the database
30356680056
663531768
4222674760
567091224
2147418944
The total keyspace size is less than 1 GB.
Thats the result of the java's jvm garbage collection not anything specific to C*. The difference of the garbage collector mbean values: http://docs.oracle.com/javase/7/docs/api/java/lang/management/GarbageCollectorMXBean.html since the last time gcstats was called.

How to prevent heap filling up

Firstly forgive me for what might be a very naive question.
I am on a mission to identify the right nosql database for my project.
I was inserting and updating records in the table (column family) in highly concurrent fashion.
Then i encountered this.
INFO 11:55:20,924 Writing Memtable-scan_request#314832703(496750/1048576 serialized/live bytes, 8204 ops)
INFO 11:55:21,084 Completed flushing /var/lib/cassandra/data/mykey/scan_request/mykey-scan_request-ic-14-Data.db (115527 bytes) for commitlog position ReplayPosition(segmentId=1372313109304, position=24665321)
INFO 11:55:21,085 Writing Memtable-scan_request#721424982(1300975/2097152 serialized/live bytes, 21494 ops)
INFO 11:55:21,191 Completed flushing /var/lib/cassandra/data/mykey/scan_request/mykey-scan_request-ic-15-Data.db (304269 bytes) for commitlog position ReplayPosition(segmentId=1372313109304, position=26554523)
WARN 11:55:21,268 Heap is 0.829968311377531 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
WARN 11:55:21,268 Flushing CFS(Keyspace='mykey', ColumnFamily='scan_request') to relieve memory pressure
INFO 11:55:25,451 Enqueuing flush of Memtable-scan_request#714386902(324895/843149 serialized/live bytes, 5362 ops)
INFO 11:55:25,452 Writing Memtable-scan_request#714386902(324895/843149 serialized/live bytes, 5362 ops)
INFO 11:55:25,490 Completed flushing /var/lib/cassandra/data/mykey/scan_request/mykey-scan_request-ic-16-Data.db (76213 bytes) for commitlog position ReplayPosition(segmentId=1372313109304, position=27025950)
WARN 11:55:30,109 Heap is 0.9017950505664833 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid8849.hprof ...
Heap dump file created [1359702396 bytes in 105.277 secs]
WARN 12:25:26,656 Flushing CFS(Keyspace='mykey', ColumnFamily='scan_request') to relieve memory pressure
INFO 12:25:26,657 Enqueuing flush of Memtable-scan_request#728952244(419985/1048576 serialized/live bytes, 6934 ops)
Its to be noticed that i was able to insert & update around 6 million records before i got this. I am using cassandra on a single node. In-spite of the hint in the logs, i am not able to decide what config to change. I did check into the bin/cassandra shell script and i see they have done lots of manipulation before they came up with the -Xms & -Xmx values.
Kindly advice.
First, you can run
ps -ef|grep cassandra
to see what -Xmx is set to in your Cassandra. The default values of -Xms and -Xmx are based on the amount of your system's memory.
Check this for details:
http://www.datastax.com/documentation/cassandra/1.2/index.html?pagename=docs&version=1.2&file=index#cassandra/operations/ops_tune_jvm_c.html
You can try to increase MAX_HEAP_SIZE (in conf/cassandra-env.sh) to see if the problem would go away.
For example, you can replace
MAX_HEAP_SIZE="${max_heap_size_in_mb}M"
with
MAX_HEAP_SIZE="2048M"
I assume that tuning the Garbage Collector for Cassandra might solve the OOM error. Cassandra use Concurrent mark-and-sweep(CMS) JVM implementation for Garbage Collector when we use default settings.Most oftenly the CMS garbage collector would only start after the heap is almost fully populated. But the CMS process itself takes some time to finish and the problem is that the JVM runs out of space before the CMS process finished.We can set the percentage of used old generation space which triggers the CMS with following options in bin/cassandra.in.sh file under JAVA_OPTS variable
-XX:CMSInitiatingOccupancyFraction={percentage} - This controls the percentage of the old generation when the CMS is triggered and we can set this bit lower value to hold until CMS process finished.
-XX:+UseCMSInitiatingOccupancyOnly - This parameter ensures the percentage is kept fixed
Also with the following options we can achieve incremental CMS
-XX:+UseConcMarkSweepGC \
-XX:+CMSIncrementalMode \
-XX:+CMSIncrementalPacing \
-XX:CMSIncrementalDutyCycleMin=0 \
-XX:+CMSIncrementalDutyCycle=10
We can increase the parallel CMS threads considering the number of cores of the CPU
-XX:ParallelCMSThreads={numberOfTreads}
Further we can tune the garbage collection for young generation for make the process optimum. Here we have to control the amount of re-used objects
Increasing the size of the young generation
Delay the young generation objects promotion for old generation
To achieve this we can set following parameters
-XX:NewSize={size} - Determine the size of the young generation
-XX:NewMaxSize={size} - This is the maximum size of the young generation
-Xmn{size} - Fix the maximum size
-XX:NewRatio={n} - Set the ratio of young generation to old generation
Before the objects are migrate to the old generation from young generation they are put in to phase called "young survior". So we can controll the migration of the objects to old generation using following parameters
-XX:SurvivorRatio={n} - ration of "young eden" to "young survivors"
-XX:MaxTenuringThreshold={age} Number of objects to be migrated to old generation

Resources