Garbage collection takes a long time in case of huge objects - node.js

I have a Node.js app which does a real-time messaging between Redis Pubsub and Websockets. The application deals with strings that are of huge length going to around 2.5 MB in each individual message. It performs fine, except when GC starts. Here is what I see with the -trace_gc flag turned on:
[GC interrupt] [GC in old space requested].
[29290:0x35826f0] 104218566 ms: Mark-sweep 104.2 (1139.5) -> 20.7 (1057.8) MB, 153.4 / 0 ms (+ 19.4 ms in 16 steps since start of marking, biggest step 17.0 ms) [GC interrupt] [GC in old space requested].
[29290:0x35826f0] 104219639 ms: Mark-sweep 110.8 (1146.0) -> 110.1 (1146.0) MB, 20.1 / 0 ms (+ 63.3 ms in 25 steps since start of marking, biggest step 59.0 ms) [GC interrupt] [GC in old space requested].
[29290:0x35826f0] Increasing marking speed to 3 due to high promotion rate
[29290:0x35826f0] 104222147 ms: Mark-sweep 384.7 (1415.3) -> 128.1 (1163.8) MB, 390.0 / 0 ms (+ 19.9 ms in 27 steps since start of marking, biggest step 17.0 ms) [GC interrupt] [GC in old space requested].
[29290:0x35826f0] 104224639 ms: Mark-sweep 394.0 (1424.5) -> 48.0 (1084.9) MB, 324.0 / 0 ms (+ 22.7 ms in 10 steps since start of marking, biggest step 17.0 ms) [GC interrupt] [GC in old space requested].
[29290:0x35826f0] 104225695 ms: Mark-sweep 175.1 (1209.9) -> 88.2 (1124.0) MB, 169.0 / 0 ms (+ 18.3 ms in 11 steps since start of marking, biggest step 16.3 ms) [GC interrupt] [GC in old space requested].
Here is what I have tried:
Running node with --expose-gc flag and manually calling GC every minute. Didn't help much in this case.
I have increased the old space size to 3GB with --max-old-space-size=3072. This decreased the frequency at which the app gets stuck in GC, but I cannot keep increasing the memory indefinitely.
Somethings that I have gathered by reading from various places is that, I am running out of free space in the old objects region, so V8 triggers the full marks and sweep collection which takes a long time. Since my app deal with large objects (mostly JS strings), they are perhaps put in the old space directly.
What are the things that can be done to minimize the GC time? Note that the app exchanges a lot of messages and cutting up the strings so that they each have less length is a non-option.

Related

Meaning of "CG in old space requested" in NodeJS/V8 engine GC trace

I sometimes see this log in the GC logs when using the --trace_gc option of NodeJS:
1502684 ms: Mark-sweep 26.7 (29.2) -> 18.8 (28.2) MB, 7.7 / 0.4 ms (+
1.0 ms in 10 steps since start of marking, biggest step 0.2 ms, walltime since start of marking 28 ms) (average mu = 1.000, current mu
= 1.000) finalize incremental marking via task; GC in old space requested
Can I conclude that GC was only partial based on the fact it prints GC in old space requested? Does this mean old space wasn't garbage collected?
I also can't understand what the numbers 26.7 (29.2) -> 18.8 (28.2 MB), it would look like it reduced used memory from 29.2 to 18.8, but then I don't understand why it is suffixed with (28.2 MB).

Out of memory - Google Kubernetes

I have an autopilot cluster which should increase the cpu/memory on demand but I am still getting the following error
I have not defined any limits/resources in the deployment. I am relying on Google to handle that automatically.
It's a node application that reads large csv files (300-400mb), parses them and inserts them to mysql db (using typeorm)
It seems it works with smaller files. The files are read one by one.
In this case there are over 1200 files (not all 300-400mb of size but quite many are)
It seems it does not work like I thought it would....
Is this a sign that there is something wrong with the js code or do I just need to increase the memory manually ?
<--- Last few GCs --->
[1:0x7f9fde9bf330] 4338991 ms: Scavenge (reduce) 955.5 (1037.2) -> 955.5 (1037.2) MB, 7.1 / 0.0 ms (average mu = 0.295, current mu = 0.279) allocation failure
[1:0x7f9fde9bf330] 4339001 ms: Scavenge (reduce) 956.9 (1037.7) -> 956.9 (1038.0) MB, 7.6 / 0.0 ms (average mu = 0.295, current mu = 0.279) allocation failure
[1:0x7f9fde9bf330] 4339011 ms: Scavenge (reduce) 958.1 (1038.0) -> 958.1 (1038.5) MB, 7.7 / 0.0 ms (average mu = 0.295, current mu = 0.279) allocation failure
<--- JS stacktrace --->
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
I have not defined any limits/resources in the deployment. I am relying on Google to handle that automatically.
I don't think that's true. Copy pasting from the docs:
"Autopilot relies on what you specify in your deployment configuration to provision resources. If you do not specify resource requests for any container in the Pod, Autopilot applies default values. These defaults are designed to give the containers in your Pods an average amount of resources, which are suitable for many smaller workloads."
And this: "Important: Google recommends that you explicitly set your resource requests for each container to meet your application requirements, as these default values might not be sufficient, or optimal."
So it's likely that the default resource requests and limit are too low for your application. You should set them to be a value high enough so you don't run out of memory.

Cassandra repair cause timeout in node

We are using Cassandra (3.10-1) cluster with 5 nodes - each with 8 cores and 23Gi memory, all in the same DC.
Replication factor - 2
Consistency level - 2
Lately during scheduled repair which is being done about once a week, there are many timeouts on queries from one node - always the same node in different repairs.
From looking at it logs:
debug.log shows that there are several errors when building the merkel tree for repair.
DEBUG [MemtablePostFlush:7354] 2018-12-30 23:52:08,314 ColumnFamilyStore.java:954 - forceFlush requested but everything is clean in user_device_validation_configuration
ERROR [ValidationExecutor:973] 2018-12-30 23:52:08,315 Validator.java:268 - Failed creating a merkle tree for [repair #b3887f60-0c8d-11e9-b894-754001cf0917 on keyspace1/user_device_validation_configuration, [(-8096630393642361664,-8073407512196669022]]], /10.52.5.42 (see log for details)
DEBUG [AntiEntropyStage:1] 2018-12-30 23:52:08,318 RepairMessageVerbHandler.java:114 - Validating ValidationRequest{gcBefore=1545349928} org.apache.cassandra.repair.messages.ValidationRequest#5c1c2b28
DEBUG [ValidationExecutor:973] 2018-12-30 23:52:08,319 StorageService.java:3313 - Forcing flush on keyspace keyspace1, CF raw_sensors
DEBUG [MemtablePostFlush:7354] 2018-12-30 23:52:08,319 ColumnFamilyStore.java:954 - forceFlush requested but everything is clean in raw_sensors
ERROR [ValidationExecutor:973] 2018-12-30 23:52:08,319 Validator.java:268 - Failed creating a merkle tree for [repair #b3887f60-0c8d-11e9-b894-754001cf0917 on keyspace1/raw_sensors, [(-8096630393642361664,-8073407512196669022]]], /10.52.5.42 (see log for details)
DEBUG [AntiEntropyStage:1] 2018-12-30 23:52:08,321 RepairMessageVerbHandler.java:114 - Validating ValidationRequest{gcBefore=1545349928} org.apache.cassandra.repair.messages.ValidationRequest#5c1c2b28
DEBUG [AntiEntropyStage:1] 2018-12-30 23:52:08,321 RepairMessageVerbHandler.java:142 - Got anticompaction request AnticompactionRequest{parentRepairSession=b387e320-0c8d-11e9-b894-754001cf0917} org.apache.cassandra.repair.messages.AnticompactionRequest#d4b7ed7b
ERROR [AntiEntropyStage:1] 2018-12-30 23:52:08,322 RepairMessageVerbHandler.java:168 - Got error, removing parent repair session
ERROR [AntiEntropyStage:1] 2018-12-30 23:52:08,322 CassandraDaemon.java:229 - Exception in thread Thread[AntiEntropyStage:1,5,main]
java.lang.RuntimeException: java.lang.RuntimeException: Parent repair session with id = b387e320-0c8d-11e9-b894-754001cf0917 has failed.
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_131]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) [apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
Caused by: java.lang.RuntimeException: Parent repair session with id = b387e320-0c8d-11e9-b894-754001cf0917 has failed.
at org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:400) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.ActiveRepairService.doAntiCompaction(ActiveRepairService.java:435) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:143) ~[apache-cassandra-3.10.jar:3.10]
... 7 common frames omitted
DEBUG [ValidationExecutor:973] 2018-12-30 23:52:08,323 StorageService.java:3313 - Forcing flush on keyspace keyspace1, CF mouse_interactions
DEBUG [MemtablePostFlush:7354] 2018-12-30 23:52:08,323 ColumnFamilyStore.java:954 - forceFlush requested but everything is clean in mouse_interactions
ERROR [ValidationExecutor:973] 2018-12-30 23:52:08,327 Validator.java:268 - Failed creating a merkle tree for [repair #b3887f60-0c8d-11e9-b894-754001cf0917 on keyspace1/mouse_interactions, [(-8096630393642361664,-8073407512196669022]]], /10.52.5.42 (see log for details)
DEBUG [GossipStage:1] 2018-12-30 23:52:10,643 FailureDetector.java:457 - Ignoring interval time of 2000164738 for /10.52.3.47
DEBUG [GossipStage:1] 2018-12-30 23:52:13,643 FailureDetector.java:457 - Ignoring interval time of 2000239040 for /10.52.3.47
DEBUG [ReadRepairStage:407] 2018-12-30 23:52:15,133 ReadCallback.java:242 - Digest mismatch:
org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(7486012912397474412, 000467657474000020376337363933643363613837616231643531633936396564616234636363613400) (a0e45fcd73255bcd93a63b15d41e0843 vs 7dff1a87a755cf62150befc8352f59e8)
at org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) [apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
DEBUG [GossipStage:1] 2018-12-30 23:52:26,639 FailureDetector.java:457 - Ignoring interval time of 2000385682 for /10.52.3.47
After few hours at GC logs, I noticed that GC is being called roughly every 20 seconds, and stop for more than 10 seconds each call:
2018-12-31T06:24:57.450+0000: 1184437.292: Total time for which application threads were stopped: 18.0318658 seconds, Stopping threads took: 0.0005000 seconds
2018-12-31T06:24:57.483+0000: 1184437.325: Total time for which application threads were stopped: 0.0053815 seconds, Stopping threads took: 0.0007325 seconds
2018-12-31T06:24:57.565+0000: 1184437.407: Total time for which application threads were stopped: 0.0118127 seconds, Stopping threads took: 0.0057652 seconds
2018-12-31T06:24:57.604+0000: 1184437.446: Total time for which application threads were stopped: 0.0064909 seconds, Stopping threads took: 0.0023037 seconds
2018-12-31T06:24:57.701+0000: 1184437.543: Total time for which application threads were stopped: 0.0066540 seconds, Stopping threads took: 0.0031299 seconds
{Heap before GC invocations=1377084 (full 108682):
par new generation total 943744K, used 943711K [0x00000005c0000000, 0x0000000600000000, 0x0000000600000000)
eden space 838912K, 100% used [0x00000005c0000000, 0x00000005f3340000, 0x00000005f3340000)
from space 104832K, 99% used [0x00000005f99a0000, 0x00000005ffff7ce0, 0x0000000600000000)
to space 104832K, 0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000005f99a0000)
concurrent mark-sweep generation total 7340032K, used 7340032K [0x0000000600000000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 71629K, capacity 73844K, committed 75000K, reserved 1116160K
class space used 8521K, capacity 8909K, committed 9176K, reserved 1048576K
2018-12-31T06:24:58.029+0000: 1184437.870: [Full GC (Allocation Failure) 2018-12-31T06:24:58.029+0000: 1184437.871: [CMS: 7340032K->7340031K(7340032K), 15.2051822 secs] 8283743K->7443230K(8283776K), [Metaspace: 71629K->71629K(1116160K)], 15.2055451 secs] [Times: user=13.94 sys=1.28, real=15.20 secs]
Heap after GC invocations=1377085 (full 108683):
par new generation total 943744K, used 103198K [0x00000005c0000000, 0x0000000600000000, 0x0000000600000000)
eden space 838912K, 12% used [0x00000005c0000000, 0x00000005c64c7950, 0x00000005f3340000)
from space 104832K, 0% used [0x00000005f99a0000, 0x00000005f99a0000, 0x0000000600000000)
to space 104832K, 0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000005f99a0000)
concurrent mark-sweep generation total 7340032K, used 7340031K [0x0000000600000000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 71629K, capacity 73844K, committed 75000K, reserved 1116160K
class space used 8521K, capacity 8909K, committed 9176K, reserved 1048576K
}
2018-12-31T06:25:13.235+0000: 1184453.076: Total time for which application threads were stopped: 15.2103050 seconds, Stopping threads took: 0.0004204 seconds
2018-12-31T06:25:13.243+0000: 1184453.085: Total time for which application threads were stopped: 0.0047592 seconds, Stopping threads took: 0.0008416 seconds
2018-12-31T06:25:13.272+0000: 1184453.114: Total time for which application threads were stopped: 0.0085538 seconds, Stopping threads took: 0.0046376 seconds
2018-12-31T06:25:13.298+0000: 1184453.140: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7340031K(7340032K)] 7536074K(8283776K), 0.0650538 secs] [Times: user=0.12 sys=0.01, real=0.06 secs]
2018-12-31T06:25:13.364+0000: 1184453.206: Total time for which application threads were stopped: 0.0728487 seconds, Stopping threads took: 0.0039520 seconds
2018-12-31T06:25:13.364+0000: 1184453.206: [CMS-concurrent-mark-start]
{Heap before GC invocations=1377085 (full 108684):
par new generation total 943744K, used 943215K [0x00000005c0000000, 0x0000000600000000, 0x0000000600000000)
eden space 838912K, 100% used [0x00000005c0000000, 0x00000005f3340000, 0x00000005f3340000)
from space 104832K, 99% used [0x00000005f99a0000, 0x00000005fff7bd98, 0x0000000600000000)
to space 104832K, 0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000005f99a0000)
concurrent mark-sweep generation total 7340032K, used 7340031K [0x0000000600000000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 71631K, capacity 73844K, committed 75000K, reserved 1116160K
class space used 8521K, capacity 8909K, committed 9176K, reserved 1048576K
2018-12-31T06:25:13.821+0000: 1184453.662: [Full GC (Allocation Failure) 2018-12-31T06:25:13.821+0000: 1184453.663: [CMS2018-12-31T06:25:16.960+0000: 1184456.802: [CMS-concurrent-mark: 3.592/3.596 secs] [Times: user=12.47 sys=0.38, real=3.60 secs]
So i've start checking the data spread in cluster - we are using num_of_tokens - 32
data seem to be balanced ~ 40% in each node.
UN 10.52.2.94 672.64 GiB 32 ? ad3d1365-bbb7-4229-b586-40667ec22b41 rack1
UN 10.52.3.47 640.91 GiB 32 ? cdba952b-9685-4769-aaf4-22e538a5c37f rack1
UN 10.52.1.57 719.34 GiB 32 ? 13bb7573-eb30-489b-80c4-6e5a7c8e5f5e rack1
UN 10.52.5.42 743.04 GiB 32 ? c9e892c6-9281-4a49-b4c4-a147a387c3de rack1
UN 10.52.4.43 691.1 GiB 32 ? 53e3724e-f5a9-46b1-b330-7bb542f15f90 rack1
So after checking the logs in other node I cant find any reason for those timeout in that specific node.
Any thoughts or ideas as for what cause this to happen on the same node again and again ?
That's really odd to only see this on one node. Double-check that the configs are the same. Otherwise, you might be writing/querying a large partition which that node is primarily responsible for.
Replication factor - 2 Consistency level - 2
In general, repairs can cause nodes to have trouble serving requests, as building Merkle Trees and streaming data are quite resource-intensive. I see two problems here:
Long GC pauses.
A RF/CL ratio which does not allow for any node to be down.
Starting with #2, when you have a RF=2 and you're requiring 2 replicas to respond on your queries, you are essentially querying at CONSISTENCY_ALL. Therefore, should a node become overwhelmed and short on resources, your queries will be unable to complete. If it were me, I would increase the RF to 3, and run a repair (assuming the nodes have adequate storage). The other option, would be to query at a consistency level of 1 (ONE), which is probably what you should be doing with a RF=2 anyway.
Also, when diagnosing query problems, it usually helps to see the actual query being run (which you have not provided). With Cassandra, more often than not, query issues are the result of queries which don't fit the designed table.
As for the long GC pauses, that's a trickier problem to solve. There's an old Cassandra JIRA (CASSANDRA-8150) ticket which talks about optimal settings for the CMS Garbage Collector. You should give that a read.
What is your MAX_HEAP set to? I see your new generation is less than 1GB, which is going to be too small. With 23GB of RAM on each node, I'd start with the following settings for using CMS GC:
Max Heap Size (Xmx) of 8GB-12GB (you want to leave about 50% of RAM for the OS/page-cache).
Initial Heap Size (Xms) equal to Max Heap Size to avoid the performance hit of resizing the heap.
Heap New Size (Xmn) of 40-50% of your Max Heap Size (so somewhere between 3GB-6GB). You want plenty of room available for the new generation.
MaxTenuringThreshold of 6 to 8, as you want objects to be passed around the new gen survivor spaces until they die, in lieu of being promoted to the old gen. By default, I think this is set to 1, which means objects will get promoted too quickly.
Basically, new-to-old/old-to-permanent promotion is where the long pauses happen. Ideally, you'd like all objects on your heap to be created, live, and die in the new gen, without being promoted.
Otherwise, it might be worth your while to try using G1GC. For G1, I'd go with a Max/Initial Heap of about 50% of RAM, a MaxGCPauseMillis of 400-500ms, and don't set Heap New Size at all (G1 computes that).

Node.js V8 Mark-sweep optimization?

It seems that for my server mark-sweep operation takes seconds in stop-the-world way:
Oct 17 08:26:27 s3 u[30843]: [30843:0x26671a0] 63025059 ms: Mark-sweep 2492.7 (3285.6) -> 2317.6 (2945.0) MB, 84.9 / 0.1 ms (+ 3223.4 ms in 3877 steps since start of marking, biggest step 731.7 ms, walltime since start of marking 3315 ms) finalize incremental marking via task GC in old space requested
Oct 17 08:26:27 s3 u[30843]: Execution blocked for 3273 ms
.
Oct 17 08:28:15 s3 u[30843]: [30843:0x26671a0] 63133051 ms: Mark-sweep 2499.8 (3298.4) -> 2313.4 (2947.1) MB, 160.2 / 0.1 ms (+ 3691.7 ms in 3679 steps since start of marking, biggest step 1073.4 ms, walltime since start of marking 3859 ms) finalize incremental marking via task GC in old space requested
Oct 17 08:28:15 s3 u[30843]: Execution blocked for 3791 ms
.
This behavior is similar than described in https://github.com/nodejs/help/issues/947, it seems to be somewhat related to memory consumption and worsens over time.
The problem existed in node 7, but was only barely noticeable. Now with node 8.12 it reaches 5 sec block within 24 h.
I suspected it might have something to do with unorthodox way of storing data in big objects, tried to reconstruct one 2 GB object to smaller objects cutting it down to 1 GB, but no obvious benefit. However, these objects are very simple while big.
The questions are:
Is there an V8 option to make mark-sweep to mitigate this problem? Less often, smaller steps, skip some optimizations, anything?
I could not find any decent documentation about V8 options - is there?
How to help mark-sweep to do it's task more efficiently?
Are there common cases what to avoid where mark-sweep could struggle?
Any desperate hacks I could try?

Node.js process hangs when creating huge object

I have a node process that I use to add key-values to an object. When I get to about 9.88 million keys added, the process appears to hang. I assumed an out-of-memory issue, so I turned on trace_gc and also put in a check in the the code that adds the keys:
const { heapTotal, heapUsed } = process.memoryUsage()
if ((heapUsed / heapTotal) > 0.99) {
throw new Error('Too much memory')
}
That condition was never met, and the error never thrown. As far as --trace_gc output, my last scavenge log was:
[21544:0x104000000] 2153122 ms: Scavenge 830.0 (889.8) -> 814.3 (889.8) MB, 1.0 / 0.0 ms allocation failure
Mark-sweep, however, continues logging this:
[21544:0x104000000] 3472253 ms: Mark-sweep 1261.7 (1326.9) -> 813.4 (878.8) MB, 92.3 / 0.1 ms (+ 1880.1 ms in 986 steps since start of marking, biggest step 5.6 ms, walltime since start of marking 12649 ms) finalize incremental marking via task GC in old space requested
Is this output consistent with memory issues?
I should note that having to add this many keys to the object is an edge-case; normally the range is more likely in the thousands. In addition, the keys are added during a streaming process, so I don't know how many are required to added at the outset. So in addition to trying to figure out what the specific problem is, I'm also looking for a way to determine that the problem will likely occur before the process hangs.

Resources