nodejs v8 memory gc allocation failure - node.js

I am trying to profile nodejs v8 memory with a do nothing server.
I used node-memwatch to get heap diff. I collect heap info before connecting and after connection tore down. I used node-memwatch. I tried 200 concurrent connections from client side.
Here is the gc trace after connection tore down.
can anyone help me to understand:
1.why are memory increasing ? after connections tore down, the server is absolutely doing nothing. shouldn't it suppose to always drop as garbages being collected ?
2. what is are those allocation failure ? How do I really interpret the trace here ?
15802 ms: Mark-sweep 8.9 (45.0) -> 8.1 (45.0) MB, 58 ms [allocation failure] [GC in old space forced by flags].
16144 ms: Mark-sweep 9.2 (45.0) -> 8.4 (45.0) MB, 53 ms [allocation failure] [GC in old space forced by flags].
16495 ms: Mark-sweep 9.5 (45.0) -> 8.7 (46.0) MB, 60 ms [allocation failure] [GC in old space forced by flags].
16837 ms: Mark-sweep 9.8 (46.0) -> 9.0 (46.0) MB, 56 ms [allocation failure] [GC in old space forced by flags].
17197 ms: Mark-sweep 10.1 (46.0) -> 9.4 (46.0) MB, 62 ms [allocation failure] [GC in old space forced by flags].
17905 ms: Mark-sweep 11.5 (46.0) -> 10.0 (47.0) MB, 74 ms [Runtime::PerformGC] [GC in old space forced by flags].
18596 ms: Mark-sweep 12.2 (47.0) -> 10.7 (47.0) MB, 75 ms [Runtime::PerformGC] [GC in old space forced by flags].
19315 ms: Mark-sweep 12.8 (47.0) -> 11.3 (48.0) MB, 83 ms [allocation failure] [GC in old space forced by flags].
20035 ms: Mark-sweep 13.4 (48.0) -> 12.0 (49.0) MB, 90 ms [Runtime::PerformGC] [GC in old space forced by flags].
21487 ms: Mark-sweep 16.0 (49.0) -> 13.2 (50.0) MB, 96 ms [Runtime::PerformGC] [GC in old space forced by flags].
22950 ms: Mark-sweep 17.3 (50.0) -> 14.5 (52.0) MB, 116 ms [Runtime::PerformGC] [GC in old space forced by flags].
24376 ms: Mark-sweep 18.8 (52.0) -> 15.9 (53.0) MB, 114 ms [allocation failure] [GC in old space forced by flags].
25849 ms: Mark-sweep 19.9 (53.0) -> 17.2 (54.0) MB, 129 ms [Runtime::PerformGC] [GC in old space forced by flags].
28773 ms: Mark-sweep 25.2 (54.0) -> 19.7 (57.0) MB, 149 ms [allocation failure] [GC in old space forced by flags].
31725 ms: Mark-sweep 27.7 (57.0) -> 22.2 (59.0) MB, 172 ms [Runtime::PerformGC] [GC in old space forced by flags].
34678 ms: Mark-sweep 30.2 (59.0) -> 24.7 (61.0) MB, 190 ms [Runtime::PerformGC] [GC in old space forced by flags].
44045 ms: Mark-sweep 28.4 (61.0) -> 25.8 (63.0) MB, 180 ms [idle notification] [GC in old space forced by flags].
44216 ms: Mark-sweep 25.8 (63.0) -> 25.8 (63.0) MB, 170 ms [idle notification] [GC in old space requested].
57471 ms: Mark-sweep 26.9 (63.0) -> 25.8 (62.0) MB, 167 ms [Runtime::PerformGC] [GC in old space forced by flags].
57651 ms: Mark-sweep 26.8 (62.0) -> 25.5 (62.0) MB, 160 ms [Runtime::PerformGC] [GC in old space forced by flags].
57828 ms: Mark-sweep 26.5 (62.0) -> 25.5 (62.0) MB, 159 ms [Runtime::PerformGC] [GC in old space forced by flags].
Thanks,

"allocation failure" sounds very dramatic, but there is no real failure involved. It just means that we allocated so much memory that it is time to do a GC to see if we can collect some memory.
It looks like you are running with the --gc-global flag ("GC forced by flags"). That's a bad idea for production, though it may be fine for narrowing down a problem when debugging.
I can't tell why your process is leaking. You may find the heap profiler useful. See https://github.com/felixge/node-memory-leak-tutorial

According to the code:
PrintF("%s %.1f (%.1f) -> %.1f (%.1f) MB, ",
CollectorString(),
static_cast<double>(start_object_size_) / MB,
static_cast<double>(start_memory_size_) / MB,
SizeOfHeapObjects(),
end_memory_size_mb);
Each line is one gc, when gc started,
start_object_size_ = heap_->SizeOfObjects();
In gc summary:
PrintF("total_size_before=%" V8_PTR_PREFIX "d ", start_object_size_);
PrintF("total_size_after=%" V8_PTR_PREFIX "d ", heap_->SizeOfObjects());
As of why start_object_size_ increases in the time when my app is idle, I am guessing maybe during the gc, some objects got promoted to old space and caused object size in old space increased.

Related

Neo4j: Worker for session..crashed. Java Heap Space OutOfMemoryError

Spark job I am running:
It is a pretty simple program that was converted to scala from java and 'parallelized' (it was not intended to be ran in parallel but is an experiment to a) learn spark and neo4j and b) see if i can get some speed gains just by running on a spark cluster with more nodes doing more work). The reason being is the big bottle neck is a spatial call within the neo4j cypher script (a withinDistance call). The test set of data is pretty small 52,000 nodes and about 140 mb size of a database.
Also when neo4j starts up it gives me a warning of
Starting Neo4j.
WARNING: Max 4096 open files allowed, minimum of 40000 recommended. See the Neo4j manual.
/usr/share/neo4j/bin/neo4j: line 411: /var/run/neo4j/neo4j.pid: No such file or directory
Which is strange since I believe that is open files and I asked the system admin to set that to way higher? (ulimit -Hn seems to confirm this? says 90,000 though a ulimit -a shows open files at 4096 (softlimit) I guess that is what neo4j sees and whines about)
Also when I ran this locally on my mac os X. The software would run and execute for about 14 hours or so (maybe 9) and then i would see in the console that the database would just stop talking to the spark. It wasn't down or anything like the job would time out and I could still cypher-shell into the database. But it would somehow lose connection to the spark jobs so they would try and finally the spark submit would just give up and stop.
C02RH2U9G8WM:scala-2.11 little.mac$ ulimit -Hn
unlimited
(also since last edit I even upped my limits more in the neo4j conf, now with 4gb max memory for heap sizes)
Some code bits from the job (using the ported code to scala with added spark dataframes. I know it is not properly paralleled but was hoping to get something to work before pressing forward.). I was building a hybrid program that was like the code in java I ported but using dataframes from spark (connected to neo4j).
Essentially (pseudo code):
while (going through all these lat and lons)
{
def DoCalculation()
{
val noBbox="call spatial.bbox('geom', {lat:" + minLat +",lon:"+minLon +"}, {lat:"+maxLat+",lon:" + maxLon +"}) yield node return node.altitude as altitude, node.gtype as gtype, node.toDateFormatLong as toDateFormatLong, node.latitude as latitude, node.longitude as longitude, node.fromDateFormatLong as fromDateFormatLong, node.fromDate as fromDate, node.toDate as toDate ORDER BY node.toDateFormatLong DESC";
try {
//not overly sure what the partitions and batch are really doing for me.
val initialDf2 = neo.cypher(noBbox).partitions(5).batch(10000).loadDataFrame
val theRow = initialDf2.collect() //was someStr
for(i <- 0 until theRow.length){
//do more calculations
var radius2= 100
//this call is where the biggest bottle neck is,t he spatial withinDistance is where i thought
//I could put this code ons park and make the calls through data frames and do the same long work
//but by batching it out to many nodes would get more speed gains?
val pointQuery="call spatial.withinDistance('geom', {lat:" + lat + ",lon:"+ lon +"}, " + radius2 + ") yield node, distance WITH node, distance match (node:POINT) WHERE node.toDateFormatLong < " + toDateFormatLong + " return node.fromDateFormatLong as fromDateFormatLong, node.toDateFormatLong as toDateFormatLong";
try {
val pointResults = neo.cypher(pointQuery).loadDataFrame; //did i need to batch here?
var prRow = pointResults.collect();
//do stuff with prRow loadDataFrame
} catch {
case e: Exception => e.printStackTrace
}
//do way more stuff with the data just in some scala/java datastructures
}
} catch {
case e: Exception => println("EMPTY COLLECTION")
}
}
}
Running a spark-submit job that useses the spark connector to connect to Neo4j I get these errors in /var/log/neo4j/neo4j.log
java.lang.OutOfMemoryError: Java heap space
2017-12-27 03:17:13.969+0000 ERROR Worker for session '13662816-0a86-4c95-8b7f-cea9d92440c8' crashed. Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1855)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2068)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at org.neo4j.bolt.v1.runtime.concurrent.RunnableBoltWorker.run(RunnableBoltWorker.java:88)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:109)
2017-12-27 03:17:23.244+0000 ERROR Worker for session '75983e7c-097a-4770-bcab-d63f78300dc5' crashed. Java heap space
java.lang.OutOfMemoryError: Java heap space
I know that the neo4j.conf file I can change the heapsizes (currently commented out but set to 512m) the thing that I am asking is what it says in the conf file:
# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size.
So doesn't this mean I should leave alone the heapsizes here int he conf if they are calculated to surely be more than what I can set? (these machines have 8cores and 8gb ram). Or would specifically setting these really help? maybe to 2000 (if its in megabytes), to get two gig? I ask because I feel the error log file is giving this out of memory error but it really is for a different reason.
EDIT my jvm values from the debug.log
BEFORE:
2017-12-26 16:24:06.768+0000 INFO [o.n.k.i.DiagnosticsManager] NETWORK
2017-12-26 16:24:06.768+0000 INFO [o.n.k.i.DiagnosticsManager] System memory information:
2017-12-26 16:24:06.771+0000 INFO [o.n.k.i.DiagnosticsManager] Total Physical memory: 7.79 GB
2017-12-26 16:24:06.772+0000 INFO [o.n.k.i.DiagnosticsManager] Free Physical memory: 5.49 GB
2017-12-26 16:24:06.772+0000 INFO [o.n.k.i.DiagnosticsManager] Committed virtual memory: 5.62 GB
2017-12-26 16:24:06.773+0000 INFO [o.n.k.i.DiagnosticsManager] Total swap space: 16.50 GB
2017-12-26 16:24:06.773+0000 INFO [o.n.k.i.DiagnosticsManager] Free swap space: 16.49 GB
2017-12-26 16:24:06.773+0000 INFO [o.n.k.i.DiagnosticsManager] JVM memory information:
2017-12-26 16:24:06.773+0000 INFO [o.n.k.i.DiagnosticsManager] Free memory: 85.66 MB
2017-12-26 16:24:06.773+0000 INFO [o.n.k.i.DiagnosticsManager] Total memory: 126.00 MB
2017-12-26 16:24:06.774+0000 INFO [o.n.k.i.DiagnosticsManager] Max memory: 1.95 GB
2017-12-26 16:24:06.776+0000 INFO [o.n.k.i.DiagnosticsManager] Garbage Collector: G1 Young Generation: [G1 Eden Space, G1 Survivor Space]
2017-12-26 16:24:06.776+0000 INFO [o.n.k.i.DiagnosticsManager] Garbage Collector: G1 Old Generation: [G1 Eden Space, G1 Survivor Space, G1 Old Gen]
2017-12-26 16:24:06.777+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: Code Cache (Non-heap memory): committed=4.94 MB, used=4.93 MB, max=240.00 MB, threshold=0.00 B
2017-12-26 16:24:06.777+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: Metaspace (Non-heap memory): committed=14.38 MB, used=13.41 MB, max=-1.00 B, threshold=0.00 B
2017-12-26 16:24:06.777+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: Compressed Class Space (Non-heap memory): committed=1.88 MB, used=1.64 MB, max=1.00 GB, threshold=0.00 B
2017-12-26 16:24:06.778+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: G1 Eden Space (Heap memory): committed=39.00 MB, used=35.00 MB, max=-1.00 B, threshold=?
2017-12-26 16:24:06.778+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: G1 Survivor Space (Heap memory): committed=3.00 MB, used=3.00 MB, max=-1.00 B, threshold=?
2017-12-26 16:24:06.778+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: G1 Old Gen (Heap memory): committed=84.00 MB, used=1.34 MB, max=1.95 GB, threshold=0.00 B
2017-12-26 16:24:06.778+0000 INFO [o.n.k.i.DiagnosticsManager] Operating system information:
2017-12-26 16:24:06.779+0000 INFO [o.n.k.i.DiagnosticsManager] Operating System: Linux; version: 3.10.0-693.5.2.el7.x86_64; arch: amd64; cpus: 8
2017-12-26 16:24:06.779+0000 INFO [o.n.k.i.DiagnosticsManager] Max number of file descriptors: 90000
2017-12-26 16:24:06.780+0000 INFO [o.n.k.i.DiagnosticsManager] Number of open file descriptors: 103
2017-12-26 16:24:06.782+0000 INFO [o.n.k.i.DiagnosticsManager] Process id: 26252#hp380-1
2017-12-26 16:24:06.782+0000 INFO [o.n.k.i.DiagnosticsManager] Byte order: LITTLE_ENDIAN
2017-12-26 16:24:06.793+0000 INFO [o.n.k.i.DiagnosticsManager] Local timezone: Etc/GMT
2017-12-26 16:24:06.793+0000 INFO [o.n.k.i.DiagnosticsManager] JVM information:
2017-12-26 16:24:06.794+0000 INFO [o.n.k.i.DiagnosticsManager] VM Name: OpenJDK 64-Bit Server VM
2017-12-26 16:24:06.794+0000 INFO [o.n.k.i.DiagnosticsManager] VM Vendor: Oracle Corporation
2017-12-26 16:24:06.794+0000 INFO [o.n.k.i.DiagnosticsManager] VM Version: 25.151-b12
2017-12-26 16:24:06.794+0000 INFO [o.n.k.i.DiagnosticsManager] JIT compiler: HotSpot 64-Bit Tiered Compilers
2017-12-26 16:24:06.795+0000 INFO [o.n.k.i.DiagnosticsManager] VM Arguments: [-XX:+UseG1GC, -XX:-OmitStackTraceInFastThrow, -XX:+AlwaysPreTouch, -XX:+UnlockExperimentalVMOptions, -XX:+TrustFinalNonStaticFields, -XX:+DisableExplicitGC, -Djdk.tls.ephemeralDHKeySize=2048, -Dunsupported.dbms.udc.source=rpm, -Dfile.encoding=UTF-8]
2017-12-26 16:24:06.795+0000 INFO [o.n.k.i.DiagnosticsManager] Java classpath:
AFTER:
2017-12-27 16:17:30.740+0000 INFO [o.n.k.i.DiagnosticsManager] System memory information:
2017-12-27 16:17:30.749+0000 INFO [o.n.k.i.DiagnosticsManager] Total Physical memory: 7.79 GB
2017-12-27 16:17:30.750+0000 INFO [o.n.k.i.DiagnosticsManager] Free Physical memory: 4.23 GB
2017-12-27 16:17:30.750+0000 INFO [o.n.k.i.DiagnosticsManager] Committed virtual memory: 5.62 GB
2017-12-27 16:17:30.751+0000 INFO [o.n.k.i.DiagnosticsManager] Total swap space: 16.50 GB
2017-12-27 16:17:30.751+0000 INFO [o.n.k.i.DiagnosticsManager] Free swap space: 16.19 GB
2017-12-27 16:17:30.751+0000 INFO [o.n.k.i.DiagnosticsManager] JVM memory information:
2017-12-27 16:17:30.751+0000 INFO [o.n.k.i.DiagnosticsManager] Free memory: 1.89 GB
2017-12-27 16:17:30.751+0000 INFO [o.n.k.i.DiagnosticsManager] Total memory: 1.95 GB
2017-12-27 16:17:30.752+0000 INFO [o.n.k.i.DiagnosticsManager] Max memory: 1.95 GB
2017-12-27 16:17:30.777+0000 INFO [o.n.k.i.DiagnosticsManager] Garbage Collector: G1 Young Generation: [G1 Eden Space, G1 Survivor Space]
2017-12-27 16:17:30.777+0000 INFO [o.n.k.i.DiagnosticsManager] Garbage Collector: G1 Old Generation: [G1 Eden Space, G1 Survivor Space, G1 Old Gen]
2017-12-27 16:17:30.778+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: Code Cache (Non-heap memory): committed=4.94 MB, used=4.89 MB, max=240.00 MB, threshold=0.00 B
2017-12-27 16:17:30.778+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: Metaspace (Non-heap memory): committed=14.38 MB, used=13.42 MB, max=-1.00 B, threshold=0.00 B
2017-12-27 16:17:30.778+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: Compressed Class Space (Non-heap memory): committed=1.88 MB, used=1.64 MB, max=1.00 GB, threshold=0.00 B
2017-12-27 16:17:30.779+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: G1 Eden Space (Heap memory): committed=105.00 MB, used=59.00 MB, max=-1.00 B, threshold=?
2017-12-27 16:17:30.779+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: G1 Survivor Space (Heap memory): committed=0.00 B, used=0.00 B, max=-1.00 B, threshold=?
2017-12-27 16:17:30.779+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: G1 Old Gen (Heap memory): committed=1.85 GB, used=0.00 B, max=1.95 GB, threshold=0.00 B
2017-12-27 16:17:30.779+0000 INFO [o.n.k.i.DiagnosticsManager] Operating system information:
2017-12-27 16:17:30.780+0000 INFO [o.n.k.i.DiagnosticsManager] Operating System: Linux; version: 3.10.0-693.5.2.el7.x86_64; arch: amd64; cpus: 8
2017-12-27 16:17:30.780+0000 INFO [o.n.k.i.DiagnosticsManager] Max number of file descriptors: 90000
2017-12-27 16:17:30.781+0000 INFO [o.n.k.i.DiagnosticsManager] Number of open file descriptors: 103
2017-12-27 16:17:30.785+0000 INFO [o.n.k.i.DiagnosticsManager] Process id: 20774#hp380-1
2017-12-27 16:17:30.785+0000 INFO [o.n.k.i.DiagnosticsManager] Byte order: LITTLE_ENDIAN
2017-12-27 16:17:30.814+0000 INFO [o.n.k.i.DiagnosticsManager] Local timezone: Etc/GMT
2017-12-27 16:17:30.815+0000 INFO [o.n.k.i.DiagnosticsManager] JVM information:
2017-12-27 16:17:30.815+0000 INFO [o.n.k.i.DiagnosticsManager] VM Name: OpenJDK 64-Bit Server VM
2017-12-27 16:17:30.815+0000 INFO [o.n.k.i.DiagnosticsManager] VM Vendor: Oracle Corporation
2017-12-27 16:17:30.815+0000 INFO [o.n.k.i.DiagnosticsManager] VM Version: 25.151-b12
2017-12-27 16:17:30.815+0000 INFO [o.n.k.i.DiagnosticsManager] JIT compiler: HotSpot 64-Bit Tiered Compilers
2017-12-27 16:17:30.816+0000 INFO [o.n.k.i.DiagnosticsManager] VM Arguments: [-Xms2000m, -Xmx2000m, -XX:+UseG1GC, -XX:-OmitStackTraceInFastThrow, -XX:+AlwaysPreTouch, -XX:+UnlockExperimentalVMOptions, -XX:+TrustFinalNonStaticFields, -XX:+DisableExplicitGC, -Djdk.tls.ephemeralDHKeySize=2048, -Dunsupported.dbms.udc.source=rpm, -Dfile.encoding=UTF-8]
2017-12-27 16:17:30.816+0000 INFO [o.n.k.i.DiagnosticsManager] Java classpath:
Just an fyi, and I still seem to get java heap errors. These machines (not for production just dev) have only 8gb each
We usually recommend setting these yourself. You can check your debug.log file for the logs during startup, that can report the values it chose to use as default. You're looking for an excerpt like this:
JVM memory information:
Free memory: 204.79 MB
Total memory: 256.00 MB
Max memory: 4.00 GB
I believe the Total memory is the initial heap size and Max memory is the max heap size.
When setting this yourself, we usually recommending keeping the initial and max set to the same value. Here's a knowledge base article on estimating initial memory configuration that may be helpful.
If the defaults seem sufficient, then it may be better to look for other areas to optimize, or see if the issue is known on the apache-spark side of things.

Spark stuck in UDF on aws cluster

I have a Spark job that retrieve data from a few Redshift tables, applies some transformations like join and groupby and also applies some UDFs to some columns.
I have executed in Spark stand alone in my local machine and it works properly. But when I execute in the aws cluster it stucks on the udf, because I have tried removing the udf and it works.
Related to that I have found this
I need to use the UDFs but if I use the job stucks like 2 or 3 hours at that tasks and the spark job finishes without error, only stops.
Anyone has exeprienced something similar? Any help will be appreciated
EDIT:
When I remove the UDFs the job works properly.
But whith the UDFs it stucks with few tasks remaining, here the end of the logs:
stdout log:
2017-06-07T09:18:01.929+0000: [GC (Allocation Failure) 2017-06-07T09:18:01.929+0000: [ParNew: 66492K->2341K(72512K),
0.0024644 secs] 648682K->584531K(1042416K), 0.0025210 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 2017-06-07T09:18:01.962+0000: [GC (Allocation Failure) 2017-06-07T09:18:01.962+0000: [ParNew: 66758K->2487K(72512K), 0.0022863 secs] 648948K->584677K(1042416K),
0.0023321 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 2017-06-07T09:18:02.001+0000: [GC (Allocation Failure) 2017-06-07T09:18:02.001+0000: [ParNew: 66999K->3757K(72512K),
0.0028101 secs] 649189K->585953K(1042416K), 0.0028601 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 2017-06-07T09:18:02.030+0000: [GC (Allocation Failure) 2017-06-07T09:18:02.030+0000: [ParNew: 68269K->2462K(72512K), 0.0019834 secs] 650465K->584706K(1042416K),
0.0020289 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 2017-06-07T09:18:02.130+0000: [GC (Allocation Failure) 2017-06-07T09:18:02.130+0000: [ParNew: 66974K->6797K(72512K),
0.0038833 secs] 649218K->589043K(1042416K), 0.0039409 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 2017-06-07T09:18:02.309+0000: [GC (Allocation Failure) 2017-06-07T09:18:02.309+0000: [ParNew: 71311K->8000K(72512K), 0.0209973 secs] 653556K->595016K(1042416K),
0.0210531 secs] [Times: user=0.10 sys=0.00, real=0.02 secs] 2017-06-07T09:18:02.331+0000: [GC (GCLocker Initiated GC) 2017-06-07T09:18:02.331+0000: [ParNew: 8632K->3373K(72512K), 0.0131140 secs] 595648K->595234K(1042416K), 0.0131557 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 2017-06-07T09:22:28.879+0000: [GC (Allocation Failure) 2017-06-07T09:22:28.879+0000: [ParNew: 67885K->1862K(72512K), 0.0018928 secs] 659746K->593723K(1042416K),
0.0019463 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 2017-06-07T09:27:48.879+0000: [GC (Allocation Failure) 2017-06-07T09:27:48.879+0000: [ParNew: 66374K->1231K(72512K),
0.0014260 secs] 658235K->593093K(1042416K), 0.0014730 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 2017-06-07T09:33:08.879+0000: [GC (Allocation Failure) 2017-06-07T09:33:08.879+0000: [ParNew: 65743K->1075K(72512K), 0.0016924 secs] 657605K->592937K(1042416K),
0.0017409 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
stderr log:
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1200 blocks
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1200 blocks
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/06/07 09:18:02 INFO CodeGenerator: Code generated in 29.410073 ms
17/06/07 09:18:02 INFO CodeGenerator: Code generated in 8.06304 ms
17/06/07 09:18:02 INFO CodeGenerator: Code generated in 12.481201 ms
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_10 stored as values in memory (estimated size 928.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_9 stored as values in memory (estimated size 928.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_3 stored as values in memory (estimated size 904.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_15 stored as values in memory (estimated size 904.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_13 stored as values in memory (estimated size 888.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_5 stored as values in memory (estimated size 904.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_12 stored as values in memory (estimated size 904.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_8 stored as values in memory (estimated size 928.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO CodeGenerator: Code generated in 17.574289 ms
17/06/07 09:18:02 INFO CodeGenerator: Code generated in 8.639658 ms
Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']
Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']
Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']
Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']
Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']
Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']
Why using the UDFs the message Could not find valid SPARK_HOME while searching ... ??

Spark Memory Error Java Runtime Environment

I have a spark job made in python, where I retrieve data from Redshift, and then I apply many transformations, join, filter, withColumn, agg ...
There are around 30K records in the dataframes
I perform all the transformation and when I try to write an AVRO file the spark job fails
My spark submit:
. /usr/bin/spark-submit --packages="com.databricks:spark-avro_2.11:3.2.0" --jars RedshiftJDBC42-1.2.1.1001.jar --deploy-mode client --master yarn --num-executors 10 --executor-cores 3 --executor-memory 10G --driver-memory 14g --conf spark.sql.broadcastTimeout=3600 --conf spark.network.timeout=10000000 --py-files dependencies.zip iface_extractions.py 2016-10-01 > output.log
I'm using --executor-memory 10G --driver-memory 14g, 6 machines in amazon with 8 cores and 15G RAM, why Im getting out of memory error ???
Error returned:
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 196608 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/hadoop/hs_err_pid13688.log
This is the end of spark log:
17/05/29 10:13:09 INFO TaskSetManager: Starting task 0.0 in stage 21.0 (TID 19, ip-10-185-53-172.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5779 bytes)
17/05/29 10:13:09 INFO TransportClientFactory: Successfully created connection to ip-10-185-53-172.eu-west-1.compute.internal/10.185.53.172:39759 after 3 ms (0 ms spent in bootstraps)
17/05/29 10:13:09 INFO BlockManagerInfo: Added broadcast_24_piece0 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 8.9 KB, free: 5.3 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 10.185.52.91:43829 in memory (size: 30.4 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_8_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.4 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_7_piece0 on 10.185.52.91:43829 in memory (size: 30.3 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_7_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.3 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_6_piece0 on 10.185.52.91:43829 in memory (size: 30.6 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_6_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.6 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Added taskresult_2 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 499.6 MB, free: 4.8 GB)
17/05/29 10:13:11 INFO TaskSetManager: Starting task 0.0 in stage 23.0 (TID 20, ip-10-185-53-172.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5779 bytes)
17/05/29 10:13:12 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 8.8 KB, free: 4.8 GB)
17/05/29 10:13:13 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 270161 ms on ip-10-185-53-172.eu-west-1.compute.internal (executor 2) (1/1)
17/05/29 10:13:13 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/05/29 10:13:13 INFO DAGScheduler: ResultStage 3 (run at ThreadPoolExecutor.java:1142) finished in 270.162 s
17/05/29 10:13:13 INFO DAGScheduler: Job 3 finished: run at ThreadPoolExecutor.java:1142, took 270.230067 s
17/05/29 10:13:13 INFO BlockManagerInfo: Removed taskresult_3 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 499.5 MB, free: 5.3 GB)
17/05/29 10:13:16 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.185.52.91:43829 in memory (size: 5.5 KB, free: 8.2 GB)
17/05/29 10:13:17 INFO BlockManagerInfo: Removed broadcast_3_piece0 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 5.5 KB, free: 5.3 GB)
17/05/29 10:13:20 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 276982 ms on ip-10-185-53-172.eu-west-1.compute.internal (executor 2) (1/1)
17/05/29 10:13:20 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool
17/05/29 10:13:20 INFO DAGScheduler: ResultStage 2 (run at ThreadPoolExecutor.java:1142) finished in 276.984 s
17/05/29 10:13:20 INFO DAGScheduler: Job 2 finished: run at ThreadPoolExecutor.java:1142, took 277.000009 s
17/05/29 10:13:20 INFO BlockManagerInfo: Removed taskresult_2 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 499.6 MB, free: 5.8 GB)
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000667766000, 196608, 0) failed; error='Cannot allocate memory' (errno=12)

nodejs garbage collection output

Anyone knows where I can read about the output of the --trace-gc option in nodejs?
I am not asking for an explanation on how the gc works, as there is plenty of information about it, just for the output of the --trace-gc.
I can guess what is the meaning of some of the fields but I have no idea of some others.
For instance :
what is the meaning of the number in parenthesis,
"steps" meaining (it may be related to the incremental mark & lazy
sweep)
the size of the heap that is written is the total heap ? (adding young and old areas)
...
An example :
[12994] 77042 ms: Scavenge 260.7 (298.1) -> 247.7 (298.1) MB, 9.4 ms [allocation failure].
[12994] 77188 ms: Scavenge 261.7 (298.1) -> 249.0 (300.1) MB, 7.4 ms [allocation failure].
[12994] 77391 ms: Scavenge 263.8 (301.1) -> 250.6 (302.1) MB, 8.1 ms [allocation failure].
[12994] 77511 ms: Scavenge 264.8 (302.1) -> 251.8 (304.1) MB, 7.4 ms [allocation failure].
[12994] 77839 ms: Scavenge 273.4 (304.1) -> 260.7 (305.1) MB, 8.3 ms (+ 55.7 ms in 201 steps since last GC) [allocation failure].
[12994] 78052 ms: Scavenge 274.3 (305.1) -> 261.9 (307.1) MB, 8.2 ms (+ 54.4 ms in 192 steps since last GC) [allocation failure].
[12994] 78907 ms: Scavenge 277.3 (308.1) -> 264.2 (309.1) MB, 10.1 ms (+ 51.5 ms in 196 steps since last GC) [allocation failure].
[12994] 80246 ms: Mark-sweep 272.2 (310.1) -> 82.9 (310.1) MB, 45.2 ms (+ 195.4 ms in 690 steps since start of marking, biggest step 1.2 ms) [GC interrupt] [GC in old space requested].
[12994] 80868 ms: Scavenge 99.3 (310.1) -> 85.5 (310.1) MB, 6.5 ms [allocation failure].
[12994] 81039 ms: Scavenge 100.2 (310.1) -> 86.8 (310.1) MB, 6.9 ms [allocation failure].
[12994] 81455 ms: Scavenge 102.2 (310.1) -> 88.8 (310.1) MB, 5.5 ms [allocation failure].
UPDATE
Looking at the file that creates the output (as suggesteed by mtth), I am adding an explanation of all the fields in case anyone is interested :
[12994] 77042 ms: Scavenge 260.7 (298.1) -> 247.7 (298.1) MB, 9.4 ms [allocation failure].
[pid] <time_since_start> :
<Phase> <heap_used_before (old+young)> (<allocated_heap_before>) ->
<heap_used_after (old+young)> (<allocated_heap_after>) MB,
<time_spent_gc> [<reason_of_gc>]
Additionally when there has been any incremental marking between old space gcs (full), it appears in the scavenging trace, like this :
(+ <incremental_time_duration> ms in <incremental_marking_steps> steps since last GC)
When the trace corresponds to an old space gcs (full), it also shows the biggest step duration.
This traces correspond to nodejs 0.12.9, and they look alike at least in nodejs 4.2.2
The closest to documentation I could find is the source of the function that generates the output. Using the comments in gc-tracer.h, we can figure out what each entry means. For example:
what is the meaning of the number in parenthesis
The number inside the parens represents the total memory allocated from the OS (and the one before is the total memory used for objects in the heap).

Does node.js 0.2.5 leak memory?

Node appears to be leaking memory in this simple example. Can anyone else confirm?
https://gist.github.com/a8eadd54d1058bcda796
I was accidentally sending 2 new requests for each completed request. One on end and one on close.
I'm on 0.3.1, GC kicks in normally here.
Using node --trace_gc test.js, this hardly reaches 5mb:
ivo#ivo:~/Desktop$ node --trace_gc test.js
Scavenge 0.9 -> 1.0 MB, 1 ms.
Scavenge 1.9 -> 1.8 MB, 0 ms.
Scavenge 2.6 -> 1.9 MB, 1 ms.
Mark-sweep 2.9 -> 1.8 MB, 6 ms.
Scavenge 2.8 -> 1.8 MB, 0 ms.
Scavenge 2.9 -> 1.9 MB, 0 ms.
Another run:
ivo#ivo:~/Desktop$ node --trace_gc test.js
Scavenge 0.9 -> 1.0 MB, 1 ms.
Scavenge 1.9 -> 1.8 MB, 0 ms.
Scavenge 2.6 -> 1.9 MB, 1 ms.
Mark-sweep 1.9 -> 1.8 MB, 4 ms.
Mark-sweep 1.8 -> 1.7 MB, 3 ms.
Mark-compact 1.7 -> 1.7 MB, 11 ms.
Scavenge 2.3 -> 1.8 MB, 0 ms.
Scavenge 2.3 -> 1.8 MB, 0 ms.
Scavenge 2.3 -> 1.8 MB, 0 ms.
Scavenge 2.0 -> 1.9 MB, 0 ms.
Mark-sweep 1.9 -> 1.6 MB, 3 ms.
Mark-compact 1.6 -> 1.6 MB, 10 ms.
V8 is very intelligent when it comes to GC'ing, one thing you might wanna look out for is that you don't push some references in a global list etc. because that will keep whatever is inside that reference alive.
If you really deal with big amounts of data, consider using Buffer and re-allocating on the fly, especially in 0.3.x buffers are extremely fast.

Resources