I have a Spark job that retrieve data from a few Redshift tables, applies some transformations like join and groupby and also applies some UDFs to some columns.
I have executed in Spark stand alone in my local machine and it works properly. But when I execute in the aws cluster it stucks on the udf, because I have tried removing the udf and it works.
Related to that I have found this
I need to use the UDFs but if I use the job stucks like 2 or 3 hours at that tasks and the spark job finishes without error, only stops.
Anyone has exeprienced something similar? Any help will be appreciated
EDIT:
When I remove the UDFs the job works properly.
But whith the UDFs it stucks with few tasks remaining, here the end of the logs:
stdout log:
2017-06-07T09:18:01.929+0000: [GC (Allocation Failure) 2017-06-07T09:18:01.929+0000: [ParNew: 66492K->2341K(72512K),
0.0024644 secs] 648682K->584531K(1042416K), 0.0025210 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 2017-06-07T09:18:01.962+0000: [GC (Allocation Failure) 2017-06-07T09:18:01.962+0000: [ParNew: 66758K->2487K(72512K), 0.0022863 secs] 648948K->584677K(1042416K),
0.0023321 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 2017-06-07T09:18:02.001+0000: [GC (Allocation Failure) 2017-06-07T09:18:02.001+0000: [ParNew: 66999K->3757K(72512K),
0.0028101 secs] 649189K->585953K(1042416K), 0.0028601 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 2017-06-07T09:18:02.030+0000: [GC (Allocation Failure) 2017-06-07T09:18:02.030+0000: [ParNew: 68269K->2462K(72512K), 0.0019834 secs] 650465K->584706K(1042416K),
0.0020289 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 2017-06-07T09:18:02.130+0000: [GC (Allocation Failure) 2017-06-07T09:18:02.130+0000: [ParNew: 66974K->6797K(72512K),
0.0038833 secs] 649218K->589043K(1042416K), 0.0039409 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 2017-06-07T09:18:02.309+0000: [GC (Allocation Failure) 2017-06-07T09:18:02.309+0000: [ParNew: 71311K->8000K(72512K), 0.0209973 secs] 653556K->595016K(1042416K),
0.0210531 secs] [Times: user=0.10 sys=0.00, real=0.02 secs] 2017-06-07T09:18:02.331+0000: [GC (GCLocker Initiated GC) 2017-06-07T09:18:02.331+0000: [ParNew: 8632K->3373K(72512K), 0.0131140 secs] 595648K->595234K(1042416K), 0.0131557 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 2017-06-07T09:22:28.879+0000: [GC (Allocation Failure) 2017-06-07T09:22:28.879+0000: [ParNew: 67885K->1862K(72512K), 0.0018928 secs] 659746K->593723K(1042416K),
0.0019463 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 2017-06-07T09:27:48.879+0000: [GC (Allocation Failure) 2017-06-07T09:27:48.879+0000: [ParNew: 66374K->1231K(72512K),
0.0014260 secs] 658235K->593093K(1042416K), 0.0014730 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 2017-06-07T09:33:08.879+0000: [GC (Allocation Failure) 2017-06-07T09:33:08.879+0000: [ParNew: 65743K->1075K(72512K), 0.0016924 secs] 657605K->592937K(1042416K),
0.0017409 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
stderr log:
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1200 blocks
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1200 blocks
17/06/07 09:18:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/06/07 09:18:02 INFO CodeGenerator: Code generated in 29.410073 ms
17/06/07 09:18:02 INFO CodeGenerator: Code generated in 8.06304 ms
17/06/07 09:18:02 INFO CodeGenerator: Code generated in 12.481201 ms
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_10 stored as values in memory (estimated size 928.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_9 stored as values in memory (estimated size 928.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_3 stored as values in memory (estimated size 904.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_15 stored as values in memory (estimated size 904.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_13 stored as values in memory (estimated size 888.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_5 stored as values in memory (estimated size 904.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_12 stored as values in memory (estimated size 904.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO MemoryStore: Block rdd_368_8 stored as values in memory (estimated size 928.0 B, free 5.4 GB)
17/06/07 09:18:02 INFO CodeGenerator: Code generated in 17.574289 ms
17/06/07 09:18:02 INFO CodeGenerator: Code generated in 8.639658 ms
Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']
Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']
Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']
Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']
Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']
Could not find valid SPARK_HOME while searching ['/mnt/yarn/usercache/hadoop/appcache/application_1496824216933_0005', '/mnt/yarn/usercache/hadoop/filecache/157/pyspark.zip/pyspark']
Why using the UDFs the message Could not find valid SPARK_HOME while searching ... ??
Related
I am running a spark job and its in cluster mode. So, my driver downloads files, adds them to all executor SparkSession.sparkContext.addFile("file:///" + file.toString) Note: file here is java.io.File Object. Now I call sc.textFile("file:///"+SparkFiles.get(fileName) Note: fileName is actually file.getName which returns the file name of the java.io.File object. I get a file not found exception. the size of the files are less than 500kb. I tried reading the yarn logs and found this.
10:35:17 INFO executor.Executor: Fetching spark://foo.bar.ca:45133/files/QQ4hyC.csv with timestamp 1562250908486
19/07/04 10:35:17 INFO client.TransportClientFactory: Successfully created connection to foo.bar.ca/102.63.12.200:45000 after 1 ms (0 ms spent in bootstraps)
19/07/04 10:35:17 INFO util.Utils: Fetching spark://foo.bar.ca:45133/files/QQ4hyC.csv to /data15/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/spark-d0ee1f52-0b3d-4f28-8c40-31283fbc6c00/fetchFileTemp6119089786950130363.tmp
19/07/04 10:35:17 INFO util.Utils: Copying /data15/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/spark-d0ee1f52-0b3d-4f28-8c40-31283fbc6c00/-20762980841562250908486_cache to /data19/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/container_e94_1559671533076_152651_01_000002/./QQ4hyC.csv
19/07/04 10:35:17 INFO executor.Executor: Fetching spark://foo.bar.ca:45133/files/20mo2V.csv with timestamp 1562250908498
19/07/04 10:35:17 INFO util.Utils: Fetching spark://foo.bar.ca:45133/files/20mo2V.csv to /data15/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/spark-d0ee1f52-0b3d-4f28-8c40-31283fbc6c00/fetchFileTemp4236523310531688097.tmp
19/07/04 10:35:17 INFO util.Utils: Copying /data15/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/spark-d0ee1f52-0b3d-4f28-8c40-31283fbc6c00/-3045146541562250908498_cache to /data19/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/container_e94_1559671533076_152651_01_000002/./20mo2V.csv
19/07/04 10:35:17 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
19/07/04 10:35:17 INFO client.TransportClientFactory: Successfully created connection to foo.bar.ca/102.63.12.200:46650 after 5 ms (0 ms spent in bootstraps)
19/07/04 10:35:17 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.2 KB, free 912.3 MB)
19/07/04 10:35:17 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 148 ms
19/07/04 10:35:18 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.6 KB, free 912.3 MB)
19/07/04 10:35:18 INFO rdd.HadoopRDD: Input split: file:/data20/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/spark-28989d25-5d6e-4e49-8513-699d01ac0976/userFiles-dd2b37d9-0029-486e-9c55-84703887b1ca/QQ4hyC.csv:0+48343
19/07/04 10:35:18 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
19/07/04 10:35:18 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 31.5 KB, free 912.3 MB)
19/07/04 10:35:18 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 27 ms
19/07/04 10:35:18 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 512.6 KB, free 911.8 MB)
19/07/04 10:35:19 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.FileNotFoundException: File file:/data20/yarn/nm/usercache/networkhadoop/appcache/application_1559671533076_152651/spark-28989d25-5d6e-4e49-8513-699d01ac0976/userFiles-dd2b37d9-0029-486e-9c55-84703887b1ca/QQ4hyC.csv does not exist
If you look at INFO util.Utils: Copying you can see that it is indeed being copied from /data15/... to /data19/..., but it throws file not found at /data19/...
From the official docs it seems SparkContext.addFiles() adds files to workers and SparkFiles.get() can get it from the worker nodes to which this was copied. Is this a bug?
Spark job I am running:
It is a pretty simple program that was converted to scala from java and 'parallelized' (it was not intended to be ran in parallel but is an experiment to a) learn spark and neo4j and b) see if i can get some speed gains just by running on a spark cluster with more nodes doing more work). The reason being is the big bottle neck is a spatial call within the neo4j cypher script (a withinDistance call). The test set of data is pretty small 52,000 nodes and about 140 mb size of a database.
Also when neo4j starts up it gives me a warning of
Starting Neo4j.
WARNING: Max 4096 open files allowed, minimum of 40000 recommended. See the Neo4j manual.
/usr/share/neo4j/bin/neo4j: line 411: /var/run/neo4j/neo4j.pid: No such file or directory
Which is strange since I believe that is open files and I asked the system admin to set that to way higher? (ulimit -Hn seems to confirm this? says 90,000 though a ulimit -a shows open files at 4096 (softlimit) I guess that is what neo4j sees and whines about)
Also when I ran this locally on my mac os X. The software would run and execute for about 14 hours or so (maybe 9) and then i would see in the console that the database would just stop talking to the spark. It wasn't down or anything like the job would time out and I could still cypher-shell into the database. But it would somehow lose connection to the spark jobs so they would try and finally the spark submit would just give up and stop.
C02RH2U9G8WM:scala-2.11 little.mac$ ulimit -Hn
unlimited
(also since last edit I even upped my limits more in the neo4j conf, now with 4gb max memory for heap sizes)
Some code bits from the job (using the ported code to scala with added spark dataframes. I know it is not properly paralleled but was hoping to get something to work before pressing forward.). I was building a hybrid program that was like the code in java I ported but using dataframes from spark (connected to neo4j).
Essentially (pseudo code):
while (going through all these lat and lons)
{
def DoCalculation()
{
val noBbox="call spatial.bbox('geom', {lat:" + minLat +",lon:"+minLon +"}, {lat:"+maxLat+",lon:" + maxLon +"}) yield node return node.altitude as altitude, node.gtype as gtype, node.toDateFormatLong as toDateFormatLong, node.latitude as latitude, node.longitude as longitude, node.fromDateFormatLong as fromDateFormatLong, node.fromDate as fromDate, node.toDate as toDate ORDER BY node.toDateFormatLong DESC";
try {
//not overly sure what the partitions and batch are really doing for me.
val initialDf2 = neo.cypher(noBbox).partitions(5).batch(10000).loadDataFrame
val theRow = initialDf2.collect() //was someStr
for(i <- 0 until theRow.length){
//do more calculations
var radius2= 100
//this call is where the biggest bottle neck is,t he spatial withinDistance is where i thought
//I could put this code ons park and make the calls through data frames and do the same long work
//but by batching it out to many nodes would get more speed gains?
val pointQuery="call spatial.withinDistance('geom', {lat:" + lat + ",lon:"+ lon +"}, " + radius2 + ") yield node, distance WITH node, distance match (node:POINT) WHERE node.toDateFormatLong < " + toDateFormatLong + " return node.fromDateFormatLong as fromDateFormatLong, node.toDateFormatLong as toDateFormatLong";
try {
val pointResults = neo.cypher(pointQuery).loadDataFrame; //did i need to batch here?
var prRow = pointResults.collect();
//do stuff with prRow loadDataFrame
} catch {
case e: Exception => e.printStackTrace
}
//do way more stuff with the data just in some scala/java datastructures
}
} catch {
case e: Exception => println("EMPTY COLLECTION")
}
}
}
Running a spark-submit job that useses the spark connector to connect to Neo4j I get these errors in /var/log/neo4j/neo4j.log
java.lang.OutOfMemoryError: Java heap space
2017-12-27 03:17:13.969+0000 ERROR Worker for session '13662816-0a86-4c95-8b7f-cea9d92440c8' crashed. Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1855)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2068)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at org.neo4j.bolt.v1.runtime.concurrent.RunnableBoltWorker.run(RunnableBoltWorker.java:88)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:109)
2017-12-27 03:17:23.244+0000 ERROR Worker for session '75983e7c-097a-4770-bcab-d63f78300dc5' crashed. Java heap space
java.lang.OutOfMemoryError: Java heap space
I know that the neo4j.conf file I can change the heapsizes (currently commented out but set to 512m) the thing that I am asking is what it says in the conf file:
# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size.
So doesn't this mean I should leave alone the heapsizes here int he conf if they are calculated to surely be more than what I can set? (these machines have 8cores and 8gb ram). Or would specifically setting these really help? maybe to 2000 (if its in megabytes), to get two gig? I ask because I feel the error log file is giving this out of memory error but it really is for a different reason.
EDIT my jvm values from the debug.log
BEFORE:
2017-12-26 16:24:06.768+0000 INFO [o.n.k.i.DiagnosticsManager] NETWORK
2017-12-26 16:24:06.768+0000 INFO [o.n.k.i.DiagnosticsManager] System memory information:
2017-12-26 16:24:06.771+0000 INFO [o.n.k.i.DiagnosticsManager] Total Physical memory: 7.79 GB
2017-12-26 16:24:06.772+0000 INFO [o.n.k.i.DiagnosticsManager] Free Physical memory: 5.49 GB
2017-12-26 16:24:06.772+0000 INFO [o.n.k.i.DiagnosticsManager] Committed virtual memory: 5.62 GB
2017-12-26 16:24:06.773+0000 INFO [o.n.k.i.DiagnosticsManager] Total swap space: 16.50 GB
2017-12-26 16:24:06.773+0000 INFO [o.n.k.i.DiagnosticsManager] Free swap space: 16.49 GB
2017-12-26 16:24:06.773+0000 INFO [o.n.k.i.DiagnosticsManager] JVM memory information:
2017-12-26 16:24:06.773+0000 INFO [o.n.k.i.DiagnosticsManager] Free memory: 85.66 MB
2017-12-26 16:24:06.773+0000 INFO [o.n.k.i.DiagnosticsManager] Total memory: 126.00 MB
2017-12-26 16:24:06.774+0000 INFO [o.n.k.i.DiagnosticsManager] Max memory: 1.95 GB
2017-12-26 16:24:06.776+0000 INFO [o.n.k.i.DiagnosticsManager] Garbage Collector: G1 Young Generation: [G1 Eden Space, G1 Survivor Space]
2017-12-26 16:24:06.776+0000 INFO [o.n.k.i.DiagnosticsManager] Garbage Collector: G1 Old Generation: [G1 Eden Space, G1 Survivor Space, G1 Old Gen]
2017-12-26 16:24:06.777+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: Code Cache (Non-heap memory): committed=4.94 MB, used=4.93 MB, max=240.00 MB, threshold=0.00 B
2017-12-26 16:24:06.777+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: Metaspace (Non-heap memory): committed=14.38 MB, used=13.41 MB, max=-1.00 B, threshold=0.00 B
2017-12-26 16:24:06.777+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: Compressed Class Space (Non-heap memory): committed=1.88 MB, used=1.64 MB, max=1.00 GB, threshold=0.00 B
2017-12-26 16:24:06.778+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: G1 Eden Space (Heap memory): committed=39.00 MB, used=35.00 MB, max=-1.00 B, threshold=?
2017-12-26 16:24:06.778+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: G1 Survivor Space (Heap memory): committed=3.00 MB, used=3.00 MB, max=-1.00 B, threshold=?
2017-12-26 16:24:06.778+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: G1 Old Gen (Heap memory): committed=84.00 MB, used=1.34 MB, max=1.95 GB, threshold=0.00 B
2017-12-26 16:24:06.778+0000 INFO [o.n.k.i.DiagnosticsManager] Operating system information:
2017-12-26 16:24:06.779+0000 INFO [o.n.k.i.DiagnosticsManager] Operating System: Linux; version: 3.10.0-693.5.2.el7.x86_64; arch: amd64; cpus: 8
2017-12-26 16:24:06.779+0000 INFO [o.n.k.i.DiagnosticsManager] Max number of file descriptors: 90000
2017-12-26 16:24:06.780+0000 INFO [o.n.k.i.DiagnosticsManager] Number of open file descriptors: 103
2017-12-26 16:24:06.782+0000 INFO [o.n.k.i.DiagnosticsManager] Process id: 26252#hp380-1
2017-12-26 16:24:06.782+0000 INFO [o.n.k.i.DiagnosticsManager] Byte order: LITTLE_ENDIAN
2017-12-26 16:24:06.793+0000 INFO [o.n.k.i.DiagnosticsManager] Local timezone: Etc/GMT
2017-12-26 16:24:06.793+0000 INFO [o.n.k.i.DiagnosticsManager] JVM information:
2017-12-26 16:24:06.794+0000 INFO [o.n.k.i.DiagnosticsManager] VM Name: OpenJDK 64-Bit Server VM
2017-12-26 16:24:06.794+0000 INFO [o.n.k.i.DiagnosticsManager] VM Vendor: Oracle Corporation
2017-12-26 16:24:06.794+0000 INFO [o.n.k.i.DiagnosticsManager] VM Version: 25.151-b12
2017-12-26 16:24:06.794+0000 INFO [o.n.k.i.DiagnosticsManager] JIT compiler: HotSpot 64-Bit Tiered Compilers
2017-12-26 16:24:06.795+0000 INFO [o.n.k.i.DiagnosticsManager] VM Arguments: [-XX:+UseG1GC, -XX:-OmitStackTraceInFastThrow, -XX:+AlwaysPreTouch, -XX:+UnlockExperimentalVMOptions, -XX:+TrustFinalNonStaticFields, -XX:+DisableExplicitGC, -Djdk.tls.ephemeralDHKeySize=2048, -Dunsupported.dbms.udc.source=rpm, -Dfile.encoding=UTF-8]
2017-12-26 16:24:06.795+0000 INFO [o.n.k.i.DiagnosticsManager] Java classpath:
AFTER:
2017-12-27 16:17:30.740+0000 INFO [o.n.k.i.DiagnosticsManager] System memory information:
2017-12-27 16:17:30.749+0000 INFO [o.n.k.i.DiagnosticsManager] Total Physical memory: 7.79 GB
2017-12-27 16:17:30.750+0000 INFO [o.n.k.i.DiagnosticsManager] Free Physical memory: 4.23 GB
2017-12-27 16:17:30.750+0000 INFO [o.n.k.i.DiagnosticsManager] Committed virtual memory: 5.62 GB
2017-12-27 16:17:30.751+0000 INFO [o.n.k.i.DiagnosticsManager] Total swap space: 16.50 GB
2017-12-27 16:17:30.751+0000 INFO [o.n.k.i.DiagnosticsManager] Free swap space: 16.19 GB
2017-12-27 16:17:30.751+0000 INFO [o.n.k.i.DiagnosticsManager] JVM memory information:
2017-12-27 16:17:30.751+0000 INFO [o.n.k.i.DiagnosticsManager] Free memory: 1.89 GB
2017-12-27 16:17:30.751+0000 INFO [o.n.k.i.DiagnosticsManager] Total memory: 1.95 GB
2017-12-27 16:17:30.752+0000 INFO [o.n.k.i.DiagnosticsManager] Max memory: 1.95 GB
2017-12-27 16:17:30.777+0000 INFO [o.n.k.i.DiagnosticsManager] Garbage Collector: G1 Young Generation: [G1 Eden Space, G1 Survivor Space]
2017-12-27 16:17:30.777+0000 INFO [o.n.k.i.DiagnosticsManager] Garbage Collector: G1 Old Generation: [G1 Eden Space, G1 Survivor Space, G1 Old Gen]
2017-12-27 16:17:30.778+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: Code Cache (Non-heap memory): committed=4.94 MB, used=4.89 MB, max=240.00 MB, threshold=0.00 B
2017-12-27 16:17:30.778+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: Metaspace (Non-heap memory): committed=14.38 MB, used=13.42 MB, max=-1.00 B, threshold=0.00 B
2017-12-27 16:17:30.778+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: Compressed Class Space (Non-heap memory): committed=1.88 MB, used=1.64 MB, max=1.00 GB, threshold=0.00 B
2017-12-27 16:17:30.779+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: G1 Eden Space (Heap memory): committed=105.00 MB, used=59.00 MB, max=-1.00 B, threshold=?
2017-12-27 16:17:30.779+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: G1 Survivor Space (Heap memory): committed=0.00 B, used=0.00 B, max=-1.00 B, threshold=?
2017-12-27 16:17:30.779+0000 INFO [o.n.k.i.DiagnosticsManager] Memory Pool: G1 Old Gen (Heap memory): committed=1.85 GB, used=0.00 B, max=1.95 GB, threshold=0.00 B
2017-12-27 16:17:30.779+0000 INFO [o.n.k.i.DiagnosticsManager] Operating system information:
2017-12-27 16:17:30.780+0000 INFO [o.n.k.i.DiagnosticsManager] Operating System: Linux; version: 3.10.0-693.5.2.el7.x86_64; arch: amd64; cpus: 8
2017-12-27 16:17:30.780+0000 INFO [o.n.k.i.DiagnosticsManager] Max number of file descriptors: 90000
2017-12-27 16:17:30.781+0000 INFO [o.n.k.i.DiagnosticsManager] Number of open file descriptors: 103
2017-12-27 16:17:30.785+0000 INFO [o.n.k.i.DiagnosticsManager] Process id: 20774#hp380-1
2017-12-27 16:17:30.785+0000 INFO [o.n.k.i.DiagnosticsManager] Byte order: LITTLE_ENDIAN
2017-12-27 16:17:30.814+0000 INFO [o.n.k.i.DiagnosticsManager] Local timezone: Etc/GMT
2017-12-27 16:17:30.815+0000 INFO [o.n.k.i.DiagnosticsManager] JVM information:
2017-12-27 16:17:30.815+0000 INFO [o.n.k.i.DiagnosticsManager] VM Name: OpenJDK 64-Bit Server VM
2017-12-27 16:17:30.815+0000 INFO [o.n.k.i.DiagnosticsManager] VM Vendor: Oracle Corporation
2017-12-27 16:17:30.815+0000 INFO [o.n.k.i.DiagnosticsManager] VM Version: 25.151-b12
2017-12-27 16:17:30.815+0000 INFO [o.n.k.i.DiagnosticsManager] JIT compiler: HotSpot 64-Bit Tiered Compilers
2017-12-27 16:17:30.816+0000 INFO [o.n.k.i.DiagnosticsManager] VM Arguments: [-Xms2000m, -Xmx2000m, -XX:+UseG1GC, -XX:-OmitStackTraceInFastThrow, -XX:+AlwaysPreTouch, -XX:+UnlockExperimentalVMOptions, -XX:+TrustFinalNonStaticFields, -XX:+DisableExplicitGC, -Djdk.tls.ephemeralDHKeySize=2048, -Dunsupported.dbms.udc.source=rpm, -Dfile.encoding=UTF-8]
2017-12-27 16:17:30.816+0000 INFO [o.n.k.i.DiagnosticsManager] Java classpath:
Just an fyi, and I still seem to get java heap errors. These machines (not for production just dev) have only 8gb each
We usually recommend setting these yourself. You can check your debug.log file for the logs during startup, that can report the values it chose to use as default. You're looking for an excerpt like this:
JVM memory information:
Free memory: 204.79 MB
Total memory: 256.00 MB
Max memory: 4.00 GB
I believe the Total memory is the initial heap size and Max memory is the max heap size.
When setting this yourself, we usually recommending keeping the initial and max set to the same value. Here's a knowledge base article on estimating initial memory configuration that may be helpful.
If the defaults seem sufficient, then it may be better to look for other areas to optimize, or see if the issue is known on the apache-spark side of things.
I have a spark job made in python, where I retrieve data from Redshift, and then I apply many transformations, join, filter, withColumn, agg ...
There are around 30K records in the dataframes
I perform all the transformation and when I try to write an AVRO file the spark job fails
My spark submit:
. /usr/bin/spark-submit --packages="com.databricks:spark-avro_2.11:3.2.0" --jars RedshiftJDBC42-1.2.1.1001.jar --deploy-mode client --master yarn --num-executors 10 --executor-cores 3 --executor-memory 10G --driver-memory 14g --conf spark.sql.broadcastTimeout=3600 --conf spark.network.timeout=10000000 --py-files dependencies.zip iface_extractions.py 2016-10-01 > output.log
I'm using --executor-memory 10G --driver-memory 14g, 6 machines in amazon with 8 cores and 15G RAM, why Im getting out of memory error ???
Error returned:
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 196608 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/hadoop/hs_err_pid13688.log
This is the end of spark log:
17/05/29 10:13:09 INFO TaskSetManager: Starting task 0.0 in stage 21.0 (TID 19, ip-10-185-53-172.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5779 bytes)
17/05/29 10:13:09 INFO TransportClientFactory: Successfully created connection to ip-10-185-53-172.eu-west-1.compute.internal/10.185.53.172:39759 after 3 ms (0 ms spent in bootstraps)
17/05/29 10:13:09 INFO BlockManagerInfo: Added broadcast_24_piece0 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 8.9 KB, free: 5.3 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 10.185.52.91:43829 in memory (size: 30.4 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_8_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.4 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_7_piece0 on 10.185.52.91:43829 in memory (size: 30.3 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_7_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.3 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_6_piece0 on 10.185.52.91:43829 in memory (size: 30.6 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_6_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.6 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Added taskresult_2 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 499.6 MB, free: 4.8 GB)
17/05/29 10:13:11 INFO TaskSetManager: Starting task 0.0 in stage 23.0 (TID 20, ip-10-185-53-172.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5779 bytes)
17/05/29 10:13:12 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 8.8 KB, free: 4.8 GB)
17/05/29 10:13:13 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 270161 ms on ip-10-185-53-172.eu-west-1.compute.internal (executor 2) (1/1)
17/05/29 10:13:13 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/05/29 10:13:13 INFO DAGScheduler: ResultStage 3 (run at ThreadPoolExecutor.java:1142) finished in 270.162 s
17/05/29 10:13:13 INFO DAGScheduler: Job 3 finished: run at ThreadPoolExecutor.java:1142, took 270.230067 s
17/05/29 10:13:13 INFO BlockManagerInfo: Removed taskresult_3 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 499.5 MB, free: 5.3 GB)
17/05/29 10:13:16 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.185.52.91:43829 in memory (size: 5.5 KB, free: 8.2 GB)
17/05/29 10:13:17 INFO BlockManagerInfo: Removed broadcast_3_piece0 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 5.5 KB, free: 5.3 GB)
17/05/29 10:13:20 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 276982 ms on ip-10-185-53-172.eu-west-1.compute.internal (executor 2) (1/1)
17/05/29 10:13:20 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool
17/05/29 10:13:20 INFO DAGScheduler: ResultStage 2 (run at ThreadPoolExecutor.java:1142) finished in 276.984 s
17/05/29 10:13:20 INFO DAGScheduler: Job 2 finished: run at ThreadPoolExecutor.java:1142, took 277.000009 s
17/05/29 10:13:20 INFO BlockManagerInfo: Removed taskresult_2 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 499.6 MB, free: 5.8 GB)
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000667766000, 196608, 0) failed; error='Cannot allocate memory' (errno=12)
SPARK-Version: 1.5.2 with yarn 2.7.1.2.3.0.0-2557
I'm running into a problem while I'm exploring the data through spark-shell that I'm trying to create a really fat dataframe that with 3000 columns. Code as below:
val valueFunctionUDF = udf((valMap: Map[String, String], dataItemId: String) =>
valMap.get(dataItemId) match {
case Some(v) => v.toDouble
case None => Double.NaN
})
s1 is being the main dataframe and the schema as below:
|-- combKey: string (nullable = true)
|-- valMaps: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
after I run the code:
dataItemIdVals.foreach{w =>
s1 = s1.withColumn(w, valueFunctionUDF($"valMaps", $"combKey"))}
my terminal just stuck after the above column with the info being printed out:
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 172.22.49.20:41494 in memory (size: 7.6 KB, free: 5.2 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:43026 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:44890 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:52020 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:33272 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:48481 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:44026 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:34539 in memory (size: 7.6 KB, free: 5.0 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:43734 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:42769 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:60603 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:59102 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:47578 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:43149 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:52488 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:52298 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 9
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 172.22.49.20:41494 in memory (size: 7.3 KB, free: 5.2 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:33272 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:59102 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:44026 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:42769 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:43149 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:43026 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:52298 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:42890 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:47578 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:60603 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:43734 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:48481 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:52020 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:52488 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:34539 in memory (size: 7.3 KB, free: 5.0 GB)
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 8
16/07/11 12:20:54 INFO ContextCleaner: Cleaned shuffle 0
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 7
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 6
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 5
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 4
Nothing is going on on sparkUI and I guess spark is calculating some metadata for the new dataframe(number of column etc.)? Anyone seen this kind of issue before? Anyway to get around with it?
I am trying to profile nodejs v8 memory with a do nothing server.
I used node-memwatch to get heap diff. I collect heap info before connecting and after connection tore down. I used node-memwatch. I tried 200 concurrent connections from client side.
Here is the gc trace after connection tore down.
can anyone help me to understand:
1.why are memory increasing ? after connections tore down, the server is absolutely doing nothing. shouldn't it suppose to always drop as garbages being collected ?
2. what is are those allocation failure ? How do I really interpret the trace here ?
15802 ms: Mark-sweep 8.9 (45.0) -> 8.1 (45.0) MB, 58 ms [allocation failure] [GC in old space forced by flags].
16144 ms: Mark-sweep 9.2 (45.0) -> 8.4 (45.0) MB, 53 ms [allocation failure] [GC in old space forced by flags].
16495 ms: Mark-sweep 9.5 (45.0) -> 8.7 (46.0) MB, 60 ms [allocation failure] [GC in old space forced by flags].
16837 ms: Mark-sweep 9.8 (46.0) -> 9.0 (46.0) MB, 56 ms [allocation failure] [GC in old space forced by flags].
17197 ms: Mark-sweep 10.1 (46.0) -> 9.4 (46.0) MB, 62 ms [allocation failure] [GC in old space forced by flags].
17905 ms: Mark-sweep 11.5 (46.0) -> 10.0 (47.0) MB, 74 ms [Runtime::PerformGC] [GC in old space forced by flags].
18596 ms: Mark-sweep 12.2 (47.0) -> 10.7 (47.0) MB, 75 ms [Runtime::PerformGC] [GC in old space forced by flags].
19315 ms: Mark-sweep 12.8 (47.0) -> 11.3 (48.0) MB, 83 ms [allocation failure] [GC in old space forced by flags].
20035 ms: Mark-sweep 13.4 (48.0) -> 12.0 (49.0) MB, 90 ms [Runtime::PerformGC] [GC in old space forced by flags].
21487 ms: Mark-sweep 16.0 (49.0) -> 13.2 (50.0) MB, 96 ms [Runtime::PerformGC] [GC in old space forced by flags].
22950 ms: Mark-sweep 17.3 (50.0) -> 14.5 (52.0) MB, 116 ms [Runtime::PerformGC] [GC in old space forced by flags].
24376 ms: Mark-sweep 18.8 (52.0) -> 15.9 (53.0) MB, 114 ms [allocation failure] [GC in old space forced by flags].
25849 ms: Mark-sweep 19.9 (53.0) -> 17.2 (54.0) MB, 129 ms [Runtime::PerformGC] [GC in old space forced by flags].
28773 ms: Mark-sweep 25.2 (54.0) -> 19.7 (57.0) MB, 149 ms [allocation failure] [GC in old space forced by flags].
31725 ms: Mark-sweep 27.7 (57.0) -> 22.2 (59.0) MB, 172 ms [Runtime::PerformGC] [GC in old space forced by flags].
34678 ms: Mark-sweep 30.2 (59.0) -> 24.7 (61.0) MB, 190 ms [Runtime::PerformGC] [GC in old space forced by flags].
44045 ms: Mark-sweep 28.4 (61.0) -> 25.8 (63.0) MB, 180 ms [idle notification] [GC in old space forced by flags].
44216 ms: Mark-sweep 25.8 (63.0) -> 25.8 (63.0) MB, 170 ms [idle notification] [GC in old space requested].
57471 ms: Mark-sweep 26.9 (63.0) -> 25.8 (62.0) MB, 167 ms [Runtime::PerformGC] [GC in old space forced by flags].
57651 ms: Mark-sweep 26.8 (62.0) -> 25.5 (62.0) MB, 160 ms [Runtime::PerformGC] [GC in old space forced by flags].
57828 ms: Mark-sweep 26.5 (62.0) -> 25.5 (62.0) MB, 159 ms [Runtime::PerformGC] [GC in old space forced by flags].
Thanks,
"allocation failure" sounds very dramatic, but there is no real failure involved. It just means that we allocated so much memory that it is time to do a GC to see if we can collect some memory.
It looks like you are running with the --gc-global flag ("GC forced by flags"). That's a bad idea for production, though it may be fine for narrowing down a problem when debugging.
I can't tell why your process is leaking. You may find the heap profiler useful. See https://github.com/felixge/node-memory-leak-tutorial
According to the code:
PrintF("%s %.1f (%.1f) -> %.1f (%.1f) MB, ",
CollectorString(),
static_cast<double>(start_object_size_) / MB,
static_cast<double>(start_memory_size_) / MB,
SizeOfHeapObjects(),
end_memory_size_mb);
Each line is one gc, when gc started,
start_object_size_ = heap_->SizeOfObjects();
In gc summary:
PrintF("total_size_before=%" V8_PTR_PREFIX "d ", start_object_size_);
PrintF("total_size_after=%" V8_PTR_PREFIX "d ", heap_->SizeOfObjects());
As of why start_object_size_ increases in the time when my app is idle, I am guessing maybe during the gc, some objects got promoted to old space and caused object size in old space increased.