java.lang.OutOfMemoryError: Java heap space spark streaming job - apache-spark

I've a spark streaming job. It operates in batches of 10 minutes. The driver machine is m4X4x (64GB) ec2 instance.
The job stalled after 18 hours. It crashes on the following exception. As I read the other posts it seems that the driver may have run out of memory. How can I check this? My pyspark config is as follows
Also, how do i check the memory in spark-ui ? I only see the 11 tasks nodes i have, not the driver.
export PYSPARK_SUBMIT_ARGS='--master yarn --deploy-mode client
--driver-memory 10g
--executor-memory 10g
--executor-cores 4
--conf spark.driver.cores=5
--packages "org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2"
--conf spark.driver.maxResultSize=2g
--conf spark.shuffle.spill=true
--conf spark.yarn.driver.memoryOverhead=2048
--conf spark.yarn.executor.memoryOverhead=2048
--conf "spark.broadcast.blockSize=512M"
--conf "spark.memory.storageFraction=0.5"
--conf "spark.kryoserializer.buffer.max=1024"
--conf "spark.default.parallelism=600"
--conf "spark.sql.shuffle.partitions=600"
--driver-java-options - Dlog4j.configuration=file:///usr/lib/spark/conf/log4j.properties pyspark-shell'
[Stage 3507:> (0 + 0) / 600]Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:231)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:231)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
at net.jpountz.lz4.LZ4BlockOutputStream.finish(LZ4BlockOutputStream.java:235)
at net.jpountz.lz4.LZ4BlockOutputStream.close(LZ4BlockOutputStream.java:175)
at java.io.ObjectOutputStream$BlockDataOutputStream.close(ObjectOutputStream.java:1828)
at java.io.ObjectOutputStream.close(ObjectOutputStream.java:742)
at org.apache.spark.serializer.JavaSerializationStream.close(JavaSerializer.scala:57)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$1.apply$mcV$sp(TorrentBroadcast.scala:238)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:237)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:107)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:86)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1387)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1012)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:933)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:936)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:935)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:935)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:873)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1630)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
[Stage 3507:> (0 + 0) / 600]18/02/23 12:59:33 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=8388437576763608177, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /172.23.56.231:58822; closing connection

Related

how to decrease storage memory in spark 2.3?

I run a pyspark job to do some transformation and save result into orc files in hdfs, my spark conf are:
--driver-memory 12G --executor-cores 2 --num-executors 8 --executor-memory 32G ${dll_app_spark_options} --conf spark.kryoserializer.buffer.max=2047 --conf spark.driver.maxResultSize=4g --conf spark.shuffle.memoryFraction=0.7 --conf spark.yarn.driver.memoryOverhead=4096 --conf spark.sql.shuffle.partitions=200
my job always fails, because Yarn kill executor for memory (exceeding memory limits)
storage memory for executors and driver as bellow
DataFrame to save contain 1 million rows and 400 columns (type of columns array(Float)
I want to decrease storage memory, I tried spark.shuffle.memoryFraction=0.7 but it gives same results
any idea please ?
To control storage memory you can use following
--conf spark.memory.storageFraction=0.1
or
--conf spark.memory.fraction=0.1
Please refer - spark-management-overview

Correct spark configuration to fully utilise EMR cluster resources

I'm quite new to configuring spark, so wanted to know whether I am fully utilising my EMR cluster.
The EMR cluster is using spark 2.4 and hadoop 2.8.5.
The app reads loads of small gzipped json files from s3, transforms the data and writes them back out to s3.
I've read various articles, but I was hoping I could get my configuration double checked in case there were set settings that conflict with each other or something.
I'm using a c4.8xlarge cluster with each of the 3 worker nodes having 36 cpu cores and 60gb of ram.
So that's 108 cpu cores and 180gb of ram overall.
Here is my spark-submit settings that I paste in the EMR add step box:
--class com.example.app
--master yarn
--driver-memory 12g
--executor-memory 3g
--executor-cores 3
--num-executors 33
--conf spark.executor.memory=5g
--conf spark.executor.cores=3
--conf spark.executor.instances=33
--conf spark.driver.cores=16
--conf spark.driver.memory=12g
--conf spark.default.parallelism=200
--conf spark.sql.shuffle.partitions=500
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
--conf spark.speculation=false
--conf spark.yarn.am.memory=1g
--conf spark.executor.heartbeatInterval=360000
--conf spark.network.timeout=420000
--conf spark.hadoop.fs.hdfs.impl.disable.cache=true
--conf spark.kryoserializer.buffer.max=512m
--conf spark.shuffle.consolidateFiles=true
--conf spark.hadoop.fs.s3a.multiobjectdelete.enable=false
--conf spark.hadoop.fs.s3a.fast.upload=true
--conf spark.worker.instances=3

Spark Streaming - Diagnostics: Container is running beyond physical memory limits

My Spark Streaming job failed with the below exception
Diagnostics: Container is running beyond physical memory limits.
Current usage: 1.5 GB of 1.5 GB physical memory used; 3.6 GB of 3.1 GB
virtual memory used. Killing container.
Here is my spark submit command
spark2-submit \
--name App name \
--class Class name \
--master yarn \
--deploy-mode cluster \
--queue Queue name \
--num-executors 5 --executor-cores 3 --executor-memory 5G \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.locality.wait=10 \
--conf spark.task.maxFailures=8 \
--conf spark.ui.killEnabled=false \
--conf spark.logConf=true \
--conf spark.yarn.driver.memoryOverhead=512 \
--conf spark.yarn.executor.memoryOverhead=2048 \
--conf spark.yarn.max.executor.failures=40 \
jar path
I am not sure what's causing the above issue. Am I missing something in the above command or is it failing as I didn't set --driver-memory in my spark submit command?

Spark Job using more executors than allocated in jobs

I have following settings in my Spark job:
--num-executors 2
--executor-cores 1
--executor-memory 12G
--driver memory 16G
--conf spark.streaming.dynamicAllocation.enabled=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.streaming.receiver.writeAheadLog.enable=false
--conf spark.executor.memoryOverhead=8192
--conf spark.driver.memoryOverhead=8192'
My understanding is job should run with 2 executors however it is running with 3. This is happening to multiple of my jobs. Could someone please explain the reason?

java.lang.OutOfMemoryError: Java heap space despite having sufficient memory on Spark 2

I am running a Spark 2.1.1 job on an Azure VM (local mode), 16 core, 55 GB RAM.
I initialize with this command:
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
And run the following script on data:
import io.archivesunleashed.spark.matchbox.{ExtractDomain, ExtractLinks, RemoveHTML, RecordLoader, WriteGEXF}
import io.archivesunleashed.spark.rdd.RecordRDD._
val r = RecordLoader.loadArchives("/data2/toronto-mayor/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/data2/toronto-mayor-data/all-domains")
The data is relatively small (290GB) but consists of 292 files, ranging in size from 7GB to 38KB. Average size around 1GB. Swap of 100GB is available on this machine, and I've monitored htop while executing and there are no memory spikes above 45GB and no swap usage. It all seems to be working well, and then tumbles down...
It crashes with the following error:
ERROR Executor - Exception in task 13.0 in stage 0.0 (TID 13)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.StringCoding.safeTrim(StringCoding.java:89)
at java.lang.StringCoding.access$100(StringCoding.java:50)
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:154)
at java.lang.StringCoding.decode(StringCoding.java:193)
at java.lang.StringCoding.decode(StringCoding.java:254)
at java.lang.String.<init>(String.java:546)
at java.lang.String.<init>(String.java:566)
at io.archivesunleashed.data.WarcRecordUtils.getWarcResponseMimeType(WarcRecordUtils.java:102)
at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:74)
at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Many of the other discussions on this site involve either cluster mode, or setting --driver-memory. Any help appreciated.
Attempts so far (updated)
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.memory.fraction=0.4 --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.memory.fraction=0.8 --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.default.parallelism=64 --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.default.parallelism=500 --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=100G --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --driver-memory 45G --executor-memory 10G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --driver-memory 45G --executor-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
The solution was ultimately to reduce the number of worker threads.
By default, Spark runs with local[*] which runs the number of threads = cores on the machine, which in this case was 16.
By reducing to local[14] the jobs completed.
Syntax to run:
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --master local[12] --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"

Resources