Spark YARN: Cannot allocate a page with more than 17179869176 bytes - apache-spark

I am joining 11Mn records. I am running with 5 workers in EMR Cluster Spark 2.2.1
I am getting the following error while running the job:
executor 3): java.lang.IllegalArgumentException: Cannot allocate a page with more than 17179869176 bytes
at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:277)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I am not able to understand the possible reason for this. Please help me with what parameter should I set.
Currently I am running with the following arguments: --num-executors 5 --conf spark.eventLog.enabled=true --executor-memory 70g --driver-memory 30g --executor-cores 16 --conf spark.shuffle.memoryFraction=0.5

Related

Pyspark - java.lang.OutOfMemoryError: Java heap space while writing to csv file

when trying to write to a csv file using below code
DF.coalesce(1).write.option("header","false").option("sep",",").option("escape",'"').option("ignoreTrailingWhiteSpace","false").option("ignoreLeadingWhiteSpace","false").mode("overwrite").csv(filename)
I am getting the below error
ileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.OutOfMemoryError: Java heap space
Could someone advise a workaround ?
Try increasing the executor.memory in your spark-submit application
Something like this
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
For me adding the below spark config fixed the issue
spark = SparkSession.builder.master('local[*]').config("spark.driver.memory", "15g").appName('sl-app').getOrCreate()

java.lang.OutOfMemoryError: Java heap space using Docker

So I am running the following locally (standalone):
~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --py-files afile.py run_script.py
And I got the following error:
java.lang.OutOfMemoryError: Java heap space
To overpass this I am running the following:
~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 6G --executor-memory 1G --py-files afile.py run_script.py
and the script runs normally.
Now, I am using the following docker build for Spark and run the following:
docker-compose up
docker exec app_master_1 bin/spark-submit --driver-memory 6G --executor-memory 1G --py-files afile.py run_script.py
In that case I still get the error of:
2018-06-13 21:43:16 WARN TaskSetManager:66 - Lost task 0.0 in stage 3.0 (TID 9, 172.17.0.3, executor 0): java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:77)
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:219)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat$$anonfun$readToUnsafeMem$1$$anonfun$apply$4.apply(TextFileFormat.scala:143)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat$$anonfun$readToUnsafeMem$1$$anonfun$apply$4.apply(TextFileFormat.scala:140)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
at scala.collection.AbstractIterator.fold(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$19.apply(RDD.scala:1090)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$19.apply(RDD.scala:1090)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2123)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2123)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
and somewhere later:
2018-06-13 21:43:17 ERROR TaskSchedulerImpl:70 - Lost executor 0 on 172.17.0.3: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages
As far as I undersand even though its written that its out of memory in executor 0 i have to increase the driver-memory as its a standalone, right??
Any idea why is this happening and how to overpass it?
edit
The error is happening when i am trying to use sqlCont.read.json(json_path) where the file is not even big enough.
As you can see here, Worker node is initialized with 1 GB memory in your docker script and you executing spark-submit command with 1 GB memory for executor, so either you decrease executor memory or increase your worker memory while you creating docker container.

java.lang.OutOfMemoryError: Java heap space despite having sufficient memory on Spark 2

I am running a Spark 2.1.1 job on an Azure VM (local mode), 16 core, 55 GB RAM.
I initialize with this command:
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
And run the following script on data:
import io.archivesunleashed.spark.matchbox.{ExtractDomain, ExtractLinks, RemoveHTML, RecordLoader, WriteGEXF}
import io.archivesunleashed.spark.rdd.RecordRDD._
val r = RecordLoader.loadArchives("/data2/toronto-mayor/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/data2/toronto-mayor-data/all-domains")
The data is relatively small (290GB) but consists of 292 files, ranging in size from 7GB to 38KB. Average size around 1GB. Swap of 100GB is available on this machine, and I've monitored htop while executing and there are no memory spikes above 45GB and no swap usage. It all seems to be working well, and then tumbles down...
It crashes with the following error:
ERROR Executor - Exception in task 13.0 in stage 0.0 (TID 13)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.StringCoding.safeTrim(StringCoding.java:89)
at java.lang.StringCoding.access$100(StringCoding.java:50)
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:154)
at java.lang.StringCoding.decode(StringCoding.java:193)
at java.lang.StringCoding.decode(StringCoding.java:254)
at java.lang.String.<init>(String.java:546)
at java.lang.String.<init>(String.java:566)
at io.archivesunleashed.data.WarcRecordUtils.getWarcResponseMimeType(WarcRecordUtils.java:102)
at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:74)
at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Many of the other discussions on this site involve either cluster mode, or setting --driver-memory. Any help appreciated.
Attempts so far (updated)
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.memory.fraction=0.4 --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.memory.fraction=0.8 --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.default.parallelism=64 --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.default.parallelism=500 --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=100G --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --driver-memory 45G --executor-memory 10G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --driver-memory 45G --executor-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
The solution was ultimately to reduce the number of worker threads.
By default, Spark runs with local[*] which runs the number of threads = cores on the machine, which in this case was 16.
By reducing to local[14] the jobs completed.
Syntax to run:
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --master local[12] --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"

Spark Streaming - java.io.IOException: Lease timeout of 0 seconds expired

I have spark streaming application using checkpoint writing on HDFS.
Has anyone know the solution?
Previously we were using the kinit to specify principal and keytab and got the suggestion to specify these via spark-submit command instead kinit but still this error and cause spark streaming application down.
spark-submit --principal sparkuser#HADOOP.ABC.COM --keytab /home/sparkuser/keytab/sparkuser.keytab --name MyStreamingApp --master yarn-cluster --conf "spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC --conf "spark.eventLog.enabled=true" --conf "spark.streaming.backpressure.enabled=true" --conf "spark.streaming.stopGracefullyOnShutdown=true" --conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC --class com.abc.DataProcessor myapp.jar
I see multiple occurrences of following exception in logs and finally SIGTERM 15 that kills the executor and driver. We are using CDH 5.5.2
2016-10-02 23:59:50 ERROR SparkListenerBus LiveListenerBus:96 -
Listener EventLoggingListener threw an exception
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:148)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:148)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:148)
at org.apache.spark.scheduler.EventLoggingListener.onUnpersistRDD(EventLoggingListener.scala:184)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:50)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1135)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
Caused by: java.io.IOException: Lease timeout of 0 seconds expired.
at org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:2370)
at org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:964)
at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:932)
at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:423)
at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:448)
at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:304)
at java.lang.Thread.run(Thread.java:745)

Spark sql very slow - Fails after couple of hours - Executors Lost

I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K). Each file is roughly 400-500 Megs.
I am issuing a fairly simple hive query on the dataset with just filters (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs and processes about 80-100 Gigs on a 12 node cluster.
I have experimented with different values of spark.sql.shuffle.partitions from 20 to 4000 but havn't seen lot of difference.
From the logs I have the yarn error attached at end [1]. I have got the below spark configs [2] for the job.
Is there any other tuning I need to look into. Any tips would be appreciated,
Thanks
2. Spark config -
spark-submit
--master yarn-client
--driver-memory 1G
--executor-memory 10G
--executor-cores 5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.minExecutors=2
1. Yarn Error:
16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed: container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1459747472046_1618_02_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
I have explored the container logs but did not get lot of information from it.
I have seen this error log for few containers but not sure of the cause for it.
1. java.lang.NullPointerException at org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:167)
2. java.lang.ClassCastException: Cannot cast org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisterExecutorFailed to org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisteredExecutor$

Resources