Spark app failing with error org.apache.spark.shuffle.FetchFailedException - apache-spark

I am running spark 2.4.0 on EMR. I am trying to process huge data (1TB) on EMR using 100 NODES with memory 122G and 16core each. i am getting below exceptions after sometime. Here are the parameters I've set.
--executor-memory 80g
--exeuctor-cores 4
--driver-memory 80g
--driver-cores 1
spark = (SparkSession
.builder
.master("yarn")
.config("spark.shuffle.service.enabled","true")
.config("spark.dynamicAllocation.shuffleTracking.enabled","true")
.config("spark.dynamicAllocation.enabled", "true")
.config("spark.dynamicAllocation.minExecutors","50")
#.config("spark.dynamicAllocation.maxExecutors", "500")
.config("spark.dynamicAllocation.executorIdleTimeout","2m")
.config("spark.driver.maxResultSize", "16g")
.config("spark.kryoserializer.buffer.max", "2047")
.config("spark.rpc.message.maxSize", "2047")
.config("spark.memory.offHeap.enabled","true")
.config("spark.memory.offHeap.size","50g")
.config("spark.sql.autoBroadcastJoinThreshold", "-1")
.config("spark.sql.broadcastTimeout","1200")
.config("spark.sql.shuffle.partitions","200")
.config("spark.memory.storageFraction","0.3")
.config("spark.yarn.executor.memoryOverhead","2g")
.enableHiveSupport()
.getOrCreate())
Here is the three type of executor failures I've been getting and eventually causing the corresponding Stage to rerun. Sometimes the rerun succeeds and sometimes they go on retrying forever.
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 8
org.apache.spark.shuffle.FetchFailedException: Failure while fetching StreamChunkId{streamId=45963765394, chunkIndex=0}: java.lang.RuntimeException: Executor is not registered (appId=application_1625085506598_0885, execId=137)
org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-10-40-6-235.ap-south-1.compute.internal/10.40.6.235:7337
attaching the screenshot of spark dag

Related

pyspark with spark 2.4 on EMR SparkException: Cannot broadcast the table that is larger than 8GB

I've checked the other posts related to this error and I do not found anything working at all.
What I'm trying to do:
df = spark.sql("""
SELECT DISTINCT
action.AccountId
...
,to_date(date) as Date
FROM sc_raw_report LEFT JOIN adwords_accounts ON action.AccountId=sc_raw_report.customer_id
WHERE date >= to_date(concat_ws('-',2018,1,1))
GROUP BY action.AccountId
,Account_Name
...
,to_date(date)
,substring(timestamp,12,2)
""")
df.show(5, False)
and then a saveAsTable.. Nonetheless it returns an error:
py4j.protocol.Py4JJavaError: An error occurred while calling o119.showString.
: org.apache.spark.SparkException: Exception thrown in awaitResult:
[...]
Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 13 GB
I've tried the:
'spark.sql.autoBroadcastJoinThreshold': '-1'
But it did nothing.
The table adwords_account is very small and doing a print of df.count() on sc_raw_report returns: 2022197
emr-5.28.0 spark 2.4.4
My cluster core: 15 r4.4xlarge (16 vCore, 122 GiB memory, EBS only storage)
main: r5a.4xlarge (16 vCore, 128 GiB memory, EBS only storage)
with config for spark-submit --deploy-mode cluster:
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf fs.s3a.attempts.maximum=30 --conf spark.sql.crossJoin.enabled=true --executor-cores 5 --num-executors 5 --conf spark.dynamicAllocation.enabled=false --conf spark.executor.memoryOverhead=3g --driver-memory 22g --executor-memory 22g --conf spark.executor.instances=49 --conf spark.default.parallelism=490 --conf spark.driver.maxResultSize=0 --conf spark.sql.broadcastTimeout=3600
Anyone know what I can do here?
EDIT: additional info:
upgrading to 16 instances or r4.8xlarge (32CPU, 244RAM) did nothing either.
graph with step, then it goes idle before throwing the broadcast error
Executors report few moment before the crash:
the config:
spark.serializer.objectStreamReset 100
spark.sql.autoBroadcastJoinThreshold -1
spark.executor.memoryOverhead 3g
spark.driver.maxResultSize 0
spark.shuffle.service.enabled true
spark.rdd.compress True
spark.stage.attempt.ignoreOnDecommissionFetchFailure true
spark.sql.crossJoin.enabled true
hive.metastore.client.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.scheduler.mode FIFO
spark.driver.memory 22g
spark.executor.instances 5
spark.default.parallelism 490
spark.resourceManager.cleanupExpiredHost true
spark.executor.id driver
spark.driver.extraJavaOptions -Dcom.amazonaws.services.s3.enableV4=true
spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds 2000
spark.submit.deployMode cluster
spark.sql.broadcastTimeout 3600
spark.master yarn
spark.sql.parquet.output.committer.class com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter
spark.ui.filters org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
spark.blacklist.decommissioning.timeout 1h
spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2
spark.executor.memory 22g
spark.dynamicAllocation.enabled false
spark.sql.catalogImplementation hive
spark.executor.cores 5
spark.decommissioning.timeout.threshold 20
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem true
spark.hadoop.yarn.timeline-service.enabled false
spark.yarn.executor.memoryOverheadFactor 0.1875
After ShuffleMapStage, part of shuffle block needs to be broadcasted at driver.
Please make sure Driver (in your case an AM in YARN ) has enough memory/overhead.
Could you post sc run time config ?

Spark Standalone: application gets 0 cores

I seem to be unable to assign cores to an application. This leads to the following (apparently common) error message:
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I have one master and two slaves in a Spark cluster. All are 8-core i7s with 16GB of RAM.
I have left the spark-env.sh virtually virgin on all three, just specifying the master's IP address.
My spark-submit is the following:
nohup ./bin/spark-submit
--jars ./ikoda/extrajars/ikoda_assembled_ml_nlp.jar,./ikoda/extrajars/stanford-corenlp-3.8.0.jar,./ikoda/extrajars/stanford-parser-3.8.0.jar \
--packages datastax:spark-cassandra-connector:2.0.1-s_2.11 \
--class ikoda.mlserver.Application \
--conf spark.cassandra.connection.host=192.168.0.33 \
--conf spark.cores.max=4 \
--driver-memory 4g –num-executors 2 --executor-memory 2g --executor-cores 2 \
--master spark://192.168.0.141:7077 ./ikoda/ikodaanalysis-mlserver-0.1.0.jar 1000 > ./logs/nohup.out &
I suspect I am conflating the sparkConf initialization in my code with the spark-submit. I need this as the app involves SparkStreaming which can require reinitializing the SparkContext.
The sparkConf setup is as follows:
val conf = new SparkConf().setMaster(s"spark://$sparkmaster:7077").setAppName("MLPCURLModelGenerationDataStream")
conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
conf.set("spark.cassandra.connection.host", sparkcassandraconnectionhost)
conf.set("spark.driver.maxResultSize", sparkdrivermaxResultSize)
conf.set("spark.network.timeout", sparknetworktimeout)
conf.set("spark.jars.packages", "datastax:spark-cassandra-connector:"+datastaxpackageversion)
conf.set("spark.cores.max", sparkcoresmax)
The Spark UI shows the following:
OK, this is definitely a case of programmer error.
But maybe others will make a similar error. The Master had been used as a local Spark previously. I had put some executor settings in spark-defaults.conf and then months later had forgotten about this.
There is a cascading hierarchy whereby SparkConf settings get precedence, then spark-submit settings and then spark-defaults.conf. spark-defaults.conf overrides defaults set by Apache Spark team
Once I removed the settings from spark-defaults, all was fixed.
It is because of the maximum of your physical memory.
your spark memory in spark UI 14.6GB So you must request memory for each executor below the volume
14.6GB, for this you can add config to your spark conf something like this:
conf.set("spark.executor.memory", "10gb")
if you request more than your physical memory spark does't allocate cpu cores to your job and display 0 in Cores in spark UI and run NOTHING.

SparkConf settings not used when running Spark app in cluster mode on YARN

I wrote a Spark application, which sets sets some configuration stuff via SparkConf instance, like this:
SparkConf conf = new SparkConf().setAppName("Test App Name");
conf.set("spark.driver.cores", "1");
conf.set("spark.driver.memory", "1800m");
conf.set("spark.yarn.am.cores", "1");
conf.set("spark.yarn.am.memory", "1800m");
conf.set("spark.executor.instances", "30");
conf.set("spark.executor.cores", "3");
conf.set("spark.executor.memory", "2048m");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> inputRDD = sc.textFile(...);
...
When I run this application with the command (master=yarn & deploy-mode=client)
spark-submit --class spark.MyApp --master yarn --deploy-mode client /home/myuser/application.jar
everything seems to work fine, the Spark History UI shows correct executor information:
But when running it with (master=yarn & deploy-mode=cluster)
my Spark UI shows wrong executor information (~512 MB instead of ~1400 MB):
Also my App name equals Test App Name when running in client mode, but is spark.MyApp when running in cluster mode. It seems that however some default settings are taken when running in Cluster mode. What am I doing wrong here? How can I make these settings for the Cluster mode?
I'm using Spark 1.6.2 on a HDP 2.5 cluster, managed by YARN.
OK, I think I found out the problem! In short form: There's a difference between running Spark settings in Standalone and in YARN-managed mode!
So when you run Spark applications in the Standalone mode, you can focus on the Configuration documentation of Spark, see http://spark.apache.org/docs/1.6.2/configuration.html
You can use the following settings for Driver & Executor CPU/RAM (just as explained in the documentation):
spark.executor.cores
spark.executor.memory
spark.driver.cores
spark.driver.memory
BUT: When running Spark inside a YARN-managed Hadoop environment, you have to be careful with the following settings and consider the following points:
orientate on the "Spark on YARN" documentation rather then on the Configuration documentation linked above: http://spark.apache.org/docs/1.6.2/running-on-yarn.html (the properties explained here have a higher priority then the ones explained in the Configuration docu (this seems to describe only the Standalone cluster vs. client mode, not the YARN cluster vs. client mode!!))
you can't use SparkConf to set properties in yarn-cluster mode! Instead use the corresponding spark-submit parameters:
--executor-cores 5
--executor-memory 5g
--driver-cores 3
--driver-memory 3g
In yarn-client mode you can't use the spark.driver.cores and spark.driver.memory properties! You have to use the corresponding AM properties in a SparkConf instance:
spark.yarn.am.cores
spark.yarn.am.memory
You can't set these AM properties via spark-submit parameters!
To set executor resources in yarn-client mode you can use
spark.executor.cores and spark.executor.memory in SparkConf
--executor-cores and executor-memory parameters in spark-submit
if you set both, the SparkConf settings overwrite the spark-submit parameter values!
This is the textual form of my notes:
Hope I can help anybody else with this findings...
Just to add on to D. Müller's answer:
Same issue happened to me and I tried the settings with some different combination. I am running Pypark 2.0.0 on YARN cluster.
I found that driver-memory must be written during spark submit but executor-memory can be written in script (i.e. SparkConf) and the application will still work.
My application will die if driver-memory is less than 2g. The error is:
ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM
ERROR yarn.ApplicationMaster: User application exited with status 143
CASE 1:
driver & executor both written in SparkConf
spark = (SparkSession
.builder
.appName("driver_executor_inside")
.enableHiveSupport()
.config("spark.executor.memory","4g")
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.config("spark.driver.memory","2g")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster myscript.py
CASE 2:
- driver in spark submit
- executor in SparkConf in script
spark = (SparkSession
.builder
.appName("executor_inside")
.enableHiveSupport()
.config("spark.executor.memory","4g")
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster --conf spark.driver.memory=2g myscript.py
The job Finished with succeed status. Executor memory correct.
CASE 3:
- driver in spark submit
- executor not written
spark = (SparkSession
.builder
.appName("executor_not_written")
.enableHiveSupport()
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster --conf spark.driver.memory=2g myscript.py
Apparently the executor memory is not set. Meaning CASE 2 actually captured executor memory settings despite writing it inside sparkConf.

Spark job fails due to stackoverflow error

My spark job is using Mllib to train a LogisticRegression on a data but it fails due to the Stackoverflow error, here is the error message shown in the spark-shell
java.lang.StackOverflowError
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.generic.GenericCompanion.apply(GenericCompanion.scala:48)
...
when I check the Spark UI, there is no failed stage or job! This is how I run my spark-shell
spark-shell --num-executors 100 --driver-memory 20g --conf spark.driver.maxResultSize=5g --executor-memory 8g --executor-cores 3
I even tried to increase the size of the stack by adding the following line when running the spark-shell, but it didn't help
--conf "spark.driver.extraJavaOptions='-XX:ThreadStackSize=81920'"
What the issue is?

Spark GraphX memory out of error SparkListenerBus (java.lang.OutOfMemoryError: Java heap space)

I have problem with out of memory on Apache Spark (Graphx). Application run, but after some time shutdown. I use Spark 1.2.0. Cluster has enough memory a number of cores. Other application where I am not using GraphX, run without problem. Application use Pregel.
I submit application in Hadoop YARN mode:
HADOOP_CONF_DIR=/etc/hadoop/conf spark-submit --class DPFile --deploy-mode cluster --master yarn --num-executors 4 --driver-memory 10g --executor-memory 6g --executor-cores 8 --files log4j.properties spark_routing_2.10-1.0.jar road_cr_big2 1000
Spark configuration:
val conf = new SparkConf(true)
.set("spark.eventLog.overwrite", "true")
.set("spark.driver.extraJavaOptions", "-Dlog4j.configuration=log4j.properties")
.set("spark.yarn.applicationMaster.waitTries", "60")
.set("yarn.log-aggregation-enable","true")
.set("spark.akka.frameSize", "500")
.set("spark.akka.askTimeout", "600")
.set("spark.core.connection.ack.wait.timeout", "600")
.set("spark.akka.timeout","1000")
.set("spark.akka.heartbeat.pauses","60000")
.set("spark.akka.failure-detector.threshold","3000.0")
.set("spark.akka.heartbeat.interval","10000")
.set("spark.ui.retainedStages","100")
.set("spark.ui.retainedJobs","100")
.set("spark.driver.maxResultSize","4G")
Thank you for answers.
Log:
ERROR Utils: Uncaught exception in thread SparkListenerBus
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuilder.append(StringBuilder.java:132)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at org.apache.spark.util.FileLogger.logLine(FileLogger.scala:192)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:88)
at org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:113)
at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$3.apply(SparkListenerBus.scala:50)
at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$3.apply(SparkListenerBus.scala:50)
at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:83)
at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:81)
at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:50)
at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1468)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
Exception in thread "SparkListenerBus" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuilder.append(StringBuilder.java:132)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at org.apache.spark.util.FileLogger.logLine(FileLogger.scala:192)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:88)
at org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:113)
at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$3.apply(SparkListenerBus.scala:50)
at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$3.apply(SparkListenerBus.scala:50)
at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:83)
at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:81)
at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:50)
at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1468)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
ERROR LiveListenerBus: SparkListenerBus thread is dead! This means SparkListenerEvents have notbeen (and will no longer be) propagated to listeners for some time.
ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM

Resources