Apache Spark Configuration - apache-spark

I'm facing memory issue while running spark configuration and i have changed the settings to max memory but it's still not working. Please check out the following issue:
Command-
spark2-shell --conf "spark.default.parallelism=40" --executor.memory 8g --driver-memory 32g --conf "spark.ui.port=4404" --conf spark.driver.maxResultSize=2048m --conf spark.executor.heartbeatInterval=200s
Error- ERROR cluster.YarnScheduler: Lost executor 9 on
ampanacdddbp01.au.amp.local: Executor heartbeat timed out after 123643
ms WARN scheduler.TaskSetManager: Lost task 19.0 in stage 0.0 (TID 19,
ampanacdddbp01.au.amp.local, executor 9): ExecutorLostFailure
(executor 9 e running tasks) Reason: Executor heartbeat timed out
after 123643 ms WARN spark.HeartbeatReceiver: Removing executor 3 with
no recent heartbeats: 126935 ms exceeds timeout 120000 ms ERROR
cluster.YarnScheduler: Lost executor 3 on ampanacdddbp01.au.amp.local:
Executor heartbeat timed out after 126935 ms ERROR
scheduler.TaskSetManager: Total size of serialized results of 23 tasks
(1040.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
ERROR scheduler.TaskSetManager: Total size of serialized results of 24
tasks (1085.8 MB) is bigger than spark.driver.maxResultSize (1024.0
MB)
Please help me with the configuration and how to fix this "lost executor error".

Default value of parameter "spark.driver.maxResultSize" is 1g which is 1024MB. Since your application is trying to use more that the allocated memory to this property you are getting this error.
Try changing the value as follows :
Either pass command line argument while launching spark-shell as "--conf spark.driver.maxResultSize=4g"
Set the value of this property at system level in "conf/spark-env.sh"
Set the property at sparkContext Level as follows
conf = SparkConf().set('spark.driver.maxResultSize', '4g')
sc = SparkContext(conf=conf)
Hope it Helps.
Regards,
Neeraj

Related

Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode

I have Spark driver running in a Kubernetes pod with client deploy-mode and it tries to start an executor.
Executor will fail with error:
{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", "class":"dispatcher-Executor", "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", "log":"Executor self-exiting due to : Driver 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting down.\n"}
Then driver will attempt to start another executor which fails with same error and this goes on and on.
In the driver pod, I see only following errors:
22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.43.250:
22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.43.233:
22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on 192.168.43.221:
22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on 192.168.43.217:
22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on 192.168.43.197:
22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on 192.168.43.237:
22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on 192.168.43.196:
22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on 192.168.43.228:
22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on 192.168.43.254:
22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on 192.168.43.204:
22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on 192.168.43.231:
What is wrong? And how can I get executors running correctly?
Looks like the cluster is disconnected. Which platform are you using?
We are using Kubernetes running on private cloud running OpenStack

Apache Spark driver logs don't specify reason of stage cancelling

I run Apache Spark on AWS EMR under YARN.
The cluster has 1 master and 10 executors.
After some hours of processing my cluster failed and I go to look on a log.
So, I see that all working executors were trying to kill task at one time (It's the log of someone executor):
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 66.0 in stage 2.0 (TID 466), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 65.0 in stage 2.0 (TID 465), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 67.0 in stage 2.0 (TID 467), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 64.0 in stage 2.0 (TID 464), reason: Stage cancelled
20/03/05 00:02:12 ERROR Utils: Aborting a task
I see that reason is Stage cancelled but I can't get any details about that. I see driver logs and find that they have the last record at much earlier time.
So I have 2 questions:
Why driver logs are much shorter than executors logs?
How can I get the real reason why stage cancelled?
20/03/04 18:39:40 INFO TaskSetManager: Starting task 159.0 in stage 1.0 (TID 359, ip-172-31-6-236.us-west-2.compute.internal, executor 40, partition 159, RACK_LOCAL, 8421 bytes)
20/03/04 18:39:40 INFO ExecutorAllocationManager: New executor 40 has registered (new total is 40)
20/03/04 18:39:41 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-6-236.us-west-2.compute.internal:33589 with 2.8 GB RAM, BlockManagerId(40, ip-172-31-6-236.us-west-2.compute.internal, 33589, None)
20/03/04 18:39:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-6-236.us-west-2.compute.internal:33589 (size: 44.7 KB, free: 2.8 GB)
20/03/04 18:39:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-6-236.us-west-2.compute.internal:33589 (size: 37.4 KB, free: 2.8 GB)

Why I get executor refused connection?

I have something strange in the execution of my code:
When I execute the following line:
sourceList = joinLabelrdd_df.select("x").collect()
I get the following execption. Noting I have enough memory and cpus.
19/07/14 11:22:34 ERROR TaskSchedulerImpl: Lost executor 5 on 172.16.140.68: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
19/07/14 11:22:34 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190714111835-0000/5 is now EXITED (Command exited with code 137)
This error cause another exception:
19/07/14 11:22:41 WARN TaskSetManager: Lost task 113.1 in stage 9.0 (TID 2154, 172.16.140.113, executor 9): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=113, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

Lost executor Spark

I have a long running job on Spark, which after running for hours failed with the following errors.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 547 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 750.0 in stage 19.0 (TID 1565492, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 752.0 in stage 19.0 (TID 1565494, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 751.0 in stage 19.0 (TID 1565493, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 754.0 in stage 19.0 (TID 1565496, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 753.0 in stage 19.0 (TID 1565495, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 572 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 INFO DAGScheduler: Executor lost: 547 (epoch 45)
18/10/09 03:22:15 WARN TaskSetManager: Lost task 756.0 in stage 19.0 (TID 1565498, ip, executor 572): ExecutorLostFailure (executor 572 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
...
The strange thing is, I can't even see the lost executors on the Executor list for the log.
It would be great if someone can help fix the problem.
There are many factors for this to happen but the summary is the following:
Your master node is unable to reply to a specific executor and therefore gives the error
Unable to register with external shuffle server due to
Why your master node is unable to reply can be of different reasons. Depends on how your code is structured, the size of your instance if you are using EMR.
To solve it
Increase your master node. For example, if you are using i3.4xlarge, instead use i3.8xlarge or even i3.16xlarge.
Increase the network timeout from 2 minutes to 5 minutes. This is done with the following spark configuration: spark.network.timeout=300
Increase both the memory and number of cores of your master node. To increase the number of cores of your master node, set the following configuration. spark.yarn.am.cores=3
Hope this solves the issue.

spark worker with 32GB or more memory encountered a fatal error

I have three slaves in a standalone Spark Cluster. Each slave has 48GB of RAM. When I assigned more than 31GB (e.g. 32GB or more) of RAM to my executors:
.config("spark.executor.memory", "44g")
the executors were terminated without much information during a join of two large Dataframes. The output message at the Slave driver showed "missing an output location for shuffle":
17/09/21 12:34:18 INFO StandaloneSchedulerBackend: Granted executor ID app-20170921123240-0000/3 on hostPort XXX.XXX.XXX.92:33705 with 6 cores, 44.0 GB RAM
17/09/21 12:34:18 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 124, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 5.0 in stage 7.0 (TID 115, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 127, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 8.0 in stage 7.0 (TID 118, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 2.0 in stage 7.0 (TID 112, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 11.0 in stage 7.0 (TID 121, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 INFO DAGScheduler: Executor lost: 0 (epoch 5)
17/09/21 12:34:18 INFO BlockManagerMaster: Removal of executor 0 requested
17/09/21 12:34:18 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_2 !
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_11 !
17/09/21 12:34:18 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170921123240-0000/3 is now RUNNING
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_5 !
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_8 !
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(0, XXX.XXX.XXX, 34840, None)
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/09/21 12:34:18 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
The log message of Spark Master showed that the executors were "EXITED" and then relaunched:
17/09/21 12:34:18 INFO Master: Removing executor app-20170921123240-0000/0 because it is EXITED
17/09/21 12:34:18 INFO Master: Launching executor app-20170921123240-0000/3 on worker worker-20170921123014-152.83.247.92-33705
The log message of Spark Worker showed that the executor exited with code 134
17/09/21 12:34:18 INFO Worker: Executor app-20170921123240-0000/0 finished with state EXITED message Command exited with code 134 exitStatus 134
The only clue seems to be in the error log of the application, showing a fatal error has been detected by the JRE:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fdec0c92a73, pid=11300, tid=0x00007fd3a6951700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 1.8.0_131-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode linux-amd64 )
# Problematic frame:
# V [libjvm.so+0x3ffa73] CardTableExtension::scavenge_contents_parallel(ObjectStartArray*, MutableSpace*, HeapWord*, PSPromotionManager*, unsigned int, unsigned int)+0x5e3
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
--------------- T H R E A D ---------------
Current thread (0x0000000001c9e800): GCTaskThread [stack: 0x00007fd3a6851000,0x00007fd3a6952000] [id=11308]
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000008
As long as I assign 31GB of RAM (or less) to each executor, my program works just fine. Has anyone encountered such problem before?
44 GB may actually give you a smaller usable heap than 31 GB due to how Java stores object references: For heap sizes over 32 GB the JVM has to switch to 64 bit object references, which means all objects take up more space. More details here: http://java-performance.info/over-32g-heap-java/
My rule of thumb is to either stay below 32 GB or go much higher (say, 50 GB). Usually it is more cost efficient to use multiple JVMs, each having less than 32 GB heap. With 48 GB RAM I would stick to 31 GB heap.

Resources