How to avoid ExecutorFailure Error in Spark - apache-spark

How to avoid Executor Failures while Spark jobs are executing .
We are using Spark 1.6 version as part of Cloudera CDH 5.10.
Normally I am getting below error.
ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 127100 ms

There could be various reasons behind the slow tasks execution then it gets timeout, you need to drill down to find the rootcause.
Sometimes tuning default timeout configuration parameters also helps. Go to spark UI configuration tab and find out values for below parameters then increase timeout parameters in spark-submit.
spark.worker.timeout
spark.network.timeout
spark.akka.timeout
Running job with speculative execution spark.speculation=true also helps, if one or more tasks are running slowly in a stage, they will be re-launched.
Explore more about spark 1.6.0 configuration properties.

Related

Spark keeps relaunching executors after yarn kills them

I was testing with spark yarn cluster mode.
The spark job runs in lower priority queue.
And its containers are preempted when a higher priority job comes.
However it relaunches the containers right after being killed.
And higher priority app kills them again.
So apps are stuck in this deadlock.
Infinite retry of executors is discussed here.
Found below trace in logs.
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
So it seems any retry count I set is not even considered.
Is there a flag to indicate that all failures in executor should be counted, and job should fail when maxFailures happen ?
spark version 2.11
Spark distinguishes between code throwing some exception and external issues, ie code failures and container failures.
But spark does not consider preemption as container failure.
See ApplicationMaster.scala, here spark decides to quit if container failure limit is hit.
It gets number of failed executors from YarnAllocator.
YarnAllocator updates its failed containers in some cases. But not for preemptions, see case ContainerExitStatus.PREEMPTED in same function.
We use spark 2.0.2, where code is slightly different but logic is same.
Fix seems to update failed containers collection for preemptions too.

EMR 5.13: Spark 2.3.0 UI shows Executors remain alive

Ever since I've upgraded to EMR 5.13, I've been seeing strange metrics on the Spark & YARN UIs.
In this particular instance:
YARN showed that process completed
Ganglia shows that cluster has been idle since completion of last (118th) job
Spark UI also tells that all my 118 tasks have been completed
Even so, Spark UI reports that all Executors are alive, long (over 1 hr at time of writing) after the last job was completed.
Could this be a UI glitch or there's something else going on?
Frameworks / Platform:
EMR 5.13
Spark 2.3.0
Hive 2.3.2
Hadoop: Amazon 2.8.3
One executor with active tasks in your screen is marked as Dead. It shows statistics at the moment of termination.
As you can see, executor #5 has processed 624 tasks before termination. Then yarn started a new executor #9 instead that completed 76 tasks

Best practice to run multiple spark instance at a time in same jvm?

I am trying to initiate separate pyspark application at a time from the Driver machine. So both applications are running in same JVM. Though it is creating separate spark context object but one of the job failed saying failed to get broadcast_1.
16/12/06 08:18:40 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and
have sufficient resources
16/12/06 08:18:55 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and
have sufficient resources
16/12/06 08:18:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.26.7.195:44690) with ID 52
16/12/06 08:18:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.26.7.195, partition 0, ANY, 7307 bytes)
16/12/06 08:18:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 0 on executor id: 52 hostname: 172.26.7.195.
16/12/06 08:19:00 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.26.7.192:38343) with ID 53
16/12/06 08:19:02 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.26.7.195): java.io.IOException: org.apache.spark.SparkException: Failed
to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1260)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:67)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I searched google a lot and even in stackoverflow and found that it is not recommended to run multiple spark context object in same JVM and it is not supported at all for python.
My queries are:
In my application, I need to run multiple pyspark application in same time in schedule. is there any way to run multiple pyspark application from spark driver at at time which will create separate sparkcontext object?
If first query answer is NO, then can I run for example one application from driver, another from executor but I can run it at a time.
Finally any other better suggestion in terms of configuration or best practice for parallel spark application running in same spark cluster.
My Setup:
Hadoop version: hadoop-2.7.1
Spark: 2.0.0
Python: python 3.4
Mongodb 3.2.10
Configuration:
VM-1: Hadoop primary node, Spark driver & Executor, Mongodb
VM-2: Hadoop data node, Spark Executor
Pyspark application is running in normal crontab entry in VM-1
I was also trying to do the similar things, and got the block manager registration error. I was trying to launch 2 different pyspark shells from the same node, after many searches I realized that maybe both the pyspark shells are using the same driver JVM, and as one shell occupy the BlockManager for the other, the other shell started giving exception.
So I decided to use another approach, where I was using different nodes to launch the driver programs and link both the programs with the same master using
pyspark --master <spark-master url> --total-executor-cores <num of cores to use>
Now I am no longer getting the error.
Hope this helps, and do tell if you find any reason or solution to launch more than one spark-shells in the same driver.
Do you mean two spark applications or one spark application and two spark contexts. Two spark applications, each with their own driver and sparkcontext should be achievable unless you have to do something common as per your requirement.
When you have two spark applications, they are just like any other and the resources need to be shared like any other application
"WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources"
The driver is allocated resources in order to run, and the remaining resources are less than those specified for your application executors.
EG:
Node has 4 Cores x 16GB RAM and
Driver Configuration is Spark Driver Cores = 1, Spark Driver Memory = 8GB
Executor Configuration is Spark Executor Cores = 4, Spark Executor Memory = 10GB
This will result in the error above.
The Driver resources + Executor resources cannot exceed the limit of the node (as determined by either physical hardware or spark-env settings)
In the above example:
Driver configured to use 1 CPU Core / 8 GB RAM
The Executor configuration cannot exceed 3 CPU Cores / 8 GB RAM
Note that the total executor resources will be
(Spark Executor Cores/ Executor Memory) * number of executors running on the node

Spark Streaming stops after sometime due to Executor Lost

I am using spark 1.3 for spark streaming application. When i start my application . I can see in spark UI that few of the jobs have failed tasks. On investigating the job details . I see few of the task were failed due to Executor Lost Exception either ExecutorLostFailure (executor 11 lost) or Resubmitted (resubmitted due to lost executor) .
In application logs from yarn the only Error shown is Lost executor 11 on <machineip> remote Akka client disassociated . I dont see any other exception or error being thrown.
The application stops after couple of hours. Logs shows all the executor are lost when application fails.
Can anyone suggest or point to link on how to resolve this issue.
There are many potential options for why you're seeing executor loss. One thing I have observed in the past is that Java garbage collection can take very long periods under heavy load. As a result the executor is 'lost' when the GC takes too long, and returns shortly thereafter.
You can determine if this is the issue by turning on executor GC logging. Simply add the following configuration:
--conf "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy"
See this great guide from Intel/DataBricks here for more details on GC tuning: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

How can I run an Apache Spark shell remotely?

I have a Spark cluster setup with one master and 3 workers. I also have Spark installed on a CentOS VM. I'm trying to run a Spark shell from my local VM which would connect to the master, and allow me to execute simple Scala code. So, here is the command I run on my local VM:
bin/spark-shell --master spark://spark01:7077
The shell runs to the point where I can enter Scala code. It says that executors have been granted (x3 - one for each worker). If I peek at the Master's UI, I can see one running application, Spark shell. All the workers are ALIVE, have 2 / 2 cores used, and have allocated 512 MB (out of 5 GB) to the application. So, I try to execute the following Scala code:
sc.parallelize(1 to 100).count
Unfortunately, the command doesn't work. The shell will just print the same warning endlessly:
INFO SparkContext: Starting job: count at <console>:13
INFO DAGScheduler: Got job 0 (count at <console>:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 0(count at <console>:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 0 (Parallel CollectionRDD[0] at parallelize at <console>:13), which has no missing parents
INFO DAGScheduler: Submitting 2 missing tasts from Stage 0 (ParallelCollectionRDD[0] at parallelize at <console>:13)
INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Following my research into the issue, I have confirmed that the master URL I am using is identical to the one on the web UI. I can ping and ssh both ways (cluster to local VM, and vice-versa). Moreover, I have played with the executor-memory parameter (both increasing and decreasing the memory) to no avail. Finally, I tried disabling the firewall (iptables) on both sides, but I keep getting the same error. I am using Spark 1.0.2.
TL;DR Is it possible to run an Apache Spark shell remotely (and inherently submit applications remotely)? If so, what am I missing?
EDIT: I took a look at the worker logs and found that the workers had trouble finding Spark:
ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor
java.io.IOException: Cannot run program "/usr/bin/spark-1.0.2/bin/compute-classpath.sh" (in directory "."): error=2, No such file or directory
...
Spark is installed in a different directory on my local VM than on the cluster. The path the worker is attempting to find is the one on my local VM. Is there a way for me to specify this path? Or must they be identical everywhere?
For the moment, I adjusted my directories to circumvent this error. Now, my Spark Shell fails before I get the chance to enter the count command (Master removed our application: FAILED). All the workers have the same error:
ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#spark02:7078] -> [akka.tcp://sparkExecutor#spark02:53633]:
Error [Association failed with [akka.tcp://sparkExecutor#spark02:53633]]
[akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor#spark02:53633]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$annon2: Connection refused: spark02/192.168.64.2:53633
As suspected, I am running into network issues. What should I look at now?
I solve this problem at my spark client and spark cluster。
Check your network,client A can ping cluster each other! Then add two line config in your spark-env.sh on client A。
first
export SPARK_MASTER_IP=172.100.102.156
export SPARK_JAR=/usr/spark-1.1.0-bin-hadoop2.4/lib/spark-assembly-1.1.0-hadoop2.4.0.jar
Second
Test your spark shell with cluster mode !
This problem can be caused by the network configuration. It looks like the error TaskSchedulerImpl: Initial job has not accepted any resources can have quite a few causes (see also this answer):
actual resource shortage
broken communication between master and workers
broken communication between master/workers and driver
The easiest way to exclude the first possibilities is to run a test with a Spark shell running directly on the master. If this works, the cluster communication within the cluster itself is fine and the problem is caused by the communication to the driver host. To further analyze the problem it helps to look into the worker logs, which contain entries like
16/08/14 09:21:52 INFO ExecutorRunner: Launch command:
"/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java"
...
"--driver-url" "spark://CoarseGrainedScheduler#192.168.1.228:37752"
...
and test whether the worker can establish a connection to the driver's IP/port. Apart from general firewall / port forwarding issues, it might be possible that the driver is binding to the wrong network interface. In this case you can export SPARK_LOCAL_IP on the driver before starting the Spark shell in order to bind to a different interface.
Some additional references:
Knowledge base entry on network connectivity issues.
Github discussion on improving the documentation of Initial job has not accepted any resources.

Resources