I`m working with spark 1.2.1.
When I run spark jobs sometimes I get executor state "Exited" and sometimes "Killed", in both scenarios the job is finished successfully and I invoking SparkContext.stop()...
I failed to understand the meaning of those states.
What is the difference between Spark executor states Exited vs Killed?
Exited - It means that the Executor finished the processing and it exists cleanly without any errors or exception.
Killed - It means that that executor is killed by an the Worker who stopped and asked to kill the executor. This situation can be because of many reasons like by some user driven action or may be your executor finished processing but due to some reason it does not exists but worker is exiting so it needs to Kill the executor.
Also as a good practice we should invoke SparkContext.stop() method at the end of Job. Though this would not ensure that you will always have "Exited" status but it will definitely ensure that cleanup is executed and resources are de-allocated.
Related
I was testing with spark yarn cluster mode.
The spark job runs in lower priority queue.
And its containers are preempted when a higher priority job comes.
However it relaunches the containers right after being killed.
And higher priority app kills them again.
So apps are stuck in this deadlock.
Infinite retry of executors is discussed here.
Found below trace in logs.
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
So it seems any retry count I set is not even considered.
Is there a flag to indicate that all failures in executor should be counted, and job should fail when maxFailures happen ?
spark version 2.11
Spark distinguishes between code throwing some exception and external issues, ie code failures and container failures.
But spark does not consider preemption as container failure.
See ApplicationMaster.scala, here spark decides to quit if container failure limit is hit.
It gets number of failed executors from YarnAllocator.
YarnAllocator updates its failed containers in some cases. But not for preemptions, see case ContainerExitStatus.PREEMPTED in same function.
We use spark 2.0.2, where code is slightly different but logic is same.
Fix seems to update failed containers collection for preemptions too.
When using spark shell, we sometimes get "Lost task 4 in stage 2.0" logs in between and the reasons could be numerous (eg., Out of memory/ executor killed etc.,). But is there any manual way to explicitly retrieve the tasks from an executor and kill it?
I am seeing about 3018 tasks failed for the job as about 4 executors died.
The Executors summary (as below in Spark UI) have a completely different statistics. Out of 3018, about 2994 properly completed. My question is,
Will they be re-tried again?
Is there a config to override/limit this?
After monitoring the job and manually validation the attempt counts event for successful tasks, realised
Will they be re-tried again?
- Yes, even the successful tasks are retried.
Is there a config to override/limit this?
- Did not find any config to override this behaviour.
If an executer (kubernetes pod) dies (like with an OOM or timeout), all the tasks, even if successfully completed are re-executed. One of the main reason is, the shuffle writes from the executers are lost with the executor itself!!!
We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene several jobs launched simultaneously.
We found that by launching the job spark in yarn-client mode we do not have this problem, unlike launching it in yarn-cluster mode.
it could be a trail to find the cause.
we changed the code to add a sparkContext.stop ()
Indeed, the SparkContext was created (val sparkContext = createSparkContext) but not stopped. this solution has allowed us to decrease the number of jobs that remains blocked but nevertheless we still have some jobs blocked.
by analyzing the logs we have found this log that repeats without stopping:
17/09/29 11:04:37 DEBUG SparkEventPublisher: Enqueue SparkListenerExecutorMetricsUpdate(1,WrappedArray())
17/09/29 11:04:41 DEBUG ApplicationMaster: Sending progress
17/09/29 11:04:41 DEBUG ApplicationMaster: Number of pending allocations is 0. Sleeping for 5000.
it seems that the job block whene we call newAPIHadoopRDD to get data from Hbase. it may be the issue !!
Does someone have any idea about this issue ?
Thank you in advance
The running spark streaming job, which is supposed to run continuously, exited abruptly with the following error (found in the executor logs):
2017-07-28 00:19:38,807 [SIGTERM handler] ERROR org.apache.spark.util.SignalUtils$$anonfun$registerLogger$1$$anonfun$apply$1 (SignalUtils.scala:43) - RECEIVED SIGNAL TERM
The spark streaming job ran for ~62 hours before receiving this signal.
I couldn't find any other ERROR/ WARN in the executor logs. Unfortunately I haven't set up the driver logs yet, so I am not able to check on this specific issue deeper.
I am using Spark cluster in Standalone mode.
Any reason why driver might send this signal? (after spark streaming ran well and good for more than 60 hours)