Job spark blocked and runs indefinitely - apache-spark

We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene several jobs launched simultaneously.
We found that by launching the job spark in yarn-client mode we do not have this problem, unlike launching it in yarn-cluster mode.
it could be a trail to find the cause.
we changed the code to add a sparkContext.stop ()
Indeed, the SparkContext was created (val sparkContext = createSparkContext) but not stopped. this solution has allowed us to decrease the number of jobs that remains blocked but nevertheless we still have some jobs blocked.
by analyzing the logs we have found this log that repeats without stopping:
17/09/29 11:04:37 DEBUG SparkEventPublisher: Enqueue SparkListenerExecutorMetricsUpdate(1,WrappedArray())
17/09/29 11:04:41 DEBUG ApplicationMaster: Sending progress
17/09/29 11:04:41 DEBUG ApplicationMaster: Number of pending allocations is 0. Sleeping for 5000.
it seems that the job block whene we call newAPIHadoopRDD to get data from Hbase. it may be the issue !!
Does someone have any idea about this issue ?
Thank you in advance

Related

Spark structured streaming job stuck for hours without getting killed

I have a structured streaming job which reads from kafka, perform aggregations and write to hdfs. The job is running in cluster mode in yarn. I am using spark2.4.
Every 2-3 days this job gets stuck. It doesn't fail but gets stuck at some microbatch microbatch. The microbatch doesn't even tend to start. The driver keeps printing following log multiple times for hours.
Got an error when resolving hostNames. Falling back to /default-rack for all.
When I kill the streaming job and start again, the job again starts running fine.
How to fix this ?
See this issue https://issues.apache.org/jira/browse/SPARK-28005
This is fixed in spark 3.0. It seems that this happens because there are no active executers.

Spark keeps relaunching executors after yarn kills them

I was testing with spark yarn cluster mode.
The spark job runs in lower priority queue.
And its containers are preempted when a higher priority job comes.
However it relaunches the containers right after being killed.
And higher priority app kills them again.
So apps are stuck in this deadlock.
Infinite retry of executors is discussed here.
Found below trace in logs.
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
So it seems any retry count I set is not even considered.
Is there a flag to indicate that all failures in executor should be counted, and job should fail when maxFailures happen ?
spark version 2.11
Spark distinguishes between code throwing some exception and external issues, ie code failures and container failures.
But spark does not consider preemption as container failure.
See ApplicationMaster.scala, here spark decides to quit if container failure limit is hit.
It gets number of failed executors from YarnAllocator.
YarnAllocator updates its failed containers in some cases. But not for preemptions, see case ContainerExitStatus.PREEMPTED in same function.
We use spark 2.0.2, where code is slightly different but logic is same.
Fix seems to update failed containers collection for preemptions too.

Does the successful tasks also gets reprocessed on an executor crash?

I am seeing about 3018 tasks failed for the job as about 4 executors died.
The Executors summary (as below in Spark UI) have a completely different statistics. Out of 3018, about 2994 properly completed. My question is,
Will they be re-tried again?
Is there a config to override/limit this?
After monitoring the job and manually validation the attempt counts event for successful tasks, realised
Will they be re-tried again?
- Yes, even the successful tasks are retried.
Is there a config to override/limit this?
- Did not find any config to override this behaviour.
If an executer (kubernetes pod) dies (like with an OOM or timeout), all the tasks, even if successfully completed are re-executed. One of the main reason is, the shuffle writes from the executers are lost with the executor itself!!!

Spark Streaming stops after sometime due to Executor Lost

I am using spark 1.3 for spark streaming application. When i start my application . I can see in spark UI that few of the jobs have failed tasks. On investigating the job details . I see few of the task were failed due to Executor Lost Exception either ExecutorLostFailure (executor 11 lost) or Resubmitted (resubmitted due to lost executor) .
In application logs from yarn the only Error shown is Lost executor 11 on <machineip> remote Akka client disassociated . I dont see any other exception or error being thrown.
The application stops after couple of hours. Logs shows all the executor are lost when application fails.
Can anyone suggest or point to link on how to resolve this issue.
There are many potential options for why you're seeing executor loss. One thing I have observed in the past is that Java garbage collection can take very long periods under heavy load. As a result the executor is 'lost' when the GC takes too long, and returns shortly thereafter.
You can determine if this is the issue by turning on executor GC logging. Simply add the following configuration:
--conf "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy"
See this great guide from Intel/DataBricks here for more details on GC tuning: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

What is the difference between Spark executor states Exited vs Killed?

I`m working with spark 1.2.1.
When I run spark jobs sometimes I get executor state "Exited" and sometimes "Killed", in both scenarios the job is finished successfully and I invoking SparkContext.stop()...
I failed to understand the meaning of those states.
What is the difference between Spark executor states Exited vs Killed?
Exited - It means that the Executor finished the processing and it exists cleanly without any errors or exception.
Killed - It means that that executor is killed by an the Worker who stopped and asked to kill the executor. This situation can be because of many reasons like by some user driven action or may be your executor finished processing but due to some reason it does not exists but worker is exiting so it needs to Kill the executor.
Also as a good practice we should invoke SparkContext.stop() method at the end of Job. Though this would not ensure that you will always have "Exited" status but it will definitely ensure that cleanup is executed and resources are de-allocated.

Resources