How to handle executor failure in apache spark - apache-spark

i have run the job using spark-submit while that time we lost executor and the certain point we can recover or not if recover how we will recover and while how we have to get back that executor

You cannot handle executor failures programmatically in your application, if thats what you are asking.
You can configure spark configuration properties which guides the actual job execution including how YARN would schedule jobs and handle task and executor failures.
https://spark.apache.org/docs/latest/configuration.html#scheduling
Some important properties you may want to check out:
spark.task.maxFailures(default=4): Number of failures of any particular task
before giving up on the job. The total number of failures spread
across different tasks will not cause the job to fail; a particular
task has to fail this number of attempts. Should be greater than or
equal to 1. Number of allowed retries = this value - 1.
spark.blacklist.application.maxFailedExecutorsPerNode(default=2): (Experimental)
How many different executors must be blacklisted for the entire
application, before the node is blacklisted for the entire
application. Blacklisted nodes will be automatically added back to the
pool of available resources after the timeout specified by
spark.blacklist.timeout. Note that with dynamic allocation, though,
the executors on the node may get marked as idle and be reclaimed by
the cluster manager.
spark.blacklist.task.maxTaskAttemptsPerExecutor(default=1): (Experimental)
For a given task, how many times it can be retried on one executor
before the executor is blacklisted for that task.

Related

Spark BroadcastHashJoin operator and Dynamic Allocation enabled

In my company, we have the following scenario:
we have dynamic allocation enabled by default for every data pipeline written, so we can save some costs and enable resource sharing among the different execution
also, most of the queries running perform joins and Spark has some interesting optimizations regarding it, like the join strategy change, which occurs when Spark identifies that one side of the join is small enough to be broadcasted. This is what we called BroadcastHashJoin and we have lots of queries with these operators in their respective query plans
last but not least, our pipelines run on EMR clusters using the client mode.
We are having a problem that happens when the YARN (RM on EMR) queue where a job was submitted is full, and there are not enough resources to allocate new executors for a given application. Since the driver process runs on the machine that submitted the application (client-mode), the broadcast job started and, after 300s it fails showing the broadcast timeout error.
Running the same job in a different schedule (a time when the queue usage is not too high), it was able to run successfully.
My questions are all related to how these three different things work together (dynamic allocation enabled, BHJ, client-mode). So, if you haven't enabled dynamic allocation, it's easier to see that the broadcast operation will occur for every executor that was requested initially through the spark-submit command. But, if we enable dynamic allocation, how the broadcast operation will occur for the next executors that will be dynamically allocated? Will the driver have to send it again for every new executor? Will they be subject to the same 300 timeout seconds? Is there a way to prevent the driver (client-mode) from starting the broadcast operation unless it has enough executors?
Source: BroadcastExchangeExec source code here
PS: we have tried already define spark.dynamicAllocation.minExecutors property equal to 1, but no success. The job still started only with the driver allocated and errored after the 300s.

Spark keeps relaunching executors after yarn kills them

I was testing with spark yarn cluster mode.
The spark job runs in lower priority queue.
And its containers are preempted when a higher priority job comes.
However it relaunches the containers right after being killed.
And higher priority app kills them again.
So apps are stuck in this deadlock.
Infinite retry of executors is discussed here.
Found below trace in logs.
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
So it seems any retry count I set is not even considered.
Is there a flag to indicate that all failures in executor should be counted, and job should fail when maxFailures happen ?
spark version 2.11
Spark distinguishes between code throwing some exception and external issues, ie code failures and container failures.
But spark does not consider preemption as container failure.
See ApplicationMaster.scala, here spark decides to quit if container failure limit is hit.
It gets number of failed executors from YarnAllocator.
YarnAllocator updates its failed containers in some cases. But not for preemptions, see case ContainerExitStatus.PREEMPTED in same function.
We use spark 2.0.2, where code is slightly different but logic is same.
Fix seems to update failed containers collection for preemptions too.

Spark job in Dataproc dynamic vs static allocation

I have a Dataproc cluster:
master - 6cores| 32g
worker{0-7} - 6cores| 32g
Maximum allocation: memory:24576, vCores:6
Have two spark-streaming jobs to submit, one after another
In the first place, I tried to submit with default configurations spark.dynamicAllocation.enabled=true
In 30% of cases, I saw that the first job caught almost all available memory and the second was queued and waited for resources for ages. (This is a streaming job which took a small portion of resources every batch ).
My second try was to change a dynamic allocation. I submitted the same two jobs with identical configurations:
spark.dynamicAllocation.enabled=false
spark.executor.memory=12g
spark.executor.cores=3
spark.executor.instances=6
spark.driver.memory=8g
Surprisingly in Yarn UI I saw:
7 Running Containers with 84g Memory allocation for the first job.
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
In Spark UI there are 6 executors and driver for the first job and 2 executors and driver for the second job
After retrying(deleting previous jobs and submitting the same jobs) without dynamic allocation and same configurations, I got a totally different result:
5 containers 59g Memory allocation for both jobs and 71g Reserved Memory for the second job. In spark UI I see 4 executors and driver in both cases.
I have a couple of questions:
If dynamicAllocation=false, why the number of yarn containers is
different from the number of executors? (Firstly I thought that
additional yarn container is a driver, but it differs in memory.)
If dynamicAllocation=false, Why Yarn doesn't create containers by my
exact requirements- 6 containers(spark executors) for both jobs. Why two different attempts with the same configuration lead to different results?
If dynamicAllocation=true - how may it be possible that low consuming memory spark job takes control of all Yarn resources
Thanks
Spark and YARN scheduling are pretty confusing. I'm going to answer the questions in reverse order:
3) You should not be using dynamic allocation in Spark streaming jobs.
The issue is that Spark continuously asks YARN for more executors as long as there's a backlog of tasks to run. Once a Spark job gets an executor, it keeps it until the executor is idle for 1 minute (configurable, of course). In batch jobs, this is okay because there's generally a large, continuous backlog of tasks.
However, in streaming jobs, there's a spike of tasks at the start of every micro-batch, but executors are actually idle most of the time. So a streaming job will grab a lot of executors that it doesn't need.
To fix this, the old streaming API (DStreams) has its own version of dynamic allocation: https://issues.apache.org/jira/browse/SPARK-12133. This JIRA has more background on why Spark's batch dynamic allocation algorithm isn't a good fit for streaming.
However, Spark Structured Streaming (likely what you're using) does not support dynamic allocation: https://issues.apache.org/jira/browse/SPARK-24815.
tl;dr Spark requests executors based on its task backlog, not based on memory used.
1 & 2) #Vamshi T is right. Every YARN application has an "Application Master", which is responsible for requesting containers for the application. Each of your Spark jobs has an app master that proxies requests for containers from the driver.
Your configuration doesn't seem to match what you're seeing in YARN, so not sure what's going on there. You have 8 workers with 24g given to YARN. With 12g executors, you should have 2 executors per node, for a total of 16 "slots". An app master + 6 executors should be 7 containers per application, so both applications should fit within the 16 slots.
We configure the app master to have less memory, that's why total memory for an application isn't a clean multiple of 12g.
If you want both applications to schedule all their executors concurrently, you should set spark.executor.instances=5.
Assuming you're using structured streaming, you could also just run both streaming jobs in the same Spark application (submitting them from different threads on the driver).
Useful references:
Running multiple jobs in one application: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Dynamic allocation: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
Spark-on-YARN: https://spark.apache.org/docs/latest/running-on-yarn.html
I have noticed similar behavior in my experience as well and here is what I observed. Firstly the resource allocation by yarn depends on available resources on cluster when the job is submitted. When both jobs are submitted at almost the same time with same config, yarn distributes the available resources equally between the jobs. Now when you throw dynamic allocation in to the mix, things get a little confusing/complex. Now in your case below:
7 Running Containers with 84g Memory allocation for the first job.
--You got 7 containers because you requested 6 executors, one container for each executor and the extra one container is for the application Master
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
--Since the second job was submitted after some time, Yarn allocated the remaining resources...2 containers, one for each executor and the extra one for your application master.
Your containers will never match the executors you requested and will always be one more than the number of executors you requested because you need one container to run your application master.
Hope that answers part of your question.

Does the successful tasks also gets reprocessed on an executor crash?

I am seeing about 3018 tasks failed for the job as about 4 executors died.
The Executors summary (as below in Spark UI) have a completely different statistics. Out of 3018, about 2994 properly completed. My question is,
Will they be re-tried again?
Is there a config to override/limit this?
After monitoring the job and manually validation the attempt counts event for successful tasks, realised
Will they be re-tried again?
- Yes, even the successful tasks are retried.
Is there a config to override/limit this?
- Did not find any config to override this behaviour.
If an executer (kubernetes pod) dies (like with an OOM or timeout), all the tasks, even if successfully completed are re-executed. One of the main reason is, the shuffle writes from the executers are lost with the executor itself!!!

How does spark.dynamicAllocation.enabled influence the order of jobs?

Need an understanding on when to use spark.dynamicAllocation.enabled - What are advantages and disadvantages of using it? I have queue where jobs get submitted.
9:30 AM --> Job A gets submitted with dynamicAllocation enabled.
10:30 AM --> Job B gets submitted with dynamicAllocation enabled.
Note: My Data is huge (processing will be done on 10GB data with transformations).
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Dynamic Allocation of Executors is about resizing your pool of executors.
Quoting Dynamic Allocation:
spark.dynamicAllocation.enabled Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload.
And later on in Dynamic Resource Allocation:
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
In other words, job A will usually finish before job B will be executed. Spark jobs are usually executed sequentially, i.e. a job has to finish before another can start.
Usually...
SparkContext is thread-safe and can handle jobs from a Spark application. That means that you may submit jobs at the same time or one after another and in some configuration expect that these two jobs will run in parallel.
Quoting Scheduling Within an Application:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
Wrapping up...
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Job A.
Unless you have enabled Fair Scheduler Pools:
The fair scheduler also supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important jobs, for example, or to group the jobs of each user together and give users equal shares regardless of how many concurrent jobs they have instead of giving jobs equal shares.

Resources