Is there a way to tell spark to continue a job after a single task failed?
Or even better:
Can we configure a job to fail only if a certain percent of the tasks fails?
My scenario is like this:
I'm using pyspark to do some parallel computations.
I have a job that is composed from thousands of tasks (which are more or less independent from each other - i can allow some to fail).
1 task fails (throws exception), and after few retries for this task the entire job is aborted.
Is there a way to change this (weird) behavior?
No, there is no such feature in spark.
There is an open jira ticket(SPARK-10781) for it but I don't see any action there.
You can do it in mapreduce using config mapreduce.map.failures.maxpercent and mapreduce.max.reduce.failures.percent
Related
I have a Spark job where a small minority of the tasks keep failing, causing the whole job to then fail, and nothing gets outputted to the table where results are supposed to go. Is there a way to get Spark to tolerate a few failed tasks and still write the output from the successful ones? I don't actually need 100% of the data to get through, so I'm fine with a few tasks failing.
No, that is not possible, and not part of the design of Spark. No is also an answer.
I have jobs that repartition the huge datasets in parquet format and the file system used is s3a (S3).
Browsing through the Spark UI, I stumbled upon a job which has uncompleted tasks but the job marked is successful.
The different categories of jobs: i) Active, ii) Completed, iii) Failed.
I am unable to deduce the reason for this failed job, nor I am able to assert whether this was actually a failed one, given that there is another category for failed jobs.
How do I resolve this ambiguity?
I have one question regarding Spark execution which .
We all know that Each spark application (or the driver program) may contain one or many actions.
My question is which one is correct - Do a collection of jobs correspond to one action OR Does each job correspond to one action. Here job means the one that can be seen in the Spark execution UI.
I think the latter is true (each job correspond to one action). Please validate
Thanks.
Your understanding is correct.
Each action in spark corresponds to a Spark Job. And these actions are called by the driver program in the application.
And therefore an action can involve many transformation on the dataset(or RDD). Which creates stages in the job.
A stage can be thought of as the set of calculations(tasks) that can each be computed on an executor without communication with other executors or with the driver.
In other words, a new stage begins whenever network travel between workers is required; for example in a shuffle. These dependencies that create stage boundaries are called ShuffleDependencies.
do tasks in the same stage work simultaneously? if so, the line between partitions in a stage refers to what? example of a DAG
here is a good link for your reading. that explains DAG in detail and few other things that may be of interest. databricks blog on DAG
I can try to explain. as each stage is created it has a set of tasks that are divided. when an action is encountered. Driver sends the task to executors. based on how your data is partitioned N number tasks are invoked on the data in your distributed cluster. so the arrows that you are seeing is execution plan. as in it cannot do map function prior to reading the file. each node that has some data will execute those tasks in order that is provided by the DAG.
When i look at Jobs tab on spark UI, i can see Tasks status like 20/20/ (4 failed).
Does it mean there is data loss on failed tasks? Aren't those failed tasks moved to a diff executor?
While you should be wary of failing tasks (they are frequently an indicator of an underlying memory issue), you need not worry about data loss. The stages have been marked as successfully completed, so the tasks that failed were in fact (eventually) successfully processed.