What does "failed" mean in a completed Spark job? - apache-spark

I have jobs that repartition the huge datasets in parquet format and the file system used is s3a (S3).
Browsing through the Spark UI, I stumbled upon a job which has uncompleted tasks but the job marked is successful.
The different categories of jobs: i) Active, ii) Completed, iii) Failed.
I am unable to deduce the reason for this failed job, nor I am able to assert whether this was actually a failed one, given that there is another category for failed jobs.
How do I resolve this ambiguity?

Related

Spark execution - Relationship between spark execution job and spark action

I have one question regarding Spark execution which .
We all know that Each spark application (or the driver program) may contain one or many actions.
My question is which one is correct - Do a collection of jobs correspond to one action OR Does each job correspond to one action. Here job means the one that can be seen in the Spark execution UI.
I think the latter is true (each job correspond to one action). Please validate
Thanks.
Your understanding is correct.
Each action in spark corresponds to a Spark Job. And these actions are called by the driver program in the application.
And therefore an action can involve many transformation on the dataset(or RDD). Which creates stages in the job.
A stage can be thought of as the set of calculations(tasks) that can each be computed on an executor without communication with other executors or with the driver.
In other words, a new stage begins whenever network travel between workers is required; for example in a shuffle. These dependencies that create stage boundaries are called ShuffleDependencies.

In Apache Spark , do Tasks in the same Stage work simultaneously or not?

do tasks in the same stage work simultaneously? if so, the line between partitions in a stage refers to what? example of a DAG
here is a good link for your reading. that explains DAG in detail and few other things that may be of interest. databricks blog on DAG
I can try to explain. as each stage is created it has a set of tasks that are divided. when an action is encountered. Driver sends the task to executors. based on how your data is partitioned N number tasks are invoked on the data in your distributed cluster. so the arrows that you are seeing is execution plan. as in it cannot do map function prior to reading the file. each node that has some data will execute those tasks in order that is provided by the DAG.

Distribution of spark code into jobs, stages and tasks [duplicate]

This question already has answers here:
What is the concept of application, job, stage and task in spark?
(5 answers)
Closed 5 years ago.
As per my understanding each action in whole job is translated to job, whil each shuffling stage within a job is traslated into stage and each partition for each stages input is translated into task.
Please corrrect me if I am wrong, I am unable to get any actual definition.
Invoking an action inside a Spark application triggers the launch of a Spark job to fulfill it.Spark examines the DAG and formulates an execution plan.The execution plan consists of assembling the job’s transformations into stages.
When Spark optimises code internally, it splits it into stages, where
each stage consists of many little tasks.Each stage contains a sequence of transformations that can be completed without shuffling the full data.
Every task for a given stage is a single-threaded atom of computation consisting of exactly the same
code, just applied to a different set of data.The number of tasks is determined by the number of partitions.
To manage the job flow and schedule tasks Spark relies on an active driver process.
The executor processes are responsible for executing this work, in the form of tasks, as well as for storing any data that the user chooses to cache
A single executor has a number of slots for running tasks and will run many concurrently throughout its lifetime.

Spark Streaming failed executor tasks

When i look at Jobs tab on spark UI, i can see Tasks status like 20/20/ (4 failed).
Does it mean there is data loss on failed tasks? Aren't those failed tasks moved to a diff executor?
While you should be wary of failing tasks (they are frequently an indicator of an underlying memory issue), you need not worry about data loss. The stages have been marked as successfully completed, so the tasks that failed were in fact (eventually) successfully processed.

spark - contiue job processing after tasks failure

Is there a way to tell spark to continue a job after a single task failed?
Or even better:
Can we configure a job to fail only if a certain percent of the tasks fails?
My scenario is like this:
I'm using pyspark to do some parallel computations.
I have a job that is composed from thousands of tasks (which are more or less independent from each other - i can allow some to fail).
1 task fails (throws exception), and after few retries for this task the entire job is aborted.
Is there a way to change this (weird) behavior?
No, there is no such feature in spark.
There is an open jira ticket(SPARK-10781) for it but I don't see any action there.
You can do it in mapreduce using config mapreduce.map.failures.maxpercent and mapreduce.max.reduce.failures.percent

Resources