Are failed tasks resubmitted in Apache Spark? - apache-spark

Are failed tasks automatically resubmitted in Apache Spark to the same or another executor?

Yes, but there is a parameter set for the max number of failures
spark.task.maxFailures 4 Number of individual task failures before giving up on the job. Should be greater than or equal to 1. Number of allowed retries = this value - 1.

I believe failed tasks are resubmitted because I have seen the same failed task submitted multiple times on the Web UI. However, if the same task fails multiple times, the full job fail:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 120 in stage 91.0 failed 4 times, most recent failure: Lost task 120.3 in stage 91.0

Related

org.apache.spark.SparkException: Job aborted due to stage failure in pyspark

Sorry for the duplicate post. I'm creating again another post as those posts are unable to solve my problem.
I'm running ML Regression on pyspark 3.0.1. I'm running it on a cluster of 640 GB memory & 32 worker node.
I have a data set with 33751 rows & 63 columns. I'm trying to prepare the data set for ML regression. So I wrote following code
from pyspark.ml.feature import VectorAssembler, StandardScaler
input_col=[...]
vector_assembler=VectorAssembler(inputCols=input_col,outputCol='ss_feature')
temp_train=vector_assembler.transform(train)
standard_scaler=StandardScaler(inputCol='ss_feature',outputCol='scaled')
train=standard_scaler.fit(temp_train).transform(temp_train)
But I'm getting error message at the last line execution
org.apache.spark.SparkException: Job aborted due to stage failure: Task 169 in stage 57.0 failed 4
times, most recent failure: Lost task 169.3 in stage 57.0 (TID 5522, 10.8.64.22, executor 11):
org.apache.spark.SparkException: Failed to execute user defined
function(VectorAssembler$$Lambda$6296/1890764576:
Can you suggest me how do I solve this issue?

Databricks Checksum error while writing to a file

I am running a job in 9 nodes.
All of them are going to write some information to files doing simple writes like below:
dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)
However I am receiving this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling
o106.save. : java.util.concurrent.ExecutionException:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task
1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0
exp: 1179219224 got: -1020415797
It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.
Is there any known scenario that may be causing it?
So there are a couple of things going on and it should explain why coalesce may not work.
What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.
What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.
Why coalesce would not work?
Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.
Here is the code that would work:
dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

Spark, Why does dropping columns cause spark job to fail?

In Spark 2.0, I am running a pyspark job where I read from a table, add some columns whose logic is based off of windowing on 30 days worth of data and then I use df.createOrReplaceTempView followed up with spark.sql(create table as select * from ...) to create a table in HDFS.
This job runs successfully and creates a table in HDFS. However, I don't need all of the columns I just created in my dataframe. I only need half of the new columns and so I add some logic to drop the columns I don't need (all of these columns that will be dropped were recently created). When I run the drop `df = df.select([c for c in df.columns if c not in ('a','b','d','e')]) the spark job now fails!
error: Job aborted due to stage failure: Task 139 in stage 1.0 failed 4 times, most recent failure: Lost task 139.3 in stage 1.0 (TID 405, myhost, executor 197): ExecutorLostFailure (executor 197 exited caused by one of the running tasks) Reason: Container marked as failed: container_111 on host: myhost. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
You can use .drop("colname") to drop the columns from dataframe.
df1=df.drop("a","b","c","d")
Hope it helps you.

Why treeAggregate always change the partition to 70?

Background:
I have several billion rows, which I need to use to run logistic regression. I choose L-BFGS and it succeed when faced to a small data set(1 hundred million), but always failed when faced to the several billion rows.
I read the log and I find this error:
Job aborted due to stage failure: Task 54 in stage 37.0 failed 4 times, most recent failure:
Lost task 54.3 in stage 37.0 (TID 79670, 10.215.155.83):
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Which is triggered by:
treeAggregate at StandardScaler.scala:55
And I find the function treeAggregate seems change the partition number to 70, which I set to 5000. It explains the OOM, but I don't know why, and I wonder how to change it to 5000 to avoid OOM(VM's memory limit is 14G which cannot be changed).

Severe straggler tasks due to Locality Level being "Any" and a Network Fetch on cached RDD

A cached dataset that has been completely read through - successfully - is being reprocessed. A small number (typically 2/204 tasks - 1%) of the tasks may fail on a subsequent pass over the same (still cached) dataset. We are on spark 1.3.1.
The following screenshot shows that - of 204 tasks - the last two seem to have been 'forgotten' by the scheduler.
Is there any way to get more information about these tasks that are in limbo?
All of the other tasks completed within a reasonable fraction of similar time: in particular the 75% is still within 50% of the median. It is just these last two stragglers that are killing the entire job completion time. Notice also these are not due to record count skew
Update The two stragglers did finally finish - at over 7 minutes (over 3x longer any other other 202 tasks) !
15/08/15 20:04:54 INFO TaskSetManager: Finished task 201.0 in stage 2.0 (TID 601) in 133583 ms on x125 (202/204)
15/08/15 20:09:53 INFO TaskSetManager: Finished task 189.0 in stage 2.0 (TID 610) in 423230 ms on i386 (203/204)
15/08/15 20:10:05 INFO TaskSetManager: Finished task 190.0 in stage 2.0 (TID 611) in 435459 ms on i386 (204/204)
15/08/15 20:10:05 INFO DAGScheduler: Stage 2 (countByKey at MikeFilters386.scala:76) finished in 599.028 s
Suggestions on what to look for /review appreciated.
Another update The TYPE has turned out to be Network for those two. What does that mean?
I had a similar issue with you. Try increasing spark.locality.wait.
If that works, the following might apply to you:
https://issues.apache.org/jira/browse/SPARK-13718#
** ADDED **
Some extra information that I found helpful.
Spark will always initially assign a task to the executor that contains the respective cached RDD partition.
If Task is not accepted under the locality timeouts as defined in the spark config, then it will try NODE_LOCAL, RACK_LOCAL, ANY in that sequence.
Regardless if the cached data are available locally (HDFS replicas), Spark will always fetch the cached partition from the node that contains it. It will only re-compute if that executor crashed so the RDD is no longer cached. This will, in many cases, cause a network bottleneck on the original straggler node as well.
Have you tried using Spark speculation (spark.speculation true)? Spark will identify these stragglers and relaunch then on another node.

Resources