Failure Handling of transformations in Spark - apache-spark

I read all the data into a pyspark dataframe from s3.
I apply the filter transform on the dataframe. And then write the dataframe to S3.
Lets say the dataframe had 10 partitions of 64MB each.
Now say for partition 1, 2, and 3 the filter and write were successful and there data was written to S3.
Now lets say for partition 4 the filter errors out.
What will happen after this. Will spark proceed for all the remaining partitions and leave partition 4, or will the program terminate after writing only 3 partitions?

Relevant parameter for non-local mode of operation is: spark.task.maxFailures.
If you have 32 tasks and 4 executors and 7 have run and 4 are running with 21 tasks waiting in that stage,
then if one of the 4 fails more times than spark.task.maxFailures after being re-scheduled,
then the Job will stop and no more stages will be executed.
the 3 running tasks will complete, but that's it.
A Job of multi-stages must stop, as a new stage can only start when all tasks of previous stage complete.

Transformations are all or none operations. In your case above, Spark will crash with errors from partition 4.

Related

First stage of action in Spark ran by only one executor

I have a spark program running with YARN as master and in client mode with 3 executors
By reading data from ElasticSearch through a connector i'm able to load them into a dataframe.
Such dataframe is repartitioned using df = df.repartition(3) in three partitions.
Whenever i try to do an action such as count() or show() for example, the first stage, which from this thread: Why spark count action has executed in three stages i understood it's about reading the file, has only one task and it's ran by a single executor.
Is this behavior expected for this stage? shouldn't i be able to run this stage in parallel with all the executor allocated?
It depends on the replication of your data.
If your data is replicated on more Data Nodes, you potentially have more executors that are able to read from.

Spark number of tasks vs number of partitions

I am running Spark on my local machine with i5 and quad core processor i.e. 4 core, 8 threads. I run a simple Spark job to understand the behaviour but got confused how number of partitions and number of tasks are different on Spark UI.
Below operations I did:
read CSV file in Spark dataframe with 2 partitions.
I have checked number of underlying partitions using df.rdd.getNumPartitions() which is giving 2.
Apply withColumn logic to add another column.
df1.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().show() to get size of each partition.
Questions:
With above operations spark creates 7 JobIds. Why 7? My understanding is, a job is created when action is called and I do not have 7 actions.
I have 2 partitions, so shouldn't there be 2 tasks running ? Why I see different number of tasks on different stages?

Why my shuffle partition is not 200(default) during group by operation? (Spark 2.4.5)

I am new to spark and trying to understand the internals of it. So,
I am reading a small 50MB parquet file from s3 and performing a group by and then saving back to s3.
When I observe the Spark UI, I can see 3 stages created for this,
stage 0 : load (1 tasks)
stage 1 : shufflequerystage for grouping (12 tasks)
stage 2: save (coalescedshufflereader) (26 tasks)
Code Sample:
df = spark.read.format("parquet").load(src_loc)
df_agg = df.groupby(grp_attribute)\
.agg(F.sum("no_of_launches").alias("no_of_launchesGroup")
df_agg.write.mode("overwrite").parquet(target_loc)
I am using EMR instance with 1 master, 3 core nodes(each with 4vcores). So, default parallelism is 12. I am not changing any config in runtime. But I am not able to understand why 26 tasks are created in the final stage? As I understand by default the shuffle partition should be 200. Screenshot of the UI attached.
I tried a similar logic on Databricks with Spark 2.4.5.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'true'), the final number of my partitions is 2.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'false') and spark.conf.set('spark.sql.shuffle.partitions', 75), the final number of my partitions is 75.
Using print(df_agg.rdd.getNumPartitions()) reveals this.
So, the job output on Spark UI does not reflect this. May be a repartition occurs at the end. Interesting, but not really an issue.
In Spark sql, number of shuffle partitions are set using spark.sql.shuffle.partitions which defaults to 200. In most of the cases, this number is too high for smaller data and too small for bigger data. Selecting right value becomes always tricky for the developer.
So we need an ability to coalesce the shuffle partitions by looking at the mapper output. If the mapping generates small number of partitions, we want to reduce the overall shuffle partitions so it will improve the performance.
In the lastet version , Spark3.0 with Adaptive Query Execution , this feature of reducing the tasks is automated.
http://blog.madhukaraphatak.com/spark-aqe-part-2/
Considering this in Spark2.4.5 also catalist opimiser or EMR might have enabled this feature to reduce the tasks insternally rather 200 tasks.

Default shuffle partition value in spark

The default shuffle partition value in spark is 200 partitions. I would like to clarify that this number is per input partitions ? or across all input partitions, the number of output partitions are going to be 200 ?
I looked at several materials and not able to find the answer I am looking for.
I am not exactly sure whether I understood your question, however I think I can give you a best example I found in Spark: The Definitive Guide book to understand number of partitions and corresponding tasks in each stage
For this job following is the explain output
This job breaks down into the following stages and tasks:
Stage 1 with 8 Tasks
Stage 2 with 8 Tasks
Stage 3 with 6 Tasks
Stage 4 with 5 Tasks
Stage 5 with 200 Tasks
Stage 6 with 1 Task
The first two stages correspond to the range that you perform in order to create your DataFrames. By default when you create a DataFrame with range, it has eight partitions.
The next step is the repartitioning. This changes the number of partitions by shuffling the data. These DataFrames are shuffled into six partitions and five partitions, corresponding to the number of tasks in stages 3 and 4.
Stages 3 and 4 perform on each of those DataFrames and the end of the stage represents the join (a shuffle). Suddenly, we have 200 tasks. This is because of a Spark SQL configuration. The spark.sql.shuffle.partitions default value is 200, which means that when there is a shuffle performed during execution, it outputs 200 shuffle partitions by default. You can change this value, and the number of output partitions will change.
The final result aggregates those partitions individually, brings them all to a single partition before finally sending the final result to the driver.
Another note on spark.sql.shuffle.partitions from spark docs
spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data **for joins or aggregations**.

Spark write to CSV fails even after 8 hours

I have a dataframe with roughly 200-600 gb of data I am reading, manipulating, and then writing to csv using the spark shell (scala) on an elastic map reduce cluster.Spark write to CSV fails even after 8 hours
here's how I'm writing to csv:
result.persist.coalesce(20000).write.option("delimiter",",").csv("s3://bucket-name/results")
The result variable is created through a mix of columns from some other dataframes:
var result=sources.join(destinations, Seq("source_d","destination_d")).select("source_i","destination_i")
Now, I am able to read the csv data it is based on in roughly 22 minutes. In this same program, I'm also able to write another (smaller) dataframe to csv in 8 minutes. However, for this result dataframe it takes 8+ hours and still fails ... saying one of the connections was closed.
I'm also running this job on 13 x c4.8xlarge instances on ec2, with 36 cores each and 60 gb of ram, so I thought I'd have the capacity to write to csv, especially after 8 hours.
Many stages required retries or had failed tasks and I can't figure out what I'm doing wrong or why it's taking so long. I can see from the Spark UI that it never even got to the write CSV stage and was busy with persist stages, but without the persist function it was still failing after 8 hours. Any ideas? Help is greatly appreciated!
Update:
I've ran the following command to repartition the result variable into 66K partitions:
val r2 = result.repartition(66000) #confirmed with numpartitions
r2.write.option("delimiter",",").csv("s3://s3-bucket/results")
However, even after several hours, the jobs are still failing. What am I doing wrong still?
note, I'm running spark shell via spark-shell yarn --driver-memory 50G
Update 2:
I've tried running the write with a persist first:
r2.persist(StorageLevel.MEMORY_AND_DISK)
But I had many stages fail, returning a, Job aborted due to stage failure: ShuffleMapStage 10 (persist at <console>:36) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3' or saying Connection from ip-172-31-48-180.ec2.internal/172.31.48.180:7337 closed
Executors page
Spark web UI page for a node returning a shuffle error
Spark web UI page for a node returning an ec2 connection closed error
Overall Job Summary page
I can see from the Spark UI that it never even got to the write CSV
stage and was busy with persist stages, but without the persist
function it was still failing after 8 hours. Any ideas?
It is FetchFailedException i.e Failed to fetch a shuffle block
Since you are able to deal with small files, only huge data its failed...
I strongly feel that not enough partitions.
Fist thing is verify/Print source.rdd.getNumPartitions(). and destinations.rdd.getNumPartitions(). and result.rdd.getNumPartitions().
You need to repartition after the data is loaded in order to partition the data (via shuffle) to other nodes in the cluster. This will give you the parallelism that you need for faster processing with out fail
Further more, to verify the other configurations applied...
print all the config like this, adjust them to correct values as per demand.
sc.getConf.getAll
Also have a look at
SPARK-5928
Spark-TaskRunner-FetchFailedException Possible reasons : OOM or Container memory limits
repartition both source and destination before joining, with number of partitions such that each partition would be 10MB - 128MB(try to tune), there is no need to make it 20000(imho too many).
then join by those two columns and then write, without repartitioning(ie. output partitions should be same as reparitioning before join)
if you still have trouble, try to make same thing after converting to both dataframes to rdd(there are some differences between apis, and especially regarding repartitions, key-value rdds etc)

Resources