Default shuffle partition value in spark - apache-spark

The default shuffle partition value in spark is 200 partitions. I would like to clarify that this number is per input partitions ? or across all input partitions, the number of output partitions are going to be 200 ?
I looked at several materials and not able to find the answer I am looking for.

I am not exactly sure whether I understood your question, however I think I can give you a best example I found in Spark: The Definitive Guide book to understand number of partitions and corresponding tasks in each stage
For this job following is the explain output
This job breaks down into the following stages and tasks:
Stage 1 with 8 Tasks
Stage 2 with 8 Tasks
Stage 3 with 6 Tasks
Stage 4 with 5 Tasks
Stage 5 with 200 Tasks
Stage 6 with 1 Task
The first two stages correspond to the range that you perform in order to create your DataFrames. By default when you create a DataFrame with range, it has eight partitions.
The next step is the repartitioning. This changes the number of partitions by shuffling the data. These DataFrames are shuffled into six partitions and five partitions, corresponding to the number of tasks in stages 3 and 4.
Stages 3 and 4 perform on each of those DataFrames and the end of the stage represents the join (a shuffle). Suddenly, we have 200 tasks. This is because of a Spark SQL configuration. The spark.sql.shuffle.partitions default value is 200, which means that when there is a shuffle performed during execution, it outputs 200 shuffle partitions by default. You can change this value, and the number of output partitions will change.
The final result aggregates those partitions individually, brings them all to a single partition before finally sending the final result to the driver.
Another note on spark.sql.shuffle.partitions from spark docs
spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data **for joins or aggregations**.

Related

spark shuffle partitions with coalesce

Lets say I have a dataset with 20 partitions when I was going to read some data. Then I do aggregate operation on that dataset , which would make no of partitions to be 200(because of default shuffle partitions size). Now without calling any action on that dataset so far , I apply coalesce on that same data set giving 30 partitions in coalesce operation and then call some spark action on that dataset.
So my question is, how many partitions will be in action while that dataset would be having its aggregate operation ? Will it be 30 partitions(because that was the coalesce partitions given ) only or 200 shuffle partitions ?
Editing to provide more clarification on my question:
I understand that coalesce operation in itself will not do shuffle unless we drastically changed no of partitions. I also understand that final dataset will have numPartitions size only , but my question is if I change no of partitions before calling any action on that dataframne , would that resulting action will operate on the final no of partitions we had given(in my case 30) or it will also honor intermediate partitions size that we had given in aggregate operation. So in all, I am mainly looking whether aggregation will be done with 200 partitions and then coalesce will be applied or aggregation will also be performed with 30(in my case) partitions only.
Yes, your final action will operate on partitions generated by coalesce, like in your case it's 30.
As we know there is two types of transformation narrow and wide.
Narrow transformation don't do shuffling and don't do repartitioning but wide shuffling shuffle the data between node and generate new partition.
So if you check coalesce is a wide transformation and it will create a new stage before proceeding for next transformation or action and next stage will work on shuffle partition generated by coalesce.
So yes, your actions will going to work on 30 partitions.
https://www.google.com/amp/s/data-flair.training/blogs/spark-rdd-operations-transformations-actions/amp/
Coalesce
Returns a new SparkDataFrame that has exactly numPartitions
partitions. This operation results in a narrow dependency, e.g. if you
go from 1000 partitions to 100 partitions, there will not be a
shuffle, instead each of the 100 new partitions will claim 10 of the
current partitions. If a larger number of partitions is requested, it
will stay at the current number of partitions.
However, if you're doing a drastic coalesce on a SparkDataFrame, e.g.
to numPartitions = 1, this may result in your computation taking place
on fewer nodes than you like (e.g. one node in the case of
numPartitions = 1). To avoid this, call repartition. This will add a
shuffle step, but means the current upstream partitions will be
executed in parallel (per whatever the current partitioning is).
https://spark.apache.org/docs/2.2.1/api/R/coalesce.html
Coalesce: Shuffle the data into existing number of partitions.
https://medium.com/#mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4#.36o8a7b5j

Failure Handling of transformations in Spark

I read all the data into a pyspark dataframe from s3.
I apply the filter transform on the dataframe. And then write the dataframe to S3.
Lets say the dataframe had 10 partitions of 64MB each.
Now say for partition 1, 2, and 3 the filter and write were successful and there data was written to S3.
Now lets say for partition 4 the filter errors out.
What will happen after this. Will spark proceed for all the remaining partitions and leave partition 4, or will the program terminate after writing only 3 partitions?
Relevant parameter for non-local mode of operation is: spark.task.maxFailures.
If you have 32 tasks and 4 executors and 7 have run and 4 are running with 21 tasks waiting in that stage,
then if one of the 4 fails more times than spark.task.maxFailures after being re-scheduled,
then the Job will stop and no more stages will be executed.
the 3 running tasks will complete, but that's it.
A Job of multi-stages must stop, as a new stage can only start when all tasks of previous stage complete.
Transformations are all or none operations. In your case above, Spark will crash with errors from partition 4.

Apache Spark: How 200 Reducer Tasks Can Aggregate 20000+ Mapper Output?

Updated Question
What I am not Clear about =>
in ShuffleMapStage each Mapper will create a .data and a .index file
These data/index will have a name like
shuflle_X_Y_Z
where
X = shuffle_id
Y = map_id
Z = REDUCER_ID
I Understand map_id can range from 1-222394
BUT HOW ABOUT REDUCER_ID ?
is it 1-200 (e.g default partition for ResultStage) ?
is it = # of Executors ?
if it is 1-200 then does how these 200 tasks Know which data/index file to read ?
Help me to understand that
I am at a loss in understanding how Reduce/Aggergation tasks work ?
Say I have a Simple Example Like
input_df = spark.read.parquet("Big_folder_having parquets")
# Spark loads and during reading partitions = as per number of files * number of 128MB blocks.
# Now I do a Simple Aggergation/Count
input_df.createOrReplaceTempView("table1")
grouped_df = spark.sql("select key1, key2, count(1) as user_count from table1 group by 1,2")
# And simply write it with default 200 parallelism
grouped_df.write.format("parquet").mode("overwrite").save(my_save_path)
So for input load the parent rdd/input map Stage has 22394 partitions
As I understand each mapper will create a shuflle data and index file
Now next stage has only 200 tasks (default shuffle partitions)
How can these 200 reducers/tasks process output from 22394 mapper tasks ?
Attached DAG Screenshot
You have a cluster with 40 cores.
What happens is:
You ask Spark to read the files in the directory, it will do it 40 tasks at a time (since that is the number of cores you got) and the result will be a RDD that will have 22,394 partitions. (Be careful about shuffle spill. Check the stage details.)
Then you ask Spark to group your data by some keys and then write it out.
Since the default shuffle partitions is 200, Spark will "move" the data from 22,394 partitions into 200 partitions and process 40 tasks/partitions at a time.
In other words...
When you request to group and save Spark will create plans (I recommend you investigate physical and logical plans) and it will say... "In order to do what the user is asking me to, I will create 200 tasks that will be executed against the data"
Then the executors will execute 40 tasks at a time.
There aren't mappers or reducers per se.
There are tasks that Spark will create and there are executors which will execute those tasks.
Edit:
Forgot to mention, the number of the partitions in the RDD will determine the number of output files.
If you have 10 buckets with 10 apples or 1 bucket with 100 apples, it's all the same total apples.
Asking how it can handle it is similar to asking how can you carry 10 buckets or carry 1 bucket.
It will either do it or it won't depending on the amount of data you have. Issue you can have is data being spilled to disk because when having 200 partitions each partition needs to handle more data which may not necessarily fit into memory.

Is it possible to do repartition after using partitionBy in a spark DF?

I am asking this question because if I specify repartition as 5, than all my data(>200Gigs) are moved to 5 different executors and 98% of the resources is unused. and then the partitionBy is happening which is again creating a lot of shuffle. Is there a way that first the partitionBy happens and then repartition runs on the data?
Although the question is not entirely easy to follow, the following aligns with the other answer and this approach should avoid the issues mentioned on unnecessary shuffling:
val n = [... some calculation for number of partitions / executors based on cluster config and volume of data to process ...]
df.repartition(n, $"field_1", $"field_2", ...)
.sortWithinPartitions("fieldx", "field_y")
.write.partitionBy("field_1", "field_2", ...)
.format("location")
whereby [field_1, field_2, ...] are the same set of fields for repartition and partitionBy.
You can use repartition(5, col("$colName")).
Thus when you will make partitionBy("$colName") you will skip shuffle for '$colName' since it's already been repartitioned.
Also consider to have as many partitions as the product of the number of executors by the number of used cores by 3 (this may vary between 2 and 4 though).
So as we know, Spark can only run 1 concurrent task for every partition of an RDD. Assuming you have 8 cores per executor and 5 executors:
You need to have: 8 * 5 * 3 = 120 partitions

Why join in spark in local mode is so slow?

I am using spark in local mode and a simple join is taking too long. I have fetched two dataframes: A (8 columns and 2.3 million rows) and B(8 columns and 1.2 million rows) and joining them using A.join(B,condition,'left') and called an action at last. It creates a single job with three stages, each for two dataframes extraction and one for joining. Surprisingly stage with extraction of dataframe A is taking around 8 minutes and that of dataframe B is taking 1 minute. And join happens within seconds. My important configuration settings are:
spark.master local[*]
spark.driver.cores 8
spark.executor.memory 30g
spark.driver.memory 30g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.shuffle.partitions 16
The only executor is driver itself. While extracting dataframes, i have partitioned it in 32(also tried 16,64,50,100,200) parts. I have seen shuffle write memory to be 100 MB for Stage with dataframe A extraction. So to avoid shuffle i made 16 initial partitions for both dataframes and broadcasted dataframe B(smaller), but it is not helping. There is still shuffle write memory. I have used broadcast(B) syntax for this. Am I doing something wrong? Why shuffling is still there? Also when i see event timelines its showing only four cores are processing at any point of time. Although I have a 2core*4 processor machine.Why is that so?
In short, "Join"<=>Shuffling, the big question here is how uniformly are your data distributed over partitions (see for example https://0x0fff.com/spark-architecture-shuffle/ , https://www.slideshare.net/SparkSummit/handling-data-skew-adaptively-in-spark-using-dynamic-repartitioning and just Google the problem).
Few possibilities to improve efficiency:
think more about your data (A and B) and partition data wisely;
analyze, are your data skewed?;
go into UI and look at the tasks timing;
choose such keys for partitions that during "join" only few partitions from dataset A shuffle with few partitions of B;

Resources