Apache Spark: How 200 Reducer Tasks Can Aggregate 20000+ Mapper Output? - apache-spark

Updated Question
What I am not Clear about =>
in ShuffleMapStage each Mapper will create a .data and a .index file
These data/index will have a name like
shuflle_X_Y_Z
where
X = shuffle_id
Y = map_id
Z = REDUCER_ID
I Understand map_id can range from 1-222394
BUT HOW ABOUT REDUCER_ID ?
is it 1-200 (e.g default partition for ResultStage) ?
is it = # of Executors ?
if it is 1-200 then does how these 200 tasks Know which data/index file to read ?
Help me to understand that
I am at a loss in understanding how Reduce/Aggergation tasks work ?
Say I have a Simple Example Like
input_df = spark.read.parquet("Big_folder_having parquets")
# Spark loads and during reading partitions = as per number of files * number of 128MB blocks.
# Now I do a Simple Aggergation/Count
input_df.createOrReplaceTempView("table1")
grouped_df = spark.sql("select key1, key2, count(1) as user_count from table1 group by 1,2")
# And simply write it with default 200 parallelism
grouped_df.write.format("parquet").mode("overwrite").save(my_save_path)
So for input load the parent rdd/input map Stage has 22394 partitions
As I understand each mapper will create a shuflle data and index file
Now next stage has only 200 tasks (default shuffle partitions)
How can these 200 reducers/tasks process output from 22394 mapper tasks ?
Attached DAG Screenshot

You have a cluster with 40 cores.
What happens is:
You ask Spark to read the files in the directory, it will do it 40 tasks at a time (since that is the number of cores you got) and the result will be a RDD that will have 22,394 partitions. (Be careful about shuffle spill. Check the stage details.)
Then you ask Spark to group your data by some keys and then write it out.
Since the default shuffle partitions is 200, Spark will "move" the data from 22,394 partitions into 200 partitions and process 40 tasks/partitions at a time.
In other words...
When you request to group and save Spark will create plans (I recommend you investigate physical and logical plans) and it will say... "In order to do what the user is asking me to, I will create 200 tasks that will be executed against the data"
Then the executors will execute 40 tasks at a time.
There aren't mappers or reducers per se.
There are tasks that Spark will create and there are executors which will execute those tasks.
Edit:
Forgot to mention, the number of the partitions in the RDD will determine the number of output files.

If you have 10 buckets with 10 apples or 1 bucket with 100 apples, it's all the same total apples.
Asking how it can handle it is similar to asking how can you carry 10 buckets or carry 1 bucket.
It will either do it or it won't depending on the amount of data you have. Issue you can have is data being spilled to disk because when having 200 partitions each partition needs to handle more data which may not necessarily fit into memory.

Related

Minimize shuffle spill and shuffle write

I have following pipeline in HDFS which I am processing in spark
input table : batch, team, user, metric1, metric2
This table can has user level metrics in hourly batches. In same hour a user can have multiple entries.
level 1 aggregation : this aggregation to get latest entry per user per batch
agg(metric1) as user_metric1, agg(metric2) as user_metric2 (group by batch, team, user)
level 2 aggregation : get team level metrics
agg(user_metric1) as team_metric1, agg(user_metric2) as team_metric2 (group by batch, team)
Input table is 8gb (snappy parquet format) in size in HDFS. My spark job is showing shuffle write to 40gb and at least 1 gb per executor shuffle spill.
In order to minimize this, if I repartition input table on user level before performaing aggregation,
df = df.repartition('user')
would it improve performance? How should I approach this problem if I want to reduce shuffle?
While running with following resources
spark.executor.cores=6
spark.cores.max=48
spark.sql.shuffle.partitions=200
Spark shuffles data from a node to another one because the resources is distributed (input data...) over the cluster, this can make the calculation slow and can present a heavy network traffic over the cluster, for your case the number of shuffles is due to the group by , if you make a repartition based on the three columns of the goup by it will reduce the number of shuffles, for the spark configuration the default spark.sql.shuffle.partitions is 200, let's say that we will let spark configuration as it is, the repartition will take some time and once finished the calculation will be faster:
new_df = df.repartition("batch","team", "user")

Why my shuffle partition is not 200(default) during group by operation? (Spark 2.4.5)

I am new to spark and trying to understand the internals of it. So,
I am reading a small 50MB parquet file from s3 and performing a group by and then saving back to s3.
When I observe the Spark UI, I can see 3 stages created for this,
stage 0 : load (1 tasks)
stage 1 : shufflequerystage for grouping (12 tasks)
stage 2: save (coalescedshufflereader) (26 tasks)
Code Sample:
df = spark.read.format("parquet").load(src_loc)
df_agg = df.groupby(grp_attribute)\
.agg(F.sum("no_of_launches").alias("no_of_launchesGroup")
df_agg.write.mode("overwrite").parquet(target_loc)
I am using EMR instance with 1 master, 3 core nodes(each with 4vcores). So, default parallelism is 12. I am not changing any config in runtime. But I am not able to understand why 26 tasks are created in the final stage? As I understand by default the shuffle partition should be 200. Screenshot of the UI attached.
I tried a similar logic on Databricks with Spark 2.4.5.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'true'), the final number of my partitions is 2.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'false') and spark.conf.set('spark.sql.shuffle.partitions', 75), the final number of my partitions is 75.
Using print(df_agg.rdd.getNumPartitions()) reveals this.
So, the job output on Spark UI does not reflect this. May be a repartition occurs at the end. Interesting, but not really an issue.
In Spark sql, number of shuffle partitions are set using spark.sql.shuffle.partitions which defaults to 200. In most of the cases, this number is too high for smaller data and too small for bigger data. Selecting right value becomes always tricky for the developer.
So we need an ability to coalesce the shuffle partitions by looking at the mapper output. If the mapping generates small number of partitions, we want to reduce the overall shuffle partitions so it will improve the performance.
In the lastet version , Spark3.0 with Adaptive Query Execution , this feature of reducing the tasks is automated.
http://blog.madhukaraphatak.com/spark-aqe-part-2/
Considering this in Spark2.4.5 also catalist opimiser or EMR might have enabled this feature to reduce the tasks insternally rather 200 tasks.

How are partitions assigned to tasks in Spark

Let's say I'm reading 100 files from an S3 folder. Each file is of size 10 MB. When I execute df = spark.read.parquet(s3 path), how do the files (or rather partitions) get distributed across tasks? E.g. in this case df is going to have 100 partitions, and if spark has 10 tasks running for reading contents of this folder into the data frame, how the partitions are getting assigned to the 10 tasks? Is it in a round-robin fashion, or each task gets equal proportions of all partitions in a range based distribution, or something else? Any pointer to relevant resources would also be very helpful. Thank you.
Tasks are directly proportional to the number of partitions.
Spark tries to partition the rows directly from original partitions without bringing anything to the driver.
The partition logic is to start with a randomly picked target partition and then assign partitions to the rows in a round-robin method. Note that "start" partition is picked for each source partition and there could be collisions.
The final distribution depends on many factors: a number of source/target partitions and the number of rows in your dataframe.

Default shuffle partition value in spark

The default shuffle partition value in spark is 200 partitions. I would like to clarify that this number is per input partitions ? or across all input partitions, the number of output partitions are going to be 200 ?
I looked at several materials and not able to find the answer I am looking for.
I am not exactly sure whether I understood your question, however I think I can give you a best example I found in Spark: The Definitive Guide book to understand number of partitions and corresponding tasks in each stage
For this job following is the explain output
This job breaks down into the following stages and tasks:
Stage 1 with 8 Tasks
Stage 2 with 8 Tasks
Stage 3 with 6 Tasks
Stage 4 with 5 Tasks
Stage 5 with 200 Tasks
Stage 6 with 1 Task
The first two stages correspond to the range that you perform in order to create your DataFrames. By default when you create a DataFrame with range, it has eight partitions.
The next step is the repartitioning. This changes the number of partitions by shuffling the data. These DataFrames are shuffled into six partitions and five partitions, corresponding to the number of tasks in stages 3 and 4.
Stages 3 and 4 perform on each of those DataFrames and the end of the stage represents the join (a shuffle). Suddenly, we have 200 tasks. This is because of a Spark SQL configuration. The spark.sql.shuffle.partitions default value is 200, which means that when there is a shuffle performed during execution, it outputs 200 shuffle partitions by default. You can change this value, and the number of output partitions will change.
The final result aggregates those partitions individually, brings them all to a single partition before finally sending the final result to the driver.
Another note on spark.sql.shuffle.partitions from spark docs
spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data **for joins or aggregations**.

spark.sql.shuffle.partitions of 200 default partitions conundrum

In many posts there is the statement - as shown below in some form or another - due to some question on shuffling, partitioning, due to JOIN, AGGR, whatever, etc.:
... In general whenever you do a spark sql aggregation or join which shuffles data this is the number of resulting partitions = 200.
This is set by spark.sql.shuffle.partitions. ...
So, my question is:
Do we mean that if we have set partitioning at 765 for a DF, for example,
That the processing occurs against 765 partitions, but that the output is coalesced / re-partitioned standardly to 200 - referring here to word resulting?
Or does it do the processing using 200 partitions after coalescing / re-partitioning to 200 partitions before JOINing, AGGR?
I ask as I never see a clear viewpoint.
I did the following test:
// genned a DS of some 20M short rows
df0.count
val ds1 = df0.repartition(765)
ds1.count
val ds2 = df0.repartition(765)
ds2.count
sqlContext.setConf("spark.sql.shuffle.partitions", "765")
// The above not included on 1st run, the above included on 2nd run.
ds1.rdd.partitions.size
ds2.rdd.partitions.size
val joined = ds1.join(ds2, ds1("time_asc") === ds2("time_asc"), "outer")
joined.rdd.partitions.size
joined.count
joined.rdd.partitions.size
On the 1st test - not defining sqlContext.setConf("spark.sql.shuffle.partitions", "765"), the processing and num partitions resulted was 200. Even though SO post 45704156 states it may not apply to DFs - this is a DS.
On the 2nd test - defining sqlContext.setConf("spark.sql.shuffle.partitions", "765"), the processing and num partitions resulted was 765. Even though SO post 45704156 states it may not apply to DFs - this is a DS.
It is a combination of both your guesses.
Assume you have a set of input data with M partitions and you set shuffle partitions to N.
When executing a join, spark reads your input data in all M partitions and re-shuffle the data based on the key to N partitions. Imagine a trivial hashpartitioner, the hash function applied on the key pretty much looks like A = hashcode(key) % N, and then this data is re-allocated to the node in charge of handling the Ath partition. Each node can be in charge of handling multiple partitions.
After shuffling, the nodes will work to aggregate the data in partitions they are in charge of. As no additional shuffling needs to be done here, the nodes can produce the output directly.
So in summary, your output will be coalesced to N partitions, however it is coalesced because it is processed in N partitions, not because spark applies one additional shuffle stage to specifically repartition your output data to N.
Spark.sql.shuffle.partitions is the parameter which decides the number of partitions while doing shuffles like joins or aggregation i.e where data movement is there across the nodes. The other part spark.default.parallelism will be calculated on basis of your data size and max block size, in HDFS it’s 128mb. So if your job does not do any shuffle it will consider the default parallelism value or if you are using rdd you can set it by your own. While shuffling happens it will take 200.
Val df = sc.parallelize(List(1,2,3,4,5),4).toDF()
df.count() // this will use 4 partitions
Val df1 = df
df1.except(df).count // will generate 200 partitions having 2 stages

Resources