Minimize shuffle spill and shuffle write - apache-spark

I have following pipeline in HDFS which I am processing in spark
input table : batch, team, user, metric1, metric2
This table can has user level metrics in hourly batches. In same hour a user can have multiple entries.
level 1 aggregation : this aggregation to get latest entry per user per batch
agg(metric1) as user_metric1, agg(metric2) as user_metric2 (group by batch, team, user)
level 2 aggregation : get team level metrics
agg(user_metric1) as team_metric1, agg(user_metric2) as team_metric2 (group by batch, team)
Input table is 8gb (snappy parquet format) in size in HDFS. My spark job is showing shuffle write to 40gb and at least 1 gb per executor shuffle spill.
In order to minimize this, if I repartition input table on user level before performaing aggregation,
df = df.repartition('user')
would it improve performance? How should I approach this problem if I want to reduce shuffle?
While running with following resources
spark.executor.cores=6
spark.cores.max=48
spark.sql.shuffle.partitions=200

Spark shuffles data from a node to another one because the resources is distributed (input data...) over the cluster, this can make the calculation slow and can present a heavy network traffic over the cluster, for your case the number of shuffles is due to the group by , if you make a repartition based on the three columns of the goup by it will reduce the number of shuffles, for the spark configuration the default spark.sql.shuffle.partitions is 200, let's say that we will let spark configuration as it is, the repartition will take some time and once finished the calculation will be faster:
new_df = df.repartition("batch","team", "user")

Related

How is a Spark Dataframe partitioned by default?

I know that an RDD is partitioned based on the key values using the HashPartitioner. But how is a Spark Dataframe partitioned by default as it does not have the concept of key/value.
A Dataframe is partitioned dependent on the number of tasks that run to create it.
There is no "default" partitioning logic applied. Here are some examples how partitions are set:
A Dataframe created through val df = Seq(1 to 500000: _*).toDF() will have only a single partition.
A Dataframe created through val df = spark.range(0,100).toDF() has as many partitions as the number of available cores (e.g. 4 when your master is set to local[4]). Also, see remark below on the "default parallelism" that comes into effect for operations like parallelize with no parent RDD.
A Dataframe derived from an RDD (spark.createDataFrame(rdd, schema)) will have the same amount of partitions as the underlying RDD. In my case, as I have locally 6 cores, the RDD got created with 6 partitions.
A Dataframe consuming from a Kafka topic will have the amount of partitions matching with the partitions of the topic because it can use as many cores/slots as the topic has partitions to consume the topic.
A Dataframe created by reading a file e.g. from HDFS will have the amount of partitions matching them of the file unless individual files have to be splitted into multiple partitions based on spark.sql.files.maxPartitionBytes which defaults to 128MB.
A Dataframe derived from a transformation requiring a shuffle will have the configurable amount of partitions set by spark.sql.shuffle.partitions (200 by default).
...
One of the major disctinctions between RDD and Structured API is that you do not have as much control over the partitions as you have with RDDs where you can even define a custom partitioner. This is not possible with Dataframes.
Default Parallelism
The documentation of the Execution Behavior configuration spark.default.parallelism explains:
For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

Why my shuffle partition is not 200(default) during group by operation? (Spark 2.4.5)

I am new to spark and trying to understand the internals of it. So,
I am reading a small 50MB parquet file from s3 and performing a group by and then saving back to s3.
When I observe the Spark UI, I can see 3 stages created for this,
stage 0 : load (1 tasks)
stage 1 : shufflequerystage for grouping (12 tasks)
stage 2: save (coalescedshufflereader) (26 tasks)
Code Sample:
df = spark.read.format("parquet").load(src_loc)
df_agg = df.groupby(grp_attribute)\
.agg(F.sum("no_of_launches").alias("no_of_launchesGroup")
df_agg.write.mode("overwrite").parquet(target_loc)
I am using EMR instance with 1 master, 3 core nodes(each with 4vcores). So, default parallelism is 12. I am not changing any config in runtime. But I am not able to understand why 26 tasks are created in the final stage? As I understand by default the shuffle partition should be 200. Screenshot of the UI attached.
I tried a similar logic on Databricks with Spark 2.4.5.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'true'), the final number of my partitions is 2.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'false') and spark.conf.set('spark.sql.shuffle.partitions', 75), the final number of my partitions is 75.
Using print(df_agg.rdd.getNumPartitions()) reveals this.
So, the job output on Spark UI does not reflect this. May be a repartition occurs at the end. Interesting, but not really an issue.
In Spark sql, number of shuffle partitions are set using spark.sql.shuffle.partitions which defaults to 200. In most of the cases, this number is too high for smaller data and too small for bigger data. Selecting right value becomes always tricky for the developer.
So we need an ability to coalesce the shuffle partitions by looking at the mapper output. If the mapping generates small number of partitions, we want to reduce the overall shuffle partitions so it will improve the performance.
In the lastet version , Spark3.0 with Adaptive Query Execution , this feature of reducing the tasks is automated.
http://blog.madhukaraphatak.com/spark-aqe-part-2/
Considering this in Spark2.4.5 also catalist opimiser or EMR might have enabled this feature to reduce the tasks insternally rather 200 tasks.

Apache Spark: How 200 Reducer Tasks Can Aggregate 20000+ Mapper Output?

Updated Question
What I am not Clear about =>
in ShuffleMapStage each Mapper will create a .data and a .index file
These data/index will have a name like
shuflle_X_Y_Z
where
X = shuffle_id
Y = map_id
Z = REDUCER_ID
I Understand map_id can range from 1-222394
BUT HOW ABOUT REDUCER_ID ?
is it 1-200 (e.g default partition for ResultStage) ?
is it = # of Executors ?
if it is 1-200 then does how these 200 tasks Know which data/index file to read ?
Help me to understand that
I am at a loss in understanding how Reduce/Aggergation tasks work ?
Say I have a Simple Example Like
input_df = spark.read.parquet("Big_folder_having parquets")
# Spark loads and during reading partitions = as per number of files * number of 128MB blocks.
# Now I do a Simple Aggergation/Count
input_df.createOrReplaceTempView("table1")
grouped_df = spark.sql("select key1, key2, count(1) as user_count from table1 group by 1,2")
# And simply write it with default 200 parallelism
grouped_df.write.format("parquet").mode("overwrite").save(my_save_path)
So for input load the parent rdd/input map Stage has 22394 partitions
As I understand each mapper will create a shuflle data and index file
Now next stage has only 200 tasks (default shuffle partitions)
How can these 200 reducers/tasks process output from 22394 mapper tasks ?
Attached DAG Screenshot
You have a cluster with 40 cores.
What happens is:
You ask Spark to read the files in the directory, it will do it 40 tasks at a time (since that is the number of cores you got) and the result will be a RDD that will have 22,394 partitions. (Be careful about shuffle spill. Check the stage details.)
Then you ask Spark to group your data by some keys and then write it out.
Since the default shuffle partitions is 200, Spark will "move" the data from 22,394 partitions into 200 partitions and process 40 tasks/partitions at a time.
In other words...
When you request to group and save Spark will create plans (I recommend you investigate physical and logical plans) and it will say... "In order to do what the user is asking me to, I will create 200 tasks that will be executed against the data"
Then the executors will execute 40 tasks at a time.
There aren't mappers or reducers per se.
There are tasks that Spark will create and there are executors which will execute those tasks.
Edit:
Forgot to mention, the number of the partitions in the RDD will determine the number of output files.
If you have 10 buckets with 10 apples or 1 bucket with 100 apples, it's all the same total apples.
Asking how it can handle it is similar to asking how can you carry 10 buckets or carry 1 bucket.
It will either do it or it won't depending on the amount of data you have. Issue you can have is data being spilled to disk because when having 200 partitions each partition needs to handle more data which may not necessarily fit into memory.

spark behavior on hive partitioned table

I use Spark 2.
Actually I am not the one executing the queries so I cannot include query plans. I have been asked this question by the data science team.
We are having hive table partitioned into 2000 partitions and stored in parquet format. When this respective table is used in spark, there are exactly 2000 tasks that are executed among the executors. But we have a block size of 256 MB and we are expecting the (total size/256) number of partitions which will be much lesser than 2000 for sure. Is there any internal logic that spark uses physical structure of data to create partitions. Any reference/help would be greatly appreciated.
UPDATE: It is the other way around. Actually our table is very huge like 3 TB having 2000 partitions. 3TB/256MB would actually come to 11720 but we are having exactly same number of partitions as the table is partitioned physically. I just want to understand how the tasks are generated on data volume.
In general Hive partitions are not mapped 1:1 to Spark partitions. 1 Hive partition can be split into multiple Spark partitions, and one Spark partition can hold multiple hive-partitions.
The number of Spark partitions when you load a hive-table depends on the parameters:
spark.files.maxPartitionBytes (default 128MB)
spark.files.openCostInBytes (default 4MB)
You can check the partitions e.g. using
spark.table(yourtable).rdd.partitions
This will give you an Array of FilePartitions which contain the physical path of your files.
Why you got exactly 2000 Spark partitions from your 2000 hive partitions seems a coincidence to me, in my experience this is very unlikely to happen. Note that the situation in spark 1.6 was different, there the number of spark partitions resembled the number of files on the filesystem (1 spark partition for 1 file, unless the file was very large)
I just want to understand how the tasks are generated on data volume.
Tasks are a runtime artifact and their number is exactly the number of partitions.
The number of tasks does not correlate to data volume in any way. It's a Spark developer's responsibility to have enough partitions to hold the data.

increasing number of partitions in spark

I was using Hive for executing SQL queries on a project. I used ORC with 50k Stride for my data and have created the hive ORC tables using this configuration with a certain date column as partition.
Now I wanted to use Spark SQL to benchmark the same queries operating on the same data.
I executed the following query
val q1 = sqlContext.sql("select col1,col2,col3,sum(col4),sum(col5) from mytable where date_key=somedatkye group by col1,col2,col3")
In hive it takes 90 seconds for this query. But spark takes 21 minutes for the same query and on looking at the job, i found the issue was because Spark creates 2 stages and on the first stage, it has only 7 tasks, one each for each of the 7 blocks of data within that given partition in orc file. The blocks are of different size, one is 5MB while the other is 45MB and because of this stragglers take more time leading to taking too much time for the job.
How do i mitigate this issue in spark. How do i manually increase the number of partitions, resulting in increasing the number of tasks in stage 1, even though there are only 7 physical blocks for the given range of the query.

Resources