How does wide transformations actually work based on shuffle partitions configuration?
If I have following program:
spark.conf.set("spark.sql.shuffle.partitions", "5")
val df = spark
.read
.option("inferSchema", "true")
.option("header", "true")
.csv("...\input.csv")
df.sort("sal").take(200)
Does it mean sort would output 5 new partitions(as configured), and then spark takes 200 records from those 5 partitions?
As mentioned in comment your sample code is not affected because this sort is not going to trigger shuffle, in plan you will find something like this
== Physical Plan ==
TakeOrderedAndProject (2)
+- Scan csv (1)
But for example when you do some join later (or any other wide transformation which will trigger shuffle) you can see that during exchange value from this parameter is going to be used (check number of partitions row)
This may not be the case when adaptive query execution is enabled, in such situation it may look like this
Now you can see that at the beginning value from spark.sql.shuffle.partitions was used but later due to AQE Spark changed plan and on shuffle read number of partitions was changed to 8 (you may also see that SMJ was changed to broadcast hash join - it was also done by AQE)
Related
I have two dataframe - target_df and reference_df. I need to remove account_id's in target_df which is present in reference_df.
target_df is created from hive table, will have hundreds of partitions. It is partitioned based on date(20220101 to 20221101).
I am doing left anti-join and writing data in hdfs location.
val numPartitions = 10
val df_purge = spark.sql(s"SELECT /*+ BROADCASTJOIN(ref) */ target.* FROM input_table target LEFT ANTI JOIN ${reference_table} ref ON target.${Customer_ID} = ref.${Customer_ID}")
df_purge.coalesce(numPartitions).write.partitionBy("date").mode("overwrite").parquet("hdfs_path")
I need to apply same numPartitions value to each partition. But it is applying to numPartitions value to entire dataframe. For example: If it has 100 date partitions, i need to have 100 * 10 = 1000 part files. These code is not working as expected. I tried repartitionby("date") but this is causing huge data shuffle.
Can anyone please provide an optimized solution. Thanks!
I am afraid that you can not skip shuffle in this case. All repartition/coalesce/partitionBy are working on dataset level and i dont think that there is a way to just split partitions into 10 without shuffle
You tried to use coalesce which is not causing shuffle and this is true, but coalesce can only be used to decrese number of partitions so its not going to help you
You can try to achieve what you want by using combination of raprtition and repartitionBy. Here is description of both functions (same applies to Scala source: https://sparkbyexamples.com:
PySpark repartition() is a DataFrame method that is used to increase
or reduce the partitions in memory and when written to disk, it create
all part files in a single directory.
PySpark partitionBy() is a method of DataFrameWriter class which is
used to write the DataFrame to disk in partitions, one sub-directory
for each unique value in partition columns.
If you first repartition your dataset with repartition = 1000 Spark is going to create 1000 partitions in memory. Later, when you call repartitionBy, Spark is going to create sub-directory forr each value and create one part file for each in-memory partition which contains given key
So if after repartition you have date X in 500 partitions out of 1000 you will find 500 file in sub-directory for this date
In article which i mentioned previously you can find simple example of this behaviourm, chech chapter 1.3 partitionBy(colNames : String*) Example
#Use repartition() and partitionBy() together
dfRepart.repartition(2)
.write.option("header",True) \
.partitionBy("state") \
.mode("overwrite") \
.csv("c:/tmp/zipcodes-state-more")
I have a spark job where some task has zero records output and shuffle read size where some task have memory and disk spill. Can some one help me what can I do to optimize the execution.
Execution Info: repartition_cnt=3500 [ datasets is in S3 and execution is through Glue G2X with 298 DPUs)
Code:
fct_ate_df.repartition(expr(s"pmod(hash(mae_id, rowsin, dep), $repartition_cnt)"))
.write
.mode("overwrite")
.format("parquet")
.bucketBy(repartition_cnt, "rowsin", "rowsin","dep")
.sortBy("rowsin","dep")
.option("path", s"s3://b222-id/data22te=$dat22et_date")
.saveAsTable(s"btemp.intte_${table_name}_${regd}")
Summary Metrics
No record output/shuffle
Spill record
You are using reparition by expression and i think that this the reason why you see those empty partitions. In this case internally spark is going to use HashPartitioner and this partinioner does not guarantee that partitions are going to be equal.
Due to Hash algorithm you are sure that records with the same expression value are going to be in the same partition but you may end up with empty partitions or with partitions which has for example 5 keys inside.
In this case numPartitions is not changing anything, in case of many keys in one bucket (so later partition) which at the end are generating less partitions than numPartition Spark is going to generate empty partitions as you can see in your example
I think that if you want to have equal partitions you may remove this expression in which you are calculating hash and leave only $repartition_cnt
Thanks to that Spark will use RoundRobinPartitioner instead and this one will generate equals partitions
If you want to dig dipper you may take a look at source code, i think that here are nice starting points
Here you can find logic connected to repartition without expression: Spark source code
Here you can find logic which is used for partitioning by expression: Spark source code
Regards!
I'm trying to do the following crossJoin on two dataframes with 5 rows each, but Spark spawns 40000 tasks on my machine and it took 30 seconds to achieve the task. Any idea why that is happening?
df = spark.createDataFrame([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']]).toDF('a','b')
df = df.repartition(1)
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
You call a .distinct before join, it requires a shuffle, so it repartitions data based on spark.sql.shuffle.partitions property value. Thus, df.select('a').distinct() and df.select('b').distinct() result in new DataFrames each with 200 partitions, 200 x 200 = 40000
Two things - it looks like you cannot directly control the number of partitions a DF is created with, so we can first create a RDD instead (where you can specify the number of partitions) and convert it to DF. Also you can set the shuffle partitions to '1' as well. These both ensure you will have just 1 partition during the whole execution and should speed things up.
Just note that this shouldn't be an issue at all for larger datasets, for which Spark is designed (it would be faster to achieve the same result on a dataset of this size not using spark at all). So in the general case you won't really need to do stuff like this, but tune the number of partitions to your resources/data.
spark.conf.set("spark.default.parallelism", "1")
spark.conf.set("spark.sql.shuffle.partitions", "1")
df = sc.parallelize([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']], 1).toDF(['a','b'])
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
spark.conf.set sets the configuration for a single execution only, if you want more permanent changes do them in the actual spark conf file
In many posts there is the statement - as shown below in some form or another - due to some question on shuffling, partitioning, due to JOIN, AGGR, whatever, etc.:
... In general whenever you do a spark sql aggregation or join which shuffles data this is the number of resulting partitions = 200.
This is set by spark.sql.shuffle.partitions. ...
So, my question is:
Do we mean that if we have set partitioning at 765 for a DF, for example,
That the processing occurs against 765 partitions, but that the output is coalesced / re-partitioned standardly to 200 - referring here to word resulting?
Or does it do the processing using 200 partitions after coalescing / re-partitioning to 200 partitions before JOINing, AGGR?
I ask as I never see a clear viewpoint.
I did the following test:
// genned a DS of some 20M short rows
df0.count
val ds1 = df0.repartition(765)
ds1.count
val ds2 = df0.repartition(765)
ds2.count
sqlContext.setConf("spark.sql.shuffle.partitions", "765")
// The above not included on 1st run, the above included on 2nd run.
ds1.rdd.partitions.size
ds2.rdd.partitions.size
val joined = ds1.join(ds2, ds1("time_asc") === ds2("time_asc"), "outer")
joined.rdd.partitions.size
joined.count
joined.rdd.partitions.size
On the 1st test - not defining sqlContext.setConf("spark.sql.shuffle.partitions", "765"), the processing and num partitions resulted was 200. Even though SO post 45704156 states it may not apply to DFs - this is a DS.
On the 2nd test - defining sqlContext.setConf("spark.sql.shuffle.partitions", "765"), the processing and num partitions resulted was 765. Even though SO post 45704156 states it may not apply to DFs - this is a DS.
It is a combination of both your guesses.
Assume you have a set of input data with M partitions and you set shuffle partitions to N.
When executing a join, spark reads your input data in all M partitions and re-shuffle the data based on the key to N partitions. Imagine a trivial hashpartitioner, the hash function applied on the key pretty much looks like A = hashcode(key) % N, and then this data is re-allocated to the node in charge of handling the Ath partition. Each node can be in charge of handling multiple partitions.
After shuffling, the nodes will work to aggregate the data in partitions they are in charge of. As no additional shuffling needs to be done here, the nodes can produce the output directly.
So in summary, your output will be coalesced to N partitions, however it is coalesced because it is processed in N partitions, not because spark applies one additional shuffle stage to specifically repartition your output data to N.
Spark.sql.shuffle.partitions is the parameter which decides the number of partitions while doing shuffles like joins or aggregation i.e where data movement is there across the nodes. The other part spark.default.parallelism will be calculated on basis of your data size and max block size, in HDFS it’s 128mb. So if your job does not do any shuffle it will consider the default parallelism value or if you are using rdd you can set it by your own. While shuffling happens it will take 200.
Val df = sc.parallelize(List(1,2,3,4,5),4).toDF()
df.count() // this will use 4 partitions
Val df1 = df
df1.except(df).count // will generate 200 partitions having 2 stages
When I tried to write dataframe to Hive Parquet Partitioned Table
df.write.partitionBy("key").mode("append").format("hive").saveAsTable("db.table")
It will create a lots of blocks in HDFS, each of the block only have small size of data.
I understand how it goes as each spark sub-task will create a block, then write data to it.
I also understand, num of blocks will increase the Hadoop performance, but it will also decrease the performance after reaching a threshold.
If i want to auto set numPartition, does anyone have a good idea?
numPartition = ??? // auto calc basing on df size or something
df.repartition("numPartition").write
.partitionBy("key")
.format("hive")
.saveAsTable("db.table")
First of all, why do you want to have an extra repartition step when you are already using partitionBy(key)- your data would be partitioned based on the key.
Generally, you could re-partition by a column value, that's a common scenario, helps in operations like reduceByKey, filtering based on column value etc. For example,
val birthYears = List(
(2000, "name1"),
(2000, "name2"),
(2001, "name3"),
(2000, "name4"),
(2001, "name5")
)
val df = birthYears.toDF("year", "name")
df.repartition($"year")
By Default spark will create 200 Partitions for shuffle operations. so, 200 files/blocks (if the file size is less) will be written to HDFS.
Configure the number of partitions to be created after shuffle based on your data in Spark using below configuration:
spark.conf.set("spark.sql.shuffle.partitions", <Number of paritions>)
ex: spark.conf.set("spark.sql.shuffle.partitions", "5"), so Spark will create 5 partitions and 5 files will be written to HDFS.