As per the documentation:
spark.default.parallelism:Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user
spark.default.parallelism: For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD
I am not able to produce the documented behaviour
Dataframe:
I create 2 DFs with partitions 3 and 50 respectively and join them. The output should have 50 partitions but, always has the number of partitions equal to the number of partitions of the larger DF
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.join(df2,df1.id==df2.id,'inner')
df3.rdd.getNumPartitions() #3
RDD:
I create 2 DFs with partitions 3 and 50 respectively and join the underlying RDDs of them. The output RDD should have 50 partitions but, has 53 partitions
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.rdd.join(df2.rdd)
df3.getNumPartitions() #53
How to understand the conf spark.default.parallelism?
For now i can answer df part, i am experimenting with rdd so maybe i will add edit later
For df there is parameter spark.sql.shuffle.partitions which is used during joins etc, its set to 200 by default. In your case its not used and there may be two reasons:
Your datasets are smaller than 10mb so one dataset is broadcasted and there is no shuffle during join so you end up with number of partitions from bigger dataset
You may have AQE enabled which is changing the number of partitions.
I did a quick check with broadcast and AQE disabled and results are as expected
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.shuffle.partitions",100)
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.join(df2,df1.id==df2.id,'inner')
df3.rdd.getNumPartitions() #100
Out[35]: 100
For second case with rdds i think that rdds are co-partitioned and due to that Spark is not triggering full shuffle: "Understanding Co-partitions and Co-Grouping In Spark"
What i can see in query plan is that instead of join Spark is using union and thats why in final rdd you can see 53 partitions
Related
We have two dataframes. df_A and df_B
df_A.rdd.getPartitionsNumber() # => 9
df_B.rdd.getPartitionsNumber() # => 160
df_A.createOrReplaceTempView('table_A')
df_B.createOrReplaceTempView('table_B')
After creation joined dataframe via SparkSQL,
df_C = spark.sql("""
select *
from table_A inner table_B on (...)
""")
df_C.rdd.getPartitionsNumber() # => 160
How does Spark calculate and use these two partitions for two joined dataframes?
Shouldn't the number of partitions of the joined dataframe be 9 * 160 = 1440?
Spark configures the number of partitions to 200 when shuffling data for joins or aggregations. You can change the value in spark.sql.shuffle.partitions to increase or decrease the number of partitions in join operation.
https://spark.apache.org/docs/latest/sql-performance-tuning.html
I have two large spark dataframe. I joined them by one common column as:
df_joined = df1.join(df2.select("id",'label'), "id")
I got the result, but when I want to work with df_joined, it's too slow. As I know, we need to repartition df1 and df2 to prevent large number of partition for df_joined. so, even, I changed the number of partitions,
df1r = df1.repartition(1)
df2r = df2.repartition(1)
df_joined = df1r.join(df2r.select("id",'label'), "id")
still NOT working.
any IDEA?
Spark runs 1 concurrent task for every partition of an RDD / DataFrame (up to the number of cores in the cluster).
If you’re cluster has 20 cores, you should have at least 20 partitions (in practice 2–3x times more). From the other hand a single partition typically shouldn’t contain more than 128MB.
so, instead of below two lines, which repartition your dataframe into 1 paritition:
df1r = df1.repartition(1)
df2r = df2.repartition(1)
Repartition your data based on 'id' column, joining key, into n partitions. ( n depends on data size and number of cores in cluster).
df1r = df1.repartition(n, "id")
df2r = df2.repartition(n, "id")
In a parquet data lake partitioned by year and month, with spark.default.parallelism set to i.e. 4, lets say I want to create a DataFrame comprised of months 11~12 from 2017, and months 1~3 from 2018 of two sources A and B.
df = spark.read.parquet(
"A.parquet/_YEAR={2017}/_MONTH={11,12}",
"A.parquet/_YEAR={2018}/_MONTH={1,2,3}",
"B.parquet/_YEAR={2017}/_MONTH={11,12}",
"B.parquet/_YEAR={2018}/_MONTH={1,2,3}",
)
If I get the number of partitions, Spark used spark.default.parallelism as default:
df.rdd.getNumPartitions()
Out[4]: 4
Taking into account that after creating df I need to perform join and groupBy operations over each period, and that data is more or less evenly distributed over each one (around 10 million rows per period):
Question
Will a repartition improve the performance of my subsequent operations?
If so, if I have 10 different periods (5 per year in both A and B), should I repartition by the number of periods and explicitly reference the columns to repartition (df.repartition(10,'_MONTH','_YEAR'))?
Will a repartition improve the performance of my subsequent operations?
Typically it won't. The only reason to preemptively repartition data is to avoid further shuffling when the same Dataset is used for multiple joins, based on the same condition
If so, if I have 10 different periods (5 per year in both A and B), should I repartition by the number of periods and explicitly reference the columns to repartition (df.repartition(10,'_MONTH','_YEAR'))?
Let's go step-by-step:
should I repartition by the number of periods
Practitioners don't guarantee 1:1 relationship between levels and partitions, so the only thing to remember is, that you cannot have more non-empty partitions than unique keys, so using significantly larger value doesn't make sense.
and explicitly reference the columns to repartition
If you repartition and subsequently join or groupBy using the same set of columns for both parts is the only sensible solution.
Summary
repartitoning before join makes sense in two scenarios:
In case of multiple subsequent joins
df_ = df.repartition(10, "foo", "bar")
df_.join(df1, ["foo", "bar"])
...
df_.join(df2, ["foo", "bar"])
With single join when desired number of the output partitions is different than spark.sql.shuffle.partitions (and there is no broadcast join)
spark.conf.get("spark.sql.shuffle.partitions")
# 200
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
df1_ = df1.repartition(11, "foo", "bar")
df2_ = df2.repartition(11, "foo", "bar")
df1_.join(df2_, ["foo", "bar"]).rdd.getNumPartitions()
# 11
df1.join(df2, ["foo", "bar"]).rdd.getNumPartitions()
# 200
which might be preferable over:
spark.conf.set("spark.sql.shuffle.partitions", 11)
df1.join(df2, ["foo", "bar"]).rdd.getNumPartitions()
spark.conf.set("spark.sql.shuffle.partitions", 200)
I am using Spark 1.5.0
I am doing a broadcast join since one of my dataframe is around 30 GB (large_df) and other is around 10 MB (small_df). Here is my code.
df1 = large_df.join(broadcast(small_df), large_df("col2") === small_df("s_col2")
Right after this, if I get the number of partitions for df1, I see the correct number (1000)
df1.rdd.partitions.size() 1000
Now I am doing a project to select only certain columns of df1
df2 = df1.select("col2", "col4", "col6", "col8")
Right after this, if I get the number of partitions for df1, I see a smaller number (200). I am not sure is it just this select or some other operation is somehow changing the # of partitions of my dataframe.
df2.rdd.partitions.size() 200
How do I make sure that the number of partitions is not reduced.
You can set the following property of your SparkConf to 1000
spark.sql.shuffle.partitions
More info here: https://spark.apache.org/docs/1.2.0/sql-programming-guide.html
spark.sql.shuffle.partitions
Default: 200
Configures the number of partitions to use when shuffling data for joins or aggregations.
val df = Seq(
("A", 1), ("B", 2), ("A", 3), ("C", 1)
).toDF("k", "v")
df.rdd.getNumPartitions
When ever you do suffle operations on dataframes, the default partitions is 200
val partitioned = df.repartition($"k")
partitioned.rdd.getNumPartitions //Results 200
I have 3 dataframes df1, df2 and df3.
Each dataframe has approximately 3million rows. df1 and df3 has apprx. 8 columns. df2 has only 3 columns.
(source text file of df1 is approx 600MB size)
These are the operations performed:
df_new=df1 left join df2 ->group by df1 columns->select df1 columns, first(df2 columns)
df_final = df_new outer join df3
df_split1 = df_final filtered using condition1
df_split2 = df_final filtered using condition2
write df_split1,df_split2 into a single table after performing different operations on both dataframes
This entire process takes 15mins in pyspark 1.3.1, with default partition value = 10, executor memory = 30G, driver memory = 10G and I have used cache() wherever necessary.
But when I use hive queries, this hardly takes 5 mins. Is there any particular reason why my dataframe operations are slow and is there any way I can improve the performance?
You should be careful with the use of JOIN.
JOIN in spark can be really expensive. Especially if the join is between two dataframes. You can avoid expensive operations by re-partition the two dataframes on the same column or by using the same partitioner.