I am using Spark 1.5.0
I am doing a broadcast join since one of my dataframe is around 30 GB (large_df) and other is around 10 MB (small_df). Here is my code.
df1 = large_df.join(broadcast(small_df), large_df("col2") === small_df("s_col2")
Right after this, if I get the number of partitions for df1, I see the correct number (1000)
df1.rdd.partitions.size() 1000
Now I am doing a project to select only certain columns of df1
df2 = df1.select("col2", "col4", "col6", "col8")
Right after this, if I get the number of partitions for df1, I see a smaller number (200). I am not sure is it just this select or some other operation is somehow changing the # of partitions of my dataframe.
df2.rdd.partitions.size() 200
How do I make sure that the number of partitions is not reduced.
You can set the following property of your SparkConf to 1000
spark.sql.shuffle.partitions
More info here: https://spark.apache.org/docs/1.2.0/sql-programming-guide.html
spark.sql.shuffle.partitions
Default: 200
Configures the number of partitions to use when shuffling data for joins or aggregations.
val df = Seq(
("A", 1), ("B", 2), ("A", 3), ("C", 1)
).toDF("k", "v")
df.rdd.getNumPartitions
When ever you do suffle operations on dataframes, the default partitions is 200
val partitioned = df.repartition($"k")
partitioned.rdd.getNumPartitions //Results 200
Related
As per the documentation:
spark.default.parallelism:Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user
spark.default.parallelism: For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD
I am not able to produce the documented behaviour
Dataframe:
I create 2 DFs with partitions 3 and 50 respectively and join them. The output should have 50 partitions but, always has the number of partitions equal to the number of partitions of the larger DF
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.join(df2,df1.id==df2.id,'inner')
df3.rdd.getNumPartitions() #3
RDD:
I create 2 DFs with partitions 3 and 50 respectively and join the underlying RDDs of them. The output RDD should have 50 partitions but, has 53 partitions
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.rdd.join(df2.rdd)
df3.getNumPartitions() #53
How to understand the conf spark.default.parallelism?
For now i can answer df part, i am experimenting with rdd so maybe i will add edit later
For df there is parameter spark.sql.shuffle.partitions which is used during joins etc, its set to 200 by default. In your case its not used and there may be two reasons:
Your datasets are smaller than 10mb so one dataset is broadcasted and there is no shuffle during join so you end up with number of partitions from bigger dataset
You may have AQE enabled which is changing the number of partitions.
I did a quick check with broadcast and AQE disabled and results are as expected
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.shuffle.partitions",100)
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.join(df2,df1.id==df2.id,'inner')
df3.rdd.getNumPartitions() #100
Out[35]: 100
For second case with rdds i think that rdds are co-partitioned and due to that Spark is not triggering full shuffle: "Understanding Co-partitions and Co-Grouping In Spark"
What i can see in query plan is that instead of join Spark is using union and thats why in final rdd you can see 53 partitions
I have 2 df's
df1:
columns: col1, col2, col3
partitioned on col1
no of partitions: 120000
df2:
columns: col1, col2, col3
partitioned on col1
no of partitions: 80000
Now I want to join the df1, df2 on (df1.col1=df2.col1 and df1.col2=df2.col2) without much shuffles
tried to join but taking a lot of time...
How do i do it.. Can any one help..?
Imo you can try to use broadcast join if one of your dataset is small (lets say few hundrests of mb) - in this case smaller dataset will be broadcasted and you will skip the shuffle
Without broadcast hint catalyst is probably going to pick SMJ(sort-merge join) and during this join algorithm data needs to be repartitioned by join key and then sorted. I prepared quick example
import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.shuffle.partitions", "10")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val data = Seq(("test", 3),("test", 3), ("test2", 5), ("test3", 7), ("test55", 86))
val data2 = Seq(("test", 3),("test", 3), ("test2", 5), ("test3", 6), ("test33", 76))
val df = data.toDF("Name", "Value").repartition(5, col("Name"))
df.show
val df2 = data2.toDF("Name", "Value").repartition(5, col("Name"))
df2.show
df.join(df2, Seq("Name", "Value")).show
autoBroadcastJoinThreshold is set to -1 to disable broadcastJoin
sql.shuffle.partitions is set to 10 to show that join is going to use this value during repartition
i repartitioned dfs before join with 5 partitions and called action to be sure that they are paritioned by the same column before join
And in sql tab i can see that Spark is repartitioning data again
If you cant broadcast and your join is taking a lot of time you may check if you have some skew.
You may read this blogpost by Dima Statz to find more informations about skew on joins
I have two large spark dataframe. I joined them by one common column as:
df_joined = df1.join(df2.select("id",'label'), "id")
I got the result, but when I want to work with df_joined, it's too slow. As I know, we need to repartition df1 and df2 to prevent large number of partition for df_joined. so, even, I changed the number of partitions,
df1r = df1.repartition(1)
df2r = df2.repartition(1)
df_joined = df1r.join(df2r.select("id",'label'), "id")
still NOT working.
any IDEA?
Spark runs 1 concurrent task for every partition of an RDD / DataFrame (up to the number of cores in the cluster).
If you’re cluster has 20 cores, you should have at least 20 partitions (in practice 2–3x times more). From the other hand a single partition typically shouldn’t contain more than 128MB.
so, instead of below two lines, which repartition your dataframe into 1 paritition:
df1r = df1.repartition(1)
df2r = df2.repartition(1)
Repartition your data based on 'id' column, joining key, into n partitions. ( n depends on data size and number of cores in cluster).
df1r = df1.repartition(n, "id")
df2r = df2.repartition(n, "id")
I want to partition a dataframe "df1" on 3 columns. This dataframe has exactly 990 unique combinaisons for those 3 columns:
In [17]: df1.createOrReplaceTempView("df1_view")
In [18]: spark.sql("select count(*) from (select distinct(col1,col2,col3) from df1_view) as t").show()
+--------+
|count(1)|
+--------+
| 990|
+--------+
In order to optimize the processing of this dataframe, I want to partition df1 in order to get 990 partitions, one for each key possibility:
In [19]: df1.rdd.getNumPartitions()
Out[19]: 24
In [20]: df2 = df1.repartition(990, "col1", "col2", "col3")
In [21]: df2.rdd.getNumPartitions()
Out[21]: 990
I wrote a simple way to count rows in each partition:
In [22]: def f(iterator):
...: a = 0
...: for partition in iterator:
...: a = a + 1
...: print(a)
...:
In [23]: df2.foreachPartition(f)
And I notice that what I get in fact is 628 partitions with one or more key values, and 362 empty partitions.
I assumed spark would repartition in an even way (1 key value = 1 partition) but that does not seem like it, and I feel like this repartitionning is adding data skew even though it should be the other way around...
What's the algorithm Spark uses to partition a dataframe on columns ?
Is there a way to achieve what I thought was possible ?
I'm using Spark 2.2.0 on Cloudera.
To distribute data across partitions spark needs somehow to convert value of the column to index of the partition. There are two default partitioners in Spark - HashPartitioner and RangePartitioner. Different transformations in Spark can apply different partitioners - e.g. join will apply hash partitioner.
Basically for hash partitioner formula to convert value to partition index would be value.hashCode() % numOfPartitions. In your case multiple values are mapping to same partition index.
You could implement your own partitioner if you want better distribution. More about it is here and here and here.
I have 3 dataframes df1, df2 and df3.
Each dataframe has approximately 3million rows. df1 and df3 has apprx. 8 columns. df2 has only 3 columns.
(source text file of df1 is approx 600MB size)
These are the operations performed:
df_new=df1 left join df2 ->group by df1 columns->select df1 columns, first(df2 columns)
df_final = df_new outer join df3
df_split1 = df_final filtered using condition1
df_split2 = df_final filtered using condition2
write df_split1,df_split2 into a single table after performing different operations on both dataframes
This entire process takes 15mins in pyspark 1.3.1, with default partition value = 10, executor memory = 30G, driver memory = 10G and I have used cache() wherever necessary.
But when I use hive queries, this hardly takes 5 mins. Is there any particular reason why my dataframe operations are slow and is there any way I can improve the performance?
You should be careful with the use of JOIN.
JOIN in spark can be really expensive. Especially if the join is between two dataframes. You can avoid expensive operations by re-partition the two dataframes on the same column or by using the same partitioner.