Spark dataframe partition count - apache-spark

I am confused about how spark creates partitions in spark dataframe. Here is the list of steps and the partition size
i_df = sqlContext.read.json("json files") // num partitions returned is 4, total records 7000
p_df = sqlContext.read.format("csv").Other options // num partitions returned is 4 , total records: 120k
j_df = i_df.join(p_df, i_df.productId == p_df.product_id) // total records 7000, but num of partitions is 200
first two dataframes have 4 partitions, but as soon as i join them it shows 200 partitions. I was expecting that it will make 4 partitions after joining, but why is it showing 200.
I am running it on local with
conf.setIfMissing("spark.master", "local[4]")

The 200 is the default shuffle partition size. you can change it by setting spark.sql.shuffle.partitions

Related

understanding spark.default.parallelism

As per the documentation:
spark.default.parallelism:Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user
spark.default.parallelism: For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD
I am not able to produce the documented behaviour
Dataframe:
I create 2 DFs with partitions 3 and 50 respectively and join them. The output should have 50 partitions but, always has the number of partitions equal to the number of partitions of the larger DF
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.join(df2,df1.id==df2.id,'inner')
df3.rdd.getNumPartitions() #3
RDD:
I create 2 DFs with partitions 3 and 50 respectively and join the underlying RDDs of them. The output RDD should have 50 partitions but, has 53 partitions
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.rdd.join(df2.rdd)
df3.getNumPartitions() #53
How to understand the conf spark.default.parallelism?
For now i can answer df part, i am experimenting with rdd so maybe i will add edit later
For df there is parameter spark.sql.shuffle.partitions which is used during joins etc, its set to 200 by default. In your case its not used and there may be two reasons:
Your datasets are smaller than 10mb so one dataset is broadcasted and there is no shuffle during join so you end up with number of partitions from bigger dataset
You may have AQE enabled which is changing the number of partitions.
I did a quick check with broadcast and AQE disabled and results are as expected
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.shuffle.partitions",100)
​
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.join(df2,df1.id==df2.id,'inner')
df3.rdd.getNumPartitions() #100
Out[35]: 100
For second case with rdds i think that rdds are co-partitioned and due to that Spark is not triggering full shuffle: "Understanding Co-partitions and Co-Grouping In Spark"
What i can see in query plan is that instead of join Spark is using union and thats why in final rdd you can see 53 partitions

How is the number of partitions for an inner join calculated in Spark?

We have two dataframes. df_A and df_B
df_A.rdd.getPartitionsNumber() # => 9
df_B.rdd.getPartitionsNumber() # => 160
df_A.createOrReplaceTempView('table_A')
df_B.createOrReplaceTempView('table_B')
After creation joined dataframe via SparkSQL,
df_C = spark.sql("""
select *
from table_A inner table_B on (...)
""")
df_C.rdd.getPartitionsNumber() # => 160
How does Spark calculate and use these two partitions for two joined dataframes?
Shouldn't the number of partitions of the joined dataframe be 9 * 160 = 1440?
Spark configures the number of partitions to 200 when shuffling data for joins or aggregations. You can change the value in spark.sql.shuffle.partitions to increase or decrease the number of partitions in join operation.
https://spark.apache.org/docs/latest/sql-performance-tuning.html

how to get the number of partitions in a dataset?

I know there are many questions on the same but none really answers my question.
I have scenario data.
val data_codes = Seq("con_dist_1","con_dist_2","con_dist_3","con_dist_4","con_dist_5")
val codes = data_codes.toDF("item_code")
val partitioned_codes = codes.repartition($"item_code")
println( "getNumPartitions : " + partitioned_codes.rdd.getNumPartitions);
Output :
getNumPartitions : 200
it suppose to give 5 right why it is giving 200 ? where am i doing wrong and how to fix this?
Because 200 is the standard value of spark.sql.shuffle.partitions which is applied to df.repartition. From the docs :
Returns a new Dataset partitioned by the given partitioning
expressions, using spark.sql.shuffle.partitions as number of
partitions. The resulting Dataset is hash partitioned.
The number of partitions is NOT RELATED to the number of (distinct) values in your dataframe. Repartitioning ensures that all records with the same key are in the same partition, nothing else. So in your case it could be that all records are in 1 partition and 199 partitions are empty
Even if you do codes.repartition($"item_code",5), there is no guarantee that you have 5 equally sized partitions. AFAIK you cannot to this in dataframe API, maybe in RDD with custom partitioner

Pyspark: Why show() or count() of a joined spark dataframe is so slow?

I have two large spark dataframe. I joined them by one common column as:
df_joined = df1.join(df2.select("id",'label'), "id")
I got the result, but when I want to work with df_joined, it's too slow. As I know, we need to repartition df1 and df2 to prevent large number of partition for df_joined. so, even, I changed the number of partitions,
df1r = df1.repartition(1)
df2r = df2.repartition(1)
df_joined = df1r.join(df2r.select("id",'label'), "id")
still NOT working.
any IDEA?
Spark runs 1 concurrent task for every partition of an RDD / DataFrame (up to the number of cores in the cluster).
If you’re cluster has 20 cores, you should have at least 20 partitions (in practice 2–3x times more). From the other hand a single partition typically shouldn’t contain more than 128MB.
so, instead of below two lines, which repartition your dataframe into 1 paritition:
df1r = df1.repartition(1)
df2r = df2.repartition(1)
Repartition your data based on 'id' column, joining key, into n partitions. ( n depends on data size and number of cores in cluster).
df1r = df1.repartition(n, "id")
df2r = df2.repartition(n, "id")

Spark repartition is slow and shuffles too much data

My cluster:
5 data node
each data node has: 8 CPUs, 45GB memory
Due to some other configuration limit, I can only start 5 executors on each data node. So I did
spark-submit --num-executors 30 --executor-memory 2G ...
So each executor uses 1 core.
I have two data set, each is about 20 GB. In my code, I did:
val rdd1 = sc...cache()
val rdd2 = sc...cache()
val x = rdd1.cartesian(rdd2).repartition(30) map ...
In the Spark UI, I saw the repartition step took more than 30 mins, and it cause data shuffle of more than 150GB.
I don't think it is right. But I could not figure out what goes wrong...
Did you really mean "cartesian"?
You are multiplying every row in RDD1 by every row in RDD2. So if your rows were 1k each, you had about 20,000 rows per RDD. The cartesian product will return a set with 20,000 x 20,000 or 400 million records. And note that each row would now be double in width -- 2k -- so you'd have 800 GB in RDD3 whereas you only had 20 GB each in RDD1 and RDD2.
Perhaps try:
val x = rdd1.union(rdd2).repartition(30) map ...
or maybe even:
val x = rdd1.zip(rdd2).repartition(30) map ...
?

Resources