Spark repartition is slow and shuffles too much data - apache-spark

My cluster:
5 data node
each data node has: 8 CPUs, 45GB memory
Due to some other configuration limit, I can only start 5 executors on each data node. So I did
spark-submit --num-executors 30 --executor-memory 2G ...
So each executor uses 1 core.
I have two data set, each is about 20 GB. In my code, I did:
val rdd1 = sc...cache()
val rdd2 = sc...cache()
val x = rdd1.cartesian(rdd2).repartition(30) map ...
In the Spark UI, I saw the repartition step took more than 30 mins, and it cause data shuffle of more than 150GB.
I don't think it is right. But I could not figure out what goes wrong...

Did you really mean "cartesian"?
You are multiplying every row in RDD1 by every row in RDD2. So if your rows were 1k each, you had about 20,000 rows per RDD. The cartesian product will return a set with 20,000 x 20,000 or 400 million records. And note that each row would now be double in width -- 2k -- so you'd have 800 GB in RDD3 whereas you only had 20 GB each in RDD1 and RDD2.
Perhaps try:
val x = rdd1.union(rdd2).repartition(30) map ...
or maybe even:
val x = rdd1.zip(rdd2).repartition(30) map ...
?

Related

understanding spark.default.parallelism

As per the documentation:
spark.default.parallelism:Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user
spark.default.parallelism: For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD
I am not able to produce the documented behaviour
Dataframe:
I create 2 DFs with partitions 3 and 50 respectively and join them. The output should have 50 partitions but, always has the number of partitions equal to the number of partitions of the larger DF
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.join(df2,df1.id==df2.id,'inner')
df3.rdd.getNumPartitions() #3
RDD:
I create 2 DFs with partitions 3 and 50 respectively and join the underlying RDDs of them. The output RDD should have 50 partitions but, has 53 partitions
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.rdd.join(df2.rdd)
df3.getNumPartitions() #53
How to understand the conf spark.default.parallelism?
For now i can answer df part, i am experimenting with rdd so maybe i will add edit later
For df there is parameter spark.sql.shuffle.partitions which is used during joins etc, its set to 200 by default. In your case its not used and there may be two reasons:
Your datasets are smaller than 10mb so one dataset is broadcasted and there is no shuffle during join so you end up with number of partitions from bigger dataset
You may have AQE enabled which is changing the number of partitions.
I did a quick check with broadcast and AQE disabled and results are as expected
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.shuffle.partitions",100)
​
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.join(df2,df1.id==df2.id,'inner')
df3.rdd.getNumPartitions() #100
Out[35]: 100
For second case with rdds i think that rdds are co-partitioned and due to that Spark is not triggering full shuffle: "Understanding Co-partitions and Co-Grouping In Spark"
What i can see in query plan is that instead of join Spark is using union and thats why in final rdd you can see 53 partitions

Pyspark: Why show() or count() of a joined spark dataframe is so slow?

I have two large spark dataframe. I joined them by one common column as:
df_joined = df1.join(df2.select("id",'label'), "id")
I got the result, but when I want to work with df_joined, it's too slow. As I know, we need to repartition df1 and df2 to prevent large number of partition for df_joined. so, even, I changed the number of partitions,
df1r = df1.repartition(1)
df2r = df2.repartition(1)
df_joined = df1r.join(df2r.select("id",'label'), "id")
still NOT working.
any IDEA?
Spark runs 1 concurrent task for every partition of an RDD / DataFrame (up to the number of cores in the cluster).
If you’re cluster has 20 cores, you should have at least 20 partitions (in practice 2–3x times more). From the other hand a single partition typically shouldn’t contain more than 128MB.
so, instead of below two lines, which repartition your dataframe into 1 paritition:
df1r = df1.repartition(1)
df2r = df2.repartition(1)
Repartition your data based on 'id' column, joining key, into n partitions. ( n depends on data size and number of cores in cluster).
df1r = df1.repartition(n, "id")
df2r = df2.repartition(n, "id")

Spark dataframe partition count

I am confused about how spark creates partitions in spark dataframe. Here is the list of steps and the partition size
i_df = sqlContext.read.json("json files") // num partitions returned is 4, total records 7000
p_df = sqlContext.read.format("csv").Other options // num partitions returned is 4 , total records: 120k
j_df = i_df.join(p_df, i_df.productId == p_df.product_id) // total records 7000, but num of partitions is 200
first two dataframes have 4 partitions, but as soon as i join them it shows 200 partitions. I was expecting that it will make 4 partitions after joining, but why is it showing 200.
I am running it on local with
conf.setIfMissing("spark.master", "local[4]")
The 200 is the default shuffle partition size. you can change it by setting spark.sql.shuffle.partitions

Spark dataframe execution stuck for Sort & dropDuplicates operations

I have a dataframe with 1.4 billion rows and 20 columns. here is my code
df = sqlContext.read.parquet('path0')
df = df.sort('col_10')
df = df.dropDuplicates(['col_1', 'col_2', 'col_3', 'col_4', 'col_5'])
df.write.parquet('path1')
for the last write operation out of total 200 tasks, it completes 194 tasks in less time(~6min) but to complete remaining 6 tasks it is taking ~30min.
what's causing spark unable to parallelize the tasks properly?

Spark dataframe select operation and number of partitions

I am using Spark 1.5.0
I am doing a broadcast join since one of my dataframe is around 30 GB (large_df) and other is around 10 MB (small_df). Here is my code.
df1 = large_df.join(broadcast(small_df), large_df("col2") === small_df("s_col2")
Right after this, if I get the number of partitions for df1, I see the correct number (1000)
df1.rdd.partitions.size() 1000
Now I am doing a project to select only certain columns of df1
df2 = df1.select("col2", "col4", "col6", "col8")
Right after this, if I get the number of partitions for df1, I see a smaller number (200). I am not sure is it just this select or some other operation is somehow changing the # of partitions of my dataframe.
df2.rdd.partitions.size() 200
How do I make sure that the number of partitions is not reduced.
You can set the following property of your SparkConf to 1000
spark.sql.shuffle.partitions
More info here: https://spark.apache.org/docs/1.2.0/sql-programming-guide.html
spark.sql.shuffle.partitions
Default: 200
Configures the number of partitions to use when shuffling data for joins or aggregations.
val df = Seq(
("A", 1), ("B", 2), ("A", 3), ("C", 1)
).toDF("k", "v")
df.rdd.getNumPartitions
When ever you do suffle operations on dataframes, the default partitions is 200
val partitioned = df.repartition($"k")
partitioned.rdd.getNumPartitions //Results 200

Resources