Spark dataframe execution stuck for Sort & dropDuplicates operations - apache-spark

I have a dataframe with 1.4 billion rows and 20 columns. here is my code
df = sqlContext.read.parquet('path0')
df = df.sort('col_10')
df = df.dropDuplicates(['col_1', 'col_2', 'col_3', 'col_4', 'col_5'])
df.write.parquet('path1')
for the last write operation out of total 200 tasks, it completes 194 tasks in less time(~6min) but to complete remaining 6 tasks it is taking ~30min.
what's causing spark unable to parallelize the tasks properly?

Related

understanding spark.default.parallelism

As per the documentation:
spark.default.parallelism:Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user
spark.default.parallelism: For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD
I am not able to produce the documented behaviour
Dataframe:
I create 2 DFs with partitions 3 and 50 respectively and join them. The output should have 50 partitions but, always has the number of partitions equal to the number of partitions of the larger DF
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.join(df2,df1.id==df2.id,'inner')
df3.rdd.getNumPartitions() #3
RDD:
I create 2 DFs with partitions 3 and 50 respectively and join the underlying RDDs of them. The output RDD should have 50 partitions but, has 53 partitions
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.rdd.join(df2.rdd)
df3.getNumPartitions() #53
How to understand the conf spark.default.parallelism?
For now i can answer df part, i am experimenting with rdd so maybe i will add edit later
For df there is parameter spark.sql.shuffle.partitions which is used during joins etc, its set to 200 by default. In your case its not used and there may be two reasons:
Your datasets are smaller than 10mb so one dataset is broadcasted and there is no shuffle during join so you end up with number of partitions from bigger dataset
You may have AQE enabled which is changing the number of partitions.
I did a quick check with broadcast and AQE disabled and results are as expected
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.shuffle.partitions",100)
​
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.join(df2,df1.id==df2.id,'inner')
df3.rdd.getNumPartitions() #100
Out[35]: 100
For second case with rdds i think that rdds are co-partitioned and due to that Spark is not triggering full shuffle: "Understanding Co-partitions and Co-Grouping In Spark"
What i can see in query plan is that instead of join Spark is using union and thats why in final rdd you can see 53 partitions

How is the number of partitions for an inner join calculated in Spark?

We have two dataframes. df_A and df_B
df_A.rdd.getPartitionsNumber() # => 9
df_B.rdd.getPartitionsNumber() # => 160
df_A.createOrReplaceTempView('table_A')
df_B.createOrReplaceTempView('table_B')
After creation joined dataframe via SparkSQL,
df_C = spark.sql("""
select *
from table_A inner table_B on (...)
""")
df_C.rdd.getPartitionsNumber() # => 160
How does Spark calculate and use these two partitions for two joined dataframes?
Shouldn't the number of partitions of the joined dataframe be 9 * 160 = 1440?
Spark configures the number of partitions to 200 when shuffling data for joins or aggregations. You can change the value in spark.sql.shuffle.partitions to increase or decrease the number of partitions in join operation.
https://spark.apache.org/docs/latest/sql-performance-tuning.html

Spark condition on partition column from another table (Performance)

I have a huge parquet table partitioned on registration_ts column - named stored.
I'd like to filter this table based on data obtained from small table - stream
In sql world the query would look like:
spark.sql("select * from stored where exists (select 1 from stream where stream.registration_ts = stored.registration_ts)")
In Dataframe world:
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi")
This all works, but the performance is suffering, because the partition pruning is not applied. Spark full-scans stored table, which is too expensive.
For example this runs 2 minutes:
stream.count
res45: Long = 3
//takes 2 minutes
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
[Stage 181:> (0 + 1) / 373]
This runs in 3 seconds:
val stream = stream.where("registration_ts in (20190516204l, 20190515143l,20190510125l, 20190503151l)")
stream.count
res44: Long = 42
//takes 3 seconds
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
The reason is that in the 2-nd example the partition filter is propagated to joined stream table.
I'd like to achieve partition filtering on dynamic set of partitions.
The only solution I was able to come up with:
val partitions = stream.select('registration_ts).distinct.collect.map(_.getLong(0))
stored.where('registration_ts.isin(partitions:_*))
Which collects the partitions to driver and makes a 2-nd query. This works fine only for small number of partitions. When I've tried this solution with 500k distinct partitions, the delay was significant.
But there must be a better way ...
Here's one way that you can do it in PySpark and I've verified in Zeppelin that it is using the set of values to prune the partitions
# the collect_set function returns a distinct list of values and collect function returns a list of rows. Getting the [0] element in the list of rows gets you the first row and the [0] element in the row gets you the value from the first column which is the list of distinct values
from pyspark.sql.functions import collect_set
filter_list = spark.read.orc(HDFS_PATH)
.agg(collect_set(COLUMN_WITH_FILTER_VALUES))
.collect()[0][0]
# you can use the filter_list with the isin function to prune the partitions
df = spark.read.orc(HDFS_PATH)
.filter(col(PARTITION_COLUMN)
.isin(filter_list))
.show(5)
# you may want to do some checks on your filter_list value to ensure that your first spark.read actually returned you a valid list of values before trying to do the next spark.read and prune your partitions

Pyspark: Why show() or count() of a joined spark dataframe is so slow?

I have two large spark dataframe. I joined them by one common column as:
df_joined = df1.join(df2.select("id",'label'), "id")
I got the result, but when I want to work with df_joined, it's too slow. As I know, we need to repartition df1 and df2 to prevent large number of partition for df_joined. so, even, I changed the number of partitions,
df1r = df1.repartition(1)
df2r = df2.repartition(1)
df_joined = df1r.join(df2r.select("id",'label'), "id")
still NOT working.
any IDEA?
Spark runs 1 concurrent task for every partition of an RDD / DataFrame (up to the number of cores in the cluster).
If you’re cluster has 20 cores, you should have at least 20 partitions (in practice 2–3x times more). From the other hand a single partition typically shouldn’t contain more than 128MB.
so, instead of below two lines, which repartition your dataframe into 1 paritition:
df1r = df1.repartition(1)
df2r = df2.repartition(1)
Repartition your data based on 'id' column, joining key, into n partitions. ( n depends on data size and number of cores in cluster).
df1r = df1.repartition(n, "id")
df2r = df2.repartition(n, "id")

Pyspark Operations are Slower than Hive

I have 3 dataframes df1, df2 and df3.
Each dataframe has approximately 3million rows. df1 and df3 has apprx. 8 columns. df2 has only 3 columns.
(source text file of df1 is approx 600MB size)
These are the operations performed:
df_new=df1 left join df2 ->group by df1 columns->select df1 columns, first(df2 columns)
df_final = df_new outer join df3
df_split1 = df_final filtered using condition1
df_split2 = df_final filtered using condition2
write df_split1,df_split2 into a single table after performing different operations on both dataframes
This entire process takes 15mins in pyspark 1.3.1, with default partition value = 10, executor memory = 30G, driver memory = 10G and I have used cache() wherever necessary.
But when I use hive queries, this hardly takes 5 mins. Is there any particular reason why my dataframe operations are slow and is there any way I can improve the performance?
You should be careful with the use of JOIN.
JOIN in spark can be really expensive. Especially if the join is between two dataframes. You can avoid expensive operations by re-partition the two dataframes on the same column or by using the same partitioner.

Resources