Pyspark Operations are Slower than Hive - apache-spark

I have 3 dataframes df1, df2 and df3.
Each dataframe has approximately 3million rows. df1 and df3 has apprx. 8 columns. df2 has only 3 columns.
(source text file of df1 is approx 600MB size)
These are the operations performed:
df_new=df1 left join df2 ->group by df1 columns->select df1 columns, first(df2 columns)
df_final = df_new outer join df3
df_split1 = df_final filtered using condition1
df_split2 = df_final filtered using condition2
write df_split1,df_split2 into a single table after performing different operations on both dataframes
This entire process takes 15mins in pyspark 1.3.1, with default partition value = 10, executor memory = 30G, driver memory = 10G and I have used cache() wherever necessary.
But when I use hive queries, this hardly takes 5 mins. Is there any particular reason why my dataframe operations are slow and is there any way I can improve the performance?

You should be careful with the use of JOIN.
JOIN in spark can be really expensive. Especially if the join is between two dataframes. You can avoid expensive operations by re-partition the two dataframes on the same column or by using the same partitioner.

Related

Find difference between two dataframes of large data

I need to get the difference in data between two dataframes. I'm using subtract() for this.
# Dataframes that need to be compared
df1
df2
#df1-df2
new_df1 = df1.subtract(df2)
#df2-df1
new_df2 = df2.subtract(df1)
It works fine and the output is what I needed, but my only issue is with the performance.
Even for comparing 1gb of data, it takes around 50 minutes, which is far from ideal.
Is there any other optimised method to perform the same operation?
Following are some of the details regarding the dataframes:
df1 size = 9397995 * 30
df2 size = 1500000 * 30
All 30 columns are of dtype string.
Both dataframes are being loaded from a database through jdbc
connection.
Both dataframes have same column names and in same order.
You could use the "WHERE" statement on a dataset to filter out rows you don't need. For example through PK, if you know that df1 has pk ranging from 1 to 100, you just filter in df2 for all those pk's. Obviously after a union

understanding spark.default.parallelism

As per the documentation:
spark.default.parallelism:Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user
spark.default.parallelism: For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD
I am not able to produce the documented behaviour
Dataframe:
I create 2 DFs with partitions 3 and 50 respectively and join them. The output should have 50 partitions but, always has the number of partitions equal to the number of partitions of the larger DF
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.join(df2,df1.id==df2.id,'inner')
df3.rdd.getNumPartitions() #3
RDD:
I create 2 DFs with partitions 3 and 50 respectively and join the underlying RDDs of them. The output RDD should have 50 partitions but, has 53 partitions
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.rdd.join(df2.rdd)
df3.getNumPartitions() #53
How to understand the conf spark.default.parallelism?
For now i can answer df part, i am experimenting with rdd so maybe i will add edit later
For df there is parameter spark.sql.shuffle.partitions which is used during joins etc, its set to 200 by default. In your case its not used and there may be two reasons:
Your datasets are smaller than 10mb so one dataset is broadcasted and there is no shuffle during join so you end up with number of partitions from bigger dataset
You may have AQE enabled which is changing the number of partitions.
I did a quick check with broadcast and AQE disabled and results are as expected
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.shuffle.partitions",100)
​
df1 = spark.range(1,10000).repartition(50)
df2 = spark.range(1,100000).repartition(3)
df3 = df1.join(df2,df1.id==df2.id,'inner')
df3.rdd.getNumPartitions() #100
Out[35]: 100
For second case with rdds i think that rdds are co-partitioned and due to that Spark is not triggering full shuffle: "Understanding Co-partitions and Co-Grouping In Spark"
What i can see in query plan is that instead of join Spark is using union and thats why in final rdd you can see 53 partitions

Should I reduce not required columns in DFs before join them in Spark?

Is there any sense to reduce not required columns before I join it in Spark data frames?
For example:
DF1 has 10 columns, DF2 has 15 columns, DF3 has 25 columns.
I want to join them, select needed 10 columns and save it in .parquet.
Does it make sense to transform DFs with select only needed columns before the join or Spark engine will optimize the join by itself and will not operate with all 50 columns during the join operation?
Yes, it makes a perfect sense because it reduce the amount of data shuffled between executors. And it's better to make selection of only necessary columns as early as possible - in most cases, if file format allows (Parquet, Delta Lake), Spark will read data only for necessary columns, not for all columns. I.e.:
df1 = spark.read.parquet("file1") \
.select("col1", "col2", "col3")
df2 = spark.read.parquet("file2") \
.select("col1", "col5", "col6")
joined = df1.join(df2, "col1")

Partitioning the data while reading from hive/hdfs based on column values in Spark

I have 2 spark dataframes that I read from hive using the sqlContext. Lets call these dataframes as df1 and df2. The data in both the dataframes is sorted on a Column called PolicyNumber at hive level. PolicyNumber also happens to be the primary key for both the dataframes. Below are the sample values for both the dataframes although in reality, both my dataframes are huge and spread across 5 executors as 5 partitions. For simplity sake, I will assume that each partition will have one record.
Sample df1
PolicyNumber FirstName
1 A
2 B
3 C
4 D
5 E
Sample df2
PolicyNumber PremiumAmount
1 450
2 890
3 345
4 563
5 2341
Now, I want to join df1 and df2 on PolicyNumber column. I can run the below piece of code and get my required output.
df1.join(df2,df1.PolicyNumber=df2.PolicyNumber)
Now, I want to avoid as much shuffle as possible to make this join efficient. So to avoid shuffle, while reading from hive, I want to partition df1 based on values of PolicyNumber Column in such a way that the row with PolicyNumber 1 will go to Executor 1, row with PolicyNumber 2 will go to Executor 2, row with PolicyNumber 3 will go to Executor 3 and so on. And I want to partition df2 in the exact same way I did for df1 as well.
This way, Executor 1 will now have the row from df1 with PolicyNumber=1 and also the row from df2 with PolicyNumber=1 as well.
Similarly, Executor 2 will have the row from df1 with PolicyNumber=2 and also the row from df2 with PolicyNumber=2 ans so on.
This way, there will not be any shuffle required as now, the data is local to that executor.
My question is, is there a way to control the partitions in this granularity? And if yes, how do I do it.
Unfortunately there is no direct control on the data which is floating into each executor, however, while you reading data into each dataframe, use the CLUSTER BY on join column which helps sorted data distributing into right executors.
ex:
df1 = sqlContext.sql("select * from CLSUTER BY JOIN_COLUMN")
df2 = sqlContext.sql("SELECT * FROM TABLE2 CLSUTER BY JOIN_COLUMN")
hope it helps.

Why do I get so many empty partitions when repartionning a Spark Dataframe?

I want to partition a dataframe "df1" on 3 columns. This dataframe has exactly 990 unique combinaisons for those 3 columns:
In [17]: df1.createOrReplaceTempView("df1_view")
In [18]: spark.sql("select count(*) from (select distinct(col1,col2,col3) from df1_view) as t").show()
+--------+
|count(1)|
+--------+
| 990|
+--------+
In order to optimize the processing of this dataframe, I want to partition df1 in order to get 990 partitions, one for each key possibility:
In [19]: df1.rdd.getNumPartitions()
Out[19]: 24
In [20]: df2 = df1.repartition(990, "col1", "col2", "col3")
In [21]: df2.rdd.getNumPartitions()
Out[21]: 990
I wrote a simple way to count rows in each partition:
In [22]: def f(iterator):
...: a = 0
...: for partition in iterator:
...: a = a + 1
...: print(a)
...:
In [23]: df2.foreachPartition(f)
And I notice that what I get in fact is 628 partitions with one or more key values, and 362 empty partitions.
I assumed spark would repartition in an even way (1 key value = 1 partition) but that does not seem like it, and I feel like this repartitionning is adding data skew even though it should be the other way around...
What's the algorithm Spark uses to partition a dataframe on columns ?
Is there a way to achieve what I thought was possible ?
I'm using Spark 2.2.0 on Cloudera.
To distribute data across partitions spark needs somehow to convert value of the column to index of the partition. There are two default partitioners in Spark - HashPartitioner and RangePartitioner. Different transformations in Spark can apply different partitioners - e.g. join will apply hash partitioner.
Basically for hash partitioner formula to convert value to partition index would be value.hashCode() % numOfPartitions. In your case multiple values are mapping to same partition index.
You could implement your own partitioner if you want better distribution. More about it is here and here and here.

Resources