Spark Join optimization - apache-spark

let's say I have two dataframes that I want to join using "inner join": A and B, each one has 100 columns and billions of rows.
If in my use case I'm only interested in 10 columns of A and 4 columns of B, does Spark do the optimization for me in order to handle this and shuffle only 14 columns or will he be shuffling everything then selecting 14 columns?
Query 1 :
A_select = A.select("{10 columns}").as("A")
B_select = B.select("{4 columns}").as("B")
result = A_select.join(B_select, $"A.id"==$"B.id")
Query 2 :
A.join(B, $"A.id"==$"B.id").select("{14 columns}")
Is Query1==Query2 in termes of behavior, execution time, data shuffling ?
Thanks in advance for your answers :

Yes, spark will handle the optimization for you. Due to it's lazy evaluation behaviour only the required attributes will be selected from the datafrmes (A and B).
You can use explain function to view logical/physical plan,
result.explain()
Both the query will be returning same physical plan. Hence execution time and data shuffling will be same.
Reference - Pyspark documentation for explain function.

Related

Spark Window Function Null Skew

Recently I've encountered an issue running one of our PySpark jobs. While analyzing the stages in Spark UI I have noticed that the longest running stage takes 1.2 hours to run out of the total 2.5 hours that takes for the entire process to run.
Once I took a look at the stage details it was clear that I'm facing a severe data skew, causing a single task to run for the entire 1.2 hours while all other tasks finish within 23 seconds.
The DAG showed this stage involves Window Functions which helped me to quickly narrow down the problematic area to a few queries and finding the root cause -> The column, account, that was being used in the Window.partitionBy("account") had 25% of null values.
I don't have an interest to calculate the sum for the null accounts though I do need the involved rows for further calculations therefore I can't filter them out prior the window function.
Here is my window function query:
problematic_account_window = Window.partitionBy("account")
sales_with_account_total_df = sales_df.withColumn("sum_sales_per_account", sum(col("price")).over(problematic_account_window))
So we found the one to blame - What can we do now? How can we resolve the skew and the performance issue?
We basically have 2 solutions for this issue:
Break the initial dataframe to 2 different dataframes, one that filters out the null values and calculates the sum on, and the second that contains only the null values and is not part of the calculation. Lastly we union the two together.
Apply salting technique on the null values in order to spread the nulls on all partitions and provide stability to the stage.
Solution 1:
account_window = Window.partitionBy("account")
# split to null and non null
non_null_accounts_df = sales_df.where(col("account").isNotNull())
only_null_accounts_df = sales_df.where(col("account").isNull())
# calculate the sum for the non null
sales_with_non_null_accounts_df = non_null_accounts_df.withColumn("sum_sales_per_account", sum(col("price")).over(account_window)
# union the calculated result and the non null df to the final result
sales_with_account_total_df = sales_with_non_null_accounts_df.unionByName(only_null_accounts_df, allowMissingColumns=True)
Solution 2:
SPARK_SHUFFLE_PARTITIONS = spark.conf.get("spark.sql.shuffle.partitions")
modified_sales_df = (sales_df
# create a random partition value that spans as much as number of shuffle partitions
.withColumn("random_salt_partition", lit(ceil(rand() * SPARK_SHUFFLE_PARTITIONS)))
# use the random partition values only in case the account value is null
.withColumn("salted_account", coalesce(col("account"), col("random_salt_partition")))
)
# modify the partition to use the salted account
salted_account_window = Window.partitionBy("salted_account")
# use the salted account window to calculate the sum of sales
sales_with_account_total_df = sales_df.withColumn("sum_sales_per_account", sum(col("price")).over(salted_account_window))
In my solution I've decided to use solution 2 since it didn't force me to create more dataframes for the sake of the calculation, and here is the result:
As seen above the salting technique helped resolving the skewness. The exact same stage now runs for a total of 5.5 minutes instead of 1.2 hours. The only modification in the code was the salting column in the partitionBy. The comparison shown is based on the exact same cluster/nodes amount/cluster config.

Spark request only a partial sorting for row_number().over partitioned window

Version: DBR 8.4 | Spark 3.1.2
I'm trying to get the top 500 rows per partition, but I can see from the query plan that it is sorting the entire data set (50K rows per partition) before eventually filtering to the rows I care about.
max_rank = 500
ranking_order = Window.partitionBy(['category', 'id'])
.orderBy(F.col('primary').desc(), F.col('secondary'))
df_ranked = (df
.withColumn('rank', F.row_number().over(ranking_order))
.where(F.col('rank') <= max_rank)
)
df_ranked.explain()
I read elsewhere that expressions such as df.orderBy(desc("value")).limit(n) are optimized by the query planner to use TakeOrderedAndProject and avoid sorting the entire table. Is there a similar approach I can use here to trigger an optimization and avoid fully sorting all partitions?
For context, right now my query is taking 3.5 hours on a beefy 4 worker x 40 core cluster and shuffle write time surrounding this query (including some projections not listed above) appears to be my high-nail, so I'm trying to cut down the amount of data as soon as possible.

avoid shuffling and long plans on multiple joins in pyspark

i am doing multiple joins with the same data frame
the data frames i am joining with are result of group by on my original data frame.
listOfCols = ["a","b","c",....]
for c in listOfCols:
means=df.groupby(col(c)).agg(mean(target).alias(f"{c}_mean_encoding"))
df=df.join(means,c,how="left")
this code produces more than 100000 tasks and takes forever to finish.
i see in the dag a lot of shuffling happening.
how can i optimize this code ?
well, after a LOT of tries and failures , i came up with the fastest solution .
instead of 1.5 hours for this job it ran for 5 minutes....
i will put it here so if someone will stumble into it - he/she won't suffer as i did...
the solution was to use spark sql , it must be much more optimized internally than using data frame API:
df.registerTempTable("df")
for c in listOfCols:
left_join_string += f" left join means_{c} on df.{c} = means_{c}.{c}"
means = df.groupby(F.col(c)).agg(F.mean(target).alias(f"{c}_mean_encoding"))
means.registerTempTable(f"means_{c}")
df = sqlContext.sql("SELECT * FROM df "+left_join_string)

How to optimize this short Spark Apache DataFrame function

Following function is supposed to join two DataFrames and return the number of checkouts per location. It is based on the Seattle Public Library data set.
def topKCheckoutLocations(checkoutDF: DataFrame, libraryInventoryDF: DataFrame, k: Int): DataFrame = {
checkoutDF
.join(libraryInventoryDF, "ItemType")
.groupBy("ItemBarCode", "ItemLocation") //grouping by ItemBarCode and ItemLocation
.agg(count("ItemBarCode")) //counting number of ItemBarCode for each ItemLocation
.withColumnRenamed("count(ItemBarCode)", "NumCheckoutItemsAtLocation")
.select($"ItemLocation", $"NumCheckoutItemsAtLocation")
}
When I run this, it takes ages to finish (40+ minutes), and I'm pretty sure it is not supposed to take more than a couple of minutes. Can I change the order of the calls to decrease computation time?
As I never managed to finish computation I never actually got to check whether the output is correct. I assume it is.
The checkoutDF has 3 mio. rows.
For spark job performance
Select the required column from the dataset before joins to
decrease data size
Partition your both dataset by join column ("ItemType") to avoid shuffling

Spark window function on dataframe with large number of columns

I have an ML dataframe which I read from csv files. It contains three types of columns:
ID Timestamp Feature1 Feature2...Feature_n
where n is ~ 500 (500 features in ML parlance). The total number of rows in the dataset is ~ 160 millions.
As this is the result of a previous full join, there are many features which do not have values set.
My aim is to run a "fill" function(fillna style form python pandas), where each empty feature value gets set with the previously available value for that column, per Id and Date.
I am trying to achieve this with the following spark 2.2.1 code:
val rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(-50000, -1)
val columns = Array(...) //first 30 columns initially, just to see it working
val rawDataSetFilled = columns.foldLeft(rawDataset) { (originalDF, columnToFill) =>
originalDF.withColumn(columnToFill, coalesce(col(columnToFill), last(col(columnToFill), ignoreNulls = true).over(window)))
}
I am running this job on a 4 m4.large instances on Amazon EMR, with spark 2.2.1. and dynamic allocation enabled.
The job runs for over 2h without completing.
Am I doing something wrong, at the code level? Given the size of the data, and the instances, I would assume it should finish in a reasonable amount of time? And I haven't even tried with the full 500 columns, just with about 30!
Looking in the container logs, all I see are many logs like this:
INFO codegen.CodeGenerator: Code generated in 166.677493 ms
INFO execution.ExternalAppendOnlyUnsafeRowArray: Reached spill
threshold of
4096 rows, switching to
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I have tried setting parameter spark.sql.windowExec.buffer.spill.threshold to something larger, without any impact. Is theresome other setting I should know about? Those 2 lines are the only ones I see in any container log.
In Ganglia, I see most of the CPU cores peaking around full usage, but the memory usage is lower than the maximum available. All executors are allocated and are doing work.
I have managed to rewrite the fold left logic without using withColumn calls. Apparently they can be very slow for large number of columns, and I was also getting stackoverflow errors because of that.
I would be curious to know why this massive difference - and what exactly happens behind the scenes with the query plan execution, which makes repeated withColumns calls so slow.
Links which proved very helpful: Spark Jira issue and this stackoverflow question
var rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(Window.unboundedPreceding, Window.currentRow)
rawDataset = rawDataset.select(rawDataset.columns.map(column => coalesce(col(column), last(col(column), ignoreNulls = true).over(window)).alias(column)): _*)
rawDataset.write.option("header", "true").csv(outputLocation)

Resources