How to optimize this short Spark Apache DataFrame function - apache-spark

Following function is supposed to join two DataFrames and return the number of checkouts per location. It is based on the Seattle Public Library data set.
def topKCheckoutLocations(checkoutDF: DataFrame, libraryInventoryDF: DataFrame, k: Int): DataFrame = {
checkoutDF
.join(libraryInventoryDF, "ItemType")
.groupBy("ItemBarCode", "ItemLocation") //grouping by ItemBarCode and ItemLocation
.agg(count("ItemBarCode")) //counting number of ItemBarCode for each ItemLocation
.withColumnRenamed("count(ItemBarCode)", "NumCheckoutItemsAtLocation")
.select($"ItemLocation", $"NumCheckoutItemsAtLocation")
}
When I run this, it takes ages to finish (40+ minutes), and I'm pretty sure it is not supposed to take more than a couple of minutes. Can I change the order of the calls to decrease computation time?
As I never managed to finish computation I never actually got to check whether the output is correct. I assume it is.
The checkoutDF has 3 mio. rows.

For spark job performance
Select the required column from the dataset before joins to
decrease data size
Partition your both dataset by join column ("ItemType") to avoid shuffling

Related

Spark Window Function Null Skew

Recently I've encountered an issue running one of our PySpark jobs. While analyzing the stages in Spark UI I have noticed that the longest running stage takes 1.2 hours to run out of the total 2.5 hours that takes for the entire process to run.
Once I took a look at the stage details it was clear that I'm facing a severe data skew, causing a single task to run for the entire 1.2 hours while all other tasks finish within 23 seconds.
The DAG showed this stage involves Window Functions which helped me to quickly narrow down the problematic area to a few queries and finding the root cause -> The column, account, that was being used in the Window.partitionBy("account") had 25% of null values.
I don't have an interest to calculate the sum for the null accounts though I do need the involved rows for further calculations therefore I can't filter them out prior the window function.
Here is my window function query:
problematic_account_window = Window.partitionBy("account")
sales_with_account_total_df = sales_df.withColumn("sum_sales_per_account", sum(col("price")).over(problematic_account_window))
So we found the one to blame - What can we do now? How can we resolve the skew and the performance issue?
We basically have 2 solutions for this issue:
Break the initial dataframe to 2 different dataframes, one that filters out the null values and calculates the sum on, and the second that contains only the null values and is not part of the calculation. Lastly we union the two together.
Apply salting technique on the null values in order to spread the nulls on all partitions and provide stability to the stage.
Solution 1:
account_window = Window.partitionBy("account")
# split to null and non null
non_null_accounts_df = sales_df.where(col("account").isNotNull())
only_null_accounts_df = sales_df.where(col("account").isNull())
# calculate the sum for the non null
sales_with_non_null_accounts_df = non_null_accounts_df.withColumn("sum_sales_per_account", sum(col("price")).over(account_window)
# union the calculated result and the non null df to the final result
sales_with_account_total_df = sales_with_non_null_accounts_df.unionByName(only_null_accounts_df, allowMissingColumns=True)
Solution 2:
SPARK_SHUFFLE_PARTITIONS = spark.conf.get("spark.sql.shuffle.partitions")
modified_sales_df = (sales_df
# create a random partition value that spans as much as number of shuffle partitions
.withColumn("random_salt_partition", lit(ceil(rand() * SPARK_SHUFFLE_PARTITIONS)))
# use the random partition values only in case the account value is null
.withColumn("salted_account", coalesce(col("account"), col("random_salt_partition")))
)
# modify the partition to use the salted account
salted_account_window = Window.partitionBy("salted_account")
# use the salted account window to calculate the sum of sales
sales_with_account_total_df = sales_df.withColumn("sum_sales_per_account", sum(col("price")).over(salted_account_window))
In my solution I've decided to use solution 2 since it didn't force me to create more dataframes for the sake of the calculation, and here is the result:
As seen above the salting technique helped resolving the skewness. The exact same stage now runs for a total of 5.5 minutes instead of 1.2 hours. The only modification in the code was the salting column in the partitionBy. The comparison shown is based on the exact same cluster/nodes amount/cluster config.

Pyspark - Loop n times - Each loop gets gradually slower

So basically i want to loop n times through my dataframe and apply a function in each loop
(perform a join).
My test-Dataframe is like 1000 rows and in each iteration, exactly one column will be added.
The first three loops perform instantly and from then its gets really really slow.
The 10th loop e.g. needs more than 10 minutes.
I dont understand why this happens because my Dataframe wont grow larger in terms of rows.
If i call my functions with n=20 e.g., the join performs instantly.
But when i loop iteratively 20 times, it gets stucked soon.
You have any idea what can potentially cause this problem?
Examble Code from Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller
import time
from pyspark import SparkContext
sc = SparkContext()
def push_and_pop(rdd):
# two transformations: moves the head element to the tail
first = rdd.first()
return rdd.filter(
lambda obj: obj != first
).union(
sc.parallelize([first])
)
def serialize_and_deserialize(rdd):
# perform a collect() action to evaluate the rdd and create a new instance
return sc.parallelize(rdd.collect())
def do_test(serialize=False):
rdd = sc.parallelize(range(1000))
for i in xrange(25):
t0 = time.time()
rdd = push_and_pop(rdd)
if serialize:
rdd = serialize_and_deserialize(rdd)
print "%.3f" % (time.time() - t0)
do_test()
I have fixed this issue with converting the df every n times to a rdd and back to df.
Code runs fast now. But i dont understand what exactly is the reason for that. The explain plan seems to rise very fast during iterations if i dont do the conversion.
This fix is also issued in the book "High Performance Spark" with this workaround.
While the Catalyst optimizer is quite powerful, one of the cases where
it currently runs into challenges is with very large query plans.
These query plans tend to be the result of iterative algorithms, like
graph algorithms or machine learning algorithms. One simple workaround
for this is converting the data to an RDD and back to
DataFrame/Dataset at the end of each iteration

Faster way to count values greater than 0 in Spark DataFrame?

I have a Spark DataFrame where all fields are integer type. I need to count how many individual cells are greater than 0.
I am running locally and have a DataFrame with 17,000 rows and 450 columns.
I have tried two methods, both yielding slow results:
Version 1:
(for (c <- df.columns) yield df.where(s"$c > 0").count).sum
Version 2:
df.columns.map(c => df.filter(df(c) > 0).count)
This calculation takes 80 seconds of wall clock time. With Python Pandas, it takes a fraction of second. I am aware that for small data sets and local operation, Python may perform better, but this seems extreme.
Trying to make a Spark-to-Spark comparison, I find that running MLlib's PCA algorithm on the same data (converted to a RowMatrix) takes less than 2 seconds!
Is there a more efficient implementation I should be using?
If not, how is the seemingly much more complex PCA calculation so much faster?
What to do
import org.apache.spark.sql.functions.{col, count, when}
df.select(df.columns map (c => count(when(col(c) > 0, 1)) as c): _*)
Why
Your both attempts create number of jobs proportional to the number of columns. Computing the execution plan and scheduling the job alone are expensive and add significant overhead depending on the amount of data.
Furthermore, data might be loaded from disk and / or parsed each time the job is executed, unless data is fully cached with significant memory safety margin which ensures that the cached data will not be evicted.
This means that in the worst case scenario nested-loop-like structure you use can roughly quadratic in terms of the number of columns.
The code shown above handles all columns at the same time, requiring only a single data scan.
The problem with your approach is that the file is scanned for every column (unless you have cached it in memory). The fastet way with a single FileScan should be:
import org.apache.spark.sql.functions.{explode,array}
val cnt: Long = df
.select(
explode(
array(df.columns.head,df.columns.tail:_*)
).as("cell")
)
.where($"cell">0).count
Still I think it will be slower than with Pandas, as Spark has a certain overhead due to the parallelization engine

Bad performance with window function in streaming job

I use Spark 2.0.2, Kafka 0.10.1 and the spark-streaming-kafka-0-8 integration. I want to do the following:
I extract features in a streaming job out of NetFlow connections and than apply the records to a k-means model. Some of the features are simple ones which are calculated directly from the record. But I also have more complex features which depend on records from a specified time window before. They count how many connections in the last second were to the same host or service as the current one. I decided to use the SQL window functions for this.
So I build window specifications:
val hostCountWindow = Window.partitionBy("plainrecord.ip_dst").orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
val serviceCountWindow = Window.partitionBy("service").orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
And a function which is called to extract this features on every batch:
def extractTrafficFeatures(dataset: Dataset[Row]) = {
dataset
.withColumn("host_count", count(dataset("plainrecord.ip_dst")).over(hostCountWindow))
.withColumn("srv_count", count(dataset("service")).over(serviceCountWindow))
}
And use this function as follows
stream.map(...).map(...).foreachRDD { rdd =>
val dataframe = rdd.toDF(featureHeaders: _*).transform(extractTrafficFeatures(_))
...
}
The problem is that this has a very bad performance. A batch needs between 1 and 3 seconds for a average input rate of less than 100 records per second. I guess it comes from the partitioning, which produces a lot of shuffling?
I tried to use the RDD API and countByValueAndWindow(). This seems to be much faster, but the code looks way nicer and cleaner with the DataFrame API.
Is there a better way to calculate these features on the streaming data? Or am I doing something wrong here?
Relatively low performance is to be expected here. Your code has to shuffle and sort data twice, once for:
Window
.partitionBy("plainrecord.ip_dst")
.orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
and once for:
Window
.partitionBy("service")
.orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
This will have a huge impact on the runtime and if these are the hard requirements you won't be able to do much better.

Array of RDDs? One RDD for a time window

I have a question about bucketing time events with Spark, and the best way to handle it.
So I'm ingesting a very large dataset, with specific start/stop times for each event.
For instance, I might load in three weeks of data. Within the main time window, I divide that into buckets of smaller intervals. So 3 weeks divided into 24 hour time buckets, with an array that looks like [(start_epoch, stop_epoch), (start_epoch, stop_epoch), ...]
Within each time bucket I map/reduce my events down into a smaller set.
I'd like to keep the events split up by the time bucket they belong to.
What is the best way to handle this? Each map/reduce operation results in a new RDD so I'm effectively left with a large array of RDDs.
Is it "safe" to just loop over that array from the driver, and then do other transformations/actions on each RDD to get results each time window?
Thanks!
I would suggest to think about it a bit differently:
You want to read your data, and then "keyBy" time rounded to hour resolution. And then you can reduceByKey(or combineByKey if you want another type in output).
While working with spark it's not necessary to collect items into arrays by some key(even antipattern)
RDD[Event] -> keyBy ts rounded to hour -> RDD[(hour, event)] -> reduceByKey(i.e. hour) -> RDD[(hour, aggregated view of all events in this hour)]

Resources