I have code that his goal is to take the 10M oldest records out of 1.5B records.
I tried to do it with orderBy and it never finished and then I tried to do it with a window function and it finished after 15min.
I understood that with orderBy every executor takes part of the data, order it and pass the top 10M to the final executor. Because 10M>partition size we are passing to the final executor all the data and then it's taking a lot of time to finish.
I couldn't understand how the window solution works? What's happening in the shuffle before the single executor starts to run? How this shuffle helps to the sorting in the single executor to work faster?
I would appreciate any help with understanding how window function in this case works in the background.
This is the code of the window function:
df = sess.sql("select * from table")
last_update_window = Window.orderBy(f.col("last_update").asc())
df = df.withColumn('row_number', f.row_number().over(last_update_window))
df = df.where(f.col('row_number') <= 1000000000)
This is the code for orderBy:
df = sess.sql("select * from table")
df = df.orderBy(f.col('last_update').asc()).limit(100000000)
Here is a picture of the plan when executing the window function:
Run Explain on both queries, and that will show you the different paths they take. T
Window sends all the data to 1 node. In this case you also use a where clause that means it uses a shuffle to complete the filtering. This seems to be faster than the implementation used by limit for a large number of items. It's likely faster because of the volume of data. Extra shuffles hurt for small data sets but if optimized correctly on large data sets helps spread the load and reduce the time it takes.
== Physical Plan ==
*(2) Filter (isnotnull(row_number#66) && (row_number#66 <= 1000000000))
+- Window [row_number() windowspecdefinition(date#61 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS row_number#66], [date#61 ASC NULLS FIRST]
+- *(1) Sort [date#61 ASC NULLS FIRST], false, 0
+- Exchange SinglePartition
+- Scan hive default.table_persons [people#59, type#60, date#61], HiveTableRelation `default`.`table_persons`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [people#59, type#60, date#61]
"Order by" uses a different implementation and does not appear to perform as well under with 'large takes'. It doesn't use as many shuffles but this seems that it doesn't accomplish the work as fast when there is a large number of items to return.
== Physical Plan ==
TakeOrderedAndProject(limit=100000000, orderBy=[date#61 ASC NULLS FIRST], output=[people#59,type#60,date#61])
+- Scan hive default.table_persons [people#59, type#60, date#61], HiveTableRelation `default`.`table_persons`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [people#59, type#60, date#61]
As an aside, for large skewed data sets it common to add 2 extra shuffles and this does in fact take less time than not having the extra shuffles. (But again would increase the time it takes on small data sets.)
So what does TakeOrderedAndProject actually do? It uses a take and sorts the data on disk for large datasets. (Instead of sorting in memory).
With your window it does do a shuffle which does use ranges to sort the data. There are inferences that further sorting is done in memory giving you the performance tradeoff. (Updated links to inferences below.)
And this is where I think you are getting the payoff. (I'd be interested to know if adding a limit to your existing window speed things up.)
From snippets on reddit and digging issues the code it the are inference that the sort is done in memory and spilt to disk if it's required.
The other piece that I feel provide a lot of performance boost is that you use a where clause. Take pulls back as many items from each partition as you have in the limit clause. (see the implementation above.) This is done only to throw out the items. where doesn't not pull any items back that do not match the filter condition. This movement of data [using limit] is likely where you are getting the real performance degradation in limit.
Related
I have noticed that the output to the following code spark.read.format("csv").option("header",True).schema(schema).load(path).limit(nrows).rdd.getNumPartitions() is always 1 notwithstanding the argument nrows. Does using limit clause always result in only 1 partition?
I haven't been able to find anything mentioned about such constraint
I don't understand the background of this. But for this statement, the result will always be 1 as you tried to collect result.
>>> df3.limit(1000000).explain()
== Physical Plan ==
CollectLimit 1000000
+- FileScan csv
But for LimitExec which is physical leaf node for limit operation. It has several implementation such as GlobalLimitExec, LocalLimitExec. For some implementations, it could use same partition num to parrellel fetch enough records, or pick subsets of partitions iteratively. Generally, you don't need to care how is doing.
Version: DBR 8.4 | Spark 3.1.2
I'm trying to get the top 500 rows per partition, but I can see from the query plan that it is sorting the entire data set (50K rows per partition) before eventually filtering to the rows I care about.
max_rank = 500
ranking_order = Window.partitionBy(['category', 'id'])
.orderBy(F.col('primary').desc(), F.col('secondary'))
df_ranked = (df
.withColumn('rank', F.row_number().over(ranking_order))
.where(F.col('rank') <= max_rank)
)
df_ranked.explain()
I read elsewhere that expressions such as df.orderBy(desc("value")).limit(n) are optimized by the query planner to use TakeOrderedAndProject and avoid sorting the entire table. Is there a similar approach I can use here to trigger an optimization and avoid fully sorting all partitions?
For context, right now my query is taking 3.5 hours on a beefy 4 worker x 40 core cluster and shuffle write time surrounding this query (including some projections not listed above) appears to be my high-nail, so I'm trying to cut down the amount of data as soon as possible.
let's say I have two dataframes that I want to join using "inner join": A and B, each one has 100 columns and billions of rows.
If in my use case I'm only interested in 10 columns of A and 4 columns of B, does Spark do the optimization for me in order to handle this and shuffle only 14 columns or will he be shuffling everything then selecting 14 columns?
Query 1 :
A_select = A.select("{10 columns}").as("A")
B_select = B.select("{4 columns}").as("B")
result = A_select.join(B_select, $"A.id"==$"B.id")
Query 2 :
A.join(B, $"A.id"==$"B.id").select("{14 columns}")
Is Query1==Query2 in termes of behavior, execution time, data shuffling ?
Thanks in advance for your answers :
Yes, spark will handle the optimization for you. Due to it's lazy evaluation behaviour only the required attributes will be selected from the datafrmes (A and B).
You can use explain function to view logical/physical plan,
result.explain()
Both the query will be returning same physical plan. Hence execution time and data shuffling will be same.
Reference - Pyspark documentation for explain function.
I have a Spark DataFrame where all fields are integer type. I need to count how many individual cells are greater than 0.
I am running locally and have a DataFrame with 17,000 rows and 450 columns.
I have tried two methods, both yielding slow results:
Version 1:
(for (c <- df.columns) yield df.where(s"$c > 0").count).sum
Version 2:
df.columns.map(c => df.filter(df(c) > 0).count)
This calculation takes 80 seconds of wall clock time. With Python Pandas, it takes a fraction of second. I am aware that for small data sets and local operation, Python may perform better, but this seems extreme.
Trying to make a Spark-to-Spark comparison, I find that running MLlib's PCA algorithm on the same data (converted to a RowMatrix) takes less than 2 seconds!
Is there a more efficient implementation I should be using?
If not, how is the seemingly much more complex PCA calculation so much faster?
What to do
import org.apache.spark.sql.functions.{col, count, when}
df.select(df.columns map (c => count(when(col(c) > 0, 1)) as c): _*)
Why
Your both attempts create number of jobs proportional to the number of columns. Computing the execution plan and scheduling the job alone are expensive and add significant overhead depending on the amount of data.
Furthermore, data might be loaded from disk and / or parsed each time the job is executed, unless data is fully cached with significant memory safety margin which ensures that the cached data will not be evicted.
This means that in the worst case scenario nested-loop-like structure you use can roughly quadratic in terms of the number of columns.
The code shown above handles all columns at the same time, requiring only a single data scan.
The problem with your approach is that the file is scanned for every column (unless you have cached it in memory). The fastet way with a single FileScan should be:
import org.apache.spark.sql.functions.{explode,array}
val cnt: Long = df
.select(
explode(
array(df.columns.head,df.columns.tail:_*)
).as("cell")
)
.where($"cell">0).count
Still I think it will be slower than with Pandas, as Spark has a certain overhead due to the parallelization engine
UPD: The question is not valid anymore as it turned out two of the 100 tables had several orders of magnitude more rows than the rest (which had 500). When "bad" tables are eliminated, the join is distributed fairly and completes in predictable time.
I have about 100 Spark DataFrames, <=500 rows each, but roughly same size (planning to have tens of thousands of rows later). The ids of the entries of all of the columns are subsets of ids of the first (reference) table.
I want to left outer join all of the tables to the first one by id. I do it as follows (in pyspark):
df1.join(df2, df2.id == df1.id, 'left_outer')
.join(df3, df3.id == df1.id, 'left_outer')
...
This join operation generates 200 jobs, all of which but a few finish in couple of seconds. The last job, however takes extremely long (an hour or so) and runs (obviously) only on one processor. The spark web UI reveals that this job has acquired too many shuffle records.
Why is this happening and how is it better to tune Spark to avoid this?
The query "explain select * from ... left outer join ... ... ..." looks as follows:
== Physical Plan ==
Project [id#0, ... rest of the columns (~205) ...]
HashOuterJoin [id#0], [id#370], LeftOuter, None
HashOuterJoin [id#0], [id#367], LeftOuter, None
HashOuterJoin [id#0], [id#364], LeftOuter, None
...
Exchange (HashPartitioning [id#364], 200)
Project [...cols...]
PhysicalRDD [...cols...], MapPartitionsRDD[183] at map at newParquet.scala:542
Exchange (HashPartitioning [id#367], 200)
Project [...cols...]
PhysicalRDD [..cols...], MapPartitionsRDD[185] at map at newParquet.scala:542
Exchange (HashPartitioning [id#370], 200)
Project [...cols...]
PhysicalRDD [...cols...], MapPartitionsRDD[187] at map at newParquet.scala:542
Using repartition after join may help.
I experienced similar situations.
Join two dfs with 200 partitions, and join again again, and it never ends.
I tried to add repartition(50) to DFs which will be joined, then it worked.