Spark Dataset join performance - apache-spark

I receive a Dataset and I am required to join it with another Table. Hence the most simple solution that came to my mind was to create a second Dataset for the other table and perform the joinWith.
def joinFunction(dogs: Dataset[Dog]): Dataset[(Dog, Cat)] = {
val cats: Dataset[Cat] = spark.table("dev_db.cat").as[Cat]
dogs.joinWith(cats, ...)
}
Here my main concern is with spark.table("dev_db.cat"), as it feels like we are referring to all of the cat data as
SELECT * FROM dev_db.cat
and then doing a join at a later stage. Or will the query optimizer directly perform the join with out referring to the whole table? Is there a better solution?

Here are some suggestions for your case:
a. If you have where, filter, limit, take etc operations try to apply them before joining the two datasets. Spark can't push down these kind of filters therefore you have to do by your own reducing as much as possible the amount of target records. Here an excellent source of information over the Spark optimizer.
b. Try to co-locate the datasets and minimize the shuffled data by using repartition function. The repartition should be based on the keys that participate in join i.e:
dogs.repartition(1024, "key_col1", "key_col2")
dogs.join(cats, Seq("key_col1", "key_col2"), "inner")
c. Try to use broadcast for the smaller dataset if you are sure that it can fit in memory (or increase the value of spark.broadcast.blockSize). This consists a certain boost for the performance of your Spark program since it will ensure the co-existense of two datasets within the same node.
If you can't apply any of the above then Spark doesn't have a way to know which records should be excluded and therefore will scan all the available rows from both datasets.

You need to do an explain and see if predicate push down is used. Then you can judge your concern to be correct or not.
However, in general now, if no complex datatypes are used and/or datatype mismatches are not evident, then push down takes place. You can see that with simple createOrReplaceTempView as well. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html

Related

HashPartioning dataframes to achieve co-partitioning during join in PySpark

I am trying to figure out the best way to achieve co-partitioning on my two datasets to eliminate join related shuffles. I'm working with 2 dataframes A and B where A contains minimal user date including a field for event IDs they interacted with, and B contains detailed information about the events. I am trying to join on 3 fields: day, event_type, and event_id. A and B need to be read from disk as they will be written to and read from by external clients on an ongoing basis.
The main goal of the project I'm working on is to enable the ability to quickly:
Filter by event_type
Join raw event details to user IDs
I understand that in order to achieve #1 I probably need to partition my parquet files on event_type so that the directory structure achieves easier filtering. In order to achieve #2 I should try to minimize shuffles as much as possible by means of co-partitioning keys from the two dataframes.
The data I'm working with consists of 3 days of event data (~12M rows per event type) and the goal is to get this working efficiently for 1-3 years of data.
In order to improve my join I first begin by filtering on the event_type I am interested in to narrow down the data on both dataframes. I then do the actual join on day and event_id. This naturally will result in shuffles since there is no co-partitioning so I've tried to address that using hash partitioning.
I read that repartition implements hash partitioning on the specified columns. I save my dataframes to disk and also include a partitionBy('day', 'event_type') in order to achieve better performance on filtering/grouping operations.
A\
.repartition('day', 'event_id')\
.write
.partitionBy('day', 'event_type')\
.mode('overwrite')\
.parquet('/path/to/A')
B\
.repartition('day', 'event_id')\
.write\
.partitionBy('day', 'event_type')\
.mode('overwrite')\
.parquet('/path/to/B')
...
...
A = spark.read.parquet('/path/to/A')
B = spark.read.parquet('/path/to/B')
A.filter(col('event_type') == 'X')\
.join(B.filter(col('event_type) == 'X'), on=['day', event_id'], how='inner')\
.show()
When I execute this I still see a shuffle exchange in the plan as well as shuffle writes which take up around 5-10GB each. I also see longer executor compute times of around 21-41s which might not seem much on 3 days of data but might blow up for yearly data.
I am wondering what's a better way I can go about doing this - or if it is even possible to eliminate shuffles when working with dataframes? Answers to this question seem to suggest that it might be possible but not a great idea?
I am not even sure that doing both a repartition and a partitionBy is the correct approach. Is the initial partitioning using repartition() preserved at all when I re-read the parquet files from disk? I have read that this might not be the case - overall the information available seems either conflicting or without explicit sources attached.
Thank you for taking the time to help.

PySpark: Best practice to add more columns to a DataFrame

Spark Dataframes has a method withColumn to add one new column at a time. To add multiple columns, a chain of withColumns are required. Is this the best practice to do this?
I feel that usingmapPartitions has more advantages. Let's say I have a chain of three withColumns and then one filter to remove Rows based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions. It also helps if I have a database connection that I would prefer to open once per RDD partition.
My question has two parts.
The first part, this is my implementation of mapPartitions. Are there any unforeseen issues with this approach? And is there a more elegant way to do this?
df2 = df.rdd.mapPartitions(add_new_cols).toDF()
def add_new_cols(rows):
db = open_db_connection()
new_rows = []
new_row_1 = Row("existing_col_1", "existing_col_2", "new_col_1", "new_col_2")
i = 0
for each_row in rows:
i += 1
# conditionally omit rows
if i % 3 == 0:
continue
db_result = db.get_some_result(each_row.existing_col_2)
new_col_1 = ''.join([db_result, "_NEW"])
new_col_2 = db_result
new_f_row = new_row_1(each_row.existing_col_1, each_row.existing_col_2, new_col_1, new_col_2)
new_rows.append(new_f_row)
db.close()
return iter(new_rows)
The second part, what are the tradeoffs in using mapPartitions over a chain of withColumn and filter?
I read somewhere that using the available methods with Spark DFs are always better than rolling out your own implementation. Please let me know if my argument is wrong. Thank you! All thoughts are welcome.
Are there any unforeseen issues with this approach?
Multiple. The most severe implications are:
A few times higher memory footprint to compared to plain DataFrame code and significant garbage collection overhead.
High cost of serialization and deserialization required to move data between execution contexts.
Introducing breaking point in the query planner.
As is, cost of schema inference on toDF call (can be avoided if proper schema is provided) and possible re-execution of all preceding steps.
And so on...
Some of these can be avoided with udf and select / withColumn, other cannot.
let's say I have a chain of three withColumns and then one filter to remove Rows based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions
Your mapPartitions doesn't remove any operations, and doesn't provide any optimizations, that Spark planner cannot excluding. Its only advantage is that it provides a nice scope for expensive connection objects.
I read somewhere that using the available methods with Spark DFs are always better than rolling out your own implementation
When you start using executor-side Python logic you already diverge from Spark SQL. Doesn't matter if you use udf, RDD or newly added vectorized udf. At the end of the day you should make decision based on overall structure of your code - if it is predominantly Python logic executed directly on the data it might be better to stick with RDD or skip Spark completely.
If it is just a fraction of the logic, and doesn't cause severe performance issue, don't sweat about it.
using df.withColumn() is the best way to add columns. they're all added lazily

Does joining additional columns in Spark scale horizontally?

I have a dataset with about 2.4M rows, with a unique key for each row. I have performed some complex SQL queries on some other tables, producing a dataset with two columns, a key and the value true. This dataset is about 500 rows. Now I would like to (outer) join this dataset with my original table.
This produces a new table with a very sparse set of values (true in about 500 rows, null elsewhere).
Finally, I would like to do this about 200 times, giving me a final table of about 201 columns (the key, plus the 200 sparse columns).
When I run this, I notice that as it runs it gets considerably slower. The first join takes 2 seconds, then 4s, then 6s, then 10s, then 20s and after about 30 joins the system never recovers. Of course, the actual numbers are irrelevant as that depends on the cluster I'm running, but I'm wondering:
Is this slowdown is expected?
I am using parquet as a data storage format (columnar storage) so I was hopeful that adding more columns would scale horizontally, is that a correct assumption?
All the columns I've joined so far are not needed for the Nth join, can they be unloaded from memory?
Are there other things I can do when combining lots of columns in spark?
Calling explain on each join in the loop shows that each join is getting more complex (appears to include all previous joins and it also includes the complex sql queries, even though those have been checkpointed). Is there a way to really checkpoint so each join is just a join? I am actually calling show() after each join, so I assumed the join is actually happening at that point.
Is this slowdown is expected
Yes, to some extent it is. Joins belong to the most expensive operations in a data intensive systems (it is not a coincidence that products which claim linear scalability usually take joins out of the table). Join-like operation in a distributed system typically require data exchange between nodes hitting a bunch of high latency numbers.
In Spark SQL there is also additional cost of computing execution plan, which has larger than linear complexity.
I am using parquet as a data storage format (columnar storage) so I was hopeful that adding more columns would scale horizontally, is that a correct assumption?
No. Input format doesn't affect join logic at all.
All the columns I've joined so far are not needed for the Nth join, can they be unloaded from memory?
If truly excluded from the final output they will be pruned from the execution plan. But since you for a reason, I assume it is not the case and there are required for the final output.
Is there a way to really checkpoint so each join is just a join? I am actually calling show() after each join, so I assumed the join is actually happening at that point.
show computes only a small subset of data required for the output. It doesn't cache, although shuffle files might be reused.
(appears to include all previous joins and it also includes the complex sql queries, even though those have been checkpointed).
Checkpoints are created only if data is fully computed and don't remove stages from the execution plan. If you want to do it explicitly, write partial result to persistent storage and read it back at the beginning of each iteration (it is probably an overkill).
Are there other things I can do when combining lots of columns in spark?
The best thing you can do is to find a way to avoid joins completely. If key is always the same then single shuffle, and operation on groups / partitions (with byKey method, window functions) might be better choice.
However if you
have a dataset with about 2.4M rows
then using non-distributed system that supports in-place modification might be much better choice.
In the most naive implementation you can compute each aggregate separately, sort by key and write to disk. Then data can be merged together line by line with negligible memory footprint.

Forcing pyspark join to occur sooner

PROBLEM: I have two tables that are vastly different in size. I want to join on some id by doing a left-outer join. Unfortunately, for some reason even after caching my actions after the join are being executed on all records even though I only want the ones from the left table. See below:
MY QUESTIONS:
1. How can I set this up so only the records that match the left table get processed through the costly wrangling steps?
LARGE_TABLE => ~900M records
SMALL_TABLE => 500K records
CODE:
combined = SMALL_TABLE.join(LARGE_TABLE SMALL_TABLE.id==LARGE_TABLE.id, 'left-outer')
print(combined.count())
...
...
# EXPENSIVE STUFF!
w = Window().partitionBy("id").orderBy(col("date_time"))
data = data.withColumn('diff_id_flag', when(lag('id').over(w) != col('id'), lit(1)).otherwise(lit(0)))
Unfortunately, my execution plan shows the expensive transformation operation above is being done on ~900M records. I find this odd since I ran df.count() to force the join to execute eagerly rather than lazily.
Any Ideas?
ADDITIONAL INFORMATION:
- note that the expensive transformation in my code flow occurs after the join (at least that is how I interpret it) but my DAG shows the expensive transformation occurring as a part of the join. This is exactly what I want to avoid as the transformation is expensive. I want the join to execute and THEN the result of that join to be run through the expensive transformation.
- Assume the smaller table CANNOT fit into memory.
The best way to do this is to broadcast the tiny dataframe. Caching is good for multiple actions, which doesnt seem to be applicable ro your particular use case.
df.count has no effect on the execution plan at all. It is just expensive operation executed without any good reason.
Window function application in this requires the same logic as join. Because you join by id and partitionBy idboth stages will require the same hash partitioning and full data scan for both sides. There is no acceptable reason to separate these two.
In practice join logic should be applied before window, serving as a filter for the the downstream transformations in the same stage.

DataFrame orderBy followed by limit in Spark

I am having a program take generate a DataFrame on which it will run something like
Select Col1, Col2...
orderBy(ColX) limit(N)
However, when i collect the data in end, i find that it is causing the driver to OOM if I take a enough large top N
Also another observation is that if I just do sort and top, this problem will not happen. So this happen only when there is sort and top at the same time.
I am wondering why it could be happening? And particular, what is really going underneath this two combination of transforms? How does spark will evaluate query with both sorting and limit and what is corresponding execution plan underneath?
Also just curious does spark handle sort and top different between DataFrame and RDD?
EDIT,
Sorry i didn't mean collect,
what i original just mean that when i call any action to materialize the data, regardless of whether it is collect (or any action sending data back to driver) or not (So the problem is definitely not on the output size)
While it is not clear why this fails in this particular case there multiple issues you may encounter:
When you use limit it simply puts all data on a single partition, no matter how big n is. So while it doesn't explicitly collect it almost as bad.
On top of that orderBy requires a full shuffle with range partitioning which can result in a different issues when data distribution is skewed.
Finally when you collect results can be larger than the amount of memory available on the driver.
If you collect anyway there is not much you can improve here. At the end of the day driver memory will be a limiting factor but there still some possible improvements:
First of all don't use limit.
Replace collect with toLocalIterator.
use either orderBy |> rdd |> zipWithIndex |> filter or if exact number of values is not a hard requirement filter data directly based on approximated distribution as shown in Saving a spark dataframe in multiple parts without repartitioning (in Spark 2.0.0+ there is handy approxQuantile method).

Resources