PySpark: Best practice to add more columns to a DataFrame - apache-spark

Spark Dataframes has a method withColumn to add one new column at a time. To add multiple columns, a chain of withColumns are required. Is this the best practice to do this?
I feel that usingmapPartitions has more advantages. Let's say I have a chain of three withColumns and then one filter to remove Rows based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions. It also helps if I have a database connection that I would prefer to open once per RDD partition.
My question has two parts.
The first part, this is my implementation of mapPartitions. Are there any unforeseen issues with this approach? And is there a more elegant way to do this?
df2 = df.rdd.mapPartitions(add_new_cols).toDF()
def add_new_cols(rows):
db = open_db_connection()
new_rows = []
new_row_1 = Row("existing_col_1", "existing_col_2", "new_col_1", "new_col_2")
i = 0
for each_row in rows:
i += 1
# conditionally omit rows
if i % 3 == 0:
continue
db_result = db.get_some_result(each_row.existing_col_2)
new_col_1 = ''.join([db_result, "_NEW"])
new_col_2 = db_result
new_f_row = new_row_1(each_row.existing_col_1, each_row.existing_col_2, new_col_1, new_col_2)
new_rows.append(new_f_row)
db.close()
return iter(new_rows)
The second part, what are the tradeoffs in using mapPartitions over a chain of withColumn and filter?
I read somewhere that using the available methods with Spark DFs are always better than rolling out your own implementation. Please let me know if my argument is wrong. Thank you! All thoughts are welcome.

Are there any unforeseen issues with this approach?
Multiple. The most severe implications are:
A few times higher memory footprint to compared to plain DataFrame code and significant garbage collection overhead.
High cost of serialization and deserialization required to move data between execution contexts.
Introducing breaking point in the query planner.
As is, cost of schema inference on toDF call (can be avoided if proper schema is provided) and possible re-execution of all preceding steps.
And so on...
Some of these can be avoided with udf and select / withColumn, other cannot.
let's say I have a chain of three withColumns and then one filter to remove Rows based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions
Your mapPartitions doesn't remove any operations, and doesn't provide any optimizations, that Spark planner cannot excluding. Its only advantage is that it provides a nice scope for expensive connection objects.
I read somewhere that using the available methods with Spark DFs are always better than rolling out your own implementation
When you start using executor-side Python logic you already diverge from Spark SQL. Doesn't matter if you use udf, RDD or newly added vectorized udf. At the end of the day you should make decision based on overall structure of your code - if it is predominantly Python logic executed directly on the data it might be better to stick with RDD or skip Spark completely.
If it is just a fraction of the logic, and doesn't cause severe performance issue, don't sweat about it.

using df.withColumn() is the best way to add columns. they're all added lazily

Related

Spark Dataset join performance

I receive a Dataset and I am required to join it with another Table. Hence the most simple solution that came to my mind was to create a second Dataset for the other table and perform the joinWith.
def joinFunction(dogs: Dataset[Dog]): Dataset[(Dog, Cat)] = {
val cats: Dataset[Cat] = spark.table("dev_db.cat").as[Cat]
dogs.joinWith(cats, ...)
}
Here my main concern is with spark.table("dev_db.cat"), as it feels like we are referring to all of the cat data as
SELECT * FROM dev_db.cat
and then doing a join at a later stage. Or will the query optimizer directly perform the join with out referring to the whole table? Is there a better solution?
Here are some suggestions for your case:
a. If you have where, filter, limit, take etc operations try to apply them before joining the two datasets. Spark can't push down these kind of filters therefore you have to do by your own reducing as much as possible the amount of target records. Here an excellent source of information over the Spark optimizer.
b. Try to co-locate the datasets and minimize the shuffled data by using repartition function. The repartition should be based on the keys that participate in join i.e:
dogs.repartition(1024, "key_col1", "key_col2")
dogs.join(cats, Seq("key_col1", "key_col2"), "inner")
c. Try to use broadcast for the smaller dataset if you are sure that it can fit in memory (or increase the value of spark.broadcast.blockSize). This consists a certain boost for the performance of your Spark program since it will ensure the co-existense of two datasets within the same node.
If you can't apply any of the above then Spark doesn't have a way to know which records should be excluded and therefore will scan all the available rows from both datasets.
You need to do an explain and see if predicate push down is used. Then you can judge your concern to be correct or not.
However, in general now, if no complex datatypes are used and/or datatype mismatches are not evident, then push down takes place. You can see that with simple createOrReplaceTempView as well. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html

Forcing pyspark join to occur sooner

PROBLEM: I have two tables that are vastly different in size. I want to join on some id by doing a left-outer join. Unfortunately, for some reason even after caching my actions after the join are being executed on all records even though I only want the ones from the left table. See below:
MY QUESTIONS:
1. How can I set this up so only the records that match the left table get processed through the costly wrangling steps?
LARGE_TABLE => ~900M records
SMALL_TABLE => 500K records
CODE:
combined = SMALL_TABLE.join(LARGE_TABLE SMALL_TABLE.id==LARGE_TABLE.id, 'left-outer')
print(combined.count())
...
...
# EXPENSIVE STUFF!
w = Window().partitionBy("id").orderBy(col("date_time"))
data = data.withColumn('diff_id_flag', when(lag('id').over(w) != col('id'), lit(1)).otherwise(lit(0)))
Unfortunately, my execution plan shows the expensive transformation operation above is being done on ~900M records. I find this odd since I ran df.count() to force the join to execute eagerly rather than lazily.
Any Ideas?
ADDITIONAL INFORMATION:
- note that the expensive transformation in my code flow occurs after the join (at least that is how I interpret it) but my DAG shows the expensive transformation occurring as a part of the join. This is exactly what I want to avoid as the transformation is expensive. I want the join to execute and THEN the result of that join to be run through the expensive transformation.
- Assume the smaller table CANNOT fit into memory.
The best way to do this is to broadcast the tiny dataframe. Caching is good for multiple actions, which doesnt seem to be applicable ro your particular use case.
df.count has no effect on the execution plan at all. It is just expensive operation executed without any good reason.
Window function application in this requires the same logic as join. Because you join by id and partitionBy idboth stages will require the same hash partitioning and full data scan for both sides. There is no acceptable reason to separate these two.
In practice join logic should be applied before window, serving as a filter for the the downstream transformations in the same stage.

Why is this Spark SQL UDF slower than an RDD?

I have some expensive analysis I need to perform on a DataFrame of pairs of objects. The setup looks something like this.
# This does the expensive work and holds some reference data
# Expensive to initialize so done only once
analyze = Analyze()
def analyze_row(row):
# Turn the row into objects and pass them to the function above
foo = Foo.from_dict(row.foo.asDict(recursive=True))
bar = Bar.from_dict(row.bar.asDict(recursive=True))
return analyze(foo, bar)
When I apply analyze_row as a UDF like so
analyze_row_udf = udf(analyze_row, result_schema)
results_df = input_df.withColumn("result", analyze_row_udf).select("result.*")
it is empirically slower than applying it to an RDD like so
results = content.rdd.map(analyze_row)
results_df = spark.createDataFrame(results, schema=result_schema)
All other things being equal, the UDF version didn't seem to make progress in an hour, while the RDD version completely finished in 30 mins. The cluster CPU was maxed out in both cases. Same behavior was reproduced on multiple tries.
I thought DataFrames are meant to supersede RDDs, partially because of better performance. How come an RDD seems to be much faster in this case?
DataFrames can supersede RDDs where:
There execution plan optimizations (here none can be applied).
There low level optimizations used - off-heap memory, code generation (once again none are applied when you execute black box code outside JVM)
Optimized columnar storage is used - (ditto).
Additionally passing data between contexts is expensive, and merging partial results requires additional operations. Also it more than doubles memory requirements.
It is hard to say why RDD are strictly faster in your case (there have significant improvements time, and you didn't provide a version) but I'd guess you hit some case border-case.
Overall, for arbitrary Python code DataFrames are not a better option at all. This might change a bit in the future, for vectorized operations backed with Arrow.

DataFrame orderBy followed by limit in Spark

I am having a program take generate a DataFrame on which it will run something like
Select Col1, Col2...
orderBy(ColX) limit(N)
However, when i collect the data in end, i find that it is causing the driver to OOM if I take a enough large top N
Also another observation is that if I just do sort and top, this problem will not happen. So this happen only when there is sort and top at the same time.
I am wondering why it could be happening? And particular, what is really going underneath this two combination of transforms? How does spark will evaluate query with both sorting and limit and what is corresponding execution plan underneath?
Also just curious does spark handle sort and top different between DataFrame and RDD?
EDIT,
Sorry i didn't mean collect,
what i original just mean that when i call any action to materialize the data, regardless of whether it is collect (or any action sending data back to driver) or not (So the problem is definitely not on the output size)
While it is not clear why this fails in this particular case there multiple issues you may encounter:
When you use limit it simply puts all data on a single partition, no matter how big n is. So while it doesn't explicitly collect it almost as bad.
On top of that orderBy requires a full shuffle with range partitioning which can result in a different issues when data distribution is skewed.
Finally when you collect results can be larger than the amount of memory available on the driver.
If you collect anyway there is not much you can improve here. At the end of the day driver memory will be a limiting factor but there still some possible improvements:
First of all don't use limit.
Replace collect with toLocalIterator.
use either orderBy |> rdd |> zipWithIndex |> filter or if exact number of values is not a hard requirement filter data directly based on approximated distribution as shown in Saving a spark dataframe in multiple parts without repartitioning (in Spark 2.0.0+ there is handy approxQuantile method).

when should groupByKey API used in spark programming?

GroupByKey suffers from shuffling the data.And GroupByKey functionality can be achieved either by using combineByKey or reduceByKey.So When should this API be used ? Is there any use case ?
Combine and reduce will also eventually shuffle, but they have better memory and speed performance characteristics because they are able to do more work to reduce the volume of data before the shuffle.
Consider if you had to sum a numeric attribute by a group RDD[(group, num)]. groupByKey will give you RDD[(group, List[num])] which you can then manually reduce using map. The shuffle would need to move all the individual nums to the destination partitions/nodes to get that list - many rows being shuffled.
Because reduceByKey knows that what you are doing with the nums (ie. summing them), it can sum each individual partition before the shuffle - so you'd have at most one row per group being written out to shuffle partition/node.
According to the link below, GroupByKey should be avoided.
Avoid GroupByKey
Avoid GroupByKey when the data in the merge field will be reduced to single value . Eg. In case of sum for a particular key.
Use GroupByKey when you know that merge field is not going to be reduced to single value. Eg: List reduce(_++_) --> Avoid this.
The reason being reduce a list will create memory both map side and reduce side. Memory that is created on executor that doesn't own the key will be wasted during shuffle.
Good example would be TopN.
More on this -
https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey
I woud say if groupByKey is last transformation in your chain of work (or you do anything after that has narrow dependency only), they you may consider it.
The reason reducebyKey is preferred is
1. Combine as alister mentioned above
2. ReduceByKey also partitions the data so that sum/agg becomes narrow ie can happen within partitions

Resources