How to compute the largest value in a column using withColumn? - apache-spark

I'm trying to compute the largest value of the following DataFrame in Spark 1.6.1:
val df = sc.parallelize(Seq(1,2,3)).toDF("id")
A first approach would be to select the maximum value, and it works as expected:
df.select(max($"id")).show
The second approach could be to use withColumn as follows:
df.withColumn("max", max($"id")).show
But unfortunately it fails with the following error message:
org.apache.spark.sql.AnalysisException: expression 'id' is neither
present in the group by, nor is it an aggregate function. Add to group
by or wrap in first() (or first_value) if you don't care which value
you get.;
How can I compute the maximum value in a withColumn function without any Window or groupBy? If not possible, how can I do it in this specific case using a Window?

The right approach is to compute an aggregate as a separate query and combine with the actual result. Unlike window functions, suggested in many answers here, it won't require shuffle to a single partition and will be applicable to large datasets.
It could be done withColumn using a separate action:
import org.apache.spark.sql.functions.{lit, max}
df.withColumn("max", lit(df.agg(max($"id")).as[Int].first))
but it is much cleaner to use either explicit:
import org.apache.spark.sql.functions.broadcast
df.crossJoin(broadcast(df.agg(max($"id") as "max")))
or implicit cross join:
spark.conf.set("spark.sql.crossJoin.enabled", true)
df.join(broadcast(df.agg(max($"id") as "max")))

There are few categories of functions in Apache Spark.
Aggregate functions, e.g. max, when we wanna aggregate multiple rows in to one
None-aggregate functions, abs, isnull, when we wanna transform one column to another
Collection functions, e.g. explode, when one row will expand to multiple rows.
Implicit aggregation
They are used to when we wanna aggregate more rows in to one.
The following code internally has an aggregation.
df.select(max($"id")).explain
== Physical Plan ==
*HashAggregate(keys=[], functions=[max(id#3)])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_max(id#3)])
+- *Project [value#1 AS id#3]
+- Scan ExistingRDD[value#1]
we can also using multiple aggregation functions in a select.
df.select(max($"id"), min($"id")).explain
aggregate functions can not mix with none-aggregate functions directly
The following code will report error.
df.select(max($"id"), $"id")
df.withColumn("max", max($"id"))
Because max($"id") has few values then $"id"
aggregate with over
In this case the analytic function is applied and presented for all rows in the result set.
We can use
df.select(max($"id").over, $"id").show
Or
df.withColumn("max", max($"id").over).show

This is Spark 2.0 here.
With withColumn and window functions it could be as follows:
df.withColumn("max", max('id) over)
Note the empty over which is to assume a "empty" window (and is equivalent of over ()).
If you however need a more complete WindowSpec you can do the following (again, this is 2.0):
import org.apache.spark.sql.expressions._
// the trick that has performance cost (!)
val window = Window.orderBy()
df.withColumn("max", max('id) over window).show
Please note that the code has a serious performance issue as reported by Spark itself:
WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

Related

row_number and orderby window function narrow transformation equivalent in spark

df.withColumn("col1", functions.row_number().over(Window.partitionBy("col2").orderBy("col3")));
This operation I think is a wide transformation in nature where it will sort and partition and for every partition together. Due to which a lot of shuffling is occurring and causing performance issues.
I have a use case where data in each partition is independent of other partitions. So I want to run the above row number function for each partition only i.e I need a narrow transformation and I couldn't find a way to do it.
Is there a way to achieve it ?
I read orderBy vs sort methods which are wide and narrow respectively. Does that hold true in this case also?

Spark Dataset join performance

I receive a Dataset and I am required to join it with another Table. Hence the most simple solution that came to my mind was to create a second Dataset for the other table and perform the joinWith.
def joinFunction(dogs: Dataset[Dog]): Dataset[(Dog, Cat)] = {
val cats: Dataset[Cat] = spark.table("dev_db.cat").as[Cat]
dogs.joinWith(cats, ...)
}
Here my main concern is with spark.table("dev_db.cat"), as it feels like we are referring to all of the cat data as
SELECT * FROM dev_db.cat
and then doing a join at a later stage. Or will the query optimizer directly perform the join with out referring to the whole table? Is there a better solution?
Here are some suggestions for your case:
a. If you have where, filter, limit, take etc operations try to apply them before joining the two datasets. Spark can't push down these kind of filters therefore you have to do by your own reducing as much as possible the amount of target records. Here an excellent source of information over the Spark optimizer.
b. Try to co-locate the datasets and minimize the shuffled data by using repartition function. The repartition should be based on the keys that participate in join i.e:
dogs.repartition(1024, "key_col1", "key_col2")
dogs.join(cats, Seq("key_col1", "key_col2"), "inner")
c. Try to use broadcast for the smaller dataset if you are sure that it can fit in memory (or increase the value of spark.broadcast.blockSize). This consists a certain boost for the performance of your Spark program since it will ensure the co-existense of two datasets within the same node.
If you can't apply any of the above then Spark doesn't have a way to know which records should be excluded and therefore will scan all the available rows from both datasets.
You need to do an explain and see if predicate push down is used. Then you can judge your concern to be correct or not.
However, in general now, if no complex datatypes are used and/or datatype mismatches are not evident, then push down takes place. You can see that with simple createOrReplaceTempView as well. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html

How to use groupByKey() on multiple RDDs?

I have multiple RDDs with one common field CustomerId.
For eg:
debitcardRdd has data as (CustomerId, debitField1, debitField2, ......)
creditcardRdd has data as (CustomerId, creditField1, creditField2, ....)
netbankingRdd has data as (CustomerId, nbankingField1, nbankingField2, ....)
We perform different transformations on each individual rdd, however we need to perform a transformation on the data from all the 3 rdds by grouping CustomerId.
Example: (CustomerId,debitFiedl1,creditField2,bankingField1,....)
Is there any way we can group the data from all RDDs based on same key.
Note: In Apache Beam it can be done by using coGroupByKey, just checking if there is such alternative available in spark.
Just cogroup
debitcardRdd.keyBy(_.CustomerId).cogroup(
creditcardRdd.keyBy(_.CustomerId),
netbankingRdd.keyBy(_.CustomerId)
)
In contrast to the below, the .keyBy is not imho actually required here and we note that cogroup - not well described can extend to n RDDs.
val rddREScogX = rdd1.cogroup(rdd2,rdd3,rddn, ...)
Points should go to the first answer.

Spark: where doesn't work properly

I have 2 dataset, and i want to create a join dataset, so I did
Dataset<Row> join = ds1.join(ds2, "id");
However for performance enhancement I tried to replace join with .where(cond) ( I also tried .filter(cond) ) like this:
Dataset<Row> join = ds1.where(col("id").equalTo(ds2.col("id"));
which also work, but not when one of the datasets is empty ( In this case it will return the non-empty dataset), However this is not the expected result.
So my question why .where doesn't work properly in that case, or is there another optimized solution for joining 2 datasets without using join().
Join and where condition are 2 different things. Your code for where condition will fail due to the resolve attribute issue. The where condition or filter condition is specific to that DataFrame. If you will mention second DataFrame in the condition it won’t iterate over like join. Please check your code if you are getting the result at all
Absolutely one of the key points when you want to join two RDDs, is the partitioner used over those two. If the first and the second rdd has the same partitioner then your join operation would be in the best performance it could be. If paritioner varies, then the first rdd's partitioner would be used to partition the second rdd.
Then try to just use a "light key", e.g. use encoded or hashed output of a String instead using the raw, and the same partitioner for both the rdds.

PySpark: Best practice to add more columns to a DataFrame

Spark Dataframes has a method withColumn to add one new column at a time. To add multiple columns, a chain of withColumns are required. Is this the best practice to do this?
I feel that usingmapPartitions has more advantages. Let's say I have a chain of three withColumns and then one filter to remove Rows based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions. It also helps if I have a database connection that I would prefer to open once per RDD partition.
My question has two parts.
The first part, this is my implementation of mapPartitions. Are there any unforeseen issues with this approach? And is there a more elegant way to do this?
df2 = df.rdd.mapPartitions(add_new_cols).toDF()
def add_new_cols(rows):
db = open_db_connection()
new_rows = []
new_row_1 = Row("existing_col_1", "existing_col_2", "new_col_1", "new_col_2")
i = 0
for each_row in rows:
i += 1
# conditionally omit rows
if i % 3 == 0:
continue
db_result = db.get_some_result(each_row.existing_col_2)
new_col_1 = ''.join([db_result, "_NEW"])
new_col_2 = db_result
new_f_row = new_row_1(each_row.existing_col_1, each_row.existing_col_2, new_col_1, new_col_2)
new_rows.append(new_f_row)
db.close()
return iter(new_rows)
The second part, what are the tradeoffs in using mapPartitions over a chain of withColumn and filter?
I read somewhere that using the available methods with Spark DFs are always better than rolling out your own implementation. Please let me know if my argument is wrong. Thank you! All thoughts are welcome.
Are there any unforeseen issues with this approach?
Multiple. The most severe implications are:
A few times higher memory footprint to compared to plain DataFrame code and significant garbage collection overhead.
High cost of serialization and deserialization required to move data between execution contexts.
Introducing breaking point in the query planner.
As is, cost of schema inference on toDF call (can be avoided if proper schema is provided) and possible re-execution of all preceding steps.
And so on...
Some of these can be avoided with udf and select / withColumn, other cannot.
let's say I have a chain of three withColumns and then one filter to remove Rows based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions
Your mapPartitions doesn't remove any operations, and doesn't provide any optimizations, that Spark planner cannot excluding. Its only advantage is that it provides a nice scope for expensive connection objects.
I read somewhere that using the available methods with Spark DFs are always better than rolling out your own implementation
When you start using executor-side Python logic you already diverge from Spark SQL. Doesn't matter if you use udf, RDD or newly added vectorized udf. At the end of the day you should make decision based on overall structure of your code - if it is predominantly Python logic executed directly on the data it might be better to stick with RDD or skip Spark completely.
If it is just a fraction of the logic, and doesn't cause severe performance issue, don't sweat about it.
using df.withColumn() is the best way to add columns. they're all added lazily

Resources