Pyspark : aggregate in expression vs function call - apache-spark

Can anyone tell me why aggregate call via expr works here but not through functions? Spark version 3.1.1
I am trying to calculate sum of an array column using aggregate.
Works:
import pyspark.sql.functions as f
from pyspark.sql.types import *
sdf_login_7.withColumn("new_col", f.array( f.col("a"), f.col("b"))
.cast(ArrayType(IntegerType())))\
.withColumn("sum", f.expr('aggregate(new_col, 0L, (acc,x) -> acc+x)'))\
.select(["a", "b", "new_col", "sum"]).show(3)
+---+---+--------+----+
| a| b| new_col| sum|
+---+---+--------+----+
| 10| 41|[10, 41]| 51|
| 11| 74|[11, 74]| 85|
| 11| 80|[11, 80]| 91|
+---+---+--------+----+
only showing top 3 rows
Doesn't work:
sdf_login_7.withColumn("new_col", f.array( f.col("a"), f.col("b"))
.cast(ArrayType(IntegerType())))\
.withColumn("sum", f.aggregate("new_col", f.lit(0), lambda acc, x: acc+x))\
.select(["a", "b", "new_col", "sum"]).show(3)
Py4JError:
org.apache.spark.sql.catalyst.expressions.UnresolvedNamedLambdaVariable.freshVarName
does not exist in the JVM

Related

spark aggregate set of events per key including their change timestamps

For a dataframe of:
+----+--------+-------------------+----+
|user| dt| time_value|item|
+----+--------+-------------------+----+
| id1|20200101|2020-01-01 00:00:00| A|
| id1|20200101|2020-01-01 10:00:00| B|
| id1|20200101|2020-01-01 09:00:00| A|
| id1|20200101|2020-01-01 11:00:00| B|
+----+--------+-------------------+----+
I want to capture all the unique items i.e. collect_set, but retain its own time_value
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.unix_timestamp
import org.apache.spark.sql.functions.collect_set
import org.apache.spark.sql.types.TimestampType
val timeFormat = "yyyy-MM-dd HH:mm"
val dx = Seq(("id1", "20200101", "2020-01-01 00:00", "A"), ("id1", "20200101","2020-01-01 10:00", "B"), ("id1", "20200101","2020-01-01 9:00", "A"), ("id1", "20200101","2020-01-01 11:00", "B")).toDF("user", "dt","time_value", "item").withColumn("time_value", unix_timestamp(col("time_value"), timeFormat).cast(TimestampType))
dx.show
A
dx.groupBy("user", "dt").agg(collect_set("item")).show
+----+--------+-----------------+
|user| dt|collect_set(item)|
+----+--------+-----------------+
| id1|20200101| [B, A]|
+----+--------+-----------------+
does not retain the time_value information when the signal switched from A to B. How can I keep the time value information for each set in the item?
Would it be possible to have the collect_set within a window function to achieve the desired result? Currently, I can only think of:
use a window function to determine pairs of events
filter to change events
aggregate
which needs to shuffle multiple times. Alternatively, a UDF would be possible (collect_list(sort_array(struct(time_value, item)))) but that also seems rather clumsy.
Is there a better way?
I would indeed use window-functions to isolate the change-points, I think there are no alternatives:
val win = Window.partitionBy($"user",$"dt").orderBy($"time_value")
dx
.orderBy($"time_value")
.withColumn("item_change_post",coalesce((lag($"item",1).over(win)=!=$"item"),lit(false)))
.withColumn("item_change_pre",lead($"item_change_post",1).over(win))
.where($"item_change_pre" or $"item_change_post")
.show()
+----+--------+-------------------+----+----------------+---------------+
|user| dt| time_value|item|item_change_post|item_change_pre|
+----+--------+-------------------+----+----------------+---------------+
| id1|20200101|2020-01-01 09:00:00| A| false| true|
| id1|20200101|2020-01-01 10:00:00| B| true| false|
+----+--------+-------------------+----+----------------+---------------+
then use something like groupBy($"user",$"dt").agg(collect_list(struct($"time_value",$"item")))
I don't think that multiple shuffles occur, because you always partition/group by the same keys.
You can try to make it more efficient by aggregating your initial dataframe to the min/max time_value for each item, then do the same as above.

best way to write Spark Dataset[U]

Ok, so I am aware of the fact that Dataset.as[U] just changes the view of the Dataframe for typed operations.
As seen in this example:
case class One(one: Int)
val df = Seq(
(1,2,3),
(11,22,33),
(111,222,333)
).toDF("one", "two", "thre")
val ds : Dataset[One] = df.as[One]
ds.show
prints
+----+----+-----+
| one| two|three|
+----+----+-----+
| 1| 2| 3|
| 11| 22| 33|
| 111| 222| 333|
+----+----+-----+
This is totally fine and works in my favor most of the times. BUT now I need to write that ds to disk, with only that column one.
To enforce the schema I could do
.map(x => x) as this is a typed operation, the case class schema will take effect. This operation also results in a Dataset[One], but with the underlying data reduced to column one. This just seems awfully expensive looking at the execution plan
== Physical Plan ==
*SerializeFromObject [assertnotnull(input[0, $line2012488405320.$read$$iw$$iw$One, true]).one AS one#9408]
+- *MapElements <function1>, obj#9407: $line2012488405320.$read$$iw$$iw$One
+- *DeserializeToObject newInstance(class $line2012488405320.$read$$iw$$iw$One), obj#9406: $line2012488405320.$read$$iw$$iw$One
+- LocalTableScan [one#9391]
What are alternative implementations to achieve
ds.show
+----+
| one|
+----+
| 1|
| 11|
| 111|
+----+
UPDATE 1
I was thinking of a general solution to the problem. Maybe something along these lines?:
def caseClassAccessorNames[T <: Product](implicit tag: TypeTag[T]) = {
typeOf[T]
.members
.collect {
case m: MethodSymbol if m.isCaseAccessor => m.name
}
.map(m => m.toString)
}
def project[T <: Product](ds: Dataset[T])(implicit tag: TypeTag[T]): Dataset[T] = {
import ds.sparkSession.implicits._
val columnsOfT: Seq[Column] =
caseClassAccessorNames
.map(col)(scala.collection.breakOut)
val t: DataFrame = ds.select(columnsOfT: _*)
t.as[T]
}
I managed to get this working for the trivial example, but need to evaluate it further. I wonder if there are alternative, maybe built-in, ways to achieve something like this?

Spark: DataFrame Aggregation (Scala)

I have a below requirement to aggregate the data on Spark dataframe in scala.
And, I have two datasets.
Dataset 1 contains values (val1, val2..) for each "t" types distributed on several different columns like (t1,t2...) .
val data1 = Seq(
("1","111",200,"221",100,"331",1000),
("2","112",400,"222",500,"332",1000),
("3","113",600,"223",1000,"333",1000)
).toDF("id1","t1","val1","t2","val2","t3","val3")
data1.show()
+---+---+----+---+----+---+----+
|id1| t1|val1| t2|val2| t3|val3|
+---+---+----+---+----+---+----+
| 1|111| 200|221| 100|331|1000|
| 2|112| 400|222| 500|332|1000|
| 3|113| 600|223|1000|333|1000|
+---+---+----+---+----+---+----+
Dataset 2 represent the same thing by having a separate row for each "t" type.
val data2 = Seq(("1","111",200),("1","221",100),("1","331",1000),
("2","112",400),("2","222",500),("2","332",1000),
("3","113",600),("3","223",1000), ("3","333",1000)
).toDF("id*","t*","val*")
data2.show()
+---+---+----+
|id*| t*|val*|
+---+---+----+
| 1|111| 200|
| 1|221| 100|
| 1|331|1000|
| 2|112| 400|
| 2|222| 500|
| 2|332|1000|
| 3|113| 600|
| 3|223|1000|
| 3|333|1000|
+---+---+----+
Now,I need to groupBY(id,t,t*) fields and print the balances for sum(val) and sum(val*) as a separate record.
And both balances should be equal.
My output should look like below:
+---+---+--------+---+---------+
|id1| t |sum(val)| t*|sum(val*)|
+---+---+--------+---+---------+
| 1|111| 200|111| 200|
| 1|221| 100|221| 100|
| 1|331| 1000|331| 1000|
| 2|112| 400|112| 400|
| 2|222| 500|222| 500|
| 2|332| 1000|332| 1000|
| 3|113| 600|113| 600|
| 3|223| 1000|223| 1000|
| 3|333| 1000|333| 1000|
+---+---+--------+---+---------+
I'm thinking of exploding the dataset1 into mupliple records for each "t" type and then join with dataset2.
But could you please suggest me a better approach which wouldn't affect the performance if datasets become bigger?
The simplest solution is to do sub-selects and then union datasets:
val ts = Seq(1, 2, 3)
val dfs = ts.map (t => data1.select("t" + t as "t", "v" + t as "v"))
val unioned = dfs.drop(1).foldLeft(dfs(0))((l, r) => l.union(r))
val ds = unioned.join(df2, 't === col("t*")
here aggregation
You can also try array with explode:
val df1 = data1.withColumn("colList", array('t1, 't2, 't3))
.withColumn("t", explode(colList))
.select('t, 'id1 as "id")
val ds = df2.withColumn("val",
when('t === 't1, 'val1)
.when('t === 't2, 'val2)
.when('t === 't3, 'val3)
.otherwise(0))
The last step is to join this Dataset with data2:
ds.join(data2, 't === col("t*"))
.groupBy("t", "t*")
.agg(first("id1") as "id1", sum(val), sum("val*"))

create column with a running total in a Spark Dataset

Suppose we have a Spark Dataset with two columns, say Index and Value, sorted by the first column (Index).
((1, 100), (2, 110), (3, 90), ...)
We'd like to have a Dataset with a third column with a running total of the values in the second column (Value).
((1, 100, 100), (2, 110, 210), (3, 90, 300), ...)
Any suggestions how to do this efficiently, with one pass through the data? Or are there any canned CDF type functions out there that could be utilized for this?
If need be, the Dataset can be converted to a Dataframe or an RDD to accomplish the task, but it will have to remain a distributed data structure. That is, it cannot be simply collected and turned to an array or sequence, and no mutable variables are to be used (val only, no var).
but it will have to remain a distributed data structure.
Unfortunately what you've said you seek to do isn't possible in Spark. If you are willing to repartition the data set to a single partition (in effect consolidating it on a single host) you could easily write a function to do what you wish, keeping the incremented value as a field.
Since Spark functions don't share state across the network when they execute, there's no way to create the shared state you would need to keep the data set completely distributed.
If you're willing to relax your requirement and allow the data to be consolidated and read through in a single pass on one host then you may do what you wish by repartitioning to a single partition and applying a function. This does not pull the data onto the driver (keeping it in HDFS/the cluster) but does still compute the output serially, on a single executor. For example:
package com.github.nevernaptitsa
import java.io.Serializable
import java.util
import org.apache.spark.sql.{Encoders, SparkSession}
object SparkTest {
class RunningSum extends Function[Int, Tuple2[Int, Int]] with Serializable {
private var runningSum = 0
override def apply(v1: Int): Tuple2[Int, Int] = {
runningSum+=v1
return (v1, runningSum)
}
}
def main(args: Array[String]): Unit ={
val session = SparkSession.builder()
.appName("runningSumTest")
.master("local[*]")
.getOrCreate()
import session.implicits._
session.createDataset(Seq(1,2,3,4,5))
.repartition(1)
.map(new RunningSum)
.show(5)
session.createDataset(Seq(1,2,3,4,5))
.map(new RunningSum)
.show(5)
}
}
The two statements here show different output, the first providing the correct output (serial, because repartition(1) is called), and the second providing incorrect output because the result is computed in parallel.
Results from first statement:
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 2| 3|
| 3| 6|
| 4| 10|
| 5| 15|
+---+---+
Results from second statement:
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 9|
+---+---+
A colleague suggested the following which relies on the RDD.mapPartitionsWithIndex() method.
(To my knowledge, the other data structure do not provide this kind of reference to their partitions' indices.)
val data = sc.parallelize((1 to 5)) // sc is the SparkContext
val partialSums = data.mapPartitionsWithIndex{ (i, values) =>
Iterator((i, values.sum))
}.collect().toMap // will in general have size other than data.count
val cumSums = data.mapPartitionsWithIndex{ (i, values) =>
val prevSums = (0 until i).map(partialSums).sum
values.scanLeft(prevSums)(_+_).drop(1)
}

How to use a broadcast collection in a udf?

How to use a broadcast collection in Spark SQL 1.6.1 udf. Udf should be called from Main SQL as shown below
sqlContext.sql("""Select col1,col2,udf_1(key) as value_from_udf FROM table_a""")
udf_1() should be looking through a broadcast small collection to return value to main sql.
Here's a minimal reproducible example in pySpark, illustrating the use of broadcast variables to perform lookups, employing a lambda function as an UDF inside a SQL statement.
# Create dummy data and register as table
df = sc.parallelize([
(1,"a"),
(2,"b"),
(3,"c")]).toDF(["num","let"])
df.registerTempTable('table')
# Create broadcast variable from local dictionary
myDict = {1: "y", 2: "x", 3: "z"}
broadcastVar = sc.broadcast(myDict)
# Alternatively, if your dict is a key-value rdd,
# you can do sc.broadcast(rddDict.collectAsMap())
# Create lookup function and apply it
sqlContext.registerFunction("lookup", lambda x: broadcastVar.value.get(x))
sqlContext.sql('select num, let, lookup(num) as test from table').show()
+---+---+----+
|num|let|test|
+---+---+----+
| 1| a| y|
| 2| b| x|
| 3| c| z|
+---+---+----+

Resources