create column with a running total in a Spark Dataset - apache-spark

Suppose we have a Spark Dataset with two columns, say Index and Value, sorted by the first column (Index).
((1, 100), (2, 110), (3, 90), ...)
We'd like to have a Dataset with a third column with a running total of the values in the second column (Value).
((1, 100, 100), (2, 110, 210), (3, 90, 300), ...)
Any suggestions how to do this efficiently, with one pass through the data? Or are there any canned CDF type functions out there that could be utilized for this?
If need be, the Dataset can be converted to a Dataframe or an RDD to accomplish the task, but it will have to remain a distributed data structure. That is, it cannot be simply collected and turned to an array or sequence, and no mutable variables are to be used (val only, no var).

but it will have to remain a distributed data structure.
Unfortunately what you've said you seek to do isn't possible in Spark. If you are willing to repartition the data set to a single partition (in effect consolidating it on a single host) you could easily write a function to do what you wish, keeping the incremented value as a field.
Since Spark functions don't share state across the network when they execute, there's no way to create the shared state you would need to keep the data set completely distributed.
If you're willing to relax your requirement and allow the data to be consolidated and read through in a single pass on one host then you may do what you wish by repartitioning to a single partition and applying a function. This does not pull the data onto the driver (keeping it in HDFS/the cluster) but does still compute the output serially, on a single executor. For example:
package com.github.nevernaptitsa
import java.io.Serializable
import java.util
import org.apache.spark.sql.{Encoders, SparkSession}
object SparkTest {
class RunningSum extends Function[Int, Tuple2[Int, Int]] with Serializable {
private var runningSum = 0
override def apply(v1: Int): Tuple2[Int, Int] = {
runningSum+=v1
return (v1, runningSum)
}
}
def main(args: Array[String]): Unit ={
val session = SparkSession.builder()
.appName("runningSumTest")
.master("local[*]")
.getOrCreate()
import session.implicits._
session.createDataset(Seq(1,2,3,4,5))
.repartition(1)
.map(new RunningSum)
.show(5)
session.createDataset(Seq(1,2,3,4,5))
.map(new RunningSum)
.show(5)
}
}
The two statements here show different output, the first providing the correct output (serial, because repartition(1) is called), and the second providing incorrect output because the result is computed in parallel.
Results from first statement:
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 2| 3|
| 3| 6|
| 4| 10|
| 5| 15|
+---+---+
Results from second statement:
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 9|
+---+---+

A colleague suggested the following which relies on the RDD.mapPartitionsWithIndex() method.
(To my knowledge, the other data structure do not provide this kind of reference to their partitions' indices.)
val data = sc.parallelize((1 to 5)) // sc is the SparkContext
val partialSums = data.mapPartitionsWithIndex{ (i, values) =>
Iterator((i, values.sum))
}.collect().toMap // will in general have size other than data.count
val cumSums = data.mapPartitionsWithIndex{ (i, values) =>
val prevSums = (0 until i).map(partialSums).sum
values.scanLeft(prevSums)(_+_).drop(1)
}

Related

spark aggregate set of events per key including their change timestamps

For a dataframe of:
+----+--------+-------------------+----+
|user| dt| time_value|item|
+----+--------+-------------------+----+
| id1|20200101|2020-01-01 00:00:00| A|
| id1|20200101|2020-01-01 10:00:00| B|
| id1|20200101|2020-01-01 09:00:00| A|
| id1|20200101|2020-01-01 11:00:00| B|
+----+--------+-------------------+----+
I want to capture all the unique items i.e. collect_set, but retain its own time_value
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.unix_timestamp
import org.apache.spark.sql.functions.collect_set
import org.apache.spark.sql.types.TimestampType
val timeFormat = "yyyy-MM-dd HH:mm"
val dx = Seq(("id1", "20200101", "2020-01-01 00:00", "A"), ("id1", "20200101","2020-01-01 10:00", "B"), ("id1", "20200101","2020-01-01 9:00", "A"), ("id1", "20200101","2020-01-01 11:00", "B")).toDF("user", "dt","time_value", "item").withColumn("time_value", unix_timestamp(col("time_value"), timeFormat).cast(TimestampType))
dx.show
A
dx.groupBy("user", "dt").agg(collect_set("item")).show
+----+--------+-----------------+
|user| dt|collect_set(item)|
+----+--------+-----------------+
| id1|20200101| [B, A]|
+----+--------+-----------------+
does not retain the time_value information when the signal switched from A to B. How can I keep the time value information for each set in the item?
Would it be possible to have the collect_set within a window function to achieve the desired result? Currently, I can only think of:
use a window function to determine pairs of events
filter to change events
aggregate
which needs to shuffle multiple times. Alternatively, a UDF would be possible (collect_list(sort_array(struct(time_value, item)))) but that also seems rather clumsy.
Is there a better way?
I would indeed use window-functions to isolate the change-points, I think there are no alternatives:
val win = Window.partitionBy($"user",$"dt").orderBy($"time_value")
dx
.orderBy($"time_value")
.withColumn("item_change_post",coalesce((lag($"item",1).over(win)=!=$"item"),lit(false)))
.withColumn("item_change_pre",lead($"item_change_post",1).over(win))
.where($"item_change_pre" or $"item_change_post")
.show()
+----+--------+-------------------+----+----------------+---------------+
|user| dt| time_value|item|item_change_post|item_change_pre|
+----+--------+-------------------+----+----------------+---------------+
| id1|20200101|2020-01-01 09:00:00| A| false| true|
| id1|20200101|2020-01-01 10:00:00| B| true| false|
+----+--------+-------------------+----+----------------+---------------+
then use something like groupBy($"user",$"dt").agg(collect_list(struct($"time_value",$"item")))
I don't think that multiple shuffles occur, because you always partition/group by the same keys.
You can try to make it more efficient by aggregating your initial dataframe to the min/max time_value for each item, then do the same as above.

Aggregating several fields simultaneously from Dataset

I have a data with the following scheme:
sourceip
destinationip
packets sent
And I want to calculate several aggregative fields out of this data and have the following schema:
ip
packets sent as sourceip
packets sent as destination
In the happy days of RDDs I could use aggregate, define a map of {ip -> []}, and count the appearances in a corresponding array location.
In the Dataset/Dataframe aggregate is no longer available, instead UDAF could be used, unfortunately, from the experience I had with UDAF they are immutable, means they cannot be used (have to create a new instance on every map update) example + explanation here
on one hand, technically, I could convert the Dataset to RDD, aggregate etc and go back to dataset. Which I expect would result in performance degradation, as Datasets are more optimized. UDAFs are out of the question due to the copying.
Is there any other way to perform aggregations?
It sounds like you need a standard melt (How to melt Spark DataFrame?) and pivot combination:
val df = Seq(
("192.168.1.102", "192.168.1.122", 10),
("192.168.1.122", "192.168.1.65", 10),
("192.168.1.102", "192.168.1.97", 10)
).toDF("sourceip", "destinationip", "packets sent")
df.melt(Seq("packets sent"), Seq("sourceip", "destinationip"), "type", "ip")
.groupBy("ip")
.pivot("type", Seq("sourceip", "destinationip"))
.sum("packets sent").na.fill(0).show
// +-------------+--------+-------------+
// | ip|sourceip|destinationip|
// +-------------+--------+-------------+
// | 192.168.1.65| 0| 10|
// |192.168.1.102| 20| 0|
// |192.168.1.122| 10| 10|
// | 192.168.1.97| 0| 10|
// +-------------+--------+-------------+
One way to go about it without any custom aggregation would be to use flatMap (or explode for dataframes) like this:
case class Info(ip : String, sent : Int, received : Int)
case class Message(from : String, to : String, p : Int)
val ds = Seq(Message("ip1", "ip2", 5),
Message("ip2", "ip3", 7),
Message("ip2", "ip1", 1),
Message("ip3", "ip2", 3)).toDS()
ds
.flatMap(x => Seq(Info(x.from, x.p, 0), Info(x.to, 0, x.p)))
.groupBy("ip")
.agg(sum('sent) as "sent", sum('received) as "received")
.show
// +---+----+--------+
// | ip|sent|received|
// +---+----+--------+
// |ip2| 8| 8|
// |ip3| 3| 7|
// |ip1| 5| 1|
// +---+----+--------+
As far as the performance is concerned, I am not sure a flatMap is an improvement versus a custom aggregation though.
Here is a pyspark version using explode. It is more verbose but the logic is exactly the same as the flatMap version, only with pure dataframe code.
sc\
.parallelize([("ip1", "ip2", 5), ("ip2", "ip3", 7), ("ip2", "ip1", 1), ("ip3", "ip2", 3)])\
.toDF(("from", "to", "p"))\
.select(F.explode(F.array(\
F.struct(F.col("from").alias("ip"),\
F.col("p").alias("received"),\
F.lit(0).cast("long").alias("sent")),\
F.struct(F.col("to").alias("ip"),\
F.lit(0).cast("long").alias("received"),\
F.col("p").alias("sent")))))\
.groupBy("col.ip")\
.agg(F.sum(F.col("col.received")).alias("received"), F.sum(F.col("col.sent")).alias("sent"))
// +---+----+--------+
// | ip|sent|received|
// +---+----+--------+
// |ip2| 8| 8|
// |ip3| 3| 7|
// |ip1| 5| 1|
// +---+----+--------+
Since you didn't mention the context and aggregations, you may do something like below,
val df = ??? // your dataframe/ dataset
From Spark source:
(Scala-specific) Compute aggregates by specifying a map from column
name to aggregate methods. The resulting DataFrame will also contain
the grouping columns. The available aggregate methods are avg, max,
min, sum, count.
// Selects the age of the oldest employee and the aggregate expense
for each department
df
.groupBy("department")
.agg(Map(
"age" -> "max",
"expense" -> "sum"
))

How to use a broadcast collection in a udf?

How to use a broadcast collection in Spark SQL 1.6.1 udf. Udf should be called from Main SQL as shown below
sqlContext.sql("""Select col1,col2,udf_1(key) as value_from_udf FROM table_a""")
udf_1() should be looking through a broadcast small collection to return value to main sql.
Here's a minimal reproducible example in pySpark, illustrating the use of broadcast variables to perform lookups, employing a lambda function as an UDF inside a SQL statement.
# Create dummy data and register as table
df = sc.parallelize([
(1,"a"),
(2,"b"),
(3,"c")]).toDF(["num","let"])
df.registerTempTable('table')
# Create broadcast variable from local dictionary
myDict = {1: "y", 2: "x", 3: "z"}
broadcastVar = sc.broadcast(myDict)
# Alternatively, if your dict is a key-value rdd,
# you can do sc.broadcast(rddDict.collectAsMap())
# Create lookup function and apply it
sqlContext.registerFunction("lookup", lambda x: broadcastVar.value.get(x))
sqlContext.sql('select num, let, lookup(num) as test from table').show()
+---+---+----+
|num|let|test|
+---+---+----+
| 1| a| y|
| 2| b| x|
| 3| c| z|
+---+---+----+

Does GraphFrames api support creation of Bipartite graphs?

Does GraphFrames api support creation of Bipartite graphs in the current version?
Current version: 0.1.0
Spark version : 1.6.1
As pointed out in the comments to this question, neither GraphFrames nor GraphX have built-in support for bipartite graphs. However, they both have more than enough flexibility to let you create bipartite graphs. For a GraphX solution, see this previous answer. That solution uses a shared trait between the different vertex / object type. And while that works with RDDs that's not going to work for DataFrames. A row in a DataFrame has a fixed schema -- it can't sometimes contain a price column and sometimes not. It can have a price column that's sometimes null, but the column has to exist in every row.
Instead, the solution for GraphFrames seems to be that you need to define a DataFrame that's essentially a linear sub-type of both types of objects in your bipartite graph -- it has to contain all of the fields of both types of objects. This is actually pretty easy -- a join with full_outer is going to give you that. Something like this:
val players = Seq(
(1,"dave", 34),
(2,"griffin", 44)
).toDF("id", "name", "age")
val teams = Seq(
(101,"lions","7-1"),
(102,"tigers","5-3"),
(103,"bears","0-9")
).toDF("id","team","record")
You could then create a super-set DataFrame like this:
val teamPlayer = players.withColumnRenamed("id", "l_id").join(
teams.withColumnRenamed("id", "r_id"),
$"r_id" === $"l_id", "full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
teamPlayer.show
+---+-------+----+------+------+
| id| name| age| team|record|
+---+-------+----+------+------+
|101| null|null| lions| 7-1|
|102| null|null|tigers| 5-3|
|103| null|null| bears| 0-9|
| 1| dave| 34| null| null|
| 2|griffin| 44| null| null|
+---+-------+----+------+------+
You could possibly do it a little cleaner with structs:
val tpStructs = players.select($"id" as "l_id", struct($"name", $"age") as "player").join(
teams.select($"id" as "r_id", struct($"team",$"record") as "team"),
$"l_id" === $"r_id",
"full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
tpStructs.show
+---+------------+------------+
| id| player| team|
+---+------------+------------+
|101| null| [lions,7-1]|
|102| null|[tigers,5-3]|
|103| null| [bears,0-9]|
| 1| [dave,34]| null|
| 2|[griffin,44]| null|
+---+------------+------------+
I'll also point out that more or less the same solution would work in GraphX with RDDs. You could always create a vertex via joining two case classes that don't share any traits:
case class Player(name: String, age: Int)
val playerRdd = sc.parallelize(Seq(
(1L, Player("date", 34)),
(2L, Player("griffin", 44))
))
case class Team(team: String, record: String)
val teamRdd = sc.parallelize(Seq(
(101L, Team("lions", "7-1")),
(102L, Team("tigers", "5-3")),
(103L, Team("bears", "0-9"))
))
playerRdd.fullOuterJoin(teamRdd).collect foreach println
(101,(None,Some(Team(lions,7-1))))
(1,(Some(Player(date,34)),None))
(102,(None,Some(Team(tigers,5-3))))
(2,(Some(Player(griffin,44)),None))
(103,(None,Some(Team(bears,0-9))))
With all respect to the previous answer, this seems like a more flexible way to handle it -- without having to share a trait between the combined objects.

Spark, DataFrame: apply transformer/estimator on groups

I have a DataFrame that looks like follow:
+-----------+-----+------------+
| userID|group| features|
+-----------+-----+------------+
|12462563356| 1| [5.0,43.0]|
|12462563701| 2| [1.0,8.0]|
|12462563701| 1| [2.0,12.0]|
|12462564356| 1| [1.0,1.0]|
|12462565487| 3| [2.0,3.0]|
|12462565698| 2| [1.0,1.0]|
|12462565698| 1| [1.0,1.0]|
|12462566081| 2| [1.0,2.0]|
|12462566081| 1| [1.0,15.0]|
|12462566225| 2| [1.0,1.0]|
|12462566225| 1| [9.0,85.0]|
|12462566526| 2| [1.0,1.0]|
|12462566526| 1| [3.0,79.0]|
|12462567006| 2| [11.0,15.0]|
|12462567006| 1| [10.0,15.0]|
|12462567006| 3| [10.0,15.0]|
|12462586595| 2| [2.0,42.0]|
|12462586595| 3| [2.0,16.0]|
|12462589343| 3| [1.0,1.0]|
+-----------+-----+------------+
Where the columns types are: userID: Long, group: Int, and features:vector.
This is already a grouped DataFrame, i.e. a userID will appear in a particular group at max one time.
My goal is to scale the features column per group.
Is there a way to apply a feature transformer (in my case I would like to apply a StandardScaler) per group instead of applying it to the full DataFrame.
P.S. using ML is not mandatory, so no problem if the solution is based on MLlib.
Compute statistics
Spark >= 3.0
Now Summarizer supports standard deviations so
val summary = data
.groupBy($"group")
.agg(Summarizer.metrics("mean", "std")
.summary($"features").alias("stats"))
.as[(Int, (Vector, Vector))]
.collect.toMap
Spark >= 2.3
In Spark 2.3 or later you could also use Summarizer:
import org.apache.spark.ml.stat.Summarizer
val summaryVar = data
.groupBy($"group")
.agg(Summarizer.metrics("mean", "variance")
.summary($"features").alias("stats"))
.as[(Int, (Vector, Vector))]
.collect.toMap
and adjust downstream code to handle variances instead of standard deviations.
Spark < 2.0, Spark < 2.3 with adjustments for conversions between ml and mllib Vectors.
You can compute statistics by group using almost the same code as default Scaler:
import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.sql.Row
// Compute Multivariate Statistics
val summary = data.select($"group", $"features")
.rdd
.map {
case Row(group: Int, features: Vector) => (group, features)
}
.aggregateByKey(new MultivariateOnlineSummarizer)(/* Create an empty new MultivariateOnlineSummarizer */
(agg, v) => agg.add(v), /* seqOp : Add a new sample Vector to this summarizer, and update the statistical summary. */
(agg1, agg2) => agg1.merge(agg2)) /* combOp : As MultivariateOnlineSummarizer accepts a merge action with another MultivariateOnlineSummarizer, and update the statistical summary. */
.mapValues {
s => (
s.variance.toArray.map(math.sqrt(_)), /* compute the square root variance for each key */
s.mean.toArray /* fetch the mean for each key */
)
}.collectAsMap
Transformation
If expected number of groups is relatively low you can broadcast these:
val summaryBd = sc.broadcast(summary)
and transform your data:
val scaledRows = df.rdd.map{ case Row(userID, group: Int, features: Vector) =>
val (stdev, mean) = summaryBd.value(group)
val vs = features.toArray.clone()
for (i <- 0 until vs.size) {
vs(i) = if(stdev(i) == 0.0) 0.0 else (vs(i) - mean(i)) * (1 / stdev(i))
}
Row(userID, group, Vectors.dense(vs))
}
val scaledDf = sqlContext.createDataFrame(scaledRows, df.schema)
Otherwise you can simply join. It shouldn't be hard to wrap this as a ML transformer with group column as a param.

Resources