Spark SQL dataframe: best way to compute across rowpairs - apache-spark

I have a Spark DataFrame "deviceDF" like so :
ID date_time state
a 2015-12-11 4:30:00 up
a 2015-12-11 5:00:00 down
a 2015-12-11 5:15:00 up
b 2015-12-12 4:00:00 down
b 2015-12-12 4:20:00 up
a 2015-12-12 10:15:00 down
a 2015-12-12 10:20:00 up
b 2015-12-14 15:30:00 down
I am trying to calculate the downtime for each of the IDs. I started simple by grouping based on id and separately computing the sum of all uptimes and downtimes. Then take the difference of the summed uptime and downtime.
val downtimeDF = deviceDF.filter($"state" === "down")
.groupBy("ID")
.agg(sum(unix_timestamp($"date_time")) as "down_time")
val uptimeDF = deviceDF.filter($"state" === "up")
.groupBy("ID")
.agg(sum(unix_timestamp($"date_time")) as "up_time")
val updownjoinDF = uptimeDF.join(downtimeDF, "ID")
val difftimeDF = updownjoinDF
.withColumn("diff_time", $"up_time" - $"down_time")
However there are few conditions that cause errors, such as the device went down but never came back up, in this case, the down_time is the difference between current_time and last_time it was down.
Also if the first entry for a particular device starts with 'up' then the down_time is difference of the first_entry and the time at the begining of this analysis, say 2015-12-11 00:00:00. Whats the best way to handle these border conditions using dataframe? Do I need to write a custom UDAF ?

The first thing you can try is to use window functions. While this is usually not the fastest possible solution it is concise and extremely expressive. Taking your data as an example:
import org.apache.spark.sql.functions.unix_timestamp
val df = sc.parallelize(Array(
("a", "2015-12-11 04:30:00", "up"), ("a", "2015-12-11 05:00:00", "down"),
("a", "2015-12-11 05:15:00", "up"), ("b", "2015-12-12 04:00:00", "down"),
("b", "2015-12-12 04:20:00", "up"), ("a", "2015-12-12 10:15:00", "down"),
("a", "2015-12-12 10:20:00", "up"), ("b", "2015-12-14 15:30:00", "down")))
.toDF("ID", "date_time", "state")
.withColumn("timestamp", unix_timestamp($"date_time"))
Lets define example window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{coalesce, lag, when, sum}
val w = Window.partitionBy($"ID").orderBy($"timestamp")
some helper columns
val previousTimestamp = coalesce(lag($"timestamp", 1).over(w), $"timestamp")
val previousState = coalesce(lag($"state", 1).over(w), $"state")
val downtime = when(
previousState === "down",
$"timestamp" - previousTimestamp
).otherwise(0).alias("downtime")
val uptime = when(
previousState === "up",
$"timestamp" - previousTimestamp
).otherwise(0).alias("uptime")
and finally a basic query:
val upsAndDowns = df.select($"*", uptime, downtime)
upsAndDowns.show
// +---+-------------------+-----+----------+------+--------+
// | ID| date_time|state| timestamp|uptime|downtime|
// +---+-------------------+-----+----------+------+--------+
// | a|2015-12-11 04:30:00| up|1449804600| 0| 0|
// | a|2015-12-11 05:00:00| down|1449806400| 1800| 0|
// | a|2015-12-11 05:15:00| up|1449807300| 0| 900|
// | a|2015-12-12 10:15:00| down|1449911700|104400| 0|
// | a|2015-12-12 10:20:00| up|1449912000| 0| 300|
// | b|2015-12-12 04:00:00| down|1449889200| 0| 0|
// | b|2015-12-12 04:20:00| up|1449890400| 0| 1200|
// | b|2015-12-14 15:30:00| down|1450103400|213000| 0|
// +---+-------------------+-----+----------+------+--------+
In a similar manner you cna look forward and if there is no more records in a group you can adjust total uptime / downtime using current timestamp.
Window functions provide some other useful features like window definitions with ROWS BETWEEN and RANGE BETWEEN clauses.
Another possible solution is to move your data to RDD and use low level operations with RangePartitioner, mapPartitions and sliding windows. For basic things you can even groupBy. This requires significantly more effort but is also much more flexible.
Finally there is a spark-timeseries package from Cloudera. Documentation is close to non-existent but tests are comprehensive enough to give you some idea how to use it.
Regarding custom UDAFs I wouldn't be to optimistic. UDAF API is rather specific and not exactly flexible.

Related

Spark: Transforming multiple dataframes in parallel

Understanding how to achieve best parallelism while transforming multiple dataframes in parallel
I have an array of paths
val paths = Array("path1", "path2", .....
I am loading dataframe from each path then transforming and writing to destination path
paths.foreach(path => {
val df = spark.read.parquet(path)
df.transform(processData).write.parquet(path+"_processed")
})
The transformation processData is independent of dataframe I am loading.
This limits to processing one dataframe at a time and most of my cluster resources are idle. As processing each dataframe is independent, I converted Array to ParArray of scala.
paths.par.foreach(path => {
val df = spark.read.parquet(path)
df.transform(processData).write.parquet(path+"_processed")
})
Now it is using more resources in cluster. I am still trying to understand how it works and how to fine tune the parallel processing here
If I increase the default scala parallelism using ForkJoinPool to higher number, can it lead to more threads spawning at driver side and will be in lock state waiting for foreach function to finish and eventually kill the driver?
How does it effect the centralized spark things like EventLoggingListnener which needs to handle more inflow of events as multiple dataframes are processed in parallel.
What parameters do I consider for optimal resource utilization.
Any other approach
Any resources I can go through to understand this scaling would be very helpful
The reason why this is slow is that spark is very good at parallelizing computations on lots of data, stored in one big dataframe. However, it is very bad at dealing with lots of dataframes. It will start the computation on one using all its executors (even though they are not all needed) and wait for it to finish before starting the next one. This results in a lot of inactive processors. This is bad but that's not what spark was designed for.
I have a hack for you. There might need to refine it a little, but you would have the idea. Here is what I would do. From a list of paths, I would extract all the schemas of the parquet files and create a new big schema that gathers all the columns. Then, I would ask spark to read all the parquet files using this schema (the columns that are not present will be set to null automatically). I would then union all the dataframes and perform the transformation on this big dataframe and finally use partitionBy to store the dataframes in separate files, while still doing all of it in parallel. It would look like this.
// let create two sample datasets with one column in common (id)
// and two different columns x != y
val d1 = spark.range(3).withColumn("x", 'id * 10)
d1.show
+---+----+
| id| x |
+---+----+
| 0| 0|
| 1| 10|
| 2| 20|
+---+----+
val d2 = spark.range(2).withColumn("y", 'id cast "string")
d2.show
+---+---+
| id| y|
+---+---+
| 0| 0|
| 1| 1|
+---+---+
// And I store them
d1.write.parquet("hdfs:///tmp/d1.parquet")
d2.write.parquet("hdfs:///tmp/d2.parquet")
// Now let's create the big schema
val paths = Seq("hdfs:///tmp/d1.parquet", "hdfs:///tmp/d2.parquet")
val fields = paths
.flatMap(path => spark.read.parquet(path).schema.fields)
.toSet //removing duplicates
.toArray
val big_schema = StructType(fields)
// and let's use it
val dfs = paths.map{ path =>
spark.read
.schema(big_schema)
.parquet(path)
.withColumn("path", lit(path.split("/").last))
}
// The we are ready to create one big dataframe
dfs.reduce( _ unionAll _).show
+---+----+----+----------+
| id| x| y| file|
+---+----+----+----------+
| 1| 1|null|d1.parquet|
| 2| 2|null|d1.parquet|
| 0| 0|null|d1.parquet|
| 0|null| 0|d2.parquet|
| 1|null| 1|d2.parquet|
+---+----+----+----------+
Yet, I do not recommend using unionAll on lots of dataframes. Because of spark's analysis of the execution plan, it can be very slow with many dataframes. I would use the RDD version although it is more verbose.
val rdds = sc.union(dfs.map(_.rdd))
// let's not forget to add the path to the schema
val big_df = spark.createDataFrame(rdds,
big_schema.add(StructField("path", StringType, true)))
transform(big_df)
.write
.partitionBy("path")
.parquet("hdfs:///tmp/processed.parquet")
And having a look at my processed directory, I get this:
hdfs:///tmp/processed.parquet/_SUCCESS
hdfs:///tmp/processed.parquet/path=d1.parquet
hdfs:///tmp/processed.parquet/path=d2.parquet
You should play with some variables here. Most important are: CPU cores, the size of each DF and a little use of futures. The propose is decide the priority of each DF to be processed. You can use FAIR configuration but that don't be enough and process all in parallel could consume a big part of your cluster. You have to assign priorities to DFs and use Future pooll to control the number of parallel Jobs running in your app.

Aggregating several fields simultaneously from Dataset

I have a data with the following scheme:
sourceip
destinationip
packets sent
And I want to calculate several aggregative fields out of this data and have the following schema:
ip
packets sent as sourceip
packets sent as destination
In the happy days of RDDs I could use aggregate, define a map of {ip -> []}, and count the appearances in a corresponding array location.
In the Dataset/Dataframe aggregate is no longer available, instead UDAF could be used, unfortunately, from the experience I had with UDAF they are immutable, means they cannot be used (have to create a new instance on every map update) example + explanation here
on one hand, technically, I could convert the Dataset to RDD, aggregate etc and go back to dataset. Which I expect would result in performance degradation, as Datasets are more optimized. UDAFs are out of the question due to the copying.
Is there any other way to perform aggregations?
It sounds like you need a standard melt (How to melt Spark DataFrame?) and pivot combination:
val df = Seq(
("192.168.1.102", "192.168.1.122", 10),
("192.168.1.122", "192.168.1.65", 10),
("192.168.1.102", "192.168.1.97", 10)
).toDF("sourceip", "destinationip", "packets sent")
df.melt(Seq("packets sent"), Seq("sourceip", "destinationip"), "type", "ip")
.groupBy("ip")
.pivot("type", Seq("sourceip", "destinationip"))
.sum("packets sent").na.fill(0).show
// +-------------+--------+-------------+
// | ip|sourceip|destinationip|
// +-------------+--------+-------------+
// | 192.168.1.65| 0| 10|
// |192.168.1.102| 20| 0|
// |192.168.1.122| 10| 10|
// | 192.168.1.97| 0| 10|
// +-------------+--------+-------------+
One way to go about it without any custom aggregation would be to use flatMap (or explode for dataframes) like this:
case class Info(ip : String, sent : Int, received : Int)
case class Message(from : String, to : String, p : Int)
val ds = Seq(Message("ip1", "ip2", 5),
Message("ip2", "ip3", 7),
Message("ip2", "ip1", 1),
Message("ip3", "ip2", 3)).toDS()
ds
.flatMap(x => Seq(Info(x.from, x.p, 0), Info(x.to, 0, x.p)))
.groupBy("ip")
.agg(sum('sent) as "sent", sum('received) as "received")
.show
// +---+----+--------+
// | ip|sent|received|
// +---+----+--------+
// |ip2| 8| 8|
// |ip3| 3| 7|
// |ip1| 5| 1|
// +---+----+--------+
As far as the performance is concerned, I am not sure a flatMap is an improvement versus a custom aggregation though.
Here is a pyspark version using explode. It is more verbose but the logic is exactly the same as the flatMap version, only with pure dataframe code.
sc\
.parallelize([("ip1", "ip2", 5), ("ip2", "ip3", 7), ("ip2", "ip1", 1), ("ip3", "ip2", 3)])\
.toDF(("from", "to", "p"))\
.select(F.explode(F.array(\
F.struct(F.col("from").alias("ip"),\
F.col("p").alias("received"),\
F.lit(0).cast("long").alias("sent")),\
F.struct(F.col("to").alias("ip"),\
F.lit(0).cast("long").alias("received"),\
F.col("p").alias("sent")))))\
.groupBy("col.ip")\
.agg(F.sum(F.col("col.received")).alias("received"), F.sum(F.col("col.sent")).alias("sent"))
// +---+----+--------+
// | ip|sent|received|
// +---+----+--------+
// |ip2| 8| 8|
// |ip3| 3| 7|
// |ip1| 5| 1|
// +---+----+--------+
Since you didn't mention the context and aggregations, you may do something like below,
val df = ??? // your dataframe/ dataset
From Spark source:
(Scala-specific) Compute aggregates by specifying a map from column
name to aggregate methods. The resulting DataFrame will also contain
the grouping columns. The available aggregate methods are avg, max,
min, sum, count.
// Selects the age of the oldest employee and the aggregate expense
for each department
df
.groupBy("department")
.agg(Map(
"age" -> "max",
"expense" -> "sum"
))

create column with a running total in a Spark Dataset

Suppose we have a Spark Dataset with two columns, say Index and Value, sorted by the first column (Index).
((1, 100), (2, 110), (3, 90), ...)
We'd like to have a Dataset with a third column with a running total of the values in the second column (Value).
((1, 100, 100), (2, 110, 210), (3, 90, 300), ...)
Any suggestions how to do this efficiently, with one pass through the data? Or are there any canned CDF type functions out there that could be utilized for this?
If need be, the Dataset can be converted to a Dataframe or an RDD to accomplish the task, but it will have to remain a distributed data structure. That is, it cannot be simply collected and turned to an array or sequence, and no mutable variables are to be used (val only, no var).
but it will have to remain a distributed data structure.
Unfortunately what you've said you seek to do isn't possible in Spark. If you are willing to repartition the data set to a single partition (in effect consolidating it on a single host) you could easily write a function to do what you wish, keeping the incremented value as a field.
Since Spark functions don't share state across the network when they execute, there's no way to create the shared state you would need to keep the data set completely distributed.
If you're willing to relax your requirement and allow the data to be consolidated and read through in a single pass on one host then you may do what you wish by repartitioning to a single partition and applying a function. This does not pull the data onto the driver (keeping it in HDFS/the cluster) but does still compute the output serially, on a single executor. For example:
package com.github.nevernaptitsa
import java.io.Serializable
import java.util
import org.apache.spark.sql.{Encoders, SparkSession}
object SparkTest {
class RunningSum extends Function[Int, Tuple2[Int, Int]] with Serializable {
private var runningSum = 0
override def apply(v1: Int): Tuple2[Int, Int] = {
runningSum+=v1
return (v1, runningSum)
}
}
def main(args: Array[String]): Unit ={
val session = SparkSession.builder()
.appName("runningSumTest")
.master("local[*]")
.getOrCreate()
import session.implicits._
session.createDataset(Seq(1,2,3,4,5))
.repartition(1)
.map(new RunningSum)
.show(5)
session.createDataset(Seq(1,2,3,4,5))
.map(new RunningSum)
.show(5)
}
}
The two statements here show different output, the first providing the correct output (serial, because repartition(1) is called), and the second providing incorrect output because the result is computed in parallel.
Results from first statement:
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 2| 3|
| 3| 6|
| 4| 10|
| 5| 15|
+---+---+
Results from second statement:
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 9|
+---+---+
A colleague suggested the following which relies on the RDD.mapPartitionsWithIndex() method.
(To my knowledge, the other data structure do not provide this kind of reference to their partitions' indices.)
val data = sc.parallelize((1 to 5)) // sc is the SparkContext
val partialSums = data.mapPartitionsWithIndex{ (i, values) =>
Iterator((i, values.sum))
}.collect().toMap // will in general have size other than data.count
val cumSums = data.mapPartitionsWithIndex{ (i, values) =>
val prevSums = (0 until i).map(partialSums).sum
values.scanLeft(prevSums)(_+_).drop(1)
}

Does GraphFrames api support creation of Bipartite graphs?

Does GraphFrames api support creation of Bipartite graphs in the current version?
Current version: 0.1.0
Spark version : 1.6.1
As pointed out in the comments to this question, neither GraphFrames nor GraphX have built-in support for bipartite graphs. However, they both have more than enough flexibility to let you create bipartite graphs. For a GraphX solution, see this previous answer. That solution uses a shared trait between the different vertex / object type. And while that works with RDDs that's not going to work for DataFrames. A row in a DataFrame has a fixed schema -- it can't sometimes contain a price column and sometimes not. It can have a price column that's sometimes null, but the column has to exist in every row.
Instead, the solution for GraphFrames seems to be that you need to define a DataFrame that's essentially a linear sub-type of both types of objects in your bipartite graph -- it has to contain all of the fields of both types of objects. This is actually pretty easy -- a join with full_outer is going to give you that. Something like this:
val players = Seq(
(1,"dave", 34),
(2,"griffin", 44)
).toDF("id", "name", "age")
val teams = Seq(
(101,"lions","7-1"),
(102,"tigers","5-3"),
(103,"bears","0-9")
).toDF("id","team","record")
You could then create a super-set DataFrame like this:
val teamPlayer = players.withColumnRenamed("id", "l_id").join(
teams.withColumnRenamed("id", "r_id"),
$"r_id" === $"l_id", "full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
teamPlayer.show
+---+-------+----+------+------+
| id| name| age| team|record|
+---+-------+----+------+------+
|101| null|null| lions| 7-1|
|102| null|null|tigers| 5-3|
|103| null|null| bears| 0-9|
| 1| dave| 34| null| null|
| 2|griffin| 44| null| null|
+---+-------+----+------+------+
You could possibly do it a little cleaner with structs:
val tpStructs = players.select($"id" as "l_id", struct($"name", $"age") as "player").join(
teams.select($"id" as "r_id", struct($"team",$"record") as "team"),
$"l_id" === $"r_id",
"full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
tpStructs.show
+---+------------+------------+
| id| player| team|
+---+------------+------------+
|101| null| [lions,7-1]|
|102| null|[tigers,5-3]|
|103| null| [bears,0-9]|
| 1| [dave,34]| null|
| 2|[griffin,44]| null|
+---+------------+------------+
I'll also point out that more or less the same solution would work in GraphX with RDDs. You could always create a vertex via joining two case classes that don't share any traits:
case class Player(name: String, age: Int)
val playerRdd = sc.parallelize(Seq(
(1L, Player("date", 34)),
(2L, Player("griffin", 44))
))
case class Team(team: String, record: String)
val teamRdd = sc.parallelize(Seq(
(101L, Team("lions", "7-1")),
(102L, Team("tigers", "5-3")),
(103L, Team("bears", "0-9"))
))
playerRdd.fullOuterJoin(teamRdd).collect foreach println
(101,(None,Some(Team(lions,7-1))))
(1,(Some(Player(date,34)),None))
(102,(None,Some(Team(tigers,5-3))))
(2,(Some(Player(griffin,44)),None))
(103,(None,Some(Team(bears,0-9))))
With all respect to the previous answer, this seems like a more flexible way to handle it -- without having to share a trait between the combined objects.

How to get multiple queries in single run

For example I have a dataframe like below,
df
DataFrame[columnA: int, columnB: int]
If I have to do two checks. I am going over the data two times like below,
df.where(df.columnA == 412).count()
df.where(df.columnB == 25).count()
In normal code I would be having two count variable and incrementing on True. How would I go with spark dataframe? Appreciate if some one could link to right documentation also. Happy to see python or scala.
For example like this:
import org.apache.spark.sql.functions.sum
val df = sc.parallelize(Seq(
(412, 0),
(0, 25),
(412, 25),
(0, 25)
)).toDF("columnA", "columnB")
df.agg(
sum(($"columnA" === 412).cast("long")).alias("columnA"),
sum(($"columnB" === 25).cast("long")).alias("columnB")
).show
// +-------+-------+
// |columnA|columnB|
// +-------+-------+
// | 2| 3|
// +-------+-------+
or like this:
import org.apache.spark.sql.functions.{count, when}
df.agg(
count(when($"columnA" === 412, $"columnA")).alias("columnA"),
count(when($"columnB" === 25, $"columnB")).alias("columnB")
).show
// +-------+-------+
// |columnA|columnB|
// +-------+-------+
// | 2| 3|
// +-------+-------+
I am not aware of any specific documentation but I am pretty sure you'll find this in any good SQL reference.
#zero323's answer is spot on, but just to indicate the most flexible programming model is Spark, you could do your checks as if statements inside a map with a lambda function, e.g. (using the same dataframe as above)
import org.apache.spark.sql.functions._
val r1 = df.map(x => {
var x0 = 0
var x1 = 0
if (x(0) == 412) x0=1
if (x(1) == 25) x1=1
(x0, x1)
}).toDF("x0", "x1").select(sum("x0"), sum("x1")).show()
This model lets you do almost anything you can think of, though you're much better off sticking with the specific APIs where available.

Resources