Does spark.sql.Dataset.groupByKey support window operations like groupBy? - apache-spark

In Spark Structured Streaming, we can do window operations on event time with groupBy like:
import spark.implicits._
val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }
// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
Does groupByKey also supports window operations?
Thanks.

It is possible to write a helper function that makes it easier to generate a time-windowing function to give to groupByKey.
object windowing {
import java.sql.Timestamp
import java.time.Instant
/** given:
* a row type R
* a function from R to the Timestamp
* a windowing width in seconds
* return: a function that allows groupByKey to do windowing
*/
def windowBy[R](f:R=>Timestamp, width: Int) = {
val w = width.toLong * 1000L
(row: R) => {
val tsCur = f(row)
val msCur = tsCur.getTime()
val msLB = (msCur / w) * w
val instLB = Instant.ofEpochMilli(msLB)
val instUB = Instant.ofEpochMilli(msLB+w)
(Timestamp.from(instLB), Timestamp.from(instUB))
}
}
}
And in your example, it might be used like this:
case class MyRow(timestamp: Timestamp, word: String)
val windowBy60 = windowing.windowBy[MyRow](_.timestamp, 60)
// count words by time window
words.as[MyRow]
.groupByKey(windowBy60)
.count()
Or counting by (window, word) pairs:
words.as[MyRow]
.groupByKey(row => (windowBy60(row), row.word))
.count()

Yes and no. It cannot be used directly, as it is applicable only to SQL / DataFrame API, but you can always extend the record with window field:
val dfWithWindow = df.withColumn("window", window(...)))
case class Window(start: java.sql.Timestamp. end: java.sql.Timestamp)
case class MyRecordWithWindow(..., window: Window)
and use it for grouping:
dfWithWindow.as[MyRecordWithWindow].groupByKey(_.window).mapGroups(...)

Related

Technique for joining with spark dataframe w/ custom partitioner works w/ python, but not scala?

I recently read an article that described how to custom partition a dataframe
[ https://dataninjago.com/2019/06/01/create-custom-partitioner-for-spark-dataframe/ ] in which the author illustrated the technique in Python. I use Scala, and the technique looked like a good way to address issues of skew, so I tried something similar, and what I found was that when one does the following:
- create 2 data frames, D1, D2
- convert D1, D2 to 2 Pair RDDs R1,R2
(where the key is the key you want to join on)
- repartition R1,R2 with a custom partitioner 'C'
where 'C' has 2 partitions (p-0,p-1) and
stuffs everything in P-1, except keys == 'a'
- join R1,R2 as R3
- OBSERVE that:
- partitioner for R3 is 'C' (same for R1,R2)
- when printing the contents of each partition of R3 all entries
except the one keyed by 'a' is in p-1
- set D1' <- R1.toDF
- set D2' <- R2.toDF
We note the following results:
0) The join of D1' and D2' produce expected results (good)
1) The partitioners for D1' and D2' are None -- not Some(C),
as was the case with RDD's R1/R2 (bad)
2) The contents of the glom'd underlying RDDs of D1' and D2' did
not have everything (except key 'a') piled up
in partition 1 as expected.(bad)
So, I came away with the following conclusion... which will work for me practically... But it really irks me that I could not get the behavior in the article which used Python:
When one needs to use custom partitioning with Dataframes in Scala one must
drop into RDD's do the join or whatever operation on the RDD, then convert back
to dataframe. You can't apply the custom partitioner, then convert back to
dataframe, do your operations, and expect the custom partitioning to work.
Now...I am hoping I am wrong ! Perhaps someone with more expertise in Spark internals can guide me here. I have written a little program (below) to illustrate the results. Thanks in advance if you can set me straight.
UPDATE
In addition to the Spark code which illustrates the problem I also tried a simplified version of what the original article presented in Python. The conversions below create a dataframe, extract its underlying RDD and repartition it, then recover the dataframe and verify that the partitioner is lost.
Python snippet illustrating problem
from pyspark.sql.types import IntegerType
mylist = [1, 2, 3, 4]
df = spark.createDataFrame(mylist, IntegerType())
def travelGroupPartitioner(key):
return 0
dfRDD = df.rdd.map(lambda x: (x[0],x))
dfRDD2 = dfRDD .partitionBy(8, travelGroupPartitioner)
# this line uses approach of original article and maps to only the value
# but map doesn't guarantee preserving pratitioner, so i tried without the
# map below...
df2 = spark.createDataFrame(dfRDD2 .map(lambda x: x[1]))
print ( df2.rdd.partitioner ) # prints None
# create dataframe from partitioned RDD _without_ the map,
# and we _still_ lose partitioner
df3 = spark.createDataFrame(dfRDD2)
print ( df3.rdd.partitioner ) # prints None
Scala snippet illustrating problem
object Question extends App {
val conf =
new SparkConf().setAppName("blah").
setMaster("local").set("spark.sql.shuffle.partitions", "2")
val sparkSession = SparkSession.builder .config(conf) .getOrCreate()
val spark = sparkSession
import spark.implicits._
sparkSession.sparkContext.setLogLevel("ERROR")
class CustomPartitioner(num: Int) extends Partitioner {
def numPartitions: Int = num
def getPartition(key: Any): Int = if (key.toString == "a") 0 else 1
}
case class Emp(name: String, deptId: String)
case class Dept(deptId: String, name: String)
val value: RDD[Emp] = spark.sparkContext.parallelize(
Seq(
Emp("anne", "a"),
Emp("dave", "d"),
Emp("claire", "c"),
Emp("roy", "r"),
Emp("bob", "b"),
Emp("zelda", "z"),
Emp("moe", "m")
)
)
val employee: Dataset[Emp] = value.toDS()
val department: Dataset[Dept] = spark.sparkContext.parallelize(
Seq(
Dept("a", "ant dept"),
Dept("d", "duck dept"),
Dept("c", "cat dept"),
Dept("r", "rabbit dept"),
Dept("b", "badger dept"),
Dept("z", "zebra dept"),
Dept("m", "mouse dept")
)
).toDS()
val dumbPartitioner: Partitioner = new CustomPartitioner(2)
// Convert to-be-joined dataframes to custom repartition RDDs [ custom partitioner: cp ]
//
val deptPairRdd: RDD[(String, Dept)] = department.rdd.map { dept => (dept.deptId, dept) }
val empPairRdd: RDD[(String, Emp)] = employee.rdd.map { emp: Emp => (emp.deptId, emp) }
val cpEmpRdd: RDD[(String, Emp)] = empPairRdd.partitionBy(dumbPartitioner)
val cpDeptRdd: RDD[(String, Dept)] = deptPairRdd.partitionBy(dumbPartitioner)
assert(cpEmpRdd.partitioner.get == dumbPartitioner)
assert(cpDeptRdd.partitioner.get == dumbPartitioner)
// Here we join using RDDs and ensure that the resultant rdd is partitioned so most things end up in partition 1
val joined: RDD[(String, (Emp, Dept))] = cpEmpRdd.join(cpDeptRdd)
val reso: Array[(Array[(String, (Emp, Dept))], Int)] = joined.glom().collect().zipWithIndex
reso.foreach((item: Tuple2[Array[(String, (Emp, Dept))], Int]) => println(s"array size: ${item._2}. contents: ${item._1.toList}"))
System.out.println("partitioner of RDD created by joining 2 RDD's w/ custom partitioner: " + joined.partitioner)
assert(joined.partitioner.contains(dumbPartitioner))
val recoveredDeptDF: DataFrame = deptPairRdd.toDF
val recoveredEmpDF: DataFrame = empPairRdd.toDF
System.out.println(
"partitioner for DF recovered from custom partitioned RDD (not as expected!):" +
recoveredDeptDF.rdd.partitioner)
val joinedDf = recoveredEmpDF.join(recoveredDeptDF, "_1")
println("printing results of joining the 2 dataframes we 'recovered' from the custom partitioned RDDS (looks good)")
joinedDf.show()
println("PRINTING partitions of joined DF does not match the glom'd results we got from underlying RDDs")
joinedDf.rdd.glom().collect().
zipWithIndex.foreach {
item: Tuple2[Any, Int] =>
val asList = item._1.asInstanceOf[Array[org.apache.spark.sql.Row]].toList
println(s"array size: ${item._2}. contents: $asList")
}
assert(joinedDf.rdd.partitioner.contains(dumbPartitioner)) // this will fail ;^(
}
Check out my new library which adds partitionBy method to the Dataset/Dataframe API level.
Taking your Emp and Dept objects as example:
class DeptByIdPartitioner extends TypedPartitioner[Dept] {
override def getPartitionIdx(value: Dept): Int = if (value.deptId.startsWith("a")) 0 else 1
override def numPartitions: Int = 2
override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}
class EmpByDepIdPartitioner extends TypedPartitioner[Emp] {
override def getPartitionIdx(value: Emp): Int = if (value.deptId.startsWith("a")) 0 else 1
override def numPartitions: Int = 2
override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}
Note that we are extending TypedPartitioner.
It is compile-time safe, you won't be able to repartition a dataset of persons with emp partitioner.
val spark = SparkBuilder.getSpark()
import org.apache.spark.sql.exchange.implicits._ //<-- addtitonal import
import spark.implicits._
val deptPartitioned = department.repartitionBy(new DeptByIdPartitioner)
val empPartitioned = employee.repartitionBy(new EmpByDepIdPartitioner)
Let's check how our data is partitioned:
Dep dataset:
Partition N 0
: List([a,ant dept])
Partition N 1
: List([d,duck dept], [c,cat dept], [r,rabbit dept], [b,badger dept], [z,zebra dept], [m,mouse dept])
If we join repartitioned by the same key dataset Catalyst will properly recognize this:
val joined = deptPartitioned.join(empPartitioned, "deptId")
println("Joined:")
val result: Array[(Int, Array[Row])] = joined.rdd.glom().collect().zipWithIndex.map(_.swap)
for (elem <- result) {
println(s"Partition N ${elem._1}")
println(s"\t: ${elem._2.toList}")
}
Partition N 0
: List([a,ant dept,anne])
Partition N 1
: List([b,badger dept,bob], [c,cat dept,claire], [d,duck dept,dave], [m,mouse dept,moe], [r,rabbit dept,roy], [z,zebra dept,zelda])
What version of Spark are you using? If it's 2.x and above, it's recommended to use Dataframe/Dataset API instead, not RDDs
It's much easier to work with the mentioned API than with RDDs, and it performs much better on later versions of Spark
You may find the link below useful for how to join DFs:
How to join two dataframes in Scala and select on few columns from the dataframes by their index?
Once you get your joined DataFrame, you can use the link below for partitioning by column values, which I assume you're trying to achieve:
Partition a spark dataframe based on column value?

Cannot evaluate ML model on Structured Streaming, because RDD transformations and actions are invoked inside other transformations

This is a well-known limitation[1] of Structured Streaming that I'm trying to get around using a custom sink.
In what follows, modelsMap is a map of string keys to org.apache.spark.mllib.stat.KernelDensity models
and
streamingData is a streaming dataframe org.apache.spark.sql.DataFrame = [id1: string, id2: string ... 6 more fields]
I'm trying to evaluate each row of streamingData against its corresponding model from modelsMap, enhance each row with prediction, and write to Kakfa.
An obvious way would be .withColumn, using a UDF to predict, and write using kafka sink.
But this is illegal because:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It
could happen in the following cases: (1) RDD transformations and
actions are NOT invoked by the driver, but inside of other
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is
invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation. For more information,
see SPARK-5063.
I get the same error with a custom sink that implements forEachWriter which was a bit unexpected:
import org.apache.spark.sql.ForeachWriter
import java.util.Properties
import kafkashaded.org.apache.kafka.clients.producer._
class customSink(topic:String, servers:String) extends ForeachWriter[(org.apache.spark.sql.Row)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
kafkaProperties.put("value.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (org.apache.spark.sql.Row)): Unit = {
var prediction = Double.NaN
try {
val id1 = value(0)
val id2 = value(3)
val id3 = value(5)
val time_0 = value(6).asInstanceOf[Double]
val key = f"$id1/$id2/$id3"
var model = modelsMap(key)
println("Looking up key: ",key)
var prediction = Double.NaN
prediction = model.estimate(Array[Double](time_0))(0)
println(prediction)
} catch {
case e: NoSuchElementException =>
val prediction = Double.NaN
println(prediction)
}
producer.send(new ProducerRecord(topic, value.mkString(",")+","+prediction.toString))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
val writer = new customSink("<broker>", "<topic>")
val query = streamingData
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(Trigger.ProcessingTime(10.seconds))
.start()
model.estimate is implemented under the hood using aggregate in mllib.stat, and there's no way to get around it.
What changes do I make? (I could collect each batch and execute a for loop using driver, but then I'm not using spark the way it's intended)
References:
https://www.slideshare.net/databricks/realtime-machine-learning-analytics-using-structured-streaming-and-kinesis-firehose slide#11 mentions limitations
https://www.oreilly.com/learning/extend-structured-streaming-for-spark-ml
https://github.com/holdenk/spark-structured-streaming-ml (proposed solution)
https://issues.apache.org/jira/browse/SPARK-16454
https://issues.apache.org/jira/browse/SPARK-16407

create a register dynamic dataframe as temptable in spark

I am trying to registerTemptables, from dynamic dataframes.
I am getting the output as a string., i am not sure if there is a way to execute dataframe or convert a string to dataframe so that the temptable can be created.
Here are the steps to replicate this issue :
import org.apache.spark.sql._
val contact_df = sc.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
val acct_df = sc.makeRDD(1 to 5).map(i => (i, i / i)).toDF("value", "devide")
val dataframeJoins = Array(
Row("x","","","" ,"Y","",1,"contact_hotline_df","contact_df","acct_nbr","hotline_df","tm49_acct_nbr"),
Row("x","","","","Y","",2,"contact_hotline_acct_df","acct_df","tm06_acct_nbr" ,"contact_hotline_df","acct_nbr")
)
val dfJoinbroadcast = sc.broadcast(dataframeJoins)
val DFJoins1 = for ( row <- dfJoinbroadcast.value ) yield {
(row(8)+".registerTempTable(\""+row(8)+"\")" )
}
for (rows <- 0 until DFJoins1.size ){
println(DFJoins1(rows) )
DFJoins1(rows)
}
Here is the output of the above for loop :
contact_df.registerTempTable("contact_df")
acct_df.registerTempTable("acct_df")
I am not getting any error. But the table is not getting created.
When i say sqlContext.sql("select * from contact_df") i am getting an error that table is not created.
Is there a way to convert string to a dataframe and execute the dataframe to create temptable.
Please suggest.
Thanks,
Sreehari
Your code concatenates the strings and prints the result, that's it. The registerTempTable method is not being called, that's why you cant use it in the SQL query. Try to do this:
// assuming we have this string to object mapping
val tableNameToDf = Map("contact_df" -> contact_df, "acct_df" -> acct_df)
you could restructure your for loop into something like:
val dfJoins = for (row <- dfJoinbroadcast.value) yield {
val wannabeTable = row(8)
tableNameToRdd(wannabeTable).createOrReplaceTempView(wannabeTable)
wannabeTableName
}

Aggregate for mode(most common element) in spark dataframes

In Spark I am using a library for which I am supposed to provide the aggregates and the library then does a series of joins/groupby's and calls the aggregate at the end. I am trying to avoid violating encapsulation (although I can if necessary), and just call this method with an aggregate (traditionally sum or min etc.)
In this case I am trying to run mode, however, which I am not sure of how to run in an aggregate.
Here's a Spark (2.1.0) UDAF to calculate the Statistical Mode for a given column:
package org.anish.spark.mostcommonvalue
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import scalaz.Scalaz._
/**
* Spark User Defined Aggregate Function to calculate the most frequent value in a column. This is similar to
* Statistical Mode. When there are two random values, this function selects any one. When calculating mode, both
* these values together is considered as mode.
*
* Usage:
*
* DataFrame / DataSet DSL
* val mostCommonValue = new MostCommonValue
* df.groupBy("group_id").agg(mostCommonValue(col("mode_column")), mostCommonValue(col("city")))
*
* Spark SQL:
* sqlContext.udf.register("mode", new MostCommonValue)
* %sql
* -- Use a group_by statement and call the UDAF.
* select group_id, mode(id) from table group by group_id
*
* Reference: https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
*
* Created by anish on 26/05/17.
*/
class MostCommonValue extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function.
// We use StringType, because Mode can also be meaningfully applied on nominal data
override def inputSchema: StructType =
StructType(StructField("value", StringType) :: Nil)
// This is the internal fields you keep for computing your aggregate.
// We store the frequency of all the distinct element we encounter for the given attribute in this HashMap
override def bufferSchema: StructType = StructType(
StructField("frequencyMap", DataTypes.createMapType(StringType, LongType)) :: Nil
)
// This is the output type of your aggregation function.
override def dataType: DataType = StringType
override def deterministic: Boolean = true
// This is the initial value for the buffer schema.
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = Map[String, Long]()
}
// This is how to update your buffer schema given an input.
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getAs[Map[String, Long]](0) |+| Map(input.getAs[String](0) -> 1L)
}
// This is how you merge two objects with the bufferSchema type.
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getAs[Map[String, Long]](0) |+| buffer2.getAs[Map[String, Long]](0)
}
// This is where you output the final value, given the final value of your bufferSchema.
override def evaluate(buffer: Row): String = {
buffer.getAs[Map[String, Long]](0).maxBy(_._2)._1
}
}
credit/source:
https://gist.github.com/anish749/6a815ed281f538068a0d3a20ca9044fa

How to build a lookup map in Spark Streaming?

What is the best way to maintain application state in a spark streaming application?
I know of two ways :
use "Union" operation to append to the lookup RDD and persist it after each union.
save the state in a file or database and load it in the start of each batch.
My question is from the performance perspective which one is better ? Also, is there a better way to do this?
You should really be using mapWithState(spec: StateSpec[K, V, StateType, MappedType]) as follows:
import org.apache.spark.streaming.{ StreamingContext, Seconds }
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
// checkpointing is mandatory
ssc.checkpoint("_checkpoints")
val rdd = sc.parallelize(0 to 9).map(n => (n, n % 2 toString))
import org.apache.spark.streaming.dstream.ConstantInputDStream
val sessions = new ConstantInputDStream(ssc, rdd)
import org.apache.spark.streaming.{State, StateSpec, Time}
val updateState = (batchTime: Time, key: Int, value: Option[String], state: State[Int]) => {
println(s">>> batchTime = $batchTime")
println(s">>> key = $key")
println(s">>> value = $value")
println(s">>> state = $state")
val sum = value.getOrElse("").size + state.getOption.getOrElse(0)
state.update(sum)
Some((key, value, sum)) // mapped value
}
val spec = StateSpec.function(updateState)
val mappedStatefulStream = sessions.mapWithState(spec)
mappedStatefulStream.print()

Resources