The programming guide says that structured streaming guarantees end-to-end exactly once semantics using appropriate sources/sinks.
However I'm not understanding how this works when the job crashes and we have a watermark applied.
Below is an example of how I currently imagine it working, please correct me on any points that I'm misunderstanding. Thanks in advance!
Example:
Spark Job: Count # events in each 1 hour window, with a 1 hour Watermark.
Messages:
A - timestamp 10am
B - timestamp 10:10am
C - timestamp 10:20am
X - timestamp 12pm
Y - timestamp 12:50pm
Z - timestamp 8pm
We start the job, read A, B, C from the Source and the job crashes at 10:30am before we've written them out to our Sink.
At 6pm the job comes back up and knows to re-process A, B, C using the saved checkpoint/WAL. The final count is 3 for the 10-11am window.
Next, it reads the new messages from Kafka, X, Y, Z in parallel since they belong to different partitions. Z is processed first, so the max event timestamp gets set to 8pm. When the job reads X and Y, they are now behind the watermark (8pm - 1 hour = 7pm), so they are discarded as old data. The final count is 1 for 8-9pm, and the job does not report anything for the 12-1pm window. We've lost data for X and Y.
---End example---
Is this scenario accurate?
If so, the 1 hour watermark may be sufficient to handle late/out-of-order data when flowing normally from Kafka-Sspark, but not when the spark job goes down/Kafka connection is lost for a long period of time. Would the only option to avoid data loss be to use a watermark longer than you expect the job to ever go down for?
The watermark is a fixed value during the minibatch. In your example, since X, Y and Z are processed in the same minibatch, watermark used for this records would be 9:20am. After completion of that minibatch watermark would be updated to 7pm.
Below the quote from the design doc for the feature SPARK-18124 which implements watermarking functionality:
To calculate the drop boundary in our trigger based execution, we have to do the following.
In every trigger, while aggregate the data, we also scan for the max value of event time in the trigger data
After trigger completes, compute watermark = MAX(event time before trigger, max event time in trigger) - threshold
Probably simulation would be more description:
import org.apache.hadoop.fs.Path
import java.sql.Timestamp
import org.apache.spark.sql.types._
import org.apache.spark.sql.streaming.ProcessingTime
val dir = new Path("/tmp/test-structured-streaming")
val fs = dir.getFileSystem(sc.hadoopConfiguration)
fs.mkdirs(dir)
val schema = StructType(StructField("vilue", StringType) ::
StructField("timestamp", TimestampType) ::
Nil)
val eventStream = spark
.readStream
.option("sep", ";")
.option("header", "false")
.schema(schema)
.csv(dir.toString)
// Watermarked aggregation
val eventsCount = eventStream
.withWatermark("timestamp", "1 hour")
.groupBy(window($"timestamp", "1 hour"))
.count
def writeFile(path: Path, data: String) {
val file = fs.create(path)
file.writeUTF(data)
file.close()
}
// Debug query
val query = eventsCount.writeStream
.format("console")
.outputMode("complete")
.option("truncate", "false")
.trigger(ProcessingTime("5 seconds"))
.start()
writeFile(new Path(dir, "file1"), """
|A;2017-08-09 10:00:00
|B;2017-08-09 10:10:00
|C;2017-08-09 10:20:00""".stripMargin)
query.processAllAvailable()
val lp1 = query.lastProgress
// -------------------------------------------
// Batch: 0
// -------------------------------------------
// +---------------------------------------------+-----+
// |window |count|
// +---------------------------------------------+-----+
// |[2017-08-09 10:00:00.0,2017-08-09 11:00:00.0]|3 |
// +---------------------------------------------+-----+
// lp1: org.apache.spark.sql.streaming.StreamingQueryProgress =
// {
// ...
// "numInputRows" : 3,
// "eventTime" : {
// "avg" : "2017-08-09T10:10:00.000Z",
// "max" : "2017-08-09T10:20:00.000Z",
// "min" : "2017-08-09T10:00:00.000Z",
// "watermark" : "1970-01-01T00:00:00.000Z"
// },
// ...
// }
writeFile(new Path(dir, "file2"), """
|Z;2017-08-09 20:00:00
|X;2017-08-09 12:00:00
|Y;2017-08-09 12:50:00""".stripMargin)
query.processAllAvailable()
val lp2 = query.lastProgress
// -------------------------------------------
// Batch: 1
// -------------------------------------------
// +---------------------------------------------+-----+
// |window |count|
// +---------------------------------------------+-----+
// |[2017-08-09 10:00:00.0,2017-08-09 11:00:00.0]|3 |
// |[2017-08-09 12:00:00.0,2017-08-09 13:00:00.0]|2 |
// |[2017-08-09 20:00:00.0,2017-08-09 21:00:00.0]|1 |
// +---------------------------------------------+-----+
// lp2: org.apache.spark.sql.streaming.StreamingQueryProgress =
// {
// ...
// "numInputRows" : 3,
// "eventTime" : {
// "avg" : "2017-08-09T14:56:40.000Z",
// "max" : "2017-08-09T20:00:00.000Z",
// "min" : "2017-08-09T12:00:00.000Z",
// "watermark" : "2017-08-09T09:20:00.000Z"
// },
// "stateOperators" : [ {
// "numRowsTotal" : 3,
// "numRowsUpdated" : 2
// } ],
// ...
// }
writeFile(new Path(dir, "file3"), "")
query.processAllAvailable()
val lp3 = query.lastProgress
// -------------------------------------------
// Batch: 2
// -------------------------------------------
// +---------------------------------------------+-----+
// |window |count|
// +---------------------------------------------+-----+
// |[2017-08-09 10:00:00.0,2017-08-09 11:00:00.0]|3 |
// |[2017-08-09 12:00:00.0,2017-08-09 13:00:00.0]|2 |
// |[2017-08-09 20:00:00.0,2017-08-09 21:00:00.0]|1 |
// +---------------------------------------------+-----+
// lp3: org.apache.spark.sql.streaming.StreamingQueryProgress =
// {
// ...
// "numInputRows" : 0,
// "eventTime" : {
// "watermark" : "2017-08-09T19:00:00.000Z"
// },
// "stateOperators" : [ ],
// ...
// }
query.stop()
fs.delete(dir, true)
Notice how Batch 0 started with watermark 1970-01-01 00:00:00 while Batch 1 started with watermark 2017-08-09 09:20:00 (max event time of Batch 0 minus 1 hour). Batch 2, while empty, used watermark 2017-08-09 19:00:00.
Z is processed first, so the max event timestamp gets set to 8pm.
That's correct. Even though Z may be computed first, the watermark is subtracted from the maximum timestamp in the current query iteration. This means that 08:00 PM will be set as the time in which we subtract the watermark time from, meaning 12:00 and 12:50 will be discarded.
From the documentation:
For a specific window starting at time T, the engine will maintain state and allow late data to update the state until (max event time seen by the engine - late threshold > T)
Would the only option to avoid data loss be to use a watermark longer than you expect the job to ever go down for
Not necessarily. Lets assume you set a maximum amount of data to be read per Kafka querying to 100 items. If you read small batches, and you're reading serially from each partition, each maximum timestamp for each batch may not be the maximum time of the latest message in the broker, meaning you won't lose these messages.
Related
I have a stream of data coming from on Kafka call it a SourceStream.
I have another stream of Spark SQL queries whose individual values are Spark SQL queries along with a window size.
I want those queries to be applied to the SourceStream data, and pass the results of queries to the sink.
Eg.
Source Stream
Id type timestamp user amount
------- ------ ---------- ---------- --------
uuid1 A 342342 ME 10.0
uuid2 B 234231 YOU 120.10
uuid3 A 234234 SOMEBODY 23.12
uuid4 A 234233 WHO 243.1
uuid5 C 124555 IT 35.12
...
....
Query Stream
Id window query
------- ------ ------
uuid13 1 hour select 'uuid13' as u, max(amount) as output from df where type = 'A' group by ..
uuid21 5 minute select 'uuid121' as u, count(1) as output from df where amount > 100 group by ..
uuid321 1 day select 'uuid321' as u, sum(amount) as output from df where amount > 100 group by ..
...
....
Each query in query stream would be applied to the source stream's incoming data at window mentioned along with the query, and the output would be sent to the sink.
What ways can I implement it with the Spark?
A good feature of spark structured streaming is that it can join the static dataframe with the streaming dataframe. To cite an example as below. users is a static dataframe read from database. transactionStream is from a stream. By the joining function, we can get the spending of each country accumulated with the new arrival of batches.
val spendingByCountry = (transactionStream
.join(users, users("id") === transactionStream("userid"))
.groupBy($"country")
.agg(sum($"cost")) as "spending")
spendingByContry.writeStream
.outputMode("complete")
.format("console")
.start()
The sum of cost is aggregated with the new batches are coming as shown below.
-------------------------------
Batch: 0
-------------------------------
Country Spending
EN 90.0
FR 50.0
-------------------------------
Batch: 1
-------------------------------
Country Spending
EN 190.0
FR 150.0
If I want to introduce a notification and reset logic as the above example, what should be the correct approach? The requirement is that if the spending is larger than some threshold, the records of country and spending should be stored into a table and the spending should be reset as 0 to accumulate again.
One approach that you can achieve this is by arbitrary stateful processing. The groupBy can be enhanced with a custom function mapGroupsWithState where you maintain all the business logic needed. Here is an example taken from the Spark docs:
// A mapping function that maintains an integer state for string keys and returns a string. // Additionally, it sets a timeout to remove the state if it has not received data for an hour. def mappingFunction(key: String, value: Iterator[Int], state: GroupState[Int]): String = {
if (state.hasTimedOut) { // If called when timing out, remove the state
state.remove()
} else if (state.exists) { // If state exists, use it for processing
val existingState = state.get // Get the existing state
val shouldRemove = ... // Decide whether to remove the state
if (shouldRemove) {
state.remove() // Remove the state
} else {
val newState = ...
state.update(newState) // Set the new state
state.setTimeoutDuration("1 hour") // Set the timeout
}
} else {
val initialState = ...
state.update(initialState) // Set the initial state
state.setTimeoutDuration("1 hour") // Set the timeout } ... // return something }
dataset
.groupByKey(...)
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)(mappingFunction)
Been playing around Spark Structured Streaming and mapGroupsWithState (specifically following the StructuredSessionization example in the Spark source). I want to confirm some limitations I believe exist with mapGroupsWithState given my use case.
A session for my purposes is a group of uninterrupted activity for a user such that no two chronologically ordered (by event time, not processing time) events are separated by more than some developer-defined duration (30 minutes is common).
An example will help before jumping into code:
{"event_time": "2018-01-01T00:00:00", "user_id": "mike"}
{"event_time": "2018-01-01T00:01:00", "user_id": "mike"}
{"event_time": "2018-01-01T00:05:00", "user_id": "mike"}
{"event_time": "2018-01-01T00:45:00", "user_id": "mike"}
For the stream above, a session is defined with a 30 minute period of inactivity. In a streaming context, we should end up with one session (the second has yet to complete):
[
{
"user_id": "mike",
"startTimestamp": "2018-01-01T00:00:00",
"endTimestamp": "2018-01-01T00:05:00"
}
]
Now consider the following Spark driver program:
import java.sql.Timestamp
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout}
object StructuredSessionizationV2 {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.master("local[2]")
.appName("StructredSessionizationRedux")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
import spark.implicits._
implicit val ctx = spark.sqlContext
val input = MemoryStream[String]
val EVENT_SCHEMA = new StructType()
.add($"event_time".string)
.add($"user_id".string)
val events = input.toDS()
.select(from_json($"value", EVENT_SCHEMA).alias("json"))
.select($"json.*")
.withColumn("event_time", to_timestamp($"event_time"))
.withWatermark("event_time", "1 hours")
events.printSchema()
val sessionized = events
.groupByKey(row => row.getAs[String]("user_id"))
.mapGroupsWithState[SessionState, SessionOutput](GroupStateTimeout.EventTimeTimeout) {
case (userId: String, events: Iterator[Row], state: GroupState[SessionState]) =>
println(s"state update for user ${userId} (current watermark: ${new Timestamp(state.getCurrentWatermarkMs())})")
if (state.hasTimedOut) {
println(s"User ${userId} has timed out, sending final output.")
val finalOutput = SessionOutput(
userId = userId,
startTimestampMs = state.get.startTimestampMs,
endTimestampMs = state.get.endTimestampMs,
durationMs = state.get.durationMs,
expired = true
)
// Drop this user's state
state.remove()
finalOutput
} else {
val timestamps = events.map(_.getAs[Timestamp]("event_time").getTime).toSeq
println(s"User ${userId} has new events (min: ${new Timestamp(timestamps.min)}, max: ${new Timestamp(timestamps.max)}).")
val newState = if (state.exists) {
println(s"User ${userId} has existing state.")
val oldState = state.get
SessionState(
startTimestampMs = math.min(oldState.startTimestampMs, timestamps.min),
endTimestampMs = math.max(oldState.endTimestampMs, timestamps.max)
)
} else {
println(s"User ${userId} has no existing state.")
SessionState(
startTimestampMs = timestamps.min,
endTimestampMs = timestamps.max
)
}
state.update(newState)
state.setTimeoutTimestamp(newState.endTimestampMs, "30 minutes")
println(s"User ${userId} state updated. Timeout now set to ${new Timestamp(newState.endTimestampMs + (30 * 60 * 1000))}")
SessionOutput(
userId = userId,
startTimestampMs = state.get.startTimestampMs,
endTimestampMs = state.get.endTimestampMs,
durationMs = state.get.durationMs,
expired = false
)
}
}
val eventsQuery = sessionized
.writeStream
.queryName("events")
.outputMode("update")
.format("console")
.start()
input.addData(
"""{"event_time": "2018-01-01T00:00:00", "user_id": "mike"}""",
"""{"event_time": "2018-01-01T00:01:00", "user_id": "mike"}""",
"""{"event_time": "2018-01-01T00:05:00", "user_id": "mike"}"""
)
input.addData(
"""{"event_time": "2018-01-01T00:45:00", "user_id": "mike"}"""
)
eventsQuery.processAllAvailable()
}
case class SessionState(startTimestampMs: Long, endTimestampMs: Long) {
def durationMs: Long = endTimestampMs - startTimestampMs
}
case class SessionOutput(userId: String, startTimestampMs: Long, endTimestampMs: Long, durationMs: Long, expired: Boolean)
}
Output of that program is:
root
|-- event_time: timestamp (nullable = true)
|-- user_id: string (nullable = true)
state update for user mike (current watermark: 1969-12-31 19:00:00.0)
User mike has new events (min: 2018-01-01 00:00:00.0, max: 2018-01-01 00:05:00.0).
User mike has no existing state.
User mike state updated. Timeout now set to 2018-01-01 00:35:00.0
-------------------------------------------
Batch: 0
-------------------------------------------
+------+----------------+--------------+----------+-------+
|userId|startTimestampMs|endTimestampMs|durationMs|expired|
+------+----------------+--------------+----------+-------+
| mike| 1514782800000| 1514783100000| 300000| false|
+------+----------------+--------------+----------+-------+
state update for user mike (current watermark: 2017-12-31 23:05:00.0)
User mike has new events (min: 2018-01-01 00:45:00.0, max: 2018-01-01 00:45:00.0).
User mike has existing state.
User mike state updated. Timeout now set to 2018-01-01 01:15:00.0
-------------------------------------------
Batch: 1
-------------------------------------------
+------+----------------+--------------+----------+-------+
|userId|startTimestampMs|endTimestampMs|durationMs|expired|
+------+----------------+--------------+----------+-------+
| mike| 1514782800000| 1514785500000| 2700000| false|
+------+----------------+--------------+----------+-------+
Given my session definition, the single event in the second batch should trigger an expiry of session state and thus a new session. However, since the watermark (2017-12-31 23:05:00.0) has not passed the state's timeout (2018-01-01 00:35:00.0), state isn't expired and the event is erroneously added to the existing session despite the fact that more than 30 minutes have passed since the latest timestamp in the previous batch.
I think the only way for session state expiration to work as I'm hoping is if enough events from different users were received within the batch to advance the watermark past the state timeout for mike.
I suppose one could also mess with the stream's watermark, but I can't think of how I'd do that to accomplish my use case.
Is this accurate? Am I missing anything in how to properly do event time-based sessionization in Spark?
The implementation you have provided does not seem to work if the watermark interval is greater than session gap duration.
For the logic you have shown to work, you need to set the watermark interval to < 30 mins.
If you really want the watermark interval to be independent of (or more than) the session gap duration, you need to wait until the watermark passes (watermark + gap) to expire the state. The merging logic seems to blindly merge the windows. This should take the gap duration into account before merging.
EDIT: I think I need to answer specific point of origin question instead of providing full resolution.
To add Arun’s answer, state function of map/flatMapGroupsWithState is being called with events first, and then being called with timed out states. Based on how it works, your code is going to reset the timeout while the state should be timed out in this batch.
So while you can leverage timeout feature to call state func even the events don’t contain such key, you still need to deal with current watermark manually. That’s why I set a timeout to earliest sessions’ session end timestamp, and handle all evictions once it is being called.
——
You can refer below code block to see how to achieve session window with event time & watermark via flatMapGroupsWithState.
NOTE: I didn't clean the code, and try to support both output modes, so once you decide the output mode, you can remove unrelated codes to make it simpler.
EDIT2: I had wrong assumption regarding flatMapGroupsWithState, events are not guaranteed to be sorted.
Just updated the code: https://gist.github.com/HeartSaVioR/9a3aeeef0f1d8ee97516743308b14cd6#file-eventtimesessionwindowimplementationviaflatmapgroupswithstate-scala-L32-L189
As of Spark 3.2.0, Spark supports Session window natively.
https://databricks.com/blog/2021/10/12/native-support-of-session-window-in-spark-structured-streaming.html
My architecture:
1 EventHub with 8 Partitions & 2 TPUs
1 Streaming Analytics Job
6 Windows based on the same input (from 1mn to 6mn)
Sample Data:
{side: 'BUY', ticker: 'MSFT', qty: 1, price: 123, tradeTimestamp: 10000000000}
{side: 'SELL', ticker: 'MSFT', qty: 1, price: 124, tradeTimestamp:1000000000}
The EventHub PartitionKey is ticker
I would like to emit every second, the following data:
(Total quantity bought / Total quantity sold) in the last minute, last 2mn, last 3mn and more
What I tried:
WITH TradesWindow AS (
SELECT
windowEnd = System.Timestamp,
ticker,
side,
totalQty = SUM(qty)
FROM [Trades-Stream] TIMESTAMP BY tradeTimestamp PARTITION BY PartitionId
GROUP BY ticker, side, PartitionId, HoppingWindow(second, 60, 1)
),
TradesRatio1MN AS (
SELECT
ticker = b.ticker,
buySellRatio = b.totalQty / s.totalQty
FROM TradesWindow b /* SHOULD I PARTITION HERE TOO ? */
JOIN TradesWindow s /* SHOULD I PARTITION HERE TOO ? */
ON s.ticker = b.ticker AND s.side = 'SELL'
AND DATEDIFF(second, b, s) BETWEEN 0 AND 1
WHERE b.side = 'BUY'
)
/* .... More windows.... */
/* FINAL OUTPUT: Joining all the windows */
SELECT
buySellRatio1MN = bs1.buySellRatio,
buySellRatio2MN = bs2.buySellRatio
/* more windows */
INTO [output]
FROM buySellRatio1MN bs1 /* SHOULD I PARTITION HERE TOO ? */
JOIN buySellRatio2MN bs2 /* SHOULD I PARTITION HERE TOO ? */
ON bs2.ticker = bs1.ticker
AND DATEDIFF(second, bs1, bs2) BETWEEN 0 AND 1
Issues:
This requires 6 EventHub Consumer groups (each one can only have 5 readers), why ? I don't have 5x6 SELECT statements on the input, why then ?
The output doesn't seem consistent (I don't know if my JOINs are correct).
Sometimes the job doesn't output at all (maybe some partitioning problem ? see the comments in the code about partitioning)
Briefly, is there a better way to achieve this ? I couldn't find anything in the doc and examples about having multiple windows and joining them then joining the results of the previous joins from only 1 input.
For the first question, this depend of the internal implementation of the scale out logic. See details here.
For the output of the join, I don't see the whole query but if you join a query with a 1 minute window with a query with a 2 minute window with a 1s time "buffer" you will only an output every 2 minutes. UNION operator will be better for this.
From your sample and your goal, I think there is a much easier way to write this query using UDA (User Defined Aggregate).
For this I will define a UDA function called "ratio" first:
function main() {
this.init = function () {
this.sumSell = 0.0;
this.sumBuy = 0.0;
}
this.accumulate = function (value, timestamp) {
if (value.side=="BUY") {this.sumBuy+=value.qty};
if (value.side=="SELL") {this.sumSell+=value.qty};
}
this.computeResult = function () {
if(this.sumSell== 0) {
result = 0;
}
else {
result = this.sumBuy/this.sumSell;
}
return result;
}
}
Then I can simply use this SQL query for a 60 seconds window:
SELECT
windowEnd = System.Timestamp,
ticker,
uda.ratio(iothub) as ratio
FROM iothub PARTITION BY PartitionId
GROUP BY ticker, PartitionId, SlidingWindow(second, 60)
I am streaming data from kafka and trying to limit the number of events per batch to 10 events. After processing for 10-15 batches, there is a sudden spike in the batch size. Below are my settings:
spark.streaming.kafka.maxRatePerPartition=1
spark.streaming.backpressure.enabled=true
spark.streaming.backpressure.pid.minRate=1
spark.streaming.receiver.maxRate=2
Please check this image for the streaming behavior
This is the bug in spark, please reffer to: https://issues.apache.org/jira/browse/SPARK-18371
The pull request isn't merged yet, but you may pick it up and build spark on your own.
To summarize the issue:
If you have the spark.streaming.backpressure.pid.minRate set to a number <= partition count, then an effective rate of 0 is calculated:
val totalLag = lagPerPartition.values.sum
...
val backpressureRate = Math.round(lag / totalLag.toFloat * rate)
...
(the second line calculates rate per partition where rate is rate comming from PID and defaults to minRate, when PID calculates it shall be smaller)
As here: DirectKafkaInputDStream code
This resulting to 0 causes the fallback to (unreasonable) head of partitions:
...
if (effectiveRateLimitPerPartition.values.sum > 0) {
val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
Some(effectiveRateLimitPerPartition.map {
case (tp, limit) => tp -> (secsPerBatch * limit).toLong
})
} else {
None
}
...
maxMessagesPerPartition(offsets).map { mmp =>
mmp.map { case (tp, messages) =>
val lo = leaderOffsets(tp)
tp -> lo.copy(offset = Math.min(currentOffsets(tp) + messages, lo.offset))
}
}.getOrElse(leaderOffsets)
As in DirectKafkaInputDStream#clamp
This makes the backpressure basically not working when your actual and minimal receive rate/msg/ partitions is smaller ~ equal to partitions count and you experience significant lag (e.g. messages come in spikes and you have constant processing powers).