As I understood, the watermark is is last seen event time - late threshold. So if the last seen event time is 12:11 and the late threshold is 10 minutes the watermark is 12:01. Since 12:01 is later than the window start time of 12:00 it's state is dropped.
But I wrote query:
stream
.withWatermark("created", "2 seconds")
.groupBy(
window($"created", "2 seconds", "2 seconds"),
$"animal"
)
.count()
.writeStream
.format("console")
.outputMode(OutputMode.Update())
And the output:
[2021-02-22 16:06:40.0,2021-02-22 16:06:42.0]:dog
[2021-02-22 16:06:40.0,2021-02-22 16:06:42.0]:owl
[2021-02-22 16:06:40.0,2021-02-22 16:06:42.0]:cat
[2021-02-22 16:06:34.0,2021-02-22 16:06:36.0]:pig
Last event time: 2021-02-22 16:06:41.696 in the window 40-42 sec
Pig time: 2021-02-22 16:06:35.696
Aa you can see, pig exist in window 34-36, but threshold is 2 seconds.
Why I can see pig int the output?
Interesting thing: if I push pig at the same time as other events but with the old timestamp, this event is added to the result set. But if the event is pushed after 2 seconds (threshold) with the same timestamp, it will not be shown in the result set.
I push all data to the stream in one batch. And at that time, there is no watermark yet, that why I can see old event in the result set. If I push some data to the stream, set ProcessingTime, for example 100ms, and after 100ms will push old data, the result will be expected.
Related
I have two streams, 'left' stream and 'right' stream. I would like to do a leftOuter join on the streams. I would like to collect the events on 'left' stream that couldn't join with 'right' stream.
The watermark delay on both the streams is present(20 minutes).
Issue is that, as long as there is data on the right stream within a watermark, the unjoined events are showing up. But lets say after a day of not generating any events on both the streams, I generate 'left' events without generating any 'right' events, the 'left' events are getting dropped and not showing up as unjoined data.
I am expecting the unjoined events to show up at the end of watermark.
The code is as follows
right = right.withWatermark("left_event_time", "20 minutes")
left = left.withWatermark("right_event_time", "20 minutes")
joineddf = leftdf.join(
rightdf,
expr("""
left_id = right_id AND
left_event_time >= right_event_time - interval 20 minutes AND
left_event_time <= right_event_time + interval 20 minutes
"""),
"leftOuter"
successdf = joineddf.filter(col("right_field").isNotNull())
unjoined = joineddf.filter(col("right_field").isNull())
I am expecting to get unjoined events even if rightdf is empty.
I tried changing watermark to 10 seconds for experimentation and generated events only on 'left' stream. Even after waiting for few minutes(7 minutes), unjoined events didn't show up. But once I generated an event on 'right' stream, the unjoined events showed up that were generated earlier showed up.
In the blog post "Introducing Stream-Stream Joins in Apache Spark 2.3" joining clicks with impressions based on their adId is discussed:
# Define watermarks
impressionsWithWatermark = impressions \
.selectExpr("adId AS impressionAdId", "impressionTime") \
.withWatermark("impressionTime", "10 seconds ") # max 10 seconds late
clicksWithWatermark = clicks \
.selectExpr("adId AS clickAdId", "clickTime") \
.withWatermark("clickTime", "20 seconds") # max 20 seconds late
# Inner join with time range conditions
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 minutes
"""
)
)
I'd like to know if its possible to filter the resulting stream so that only the rows with latest clickTime are included in each "query interval".
The query interval is the interval given in the query join condition:
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 minutes
So I might get the following sequence
{type:impression, impressionAdId:1, timestamp: 1}
{type:click, clickAdId:1, timestamp: 1}
{type:click, clickAdId:1, timestamp: 15}
And after t=60s or so spark emits the following row in the dataframe:
{impressionTimestamp: 1, clickTimestamp: 15: clickAddId: 1, impressionAdId: 1}
I only posted python code because that was what was in the article, answers with java or scala code are welcome too.
Every streaming query in Structured Streaming is associated with id and runId.
Why do they change when I stop and start the following query?
// Reading datasets with records from a Kafka topic
val idsPerBatch = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load.
withColumn("tokens", split('value, ",")).
withColumn("seconds", 'tokens(0) cast "long").
withColumn("event_time", to_timestamp(from_unixtime('seconds))). // <-- Event time has to be a timestamp
withColumn("id", 'tokens(1)).
withColumn("batch", 'tokens(2) cast "int").
withWatermark(eventTime = "event_time", delayThreshold = "10 seconds"). // <-- define watermark (before groupBy!)
groupBy($"event_time"). // <-- use event_time for grouping
agg(collect_list("batch") as "batches", collect_list("id") as "ids").
withColumn("event_time", to_timestamp($"event_time")) // <-- convert to human-readable date
// start the query and display results to console
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = idsPerBatch.
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(5.seconds)).
outputMode(OutputMode.Append). // <-- Append output mode
start
id is persistent across runs as part of checkpoint metadata.
Since you're using ConsoleSink (i.e. console output), which doesn't support checkpointing and not providing a checkpoint location, the id cannot be fetched from the metadata file (emphasis mine):
Returns the unique id of this query that persists across restarts
from checkpoint data. That is, this id is generated when a query is
started for the first time, and will be the same every time it is
restarted from checkpoint data
On the other hand, runId is generated each time you restart the query:
Returns the unique id of this run of the query. That is, every start/restart of a query will generated a unique runId. Therefore, every time a query is restarted from checkpoint, it will have the same id but different runIds.
I want to count the unique access for each day using spark structured streaming, so I use the following code
.dropDuplicates("uuid")
and in the next day the state maintained for today should be dropped so that I can get the right count of unique access of the next day and avoid OOM. The spark document indicates using dropDuplicates with watermark, for example:
.withWatermark("timestamp", "1 day")
.dropDuplicates("uuid", "timestamp")
but the watermark column must be specified in dropDuplicates. In such case the uuid and timestamp will be used as a combined key to deduplicate elements with the same uuid and timestamp, which is not what I expected.
So is there a perfect solution?
After a few days effort I finally find out the way myself.
While studying the source code of watermark and dropDuplicates, I discovered that besides an eventTime column, watermark also supports window column, so we can use the following code:
.select(
window($"timestamp", "1 day"),
$"timestamp",
$"uuid"
)
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")
Since all events in the same day have the same window, this will produce the same results as using only uuid to deduplicate. Hopes can help someone.
Below is the modification of the procedure proposed in Spark documentation. Trick is to manipulate event time i.e. put event time in
buckets. Assumption is that event time is provided in milliseconds.
// removes all duplicates that are in 15 minutes tumbling window.
// doesn't remove duplicates that are in different 15 minutes windows !!!!
public static Dataset<Row> removeDuplicates(Dataset<Row> df) {
// converts time in 15 minute buckets
// timestamp - (timestamp % (15 * 60))
Column bucketCol = functions.to_timestamp(
col("event_time").divide(1000).minus((col("event_time").divide(1000)).mod(15*60)));
df = df.withColumn("bucket", bucketCol);
String windowDuration = "15 minutes";
df = df.withWatermark("bucket", windowDuration)
.dropDuplicates("uuid", "bucket");
return df.drop("bucket");
}
I found out that window function didn't work so I chose to use window.start or window.end.
.select(
window($"timestamp", "1 day").start,
$"timestamp",
$"uuid"
)
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")
if I set a batch interval of 5 seconds(Seconds(5)), every 5 seconds, I add a tag for current batch data. if I can add tags for every batch data, when I use window() function, I can filter data by tags.
1st 5 seconds input some data:
hello
word
hello
after add tags for data like this:
(1st, hello) // "1st" is the custom tag that can identify this batch data
(1st, word)
(1st, hello)
2nd 5 seconds input some data:
spark
streaming
interval
time
after add tags for data:
(2nd, spark)
(2nd, streaming)
(2nd, interval)
(2nd, time)
There are 3 Options: -
The best way would be to add some identifications within the messages itself, so that while you receive you already have something which can identify each message.
Second Option would be to create Custom receiver which can identify the message Batch and add some tags and then further send it to Spark Job.
The final option would be to leverage Accumulator. Something like this: -
val sc = new SparkContext(conf)
var accum = sc.accumulator(0, "My Accumulator")
val recDStream = //Write Code to get the Stream
recDStream.foreachRDD(x => "Data for Batch-"+(accum+=1)+"-"+x)
//Or may be you can add Accumulator after the forEach,
//so that it becomes for a whole Batch something like accum.add(1)