Spark streaming dropping left join events when right side data is empty - apache-spark

I have two streams, 'left' stream and 'right' stream. I would like to do a leftOuter join on the streams. I would like to collect the events on 'left' stream that couldn't join with 'right' stream.
The watermark delay on both the streams is present(20 minutes).
Issue is that, as long as there is data on the right stream within a watermark, the unjoined events are showing up. But lets say after a day of not generating any events on both the streams, I generate 'left' events without generating any 'right' events, the 'left' events are getting dropped and not showing up as unjoined data.
I am expecting the unjoined events to show up at the end of watermark.
The code is as follows
right = right.withWatermark("left_event_time", "20 minutes")
left = left.withWatermark("right_event_time", "20 minutes")
joineddf = leftdf.join(
rightdf,
expr("""
left_id = right_id AND
left_event_time >= right_event_time - interval 20 minutes AND
left_event_time <= right_event_time + interval 20 minutes
"""),
"leftOuter"
successdf = joineddf.filter(col("right_field").isNotNull())
unjoined = joineddf.filter(col("right_field").isNull())
I am expecting to get unjoined events even if rightdf is empty.
I tried changing watermark to 10 seconds for experimentation and generated events only on 'left' stream. Even after waiting for few minutes(7 minutes), unjoined events didn't show up. But once I generated an event on 'right' stream, the unjoined events showed up that were generated earlier showed up.

Related

Don't understand Update mode and watermarking

As I understood, the watermark is is last seen event time - late threshold. So if the last seen event time is 12:11 and the late threshold is 10 minutes the watermark is 12:01. Since 12:01 is later than the window start time of 12:00 it's state is dropped.
But I wrote query:
stream
.withWatermark("created", "2 seconds")
.groupBy(
window($"created", "2 seconds", "2 seconds"),
$"animal"
)
.count()
.writeStream
.format("console")
.outputMode(OutputMode.Update())
And the output:
[2021-02-22 16:06:40.0,2021-02-22 16:06:42.0]:dog
[2021-02-22 16:06:40.0,2021-02-22 16:06:42.0]:owl
[2021-02-22 16:06:40.0,2021-02-22 16:06:42.0]:cat
[2021-02-22 16:06:34.0,2021-02-22 16:06:36.0]:pig
Last event time: 2021-02-22 16:06:41.696 in the window 40-42 sec
Pig time: 2021-02-22 16:06:35.696
Aa you can see, pig exist in window 34-36, but threshold is 2 seconds.
Why I can see pig int the output?
Interesting thing: if I push pig at the same time as other events but with the old timestamp, this event is added to the result set. But if the event is pushed after 2 seconds (threshold) with the same timestamp, it will not be shown in the result set.
I push all data to the stream in one batch. And at that time, there is no watermark yet, that why I can see old event in the result set. If I push some data to the stream, set ProcessingTime, for example 100ms, and after 100ms will push old data, the result will be expected.

Killing spark streaming job when no activity

I want to kill my spark streaming job when there is no activity (i.e. the receivers are not receiving messages) for a certain time. I tried doing this
var counter = 0
myDStream.foreachRDD {
rdd =>
if (rdd.count() == 0L)
{
counter = counter + 1
if (counter == 40) {
ssc.stop(true, true)
}
} else {
counter = 0
}
}
Is there a better way of doing this? How would I make a variable available to all receivers and update the variable by 1 whenever there is no activity?
Use a NoSQL Table like Cassandra or HBase to keep the counter. You can not handle Stream Polling inside a loop. Implement same logic using NoSQL or Maria DB and perform a Graceful Shutdown to your streaming Job if no activity is happening.
The way I did it was I maintained a Table in Maria DB for Streaming JOB having Polling interval of 5 mins. Every 5 mins it hits the data base and writes the count of records it consumed also the method returns what is the count of zero records line items during latest timestamp. This helped me a lot managing my Streaming Job Management. Also this table usually helps me o automatically trigger the Streaming job based on a logic written in a shell script

How to avoid sudden spikes in batch size in Spark streaming?

I am streaming data from kafka and trying to limit the number of events per batch to 10 events. After processing for 10-15 batches, there is a sudden spike in the batch size. Below are my settings:
spark.streaming.kafka.maxRatePerPartition=1
spark.streaming.backpressure.enabled=true
spark.streaming.backpressure.pid.minRate=1
spark.streaming.receiver.maxRate=2
Please check this image for the streaming behavior
This is the bug in spark, please reffer to: https://issues.apache.org/jira/browse/SPARK-18371
The pull request isn't merged yet, but you may pick it up and build spark on your own.
To summarize the issue:
If you have the spark.streaming.backpressure.pid.minRate set to a number <= partition count, then an effective rate of 0 is calculated:
val totalLag = lagPerPartition.values.sum
...
val backpressureRate = Math.round(lag / totalLag.toFloat * rate)
...
(the second line calculates rate per partition where rate is rate comming from PID and defaults to minRate, when PID calculates it shall be smaller)
As here: DirectKafkaInputDStream code
This resulting to 0 causes the fallback to (unreasonable) head of partitions:
...
if (effectiveRateLimitPerPartition.values.sum > 0) {
val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
Some(effectiveRateLimitPerPartition.map {
case (tp, limit) => tp -> (secsPerBatch * limit).toLong
})
} else {
None
}
...
maxMessagesPerPartition(offsets).map { mmp =>
mmp.map { case (tp, messages) =>
val lo = leaderOffsets(tp)
tp -> lo.copy(offset = Math.min(currentOffsets(tp) + messages, lo.offset))
}
}.getOrElse(leaderOffsets)
As in DirectKafkaInputDStream#clamp
This makes the backpressure basically not working when your actual and minimal receive rate/msg/ partitions is smaller ~ equal to partitions count and you experience significant lag (e.g. messages come in spikes and you have constant processing powers).

How to expire state of dropDuplicates in structured streaming to avoid OOM?

I want to count the unique access for each day using spark structured streaming, so I use the following code
.dropDuplicates("uuid")
and in the next day the state maintained for today should be dropped so that I can get the right count of unique access of the next day and avoid OOM. The spark document indicates using dropDuplicates with watermark, for example:
.withWatermark("timestamp", "1 day")
.dropDuplicates("uuid", "timestamp")
but the watermark column must be specified in dropDuplicates. In such case the uuid and timestamp will be used as a combined key to deduplicate elements with the same uuid and timestamp, which is not what I expected.
So is there a perfect solution?
After a few days effort I finally find out the way myself.
While studying the source code of watermark and dropDuplicates, I discovered that besides an eventTime column, watermark also supports window column, so we can use the following code:
.select(
window($"timestamp", "1 day"),
$"timestamp",
$"uuid"
)
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")
Since all events in the same day have the same window, this will produce the same results as using only uuid to deduplicate. Hopes can help someone.
Below is the modification of the procedure proposed in Spark documentation. Trick is to manipulate event time i.e. put event time in
buckets. Assumption is that event time is provided in milliseconds.
// removes all duplicates that are in 15 minutes tumbling window.
// doesn't remove duplicates that are in different 15 minutes windows !!!!
public static Dataset<Row> removeDuplicates(Dataset<Row> df) {
// converts time in 15 minute buckets
// timestamp - (timestamp % (15 * 60))
Column bucketCol = functions.to_timestamp(
col("event_time").divide(1000).minus((col("event_time").divide(1000)).mod(15*60)));
df = df.withColumn("bucket", bucketCol);
String windowDuration = "15 minutes";
df = df.withWatermark("bucket", windowDuration)
.dropDuplicates("uuid", "bucket");
return df.drop("bucket");
}
I found out that window function didn't work so I chose to use window.start or window.end.
.select(
window($"timestamp", "1 day").start,
$"timestamp",
$"uuid"
)
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")

How to add tags for every batch data in spark streaming?

if I set a batch interval of 5 seconds(Seconds(5)), every 5 seconds, I add a tag for current batch data. if I can add tags for every batch data, when I use window() function, I can filter data by tags.
1st 5 seconds input some data:
hello
word
hello
after add tags for data like this:
(1st, hello) // "1st" is the custom tag that can identify this batch data
(1st, word)
(1st, hello)
2nd 5 seconds input some data:
spark
streaming
interval
time
after add tags for data:
(2nd, spark)
(2nd, streaming)
(2nd, interval)
(2nd, time)
There are 3 Options: -
The best way would be to add some identifications within the messages itself, so that while you receive you already have something which can identify each message.
Second Option would be to create Custom receiver which can identify the message Batch and add some tags and then further send it to Spark Job.
The final option would be to leverage Accumulator. Something like this: -
val sc = new SparkContext(conf)
var accum = sc.accumulator(0, "My Accumulator")
val recDStream = //Write Code to get the Stream
recDStream.foreachRDD(x => "Data for Batch-"+(accum+=1)+"-"+x)
//Or may be you can add Accumulator after the forEach,
//so that it becomes for a whole Batch something like accum.add(1)

Resources