Spark Structured Streaming - Force microbatch execution even without input rows - apache-spark

We have a Spark Structured Streaming query that counts the number of input rows received on the last hour, updating every minute, performing the aggrupation with a temporal window (windowDuration="1 hour", slideDuration="1 minute"). The query is configured to use a processingTime trigger, with a duration of 30 secods trigger(processingTime="30 seconds"). The outputMode of the query is append.
This query produces results as long as new rows are received, which is consistent with the behaviour that the documentation indicates for fixed interval micro-batches:
If no new data is available, then no micro-batch will be kicked off.
However, we would like the query to produce results even when there are NO input rows: our use case is related to monitorization, and we would like to trigger alerts when there are no input messages in the monitorized system for a period of time.
For example, for the following input:
event_time
event_id
00:02
1
00:05
2
01:00
3
03:00
4
At processingTime=01:01, we could suppose that the following output row would be produced:
window.start
window.end
count
00:00
01:00
3
However, from this point, there are no input rows until 03:00, and therefore, no microbatch will be executed until this time, missing the opportunity to produce output rows such as:
window.start
window.end
count
01:01
02:01
0
Which would otherwise produce a monitoring alert in our system.
Is there any workaround for this behaviour, allowing executions of empty microbatches when there are no input rows?

You cannot ask for things not provided in the software as such, and there are no work arounds. There was even a time, that may still exist, in which the last set of microbatch data is not processed.

Related

Is there a way to use Spark Structured Streaming to calculate daily aggregates?

I am planning to use structured streaming to calculate daily aggregates across different metrics.
Data volume < 1000 records per day.
Here is the simple example of input data
timestamp, Amount
1/1/20 10:00, 100
1/1/20 11:00, 200
1/1/20 23:00, 400
1/2/20 10:00, 100
1/2/20 11:00, 200
1/2/20 23:00, 400
1/2/20 23:10, 400
Expected output
Day, Amount
1/1/20, 700
1/2/20, 1100
I am planning to do something like this in the structured streaming not sure if it works or if it's the right way to do it?
parsedDF.withWatermark("date", "25 hours").groupBy("date", window("date", "24 hours")).sum("amount")
There is material overhead from running structured streams. Given you're writing code to produce a single result every 24 hours it would seem a better use of resources to do the following if you can take an extra couple minutes of latency in trade for using far fewer resources.
Ingest data into a table, partitioned by day
Write a simple SQL query against this table to generate your daily aggregate(s)
Schedule the job to run [watermark] seconds after midnight.
That's with the impression you're in the default output mode since you didn't specify one. If you want to stick with streaming, more context in your code and what your goal is would be helpful. For example, how often do you want results, and do you need partial results before the end of the day? How long do you want to wait for late data to update aggregates? What output mode are you planning to use?

How to avoid selecting too many data

What we are doing is pretty much like
putting time series data into cassandra
running an spark aggregation job every hour and put aggregated data back to cassandra
One of the problems we found is, if the hourly job does not succeed, for example, continuously, 1 AM ~ 2 AM, 2 AM ~ 3 AM, 3 AM ~ 4 AM (or more), then next time, it'll aggregate the data from 1 AM to 5 AM (last success time is recorded in cassandra). The issue comes at this hour, because it's now 4 (or more) hours data, and it's way larger than one hour data which then results in an OutofMemory exception by selecting too many data from cassandra into dataframe.
Well, adding memory to spark executor is a way fixing this. However, considering it's an edge issue, I'm wondering if there's any mature pattern or architecture to deal with this issue.

Record by record timestamp difference calculation

I am working on a logic to find the consecutive time difference between two timestamps in the streaming layer(spark) by comparing the previous time and current time and storing the value in the database.
For eg:
2017-08-01 11:00:00
2017-08-01 11:05:00
2017-08-01 11:07:00
So according to the above timestamps my consecutive diff will be 5 mins(11:00:00 - 11:05:00) and 2 mins respectively and when i sum the difference I will get 7 mins(5+2) which will be actual time difference.. Now the real challenge is when I receive delayed timestamp.
For eg:
2017-08-01 11:00:00
2017-08-01 11:05:00
2017-08-01 11:07:00
2017-08-01 11:02:00
Here when i calculate the difference it will be 5 mins,2 mins,5 mins respectively and now sum of the difference I will get 12 mins(5+2+5) which will be greater than the actual time difference(7 mins).which is wrong
please help me to find a workaround to handle this delayed timestamp in record by record time difference calculation.
What you are experiencing is the difference between 'event time' and 'processing time'. In the best case, processing time will be nearly identical to event time, but sometimes, an input record is delayed, so the difference will be larger.
When you process streaming data, you define (explicitly or implicitly) a window of records that you look at. If you process records individually, this window has size 1. In your case, your window has size 2. But you could also have a window that is based on time, i.e. you can look at all records that have been received in the last 10 minutes.
If you want to process delayed records in-order, you need to wait before the delayed records have arrived and then sort the records within the window. The problem then becomes, how long do you wait? The delayed records may show up 2 days later! How long to wait is a subjective question and depends on your application and its requirements.
Note that if your window is time-based, you will need to handle the case where no previous record is available.
I highly recommend this article: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 to get to grips with streaming terminology and windows.

Stream Analytics Query working but no output to table

I got a problem with my Stream Analytics job. I'm pulling events from an IoT Hub and grouping them in timewindows based on their custom timestamps; I've already written a query that does this correctly. But the problem is that it just doesn't write anything into my output table (being a NoSQL table on my Storage Account).
The query runs without problems in the query editor (when testing with a sample input file) and produces the correct output, but when running 'for real', it doesn't output anything (the output table remains empty). I've even tried renaming the table and outputting to a blob storage, but no dice. Here's the query:
SELECT
'general' AS partitionKey,
MIN(ID_frame) AS rowKey,
DATEADD(second, 1, DATEADD(hour, -3, System.TimeStamp)) AS window_start,
System.TimeStamp AS window_end,
COUNT(ID_frame) AS device_count
INTO
[IoT-Hub-output-table]
FROM
[IoT-Hub-input] TIMESTAMP BY custom_timestamp
GROUP BY TumblingWindow(Duration(hour, 3), Offset(second, -1))
The interesting part is that, if I omit any windowing in my query, then the table output works just fine.
I've been beating my head against the wall about this for a few days now, so I think I've already tried most of the obvious things.
As you are using a TumblingWindow of 3 hours, it means you will get a single output every 3 hours which contains an aggregate of all the events within that period.
So did you already wait for 3 hours for the first output to be generated?
I would try and set the window smaller, and try again to see if the output works correctly.
Turns out the query did output into my table, but with an amount of delay I didn't expect; I was waiting for 20-30 minutes at max. while the first insertions would began after a little later than half an hour. Thus I was cancelling the Analytics job before any output was produced and falsely assuming it just wouldn't output anything.
I found this to be the case afer I noted that 'sometimes' (when the job was running for long enough) there appeared to be some output. And in those output records I noticed the big delay between my custom timestamp field and the general timestamp field (which the engine uses to remember when the entity was updated for the last time)

How big the window could be when using spark stream?

We have some stream data need to be calculated and considering use spark stream to do it.
We need to generate three kinds of reports. The reports are based on
The last 5 minutes data
The last 1 hour data
The last 24 hour data
The frequency of reports is 5 minutes.
After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day.
But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment?

Resources