Single Sliding Window in Spark Structured Streaming - apache-spark

I am trying to implement a system which is measuring something based on data from last n hours. Here is the simple example:
I am constantly getting new messages with the following format: {"temp": <temp>, createdAt: <timestamp>}. I want to calculate the average temperature for last 1 hour, so if at this moment the time is 15:10 I want to consider only those records, which have createdAt field set to time after 14:10. Others should be dropped.
Is there anyway I could do it gently in spark structured streaming and keep with only one window? I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.

I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
If you don't want overlapping windows, then what you are looking for is a Tumbling window. From the code perspective, you use the same syntax as for Sliding window, but you leave the slideDuration param to its default None.
PySpark docs on this topic - https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html#pyspark-sql-functions-window

Related

how sliding window functions work in spark?

I've a doubt regarding sliding window function in spark. I'll be receiving json events from stream, and I would like to see top value out of all messages I've received in last 1 hour.
So If I apply this logic using tumbling or sliding window function, then does it means, I'll get the output after every one hour ?
Does spark stores all events for the time frame I've mentioned in window function to get largest/top value ?
Please help me to understand.
Spark supports three types of time windows: tumbling (fixed), sliding and session.
tumbling window is different from sliding window
tumbling window is fixed-size.If you have used tumbling window, you can get the output every one hour with configuration. Yet if you have used sliding window, you can get topN of all messages you've received in last 1 hour(previous 60 minutes)
Spark we use the state store provider to handle the stateful operations.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#types-of-time-windows
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#state-store

Accessing the current sliding window in foreachBatch in Spark Structured Streaming

I am using in Spark Structured Streaming foreachBatch() to maintain manually a sliding window, consisting of the last 200000 entries. With every microbatch I receive about 50 rows. On this sliding sliding window I am calculating manually my desired metrices like min, max, etc.
Spark provides also a Sliding Window function. But I have two problems with it:
The interval when the sliding window is updated can only be configured based on a time period, but there seems no possibility to force an update with each single microbatch coming in. Is there a possibility that I do not see?
The bigger problem: It seems I can only do aggregations using grouping like:
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
But I do not want to group over multiple sliding windows. I need something like the existing foreachBatch() that allows me to access not only the current batch but also/or the current sliding window. Is there something like that?
Thank you for your time!
You can probably use flatMapGroupsWithState feature to achieve this.
Basically you can store/keep updating previous batches in an internal state(only the information you need) and use it in the next batch
You can refer below links
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#arbitrary-stateful-operations
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-demo-arbitrary-stateful-streaming-aggregation-flatMapGroupsWithState.html

spark structured streaming with large window size: memory consumption

We plan to implement a Spark Structured Streaming application which will consume a continuous flow of data: evolution of a metric value over time.
This streaming application will work with a window size of 7 days (and a sliding window) in order to frequently calculate the average of the metric value over the last 7 days.
1- Will Spark retain all those 7 days of data (impacting a lot the memory consumed), OR Spark continuously calculates and updates the average requested (and then get rid of handled data) and so does not impact so much memory consumed (not retaining 7 days of data) ?
2- In case answer to first question is that those 7 days of data are retained, does the usage of watermark prevent this retention ?
Let’s say that we have a watermark of 1 hour; will only 1 hour of data be retained in Spark, OR 7 days are still retained in spark memory and watermark is here just for ignoring new data coming in with a datatimestamp older than 1 hour ?
Window Size 7 is definitely a significant one, but it also depends on the streaming data volume/records coming in. The trick lies in how to use the Window duration, update interval, output mode and if necessary the watermark (if the business rule is not impacted)
1- If the streaming is configured to be of tumbling window size (ie the window duration is same as the update duration), with complete mode, you may end up full data being kept in memory for 7 days. However, if you configure the window duration to be 7 days with an update of every x minutes, aggregates will be calculated every x minutes and only the result data will be kept in memory. Hence look at the window API parameters and configure the way to get the results.
2- Watermark brings a different behaviour and it ignores the records before the watermark duration and update the result tables after every micro batch crosses the water mark time. If your business rule is ok to include watermark calculation, it is fine to use it too.
It is good to go through the API in detail, output modes and watermark usage at enter link description here
This would help to choose the right combination.

Hazelcast Jet sliding window unit of measurement

Sorry for may be silly question but it is unclear from docs what is the unit of measurement for sliding window? Is it milliseconds, seconds or number of items in the stream?
I've noticed the aggregation operation was producing empty results and I had to filter them explicitly because probably there was no data available for that window, so I guess last point it not an option.
Jet doesn't specify a unit for windows, instead the windows are calculated based on the same unit that your timestamps are specified in. Typically if your timestamps are UNIX-style timestamps then it would be in milliseconds, but you could also use nanoseconds, seconds, or minutes if that's how your timestamps are defined. It refers to specifically event time and is not related to number of events in the stream, only to their timestamps.

Spark streaming - waiting for data for window aggregations?

I have data in the format { host | metric | value | time-stamp }. We have hosts all around the world reporting metrics.
I'm a little confused about using window operations (say, 1 hour) to process data like this.
Can I tell my window when to start, or does it just start when the application starts? I want to ensure I'm aggregating all data from hour 11 of the day, for example. If my window starts at 10:50, I'll just get 10:50-11:50 and miss 10 minutes.
Even if the window is perfect, data may arrive late.
How do people handle this kind of issue? Do they make windows far bigger than needed and just grab the data they care about on every batch cycle (kind of sliding)?
In the past, I worked on a large-scale IoT platform and solved that problem by considering that the windows were only partial calculations. I modeled the backend (Cassandra) to receive more than 1 record for each window. The actual value of any given window would be the addition of all -potentially partial- records found for that window.
So, a perfect window would be 1 record, a split window would be 2 records, late-arrivals are naturally supported but only accepted up to a certain 'age' threshold. Reconciliation was done at read time. As this platform was orders of magnitude heavier in terms of writes vs reads, it made for a good compromise.
After speaking with people in depth on MapR forums, the consensus seems to be that hourly and daily aggregations should not be done in a stream, but rather in a separate batch job once the data is ready.
When doing streaming you should stick to small batches with windows that are relatively small multiples of the streaming interval. Sliding windows can be useful for, say, trends over the last 50 batches. Using them for tasks as large as an hour or a day doesn't seem sensible though.
Also, I don't believe you can tell your batches when to start/stop, etc.

Resources