I've a doubt regarding sliding window function in spark. I'll be receiving json events from stream, and I would like to see top value out of all messages I've received in last 1 hour.
So If I apply this logic using tumbling or sliding window function, then does it means, I'll get the output after every one hour ?
Does spark stores all events for the time frame I've mentioned in window function to get largest/top value ?
Please help me to understand.
Spark supports three types of time windows: tumbling (fixed), sliding and session.
tumbling window is different from sliding window
tumbling window is fixed-size.If you have used tumbling window, you can get the output every one hour with configuration. Yet if you have used sliding window, you can get topN of all messages you've received in last 1 hour(previous 60 minutes)
Spark we use the state store provider to handle the stateful operations.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#types-of-time-windows
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#state-store
Related
I am trying to implement a system which is measuring something based on data from last n hours. Here is the simple example:
I am constantly getting new messages with the following format: {"temp": <temp>, createdAt: <timestamp>}. I want to calculate the average temperature for last 1 hour, so if at this moment the time is 15:10 I want to consider only those records, which have createdAt field set to time after 14:10. Others should be dropped.
Is there anyway I could do it gently in spark structured streaming and keep with only one window? I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
If you don't want overlapping windows, then what you are looking for is a Tumbling window. From the code perspective, you use the same syntax as for Sliding window, but you leave the slideDuration param to its default None.
PySpark docs on this topic - https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html#pyspark-sql-functions-window
I am using in Spark Structured Streaming foreachBatch() to maintain manually a sliding window, consisting of the last 200000 entries. With every microbatch I receive about 50 rows. On this sliding sliding window I am calculating manually my desired metrices like min, max, etc.
Spark provides also a Sliding Window function. But I have two problems with it:
The interval when the sliding window is updated can only be configured based on a time period, but there seems no possibility to force an update with each single microbatch coming in. Is there a possibility that I do not see?
The bigger problem: It seems I can only do aggregations using grouping like:
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
But I do not want to group over multiple sliding windows. I need something like the existing foreachBatch() that allows me to access not only the current batch but also/or the current sliding window. Is there something like that?
Thank you for your time!
You can probably use flatMapGroupsWithState feature to achieve this.
Basically you can store/keep updating previous batches in an internal state(only the information you need) and use it in the next batch
You can refer below links
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#arbitrary-stateful-operations
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-demo-arbitrary-stateful-streaming-aggregation-flatMapGroupsWithState.html
We plan to implement a Spark Structured Streaming application which will consume a continuous flow of data: evolution of a metric value over time.
This streaming application will work with a window size of 7 days (and a sliding window) in order to frequently calculate the average of the metric value over the last 7 days.
1- Will Spark retain all those 7 days of data (impacting a lot the memory consumed), OR Spark continuously calculates and updates the average requested (and then get rid of handled data) and so does not impact so much memory consumed (not retaining 7 days of data) ?
2- In case answer to first question is that those 7 days of data are retained, does the usage of watermark prevent this retention ?
Let’s say that we have a watermark of 1 hour; will only 1 hour of data be retained in Spark, OR 7 days are still retained in spark memory and watermark is here just for ignoring new data coming in with a datatimestamp older than 1 hour ?
Window Size 7 is definitely a significant one, but it also depends on the streaming data volume/records coming in. The trick lies in how to use the Window duration, update interval, output mode and if necessary the watermark (if the business rule is not impacted)
1- If the streaming is configured to be of tumbling window size (ie the window duration is same as the update duration), with complete mode, you may end up full data being kept in memory for 7 days. However, if you configure the window duration to be 7 days with an update of every x minutes, aggregates will be calculated every x minutes and only the result data will be kept in memory. Hence look at the window API parameters and configure the way to get the results.
2- Watermark brings a different behaviour and it ignores the records before the watermark duration and update the result tables after every micro batch crosses the water mark time. If your business rule is ok to include watermark calculation, it is fine to use it too.
It is good to go through the API in detail, output modes and watermark usage at enter link description here
This would help to choose the right combination.
In the documentation on https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking, an example is shown using a window of 10 minutes, using a watermark of 10 minutes and a trigger of 5 minutes.
In the diagram when using the APPEND mode, the first results form the 12:00:00->12:10:00 window are only shown at 12:25:00. The reason is that at that time, the watermark is at 12:11:00 so all windows before 12:11:00 can already be sent to sink.
However, at 12:20:00, we already know the watermark is 12:11:00. So why isn't the first window not sent at 12:20:00 instead of 12:25:00?
Because Spark applies global watermark instead of watermark for each partition: watermark for a next batch is decided when tasks in current batch "finishes". Each partition is no idea to decide watermark: it only knows about events in its partition.
So at 12:20:00, Spark gets 12:21:00 and process it, and at the end of batch, Spark collects the events' timestamp and determines max timestamp, and decides watermark for a next batch - "12:11:00" - which will be the watermark for a batch 12:25:00.
I wanted to use spark streaming to process events from kafka and want to set the window width and slide in terms of the number of messages instead of time. Is this possible? I didn't see anything obvious in the api for this and instead only saw time based window options.
That's not possible. You can only set window duration.
This gets near enough to what I would want:
[Output an event based on a number of messages within a specific time period]
https://damieng.com/blog/2015/06/27/time-window-events-with-apache-spark-streaming