Hazelcast Jet sliding window unit of measurement - hazelcast

Sorry for may be silly question but it is unclear from docs what is the unit of measurement for sliding window? Is it milliseconds, seconds or number of items in the stream?
I've noticed the aggregation operation was producing empty results and I had to filter them explicitly because probably there was no data available for that window, so I guess last point it not an option.

Jet doesn't specify a unit for windows, instead the windows are calculated based on the same unit that your timestamps are specified in. Typically if your timestamps are UNIX-style timestamps then it would be in milliseconds, but you could also use nanoseconds, seconds, or minutes if that's how your timestamps are defined. It refers to specifically event time and is not related to number of events in the stream, only to their timestamps.

Related

Single Sliding Window in Spark Structured Streaming

I am trying to implement a system which is measuring something based on data from last n hours. Here is the simple example:
I am constantly getting new messages with the following format: {"temp": <temp>, createdAt: <timestamp>}. I want to calculate the average temperature for last 1 hour, so if at this moment the time is 15:10 I want to consider only those records, which have createdAt field set to time after 14:10. Others should be dropped.
Is there anyway I could do it gently in spark structured streaming and keep with only one window? I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
I was looking at sliding windows feature but that would result in multiple windows, which I don't really need.
If you don't want overlapping windows, then what you are looking for is a Tumbling window. From the code perspective, you use the same syntax as for Sliding window, but you leave the slideDuration param to its default None.
PySpark docs on this topic - https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html#pyspark-sql-functions-window

spark streaming understanding timeout setup in mapGroupsWithState

I am trying very hard to understand the timeout setup when using the mapGroupsWithState for spark structured streaming.
below link has very detailed specification, but I am not sure i understood it properly, especially the GroupState.setTimeoutTimeStamp() option. Meaning when setting up the state expiry to be sort of related to the event time.
https://spark.apache.org/docs/3.0.0-preview/api/scala/org/apache/spark/sql/streaming/GroupState.html
I copied them out here:
With EventTimeTimeout, the user also has to specify the the the event time watermark in the query using Dataset.withWatermark().
With this setting, data that is older than the watermark are filtered out.
The timeout can be set for a group by setting a timeout timestamp usingGroupState.setTimeoutTimestamp(), and the timeout would occur when the watermark advances beyond the set timestamp.
You can control the timeout delay by two parameters - watermark delay and an additional duration beyond the timestamp in the event (which is guaranteed to be newer than watermark due to the filtering).
Guarantees provided by this timeout are as follows:
Timeout will never be occur before watermark has exceeded the set timeout.
Similar to processing time timeouts, there is a no strict upper bound on the delay when the timeout actually occurs. The watermark can advance only when there is data in the stream, and the event time of the data has actually advanced.
question 1:
What is this timestamp in this sentence and the timeout would occur when the watermark advances beyond the set timestamp? is it an absolute time or is it a relative time duration to the current event time in the state? I know I could expire it by removing the state by ```
e.g. say I have some data state like below, when will it exprire by setting up what value in what settings?
+-------+-----------+-------------------+
|expired|something | timestamp|
+-------+-----------+-------------------+
| false| someKey |2020-08-02 22:02:00|
+-------+-----------+-------------------+
question 2:
Reading the sentence Data that is older than the watermark are filtered out, I understand the late arrival data is ignored after it is read from kafka, is this correct?
question reason
Without understanding these, i can not really apply them to use cases. Meaning when to use GroupState.setTimeoutDuration(), when to use GroupState.setTimeoutTimestamp()
Thanks a lot.
ps. I also tried to read below
- https://www.waitingforcode.com/apache-spark-structured-streaming/stateful-transformations-mapgroupswithstate/read
(confused me, did not understand)
- https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
(did not say a lot of it for my interest)
What is this timestamp in the sentence and the timeout would occur when the watermark advances beyond the set timestamp?
This is the timestamp you set by GroupState.setTimeoutTimestamp().
is it an absolute time or is it a relative time duration to the current event time in the state?
This is a relative time (not duration) based on the current batch window.
say I have some data state (column timestamp=2020-08-02 22:02:00), when will it expire by setting up what value in what settings?
Let's assume your sink query has a defined processing trigger (set by trigger()) of 5 minutes. Also, let us assume that you have used a watermark before applying the groupByKey and the mapGroupsWithState. I understand you want to use timeouts based on event times (as opposed to processing times, so your query will be like:
ds.withWatermark("timestamp", "10 minutes")
.groupByKey(...) // declare your key
.mapGroupsWithState(
GroupStateTimeout.EventTimeTimeout)(
...) // your custom update logic
Now, it depends on how you set the TimeoutTimestamp withing your "custom update logic". Somewhere in your custom update logic you will need to call
state.setTimeoutTimestamp()
This method has four different signatures and it is worth scanning through their documentation. As we have set a watermark in (withWatermark) we can actually make use of that time. As a general rule: It is important to set the timeout timestamp (set by state.setTimeoutTimestamp()) to a value larger then the current watermark. To continue with our example we add one hour as shown below:
state.setTimeoutTimestamp(state.getCurrentWatermarkMs, "1 hour")
To conclude, your message can arrive into your stream between 22:00:00 and 22:15:00 and if that message was the last for the key it will timeout by 23:15:00 in your GroupState.
question 2: Reading the sentence Data that is older than the watermark are filtered out, I understand the late arrival data is ignored after it is read from kafka, this is correct?
Yes, this is correct. For the batch interval 22:00:00 - 22:05:00 all messages that have an event time (defined by column timestamp) arrive later then the declared watermark of 10 minutes (meaning later then 22:15:00) will be ignored anyway in your query and are not going to be processed within your "custom update logic".

Using Timer for batch operations

I'm new to using Micrometer and am trying to see if there's a way to use a Timer that would also include a count of the number of items in a batch processing scenario. Since I'm processing the batch with Java streams, I didn't see an obvious way to record the timer for each item processed, so I was looking for a way to set a batch size attribute. One way I think that could work is to use the FunctionTimer from https://micrometer.io/docs/concepts#_function_tracking_timers, but I believe that requires the app to maintain a persistent monotonically increasing set of values for the total count and total time.
Is there a simpler way this can be done? Ultimately this data will be fed to New Relic. I've also tried setting tags for the batch size, but those seem to be reported as strings so I can't do any type of aggregation on the values.
Thanks!
A timer is intended for measuring an action and at a minimum results in two measurements: a count and a duration.
So a timer will work perfectly for your batch processing. In the the java stream, a peek operation might be a good place to put a timer.
If you were about to process 20 elements and you were just measuring the time for all 20 elements, you would need to create a new Counter for measuring the batch size. You could them divide the timer's total duration against your counter to get a per-item duration or divide it against the timer's total count to get a per-batch duration.
Feel free to add code snippets if you would like feedback for those.

Predefined (and large) windows? Any stream processing frameworks support this?

All the examples I see of windowing involve defining the windows. E.g., tumbling 1-minute windows, or sliding 1-minute windows, etc. In my situation, all my data has timestamped events, but that's not the primary interest.
All my data also has an associated period that I do not have control over. That is the desired window in my case. The periods are time-based, but they vary from 2-3 weeks, roughly.
So, if I look at just the period of a stream of values might look like this (almost everything from the current period, a few stragglers from the last period early on in current period),
... PERIOD 6, PERIOD 5, PERIOD 6, PERIOD 6, PERIOD 6, PERIOD 6, ...
It's not clear to me how to handle this situation in terms of watermarks/triggers/etc? If I'm understanding all this terminology correctly I've thought of something like this: the watermark for PERIOD N occurs when the first event with PERIOD (N+1) is processed. The lateness horizon (for garbage collecting state) for the PERIOD N window can be 1-2 days after the timestamp of the first event with PERIOD (N+1). I'd like triggers to be accumulating and every 5 minutes (ideally, I'd like this trigger duration to be increasing: more frequent at beginning of the window, less frequent as time passes).
I'm trying to use terminology from this article, https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 sorry if it's incorrect
I'm particularly confused about how watermarks seem to be continuous and based on event-time. In my case, I have both event-time (timestamp) and event-time (period). If I'm understanding this correctly, the curve of my situation (as in the above article) would look like a step-function?
I haven't yet picked a stream processing framework to use. Does my situation make sense for any of them? Does this require a lot of custom logic? Does any framework make this easier? Is this a known problem with a name?
Any help is appreciated.
In Flink, one way to achieve this is to use the processing time window for aggregation. Then you use a rich map function to maintain the accumulated counts before the window. In the end, you sink the aggregates to long-term data storage
You can take a look at my blog post where we did something similar to this. Take a look at Section A peek into Milestone Two

Spark streaming - waiting for data for window aggregations?

I have data in the format { host | metric | value | time-stamp }. We have hosts all around the world reporting metrics.
I'm a little confused about using window operations (say, 1 hour) to process data like this.
Can I tell my window when to start, or does it just start when the application starts? I want to ensure I'm aggregating all data from hour 11 of the day, for example. If my window starts at 10:50, I'll just get 10:50-11:50 and miss 10 minutes.
Even if the window is perfect, data may arrive late.
How do people handle this kind of issue? Do they make windows far bigger than needed and just grab the data they care about on every batch cycle (kind of sliding)?
In the past, I worked on a large-scale IoT platform and solved that problem by considering that the windows were only partial calculations. I modeled the backend (Cassandra) to receive more than 1 record for each window. The actual value of any given window would be the addition of all -potentially partial- records found for that window.
So, a perfect window would be 1 record, a split window would be 2 records, late-arrivals are naturally supported but only accepted up to a certain 'age' threshold. Reconciliation was done at read time. As this platform was orders of magnitude heavier in terms of writes vs reads, it made for a good compromise.
After speaking with people in depth on MapR forums, the consensus seems to be that hourly and daily aggregations should not be done in a stream, but rather in a separate batch job once the data is ready.
When doing streaming you should stick to small batches with windows that are relatively small multiples of the streaming interval. Sliding windows can be useful for, say, trends over the last 50 batches. Using them for tasks as large as an hour or a day doesn't seem sensible though.
Also, I don't believe you can tell your batches when to start/stop, etc.

Resources