Tumbling window with dynamic duration - azure

While passing reference data field as a duration in TumblingWindow I am getting compile time error related to Window duration require positive float constant.
Can anyone please guide?
group by TumblingWindow(minute, referencetable.EntryTime)

At the moment we don't support variable time windows, so you need to set the time explicitly and not load it from the reference data. Sorry for the inconvenience.
A workaround, in the case you have only few different time durations, would be to have different steps/subqueries for the different times and use a where clause to create or not an output for that step.
Let me know if you have further question.
JS (from the Azure Stream Analytics team)

Related

Set Azure Batch MaxWallClockTime Node SDK

I'm trying to set a maxWallClockTime of 72 hours using the ISO 8601 Duration format. The documentation for this property is useless, so I'm basing my guess on using the 8601 format on that being the way to set the same property at the Batch Job level when using the CLI. My constraints object is as follows:
const taskConstraints = {
maxWallClockTime: 'P3D' //ISO 8601 Duration Format e.g. P3Y6M4DT12H30M5S represents a duration of three years, six months, four days, twelve hours, thirty minutes, and five seconds.
};
However, this results in the following error:
task.constraints.maxWallClockTime must be a TimeSpan/Duration.
I cannot find any examples that set this property and use the Javascript API, any pointers to better documentation or example code would be greatly appreciated.
Agreed the docs are lacking here. I haven't tested this out locally yet, but from looking at the code I believe the answer depends on whether you are using the older Node.js-specific azure-batch package or the newer #azure/batch which also runs in web browsers.
For the "azure-batch" package, it looks like it takes a Moment.js duration object. Here's the related JSDoc string:
* #property {moment.duration} [maxWallClockTime] The maximum elapsed time
* that the Task may run, measured from the time the Task starts. If the Task
* does not complete within the time limit, the Batch service terminates it.
* If this is not specified, there is no time limit on how long the Task may
* run.
For the newer "#azure/batch" package, it should take an ISO-8601 duration string. If you're using that package then the value you're trying to use looks right to me, and maybe it's a bug (I'd have to try to repro it).

Tracking a counter value in application insights

I'm trying to use application insights to keep track of a counter of number of active streams in my application. I have 2 goals to achieve:
Show the current (or at least recent) number of active streams in a dashboard
Activate a kind of warning if the number exceeds a certain limit.
These streams can be quite long lived, and sometimes brief. So the number can sometimes change say 100 times a second, and sometimes remain unchanged for many hours.
I have been trying to track this active streams count as an application insights metric.
I'm incrementing a counter in my application when a new stream opens, and decrementing when one closes. On each change I use the telemetry client something like this
var myMetric = myTelemetryClient.GetMetric("Metricname");
myMetric.TrackValue(myCount);
When I query my metric values with Kusto, I see that because of these clusters of activity within a 10 sec period, my metric values get aggregated. For the purposes of my alarm, I can live with that, as I can look at the max value of the aggregate. But I can't present a dashboard of the number of active streams, as I have no way of knowing the number of active streams between my measurement points. I know the min value, max and average, but I don't know the last value of the aggregate period, and since it can be somewhere between 0 and 1000, its no help.
So the solution I have doesn't serve my needs, I thought of a couple of changes:
Adding a scheduled pump to my counter component, which will send the current counter value, once every say 5 minutes. But I don't like that I then have to add a thread for each of these counters.
Adding a timer to send the current value once, 5 minutes after the last change. Countdown gets reset each time the counter changes. This has the same problem as above, and does an excessive amount of work to reset the counter when it could be changing thousands of times a second.
In the end, I don't think my needs are all that exotic, so I wonder if I'm using app insights incorrectly.
Is there some way I can change the metric's behavior to suit my purposes? I appreciate that it's pre-aggregating before sending data in order to reduce ingest costs, but it's preventing me from solving a simple problem.
Is a metric even the right way to do this? Are there alternative approaches within app insights?
You can use TrackMetric instead of the GetMetric ceremony to track individual values withouth aggregation. From the docs:
Microsoft.ApplicationInsights.TelemetryClient.TrackMetric is not the preferred method for sending metrics. Metrics should always be pre-aggregated across a time period before being sent. Use one of the GetMetric(..) overloads to get a metric object for accessing SDK pre-aggregation capabilities. If you are implementing your own pre-aggregation logic, you can use the TrackMetric() method to send the resulting aggregates.
But you can also use events as described next:
If your application requires sending a separate telemetry item at every occasion without aggregation across time, you likely have a use case for event telemetry; see TelemetryClient.TrackEvent (Microsoft.ApplicationInsights.DataContracts.EventTelemetry).

spark streaming understanding timeout setup in mapGroupsWithState

I am trying very hard to understand the timeout setup when using the mapGroupsWithState for spark structured streaming.
below link has very detailed specification, but I am not sure i understood it properly, especially the GroupState.setTimeoutTimeStamp() option. Meaning when setting up the state expiry to be sort of related to the event time.
https://spark.apache.org/docs/3.0.0-preview/api/scala/org/apache/spark/sql/streaming/GroupState.html
I copied them out here:
With EventTimeTimeout, the user also has to specify the the the event time watermark in the query using Dataset.withWatermark().
With this setting, data that is older than the watermark are filtered out.
The timeout can be set for a group by setting a timeout timestamp usingGroupState.setTimeoutTimestamp(), and the timeout would occur when the watermark advances beyond the set timestamp.
You can control the timeout delay by two parameters - watermark delay and an additional duration beyond the timestamp in the event (which is guaranteed to be newer than watermark due to the filtering).
Guarantees provided by this timeout are as follows:
Timeout will never be occur before watermark has exceeded the set timeout.
Similar to processing time timeouts, there is a no strict upper bound on the delay when the timeout actually occurs. The watermark can advance only when there is data in the stream, and the event time of the data has actually advanced.
question 1:
What is this timestamp in this sentence and the timeout would occur when the watermark advances beyond the set timestamp? is it an absolute time or is it a relative time duration to the current event time in the state? I know I could expire it by removing the state by ```
e.g. say I have some data state like below, when will it exprire by setting up what value in what settings?
+-------+-----------+-------------------+
|expired|something | timestamp|
+-------+-----------+-------------------+
| false| someKey |2020-08-02 22:02:00|
+-------+-----------+-------------------+
question 2:
Reading the sentence Data that is older than the watermark are filtered out, I understand the late arrival data is ignored after it is read from kafka, is this correct?
question reason
Without understanding these, i can not really apply them to use cases. Meaning when to use GroupState.setTimeoutDuration(), when to use GroupState.setTimeoutTimestamp()
Thanks a lot.
ps. I also tried to read below
- https://www.waitingforcode.com/apache-spark-structured-streaming/stateful-transformations-mapgroupswithstate/read
(confused me, did not understand)
- https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
(did not say a lot of it for my interest)
What is this timestamp in the sentence and the timeout would occur when the watermark advances beyond the set timestamp?
This is the timestamp you set by GroupState.setTimeoutTimestamp().
is it an absolute time or is it a relative time duration to the current event time in the state?
This is a relative time (not duration) based on the current batch window.
say I have some data state (column timestamp=2020-08-02 22:02:00), when will it expire by setting up what value in what settings?
Let's assume your sink query has a defined processing trigger (set by trigger()) of 5 minutes. Also, let us assume that you have used a watermark before applying the groupByKey and the mapGroupsWithState. I understand you want to use timeouts based on event times (as opposed to processing times, so your query will be like:
ds.withWatermark("timestamp", "10 minutes")
.groupByKey(...) // declare your key
.mapGroupsWithState(
GroupStateTimeout.EventTimeTimeout)(
...) // your custom update logic
Now, it depends on how you set the TimeoutTimestamp withing your "custom update logic". Somewhere in your custom update logic you will need to call
state.setTimeoutTimestamp()
This method has four different signatures and it is worth scanning through their documentation. As we have set a watermark in (withWatermark) we can actually make use of that time. As a general rule: It is important to set the timeout timestamp (set by state.setTimeoutTimestamp()) to a value larger then the current watermark. To continue with our example we add one hour as shown below:
state.setTimeoutTimestamp(state.getCurrentWatermarkMs, "1 hour")
To conclude, your message can arrive into your stream between 22:00:00 and 22:15:00 and if that message was the last for the key it will timeout by 23:15:00 in your GroupState.
question 2: Reading the sentence Data that is older than the watermark are filtered out, I understand the late arrival data is ignored after it is read from kafka, this is correct?
Yes, this is correct. For the batch interval 22:00:00 - 22:05:00 all messages that have an event time (defined by column timestamp) arrive later then the declared watermark of 10 minutes (meaning later then 22:15:00) will be ignored anyway in your query and are not going to be processed within your "custom update logic".

Azure Stream Analytics Window based on variable time

I know I can create Stream Analytics windows as follows:
TumblingWindow(second, 30)
This would make fixed windows every 30 seconds.
Is it possible to make the 30 seconds dynamic? This would mean we get multiple windows through each other, all on different time schedules.
I'm experimenting with reference input file's, and I would like to get the amount of seconds from the reference file, rather than fixed in the query.
If I create the Window with input from a reference file, I get the error:
Error : Invalid window duration: 'timespanInSeconds'. Window duration must be a positive float constant.
Even though it seems to be a valid json number. Is what I'm trying to do even possible?
Something in the docs that I've found:
https://msdn.microsoft.com/en-us/azure/stream-analytics/reference/tumbling-window-azure-stream-analytics
It states:
A big integer which describes the size of the window. The windowsize is static and cannot be changed dynamically at runtime.

Dealing with a daily time window across timezones in Node.js

Currently, I'm working on a project that requires a window of time to be selected that is used as a valid window to trigger an event within. This window is selected by the user as a start time (24 hour time), end time (24 hour time), and a timezone. My goal is to then be able to convert these times into UTC based on the offset from the provided timezone and save into MySQL.
The main problem is I have set up the entire flow to deal with time-only data types from the mobile app all the way back to the MySQL database. I have been trying to figure out a solution that won't require changing all those data types to include date and time which would require changes in many parts of the project.
Can I make this calculation without dealing with the date? I don't believe I can as timezone offsets range from -12:00 to +14:00 which would push some windows to the next or previous days when turned into UTC.
Is the correct approach to add in the date component and then continue to update it as time progresses? I also want to ensure daylight savings doesn't create errors.
Ultimately I would like the best approach to take so if I have to change a lot now I'd rather do that then deal with a headache later. Any thoughts would be greatly appreciated!

Resources