We are using Spring Integration 4.2.3 Aggregator component and defined group-timeout and expecting the group to be timed out within the given timeout value while adding messages to the group & release size criteria is not met.
But we are seeing different results, when we input heavy load to the service the aggregator is waiting on all messages to be added to the group rather than expiring the group when the timeout reached.
Is there any way to override the aggregator functionality to look at the first message rather than last message when timing out group.
Well, actually you can do what you need even now. Using the same group-timeout-expression. But you have to consult the #root object of the evaluation context which is exactly what you need - MessageGroup. With that you can call one of for your purpose:
/**
* #return the timestamp (milliseconds since epoch) associated with the creation of this group
*/
long getTimestamp();
/**
* #return the timestamp (milliseconds since epoch) associated with the time this group was last updated
*/
long getLastModified();
Therefore an expression for your original request might be like:
group-timeout-expression="timestamp + 10000 - T(System).currentTimeMillis()"
And we get that adjusted timeout which will be applied to scheduled task with the value like: new Date(System.currentTimeMillis() + groupTimeout));.
No; the timeout is currently based on the arrival of the last message only.
If you use a MessageGroupStoreReaper instead, the time is based on the group creation by default, but that can be changed by setting the group store's timeoutOnIdle to true.
If your group is not timing out at all, perhaps the thread pool in the default taskScheduler is exhausted - it only has 10 threads by default.
You can increase the pool size or inject a dedicated scheduler into the aggregator.
we debugged the issue with your group-timeout-expression(timestamp + 20000 - T(System).currentTimeMillis()) and found out that the expression is evaluating to a negative value after messaged keep flowing in thus causing the group never getting released.
The code block where the issue is in AbstractCorrelatingMessageHandler.java
Once we removed the condition of "groupTimeout >= 0", now the group is getting expired because of the else block. The code is now behaving like how we expected.
Could you let me know why you are not forcing the group to be timedoout when it reaches negative value?
Related
I'm trying to use application insights to keep track of a counter of number of active streams in my application. I have 2 goals to achieve:
Show the current (or at least recent) number of active streams in a dashboard
Activate a kind of warning if the number exceeds a certain limit.
These streams can be quite long lived, and sometimes brief. So the number can sometimes change say 100 times a second, and sometimes remain unchanged for many hours.
I have been trying to track this active streams count as an application insights metric.
I'm incrementing a counter in my application when a new stream opens, and decrementing when one closes. On each change I use the telemetry client something like this
var myMetric = myTelemetryClient.GetMetric("Metricname");
myMetric.TrackValue(myCount);
When I query my metric values with Kusto, I see that because of these clusters of activity within a 10 sec period, my metric values get aggregated. For the purposes of my alarm, I can live with that, as I can look at the max value of the aggregate. But I can't present a dashboard of the number of active streams, as I have no way of knowing the number of active streams between my measurement points. I know the min value, max and average, but I don't know the last value of the aggregate period, and since it can be somewhere between 0 and 1000, its no help.
So the solution I have doesn't serve my needs, I thought of a couple of changes:
Adding a scheduled pump to my counter component, which will send the current counter value, once every say 5 minutes. But I don't like that I then have to add a thread for each of these counters.
Adding a timer to send the current value once, 5 minutes after the last change. Countdown gets reset each time the counter changes. This has the same problem as above, and does an excessive amount of work to reset the counter when it could be changing thousands of times a second.
In the end, I don't think my needs are all that exotic, so I wonder if I'm using app insights incorrectly.
Is there some way I can change the metric's behavior to suit my purposes? I appreciate that it's pre-aggregating before sending data in order to reduce ingest costs, but it's preventing me from solving a simple problem.
Is a metric even the right way to do this? Are there alternative approaches within app insights?
You can use TrackMetric instead of the GetMetric ceremony to track individual values withouth aggregation. From the docs:
Microsoft.ApplicationInsights.TelemetryClient.TrackMetric is not the preferred method for sending metrics. Metrics should always be pre-aggregated across a time period before being sent. Use one of the GetMetric(..) overloads to get a metric object for accessing SDK pre-aggregation capabilities. If you are implementing your own pre-aggregation logic, you can use the TrackMetric() method to send the resulting aggregates.
But you can also use events as described next:
If your application requires sending a separate telemetry item at every occasion without aggregation across time, you likely have a use case for event telemetry; see TelemetryClient.TrackEvent (Microsoft.ApplicationInsights.DataContracts.EventTelemetry).
I am trying very hard to understand the timeout setup when using the mapGroupsWithState for spark structured streaming.
below link has very detailed specification, but I am not sure i understood it properly, especially the GroupState.setTimeoutTimeStamp() option. Meaning when setting up the state expiry to be sort of related to the event time.
https://spark.apache.org/docs/3.0.0-preview/api/scala/org/apache/spark/sql/streaming/GroupState.html
I copied them out here:
With EventTimeTimeout, the user also has to specify the the the event time watermark in the query using Dataset.withWatermark().
With this setting, data that is older than the watermark are filtered out.
The timeout can be set for a group by setting a timeout timestamp usingGroupState.setTimeoutTimestamp(), and the timeout would occur when the watermark advances beyond the set timestamp.
You can control the timeout delay by two parameters - watermark delay and an additional duration beyond the timestamp in the event (which is guaranteed to be newer than watermark due to the filtering).
Guarantees provided by this timeout are as follows:
Timeout will never be occur before watermark has exceeded the set timeout.
Similar to processing time timeouts, there is a no strict upper bound on the delay when the timeout actually occurs. The watermark can advance only when there is data in the stream, and the event time of the data has actually advanced.
question 1:
What is this timestamp in this sentence and the timeout would occur when the watermark advances beyond the set timestamp? is it an absolute time or is it a relative time duration to the current event time in the state? I know I could expire it by removing the state by ```
e.g. say I have some data state like below, when will it exprire by setting up what value in what settings?
+-------+-----------+-------------------+
|expired|something | timestamp|
+-------+-----------+-------------------+
| false| someKey |2020-08-02 22:02:00|
+-------+-----------+-------------------+
question 2:
Reading the sentence Data that is older than the watermark are filtered out, I understand the late arrival data is ignored after it is read from kafka, is this correct?
question reason
Without understanding these, i can not really apply them to use cases. Meaning when to use GroupState.setTimeoutDuration(), when to use GroupState.setTimeoutTimestamp()
Thanks a lot.
ps. I also tried to read below
- https://www.waitingforcode.com/apache-spark-structured-streaming/stateful-transformations-mapgroupswithstate/read
(confused me, did not understand)
- https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
(did not say a lot of it for my interest)
What is this timestamp in the sentence and the timeout would occur when the watermark advances beyond the set timestamp?
This is the timestamp you set by GroupState.setTimeoutTimestamp().
is it an absolute time or is it a relative time duration to the current event time in the state?
This is a relative time (not duration) based on the current batch window.
say I have some data state (column timestamp=2020-08-02 22:02:00), when will it expire by setting up what value in what settings?
Let's assume your sink query has a defined processing trigger (set by trigger()) of 5 minutes. Also, let us assume that you have used a watermark before applying the groupByKey and the mapGroupsWithState. I understand you want to use timeouts based on event times (as opposed to processing times, so your query will be like:
ds.withWatermark("timestamp", "10 minutes")
.groupByKey(...) // declare your key
.mapGroupsWithState(
GroupStateTimeout.EventTimeTimeout)(
...) // your custom update logic
Now, it depends on how you set the TimeoutTimestamp withing your "custom update logic". Somewhere in your custom update logic you will need to call
state.setTimeoutTimestamp()
This method has four different signatures and it is worth scanning through their documentation. As we have set a watermark in (withWatermark) we can actually make use of that time. As a general rule: It is important to set the timeout timestamp (set by state.setTimeoutTimestamp()) to a value larger then the current watermark. To continue with our example we add one hour as shown below:
state.setTimeoutTimestamp(state.getCurrentWatermarkMs, "1 hour")
To conclude, your message can arrive into your stream between 22:00:00 and 22:15:00 and if that message was the last for the key it will timeout by 23:15:00 in your GroupState.
question 2: Reading the sentence Data that is older than the watermark are filtered out, I understand the late arrival data is ignored after it is read from kafka, this is correct?
Yes, this is correct. For the batch interval 22:00:00 - 22:05:00 all messages that have an event time (defined by column timestamp) arrive later then the declared watermark of 10 minutes (meaning later then 22:15:00) will be ignored anyway in your query and are not going to be processed within your "custom update logic".
I have a node.js function that needs to be executed for each order on my application. In this function my app gets an order number from a oracle database, process the order and then adds + 1 to that number on the database (needs to be the last thing on the function because order can fail and therefore the number will not be used).
If all recieved orders at time T are processed at the same time (asynchronously) then the same order number will be used for multiple orders and I don't want that.
So I used rabbit to try to remedy this situation since it was a queue. It seems that the processes finishes in the order they should, but a second process does NOT wait for the first one to finish (ack) to begin, so in the end I'm having the same problem of using the same order number multiple times.
Is there anyway I can configure my queue to process one message at a time? To only start process n+1 when process n has been acknowledged?
This would be a life saver to me!
If the problem is to avoid duplicate order numbers, then use an Oracle sequence, or use an identity column when you insert into a table to generate the order number:
CREATE TABLE mytab (
id NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY(START WITH 1),
data VARCHAR2(20));
INSERT INTO mytab (data) VALUES ('abc');
INSERT INTO mytab (data) VALUES ('def');
SELECT * FROM mytab;
This will give:
ID DATA
---------- --------------------
1 abc
2 def
If the problem is that you want orders to be processed sequentially, then don't pull an order from the queue until the previous one is finished. This will limit your throughput, so you need to understand your requirements and make some architectural decisions.
Overall, it sounds Oracle Advanced Queuing would be a good fit. See the node-oracledb documentation on AQ.
I have a logic app that is fetching data from an API endpoint. The API is using pagination and has an limit of 50 objects per request and then provides an link for the next 50 objects until it gets all the objects, however I have no idea on how many objects there will be for each request. My flow is briefly described down below:
First make an initial HTTP request against the endpoint
Parsing the response HTTP Body to be able to use the nextLink url provided.
Until loop with the conditon to run until nextLink is equal to null.
In the until loop I have an action for Set Variable that get Set to a new URL for each request made with a new pagination in the end of the url: "&_offset=100"
The issue with the until loop is that you can set limits for count and timeout as you can see here. As I have no clue on how many pagination there will be I am expecting this loop to run until the condition specified is met. However, I have tried specify some different values listed below:
Count = 1 - Resulted in just 1 run
Count = empty - Resulted in it running for an hour (approx 3300 loops), as specified by the Timeout value.
Count = 60 - Resulted in it running for 60 times
I have researched on how many pagination this specific request has and it turns out it has 290 paginations. My expectations is that this until loop will run until nextLink is equal to null which will be after 290 loops. But I wonder if there is any possibiliy to specify a dynamic value for Count in the until action?
I am expecting the UNTIL action to run as many time as needed based on how many pagination there is, that is atleast what I suppose it should do because if I need to specify a value for how many times it needs to run then this action is pretty useless. Hopefully there is someone in here that maybe have faced the same issue.
Best regards
As far as I know, "Until" action requires us to define at least one limit to prevent endless loops.
For your problem, you can just define a count which is large enough to allow your endpoints show all of the pages. If you want to specify a dynamic value for the count, you need to meet two conditions:
You have to be able to access total number of pages (if your endpoint provides a url to get it).
The count set in "Until" action can only reference trigger inputs, trigger outputs and parameters.
According to the statement in your question, I guess you can't meet these two conditions. So I think we can just set a count which is large enough.
I need to ensure the same job added to queue isn't duplicated within a certain period of time.
Is it worth including partial timestamps (i.e. D/M/Y-HH:M) in my unique jobId strings, so it processes only if not in the same Minute?
It would still duplicate if one job was added at 12:01 and the other at 12:09 – or does Bull have a much better way of doing this?
Bull is designed to support idempotence by ignoring jobs that were added with existing job ids. Be careful to not enable options such as removeOnCompleted, since the job will be removed after completion and not being considered the next time you add a job.
In your case, where you want to make sure that no new jobs are added during a given timespan, just make sure that all the job ids during that timestamp are the same, for example as you wrote in your comment removing the 4 last digits of your UNIX timestamp.
I feel you should use Bull's API to check that the job is running or not, then you decide if you add the job to the queue if not (patch on the producer).
You can also decide to check if a similar job is already running when your are running the job (inside the process function) and do an early return instead of executing the job (patch on the consumer).
You can use the Queue getJobs function to do so:
getJobs(types: string[], start?: number, end?: number, asc?: boolean):Promise<Job[]>
"Returns a promise that will return an array of job instances of the given types. Optional parameters for range and ordering are provided."
From documentation:
https://github.com/OptimalBits/bull/blob/develop/REFERENCE.md#queuegetjobs
The Job item should provide enough data so you can find the one you are looking for.