Using Timer for batch operations

Using Timer for batch operations - micrometer

I'm new to using Micrometer and am trying to see if there's a way to use a Timer that would also include a count of the number of items in a batch processing scenario. Since I'm processing the batch with Java streams, I didn't see an obvious way to record the timer for each item processed, so I was looking for a way to set a batch size attribute. One way I think that could work is to use the FunctionTimer from https://micrometer.io/docs/concepts#_function_tracking_timers, but I believe that requires the app to maintain a persistent monotonically increasing set of values for the total count and total time.
Is there a simpler way this can be done? Ultimately this data will be fed to New Relic. I've also tried setting tags for the batch size, but those seem to be reported as strings so I can't do any type of aggregation on the values.
Thanks!

A timer is intended for measuring an action and at a minimum results in two measurements: a count and a duration.
So a timer will work perfectly for your batch processing. In the the java stream, a peek operation might be a good place to put a timer.
If you were about to process 20 elements and you were just measuring the time for all 20 elements, you would need to create a new Counter for measuring the batch size. You could them divide the timer's total duration against your counter to get a per-item duration or divide it against the timer's total count to get a per-batch duration.
Feel free to add code snippets if you would like feedback for those.

Related

Bulls Queue Performance and Scalability: Queue.add(), Queue.getJob(jobId), Job.remove()

My use case is to create dynamic delayed job. (I am Using Bulls Queue which can be used to create delayed Jobs.)
Based on some event add some more delay to the delayed interval (further delay the job).
Since I could not find any function to update the Delayed Interval for a Job I came up with the following steps:
onEvent(jobId):
// queue is of Type Bull.Queue
// job is of type bull.Job
job = queue.getJob(jobId)
data = job.data
delay = job.toJSON().delay
job.remove()
queue.add("jobName", {value: 1}, {jobId: jobId, delayed: delay + someValue})
This pretty much solves my problem.
But I am worried about the SCALE at which these operations will happen.
I am expecting nearly 50K events per minute or even more in near future.
My Queue size is expected to grow based on unique JobId.
I am expecting more than:
1 million daily entry
around 4-5 million weekly entry
10-12 million monthly entry.
Also, after 60-70 days delayed interval for jobs will reach, and older jobs will be removed one by one.
I can run multiple processor to handle these delayed job which is not an issue.
My queue size will be stabilise after 60-70 days and more or less my queue will have around 10 million jobs.
I can vertically scale my REDIS as required.
But I want to understand the time complexity for below queries:
queue.getJob(jobId) // Get Job By Id
job.remove() // remove job from queue
queue.add(name, data, opts) // add a delayed job to this queue
If any of these operations are O(N) OR the QUEUE can keep some max number of Jobs which is less than 10 million.
Then I might have to discard this design and come up with something entirely different.
Need advice from experienced folks who can guide me on how solve this problem.
Any kind of help is appreciated.

Taking reference from the source code:
queue.getJob(jobId)
This should be O(1) since it's mostly using hash based solutions using hmget. You're only requesting one job and according to official redis docs, the time complexity is O(N) where N is the requested number of keys which will be in the order of O(1) since I'm expecting bull is storing few number of fields at the hash key.
job.remove()
Considering that a considerable number of your jobs is going to be delayed and a fraction of them are moved to waiting or active queue. This should be O(logN) on an amortized level as it's mostly using zrem for these operations.
queue.add(name, data, opts)
For job addition in a delayed queue, bull is using zadd so this is again O(logN).

Tracking a counter value in application insights

I'm trying to use application insights to keep track of a counter of number of active streams in my application. I have 2 goals to achieve:
Show the current (or at least recent) number of active streams in a dashboard
Activate a kind of warning if the number exceeds a certain limit.
These streams can be quite long lived, and sometimes brief. So the number can sometimes change say 100 times a second, and sometimes remain unchanged for many hours.
I have been trying to track this active streams count as an application insights metric.
I'm incrementing a counter in my application when a new stream opens, and decrementing when one closes. On each change I use the telemetry client something like this
var myMetric = myTelemetryClient.GetMetric("Metricname");
myMetric.TrackValue(myCount);
When I query my metric values with Kusto, I see that because of these clusters of activity within a 10 sec period, my metric values get aggregated. For the purposes of my alarm, I can live with that, as I can look at the max value of the aggregate. But I can't present a dashboard of the number of active streams, as I have no way of knowing the number of active streams between my measurement points. I know the min value, max and average, but I don't know the last value of the aggregate period, and since it can be somewhere between 0 and 1000, its no help.
So the solution I have doesn't serve my needs, I thought of a couple of changes:
Adding a scheduled pump to my counter component, which will send the current counter value, once every say 5 minutes. But I don't like that I then have to add a thread for each of these counters.
Adding a timer to send the current value once, 5 minutes after the last change. Countdown gets reset each time the counter changes. This has the same problem as above, and does an excessive amount of work to reset the counter when it could be changing thousands of times a second.
In the end, I don't think my needs are all that exotic, so I wonder if I'm using app insights incorrectly.
Is there some way I can change the metric's behavior to suit my purposes? I appreciate that it's pre-aggregating before sending data in order to reduce ingest costs, but it's preventing me from solving a simple problem.
Is a metric even the right way to do this? Are there alternative approaches within app insights?

You can use TrackMetric instead of the GetMetric ceremony to track individual values withouth aggregation. From the docs:
Microsoft.ApplicationInsights.TelemetryClient.TrackMetric is not the preferred method for sending metrics. Metrics should always be pre-aggregated across a time period before being sent. Use one of the GetMetric(..) overloads to get a metric object for accessing SDK pre-aggregation capabilities. If you are implementing your own pre-aggregation logic, you can use the TrackMetric() method to send the resulting aggregates.
But you can also use events as described next:
If your application requires sending a separate telemetry item at every occasion without aggregation across time, you likely have a use case for event telemetry; see TelemetryClient.TrackEvent (Microsoft.ApplicationInsights.DataContracts.EventTelemetry).

Predefined (and large) windows? Any stream processing frameworks support this?

All the examples I see of windowing involve defining the windows. E.g., tumbling 1-minute windows, or sliding 1-minute windows, etc. In my situation, all my data has timestamped events, but that's not the primary interest.
All my data also has an associated period that I do not have control over. That is the desired window in my case. The periods are time-based, but they vary from 2-3 weeks, roughly.
So, if I look at just the period of a stream of values might look like this (almost everything from the current period, a few stragglers from the last period early on in current period),
... PERIOD 6, PERIOD 5, PERIOD 6, PERIOD 6, PERIOD 6, PERIOD 6, ...
It's not clear to me how to handle this situation in terms of watermarks/triggers/etc? If I'm understanding all this terminology correctly I've thought of something like this: the watermark for PERIOD N occurs when the first event with PERIOD (N+1) is processed. The lateness horizon (for garbage collecting state) for the PERIOD N window can be 1-2 days after the timestamp of the first event with PERIOD (N+1). I'd like triggers to be accumulating and every 5 minutes (ideally, I'd like this trigger duration to be increasing: more frequent at beginning of the window, less frequent as time passes).
I'm trying to use terminology from this article, https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 sorry if it's incorrect
I'm particularly confused about how watermarks seem to be continuous and based on event-time. In my case, I have both event-time (timestamp) and event-time (period). If I'm understanding this correctly, the curve of my situation (as in the above article) would look like a step-function?
I haven't yet picked a stream processing framework to use. Does my situation make sense for any of them? Does this require a lot of custom logic? Does any framework make this easier? Is this a known problem with a name?
Any help is appreciated.

In Flink, one way to achieve this is to use the processing time window for aggregation. Then you use a rich map function to maintain the accumulated counts before the window. In the end, you sink the aggregates to long-term data storage
You can take a look at my blog post where we did something similar to this. Take a look at Section A peek into Milestone Two

Hazelcast Jet sliding window unit of measurement

Sorry for may be silly question but it is unclear from docs what is the unit of measurement for sliding window? Is it milliseconds, seconds or number of items in the stream?
I've noticed the aggregation operation was producing empty results and I had to filter them explicitly because probably there was no data available for that window, so I guess last point it not an option.

Jet doesn't specify a unit for windows, instead the windows are calculated based on the same unit that your timestamps are specified in. Typically if your timestamps are UNIX-style timestamps then it would be in milliseconds, but you could also use nanoseconds, seconds, or minutes if that's how your timestamps are defined. It refers to specifically event time and is not related to number of events in the stream, only to their timestamps.

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.
In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:
(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:
// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))
(ii) Run separate Spark Streaming apps - one for each stat
Questions
To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:
How would streamA and streamB be represented in the underlying
datastructure.
Would they share data - since they originate from the
KafkaDStream? Or would there be duplication of data?
Also, are there more efficient methods to handle such a use case.
Thanks in advance

Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.
Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).
One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.
If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string