Here's my stream analytics topology
EventHubSource => Job A (HoppingWindow every second) => EventHubA
EventHubSource => Job B (HoppingWindow every second) => EventHubB
Each job has a different consumer group in EventHubSource.
Each job is embarrassingly parallel and consumes only
14% SU resources.
When testing the JobA and JobC, the difference between the windowEnd and the original Event Time is just some few millisecond (~300), which is ok (latency from my producer + eventhub + stream analytics processing time).
But when I join both streams in a new Job C like this:
EventHubA
\
=> Job C (Join Datediff = 0 and timestamp by windowEnd)
/
EventHubB
This produces some output, but the problems comes here:
The real events are multiple minutes apart even if they were pushed at the same time by Job A and B (same windowEnd)
When I inspect the data coming out from EventHub A and B, the difference between the windowEnd and the real event timestamp ranges between 39 and 44 minutes, for all of them. But when testing like mentionned above, it was only 300ms.
The worst part here is that when I run it in prod, it only emits some dozen events and stops, even if the input count is still in the thousands.
It's been weeks I'm working on this and everytime I'm dealing with some cryptic behavior from ASA, my topology is quite simple and I'm only using simple hopping windows of 1s hop, this shouldn't take weeks of tweaking and trial errors without even understanding what's happening.
For people who used ASA and AWS Kinesis analytics, did you find Kinesis analytics simpler to work with ? What annoys me here in ASA is the unpredictable behavior and issues without error messages (I activated log analytics and no error was there...)
Sorry to hear you encountered some issues with ASA. I see you have a 1 second hopping windows, but what is the total size of the windows and what is your approximate throughput?
Regarding the delay: Looking are your question, I think your ASA job may not have enough CPU resources, and then the event processing is delayed. Unfortunately this is not visible in the current SU% metric, but we plan to show metrics for both CPU and memory in the future.
To confirm this is the root cause, you can check the number of backlogged events in the job diagram. If there are lot of events backlogged, you may need to increase the number of SUs for this job.
You also mentioned the job stops after a dozen output, do you see an error message in the logs?
Related
Context
I have created a streaming job using Azure portal which aggregates data using a day wise TUMBLINGWINDOW. Have attached a code snippet below, modified from the docs, which shows similar logic.
SELECT
DATEADD(day, -1, System.Timestamp()) AS WindowStart
System.Timestamp() AS WindowEnd,
TollId,
COUNT(*)
FROM Input TIMESTAMP BY EntryTime
GROUP BY TumblingWindow(day, 1), TollId
Now that the job has been running and can see it producing output I want to be able to reduce the costs ideally by setting some sort of time scheduling so that the job can run and still produce the same output without being on all the time.
The only real constraint being that the aggregated output at the end of each TUMBLINGWINDOW has to remain the same as if it were running all the time (no impact of stop-starting on output).
This then brings me to my question.
Update: 2021-02-28
Before going into the question another thing that drove me was that through Azure portal you can manually start and stop a job. When you start/restart a job you can set a custom start time for the job/query. With this level of control say I start a job (or have a job running) and then decide to stop it for majority of the day and then turn it on at say 11:30pm each day with a custom start time of midnight of the current day then it would be able to be on for approx 30min before it would output the results (yet still to my understanding produce the same aggregation results/effect compared to if it was on the whole day up until that point). This job could then be paused again at 00:30am ( the next day for which it stays paused for the majority of the day (1380min total until 11:30pm again) upon which the same above logic is applied.
This way it remains off the majority of the day yet still can produce the same output for each day wise window (correct me if I am wrong in my thinking). The only issue with this to me seems to be the fact someone would have to manually perform this. Thus I was driven to the docs looking for a way to automate this.
Question
How can I start and stop a job in an automated fashion such that the required output would still remain intact but so that the job doesn't have to remain on all the time (like it currently is)?
Does the documentation linked above suffice given the context above, if so what are some possible arrangements for the N minutes (on) and M minutes (off) time variables for this to work?
Is this possible given the scenario that I want to aggregate on a one day TUMBLINGWINDOW window (whereby I want each window to start and end at midnight of each day, as per its default behaviour.)?
Eg
Window start: 2022-02-20 00:00:00 Window end: 2022-02-21 00:00:00 (aggregation performed),
Window start: 2022-02-21 00:00:00 Window end: 2022-02-22 00:00:00 (aggregation performed),
Window start: 2022-02-22 00:00:00 Window end: 2022-02-23 00:00:00 (aggregation performed),
....so on
Thoughts
I found this documentation from Microsoft regarding auto-pausing jobs using a few methods
However came across a paragraph (quoted below) which made me doubtful whether it seems reasonable in my particular use case (TUMBLING 1 day window as described in my question section).
Note
There are downsides to auto-pausing a job. The main ones being the loss of the low latency /real time capabilities, and the potential risks from allowing the input event backlog to grow unsupervised while a job is paused. Auto-pausing should not be considered for most production scenarios running at scale.
Could this method
There are 3 ways to lower costs:
downscale your job, you will have higher latency but for a lower cost, up to a point where your job crashes because it runs out of memory over time and/or can't catch up with its backlog. Here you need to keep an eye on your metrics to make sure you can react before it's too late
going further, you can regroup multiple queries into a single job. This job most likely won't be aligned in partitions, so it won't be able to scale linearly (adding SUs is not guaranteed to give you better performance). Same comment as above, plus you need to remember that when you need to scale back up, you probably will have to break down that job into multiple jobs to again be able to scale in a linear fashion
finally you can auto-pause a job, one way to implement that being explained in the doc you linked. I wrote that doc, and what I meant by that comment is that here again you are taking the risk of overloading the job if it can't run long enough to process the backlog of events. This is a risky proposition for most production scenarios
But if you know what you are doing, and are monitoring closely the appropriate metrics (as explained in the doc), this is definitely something you should explore.
Finally, all of these approaches, including the auto-pause one, will deal with tumbling windows transparently for you.
Update: 2022-03-03 following comments here
Update: 2022-03-04 following comments there
There are 3 time dimensions here:
When the job is running or not: the wall clock
When the time window is expected to output results: Tumbling(day,1) -> 00:00AM every day, this is absolute (on the day, on the hour, on the minute...) and independent of the job start time below
What output you want produced from the job, via the job start time
Let's say you have the job running 24/7 for multiple months, and decide to stop it at noon (12:00PM) on the 1st day of March.
It already has generated an output for the last day of February, at 00:00AM Mar1.
You won't see a difference in output until the following day, 00:00AM Mar2, when you expect to see the daily window of Mar1, but it's not output because the job is stopped.
Let's start the job at 01:00AM Mar2 wall clock time. If you want the missing time window, you should either pick a start time at 'when last stopped' (noon the day before), or a custom time any time before 23:59PM Mar1. What you are driving is the output window you want. Here you are telling ASA you want all the windows from that point onward.
ASA will then reload all the data it needs to generate that window (make sure the event hub has enough retention for that, we don't cache data between restarts in the job): Azure Stream Analytics will automatically look back at the data in the input source. For instance, if you start a job “Now” and if your query uses a 5-minutes Tumbling Window, Azure Stream Analytics will seek data from 5 minutes ago in the input. The first possible output event would have a timestamp equal to or greater than the current time, and ASA guarantees that all input events that may logically contribute to the output has been accounted for.
My use case is to create dynamic delayed job. (I am Using Bulls Queue which can be used to create delayed Jobs.)
Based on some event add some more delay to the delayed interval (further delay the job).
Since I could not find any function to update the Delayed Interval for a Job I came up with the following steps:
onEvent(jobId):
// queue is of Type Bull.Queue
// job is of type bull.Job
job = queue.getJob(jobId)
data = job.data
delay = job.toJSON().delay
job.remove()
queue.add("jobName", {value: 1}, {jobId: jobId, delayed: delay + someValue})
This pretty much solves my problem.
But I am worried about the SCALE at which these operations will happen.
I am expecting nearly 50K events per minute or even more in near future.
My Queue size is expected to grow based on unique JobId.
I am expecting more than:
1 million daily entry
around 4-5 million weekly entry
10-12 million monthly entry.
Also, after 60-70 days delayed interval for jobs will reach, and older jobs will be removed one by one.
I can run multiple processor to handle these delayed job which is not an issue.
My queue size will be stabilise after 60-70 days and more or less my queue will have around 10 million jobs.
I can vertically scale my REDIS as required.
But I want to understand the time complexity for below queries:
queue.getJob(jobId) // Get Job By Id
job.remove() // remove job from queue
queue.add(name, data, opts) // add a delayed job to this queue
If any of these operations are O(N) OR the QUEUE can keep some max number of Jobs which is less than 10 million.
Then I might have to discard this design and come up with something entirely different.
Need advice from experienced folks who can guide me on how solve this problem.
Any kind of help is appreciated.
Taking reference from the source code:
queue.getJob(jobId)
This should be O(1) since it's mostly using hash based solutions using hmget. You're only requesting one job and according to official redis docs, the time complexity is O(N) where N is the requested number of keys which will be in the order of O(1) since I'm expecting bull is storing few number of fields at the hash key.
job.remove()
Considering that a considerable number of your jobs is going to be delayed and a fraction of them are moved to waiting or active queue. This should be O(logN) on an amortized level as it's mostly using zrem for these operations.
queue.add(name, data, opts)
For job addition in a delayed queue, bull is using zadd so this is again O(logN).
We have a business problem that needs solving and would like some guidance from the community on the combination of products in Azure we could use to solve it.
The Problem:
I work for a business that produces online games. We would like to display the number of users playing a specific game in a 24 Hour Window, but we want the value to update every minute. Essentially the output that HoppingWindow(Duration(hour, 24), Hop(minute, 1)) function in Azure Stream Analytics will provide.
Currently, the amount of events are around 17 Million a day and the Stream Analytics Job seems to be struggling with the load. We tried the following so far;
Tests Done:
17 Million Events -> Event Hub (32 Partitions) -> ASA (42 Streaming Units) -> Table Storage
Failed: Stream Analytics Job never outputs on large timeframes (Stopped test at 8 Hours)
17 Million Events -> Event Hub (32 Partitions) -> FUNC -> Table Storage (With Proper Partition/Row Key)
Failed: Table storage does not support distinct count
17 Million Events -> Event Hub -> FUNC -> Cosmos DB
Tentative: Cosmos DB doesn't support distinct count, not natively anyways. Seems to be some hacks going around, but not sure that's the way to go.
Is there any known designs geared for processing 17 Million events a minute, every minute?
Edit: As per the comments, the code.
SELECT
GameId,
COUNT(DISTINCT UserId) AS ActiveCount,
DateAdd(hour, -24, System.TimeStamp()) AS StartWindowUtc,
System.TimeStamp() AS EndWindowUtc INTO [out]
FROM
[in] TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY
HoppingWindow(Duration(hour, 24), Hop(minute, 1)),
GameId,
UserId
The expected output, note that in reality there will be 1440 records per GameId. One for each minute
To be clear, the problem is that generating the expected output on the larger timeframes, ie 24 Hours doesn't output or at the very least takes 8+ Hours to output. The smaller window sizes work, for example changing the above code to use HoppingWindow(Duration(minute, 10), Hop(minute, 5)).
The tests that followed assumed that ASA is not the answer to the problem and we tried different approaches. Which seemed to have caused a bit of confusion, sorry about that
The way ASA scales up at the moment is with 1 node vertically from 1 to 6SU, then horizontally with multiple nodes of 6SU above that threshold.
Obviously, to be able to scale horizontally a job needs to be parallelizable, which means the stream will be distributed across nodes according to the partition scheme.
Now if the input stream, the query and the output destination are aligned in partitions, then the job is called embarrassingly parallel and that's where you'll be able to reach maximum scale. Each pipeline, from entry to output, will be independent, and each node will only have to maintain in memory the data pertaining to its own state (those 24h of data). That's what we're looking for here: minimizing the local data store at each node.
With EH supporting 32 partitions on most SKUs, the maximum scale publicly available on ASA is 192SU (6*32). If partitions are balanced, that's means that each node will have the least amount of data to maintain in its own state store.
Then we need to minimize the payload itself (the size of an individual message), but from the look of the query that's already the case.
Could you try scaling up to 192SU and see what happens?
We are also working on multiple other features that could help on that scenario. Let me know if that could be of interest to you.
I have data in the format { host | metric | value | time-stamp }. We have hosts all around the world reporting metrics.
I'm a little confused about using window operations (say, 1 hour) to process data like this.
Can I tell my window when to start, or does it just start when the application starts? I want to ensure I'm aggregating all data from hour 11 of the day, for example. If my window starts at 10:50, I'll just get 10:50-11:50 and miss 10 minutes.
Even if the window is perfect, data may arrive late.
How do people handle this kind of issue? Do they make windows far bigger than needed and just grab the data they care about on every batch cycle (kind of sliding)?
In the past, I worked on a large-scale IoT platform and solved that problem by considering that the windows were only partial calculations. I modeled the backend (Cassandra) to receive more than 1 record for each window. The actual value of any given window would be the addition of all -potentially partial- records found for that window.
So, a perfect window would be 1 record, a split window would be 2 records, late-arrivals are naturally supported but only accepted up to a certain 'age' threshold. Reconciliation was done at read time. As this platform was orders of magnitude heavier in terms of writes vs reads, it made for a good compromise.
After speaking with people in depth on MapR forums, the consensus seems to be that hourly and daily aggregations should not be done in a stream, but rather in a separate batch job once the data is ready.
When doing streaming you should stick to small batches with windows that are relatively small multiples of the streaming interval. Sliding windows can be useful for, say, trends over the last 50 batches. Using them for tasks as large as an hour or a day doesn't seem sensible though.
Also, I don't believe you can tell your batches when to start/stop, etc.
I am a newbie to Spark Streaming and I have some doubts regarding the same like
Do we need always more than one executor or with one we can do our job
I am pulling data from kafka using createDirectStream which is receiver less method and batch duration is one minute , so is my data is received for one batch and then processed during other batch duration or it is simultaneously processed
If it is processed simultaneously then how is it assured that my processing is finished in the batch duration
How to use the that web UI to monitor and debugging
Do we need always more than one executor or with one we can do our job
It depends :). If you have a very small volume of traffic coming in, it could very well be that one machine code suffice in terms of load. In terms of fault tolerance that might not be a very good idea, since a single executor could crash and make your entire stream fault.
I am pulling data from kafka using createDirectStream which is
receiver less method and batch duration is one minute , so is my data
is received for one batch and then processed during other batch
duration or it is simultaneously processed
Your data is read once per minute, processed, and only upon the completion of the entire job will it continue to the next. As long as your batch processing time is less than one minute, there shouldn't be a problem. If processing takes more than a minute, you will start to accumulate delays.
If it is processed simultaneously then how is it assured that my
processing is finished in the batch duration?
As long as you don't set spark.streaming.concurrentJobs to more than 1, a single streaming graph will be executed, one at a time.
How to use the that web UI to monitor and debugging
This question is generally too broad for SO. I suggest starting with the Streaming tab that gets created once you submit your application, and start diving into each batch details and continuing from there.
To add a bit more on monitoring
How to use the that web UI to monitor and debugging
Monitor your application in the Streaming tab on localhost:4040, the main metrics to look for are Processing Time and Scheduling Delay. Have a look at the offical doc : http://spark.apache.org/docs/latest/streaming-programming-guide.html#monitoring-applications
batch duration is one minute
Your batch duration a bit long, try to adjust it with lower values to improve your latency. 4 seconds can be a good start.
Also it's a good idea to monitor these metrics on Graphite and set alerts. Have a look at this post https://stackoverflow.com/a/29983398/3535853