Event Processing (17 Million Events/day) using Azure Stream Analytics HoppingWindow Function - 24H Window, 1 Minute Hops - azure

We have a business problem that needs solving and would like some guidance from the community on the combination of products in Azure we could use to solve it.
The Problem:
I work for a business that produces online games. We would like to display the number of users playing a specific game in a 24 Hour Window, but we want the value to update every minute. Essentially the output that HoppingWindow(Duration(hour, 24), Hop(minute, 1)) function in Azure Stream Analytics will provide.
Currently, the amount of events are around 17 Million a day and the Stream Analytics Job seems to be struggling with the load. We tried the following so far;
Tests Done:
17 Million Events -> Event Hub (32 Partitions) -> ASA (42 Streaming Units) -> Table Storage
Failed: Stream Analytics Job never outputs on large timeframes (Stopped test at 8 Hours)
17 Million Events -> Event Hub (32 Partitions) -> FUNC -> Table Storage (With Proper Partition/Row Key)
Failed: Table storage does not support distinct count
17 Million Events -> Event Hub -> FUNC -> Cosmos DB
Tentative: Cosmos DB doesn't support distinct count, not natively anyways. Seems to be some hacks going around, but not sure that's the way to go.
Is there any known designs geared for processing 17 Million events a minute, every minute?
Edit: As per the comments, the code.
SELECT
GameId,
COUNT(DISTINCT UserId) AS ActiveCount,
DateAdd(hour, -24, System.TimeStamp()) AS StartWindowUtc,
System.TimeStamp() AS EndWindowUtc INTO [out]
FROM
[in] TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY
HoppingWindow(Duration(hour, 24), Hop(minute, 1)),
GameId,
UserId
The expected output, note that in reality there will be 1440 records per GameId. One for each minute
To be clear, the problem is that generating the expected output on the larger timeframes, ie 24 Hours doesn't output or at the very least takes 8+ Hours to output. The smaller window sizes work, for example changing the above code to use HoppingWindow(Duration(minute, 10), Hop(minute, 5)).
The tests that followed assumed that ASA is not the answer to the problem and we tried different approaches. Which seemed to have caused a bit of confusion, sorry about that

The way ASA scales up at the moment is with 1 node vertically from 1 to 6SU, then horizontally with multiple nodes of 6SU above that threshold.
Obviously, to be able to scale horizontally a job needs to be parallelizable, which means the stream will be distributed across nodes according to the partition scheme.
Now if the input stream, the query and the output destination are aligned in partitions, then the job is called embarrassingly parallel and that's where you'll be able to reach maximum scale. Each pipeline, from entry to output, will be independent, and each node will only have to maintain in memory the data pertaining to its own state (those 24h of data). That's what we're looking for here: minimizing the local data store at each node.
With EH supporting 32 partitions on most SKUs, the maximum scale publicly available on ASA is 192SU (6*32). If partitions are balanced, that's means that each node will have the least amount of data to maintain in its own state store.
Then we need to minimize the payload itself (the size of an individual message), but from the look of the query that's already the case.
Could you try scaling up to 192SU and see what happens?
We are also working on multiple other features that could help on that scenario. Let me know if that could be of interest to you.

Related

Stream Analytics too slow (time slippage between two streams)?

Here's my stream analytics topology
EventHubSource => Job A (HoppingWindow every second) => EventHubA
EventHubSource => Job B (HoppingWindow every second) => EventHubB
Each job has a different consumer group in EventHubSource.
Each job is embarrassingly parallel and consumes only
14% SU resources.
When testing the JobA and JobC, the difference between the windowEnd and the original Event Time is just some few millisecond (~300), which is ok (latency from my producer + eventhub + stream analytics processing time).
But when I join both streams in a new Job C like this:
EventHubA
\
=> Job C (Join Datediff = 0 and timestamp by windowEnd)
/
EventHubB
This produces some output, but the problems comes here:
The real events are multiple minutes apart even if they were pushed at the same time by Job A and B (same windowEnd)
When I inspect the data coming out from EventHub A and B, the difference between the windowEnd and the real event timestamp ranges between 39 and 44 minutes, for all of them. But when testing like mentionned above, it was only 300ms.
The worst part here is that when I run it in prod, it only emits some dozen events and stops, even if the input count is still in the thousands.
It's been weeks I'm working on this and everytime I'm dealing with some cryptic behavior from ASA, my topology is quite simple and I'm only using simple hopping windows of 1s hop, this shouldn't take weeks of tweaking and trial errors without even understanding what's happening.
For people who used ASA and AWS Kinesis analytics, did you find Kinesis analytics simpler to work with ? What annoys me here in ASA is the unpredictable behavior and issues without error messages (I activated log analytics and no error was there...)
Sorry to hear you encountered some issues with ASA. I see you have a 1 second hopping windows, but what is the total size of the windows and what is your approximate throughput?
Regarding the delay: Looking are your question, I think your ASA job may not have enough CPU resources, and then the event processing is delayed. Unfortunately this is not visible in the current SU% metric, but we plan to show metrics for both CPU and memory in the future.
To confirm this is the root cause, you can check the number of backlogged events in the job diagram. If there are lot of events backlogged, you may need to increase the number of SUs for this job.
You also mentioned the job stops after a dozen output, do you see an error message in the logs?

Spark streaming - waiting for data for window aggregations?

I have data in the format { host | metric | value | time-stamp }. We have hosts all around the world reporting metrics.
I'm a little confused about using window operations (say, 1 hour) to process data like this.
Can I tell my window when to start, or does it just start when the application starts? I want to ensure I'm aggregating all data from hour 11 of the day, for example. If my window starts at 10:50, I'll just get 10:50-11:50 and miss 10 minutes.
Even if the window is perfect, data may arrive late.
How do people handle this kind of issue? Do they make windows far bigger than needed and just grab the data they care about on every batch cycle (kind of sliding)?
In the past, I worked on a large-scale IoT platform and solved that problem by considering that the windows were only partial calculations. I modeled the backend (Cassandra) to receive more than 1 record for each window. The actual value of any given window would be the addition of all -potentially partial- records found for that window.
So, a perfect window would be 1 record, a split window would be 2 records, late-arrivals are naturally supported but only accepted up to a certain 'age' threshold. Reconciliation was done at read time. As this platform was orders of magnitude heavier in terms of writes vs reads, it made for a good compromise.
After speaking with people in depth on MapR forums, the consensus seems to be that hourly and daily aggregations should not be done in a stream, but rather in a separate batch job once the data is ready.
When doing streaming you should stick to small batches with windows that are relatively small multiples of the streaming interval. Sliding windows can be useful for, say, trends over the last 50 batches. Using them for tasks as large as an hour or a day doesn't seem sensible though.
Also, I don't believe you can tell your batches when to start/stop, etc.

Solution for delaying events for N days

We're currently writing an application in Microsoft Azure and we're planning to use Event Hubs to handle processing of real time events.
However, after an initial processing we will have to delay further processing of the events for N number of days. The process will work like this:
Event triggered -> Place event in Event Hub -> Event gets fetched from Event Hub and processed -> Event should be delay for X days -> Event gets' further processed (two last steps might be a loop)
How can we achieve this delay of further event processing without using polling or similar strategies. One idea is to use Azure Queues and their visibility timeout, but 7 days is the supported maximum according to the documentation and our business demands are in the 1-3 months maximum range. Number of events in our system should be max 10k per day.
Any ideas would be appreciated, thanks!
As you already mentioned - EventHubs supports only 7 days window of data to be retained.
Event Hubs are typically used as real-time telemetry data pipe-lines where data seek performance is critical. For 99.9% usecases/scenarios our users typically require last couple of hours, if not seconds.
However, after the real-time processing is over, and If you still need to re-analyze the data after a while, for ex: run a Hadoop job on last months data - our seek pattern & store are not optimized for it. We recommend to forward the messages to other data archival stores which are specialized for big-data queries.
As - data archival is an ask that most of our customers naturally look for - we are releasing a new feature which automatically archives the data in AVRO format into Azure storage.

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.
In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:
(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:
// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))
(ii) Run separate Spark Streaming apps - one for each stat
Questions
To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:
How would streamA and streamB be represented in the underlying
datastructure.
Would they share data - since they originate from the
KafkaDStream? Or would there be duplication of data?
Also, are there more efficient methods to handle such a use case.
Thanks in advance
Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.
Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).
One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.
If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

Strange data access time in Azure Table Storage while using .Take()

this is our situation:
We store user messages in table Storage. The Partition key is the UserId and the RowKey is used as a message id.
When a users opens his message panel we want to just .Take(x) number of messages, we don't care about the sortOrder. But what we have noticed is that the time it takes to get the messages varies very much by the number of messages we take.
We did some small tests:
We did 50 * .Take(X) and compared the differences:
So we did .Take(1) 50 times and .Take(100) 50 times etc.
To make an extra check we did the same test 5 times.
Here are the results:
As you can see there are some HUGE differences. The difference between 1 and 2 is very strange. The same for 199-200.
Does anybody have any clue how this is happening? The Table Storage is on a live server btw, not development storage.
Many thanks.
X: # Takes
Y: Test Number
Update
The problem only seems to come when I'm using a wireless network. But I'm using the cable the times are normal.
Possibly the data is collected in batches of a certain number x. When you request x+1 rows, it would have to take two batches and then drop a certain number.
Try running your test with increments of 1 as the Take() parameter, to confirm or dismiss this assumption.

Resources