Azure Function with Event Hub trigger receives weird amount of events - azure

I have an Event Hub and Azure Function connected to it. With small amounts of data all works well, but when I tested it with 10 000 events, I got very peculiar results.
For test purposes I send into Event hub numbers from 0 to 9999 and log data in application insights and in service bus. For the first test I see in Azure that hub got exactly 10 000 events, but service bus and AI got all messages between 0 and 4500, and every second message after 4500 (so it lost about 30%). In second test, I got all messages from 0 to 9999, but every second message between 3500 and 3200 was duplicated. I would like to get all messages once, what did I do wrong?
public async Task Run([EventHubTrigger("%EventHubName%", Connection = "AzureEventHubConnectionString")] EventData[] events, ILogger log)
{
int id = _random.Next(1, 100000);
_context.Log.TraceInfo("Started. Count: " + events.Length + ". " + id); //AI log
foreach (var message in events)
{
//log with ASB
var mess = new Message();
mess.Body = message.EventBody.ToArray();
await queueClient.SendAsync(mess);
}
_context.Log.TraceInfo("Completed. " + id); //AI log
}

By using EventData[] events, you are reading events from hub in batch mode, thats why you see X events processing at a time then next seconds you process next batch.
Instead of EventData[] use simply EventData.
When you send events to hub check that all events are sent with the same partition key if you want try batch processing otherwise they can be splitted in several partitions depending on TU (throughput units), PU (Processing Units) and CU (Capacity Units).
Egress: Up to 2 MB per second or 4096 events per second.
Refer to this document.
Throughput limits for Basic, Standard, Premium..:

There are a couple of things likely happening, though I can only speculate with the limited context that we have. Knowing more about the testing methodology, tier of your Event Hubs namespace, and the number of partitions in your Event Hub would help.
The first thing to be aware of is that the timing between when an event is published and when it is available in a partition to be read is non-deterministic. When a publish operation completes, the Event Hubs broker has acknowledged receipt of the events and taken responsibility for ensuring they are persisted to multiple replicas and made available in a specific partition. However, it is not a guarantee that the event can immediately be read.
Depending on how you sent the events, the broker may also need to route events from a gateway by performing a round-robin or applying a hash algorithm. If you're looking to optimize the time from publish to availability, taking ownership of partition distribution and publishing directly to a partition can help, as can ensuring that you're publishing with the right degree of concurrency for your host environment and scenario.
With respect to duplication, it's important to be aware that Event Hubs offers an "at least once" guarantee; your consuming application should expect some duplicates and needs to be able to handle them in the way that is appropriate for your application scenario.
Azure Functions uses a set of event processors in its infrastructure to read events. The processors collaborate with one another to share work and distribute the responsibility for partitions between them. Because collaboration takes place using storage as an intermediary to synchronize, there is an overlap of partition ownership when instances are scaled up or scaled down, during which time the potential for duplication is increased.
Functions makes the decision to scale based on the number of events that it sees waiting in partitions to be read. In the case of your test, if your publication pattern increases rapidly and Functions sees "the event backlog" grow to the point that it feels the need to scale by multiple instances, you'll see more duplication than you otherwise would for a period of 10-30 seconds until partition ownership normalizes. To mitigate this, using an approach of gradually increasing speed of publishing over a 1-2 minute period can help to smooth out the scaling and reduce (but not eliminate) duplication.

Related

What happens to events in event hub after stream analytics does it works and routes them to service bus?

I have following scenario:
The event hub (EH1) is configured with a retention policy of 7 days.
Producers publish events to EH1.
The events from EH1 are routed from stream analytics (SA) (after performing certain calculations over 1 hour time windows) to service bus, which gets both raw events (as messages) as well as summarized calculations.
Lets say over 24 hour period of day 1, producers publish 1 million events to EH1.
SA kicks in and routes the raw events as well as summarized calculations (over 1 hour periods) to service bus.
Assume that after day 1, there are no events pushed to EH1 for next 15 days.
Questions:
How long will the 1 million raw events (from day 1) stay in EH1?
Will those 1 million raw events (from day 1) be still there on day 2 (after 1st hour) through day 7 (because the retention policy is 7)? Or will they be gone after day 1 when SA is done processing all those events? If neither, what else happens?
What metrics should I look at in EH1 to prove what ever the answer is to both (1) and (2)?
First of all, you should take a look at the consumer group first.
In short, when consumers(like any app or code which are used to receive events from eventhub) read events, it must read the events via a consumer group(we named it cg_1 here) -> then for the next time, you read events from cg_1 again, the events(which you have already read) will not be read again.
But if you switch to another consumer group(like you newly create a consumer group named cg_2), you can read all the data(even though the data has been read from cg_1) again.
So for your questions:
#1:
Since you have configured the retention policy of 7 days, the events(raw data) will be kept in eventhub for 7 days. If the events have been received via a consumer group, you cannot receive it again via this consumer group. But you can use another consumer group to receive the data again.
#2:
Similar to question 1, the raw events will be stored in eventhub according to the retention days you have configured.
#There is no such metrics, but you can easily write client codes, and create a new consumer group, then read the data to check if it's there.

Azure Durable Functions - fanout-fanin scalability

We are a skills based development company that creates competitions . The players of this competition can upload photos and rank each other photos to earn points . One of the key requirements of this is to update the competition leader board regularly to keep the players interested. We are looking for a fan-out and fan in architecture to implement the leader board. A typical work flow is attached
From our analysis Durable functions seems to an best option.
However we have the following constraints
Each competition has about 500 players
A player will be ranking up to 500 photos
I have been trying to read through documentation. However could not find documentation on the scalability of this approach using Durable functions. Any comments or insights is highly appreciated
You can find the performance targets for Durable Functions here: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-perf-and-scale#performance-targets
Parallel activity execution (fan-out) 100 activities per second, per instance
Parallel response processing (fan-in) 150 responses per second, per instance
If you run on an Azure Functions Consumption plan, the scale controller there will scale up to more instances as more messages appear in the work item queue.
This is the queue used to start activities (which you would use to calculate a single player's score.
You can also improve fan-in performance by doing what the say in the docs:
Unlike fan-out, fan-in operations are limited to a single VM. If your application uses the fan-out, fan-in pattern and you are concerned about fan-in performance, consider sub-dividing the activity function fan-out across multiple sub-orchestrations.
So you'd have:
Main orchestrator
Batch 0 sub-orchestrator
Activity for user 0 in batch 0
Activity for user 1 in batch 0
...
Batch 1 sub-orchestrator
Activity for user 0 in batch 1
Activity for user 1 in batch 1
...
...
The reason this kind of sub-orchestrator batching makes it faster is because your orchestrator history table gets more and more rows as the activities complete.
It has to load these every time there is a result.
So by limiting the ceiling for those rows you get maximal performance.
TL;DR: I think the fan-out will scale well, but you may want to do sub-orchestrator batching to improve fan-in performance.

Azure Service Bus performance issue MassTransit

So I've been playing around with MassTransit and Azure Service Bus Premium, here's a sample of one of my consumers. Hypothetical initial load for one publisher would be about 1000 messages a second. However whenever I attempt to configure a consumer, it seems to generally average out at about 20-40 messages per loop.
cfg.ReceiveEndpoint("ReceivePoint", e =>{
e.PrefetchCount = 500;
e.MaxConcurrentCalls = 20;
e.Batch<IBlahContract>(b => {
b.MessageLimit = 500;
b.TimeLimit = TimeSpan.FromSeconds(1);
b.Consumer(() => new BatchBlahConsumer(provider.GetRequiredService<IRepository>(), provider.GetRequiredService<ILogger<BatchBlahConsumer>>()));
});
});
I did try Throughput test which managed a thousand plus messages a second. Did anyone get any tips on how to achieve optimal performance? And might it make more sense to consider a managed instance of RabbitMq since this needs to scale? It just feels like Azure Service Bus isn't really suited to such high throughput?
Edit: Slight addition to this, suspect it's related to a requirement to keep prefetch to about 20 and then consumer concurrency is what really defines performance. So basically, it needs consumer level configuration in terms of estimated requirements. Which would make me lean more towards using rabbit.
Your batch message limit is 500, which is honestly way too high. With the MaxConcurrentCalls set at 20, you'll always hit the timeout instead of the batch size limit, because the Azure client library will only ever deliver 20 messages at once, and the batch size is significantly higher than that value (500 vs 20). You need to set it high enough that it can complete a batch or you'll always be completing the batch on timeout alone.
Lower the batch size, and increase the MaxConncurrentCalls, so that they are the same, or at least so the batch size is less than the concurrent calls limit, so that batches can be completed upon message receipt instead of waiting to time out.

Would SQS batch size max limit result in slower processing through Lambdas?

I'm aware that AWS has allowed SQS to be one of the event source mappings for Lambdas. I'm glad that this is possible now as I would then not have to poll from the queue every few seconds through a cron job. However, it appears that the maximum possible value for batchSize is limited to 10. From my understanding, the batchSize is the number of messages a single Lambda invocation will receive from the queue.
This sounds like it could be an issue for me because, in my case, I may have a few hundreds of thousands of messages at a time in the queue. Those messages don't need any heavy processing; they just need to be parsed and saved to the database as a record. It's pretty simple.
If the batchSize is limited to only 10 messages per retrieval, I foresee a few issues that I may have:
It may actually take a long time to finish processing the messages on the queue.
Not only is 10 messages per retrieval slow, since the messages are very simple to process, processing only 10 messages in a single Lambda invocation sounds a little wasteful because, given the simplicity of what is needed to be done to process the messages, I'm pretty sure it can process at least a few thousands messages in a single Lambda invocation.
Having only 10 messages per retrieval may also mean that I need to make more write operations to my database because each of these messages need to inserted as a record on the database.
Are my concerns valid in this case? If so, is there anything else I can do with SQS and Lambdas to overcome those concerns?
Your assumption about a limit of 10 is correct.
Lambda will spin up more instances to run in parallel, if there are more messages available. See Scaling and Processing. This means that if there are 1000 messages available, Lambda might spin up 100 concurrent executions to quickly process all the messages.
Once a lambda function has processed the 10 messages of a batch, it continues with processing other batches. As lambda bills in 100ms intervals, the wasted time is minimal.
As for the database writes you could pre-process the messages before inserting them into the queue.
In that case you need to let you lambda function fetch the messages from the queue and process them rather than lambda getting triggered via SQS. Probably have a cloud watch event which can trigger lambda for you depending upon what your use case is.
Please note that SQS has a limit of max 10 messages in one go but you could write the code to make it much more efficient.
One of the package which is very efficient at is squiss-ts
In this case you could let your lambda function run for 15 mins (max time) and let it process as many messages possible. Idempotency is the key when you are desinging these kind of applications so in case if message wasn't processed in this run, it will be processed in the next run.
Downside of using this approach is that you need to scale your lambda's manually depending on how many messages you are anticipating.
You're right that a larger batch size seems appropriate for your use case.
As of late 2020, if you specify a batch window in seconds, you can then specify a batch size of up to 10,000 messages.
So with this new option you can now configure your lambda to wait and receive much larger batches per invocation.

Solution for delaying events for N days

We're currently writing an application in Microsoft Azure and we're planning to use Event Hubs to handle processing of real time events.
However, after an initial processing we will have to delay further processing of the events for N number of days. The process will work like this:
Event triggered -> Place event in Event Hub -> Event gets fetched from Event Hub and processed -> Event should be delay for X days -> Event gets' further processed (two last steps might be a loop)
How can we achieve this delay of further event processing without using polling or similar strategies. One idea is to use Azure Queues and their visibility timeout, but 7 days is the supported maximum according to the documentation and our business demands are in the 1-3 months maximum range. Number of events in our system should be max 10k per day.
Any ideas would be appreciated, thanks!
As you already mentioned - EventHubs supports only 7 days window of data to be retained.
Event Hubs are typically used as real-time telemetry data pipe-lines where data seek performance is critical. For 99.9% usecases/scenarios our users typically require last couple of hours, if not seconds.
However, after the real-time processing is over, and If you still need to re-analyze the data after a while, for ex: run a Hadoop job on last months data - our seek pattern & store are not optimized for it. We recommend to forward the messages to other data archival stores which are specialized for big-data queries.
As - data archival is an ask that most of our customers naturally look for - we are releasing a new feature which automatically archives the data in AVRO format into Azure storage.

Resources