Event Hub - retry policy to continue retry forever - azure

As per my business case, the processor application which processing the EventHub encounters any transient faults(429,449,503 etc) it should be retried as many times as it takes to succeed with exponential back-off in order to avoid data loss.
Does the EventHub lease will expire as the processor thread may go for longer sleep during retry's waiting period in exponential back-off approach?
Is above approach is recommended? if not how the data loss can be avoided during 429s or other transients faults?
Note: throwing message back to the Eventhub after few retries, is not an option in my case as it hampers the message ordering.

Related

Azure Event Hub - How to achieve infinite retry?

EventHub consumer need to process the message it received until it succeeds during the transient faults, how to achieve this infinite retry by honoring the EventHub partition lease expiry?
Here the business scenario is not important but the approach for infinite retry (by considering partition lease expiry) is what I'm looking for.
Note: I'm reading the message in batches, processing of any message can encounter transient faults which need to retry. So driving some logic with an 'offset' value may not be efficient but not sure anyone has achieved infinite retries by leveraging offset value.
The consumer can retry on transient failures indefinitely until cancellation is requested. By the way, the lease won't expire due to retry possibly taking longer than expected.
Please check the API documentation for more reference. https://learn.microsoft.com/en-us/dotnet/api/azure.messaging.eventhubs.processor.processeventargs?view=azure-dotnet
CancellationToken
A CancellationToken to indicate that the processor is requesting that the handler stop its activities. If this token is requesting cancellation, then either the processor is attempting to shutdown or ownership of the partition has changed.

What should be done when the provisioned throughput is exceeded?

I'm using AWS SDK for Javascript (Node.js) to read data from a DynamoDB table. The auto scaling feature does a great job during most of the time and the consumed Read Capacity Units (RCU) are really low most part of the day. However, there's a programmed job that is executed around midnight which consumes about 10x the provisioned RCU and since the auto scaling takes some time to adjust the capacity, there are a lot of throttled read requests. Furthermore, I suspect my requests are not being completed (though I can't find any exceptions in my error log).
In order to handle this situation, I've considered increasing the provisioned RCU using the AWS API (updateTable) but calculating the number of RCU my application needs may not be straightforward.
So my second guess was to retry failed requests and simply wait for auto scale increase the provisioned RCU. As pointed out by AWS docs and some Stack Overflow answers (particularlly about ProvisionedThroughputExceededException):
The AWS SDKs for Amazon DynamoDB automatically retry requests that receive this exception. So, your request is eventually successful, unless the request is too large or your retry queue is too large to finish.
I've read similar questions (this one, this one and this one) but I'm still confused: is this exception raised if the request is too large or the retry queue is too large to finish (therefore after the automatic retries) or actually before the retries?
Most important: is that the exception I should be expecting in my context? (so I can catch it and retry until auto scale increases the RCU?)
Yes.
Every time your application sends a request that exceeds your capacity you get ProvisionedThroughputExceededException message from Dynamo. However your SDK handles this for you and retries. The default Dynamo retry time starts at 50ms, the default number of retries is 10, and backoff is exponential by default.
This means you get retries at:
50ms
100ms
200ms
400ms
800ms
1.6s
3.2s
6.4s
12.8s
25.6s
If after the 10th retry your request has still not succeeded, the SDK passes the ProvisionedThroughputExceededException back to your application and you can handle it how you like.
You could handle it by increasing throughput provision but another option would be to change the default retry times when you create the Dynamo connection. For example
new AWS.DynamoDB({maxRetries: 13, retryDelayOptions: {base: 200}});
This would mean you retry 13 times, with an initial delay of 200ms. This would give your request a total of 819.2s to complete rather than 25.6s.

Azure - Storage queue message renewal

We are using azure web job for processing the messages in azure storage queue. After the 5 unsuccessful attempts messages are moved to the Poisson queue. Instead of that i want to process the message further more until the message have been processed successfully.
kindly assist me on the same.
You can configure the maximum number of retries (default is 5) before a message is sent to the poison queue. You can add an 'int dequeCount' parameter to your method to check the number of times it has been called and base your decisions on that.
Having said that you should definitely have some error handling strategy in place. Just trying indefinitely till you succeed is a recipe for failure.

Azure Functions Event Hub trigger bindings

Just have a couple of questions regarding the usage of Azure Functions with an EventHub in an IoT scenario.
EventHub has partitions. Typically messages from a specific device go to the same partition. How are the instances of an Azure Function distributed across EventHub partitions? Is it based on the performance? In case one instance of an Azure Function manages to process events from all partitions then it is enough otherwise one might end up with one instance of an Azure Function per EventHub partition?
What about the read-offset? Does this binding somehow records where it stopped reading the event stream? I thought the functions are meant to be stateless and here we have some state.
Thanks
Each instance of an Event Hub-Triggered Function is backed by only 1 EventProcessorHost(EPH) instance. Event Hub ensures that only 1 EPH can get a lease on a given partition.
Answer to Question 1:
Let's elaborate on this with a contrived example. Suppose we begin with the following setup and assumptions for an EventHub:
10 partitions.
1000 events distributed evenly across all partitions => 100 messages in each partition.
When your Function is first enabled, there is only 1 instance of the Function. Let's call this Function instance Function_0. Function_0 will have 1 EPH that manages to get a lease on all 10 partitions. Let this EPH be called EPH_0, and it will start reading events from partitions 0-9. From this point forward, one of the following will happen:
Only 1 Function instance is needed - Function_0 is able to process all 1000 before the Azure Functions' scaling logic kicks in.
Hence, all 1000 messages are processed by Function_0.
Add 1 more Function instance - Azure Functions' scaling logic determines that Function_0 seems sluggish, so a new instance
Function_1 is created, resulting in EPH_1. Event Hub detects that a new EPH instance is trying read messages. Event Hub will start load
balancing the partitions across the EPH instances, e.g., partitions
0-4 are assigned to EPH_0 and partitions 5-9 are assigned to EPH_1.
If all Function execution succeed without errors, both EPH_0 and
EPH_1 checkpoints successfully and all 1000 messages are processed. When check-pointing succeeds, all 1000 messages should never be retrieved again.
Add N more function instances - Azure Functions' scaling logic determines that both Function_0 and Function_1 are still sluggish and
will repeat workflow 2 again for Function_2...N, where N>9. Event Hub will load balance the partitions across Function_0...9 instances.
Unique to Azure Functions' current scaling logic is the fact that N is >(number of partitions). This is done to ensure
that there are always instances of EPH readily available to quickly
get a lock on the partition(s). As a customer, you are only charged for the resources used when your Function instance executes, but you are not charged for this over-provisioning.
Answer to Question 2:
EPH uses a check-pointing mechanism to mark the last known successfully read message. An EventHub-Triggered Function can be setup to process 1 message or a batch of messages at a time. The option you choose needs to consider the following:
1. Speed of message processing - Processing messages in batches instead of a single message at a time is one of the factors that will speed up the ability of your Azure Function workflow to keep up with the incoming messages in your Event Hub.
2. Tolerance for duplicates - If check-pointing fails due to errors in your Function code/(Updated Aug 24th, 2017) timeout/partition least lost, then the next EPH that gets a lease on that partition will start retrieving messages from the last known checkpoint. Event Hub guarantees at-least-once delivery but not at-most-once delivery. Azure Functions will not attempt to change that behavior. If not having duplicate messages is a priority, then you will need to mitigate it in your workflow. As such, when check-pointing fails, there are more duplicate messages to manage if your Function is processing messages at batch level.
Function Apps are based on WebJobs SDK, which use EventHostProcessor to consume events from Event Hubs. So you can lookup information about EventHostProcessor and it will be applicable to your Function App.
Particularly, you can find the implementation of IEventProcessor
here.
To your questions:
Not sure what you mean by "one instance". One listener will be created per partition, but they can be both hosted inside a single App Plan Instance if the load is low. On the high level, you should not care much: in Consumption Plan you pay per execution time, no matter how many servers/processes/threads are running. Of course, you should care whether the auto-scaling works good enough for high load, but that needs to be tested anyway.
Functions are stateless in a sense that you can't save anything in-memory between two function executions. You are totally fine to save state in external storage. Function App will use PartitionContext.CheckpointAsync() for checkpointing of the current offset. Azure Storage is used internally; again you can read more about how it works in Event Hubs and EventHostProcessor docs, e.g. here.

Status as never finished by one of my Webjob while processing the message

I have a webjob which process the message only once by using the condition (DevliverCount = 1). Because I don't want other instance to process it if the locktime expired by first webjob. As other webjob try to process the message after locktime expired, the condition (DevliverCount = 1) will not met and comes out of the method which deletes the message from the queue automatically.
The problem over here is if the message state went to never finished (other than success) I wont have message in queue to process. How to handle this situation?
I think part of the problem is that you're trying to use the MaxDeliveryCount property to prevent concurrent message processing:
MaxDeliveryCount
The max delivery count setting is not used to prevent multiple consumers from processing a message at the same time, it's used to prevent "poison messages" where any consumer attempts to process a message whose contents prevent successful processing, and therefore the message would otherwise be processed forever.
I recommend you determine exactly what it is you're trying to accomplish. If you want a simple competing consumers scenario where multiple webjobs consume messages from a single queue, then there are standard ways to accomplish that:
good description of competing consumers
competing consumers with Service Bus queues
You can use MaxDeliveryCount in conjunction with competing consumers... if you want to prevent poison messages you can set MaxDeliveryCount to something larger than 1 and still give other consumers a chance to process messages whose locks expire.
Azure Service Bus supports dead-lettering of poison messages that exceed max delivery count, so you're able to examine such messages offline... they aren't simply deleted forever.
You might also need to add code in your webjobs to renew locks prior to their expiration... otherwise service bus can't differentiate between "valid messages that are taking a long time to process" and "poison messages that can't be processed". Without lock renewal your long-running valid messages will be dead-lettered the same as poison messages, which is almost certainly not what you want.
Good luck!

Resources