Azure Event Hub - How to achieve infinite retry? - azure

EventHub consumer need to process the message it received until it succeeds during the transient faults, how to achieve this infinite retry by honoring the EventHub partition lease expiry?
Here the business scenario is not important but the approach for infinite retry (by considering partition lease expiry) is what I'm looking for.
Note: I'm reading the message in batches, processing of any message can encounter transient faults which need to retry. So driving some logic with an 'offset' value may not be efficient but not sure anyone has achieved infinite retries by leveraging offset value.

The consumer can retry on transient failures indefinitely until cancellation is requested. By the way, the lease won't expire due to retry possibly taking longer than expected.
Please check the API documentation for more reference. https://learn.microsoft.com/en-us/dotnet/api/azure.messaging.eventhubs.processor.processeventargs?view=azure-dotnet
CancellationToken
A CancellationToken to indicate that the processor is requesting that the handler stop its activities. If this token is requesting cancellation, then either the processor is attempting to shutdown or ownership of the partition has changed.

Related

Event Hub - retry policy to continue retry forever

As per my business case, the processor application which processing the EventHub encounters any transient faults(429,449,503 etc) it should be retried as many times as it takes to succeed with exponential back-off in order to avoid data loss.
Does the EventHub lease will expire as the processor thread may go for longer sleep during retry's waiting period in exponential back-off approach?
Is above approach is recommended? if not how the data loss can be avoided during 429s or other transients faults?
Note: throwing message back to the Eventhub after few retries, is not an option in my case as it hampers the message ordering.

EventProcessorClient - AmqpRetryOptions options behaviour

Here is our current scenario - Listen to all the partitions on a given event hub and logically process each message based on the content and re-process (until configured no. of times - say 3) the same event if the initial processing fails internally.
Following the recommended guidelines for higher throughput, we are planning to use EventProcessorClient (Azure SDK for consuming the events from Azure Event Hub. Using the EventProcessorClientBuilder, its initialized based on the docs.
EventProcessorClient eventProcessorClient = new EventProcessorClientBuilder()
.consumerGroup("consumer-group")
.checkpointStore(new BlobCheckpointStore(blobContainerAsyncClient))
.processEvent(eventContext -> {
// process eventData and throw error if the internal logic fails
})
.processError(errorContext -> {
System.out.printf("Error occurred in partition processor for partition {}, {}",
errorContext.getPartitionContext().getPartitionId(),
errorContext.getThrowable());
})
.connectionString(connectionString)
.retry(new AmqpRetryOptions()
.setMaxRetries(3).setMode(AmqpRetryMode.FIXED).setDelay(Duration.ofSeconds(120)))
.buildEventProcessorClient();
eventProcessorClient.start();
However, the retry logic doesn't kick in at all, checking the documentation further - I wonder if this is only applicable for the explicit instance for EventHubAsyncClient. Any suggestions or inputs on what is required to achieve this retry capability ?
Thanks in advance.
re-process (until configured no. of times - say 3) the same event if the initial processing fails internally.
That is not how retries work with the processor. The AmqpRetryOptions control how many times the client will retry service operations (aka operations that use the AMQP protocol), such as reading from the partition.
Once the processor delivers events, your application owns responsibility for the code that processes them - that includes error handling and retries.
The reason for this is that the EventProcessorClient does not have sufficient understanding of your application code and scenarios to determine the correct action to take in the face of an exception. For example, it has no way to know if processing is stateful and has been corrupted or is safe to retry.

How does Azure Service Bus Queue guarantees at most once delivery?

According to this doc service bus supports two modes Receive-and-Delete and Peek-Lock.
If using Peek-Lock Mode if the consumer crashes/hangs/do a very long GC right after processing the message, but before the messageId is "Completed" and visibility time expires there's a chance that same message is delivered twice.
Then how does Microsoft says that Service Bus supports at most once delivery mode. Is it because of the Receive-and-Delete mode which sends messages only once.But then again, if something happens while consumers are processing the message then that valuable info is lost.
If yes then what is the best way to ensure exact once delivery using Azure Services Bus as Queue and Azure Functions as Consumers.
P.S. The one approach I can think of is storing MessageID's in blob but since in my case number of MessageID's could be very large storing and loading all of them is not right approach.
Azure Functions will always consume Service Bus messages in Peek-Lock mode. Exactly Once delivery is basically not possible in general case: there's always a chance that consuming application will crash at wrong time just before completing the message, and then the message will be re-delivered.
You should strive to implement Effectively Once processing. This is usually achieved with idempotent message processor.
Storing MessageID's (consumer-side de-duplication) is one option. You could have a policy to clean up old Message IDs to keep the size of such storage manageable. To make this 100% reliable you would have to store Message ID in the same transaction as other modifications done by processor.
Other options really depend on your processing scenario. Find a way to make it idempotent - so that processing the same message multiple times is functionally same as processing it just once.

Status as never finished by one of my Webjob while processing the message

I have a webjob which process the message only once by using the condition (DevliverCount = 1). Because I don't want other instance to process it if the locktime expired by first webjob. As other webjob try to process the message after locktime expired, the condition (DevliverCount = 1) will not met and comes out of the method which deletes the message from the queue automatically.
The problem over here is if the message state went to never finished (other than success) I wont have message in queue to process. How to handle this situation?
I think part of the problem is that you're trying to use the MaxDeliveryCount property to prevent concurrent message processing:
MaxDeliveryCount
The max delivery count setting is not used to prevent multiple consumers from processing a message at the same time, it's used to prevent "poison messages" where any consumer attempts to process a message whose contents prevent successful processing, and therefore the message would otherwise be processed forever.
I recommend you determine exactly what it is you're trying to accomplish. If you want a simple competing consumers scenario where multiple webjobs consume messages from a single queue, then there are standard ways to accomplish that:
good description of competing consumers
competing consumers with Service Bus queues
You can use MaxDeliveryCount in conjunction with competing consumers... if you want to prevent poison messages you can set MaxDeliveryCount to something larger than 1 and still give other consumers a chance to process messages whose locks expire.
Azure Service Bus supports dead-lettering of poison messages that exceed max delivery count, so you're able to examine such messages offline... they aren't simply deleted forever.
You might also need to add code in your webjobs to renew locks prior to their expiration... otherwise service bus can't differentiate between "valid messages that are taking a long time to process" and "poison messages that can't be processed". Without lock renewal your long-running valid messages will be dead-lettered the same as poison messages, which is almost certainly not what you want.
Good luck!

Azure queue - can I verify a message will be read only once?

I am using an Azure queue and have several different processes reading from the queue.
My system is built in a way that assumes each message is read only once.
This Microsoft article claims Azure queues have an at least once delivery guarantee which potentially means two processes can read the same message from the queue.
This StackOverflow thread claims that if I use GetMessage then the message becomes invisible to all other processes for the invisibility timeout.
Assuming I use GetMessage() and never exceed the message invisibility time before I DeleteMessage, can I assume I will get each message only once?
I think there is a property in queue message named DequeueCount, which is the number of times this message has been dequeued. And it's maintained by queue service. I think you can use this property to identify whether your message had been read before.
https://learn.microsoft.com/en-us/dotnet/api/azure.storage.queues.models.queuemessage.dequeuecount?view=azure-dotnet
No. The following can happen:
GetMessage()
Add some records in a database...
Generate some files...
DeleteMessage() -> Unexpected failure (process that crashes, instance that reboots, network connectivity issues, ...)
In this case your logic was executed without calling DeleteMessage. This means, once the invisibility timeout expires, the message will appear in the queue and be processed once again. You will need to make sure that your process is idempotent:
Idempotence is the property of certain operations in mathematics and
computer science, that they can be applied multiple times without
changing the result beyond the initial application.
An alternative solution would be to use Service Bus Queues with the ReceiveAndDelete mode (see this page under How to Receive Messages from a Queue). If you receive the message it will be marked as consumed and never appear again. This way you can be sure it is delivered At-Most-Once (see the comparison with Storage Queues here). But then again, if something happens while your are processing the message (ie: server crashes, ...), you could loose valuable information.
Update:
This will simulate an At-Most-Once in storage queues. The message can arrive multiple times via GetMessage, but will only be processed once by your business logic (with the risk that some of your business logic will never execute).
GetMessage()
DeleteMessage()
AddRecordsToDatabase()
GenerateFiles()

Resources