EventProcessorClient - AmqpRetryOptions options behaviour - azure

Here is our current scenario - Listen to all the partitions on a given event hub and logically process each message based on the content and re-process (until configured no. of times - say 3) the same event if the initial processing fails internally.
Following the recommended guidelines for higher throughput, we are planning to use EventProcessorClient (Azure SDK for consuming the events from Azure Event Hub. Using the EventProcessorClientBuilder, its initialized based on the docs.
EventProcessorClient eventProcessorClient = new EventProcessorClientBuilder()
.consumerGroup("consumer-group")
.checkpointStore(new BlobCheckpointStore(blobContainerAsyncClient))
.processEvent(eventContext -> {
// process eventData and throw error if the internal logic fails
})
.processError(errorContext -> {
System.out.printf("Error occurred in partition processor for partition {}, {}",
errorContext.getPartitionContext().getPartitionId(),
errorContext.getThrowable());
})
.connectionString(connectionString)
.retry(new AmqpRetryOptions()
.setMaxRetries(3).setMode(AmqpRetryMode.FIXED).setDelay(Duration.ofSeconds(120)))
.buildEventProcessorClient();
eventProcessorClient.start();
However, the retry logic doesn't kick in at all, checking the documentation further - I wonder if this is only applicable for the explicit instance for EventHubAsyncClient. Any suggestions or inputs on what is required to achieve this retry capability ?
Thanks in advance.

re-process (until configured no. of times - say 3) the same event if the initial processing fails internally.
That is not how retries work with the processor. The AmqpRetryOptions control how many times the client will retry service operations (aka operations that use the AMQP protocol), such as reading from the partition.
Once the processor delivers events, your application owns responsibility for the code that processes them - that includes error handling and retries.
The reason for this is that the EventProcessorClient does not have sufficient understanding of your application code and scenarios to determine the correct action to take in the face of an exception. For example, it has no way to know if processing is stateful and has been corrupted or is safe to retry.

Related

Azure Eventhub - how to resend/replay the same batch of events to the same Eventhub client

Lets say, when transient fault occurs while processing batch of events from Azure EventHub & transient fault continues even after retries then what kind of exception can be thrown to Eventhub from processor? so that Azure eventhub can able to send same batch of events again (replay) to the same processor instance (without moving forward the checkpointer) for reprocessing.
Package "Microsoft.Azure.WebJobs.Extensions.EventHubs" Version="5.1.2"
Azure function is the client which runs on AKS with KEDA scaling configuration
In a nutshell - there is no exception that your code can throw that would stop checkpointing in an Azure Function app.
Functions has a unique model in that checkpointing is automatic and it moves forward regardless of success or failure of the processing code that is executing.
The most common pattern for when your Function encounters an error where it cannot process an event is to dead-letter it so that another application can revisit and process it in the future. More detail and other patterns to consider can be found in the article Resilient Event Hubs and Functions design.
If none of these work in your application scenario, your best path forward would likely be to consider not using Functions as a host platform and use the event processor directly in your application. That would give you full control over when checkpoints are emitted.

EventHub data bursty with long pauses

I'm seeing multi-second pauses in the event stream, even reading from the retention pool.
Here's the main nugget of EH setup:
BlobContainerClient storageClient = new BlobContainerClient(blobcon, BLOB_NAME);
RTMTest.eventProcessor = new EventProcessorClient(storageClient, consumerGroup, ehubcon, EVENTHUB_NAME);
And then the do nothing processor:
static async Task processEventHandler(ProcessEventArgs eventArgs)
{
RTMTest.eventsPerSecond++;
RTMTest.eventCount++;
if ((RTMTest.eventCount % 16) == 0)
{
await eventArgs.UpdateCheckpointAsync(eventArgs.CancellationToken);
}
}
And then a typical execution:
15:02:23: no events
15:02:24: no events
15:02:25: reqs=643
15:02:26: reqs=656
15:02:27: reqs=1280
15:02:28: reqs=2221
15:02:29: no events
15:02:30: no events
15:02:31: no events
15:02:32: no events
15:02:33: no events
15:02:34: no events
15:02:35: no events
15:02:36: no events
15:02:37: no events
15:02:38: no events
15:02:39: no events
15:02:40: no events
15:02:41: no events
15:02:42: no events
15:02:43: no events
15:02:44: reqs=3027
15:02:45: reqs=3440
15:02:47: reqs=4320
15:02:48: reqs=9232
15:02:49: reqs=4064
15:02:50: reqs=395
15:02:51: no events
15:02:52: no events
15:02:53: no events
The event hub, blob storage and RTMTest webjob are all in US West 2. The event hub as 16 partitions. It's correctly calling my handler as evidenced by the bursts of data. The error handler is not called.
Here are two applications side by side, left using Redis, right using Event Hub. The events turn into the animations so you can visually watch the long stalls. Note: these are vaccines being reported around the US, either live or via batch reconciliations from the pharmacies.
vaccine reporting animations
Any idea why I see the multi-second stalls?
Thanks.
Event Hubs consumers make use of a prefetch queue when reading. This is essentially a local cache of events that the consumer tries to keep full by streaming in continually from the service. To prioritize throughput and avoid waiting on the network, consumers read exclusively from prefetch.
The pattern that you're describing falls into the "many smaller events" category, which will often drain the prefetch quickly if event processing is also quick. If your application is reading more quickly than the prefetch can refill, reads will start to take longer and return fewer events, as it waits on network operations.
One thing that may help is to test using higher values for PrefetchCount and CacheEventCount in the options when creating your processor. These default to a prefetch of 300 and cache event count of 100. You may want try testing with something like 750/250 and see what happens. We recommend keeping at least a 3:1 ratio.
It is also possible that your processor is being asked to do more work than is recommended for consistent performance across all partitions it owns. There's good discussion of different behaviors in the Troubleshooting Guide, and ultimately, capturing a +/- 5-minute slice of the SDK logs described here would give us the best view of what's going on. That's more detail and requires more back-and-forth discussion than works well on StackOverflow; I'd invite you to open an issue in the Azure SDK repository if you go down that path.
Something to keep in mind is that Event Hubs is optimized to maximize overall throughput and not for minimizing latency for individual events. The service offers no SLA for the time between when an event is received by the service and when it becomes available to be read from a partition.
When the service receives an event, it acknowledges receipt to the publisher and the send call completes. At this point, the event still needs to be committed to a partition. Until that process is complete, it isn't available to be read. Normally, this takes milliseconds but may occasionally take longer for the Standard tier because it is a shared instance. Transient failures, such as a partition node being rebooted/migrated, can also impact this.
With you near real-time reading, you may be processing quickly enough that there's nothing client-side that will help. In this case, you'd need to consider adding more TUs, moving to a Premium/Dedicated tier, or using more partitions to increase concurrency.
Update:
For those interested without access to the chat, log analysis shows a pattern of errors that indicates that either the host owns too many partitions and load balancing is unhealthy or there is a rogue processor running in the same consumer group but not using the same storage container.
In either case, partition ownership is bouncing frequently causing them to stop, move to a new host, reinitialize, and restart - only to stop and have to move again.
I've suggested reading through the Troubleshooting Guide, as this scenario and some of the other symptoms tare discussed in detail.
I've also suggested reading through the samples for the processor - particularly Event Processor Configuration and Event Processor Handlers. Each has guidance around processor use and configuration that should be followed to maximize throughput.
#jesse very patiently examined my logs and led me to the "duh" moment of realizing I just needed a separate consumer group for this 2nd application of the EventHub data. Now things are rock solid. Thanks Jesse!

How does Azure Service Bus Queue guarantees at most once delivery?

According to this doc service bus supports two modes Receive-and-Delete and Peek-Lock.
If using Peek-Lock Mode if the consumer crashes/hangs/do a very long GC right after processing the message, but before the messageId is "Completed" and visibility time expires there's a chance that same message is delivered twice.
Then how does Microsoft says that Service Bus supports at most once delivery mode. Is it because of the Receive-and-Delete mode which sends messages only once.But then again, if something happens while consumers are processing the message then that valuable info is lost.
If yes then what is the best way to ensure exact once delivery using Azure Services Bus as Queue and Azure Functions as Consumers.
P.S. The one approach I can think of is storing MessageID's in blob but since in my case number of MessageID's could be very large storing and loading all of them is not right approach.
Azure Functions will always consume Service Bus messages in Peek-Lock mode. Exactly Once delivery is basically not possible in general case: there's always a chance that consuming application will crash at wrong time just before completing the message, and then the message will be re-delivered.
You should strive to implement Effectively Once processing. This is usually achieved with idempotent message processor.
Storing MessageID's (consumer-side de-duplication) is one option. You could have a policy to clean up old Message IDs to keep the size of such storage manageable. To make this 100% reliable you would have to store Message ID in the same transaction as other modifications done by processor.
Other options really depend on your processing scenario. Find a way to make it idempotent - so that processing the same message multiple times is functionally same as processing it just once.

NodeJS with Redis message queue - How to set multiple consumers (threads)

I have a nodejs project that is exposing a simple rest api for an external web application. This webhook must cope with a large number of requests per second as well as return 200 OK very quickly to the caller. In order for that to happen I investigate a redis simple queue to be enqueued with each request's to be handled asynchronously later on (via a consumer thread).
The redis simple queue seems like an easy way to achieve this task (https://github.com/smrchy/rsmq)
1) Is rsmq.receiveMessage() { ....... } a blocking method? if this handler is slow - will it impact my server's performance?
2) If the answer to question 1 is true - Is it recommended to extract the consumption of the messages to an external micro service? (a dedicated consumer)? what are the best practices to create multi threaded consumers on such environment?
You can use pubsub feature provided by redis https://redis.io/topics/pubsub
You can publish to various channels without any knowledge of subscribers . Subscribers can subscribe to the channels they wish.
sreeni
1) No, it won't block the event loop, however you will only start processing a second message once you call the "next" method, i.e., you will process one message at a time. To overcome this, you can start multiple workers in parallel. Take a look here: https://stackoverflow.com/a/45984677/7201847
2) That's an architectural decision that depends on the load you have to support and the hardware capacity you have. I would recommend at least two Node.js processes, one for adding the messages to the queue and another one to actually processing them, with the option to start additional worker processes if needed, depending on the results of your performance tests.

C# Masstransit how to handle exception when the queue is not available or down

I am using Masstransit with RabbitMQ to consume message from queue. Can anyone tell me how to handle exception when the queue is down or not available to get the message? following is my setup:
var busControl = Bus.Factory.CreateUsingRabbitMq(cfg =>
{
var host = cfg.Host(new Uri(configManager.RabbitMqUrl), h =>
{
h.Username(configManager.RabbitMqUserName);
h.Password(configManager.RabbitMqPassword);
});
cfg.ReceiveEndpoint(host, RabbitMqConstants.Change, e =>
{
e.UseRetry(Retry.Immediate(configManager.ProcessorRetryNumber));
e.Handler<ChangeDetected>(context =>
{
var task = Task.Run(() => consumer.Consume(context));
return task;
});
});
});
Thanks
In our in-house RabbitMQ messaging implementation, we have approached/solved the publishing side of this (broker not available when want to publish) in two ways:
[1] We use Polly to asynchronously orchestrate a limited number of publishing retries (with delay between tries). This overcomes situations where loss of connectivity to the broker is a minor network blip.
[2] If all publish retries fail, we use a 'message hospital' concept: we store enough detail about the failed-to-publish message to an alternative source (database; with additional failover to local file store), such that we can republish the failed messages later, if desired. A variant on 'store and forward' (we can republish in bulk, but we also allow manual intervention to choose whether to republish).
All depends how important it is to you 'never to lose a message'. Some redundancy of RabbitMQ brokers (clustering as Chris Patterson suggests or federation) is also an obvious step. Clustering/federation gives you protection if you lose/want to do maintenance on one/some of your brokers. The resilience strategies [1], [2] above give you protection if for some reason the message publisher can't see any RabbitMQ broker (for example network fault nearer the publisher).
For receiving messages, MassTransit will automatically reconnect to the broker (RabbitMQ) when it comes back online. For sending messages, if your application is unable to connect to the broker to send, that's another problem entirely.
When using messaging in applications, it often becomes the single most important aspect of your infrastructure. So if you need high availability, then a cluster setup may be in your future (there are articles on clustering RabbitMQ out there).
MassTransit does not have any store-and-forward concepts in it, the broker needs to be available. While a few options have been discussed, nothing is concrete at this point nor generally available.
After reading its documents, I realized that MassTransit just does not handle situations that the producer failed to send/publish to MQ or the consumer failed to send back the ACK.
So I have to go with another tool CAP, which implemented a local transaction table. You can put the send message action within the same local DB transaction of your business code. But the cons is that the CAP does not have saga implemented yet.
Otherwise, you have to implement a durable outbox pattern with the local transaction table by yourself.

Resources