IEventProcessor processes message two times - azure

Hi I am using Event hub for ingesting data at the frequency of 1 second.
I am continuously pushing simulated data from console application to event hub and then storing into the SQL data base.
Now its been more than 5 days and I found every day some times my receiver process data two times that why i got duplicate records into the database.
Since it happen only once or twice in a day so I am not even able to trace.
Can any one faced such situation so far ?
Or is it possible then host can process same messages twice ?
Or is it an issue of async behavior of receiver ?
Looking forward for the help....
Code snippet :
public class SimpleEventProcessor : IEventProcessor
{
Stopwatch checkpointStopWatch;
async Task IEventProcessor.CloseAsync(PartitionContext context, CloseReason reason)
{
Console.WriteLine("Processor Shutting Down. Partition '{0}', Reason: '{1}'.", context.Lease.PartitionId, reason);
if (reason == CloseReason.Shutdown)
{
await context.CheckpointAsync();
}
}
Task IEventProcessor.OpenAsync(PartitionContext context)
{
Console.WriteLine("SimpleEventProcessor initialized. Partition: '{0}', Offset: '{1}'", context.Lease.PartitionId, context.Lease.Offset);
this.checkpointStopWatch = new Stopwatch();
this.checkpointStopWatch.Start();
return Task.FromResult<object>(null);
}
async Task IEventProcessor.ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> messages)
{
foreach (EventData eventData in messages)
{
string data = Encoding.UTF8.GetString(eventData.GetBytes());
// store data into SQL database / database call.
}
// Call checkpoint every 5 minutes, so that worker can resume processing from 5 minutes back if it restarts.
if (this.checkpointStopWatch.Elapsed > TimeSpan.FromMinutes(0))
{
await context.CheckpointAsync();
this.checkpointStopWatch.Restart();
}
if (messages.Count() > 0)
await context.CheckpointAsync();
}
}

Event Hub guarantees at least once delivery:
It has the following characteristics:
low latency
capable of receiving and processing millions of events per second
at least once delivery
So you can expect this to happen.
Also take in account the situation that checkpointing just has occurred, then some more message (lets call them A and B) are processed and then the process fails. The next time the reading process is started again after the failure message consumption will start at the last checkpointed message, so in other words, message A and B will be processed again.

Related

Event Hub Consumer in Service Fabric

I'm trying to get a service fabric to consistently pull messages from an azure event hub. I seem to have everything wired up but notice that my consumer just stops pulling events.
I have a hub with a couple thousand events I've pushed to it. Configured the hub with 1 partition and have my service fabric service with also only 1 partition to ease debugging.
Service starts, creates the EventHubClient, from there uses it to create a PartitionReceiver. The receiver is passed to an "EventLoop" that enters an "infinite" while that calls receiver.ReceiveAsync. The code for the EventLoop is below.
What I am observing is the first time through the loop I almost always get 1 message. Second time through I get somewhere between 103 and 200ish messages. After that, I get no messages. Also seems like if I restart the service, I get the same messages again - but that's because when I restart the service I'm having it start back at the beginning of the stream.
Would expect this to keep running until my 2000 messages were consumed and then it would wait for me (polling ocassionally).
Is there something specific I need to do with the Azure.Messaging.EventHubs 5.3.0 package to make it keep pulling events?
//Here is how I am creating the EventHubClient:
var connectionString = "something secret";
var connectionStringBuilder = new EventHubsConnectionStringBuilder(connectionString)
{
EntityPath = "NameOfMyEventHub"
};
try
{
m_eventHubClient = EventHubClient.Create(connectionStringBuilder);
}
//Here is how I am getting the partition receiver
var receiver = m_eventHubClient.CreateReceiver("$Default", m_partitionId, EventPosition.FromStart());
//The event loop which the receiver is passed to
private async Task EventLoop(PartitionReceiver receiver)
{
m_started = true;
while (m_keepRunning)
{
var events = await receiver.ReceiveAsync(m_options.BatchSize, TimeSpan.FromSeconds(5));
if (events != null) //First 2/3 times events aren't null. After that, always null and I know there are more in the partition/
{
var eventsArray = events as EventData[] ?? events.ToArray();
m_state.NumProcessedSinceLastSave += eventsArray.Count();
foreach (var evt in eventsArray)
{
//Process the event
await m_options.Processor.ProcessMessageAsync(evt, null);
string lastOffset = evt.SystemProperties.Offset;
if (m_state.NumProcessedSinceLastSave >= m_options.BatchSize)
{
m_state.Offset = lastOffset;
m_state.NumProcessedSinceLastSave = 0;
await m_state.SaveAsync();
}
}
}
}
m_started = false;
}
**EDIT, a question was asked on the number of partitions. The event hub has a single partition and the SF service also has a single one.
Intending to use service fabric state to keep track of my offset into the hub, but that's not the concern for now.
Partition listeners are created for each partition. I get the partitions like this:
public async Task StartAsync()
{
// slice the pie according to distribution
// this partition can get one or more assigned Event Hub Partition ids
string[] eventHubPartitionIds = (await m_eventHubClient.GetRuntimeInformationAsync()).PartitionIds;
string[] resolvedEventHubPartitionIds = m_options.ResolveAssignedEventHubPartitions(eventHubPartitionIds);
foreach (var resolvedPartition in resolvedEventHubPartitionIds)
{
var partitionReceiver = new EventHubListenerPartitionReceiver(m_eventHubClient, resolvedPartition, m_options);
await partitionReceiver.StartAsync();
m_partitionReceivers.Add(partitionReceiver);
}
}
When the partitionListener.StartAsync is called, it actually creates the PartitionListener, like this (it's actually a bit more than this, but the branch taken is this one:
m_eventHubClient.CreateReceiver(m_options.EventHubConsumerGroupName, m_partitionId, EventPosition.FromStart());
Thanks for any tips.
Will
How many partition do you have? I can't see in your code how you make sure you read all partitions in the default consumer group.
Any specific reason why you are using PartitionReceiver instead of using an EventProcessorHost?
To me, SF seems like a perfect fit for using the event processor host. I see there is already a SF integrated solution that uses stateful services for checkpointing.

About checkpoint strategy in event hub processor

I use event hubs processor host to receive and process the events from event hubs. For better performance, I call checkpoint every 3 minutes instead of every time when receiving the events:
public async Task ProcessEventAsync(context, messages)
{
foreach (var eventData in messages)
{
// do something
}
if (checkpointStopWatth.Elapsed > TimeSpan.FromMinutes(3);
{
await context.CheckpointAsync();
}
}
But the problem is, that there might be some events never being checkpoint if not new events sending to event hubs, as the ProcessEventAsync won't be invoked if no new messages.
Any suggestions to make sure all processed events being checkpoint, but still checkpoint every several mins?
Update: Per Sreeram's suggestion, I updated the code as below:
public async Task ProcessEventAsync(context, messages)
{
foreach (var eventData in messages)
{
// do something
}
this.lastProcessedEventsCount += messages.Count();
if (this.checkpointStopWatth.Elapsed > TimeSpan.FromMinutes(3);
{
this.checkpointStopWatch.Restart();
if (this.lastProcessedEventsCount > 0)
{
await context.CheckpointAsync();
this.lastProcessedEventsCount = 0;
}
}
}
Great case - you are covering!
You could experience loss of event checkpoints (and as a result event replay) in the below 2 cases:
when you have sparse data flow (for ex: a batch of messages every 5 mins and your checkpoint interval is 3 mins) and EventProcessorHost instance closes for some reason - you could see 2 min of EventData - re-processing. To handle that case,
Keep track of the lastProcessedEvent after completing IEventProcessor.onEvents/IEventProcessor.ProcessEventsAsync & checkpoint when you get notified on close - IEventProcessor.onClose/IEventProcessor.CloseAsync.
There might just be a case when - there are no more events to a specific EventHubs partition. In this case, you would never see the last event being checkpointed - with your Checkpointing strategy. However, this is uncommon, when you have continuous flow of EventData and you are not sending to specific EventHubs partition (EventHubClient.send(EventData_Without_PartitionKey)). If you think - you could run into this situation, use the:
EventProcessorOptions.setInvokeProcessorAfterReceiveTimeout(true); // in java or
EventProcessorOptions.InvokeProcessorAfterReceiveTimeout = true; // in C#
flag to wake up the processEventsAsync every so often. Then, keep track of, LastProcessedEventData and LastCheckpointedEventData and make a judgement whether to checkpoint when no Events are received, based on EventData.SequenceNumber property on those events.

Getting Data from EventHub is delayed

I have an EventHub configured in Azure, also a consumer group for reading the data. It was working fine for some days. Suddenly, I see there is a delay in incoming data(around 3 days). I use Windows Service to consume data in my server. I have around 500 incoming messages per minute. Can anyone help me out to figure this out ?
It might be that you are processing them items too slow. Therefore the work to be done grows and you will lag behind.
To get some insight in where you are in the event stream you can use code like this:
private void LogProgressRecord(PartitionContext context)
{
if (namespaceManager == null)
return;
var currentSeqNo = context.Lease.SequenceNumber;
var lastSeqNo = namespaceManager.GetEventHubPartition(context.EventHubPath, context.ConsumerGroupName, context.Lease.PartitionId).EndSequenceNumber;
var delta = lastSeqNo - currentSeqNo;
logWriter.Write(
$"Last processed seqnr for partition {context.Lease.PartitionId}: {currentSeqNo} of {lastSeqNo} in consumergroup '{context.ConsumerGroupName}' (lag: {delta})",
EventLevel.Informational);
}
the namespaceManager is build like this:
namespaceManager = NamespaceManager.CreateFromConnectionString("Endpoint=sb://xxx.servicebus.windows.net/;SharedAccessKeyName=yyy;SharedAccessKey=zzz");
I call this logging method in the CloseAsync method:
public Task CloseAsync(PartitionContext context, CloseReason reason)
{
LogProgressRecord(context);
return Task.CompletedTask;
}
logWriter is just some logging class I have used to write info to blob storage.
It now outputs messages like
Last processed seqnr for partition 3: 32780931 of 32823804 in consumergroup 'telemetry' (lag: 42873)
so when the lag is very high you could be processing events that have occurred a long time ago. In that case you need to scale up/out your processor.
If you notice a lag you should measure how long it takes to process a given number of item. You can then try to optimize performance and see whether it improves. We did it like:
public async Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> events)
{
try
{
stopwatch.Restart();
// process items here
stopwatch.Stop();
await CheckPointAsync(context);
logWriter.Write(
$"Processed {events.Count()} events in {stopwatch.ElapsedMilliseconds}ms using partition {context.Lease.PartitionId} in consumergroup {context.ConsumerGroupName}.",
EventLevel.Informational);
}
}

How to parallelize an azure worker role?

I have got a Worker Role running in azure.
This worker processes a queue in which there are a large number of integers. For each integer I have to do processings quite long (from 1 second to 10 minutes according to the integer).
As this is quite time consuming, I would like to do these processings in parallel. Unfortunately, my parallelization seems to not be efficient when I test with a queue of 400 integers.
Here is my implementation :
public class WorkerRole : RoleEntryPoint {
private readonly CancellationTokenSource cancellationTokenSource = new CancellationTokenSource();
private readonly ManualResetEvent runCompleteEvent = new ManualResetEvent(false);
private readonly Manager _manager = Manager.Instance;
private static readonly LogManager logger = LogManager.Instance;
public override void Run() {
logger.Info("Worker is running");
try {
this.RunAsync(this.cancellationTokenSource.Token).Wait();
}
catch (Exception e) {
logger.Error(e, 0, "Error Run Worker: " + e);
}
finally {
this.runCompleteEvent.Set();
}
}
public override bool OnStart() {
bool result = base.OnStart();
logger.Info("Worker has been started");
return result;
}
public override void OnStop() {
logger.Info("Worker is stopping");
this.cancellationTokenSource.Cancel();
this.runCompleteEvent.WaitOne();
base.OnStop();
logger.Info("Worker has stopped");
}
private async Task RunAsync(CancellationToken cancellationToken) {
while (!cancellationToken.IsCancellationRequested) {
try {
_manager.ProcessQueue();
}
catch (Exception e) {
logger.Error(e, 0, "Error RunAsync Worker: " + e);
}
}
await Task.Delay(1000, cancellationToken);
}
}
}
And the implementation of the ProcessQueue:
public void ProcessQueue() {
try {
_queue.FetchAttributes();
int? cachedMessageCount = _queue.ApproximateMessageCount;
if (cachedMessageCount != null && cachedMessageCount > 0) {
var listEntries = new List<CloudQueueMessage>();
listEntries.AddRange(_queue.GetMessages(MAX_ENTRIES));
Parallel.ForEach(listEntries, ProcessEntry);
}
}
catch (Exception e) {
logger.Error(e, 0, "Error ProcessQueue: " + e);
}
}
And ProcessEntry
private void ProcessEntry(CloudQueueMessage entry) {
try {
int id = Convert.ToInt32(entry.AsString);
Service.GetData(id);
_queue.DeleteMessage(entry);
}
catch (Exception e) {
_queueError.AddMessage(entry);
_queue.DeleteMessage(entry);
logger.Error(e, 0, "Error ProcessEntry: " + e);
}
}
In the ProcessQueue function, I try with different values of MAX_ENTRIES: first =20 and then =2.
It seems to be slower with MAX_ENTRIES=20, but whatever the value of MAX_ENTRIES is, it seems quite slow.
My VM is a A2 medium.
I really don't know if I do the parallelization correctly ; maybe the problem comes from the worker itself (which may be it is hard to have this in parallel).
You haven't mentioned which Azure Messaging Queuing technology you are using, however for tasks where I want to process multiple messages in parallel I tend to use the Message Pump Pattern on Service Bus Queues and Subscriptions, leveraging the OnMessage() method available on both Service Bus Queue and Subscription Clients:
QueueClient OnMessage() - https://msdn.microsoft.com/en-us/library/microsoft.servicebus.messaging.queueclient.onmessage.aspx
SubscriptionClient OnMessage() - https://msdn.microsoft.com/en-us/library/microsoft.servicebus.messaging.subscriptionclient.onmessage.aspx
An overview of how this stuff works :-) - http://fabriccontroller.net/blog/posts/introducing-the-event-driven-message-programming-model-for-the-windows-azure-service-bus/
From MSDN:
When calling OnMessage(), the client starts an internal message pump
that constantly polls the queue or subscription. This message pump
consists of an infinite loop that issues a Receive() call. If the call
times out, it issues the next Receive() call.
This pattern allows you to use a delegate (or anonymous function in my preferred case) that handles the receipt of the Brokered Message instance on a separate thread on the WaWorkerHost process. In fact, to increase the level of throughput, you can specify the number of threads that the Message Pump should provide, thereby allowing you to receive and process 2, 4, 8 messages from the queue in parallel. You can additionally tell the Message Pump to automagically mark the message as complete when the delegate has successfully finished processing the message. Both the thread count and AutoComplete instructions are passed in the OnMessageOptions parameter on the overloaded method.
public override void Run()
{
var onMessageOptions = new OnMessageOptions()
{
AutoComplete = true, // Message-Pump will call Complete on messages after the callback has completed processing.
MaxConcurrentCalls = 2 // Max number of threads the Message-Pump can spawn to process messages.
};
sbQueueClient.OnMessage((brokeredMessage) =>
{
// Process the Brokered Message Instance here
}, onMessageOptions);
RunAsync(_cancellationTokenSource.Token).Wait();
}
You can still leverage the RunAsync() method to perform additional tasks on the main Worker Role thread if required.
Finally, I would also recommend that you look at scaling your Worker Role instances out to a minimum of 2 (for fault tolerance and redundancy) to increase your overall throughput. From what I have seen with multiple production deployments of this pattern, OnMessage() performs perfectly when multiple Worker Role Instances are running.
A few things to consider here:
Are your individual tasks CPU intensive? If so, parallelism may not help. However, if they are mostly waiting on data processing tasks to be processed by other resources, parallelizing is a good idea.
If parallelizing is a good idea, consider not using Parallel.ForEach for queue processing. Parallel.Foreach has two issues that prevent you from being very optimal:
The code will wait until all kicked off threads finish processing before moving on. So, if you have 5 threads that need 10 seconds each and 1 thread that needs 10 minutes, the overall processing time for Parallel.Foreach will be 10 minutes.
Even though you are assuming that all of the threads will start processing at the same time, Parallel.Foreach does not work this way. It looks at number of cores on your server and other parameters and generally only kicks off number of threads it thinks it can handle, without knowing too much about what's in those threads. So, if you have a lot of non-CPU bound threads that /can/ be kicked off at the same time without causing CPU over-utilization, default behaviour will not likely run them optimally.
How to do this optimally:
I am sure there are a ton of solutions out there, but for reference, the way we've architected it in CloudMonix (that must kick off hundreds of independent threads and complete them as fast as possible) is by using ThreadPool.QueueUserWorkItem and manually keeping track number of threads that are running.
Basically, we use a Thread-safe collection to keep track of running threads that are started by ThreadPool.QueueUserWorkItem. Once threads complete, remove them from that collection. The queue-monitoring loop is indendent of executing logic in that collection. Queue-monitoring logic gets messages from the queue if the processing collection is not full up to the limit that you find most optimal. If there is space in the collection, it tries to pickup more messages from the queue, adds them to the collection and kick-start them via ThreadPool.QueueUserWorkItem. When processing completes, it kicks off a delegate that cleans up thread from the collection.
Hope this helps and makes sense

Message being retried when operation takes time

I have a messaging system using Azure ServiceBus but I'm using Nimbus on top of that. I have an endpoint that sends a command to another endpoint and at one point the handler class on the other side picks it up, so it is all working fine.
When the operation takes time, roughly more than 20 second or so, the handler gets 'another' call with the same message. It looks like Nimbus is retrying the message that is already being handled by an other (even the same) instance of the handler, I don't see any exceptions being thrown and I could easily repro this with the following handler:
public class Synchronizer : IHandleCommand<RequestSynchronization>
{
public async Task Handle(RequestSynchronization synchronizeInfo)
{
Console.WriteLine("Received Synchronization");
await Task.Delay(TimeSpan.FromSeconds(30)); //Simulate long running process
Console.WriteLine("Got through first timeout");
await Task.Delay(TimeSpan.FromSeconds(30)); //Simulate another long running process
Console.WriteLine("Got through second timeout");
}
}
My question is: How do I disable this behavior? I am happy for the transaction take time as it is a heavy process that I have off-loaded from my website, which was the whole point of going with this architecture in the first place.
In other words, I was expecting the message to not to be picked up by another handler while one has picked it up and is processing it, unless there's an exception and the message goes back to the queue and eventually gets picked up for a retry.
Any ideas how to do this? Anything I'm missing?
By default, ASB/WSB will give you a message lock of 30 seconds. The idea is that you pop a BrokeredMessage off the head of the queue but have to either .Complete() or .Abandon() that message within the lock timeout.
If you don't do that, the service bus assumes that you've crashed or otherwise failed and it will return that message to the queue to be re-processed.
You have a couple of options:
1) Implement ILongRunningHandler on your handler. Nimbus will pay attention to the remaining lock time and automatically renew your message lock. Caution: The maximum message lock time supported by ASB/WSB is five minutes no matter how many times you renew so if your handler takes longer than that then you might want option #2.
public class Synchronizer : IHandleCommand<RequestSynchronization>, ILongRunningTask
{
public async Task Handle(RequestSynchronization synchronizeInfo)
{
Console.WriteLine("Received Synchronization");
await Task.Delay(TimeSpan.FromSeconds(30)); //Simulate long running process
Console.WriteLine("Got through first timeout");
await Task.Delay(TimeSpan.FromSeconds(30)); //Simulate another long running process
Console.WriteLine("Got through second timeout");
}
}
2) In your handler, call a Task.Run(() => SomeService(yourMessage)) and return. If you do this, be careful about lifetime scoping of dependencies if your handler takes any. If you need an IFoo, take a dependency on a Func> (or equivalent depending on your container) and resolve that within your handling task.
public class Synchronizer : IHandleCommand<RequestSynchronization>
{
private readonly Func<Owned<IFoo>> fooFunc;
public Synchronizer(Func<Owned<IFoo>> fooFunc)
{
_fooFunc = fooFunc;
}
public async Task Handle(RequestSynchronization synchronizeInfo)
{
// don't await!
Task.Run(() => {
using (var foo = _fooFunc())
{
Console.WriteLine("Received Synchronization");
await Task.Delay(TimeSpan.FromSeconds(30)); //Simulate long running process
Console.WriteLine("Got through first timeout");
await Task.Delay(TimeSpan.FromSeconds(30)); //Simulate another long running process
Console.WriteLine("Got through second timeout");
}
});
}
}
I think you are looking for the code here: http://www.uglybugger.org/software/post/support_for_long_running_handlers_in_nimbus

Resources