Avoid consuming same events parallel from EventHub - azure

I'm using:
Azure platform to run some microservice architecture software solution.
microservices are using the Azure-EventHub for communicating in special cases.
Kubernetes with 2 clusters (primary, secondary)
per application namespace, there is 1 event-listener pod running per cluster for consuming from eventhub
The last point is relevant to my current problem:
The load balancers will share traffic between the primary and secondary clusters. This means that 2 event-listener-pods are running per application at the same time. So they are just reacting to events but some times they are consuming the same event from the event hub and this causes some duplicated notification mails.
So finally my question is: How can I avoid reading the same event twice the same time? I thought event hub index is always increasing but starting at the same moment is not "secured".

You will need to use separate consumer groups per pod to avoid EPOCH error.
That said, both pods will read the same events, so you have two options.
Have an active-passive set up. One consumer group, one pod that reads the events and delegates the work out on each event. If that pod fails, then a health/heart beat mechanism brings the second pod online.
Have an active-active set up. Two consumer groups, two active pods. You will need to implement idempotent processing.
Idempotent processing, where processing the same message multiple times produces the same result, is good practice regardless of approach. This would allow you to replay batches of events in which one errored and not have adverse affects on the integrity of your data.
I would opt for the first option, a single event hub reader will process thousands of events per second and pass off the work to your micro services.
If you have lower volumes of messages and need guaranteed message processing, then using Service Bus may be a better choice where messages can be locked, completed and abandoned.

Related

Event Hub -- how to prevent duplicate handling when consumers scale out

When we have multiple consumers of Event Hub (or any messaging service, for that matter), how to make sure that no message is processed twice especially in a situation when consumer auto-scales out to multiple instances?
I know we could keep track of last message processed but then again, between the check if message was processed and actuall, processing it,other instance could process it already (race condition?.
so, how to solve that in a scalable way?
[UPDATE]
i am aware there is a recommendation to have at least as many partitions as there are consumers but what to do in case when a single consumer cannot process messages directed to it but needs to scale out to multiple instances?
Each processor takes a lease on a partition, see the docs
An event processor instance typically owns and processes events from one or more partitions. Ownership of partitions is evenly distributed among all the active event processor instances associated with an event hub and consumer group combination.
So scaling out doesn't result in duplicate message processing because a new processor cannot take a lease on a partition that is already being handled by another processor.
Then, regarding your comment:
i am aware there is a recommendation to have at least as many partitions as there are consumers
It is the other way around: it is recommended to have as many consumers as you have partitions. If you have more consumers than partitions the consumers will compete with each other to obtain a lock on a partition.
Now, regarding duplicate messages, since Event Hub guarantees at-least-once delivery there isn't much you can do to prevent this. There aren't that many scalable services that offer at-most-once deliveries, I know that Azure Service Bus Queues do offer this if you really need it.
The question may arise what can cause duplicate message processing. Well, when processing message the processor does some checkpointing: once in a while it will store its position within a partition event sequence (remember, a partition is bound to a single processor). Now when the processer instance crashes between two checkpoint events a new instance will resume processing messages from the position of the last checkpoint. That may very well lead to older messages being processed again.
If a reader disconnects from a partition, when it reconnects it begins reading at the checkpoint that was previously submitted by the last reader of that partition in that consumer group.
So, that means you need to make sure your processing logic is idempotent. How, that is up to you as I don't know your use case.
One option is to track each individual message to see whether it is already processed or not. If you do not have a unique ID to check on maybe you can generate a hash of the whole message and compare with that.

Spark Streaming job fails with ReceiverDisconnectedException Class

I've Spark Streaming job which captures the near real time data from Azure Eventhub and runs 24/7.
More interestingly, my job fails at least 2 times a day with the below error. if I google the error, Microsoft docs gives me 'This exception is thrown if two or more PartitionReceiver instances connect to the same partition with different epoch values'. I'm not worried about data loss because spark Checkpointing will automatically take care of data when i restart the job, but my question is why the spark streaming job fails 2-3 times a day with the same error.
Has anybody faced the same issue, is there any solution/workaround available of this. Any help would be much appreciated.
error:
This exception is thrown if two or more Partitions Receiver instances connect to the same partition with different epoch values.
What is Partition Receiver?
This is a logical representation of receiving from a EventHub partition.
A PartitionReceiver is tied to a ConsumerGroup + Partition combination. If you are creating an epoch based PartitionReceiver (i.e. PartitionReceiver.Epoch != 0) you cannot have more than one active receiver per ConsumerGroup + Partition combo. You can have multiple receivers per ConsumerGroup + Partition combination with non-epoch receivers.
It sounds like you are running two instances of the application, two concurrent classes, or two applications that use the same event hub consumer group. Event hub consumer groups are effectively pointers to a point in time on the event stream. If you try and use one consumer group pointing with two instances of code, then you get a conflict like the one you are seeing.
Either:
Ensure you only have a single instance reading the consumer group at a time.
Use two consumer groups when you need two separate programs or sets of functionality to process the event hub at the same time.
If you are looking to parallelize for performance, look in to event hub Partitioning and how to take advantage of processing each partition independently.
There is also an alternative scenario where an event hub partition is switched over to another host as part of the event hub's internal load balancing. In this case you may see the error you are receiving. In this case, just log it and continue on.
For more details, refer "Features and terminology in Azure Event Hubs" and "Event Hubs Receiver Epoch".
Hope this helps.

Azure Functions Event Hub trigger bindings

Just have a couple of questions regarding the usage of Azure Functions with an EventHub in an IoT scenario.
EventHub has partitions. Typically messages from a specific device go to the same partition. How are the instances of an Azure Function distributed across EventHub partitions? Is it based on the performance? In case one instance of an Azure Function manages to process events from all partitions then it is enough otherwise one might end up with one instance of an Azure Function per EventHub partition?
What about the read-offset? Does this binding somehow records where it stopped reading the event stream? I thought the functions are meant to be stateless and here we have some state.
Thanks
Each instance of an Event Hub-Triggered Function is backed by only 1 EventProcessorHost(EPH) instance. Event Hub ensures that only 1 EPH can get a lease on a given partition.
Answer to Question 1:
Let's elaborate on this with a contrived example. Suppose we begin with the following setup and assumptions for an EventHub:
10 partitions.
1000 events distributed evenly across all partitions => 100 messages in each partition.
When your Function is first enabled, there is only 1 instance of the Function. Let's call this Function instance Function_0. Function_0 will have 1 EPH that manages to get a lease on all 10 partitions. Let this EPH be called EPH_0, and it will start reading events from partitions 0-9. From this point forward, one of the following will happen:
Only 1 Function instance is needed - Function_0 is able to process all 1000 before the Azure Functions' scaling logic kicks in.
Hence, all 1000 messages are processed by Function_0.
Add 1 more Function instance - Azure Functions' scaling logic determines that Function_0 seems sluggish, so a new instance
Function_1 is created, resulting in EPH_1. Event Hub detects that a new EPH instance is trying read messages. Event Hub will start load
balancing the partitions across the EPH instances, e.g., partitions
0-4 are assigned to EPH_0 and partitions 5-9 are assigned to EPH_1.
If all Function execution succeed without errors, both EPH_0 and
EPH_1 checkpoints successfully and all 1000 messages are processed. When check-pointing succeeds, all 1000 messages should never be retrieved again.
Add N more function instances - Azure Functions' scaling logic determines that both Function_0 and Function_1 are still sluggish and
will repeat workflow 2 again for Function_2...N, where N>9. Event Hub will load balance the partitions across Function_0...9 instances.
Unique to Azure Functions' current scaling logic is the fact that N is >(number of partitions). This is done to ensure
that there are always instances of EPH readily available to quickly
get a lock on the partition(s). As a customer, you are only charged for the resources used when your Function instance executes, but you are not charged for this over-provisioning.
Answer to Question 2:
EPH uses a check-pointing mechanism to mark the last known successfully read message. An EventHub-Triggered Function can be setup to process 1 message or a batch of messages at a time. The option you choose needs to consider the following:
1. Speed of message processing - Processing messages in batches instead of a single message at a time is one of the factors that will speed up the ability of your Azure Function workflow to keep up with the incoming messages in your Event Hub.
2. Tolerance for duplicates - If check-pointing fails due to errors in your Function code/(Updated Aug 24th, 2017) timeout/partition least lost, then the next EPH that gets a lease on that partition will start retrieving messages from the last known checkpoint. Event Hub guarantees at-least-once delivery but not at-most-once delivery. Azure Functions will not attempt to change that behavior. If not having duplicate messages is a priority, then you will need to mitigate it in your workflow. As such, when check-pointing fails, there are more duplicate messages to manage if your Function is processing messages at batch level.
Function Apps are based on WebJobs SDK, which use EventHostProcessor to consume events from Event Hubs. So you can lookup information about EventHostProcessor and it will be applicable to your Function App.
Particularly, you can find the implementation of IEventProcessor
here.
To your questions:
Not sure what you mean by "one instance". One listener will be created per partition, but they can be both hosted inside a single App Plan Instance if the load is low. On the high level, you should not care much: in Consumption Plan you pay per execution time, no matter how many servers/processes/threads are running. Of course, you should care whether the auto-scaling works good enough for high load, but that needs to be tested anyway.
Functions are stateless in a sense that you can't save anything in-memory between two function executions. You are totally fine to save state in external storage. Function App will use PartitionContext.CheckpointAsync() for checkpointing of the current offset. Azure Storage is used internally; again you can read more about how it works in Event Hubs and EventHostProcessor docs, e.g. here.

Sharing EventHub between Azure Fabric reliable actors

I'm having an application where I map devices from the physical world to Reliable Actors in Azure Fabric. Each time I receive a message from a device, I want to push a message to an event hub.
What I'm doing right now is creating/using/closing the EventHubClient object for each message.
This is very inefficient (it takes about 1500ms) but it solves an issue I had in the past where I was keeping the EventHubClient in memory. When I have a lot of devices, the underlying virtual machine can quickly run out of network connections.
I'm thinking about creating a new actor that would be responsible for pushing data to the EventHub (by keeping the EventHubClient alive). Because of the turned based concurrency model of Reliable Actors, I'm not sure it's a good idea. If I get 10 000 devices pushing data "at the same time", each of their actors will block to push the message to the new actor that pushes message to the EventHub.
What is the recommended approach for this scenario ?
Thanks,
One approach would be to create a stateless service that is responsible for pushing messages to the EventHub. Each time an Actor receives a message from the device (by the way, how are they communicating with actors?) the Actor calls the stateless service. The stateless service in turn would be responsible for creating, maintining and disposing of one EventHubClient per service. Reliable Service would not introduce the same 'overhead' when it comes to handling incoming messages as a Reliable Actor would. If it is important for your application that the messages reach the EventHub in strictly the same order that they were produced in then you would have to do this with a Stateful Service and a Reliable Queue. (Note, this there is on the other hand no guarantee that Actors would be able to finish handling incoming messages in the same order as they are produced)
You could then fine tune-tune the solution by experimenting with the instance count (https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-availability-services) to make sure you have enough instances to handle the throughput of incoming messages. How many instances are roughly determined by the number of nodes and cores per node, although other factors may also affect.
Devices communicate with your Actors, the Actors in turn communicate with the Service (may be Stateless or Stateful if you want to queue message, see below), each Service manages an EventHubClient that can push messages to the EventHub.
If your cluster is unable to support an instance count for this service that is high enough (a little simplified: more instances = higher throughput), then you may need to create it as a Stateful Service instead and put messages in a Reliable Queue in the Service and then have the the RunAsync for the Service processing the queue in order. This could take the pressure of peaks in performance.
The Service Fabric Azure-Samples WordCount shows how you work with different Partitions to make the messages from Actors target different instances (or really partitions).
A general tip would be to not try to use Actors for everything (but for the right things they are great and reduces complexity a lot), the Reliable Services model support a lot more scenarios and requirements and could really complement your Actors (rather than trying to make Actors do something they are not really designed for).
You could use a pub/sub pattern here (use the BrokerService).
By decoupling event publishing from event processing, you don't need to worry about the turn based concurrency model.
Publishers:
The Actor sends out messages by simply publishing them to a BrokerService.
Subscribers
Then you use one or more Stateless Services or (different) Actors as subscribers of the events.
They would send them into EventHub in their own pace.
Event Hub Client
Using this approach you'd have full control over the EventHubClient instance counts and lifetimes.
You could increase event processing power by simply adding more subscribers.
In my opinion you should directly call from your actors the event hub in a background thread with an internal memory queue. You should aggregate messages and use SendBatch to improve performance.
The event hub is able to receive the load by himself.

Detect and Delete Orphaned Queues, Topics, or Subscriptions on Azure Service Bus

If there are no longer any publishers or subscribers reading nor writing to a Queue, Topic, or Subscription, because of crashes or other abnormal terminations (instance restart, etc.), is that Queue/Topic/Subscription effectively orphaned?
I tested this by creating a few Queues, and then terminating the applications. Those Queues were still on the Service Bus a long time later. It seems that they will just stay there forever. That would be wonderful if we WANTED that behavior, but in this case, we do not.
How can we detect and delete these Queues, Topics, and Subscriptions? They will count towards Azure limits, etc, and we cannot have these orphaned processes every time an instance is restarted/patched/crashes.
If it helps make the question clearer, this is a unique situation in which the Queues/Topics/Subscriptions have special names, or special Filters, and a very limited set of publishers (1) and subscribers (1) for a limited time. This is not a case where we want survivability. These are instance-specific response channels. Whether we use Queues or Subscriptions is immaterial. If the instance is gone, so is the need for that Queue (or Subscription).
This is part of a solution where each web role has a dedicated response channel that it monitors. At any time, this web role may have dozens of requests pending via other messaging channels (Queues/Topics), and it is waiting for the answers on multiple threads. We need the response to come back to the thread that placed the message, so that the web role can respond to the caller. It is no good in this situation to simply have a Subscription based on the machine, because it will be receiving messages for other threads. We need each publishing thread to establish a dedicated response channel, so that the only thing on that channel is the response for that thread.
Even if we use Subscriptions (with some kind of instance-related filter) to do a long-polling receive operation on the Subscription, if the web role instance dies, that Subscription will be orphaned, correct?
This question can be boiled down like so:
If there are no more publishers or subscribers to a Queue/Topic/Subscription, then that service is effectively orphaned. How can those orphans be detected and cleaned up?
In this scenario you are looking for the Queue/Subscriptions to be "dynamic" in nature. They would be created and removed based on use as opposed to the current explicit provisioning model for these entities. Service Bus provides you with the APIs to perform create/delete operations so you can plug these on role OnStart/OnStop events appropriately. If those operations fail for some reason then the orphaned entities will exist. Again you can run clean up operation on them based on some unique identifier for the name of the entities. An example of this can be seen here: http://windowsazurecat.com/2011/08/how-to-simplify-scale-inter-role-communication-using-windows-azure-service-bus/
In the near future we will add more metadata and query capabilities to Queues/Topics/Subscriptions so you can see when they were last accessed and make cleanup decisions.
Service Bus Queues are built using the “brokered messaging” infrastructure designed to integrate applications or application components that may span multiple communication protocols, data contracts, trust domains, and/or network environments. The allows for a mechanism to communicate reliably with durable messaging.
If a client (publisher) sends a message to a service bus queue and then crashes the message will be stored on the Queue until as consumer reads the message off the queue. Also if your consumer dies and restarts it will just poll the queue and pick up any work that is waiting for it (You can scale out and have multiple consumers reading from queue to increase throughput), Service Bus Queues allow you to decouple your applications via durable cloud gateway analogous to MSMQ on-premises (or other queuing technology).
What I'm really trying to say is that you won't get an orphaned queue, you might get poisoned messages that you will need to handled, this blog post gives some very detailed information re: Service Bus Queues and their Capacity and Quotas which might give you a better understanding http://msdn.microsoft.com/en-us/library/windowsazure/hh767287.aspx
Re: Queue Management, you can do this via Visual Studio (1.7 SDK & Tools) or there is an excellent tool called Service Bus Explorer that will make your life easier for queue managagment: http://code.msdn.microsoft.com/windowsazure/Service-Bus-Explorer-f2abca5a
*Note the default maximum number of queues is 10,000 (per service namespace, this can be increased via a support call)
As Abhishek Lai mentioned there is no orphan detecting capability supported.
Orphan detection can be implement externally in multiple ways.
For example, whenever you send/receive a message, update a timestamp in an SQL database to indicate that the queue/tropic/subscription is still active. This timestamp can then be used to determine orphans.
If your process will crash which is very much possible there will be issue with the message delivery within the queue however queue will still be available to process your request. Handling Application Crashes and Unreadable Messages with Windows Azure Service Bus queues are described here:
The Service Bus provides functionality to help you gracefully recover from errors in your application or difficulties processing a message. If a receiver application is unable to process the message for some reason, then it can call the Abandon method on the received message (instead of the Complete method). This will cause the Service Bus to unlock the message within the queue and make it available to be received again, either by the same consuming application or by another consuming application.
In the event that the application crashes after processing the message but before the Complete request is issued, then the message will be redelivered to the application when it restarts. This is often called At Least Once Processing, that is, each message will be processed at least once but in certain situations the same message may be redelivered. If the scenario cannot tolerate duplicate processing, then application developers should add additional logic to their application to handle duplicate message delivery. This is often achieved using the MessageId property of the message, which will remain constant across delivery attempts.
If there are no longer any processes reading nor writing to a queue, because of crashes or other abnormal terminations (instance restart, etc.), is that queue effectively orphaned?
No the queue is in place to allow communication to occur via Brokered Messages, if all your apps die for some reason then the queue still exists and will be there when they become alive again, it's the communication channel for loosely decoupled applications. Regards Billing 'Messages are charged based on the number of messages sent to, or delivered by, the Service Bus during the billing month' you won't be charged if a queue exists but nobody is using it.
I tested this by creating a few queues, and then terminating the
applications. Those queues were still on the machine a long time
later.
The whole point of the queue is to guarantee message delivery of loosely decoupled applications. Think of the queue as an entity or application in its own right with high availability (SLA) as its hosted in Azure, your producer/consumers can die/restart and the queue will be active in Azure. *Note I got a bit confused with your wording re: "still on the machine a long time later", the queue doesn't actually live on your machine, it sits up in Azure in a designated service bus namespace. You can view and managed the queues via the tools I pointed out in the previous answer.
How can we detect and delete these queues, as they will count towards
Azure limits, etc.
As stated above the default maximum number of queues is 10,000 (per service namespace, this can be increased via a support call), queue management can be done via the tools stated in the other answer. You should only be looking to delete queue's when you no longer have producer/consumers looking to write to them (i.e. never again). You can of course create and delete queues in your producer/consumer applications via the namespaceManager.QueueExists, more information here How to Use Service Bus Queues
If it helps make the question clearer, this is a unique situation in which the queues have special names, and a very limited set of publishers (1) and subscribers (1) for a limited time.
It sounds like you need to use Topics & Subscriptions How to Use Service Bus Topics/Subscriptions, this link also has a section on 'How to Delete Topics and Subscriptions' If you have a very limited lifetime then you could handle topic creation/deletion in your app's otherwise you could have have a separate Queue/Topic/Subscription setup/deletion script to handle this logic...

Resources