Event hubs - avoiding duplicate consumption incase multiple consumer groups - azure

I am working on a task where I am using pyspark/python to read events from event hub. When I have multiple consumer groups, I am getting duplicate messages which is behavior.
Eg:I have 2 consumer groups(CG) and 2 events. CG1 consuming event1 and while this process is ON the 2nd event got triggered then CG2 will consume which is good but now after CG1 is free after event1 consumption its consuming event2 aswell which we want to avoid. Even though the checkpoint is available, its failing. is this default behaviour?

According to you, you added multiple consumer groups in order to handle lots of messages given your comment:
Q: Why did you choose to use multiple consumer groups anyway?
A: There are good number of messages which flows in so we added two.
Scaling out is done using partitions, not using consumer groups. They are designed to be independent. You can't work against that.
Your question:
I have 2 consumer groups(CG) and 2 events. CG1 consuming event1 and while this process is ON the 2nd event got triggered then CG2 will consume which is good but now after CG1 is free after event1 consumption its consuming event2 aswell which we want to avoid. Even though the checkpoint is available, its failing. is this default behaviour?
The answer is yes, this is default behaviour. A consumer group is a seperate view of the whole message stream. Each consumer group has its own offset (checkpoint) of where they are in terms of messages they have processed of that stream. That means that each and every message will be received by each and every consumer group.
From the docs:
Consumer groups: A view (state, position, or offset) of an entire event hub. Consumer groups enable consuming applications to each have a separate view of the event stream. They read the stream independently at their own pace and with their own offsets.
This picture of the architecture also shows how the messages flow through all consumer groups.
See also this answer that provides more details about consumer groups.
Again, if you want to scale, do not use consumer groups but tweak your provisioned throughput units, partitions or improve the processing logic. See the docs about scalability.

Related

Is there a way to get count of events in Azure Event Hub consumer group for a particular time period?

Is there a way to get count of events in Azure Event Hub for a particular time period? I need to get the count of events which come for an hour.
No, there is no way to get it as of now, if you look at the docs, EventHub is a high-Throughput, low-latency durable stream of events on Azure, so getting a count may be not correct at that given point of time
unlike queues, there is no concept of queue length in Azure Event Hubs
because the data is processed as a stream
I'm not sure that the context of this question is correct, as a consumer group is just a logical grouping of those reading from an Event Hub and nothing gets published to it. For the remainder of my thoughts, I'm assuming that the nomenclature was a mistake, and what you're interested in is understanding what events were published to a partition of the Event Hub.
There's no available metric or report that I'm aware of that would surface the requested information. However, assuming that you know the time range that you're interested in, you can write a utility to compute this:
Connect one of the consumer types from the Event Hubs SDK of your choice to the partition, using FromEnqueuedTime or the equivalent as your starting position. (sample)
For each event read, inspect the EnqueuedTime or equivalent and compare it to the window that you're interested in.
If the event was enqueued within your window, increase your count and repeat, starting at #2. If the event was later than your window, you're done.
The count that you've accumulated would be the number of events that the Event Hubs broker received during that time interval.
Note: If you're doing this across partitions, you'll want to read each partition individually and count. The ReadEvents method on the consumer does not guarantee ordering or fairness; it would be difficult to have a deterministic spot to decide that you're done reading.
Assuming you really just need to know "how many events did my Event Hub receive in a particular hour" - you can take a look at the metrics published by the Event Hub. Be aware that, like mentioned in the other answers, the count might not be 100% accurate given the nature of the system.
Take a look at the Incoming Messages metric. If you take this for any given hour, it will give you the count of messages that were received during this time period. You can split the namespace metric by EventHub, and every consumer group will receive every single message, so you should be fine.
This is an example of how it can look in the UI, though you should also be able to export it to a log analytics workspace.

In-order processing in Azure event hubs with Partitions and multiple "event processor" clients

I plan to utilize all 32 partitions in Azure event hubs.
Requirement: "Ordered" processing per partition is critical..
Question: If I increase the TU's (Throughput Units) to max available of 20 across all 32 partitions, I get 40 MB of egress. Let's say I calculated that I need 500 parallel client threads processing in parallel (EventProcessorClient) to achieve my throughput needs. How do I achieve this level of parallelism with EventProcessorClient while honoring my "Ordering" requirement?
Btw, In Kafka, I can create 500 partitions in a topic and Kafka allows only 1 thread per partition guaranteeing event order.
In short, you really can't do what you're looking to do in the way that you're describing.
The EventProcessorClient is bound to a given Event Hub and consumer group combination and will collaborate with other processors using the same Event Hub/consumer group to evenly distribute the load. Adding more processors than the number of partitions would result in them being idle. You could work around this by using additional consumer groups, but the EventProcessorClient instances will only coordinate with others in the same consumer group; the processors for each consumer group would act independently and you'd end up processing the same events multiple times.
There are also quotas on the service side that you may not be taking into account.
Assuming that you're using the Standard tier, the maximum number of concurrent reads that you could have for one Event Hub, across all partitions, with the standard tier is 100. For a given Event Hub, you can create a maximum of 20 consumer groups. Each consumer group may have a maximum of 5 active readers at a time. The Event Hubs Quotas page discusses these limits. That said, a dedicated instance allows higher limits but you would still have a gap with the strict ordering that you're looking to achieve.
Without knowing more about your specific application scenarios, how long it takes for an event to be processed, the relative size of the event body, and what your throughput target is, its difficult to offer alternative suggestions that may better fit your needs.

Azure Event Hub Consumer Groups

I get that consumer groups will retrieve messages from the Event Hubs however;
Assume I have 1 producer, 32 partitions and 2 CG's, CG 1 will use the data in a different source but needs the same message.
Does a message in the queue replicate across all 32 partitions so CG1 & CG2 will retrieve the same sequenced data?
New to this so thanks for any help!
Both consumer groups will receive the same data. They both have an independent view of that data. That means that CG1 might be further in processing the messages then CG2. If, for example, CG1 processes message in memory it might be going through the messages faster than CG2 that is doing a lot of I/O to handle the messages. But they will both have access to all the data.
Does this make sense to you?

Create multiple Event hub receiver to process huge volume of data concurrently

Hi I have a event hub with two consumer group.
Many device are sending data to my event hub any I want to save all message to my data base.
Now data are getting send by multiple device so data ingress is to high so in order two process those message i have written one EventHub Trigger webjob to process the message and save to database.
But since saving these message to my data base is time consuming task or I can say that receiver speed is slow then sender speed.
So is there any way two process these message faster by creating multiple receiver kind of thing.
I have create two event receiver with different consumer group but I found that same message is getting processed by both trigger function so now duplicate data are getting save in my data base.
So please help me to know how I can create multiple receiver which will process unique message parallel.
Please guys help me...
Creating multiple consumer groups won't help you out as you found out yourself. Different consumer groups all read the same data, but they can have their own speed.
In order to increase the speed of processing there are just 2 options:
Make the process/code itself faster, so try to optimize the code that is saving the data to the database
Increase the amount of partitions so more consumers can read the data from a given partition in parallel. This means however that you will have to recreate the Event Hub as you cannot increase/decrease the partition count after the Event Hub is created. See the docs for guidance.
about 2.: The number of concurrent data consumers is equal to the number of partitions created. For example, if you have 4 partitions you can have up to 4 concurrent data readers processing the data.
I do not know your situation but if you have certain peaks in which the processing is too slow but it catches up during more quiet hours you might be able to live with the current situation. If not, you have to do something like I outlined.

Apache kafka message dispatching and balance loading

I'm just started with Apache Kafka and really try to figure out, how could I design my system to use it in proper manner.
I'm building system which process data and actually my chunk of data is a task (object), that need to be processed. And object knows how it could be processed, so that's not a problem.
My system is actually a splited into 3 main component: Publisher (code which spown tasks), transport - actually kafka, and set of Consumers - it's actually workers who just pull data from the queue, process it somehow. It's important to note, that Consumer could be a publisher itself, if it's task need 2 step computation (Consumer just create tasks and send it back to transport)
So we could start with idea that I have 3 server: 1 single root publisher (kafka server also running there) and 2 consumers servers which actually handle the tasks. Data workflow is like that: Publisher create task, put it to transposrt, than one of consumers take this task from the queue and handle it. And it will be nice if each consumer will be handle the same ammount of tasks as the others (so workload spread eqauly between consumers).
Which kafka configuration pattern I need to use for that case? Does kafka have some message balancing features or I need to create 2 partitions and each consumer will be only binded to single partitions and could consume data only from this partition?
In kafka number of partitions roughly translates to the parallelism of the system.
General tip is create more partitions per topic (eg. 10) and while creating the consumer specify the number of consumer threads corresponding to the number of partitions.
In the High-level consumer API while creating the consumer you can provide the number of streams(threads) to create per topic. Assume that you create 10 partitions and you run the consumer process from a single machine, you can give topicCount as 10. If you run the consumer process from 2 servers you could specify the topicCount as 5.
Please refer to this link
The createMessageStreams call registers the consumer for the topic, which results in rebalancing the consumer/broker assignment. The API encourages creating many topic streams in a single call in order to minimize this rebalancing.
Also you can dynamically increased the number of partitions using kafka-add-partitions.sh command under kafka/bin. After increasing the partitions you can restart the consumer process with increased topicCount
Also while producing you should use the KeyedMessage class based on some random key within your message object so that the messages are evenly distributed across the different partitions

Resources