I get that consumer groups will retrieve messages from the Event Hubs however;
Assume I have 1 producer, 32 partitions and 2 CG's, CG 1 will use the data in a different source but needs the same message.
Does a message in the queue replicate across all 32 partitions so CG1 & CG2 will retrieve the same sequenced data?
New to this so thanks for any help!
Both consumer groups will receive the same data. They both have an independent view of that data. That means that CG1 might be further in processing the messages then CG2. If, for example, CG1 processes message in memory it might be going through the messages faster than CG2 that is doing a lot of I/O to handle the messages. But they will both have access to all the data.
Does this make sense to you?
Related
I am working on a task where I am using pyspark/python to read events from event hub. When I have multiple consumer groups, I am getting duplicate messages which is behavior.
Eg:I have 2 consumer groups(CG) and 2 events. CG1 consuming event1 and while this process is ON the 2nd event got triggered then CG2 will consume which is good but now after CG1 is free after event1 consumption its consuming event2 aswell which we want to avoid. Even though the checkpoint is available, its failing. is this default behaviour?
According to you, you added multiple consumer groups in order to handle lots of messages given your comment:
Q: Why did you choose to use multiple consumer groups anyway?
A: There are good number of messages which flows in so we added two.
Scaling out is done using partitions, not using consumer groups. They are designed to be independent. You can't work against that.
Your question:
I have 2 consumer groups(CG) and 2 events. CG1 consuming event1 and while this process is ON the 2nd event got triggered then CG2 will consume which is good but now after CG1 is free after event1 consumption its consuming event2 aswell which we want to avoid. Even though the checkpoint is available, its failing. is this default behaviour?
The answer is yes, this is default behaviour. A consumer group is a seperate view of the whole message stream. Each consumer group has its own offset (checkpoint) of where they are in terms of messages they have processed of that stream. That means that each and every message will be received by each and every consumer group.
From the docs:
Consumer groups: A view (state, position, or offset) of an entire event hub. Consumer groups enable consuming applications to each have a separate view of the event stream. They read the stream independently at their own pace and with their own offsets.
This picture of the architecture also shows how the messages flow through all consumer groups.
See also this answer that provides more details about consumer groups.
Again, if you want to scale, do not use consumer groups but tweak your provisioned throughput units, partitions or improve the processing logic. See the docs about scalability.
I plan to utilize all 32 partitions in Azure event hubs.
Requirement: "Ordered" processing per partition is critical..
Question: If I increase the TU's (Throughput Units) to max available of 20 across all 32 partitions, I get 40 MB of egress. Let's say I calculated that I need 500 parallel client threads processing in parallel (EventProcessorClient) to achieve my throughput needs. How do I achieve this level of parallelism with EventProcessorClient while honoring my "Ordering" requirement?
Btw, In Kafka, I can create 500 partitions in a topic and Kafka allows only 1 thread per partition guaranteeing event order.
In short, you really can't do what you're looking to do in the way that you're describing.
The EventProcessorClient is bound to a given Event Hub and consumer group combination and will collaborate with other processors using the same Event Hub/consumer group to evenly distribute the load. Adding more processors than the number of partitions would result in them being idle. You could work around this by using additional consumer groups, but the EventProcessorClient instances will only coordinate with others in the same consumer group; the processors for each consumer group would act independently and you'd end up processing the same events multiple times.
There are also quotas on the service side that you may not be taking into account.
Assuming that you're using the Standard tier, the maximum number of concurrent reads that you could have for one Event Hub, across all partitions, with the standard tier is 100. For a given Event Hub, you can create a maximum of 20 consumer groups. Each consumer group may have a maximum of 5 active readers at a time. The Event Hubs Quotas page discusses these limits. That said, a dedicated instance allows higher limits but you would still have a gap with the strict ordering that you're looking to achieve.
Without knowing more about your specific application scenarios, how long it takes for an event to be processed, the relative size of the event body, and what your throughput target is, its difficult to offer alternative suggestions that may better fit your needs.
Hi I have a event hub with two consumer group.
Many device are sending data to my event hub any I want to save all message to my data base.
Now data are getting send by multiple device so data ingress is to high so in order two process those message i have written one EventHub Trigger webjob to process the message and save to database.
But since saving these message to my data base is time consuming task or I can say that receiver speed is slow then sender speed.
So is there any way two process these message faster by creating multiple receiver kind of thing.
I have create two event receiver with different consumer group but I found that same message is getting processed by both trigger function so now duplicate data are getting save in my data base.
So please help me to know how I can create multiple receiver which will process unique message parallel.
Please guys help me...
Creating multiple consumer groups won't help you out as you found out yourself. Different consumer groups all read the same data, but they can have their own speed.
In order to increase the speed of processing there are just 2 options:
Make the process/code itself faster, so try to optimize the code that is saving the data to the database
Increase the amount of partitions so more consumers can read the data from a given partition in parallel. This means however that you will have to recreate the Event Hub as you cannot increase/decrease the partition count after the Event Hub is created. See the docs for guidance.
about 2.: The number of concurrent data consumers is equal to the number of partitions created. For example, if you have 4 partitions you can have up to 4 concurrent data readers processing the data.
I do not know your situation but if you have certain peaks in which the processing is too slow but it catches up during more quiet hours you might be able to live with the current situation. If not, you have to do something like I outlined.
Can messages from a given partition ever be divided on multiple threads? Let's say that I have a single partition and a hundred processes with a hundred threads each - will the messages from my single partition be given to only one of those 10000 threads?
Multiple threads cannot consume the same partition unless those threads are in different consumer groups. Only a single thread will consume the messages from the single partition although you have lots of idle consumers.
The number of partitions is the unit of parallelism in Kafka. To make multiple consumers consume the same partition, you must increase the number of partitions of the topic up to the parallelism you want to achieve or put every single thread into the separate consumer groups, but I think the latter is not desirable.
If you have multiple consumers consuming from the same topic under same consumer group then the messages in a topic are distributed among those consumers. In other words, each consumer will get a non-overlapping subset of the message. The following few line is taken from the Kafka FAQ page
Should I choose multiple group ids or a single one for the consumers?
If all consumers use the same group id, messages in a topic are distributed among those consumers. In other words, each consumer will get a non-overlapping subset of the messages. Having more consumers in the same group increases the degree of parallelism and the overall throughput of consumption. See the next question for the choice of the number of consumer instances. On the other hand, if each consumer is in its own group, each consumer will get a full copy of all messages.
Why some of the consumers in a consumer group never receive any message?
Currently, a topic partition is the smallest unit that we distribute messages among consumers in the same consumer group. So, if the number of consumers is larger than the total number of partitions in a Kafka cluster (across all brokers), some consumers will never get any data. The solution is to increase the number of partitions on the broker
No in extreme cases.
Kafka high-level consumer can make sure that one message will only consumed once.And make sure that one partition only be consumed by one thread at the most time.
Because, there is a local queue in kafka high-level consumer.
Consumers considers if you polled a message from the local queue out, you have consumed the message.
So lets tell a story:
Thread 1 consumes partition 0.
Thread 1 polled a message m0. Message m1,m2... have been in the local queue.
Rebalanced, kafka will clear the local queue and re-registered.
Thread 2 consumes partition 0 now, but thread 1 is still consuming m0.
Thread 2 could poll m1,m2... now.
You can see two threads are consuming the same partition at this time.
instead of using threads it better to increase consumers and partitions to get better throughput and better control
As per my understanding, eventhub can process/ingest millions of messages per seconds. And to tune the ingesting, we can use throughput.
More throughput= more ingesting power.
But on receiving/consuming side, You can create upto 32 receivers(since we can create 32 partitions and one partition can be consumed by one receiver).
Based on above, if one single message takes 100 milisencond to process, one consumer can process 10 message per second and 32 consumer can process 32*10= 320 message per second.
How can I make my receiver consume more messages (for ex. 5-10k per seond).
1) Either I have to process message asynchronously inside ProcessEventsAsync. But in this case I would not be able to maintain ordering.
2) Or I have to ask Microsoft to allow me to create more partitions.
Please advice
TLDR: You will need to ask Microsoft to increase the number of partitions you are allowed, and remember that there is currently no way to increase the number on an already extant Event Hub.
You are correct that your unit of consumption parallelism is the partition. If your consumers can only do 10/seconds in order or even 100/second in order, then you will need more partitions to consume millions of events. While 100ms/event certainly seems slow to me and I think you should look for optimizations there (ie farm out work you don't need to wait for, commit less often etc), you will reach the point of needing more partitions at scale.
Some things to keep in mind: 32 partitions gives you only 32 Mb/s of ingress and 64Mb/s of egress. Both of these factors matter since that egress throughput is shared by all the consumer groups you use. So if you have 4 consumer groups reading the data (16Mb/s each) you'll need twice as many partitions (or at least throughput units) for input as you would based solely on your data ingress (because otherwise you would fall behind).
With regards to your comment about multitenancy, you will have one 'database consumer' group that handles all your tenants all of whose data will be flowing through the same hub? If so that sounds like a sensible use, what would not be so sensible is having one consumer group per tenant each consuming the entire stream.