I have multiple consumers subscribed to the same Pulsar topic. How do I make sure certain messages go to specific consumers?
The closest thing I understand is the key-shared consumer type. However, they seem to group the messages by a hash-range and then choose a consumer at random to send the messages to.
If you use the Key-Shared subscription type, you can specify which hash range a particular consumer will be assigned with the KeySharedPolicySticky policy.
Consumer<String> consumer = getClient().newConsumer(Schema.STRING)
.topic("topicName")
.subscriptionName("sub1")
.subscriptionType(SubscriptionType.Key_Shared)
.keySharedPolicy(KeySharedPolicy.KeySharedPolicySticky
.stickyHashRange()
.ranges(Range.of(0, 100), Range.of(1000, 2000)))
.subscribe();
Related
Is there a way to get count of events in Azure Event Hub for a particular time period? I need to get the count of events which come for an hour.
No, there is no way to get it as of now, if you look at the docs, EventHub is a high-Throughput, low-latency durable stream of events on Azure, so getting a count may be not correct at that given point of time
unlike queues, there is no concept of queue length in Azure Event Hubs
because the data is processed as a stream
I'm not sure that the context of this question is correct, as a consumer group is just a logical grouping of those reading from an Event Hub and nothing gets published to it. For the remainder of my thoughts, I'm assuming that the nomenclature was a mistake, and what you're interested in is understanding what events were published to a partition of the Event Hub.
There's no available metric or report that I'm aware of that would surface the requested information. However, assuming that you know the time range that you're interested in, you can write a utility to compute this:
Connect one of the consumer types from the Event Hubs SDK of your choice to the partition, using FromEnqueuedTime or the equivalent as your starting position. (sample)
For each event read, inspect the EnqueuedTime or equivalent and compare it to the window that you're interested in.
If the event was enqueued within your window, increase your count and repeat, starting at #2. If the event was later than your window, you're done.
The count that you've accumulated would be the number of events that the Event Hubs broker received during that time interval.
Note: If you're doing this across partitions, you'll want to read each partition individually and count. The ReadEvents method on the consumer does not guarantee ordering or fairness; it would be difficult to have a deterministic spot to decide that you're done reading.
Assuming you really just need to know "how many events did my Event Hub receive in a particular hour" - you can take a look at the metrics published by the Event Hub. Be aware that, like mentioned in the other answers, the count might not be 100% accurate given the nature of the system.
Take a look at the Incoming Messages metric. If you take this for any given hour, it will give you the count of messages that were received during this time period. You can split the namespace metric by EventHub, and every consumer group will receive every single message, so you should be fine.
This is an example of how it can look in the UI, though you should also be able to export it to a log analytics workspace.
I am working on a task where I am using pyspark/python to read events from event hub. When I have multiple consumer groups, I am getting duplicate messages which is behavior.
Eg:I have 2 consumer groups(CG) and 2 events. CG1 consuming event1 and while this process is ON the 2nd event got triggered then CG2 will consume which is good but now after CG1 is free after event1 consumption its consuming event2 aswell which we want to avoid. Even though the checkpoint is available, its failing. is this default behaviour?
According to you, you added multiple consumer groups in order to handle lots of messages given your comment:
Q: Why did you choose to use multiple consumer groups anyway?
A: There are good number of messages which flows in so we added two.
Scaling out is done using partitions, not using consumer groups. They are designed to be independent. You can't work against that.
Your question:
I have 2 consumer groups(CG) and 2 events. CG1 consuming event1 and while this process is ON the 2nd event got triggered then CG2 will consume which is good but now after CG1 is free after event1 consumption its consuming event2 aswell which we want to avoid. Even though the checkpoint is available, its failing. is this default behaviour?
The answer is yes, this is default behaviour. A consumer group is a seperate view of the whole message stream. Each consumer group has its own offset (checkpoint) of where they are in terms of messages they have processed of that stream. That means that each and every message will be received by each and every consumer group.
From the docs:
Consumer groups: A view (state, position, or offset) of an entire event hub. Consumer groups enable consuming applications to each have a separate view of the event stream. They read the stream independently at their own pace and with their own offsets.
This picture of the architecture also shows how the messages flow through all consumer groups.
See also this answer that provides more details about consumer groups.
Again, if you want to scale, do not use consumer groups but tweak your provisioned throughput units, partitions or improve the processing logic. See the docs about scalability.
I'm working with a Kafka Consumer using KAFKAJS.
I have a requirement to know all the partitions assigned to a specific consumer.
For example, let's say Topic 1 has 5 partitions. And there are 2 consumers with the same clientId.
So, one will be assigned 3 topics and the other 2. I want each consumer to know the partitions assigned.
We can use consumer.describeGroup to query the state of the consumer group. https://kafka.js.org/docs/consuming#a-name-describe-group-a-describe-group
The memberAssignment field describes what topic-partitions are assigned to each member, but it's a Buffer, so you'll need to decode that using
AssignerProtocol.MemberAssignment.decode:
https://github.com/tulios/kafkajs/blob/master/index.js#L18
We can also listen to the GROUP_JOIN event. It already contains the member assignment, so you don't need to actually describeGroup:
https://kafka.js.org/docs/instrumentation-events#consumer
I get that consumer groups will retrieve messages from the Event Hubs however;
Assume I have 1 producer, 32 partitions and 2 CG's, CG 1 will use the data in a different source but needs the same message.
Does a message in the queue replicate across all 32 partitions so CG1 & CG2 will retrieve the same sequenced data?
New to this so thanks for any help!
Both consumer groups will receive the same data. They both have an independent view of that data. That means that CG1 might be further in processing the messages then CG2. If, for example, CG1 processes message in memory it might be going through the messages faster than CG2 that is doing a lot of I/O to handle the messages. But they will both have access to all the data.
Does this make sense to you?
I have scenario where I have different type of messages to be streamed from kafka producer.
If I dont want to use different topic per different message type how to handle it at spark-structured-streaming consumer side ?
i.e. only one topic I want to use for different type of messages ...say Student record , Customer record....etc.
How to identify which message is been received from Kafka topic?
Please let me know how to handle this scenario at kafka consumer side?
Kafka topics don't inheriently have "types of data". It's all bytes, so yes you can serialize completely separate objects into the same topic, but consumers must then add logic to know what are all possible types will get added into the topic.
That being said, Structured Streaming is built on the idea of having structured data with a schema, so it likely will not work if you had completely different types in the same topic without at least performing a filter first based on some inner attribute that is always present among all types.
Yes you can do this by adding "some attribute" to the message itself when producing which signifies a logical topic, or operation, and differentiating on the Spark side - e.g. Structured Streaming KAFKA integration. E.g. checking the message content for "some attribute" and process accordingly.
Partitioning is used of course for ordering always.