Is it possible to have multiple producers for the same topic on Pulsar? - apache-pulsar

I know you can set topic subscription to be shared subscription to allow for multiple Consumers on the same topic. Can this also be done for multiple Producers?
For some reason when I try to, I get a Producer with name '<topic_name>' is already connected to topic

Yes, you can have multiple producers on a topic. You just have to make sure each producer has a unique name. From the ProducerBuilder.producerName section of the Java client API docs:
When specifying a name, it is up to the user to ensure that, for a
given topic, the producer name is unique across all Pulsar's clusters.
Brokers will enforce that only a single producer a given name can be
publishing on a topic.
The easiest way to ensure the producer name is unique is to let Pulsar set it automatically for you. From the same section:
If not assigned, the system will generate a globally unique name which
can be accessed with Producer.getProducerName().

Related

Multiple instance of producers sharing the same producer name

I have a REST API that pushes the inflow request to a Apache Pulsar topic. The producer has a name (say, "api-integration-producer"). As I run multiple instances of this service (typically in Kubernetes), the service fails to start, complaining, the producer with the name ("api-integration-producer") is already registered with the Pulsar broker.
So, this means, I cannot run multiple instances of the service with the producer that produces to the same topic, or, with a producer that that shares the same name. However, I have solved this problem by generating a random producer name (append a uuid to "api-integration-producer").
Does this have an impact on the exactly-once scenario? What is the right way to name & run the Pulsar producers?
Random producer name is fine for most cases.
Something to think about: https://www.splunk.com/en_us/blog/it/effectively-once-semantics-in-apache-pulsar.html
You have to choose access mode.
Are you using a partitioned or non-partitioned topic?
https://pulsar.apache.org/docs/en/concepts-messaging/#access-mode
https://github.com/apache/pulsar/wiki/PIP-68:-Exclusive-Producer

Understanding Azure Event Hubs partitioned consumer pattern

Azure Event Hub uses the partitioned consumer pattern described in the docs.
I have some problems understanding the consumer side of this model when it comes to a real world scenario.
So lets say I have 1000 messages send to the event hub with 4 partitions, not defining any partition Id. This means the messages will go to all partitions using the round-robin method.
Now I want to have two applications distributing the messages to two different databases. My questions there are:
Lets say for the first application, I want to store all messages in Database 1. This means, for maximum speed, In my consumer application I need to have 4 threads (consumers), each listening to one partition of the event hub, right? Each of them also has to store their own offset for the partition they're reading (checkpoint).
Lets say my second application wants to filter the messages and only store a subset of them in Database 2. There I also need 4 consumers since I don't know which message goes to which partition, right?
Also for the two applications I need to have two consumer groups, but why? Is the filtering of the messages defined in the consumer group? I don't get it really why I need this one, since the applications consumers store the partition checkpoints by themselves and I can do the filtering within the applications itself.
I know there is the EventProcessorHost class but I want to understand the concept of the EventHub on a lower level.
Lets say for the first application, I want to store all messages in Database 1. This means, for maximum speed, In my consumer application I need to have 4 threads (consumers), each listening to one partition of the event hub, right? Each of them also has to store their own offset for the partition they're reading (checkpoint).
Correct, you should have a process per provisioned partition. So, if you have 4 processors you should have 4 processes, each processing the messages of a specific partition. If you process the messages using an EventProcessorHost it will take care of the spinning up of the processes for you.
Lets say my second application wants to filter the messages and only store a subset of them in Database 2. There I also need 4 consumers since I don't know which message goes to which partition, right?
What do you mean with a consumer? You need another 4 processes to process the messages but they should be configured to read using a different consumer group. Otherwise they will compete with the processes of 1
Also for the two applications I need to have two consumer groups, but why? Is the filtering of the messages defined in the consumer group? I don't get it really why I need this one, since the applications consumers store the partition checkpoints by themselves and I can do the filtering within the applications itself.
Let us define a consumer group:
Consumer groups enable multiple consuming applications to each have a separate view of the incoming message stream, and to read the stream independently at its own pace with its own offset
So yes, you need 2 different consumer groups.
Each consumer group will get all messages send to the event hub partitions. Each consumer group tracks its own progress in the stream of messages. That is why you need two for your scenario.
Say you define an additional consumer group called "App2-Consumer-Group", the reader processes will receive all messages but should take no action for messages they are not interested in.
If you would not create an additional consumer group, the reader processes for the default consumer group will process the messages for the first application and mark them as processed using the check-pointing mechanism. The reader processes for the second application won't get any messages since they are already marked as processed. (In real life, when using one consumer group with some messages might be picked up by the reader processes for the first application and some messages might be picked up by reader processes for the second application as the processes will try to get a lock on a specific partition)
I think this image shows clearly how consumer groups track their own progress in the stream of message and hence why you need tow of them if you have 2 different processing logic for the 2 different applications:

Unique identifier for Spring Integration

I have a pub/subscribe queue in Spring Integration. Once a message is put on the queue I can see a new message ID is generated and different message ID for each of the subscribers. I want to use the initial unique message ID as an unique identifier while it flows through various microservices subscribers. Can I get the original message ID from each of the subscribers?
Also if I had multiple spring integration instances writing the messages into a single kafka queue, would message ID be unique?
I think Kafka deserves its own SO question. Re. the same id for all the subflows: how about a applySequence = true for the PublishSubscribeChannel and each message copy will be send with the Sequence Details headers where the IntegrationMessageHeaderAccessor.CORRELATION_ID is exactly copy of the original message?
The problem with Messaging that each new message should be really a new unique object. This way each message is a stand along entity and it doesn't effect all others and even may not know about their existence. The stateless is one of the consistency goals of Messaging per se.
Therefore if you would like to carry some identificator to all the messages, you should consider to use some other header, not an id. For this purpose the Framework already provides for your conventional mechanism called correlation and sequence details: https://docs.spring.io/spring-integration/docs/5.0.4.RELEASE/reference/html/messaging-channels-section.html#channel-configuration-pubsubchannel

what is best practice to consume messages from multiple kafka topics?

I need to consumer messages from different kafka topics,
Should i create different consumer instance per topic and then start a new processing thread as per the number of partition.
or
I should subscribe all topics from a single consumer instance and the should start different processing threads
Thanks & regards,
Megha
The only rule is that you have to account for what Kafka does and doesn't not guarantee:
Kafka only guarantees message order for a single topic/partition. edit: this also means you can get messages out of order if your single topic Consumer switches partitions for some reason.
When you subscribe to multiple topics with a single Consumer, that Consumer is assigned a topic/partition pair for each requested topic.
That means the order of incoming messages for any one topic will be correct, but you cannot guarantee that ordering between topics will be chronological.
You also can't guarantee that you will get messages from any particular subscribed topic in any given period of time.
I recently had a bug because my application subscribed to many topics with a single Consumer. Each topic was a live feed of images at one image per message. Since all the topics always had new images, each poll() was only returning images from the first topic to register.
If processing all messages is important, you'll need to be certain that each Consumer can process messages from all of its subscribed topics faster than the messages are created. If it can't, you'll either need more Consumers committing reads in the same group, or you'll have to be OK with the fact that some messages may never be processed.
Obviously one Consumer/topic is the simplest, but it does add some overhead to have the additional Consumers. You'll have to determine whether that's important based on your needs.
The only way to correctly answer your question is to evaluate your application's specific requirements and capabilities, and build something that works within those and within Kafka's limitations.
This really depends on logic of your application - does it need to see all messages together in one place, or not. Sometimes, consumption from single topic could be easier to implement in terms of business logic of your application.

Azure event hubs and multiple consumer groups

Need help on using Azure event hubs in the following scenario. I think consumer groups might be the right option for this scenario, but I was not able to find a concrete example online.
Here is the rough description of the problem and the proposed solution using the event hubs (I am not sure if this is the optimal solution. Will appreciate your feedback)
I have multiple event-sources that generate a lot of event data (telemetry data from sensors) which needs to be saved to our database and some analysis like running average, min-max should be performed in parallel.
The sender can only send data to a single endpoint, but the event-hub should make this data available to both the data handlers.
I am thinking about using two consumer groups, first one will be a cluster of worker role instances that take care of saving the data to our key-value store and the second consumer group will be an analysis engine (likely to go with Azure Stream Analysis).
Firstly, how do I setup the consumer groups and is there something that I need to do on the sender/receiver side such that copies of events appear on all consumer groups?
I did read many examples online, but they either use client.GetDefaultConsumerGroup(); and/or have all partitions processed by multiple instances of a same worker role.
For my scenario, when a event is triggered, it needs to be processed by two different worker roles in parallel (one that saves the data and second one that does some analysis)
Thank You!
TLDR: Looks reasonable, just make two Consumer Groups by using different names with CreateConsumerGroupIfNotExists.
Consumer Groups are primarily a concept so exactly how they work depends on how your subscribers are implemented. As you know, conceptually they are a group of subscribers working together so that each group receives all the messages and under ideal (won't happen) circumstances probably consumes each message once. This means that each Consumer Group will "have all partitions processed by multiple instances of the same worker role." You want this.
This can be implemented in different ways. Microsoft has provided two ways to consume messages from Event Hubs directly plus the option to use things like Streaming Analytics which are probably built on top of the two direct ways. The first way is the Event Hub Receiver, the second which is higher level is the Event Processor Host.
I have not used Event Hub Receiver directly so this particular comment is based on the theory of how these sorts of systems work and speculation from the documentation: While they are created from EventHubConsumerGroups this serves little purpose as these receivers do not coordinate with one another. If you use these you will need to (and can!) do all the coordination and committing of offsets yourself which has advantages in some scenarios such as writing the offset to a transactional DB in the same transaction as computed aggregates. Using these low level receivers, having different logical consumer groups using the same Azure consumer group probably shouldn't (normative not practical advice) be particularly problematic, but you should use different names in case it either does matter or you change to EventProcessorHosts.
Now onto more useful information, EventProcessorHosts are probably built on top of EventHubReceivers. They are a higher level thing and there is support to enable multiple machines to work together as a logical consumer group. Below I've included a lightly edited snippet from my code that makes an EventProcessorHost with a bunch of comments left in explaining some choices.
//We need an identifier for the lease. It must be unique across concurrently
//running instances of the program. There are three main options for this. The
//first is a static value from a config file. The second is the machine's NETBIOS
//name ie System.Environment.MachineName. The third is a random value unique per run which
//we have chosen here, if our VMs have very weak randomness bad things may happen.
string hostName = Guid.NewGuid().ToString();
//It's not clear if we want this here long term or if we prefer that the Consumer
//Groups be created out of band. Nor are there necessarily good tools to discover
//existing consumer groups.
NamespaceManager namespaceManager =
NamespaceManager.CreateFromConnectionString(eventHubConnectionString);
EventHubDescription ehd = namespaceManager.GetEventHub(eventHubPath);
namespaceManager.CreateConsumerGroupIfNotExists(ehd.Path, consumerGroupName);
host = new EventProcessorHost(hostName, eventHubPath, consumerGroupName,
eventHubConnectionString, storageConnectionString, leaseContainerName);
//Call something like this when you want it to start
host.RegisterEventProcessorFactoryAsync(factory)
You'll notice that I told Azure to make a new Consumer Group if it doesn't exist, you'll get a lovely error message if it doesn't. I honestly don't know what the purpose of this is because it doesn't include the Storage connection string which needs to be the same across instances in order for the EventProcessorHost's coordination (and presumably commits) to work properly.
Here I've provided a picture from Azure Storage Explorer of leases the leases and presumably offsets from a Consumer Group I was experimenting with in November. Note that while I have a testhub and a testhub-testcg container, this is due to manually naming them. If they were in the same container it would be things like "$Default/0" vs "testcg/0".
As you can see there is one blob per partition. My assumption is that these blobs are used for two things. The first of these is the Blob leases for distributing partitions amongst instances see here, the second is storing the offsets within the partition that have been committed.
Rather than the data getting pushed to the Consumer Groups the consuming instances are asking the storage system for data at some offset in one partition. EventProcessorHosts are a nice high level way of having a logical consumer group where each partition is only getting read by one consumer at a time, and where the progress the logical consumer group has made in each partition is not forgotten.
Remember that the throughput per partition is measured so that if you're maxing out ingress you can only have two logical consumers that are all up to speed. As such you'll want to make sure you have enough partitions, and throughput units, that you can:
Read all the data you send.
Catch up within the 24 hour retention period if you fall behind for a few hours due to issues.
In conclusion: consumer groups are what you need. The examples you read that use a specific consumer group are good, within each logical consumer group use the same name for the Azure Consumer Group and have different logical consumer groups use different ones.
I haven't yet used Azure Stream Analytics, but at least during the preview release you are limited to the default consumer group. So don't use the default consumer group for something else, and if you need two separate lots of Azure Stream Analytics you may need to do something nasty. But it's easy to configure!

Resources