Apache pulsar message filtering based on consumer id - apache-pulsar

We have a unique need in apache pulsar solution where we need to filter our message content based on who is consuming the message from a given topic. We can solve this problem by creating separate topic per consumer but would like to know if there is a better way to have single topic where all consumer connects to and we filter out the message content based on who is receiving the message.
I read about EntryFilter interface in apache pulsar but not sure if that one is for producer or consumer.

Related

How multiple Kafka Consumers in the same consumer group read messages from one partition in the topic?

I would like to know about how the consumers in the same consumer group read the messages from one topic which has only one partition.
For example, I have 3 consumers in one consumer group and that group is polling messages from Topic A which has partition A so if I have 1000 messages coming one by one in the Topic A how it would be delivered to 3 of the consumers.
Would it be like 3 messages will be delivered to 3 consumers parellely and once it's processed by each the another one would be delivered basically will they receive messages paraellely?
Would it be like any one consumer will fetch those messages as there is only one partition ?
Please also suggest me the best architecture approach for above scenario.
Thanks,
I want to process the multiple messages parallelly from one topic which has one partition to 4 consumers.
I am using the kafka structure with NodeJS microservices with kafkajs package.
In your scenario, only one consumer of that consumer group will read the data, most probably the first one you started. I'm not 100% sure as I never tried it out, but I assume the additional consumers will just idle without workload.
This question is essentially the same as yours.
If you want to achieve parallelity of consumers, you cannot avoid having multiple partitions, that's the main purpose of the whole partitioning concept.

Multiple instance of producers sharing the same producer name

I have a REST API that pushes the inflow request to a Apache Pulsar topic. The producer has a name (say, "api-integration-producer"). As I run multiple instances of this service (typically in Kubernetes), the service fails to start, complaining, the producer with the name ("api-integration-producer") is already registered with the Pulsar broker.
So, this means, I cannot run multiple instances of the service with the producer that produces to the same topic, or, with a producer that that shares the same name. However, I have solved this problem by generating a random producer name (append a uuid to "api-integration-producer").
Does this have an impact on the exactly-once scenario? What is the right way to name & run the Pulsar producers?
Random producer name is fine for most cases.
Something to think about: https://www.splunk.com/en_us/blog/it/effectively-once-semantics-in-apache-pulsar.html
You have to choose access mode.
Are you using a partitioned or non-partitioned topic?
https://pulsar.apache.org/docs/en/concepts-messaging/#access-mode
https://github.com/apache/pulsar/wiki/PIP-68:-Exclusive-Producer

Reading messages in bulk through a Pulsar consumer

I am using node pulsar client to consume messages from a Pulsar topic. The consumer is subscribed to the topic using a shared subscription mode. Currently, each call to receive gets a single message from the topic. Is there a way to receive messages in bulk?
The fact that you get messages one by one doesn't mean that the Pulsar client doesn't use batching and other optimization techniques in the background. Official documentation for the Pulsar Java consumer defines the receiverQueueSize parameter defining accumulation of messages. By default, the Pulsar consumer uses reasonable values for its parameters and it should perform quite well for the most of the applications. Do you experience any kind of issues or slow performance?
Update
Since the 2.4.1 version of Apache Pulsar it is possible to receive messages in batches using consumer. First, the consumer should be created with the BatchReceivePolicy config (change values to more appropriate for your use case):
Consumer<GenericRecord> consumer = pulsarClient
.newConsumer(Schema.AUTO_CONSUME())
.batchReceivePolicy(BatchReceivePolicy.builder()
.maxNumMessages(5000)
.maxNumBytes(10 * 1024 * 1024)
.timeout(1, TimeUnit.SECONDS).build())
// .. other configuration such as topic and subscription
Second, use the batchReceive method to get a batch of messages:
Messages<GenericRecord> messages = consumer.batchReceive();
When all messages are processed, simply acknowledge all of them:
consumer.acknowledge(messages);

Unique identifier for Spring Integration

I have a pub/subscribe queue in Spring Integration. Once a message is put on the queue I can see a new message ID is generated and different message ID for each of the subscribers. I want to use the initial unique message ID as an unique identifier while it flows through various microservices subscribers. Can I get the original message ID from each of the subscribers?
Also if I had multiple spring integration instances writing the messages into a single kafka queue, would message ID be unique?
I think Kafka deserves its own SO question. Re. the same id for all the subflows: how about a applySequence = true for the PublishSubscribeChannel and each message copy will be send with the Sequence Details headers where the IntegrationMessageHeaderAccessor.CORRELATION_ID is exactly copy of the original message?
The problem with Messaging that each new message should be really a new unique object. This way each message is a stand along entity and it doesn't effect all others and even may not know about their existence. The stateless is one of the consistency goals of Messaging per se.
Therefore if you would like to carry some identificator to all the messages, you should consider to use some other header, not an id. For this purpose the Framework already provides for your conventional mechanism called correlation and sequence details: https://docs.spring.io/spring-integration/docs/5.0.4.RELEASE/reference/html/messaging-channels-section.html#channel-configuration-pubsubchannel

what is best practice to consume messages from multiple kafka topics?

I need to consumer messages from different kafka topics,
Should i create different consumer instance per topic and then start a new processing thread as per the number of partition.
or
I should subscribe all topics from a single consumer instance and the should start different processing threads
Thanks & regards,
Megha
The only rule is that you have to account for what Kafka does and doesn't not guarantee:
Kafka only guarantees message order for a single topic/partition. edit: this also means you can get messages out of order if your single topic Consumer switches partitions for some reason.
When you subscribe to multiple topics with a single Consumer, that Consumer is assigned a topic/partition pair for each requested topic.
That means the order of incoming messages for any one topic will be correct, but you cannot guarantee that ordering between topics will be chronological.
You also can't guarantee that you will get messages from any particular subscribed topic in any given period of time.
I recently had a bug because my application subscribed to many topics with a single Consumer. Each topic was a live feed of images at one image per message. Since all the topics always had new images, each poll() was only returning images from the first topic to register.
If processing all messages is important, you'll need to be certain that each Consumer can process messages from all of its subscribed topics faster than the messages are created. If it can't, you'll either need more Consumers committing reads in the same group, or you'll have to be OK with the fact that some messages may never be processed.
Obviously one Consumer/topic is the simplest, but it does add some overhead to have the additional Consumers. You'll have to determine whether that's important based on your needs.
The only way to correctly answer your question is to evaluate your application's specific requirements and capabilities, and build something that works within those and within Kafka's limitations.
This really depends on logic of your application - does it need to see all messages together in one place, or not. Sometimes, consumption from single topic could be easier to implement in terms of business logic of your application.

Resources