Reading messages in bulk through a Pulsar consumer - apache-pulsar

I am using node pulsar client to consume messages from a Pulsar topic. The consumer is subscribed to the topic using a shared subscription mode. Currently, each call to receive gets a single message from the topic. Is there a way to receive messages in bulk?

The fact that you get messages one by one doesn't mean that the Pulsar client doesn't use batching and other optimization techniques in the background. Official documentation for the Pulsar Java consumer defines the receiverQueueSize parameter defining accumulation of messages. By default, the Pulsar consumer uses reasonable values for its parameters and it should perform quite well for the most of the applications. Do you experience any kind of issues or slow performance?
Update
Since the 2.4.1 version of Apache Pulsar it is possible to receive messages in batches using consumer. First, the consumer should be created with the BatchReceivePolicy config (change values to more appropriate for your use case):
Consumer<GenericRecord> consumer = pulsarClient
.newConsumer(Schema.AUTO_CONSUME())
.batchReceivePolicy(BatchReceivePolicy.builder()
.maxNumMessages(5000)
.maxNumBytes(10 * 1024 * 1024)
.timeout(1, TimeUnit.SECONDS).build())
// .. other configuration such as topic and subscription
Second, use the batchReceive method to get a batch of messages:
Messages<GenericRecord> messages = consumer.batchReceive();
When all messages are processed, simply acknowledge all of them:
consumer.acknowledge(messages);

Related

Apache pulsar message filtering based on consumer id

We have a unique need in apache pulsar solution where we need to filter our message content based on who is consuming the message from a given topic. We can solve this problem by creating separate topic per consumer but would like to know if there is a better way to have single topic where all consumer connects to and we filter out the message content based on who is receiving the message.
I read about EntryFilter interface in apache pulsar but not sure if that one is for producer or consumer.

Consumer is receiving only 50 percent of the messages published to the topic

We're noticing that exactly 50 percent of the messages produced to my Pulsar topic are reaching my app. Everything was working fine yesterday where our Pulsar consumer app was getting 100% of messages that were produced to the topic. We haven't made any setting changes in our app. What is happening with the missing messages? Where are they going?
Pulsar isn't losing your messages.
It looks like you're using a shared subscription and more than one consumer connected. That other consumer is receiving your other messages since the topic will dispatch them in a round-robin when using a shared subscription. This behavior can occur by design if your consumers are auto-scaling on a shared subscription.
If you check topic stats ($ pulsar-admin topics stats options, documented here), in the response, in "subscriptions", look for your subscription by its name. In that object, you can see the "type", which will be marked as "shared," and you will see a list of "consumers". I'd expect that you have more than one consumer in that list.

Spring Integration - Parallel ordered processing

In the application i work on, there is a continuous flow of messages coming from a TCP socket. Messages have different types. Different types of messages should be processed in parallel. But each specific type of message must be processed in the order they come in.
I used ExecutorChannel from spring integration and it solves parallel processing need. I created a channel for each specific type of message.
But i cannot guarantee ordered processing of messages for specific types.
Is there a way to do ordered processing with publish/subscribe channels while also using parallel processing?
Consider to use the same ExecutorChannel but as an input for each type. The trick that each of them should be configured with executors with a single thread. So, you have as much single-threaded executor channels as you have message types.
Another trick is like QueueChannel for each type and polling endpoints with the fixed-delay as subscribers to those queues.
One more option available since the current Spring Integration 5.0 is FluxMessageChannel. The ordering is guaranteed by the internal Reactor's Flux and the parallelism will be achieved by the subscribers - the processing messages in Flux is happened in the subscriber thread.

what is best practice to consume messages from multiple kafka topics?

I need to consumer messages from different kafka topics,
Should i create different consumer instance per topic and then start a new processing thread as per the number of partition.
or
I should subscribe all topics from a single consumer instance and the should start different processing threads
Thanks & regards,
Megha
The only rule is that you have to account for what Kafka does and doesn't not guarantee:
Kafka only guarantees message order for a single topic/partition. edit: this also means you can get messages out of order if your single topic Consumer switches partitions for some reason.
When you subscribe to multiple topics with a single Consumer, that Consumer is assigned a topic/partition pair for each requested topic.
That means the order of incoming messages for any one topic will be correct, but you cannot guarantee that ordering between topics will be chronological.
You also can't guarantee that you will get messages from any particular subscribed topic in any given period of time.
I recently had a bug because my application subscribed to many topics with a single Consumer. Each topic was a live feed of images at one image per message. Since all the topics always had new images, each poll() was only returning images from the first topic to register.
If processing all messages is important, you'll need to be certain that each Consumer can process messages from all of its subscribed topics faster than the messages are created. If it can't, you'll either need more Consumers committing reads in the same group, or you'll have to be OK with the fact that some messages may never be processed.
Obviously one Consumer/topic is the simplest, but it does add some overhead to have the additional Consumers. You'll have to determine whether that's important based on your needs.
The only way to correctly answer your question is to evaluate your application's specific requirements and capabilities, and build something that works within those and within Kafka's limitations.
This really depends on logic of your application - does it need to see all messages together in one place, or not. Sometimes, consumption from single topic could be easier to implement in terms of business logic of your application.

MQTT to Kafka. How to avoid duplicates

My requirement is to load balance 2 MQTT nodes running on different VMs and then having consumers to these MQTT brokers on both nodes. The job of the consumers will be to subscribe on one topic and after receiving the data, publish it to Kafka. Problem I see if that since both MQTT consumers are subscribed on the same topic, they will receive the same message and both will insert it into Kafka thereby creating duplicates. is there anyway to avoid writing duplicates into Kafka?
I have tried Mosquitto and Mosca brokers but they do not support clustering. So subscribed clients were not getting messages if they got subscribed to a different node then the node where message was published. Both nodes are behind HAProxy.
I am currently using emqtt broker which supports clustering and the load balancing issue gets solved by that but it seems it does not support shared subscriptions across cluster nodes.
A feature like the Kafka consumer group is what is required I believe. Any ideas?
Have you tried HiveMQ?
It offers so called shared subscriptions.
If shared subscriptions are used, all clients which share the same subscription will receive messages in an alternating fashion.

Resources