Pulsar Reader misses messages that were written when reader was offline - apache-pulsar

I have a Pulsar reader that reads messages and process them. As and when the messages arrive to a topic the reader receives it. But, say if this reader application was offline for sometime and is back connected to the pulsar cluster, the reader do not read those messages that were written to the Pulsar during the time this reader was offline. However, when the reader is restarted it do not read the old messages that were already consumed. The reader was set with startMessageId(MessageId.latest).
Is this an expected behaviour of the reader or am I missing something? Below is my simple code for a reference.
PulsarClient pulsarClient = PulsarClient.builder().serviceUrl(pulsarServiceUrl).build();
Reader<String> reader = pulsarClient.newReader(Schema.STRING).readerName("reader-1").startMessageId(MessageId.latest)
.topic("persistent://public/default/topic-1").create();
while(true) {
Message<String> data = reader.readNext(); //blocks if there are no new messages
System.out.println(data.getValue());
}

With the Reader interface you have to keep track of the message IDs you are reading from the topic. Because you are using MessageId.latest, every time the client connects (for example, after a restart), it positions itself a the end of the topic. If you want to pick up where you left off, then you need to keep track of the last message ID received by the application and use that as startMessageId.
If you don't want to keep track of the message IDs, you can get Pulsar to do that for you by using the Consumer interface. With a Consumer you specify a subscription name which keeps track of your position in the topic. So if a client restarts when using the Consumer interface, you will pick up where you left off. See this page for more info.

Related

Consumer is receiving only 50 percent of the messages published to the topic

We're noticing that exactly 50 percent of the messages produced to my Pulsar topic are reaching my app. Everything was working fine yesterday where our Pulsar consumer app was getting 100% of messages that were produced to the topic. We haven't made any setting changes in our app. What is happening with the missing messages? Where are they going?
Pulsar isn't losing your messages.
It looks like you're using a shared subscription and more than one consumer connected. That other consumer is receiving your other messages since the topic will dispatch them in a round-robin when using a shared subscription. This behavior can occur by design if your consumers are auto-scaling on a shared subscription.
If you check topic stats ($ pulsar-admin topics stats options, documented here), in the response, in "subscriptions", look for your subscription by its name. In that object, you can see the "type", which will be marked as "shared," and you will see a list of "consumers". I'd expect that you have more than one consumer in that list.

"Bystander" Pulsar Consumer for logging

How do I create a Pulsar consumer that listens to topic(s), but does not acknowledge the messages, basically only "eavesdropping" the conversation.
The Reader interface could be a good fit for this use case. In the beginning, you need to specify from which message you would like to start reading the topic. There is no need to acknowledge messages. When the reader is stopped there will be no subscription left behind. You can find an example here.
One more option is to use interceptors, but it will work only in case you have access to the consumer code.

ReceiveAsync from Azure Service Bus topic without a body

I'm creating a consumer of an Azure Service Bus topic (subscription) that does nothing but store some statistics. The messages sent to the topic contains a rather large body, that is handled by another consumer on a second subscription on the same topic.
Since the statistics consumer can handle a large number of messages in one go, I was wondering if it is possible to receive a lot of messages but leave out the body to improve performance when communicating with Service Bus and to receive even more messages in one go.
I'm currently doing this:
this.messageReceiver = new MessageReceiver(conn, path);
...
await messageReceiver.ReceiveAsync(10, TimeSpan.FromSeconds(5));
It works pretty sweet but it would be nice to be able to receive 100 or more messages, without having to worry about moving large messages over the network.
Before anyone suggests it, I already know that I can ask for a count, etc. on a topic subscription. I still need the Message object since that contains an entry in the UserProperties dictionary that is used to calculate the stats.
Not possible. You can peek, but that brings the whole payload and headers w/o incrementing the DeliveryCount of the message. You could request it as a broker feature here.

RabbitMQ - Single concurrent worker per routing key

Quite new to RabbitMQ and I'm trying to see if I can achieve what I need with it.
I am looking for the Worker Queues pattern but with one caveat. I want to have only a single worker running concurrently per routing key.
An example for clarification:
If i send the following messages with routing keys by order: a, a, b, c, I want to have only 3 workers running concurrently. When the first a message is received a worker picks it up and handles it.
When the next a message is received and the previous a message is still handled (not acknowledged) the new a message should wait in queue. When the b and c messages are received they each get a worker handling them. When the first a message is acknowledged any worker can pick up the next a message.
Would that pattern be possible using RabbitMQ in a natural way (without writing any application code on my side to handle the locking and stuff...)
Edit:
Another clarification. All workers can and should handle all messages, and I don't want to have a queue per Worker as I want to share the load between them, and the Publisher doesn't know which Worker should process the message. But I do want to make sure that no 2 Workers are working on messages sharing the same key at the same time.
For example, if I have a Publisher publishing messages with a userId field, I want to make sure no 2 Workers are handling messages with the same userId at the same time.
Edit 2
Expanding on the userId example. Let's say I have a single Publisher and 3 Workers. The publisher publishes messages like these: { userId: 1, text: 'Hello' }, with varying userIds. My 3 Workers all do the same thing to this messages, so I can have any of them handle the messages coming in. But what I'm trying to achieve is to have only a single worker processing a message from a certain user at the same time. If a Worker has received a message with userId 1 and is still processing it, and another message with userId 1 is received I want to make sure no other Worker picks up that message. But other messages coming in with different userIds should be processed by other available Workers.
userIds are not known beforehand, and the publisher doesn't know how many workers are or anything specific about them, he just wants to schedule the messages for processing.
what your asking is not possible with routing keys, but is built into queues with a few settings.
if you define "queue_a" for a messages, "queue_b" for b messages, etc, you can then have as many consumers connect to it as you want.
RabbitMQ will only deliver a given message to a single consumer of a given queue.
The way it works with multiple consumers on a single queue is basic round-robin style dispatch of the messages. that is, the first message will be delivered to one of the consumers, and the next message (assuming the first consumer is still busy) will be delivered to the next consumer.
So, that should satisfy the need to deliver the message to any given consumer of the queue.
To ensure your messages have an equal chance of getting to any of the consumer (and are not all delivered to the same consumer all the time), there are a few other settings you should put in place.
First, make sure to set the message consumer no ack setting to false (sometimes called "auto ack"). This will force you to ack the message from your code.
Lastly, set the "consumer prefetch" limit of the consumer to 1.
With this combination of settings, a single consumer will retrieve a single message and begin working on it. While that consumer is working, any message waiting in the queue will be delivered to other consumers if any are available. If there are none available, the message will wait in the queue until a consumer is available.
With this, you should be able to achieve the behavior you are wanting, on a given queue.
...
Keep in mind this only applies to queues, though. routing keys cannot be managed this way. all matched routing keys from an exchange will cause a copy of the message to be sent to the destination queue.

Distributed pub/sub with single consumer per message type

I have no clue if it's better to ask this here, or over on Programmers.SE, so if I have this wrong, please migrate.
First, a bit about what I'm trying to implement. I have a node.js application that takes messages from one source (a socket.io client), and then does processing on the message, which might result in zero or more messages back out, either to the sender, or other clients within that group.
For the processing, I would like to essentially just shove the message into a queue, then it works its way through various message processors that might kick off their own items, and eventually, the bit running socket.io is informed "Hey, send this message back"
As a concrete example, say a user signs into the service, that sign in message is then placed in the queue, where the authorization processor gets it, does it's thing, then places a message back in the queue saying the client's been authorized. This goes back to the socket.io socket that is connected to the client, along with other clients that might be interested. It can also go to other subsystems that might want to do more processing on authorization (looking up user info, sending more info to the client based on their data, etc).
If I wanted strong coupling, this would be easy, but I tried that before, and it just goes to a mess of spaghetti code that's very fragile, and I would like to avoid that. Another wrench in the setup is this should be cluster-able, which is where the real problem comes in. There might be more than one, say, authorization processor running. But the authorization message should be processed only once.
So, in short, I'm looking for a pattern/technique that will allow me to, essentially, have multiple "groups" of subscribers for a message, and the message will be processed only once per group.
I thought about maybe having each instance of a processor generate a unique name that would be used as a list in Reids. This name would then be registered with some sort of dispatch handler, and placed into a set for that group of subscribers. Then when a message arrives, the dispatch pulls a random member out of that set, and places it into that list. While it seems like this would work, it seems somewhat over-complicated and fragile.
The core problem is I've never designed a system like this, so I'm not even sure the proper terms to use or look up. If anyone can point me in the right direction for this, I would be most appreciative.
I think what your describing is similar to https://www.getbridge.com/ service. I it but ended up writing my own based on zeromq, it allows you to register services, req -> <- rec and channels which are pub / sub workers.
As for the design, I used a client -> broker -> services & channels which are all plug and play using auto discovery, you have the services register their schema with the brokers who open a tcp connection so that brokers on other servers can communicate with that broker groups services. Then internal services and clients connect via unix sockets or ipc channels which ever is preferred.
I ended up wrapping around the redis publish/subscribe functions a bit to do this. Each type of message processor gets a "group name", and there can be multiple instances of the processor within that group (so multiple instances of the program can run for clustering).
When publishing a message, I generate an incremental ID, then store the message in a string key with that ID, then publish the message ID.
On the receiving end, the first thing the subscriber does is attempt to add the message ID it just got from the publisher into a set of received messages for that group with sadd. If sadd returns 0, the message has already been grabbed by another instance, and it just returns. If it returns 1, the full message is pulled out of the string key and sent to the listener.
Of course, this relies on redis being single threaded, which I imagine will continue to be the case.
What you might be looking for is an AMQP protocol implementation,where you can have queue get custom exchanges,and implement a pub-sub model.
RabbitMQ - a popular amqp protocol implementation with lots of libraries
it also has node.js library

Resources