Thread Sleep in the Kafka Listener - multithreading

I am trying to pause/resume the Kafka container. Using the following code snippet to do so:
kafkaListenerEndpointRegistry.getListenerContainer("MAIN").pause();
When I call the pause, I also need to do a thread.sleep so that messages in the batch are not processed. For every message in the batch, I am calling another API which has a rate limit. To maintain this rate limit, I need to stop the processing for the message.
If the Main thread sleeps, will it stop Listener from sending the hearbeat? Does it also stop the heartbeat thread in the background?
Documentation says , "When a container is paused, it continues to poll() the consumer, avoiding a rebalance if group management is being used, but it does not retrieve any records. "
But I am pausing the container and making the thread sleep. How will this impact the flow?

You must never sleep the consumer thread, to avoid rebalancing.
Instead, reduce the max.poll.records so the pause will take effect more quickly (the consumer won't actually pause until the records received by the previous poll are processed).
You can throw an exception after pausing the consumer, but you will need to resume the container somehow.
I opened a new issue to improve this behavior https://github.com/spring-projects/spring-kafka/issues/2280
If you are subject to rate limits, consider using KafkaTemplate.receive() methods, on a schedule, or a polled Spring Integration adapter, instead of using a message-driven approach.

Related

How to throttle my cron worker form pushing messages to RabbitMQ?

Context:
We have micro service which consumes(subscribes)messages from 50+ RabbitMQ queues.
Producing message for this queue happens in two places
The application process when encounter short delayed execution business logic ( like send emails OR notify another service), the application directly sends the message to exchange ( which in turn it is sent to the queue ).
When we encounter long/delayed execution business logic We have messages table which has entries of messages which has to be executed after some time.
Now we have cron worker which runs every 10 mins which scans the messages table and pushes the messages to RabbitMQ.
Scenario:
Let's say the messages table has 10,000 messages which will be queued in next cron run,
9.00 AM - Cron worker runs and it queues 10,000 messages to RabbitMQ queue.
We do have subscribers which are listening to the queue and start consuming the messages, but due to some issue in the system or 3rd party response time delay it takes each message to complete 1 Min.
9.10 AM - Now cron worker once again runs next 10 Mins and see there are yet 9000+ messages yet to get completed and time is also crossed so once again it pushes 9000+ duplicates messages to Queue.
Note: The subscribers which consumes the messages are idempotent, so there is no issue in duplicate processing
Design Idea I had in my mind but not best logic
I can have 4 status ( RequiresQueuing, Queued, Completed, Failed )
Whenever a message is inserted i can set the status to RequiresQueuing
Next when cron worker picks and pushes the messages successfully to Queue i can set it to Queued
When subscribers completes it mark the queue status as Completed / Failed.
There is an issue with above logic, let's say RabbitMQ somehow goes down OR in some use we have purge the queue for maintenance.
Now the messages which are marked as Queued is in wrong state, because they have to be once again identified and status needs to be changed manually.
Another Example
Let say I have RabbitMQ Queue named ( events )
This events queue has 5 subscribers, each subscribers gets 1 message from the queue and post this event using REST API to another micro service ( event-aggregator ). Each API Call usually takes 50ms.
Use Case:
Due to high load the numbers events produced becomes 3x.
Also the micro service ( event-aggregator ) which accepts the event also became slow in processing, the response time increased from 50ms to 1 Min.
Cron workers follows your design mentioned above and queues the message for each min. Now the queue is becoming too large, but i cannot also increase the number of subscribers because the dependent micro service ( event-aggregator ) is also lagging.
Now the question is, If keep sending the messages to events queue, it is just bloating the queue.
https://www.rabbitmq.com/memory.html - While reading this page, i found out that rabbitmq won't even accept the connection if it reaches high watermark fraction (default is 40%). Of course this can be changed, but this requires manual intervention.
So if the queue length increases it affects the rabbitmq memory, that is reason i thought of throttling at producer level.
Questions
How can i throttle my cron worker to skip that particular run or somehow inspect the queue and identify it already being heavily loaded so don't push the messages ?
How can i handle the use cases i said above ? Is there design which solves my problem ? Is anyone faced the same issue ?
Thanks in advance.
Answer
Check the accepted answer Comments for the throttling using queueCount
You can combine QoS - (Quality of service) and Manual ACK to get around this problem.
Your exact scenario is documented in https://www.rabbitmq.com/tutorials/tutorial-two-python.html. This example is for python, you can refer other examples as well.
Let says you have 1 publisher and 5 worker scripts. Lets say these read from the same queue. Each worker script takes 1 min to process a message. You can set QoS at channel level. If you set it to 1, then in this case each worker script will be allocated only 1 message. So we are processing 5 messages at a time. No new messages will be delivered until one of the 5 worker scripts does a MANUAL ACK.
If you want to increase the throughput of message processing, you can increase the worker nodes count.
The idea of updating the tables based on message status is not a good option, DB polling is the main reason that system uses queues and it would cause a scaling issue. At one point you have to update the tables and you would bottleneck because of locking and isolations levels.

Amazon SQS better way of handling listeners

I have an SQS Queue which has a lot of messages (typically in thousands). Presently I am having multiple listeners (which are created by threads created from the same source) and each listener listens to the queue and receives messages. As soon as a listener receives a message from the Queue, that listener deletes the message from the Queue. The message will be processed only after deleting the message from the queue. I am having a visibility timeout of 30 seconds.
I am not using any locks or anything to handle duplicates since I am deleting the message from the queue as soon as after receiving. I haven't seen a case of duplicity until now but I am just worried it might.
Now, the question is, which is a better way, having multiple listeners this way or listening to the queue in a single thread, and then spinning up new threads to process each message you receive?
Firstly, it is worth understanding the concept of message invisibility timeout.
When a message is retrieved from an Amazon SQS queue (eg by your thread), the message is marked as invisible in Amazon SQS. Best-practice is for your thread to then process the message and then delete the message after it has completed processing the message. This way, if the thread fails, the message will automatically become visible on the queue again and another thread can process it.
With your current application design, if a thread fails then the message is lost and will not be retried. You should consider changing your code to delete the message only after it has been processed.
Using multiple threads to process messages is recommended, because it will allow higher message throughput by processing messages in parallel. It is also a simpler design, and simple is always best. Your alternate idea of having one process retrieve messages and then firing off threads to process the message is more complex and does not provide any benefits.
Amazon SQS queues can occasionally return the same message more than once. It is rare, but can happen. The multiple-thread design will probably result in it happening more than the single-thread design because multiple threads might simultaneously retrieve the same message. However, there it could still happen in the single-thread model, too.
If processing the same message twice is a concern, then consider using a FIFO queue (not currently available in every AWS Region). This will guarantee that every message is received only once. Alternatively, your code would need to check whether a particular message has already been processed (eg by checking in a database).
The multiple-thread design will also allow you to horizontally scale by having multiple system (even across multiple Availability Zones) process messages, whereas your single-thread design has a single point of failure and is less scalable.

Understanding Timeout In Partitioned Batch Jobs

I am trying to understand the ways timeouts cal be specified for partitioned steps.
jmsoutbound-gateway receive-timeout
jmsoutbound-gateway reply-timeout
jmsoutbound-gateway repyListener receive-timeout
partition handler messagingOperations receive-timeout
I want to be able to timeout when a step takes too long and clean up. By looking at the stack trace, the reply listener does not go away after partition ends (and may receive a late responding message after job has completed).
The time the executor thread will wait in the gateway for a reply to arrive (partition to complete) before giving up.
A timeout when writing to the reply-channel - in general will only apply if the send can block - such as when the reply channel is a bounded queue channel that is full.
When using a reply listener, the container polls the JMS client for messages, this timeout is simply how long the thread blocks in the client waiting for a reply before looping around and waiting again - it has no bearing on messages timing out; it only really affects how quickly the container will respond to a stop().
The time the partition handler will wait for all partitions to complete (unless pollRepositoryForResults is true in which case, the handler's timeout property represents that and the receive timeout is not used).
So it sounds like #4 is what you want.

Concurrent message processing in RabbitMQ consumer

I am new to RabbitMQ so please excuse me if my question sound trivial. I want to publish message on RabbitMQ which will be processed by RabbitMQ consumer.
My consumer machine is a multi core machine (preferably worker role on azure). But QueueBasicConsumer pushes one message at a time. How can I program to utilize all core where I can process multiple message concurrently.
One solution could be to open multiple channels in multiple threads and then process message over there. But in this case how will i decide the number of threads.
Another approach could be to read message on main thread and then create task and pass message to this task. In this case I will have to stop consuming messages in case there are many message (after a threshold) already in progress. Not sure how could this be implemented.
Thanks In Advance
Your second option sounds much more reasonable - consume on a single channel and spawn multiple tasks to handle the messages. To implement concurrency control, you could use a semaphore to control the number of tasks in flight. Before starting a task, you would wait for the semaphore to become available, and after a task has finished, it would signal the semaphore to allow other tasks to run.
You haven't specified you language/technology stack of choice, but whatever you do - try to utilise a thread pool instead of creating and managing threads yourself. In .NET, that would mean using Task.Run to process messages asynchronously.
Example C# code:
using (var semaphore = new SemaphoreSlim(MaxMessages))
{
while (true)
{
var args = (BasicDeliverEventArgs)consumer.Queue.Dequeue();
semaphore.Wait();
Task.Run(() => ProcessMessage(args))
.ContinueWith(() => semaphore.Release());
}
}
Instead of controlling the concurrency level yourself, you might find it easier to enable explicit ACK control on the channel, and use RabbitMQ Consumer Prefetch to set the maximum number of unacknowledged messages. This way, you will never receive more messages than you wanted at once.

Azure ServiceBus Retry Delay

I am using the Microsoft Azure ServiceBus for Queue messages using WCF for the subscriptions. I am trying to implement retry logic. I use Peak/Lock to view the message and then have to do some local processing on the message. If that processing fails, I unlock the message so I can try processing it again. The problem is I need to build a have a delay in-between processing tries. Currently it is popped back into the queue and then is processed almost immediately. There needs to be about 2 minutes between attempts.
If you always have to wait 2 minutes before re-processing the message of that particular queue, you could try to configure the lock-timeout on the queue to be 2 minutes (plus the time you expect it will take you to process the message) and then just let the lock expire, instead of unlocking it. This has the downside that you would need to keep an eye on your processing time, and extend the lock's timeout if needed.
Another option could be to receive and complete the message, set a scheduled delivery of 2 minutes into the future, and re send the message. This has the downside that you need to consume it and ack it, which involves certain risks (e.g. your process dies before you get a chance to re-send the message).
"If the message is Peeked in Peek Lock mode from a Queue then you don't have the receive context in the message. You can receive the message in Peek Lock mode, which will lock the message for the interval specified for the 'lock duration' property of the queue. Locked messages cannot be received until its lock expires. Thus, by setting the lock duration to 2 minutes and Receiving messages in Peek Lock mode will solve this issue.
You can either write custom code to update the Lock Duration property. Tools like Service Bus Explorer, Serverless360 etc provides options to update property using graphical user interface."

Resources