Azure Service Bus Queue Performance - azure

I am using the Azure service bus queue for one of my requirements. The requirement is simple, an azure function will act as an API and creates multiple jobs in the queue. The function is scalable and on-demand new instance creation. The job which microservice creates will be processed by a windows service. So the sender is Azure function and the receiver is windows service. Since the azure function is scalable, there will be multiple numbers of functions will be executed in parallel. So, the number of jobs getting created into the queue will be in parallel, and probably one job in every 500MS. Windows service is a single instance that is a Queue listener listens to this Queue and executes in parallel. So, the number of senders might be more, the receiver is one instance. And each job can run in parallel must be limited(4, since it takes more time and CPU) Right now, I am using Aure Service Bus Queue with the following configuration. My doubt is which configuration produces the best performance for this particular requirement.
The deletion of the Job in the queue will not be an issue for me. So, Can I use Delete instead of Peek-Lock?
Also, right now, the number of items receiving by the listener is not in order. I want to maintain an order in which it got created. My requirement is maximum performance. The job is done by the windows service is a CPU intensive task, that's why I have limited to 4 since the system is a 4 Core.
Max delivery count: 4, Message lock duration: 5 min, MaxConcurrentCalls: 4 (In listener). I am new to the service bus, I need a suggestion for this.
One more doubt is, let's consider the listener got 4 jobs in parallel and start execution. One job completed its execution and became a completed status. So the listener will pick the next item immediately or wait for all the 4 jobs to be completed (MaxConcurrentCalls: 4).

The deletion of the Job in the queue will not be an issue for me. So, Can I use Delete instead of Peek-Lock?
Receiving messages in PeekLock receive mode will less performant than ReceiveAndDelete. You'll be saving roundtrips to the broker to complete messages.
Max delivery count: 4, Message lock duration: 5 min, MaxConcurrentCalls: 4 (In listener). I am new to the service bus, I need a suggestion for this.
MaxDeliveryCount is how many times a message can be attempted before it's dead-lettered. It appears to be equal to the number of cores, but it shouldn't. Could be just a coincidence.
MessageLockDuration will only matter if you use PeekLock receive mode. For ReceiveAndDelete it won't matter.
As for Concurrency, even though your work is CPU bound, I'd benchmark if higher concurrency would be possible.
An additional parameter on the message receiver to look into would be PrefetchCount. It can improve the overall performance by making fewer roundtrips to the broker.
One more doubt is, let's consider the listener got 4 jobs in parallel and start execution. One job completed its execution and became a completed status. So the listener will pick the next item immediately or wait for all the 4 jobs to be completed (MaxConcurrentCalls: 4).
The listener will immediately start processing the 5th message as your concurrency is set to 4 and one message processing has been completed.
Also, right now, the number of items receiving by the listener is not in order. I want to maintain an order in which it got created.
To process messages in the order they were sent in you will need to send and receive messages using sessions.
My requirement is maximum performance. The job is done by the windows service is a CPU intensive task, that's why I have limited to 4 since the system is a 4 Core.
There are multiple things to take into consideration. The location of your Windows Service location would impact the latency and message throughput. Scaling out could help, etc.

Related

How to throttle my cron worker form pushing messages to RabbitMQ?

Context:
We have micro service which consumes(subscribes)messages from 50+ RabbitMQ queues.
Producing message for this queue happens in two places
The application process when encounter short delayed execution business logic ( like send emails OR notify another service), the application directly sends the message to exchange ( which in turn it is sent to the queue ).
When we encounter long/delayed execution business logic We have messages table which has entries of messages which has to be executed after some time.
Now we have cron worker which runs every 10 mins which scans the messages table and pushes the messages to RabbitMQ.
Scenario:
Let's say the messages table has 10,000 messages which will be queued in next cron run,
9.00 AM - Cron worker runs and it queues 10,000 messages to RabbitMQ queue.
We do have subscribers which are listening to the queue and start consuming the messages, but due to some issue in the system or 3rd party response time delay it takes each message to complete 1 Min.
9.10 AM - Now cron worker once again runs next 10 Mins and see there are yet 9000+ messages yet to get completed and time is also crossed so once again it pushes 9000+ duplicates messages to Queue.
Note: The subscribers which consumes the messages are idempotent, so there is no issue in duplicate processing
Design Idea I had in my mind but not best logic
I can have 4 status ( RequiresQueuing, Queued, Completed, Failed )
Whenever a message is inserted i can set the status to RequiresQueuing
Next when cron worker picks and pushes the messages successfully to Queue i can set it to Queued
When subscribers completes it mark the queue status as Completed / Failed.
There is an issue with above logic, let's say RabbitMQ somehow goes down OR in some use we have purge the queue for maintenance.
Now the messages which are marked as Queued is in wrong state, because they have to be once again identified and status needs to be changed manually.
Another Example
Let say I have RabbitMQ Queue named ( events )
This events queue has 5 subscribers, each subscribers gets 1 message from the queue and post this event using REST API to another micro service ( event-aggregator ). Each API Call usually takes 50ms.
Use Case:
Due to high load the numbers events produced becomes 3x.
Also the micro service ( event-aggregator ) which accepts the event also became slow in processing, the response time increased from 50ms to 1 Min.
Cron workers follows your design mentioned above and queues the message for each min. Now the queue is becoming too large, but i cannot also increase the number of subscribers because the dependent micro service ( event-aggregator ) is also lagging.
Now the question is, If keep sending the messages to events queue, it is just bloating the queue.
https://www.rabbitmq.com/memory.html - While reading this page, i found out that rabbitmq won't even accept the connection if it reaches high watermark fraction (default is 40%). Of course this can be changed, but this requires manual intervention.
So if the queue length increases it affects the rabbitmq memory, that is reason i thought of throttling at producer level.
Questions
How can i throttle my cron worker to skip that particular run or somehow inspect the queue and identify it already being heavily loaded so don't push the messages ?
How can i handle the use cases i said above ? Is there design which solves my problem ? Is anyone faced the same issue ?
Thanks in advance.
Answer
Check the accepted answer Comments for the throttling using queueCount
You can combine QoS - (Quality of service) and Manual ACK to get around this problem.
Your exact scenario is documented in https://www.rabbitmq.com/tutorials/tutorial-two-python.html. This example is for python, you can refer other examples as well.
Let says you have 1 publisher and 5 worker scripts. Lets say these read from the same queue. Each worker script takes 1 min to process a message. You can set QoS at channel level. If you set it to 1, then in this case each worker script will be allocated only 1 message. So we are processing 5 messages at a time. No new messages will be delivered until one of the 5 worker scripts does a MANUAL ACK.
If you want to increase the throughput of message processing, you can increase the worker nodes count.
The idea of updating the tables based on message status is not a good option, DB polling is the main reason that system uses queues and it would cause a scaling issue. At one point you have to update the tables and you would bottleneck because of locking and isolations levels.

Is it possible to stop MassTransit service after a saga/consumer completes

So I know this runs a bit counter to MassTransit style, but I want to take advantage of some key features of MT such as message broker connection management, sagas, scheduled messages.
However, I know the service will be rarely used. This is a fairly large data take from an API which has a throttle of 12,000 requests per hour. Once every 24 hours a saga will start to take data and move it into Data Lake. The service will run for some minutes until the throttle is hit, then start again where it left off (state) when enough time has passed, maybe something like 30 minutes later. The amount of data means this will repeat for several hours (2 to 4).
The fit for a saga and and scheduled message seems pretty good. But it would be better if the service did not have incur operating costs for being awake 24x7. There will only ever be one request at a time for one set of API credentials. There may come a time when we might have multiple sets of credentials.
Is there a way to nicely close down the service when the saga completes?
As this is likely to be implemented with a container instance I propose to start an instance from a queue triggered function or similar.
Assuming that this is the approach you want to take (versus just an Azure Web Job, triggered by Azure Scheduler), there are a number of options:
Publish an event when the saga completes, consume that event, use Task.Run() or whatever to stop the bus.
Use a receive observer to keep track of in-flight messages and when it reaches zero and stays there for n seconds, stop the bus, exit the function.
Though I wonder why not just use a scheduled job via Azure, seems easier unless MassTransit is being used for more than just scheduling.

How do you scale Azure function app(background job) based on the number items pending in a database?

So suppose that you have an application that lets user request a job. For example (hypothetical): user uploads a video. There is an entry made in RDBMs with the URL to video on blob and the status is set to "Pending".
There is a recurring time triggered functionapp that is executed every 10 seconds or so which gets 10 pending jobs from RDBMS and performs some compression etc.
The problem here is that as long as the number of requests stay 10-30 videos per 10 seconds we should be fine. But if the number of requests increase all of a sudden .. say 200 requests per 10 seconds this would mean that there will be a lot of job pending and the user would have to wait 10 times longer than usual to see status change. How do you scale out function app automatically in such scenario? Does it have to be manual?
There's an easier way to get fan out and parallel processing through multiple concurrently running Azure Functions.
Add an Azure Service Bus Queue to your solution.
For each video that needs to be processed, enqueue a service bus message with the appropriate data you'll need to retrieve and process the video (like the BlobId).
Have your Azure Function triggered by an ServiceBusTrigger.
Azure will spin up additional instances of your Azure Function as the queue depth increases. It'll also scale in idle instances after there's no more data to process.

Failure handling for Queue Centric work pattern

I am planning to use a queue centric design as described here for one of my applications. That essentially consists of using a Azure queue where work requests are queued from the UI. A worker reads from the queue, processes and deletes the message from the queue.
The 'work' done by the worker is within a transaction so if the worker fails before completing, upon restart it again picks up the same message (as it has not be deleted from the queue) and tries to perform the operation again (up to a max number of retries)
To scale I could use two methods:
Multiple workers each with a separate queue. So if I have five workers W1 to W5, I have 5 queues Q1 to Q5 and each worker knows which queue to read from and failure handling is similar as the case with one queue and one worker
One queue and multiple workers. Here failure/Retry handling here would be more involved and might end up using the 'Invisibility' time in the message queue to make sure no two workers pick up the same job. The invisibility time would have to be calculated to make sure that its enough for the job to complete and yet not be large enough that retries are performed after a long time.
Would like to know if the 1st approach is the correct way to go? What are robust ways of handling failures in the second approach above?
You would be better off taking approach 2 - a single queue, but with multiple workers.
This is better because:
The process that delivers messages to the queue only needs to know about a single queue endpoint. This reduces complexity at this end;
Scaling the number of workers that are pulling from the queue is now decoupled from any code / configuration changes - you can scale up and down much more easily (and at runtime)
If you are worried about the visibility, you can initially choose a default timespan, and then if the worker looks like it's taking too long, it can periodically call UpdateMessage() to update the visibility of the message.
Finally, if your worker timesout and failed to complete processing of the message, it'll be picked up again by some other worker to try again. You can also use the DequeueCount property of the message to manage number of retries.
Multiple workers each with a separate queue. So if I have five workers
W1 to W5, I have 5 queues Q1 to Q5 and each worker knows which queue
to read from and failure handling is similar as the case with one
queue and one worker
With this approach I see following issues:
This approach makes your architecture tightly coupled (thus beating the whole purpose of using queues). Because each worker role listens to a dedicated queue, the web application responsible for pushing messages in the queue always need to know how many workers are running. Anytime you scale up or down your worker role, some how you need to tell web application so that it can start pushing messages in appropriate queue.
If a worker role instance is taken down for whatever reason there's a possibility that some messages may not be processed ever as other worker role instances are working on their dedicated queues.
There may be a possibility of under utilization/over utilization of worker role instances depending on how web application pushes the messages in the queue. For optimal utilization, web application should know about the worker role utilization so that it can decide which queue to send message to. This is certainly not a desired thing for a web application to do.
I believe #2 is the correct way to go. #Brendan Green has covered your concerns about #2 in his answer excellently.

Azure queue message priority

I have a queue in Azure storage named for example 'messages'. And every 1 hour some service push to this queue some amount of messages that should update data. But, in some cases I also push to this queue message from another place and I want this message be proceeded immediately and I can not set priority for this message.
What is the best solution for this problem?
Can I use two different queues ('messages' and 'messages-priority') or it is a bad approach?
The correct approach is to use multiple queues - a 'normal priority' and a 'high priority' queue. What we have implemented is multiple queue reader threads in a single worker role - each thread first checks the high priority queue and, if its empty, looks in the normal queue. This way the high priority messages will be processed by the first available thread (pretty much immediately), and the same code runs regardless of where messages come from. It also saves having to have a reader continuously looking in a single queue and having to be backed off because there are seldom messages.

Resources