I am using azure worker role to read and process the queue message.
It is working fine but sometimes the performance is very slow. It's not reading queue properly.
So queue message count starts to increase, so all functionality is getting delayed.
Web app details.
Main use of the app is tracking the vehicle. each vehicle contains device to send gps in every 15sec duration..So ill will get and push into the queue by web role..then reading and processing that message..
Sometimes worker role performance is very low.. takes 2 sec to read single message..
I cant say its happening by work load, because morning and evening trips are there..that time i have to process more details,like sending messages etc...but that time its working fine.. afternoon time no trip that time simply reading and pushing into azure table storage. Its not reading queue fastly , once or twice in a day its happening..queue messages counts increased more than 5000, then all data processing getting delayed..
How can I avoid this?
Related
Context:
We have micro service which consumes(subscribes)messages from 50+ RabbitMQ queues.
Producing message for this queue happens in two places
The application process when encounter short delayed execution business logic ( like send emails OR notify another service), the application directly sends the message to exchange ( which in turn it is sent to the queue ).
When we encounter long/delayed execution business logic We have messages table which has entries of messages which has to be executed after some time.
Now we have cron worker which runs every 10 mins which scans the messages table and pushes the messages to RabbitMQ.
Scenario:
Let's say the messages table has 10,000 messages which will be queued in next cron run,
9.00 AM - Cron worker runs and it queues 10,000 messages to RabbitMQ queue.
We do have subscribers which are listening to the queue and start consuming the messages, but due to some issue in the system or 3rd party response time delay it takes each message to complete 1 Min.
9.10 AM - Now cron worker once again runs next 10 Mins and see there are yet 9000+ messages yet to get completed and time is also crossed so once again it pushes 9000+ duplicates messages to Queue.
Note: The subscribers which consumes the messages are idempotent, so there is no issue in duplicate processing
Design Idea I had in my mind but not best logic
I can have 4 status ( RequiresQueuing, Queued, Completed, Failed )
Whenever a message is inserted i can set the status to RequiresQueuing
Next when cron worker picks and pushes the messages successfully to Queue i can set it to Queued
When subscribers completes it mark the queue status as Completed / Failed.
There is an issue with above logic, let's say RabbitMQ somehow goes down OR in some use we have purge the queue for maintenance.
Now the messages which are marked as Queued is in wrong state, because they have to be once again identified and status needs to be changed manually.
Another Example
Let say I have RabbitMQ Queue named ( events )
This events queue has 5 subscribers, each subscribers gets 1 message from the queue and post this event using REST API to another micro service ( event-aggregator ). Each API Call usually takes 50ms.
Use Case:
Due to high load the numbers events produced becomes 3x.
Also the micro service ( event-aggregator ) which accepts the event also became slow in processing, the response time increased from 50ms to 1 Min.
Cron workers follows your design mentioned above and queues the message for each min. Now the queue is becoming too large, but i cannot also increase the number of subscribers because the dependent micro service ( event-aggregator ) is also lagging.
Now the question is, If keep sending the messages to events queue, it is just bloating the queue.
https://www.rabbitmq.com/memory.html - While reading this page, i found out that rabbitmq won't even accept the connection if it reaches high watermark fraction (default is 40%). Of course this can be changed, but this requires manual intervention.
So if the queue length increases it affects the rabbitmq memory, that is reason i thought of throttling at producer level.
Questions
How can i throttle my cron worker to skip that particular run or somehow inspect the queue and identify it already being heavily loaded so don't push the messages ?
How can i handle the use cases i said above ? Is there design which solves my problem ? Is anyone faced the same issue ?
Thanks in advance.
Answer
Check the accepted answer Comments for the throttling using queueCount
You can combine QoS - (Quality of service) and Manual ACK to get around this problem.
Your exact scenario is documented in https://www.rabbitmq.com/tutorials/tutorial-two-python.html. This example is for python, you can refer other examples as well.
Let says you have 1 publisher and 5 worker scripts. Lets say these read from the same queue. Each worker script takes 1 min to process a message. You can set QoS at channel level. If you set it to 1, then in this case each worker script will be allocated only 1 message. So we are processing 5 messages at a time. No new messages will be delivered until one of the 5 worker scripts does a MANUAL ACK.
If you want to increase the throughput of message processing, you can increase the worker nodes count.
The idea of updating the tables based on message status is not a good option, DB polling is the main reason that system uses queues and it would cause a scaling issue. At one point you have to update the tables and you would bottleneck because of locking and isolations levels.
I'm having a little trouble understanding the difference between a message that has a scheduled message time ('scheduledEnqueueTime') and the time to live (default 14 days).
What's the difference between the them?
I'm understanding it as the longest time that I can put something on the queue before it wakes up and does a dequeue is 14 days (default). Is this incorrect?
FYI - In my app I need to place messages on the queue to wake up, in some cases, up to 60 days from the current day. I see I can increase the pricing tier of the service bus to standard pricing and that will increase the time to live. Is this what I need to do?
Time to Live is the duration until ServiceBus will discard the message if nobody processed it.
With Scheduled Enqueue Time you can hide the message so nobody can process the message until you want it to. This is independent from the time to live.
Scheduled messages do not materialize in the queue until the defined enqueue time
Sidenote: You can also "defer" messages, but you have to explicitly unlock these messages from the queue. Scheduling would be better for your case.
I am using the Azure service bus queue for one of my requirements. The requirement is simple, an azure function will act as an API and creates multiple jobs in the queue. The function is scalable and on-demand new instance creation. The job which microservice creates will be processed by a windows service. So the sender is Azure function and the receiver is windows service. Since the azure function is scalable, there will be multiple numbers of functions will be executed in parallel. So, the number of jobs getting created into the queue will be in parallel, and probably one job in every 500MS. Windows service is a single instance that is a Queue listener listens to this Queue and executes in parallel. So, the number of senders might be more, the receiver is one instance. And each job can run in parallel must be limited(4, since it takes more time and CPU) Right now, I am using Aure Service Bus Queue with the following configuration. My doubt is which configuration produces the best performance for this particular requirement.
The deletion of the Job in the queue will not be an issue for me. So, Can I use Delete instead of Peek-Lock?
Also, right now, the number of items receiving by the listener is not in order. I want to maintain an order in which it got created. My requirement is maximum performance. The job is done by the windows service is a CPU intensive task, that's why I have limited to 4 since the system is a 4 Core.
Max delivery count: 4, Message lock duration: 5 min, MaxConcurrentCalls: 4 (In listener). I am new to the service bus, I need a suggestion for this.
One more doubt is, let's consider the listener got 4 jobs in parallel and start execution. One job completed its execution and became a completed status. So the listener will pick the next item immediately or wait for all the 4 jobs to be completed (MaxConcurrentCalls: 4).
The deletion of the Job in the queue will not be an issue for me. So, Can I use Delete instead of Peek-Lock?
Receiving messages in PeekLock receive mode will less performant than ReceiveAndDelete. You'll be saving roundtrips to the broker to complete messages.
Max delivery count: 4, Message lock duration: 5 min, MaxConcurrentCalls: 4 (In listener). I am new to the service bus, I need a suggestion for this.
MaxDeliveryCount is how many times a message can be attempted before it's dead-lettered. It appears to be equal to the number of cores, but it shouldn't. Could be just a coincidence.
MessageLockDuration will only matter if you use PeekLock receive mode. For ReceiveAndDelete it won't matter.
As for Concurrency, even though your work is CPU bound, I'd benchmark if higher concurrency would be possible.
An additional parameter on the message receiver to look into would be PrefetchCount. It can improve the overall performance by making fewer roundtrips to the broker.
One more doubt is, let's consider the listener got 4 jobs in parallel and start execution. One job completed its execution and became a completed status. So the listener will pick the next item immediately or wait for all the 4 jobs to be completed (MaxConcurrentCalls: 4).
The listener will immediately start processing the 5th message as your concurrency is set to 4 and one message processing has been completed.
Also, right now, the number of items receiving by the listener is not in order. I want to maintain an order in which it got created.
To process messages in the order they were sent in you will need to send and receive messages using sessions.
My requirement is maximum performance. The job is done by the windows service is a CPU intensive task, that's why I have limited to 4 since the system is a 4 Core.
There are multiple things to take into consideration. The location of your Windows Service location would impact the latency and message throughput. Scaling out could help, etc.
So suppose that you have an application that lets user request a job. For example (hypothetical): user uploads a video. There is an entry made in RDBMs with the URL to video on blob and the status is set to "Pending".
There is a recurring time triggered functionapp that is executed every 10 seconds or so which gets 10 pending jobs from RDBMS and performs some compression etc.
The problem here is that as long as the number of requests stay 10-30 videos per 10 seconds we should be fine. But if the number of requests increase all of a sudden .. say 200 requests per 10 seconds this would mean that there will be a lot of job pending and the user would have to wait 10 times longer than usual to see status change. How do you scale out function app automatically in such scenario? Does it have to be manual?
There's an easier way to get fan out and parallel processing through multiple concurrently running Azure Functions.
Add an Azure Service Bus Queue to your solution.
For each video that needs to be processed, enqueue a service bus message with the appropriate data you'll need to retrieve and process the video (like the BlobId).
Have your Azure Function triggered by an ServiceBusTrigger.
Azure will spin up additional instances of your Azure Function as the queue depth increases. It'll also scale in idle instances after there's no more data to process.
I have data going from my system to an azure iot. I timestamp the data packet when I send it.Then I have an azure function that is triggered by the iothub. In the azure function I get the message and get the timestamp and record how long it took the data to get to the function. I also have another program running on my system that listens for data on the iothub and records that time too.
So most of the time, the time in the azure function is in millisecs, but sometimes, I see a large time for the azure function to be triggered(I conclude it is this because the program that reads from the iot hub shows that the data reached the iot hub quickly and there was no delay).
Would anybody know the reasons for why azure function might be triggering late
Is this the same question that was asked here? https://github.com/Azure/Azure-Functions/issues/711
I'll copy/paste my answer for others to see:
Based on what I see in the logs and your description, I think the latency can be explained as being caused by a cold-start of your function app process. If a function app goes idle for approximately 20 minutes, then it is unloaded from memory and any subsequent trigger will initiate a cold start.
Basically, the following sequence of events takes place:
The function app goes idle and is unloaded (this happened about 5 minutes before the trigger you mentioned).
You send the new event.
The event eventually gets noticed by our scale controller, which polls for events on a 10 second interval.
Our scale controller initiates a cold-start of your function app. This can add a few more seconds depending on the content of your function app (it was about 6 seconds in this case).
So unfortunately this is a known behavior with the consumption plan. You can read up on this issue here: https://blogs.msdn.microsoft.com/appserviceteam/2018/02/07/understanding-serverless-cold-start/. The blog post also discusses some ways you can work around this if it's problematic for your scenario.