Scaling Long Running Message Processing Azure Service Bus - azure

What is the best way to scale a Worker Role that is processing many long running Azure Service Bus messages using the QueueClient Message Pump.
If using QueueClient.OnMessageOptions.MaxConcurrentCalls = 6 and QueueClient.OnMessage
does that mean i can only process a max of 6 messages at a time?
Is it bad form to have the long running processing within the OnMessage callback to spawn a new Task to complete it's processing?
Should i be using the QueueClient.OnMessageAsync instead?
Thanks for any help.

By “long running” do you mean IO-bound or CPU-bound?
Assuming IO-bound then I wouldn’t spawn a new Task in the OnMessage callback. This creates thread management overhead that can slow processing down at scale.
Consider using OnMessageAsync if you are using IO-bound operations and make sure that you await the asynchronous implementations of any of these operations. This uses your existing threads much more efficiently.
If your operations are CPU-bound then Task creation may do more for you. The mechanics of this are discussed in a series of excellent posts by Stephen Cleary:
http://blog.stephencleary.com/2013/10/taskrun-etiquette-and-proper-usage.html
The MaxConcurrentCalls property controls the number of concurrent requests to the service bus. Increasing this number has a limited impact if you’re IO-bound and limited by available bandwidth. I would recommend doing a bit of performance testing with the Azure client-side performance counters to get the optimum value for your environment.

Related

Improving Amazon SQS Performance

Everything I can find about performance of Amazon Simple Queue Service (SQS), including their own documentation, suggests that getting high throughput requires multiple threads. And I've verified this myself using the JS API with Node 12. If I create multiple threads, I get about the same throughput on each thread, so the total throughput increase is pretty much linear. But I'm running this on a nice machine with lots of cores. When I run in Lambda on a single core, multiple threads don't improve the performance, and generally this is what I would expect of multi-threaded apps.
But here's what I don't understand - there should be very little going on here in the way of CPU, most of the time is spent waiting on web requests. The AWS SQS API appears to be asynchronous in that all of the methods use callbacks for the responses, and I'm using Promises to "asyncify" all of the API calls, with multiple tasks running concurrently. Normally doing this with any kind of async IO is handled great by Node, and improves throughput hugely, I do it all the time with database APIs, multiple streams, etc. But SQS definitely isn't behaving that way, it's behaving as though its IO is actually synchronous and blocking threads on the network calls, which would be outrageous for any modern API.
Has anyone had success getting high SQS message throughput in a single Node thread? The max I'm seeing is about 50 to 100 messages/sec for FIFO queues (send, receive, and delete, all of which are calling the batch methods with the max batch size of 10). And this is running in lambda, i.e. on their own network, which is only slightly faster than running it on my laptop over the Internet, another surprising find. Amazon's documentation says FIFO queues should support up to 3000 messages per second when batching, which would be just fine for me. Does it really take multiple threads on multiple cores or virtual CPUs to achieve this? That would be ridiculous, I just can't believe that much CPU would be used, it should be mostly IO time, which should be asynchronous.
Edit:
As I continued to test, I found that the linear improvement with the number of threads only happened when each thread was processing a different queue. If the threads are all processing the same queue, there is no improvement by adding threads. So it behaves as though each queue is throttled by Amazon. But the throughput to which it seems to be throttling is way below what I found documented as the max throughput. Really confused and disappointed right now!
Michael's comments to the original question were right on. I was sending all messages to the same message group. I had previously been working with AMQP message queues, in which messages will be ordered in the queue in the order they're sent, and they'll be distributed to subscribers in that order. But when multiple listeners are consuming the AMQP queue, because of varying network latencies, there is no guarantee that they'll be received in that order chronologically.
So that's actually a really cool feature of SQS, the guarantee that messages will be chronologically received in the order they were sent within the same message group. In my case, I don't care about the receipt order. So now I'm setting a unique message group ID on each message, and scaling up performance by increasing the number of async message receive loops, still just in one thread, and the throughput is amazing!
So the bottom line: If exact receipt order of messages isn't important for your FIFO queue, set the message group ID to a unique value on each message, and scale out with more receiver tasks to get the best throughput performance. If you do need guaranteed message ordering, it looks like around 50 messages per second is about the best you'll do.

NodeJS with Redis message queue - How to set multiple consumers (threads)

I have a nodejs project that is exposing a simple rest api for an external web application. This webhook must cope with a large number of requests per second as well as return 200 OK very quickly to the caller. In order for that to happen I investigate a redis simple queue to be enqueued with each request's to be handled asynchronously later on (via a consumer thread).
The redis simple queue seems like an easy way to achieve this task (https://github.com/smrchy/rsmq)
1) Is rsmq.receiveMessage() { ....... } a blocking method? if this handler is slow - will it impact my server's performance?
2) If the answer to question 1 is true - Is it recommended to extract the consumption of the messages to an external micro service? (a dedicated consumer)? what are the best practices to create multi threaded consumers on such environment?
You can use pubsub feature provided by redis https://redis.io/topics/pubsub
You can publish to various channels without any knowledge of subscribers . Subscribers can subscribe to the channels they wish.
sreeni
1) No, it won't block the event loop, however you will only start processing a second message once you call the "next" method, i.e., you will process one message at a time. To overcome this, you can start multiple workers in parallel. Take a look here: https://stackoverflow.com/a/45984677/7201847
2) That's an architectural decision that depends on the load you have to support and the hardware capacity you have. I would recommend at least two Node.js processes, one for adding the messages to the queue and another one to actually processing them, with the option to start additional worker processes if needed, depending on the results of your performance tests.

Failure handling for Queue Centric work pattern

I am planning to use a queue centric design as described here for one of my applications. That essentially consists of using a Azure queue where work requests are queued from the UI. A worker reads from the queue, processes and deletes the message from the queue.
The 'work' done by the worker is within a transaction so if the worker fails before completing, upon restart it again picks up the same message (as it has not be deleted from the queue) and tries to perform the operation again (up to a max number of retries)
To scale I could use two methods:
Multiple workers each with a separate queue. So if I have five workers W1 to W5, I have 5 queues Q1 to Q5 and each worker knows which queue to read from and failure handling is similar as the case with one queue and one worker
One queue and multiple workers. Here failure/Retry handling here would be more involved and might end up using the 'Invisibility' time in the message queue to make sure no two workers pick up the same job. The invisibility time would have to be calculated to make sure that its enough for the job to complete and yet not be large enough that retries are performed after a long time.
Would like to know if the 1st approach is the correct way to go? What are robust ways of handling failures in the second approach above?
You would be better off taking approach 2 - a single queue, but with multiple workers.
This is better because:
The process that delivers messages to the queue only needs to know about a single queue endpoint. This reduces complexity at this end;
Scaling the number of workers that are pulling from the queue is now decoupled from any code / configuration changes - you can scale up and down much more easily (and at runtime)
If you are worried about the visibility, you can initially choose a default timespan, and then if the worker looks like it's taking too long, it can periodically call UpdateMessage() to update the visibility of the message.
Finally, if your worker timesout and failed to complete processing of the message, it'll be picked up again by some other worker to try again. You can also use the DequeueCount property of the message to manage number of retries.
Multiple workers each with a separate queue. So if I have five workers
W1 to W5, I have 5 queues Q1 to Q5 and each worker knows which queue
to read from and failure handling is similar as the case with one
queue and one worker
With this approach I see following issues:
This approach makes your architecture tightly coupled (thus beating the whole purpose of using queues). Because each worker role listens to a dedicated queue, the web application responsible for pushing messages in the queue always need to know how many workers are running. Anytime you scale up or down your worker role, some how you need to tell web application so that it can start pushing messages in appropriate queue.
If a worker role instance is taken down for whatever reason there's a possibility that some messages may not be processed ever as other worker role instances are working on their dedicated queues.
There may be a possibility of under utilization/over utilization of worker role instances depending on how web application pushes the messages in the queue. For optimal utilization, web application should know about the worker role utilization so that it can decide which queue to send message to. This is certainly not a desired thing for a web application to do.
I believe #2 is the correct way to go. #Brendan Green has covered your concerns about #2 in his answer excellently.

WCF Multithreading With Scalability Considerations

Working on a stateless WCF rest web service and have an operation with 3 independent tasks. Each one can be run independently. Each task consists of a web service call to an external API & a follow up local DB read operation that takes less than 0.25 sec.
First thing that comes to mind is that I should spawn 3 separate threads then join and return result. Using a Thread Pool would probably not be a good idea here as its limited to 250 treads max.
Performance is of concern, but not at the expense of scalability.
Should I be concerned about the overhead of starting & joining 3 separate threads for each web service call ?
Wrap the calls to external service into async Task methods, then call from your WCF method. It will use thread pool and will queue your web service calls nicely if thread pull is exhausted.
You can use async IO to perform the webservice calls. Async IO does not occupy any thread at all while it is running. You can do the same thing for the database calls. This alleviates any threading concern that you might have.
Alternatively, you can rely on the thread-pool. You can increase the limits. You can calculate how many threads you need: If 100 requests arrive per second and each one takes 2 seconds to complete you need 200 threads. This can easily be served by the built-in thread pool assuming you configure appropriate limits.
In case the external service is down and takes 30 seconds to timeout this number now shoots up to 3000 threads which I consider unsafe. So you either need a low timeout, a circuit breaker or async IO.
So in order to decide you need to forecast load and latency.
I'll link to some discussion for why and when to use async IO:
https://stackoverflow.com/a/25087273/122718 Why does the EF 6 tutorial use asychronous calls?
https://stackoverflow.com/a/12796711/122718 Should we switch to use async I/O by default?

Is it acceptable to use ThreadPool.GetAvailableThreads to throttle the amount of work a service performs?

I have a service which polls a queue very quickly to check for more 'work' which needs to be done. There is always more more work in the queue than a single worker can handle. I want to make sure a single worker doesn't grab too much work when the service is already at max capacity.
Let say my worker grabs 10 messages from the queue every N(ms) and uses the Parallel Library to process each message in parallel on different threads. The work itself is very IO heavy. Many SQL Server queries and even Azure Table storage (http requests) are made for a single unit of work.
Is using the TheadPool.GetAvailableThreads() the proper way to throttle how much work the service is allowed to grab?
I see that I have access to available WorkerThreads and CompletionPortThreads. For an IO heavy process, is it more appropriate to look at how many CompletionPortThreads are available? I believe 1000 is the number made available per process regardless of cpu count.
Update - Might be important to know that the queue I'm working with is an Azure Queue. So, each request to check for messages is made as an async http request which returns with the next 10 messages. (and costs money)
I don't think using IO completion ports is a good way to work out how much to grab.
I assume that the ideal situation is where you run out of work just as the next set arrives, so you've never got more backlog than you can reasonably handle.
Why not keep track of how long it takes to process a job and how long it takes to fetch jobs, and adjust the amount of work fetched each time based on that, with suitable minimum/maximum values to stop things going crazy if you have a few really cheap or really expensive jobs?
You'll also want to work out a reasonable optimum degree of parallelization - it's not clear to me whether it's really IO-heavy, or whether it's just "asynchronous request heavy", i.e. you spend a lot of time just waiting for the responses to complicated queries which in themselves are cheap for the resources of your service.
I've been working virtually the same problem in the same environment. I ended up giving each WorkerRole an internal work queue, implemented as a BlockingCollection<>. There's a single thread that monitors that queue - when the number of items gets low it requests more items from the Azure queue. It always requests the maximum number of items, 32, to cut down costs. It also has automatic backoff in the event that the queue is empty.
Then I have a set of worker threads that I started myself. They sit in a loop, pulling items off the internal work queue. The number of worker threads is my main way to optimize the load, so I've got that set up as an option in the .cscfg file. I'm currently running 35 threads/worker, but that number will depend on your situation.
I tried using TPL to manage the work, but I found it more difficult to manage the load. Sometimes TPL would under-parallelize and the machine would be bored, other times it would over-parallelize and the Azure queue message visibility would expire while the item was still being worked.
This may not be the optimal solution, but it seems to be working OK for me.
I decided to keep an internal counter of how many message are currently being processed. I used Interlocked.Increment/Decrement to manage the counter in a thread-safe manner.
I would have used the Semaphore class since each message is tied to its own Thread but wasn't able to due to the async nature of the queue poller and the code which spawned the threads.

Resources