Improving Amazon SQS Performance

Improving Amazon SQS Performance - node.js

Everything I can find about performance of Amazon Simple Queue Service (SQS), including their own documentation, suggests that getting high throughput requires multiple threads. And I've verified this myself using the JS API with Node 12. If I create multiple threads, I get about the same throughput on each thread, so the total throughput increase is pretty much linear. But I'm running this on a nice machine with lots of cores. When I run in Lambda on a single core, multiple threads don't improve the performance, and generally this is what I would expect of multi-threaded apps.
But here's what I don't understand - there should be very little going on here in the way of CPU, most of the time is spent waiting on web requests. The AWS SQS API appears to be asynchronous in that all of the methods use callbacks for the responses, and I'm using Promises to "asyncify" all of the API calls, with multiple tasks running concurrently. Normally doing this with any kind of async IO is handled great by Node, and improves throughput hugely, I do it all the time with database APIs, multiple streams, etc. But SQS definitely isn't behaving that way, it's behaving as though its IO is actually synchronous and blocking threads on the network calls, which would be outrageous for any modern API.
Has anyone had success getting high SQS message throughput in a single Node thread? The max I'm seeing is about 50 to 100 messages/sec for FIFO queues (send, receive, and delete, all of which are calling the batch methods with the max batch size of 10). And this is running in lambda, i.e. on their own network, which is only slightly faster than running it on my laptop over the Internet, another surprising find. Amazon's documentation says FIFO queues should support up to 3000 messages per second when batching, which would be just fine for me. Does it really take multiple threads on multiple cores or virtual CPUs to achieve this? That would be ridiculous, I just can't believe that much CPU would be used, it should be mostly IO time, which should be asynchronous.
Edit:
As I continued to test, I found that the linear improvement with the number of threads only happened when each thread was processing a different queue. If the threads are all processing the same queue, there is no improvement by adding threads. So it behaves as though each queue is throttled by Amazon. But the throughput to which it seems to be throttling is way below what I found documented as the max throughput. Really confused and disappointed right now!

Michael's comments to the original question were right on. I was sending all messages to the same message group. I had previously been working with AMQP message queues, in which messages will be ordered in the queue in the order they're sent, and they'll be distributed to subscribers in that order. But when multiple listeners are consuming the AMQP queue, because of varying network latencies, there is no guarantee that they'll be received in that order chronologically.
So that's actually a really cool feature of SQS, the guarantee that messages will be chronologically received in the order they were sent within the same message group. In my case, I don't care about the receipt order. So now I'm setting a unique message group ID on each message, and scaling up performance by increasing the number of async message receive loops, still just in one thread, and the throughput is amazing!
So the bottom line: If exact receipt order of messages isn't important for your FIFO queue, set the message group ID to a unique value on each message, and scale out with more receiver tasks to get the best throughput performance. If you do need guaranteed message ordering, it looks like around 50 messages per second is about the best you'll do.

Related

Replaying Messages in Order

I am implementing a consumer which does processing of messages from a queue where order of messages is of importance. I would like to implement a mechanism using NodeJS where:
the consumer function is consuming messages m1, m2, ..., mN from the queue
doing an IO intensive operation and process the messages. m -> m'
Storing the result m' in a redis cache.
acknowledging the queue after each message process (2)
In a different function, I am listening to the message from the cache
sending the processed messages m' to an external system
if the external system was able to process the external system, then delete the processed message from the cache
If the external system rejects the processed message, then stop sending messages, discard the unsent processed messages in the cache and reset the offset to the last accepted message in the queue. For example if m12' was the last message accepted by the system, and I have acknowledged m23 from the queue, then I have to discard m13' to m23' and reset the offset so that the consumer can read and start processing from m13 again.
Few assumptions:
The processing m to m' is intensive and I am processing them optimistically, knowing that most of the times there won't be a failure
With the current assumptions and goals, is there any way I can achieve this with RabbitMQ or any Azure equivalent? My client doesn't prefer Kafka or any Azure equivalent of Kafka (Azure Event Hub).

In scenarios where the messages will always be generated in sequence then a simple queue is probably all you need.
Azure Queues are pretty simple to get into, but the general mode of operation for queues is to remove the messages as they are processed successfully.
If you can avoid the scenario where you must "roll back" or re-process from an earlier time, so if you can avoid the orchestration aspect then this would be a much simpler option.
It's the "go back and replay" that you will struggle with. If you can implement two queues in a sequential pattern, where processing messages from one queue successfully pushes the message into the next queue, then we never need to go back, because the secondary consumer can never process ahead of the primary.
With Azure Event Hubs it is much easier to reset the offset for processing, because the messages stay in the bucket regardless of their read state, (in fact any given message does not have such a state) and the consumer maintains the offset pointer itself. It also has support for multiple consumer groups, which will make a copy of the message available to each consumer.
You can up your plan to maintain the data for up to 7 days without blowing the budget.
There are two problems with Large scale telemetry ingestion services like Azure Event Hubs for your use case
The order of receipt of the message is less reliable for messages that are extremely close together, the Hub is designed to receive many messages from many sources concurrently, so its internal architecture cares a lot less about trying to preserve the precise order, it records the precise receipt timestamp on the message, but it does not guarantee that the overall sequence of records will match exactly to a scenario where you were to sort by the receipt timestamp. (its a subtle but important distinction)
Event Hubs (and many client processing code examples) are designed to guarantee Exactly Once delivery across multiple concurrent consuming threads. Again the Consumers are encouraged to be asynchronous and the serice will try to ensure that failed processing attempts are retried by the next available thread.
So you could use Event Hubs, but you would have to bypass or disable a lot of its features which is generally a strong message that it is not the correct fit for your purpose, if you want to explore it though, you would want to limit the concurrency aspects:
minimise the partition count
You probably want 1 partition for each message producer, or atleast for each sequential set, maintaining sequence is simpler inside a single partition
make sure your message sender (producer) only sends to a specific partition
Each producer MUST use a unique partition key
create a consumer group for each of your consumers
process messages one at a time, not in batches
process with a single thread
I have a lot of experience in designing MS Azure based solutions for Industrial IoT (Telemetry from PLCs) and Agricultural IoT (Raspberry Pi) device implementations. In almost all cases we think that the order of messaging is important, but unless you are maintaining real-time 2 way command and control, you can usually get away with an optimisitic approach where each message and any derivatives are or were correct at the time of transmission.
If there is the remote possibility that a device can be offline for any period of time, then dealing with the stale data flushing through the system when a device comes back online can really play havok with sequential logic programming.
Take a step back to analyse your solution, EventHubs does offer a convient way to rollback the processing to a previous offset, as long as that record is still in the bucket, but can you re-design your logic flow so that you do not have to re-process old data?
What is the requirement that drives this sequence? If it is so important to maintain the sequence, then you should probably process the data with a single consumer that does everything, or look at chaining the queues in a sequential manner.

Behaviour of Vert.x Event-bus when reaching the limit

I'm missing one piece of understanding of how Event Bus / Hazelcast works.
Imagine a case with a consumer and a producer verticles communicating over the clustered EB. The consuming part is doing CPU / memory / IO-intensive calculations.
When at some point due to the load the consumer is not able to handle the messages immediately, what is going to happen?
Would the messages be queueed inside the ring-buffer and eventually be processed later (considering Netty's SingleThreadEventLoop limits of 2 billion as per Size of event bus in vert.x)? Will they be dropped in case of reaching the limit?
In general, can the messages in EB be considered persistent and with delivery guarantee, as soon as no component in the cluster crashes?

If the consumers cannot cope with the messages, Vert.x will accumulate messages in a queue in memory.
When the queue reaches its limit, the messages will be dropped. The number of elements in the queue can be configured with MessageConsumer.html#setMaxBufferedMessages. It does not depend on message size.
If you need delivery guarantees, don't use the EventBus, use a messaging system like ActiveMQ (Vert.x has clients for such messaging systems).
In general, Vert.x does its best not to lose messages but the EventBus is simply not a full-featured messaging system.

Scaling Long Running Message Processing Azure Service Bus

What is the best way to scale a Worker Role that is processing many long running Azure Service Bus messages using the QueueClient Message Pump.
If using QueueClient.OnMessageOptions.MaxConcurrentCalls = 6 and QueueClient.OnMessage
does that mean i can only process a max of 6 messages at a time?
Is it bad form to have the long running processing within the OnMessage callback to spawn a new Task to complete it's processing?
Should i be using the QueueClient.OnMessageAsync instead?
Thanks for any help.

By “long running” do you mean IO-bound or CPU-bound?
Assuming IO-bound then I wouldn’t spawn a new Task in the OnMessage callback. This creates thread management overhead that can slow processing down at scale.
Consider using OnMessageAsync if you are using IO-bound operations and make sure that you await the asynchronous implementations of any of these operations. This uses your existing threads much more efficiently.
If your operations are CPU-bound then Task creation may do more for you. The mechanics of this are discussed in a series of excellent posts by Stephen Cleary:
http://blog.stephencleary.com/2013/10/taskrun-etiquette-and-proper-usage.html
The MaxConcurrentCalls property controls the number of concurrent requests to the service bus. Increasing this number has a limited impact if you’re IO-bound and limited by available bandwidth. I would recommend doing a bit of performance testing with the Azure client-side performance counters to get the optimum value for your environment.

Azure Service Bus - Determine Number of Active Connections (Topic/Queue)

Since Azure Service Bus limits the maximum number of concurrent connections to a Queue or Topic to 100, is there a method that we can use to query our Queues/Topics to determine how many concurrent connections there are?
We are aware that we can capture the throttling events, but would very much prefer an active approach, where we can proactively increase or decrease the number of Queues/Topics when the system is under a heavy load.
The use case here is a process waiting for a reply message, where the reply is coming from a long-running process, and the subscription is using a Correlation Filter to facilitate two-way communication between the Publisher and Subscriber. Thus, we must have a BeginReceive() going in order to await the response, and each such Publisher will be consuming a connection for the duration of their wait time. The system already balances load across multiple Topics, but we need a way to be proactive about how many Topics are created, so that we do not get throttled too often, but at the same time not have an excess of Topics for this purpose.

I don't believe it is currently possile to query the listener counts. I think that the subscriber object also figures into that so in theory, if you have up to 2000 subscribers per topic and if each allows up to 100 connections, that's alot of potential connections. We just need to keep in mind that subscribers are cooperative (each gets a copy of all messages) and receivers on subscriers are competitive (only one gets it).
I've also seen unconfirmed reports of performance delays when you start running > 1,000 subscribers so make sure you test this scenario.
But... given your scenario, I'd deduce that performance time likely isn't the biggest factor (you have long running processes already). So introducing a couple seconds lag into the workflow likely won't be critical. If that's the case, I'd set the timeout for your BeginRecieve to something fairly short (couple seconds) and have a sleep/wait delay between attempts. This gives other listeners an opportnity to get messsages as well. We might also want to consider an approach where we attempt to recieve multiple messages and then assign them out other processes for processing (coorelation in this case?).
Juts some thoughts.

Is it acceptable to use ThreadPool.GetAvailableThreads to throttle the amount of work a service performs?

I have a service which polls a queue very quickly to check for more 'work' which needs to be done. There is always more more work in the queue than a single worker can handle. I want to make sure a single worker doesn't grab too much work when the service is already at max capacity.
Let say my worker grabs 10 messages from the queue every N(ms) and uses the Parallel Library to process each message in parallel on different threads. The work itself is very IO heavy. Many SQL Server queries and even Azure Table storage (http requests) are made for a single unit of work.
Is using the TheadPool.GetAvailableThreads() the proper way to throttle how much work the service is allowed to grab?
I see that I have access to available WorkerThreads and CompletionPortThreads. For an IO heavy process, is it more appropriate to look at how many CompletionPortThreads are available? I believe 1000 is the number made available per process regardless of cpu count.
Update - Might be important to know that the queue I'm working with is an Azure Queue. So, each request to check for messages is made as an async http request which returns with the next 10 messages. (and costs money)

I don't think using IO completion ports is a good way to work out how much to grab.
I assume that the ideal situation is where you run out of work just as the next set arrives, so you've never got more backlog than you can reasonably handle.
Why not keep track of how long it takes to process a job and how long it takes to fetch jobs, and adjust the amount of work fetched each time based on that, with suitable minimum/maximum values to stop things going crazy if you have a few really cheap or really expensive jobs?
You'll also want to work out a reasonable optimum degree of parallelization - it's not clear to me whether it's really IO-heavy, or whether it's just "asynchronous request heavy", i.e. you spend a lot of time just waiting for the responses to complicated queries which in themselves are cheap for the resources of your service.

I've been working virtually the same problem in the same environment. I ended up giving each WorkerRole an internal work queue, implemented as a BlockingCollection<>. There's a single thread that monitors that queue - when the number of items gets low it requests more items from the Azure queue. It always requests the maximum number of items, 32, to cut down costs. It also has automatic backoff in the event that the queue is empty.
Then I have a set of worker threads that I started myself. They sit in a loop, pulling items off the internal work queue. The number of worker threads is my main way to optimize the load, so I've got that set up as an option in the .cscfg file. I'm currently running 35 threads/worker, but that number will depend on your situation.
I tried using TPL to manage the work, but I found it more difficult to manage the load. Sometimes TPL would under-parallelize and the machine would be bored, other times it would over-parallelize and the Azure queue message visibility would expire while the item was still being worked.
This may not be the optimal solution, but it seems to be working OK for me.

I decided to keep an internal counter of how many message are currently being processed. I used Interlocked.Increment/Decrement to manage the counter in a thread-safe manner.
I would have used the Semaphore class since each message is tied to its own Thread but wasn't able to due to the async nature of the queue poller and the code which spawned the threads.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string