How to have multiple round robin RabbmitMQ consumers without concurrency - node.js

I am currently working on putting together a NodeJS system that is responsible for receiving a large amount of events and the order of those events being processed is highly critical. It is also important that the application scales and can handle a Rabbit consumer falling over, therefore I have multiple consumers reading off a queue which is bound to a direct exchange with 'noAck' set to false and each queue having a prefetch count of 1.
This is ensuring that my messages are being processed in order however both consumers are processing events concurrently, where my desired outcome is:
Consumer A Consumer B
---------- -----------
process event 1
...
acknowledge
process event 2
...
acknowledge
process event 3
...
..and so on.
I realise that this reduces the efficiency of my nodes but the guarantee that events are fully processed in order is much more critical to me.
Any help would be greatly appreciated.

This is ensuring that my messages are being processed
No, whenever you have concurrent consumers, settings basic_qos=1 won't prevent other consumers from processing messages out of order.
If you need ordered processing you must have 1 consumer per queue. If you want parallelism, the best you could do is to partition the message stream into several queues, using something like the Consistent Hash Exchange. Of course messages will lose order at partitioning time, but then one consumer per queue will assure ordered processing on each queue.

Related

Apache Pulsar - ioThreads/listenerThreads and message ordering

We are developing an application that requires messages with the same key to be processed strictly in sequence. In addition, for performance/throughput reasons, we need to introduce parallel processing.
Parallelizing is easy - we can have a single thread receiving the messages, calculating a hash on the key, and use hash % number of workers to distribute the message to a particular blocking queue with a worker on the other side. This guarantees that messages with the same key are dispatched to the same worker, so ordering is guaranteed - as long as the receiver gets the messages in order.
The questions are:
Does increasing ioThreads and listenerThreads (default = 1) have an impact on performance, i.e. should we expect to see more messages flowing through or will I/O always be the limiting factor?
If we increase them, are we still guaranteed ordering?
The Pulsar documentation is not clear...
Does increasing ioThreads and listenerThreads (default = 1) have an impact on performance, i.e. should we expect to see more messages flowing through or will I/O always be the limiting factor?
It might, depending on various factors.
IoThreads: this is the thread pool used to manage the TCP connections with brokers. If you're producing/consuming across many topics, you'll most likely be interacting with multiple brokers and thus have multiple TCP connections opened. Increasing the ioThreads count might remove the "single thread bottleneck", though it would only be effective if such bottleneck is indeed present (most of the time it will not be the case...). You can check the CPU utilization in your consumer process, across all threads, to see if there's any thread approaching 100% (of a single CPU core).
ListenerThreads: this the thread pool size when you are using the message listener in the consumer. Typically this is the thread-pool used by application to process the messages (unless it hops to a different thread). It might make sense to increase the threads count here if the app processing is reaching the 1 CPU core limit.
If we increase them, are we still guaranteed ordering?
Yes.
IO threads: 1 TCP connection is always mapped to 1 IO thread
ListenerThreads: 1 Consumer is assigned to 1 listener thread
You may also want to look at using the new key-shared subscription type that was introduced in Pulsar 2.4. Per the documentation,
Messages are delivered in a distribution across consumers and message with same key or same ordering key are delivered to only one consumer.

Single Producer, Multiple Consumers with a few unusual twists

I have a (Posix) server that acts as a proxy for many clients to another upstream server. Messages typically flow down from the upstream server, are then matched against, and pushed out to some subset of the clients interested in that traffic (maintaining the FIFO order from the upstream server). Currently, this proxy server is single threaded using an event loop (e.g. - select, epoll, etc.), but now I'd like to make it multithreaded so that the proxy can more fully utilize an entire machine and achieve much higher throughput.
My high level design is to have a pool of N worker pthreads (where N is some small multiple of the number of cores on the machine) who each run their own event loop. Each client connection will be assigned to a specific worker thread who would then be responsible for servicing all of that client's I/O + timeout needs for the duration of that client connection. I also intend to have a single dedicated thread who pulls in the messages in from the upstream server. Once a message is read in, its contents can be considered constant / unchanging, until it is no longer needed and reclaimed. The workers never alter the message contents -- they just pass them along to their clients as needed.
My first question is: should the matching of client interests preferably be done by the producer thread or the worker threads?
In the former approach, for each worker thread, the producer could check the interests (e.g. - group membership) of the worker's clients. If the message matched any clients, then it could push the message onto a dedicated queue for that worker. This approach requires some kind of synchronization between the producer and each worker about their client's rarely changing interests.
In the latter approach, the producer just pushes every message onto some kind of queue shared by all of the worker threads. Then each worker thread checks ALL of the messages for a match against their clients' interests and processes each message that matches. This is a twist on the usual SPMC problem where a consumer is usually assumed to unilaterally take an element for themselves, rather than all consumers needing to do some processing on every element. This approach distributes the matching work across multiple threads, which seems desirable, but I worry it may cause more contention between the threads depending on how we implement their synchronization.
In both approaches, when a message is no longer needed by any worker thread, it then needs to be reclaimed. So, some tracking needs to be done to know when no worker thread needs a message any longer.
My second question is: what is a good way of tracking whether a message is still needed by any of the worker threads?
A simple way to do this would be to assign to each message a count of how many worker threads still need to process the message when it is first produced. Then, when each worker is done processing a message it would decrement the count in a thread-safe manner and if/when the count went to zero we would know it could be reclaimed.
Another way to do this would be to assign 64b sequence numbers to the messages as they came in, then each thread could track and record the highest sequence number up through which they have processed somehow. Then we could reclaim all messages with sequence numbers less than or equal to the minimum processed sequence number across all of the worker threads in some manner.
The latter approach seems like it could more easily allow for a lazy reclamation process with less cross-thread synchronization necessary. That is, you could have a "clean-up" thread that only runs periodically who goes and computes the minimum across the worker threads, with much less inter-thread synchronization being necessary. For example, if we assume that reads and writes of a 64b integer are atomic and a worker's fully processed sequence number is always monotonically increasing, then the "clean-up" thread can just periodically read the workers' fully processed counts (maybe with some memory barrier) and compute the minimum.
Third question: what is the best way for workers to realize that they have new work to do in their queue(s)?
Each worker thread is going to be managing its own event loop of client file descriptors and timeouts. Is it best for each worker thread to just have their own pipe to which signal data can be written by the producer to poke them into action? Or should they just periodically check their queue(s) for new work? Are there better ways to do this?
Last question: what kind of data structure and synchronization should I use for the queue(s) between the producer and the consumer?
I'm aware of lock-free data structures but I don't have a good feel for whether they'd be preferable in my situation or if I should instead just go with a simple mutex for operations that affect the queue. Also, in the shared queue approach, I'm not entirely sure how a worker thread should track "where" it is in processing the queue.
Any insights would be greatly appreciated! Thanks!
Based on your problem description, matching of client interests needs to be done for each client for each message anyway, so the work in matching is the same whichever type of thread it occurs in. That suggests the matching should be done in the client threads to improve concurrency. Synchronization overhead should not be a major issue if the "producer" thread ensures the messages are flushed to main memory (technically, "synchronize memory with respect to other threads") before their availability is made known to the other threads, as the client threads can all read the information from main memory simultaneously without synchronizing with each other. The client threads will not be able to modify messages, but they should not need to.
Message reclamation is probably better done by tracking the current message number of each thread rather than by having a message specific counter, as a message specific counter presents a concurrency bottleneck.
I don't think you need formal queueing mechanisms. The "producer" thread can simply keep a volatile variable updated which contains the number of the most recent message that has been flushed to main memory, and the client threads can check the variable when they are free to do work, sleeping if no work is available. You could get more sophisticated on the thread management, but the additional efficiency improvement would likely be minor.
I don't think you need sophisticated data structures for this. You need volatile variables for the number of the latest message that is available for processing and for the number of the most recent message that have been processed by each client thread. You need to flush the messages themselves to main memory. You need some way of finding the messages in main memory from the message number, perhaps using a circular buffer of pointers, or of messages if the messages are all of the same length. You don't really need much else with respect to the data to be communicated between the threads.

Java Multi threading two producers and 1 Consumer Issue

I need to achieve multiple producer and one consumer problem.
The restriction is i have two producers and one consumer. The consumer should start processing only when it gets notification from both the producers. until then consumer shouldn't do anything. but each producer work independently and they can keep on producing. Could you please assist me in doing this.
HSK
Create two blocking queues - one for each producer. The consumer knows about both queues, and tries to take an element from each of them. (It can do that just by taking from one then the other.) When it's got an element from each, it processes it, then repeats.
You'll need to consider what you want to happen if one producer is much faster than another though - you probably want the queues to be bounded, and work out what to do if one producer "fills" its queue.

Threadpool multi-queue job dispatch algorithm

I'm curious to know if there is a widely accepted solution for managing thread resources in a threadpool given the following scenario/constraints:
Incoming jobs are all of the same
nature and could be processed by any
thread in the pool.
Incoming jobs
will be 'bucketed' into different
queues based on some attribute of
the incoming job such that all jobs
going to the same bucket/queue MUST
be processed serially.
Some buckets will be less busy than
others at different points during
the lifetime of the program.
My question is on the theory behind a threadpool's implementation. What algorithm could be used to efficiently allocate available threads to incoming jobs across all buckets?
Edit: Another design goal would be to eliminate as much latency as possible between a job being enqueued and it being picked up for processing, assuming there are available idle threads.
Edit2: In the case I'm thinking of there are a relatively large number of queues (50-100) which have unpredictable levels of activity, but probably only 25% of them will be active at any given time.
The first (and most costly) solution I can think of is to simply have 1 thread assigned to each queue. While this will ensure incoming requests are picked up immediately, it is obviously inefficient.
The second solution is to combine the queues together based on expected levels of activity so that the number of queues is inline with the number of threads in the pool, allowing one thread to be assigned to each queue. The problem here will be that incoming jobs, which otherwise could be processed in parallel, will be forced to wait on each other.
The third solution is to create the maximum number of queues, one for each set of jobs that must be processed serially, but only allocate threads based on the number of queues we expect to be busy at any given time (which could also be adjusted by the pool at runtime). So this is where my question comes in: Given that we have more queues than threads, how does the pool go about allocating idle threads to incoming jobs in the most efficient way possible?
I would like to know if there is a widely accepted approach. Or if there are different approaches - who makes use of which one? What are the advantages/disadvantages, etc?
Edit3:This might be best expressed in pseudo code.
You should probably eliminate nr. 2 from your specification. All you really need to comply to is that threads take up buckets and process the queues inside the buckets in order. It makes no sense to process a serialized queue with another threadpool or do some serialization of tasks in parallel. Thus your spec simply becomes that the threads iterate the fifo in the buckets and it's up to the poolmanager to insert properly constructed buckets. So your bucket will be:
struct task_bucket
{
void *ctx; // context relevant data
fifo_t *queue; // your fifo
};
Then it's up to you to make the threadpool smart enough to know what to do on each iteration of the queue. For example the ctx can be a function pointer and the queue can contain data for that function, so the worker thread simply calls the function on each iteration with the provided data.
Reflecting the comments:
If the size of the bucket list is known before hand and isn't likely to change during the lifetime of the program, you'd need to figure out if that is important to you. You will need some way for the threads to select a bucket to take. The easiest way is to have a FIFO queue that is filled by the manager and emptied by the threads. Classic reader/writer.
Another possibility is a heap. The worker removes the highest priority from the heap and processes the bucket queue. Both removal by the workers and insertion by the manager reorders the heap so that the root node is the highest priority.
Both these strategies assume that the workers throw away the buckets and the manager makes new ones.
If keeping the buckets is important, you run the risk of workers only attending to the last modified task, so the manager will either need to reorder the bucket list or modify priorities of each bucket and the worker iterates looking for the highest priority. It is important that memory of ctx remains relevant while threads are working or threads will have to copy this as well. Workers can simply assign the queue locally and set queue to NULL in the bucket.
ADDED: I now tend to agree that you might start simple and just keep a separate thread for each bucket, and only if this simple solution is understood to have problems you look for something different. And a better solution might depend on what exactly problems the simple one causes.
In any case, I leave my initial answer below, appended with an afterthought.
You can make a special global queue of "job is available in bucket X" signals.
All idle workers would wait on this queue, and when a signal is put into the queue one thread will take it and proceed to the corresponding bucket to process jobs there until the bucket becomes empty.
When an incoming job is submitted into an in-order bucket, it should be checked whether a worker thread is assigned to this bucket already. If assigned, the new job will be eventually processed by this worker thread, so no signal should be sent. If not worker is assigned, check whether the bucket is empty or not. If empty, place a signal into the global signal queue that a new job has arrived in this bucket; if not empty, such a signal should have been made already and a worker thread should soon arrive, so do nothing.
ADDED: I got a thought that my idea above can cause starvation for some jobs if the number of threads is less than the number of "active" buckets and there is a non-ending flow of incoming tasks. If all threads are already busy and a new job arrives into a bucket that is not yet served, it may take long time before a thread is freed to work on this new job. So there is a need to check if there are idle workers, and if not, create a new one... which adds more complexity.
Keep it Simple: I'd use 1 thread per queue. Simplicity is worth a lot, and threads are quite cheap. 100 threads won't be an issue on most OS's.
By using a thread per queue, you also get a real scheduler. If a thread blocks (depends on what you're doing), another thread can be queued. You won't get deadlock until every single one blocks. The same cannot be said if you use fewer threads - if the queues the threads happen to be servicing block, then even if other queues are "runnable" and even if these other queue's might unblock the blocked threads, you'll have deadlock.
Now, in particular scenarios, using a threadpool may be worth it. But then you're talking about optimizing a particular system, and the details matter. How expensive are threads? How good is the scheduler? What about blocking? How long are the queues, how frequently updated, etc.
So in general, with just the information that you have around 100 queues, I'd just go for a thread per queue. Yes, there's some overhead: all solutions will have that. A threadpool will introduce synchronization issues and overhead. And the overhead of a limited number of threads is fairly minor. You're mostly talking about around 100MB of address space - not necessarily memory. If you know most queues will be idle, you could further implement an optimization to stop threads on empty queues and start them when needed (but beware of race conditions and thrashing).

Consuming RabbitMQ with Node - limit parallel processing

I've got a RabbitMQ queue that might, at times, hold a considerable amount of data to process.
As far as I understand, using channel.consume will try to force the messages into the Node program, even if it's reaching its RAM limit (and, eventually, crash).
What is the best way to ensure workers get only as many tasks to process as they are capable of handling?
I'm thinking about using a chain of (transform) streams together with channel.get (which gets just one message). If the first stream's buffer is full, we simply stop getting messages.
I believe what you want is to specify the consumer prefetch.
This indicates to RabbitMQ how many messages it should "push" to the consumer at once.
An example is provided here
channel.prefetch(1);
Would be the lowest value to provide, and should ensure the least memory consumption for your node program.
This is based on your description, if my understanding is correct, I'd also recommend renaming your question (parallel processing would relate more to multiple consumers on a single queue, not a single consumer getting all the messages)

Resources