Multithreading: several producers + one consumer - multithreading

I've got the following problem
I have several threads (producers) calculating positions of moving objects and one thread (consumer) that prints calculation results. Every thread has it's own time scale. The problem of synchronization is that consumer can print results only when all of the producers calculate position at the printing moment. In other words consumer have to compare it's current time with the same of the producers and to decide whether the results can be printed or not. I found a similar example where synchronization was made with a semaphore, but there was only one producer there. Does anyone know a smart solution?

Consumer loop:
wait n times
collect data
do its thing
signal all n producers
Producer loop (n in parallel):
do its thing
make data available
signal to consumer
wait
(Sorry, don't know anything about QT, so just the general algorithm)
EDIT: If the producers have a buffer rather than wait to synchronise, then you can do this:
Consumer loop:
wait, then check all buffers; repeat while any buffer is empty
collect data; if any buffer was full, signal to that producer
do its thing
Producer loop (n in parallel):
do its thing
wait if buffer is full
queue data
signal to consumer

Related

Why does the Disruptor hold lots of data when the producer is much faster than the consumer?

I'm learning about the LMAX Disruptor and have a problem: When I have a very large ring buffer, like 1024, and my producer is much faster than my consumer, the ring buffer will hold lots of data, but will not publish the events until my application ends. Which means my application will lose lots of data (my application is not a daemon).
I've tried to slow down the rate of the producer, which works. But I can't use this approach in my application, it would reduce my application's performance greatly.
val ringBufferSize = 1024
val disruptor = new Disruptor[util.Map[String, Object]](new MessageEventFactory, ringBufferSize, new MessageThreadFactory, ProducerType.MULTI, new BlockingWaitStrategy)
disruptor.handleEventsWith(new MessageEventHandler(batchSize, this))
disruptor.setDefaultExceptionHandler(new MessageExceptionHandler)
val ringBuffer = disruptor.start
val producer = new MessageEventProducer(ringBuffer)
part.foreach { row =>
// Thread.sleep(2000)
accm.add(1)
producer.onData(row)
// flush(row)
}
I want to find a way to control the batch size of the disruptor by myself, and is there any method to consume the rest of the data held at the end of my application?
If you let your application end abruptly, your consumers will end abruptly, too, of course. There is no need to slow down the producer, you simply need to block your application from exiting until all consumers (i. e. event handlers) have finished working on the outstanding events.
The normal way to do this is to invoke Disruptor.shutdown() on the main thread, thus blocking the application from exiting until Disruptor.shutdown() has returned.
In your code snipplet above, you'd add that command before you exit the routine after the part.foreach statement, blocking until the routine returns normally. That would ensure that all events are properly handled to completion.
The Disruptor excels mainly in buffering (smoothing out) bursts of data coming from a single (extremely fast) or multiple (still pretty fast) producer threads, to feed that data to consumers which perform in a predictable manner, thus eliminating as much latency and overhead due to lock contention as possible. You may find that simply invoking the consumer code from within your lambda may yield better or similar results if your producers are in fact much faster than your consumers, unless you use advanced techniques such as batching or setting up the Disruptor to run multiple instances of the same consumer in parallel threads, which requires the event handler implementation to be modified though (see the Disruptor FAQ).
In your example, it seems that all you try to accomplish is to feed an already available set of data (your "part" collection) into a single event handler (MessageEventHandler). In such a use case, you might be better off saying something like parts.stream().parallel().foreach(... messageEventHanler.onEvent(event) ...)

Disruptor park/halt several EventHandlers when exception occurs

We have run into a high CPU usage situation when one of our EventHandlers broke.
Let's say we have several consumers (EventHanlders), that are configured to run sequentially over the buffer. If the first EventHandler throws an exception, is there a way to halt (and awake them later) all the other EventHandlers.
What we are doing is putting the failing thread to sleep and after we try to consume the same event again. But we have notice that the other threads continue running and trying to read from the RingBuffer even where there are not events to read, raising the CPU behind acceptable levels.
For the moment I'm discarding that this is happening because WaitStrategy of the disruptor, because under normal conditions is working as expected. We are using a BlockingWaitStrategy there.
Some more explanations for the sake of the example
INPUT -> [A*] -> [B] -> [C] -> [D]
Where INPUT is the event polled from the RingBuffer and A, B, C and D are the different EventHandlers that are executing sequentially. A* is the consumer throwing an exception.
What we want to achieve is that when consumer A cannot consume an event (eg. after an exception happens), the OnEvent(...) method of that consumer does not exit but will stay in a loop with regular sleeps trying to consume again the same event when it wakes up. In the meanwhile all the other consumers should be parked or sleeping until A succeeds.
We are using disruptor version 3.3.0.
I have been googling but haven't found a working solution.
Thanks in advance.
Salva.
A college has founded out that this issue could be related with a while loop in the waitFor method in BlockingWaitStrategy.
long availableSequence;
while((availableSequence = dependentSequence.get()) < sequence) {
barrier.checkAlert();
}
After several test we have came across this possible solution:
var availableSequence: Long = dependentSequence.get()
while(availableSequence < sequence) {
this.lock.lock()
this.lock.unlock()
availableSequence = dependentSequence.get()
}
availableSequence
Basically it makes that one thread locks the resource and with that we park momentary all the other consumers avoiding the high usage of CPU.
The second point here is the while condition. This is happening just when the available sequence (that is the sequence of the dependent threads) is below the current sequence number. And that only happens when one thread is holding the lock, for example when A throws the exception.
We still investigating if this is a valid solution, or if it can have some undesired side effects.
Any though about it is welcome.

Haskell Thread Communication Pattern Scenario

You have two threads, a and b. Thread a is in a forever loop, listening on a blocking socket 1. Thread b is also in a forever loop, listening on blocking socket 2. Both socket 1 and socket 2 may return data at arbitrary times, so Thread a may be sleeping forever waiting for data whereas Thread b constantly gets data from the socket and goes on with its processing. That's the background.
Now suppose they need to share a dictionary. When Thread a gets some data (if ever) it adds a key value pair into the dictionary after some processing, and then continues to wait for more data. When Thread b receives data from its socket it first queries the dictionary to see if there is information related to the data it has received before going on with its processing. There are no deletions to the dictionary, only inserts and queries (I'd be interested if this makes a difference in the end solution).
In a standard imperative language like python or c, this is pretty easy to do by making the dictionary available in both scopes and only querying it after a thread has acquired a lock, so Thread B always sees the most (well almost) up to date dictionary.
In Haskell, I seem to be struggling to come up with a good implementation of this pattern. MVars, can only have one item at a time so it can't be that Thread a puts in the dictionary, since a new update might occur and it would not be able to push that new dictionary until Thread b fetched it from the MVar. On the other hand if thread b uses an MVar to send a ready signal "ok!" to thread a, it may be the case that Thread a is sleeping on its read socket, so it would be unable to send back data until its read socket unblocked! There are also channels, but that seems messy since I would have to keep sending new dictionaries and Thread B would discard all but the last one.
The alternative solution that would work is to simply send the updates down a channel, and have thread B construct the dictionary for itself. However I'm wondering if there are better alternative solutions.
Thanks for taking the time to read this very long question!
You can use an MVar in the following way:
When thread A gets new data, it tries to get the dictionary with takeMVar. When that succeeds, it updates the dictionary and puts it back into the MVar
When thread B gets data, it tries to get the dictionary with takeMVar - in the above scenario where A seldom gets data that should succeed rather quickly on average. Then it does the lookup and puts the dictionary back.
As hammar pointed out, it's probably better to not directly use takeMVar and putMVar but rather wrap them in modifyMVar_ resp. modifyMVar to not leave the MVar empty if one thread gets an exception while using the dictionary.
In thread A, something like
modifyMVar_ mvar (\dict -> putMVar mvar (insert newStuff dict))
in thread B all you need is a simple readMVar (thanks to #hammar again for pointing that out).

multithreading: how to process data in a vector, while the vector is being populated?

I have a single-threaded linux app which I would like to make parallel. It reads a data file, creates objects, and places them in a vector. Then it calls a compute-intensive method (.5 second+) on each object. I want to call the method in parallel with object creation. While I've looked at qt and tbb, I am open to other options.
I planned to start the thread(s) while the vector was empty. Each one would call makeSolids (below), which has a while loop that would run until interpDone==true and all objects in the vector have been processed. However, I'm a n00b when it comes to threading, and I've been looking for a ready-made solution.
QtConcurrent::map(Iter begin,Iter end,function()) looks very easy, but I can't use it on a vector that's changing in size, can I? And how would I tell it to wait for more data?
I also looked at intel's tbb, but it looked like my main thread would halt if I used parallel_for or parallel_while. That stinks, since their memory manager was recommended (open cascade's mmgt has poor performance when multithreaded).
/**intended to be called by a thread
\param start the first item to get from the vector
\param skip how many to skip over (4 for 4 threads)
*/
void g2m::makeSolids(uint start, uint incr) {
uint curr = start;
while ((!interpDone) || (lineVector.size() > curr)) {
if (lineVector.size() > curr) {
if (lineVector[curr]->isMotion()) {
((canonMotion*)lineVector[curr])->setSolidMode(SWEPT);
((canonMotion*)lineVector[curr])->computeSolid();
}
lineVector[curr]->setDispMode(BEST);
lineVector[curr]->display();
curr += incr;
} else {
uio::sleep(); //wait a little bit for interp
}
}
}
EDIT: To summarize, what's the simplest way to process a vector at the same time that the main thread is populating the vector?
Firstly, to benefit from threading you need to find similarly slow tasks for each thread to do. You said your per-object processing takes .5s+, how long does your file reading / object creation take? It could easily be a tenth or a thousandth of that time, in which case your multithreading approach is going to produce neglegible benefit. If that's the case, (yes, I'll answer your original question soon incase it's not) then think about simultaneously processing multiple objects. Given your processing takes quite a while, the thread creation overhead isn't terribly significant, so you could simply have your main file reading/object creation thread spawn a new thread and direct it at the newly created object. The main thread then continues reading/creating subsequent objects. Once all objects are read/created, and all the processing threads launched, the main thread "joins" (waits for) the worker threads. If this will create too many threads (thousands), then put a limit on how far ahead the main thread is allowed to get: it might read/create 10 objects then join 5, then read/create 10, join 10, read/create 10, join 10 etc. until finished.
Now, if you really want the read/create to be in parallel with the processing, but the processing to be serialised, then you can still use the above approach but join after each object. That's kind of weird if you're designing this with only this approach in mind, but good because you can easily experiment with the object processing parallelism above as well.
Alternatively, you can use a more complex approach that just involves the main thread (that the OS creates when your program starts), and a single worker thread that the main thread must start. They should be coordinated using a mutex (a variable ensuring mutually-exclusive, which means not-concurrent, access to data), and a condition variable which allows the worker thread to efficiently block until the main thread has provided more work. The terms - mutex and condition variable - are the standard terms in the POSIX threading that Linux uses, so should be used in the explanation of the particular libraries you're interested in. Summarily, the worker thread waits until the main read/create thread broadcasts it a wake-up signal indicating another object is ready for processing. You may want to have a counter with index of the last fully created, ready-for-processing object, so the worker thread can maintain it's count of processed objects and move along the ready ones before once again checking the condition variable.
It's hard to tell if you have been thinking about this problem deeply and there is more than you are letting on, or if you are just over thinking it, or if you are just wary of threading.
Reading the file and creating the objects is fast; the one method is slow. The dependency is each consecutive ctor depends on the outcome of the previous ctor - a little odd - but otherwise there are no data integrity issues so there doesn't seem to be anything that needs to be protected by mutexes and such.
Why is this more complicated than something like this (in crude pseudo-code):
while (! eof)
{
readfile;
object O(data);
push_back(O);
pthread_create(...., O, makeSolid);
}
while(x < vector.size())
{
pthread_join();
x++;
}
If you don't want to loop on the joins in your main then spawn off a thread to wait on them by passing a vector of TIDs.
If the number of created objects/threads is insane, use a thread pool. Or put a counter is the creation loop to limit the number of threads that can be created before running ones are joined.
#Caleb: quite -- perhaps I should have emphasized active threads. The GUI thread should always be considered one.

Maintaining Order in a Multi-Threaded Pipeline

I'm considering a multi-threaded architecture for a processing pipeline. My main processing module has an input queue, from which it receives data packets. It then performs transformations on these packets (decryption, etc.) and places them into an output queue.
The threading comes in where many input packets can have their contents transformed independently from one another.
However, the punchline is that the output queue must have the same ordering as the input queue (i.e., the first pulled off the input queue must be the first pushed onto the output queue, regardless of whether its transformations finished first.)
Naturally, there will be some kind of synchronisation at the output queue, so my question is: what would be the best way of ensuring that this ordering is maintained?
Have a single thread read the input queue, post a placeholder on the output queue, and then hand the item over to a worker thread to process. When the data is ready the worker thread updates the placeholder. When the thread that needs the value from the output queue reads the placeholder it can then block until the associated data is ready.
Because only a single thread reads the input queue, and this thread immediately puts the placeholder on the output queue, the order in the output queue is the same as that in the input. The worker threads can be numerous, and can do the transformations in any order.
On platforms that support futures, they are ideal as the placeholder. On other systems you can use an event, monitor or condition variable.
With the following assumptions
there should be one input queue, one output queue and one working queue
there should be only one input queue
listener
output message should contain a wait
handle and a pointer to worker/output data
there may be an arbitrary number of
worker threads
I would consider the following flow:
Input queue listener does these steps:
extracts input message;
creates output message:
initializes worker data struct
resets the wait handle
enqueues the pointer to the output message into the working queue
enqueues the pointer to the output message into the output queue
Worker thread does the following:
waits on a working queue to
extract a pointer to an output
message from it
processes the message based on the given data and sets the event when done
consumer does the following:
waits on n output queue to
extract a pointer to an output
message from it
waits on a handle until the output data is ready
does something with the data
That's going to be implementation-specific. One general solution is to number the input items and preserve the numbering so you can later sort the output items. This could be done once the output queue is filled, or it could be done as part of filling it. In other words, you could insert them into their proper position and only allow the queue to be read when the next available item is sequential.
edit
I'm going to sketch out a basic scheme, trying to keep it simple by using the appropriate primitives:
Instead of queueing a Packet into the input queue, we create a future value around it and enqueue that into both the input and output queues. In C#, you could write it like this:
var future = new Lazy<Packet>(delegate() { return Process(packet); }, LazyThreadSafetyMode.ExecutionAndPublication);
A thread from the pool of workers dequeues a future from the input queue and executes future.Value, which causes the delegate to run JIT and returns once the delegate is done processing the packet.
One or more consumers dequeues a future from the output queue. Whenever they need the value of the packet, they call future.Value, which returns immediately if a worker thread has already called the delegate.
Simple, but works.
If you are using a windowed-approach (known number of elements), use an array for the output queue. For example if it is media streaming and you discard packages which haven't been processed quickly enough.
Otherwise, use a priority queue (special kind of heap, often implemented based on a fixed size array) for the output items.
You need to add a sequence number or any datum on which you can sort the items to each data packet. A priority queue is a tree like structure which ensures the sequence of items on insert/pop.

Resources