Parallel Game Of Life - Information exchange between threads - multithreading

I am trying to implement a parallel version of 'Game Of Life'.
This parallel version divides the game's board into regions, each governed by a single thread which is responsible of calculating this region's next state and conduct the state update afterwards.
One of the constraints I am facing here is the fact that - "Each thread is allowed to access only its own region cells. All other information should be communicated from the neighboring threads by some other memory".
So, the way I understand this, even if one thread attempts to read only from a cell outside it's region, it must somehow request this state from the specific thread which is running this cell.
We are encouraged to consider the producer/consumer solution for this task, and so I have considered using a public static produce/consumer queue into which state requests shall be enqueued, but some other related issues are not clear to me:
If thread A is conducting a job at the moment, how can I ask it to halt it's work and hand thread B it's information request and resume it's previous job afterwards? Is it even possible?
Which thread is responsible for this queue? a unique thread which manages the queue in parallel to the regular regions threads? I am not sure.

The easiest solution is to imagine there are multiple steps in each round.
Let's say there are N threads.
step 1: each thread makes a list of cells it needs to discover. It puts the "question" in one of the N queues that there are (one for each thread).
wait for all the threads to finish
step 2: each thread fill the responses for its queue of question
wait for all the threads to finish
step 3: each thread computes the new state of its region
wait for all the threads to finish

Related

Worker threads, jobs which have a particular "colour", only one job of each colour must be running, what synchronization primitive do I need?

I have N worker threads, and a queue of jobs. When a worker thread is idle, it picks a job from the queue and starts running it. So far, so easy.
However my jobs have a property which I'll call "colour". Two jobs of the same colour must never be running at the same time.
For example, say there are 2 threads, one is running a red job, and the other is idle. The idle thread looks at the queue, but it must not pick a red job (since if it did there would be two red jobs running which is not allowed). If there's, say, a blue job in the queue, it could run it. If there are only red jobs in the queue, then the idle thread must wait until the other thread finishes. Then both threads are idle and they must pick differently coloured jobs to run, and if by that time there are still only red jobs in the queue, one would have to stay idle.
My question is what synchronization primitives I should use here. I thought about grouping the queue into colours, and I can attach a mutex to each colour group, but then I'm stuck on how to use those mutexes. (The actual program is written in OCaml using pthreads so it has access to the usual pthread primitives and ones built on top of pthreads).
This sounds like a rather unusual case, but it's related to a real world problem: I'm writing a dependency runner (think: "make"). It must never run two recipes ("jobs") in parallel if both recipes target the same output file ("colour").
There are a few "obvious" solutions.
You could have a "red" thread and a "blue" thread, servicing separate "red" and "blue" queues. Since you never want to have more than one "red" job in progress, having a thread pool doesn't buy you anything.
If for some reason you still want to multiplex "red" jobs on a thread pool, you could still use separate "red" and "blue" queues, and a boolean flag indicating whether corresponding color job is in progress. An idle thread will then iterate over all non-empty queues and pick the first one that has in_progress == false.
You could have a single queue combined with a set of boolean flags. Now idle thread will iterate over the queue until it finds a job for which in_progress[color] == false and pick that.
My question is what synchronization primitives I should use here.
There is a variety of synchronization objects you could use to protect the queue and any other shared objects against data races. Mutexes and semaphores are prominent among these, but there are others, too.
However, since one of the possibilities is that a thread must suspend operation until a condition is satisfied (a task of a new colour becomes available, or a task is completed) you need a condition variable. Condition variables are necessarily used together with mutexes, with the natural idiom being that the mutex protects (at least) the shared data that must be accessed to evaluate the condition. Thus, "one mutex and one condition variable" is a good answer to the question posed.
But you seem also uncertain about how to structure the queue and how to use the synchronization primitives. That is specific to your particular project, but it is possible to make some general observations and suggestions. Your explanation of the real-world case suggests that colour values are essentially arbitrary, as opposed to being drawn from a small, fixed vocabulary. This rules out dedicating a separate thread to each colour, putting you instead in the situation where at any given time, every idle thread is viable for running each task that is eligible to run.
Schematically, then, every worker thread would do something like this:
Lock the mutex.
If no enqueued task is eligible to run, then
a. Wait on the condition variable.
b. When the thread returns from the wait, go back to (2).
(If control reaches this point then there is an eligible task on the queue.) Dequeue an eligible task, T, of color C(T).
Update colour-tracking data to show that a task of colour C(T) is running.
Unlock the mutex.
Run task T.
Lock the mutex.
Update colour-tracking data to show that there is no longer a task of colour C(T) running.
Broadcast to the condition variable.
Go back to (2).
I presume you will want to include a termination condition to allow your threads to exit cleanly. That would probably be evaluated as part of step (2). Make sure that after breaking out of the loop, threads unlock the mutex.
Note also that a thread that wants to enqueue a new task while the workers are running -- whether a worker thread itself or some other -- must hold the mutex while doing so, and should broadcast to the CV after doing so.
I thought about grouping the queue into colours, and I can attach a mutex to each colour group, but then I'm stuck on how to use those mutexes.
Indeed. A mutex per group does not help, because only one mutex can be associated with the CV. That mutex must be held while a thread is evaluating whether the condition for proceeding is satisfied, and while a thread is updating any of the data involved in evaluating that condition. Additional mutexes protecting subsets of that data would be unhelpful, because there could never be more than one thread contending for them (the one holding the main mutex).
It might still be reasonable to add structure to your queue to facilitate the evaluation of which tasks are eligible to run, but you haven't given us enough information to suggest details. (And the question is already plenty broad, so please do not expand it with any such information.)

Vulkan Queue Synchronization in Multithreading

In my application it is imperative that "state" and "graphics" are processed in separate threads. So for example, the "state" thread is only concerned with updating object positions, and the "graphics" thread is only concerned with graphically outputting the current state.
For simplicity, let's say that the entirety of the state data is contained within a single VkBuffer. The "state" thread creates a Compute Pipeline with a Storage Buffer backed by the VkBuffer, and periodically vkCmdDispatchs to update the VkBuffer.
Concurrently, the "graphics" thread creates a Graphics Pipeline with a Uniform Buffer backed by the same VkBuffer, and periodically draws/vkQueuePresentKHRs.
Obviously there must be some sort of synchronization mechanism to prevent the "graphics" thread from reading from the VkBuffer whilst the "state" thread is writing to it.
The only idea I have is to employ the usage of a host mutex fromvkQueueSubmit to vkWaitForFences in both threads.
I want to know, is there perhaps some other method that is more efficient or is this considered to be OK?
Try using semaphores. They are used to synchronize operations solely on the GPU, which is much more optimal than waiting in the app and submitting work after previous work is fully processed.
When You submit work You can provide a semaphore which gets signaled when this work is finished. When You submit another work You can provide the same semaphore on which the second batch should wait. Processing of the second batch will start automatically when the semaphore gets signaled (this semaphore is also automatically unsignaled and can be reused).
(I think there are some constraints on using semaphores, associated with queues. I will update the answer later when I confirm this but they should be sufficient for Your purposes.
[EDIT] There are constraints on using semaphores but it shouldn't affect You - when You use a semaphore as a wait semaphore during submission, no other queue can wait on the same semaphore.)
There are also events in Vulkan which can be used for similar purposes but their use is a little bit more complicated.
If You really need to synchronize GPU and Your application, use fences. They are signaled in a similar way as semaphores. But You can check their state on the app side and You need to manually unsignal them before You can use then again.
[EDIT]
I've added an image that more or less shows what I think You should do. One thread calculates state and with each submission adds a semaphore to the top of the list (or a ring buffer as #NicolasBolas wrote). This semaphore gets signaled when the submission is finished (it is provided in pSignalSemaphores during "compute" batch submission).
Second thread renders Your scene. It manages it's own list of semaphores similarly to the compute thread. But when You want to render things, You need to be sure that compute thread finished calculations. That's why You need to take the latest "compute" semaphore and wait on it (provide it in pWaitSemaphores during "render" batch submission). When You submit rendering commands, compute thread can't start and modify the data because it may influence the results of a rendering. So compute thread also needs to wait until the most recent rendering is done. That's why compute thread also needs to provide a wait semaphore (the most recent "rendering" semaphore).
You just need to synchronize submissions. Rendering thread cannot start when a compute threads submits commands and vice versa. That's why adding semaphores to the lists (and taking semaphores from the list) should be synchronized. But this has nothing to do with Vulkan. Probably some mutex will be helpful (for example a C++-ish std::lock_guard<std::mutex>). But this synchronization is a problem only when You have a single buffer.
Another thing is what to do with old semaphores from both lists. You cannot directly check what is their state and You cannot directly unsignal them. The state of semaphores can be checked by using additional fences provided with each submission. You don't wait on them but from time to time check if a given fence is signaled and, if it is, You can destroy old semaphore (as You cannot unsignal it from the application) or You can make an empty submission, with no command buffers, and use that semaphore as a wait semaphore. This way the semaphore will be unsignaled and You can reuse it. But I don't know which solution is more optimal: destroying old and creating new semaphores, or unsignaling them with empty submissions.
When You have a single buffer, a one-element list/ring is probably enough. But more optimal solution would have some kind of a ping-pong set of buffers - You read data from one buffer, but store results in another buffer. And in the next step You swap them. That's why in the image above, the lists of semaphores (rings) may have more elements depending on Your setup. The more independent buffers and semaphores in the lists (of course to some reasonable count), the best performance You will get as You reduce time wasted on waiting. But this complicates Your code and it may also increase a lag (rendering thread gets data that is a bit older than the data currently processed by the compute thread). So You may need to balance performance, code complexity and a rendering lag.
How you do this depends on two factors:
Whether you want to dispatch the compute operation on the same queue as its corresponding graphics operation.
The ratio of compute operations to their corresponding graphics operations.
#2 is the most important part.
Even though they are generated in separate threads, there must be at least some idea that the graphics operation is being fed by a particular compute operation (otherwise, how would the graphics thread know where the data is to read from?). So, how do you do that?
At the end of the day, that part has nothing to do with Vulkan. You need to use some inter-thread communication mechanism to allow the graphics thread to ask, "which compute task's data should I be using?"
Typically, this would be done by having the compute thread add every compute operation it does to some kind of circular buffer (thread-safe of course. And non-locking). When the graphics thread goes to decide where to read its data from, it asks the circular buffer for the most recently added compute operation.
In addition to the "where to read its data from" information, this would also provide the graphics thread with an appropriate Vulkan synchronization primitive to use to synchronize its command buffer(s) with the compute operation's CB.
If the compute and graphics operations are being dispatched on the same queue, then this is pretty simple. There doesn't have to actually be a synchronization primitive. So long as the graphics CBs are issued after the compute CBs in the batch, all the graphics CBs need is to have a vkCmdPipelineBarrier at the front which waits on all memory operations from the compute stage.
srcStageMask would be STAGE_COMPUTE_SHADER_BIT, with dstStageMask being, well, pretty much everything (you could narrow it down, but it won't matter, since at the very least your vertex shader stage will need to be there).
You would need a single VkMemoryBarrier in the pipeline barrier. It's srcAccessMask would be SHADER_WRITE_BIT, while the dstAccessMask would be however you intend to read it. If the compute operations wrote some vertex data, you need VERTEX_ATTRIBUTE_READ_BIT. If they wrote some uniform buffer data, you need UNIFORM_READ_BIT. And so on.
If you're dispatching these operations on separate queues, that's where you need an actual synchronization object.
There are several problems:
You cannot detect if a Vulkan semaphore has been signaled by user code. Nor can you set a semaphore to the unsignaled state by user code. Nor can you reasonably submit a batch that has a semaphore in it that is currently signaled and nobody's waiting on it. You can do the latter, but it won't do the right thing.
In short, you can never submit a batch that signals a semaphore unless you are certain that some process is going to wait for it.
You cannot issue a batch that waits on a semaphore, unless a batch that signals it is "pending execution". That is, your graphics thread cannot vkQueueSubmit its batch until it is certain that the compute queue has submitted its signaling batch.
So what you have to do is this. When the graphics queue goes to get its compute data, this must send a signal to the compute thread to add a semaphore to its next submit call. When the graphics thread submits its graphics operation, it then waits on that semaphore.
But to ensure proper ordering, the graphics thread cannot submit its operation until the compute thread has submitted the semaphore signaling operation. That requires a CPU-synchronization operation of some form. It could be as simple as the graphics thread polling an atomic variable set by the compute thread.

Do we always need 2 threads for thread barrier to work?

Wanted to check if thread barrier is the right way to go about solving a problem where you have to poll the DB continuously 2-3 times in specific time intervals for incoming events checking for a trigger and then, eventually timeout in a Spring Integration project.
Also, do we always need 2 threads for thread barrier to work? The suspended thread and the trigger thread.
BarrierMessageHandler is based on the logic like:
Message<?> releaseMessage = syncQueue.poll(this.timeout, TimeUnit.MILLISECONDS);
therefore is blocking the current thread.
So, to release that block you definitely need another thread which offers a value for that SynchronousQueue.

should I use memory fences in a supervisor-workers model?

I am building multithreading support for my application.
In my application, it can happen that a worker should access the "work field" of another worker to complete its own job. I have tried to make this safe with pthread mutexes, but they turned out to be horribly slow, even when there is only one worker and no so contention.
So, I came up with another idea. Let a worker complete its job where it can, and then add to a (per-worker, own) queue the jobs that have the aforementioned problem: when all the workers are done, the main supervisor thread will complete the unfinished jobs, in the hope that they will be orders of magnitude fewer than the number of jobs got done by the workers.
My question is: should I throw in a memory fence, at the moment that I transfer the execution from the supervisor to the workers and vice-versa?
EDIT:
more details (the code is on github, see pool::collision_wsc()). Each thread reads pointers from various "cells" (which are basically a std::vector), and applies some operation on the objects pointed (collision between hard spheres).
The point is that a cell interacts with (some of) its neighbours, but some of these cells might be in the ownership of another worker (one sphere might be near the bounds of a cell, and collide with one of another cell).

Threadpool multi-queue job dispatch algorithm

I'm curious to know if there is a widely accepted solution for managing thread resources in a threadpool given the following scenario/constraints:
Incoming jobs are all of the same
nature and could be processed by any
thread in the pool.
Incoming jobs
will be 'bucketed' into different
queues based on some attribute of
the incoming job such that all jobs
going to the same bucket/queue MUST
be processed serially.
Some buckets will be less busy than
others at different points during
the lifetime of the program.
My question is on the theory behind a threadpool's implementation. What algorithm could be used to efficiently allocate available threads to incoming jobs across all buckets?
Edit: Another design goal would be to eliminate as much latency as possible between a job being enqueued and it being picked up for processing, assuming there are available idle threads.
Edit2: In the case I'm thinking of there are a relatively large number of queues (50-100) which have unpredictable levels of activity, but probably only 25% of them will be active at any given time.
The first (and most costly) solution I can think of is to simply have 1 thread assigned to each queue. While this will ensure incoming requests are picked up immediately, it is obviously inefficient.
The second solution is to combine the queues together based on expected levels of activity so that the number of queues is inline with the number of threads in the pool, allowing one thread to be assigned to each queue. The problem here will be that incoming jobs, which otherwise could be processed in parallel, will be forced to wait on each other.
The third solution is to create the maximum number of queues, one for each set of jobs that must be processed serially, but only allocate threads based on the number of queues we expect to be busy at any given time (which could also be adjusted by the pool at runtime). So this is where my question comes in: Given that we have more queues than threads, how does the pool go about allocating idle threads to incoming jobs in the most efficient way possible?
I would like to know if there is a widely accepted approach. Or if there are different approaches - who makes use of which one? What are the advantages/disadvantages, etc?
Edit3:This might be best expressed in pseudo code.
You should probably eliminate nr. 2 from your specification. All you really need to comply to is that threads take up buckets and process the queues inside the buckets in order. It makes no sense to process a serialized queue with another threadpool or do some serialization of tasks in parallel. Thus your spec simply becomes that the threads iterate the fifo in the buckets and it's up to the poolmanager to insert properly constructed buckets. So your bucket will be:
struct task_bucket
{
void *ctx; // context relevant data
fifo_t *queue; // your fifo
};
Then it's up to you to make the threadpool smart enough to know what to do on each iteration of the queue. For example the ctx can be a function pointer and the queue can contain data for that function, so the worker thread simply calls the function on each iteration with the provided data.
Reflecting the comments:
If the size of the bucket list is known before hand and isn't likely to change during the lifetime of the program, you'd need to figure out if that is important to you. You will need some way for the threads to select a bucket to take. The easiest way is to have a FIFO queue that is filled by the manager and emptied by the threads. Classic reader/writer.
Another possibility is a heap. The worker removes the highest priority from the heap and processes the bucket queue. Both removal by the workers and insertion by the manager reorders the heap so that the root node is the highest priority.
Both these strategies assume that the workers throw away the buckets and the manager makes new ones.
If keeping the buckets is important, you run the risk of workers only attending to the last modified task, so the manager will either need to reorder the bucket list or modify priorities of each bucket and the worker iterates looking for the highest priority. It is important that memory of ctx remains relevant while threads are working or threads will have to copy this as well. Workers can simply assign the queue locally and set queue to NULL in the bucket.
ADDED: I now tend to agree that you might start simple and just keep a separate thread for each bucket, and only if this simple solution is understood to have problems you look for something different. And a better solution might depend on what exactly problems the simple one causes.
In any case, I leave my initial answer below, appended with an afterthought.
You can make a special global queue of "job is available in bucket X" signals.
All idle workers would wait on this queue, and when a signal is put into the queue one thread will take it and proceed to the corresponding bucket to process jobs there until the bucket becomes empty.
When an incoming job is submitted into an in-order bucket, it should be checked whether a worker thread is assigned to this bucket already. If assigned, the new job will be eventually processed by this worker thread, so no signal should be sent. If not worker is assigned, check whether the bucket is empty or not. If empty, place a signal into the global signal queue that a new job has arrived in this bucket; if not empty, such a signal should have been made already and a worker thread should soon arrive, so do nothing.
ADDED: I got a thought that my idea above can cause starvation for some jobs if the number of threads is less than the number of "active" buckets and there is a non-ending flow of incoming tasks. If all threads are already busy and a new job arrives into a bucket that is not yet served, it may take long time before a thread is freed to work on this new job. So there is a need to check if there are idle workers, and if not, create a new one... which adds more complexity.
Keep it Simple: I'd use 1 thread per queue. Simplicity is worth a lot, and threads are quite cheap. 100 threads won't be an issue on most OS's.
By using a thread per queue, you also get a real scheduler. If a thread blocks (depends on what you're doing), another thread can be queued. You won't get deadlock until every single one blocks. The same cannot be said if you use fewer threads - if the queues the threads happen to be servicing block, then even if other queues are "runnable" and even if these other queue's might unblock the blocked threads, you'll have deadlock.
Now, in particular scenarios, using a threadpool may be worth it. But then you're talking about optimizing a particular system, and the details matter. How expensive are threads? How good is the scheduler? What about blocking? How long are the queues, how frequently updated, etc.
So in general, with just the information that you have around 100 queues, I'd just go for a thread per queue. Yes, there's some overhead: all solutions will have that. A threadpool will introduce synchronization issues and overhead. And the overhead of a limited number of threads is fairly minor. You're mostly talking about around 100MB of address space - not necessarily memory. If you know most queues will be idle, you could further implement an optimization to stop threads on empty queues and start them when needed (but beware of race conditions and thrashing).

Resources