threading synchronization at high speed - multithreading

I have a threading question and what I'd qualify as a modest threading background.
Suppose I have the following (oversimplified) design and behavior:
Object ObjectA - has a reference to object ObjectB and a method MethodA().
Object ObjectB - has a reference to ObjectA, an array of elements ArrayB and a method MethodB().
ObjectA is responsible for instantiating ObjectB. ObjectB.ObjectA will point to ObjectB's instantiator.
Now, whenever some conditions are met, a new element is added in ObjectB.ArrayB and a new thread is started for this element, say ThreadB_x, where x goes from 1 to ObjectB.ArrayB.Length. Each such thread calls ObjectB.MethodB() to pass some data in, which in turn calls ObjectB.ObjectA.MethodA() for data processing.
So multiple threads call the same method ObjectB.MethodB(), and it's very likely that they do so at the very same time. There's a lot of code in MethodB that creates and initializes new objects, so I don't think there are problems there. But then this method calls ObjectB.ObjectA.MethodA(), and I don't have the slightest idea of what's going on in there. Based on the results I get, nothing wrong, apparently, but I'd like to be sure of that.
For now, I enclosed the call to ObjectB.ObjectA.MethodA() in a lock statement inside ObjectB.MethodB(), so I'm thinking this will ensure there are no clashes to the call of MethodA(), though I'm not 100% sure of that. But what happens if each ThreadB_x calls ObjectB.MethodB() a lot of times and very very fast? Will I have a queue of calls waiting for ObjectB.ObjectA.MethodA() to finish?
Thanks.

Your question is very difficult to answer because of the lack of information. It depends on the average time spent in methodA, how many times this method is called per thread, how many cores are allocated to the process, the OS scheduling policy, to name a few parameters.
All things being equals, when the number of threads grows toward infinity, you can easily imagine that the probability for two threads requesting access to a shared resource simultaneously will tend to one. This probability will grow faster in proportion to the amount of time spent on the shared resource. That intuition is probably the reason of your question.
The main idea of multithreading is to parallelize code which can be effectively computed concurrently, and avoid contention as much as possible. In your setup, if methodA is not pure, ie. if it may change the state of the process - or in C++ parlance, if it cannot be made const, then it is a source of contention (recall that a function can only be pure if it uses pure functions or constants in its body).
One way of dealing with a shared resource is to protect it with a mutex, as you've done in your code. Another way is to try to turn its use into an async service, with one thread handling it, and others requesting that thread for computation. In effect, you will end up with an explicit queue of requests, but threads doing these requests will be free to work on something else in the mean time. The goal is always to maximize computation time, as opposed to thread management time, which happens each time a thread gets rescheduled.
Of course, it is not always possible to do so, eg. when the result of methodA belongs to a strongly ordered chain of computation.

Related

Is it safe to update an object in a thread without locks if other threads will not access it?

I have a vector of entities. At update cycle I iterate through vector and update each entity: read it's position, calculate current speed, write updated position. Also, during updating process I can change some other objects in other part of program, but each that object related only to current entity and other entities will not touch that object.
So, I want to run this code in threads. I separate vector into few chunks and update each chunk in different threads. As I see, threads are fully independent. Each thread on each iteration works with independent memory regions and doesn't affect other threads work.
Do I need any locks here? I assume, that everything should work without any mutexes, etc. Am I right?
Short answer
No, you do not need any lock or synchronization mechanism as your problem appear to be a embarrassingly parallel task.
Longer answer
A race conditions that can only appear if two threads might access the same memory at the same time and at least one of the access is a write operation. If your program exposes this characteristic, then you need to make sure that threads access the memory in an ordered fashion. One way to do it is by using locks (it is not the only one though). Otherwise the result is UB.
It seems that you found a way to split the work among your threads s.t. each thread can work independently from the others. This is the best case scenario for concurrent programming as it does not require any synchronization. The complexity of the code is dramatically decreased and usually speedup will jump up.
Please note that as #acelent pointed out in the comment section, if you need changes made by one thread to be visible in another thread, then you might need some sort of synchronization due to the fact that depending on the memory model and on the HW changes made in one thread might not be immediately visible in the other.
This means that you might write from Thread 1 to a variable and after some time read the same memory from Thread 2 and still not being able to see the write made by Thread 1.
"I separate vector into few chunks and update each chunk in different threads" - in this case you do not need any lock or synchronization mechanism, however, the system performance might degrade considerably due to false sharing depending on how the chunks are allocated to threads. Note that the compiler may eliminate false sharing using thread-private temporal variables.
You can find plenty of information in books and wiki. Here is some info https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
Also there is a stackoverflow post here does false sharing occur when data is read in openmp?

Which are blocking Vulkan functions?

In Vulkan, it is recommended to break the API calls into separate threads for better throughput. I am unsure which category of calls are the computationally expensive one which would cause the thread to block, and thus should be used asynchronously.
As I see it, these are the potential calls/family-of-calls that could take a long time to execute.
vkAcquireImageKHR()
vkQueueSubmit()
vkQueuePresentKHR()
memcpy into mapped memory
vkBegin/EndCommandBuffer
vkCmd* calls for drawing and compute
But, the more I think about them, the more it seems that most would be fairly cheap to call. I'll explain my rational, which is probably flawed.
vkAcquireImageKHR()
This could block, if you choose a timeout. But, it's likely that a sufficiently optimized app would call this function with a 0 timeout, and just do other work if the image is not yet available. So, this function can be made instant. There's no need to wait, if the app is smart enough.
vkQueueSubmit()
This function takes a fence, which will be signaled when the GPU has finished executing the command buffers. So, it doesn't actually wait around while the GPU performs the work. I'm assuming this function is the one that starts the physical movement of the command buffer data to the GPU, but I'm assuming that it tell the hardware to read from some memory location, and then the function returns as quickly as possible. So, it wouldn't wait around while the command buffers get sent to the GPU.
vkQueuePresentKHR()
Signal to the GPU to send some image to the window/monitor. It doesn't have to wait for much, does it?
memcpy into mapped memory
This is probably slow.
vkCmd* calls
This family of calls is the one I'm most unsure about. When I read about threads and Vulkan, it's usually these calls that get put onto the threads. But, what are these calls doing, really? Are they building some opcode buffer, made up of some ints and pointers, to be sent to the GPU? If so, that should be extremely fast. The actual work would be carrying out the operations described by those opcodes.
Define "block". The traditional definition of "block"ing is to wait on some internal synchronization, and thereby taking longer than would strictly be necessary for the operation. Doing a memcpy is not doing any synchronization; it's just copying data.
So you don't seem to be concerned about "block"ing; you're merely talking about what operations are expensive.
vkQueueSubmit does not block. But that doesn't mean it's not expensive. It is not "tell[ing] the hardware to read from some memory location" Just look at its interface. It doesn't take a single command buffer; it takes an arbitrary number of them, which are grouped into batches, with each batch waiting on semaphores before execution, signaling semaphores after execution, and the whole operation signaling a fence.
You cannot reasonably expect an implementation of such a thing to merely copy some pointers around.
And that doesn't even get into issues of different types of command buffers. Submitting SIMULTANEOUS_USE command buffers may require creating temporary copies of its buffered data, so that different batches can contain the same command buffer.
Now obviously, vkQueueSubmit is going to return well before any of the work it submits actually gets executed. But don't make the mistake of thinking that it's free to ship work off to the GPU. The Vulkan specification takes time out in a note to directly tell you not to call the function any more frequently than you can get away with:
Submission can be a high overhead operation, and applications should attempt to batch work together into as few calls to vkQueueSubmit as possible.
The reason to present on the same thread that submitted the CBs that generates the image being presented is not because any of those operations are necessarily slow. It's for simple pragmatism; these three operations (acquire, submit, present) must happen in order. And the simplest and easiest way to ensure that is to do them on the same thread.
You cannot submit work that renders to a swapchain image until you have acquired it. Therefore, either you do it on the same thread, or you have to have some inter-thread communication pipe to tell the thread waiting to build the primary CB what the acquired image is. The two processes cannot overlap.
Unlike acquire, present is a queue operation. And both vkQueueSubmit and vkQueuePresent require that access to their VkQueue parameters must be "externally synchoronized". That of course means that you cannot call them both from different threads, on the same VkQueue, at the same time. So if you tried to do these in parallel, you'd need a mutex or something to synchronize CPU access to the VkQueue.
Whereas if you do them on the same thread, there's no need.
Additionally, in order to present an image, you must provide a semaphore that the present will wait on. This semaphore will get signaled by the batch that generates data for the image. Vulkan requires semaphore signal/wait pairs to be ordered; you cannot perform a queue operation that waits on a semaphore until the operation that signals that semaphore has been submitted. Therefore, either you do it on the same thread in sequence, or you use some inter-thread communication pipe to tell whatever thread is waiting to present the image that the submit operation that renders to it has been issued.
So what is to be gained by splitting these operations up onto different threads? They have to happen in sequence, so you may as well do them in sequence the easiest way that exists: on the same thread.
While timeline semaphores now allow you to call the present function before submitting the work that increments the semaphore counter, you still can't call them on separate threads (without synchronization) because they affect the same queue. So you may as well issue them on the same thread (though not necessarily in acquire, submit, present order).
Ultimately, it's not clear what the point of this exercise is. Yes, an individual vkCmd* call will be pretty fast. So what? In a real scene, you will be calling these functions thousands of times per frame. Spreading them evenly across 4 cores saves you ~4x the performance.

Semaphores & threads - what is the point?

I've been reading about semaphores and came across this article:
www.csc.villanova.edu/~mdamian/threads/posixsem.html
So, this page states that if there are two threads accessing the same data, things can get ugly. The solution is to allow only one thread to access the data at the same time.
This is clear and I understand the solution, only why would anyone need threads to do this? What is the point? If the threads are blocked so that only one can execute, why use them at all? There is no advantage. (or maybe this is a just a dumb example; in such a case please point me to a sensible one)
Thanks in advance.
Consider this:
void update_shared_variable() {
sem_wait( &g_shared_variable_mutex );
g_shared_variable++;
sem_post( &g_shared_variable_mutex );
}
void thread1() {
do_thing_1a();
do_thing_1b();
do_thing_1c();
update_shared_variable(); // may block
}
void thread2() {
do_thing_2a();
do_thing_2b();
do_thing_2c();
update_shared_variable(); // may block
}
Note that all of the do_thing_xx functions still happen simultaneously. The semaphore only comes into play when the threads need to modify some shared (global) state or use some shared resource. So a thread will only block if another thread is trying to access the shared thing at the same time.
Now, if the only thing your threads are doing is working with one single shared variable/resource, then you are correct - there is no point in having threads at all (it would actually be less efficient than just one thread, due to context switching.)
When you are using multithreading not everycode that runs will be blocking. For example, if you had a queue, and two threads are reading from that queue, you would make sure that no thread reads at the same time from the queue, so that part would be blocking, but that's the part that will probably take the less time. Once you have retrieved the item to process from the queue, all the rest of the code can be run asynchronously.
The idea behind the threads is to allow simultaneous processing. A shared resource must be governed to avoid things like deadlocks or starvation. If something can take a while to process, then why not create multiple instances of those processes to allow them to finish faster? The bottleneck is just what you mentioned, when a process has to wait for I/O.
Being blocked while waiting for the shared resource is small when compared to the processing time, this is when you want to use multiple threads.
This is of course a SSCCE (Short, Self Contained, Correct Example)
Let's say you have 2 worker threads that do a lot of work and write the result to a file.
you only need to lock the file (shared resource) access.
The problem with trivial examples....
If the problem you're trying to solve can be broken down into pieces that can be executed in parallel then threads are a good thing.
A slightly less trivial example - imagine a for loop where the data being processed in each iteration is different every time. In that circumstance you could execute each iteration of the for loop simultaneously in separate threads. And indeed some compilers like Intel's will convert suitable for loops to threads automatically for you. In that particular circumstances no semaphores are needed because of the iterations' data independence.
But say you were wanting to process a stream of data, and that processing had two distinct steps, A and B. The threadless approach would involve reading in some data then doing A then B and then output the data before reading more input. Or you could have a thread reading and doing A, another thread doing B and output. So how do you get the interim result from the first thread to the second?
One way would be to have a memory buffer to contain the interim result. The first thread could write the interim result to a memory buffer and the second could read from it. But with two threads operating independently there's no way for the first thread to know if it's safe to overwrite that buffer, and there's no way for the second to know when to read from it.
That's where you can use semaphores to synchronise the action of the two threads. The first thread takes a semaphore that I'll call empty, fills the buffer, and then posts a semaphore called filled. Meanwhile the second thread will take the filled semaphore, read the buffer, and then post empty. So long as filled is initialised to 0 and empty is initialised to 1 it will work. The second thread will process the data only after the first has written it, and the first won't write it until the second has finished with it.
It's only worth it of course if the amount of time each thread spends processing data outweighs the amount of time spent waiting for semaphores. This limits the extent to which splitting code up into threads yields a benefit. Going beyond that tends to mean that the overall execution is effectively serial.
You can do multithreaded programming without semaphores at all. There's the Actor model or Communicating Sequential Processes (the one I favour). It's well worth looking up JCSP on Wikipedia.
In these programming styles data is shared between threads by sending it down communication channels. So instead of using semaphores to grant another thread access to data it would be sent a copy of that data down something a bit like a network socket, or a pipe. The advantage of CSP (which limits that communication channel to send-finishes-only-if-receiver-has-read) is that it stops you falling into the many many pitfalls that plague multithreaded do programs. It sounds inefficient (copying data is inefficient), but actually it's not so bad with Intel's QPI architecture, AMD's Hypertransport. And it means hat the 'channel' really could be a network connection; scalability built in by design.

Performance loss in C++0x and static local variables?

What is a realistic performance loss due to the fact that in C++0x all other threads shall
wait in a case like this:
string& program_name() {
static string instance = "Parallel Pi";
return instance;
}
Lets assume the optimal scenario: The programmer was very careful that even with 100 threads only the main thread calls the function program_name, all the other 99 worker threads are busy doing useful stuff, which does not involve calling this "critical" function.
I quote from the new C++0x-Std ยง 6.7.(4) stmt.decl
...such an object is initialized the first time control passes through its declaration... If control enters the declaration concurrently while the object is being initialized, the concurrent execution shall wait for completion of the initialization...
What is a realistic overhead that a real-world compiler is needed to impose on me to ensure that that static initialization is done as required by the standard.
Is a lock/mutex required? I assume they are expensive, even when not really needed?
If they are expensive, will this be done by less expensive mechanisms?
edit: added string...
If control enters the declaration concurrently while the object is being initialized, the
concurrent execution shall wait for completion of the initialization...
I think this is reasonable and very normal thing to do in concurrent programming. Anyway, this statement doesn't say that all other threads must wait for this initialization. They have to wait in case they need to access the initializing object.
Is a lock/mutex required? I assume they are expensive, even when not really needed?
Could be. Mutex / lock aren't that expensive actually, they're expensive only when the locked code fragment needs to be accessed frequently by many or even all threads.
If they are expensive, will this be done by less expensive mechanisms?
There are also another non-lock based solutions AFAIK.
If you were really concerned about the price of the lock, you could simply call the function before you started your worker threads, which would initialise the static. If you call it after the threads start, either you or the compiler has got to arrange for locking of some sort, so there is no real extra overhead.

Should access to a shared resource be locked by a parent thread before spawning a child thread that accesses it?

If I have the following psuedocode:
sharedVariable = somevalue;
CreateThread(threadWhichUsesSharedVariable);
Is it theoretically possible for a multicore CPU to execute code in threadWhichUsesSharedVariable() which reads the value of sharedVariable before the parent thread writes to it? For full theoretical avoidance of even the remote possibility of a race condition, should the code look like this instead:
sharedVariableMutex.lock();
sharedVariable = somevalue;
sharedVariableMutex.unlock();
CreateThread(threadWhichUsesSharedVariable);
Basically I want to know if the spawning of a thread explicitly linearizes the CPU at that point, and is guaranteed to do so.
I know that the overhead of thread creation probably takes enough time that this would never matter in practice, but the perfectionist in me is afraid of the theoretical race condition. In extreme conditions, where some threads or cores might be severely lagged and others are running fast and efficiently, I can imagine that it might be remotely possible for the order of execution (or memory access) to be reversed unless there was a lock.
I would say that your pseudocode is safe on any correctly functioning
multiprocessor system. The C++ compiler cannot generate a call to
CreateThread() before sharedVariable has received a correct value
unless it can prove to itself that doing so is safe. You are guaranteed
that your single-threaded code executes equivalently to a completely
non-reordered linear execution path. Any system that "time warps" the
thread creation ahead of the variable assignment is seriously broken.
I don't think declaring sharedVariable as volatile does anything
useful in this case.
Given your example and if you were using Java then the answer would be "No". In Java it is not possible for the thread to spawn and read your value before the assignment operation is complete. In some other languages this might be a different story.
"Variables shared between multiple threads (e.g., instance variables of objects) have atomic assignment guaranteed by the Java language specification for all data types except longs and doubles... If a method consists solely of a single variable access or assignment, there is no need to make it synchronized for thread-safety, and every reason not to do so for performance."
reference
If your double or long is declared volatile, then you are also guaranteed that the assignment is an atomic operation.
Update:
Your example is going to work in C++ just like it works in Java. Theoretically there is no way that the thread spawning will begin or complete before the assignment, even with Out of Order Execution.
Note that your example is VERY specific and in any other case it is recommended that you ensure the shared resource is protected properly. The new C++ standard is coming out with a lot of atomic stuff, so you could declare your variable as atomic and the assignment operation will be visible to all threads without the need of locking. CAS (compare and set) is a your next best option.

Resources