I am just starting to learn the ins-and-outs of multithread programming and have a few basic questions that, once answered, should keep me occupied for quite sometime. I understand that multithreading loses its effectiveness once you have created more threads than there are cores (due to context switching and cache flushing). With that understood, I can think of two ways to employ multithreading of a recursive function...but am not quite sure what is the common way to approach the problem. One seems much more complicated, perhaps with a higher payoff...but thats what I hope you will be able to tell me.
Below is pseudo-code for two different methods of multithreading a recursive function. I have used the terminology of merge sort for simplicity, but it's not that important. It is easy to see how to generalize the methods to other problems. Also, I will personally be employing these methods using the pthreads library in C, so the thread syntax mildly reflects this.
Method 1:
main ()
{
A = array of length N
NUM_CORES = get number of functional cores
chunk[NUM_CORES] = array of indices partitioning A into (N / NUM_CORES) sized chunks
thread_id[NUM_CORES] = array of thread id’s
thread[NUM_CORES] = array of thread type
//start NUM_CORES threads on working on each chunk of A
for i = 0 to (NUM_CORES - 1) {
thread_id[i] = thread_start(thread[i], MergeSort, chunk[i])
}
//wait for all threads to finish
//Merge chunks appropriately
exit
}
MergeSort ( chunk )
{
MergeSort ( lowerSubChunk )
MergeSort ( higherSubChunk )
Merge(lowerSubChunk, higherSubChunk)
}
//Merge(,) not shown
Method 2:
main ()
{
A = array of length N
NUM_CORES = get number of functional cores
chunk = indices 0 and N
thread_id[NUM_CORES] = array of thread id’s
thread[NUM_CORES] = array of thread type
//lock variable aka mutex
THREADS_IN_USE = 1
MergeSort( chunk )
exit
}
MergeSort ( chunk )
{
lock THREADS_IN_USE
if ( THREADS_IN_USE < NUM_CORES ) {
FREE_CORE = find index of unused core
thread_id[FREE_CORE] = thread_start(thread[FREE_CORE], MergeSort, lowerSubChunk)
THREADS_IN_USE++
unlock THREADS_IN_USE
MergeSort( higherSubChunk )
//wait for thread_id[FREE_CORE] and current thread to finish
lock THREADS_IN_USE
THREADS_IN_USE--
unlock THREADS_IN_USE
Merge(lowerSubChunk, higherSubChunk)
}
else {
unlock THREADS_IN_USE
MergeSort( lowerSubChunk )
MergeSort( higherSubChunk )
Merge(lowerSubChunk, higherSubChunk)
}
}
//Merge(,) not shown
Visually, one can think of the differences between these two methods as follows:
Method 1: creates NUM_CORES separate recursion trees, each one having a single core traversing it.
Method 2: creates a single recursion tree but has all cores traversing it. In particular, whenever there is a free core, it is set to work on the "left child subtree" of the first node where MergeSort is called after the core is freed.
The problem with Method 1 is that if it is the case that the running time of the recursive function varies with the distribution of values within each initial subchunk (i.e. the chunk[i]), one thread could finish much faster leaving a core sitting idle while the others finish. With Merge Sort this is not likely to be the case since the work of MergeSort happens in Merge whose runtime isn't affected much by the distribution of values in the (sorted) subchunks. However, with a more involved recursive function, the running time on one subchunk could be much longer!
With Method 2 it is possible to have the same problem. Again, with merge sort its not clear since the running time for each subchunk is likely to be similar, but the line //wait for thread_id[FREE_CORE] and current thread to finish would also require one core to wait for the other. However, with Method 2, all calls to Merge run ASAP as opposed to Method 1 where one must wait for NUM_CORES calls to MergeSort to finish and then do NUM_CORES - 1 merges afterward (although you can multithread this as well...to an extent)
(though the syntax might not be completely correct)
Are both of these methods used in practice? Are there situations where one is more beneficial over the other? Is this the correct way to implement Method 2? (in this case, THREADS_IN_USE is a semaphore?)
Thanks so much for your help!
Related
I'm fairly new to Rust. I graduated with a Computer Engineering degree 4 years ago, and I remember discussing (and understanding) atomic operations in my Operating Systems course. However, since graduating, I've been working primarily in high-level languages where I haven't had to care about low-level stuff like atomics. Now that I'm getting into Rust, I'm struggling to remember how a lot of this stuff works.
I'm currently trying to understand the source code for the hibitset library, specifically atomic.rs.
This module specifies an AtomicBitSet type which corresponds to the BitSet type from lib.rs, but using atomic values and operations. From my understanding, an "atomic operation" is an operation that is guaranteed to not be interrupted by another thread; any "load" or "store" on the same value will have to wait for the operation to finish before proceeding. Following from this definition, an "atomic value" is a value whose operations are fully atomic. AtomicBitSet uses AtomicUsize, which is a usize wrapper where all methods are fully atomic. However, AtomicBitSet specifies several operations that seem to not be atomic (add and remove), and there is one atomic operation: add_atomic. Looking at add vs add_atomic, I can't really tell what the difference is.
Here is add (verbatim):
/// Adds `id` to the `BitSet`. Returns `true` if the value was
/// already in the set.
#[inline]
pub fn add(&mut self, id: Index) -> bool {
use std::sync::atomic::Ordering::Relaxed;
let (_, p1, p2) = offsets(id);
if self.layer1[p1].add(id) {
return true;
}
self.layer2[p2].store(self.layer2[p2].load(Relaxed) | id.mask(SHIFT2), Relaxed);
self.layer3
.store(self.layer3.load(Relaxed) | id.mask(SHIFT3), Relaxed);
false
}
This method calls load() and store() directly. I'm assuming that the fact that it's using Ordering::Relaxed is what makes this method non-atomic, because another thread doing the same thing to a different index might clobber this operation.
Here is add_atomic (verbatim):
/// Adds `id` to the `AtomicBitSet`. Returns `true` if the value was
/// already in the set.
///
/// Because we cannot safely extend an AtomicBitSet without unique ownership
/// this will panic if the Index is out of range.
#[inline]
pub fn add_atomic(&self, id: Index) -> bool {
let (_, p1, p2) = offsets(id);
// While it is tempting to check of the bit was set and exit here if it
// was, this can result in a data race. If this thread and another
// thread both set the same bit it is possible for the second thread
// to exit before l3 was set. Resulting in the iterator to be in an
// incorrect state. The window is small, but it exists.
let set = self.layer1[p1].add(id);
self.layer2[p2].fetch_or(id.mask(SHIFT2), Ordering::Relaxed);
self.layer3.fetch_or(id.mask(SHIFT3), Ordering::Relaxed);
set
}
This method uses fetch_or instead of calling load and store directly, which I'm assuming is what makes this method atomic.
But why does the usage of Ordering::Relaxed still allow this to be considered atomic? I realize that the individual "or" operations are atomic, but the full method could be run at the same time as another thread. Wouldn't that have an impact?
Moreover, why would a type like this expose non-atomic methods? Is it just for performance? That seems confusing to me. If I were to pick an AtomicBitSet over a BitSet because it's going to be used by more than one thread, I'd probably want to only use atomic operations on it. If I didn't I wouldn't be using it. Right?
I'd also love an explanation of the comment inside add_atomic. As-is it does not make sense to me. Doesn't the non-atomic version still have to care about that? It seems like the two methods are doing effectively the same thing, just with different levels of atomicity.
I'd really just love some help wrapping my head around atomics. I think I understand ordering after reading this and this, but both are still using concepts that I don't understand. When they talk about one thread "seeing" something from another, what does that mean exactly? When it's said that sequentially-consistent operations have the same order "across all threads" what does that even mean? Does the processor change the instruction order differently for different threads?
In the non-atomic case, this line:
self.layer2[p2].store(self.layer2[p2].load(Relaxed) | id.mask(SHIFT2), Relaxed);
is more or less equivalent to:
let tmp1 = self.layer2[p2];
let tmp2 = tmp1 | id.mask(SHIFT2);
self.layer2[p2] = tmp2;
so another thread could change self.layer2[p2] between the moment it is read into tmp1 and the moment tmp2 is stored into it. So if another thread tries to set another bit at the same time, there is a risk that the following sequence occurs:
thread 1 reads an empty mask,
thread 2 reads an empty mask,
thread 1 sets bit 1 of the mask and writes it,
thread 2 sets bit 2 of the mask and writes it, thus overwriting the value set by thread 1,
in the end only bit 2 is set!
The same goes for self.layer3.
In the atomic case, the use of fetch_or guarantees that the whole read-modify-write cycle is atomic.
In both cases, since the ordering is relaxed, the writes to layer2 and layer3 may seem to occur in any order as seen from other threads.
The comment inside add_atomic is meant avoid an issue when two threads try to add the same bit. Assume that add_atomic was written like this:
pub fn add_atomic(&self, id: Index) -> bool {
let (_, p1, p2) = offsets(id);
if self.layer1[p1].add(id) {
return true;
}
self.layer2[p2].fetch_or(id.mask(SHIFT2), Ordering::Relaxed);
self.layer3.fetch_or(id.mask(SHIFT3), Ordering::Relaxed);
false
}
Then you risk the following sequence:
thread 1 sets bit 1 in layer1 and sees that it wasn't set beforehand,
thread 2 tries to set bit 1 in layer1 and sees that thread 1 already set it, so thread 2 returns from add_atomic,
thread 2 executes another operation that requires reading layer3, but layer3 has not been updated yet, so thread 2 gets a wrong value!
thread 1 updates layer3, but it is too late.
This is why the add_atomic case ensures that layer2 and layer3 are set properly in all threads even if it looked like the bit was already set beforehand.
Let's suppose a simple if like this:
if (something)
// do_something
else
// do_else
Suppose that this if-else statement is executed in parallel in different threads, and each thread yielding a different result, but constant through its own life. For example, in thread 1 the condition is always evaluated as false, in thread 2, true; in thread 3 always true as well, and so on.
Does branch prediction consider the execution context of each thread to make its statistics? Because if it doesn't (I don't think that, but its difficult to check by testing), the CPU will see the condition follows a random pattern and won't predict at all.
If we ignore SMT (f.ex. hyper-threading) most architectures have a branch predictor per hardware thread.
Its tightly coupled with the fetch unit of the individual core. A few (AMD?) store some branch prediction information in L1/L2 I-cache but mostly target for next fetch.
So if you don't run your code on a SMT you are in heaven and will get a 100% predicted every time at the cost of a few instructions.
If you run your code on a SMT you will often find your life is hell, with 50+% mispredict.
Now you can solve your problem easily you just have to use more code, check your condition earlier and call a branch of your code with do_something or do_else in it.
If you have a loop that calls your function where you have your branch you can do something like:
if (something)
do_something_loop();
else
do_else_loop();
void do_something_loop() {
for (auto x : myVec)
do_something;
}
This has the disadvantage that you need to maintain 2 nearly equal branches of code.
Or you can have your branch in a function call branch_me() which you can make a template function and due to the magic of dead code elimination you should not get any branches in the loops.
C++ Concept code.
template<bool b_something>
void brancher() {
// do things
if (b_something)
// do_something
else
// do_else
}
// do more things
}
void branch_user() {
if (something) {
for (auto x : myVec)
brancher<true>();
} else {
for (auto x : myVec)
brancher<false>();
}
}
Now you only have to maintain the 2 branches of the outer function which hopefully is less work.
I am not familiar with multi-thread and locks and atomic/nonatomic operations.
Recently I saw an interview question as below.
Put f1 and f2 in two separate threads and run them at the same time, when both of them return, what is the value of a?
int a = 2, b = 0, c = 0
func f1()
{
a = a * 2
a = b
}
func f2()
{
c = a + 11
a = c
}
I tried to implement the above code in objective c environment and what I got is a = 11. I'm not sure if this is right since what I did is put f1 in main queue and put f2 in a dispatch global queue and ran it async which could be incorrect.
If someone could give an answer and explain the process based on the level of register accessing, CPU processing, memory usage, that would be great.
The answer is - the result of A is random. It can be anything. Since access to A is not atomic and there is no synchronization, different threads might see a different value for a depending on random factors. If you manage to make a unaligned and run it on X86, you might even see a non-value for a.
I'm not quite sure as to what this term means. I saw it during a course where we are learning about concurrency. I've seen a lot of definitions for data interleaving, but I could find anything about process interleaving.
When looking at the term my instincts tell me it is the use of threads to run more than one process simultaneously, is that correct?
If you imagine a process as a (possibly infinite) sequence/trace of statements (e.g. obtained by loop unfolding), then the set of possible interleavings of several processes consists of all possible sequences of statements of any of those process.
Consider for example the processes
int i;
proctype A() {
i = 1;
}
proctype B() {
i = 2;
}
Then the possible interleavings are i = 1; i = 2 and i = 2; i = 1, i.e. the possible final values for i are 1 and 2. This can be of course more complex, for instance in the presence of guarded statements: Then the next possible statements in an interleaving sequence are not necessarily those at the position of the next program counter, but only those that are allowed by the guard; consider for example the proctype
proctype B() {
if
:: i == 0 -> i = 2
:: else -> skip
fi
}
Then the possible interleavings (given A() as before) are i = 1; skip and i = 2; i = 1, so there is only one possible final value for i.
Indeed the notion of interleavings is crucial for Spin's view of concurrency. In a trace semantics, the set of possible traces of concurrent processes is the set of possible interleavings of the traces of the individual processes.
It simply means performing (data access or execution or ... ) in an arbitrary order**(see the note). In the case of concurrency, it usually refers to action interleaving.
If the process P and Q are in parallel composition (P||Q) then the actions of these will be interleaved. Consider following processes:
PLAYING = (play_music -> stop_music -> STOP).
PERFORMING = (dance -> STOP).
||PLAY_PERFORM = (PLAYING || PERFORMING).
So each primitive process can be shown as: (generated by LTSA model-cheking tool)
Then the possible traces as the result of action interleaving will be:
dance -> play_music -> stop_music
play_music -> dance -> stop_music
play_music -> stop_music -> dance
Here is the LTSA tool generated output of this example.
**note: "arbitrary" here means arbitrary choice of process execution not their inner sequence of codes. The code execution in each process will be always followed sequentially.
If it is still something that you're not comfortable with you can take a look at: https://www.doc.ic.ac.uk/~jnm/book/firstbook/pdf/ch3.pdf
Hope it helps! :)
Operating Systems support Tasks (or Processes). But for now let's think of "Actitivities".
Activities can be executed in parallel. Here are two activities, P and Q:
P: abc
Q: def
a, b, c, d, e, f, are operations. *
Each operation has always the same effect independent of what other
operations may be executing at the same time (atomicity).
What is the effect of executing the two activities concurrently? We
do not know for sure, but we know that it will be the same as obtained
by executing sequentially an INTERLEAVING of the two activities
[interleavings are also called SCHEDULES]. Here are the possible
interleavings of these two activities:
abcdef
abdcef
abdecf
abdefc
adbcef
......
defabc
That is, the operations of the two activities are sequenced in all possible ways that preserve the order in which the operations appeared in the two activities. A serial interleaving [serial schedule] of two activities is one where all the operations of one activity precede all the operations of the other activity.
The importance of the concept of interleaving is that it allows us to express the meaning of concurrent programs: The parallel execution of activities is equivalent to the sequential execution of one of the interleavings of these activities.
For detailed information: https://cis.temple.edu/~ingargio/cis307/readings/interleave.html
I need as an example how to program a parallel iter-function using ocaml-threads. My first idea was to have a function similiar to this:
let procs = 4 ;;
let rec _part part i lst = match lst with
[] -> ()
| hd::tl ->
let idx = i mod procs in
(* Printf.printf "part idx=%i\n" idx; *)
let accu = part.(idx) in
part.(idx) <- (hd::accu);
_part part (i+1) tl ;;
Then a parallel iter could look like this (here as process-based variant):
let iter f lst = let part = Array.create procs [] in
_part part 0 lst;
let rec _do i =
(* Printf.printf "do idx=%i\n" i; *)
match Unix.fork () with
0 -> (* Code of child *)
if i < procs then
begin
(* Printf.printf "child %i\n" i; *)
List.iter f part.(i)
end
| pid -> (* Code of father *)
(* Printf.printf "father %i\n" i; *)
if i >= procs then ignore (Unix.waitpid [] pid)
else _do (i+1)
in
_do 0 ;;
Because the usage of Thread-module is a little bit different, how would I code this using ocaml's thread module?
And there is another question, the _part() function must scan the whole list to split them into n parts and then each part will be piped through each own processes (here). Still exists there a solution without splitting a list first?
If you have a function which processes a list, and you want to run it on several lists independently, you can call Thread.create with that function and every list. If you store your lists in array part then:
let threads = Array.map (Thread.create (List.iter f)) part in
Array.iter Thread.join threads
INRIA OCaml threads are not actual threads: only one thread executes at any given time, which means if you have four processors and four threads, all four threads will use the same processor and the other three will remain unused.
Where threads are useful is that they still allow asynchronous programming: some Thread module primitives can wait for an external resource to become available. This can reduce the time your software spends blocked by an unavailable resource, because you can have another thread do something else in the mean time. You can also use this to concurrently start several external asynchronous processes (like querying several web servers through HTTP). If you don't have a lot of resource-related blocking, this is not going to help you.
As for your list-splitting question: to access an element of a list, you must traverse all previous elements. While this traversal could theoretically be split across several threads or processes, the communication overhead would likely make it a lot slower than just splitting things ahead of time in one process. Or using arrays.
Answer to a question from the comments. The answer does not quite fit in a comment itself.
There is a lock on the OCaml runtime. The lock is released when an OCaml thread is about to enter a C function that
may block;
may take a long time.
So you can only have one OCaml thread using the heap, but you can sometimes have non-heap-using C functions working in parallel with it.
See for instance the file ocaml-3.12.0/otherlibs/unix/write.c
memmove (iobuf, &Byte(buf, ofs), numbytes); // if we kept the data in the heap
// the GC might move it from
// under our feet.
enter_blocking_section(); // release lock.
// Another OCaml thread may
// start in parallel of this one now.
ret = write(Int_val(fd), iobuf, numbytes);
leave_blocking_section(); // take lock again to continue
// with Ocaml code.