Branch prediction and multithreading - multithreading

Let's suppose a simple if like this:
if (something)
// do_something
else
// do_else
Suppose that this if-else statement is executed in parallel in different threads, and each thread yielding a different result, but constant through its own life. For example, in thread 1 the condition is always evaluated as false, in thread 2, true; in thread 3 always true as well, and so on.
Does branch prediction consider the execution context of each thread to make its statistics? Because if it doesn't (I don't think that, but its difficult to check by testing), the CPU will see the condition follows a random pattern and won't predict at all.

If we ignore SMT (f.ex. hyper-threading) most architectures have a branch predictor per hardware thread.
Its tightly coupled with the fetch unit of the individual core. A few (AMD?) store some branch prediction information in L1/L2 I-cache but mostly target for next fetch.
So if you don't run your code on a SMT you are in heaven and will get a 100% predicted every time at the cost of a few instructions.
If you run your code on a SMT you will often find your life is hell, with 50+% mispredict.
Now you can solve your problem easily you just have to use more code, check your condition earlier and call a branch of your code with do_something or do_else in it.
If you have a loop that calls your function where you have your branch you can do something like:
if (something)
do_something_loop();
else
do_else_loop();
void do_something_loop() {
for (auto x : myVec)
do_something;
}
This has the disadvantage that you need to maintain 2 nearly equal branches of code.
Or you can have your branch in a function call branch_me() which you can make a template function and due to the magic of dead code elimination you should not get any branches in the loops.
C++ Concept code.
template<bool b_something>
void brancher() {
// do things
if (b_something)
// do_something
else
// do_else
}
// do more things
}
void branch_user() {
if (something) {
for (auto x : myVec)
brancher<true>();
} else {
for (auto x : myVec)
brancher<false>();
}
}
Now you only have to maintain the 2 branches of the outer function which hopefully is less work.

Related

How to ensure the comparison result still hold in multi-threading?

Suppose there are 3 threads,
Thread 1 and 2 will increase or decrease a global variable X atomically.
thread 1:
atomic_increase(X)
thread 2:
atomic_decrease(X)
Thread 3 will check if the X is greater than some predefined value and do things accordingly.
thread 3:
if( X > 5 ) {... logic 1 ...}
else {... logic 2 ....}
I think the atomic_xxx operations are not enough. They can only synchronize the modifications between thread 1 and 2.
What if X is changed by thread 1 or 2 after thread 3 finishes the comparison and enters logic 1.
Do I have to use a mutex to synchronize all the 3 threads when modifying or reading the X?
ADD 1
BTW, logic 1 and logic 2 don't modify the X.
In short yes, reads also need to be synchronized in some way, otherwise the risk of inconsistent reads is real. A read performed between the read and write of atomic_increase will be inconsistent.
However if logic 1 or logic 2 do stuff to X, your problems doesn't seem to stop right there. I think then you need the concept of a transaction, where it starts with a read (the X > 5 thing) and then ends with a write (logic 1 or logic 2).
Yes, And the Answer is happens before link, Lets say Thread-1 started executing atomic_increase method. It will hold the lock and enter the synchronized block to update X.
private void atomic_increase() {
synchronized (lock) {
X = X + 1; // <-- Thread-1 entered synchronized block, yet to update variable X
}
}
Now, for Thread-3 to run the logic, it needs to read the variable X, and if it is not synchronized (on the same monitor), the variable X read can be an old value since it may not yet updated by Thread-1.
private void runLogic() {
if (X > 5) { // <-- Reading X here, can be inconsistent no
happens-before between atomic_increase and runLogic
} else {
}
}
We could have prevented this by maintaining a happens-before link between atomic operation and run_logic method. If the runLogic is synchronized (on the same monitor) , then it would have to wait until the variable X to be updated by the Thread-1. So we are guaranteed to get the last updated value of X
private void runLogic() {
synchronized (lock) {
if (X > 5) { // <-- Reading X here, will be consistent, since there
is happens-before between atomic_increase and runLogic
} else {
}
}
}
The answer depends on what your application does. If neither logic 1 nor logic 2 modifies X, it is quite possible that there is no need for additional synchronization (besides using an atomic_load to read X).
I assume you use intrinsics for atomic operations, and not simply an increment in a mutex (or in a synchronized block in Java). E.g. in Java there is an AtomicInteger class with methods such as 'incrementAndGet' and 'get'. If you use them, there is probably no need for additional synchronization, but it depends what you actually want to achieve with logic 1 or logic 2.
If you want to e.g. display a message when X > 5, then you can do it. By the time the message is displayed the value of X may have already changed, but it remains the fact, that the message was triggered by X being greater than 5 for at least some time.
In other words, without additional synchronization, you have only the guarantee that logic 1 will be called if X becomes greater than 5, but there is no guarantee that it will remain so during execution of logic 1. It may be ok for you, or not.

raku Calling same function in threads with different parameters

I remember back in college days, threads share resources and memory. I do not know the specifics of Raku implementation of threads, but if, at the same time, multiple threads call the same global function with different parameters, will they interfere one another because a global function is a single block of code shared by all the threads? E.g., this example does not show interference, but what about some complicated codes?
sub add ($a, $b) { $a + $b };
for 1..100 { start { sleep 1.rand; say "I am $_, {add($_, 1000)}"; } };
You should not have to worry about accessing a global function from multiple threads at the same time, in principle: arguments are passed by value, and parameters are lexical to the function.
There is one exception I can think of: using a state variable inside such a function. There is a known race-condition on the initialization of a state variable, and updates of the form $foo++ will most likely miss increments when being run from multiple threads at the same time. E.g.:
my int $a;
await (^10).map: { start { $a++ for ^100000 } }
say $a; # 893127
Aka, not the 1000000 you'd expect. Fortunately, to handle that case, we have atomic integers:
my atomicint $a;
await (^10).map: { start { $a⚛++ for ^100000 } }
say $a; # 1000000
But that's just showing off and not directly an answer to your question :-)
Should you have a piece of code that you want to make sure that only one thread executes at a time, you could use a Lock and the protect method on that;
my $lock = Lock.new; # usually in the mainline of a program
# ... code
$lock.protect: {
# code executed by only 1 thread at a time
}
Please note that this is considered to be "plumbing", aka use this only when you need to, as it opens you up to deadlocks.

blockingForEach(), why apply function to blocked observables

I'm having trouble understanding the point of a blocking Observable, specifically blockingForEach()
What is the point in applying a function to an Observable that we will never see?? Below, I'm attempting to have my console output in the following order
this is the integer multiplied by two:2
this is the integer multiplied by two:4
this is the integer multiplied by two:6
Statement comes after multiplication
My current method prints the statement before the multiplication
fun rxTest(){
val observer1 = Observable.just(1,2,3).observeOn(AndroidSchedulers.mainThread())
val observer2 = observer1.map { response -> response * 2 }
observer2
.observeOn(AndroidSchedulers.mainThread())
.subscribeOn(AndroidSchedulers.mainThread())
.subscribe{ it -> System.out.println("this is the integer multiplie by two:" + it) }
System.out.println("Statement comes after multiplication ")
}
Now I have my changed my method to include blockingForEach()
fun rxTest(){
val observer1 = Observable.just(1,2,3).observeOn(AndroidSchedulers.mainThread())
val observer2 = observer1.map { response -> response * 2 }
observer2
.observeOn(AndroidSchedulers.mainThread())
.subscribeOn(AndroidSchedulers.mainThread())
.blockingForEach { it -> System.out.println("this is the integer multiplie by two:" + it) }
System.out.println("Statement comes after multiplication ")
}
1.)What happens to the transformed observables once no longer blocking? Wasnt that just unnecessary work since we never see those Observables??
2.)Why is my System.out("Statement...) appear before my observables when I'm subscribing?? Its like observable2 skips its blocking method, makes the System.out call and then resumes its subscription
It's not clear what you mean by your statement that you will "never see" values emitted by an observer chain. Each value that is emitted in the observer chain is seen by observers downstream from the point where they are emitted. At the point where you subscribe to the observer chain is the usual place where you perform a side effect, such as printing a value or storing it into a variable. Thus, the values are always seen.
In your examples, you are getting confused by how the schedulers work. When you use the observeOn() or subscribeOn() operators, you are telling the observer chain to emit values after the value is move on to a different thread. When you move data between threads, the destination thread has to be able to process the data. If your main code is running on the same thread, you can lock yourself out or you will re-order operations.
Normally, the use of blocking operations is strongly discouraged. Blocking operations can often be used when testing, because you have full control of the consequences. There are a couple of other situations where blocking may make sense. An example would be an application that requires access to a database or other resource; the application has no purpose without that resource, so it blocks until it becomes available or a timeout occurs, kicking it out.

Multithread+Recursion strategies

I am just starting to learn the ins-and-outs of multithread programming and have a few basic questions that, once answered, should keep me occupied for quite sometime. I understand that multithreading loses its effectiveness once you have created more threads than there are cores (due to context switching and cache flushing). With that understood, I can think of two ways to employ multithreading of a recursive function...but am not quite sure what is the common way to approach the problem. One seems much more complicated, perhaps with a higher payoff...but thats what I hope you will be able to tell me.
Below is pseudo-code for two different methods of multithreading a recursive function. I have used the terminology of merge sort for simplicity, but it's not that important. It is easy to see how to generalize the methods to other problems. Also, I will personally be employing these methods using the pthreads library in C, so the thread syntax mildly reflects this.
Method 1:
main ()
{
A = array of length N
NUM_CORES = get number of functional cores
chunk[NUM_CORES] = array of indices partitioning A into (N / NUM_CORES) sized chunks
thread_id[NUM_CORES] = array of thread id’s
thread[NUM_CORES] = array of thread type
//start NUM_CORES threads on working on each chunk of A
for i = 0 to (NUM_CORES - 1) {
thread_id[i] = thread_start(thread[i], MergeSort, chunk[i])
}
//wait for all threads to finish
//Merge chunks appropriately
exit
}
MergeSort ( chunk )
{
MergeSort ( lowerSubChunk )
MergeSort ( higherSubChunk )
Merge(lowerSubChunk, higherSubChunk)
}
//Merge(,) not shown
Method 2:
main ()
{
A = array of length N
NUM_CORES = get number of functional cores
chunk = indices 0 and N
thread_id[NUM_CORES] = array of thread id’s
thread[NUM_CORES] = array of thread type
//lock variable aka mutex
THREADS_IN_USE = 1
MergeSort( chunk )
exit
}
MergeSort ( chunk )
{
lock THREADS_IN_USE
if ( THREADS_IN_USE < NUM_CORES ) {
FREE_CORE = find index of unused core
thread_id[FREE_CORE] = thread_start(thread[FREE_CORE], MergeSort, lowerSubChunk)
THREADS_IN_USE++
unlock THREADS_IN_USE
MergeSort( higherSubChunk )
//wait for thread_id[FREE_CORE] and current thread to finish
lock THREADS_IN_USE
THREADS_IN_USE--
unlock THREADS_IN_USE
Merge(lowerSubChunk, higherSubChunk)
}
else {
unlock THREADS_IN_USE
MergeSort( lowerSubChunk )
MergeSort( higherSubChunk )
Merge(lowerSubChunk, higherSubChunk)
}
}
//Merge(,) not shown
Visually, one can think of the differences between these two methods as follows:
Method 1: creates NUM_CORES separate recursion trees, each one having a single core traversing it.
Method 2: creates a single recursion tree but has all cores traversing it. In particular, whenever there is a free core, it is set to work on the "left child subtree" of the first node where MergeSort is called after the core is freed.
The problem with Method 1 is that if it is the case that the running time of the recursive function varies with the distribution of values within each initial subchunk (i.e. the chunk[i]), one thread could finish much faster leaving a core sitting idle while the others finish. With Merge Sort this is not likely to be the case since the work of MergeSort happens in Merge whose runtime isn't affected much by the distribution of values in the (sorted) subchunks. However, with a more involved recursive function, the running time on one subchunk could be much longer!
With Method 2 it is possible to have the same problem. Again, with merge sort its not clear since the running time for each subchunk is likely to be similar, but the line //wait for thread_id[FREE_CORE] and current thread to finish would also require one core to wait for the other. However, with Method 2, all calls to Merge run ASAP as opposed to Method 1 where one must wait for NUM_CORES calls to MergeSort to finish and then do NUM_CORES - 1 merges afterward (although you can multithread this as well...to an extent)
(though the syntax might not be completely correct)
Are both of these methods used in practice? Are there situations where one is more beneficial over the other? Is this the correct way to implement Method 2? (in this case, THREADS_IN_USE is a semaphore?)
Thanks so much for your help!

table of functions vs switch in golang

im am writing a simple emulator in go (should i? or should i go back to c?).
anyway, i am fetching the instruction and decoding it. at this point i have a byte like 0x81, and i have to execute the right function.
should i have something like this
func (sys *cpu) eval() {
switch opcode {
case 0x80:
sys.add(sys.b)
case 0x81:
sys.add(sys.c)
etc
}
}
or something like this
var fnTable = []func(*cpu) {
0x80: func(sys *cpu) {
sys.add(sys.b)
},
0x81: func(sys *cpu) {
sys.add(sys.c)
}
}
func (sys *cpu) eval() {
return fnTable[opcode](sys)
}
1.which one is better?
2.which one is faster?
also
3.can i declare a function inline?
4.i have a cpu struct in which i have the registers etc. would it be faster if i have the registers and all as globals? (without the struct)
thank you very much.
I did some benchmarks and the table version is faster than the switch version once you have more than about 4 cases.
I was surprised to discover that the Go compiler (gc, anyway; not sure about gccgo) doesn't seem to be smart enough to turn a dense switch into a jump table.
Update:
Ken Thompson posted on the Go mailing list describing the difficulties of optimizing switch.
The first version looks better to me, YMMV.
Benchmark it. Depends how good is the compiler at optimizing. The "jump table" version might be faster if the compiler doesn't try hard enough to optimize.
Depends on your definition of what is "to declare function inline". Go can declare and define functions/methods at the top level only. But functions are first class citizens in Go, so one can have variables/parameters/return values and structured types of function type. In all this places a function literal can [also] be assigned to the variable/field/element...
Possibly. Still I would suggest to not keep the cpu state in a global variable. Once you possibly decide to go emulating multicore, it will be welcome ;-)
If you have the ast of some expression, and you want to eval it for a big amount of data rows, then you may only once compile it into the tree of lambdas, and do not calculate any switches on each iteration at all;
For example, given such ast: {* (a, {+ (b, c)})}
Compile function (in very rough pseudo language) will be something like this:
func (e *evaluator) compile(brunch ast) {
switch brunch.type {
case binaryOperator:
switch brunch.op {
case *: return func() {compile(brunch.arg0) * compile(brunch.arg1)}
case +: return func() {compile(brunch.arg0) + compile(brunch.arg1)}
}
case BasicLit: return func() {return brunch.arg0}
case Ident: return func(){return e.GetIdent(brunch.arg0)}
}
}
So eventually compile returns the func, that must be called on each row of your data and there will be no switches or other calculation stuff at all.
There remains the question about operations with data of different types, that is for your own research ;)
This is an interesting approach, in situations, when there is no jump-table mechanism available :) but sure, func call is more complex operation then jump.

Resources