raku Calling same function in threads with different parameters - multithreading

I remember back in college days, threads share resources and memory. I do not know the specifics of Raku implementation of threads, but if, at the same time, multiple threads call the same global function with different parameters, will they interfere one another because a global function is a single block of code shared by all the threads? E.g., this example does not show interference, but what about some complicated codes?
sub add ($a, $b) { $a + $b };
for 1..100 { start { sleep 1.rand; say "I am $_, {add($_, 1000)}"; } };

You should not have to worry about accessing a global function from multiple threads at the same time, in principle: arguments are passed by value, and parameters are lexical to the function.
There is one exception I can think of: using a state variable inside such a function. There is a known race-condition on the initialization of a state variable, and updates of the form $foo++ will most likely miss increments when being run from multiple threads at the same time. E.g.:
my int $a;
await (^10).map: { start { $a++ for ^100000 } }
say $a; # 893127
Aka, not the 1000000 you'd expect. Fortunately, to handle that case, we have atomic integers:
my atomicint $a;
await (^10).map: { start { $a⚛++ for ^100000 } }
say $a; # 1000000
But that's just showing off and not directly an answer to your question :-)
Should you have a piece of code that you want to make sure that only one thread executes at a time, you could use a Lock and the protect method on that;
my $lock = Lock.new; # usually in the mainline of a program
# ... code
$lock.protect: {
# code executed by only 1 thread at a time
}
Please note that this is considered to be "plumbing", aka use this only when you need to, as it opens you up to deadlocks.

Related

Understanding Raku's `&?BLOCK` compile-time variable

I really appreciate the Raku's &?BLOCK variable – it lets you recurse within an unnamed block, which can be extremely powerful. For example, here's a simple, inline, and anonymous factorial function:
{ when $_ ≤ 1 { 1 };
$_ × &?BLOCK($_ - 1) }(5) # OUTPUT: «120»
However, I have some questions about it when used in more complex situations. Consider this code:
{ say "Part 1:";
my $a = 1;
print ' var one: '; dd $a;
print ' block one: '; dd &?BLOCK ;
{
my $a = 2;
print ' var two: '; dd $a;
print ' outer var: '; dd $OUTER::a;
print ' block two: '; dd &?BLOCK;
print "outer block: "; dd &?OUTER::BLOCK
}
say "\nPart 2:";
print ' block one: '; dd &?BLOCK;
print 'postfix for: '; dd &?BLOCK for (1);
print ' prefix for: '; for (1) { dd &?BLOCK }
};
which yields this output (I've shortened the block IDs):
Part 1:
var one: Int $a = 1
block one: -> ;; $_? is raw = OUTER::<$_> { #`(Block|…6696) ... }
var two: Int $a = 2
outer var: Int $a = 1
block two: -> ;; $_? is raw = OUTER::<$_> { #`(Block|…8496) ... }
outer block: -> ;; $_? is raw = OUTER::<$_> { #`(Block|…8496) ... }
Part 2:
block one: -> ;; $_? is raw = OUTER::<$_> { #`(Block|…6696) ... }
postfix for: -> ;; $_ is raw { #`(Block|…9000) ... }
prefix for: -> ;; $_ is raw { #`(Block|…9360) ... }
Here's what I don't understand about that: why does the &?OUTER::BLOCK refer (based on its ID) to block two rather than block one? Using OUTER with $a correctly causes it to refer to the outer scope, but the same thing doesn't work with &?BLOCK. Is it just not possible to use OUTER with &?BLOCK? If not, is there a way to access the outer block from the inner block? (I know that I can assign &?BLOCK to a named variable in the outer block and then access that variable in the inner block. I view that as a workaround but not a full solution because it sacrifices the ability to refer to unnamed blocks, which is where much of &?BLOCK's power comes from.)
Second, I am very confused by Part 2. I understand why the &?BLOCK that follows the prefix for refers to an inner block. But why does the &?BLOCK that precedes the postfix for also refer to its own block? Is a block implicitly created around the body of the for statement? My understanding is that the postfix forms were useful in large part because they do not require blocks. Is that incorrect?
Finally, why do some of the blocks have OUTER::<$_> in the but others do not? I'm especially confused by Block 2, which is not the outermost block.
Thanks in advance for any help you can offer! (And if any of the code behavior shown above indicates a Rakudo bug, I am happy to write it up as an issue.)
That's some pretty confusing stuff you've encountered. That said, it does all make some kind of sense...
Why does the &?OUTER::BLOCK refer (based on its ID) to block two rather than block one?
Per the doc, &?BLOCK is a "special compile variable", as is the case for all variables that have a ? as their twigil.
As such it's not a symbol that can be looked up at run-time, which is what syntax like $FOO::bar is supposed to be about afaik.
So I think the compiler ought by rights reject use of a "compile variable" with the package lookup syntax. (Though I'm not sure. Does it make sense to do "run-time" lookups in the COMPILING package?)
There may already be a bug filed (in either of the GH repos rakudo/rakudo/issues or raku/old-issues-tracker/issues) about it being erroneous to try to do a run-time lookup of a special compile variable (the ones with a ? twigil). If not, it makes sense to me to file one.
Using OUTER with $a correctly causes it to refer to the outer scope
The symbol associated with the $a variable in the outer block is stored in the stash associated with the outer block. This is what's referenced by OUTER.
Is it just not possible to use OUTER with &?BLOCK?
I reckon not for the reasons given above. Let's see if anyone corrects me.
If not, is there a way to access the outer block from the inner block?
You could pass it as an argument. In other words, close the inner block with }(&?BLOCK); instead of just }. Then you'd have it available as $_ in the inner block.
Why does the &?BLOCK that precedes the postfix for also refer to its own block?
It is surprising until you know why, but...
Is a block implicitly created around the body of the for statement?
Seems so, so the body can take an argument passed by each iteration of the for.
My understanding is that the postfix forms were useful in large part because they do not require blocks.
I've always thought of their benefit as being that they A) avoid a separate lexical scope and B) avoid having to type in the braces.
Is that incorrect?
It seems so. for has to be able to supply a distinct $_ to its statement(s) (you can put a series of statements in parens), so if you don't explicitly write braces, it still has to create a distinct lexical frame, and presumably it was considered better that the &?BLOCK variable tracked that distinct frame with its own $_, and "pretended" that was a "block", and displayed its gist with a {...}, despite there being no explicit {...}.
Why do some of the blocks have OUTER::<$_> in them but others do not?
While for (and given etc) always passes an "it" aka $_ argument to its blocks/statements, other blocks do not have an argument automatically passed to them, but they will accept one if it's manually passed by the writer of code manually passing one.
To support this wonderful idiom in which one can either pass or not pass an argument, blocks other than ones that are automatically fed an $_ are given this default of binding $_ to the outer block's $_.
I'm especially confused by Block 2, which is not the outermost block.
I'm confused by you being especially confused by that. :) If the foregoing hasn't sufficiently cleared this last aspect up for you, please comment on what it is about this last bit that's especially confusing.
During compilation the compiler has to keep track of various things. One of which is the current block that it is compiling.
The block object gets stored in the compiled code wherever it sees the special variable $?BLOCK.
Basically the compile-time variables aren't really variables, but more of a macro.
So whenever it sees $?BLOCK the compiler replaces it with whatever the current block the compiler is currently compiling.
It just happens that $?OUTER::BLOCK is somehow close enough to $?BLOCK that it replaces that too.
I can show you that there really isn't a variable by that name by trying to look it up by name.
{ say ::('&?BLOCK') } # ERROR: No such symbol '&?BLOCK'
Also every pair of {} (that isn't a hash ref or hash index) denotes a new block.
So each of these lines will say something different:
{
say $?BLOCK.WHICH;
say "{ $?BLOCK.WHICH }";
if True { say $?BLOCK.WHICH }
}
That means if you declare a variable inside one of those constructs it is contained to that construct.
"{ my $a = "abc"; say $a }"; # abc
say $a; # COMPILE ERROR: Variable '$a' is not declared
if True { my $b = "def"; say $b } # def
say $b; # COMPILE ERROR: Variable '$b' is not declared
In the case of postfix for, the left side needs to be a lambda/closure so that for can set $_ to the current value.
It was probably just easier to fake it up to be a Block than to create a new Code type just for that use.
Especially since an entire Raku source file is also considered a Block.
A bare Block can have an optional argument.
my &foo;
given 5 {
&foo = { say $_ }
}
foo( ); # 5
foo(42); # 42
If you give it an argument it sets $_ to that value.
If you don't, $_ will point to whatever $_ was outside of that declaration. (Closure)
For many of the uses of that construct, doing that can be very handy.
sub call-it-a (&c){
c()
}
sub call-it-b (&c, $arg){
c( $arg * 10 )
}
for ^5 {
call-it-a( { say $_ } ); # 0␤ 1␤ 2␤ 3␤ 4␤
call-it-b( { say $_ }, $_ ); # 0␤10␤20␤30␤40␤
}
For call-it-a we needed it to be a closure over $_ to work.
For call-it-b we needed it to be an argument instead.
By having :( ;; $_? is raw = OUTER::<$_> ) as the signature it caters to both use-cases.
This makes it easy to create simple lambdas that just do what you want them to do.

What is the difference between this "atomic" Rust code and its "non-atomic" counterpart?

I'm fairly new to Rust. I graduated with a Computer Engineering degree 4 years ago, and I remember discussing (and understanding) atomic operations in my Operating Systems course. However, since graduating, I've been working primarily in high-level languages where I haven't had to care about low-level stuff like atomics. Now that I'm getting into Rust, I'm struggling to remember how a lot of this stuff works.
I'm currently trying to understand the source code for the hibitset library, specifically atomic.rs.
This module specifies an AtomicBitSet type which corresponds to the BitSet type from lib.rs, but using atomic values and operations. From my understanding, an "atomic operation" is an operation that is guaranteed to not be interrupted by another thread; any "load" or "store" on the same value will have to wait for the operation to finish before proceeding. Following from this definition, an "atomic value" is a value whose operations are fully atomic. AtomicBitSet uses AtomicUsize, which is a usize wrapper where all methods are fully atomic. However, AtomicBitSet specifies several operations that seem to not be atomic (add and remove), and there is one atomic operation: add_atomic. Looking at add vs add_atomic, I can't really tell what the difference is.
Here is add (verbatim):
/// Adds `id` to the `BitSet`. Returns `true` if the value was
/// already in the set.
#[inline]
pub fn add(&mut self, id: Index) -> bool {
use std::sync::atomic::Ordering::Relaxed;
let (_, p1, p2) = offsets(id);
if self.layer1[p1].add(id) {
return true;
}
self.layer2[p2].store(self.layer2[p2].load(Relaxed) | id.mask(SHIFT2), Relaxed);
self.layer3
.store(self.layer3.load(Relaxed) | id.mask(SHIFT3), Relaxed);
false
}
This method calls load() and store() directly. I'm assuming that the fact that it's using Ordering::Relaxed is what makes this method non-atomic, because another thread doing the same thing to a different index might clobber this operation.
Here is add_atomic (verbatim):
/// Adds `id` to the `AtomicBitSet`. Returns `true` if the value was
/// already in the set.
///
/// Because we cannot safely extend an AtomicBitSet without unique ownership
/// this will panic if the Index is out of range.
#[inline]
pub fn add_atomic(&self, id: Index) -> bool {
let (_, p1, p2) = offsets(id);
// While it is tempting to check of the bit was set and exit here if it
// was, this can result in a data race. If this thread and another
// thread both set the same bit it is possible for the second thread
// to exit before l3 was set. Resulting in the iterator to be in an
// incorrect state. The window is small, but it exists.
let set = self.layer1[p1].add(id);
self.layer2[p2].fetch_or(id.mask(SHIFT2), Ordering::Relaxed);
self.layer3.fetch_or(id.mask(SHIFT3), Ordering::Relaxed);
set
}
This method uses fetch_or instead of calling load and store directly, which I'm assuming is what makes this method atomic.
But why does the usage of Ordering::Relaxed still allow this to be considered atomic? I realize that the individual "or" operations are atomic, but the full method could be run at the same time as another thread. Wouldn't that have an impact?
Moreover, why would a type like this expose non-atomic methods? Is it just for performance? That seems confusing to me. If I were to pick an AtomicBitSet over a BitSet because it's going to be used by more than one thread, I'd probably want to only use atomic operations on it. If I didn't I wouldn't be using it. Right?
I'd also love an explanation of the comment inside add_atomic. As-is it does not make sense to me. Doesn't the non-atomic version still have to care about that? It seems like the two methods are doing effectively the same thing, just with different levels of atomicity.
I'd really just love some help wrapping my head around atomics. I think I understand ordering after reading this and this, but both are still using concepts that I don't understand. When they talk about one thread "seeing" something from another, what does that mean exactly? When it's said that sequentially-consistent operations have the same order "across all threads" what does that even mean? Does the processor change the instruction order differently for different threads?
In the non-atomic case, this line:
self.layer2[p2].store(self.layer2[p2].load(Relaxed) | id.mask(SHIFT2), Relaxed);
is more or less equivalent to:
let tmp1 = self.layer2[p2];
let tmp2 = tmp1 | id.mask(SHIFT2);
self.layer2[p2] = tmp2;
so another thread could change self.layer2[p2] between the moment it is read into tmp1 and the moment tmp2 is stored into it. So if another thread tries to set another bit at the same time, there is a risk that the following sequence occurs:
thread 1 reads an empty mask,
thread 2 reads an empty mask,
thread 1 sets bit 1 of the mask and writes it,
thread 2 sets bit 2 of the mask and writes it, thus overwriting the value set by thread 1,
in the end only bit 2 is set!
The same goes for self.layer3.
In the atomic case, the use of fetch_or guarantees that the whole read-modify-write cycle is atomic.
In both cases, since the ordering is relaxed, the writes to layer2 and layer3 may seem to occur in any order as seen from other threads.
The comment inside add_atomic is meant avoid an issue when two threads try to add the same bit. Assume that add_atomic was written like this:
pub fn add_atomic(&self, id: Index) -> bool {
let (_, p1, p2) = offsets(id);
if self.layer1[p1].add(id) {
return true;
}
self.layer2[p2].fetch_or(id.mask(SHIFT2), Ordering::Relaxed);
self.layer3.fetch_or(id.mask(SHIFT3), Ordering::Relaxed);
false
}
Then you risk the following sequence:
thread 1 sets bit 1 in layer1 and sees that it wasn't set beforehand,
thread 2 tries to set bit 1 in layer1 and sees that thread 1 already set it, so thread 2 returns from add_atomic,
thread 2 executes another operation that requires reading layer3, but layer3 has not been updated yet, so thread 2 gets a wrong value!
thread 1 updates layer3, but it is too late.
This is why the add_atomic case ensures that layer2 and layer3 are set properly in all threads even if it looked like the bit was already set beforehand.

Branch prediction and multithreading

Let's suppose a simple if like this:
if (something)
// do_something
else
// do_else
Suppose that this if-else statement is executed in parallel in different threads, and each thread yielding a different result, but constant through its own life. For example, in thread 1 the condition is always evaluated as false, in thread 2, true; in thread 3 always true as well, and so on.
Does branch prediction consider the execution context of each thread to make its statistics? Because if it doesn't (I don't think that, but its difficult to check by testing), the CPU will see the condition follows a random pattern and won't predict at all.
If we ignore SMT (f.ex. hyper-threading) most architectures have a branch predictor per hardware thread.
Its tightly coupled with the fetch unit of the individual core. A few (AMD?) store some branch prediction information in L1/L2 I-cache but mostly target for next fetch.
So if you don't run your code on a SMT you are in heaven and will get a 100% predicted every time at the cost of a few instructions.
If you run your code on a SMT you will often find your life is hell, with 50+% mispredict.
Now you can solve your problem easily you just have to use more code, check your condition earlier and call a branch of your code with do_something or do_else in it.
If you have a loop that calls your function where you have your branch you can do something like:
if (something)
do_something_loop();
else
do_else_loop();
void do_something_loop() {
for (auto x : myVec)
do_something;
}
This has the disadvantage that you need to maintain 2 nearly equal branches of code.
Or you can have your branch in a function call branch_me() which you can make a template function and due to the magic of dead code elimination you should not get any branches in the loops.
C++ Concept code.
template<bool b_something>
void brancher() {
// do things
if (b_something)
// do_something
else
// do_else
}
// do more things
}
void branch_user() {
if (something) {
for (auto x : myVec)
brancher<true>();
} else {
for (auto x : myVec)
brancher<false>();
}
}
Now you only have to maintain the 2 branches of the outer function which hopefully is less work.

Multithread+Recursion strategies

I am just starting to learn the ins-and-outs of multithread programming and have a few basic questions that, once answered, should keep me occupied for quite sometime. I understand that multithreading loses its effectiveness once you have created more threads than there are cores (due to context switching and cache flushing). With that understood, I can think of two ways to employ multithreading of a recursive function...but am not quite sure what is the common way to approach the problem. One seems much more complicated, perhaps with a higher payoff...but thats what I hope you will be able to tell me.
Below is pseudo-code for two different methods of multithreading a recursive function. I have used the terminology of merge sort for simplicity, but it's not that important. It is easy to see how to generalize the methods to other problems. Also, I will personally be employing these methods using the pthreads library in C, so the thread syntax mildly reflects this.
Method 1:
main ()
{
A = array of length N
NUM_CORES = get number of functional cores
chunk[NUM_CORES] = array of indices partitioning A into (N / NUM_CORES) sized chunks
thread_id[NUM_CORES] = array of thread id’s
thread[NUM_CORES] = array of thread type
//start NUM_CORES threads on working on each chunk of A
for i = 0 to (NUM_CORES - 1) {
thread_id[i] = thread_start(thread[i], MergeSort, chunk[i])
}
//wait for all threads to finish
//Merge chunks appropriately
exit
}
MergeSort ( chunk )
{
MergeSort ( lowerSubChunk )
MergeSort ( higherSubChunk )
Merge(lowerSubChunk, higherSubChunk)
}
//Merge(,) not shown
Method 2:
main ()
{
A = array of length N
NUM_CORES = get number of functional cores
chunk = indices 0 and N
thread_id[NUM_CORES] = array of thread id’s
thread[NUM_CORES] = array of thread type
//lock variable aka mutex
THREADS_IN_USE = 1
MergeSort( chunk )
exit
}
MergeSort ( chunk )
{
lock THREADS_IN_USE
if ( THREADS_IN_USE < NUM_CORES ) {
FREE_CORE = find index of unused core
thread_id[FREE_CORE] = thread_start(thread[FREE_CORE], MergeSort, lowerSubChunk)
THREADS_IN_USE++
unlock THREADS_IN_USE
MergeSort( higherSubChunk )
//wait for thread_id[FREE_CORE] and current thread to finish
lock THREADS_IN_USE
THREADS_IN_USE--
unlock THREADS_IN_USE
Merge(lowerSubChunk, higherSubChunk)
}
else {
unlock THREADS_IN_USE
MergeSort( lowerSubChunk )
MergeSort( higherSubChunk )
Merge(lowerSubChunk, higherSubChunk)
}
}
//Merge(,) not shown
Visually, one can think of the differences between these two methods as follows:
Method 1: creates NUM_CORES separate recursion trees, each one having a single core traversing it.
Method 2: creates a single recursion tree but has all cores traversing it. In particular, whenever there is a free core, it is set to work on the "left child subtree" of the first node where MergeSort is called after the core is freed.
The problem with Method 1 is that if it is the case that the running time of the recursive function varies with the distribution of values within each initial subchunk (i.e. the chunk[i]), one thread could finish much faster leaving a core sitting idle while the others finish. With Merge Sort this is not likely to be the case since the work of MergeSort happens in Merge whose runtime isn't affected much by the distribution of values in the (sorted) subchunks. However, with a more involved recursive function, the running time on one subchunk could be much longer!
With Method 2 it is possible to have the same problem. Again, with merge sort its not clear since the running time for each subchunk is likely to be similar, but the line //wait for thread_id[FREE_CORE] and current thread to finish would also require one core to wait for the other. However, with Method 2, all calls to Merge run ASAP as opposed to Method 1 where one must wait for NUM_CORES calls to MergeSort to finish and then do NUM_CORES - 1 merges afterward (although you can multithread this as well...to an extent)
(though the syntax might not be completely correct)
Are both of these methods used in practice? Are there situations where one is more beneficial over the other? Is this the correct way to implement Method 2? (in this case, THREADS_IN_USE is a semaphore?)
Thanks so much for your help!

when "if" is written at the end of line in perl, what is the scope of it

this question is actually coming from using threads. We know that in perl threads, we have a function called lock, and according to cpan http://perldoc.perl.org/threads/shared.html: lock places a advisory lock on a variable until the lock goes out of scope. OK, what if we write something like this:
1 sub foo{
2 lock($obj) if threads::shared::is_shared($obj); #equivalent to if(threads::shared::is_shared($obj)) {lock($obj);} ?
3 ... rest of the code
4 ... more code
5 }
so the scope of the lock is from line 2 to line 4 or just line 2? if "if" statement adding a block to it, then lock($obj) maybe just line 2, see my #comments
the question is answered actually, but I want to add some findings:
I found that no matter how you write either:
lock($ojb) if threads::shared::is_shared($obj);
or
if (threads::shared::is_shared($obj)) {
lock($ojb);
}
the scope of the lock are the same- the whole foo() subroutine.
The if statement modifier doesn't put an implicit block around the statement it applies to. So the scope of the lock (if it is applied) is the whole of the rest of your subroutine.
Based on the experiment results, I found that no matter how you write either:
lock($ojb) if threads::shared::is_shared($obj);
or
if (threads::shared::is_shared($obj)) {
lock($ojb);
}
the scope of the lock are the same- the whole foo() subroutine.
From the very document you linked in the question:
my $var :shared;
{
lock($var);
# $var is locked from here to the end of the block
...
}
# $var is now unlocked
So the lock lasts until the end of the block.

Resources