How to ensure the comparison result still hold in multi-threading? - multithreading

Suppose there are 3 threads,
Thread 1 and 2 will increase or decrease a global variable X atomically.
thread 1:
atomic_increase(X)
thread 2:
atomic_decrease(X)
Thread 3 will check if the X is greater than some predefined value and do things accordingly.
thread 3:
if( X > 5 ) {... logic 1 ...}
else {... logic 2 ....}
I think the atomic_xxx operations are not enough. They can only synchronize the modifications between thread 1 and 2.
What if X is changed by thread 1 or 2 after thread 3 finishes the comparison and enters logic 1.
Do I have to use a mutex to synchronize all the 3 threads when modifying or reading the X?
ADD 1
BTW, logic 1 and logic 2 don't modify the X.

In short yes, reads also need to be synchronized in some way, otherwise the risk of inconsistent reads is real. A read performed between the read and write of atomic_increase will be inconsistent.
However if logic 1 or logic 2 do stuff to X, your problems doesn't seem to stop right there. I think then you need the concept of a transaction, where it starts with a read (the X > 5 thing) and then ends with a write (logic 1 or logic 2).

Yes, And the Answer is happens before link, Lets say Thread-1 started executing atomic_increase method. It will hold the lock and enter the synchronized block to update X.
private void atomic_increase() {
synchronized (lock) {
X = X + 1; // <-- Thread-1 entered synchronized block, yet to update variable X
}
}
Now, for Thread-3 to run the logic, it needs to read the variable X, and if it is not synchronized (on the same monitor), the variable X read can be an old value since it may not yet updated by Thread-1.
private void runLogic() {
if (X > 5) { // <-- Reading X here, can be inconsistent no
happens-before between atomic_increase and runLogic
} else {
}
}
We could have prevented this by maintaining a happens-before link between atomic operation and run_logic method. If the runLogic is synchronized (on the same monitor) , then it would have to wait until the variable X to be updated by the Thread-1. So we are guaranteed to get the last updated value of X
private void runLogic() {
synchronized (lock) {
if (X > 5) { // <-- Reading X here, will be consistent, since there
is happens-before between atomic_increase and runLogic
} else {
}
}
}

The answer depends on what your application does. If neither logic 1 nor logic 2 modifies X, it is quite possible that there is no need for additional synchronization (besides using an atomic_load to read X).
I assume you use intrinsics for atomic operations, and not simply an increment in a mutex (or in a synchronized block in Java). E.g. in Java there is an AtomicInteger class with methods such as 'incrementAndGet' and 'get'. If you use them, there is probably no need for additional synchronization, but it depends what you actually want to achieve with logic 1 or logic 2.
If you want to e.g. display a message when X > 5, then you can do it. By the time the message is displayed the value of X may have already changed, but it remains the fact, that the message was triggered by X being greater than 5 for at least some time.
In other words, without additional synchronization, you have only the guarantee that logic 1 will be called if X becomes greater than 5, but there is no guarantee that it will remain so during execution of logic 1. It may be ok for you, or not.

Related

What is the difference between this "atomic" Rust code and its "non-atomic" counterpart?

I'm fairly new to Rust. I graduated with a Computer Engineering degree 4 years ago, and I remember discussing (and understanding) atomic operations in my Operating Systems course. However, since graduating, I've been working primarily in high-level languages where I haven't had to care about low-level stuff like atomics. Now that I'm getting into Rust, I'm struggling to remember how a lot of this stuff works.
I'm currently trying to understand the source code for the hibitset library, specifically atomic.rs.
This module specifies an AtomicBitSet type which corresponds to the BitSet type from lib.rs, but using atomic values and operations. From my understanding, an "atomic operation" is an operation that is guaranteed to not be interrupted by another thread; any "load" or "store" on the same value will have to wait for the operation to finish before proceeding. Following from this definition, an "atomic value" is a value whose operations are fully atomic. AtomicBitSet uses AtomicUsize, which is a usize wrapper where all methods are fully atomic. However, AtomicBitSet specifies several operations that seem to not be atomic (add and remove), and there is one atomic operation: add_atomic. Looking at add vs add_atomic, I can't really tell what the difference is.
Here is add (verbatim):
/// Adds `id` to the `BitSet`. Returns `true` if the value was
/// already in the set.
#[inline]
pub fn add(&mut self, id: Index) -> bool {
use std::sync::atomic::Ordering::Relaxed;
let (_, p1, p2) = offsets(id);
if self.layer1[p1].add(id) {
return true;
}
self.layer2[p2].store(self.layer2[p2].load(Relaxed) | id.mask(SHIFT2), Relaxed);
self.layer3
.store(self.layer3.load(Relaxed) | id.mask(SHIFT3), Relaxed);
false
}
This method calls load() and store() directly. I'm assuming that the fact that it's using Ordering::Relaxed is what makes this method non-atomic, because another thread doing the same thing to a different index might clobber this operation.
Here is add_atomic (verbatim):
/// Adds `id` to the `AtomicBitSet`. Returns `true` if the value was
/// already in the set.
///
/// Because we cannot safely extend an AtomicBitSet without unique ownership
/// this will panic if the Index is out of range.
#[inline]
pub fn add_atomic(&self, id: Index) -> bool {
let (_, p1, p2) = offsets(id);
// While it is tempting to check of the bit was set and exit here if it
// was, this can result in a data race. If this thread and another
// thread both set the same bit it is possible for the second thread
// to exit before l3 was set. Resulting in the iterator to be in an
// incorrect state. The window is small, but it exists.
let set = self.layer1[p1].add(id);
self.layer2[p2].fetch_or(id.mask(SHIFT2), Ordering::Relaxed);
self.layer3.fetch_or(id.mask(SHIFT3), Ordering::Relaxed);
set
}
This method uses fetch_or instead of calling load and store directly, which I'm assuming is what makes this method atomic.
But why does the usage of Ordering::Relaxed still allow this to be considered atomic? I realize that the individual "or" operations are atomic, but the full method could be run at the same time as another thread. Wouldn't that have an impact?
Moreover, why would a type like this expose non-atomic methods? Is it just for performance? That seems confusing to me. If I were to pick an AtomicBitSet over a BitSet because it's going to be used by more than one thread, I'd probably want to only use atomic operations on it. If I didn't I wouldn't be using it. Right?
I'd also love an explanation of the comment inside add_atomic. As-is it does not make sense to me. Doesn't the non-atomic version still have to care about that? It seems like the two methods are doing effectively the same thing, just with different levels of atomicity.
I'd really just love some help wrapping my head around atomics. I think I understand ordering after reading this and this, but both are still using concepts that I don't understand. When they talk about one thread "seeing" something from another, what does that mean exactly? When it's said that sequentially-consistent operations have the same order "across all threads" what does that even mean? Does the processor change the instruction order differently for different threads?
In the non-atomic case, this line:
self.layer2[p2].store(self.layer2[p2].load(Relaxed) | id.mask(SHIFT2), Relaxed);
is more or less equivalent to:
let tmp1 = self.layer2[p2];
let tmp2 = tmp1 | id.mask(SHIFT2);
self.layer2[p2] = tmp2;
so another thread could change self.layer2[p2] between the moment it is read into tmp1 and the moment tmp2 is stored into it. So if another thread tries to set another bit at the same time, there is a risk that the following sequence occurs:
thread 1 reads an empty mask,
thread 2 reads an empty mask,
thread 1 sets bit 1 of the mask and writes it,
thread 2 sets bit 2 of the mask and writes it, thus overwriting the value set by thread 1,
in the end only bit 2 is set!
The same goes for self.layer3.
In the atomic case, the use of fetch_or guarantees that the whole read-modify-write cycle is atomic.
In both cases, since the ordering is relaxed, the writes to layer2 and layer3 may seem to occur in any order as seen from other threads.
The comment inside add_atomic is meant avoid an issue when two threads try to add the same bit. Assume that add_atomic was written like this:
pub fn add_atomic(&self, id: Index) -> bool {
let (_, p1, p2) = offsets(id);
if self.layer1[p1].add(id) {
return true;
}
self.layer2[p2].fetch_or(id.mask(SHIFT2), Ordering::Relaxed);
self.layer3.fetch_or(id.mask(SHIFT3), Ordering::Relaxed);
false
}
Then you risk the following sequence:
thread 1 sets bit 1 in layer1 and sees that it wasn't set beforehand,
thread 2 tries to set bit 1 in layer1 and sees that thread 1 already set it, so thread 2 returns from add_atomic,
thread 2 executes another operation that requires reading layer3, but layer3 has not been updated yet, so thread 2 gets a wrong value!
thread 1 updates layer3, but it is too late.
This is why the add_atomic case ensures that layer2 and layer3 are set properly in all threads even if it looked like the bit was already set beforehand.

blockingForEach(), why apply function to blocked observables

I'm having trouble understanding the point of a blocking Observable, specifically blockingForEach()
What is the point in applying a function to an Observable that we will never see?? Below, I'm attempting to have my console output in the following order
this is the integer multiplied by two:2
this is the integer multiplied by two:4
this is the integer multiplied by two:6
Statement comes after multiplication
My current method prints the statement before the multiplication
fun rxTest(){
val observer1 = Observable.just(1,2,3).observeOn(AndroidSchedulers.mainThread())
val observer2 = observer1.map { response -> response * 2 }
observer2
.observeOn(AndroidSchedulers.mainThread())
.subscribeOn(AndroidSchedulers.mainThread())
.subscribe{ it -> System.out.println("this is the integer multiplie by two:" + it) }
System.out.println("Statement comes after multiplication ")
}
Now I have my changed my method to include blockingForEach()
fun rxTest(){
val observer1 = Observable.just(1,2,3).observeOn(AndroidSchedulers.mainThread())
val observer2 = observer1.map { response -> response * 2 }
observer2
.observeOn(AndroidSchedulers.mainThread())
.subscribeOn(AndroidSchedulers.mainThread())
.blockingForEach { it -> System.out.println("this is the integer multiplie by two:" + it) }
System.out.println("Statement comes after multiplication ")
}
1.)What happens to the transformed observables once no longer blocking? Wasnt that just unnecessary work since we never see those Observables??
2.)Why is my System.out("Statement...) appear before my observables when I'm subscribing?? Its like observable2 skips its blocking method, makes the System.out call and then resumes its subscription
It's not clear what you mean by your statement that you will "never see" values emitted by an observer chain. Each value that is emitted in the observer chain is seen by observers downstream from the point where they are emitted. At the point where you subscribe to the observer chain is the usual place where you perform a side effect, such as printing a value or storing it into a variable. Thus, the values are always seen.
In your examples, you are getting confused by how the schedulers work. When you use the observeOn() or subscribeOn() operators, you are telling the observer chain to emit values after the value is move on to a different thread. When you move data between threads, the destination thread has to be able to process the data. If your main code is running on the same thread, you can lock yourself out or you will re-order operations.
Normally, the use of blocking operations is strongly discouraged. Blocking operations can often be used when testing, because you have full control of the consequences. There are a couple of other situations where blocking may make sense. An example would be an application that requires access to a database or other resource; the application has no purpose without that resource, so it blocks until it becomes available or a timeout occurs, kicking it out.

Evaluation order for always blocks triggered within always blocks in Verilog?

I understand that, for 2 always blocks with the same trigger, their order of evaluation is completely unpredictable.
However, suppose I have:
always #(a) begin : blockX
c = 0;
d = a + 2;
if(c != 1) e = 2;
end
always #(a) begin : blockY
e = 3;
end
always #(d) begin : blockZ
c = 1;
e = 1;
end
Suppose block X evaluates first. Does changing d in blockX immediately jump to blockZ? If not, when is blockZ evaluated with respect to blockY?
My programmer's instinct thinks of the sequence of events as a stack, where evaluating blockX is like a function call to blockZ and I immediately jump there in the code, then finish evaluating blockX.
However, because we call the active events queue, well, a queue, this suggests blockZ is enqueued at the back of the active events queue, and I'm 100% guaranteed it will be evaluated last (unless there are other triggered always blocks).
There's also the intermediate possibility, where it's neither first nor last but is also evaluated in a random and unpredictable order.
So in this example, are 1, 2, or 3 all possible final values for e, depending on how the compiler is feeling at run time?
Additionally, while I understand, of course, this represents awful style, where might I find the specification for this kind of behvaior?
Always blocks are not function calls. See a recent answer I just gave for a similar question. These blocks are concurrent processes. The LRM only guarentees the ordering of statements within a begin/end block. There is no defined ordering between concurrently executing begin/end blocks (See Section 4.7 Nondeterminism in the 1800-2012 LRM) So a simulator is free to interleave the statements in any way as long as it honors the order within a single block.
So you are correct that e could have the final values 1, 2 or 3 depending on how a simulator decides to implement and optimize your code.

How to explain Read/Write global variables in multi threads environment

I am not familiar with multi-thread and locks and atomic/nonatomic operations.
Recently I saw an interview question as below.
Put f1 and f2 in two separate threads and run them at the same time, when both of them return, what is the value of a?
int a = 2, b = 0, c = 0
func f1()
{
a = a * 2
a = b
}
func f2()
{
c = a + 11
a = c
}
I tried to implement the above code in objective c environment and what I got is a = 11. I'm not sure if this is right since what I did is put f1 in main queue and put f2 in a dispatch global queue and ran it async which could be incorrect.
If someone could give an answer and explain the process based on the level of register accessing, CPU processing, memory usage, that would be great.
The answer is - the result of A is random. It can be anything. Since access to A is not atomic and there is no synchronization, different threads might see a different value for a depending on random factors. If you manage to make a unaligned and run it on X86, you might even see a non-value for a.

Multithread+Recursion strategies

I am just starting to learn the ins-and-outs of multithread programming and have a few basic questions that, once answered, should keep me occupied for quite sometime. I understand that multithreading loses its effectiveness once you have created more threads than there are cores (due to context switching and cache flushing). With that understood, I can think of two ways to employ multithreading of a recursive function...but am not quite sure what is the common way to approach the problem. One seems much more complicated, perhaps with a higher payoff...but thats what I hope you will be able to tell me.
Below is pseudo-code for two different methods of multithreading a recursive function. I have used the terminology of merge sort for simplicity, but it's not that important. It is easy to see how to generalize the methods to other problems. Also, I will personally be employing these methods using the pthreads library in C, so the thread syntax mildly reflects this.
Method 1:
main ()
{
A = array of length N
NUM_CORES = get number of functional cores
chunk[NUM_CORES] = array of indices partitioning A into (N / NUM_CORES) sized chunks
thread_id[NUM_CORES] = array of thread id’s
thread[NUM_CORES] = array of thread type
//start NUM_CORES threads on working on each chunk of A
for i = 0 to (NUM_CORES - 1) {
thread_id[i] = thread_start(thread[i], MergeSort, chunk[i])
}
//wait for all threads to finish
//Merge chunks appropriately
exit
}
MergeSort ( chunk )
{
MergeSort ( lowerSubChunk )
MergeSort ( higherSubChunk )
Merge(lowerSubChunk, higherSubChunk)
}
//Merge(,) not shown
Method 2:
main ()
{
A = array of length N
NUM_CORES = get number of functional cores
chunk = indices 0 and N
thread_id[NUM_CORES] = array of thread id’s
thread[NUM_CORES] = array of thread type
//lock variable aka mutex
THREADS_IN_USE = 1
MergeSort( chunk )
exit
}
MergeSort ( chunk )
{
lock THREADS_IN_USE
if ( THREADS_IN_USE < NUM_CORES ) {
FREE_CORE = find index of unused core
thread_id[FREE_CORE] = thread_start(thread[FREE_CORE], MergeSort, lowerSubChunk)
THREADS_IN_USE++
unlock THREADS_IN_USE
MergeSort( higherSubChunk )
//wait for thread_id[FREE_CORE] and current thread to finish
lock THREADS_IN_USE
THREADS_IN_USE--
unlock THREADS_IN_USE
Merge(lowerSubChunk, higherSubChunk)
}
else {
unlock THREADS_IN_USE
MergeSort( lowerSubChunk )
MergeSort( higherSubChunk )
Merge(lowerSubChunk, higherSubChunk)
}
}
//Merge(,) not shown
Visually, one can think of the differences between these two methods as follows:
Method 1: creates NUM_CORES separate recursion trees, each one having a single core traversing it.
Method 2: creates a single recursion tree but has all cores traversing it. In particular, whenever there is a free core, it is set to work on the "left child subtree" of the first node where MergeSort is called after the core is freed.
The problem with Method 1 is that if it is the case that the running time of the recursive function varies with the distribution of values within each initial subchunk (i.e. the chunk[i]), one thread could finish much faster leaving a core sitting idle while the others finish. With Merge Sort this is not likely to be the case since the work of MergeSort happens in Merge whose runtime isn't affected much by the distribution of values in the (sorted) subchunks. However, with a more involved recursive function, the running time on one subchunk could be much longer!
With Method 2 it is possible to have the same problem. Again, with merge sort its not clear since the running time for each subchunk is likely to be similar, but the line //wait for thread_id[FREE_CORE] and current thread to finish would also require one core to wait for the other. However, with Method 2, all calls to Merge run ASAP as opposed to Method 1 where one must wait for NUM_CORES calls to MergeSort to finish and then do NUM_CORES - 1 merges afterward (although you can multithread this as well...to an extent)
(though the syntax might not be completely correct)
Are both of these methods used in practice? Are there situations where one is more beneficial over the other? Is this the correct way to implement Method 2? (in this case, THREADS_IN_USE is a semaphore?)
Thanks so much for your help!

Resources