Where exactly is the synchronization point when using semaphores

Where exactly is the synchronization point when using semaphores - multithreading

I have a questions regarding the actual synchronization points in the following c - like psuedocode examples. In our slides the synchronization point is shown to occur at the point indicated below.
Two process 2 way synchronization, x and y = 0 to start
Process 1
signal(x);
//Marked as sync point
wait(y);
Process 2
signal(y);
//This arrow isn't as exact but appears to be near the middle again.
wait(x);
Now for just two process 2 way sync this seems to make sense. However, when expanding this two 3 process 3 way sync this logic seems to break down. There are no arrows given in the slide deck.
3 Process 3 Way Synchronization (S1, S2, S3 = 0 to start)
Process 0
signal(S0);
signal(S0);
wait(S1);
wait(S2);
Process 1
signal(S1);
signal(S1);
wait(S0);
wait(S2);
Process 2
signal(S2);
signal(S2);
wait(S0);
wait(S1);
Now I find the sync point couldn't actually be between the signal and the wait. For example:
Let's so Process 0 runs first and signals S0 once. Now S0 = 1. Now let's say that before the second signal(S0) can be run that the process is interrupted and Process 1 runs next. Let's say that only one signal(S1) can be run before the process is interrupted. Now the value of S1 = 1. Now let's say that Process 2 runs next. This signal(S2) is allowed to run so S2 = 2. Now the process is not interrupted so it is allowed to continue. Wait(S0) runs which decrements S0 by 1. S0 now equals 0. However, process 2 is allowed to continue running because S0's value is not a negative value. Now wait(S1) is allowed to run and a similar thing here happens.
At this point Process 2 is done running. However Process 0 and Process 1 did not finish their signal's. If the sync point is truly in between signals and wait then this solution to 3 way 3 process sync is incorrect.
A similar issue can arise in solution for 3 process 3 way synchronization that allows each process to run more than one instance of itself at a time. Attached is that slide but I will not explain why the "middle" point in the process can't be the sync point as I already have a huge wall of text.
Please let me know which way is correct, no amount of googling has given me an answer. I will include all relevant slides.

Related

Result of 100 concurrent threads, each incrementing variable to 100

I'm writing to ask about this question from 'The Little Book of Semaphores' by Allen B. Downey.
Question from 'The Little Book of Semaphores'
Puzzle: Suppose that 100 threads run the following program concurrently. (if you are not familiar with Python, the for loop runs the update 100 times.):
for i in range(100):
temp = count
count = temp + 1
What is the largest possible value of count after all threads have completed? What is the smallest possible value? Hint: the first question is easy; the second is not.
My understanding is that count is a variable shared by all threads, and that it's initial value is 0.
I believe that the largest possible value is 10,000, which occurs when there is no interleaving between threads.
I believe that the smallest possible value is 100. If line 2 is executed for each thread, they will each have a value of temp = 0. If line 3 is then executed for each thread, they will each set count = 1. If the same behaviour occurs in each iteration, the final value of count will be 100.
Is this correct, or is there another execution path that can result in a value smaller than 100 for count?

The worst case that I can think of will leave count equal to two. It's extremely unlikely that this would ever happen in practice, but in theory, it's possible. I'll need to talk about Thread A, Thread B, and 98 other threads:
Thread A reads count as zero, but then it is preempted before it can do anything else,
Thread B is allowed to run 99 iterations of its loop, and 98 other threads all run to completion before thread A finally is allowed to run again,
Thread A writes 1 to count before—are you ready to believe this?—it gets preempted again!
Thread B starts its 100th iteration. It gets as far as reading count as 1 (just now written by thread A) before thread A finally comes roaring back to life and runs to completion,
Thread B is last to cross the finish line after it writes 2 to count.

What is the logic behind this function and its output? - Queue

q= queue.Queue()
for i in [3,2,1]:
def f():
time.sleep(i)
print(i)
q.put(i)
threading.Thread(target=f).start()
print(q.get())
For this piece of code, it returns 1. The reason for this is because the queue is FIFO and "1" is put first as it slept the least time.
extended question,
If I continue to run q.get() twice, it still outputs the same value "1" rather than "2" and "3". Can anyone tell me why that is? Is there anything to do with threading?
Another extended question,
When the code finishes running completely, but there are still threads that haven't finished, will they get shut down immediately as the whole program finishes?
q.get()
#this gives me 1, but I suppose it should give me 2
q.get()
#this gives me 1, but I suppose it should give me 3
Update:
It is a Python 3 code.

Assuming that the language is Python3.
The second and third calls to q.get() return 1 because each of the three threads puts a 1 into the queue. There is never a 2 or a 3 in the queue.
I don't fully understand what to expect in this case—I'm not a Python expert—but the function, f does not appear to capture the value of the loop variable, i. The i in the function f appears to be the same variable as the i in the loop, and the loop leaves i==1 before any of the three threads wakes up from sleeping. So, in all three threads, i==1 by the time q.put(i) is called.
When the code finishes running completely, but there are still threads that haven't finished, will they get shut down immediately?
No. The process won't exit until all of its threads (including the main thread) have terminated. If you want to create a thread that will be automatically, forcibly, abruptly terminated when all of the "normal" threads are finished, then you can make that thread a daemon thread.
See https://docs.python.org/3/library/threading.html, and search for "daemon".

If one thread writes to a location and another thread is reading, can the second thread see the new value then the old?

Start with x = 0. Note there are no memory barriers in any of the code below.
volatile int x = 0
Thread 1:
while (x == 0) {}
print "Saw non-zer0"
while (x != 0) {}
print "Saw zero again!"
Thread 2:
x = 1
Is it ever possible to see the second message, "Saw zero again!", on any (real) CPU? What about on x86_64?
Similarly, in this code:
volatile int x = 0.
Thread 1:
while (x == 0) {}
x = 2
Thread 2:
x = 1
Is the final value of x guaranteed to be 2, or could the CPU caches update main memory in some arbitrary order, so that although x = 1 gets into a CPU's cache where thread 1 can see it, then thread 1 gets moved to a different cpu where it writes x = 2 to that cpu's cache, and the x = 2 gets written back to main memory before x = 1.

Yes, it's entirely possible. The compiler could, for example, have just written x to memory but still have the value in a register. One while loop could check memory while the other checks the register.
It doesn't happen due to CPU caches because cache coherency hardware logic makes the caches invisible on all CPUs you are likely to actually use.
Theoretically, the write race you talk about could happen due to posted write buffering and read prefetching. Miraculous tricks were used to make this impossible on x86 CPUs to avoid breaking legacy code. But you shouldn't expect future processors to do this.

Leaving aside for a second tricks done by the compiler (even ones allowed by language standards), I believe you're asking how the micro-architecture could behave in such scenario. Keep in mind that the code would most likely expand into a busy wait loop of cmp [x] + jz or something similar, which hides a load inside it. This means that [x] is likely to live in the cache of the core running thread 1.
At some point, thread 2 would come and perform the store. If it resides on a different core, the line would first be invalidated completely from the first core. If these are 2 threads running on the same physical core - the store would immediately affect all chronologically younger loads.
Now, the most likely thing to happen on a modern out-of-order machine is that all the loads in the pipeline at this point would be different iterations of the same first loop (since any branch predictor facing so many repetitive "taken" resolution is likely to assume the branch will continue being taken, until proven wrong), so what would happen is that the first load to encounter the new value modified by the other thread will cause the matching branch to simply flush the entire pipe from all younger operations, without the 2nd loop ever having a chance to execute.
However, it's possible that for some reason you did get to the 2nd loop (let's say the predictor issue a not-taken prediction just at the right moment when the loop condition check saw the new value) - in this case, the question boils down to this scenario:
Time -->
----------------------------------------------------------------
thread 1
cmp [x],0 execute
je ... execute (not taken)
...
cmp [x],0 execute
jne ... execute (not taken)
Can_We_Get_Here:
...
thread2
store [x],1 execute
In other words, given that most modern CPUs may execute instructions out of order, can a younger load be evaluated before an older one to the same address, allowing the store (from another thread) to change the value so it may be observed inconsistently by the loads.
My guess is that the above timeline is quite possible given the nature of out-of-order execution engines today, as they simply arbitrate and perform whatever operation is ready. However, on most x86 implementations there are safeguards to protect against such a scenario, since the memory ordering rules strictly say -
8.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations
Such mechanisms may detect this scenario and flush the machine to prevent the stale/wrong values becoming visible. So The answer is - no, it should not be possible, unless of course the software or the compiler change the nature of the code to prevent the hardware from noticing the relation. Then again, memory ordering rules are sometimes flaky, and i'm not sure all x86 manufacturers adhere to the exact same wording, but this is a pretty fundamental example of consistency, so i'd be very surprised if one of them missed it.

The answer seems to be, "this is exactly the job of the CPU cache coherency." x86 processors implement the MESI protocol, which guarantee that the second thread can't see the new value then the old.

Parallel processing - Connected Data

Problem
Summary: Parallely apply a function F to each element of an array where F is NOT thread safe.
I have a set of elements E to process, lets say a queue of them.
I want to process all these elements in parallel using the same function f( E ).
Now, ideally I could call a map based parallel pattern, but the problem has the following constraints.
Each element contains a pair of 2 objects.( E = (A,B) )
Two elements may share an object. ( E1 = (A1,B1); E2 = (A1, B2) )
The function f cannot process two elements that share an object. so E1 and E2 cannot be processing in parallel.
What is the right way of doing this?
My thoughts are like so,
trivial thought: Keep a set of active As and Bs, and start processing an Element only when no other thread is already using A OR B.
So, when you give the element to a thread you add the As and Bs to the active set.
Pick the first element, if its elements are not in the active set spawn a new thread , otherwise push it to the back of the queue of elements.
Do this till the queue is empty.
Will this cause a deadlock ? Ideally when a processing is over some elements will become available right?
2.-The other thought is to make a graph of these connected objects.
Each node represents an object (A / B) . Each element is an edge connecting A & B, and then somehow process the data such that we know the elements are never overlapping.
Questions
How can we achieve this best?
Is there a standard pattern to do this ?
Is there a problem with these approaches?
Not necessary, but if you could tell the TBB methods to use, that'll be great.

The "best" approach depends on a lot of factors here:
How many elements "E" do you have and how much work is needed for f(E). --> Check if it's really worth it to work the elements in parallel (if you need a lot of locking and don't have much work to do, you'll probably slow down the process by working in parallel)
Is there any possibility to change the design that can make f(E) multi-threading safe?
How many elements "A" and "B" are there? Is there any logic to which elements "E" share specific versions of A and B? --> If you can sort the elements E into separate lists where each A and B only appears in a single list, then you can process these lists parallel without any further locking.
If there are many different A's and B's and you don't share too many of them, you may want to do a trivial approach where you just lock each "A" and "B" when entering and wait until you get the lock.
Whenever you do "lock and wait" with multiple locks it's very important that you always take the locks in the same order (e.g. always A first and B second) because otherwise you may run into deadlocks. This locking order needs to be observed everywhere (a single place in the whole application that uses a different order can cause a deadlock)
Edit: Also if you do "try lock" you need to ensure that the order is always the same. Otherwise you can cause a lifelock:
thread 1 locks A
thread 2 locks B
thread 1 tries to lock B and fails
thread 2 tries to lock A and fails
thread 1 releases lock A
thread 2 releases lock B
Goto 1 and repeat...
Chances that this actually happens "endless" are relatively slim, but it should be avoided anyway
Edit 2: principally I guess I'd just split E(Ax, Bx) into different lists based on Ax (e.g one list for all E's that share the same A). Then process these lists in parallel with locking of "B" (there you can still "TryLock" and continue if the required B is already used.

Reusable Barrier Algorithm

I'm looking into the Reusable Barrier algorithm from the book "The Little Book Of Semaphores" (archived here).
The puzzle is on page 31 (Basic Synchronization Patterns/Reusable Barrier), and I have come up with a 'solution' (or not) which differs from the solution from the book (a two-phase barrier).
This is my 'code' for each thread:
# n = 4; threads running
# semaphore = n max., initialized to 0
# mutex, unowned.
start:
mutex.wait()
counter = counter + 1
if counter = n:
semaphore.signal(4) # add 4 at once
counter = 0
mutex.release()
semaphore.wait()
# critical section
semaphore.release()
goto start
This does seem to work, I've even inserted different sleep timers into different sections of the threads, and they still wait for all the threads to come before continuing each and every loop. Am I missing something? Is there a condition that this will fail?
I've implemented this using the Windows library Semaphore and Mutex functions.
Update:
Thank you to starblue for the answer. Turns out that if for whatever reason a thread is slow between mutex.release() and semaphore.wait() any of the threads that arrive to semaphore.wait() after a full loop will be able to go through again, since there will be one of the N unused signals left.
And having put a Sleep command for thread number 3, I got this result where one can see that thread 3 missed a turn the first time, with thread 1 having done 2 turns, and then catching up on the second turn (which was in fact its 1st turn).
Thanks again to everyone for the input.

One thread could run several times through the barrier while some other thread doesn't run at all.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string