Understanding the Weak memory model - multithreading

Assume we have two threads, working with two variables A and B in memory:
Thread 1 Thread 2
======== ========
1) A = 1 3) B = 1
2) Print(B) 4) Print(A)
I know in a Sequential consistent (strong) model you would get
1 -> 2 -> 3-> 4 executed in order. x86 is TSO which is close to a Strong model (but not as strong as one).
I don't understand what the Week model is? Does a weak model just pick random instructions and execute? i.e things like 4 -> 2 -> 3 -> 1 would be possible?
I have 2 more questions regarding this topic:
What is the difference between Out-of-order execution done by a CPU to make use of instruction cycles that would otherwise be wasted, and memory reordering due to memory model are they the same thing? or memory reordering just deals with Load/Store instructions?
Is memory model a concern only when dealing with multiple threads? Why is it not an issue in single threaded programs?

Sequential consistency does not tell you that it will execute 1,2,3,4 at all.
Sequential consistency tells you that if CPU0 is executing 1,2 and CPU1 is executing 3,4; that the CPUs will execute the blocks in that order, and no side effect (memory store) of 2 will be perceivable before those of 1; and no side effect of 4 will be perceivable before 3.
If earlier A=B=0, then:
Thread 1 Thread 2
======== ========
1) A = 1 3) B = 1
2) Print(A,B) 4) Print(A,B)
All sequential concurrency tells us is that the possible outputs are:
Thread 1 { 1, 0 }, { 1, 1}
Thread 2 { 0, 1 }, { 1, 1}.
If we extend it to an initial state of A=B=C=D=0
Thread 1 Thread 2
======== ========
A = 1 D = 1
C = 1 B = 1
Print(A,B,C,D) Print(A,B,C,D)
Thread1 valid outputs:
1: {1, 0, 1, 0} -- no effects from thread2 seen
2: {1, 0, 1, 1} -- update of D visible; not B
3: {1, 1, 1, 0} -- update of B visible; not D
4: {1, 1, 1, 1} -- update of B and D visible.
Thread2 valid outputs:
5: {0, 1, 0, 1} -- no effects from thread1 seen
6: {0, 1, 1, 1} -- update of C visible; not A
7: {1, 1, 0, 1} -- update of A visible; not C
8: {1, 1, 1, 1} -- update of A and C visible.
In sequential consistency, 1,2,4 : 5,6,8 are possible.
In weaker consistencies, 1,2,3,4 : 5,6,7,8 are possible.
Note that in neither case would the thread fail to see its own updates in order; but the outputs 3,7 result from the threads seeing the other threads updates out of order.
If you require a specific ordering to be maintained, inserting a barrier instruction[1] is the preferred approach. When the cpu encounters a barrier, it affects the either pre-fetched (read barrier), store queue (write barrier) or both (rw barrier).
When there are two memory writes: A = 1; C = 1; you can install write barriers as membar w; store A; store C. This ensures that all stores before the store to A will be seen before either the store to A or C; but enforces no ordering between A and C.
You can install them as store A; membar w; store C which ensure that the store of A will be seen before C; and store A; store C; membar w ensures that A and C will be seen before any subsequent stores.
So which barrier or barrier combination is right for your case?
[1] more modern architectures incorporate barriers into the load and store instructions themselves; so you might have a store.sc A; store C;. The advantage here is to limit the scope of the store barrier so that the store unit only has to serialize these stores, rather than suffer the latency of the entire queue.

Related

What should be the value of the semaphore

What is the maximum value that the semaphore s can be initialized (____), so there could be a deadlock, if there are two threads: the first is executing the method sequence "A A B C C", and the other "A A B C C B"?
...
semaphore s = ____;
void A(){
wait(&s);
...
}
void B(){
signal(&s);
...
}
void C(){
wait(&s);
...
}
...
In that case the minimum value that the semaphore will attain is: ____. (0, 4, 2, 6, -3, 3, -2, 5, -1, 1).
I tried to solve it like this: the minimum value of s in order to be sure that there will be no deadlock is 4 (because worst case scenario the sequence is A A A A then B), so the maximum value with deadlock is 3. Then for the second part because we have 8 wait() (total 8 calls for A and C) and 3 signal() (total 3 calls for B) we obtain 3-8+3 = -2.
Can someone confirm if this is the correct solution?

How should I do that the two receiving processes not to be twice in a row in Promela model?

I am a beginner in the spin. I am trying that the model runs the two receiving processes (function called consumer in the model) alternatively, ie. (consumer 1, consumer 2, consumer 1, consumer 2,...). But when I run this code, my output for 2 consumer processes are showing randomly. Can someone help me?
This is my code I am struggling with.
mtype = {P, C};
mtype turn = P;
chan ch1 = [1] of {bit};
byte current_consumer = 1;
byte previous_consumer;
active [2] proctype Producer()
{`
bit a = 0;
do
:: atomic {
turn == P ->
ch1 ! a;
printf("The producer %d --> sent %d!\n", _pid, a);
a = 1 - a;
turn = C;
}
od
}
active [2] proctype Consumer()
{
bit b;
do
:: atomic{
turn == C ->
current_consumer = _pid;
ch1 ? b;
printf("The consumer %d --> received %d!\n\n", _pid, b);
assert(current_consumer == _pid);
turn = P;
}
od
}
Sample out is as photo
First of all, let me draw your attention to this excerpt of atomic's documentation:
If any statement within the atomic sequence blocks, atomicity is lost, and other processes are then allowed to start executing statements. When the blocked statement becomes executable again, the execution of the atomic sequence can be resumed at any time, but not necessarily immediately. Before the process can resume the atomic execution of the remainder of the sequence, the process must first compete with all other active processes in the system to regain control, that is, it must first be scheduled for execution.
In your model, this is currently not causing any problem because ch1 is a buffered channel (i.e. it has size >= 1). However, any small change in the model could break this invariant.
From the comments, I understand that your goal is to alternate consumers, but you don't really care which producer is sending the data.
To be honest, your model already contains two examples of how processes can alternate with one another:
The Producer/Consumers alternate one another via turn, by assigning a different value each time
The Producer/Consumers alternate one another also via ch1, since this has size 1
However, both approaches are alternating Producer/Consumers rather than Consumers themselves.
One approach I like is message filtering with eval (see docs): each Consumer knows its own id, waits for a token with its own id in a separate channel, and only when that is available it starts doing some work.
byte current_consumer;
chan prod2cons = [1] of { bit };
chan cons = [1] of { byte };
proctype Producer(byte id; byte total)
{
bit a = 0;
do
:: true ->
// atomic is only for printing purposes
atomic {
prod2cons ! a;
printf("The producer %d --> sent %d\n", id, a);
}
a = 1 - a;
od
}
proctype Consumer(byte id; byte total)
{
bit b;
do
:: cons?eval(id) ->
current_consumer = id;
atomic {
prod2cons ? b;
printf("The consumer %d --> received %d\n\n", id, b);
}
assert(current_consumer == id);
// yield turn to the next Consumer
cons ! ((id + 1) % total)
od
}
init {
run Producer(0, 2);
run Producer(1, 2);
run Consumer(0, 2);
run Consumer(1, 2);
// First consumer is 0
cons!0;
}
This model, briefly:
Producers/Consumers alternate via prod2cons, a channel of size 1. This enforces the following behavior: after some producers created a message some consumer must consume it.
Consumers alternate via cons, a channel of size 1 containing a token value indicating which consumer is currently allowed to perform some work. All consumers peek on the contents of cons, but only the one with a matching id is allowed to consume the token and move on. At the end of its turn, the consumer creates a new token with the next id in the chain. Consumers alternate in a round robin fashion.
The output is:
The producer 0 --> sent 0
The consumer 1 --> received 0
The producer 1 --> sent 1
The consumer 0 --> received 1
The producer 1 --> sent 0
The consumer 1 --> received 0
...
The producer 0 --> sent 0
The consumer 1 --> received 0
The producer 0 --> sent 1
The consumer 0 --> received 1
The producer 0 --> sent 0
The consumer 1 --> received 0
The producer 0 --> sent 1
The consumer 0 --> received 1
Notice that producers do not necessarily alternate with one another, whereas consumers do -- as requested.

Computing c𝑖 = √(a𝑖 × b𝑖) in parallel using nested parallelism

Let's say we have two vectors A=(ai) and B=(bi), each of size n and we have to compute a new vector C=(ci) as 𝑐𝑖 = √(𝑎𝑖 × 𝑏𝑖) for(i=1,...,n)
Main question: What would be the best way to compute the ci in parallel (using nested parallelism, i.e. using sync and spawn).
I think the below understanding is correct about the computation
for (i = 1 to n) {
C[i] = Math.sqrt(A[i] * B[i]);
}
And is there any way to use parallel for loops to compute C in parallel ?
If so, I think the approach will be the following:
parallel for (i = 1 to n) {
C[i] = Math.sqrt(A[i] * B[i]);
}
Is it correct ?
Assuming that by best you mean fastest, the usual approach would be to divide A and B into chunks, spawn a separate thread for handling each of these chunks in parallel, and wait for all the threads to finish their tasks.
The optimal number of chunks for such computation, most likely, will be the number of CPU cores you have on your computer. So, the pseudocode would look like:
chunkSize = ceiling(n / numberOfCPUs)
for (t = 1 to numberOfCPUs) {
startIndex = (t - 1) * chunkSize + 1
size = min(chunkSize, C.size - startIndex + 1)
threads.add(Thread.spawn(startIndex, size))
}
threads.join()
Where each thread, provided with the startIndex and size, computes:
for (i = startIndex to startIndex + size) {
C[i] = Math.sqrt(A[i] * B[i])
}
Another approach would be to have a pool of threads and give those threads a single shared queue of indices 1, 2, ... n. Each thread on each iteration polls the top index (let it be i) and calculates C[i]. As soon as the queue is empty, the work is done. The problem here is that you need additional synchronization mechanism that would guarantee that every index is processed by exactly one thread. For some simple tasks (like yours) such mechanism might consume more resources than actual calculation, but for relatively long-running tasks it works pretty well.
There's a mutual approach when you break the initial set of tasks into chunks, provide each thread in the pool with its own chunk, but when a thread is done with its chunk, it starts 'stealing' tasks from other threads in order not to sit idle. On many real tasks it gives better results than either of previous approaches.

Ask for multiple (or all) violation traces in Spin

Is it possible to get multiple (or all) violation traces for a property using Spin?
As an example, I created the Promela model below:
byte mutex = 0;
active proctype A() {
A1: mutex==0; /* Is free? */
A2: mutex++; /* Get mutex */
A3: /* A's critical section */
A4: mutex--; /* Release mutex */
}
active proctype B() {
B1: mutex==0; /* Is free? */
B2: mutex++; /* Get mutex */
B3: /* B's critical section */
B4: mutex--; /* Release mutex */
}
ltl {[] (mutex < 2)}
It has a naive mutex implementation. One could expect that processes A and B would not reach their critical section together and I wrote an LTL expression to check that.
Running
spin -run mutex_example.pml
shows that the property is not valid and running
spin -p -t mutex_example.pml
show the sequence of statements that violate the property.
Never claim moves to line 4 [(1)]
2: proc 1 (B:1) mutex_example.pml:11 (state 1) [((mutex==0))]
4: proc 0 (A:1) mutex_example.pml:4 (state 1) [((mutex==0))]
6: proc 1 (B:1) mutex_example.pml:12 (state 2) [mutex = (mutex+1)]
8: proc 0 (A:1) mutex_example.pml:5 (state 2) [mutex = (mutex+1)]
spin: _spin_nvr.tmp:3, Error: assertion violated
spin: text of failed assertion: assert(!(!((mutex<2))))
Never claim moves to line 3 [assert(!(!((mutex<2))))]
spin: trail ends after 9 steps
#processes: 2
mutex = 2
9: proc 1 (B:1) mutex_example.pml:14 (state 3)
9: proc 0 (A:1) mutex_example.pml:7 (state 3)
9: proc - (ltl_0:1) _spin_nvr.tmp:2 (state 6)
This shows that the sequence of statements (indicated by labels) 'B1' -> 'A1' -> 'B2' -> 'A2' violate the property but there are other interleaving options leading to that (e.g. 'A1' -> 'B1' -> 'B2' -> 'A2').
Can I ask Spin to give me multiple (or all) traces?
I doubt that you can get all violation traces in Spin.
For example, if we consider the following model, then there are infinitely many counter-examples.
byte mutex = 0;
active [2] proctype P() {
do
:: mutex == 0 ->
mutex++;
/* critical section */
mutex--;
od
}
ltl {[] (mutex <= 1)}
What you can do, is to use different search algorithms for your verifier, and this might yield some different counter-examples
-search (or -run) generate a verifier, and compile and run it
options before -search are interpreted by spin to parse the input
options following a -search are used to compile and run the verifier pan
valid options that can follow a -search argument include:
-bfs perform a breadth-first search
-bfspar perform a parallel breadth-first search
-bcs use the bounded-context-switching algorithm
-bitstate or -bit, use bitstate storage
-biterate use bitstate with iterative search refinement (-w18..-w35)
-swarmN,M like -biterate, but running all iterations in parallel
perform N parallel runs and increment -w every M runs
default value for N is 10, default for M is 1
-link file.c link executable pan to file.c
-collapse use collapse state compression
-hc use hash-compact storage
-noclaim ignore all ltl and never claims
-p_permute use process scheduling order permutation
-p_rotateN use process scheduling order rotation by N
-p_reverse use process scheduling order reversal
-ltl p verify the ltl property named p
-safety compile for safety properties only
-i use the dfs iterative shortening algorithm
-a search for acceptance cycles
-l search for non-progress cycles
similarly, a -D... parameter can be specified to modify the compilation
and any valid runtime pan argument can be specified for the verification

LTL model checking using Spin and Promela syntax

I'm trying to reproduce ALGOL 60 code written by Dijkstra in the paper titled "Cooperating sequential processes", the code is the first attempt to solve the mutex problem, here is the syntax:
begin integer turn; turn:= 1;
parbegin
process 1: begin Ll: if turn = 2 then goto Ll;
critical section 1;
turn:= 2;
remainder of cycle 1; goto L1
end;
process 2: begin L2: if turn = 1 then goto L2;
critical section 2;
turn:= 1;
remainder of cycle 2; goto L2
end
parend
end
So I tried to reproduce the above code in Promela and here is my code:
#define true 1
#define Aturn true
#define Bturn false
bool turn, status;
active proctype A()
{
L1: (turn == 1);
status = Aturn;
goto L1;
/* critical section */
turn = 1;
}
active proctype B()
{
L2: (turn == 2);
status = Bturn;
goto L2;
/* critical section */
turn = 2;
}
never{ /* ![]p */
if
:: (!status) -> skip
fi;
}
init
{ turn = 1;
run A(); run B();
}
What I'm trying to do is, verify that the fairness property will never hold because the label L1 is running infinitely.
The issue here is that my never claim block is not producing any error, the output I get simply says that my statement was never reached..
here is the actual output from iSpin
spin -a dekker.pml
gcc -DMEMLIM=1024 -O2 -DXUSAFE -DSAFETY -DNOCLAIM -w -o pan pan.c
./pan -m10000
Pid: 46025
(Spin Version 6.2.3 -- 24 October 2012)
+ Partial Order Reduction
Full statespace search for:
never claim - (not selected)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 44 byte, depth reached 8, errors: 0
11 states, stored
9 states, matched
20 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.001 equivalent memory usage for states (stored*(State-vector + overhead))
0.291 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage
unreached in proctype A
dekker.pml:13, state 4, "turn = 1"
dekker.pml:15, state 5, "-end-"
(2 of 5 states)
unreached in proctype B
dekker.pml:20, state 2, "status = 0"
dekker.pml:23, state 4, "turn = 2"
dekker.pml:24, state 5, "-end-"
(3 of 5 states)
unreached in claim never_0
dekker.pml:30, state 5, "-end-"
(1 of 5 states)
unreached in init
(0 of 4 states)
pan: elapsed time 0 seconds
No errors found -- did you verify all claims?
I've read all the documentation of spin on the never{..} block but couldn't find my answer (here is the link), also I've tried using ltl{..} blocks as well (link) but that just gave me syntax error, even though its explicitly mentioned in the documentation that it can be outside the init and proctypes, can someone help me correct this code please?
Thank you
You've redefined 'true' which can't possibly be good. I axed that redefinition and the never claim fails. But, the failure is immaterial to your goal - that initial state of 'status' is 'false' and thus the never claim exits, which is a failure.
Also, it is slightly bad form to assign 1 or 0 to a bool; assign true or false instead - or use bit. Why not follow the Dijkstra code more closely - use an 'int' or 'byte'. It is not as if performance will be an issue in this problem.
You don't need 'active' if you are going to call 'run' - just one or the other.
My translation of 'process 1' would be:
proctype A ()
{
L1: turn !=2 ->
/* critical section */
status = Aturn;
turn = 2
/* remainder of cycle 1 */
goto L1;
}
but I could be wrong on that.

Resources