What should be the value of the semaphore - multithreading

What is the maximum value that the semaphore s can be initialized (____), so there could be a deadlock, if there are two threads: the first is executing the method sequence "A A B C C", and the other "A A B C C B"?
...
semaphore s = ____;
void A(){
wait(&s);
...
}
void B(){
signal(&s);
...
}
void C(){
wait(&s);
...
}
...
In that case the minimum value that the semaphore will attain is: ____. (0, 4, 2, 6, -3, 3, -2, 5, -1, 1).
I tried to solve it like this: the minimum value of s in order to be sure that there will be no deadlock is 4 (because worst case scenario the sequence is A A A A then B), so the maximum value with deadlock is 3. Then for the second part because we have 8 wait() (total 8 calls for A and C) and 3 signal() (total 3 calls for B) we obtain 3-8+3 = -2.
Can someone confirm if this is the correct solution?

Related

CUDA equivalent of pragma omp task

I am working on a problem where work between each thread may varies drastically, where, for example, a thread may this time handle 1000000 element, but another thread may only handle 1 or 2 element. So I stumbled upon this where the answer solve the unbalanced workload by using openmp task on CPU, so my question is can I achieve the same on CUDA ?
In case you want more context:
The problem I'm trying to solve is, I have a n tuple, each has a starting point, an ending point and a value.
(0, 3, 1), (3, 6, 2), (6, 10, 3), ...
So for each tuple I want to write the value to every position between starting point and ending point of another empty array.
1, 1, 1, 2, 2, 2, 3, 3, 3, 3, ...
It is guaranteed that there is no start/ ending overlap.
My current approach is a thread for each tuple, but the starting and ending can vary a lot so the imbalanced workload between threads might cause a bottleneck for the program, though rare, but it may very well be.
The most common thread strategy I can think of in CUDA is to assign one thread per output point, and then have each thread do the work necessary to populate its output point.
For your stated objective (have each thread do roughly equal work) this is a useful strategy.
I will suggest using thrust for this. The basic idea is to:
determine the necessary size of the output based on the input
spin up a set of threads equal to the output size, where each thread determines its "insert index" in the output array by using a vectorized binary search on the input
with the insert index, insert the appropriate value in the output array.
I have used your data, the only change is that I changed the insert values from 1,2,3 to 5,2,7:
$ cat t1871.cu
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/binary_search.h>
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
typedef thrust::tuple<int,int,int> mt;
// returns selected item from tuple
struct my_cpy_functor1
{
__host__ __device__ int operator()(mt d){ return thrust::get<1>(d); }
};
struct my_cpy_functor2
{
__host__ __device__ int operator()(mt d){ return thrust::get<2>(d); }
};
int main(){
mt my_data[] = {{0, 3, 5}, {3, 6, 2}, {6, 10, 7}};
int ds = sizeof(my_data)/sizeof(my_data[0]); // determine data size
int os = thrust::get<1>(my_data[ds-1]) - thrust::get<0>(my_data[0]); // and output size
thrust::device_vector<mt> d_data(my_data, my_data+ds); // transfer data to device
thrust::device_vector<int> d_idx(ds+1); // create index array for searching of insertion points
thrust::transform(d_data.begin(), d_data.end(), d_idx.begin()+1, my_cpy_functor1()); // set index array
thrust::device_vector<int> d_ins(os); // create array to hold insertion points
thrust::upper_bound(d_idx.begin(), d_idx.end(), thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(os), d_ins.begin()); // identify insertion points
thrust::transform(thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(d_ins.begin(), _1 -1)), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(d_ins.end(), _1 -1)), d_ins.begin(), my_cpy_functor2()); // insert
thrust::copy(d_ins.begin(), d_ins.end(), std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t1871 t1871.cu -std=c++14
$ ./t1871
5,5,5,2,2,2,7,7,7,7,
$

How should I do that the two receiving processes not to be twice in a row in Promela model?

I am a beginner in the spin. I am trying that the model runs the two receiving processes (function called consumer in the model) alternatively, ie. (consumer 1, consumer 2, consumer 1, consumer 2,...). But when I run this code, my output for 2 consumer processes are showing randomly. Can someone help me?
This is my code I am struggling with.
mtype = {P, C};
mtype turn = P;
chan ch1 = [1] of {bit};
byte current_consumer = 1;
byte previous_consumer;
active [2] proctype Producer()
{`
bit a = 0;
do
:: atomic {
turn == P ->
ch1 ! a;
printf("The producer %d --> sent %d!\n", _pid, a);
a = 1 - a;
turn = C;
}
od
}
active [2] proctype Consumer()
{
bit b;
do
:: atomic{
turn == C ->
current_consumer = _pid;
ch1 ? b;
printf("The consumer %d --> received %d!\n\n", _pid, b);
assert(current_consumer == _pid);
turn = P;
}
od
}
Sample out is as photo
First of all, let me draw your attention to this excerpt of atomic's documentation:
If any statement within the atomic sequence blocks, atomicity is lost, and other processes are then allowed to start executing statements. When the blocked statement becomes executable again, the execution of the atomic sequence can be resumed at any time, but not necessarily immediately. Before the process can resume the atomic execution of the remainder of the sequence, the process must first compete with all other active processes in the system to regain control, that is, it must first be scheduled for execution.
In your model, this is currently not causing any problem because ch1 is a buffered channel (i.e. it has size >= 1). However, any small change in the model could break this invariant.
From the comments, I understand that your goal is to alternate consumers, but you don't really care which producer is sending the data.
To be honest, your model already contains two examples of how processes can alternate with one another:
The Producer/Consumers alternate one another via turn, by assigning a different value each time
The Producer/Consumers alternate one another also via ch1, since this has size 1
However, both approaches are alternating Producer/Consumers rather than Consumers themselves.
One approach I like is message filtering with eval (see docs): each Consumer knows its own id, waits for a token with its own id in a separate channel, and only when that is available it starts doing some work.
byte current_consumer;
chan prod2cons = [1] of { bit };
chan cons = [1] of { byte };
proctype Producer(byte id; byte total)
{
bit a = 0;
do
:: true ->
// atomic is only for printing purposes
atomic {
prod2cons ! a;
printf("The producer %d --> sent %d\n", id, a);
}
a = 1 - a;
od
}
proctype Consumer(byte id; byte total)
{
bit b;
do
:: cons?eval(id) ->
current_consumer = id;
atomic {
prod2cons ? b;
printf("The consumer %d --> received %d\n\n", id, b);
}
assert(current_consumer == id);
// yield turn to the next Consumer
cons ! ((id + 1) % total)
od
}
init {
run Producer(0, 2);
run Producer(1, 2);
run Consumer(0, 2);
run Consumer(1, 2);
// First consumer is 0
cons!0;
}
This model, briefly:
Producers/Consumers alternate via prod2cons, a channel of size 1. This enforces the following behavior: after some producers created a message some consumer must consume it.
Consumers alternate via cons, a channel of size 1 containing a token value indicating which consumer is currently allowed to perform some work. All consumers peek on the contents of cons, but only the one with a matching id is allowed to consume the token and move on. At the end of its turn, the consumer creates a new token with the next id in the chain. Consumers alternate in a round robin fashion.
The output is:
The producer 0 --> sent 0
The consumer 1 --> received 0
The producer 1 --> sent 1
The consumer 0 --> received 1
The producer 1 --> sent 0
The consumer 1 --> received 0
...
The producer 0 --> sent 0
The consumer 1 --> received 0
The producer 0 --> sent 1
The consumer 0 --> received 1
The producer 0 --> sent 0
The consumer 1 --> received 0
The producer 0 --> sent 1
The consumer 0 --> received 1
Notice that producers do not necessarily alternate with one another, whereas consumers do -- as requested.

Understanding the Weak memory model

Assume we have two threads, working with two variables A and B in memory:
Thread 1 Thread 2
======== ========
1) A = 1 3) B = 1
2) Print(B) 4) Print(A)
I know in a Sequential consistent (strong) model you would get
1 -> 2 -> 3-> 4 executed in order. x86 is TSO which is close to a Strong model (but not as strong as one).
I don't understand what the Week model is? Does a weak model just pick random instructions and execute? i.e things like 4 -> 2 -> 3 -> 1 would be possible?
I have 2 more questions regarding this topic:
What is the difference between Out-of-order execution done by a CPU to make use of instruction cycles that would otherwise be wasted, and memory reordering due to memory model are they the same thing? or memory reordering just deals with Load/Store instructions?
Is memory model a concern only when dealing with multiple threads? Why is it not an issue in single threaded programs?
Sequential consistency does not tell you that it will execute 1,2,3,4 at all.
Sequential consistency tells you that if CPU0 is executing 1,2 and CPU1 is executing 3,4; that the CPUs will execute the blocks in that order, and no side effect (memory store) of 2 will be perceivable before those of 1; and no side effect of 4 will be perceivable before 3.
If earlier A=B=0, then:
Thread 1 Thread 2
======== ========
1) A = 1 3) B = 1
2) Print(A,B) 4) Print(A,B)
All sequential concurrency tells us is that the possible outputs are:
Thread 1 { 1, 0 }, { 1, 1}
Thread 2 { 0, 1 }, { 1, 1}.
If we extend it to an initial state of A=B=C=D=0
Thread 1 Thread 2
======== ========
A = 1 D = 1
C = 1 B = 1
Print(A,B,C,D) Print(A,B,C,D)
Thread1 valid outputs:
1: {1, 0, 1, 0} -- no effects from thread2 seen
2: {1, 0, 1, 1} -- update of D visible; not B
3: {1, 1, 1, 0} -- update of B visible; not D
4: {1, 1, 1, 1} -- update of B and D visible.
Thread2 valid outputs:
5: {0, 1, 0, 1} -- no effects from thread1 seen
6: {0, 1, 1, 1} -- update of C visible; not A
7: {1, 1, 0, 1} -- update of A visible; not C
8: {1, 1, 1, 1} -- update of A and C visible.
In sequential consistency, 1,2,4 : 5,6,8 are possible.
In weaker consistencies, 1,2,3,4 : 5,6,7,8 are possible.
Note that in neither case would the thread fail to see its own updates in order; but the outputs 3,7 result from the threads seeing the other threads updates out of order.
If you require a specific ordering to be maintained, inserting a barrier instruction[1] is the preferred approach. When the cpu encounters a barrier, it affects the either pre-fetched (read barrier), store queue (write barrier) or both (rw barrier).
When there are two memory writes: A = 1; C = 1; you can install write barriers as membar w; store A; store C. This ensures that all stores before the store to A will be seen before either the store to A or C; but enforces no ordering between A and C.
You can install them as store A; membar w; store C which ensure that the store of A will be seen before C; and store A; store C; membar w ensures that A and C will be seen before any subsequent stores.
So which barrier or barrier combination is right for your case?
[1] more modern architectures incorporate barriers into the load and store instructions themselves; so you might have a store.sc A; store C;. The advantage here is to limit the scope of the store barrier so that the store unit only has to serialize these stores, rather than suffer the latency of the entire queue.

Shared counter using combining tree deadlock issue

I am working on a shared counter increment application using combining tree concept. My goal is to make this application work on 2^n number of cores such as 4, 8, 16, 32, etc. This algorithm might err on any thread failure. The assumption is that there would be no thread failure or very slow threads.
Two threads compete at leaf nodes and the latter one arriving goes up the tree.
The first one that arrives waits until the second one goes up the hierarchy and comes down with the correct return value.
The second thread wakes the first thread up
Each thread gets the correct fetchAndAdd value
But this algorithm sometimes gets locked inside while (nodes[index].isActive == 1) or while(nodes[index].waiting == 1) loop. I don't see any possibility of a deadlock because only two threads are competing at each node. Could you guys enlighten me on this problem??
int increment(int threadId, int index, int value) {
int lastValue = __sync_fetch_and_add(&nodes[index].firstValue, value);
if (index == 0) return lastValue;
while (nodes[index].isActive == 1) {
}
if (lastValue == 0) {
while(nodes[index].waiting == 1) {
}
nodes[index].waiting = 1;
nodes[lindex].isActive = 0;
} else {
nodes[index].isActive = 1;
nodes[index].result = increment(threadId, (index - 1)/2, nodes[index].firstValue);
nodes[index].firstValue = 0;
nodes[index].waiting = 0;
}
return nodes[index].result + lastValue;
}
I don't think that will work on 1 core. You infinitely loop on isActive because you can't set isActive to 0 unless it is 0.
I'm not sure if you're code has a mechanism to stop this but, Here's my best crack at it Here are the threads that run and cause problems:
ex)
thread1 thread 2
nodes[10].isActive = 1
//next run on index 10
while (nodes[index].isActive == 1) {//here is the deadlock}
It's hard to understand exactly what's going on here/ what you're trying to do but I would recommend that somehow you need to be able to deactivate nodes[index].isActive. You may want to set it to 0 at the end of the function

How do I parallelize GPars Actors?

My understanding of GPars Actors may be off so please correct me if I'm wrong. I have a Groovy app that polls a web service for jobs. When one or more jobs are found it sends each job to a DynamicDispatchActor I've created, and the job is handled. The jobs are completely self-contained and don't need to return anything to the main thread. When multiple jobs come in at once I'd like them to be processed in parallel, but no matter what configuration I try the actor processes them first in first out.
To give a code example:
def poolGroup = new DefaultPGroup(new DefaultPool(true, 5))
def actor = poolGroup.messageHandler {
when {Integer msg ->
println("I'm number ${msg} on thread ${Thread.currentThread().name}")
Thread.sleep(1000)
}
}
def integers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
integers.each {
actor << it
}
This prints out:
I'm number 1 on thread Actor Thread 31
I'm number 2 on thread Actor Thread 31
I'm number 3 on thread Actor Thread 31
I'm number 4 on thread Actor Thread 31
I'm number 5 on thread Actor Thread 31
I'm number 6 on thread Actor Thread 31
I'm number 7 on thread Actor Thread 31
I'm number 8 on thread Actor Thread 31
I'm number 9 on thread Actor Thread 31
I'm number 10 on thread Actor Thread 31
With a slight pause in between each print out. Also notice that each printout happens from the same Actor/thread.
What I'd like to see here is the first 5 numbers are printed out instantly because the thread pool is set to 5, and then the next 5 numbers as those threads free up. Am I completely off base here?
To make it run as you expect there are few changes to make:
import groovyx.gpars.group.DefaultPGroup
import groovyx.gpars.scheduler.DefaultPool
def poolGroup = new DefaultPGroup(new DefaultPool(true, 5))
def closure = {
when {Integer msg ->
println("I'm number ${msg} on thread ${Thread.currentThread().name}")
Thread.sleep(1000)
stop()
}
}
def integers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
def actors = integers.collect { poolGroup.messageHandler(closure) << it }
actors*.join()
Full gist file: https://gist.github.com/wololock/7f1348e04f68710e42d2
Then the output will be:
I'm number 5 on thread Actor Thread 5
I'm number 4 on thread Actor Thread 4
I'm number 1 on thread Actor Thread 1
I'm number 3 on thread Actor Thread 3
I'm number 2 on thread Actor Thread 2
I'm number 6 on thread Actor Thread 3
I'm number 9 on thread Actor Thread 4
I'm number 7 on thread Actor Thread 2
I'm number 8 on thread Actor Thread 5
I'm number 10 on thread Actor Thread 1
Now let's take a look what changed. First of all in your previous example you've worked on a single actor only. You defined poolGroup correctly, but then you created a single actor and shifted computation to this single instance. To make run those computations in parallel you have to rely on poolGroup and only send an input to some message handler - pool group will handle actors creation and their lifecycle management. This is what we do in:
def actors = integers.collect { poolGroup.messageHandler(closure) << it }
It will create a collection of actors started with given input. Pool group will take care that the specified pool size is not exceeded. Then you have to join each actor and this can be done by using groovy's magic: actors*.join(). Thanks that the application will wait with termination until all actors stop their computation. That's why we have to add stop() method to the when closure of message handler's body - without it, it wont terminate, because pool group does not know that actors did they job - they may wait e.g. for some another message.
Alternative solution
We can also consider alternative solution that uses GPars parallelized iterations:
import groovyx.gpars.GParsPool
// This example is dummy, but let's assume that this processor is
// stateless and shared between threads component.
class Processor {
void process(int number) {
println "${Thread.currentThread().name} starting with number ${number}"
Thread.sleep(1000)
}
}
def integers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Processor processor = new Processor()
GParsPool.withPool 5, {
integers.eachParallel { processor.process(it) }
}
In this example you have a stateless component Processor and paralleled computations using one instance of stateless Processor with multiple input values.
I've tried to figure out the case you mentioned in comment, but I'm not sure if single actor can process multiple messages at a time. Statelessness of an actor means only that it does not change it's internal state during the processing of a message and must not store any other information in actor scope. It would be great if someone could correct me if my reasoning is not correct :)
I hope this will help you. Best!

Resources