OpenMP: Divide all the threads into different groups - multithreading

I will like to divide all the threads into 2 different groups, since I have two parallel tasks to run asynchronously. For example, if totally 8 threads are available, I will like 6 threads dedicated to task1, and the other 2 dedicated to task2.
How can I achieve this with OpenMP?

This is a job for OpenMP nested parallelism, as of OpenMP 3: you can use OpenMP tasks to start two independent tasks and then within those tasks, have parallel sections which use the appropriate number of threads.
As a quick example:
#include <stdio.h>
#include <omp.h>
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = nprocs/3;
int nthreads2 = nprocs - nthreads1;
#pragma omp parallel default(none) shared(nthreads1, nthreads2) num_threads(2)
#pragma omp single
{
#pragma omp task
#pragma omp parallel for num_threads(nthreads1)
for (int i=0; i<16; i++)
printf("Task 1: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), i);
#pragma omp task
#pragma omp parallel for num_threads(nthreads2)
for (int j=0; j<16; j++)
printf("Task 2: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), j);
}
return 0;
}
Running this on an 8 core (16 hardware threads) node,
$ gcc -fopenmp nested.c -o nested -std=c99
$ ./nested
Task 2: thread 3 of the 11 children of 0: handling iter 6
Task 2: thread 3 of the 11 children of 0: handling iter 7
Task 2: thread 1 of the 11 children of 0: handling iter 2
Task 2: thread 1 of the 11 children of 0: handling iter 3
Task 1: thread 2 of the 5 children of 1: handling iter 8
Task 1: thread 2 of the 5 children of 1: handling iter 9
Task 1: thread 2 of the 5 children of 1: handling iter 10
Task 1: thread 2 of the 5 children of 1: handling iter 11
Task 2: thread 6 of the 11 children of 0: handling iter 12
Task 2: thread 6 of the 11 children of 0: handling iter 13
Task 1: thread 0 of the 5 children of 1: handling iter 0
Task 1: thread 0 of the 5 children of 1: handling iter 1
Task 1: thread 0 of the 5 children of 1: handling iter 2
Task 1: thread 0 of the 5 children of 1: handling iter 3
Task 2: thread 5 of the 11 children of 0: handling iter 10
Task 2: thread 5 of the 11 children of 0: handling iter 11
Task 2: thread 0 of the 11 children of 0: handling iter 0
Task 2: thread 0 of the 11 children of 0: handling iter 1
Task 2: thread 2 of the 11 children of 0: handling iter 4
Task 2: thread 2 of the 11 children of 0: handling iter 5
Task 1: thread 1 of the 5 children of 1: handling iter 4
Task 2: thread 4 of the 11 children of 0: handling iter 8
Task 2: thread 4 of the 11 children of 0: handling iter 9
Task 1: thread 3 of the 5 children of 1: handling iter 12
Task 1: thread 3 of the 5 children of 1: handling iter 13
Task 1: thread 3 of the 5 children of 1: handling iter 14
Task 2: thread 7 of the 11 children of 0: handling iter 14
Task 2: thread 7 of the 11 children of 0: handling iter 15
Task 1: thread 1 of the 5 children of 1: handling iter 5
Task 1: thread 1 of the 5 children of 1: handling iter 6
Task 1: thread 1 of the 5 children of 1: handling iter 7
Task 1: thread 3 of the 5 children of 1: handling iter 15
Updated: I've changed the above to include the thread ancestor; there was come confusion because there were (for instance) two "thread 1"s printed - here I've also printed the ancestor (e.g., "thread 1 of the 5 children of 1" vs "thread 1 of the 11 children of 0").
From the OpenMP standard, S.3.2.4, “The omp_get_thread_num routine returns the thread number, within the current team, of the calling thread.”, and from section 2.5, “When a thread encounters a parallel construct, a team of threads is created to
execute the parallel region [...] The thread that encountered the parallel construct
becomes the master thread of the new team, with a thread number of zero for the
duration of the new parallel region.”
That is, within each of those (nested) parallel regions, teams of threads are created which have thread ids starting at zero; but just because those ids overlap within the team doesn't mean they're the same threads. Here I've emphasized that by printing their ancestor number as well, but if the threads were doing CPU-intensive work you'd also see with monitoring tools that there were indeed 16 active threads, not just 11.
The reason why they are team-local thread numbers and not globally-unique thread numbers is pretty straightforward; it would be almost impossible to keep track of globally-unique thread numbers in an environment where nested and dynamic parallelism can happen. Say there are three teams of threads, numbered [0..5], [6,..10], and [11..15], and the middle team completes. Do we leave gaps in the thread numbering? do we interrupt all threads to change their global numbers? What if a new team is started, with 7 threads? Do we start them at 6 and have overlapping thread ids, or do we start them at 16 and leave gaps in the numbering?

Related

OpenMP task firstprivate

I have a question regarding the OpenMP task pragma, if we suppose the following code:
#pragma omp parallel
{
x = omp_get_thread_num();
#pragma omp task firstprivate(x)
//do something with x
}
as far as I understood tasking, it is not guaranteed, which thread executes the task.
So my question is, is "x" in the task now the id of thread generated the task or the one who executes it?
e.g. if thread 0 comes across the task, and thread 3 executes it: x should be 0 then, right?
So my question is, is "x" in the task now the id of thread generated
the task or the one who executes it?
It depends, if the parallel default data-sharing attribute is shared (which by default typically it is) then:
'x' can be equal to any thread ID ranging from 0 to the total number of threads in the team - 1. This is because there is a race condition during the update of the variable 'x'.
This can be show-cased with the following code:
#include <omp.h>
#include <stdio.h>
int main(){
int x;
#pragma omp parallel
{
x = omp_get_thread_num();
if(omp_get_thread_num() == 1){
sleep(5);
#pragma omp task firstprivate(x)
{
printf("Value of x = %d | ID Thread executing = %d\n", x, omp_get_thread_num());
}
}
}
return 0;
}
So the thread with ID=1 creates the task, however, 'x' can have different values than '1' and also different values than the thread currently executing the task. This is because while the thread with ID=1, is waiting during sleep(5);, the remaining threads in the team can update the value of 'x'.
Typically, the canonical form in such use-cases would be to use a single pragma wrapping around the task creation as follows:
#include <omp.h>
#include <stdio.h>
int main(){
int x;
#pragma omp parallel
{
#pragma omp single
{
printf("I am the task creator '%d'\n", omp_get_thread_num());
x = omp_get_thread_num();
#pragma omp task firstprivate(x)
{
printf("Value of x = %d | ID Thread executing = %d\n", x, omp_get_thread_num());
}
}
}
return 0;
}
And in this case as #Michael Klemm mentioned on the comments:
..., x will contain the ID of the thread that created the task. So, yes, if thread 0 created the task, x will be zero even though thread 3 is picked to execute the task.
This also applies in the cases that variable 'x' is private by the time the statement x = omp_get_thread_num(); happens.
Therefore, if you run the code above you should always get I am the task creator with the same value as Value of x =, but you can get a different value in ID Thread executing. For example:
I am the task creator '4'
Value of x = 4 | ID Thread executing = 7
This is in accordance to the behaviour specified in the OpenMP standard, namely:
The task construct is a task generating construct. When a thread
encounters a task construct, an explicit task is generated from the
code for the associated structured block. The data environment of the task is created according to the data-sharing attribute clauses on the task construct, per-data environment ICVs, and any defaults that
apply.
The encountering thread may immediately execute the task, or defer its execution. In the latter case, any thread in the team may be assigned the task.

Cython parallel prange - thread locality?

I am iterating using prange over a list like this:
from cython.parallel import prange, threadid
cdef int tid
cdef CythonElement tEl
cdef int a, b, c
# elList: python list of CythonElement instances is passed via function call
for n in prange(nElements, schedule='dynamic', nogil=True):
with gil:
tEl = elList[n]
tid = threadid()
a = tEl.a
b = tEl.b
c = tEl.c
print("thread {:} elnumber {:}".format(tid, tEl.elNumber))
#nothing is done here
with gil:
print("thread {:} elnumber {:}".format(tid, tEl.elNumber))
# some other computations based on a, b and c here ...
I expect an output like this:
thread 0 elnumber 1
thread 1 elnumber 2
thread 2 elnumber 3
thread 3 elnumber 4
thread 0 elnumber 1
thread 1 elnumber 2
thread 2 elnumber 3
thread 3 elnumber 4
But i get:
thread 1 elnumber 1
thread 0 elnumber 3
thread 3 elnumber 2
thread 2 elnumber 4
thread 3 elnumber 4
thread 1 elnumber 2
thread 0 elnumber 4
thread 2 elnumber 4
So, somehow the thread local variable tEl becomes overwritten across the threads? What am i doing wrong ? Thank you!
It looks like Cython deliberately chooses to exclude any Python variables (including Cython cdef classes) from the list of thread-local variables. Code
I suspect this is deliberate to avoid reference counting issues - they'd need to drop the reference count of all the thread-local variables at the end of the loop (it wouldn't be an insurmountable problem, but might be a big change). Therefore I think it's unlikely to be fixed, but a documentation update might be helpful.
The solution is to refactorise your loop body into a function, where every variable ends up effectively "local" to the function so that it isn't an issue:
cdef f(CythonElement tEl):
cdef int tid
with nogil:
tid = threadid()
with gil:
print("thread {:} elnumber {:}".format(tid, tEl.elNumber))
with gil:
print("thread {:} elnumber {:}".format(tid, tEl.elNumber))
# I've trimmed the function a bit for the sake of being testable
# then for the loop:
for n in prange(nElements, schedule='dynamic', nogil=True):
with gil:
f()
Cython provides parallelism based on threads. The order in which threads are executed is not guaranteed, hence the disordered values for thread.
If you want tEl to be private to the thread, you should not define it globally. Try moving cdef CythonElement tEl within the prange. see http://cython-devel.python.narkive.com/atEB3yrQ/openmp-thread-private-variable-not-recognized-bug-report-discussion (part on private variables).

How do I parallelize GPars Actors?

My understanding of GPars Actors may be off so please correct me if I'm wrong. I have a Groovy app that polls a web service for jobs. When one or more jobs are found it sends each job to a DynamicDispatchActor I've created, and the job is handled. The jobs are completely self-contained and don't need to return anything to the main thread. When multiple jobs come in at once I'd like them to be processed in parallel, but no matter what configuration I try the actor processes them first in first out.
To give a code example:
def poolGroup = new DefaultPGroup(new DefaultPool(true, 5))
def actor = poolGroup.messageHandler {
when {Integer msg ->
println("I'm number ${msg} on thread ${Thread.currentThread().name}")
Thread.sleep(1000)
}
}
def integers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
integers.each {
actor << it
}
This prints out:
I'm number 1 on thread Actor Thread 31
I'm number 2 on thread Actor Thread 31
I'm number 3 on thread Actor Thread 31
I'm number 4 on thread Actor Thread 31
I'm number 5 on thread Actor Thread 31
I'm number 6 on thread Actor Thread 31
I'm number 7 on thread Actor Thread 31
I'm number 8 on thread Actor Thread 31
I'm number 9 on thread Actor Thread 31
I'm number 10 on thread Actor Thread 31
With a slight pause in between each print out. Also notice that each printout happens from the same Actor/thread.
What I'd like to see here is the first 5 numbers are printed out instantly because the thread pool is set to 5, and then the next 5 numbers as those threads free up. Am I completely off base here?
To make it run as you expect there are few changes to make:
import groovyx.gpars.group.DefaultPGroup
import groovyx.gpars.scheduler.DefaultPool
def poolGroup = new DefaultPGroup(new DefaultPool(true, 5))
def closure = {
when {Integer msg ->
println("I'm number ${msg} on thread ${Thread.currentThread().name}")
Thread.sleep(1000)
stop()
}
}
def integers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
def actors = integers.collect { poolGroup.messageHandler(closure) << it }
actors*.join()
Full gist file: https://gist.github.com/wololock/7f1348e04f68710e42d2
Then the output will be:
I'm number 5 on thread Actor Thread 5
I'm number 4 on thread Actor Thread 4
I'm number 1 on thread Actor Thread 1
I'm number 3 on thread Actor Thread 3
I'm number 2 on thread Actor Thread 2
I'm number 6 on thread Actor Thread 3
I'm number 9 on thread Actor Thread 4
I'm number 7 on thread Actor Thread 2
I'm number 8 on thread Actor Thread 5
I'm number 10 on thread Actor Thread 1
Now let's take a look what changed. First of all in your previous example you've worked on a single actor only. You defined poolGroup correctly, but then you created a single actor and shifted computation to this single instance. To make run those computations in parallel you have to rely on poolGroup and only send an input to some message handler - pool group will handle actors creation and their lifecycle management. This is what we do in:
def actors = integers.collect { poolGroup.messageHandler(closure) << it }
It will create a collection of actors started with given input. Pool group will take care that the specified pool size is not exceeded. Then you have to join each actor and this can be done by using groovy's magic: actors*.join(). Thanks that the application will wait with termination until all actors stop their computation. That's why we have to add stop() method to the when closure of message handler's body - without it, it wont terminate, because pool group does not know that actors did they job - they may wait e.g. for some another message.
Alternative solution
We can also consider alternative solution that uses GPars parallelized iterations:
import groovyx.gpars.GParsPool
// This example is dummy, but let's assume that this processor is
// stateless and shared between threads component.
class Processor {
void process(int number) {
println "${Thread.currentThread().name} starting with number ${number}"
Thread.sleep(1000)
}
}
def integers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Processor processor = new Processor()
GParsPool.withPool 5, {
integers.eachParallel { processor.process(it) }
}
In this example you have a stateless component Processor and paralleled computations using one instance of stateless Processor with multiple input values.
I've tried to figure out the case you mentioned in comment, but I'm not sure if single actor can process multiple messages at a time. Statelessness of an actor means only that it does not change it's internal state during the processing of a message and must not store any other information in actor scope. It would be great if someone could correct me if my reasoning is not correct :)
I hope this will help you. Best!

OpenMP: splitting loop based on NUMA

I am running the following loop using, say, 8 OpenMP threads:
float* data;
int n;
#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data, n)
for ( int i = 0; i < n; ++i )
{
DO SOMETHING WITH data[i]
}
Due to NUMA, I'd like to run first half of the loop (i = 0, ..., n/2-1) with threads 0,1,2,3
and second half (i = n/2, ..., n-1) with threads 4,5,6,7.
Essentially, I want to run two loops in parallel, each loop using a separate group of OpenMP threads.
How do I achieve this with OpenMP?
Thank you
PS: Ideally, if threads from one group are done with their half of the loop, and the other half of the loop is still not done, I'd like threads from finished group join unsfinished group processing the other half of the loop.
I am thinking about something like below, but I wonder if I can do this with OpenMP and no extra book-keeping:
int n;
int i0 = 0;
int i1 = n / 2;
#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data,n,i0,i1)
for ( int i = 0; i < n; ++i )
{
int nt = omp_get_thread_num();
int j;
#pragma omp critical
{
if ( nt < 4 ) {
if ( i0 < n / 2 ) j = i0++; // First 4 threads process first half
else j = i1++; // of loop unless first half is finished
}
else {
if ( i1 < n ) j = i1++; // Second 4 threads process second half
else j = i0++; // of loop unless second half is finished
}
}
DO SOMETHING WITH data[j]
}
Probably best is to use nested parallelization, first over NUMA nodes, then within each node; then you can use the infrastructure for dynamic while still breaking the data up amongst thread groups:
#include <omp.h>
#include <stdio.h>
int main(int argc, char **argv) {
const int ngroups=2;
const int npergroup=4;
const int ndata = 16;
omp_set_nested(1);
#pragma omp parallel for num_threads(ngroups)
for (int i=0; i<ngroups; i++) {
int start = (ndata*i+(ngroups-1))/ngroups;
int end = (ndata*(i+1)+(ngroups-1))/ngroups;
#pragma omp parallel for num_threads(npergroup) shared(i, start, end) schedule(dynamic,1)
for (int j=start; j<end; j++) {
printf("Thread %d from group %d working on data %d\n", omp_get_thread_num(), i, j);
}
}
return 0;
}
Running this gives
$ gcc -fopenmp -o nested nested.c -Wall -O -std=c99
$ ./nested | sort -n -k 9
Thread 0 from group 0 working on data 0
Thread 3 from group 0 working on data 1
Thread 1 from group 0 working on data 2
Thread 2 from group 0 working on data 3
Thread 1 from group 0 working on data 4
Thread 3 from group 0 working on data 5
Thread 3 from group 0 working on data 6
Thread 0 from group 0 working on data 7
Thread 0 from group 1 working on data 8
Thread 3 from group 1 working on data 9
Thread 2 from group 1 working on data 10
Thread 1 from group 1 working on data 11
Thread 0 from group 1 working on data 12
Thread 0 from group 1 working on data 13
Thread 2 from group 1 working on data 14
Thread 0 from group 1 working on data 15
But note that the nested approach may well change the thread assignments over what the one-level threading would be, so you will probably have to play with KMP_AFFINITY or other mechanisms a bit more to get the bindings right again.

Perl Queue and Threads abnormal exit

I am quite new to Perl, especially Perl Threads.
I want to accomplish:
Have 5 threads that will en-queue data(Random numbers) into a
Thread::queue
Have 3 threads that will de-queue data from the
Thread::queue.
The complete code that I wrote in order to achieve above mission:
#!/usr/bin/perl -w
use strict;
use threads;
use Thread::Queue;
my $queue = new Thread::Queue();
our #Enquing_threads;
our #Dequeuing_threads;
sub buildQueue
{
my $TotalEntry=1000;
while($TotalEntry-- >0)
{
my $query = rand(10000);
$queue->enqueue($query);
print "Enque thread with TID " .threads->tid . " got $query,";
print "Queue Size: " . $queue->pending . "\n";
}
}
sub process_Queue
{
my $query;
while ($query = $queue->dequeue)
{
print "Dequeu thread with TID " .threads->tid . " got $query\n";
}
}
push #Enquing_threads,threads->create(\&buildQueue) for 1..5;
push #Dequeuing_threads,threads->create(\&process_Queue) for 1..3;
Issues that I am Facing:
The threads are not running as concurrently as expected.
The entire program abnormally exit with following console output:
Perl exited with active threads:
8 running and unjoined
0 finished and unjoined
0 running and detached
Enque thread with TID 5 got 6646.13585023883,Queue Size: 595
Enque thread with TID 1 got 3573.84104215917,Queue Size: 595
Any help on code-optimization is appreciated.
This behaviour is to be expected: When the main thread exits, all other threads exit as well. If you don't care, you can $thread->detach them. Otherwise, you have to manually $thread->join them, which we'll do.
The $thread->join waits for the thread to complete, and fetches the return value (threads can return values just like subroutines, although the context (list/void/scalar) has to be fixed at spawn time).
We will detach the threads that enqueue data:
threads->create(\&buildQueue)->detach for 1..5;
Now for the dequeueing threads, we put them into a lexical variable (why are you using globals?), so that we can dequeue them later:
my #dequeue_threads = map threads->create(\&process_queue), 1 .. 3;
Then wait for them to complete:
$_->join for #dequeue_threads;
We know that the detached threads will finish execution before the programm exits, because the only way for the dequeueing threads to exit is to exhaust the queue.
Except for one and a half bugs. You see, there is a difference between an empty queue and a finished queue. If the queue is just empty, the dequeueing threads will block on $queue->dequeue until they get some input. The traditional solution is to dequeue while the value they get is defined. We can break the loop by supplying as many undef values in the queue as there are threads reading from the queue. More modern version of Thread::Queue have an end method, that makes dequeue return undef for all subsequent calls.
The problem is when to end the queue. We should to this after all enqueueing threads have exited. Which means, we should wait for them manually. Sigh.
my #enqueueing = map threads->create(\&enqueue), 1..5;
my #dequeueing = map threads->create(\&dequeue), 1..3;
$_->join for #enqueueing;
$queue->enqueue(undef) for 1..3;
$_->join for #dequeueing;
And in sub dequeuing: while(defined( my $item = $queue->dequeue )) { ... }.
Using the defined test fixes another bug: rand can return zero, although this is quite unlikely and will slip through most tests. The contract of rand is that it returns a pseudo-random floating point number between including zero and excluding some upper bound: A number from the interval [0, x). The bound defaults to 1.
If you don't want to join the enqueueing threads manually, you could use a semaphore to signal completition. A semaphore is a multithreading primitive that can be incremented and decremented, but not below zero. If a decrement operation would let the drop count below zero, the call blocks until another thread raises the count. If the start count is 1, this can be used as a flag to block resources.
We can also start with a negative value 1 - $NUM_THREADS, and have each thread increment the value, so that only when all threads have exited, it can be decremented again.
use threads; # make a habit of importing `threads` as the first thing
use strict; use warnings;
use feature 'say';
use Thread::Queue;
use Thread::Semaphore;
use constant {
NUM_ENQUEUE_THREADS => 5, # it's good to fix the thread counts early
NUM_DEQUEUE_THREADS => 3,
};
sub enqueue {
my ($out_queue, $finished_semaphore) = #_;
my $tid = threads->tid;
# iterate over ranges instead of using the while($maxval --> 0) idiom
for (1 .. 1000) {
$out_queue->enqueue(my $val = rand 10_000);
say "Thread $tid enqueued $val";
}
$finished_semaphore->up;
# try a non-blocking decrement. Returns true only for the last thread exiting.
if ($finished_semaphore->down_nb) {
$out_queue->end; # for sufficiently modern versions of Thread::Queue
# $out_queue->enqueue(undef) for 1 .. NUM_DEQUEUE_THREADS;
}
}
sub dequeue {
my ($in_queue) = #_;
my $tid = threads->tid;
while(defined( my $item = $in_queue->dequeue )) {
say "thread $tid dequeued $item";
}
}
# create the queue and the semaphore
my $queue = Thread::Queue->new;
my $enqueuers_ended_semaphore = Thread::Semaphore->new(1 - NUM_ENQUEUE_THREADS);
# kick off the enqueueing threads -- they handle themself
threads->create(\&enqueue, $queue, $enqueuers_ended_semaphore)->detach for 1..NUM_ENQUEUE_THREADS;
# start and join the dequeuing threads
my #dequeuers = map threads->create(\&dequeue, $queue), 1 .. NUM_DEQUEUE_THREADS;
$_->join for #dequeuers;
Don't be suprised if the threads do not seem to run in parallel, but sequentially: This task (enqueuing a random number) is very fast, and is not well suited for multithreading (enqueueing is more expensive than creating a random number).
Here is a sample run where each enqueuer only creates two values:
Thread 1 enqueued 6.39390993005694
Thread 1 enqueued 0.337993319585337
Thread 2 enqueued 4.34504733960242
Thread 2 enqueued 2.89158054485114
Thread 3 enqueued 9.4947585773571
Thread 3 enqueued 3.17079715055542
Thread 4 enqueued 8.86408863197179
Thread 5 enqueued 5.13654995317669
Thread 5 enqueued 4.2210886147538
Thread 4 enqueued 6.94064174636395
thread 6 dequeued 6.39390993005694
thread 6 dequeued 0.337993319585337
thread 6 dequeued 4.34504733960242
thread 6 dequeued 2.89158054485114
thread 6 dequeued 9.4947585773571
thread 6 dequeued 3.17079715055542
thread 6 dequeued 8.86408863197179
thread 6 dequeued 5.13654995317669
thread 6 dequeued 4.2210886147538
thread 6 dequeued 6.94064174636395
You can see that 5 managed to enqueue a few things before 4. The threads 7 and 8 don't get to dequeue anything, 6 is too fast. Also, all enqueuers are finished before the dequeuers are spawned (for such a small number of inputs).

Resources