Cython parallel prange - thread locality? - multithreading

I am iterating using prange over a list like this:
from cython.parallel import prange, threadid
cdef int tid
cdef CythonElement tEl
cdef int a, b, c
# elList: python list of CythonElement instances is passed via function call
for n in prange(nElements, schedule='dynamic', nogil=True):
with gil:
tEl = elList[n]
tid = threadid()
a = tEl.a
b = tEl.b
c = tEl.c
print("thread {:} elnumber {:}".format(tid, tEl.elNumber))
#nothing is done here
with gil:
print("thread {:} elnumber {:}".format(tid, tEl.elNumber))
# some other computations based on a, b and c here ...
I expect an output like this:
thread 0 elnumber 1
thread 1 elnumber 2
thread 2 elnumber 3
thread 3 elnumber 4
thread 0 elnumber 1
thread 1 elnumber 2
thread 2 elnumber 3
thread 3 elnumber 4
But i get:
thread 1 elnumber 1
thread 0 elnumber 3
thread 3 elnumber 2
thread 2 elnumber 4
thread 3 elnumber 4
thread 1 elnumber 2
thread 0 elnumber 4
thread 2 elnumber 4
So, somehow the thread local variable tEl becomes overwritten across the threads? What am i doing wrong ? Thank you!

It looks like Cython deliberately chooses to exclude any Python variables (including Cython cdef classes) from the list of thread-local variables. Code
I suspect this is deliberate to avoid reference counting issues - they'd need to drop the reference count of all the thread-local variables at the end of the loop (it wouldn't be an insurmountable problem, but might be a big change). Therefore I think it's unlikely to be fixed, but a documentation update might be helpful.
The solution is to refactorise your loop body into a function, where every variable ends up effectively "local" to the function so that it isn't an issue:
cdef f(CythonElement tEl):
cdef int tid
with nogil:
tid = threadid()
with gil:
print("thread {:} elnumber {:}".format(tid, tEl.elNumber))
with gil:
print("thread {:} elnumber {:}".format(tid, tEl.elNumber))
# I've trimmed the function a bit for the sake of being testable
# then for the loop:
for n in prange(nElements, schedule='dynamic', nogil=True):
with gil:

Cython provides parallelism based on threads. The order in which threads are executed is not guaranteed, hence the disordered values for thread.
If you want tEl to be private to the thread, you should not define it globally. Try moving cdef CythonElement tEl within the prange. see (part on private variables).


Code takes much more time to finish with more than 1 thread

I want to benchmark some Fortran code with OpenMP-threads with a critical-section. To simulate a realistic environment I tried to generate some load before this critical-section.
!Kompileraufruf: gfortran -fopenmp -o minExample.x minExample.f90
PROGRAM minExample
USE omp_lib
INTEGER :: n_chars, real_alloced
INTEGER :: nx,ny,nz,ix,iy,iz, idx
INTEGER :: nthreads, lasteinstellung,i
INTEGER, PARAMETER :: dp = kind(1.0d0)
REAL (KIND = dp) :: j
CHARACTER(LEN=32) :: arg
nx = 2
ny = 2
nz = 2
lasteinstellung= 10000
CALL getarg(1, arg)
READ(arg,*) nthreads
!$omp parallel
!$omp master
!$omp end master
!$omp end parallel
WRITE(*,*) "Running OpenMP benchmark on ",nthreads," thread(s)"
n_chars = 0
idx = 0
!$omp parallel do default(none) collapse(3) &
!$omp shared(nx,ny,nz,n_chars) &
!$omp private(ix,iy,iz, idx) &
!$omp private(lasteinstellung,j) !&
DO iz=-nz,nz
DO iy=-ny,ny
DO ix=-nx,nx
! WRITE(*,*) ix,iy,iz
j = 0.0d0
DO i=1,lasteinstellung
j = j + real(i)
!$omp critical
n_chars = n_chars + 1
idx = n_chars
!$omp end critical
I compiled this code with gfortran -fopenmp -o test.x test.f90 and executed it with time ./test.x THREAD
Executing this code gives some strange behaviour depending on the thread-count (set with OMP_SET_NUM_THREADS): compared with one thread (6ms) the execution with more threads costs a lot more time (2 threads: 16000ms, 4 threads: 9000ms) on my multicore machine.
What could cause this behaviour? Is there a better (but still easy) way to generate load without running in some cache-effects or related things?
edit: strange behaviour: if I have the write in the nested loops, the execution speeds dramatically up with 2 threads. If its commented out, the execution with 2 or 3 threads takes forever (write shows very slow incrementation of loop variables)...but not with 1 or 4 threads. I tried this code also on another multicore machine. There it takes for 1 and 3 threads forever but not for 2 or 4 threads.
If the code you are showing is really complete you are missing definition of loadSet in the parallel section in which it is private. It is undefined and loop
DO i=1,loadSet
j = j + real(i)
can take a completely arbitrary number of iterations.
If the value is defined somewhere before in the code you do not show you probably want firstprivate instead of private.

OpenMP: Divide all the threads into different groups

I will like to divide all the threads into 2 different groups, since I have two parallel tasks to run asynchronously. For example, if totally 8 threads are available, I will like 6 threads dedicated to task1, and the other 2 dedicated to task2.
How can I achieve this with OpenMP?
This is a job for OpenMP nested parallelism, as of OpenMP 3: you can use OpenMP tasks to start two independent tasks and then within those tasks, have parallel sections which use the appropriate number of threads.
As a quick example:
#include <stdio.h>
#include <omp.h>
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = nprocs/3;
int nthreads2 = nprocs - nthreads1;
#pragma omp parallel default(none) shared(nthreads1, nthreads2) num_threads(2)
#pragma omp single
#pragma omp task
#pragma omp parallel for num_threads(nthreads1)
for (int i=0; i<16; i++)
printf("Task 1: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), i);
#pragma omp task
#pragma omp parallel for num_threads(nthreads2)
for (int j=0; j<16; j++)
printf("Task 2: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), j);
return 0;
Running this on an 8 core (16 hardware threads) node,
$ gcc -fopenmp nested.c -o nested -std=c99
$ ./nested
Task 2: thread 3 of the 11 children of 0: handling iter 6
Task 2: thread 3 of the 11 children of 0: handling iter 7
Task 2: thread 1 of the 11 children of 0: handling iter 2
Task 2: thread 1 of the 11 children of 0: handling iter 3
Task 1: thread 2 of the 5 children of 1: handling iter 8
Task 1: thread 2 of the 5 children of 1: handling iter 9
Task 1: thread 2 of the 5 children of 1: handling iter 10
Task 1: thread 2 of the 5 children of 1: handling iter 11
Task 2: thread 6 of the 11 children of 0: handling iter 12
Task 2: thread 6 of the 11 children of 0: handling iter 13
Task 1: thread 0 of the 5 children of 1: handling iter 0
Task 1: thread 0 of the 5 children of 1: handling iter 1
Task 1: thread 0 of the 5 children of 1: handling iter 2
Task 1: thread 0 of the 5 children of 1: handling iter 3
Task 2: thread 5 of the 11 children of 0: handling iter 10
Task 2: thread 5 of the 11 children of 0: handling iter 11
Task 2: thread 0 of the 11 children of 0: handling iter 0
Task 2: thread 0 of the 11 children of 0: handling iter 1
Task 2: thread 2 of the 11 children of 0: handling iter 4
Task 2: thread 2 of the 11 children of 0: handling iter 5
Task 1: thread 1 of the 5 children of 1: handling iter 4
Task 2: thread 4 of the 11 children of 0: handling iter 8
Task 2: thread 4 of the 11 children of 0: handling iter 9
Task 1: thread 3 of the 5 children of 1: handling iter 12
Task 1: thread 3 of the 5 children of 1: handling iter 13
Task 1: thread 3 of the 5 children of 1: handling iter 14
Task 2: thread 7 of the 11 children of 0: handling iter 14
Task 2: thread 7 of the 11 children of 0: handling iter 15
Task 1: thread 1 of the 5 children of 1: handling iter 5
Task 1: thread 1 of the 5 children of 1: handling iter 6
Task 1: thread 1 of the 5 children of 1: handling iter 7
Task 1: thread 3 of the 5 children of 1: handling iter 15
Updated: I've changed the above to include the thread ancestor; there was come confusion because there were (for instance) two "thread 1"s printed - here I've also printed the ancestor (e.g., "thread 1 of the 5 children of 1" vs "thread 1 of the 11 children of 0").
From the OpenMP standard, S.3.2.4, “The omp_get_thread_num routine returns the thread number, within the current team, of the calling thread.”, and from section 2.5, “When a thread encounters a parallel construct, a team of threads is created to
execute the parallel region [...] The thread that encountered the parallel construct
becomes the master thread of the new team, with a thread number of zero for the
duration of the new parallel region.”
That is, within each of those (nested) parallel regions, teams of threads are created which have thread ids starting at zero; but just because those ids overlap within the team doesn't mean they're the same threads. Here I've emphasized that by printing their ancestor number as well, but if the threads were doing CPU-intensive work you'd also see with monitoring tools that there were indeed 16 active threads, not just 11.
The reason why they are team-local thread numbers and not globally-unique thread numbers is pretty straightforward; it would be almost impossible to keep track of globally-unique thread numbers in an environment where nested and dynamic parallelism can happen. Say there are three teams of threads, numbered [0..5], [6,..10], and [11..15], and the middle team completes. Do we leave gaps in the thread numbering? do we interrupt all threads to change their global numbers? What if a new team is started, with 7 threads? Do we start them at 6 and have overlapping thread ids, or do we start them at 16 and leave gaps in the numbering?

OPENMP running the same job on threads

In my OPENMP code, I want all threads do the same job and at the end take the average ( basically calculate error). ( How I calculate error? Each thread generates different random numbers, so the result from each threads is different.)
Here is simple code
program ...
do i=1,Nstep
!.... some code goes here
end do
sum = result(from thread 0)+result(from thread 1)+...
sum = sum/(number of threads)
Simply I have to send do loop inside OPENMP to all threads, not blocking this loop.
I can do what I want using MPI and MPI_reduce, but I want to write a hybrid code OPENMP + MPI. I haven't figured out the OPENMP part, so suggestions please?
It is as simple as applying sum reduction over result:
USE omp_lib ! for omp_get_num_threads()
INTEGER :: num_threads
result = 0.0
num_threads = 1
num_threads = omp_get_num_threads()
do i = 1, Nstep
result = ...
end do
result = result / num_threads
Here num_threads is a shared INTEGER variable that is assigned the actual number of threads used to execute the parallel region. The assignment is put in a SINGLE construct since it suffices one thread - and no matter which one - to execute the assignment.

Perl Queue and Threads abnormal exit

I am quite new to Perl, especially Perl Threads.
I want to accomplish:
Have 5 threads that will en-queue data(Random numbers) into a
Have 3 threads that will de-queue data from the
The complete code that I wrote in order to achieve above mission:
#!/usr/bin/perl -w
use strict;
use threads;
use Thread::Queue;
my $queue = new Thread::Queue();
our #Enquing_threads;
our #Dequeuing_threads;
sub buildQueue
my $TotalEntry=1000;
while($TotalEntry-- >0)
my $query = rand(10000);
print "Enque thread with TID " .threads->tid . " got $query,";
print "Queue Size: " . $queue->pending . "\n";
sub process_Queue
my $query;
while ($query = $queue->dequeue)
print "Dequeu thread with TID " .threads->tid . " got $query\n";
push #Enquing_threads,threads->create(\&buildQueue) for 1..5;
push #Dequeuing_threads,threads->create(\&process_Queue) for 1..3;
Issues that I am Facing:
The threads are not running as concurrently as expected.
The entire program abnormally exit with following console output:
Perl exited with active threads:
8 running and unjoined
0 finished and unjoined
0 running and detached
Enque thread with TID 5 got 6646.13585023883,Queue Size: 595
Enque thread with TID 1 got 3573.84104215917,Queue Size: 595
Any help on code-optimization is appreciated.
This behaviour is to be expected: When the main thread exits, all other threads exit as well. If you don't care, you can $thread->detach them. Otherwise, you have to manually $thread->join them, which we'll do.
The $thread->join waits for the thread to complete, and fetches the return value (threads can return values just like subroutines, although the context (list/void/scalar) has to be fixed at spawn time).
We will detach the threads that enqueue data:
threads->create(\&buildQueue)->detach for 1..5;
Now for the dequeueing threads, we put them into a lexical variable (why are you using globals?), so that we can dequeue them later:
my #dequeue_threads = map threads->create(\&process_queue), 1 .. 3;
Then wait for them to complete:
$_->join for #dequeue_threads;
We know that the detached threads will finish execution before the programm exits, because the only way for the dequeueing threads to exit is to exhaust the queue.
Except for one and a half bugs. You see, there is a difference between an empty queue and a finished queue. If the queue is just empty, the dequeueing threads will block on $queue->dequeue until they get some input. The traditional solution is to dequeue while the value they get is defined. We can break the loop by supplying as many undef values in the queue as there are threads reading from the queue. More modern version of Thread::Queue have an end method, that makes dequeue return undef for all subsequent calls.
The problem is when to end the queue. We should to this after all enqueueing threads have exited. Which means, we should wait for them manually. Sigh.
my #enqueueing = map threads->create(\&enqueue), 1..5;
my #dequeueing = map threads->create(\&dequeue), 1..3;
$_->join for #enqueueing;
$queue->enqueue(undef) for 1..3;
$_->join for #dequeueing;
And in sub dequeuing: while(defined( my $item = $queue->dequeue )) { ... }.
Using the defined test fixes another bug: rand can return zero, although this is quite unlikely and will slip through most tests. The contract of rand is that it returns a pseudo-random floating point number between including zero and excluding some upper bound: A number from the interval [0, x). The bound defaults to 1.
If you don't want to join the enqueueing threads manually, you could use a semaphore to signal completition. A semaphore is a multithreading primitive that can be incremented and decremented, but not below zero. If a decrement operation would let the drop count below zero, the call blocks until another thread raises the count. If the start count is 1, this can be used as a flag to block resources.
We can also start with a negative value 1 - $NUM_THREADS, and have each thread increment the value, so that only when all threads have exited, it can be decremented again.
use threads; # make a habit of importing `threads` as the first thing
use strict; use warnings;
use feature 'say';
use Thread::Queue;
use Thread::Semaphore;
use constant {
NUM_ENQUEUE_THREADS => 5, # it's good to fix the thread counts early
sub enqueue {
my ($out_queue, $finished_semaphore) = #_;
my $tid = threads->tid;
# iterate over ranges instead of using the while($maxval --> 0) idiom
for (1 .. 1000) {
$out_queue->enqueue(my $val = rand 10_000);
say "Thread $tid enqueued $val";
# try a non-blocking decrement. Returns true only for the last thread exiting.
if ($finished_semaphore->down_nb) {
$out_queue->end; # for sufficiently modern versions of Thread::Queue
# $out_queue->enqueue(undef) for 1 .. NUM_DEQUEUE_THREADS;
sub dequeue {
my ($in_queue) = #_;
my $tid = threads->tid;
while(defined( my $item = $in_queue->dequeue )) {
say "thread $tid dequeued $item";
# create the queue and the semaphore
my $queue = Thread::Queue->new;
my $enqueuers_ended_semaphore = Thread::Semaphore->new(1 - NUM_ENQUEUE_THREADS);
# kick off the enqueueing threads -- they handle themself
threads->create(\&enqueue, $queue, $enqueuers_ended_semaphore)->detach for 1..NUM_ENQUEUE_THREADS;
# start and join the dequeuing threads
my #dequeuers = map threads->create(\&dequeue, $queue), 1 .. NUM_DEQUEUE_THREADS;
$_->join for #dequeuers;
Don't be suprised if the threads do not seem to run in parallel, but sequentially: This task (enqueuing a random number) is very fast, and is not well suited for multithreading (enqueueing is more expensive than creating a random number).
Here is a sample run where each enqueuer only creates two values:
Thread 1 enqueued 6.39390993005694
Thread 1 enqueued 0.337993319585337
Thread 2 enqueued 4.34504733960242
Thread 2 enqueued 2.89158054485114
Thread 3 enqueued 9.4947585773571
Thread 3 enqueued 3.17079715055542
Thread 4 enqueued 8.86408863197179
Thread 5 enqueued 5.13654995317669
Thread 5 enqueued 4.2210886147538
Thread 4 enqueued 6.94064174636395
thread 6 dequeued 6.39390993005694
thread 6 dequeued 0.337993319585337
thread 6 dequeued 4.34504733960242
thread 6 dequeued 2.89158054485114
thread 6 dequeued 9.4947585773571
thread 6 dequeued 3.17079715055542
thread 6 dequeued 8.86408863197179
thread 6 dequeued 5.13654995317669
thread 6 dequeued 4.2210886147538
thread 6 dequeued 6.94064174636395
You can see that 5 managed to enqueue a few things before 4. The threads 7 and 8 don't get to dequeue anything, 6 is too fast. Also, all enqueuers are finished before the dequeuers are spawned (for such a small number of inputs).

Seeking help with a MT design pattern

I have a queue of 1000 work items and a n-proc machine (assume n =
4).The main thread spawns n (=4) worker threads at a time ( 25 outer
iterations) and waits for all threads to complete before processing
the next n (=4) items until the entire queue is processed
for(i= 0 to queue.Length / numprocs)
for(j= 0 to numprocs)
The work done by each (worker) thread is not homogeneous.Therefore in
1 batch (of n) if thread 1 spends 1000 s doing work and rest of the 3
threads only 1 s , above design is inefficient,becaue after 1 sec
other 3 processors are idling. Besides there is no pooling - 1000
distinct threads are being created
How do I use the NT thread pool (I am not familiar enough- hence the
long winded question) and QueueUserWorkitem to achieve the above. The
following constraints should hold
The main thread requires that all worker items are processed before
it can proceed.So I would think that a waitall like construct above
is required
I want to create as many threads as processors (ie not 1000 threads
at a time)
Also I dont want to create 1000 distinct events, pass to the worker
thread, and wait on all events using the QueueUserWorkitem API or
Exisitng code is in C++.Prefer C++ because I dont know c#
I suspect that the above is a very common pattern and was looking for
input from you folks.
I'm not a C++ programmer, so I'll give you some half-way pseudo code for it
tcount = 0
maxproc = 4
while queue_item = queue.get_next() # depends on implementation of queue
# may well be:
# for i=0; i<queue.length; i++
while tcount == maxproc
wait 0.1 seconds # or some other interval that isn't as cpu intensive
# as continously running the loop
tcount += 1 # must be atomic (reading the value and writing the new
# one must happen consecutively without interruption from
# other threads). I think ++tcount would handle that in cpp.
new thread(worker, queue_item)
function worker(item)
# stuff with item here...
tcount -= 1 # must be atomic
