Worker pool pattern - deadlock - multithreading

Wait for task pattern is the base pattern for pooling pattern. gobyexample code looks wrong, because this code is using buffered channels.
In the below code:
package main
import (
// pooling: You are a manager and you hire a team of employees. None of the new
// employees know what they are expected to do and wait for you to provide work.
// When work is provided to the group, any given employee can take it and you
// don't care who it is. The amount of time you wait for any given employee to
// take your work is unknown because you need a guarantee that the work your
// sending is received by an employee.
func pooling() {
jobCh := make(chan int) // signalling data on channel with guarantee - unbuffered
resultCh := make(chan int) // signalling data on channel with guarantee - unbuffered
workers := runtime.NumCPU() // 4 workers
for worker := 0; worker < workers; worker++ {
go func(emp int) {
var p int
for p = range jobCh {
fmt.Printf("employee %d : recv'd signal : %d\n", emp, p) // do the work
fmt.Printf("employee %d : recv'd shutdown signal\n", emp) // worker is signaled with closed state channel
resultCh <- p * 2
const jobs = 6
for jobNum := 1; jobNum <= jobs; jobNum++ {
jobCh <- jobNum
fmt.Println("manager : sent signal :", jobNum)
fmt.Println("manager : sent shutdown signal")
for a := 1; a <= jobs; a++ { //cannot range on 'resultCh'
fmt.Println("Result received: ", <-resultCh)
func main() {
manager(pooling()) is not receiving all six results, from 4 workers(employees), as shown below,
$ uname -a
Linux user 4.15.0-99-generic #100-Ubuntu SMP Wed Apr 22 20:32:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ go version
go version go1.14.1 linux/amd64
$ go install
$ bin/cs61a
manager : sent signal : 1
manager : sent signal : 2
manager : sent signal : 3
employee 3 : recv'd signal : 3
employee 3 : recv'd signal : 4
manager : sent signal : 4
manager : sent signal : 5
employee 3 : recv'd signal : 5
employee 3 : recv'd signal : 6
manager : sent signal : 6
manager : sent shutdown signal
employee 3 : recv'd shutdown signal
employee 2 : recv'd signal : 2
Result received: 12
employee 0 : recv'd signal : 1
employee 0 : recv'd shutdown signal
employee 2 : recv'd shutdown signal
Result received: 2
Result received: 4
employee 1 : recv'd shutdown signal
Result received: 0
fatal error: all goroutines are asleep - deadlock!
goroutine 1 [chan receive]:
/home/../src/ +0x25f
/home/../src/ +0x20
$ bin/cs61a
manager : sent signal : 1
employee 0 : recv'd signal : 1
manager : sent signal : 2
manager : sent signal : 3
manager : sent signal : 4
employee 3 : recv'd signal : 2
manager : sent signal : 5
manager : sent signal : 6
employee 2 : recv'd signal : 4
employee 2 : recv'd shutdown signal
employee 0 : recv'd signal : 5
manager : sent shutdown signal
Result received: 8
employee 0 : recv'd shutdown signal
Result received: 10
employee 1 : recv'd signal : 3
employee 1 : recv'd shutdown signal
Result received: 6
employee 3 : recv'd signal : 6
employee 3 : recv'd shutdown signal
Result received: 12
fatal error: all goroutines are asleep - deadlock!
goroutine 1 [chan receive]:
/home/user/../ +0x25f
/home/user/../ +0x20
As per #Mark comments, moving resultCh <- p * 2 into the loop gives below deadlock, which makes sense, because all goroutines are blocked. does buffered channel(of resultCh) help resolve this problem? but buffered channel does not signal data with guarantee..
$ go install
$ bin/cs61a
manager : sent signal : 1
manager : sent signal : 2
manager : sent signal : 3
manager : sent signal : 4
employee 1 : recv'd signal : 2
employee 2 : recv'd signal : 3
employee 0 : recv'd signal : 1
employee 3 : recv'd signal : 4
fatal error: all goroutines are asleep - deadlock!
goroutine 1 [chan send]:
/home/user/../myhub/cs61a/Main.go:33 +0xfb
/home/user/../myhub/cs61a/Main.go:46 +0x20
goroutine 6 [chan send]:
main.pooling.func1(0xc00001e0c0, 0xc00001e120, 0x0)
/home/user/../myhub/cs61a/Main.go:24 +0x136
created by main.pooling
/home/user/../myhub/cs61a/Main.go:20 +0xb7
goroutine 7 [chan send]:
main.pooling.func1(0xc00001e0c0, 0xc00001e120, 0x1)
/home/user/../myhub/cs61a/Main.go:24 +0x136
created by main.pooling
/home/user/../myhub/cs61a/Main.go:20 +0xb7
goroutine 8 [chan send]:
main.pooling.func1(0xc00001e0c0, 0xc00001e120, 0x2)
/home/user/../myhub/cs61a/Main.go:24 +0x136
created by main.pooling
/home/user/../myhub/cs61a/Main.go:20 +0xb7
goroutine 9 [chan send]:
main.pooling.func1(0xc00001e0c0, 0xc00001e120, 0x3)
/home/user/../myhub/cs61a/Main.go:24 +0x136
created by main.pooling
/home/user/../myhub/cs61a/Main.go:20 +0xb7
Why is pooling() not able to receive results from all workers?
Manager is receiving only 4 results out of 6. One of the result received is zero (Result received: 0), data sent on resultCh is always supposed to be non-zero, Why does resultCh receive zero value? It looks like resultCh is closed.
Note: Correct working of resultCh is not part of the responsibility of worker pool pattern. Worker pool pattern only ensure the work is submitted to employee successfully using jobCh

Why is pooling() not able to receive results from all workers?
The loop within the goroutine(s) (for p = range jobCh) will process all requests. However the code that sends to resultCh is outside of the loop so will only be executed once (after the loop has finnished) within each go routine.
This is as per #Marks comment; your response about scope is correct but irrelevant. The for loop will iterate through the items on the channel; when the channel is closed the loop ends and p will contain the value processed on the last iteration (if any) and that is sent to resultCh.
This means that resultCh will be sent one value for each go routine (four values in your case based upon the comment in your code). If you want to publish a value to resultCh for every value reveived on jobCh then you need to move the send into the loop (playground):
var p int
for p = range jobCh {
fmt.Printf("employee %d : recv'd signal : %d\n", emp, p) // do the work
resultCh <- p * 2
fmt.Printf("employee %d : recv'd shutdown signal\n", emp)
Manager is receiving only 4 results out of 6. One of the result received is zero (Result received: 0), data sent on resultCh is always supposed to be non-zero, Why does resultCh receive zero value? It looks like resultCh is closed.
You cannot predict how many jobs each go routine will process (and the logs show that this differed between your two runs). From your log we can tell which routine processed which jobs::
Employee 0: 1
Employee 1:
Employee 2: 2
Employee 3: 3, 4, 5, 6
You will note that Employee 1 did not process any jobs. This means that the employees loop for p = range jobCh terminated without ever assigning anything to p and, thus, resultCh <- p * 2 sent 0 (the default value for an int) to resultCh (as per the comment from #Shudipta Sharma).


How should I do that the two receiving processes not to be twice in a row in Promela model?

I am a beginner in the spin. I am trying that the model runs the two receiving processes (function called consumer in the model) alternatively, ie. (consumer 1, consumer 2, consumer 1, consumer 2,...). But when I run this code, my output for 2 consumer processes are showing randomly. Can someone help me?
This is my code I am struggling with.
mtype = {P, C};
mtype turn = P;
chan ch1 = [1] of {bit};
byte current_consumer = 1;
byte previous_consumer;
active [2] proctype Producer()
bit a = 0;
:: atomic {
turn == P ->
ch1 ! a;
printf("The producer %d --> sent %d!\n", _pid, a);
a = 1 - a;
turn = C;
active [2] proctype Consumer()
bit b;
:: atomic{
turn == C ->
current_consumer = _pid;
ch1 ? b;
printf("The consumer %d --> received %d!\n\n", _pid, b);
assert(current_consumer == _pid);
turn = P;
Sample out is as photo
First of all, let me draw your attention to this excerpt of atomic's documentation:
If any statement within the atomic sequence blocks, atomicity is lost, and other processes are then allowed to start executing statements. When the blocked statement becomes executable again, the execution of the atomic sequence can be resumed at any time, but not necessarily immediately. Before the process can resume the atomic execution of the remainder of the sequence, the process must first compete with all other active processes in the system to regain control, that is, it must first be scheduled for execution.
In your model, this is currently not causing any problem because ch1 is a buffered channel (i.e. it has size >= 1). However, any small change in the model could break this invariant.
From the comments, I understand that your goal is to alternate consumers, but you don't really care which producer is sending the data.
To be honest, your model already contains two examples of how processes can alternate with one another:
The Producer/Consumers alternate one another via turn, by assigning a different value each time
The Producer/Consumers alternate one another also via ch1, since this has size 1
However, both approaches are alternating Producer/Consumers rather than Consumers themselves.
One approach I like is message filtering with eval (see docs): each Consumer knows its own id, waits for a token with its own id in a separate channel, and only when that is available it starts doing some work.
byte current_consumer;
chan prod2cons = [1] of { bit };
chan cons = [1] of { byte };
proctype Producer(byte id; byte total)
bit a = 0;
:: true ->
// atomic is only for printing purposes
atomic {
prod2cons ! a;
printf("The producer %d --> sent %d\n", id, a);
a = 1 - a;
proctype Consumer(byte id; byte total)
bit b;
:: cons?eval(id) ->
current_consumer = id;
atomic {
prod2cons ? b;
printf("The consumer %d --> received %d\n\n", id, b);
assert(current_consumer == id);
// yield turn to the next Consumer
cons ! ((id + 1) % total)
init {
run Producer(0, 2);
run Producer(1, 2);
run Consumer(0, 2);
run Consumer(1, 2);
// First consumer is 0
This model, briefly:
Producers/Consumers alternate via prod2cons, a channel of size 1. This enforces the following behavior: after some producers created a message some consumer must consume it.
Consumers alternate via cons, a channel of size 1 containing a token value indicating which consumer is currently allowed to perform some work. All consumers peek on the contents of cons, but only the one with a matching id is allowed to consume the token and move on. At the end of its turn, the consumer creates a new token with the next id in the chain. Consumers alternate in a round robin fashion.
The output is:
The producer 0 --> sent 0
The consumer 1 --> received 0
The producer 1 --> sent 1
The consumer 0 --> received 1
The producer 1 --> sent 0
The consumer 1 --> received 0
The producer 0 --> sent 0
The consumer 1 --> received 0
The producer 0 --> sent 1
The consumer 0 --> received 1
The producer 0 --> sent 0
The consumer 1 --> received 0
The producer 0 --> sent 1
The consumer 0 --> received 1
Notice that producers do not necessarily alternate with one another, whereas consumers do -- as requested.

SPIN program using channels - verification gives "missing pars in receive" error though simulation works fine

I have a program that uses channels for inter-process messaging.It is driving me nuts.
When I run my program by typing:
spin ipc_verify.pml
It works fine (shown by the prints in my program) and exits gracefully as designed.
However, when I try to verify by doing the following:
spin -a ipc-verify.pml
gcc -DVECTORSZ=4096 -DVERBOSE -o pan pan.c
It fails in the first statement in the server where the server is trying to read on the channel, with the error:
pan:1: missing pars in receive (at depth 20)
It seems like I am missing something very simple, but can't put my finger on it. I am new to Spin, doing it as part of my coursework, so please pardon if it is a simple, silly question.
Here is a brief description of the program:
The program starts 3 processes - 1 server and 2 clients. Client sends a number to the server, which responds with the square of the number. There is a request channel on which every client send its request (message has the client id using which server knows which client to respond to), and a response channel on which server sends the response to the clients. Clients use random receive on the channel to find the message for their id.
The code line where I believe it fails is this
:: ch_clientrequest ? msgtype, client_id, client_request ->
I actually have a bigger program that exhibits this behavior so I tried to reproduce it in this program. I read through various ways of seeing more data about from spin about this error, and also googled around. Also tried changing the message structure, more fields, less fields, not doing random receive but regular receive, etc. Nothing seems to change this error!
Here is the full error trace from running ./pan:
pan:1: missing pars in receive (at depth 20)
pan: wrote ipc-verify.pml.trail
(Spin Version 6.5.1 -- 20 December 2019)
Warning: Search not completed
+ Partial Order Reduction
+ FullStack Matching
Full statespace search for:
never claim - (none specified)
assertion violations +
acceptance cycles - (not selected)
invalid end states +
State-vector 2104 byte, depth reached 20, errors: 1
21 states, stored
0 states, matched
0 matches within stack
21 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
stackframes: 0/0
stats: fa 0, fh 0, zh 0, zn 0 - check 0 holds 0
stack stats: puts 0, probes 0, zaps 0
Stats on memory usage (in Megabytes):
0.043 equivalent memory usage for states (stored*(State-vector + overhead))
1.164 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
129.315 total actual memory usage
I have tried to look for what this message at run-time in verification means, but couldn't find much. Based on various experimentation of code, it seems that the verifier thinks that the message I am trying to receive is supposed to have more parameters than what I am trying to read for. I tried to see if it is reacting to the actual message received and maybe that has less fields, but that doesn't seem to be the case.
I have been banging my head on this for full day today, with no leads. Any pointers or ideas to solve this would be very appreciated.
I am running this on my linux box, Spin 6.5.
One hub controller (server), 8 clients.
Each client sends a message to the hub, hub responds with the message it received.
#define N 2 // Number of clients
#define MQLENGTH 100
typedef ClientRequest {
byte num;
typedef HubResponse {
bool isNull; // To indicate whether there is data or not. Set True for START and STOP messages
int id;
byte num;
int sqnum;
typedef IdList {
byte ids[N]; // Use to store the ids assigned to each client process
IdList idlist;
chan ch_clientrequest = [MQLENGTH] of {mtype, byte, ClientRequest} // Hub listens to this
chan ch_hubresponse = [MQLENGTH] of {mtype, byte, HubResponse} // Clients read from this
int message_served = 0
proctype Client(byte id) {
// A client reads the message and responds to it
mtype msgtype
HubResponse hub_response
ClientRequest client_request
:: ch_hubresponse ?? msgtype, eval(id), hub_response ->
printf("\nClient Id: %d, Received - MsgType: %e", id, msgtype)
:: (msgtype == COMPUTE_RESPONSE) ->
// print the message
printf("\nClient Id: %d, Received - num = %d, sqnum = %d", id, hub_response.num, hub_response.sqnum)
// send another message. new num = sqnum
client_request.num = hub_response.sqnum % 256// To keep it as byte
:: (client_request.num < 2) ->
client_request.num = 2
:: else ->
ch_clientrequest ! COMPUTE_REQUEST(id, client_request)
printf("\nClient Id: %d, Sent - num = %d", id, client_request.num)
:: (msgtype == STOP_CLIENT) ->
// break from the do loop
:: (msgtype == START_CLIENT) ->
client_request.num = id // Start with num = id
ch_clientrequest ! COMPUTE_REQUEST(id, client_request)
printf("\nClient Id: %d, Sent - num = %d", id, client_request.num)
printf("\nClient exiting. Id = %d", id)
proctype Hub() {
// Hub sends a start message to each client, and then keeps responding to what it receives
HubResponse hr
ClientRequest client_request
mtype msgtype
byte client_id
int i
byte num
for (i: 0 .. ( N - 1) ) {
// Send a start message
hr.isNull = true
ch_hubresponse ! START_CLIENT(idlist.ids[i], hr) // Send a start message
// All of the clients have been started. Now wait for the message and respond appropriately
:: ch_clientrequest ? msgtype, client_id, client_request ->
printf("\nHub Controller. Received - MsgType: %e", msgtype)
:: (msgtype == COMPUTE_REQUEST) ->
// handle the message
num = client_request.num
hr.isNull = false = client_id
hr.num = num
hr.sqnum = num * num
ch_hubresponse ! COMPUTE_RESPONSE(client_id, hr) // Send a response message
message_served ++
:: (msgtype == STOP_HUB) ->
// break from the do loop, send stop message to all clients, and exit
// loop through the ids and send stop message
for (i: 0 .. ( N - 1) ) {
// Send a start message
hr.isNull = true
ch_hubresponse ! STOP_CLIENT(idlist.ids[i], hr) // Send a start message
printf("\nServer exiting.")
active proctype Main() {
// Start the clients and give them an id to use
ClientRequest c
pid n;
n = _nr_pr;
byte i
for (i: 1.. N ) {
run Client(i)
idlist.ids[i-1] = i
// Start the hub and give it the list of ids
run Hub()
// Send a message to Hub to stop serving
(message_served >= 100);
ch_clientrequest ! STOP_HUB(0, c)
// Wait for all processes to exit
(n == _nr_pr);
printf("\nAll processes have exited!")

How do I return a worker back to the worker pool in Go

I am implementing a worker pool which can take jobs from a channel. After it kept timing out, I realised that when a panic occurs within a worker fcn, even though I have implemented a recovery mechanism, the worker still does not return to the pool again.
In the golang playground, I was able to replicate the issue:
Worker Pool Reference
Modified code for play ground:
package main
import "fmt"
import "time"
import "log"
func recovery(id int, results chan<- int) {
if r := recover(); r != nil {
log.Print("IN RECOVERY FUNC - Failed worker: ",id)
results <- 0
func worker(id int, jobs <-chan int, results chan<- int) {
for j := range jobs {
defer recovery(id, results)
if id == 1 {
fmt.Println("worker", id, "started job", j)
fmt.Println("worker", id, "finished job", j)
results <- j * 2
func main() {
jobs := make(chan int, 100)
results := make(chan int, 100)
for w := 1; w <= 3; w++ {
go worker(w, jobs, results)
for j := 1; j <= 10; j++ {
jobs <- j
for a := 1; a <= 10; a++ {
For testing, I have implemented a panic when worker 1 is used. When run, the func panics as expected, and goes into recovery as expected (does not push a value into the channel either), however worker 1 never seems to come back.
Output without panic:
worker 3 started job 1
worker 1 started job 2
worker 2 started job 3
worker 1 finished job 2
worker 1 started job 4
worker 3 finished job 1
worker 3 started job 5
worker 2 finished job 3
worker 2 started job 6
worker 3 finished job 5
worker 3 started job 7
worker 1 finished job 4
worker 1 started job 8
worker 2 finished job 6
worker 2 started job 9
worker 1 finished job 8
worker 1 started job 10
worker 3 finished job 7
worker 2 finished job 9
worker 1 finished job 10
Output with panic:
worker 3 started job 1
2009/11/10 23:00:00 RECOVERY Failed worker: 1
worker 2 started job 3
worker 2 finished job 3
worker 2 started job 4
worker 3 finished job 1
worker 3 started job 5
worker 3 finished job 5
worker 3 started job 6
worker 2 finished job 4
worker 2 started job 7
worker 2 finished job 7
worker 2 started job 8
worker 3 finished job 6
worker 3 started job 9
worker 3 finished job 9
worker 3 started job 10
worker 2 finished job 8
worker 3 finished job 10
How do I return worker 1 back to the pool after recovery (or in the recovery process)
If you care about the errors, you could have an errors channel passed into the worker functions, and if they encounter an error, send it down the channel and then continue. The main loop could process those errors.
Or, if you don't care about the error, simply continue to skip that job.
The continue statement basically stops processing that iteration of the loop, and continues with the next.

Linux - Using mutex to synchonise serial port

I'm writing a C program for Linux OS.
The program can start a timer: both main program and timer can send and receive characters on a serial port.
My attempt is to serialize the serial port access by a mutex in a global structure initialized on the opening with:
if (pthread_mutex_init( &pED->lockSerial, NULL) != 0)
lwsl_err("lockSerial init failed\n");
I protected all the functions that send data on the port as follow:
ssize_t cmdFirmwareVersion(EngineData *pED)
if (pED->fdSerialPort==-1)
return -1;
unsigned char cmd[] = { 0x00, 0x00, 0x7F };
write( pED->fdSerialPort, cmd, sizeof(cmd));
int rx = read ( pED->fdSerialPort, rxbuffer, sizeof rxbuffer);
dump( rxbuffer, rx);
return rx;
#define LOCK_SERIAL if (0!=pthread_mutex_lock(&pED->lockSerial)) {printf("Err lock");return 0;}
#define UNLOCK_SERIAL pthread_mutex_unlock(&pED->lockSerial);
Running the program and starting the timer I see the requests are regular. When I trigger one of this calls on other way (from a rx websocket function) the program hangs and I need to kill it.
Why the entire program stops ??
If a process hangs, it could be because of circular wait for mutexes or holding mutex and trying to lock it again. This could cause deadlock.
ps output will show thread's state as D or S if it's waiting for a resource. it will appear as the process is hung.
D uninterruptible sleep (usually IO)
S interruptible sleep (waiting for an event to complete)
I have made a thread to hold mutex and try to lock it again.
ps output and GDB shows main thread and child thread are in sleep.
xxxx#virtualBox:~$ ps -eflT |grep a.out
0 S root 3982 3982 2265 0 80 0 - 22155 - 20:28 pts/0 00:00:00 ./a.out
1 S root 3982 3984 2265 0 80 0 - 22155 - 20:28 pts/0 00:00:00 ./a.out
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ffff7fdf740 (LWP 4625) "a.out" 0x00007ffff7bbed2d in __GI___pthread_timedjoin_ex (
threadid=140737345505024, thread_return=0x0, abstime=0x0, block= <optimized out>) at pthread_join_common.c:89
2 Thread 0x7ffff77c4700 (LWP 4629) "a.out" __lll_lock_wait ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
Please check blog Tech Easy for more information on threads.

OpenMP: Divide all the threads into different groups

I will like to divide all the threads into 2 different groups, since I have two parallel tasks to run asynchronously. For example, if totally 8 threads are available, I will like 6 threads dedicated to task1, and the other 2 dedicated to task2.
How can I achieve this with OpenMP?
This is a job for OpenMP nested parallelism, as of OpenMP 3: you can use OpenMP tasks to start two independent tasks and then within those tasks, have parallel sections which use the appropriate number of threads.
As a quick example:
#include <stdio.h>
#include <omp.h>
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = nprocs/3;
int nthreads2 = nprocs - nthreads1;
#pragma omp parallel default(none) shared(nthreads1, nthreads2) num_threads(2)
#pragma omp single
#pragma omp task
#pragma omp parallel for num_threads(nthreads1)
for (int i=0; i<16; i++)
printf("Task 1: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), i);
#pragma omp task
#pragma omp parallel for num_threads(nthreads2)
for (int j=0; j<16; j++)
printf("Task 2: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), j);
return 0;
Running this on an 8 core (16 hardware threads) node,
$ gcc -fopenmp nested.c -o nested -std=c99
$ ./nested
Task 2: thread 3 of the 11 children of 0: handling iter 6
Task 2: thread 3 of the 11 children of 0: handling iter 7
Task 2: thread 1 of the 11 children of 0: handling iter 2
Task 2: thread 1 of the 11 children of 0: handling iter 3
Task 1: thread 2 of the 5 children of 1: handling iter 8
Task 1: thread 2 of the 5 children of 1: handling iter 9
Task 1: thread 2 of the 5 children of 1: handling iter 10
Task 1: thread 2 of the 5 children of 1: handling iter 11
Task 2: thread 6 of the 11 children of 0: handling iter 12
Task 2: thread 6 of the 11 children of 0: handling iter 13
Task 1: thread 0 of the 5 children of 1: handling iter 0
Task 1: thread 0 of the 5 children of 1: handling iter 1
Task 1: thread 0 of the 5 children of 1: handling iter 2
Task 1: thread 0 of the 5 children of 1: handling iter 3
Task 2: thread 5 of the 11 children of 0: handling iter 10
Task 2: thread 5 of the 11 children of 0: handling iter 11
Task 2: thread 0 of the 11 children of 0: handling iter 0
Task 2: thread 0 of the 11 children of 0: handling iter 1
Task 2: thread 2 of the 11 children of 0: handling iter 4
Task 2: thread 2 of the 11 children of 0: handling iter 5
Task 1: thread 1 of the 5 children of 1: handling iter 4
Task 2: thread 4 of the 11 children of 0: handling iter 8
Task 2: thread 4 of the 11 children of 0: handling iter 9
Task 1: thread 3 of the 5 children of 1: handling iter 12
Task 1: thread 3 of the 5 children of 1: handling iter 13
Task 1: thread 3 of the 5 children of 1: handling iter 14
Task 2: thread 7 of the 11 children of 0: handling iter 14
Task 2: thread 7 of the 11 children of 0: handling iter 15
Task 1: thread 1 of the 5 children of 1: handling iter 5
Task 1: thread 1 of the 5 children of 1: handling iter 6
Task 1: thread 1 of the 5 children of 1: handling iter 7
Task 1: thread 3 of the 5 children of 1: handling iter 15
Updated: I've changed the above to include the thread ancestor; there was come confusion because there were (for instance) two "thread 1"s printed - here I've also printed the ancestor (e.g., "thread 1 of the 5 children of 1" vs "thread 1 of the 11 children of 0").
From the OpenMP standard, S.3.2.4, “The omp_get_thread_num routine returns the thread number, within the current team, of the calling thread.”, and from section 2.5, “When a thread encounters a parallel construct, a team of threads is created to
execute the parallel region [...] The thread that encountered the parallel construct
becomes the master thread of the new team, with a thread number of zero for the
duration of the new parallel region.”
That is, within each of those (nested) parallel regions, teams of threads are created which have thread ids starting at zero; but just because those ids overlap within the team doesn't mean they're the same threads. Here I've emphasized that by printing their ancestor number as well, but if the threads were doing CPU-intensive work you'd also see with monitoring tools that there were indeed 16 active threads, not just 11.
The reason why they are team-local thread numbers and not globally-unique thread numbers is pretty straightforward; it would be almost impossible to keep track of globally-unique thread numbers in an environment where nested and dynamic parallelism can happen. Say there are three teams of threads, numbered [0..5], [6,..10], and [11..15], and the middle team completes. Do we leave gaps in the thread numbering? do we interrupt all threads to change their global numbers? What if a new team is started, with 7 threads? Do we start them at 6 and have overlapping thread ids, or do we start them at 16 and leave gaps in the numbering?
