How many promises can Perl 6 keep?

How many promises can Perl 6 keep? - multithreading

That's a bit of a glib title, but in playing around with Promises I wanted to see how far I could stretch the idea. In this program, I make it so I can specify how many promises I want to make.
The default value in the thread scheduler is 16 threads (rakudo/ThreadPoolScheduler.pm)
If I specify more than that number, the program hangs but I don't get a warning (say, like "Too many threads").
If I set RAKUDO_MAX_THREADS, I can stop the program hanging but eventually there is too much thread competition to run.
I have two questions, really.
How would a program know how many more threads it can make? That's slightly more than the number of promises, for what that's worth.
How would I know how many threads I should allow, even if I can make more?
This is Rakudo 2017.01 on my puny Macbook Air with 4 cores:
my $threads = #*ARGS[0] // %*ENV<RAKUDO_MAX_THREADS> // 1;
put "There are $threads threads";
my $channel = Channel.new;
# start some promises
my #promises;
for 1 .. $threads {
#promises.push: start {
react {
whenever $channel -> $i {
say "Thread {$*THREAD.id} got $i";
}
}
}
}
put "Done making threads";
for ^100 { $channel.send( $_ ) }
put "Done sending";
$channel.close;
await |#promises;
put "Done!";

This isn't actually about Promise per se, but rather about the thread pool scheduler. A Promise itself is just a synchronization construct. The start construct actually does two things:
Ensures a fresh $_, $/, and $! inside of the block
Calls Promise.start with that block
And Promise.start also does two things:
Creates and returns a Promise
Schedules the code in the block to be run on the thread pool, and arranges that successful completion keeps the Promise and an exception breaks the Promise.
It's not only possible, but also relatively common, to have Promise objects that aren't backed by code on the thread pool. Promise.in, Promise.anyof and Promise.allof factories don't immediately schedule anything, and there are all kinds of uses of a Promise that involve doing Promise.new and then calling keep or break later on. So I can easily create and await on 1000 Promises:
my #p = Promise.new xx 1000;
start { sleep 1; .keep for #p };
await #p;
say 'done' # completes, no trouble
Similarly, a Promise is not the only thing that can schedule code on the ThreadPoolScheduler. The many things that return Supply (like intervals, file watching, asynchronous sockets, asynchronous processes) all schedule their callbacks there too. It's possible to throw code there fire-and-forget style by doing $*SCHEDULER.cue: { ... } (though often you care about the result, or any errors, so it's not especially common).
The current Perl 6 thread pool scheduler has a configurable but enforced upper limit, which defaults to 16 threads. If you create a situation where all 16 are occupied but unable to make progress, and the only thing that can make progress is stuck in the work queue, then deadlock will occur. This is nothing unique to Perl 6 thread pool; any bounded pool will be vulnerable to this (and any unbounded pool will be vulnerable to using up all resources and getting the process killed :-)).
As mentioned in another post, Perl 6.d will make await and react non-blocking constructs; this has always been the plan, but there was insufficient development resources to realize it in time for Perl 6.c. The use v6.d.PREVIEW pragma provides early access to this feature. (Also, fair warning, it's a work in progress.) The upshot of this is that an await or react on a thread owned by the thread pool will pause the execution of the scheduled code (for those curious, by taking a continuation) and and allow the thread to get on with further work. The resumption of the code will be scheduled when the awaited thing completes, or the react block gets done. Note that this means you can be on a different OS thread before and after the await or react in 6.d. (Most Perl 6 users will not need to care about this. It's mostly relevant for those writing bindings to C libraries, or doing over systems-y stuff. And a good C library binding will make it so users of the binding don't have to care.)
The upcoming 6.d change doesn't eliminate the possibility of exhausting the thread pool, but it will mean a bunch of ways that you can do in 6.c will no longer be of concern (of note, writing recursive conquer/divide things that await the results of the divided parts, or having thousands of active react blocks launched with start react { ... }).
Looking forward, the thread pool scheduler itself will also become smarter. What follows is speculation, though given I'll likely be the one implementing the changes it's probably the best speculation on offer. :-) The thread pool will start following the progress being made, and use it to dynamically tune the pool size. This will include noticing that no progress is being made and, combined with the observation that the work queues contain items, adding threads to try and resolve the deadlock - at the cost of memory overhead of added threads. Today the thread pool conservatively tends to spawn up to its maximum size anyway, even if this is not a particularly optimal choice; most likely some kind of hill-climbing algorithm will be used to try and settle on an optimal number instead. Once that happens, the default max_threads can be raised substantially, so that more programs will - at the cost of a bunch of memory overhead - be able to complete, but most will run with just a handful of threads.

Quick fix, add use v6.d.PREVIEW; on the first line.
This fixes a number of thread exhaustion issues.
I added a few other changes like $*SCHEDULER.max_threads, and adding the Promise “id” so that it is easy to see that the Thread id doesn't necessarily correlate with a given Promise.
#! /usr/bin/env perl6
use v6.d.PREVIEW; # <--
my $threads = #*ARGS[0] // $*SCHEDULER.max_threads;
put "There are $threads threads";
my $channel = Channel.new;
# start some promises
my #promises;
for 1 .. $threads {
#promises.push: start {
react {
whenever $channel -> $i {
say "Thread $*THREAD.id() ($_) got $i";
}
}
}
}
put "Done making threads";
for ^100 { $channel.send( $_ ) }
put "Done sending";
$channel.close;
await #promises;
put "Done!";
There are 16 threads
Done making threads
Thread 4 (14) got 0
Thread 4 (14) got 1
Thread 8 (8) got 3
Thread 10 (6) got 4
Thread 6 (1) got 5
Thread 16 (5) got 2
Thread 3 (16) got 7
Thread 7 (8) got 8
Thread 7 (9) got 9
Thread 5 (3) got 6
Thread 3 (6) got 10
Thread 11 (2) got 11
Thread 14 (5) got 12
Thread 4 (16) got 13
Thread 16 (15) got 14 # <<
Thread 13 (11) got 15
Thread 4 (15) got 16 # <<
Thread 4 (15) got 17 # <<
Thread 4 (15) got 18 # <<
Thread 11 (15) got 19 # <<
Thread 13 (15) got 20 # <<
Thread 3 (15) got 21 # <<
Thread 9 (13) got 22
Thread 18 (15) got 23 # <<
Thread 18 (15) got 24 # <<
Thread 8 (13) got 25
Thread 7 (15) got 26 # <<
Thread 3 (15) got 27 # <<
Thread 7 (15) got 28 # <<
Thread 8 (15) got 29 # <<
Thread 13 (13) got 30
Thread 14 (13) got 31
Thread 8 (13) got 32
Thread 6 (13) got 33
Thread 9 (15) got 34 # <<
Thread 13 (15) got 35 # <<
Thread 9 (15) got 36 # <<
Thread 16 (15) got 37 # <<
Thread 3 (15) got 38 # <<
Thread 18 (13) got 39
Thread 3 (15) got 40 # <<
Thread 7 (14) got 41
Thread 12 (15) got 42 # <<
Thread 15 (15) got 43 # <<
Thread 4 (1) got 44
Thread 11 (1) got 45
Thread 7 (15) got 46 # <<
Thread 8 (15) got 47 # <<
Thread 7 (15) got 48 # <<
Thread 17 (15) got 49 # <<
Thread 10 (10) got 50
Thread 10 (15) got 51 # <<
Thread 11 (14) got 52
Thread 6 (8) got 53
Thread 5 (13) got 54
Thread 11 (15) got 55 # <<
Thread 11 (13) got 56
Thread 3 (13) got 57
Thread 7 (13) got 58
Thread 16 (16) got 59
Thread 5 (15) got 60 # <<
Thread 5 (15) got 61 # <<
Thread 6 (15) got 62 # <<
Thread 5 (15) got 63 # <<
Thread 5 (15) got 64 # <<
Thread 17 (11) got 65
Thread 15 (15) got 66 # <<
Thread 17 (15) got 67 # <<
Thread 11 (13) got 68
Thread 10 (15) got 69 # <<
Thread 3 (15) got 70 # <<
Thread 11 (15) got 71 # <<
Thread 6 (15) got 72 # <<
Thread 16 (13) got 73
Thread 6 (13) got 74
Thread 17 (15) got 75 # <<
Thread 4 (13) got 76
Thread 8 (13) got 77
Thread 12 (15) got 78 # <<
Thread 6 (11) got 79
Thread 3 (15) got 80 # <<
Thread 11 (13) got 81
Thread 7 (13) got 82
Thread 4 (15) got 83 # <<
Thread 7 (15) got 84 # <<
Thread 7 (15) got 85 # <<
Thread 10 (15) got 86 # <<
Thread 7 (15) got 87 # <<
Thread 12 (13) got 88
Thread 3 (13) got 89
Thread 18 (13) got 90
Thread 6 (13) got 91
Thread 18 (13) got 92
Thread 15 (15) got 93 # <<
Thread 16 (15) got 94 # <<
Thread 12 (15) got 95 # <<
Thread 17 (15) got 96 # <<
Thread 11 (13) got 97
Thread 15 (16) got 98
Thread 18 (7) got 99
Done sending
Done!

Related

Is Entrance into a Windows Critical Section an atomic operation?

I wrote an FFI for critical sections, and I wrote a test for it in Haxe.
Tests run in order defined (public functions are tests)
This test test_critical_section will intermittently hang and fail:
1 var criticalSection:CriticalSection;
2
3 #if master
4 public function test_init_critical_section() {
5 return assert(attempt({
6 criticalSection = synch.SynchLib.critical_section_init(SPIN_COUNT);
7 trace('criticalSection: $criticalSection');
8 }));
9 }
10 var criticalValue = 0;
11 var done = 0;
12 var numThreads = 50;
13 function work_in_critical_section(ID:Int, a:AssertionBuffer) {
14 sys.thread.Thread.create(() -> {
15 inline function threadMsg(msg:String)
16 trace('Thread ID $ID: $msg');
17
18
19 threadMsg("Attempting to enter critical section");
20 criticalSection.critical_section_enter();
21 threadMsg("Entering crtiical section. Doing work.");
22 Sys.sleep(Std.random(100)/500); // simulate work in section
23 criticalValue+= 10;
24 done++;
25 a.assert(criticalValue == done * 10);
26 threadMsg("Leaving critical section. Work done. done: " + done);
27 criticalSection.critical_section_leave();
28 if (done == numThreads) {
29 a.assert(criticalValue == numThreads * 10);
30 a.done();
31
32 }
33 });
34 }
35 #:timeout(30000)
36 public function test_critical_section() {
37 var a = new AssertionBuffer();
38 for (i in 0...numThreads)
39 work_in_critical_section(i, a);
40 return a;
41 }
But when I add Sys.sleep(ID/5); just before entrance into the critical section (on the blank line 18), the test passses every single time (with any number of threads). Without it, the test fails randomly (more often with a higher number of threads).
My conclusion from this test is that entrance to a critical section is not atomic, and multiple threads simultaneously attempting to enter may leave the critical section in an undefined state (leading to undefined/hanging behavior).
Is this the right conclusion or am I simply mis-using critical sections (and thus, the test needs to be re-written)? And if it is the right conclusion.. does this not mean that entrance into the critical section needs its own atomic locking/synchronization mechanism..? (and further, if that is the case.. what is the point of critical sections, why would I not just use whatever that atomic synchronization mechanism is?)
To me, this seems problematic, for example, consider 10 threads meet at a synchronization barrier (with a capacity of 10), and then all 10 need to proceed through a critical section immediately after the 10th thread arrives, does that mean I'd have to synchronize/serialize access to the critical section entrance method (for instance, by sleeping such as to ensure only one thread attempts to enter the section at a given tick, as done to fix the failing test above)?
The FFI is writen ontop of synchapi.h (see EnterCriticalSection)

You read done outside the critical section. That is a race condition. If you want to look at the value of done, you need to do it before you leave the critical section.
You might see a write to done from another thread, triggering the assert before the write to criticalValue is visible to the thread that saw the write to done.
If the critical section protects criticalValue and done, then it is an error to access either of them without being in the critical section unless you are sure every thread that might access them has terminated. Your code violates this rule.

Using GLUT after fork on OSX Sierra [duplicate]

This question already has an answer here:
Why can't I use cocoa frameworks in different forked processes?
(1 answer)
Closed 6 years ago.
Before Sierra, I used to be able to initialize GLUT on the child process after forking the original process. With the latest version of Sierra, this seems to have changed. The following program crashes with a segmentation fault. If I instead move all the glut functions to the parent process, everything works. Why is there a difference between using the parent/child process?
#include <stdlib.h>
#include <GLUT/glut.h>
void pass(void){
}
int main(int argc, char* argv[]) {
pid_t childpid;
childpid = fork();
if (childpid == 0){
glutInit(&argc,argv);
glutInitDisplayMode(GLUT_SINGLE | GLUT_RGB | GLUT_DEPTH);
glutInitWindowSize(100,100);
glutCreateWindow("test");
glutDisplayFunc(pass);
glGetError();
glutMainLoop();
}else{
sleep(5);
}
exit(1);
}
The segmentation fault I get:
Crashed Thread: 0 Dispatch queue: com.apple.main-thread
Exception Type: EXC_BAD_INSTRUCTION (SIGILL)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Termination Signal: Illegal instruction: 4
Termination Reason: Namespace SIGNAL, Code 0x4
Terminating Process: exc handler [0]
Application Specific Information:
BUG IN CLIENT OF LIBDISPATCH: _dispatch_main_queue_callback_4CF called from the wrong thread
crashed on child side of fork pre-exec
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libdispatch.dylib 0x00007fffe8e7bd21 _dispatch_main_queue_callback_4CF + 1291
1 com.apple.CoreFoundation 0x00007fffd3c7bbe9 __CFRUNLOOP_IS_SERVICING_THE_MAIN_DISPATCH_QUEUE__ + 9
2 com.apple.CoreFoundation 0x00007fffd3c3d00d __CFRunLoopRun + 2205
3 com.apple.CoreFoundation 0x00007fffd3c3c514 CFRunLoopRunSpecific + 420
4 com.apple.Foundation 0x00007fffd57e1c9b -[NSRunLoop(NSRunLoop) limitDateForMode:] + 196
5 com.apple.glut 0x0000000104f39e93 -[GLUTApplication run] + 321
6 com.apple.glut 0x0000000104f46b0e glutMainLoop + 279
7 a.out 0x0000000104f24ed9 main + 121 (main.c:18)
8 libdyld.dylib 0x00007fffe8ea4255 start + 1

fork() is not creating a thread, it forks the process. After calling fork you have two processes with nearly identical address space contents, but their address spaces are protected from each other. The general rule regarding fork() is, that the only sensisble thing to do after a fork in the child process, is to replace the process image with execve(); doing anything else requires a lot of foresight in the program's design.
Thread's are created in a different way. I suggest you use the actual threading primitives offered by your programming language of choice.
That being said, many OSs and GUI libraries want a process' GUI parts to run in the main thread, so that could be part of the reason as well. Also be aware that OpenGL and multithreading is a little bit finicky.

How could sys_sigsuspend is atomical in linux kernel 2.6.11?

I'm reading linux 2.6.11
the implementation of sys_sigsuspend is as the following
34 /*
35 * Atomically swap in the new signal mask, and wait for a signal.
36 */
37 asmlinkage int
38 sys_sigsuspend(int history0, int history1, old_sigset_t mask)
39 {
40 struct pt_regs * regs = (struct pt_regs *) &history0;
41 sigset_t saveset;
42
43 mask &= _BLOCKABLE;
44 spin_lock_irq(&current->sighand->siglock);
45 saveset = current->blocked;
46 siginitset(&current->blocked, mask);
47 recalc_sigpending();
48 spin_unlock_irq(&current->sighand->siglock);
49
50 regs->eax = -EINTR;
51 while (1) {
52 current->state = TASK_INTERRUPTIBLE;
53 schedule();
54 if (do_signal(regs, &saveset))
55 return -EINTR;
56 }
57 }
in ULK3 the author says
the sigsuspend( ) system call does not allow signals to be sent after unblocking and before the schedule( ) invocation, because other processes cannot grab the CPU during that time interval.
Between spin_unlock_irq and schedule the syscall can be interrupted and preempted, so the other process can have enough time to send a signal which is not blocked to the process
But in this case, the signal will be lost, because the process schedule after the signal is delivered.
That's why sigsuspend should be atomical, but it's NOT according the its implementation.

sigsuspend implementation is correct, but the explanation in ULK is seems to be misleading.
When process executes kernel code, that execution is never interrupted by the user's signals. Instead, such signals are accumulated inside current task structure. At the moment the process leaves kernel code and returns to the user one, all signals accumulated(and not blocked) are fired.
schedule() kernel's function checks, whether some signals are accumulated. If they are, and current->state is TASK_INTERRUPTIBLE, schedule() returns. So all signals collected before schedule() call are not lost.
Atomicity of sigsuspend() system call means that if signals, temporary unblocked by the call, are emitted, then that call will garantee see them and return. Such atomicity is simply achived by placing both unblocking and checking signals inside same kernel function.

OpenMP: Divide all the threads into different groups

I will like to divide all the threads into 2 different groups, since I have two parallel tasks to run asynchronously. For example, if totally 8 threads are available, I will like 6 threads dedicated to task1, and the other 2 dedicated to task2.
How can I achieve this with OpenMP?

This is a job for OpenMP nested parallelism, as of OpenMP 3: you can use OpenMP tasks to start two independent tasks and then within those tasks, have parallel sections which use the appropriate number of threads.
As a quick example:
#include <stdio.h>
#include <omp.h>
int main(int argc, char **argv) {
omp_set_nested(1); /* make sure nested parallism is on */
int nprocs = omp_get_num_procs();
int nthreads1 = nprocs/3;
int nthreads2 = nprocs - nthreads1;
#pragma omp parallel default(none) shared(nthreads1, nthreads2) num_threads(2)
#pragma omp single
{
#pragma omp task
#pragma omp parallel for num_threads(nthreads1)
for (int i=0; i<16; i++)
printf("Task 1: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), i);
#pragma omp task
#pragma omp parallel for num_threads(nthreads2)
for (int j=0; j<16; j++)
printf("Task 2: thread %d of the %d children of %d: handling iter %d\n",
omp_get_thread_num(), omp_get_team_size(2),
omp_get_ancestor_thread_num(1), j);
}
return 0;
}
Running this on an 8 core (16 hardware threads) node,
$ gcc -fopenmp nested.c -o nested -std=c99
$ ./nested
Task 2: thread 3 of the 11 children of 0: handling iter 6
Task 2: thread 3 of the 11 children of 0: handling iter 7
Task 2: thread 1 of the 11 children of 0: handling iter 2
Task 2: thread 1 of the 11 children of 0: handling iter 3
Task 1: thread 2 of the 5 children of 1: handling iter 8
Task 1: thread 2 of the 5 children of 1: handling iter 9
Task 1: thread 2 of the 5 children of 1: handling iter 10
Task 1: thread 2 of the 5 children of 1: handling iter 11
Task 2: thread 6 of the 11 children of 0: handling iter 12
Task 2: thread 6 of the 11 children of 0: handling iter 13
Task 1: thread 0 of the 5 children of 1: handling iter 0
Task 1: thread 0 of the 5 children of 1: handling iter 1
Task 1: thread 0 of the 5 children of 1: handling iter 2
Task 1: thread 0 of the 5 children of 1: handling iter 3
Task 2: thread 5 of the 11 children of 0: handling iter 10
Task 2: thread 5 of the 11 children of 0: handling iter 11
Task 2: thread 0 of the 11 children of 0: handling iter 0
Task 2: thread 0 of the 11 children of 0: handling iter 1
Task 2: thread 2 of the 11 children of 0: handling iter 4
Task 2: thread 2 of the 11 children of 0: handling iter 5
Task 1: thread 1 of the 5 children of 1: handling iter 4
Task 2: thread 4 of the 11 children of 0: handling iter 8
Task 2: thread 4 of the 11 children of 0: handling iter 9
Task 1: thread 3 of the 5 children of 1: handling iter 12
Task 1: thread 3 of the 5 children of 1: handling iter 13
Task 1: thread 3 of the 5 children of 1: handling iter 14
Task 2: thread 7 of the 11 children of 0: handling iter 14
Task 2: thread 7 of the 11 children of 0: handling iter 15
Task 1: thread 1 of the 5 children of 1: handling iter 5
Task 1: thread 1 of the 5 children of 1: handling iter 6
Task 1: thread 1 of the 5 children of 1: handling iter 7
Task 1: thread 3 of the 5 children of 1: handling iter 15
Updated: I've changed the above to include the thread ancestor; there was come confusion because there were (for instance) two "thread 1"s printed - here I've also printed the ancestor (e.g., "thread 1 of the 5 children of 1" vs "thread 1 of the 11 children of 0").
From the OpenMP standard, S.3.2.4, “The omp_get_thread_num routine returns the thread number, within the current team, of the calling thread.”, and from section 2.5, “When a thread encounters a parallel construct, a team of threads is created to
execute the parallel region [...] The thread that encountered the parallel construct
becomes the master thread of the new team, with a thread number of zero for the
duration of the new parallel region.”
That is, within each of those (nested) parallel regions, teams of threads are created which have thread ids starting at zero; but just because those ids overlap within the team doesn't mean they're the same threads. Here I've emphasized that by printing their ancestor number as well, but if the threads were doing CPU-intensive work you'd also see with monitoring tools that there were indeed 16 active threads, not just 11.
The reason why they are team-local thread numbers and not globally-unique thread numbers is pretty straightforward; it would be almost impossible to keep track of globally-unique thread numbers in an environment where nested and dynamic parallelism can happen. Say there are three teams of threads, numbered [0..5], [6,..10], and [11..15], and the middle team completes. Do we leave gaps in the thread numbering? do we interrupt all threads to change their global numbers? What if a new team is started, with 7 threads? Do we start them at 6 and have overlapping thread ids, or do we start them at 16 and leave gaps in the numbering?

high resource usage program stalls/crashes linux

I have a program that reads about 1000 images and creates a statistical summary of their contents. Each image is processed in its own thread using OpenMP, and I have the thread limit set to match my number of processors.
Until about two weeks ago, the program ran fine. Now, however, if I run the program more than once, my system slows down and eventually freezes up.
In order to troubleshoot, I wrote the simple code listed below that emulates what my program is doing. This code will freeze my system, just as my original program does, after trying to read only a few files at line 35.
I ran the program, successively reverting to an earlier kernel after each failure, and found that it fails with all 3.6 kernels up to version 3.6.8.
However, when I go back to kernel 3.5.6, it works.
test.cc:
1 #include <cstdio>
2 #include <iostream>
3 #include <vector>
4 #include <unistd.h>
5
6 using namespace std;
7
8 int main ()
9 {
10 // number of files
11 const size_t N = 1000;
12 // total system memory
13 const size_t MEM = sysconf (_SC_PHYS_PAGES) * sysconf (_SC_PAGE_SIZE);
14 // file size
15 const size_t SZ = MEM/N;
16
17 // create temp filenames
18 vector<string> fn (N);
19 for (size_t i = 0; i < fn.size (); ++i)
20 fn[i] = string (tmpnam (NULL));
21
22 // write a bunch of files to disk
23 for (size_t i = 0; i < fn.size (); ++i)
24 {
25 vector<char> a (SZ);
26 FILE *fp = fopen (fn[i].c_str (), "wb");
27 fwrite (&a[0], a.size (), 1, fp);
28 clog << fn[i] << " written" << endl;
29 }
30
31 // read a bunch of files from disk
32 #pragma omp parallel for
33 for (size_t i = 0; i < fn.size (); ++i)
34 {
35 vector<char> a (SZ);
36 FILE *fp = fopen (fn[i].c_str (), "rb");
37 fread (&a[0], a.size (), 1, fp);
38 clog << fn[i] << " read" << endl;
39 }
40
41 return 0;
42 }
Makefile:
1 a:$
2 g++ -fopenmp -Wall -o test -g test.cc$
3 ./test$
My question is: What is different about kernel 3.6 that would cause this program to fail, but does not cause it to fail in version 3.5?

Without going through the code, if you want to set some limits to your processes, have a look at cgroups for limiting resource usage.
As for the freezing - you are trying to read/write GBs of data to disk at once. Given the speeds of ~100MB/s of today's hard-drives, I would expect a freeze at the time the kernel decides to flush the caches to the disk - which will probably occur as soon as you try to read a reasonably sized chunk of data from the disk under memory pressure (since you allocated lots of memory, the space for caches is limited).
You can try to mmap() the files or change kernel I/O scheduler.

I haven't look in deep at your code, but I realised some bad practices (at least, I thing they're) :
First, the critical section inside the openmp loop. That is a synchronism point, and putting it in every iteration sounds kind of problematic to me. Since each thread must be sure no other one has entered there, probably the overhead that synchronism introduces increases with the number of threads.
Second: I am not very used to C++, but I guess that every time vector<char> a (SZ) is executed memory is allocated (and freed at the end of the block). Excuse me if I am wrong. Since you know beforehand the value of SZ, you'll better allocate a vector<vector<char> > with as many elements as threads before the parallel region. Then, in the parallel region, you'd make each thread access its vector<char>.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string