Unexpected task switch on Linux despite of real time and nice -20

Unexpected task switch on Linux despite of real time and nice -20 - linux

I have a program that needs to execute with 100% performance but I see that it is sometimes paused for more than 20 uSec. I've struggled with this for a while and can't find the reason/explanation.
So my question is:
Why is my program "paused"/"stalled" for 20 uSec every now and then?
To investigate this I wrote the following small program:
#include <string.h>
#include <iostream>
#include <signal.h>
using namespace std;
unsigned long long get_time_in_ns(void)
{
struct timespec tmp;
if (clock_gettime(CLOCK_MONOTONIC, &tmp) == 0)
{
return tmp.tv_sec * 1000000000 + tmp.tv_nsec;
}
else
{
exit(0);
}
}
bool go_on = true;
static void Sig(int sig)
{
(void)sig;
go_on = false;
}
int main()
{
unsigned long long t1=0;
unsigned long long t2=0;
unsigned long long t3=0;
unsigned long long t4=0;
unsigned long long t5=0;
unsigned long long t2saved=0;
unsigned long long t3saved=0;
unsigned long long t4saved=0;
unsigned long long t5saved=0;
struct sigaction sig;
memset(&sig, 0, sizeof(sig));
sig.sa_handler = Sig;
if (sigaction(SIGINT, &sig, 0) < 0)
{
cout << "sigaction failed" << endl;
return 0;
}
while (go_on)
{
t1 = get_time_in_ns();
t2 = get_time_in_ns();
t3 = get_time_in_ns();
t4 = get_time_in_ns();
t5 = get_time_in_ns();
if ((t2-t1)>t2saved) t2saved = t2-t1;
if ((t3-t2)>t3saved) t3saved = t3-t2;
if ((t4-t3)>t4saved) t4saved = t4-t3;
if ((t5-t4)>t5saved) t5saved = t5-t4;
cout <<
t1 << " " <<
t2-t1 << " " <<
t3-t2 << " " <<
t4-t3 << " " <<
t5-t4 << " " <<
t2saved << " " <<
t3saved << " " <<
t4saved << " " <<
t5saved << endl;
}
cout << endl << "Closing..." << endl;
return 0;
}
The program simply test how long time it takes to call the function "get_time_in_ns". The program does this 5 times in a row. The program also tracks the longest time measured.
Normally it takes 30 ns to call the function but sometimes it takes as long as 20000 ns. Which I don't understand.
A little part of the program output is:
8909078678739 37 29 28 28 17334 17164 17458 18083
8909078680355 36 30 29 28 17334 17164 17458 18083
8909078681947 38 28 28 27 17334 17164 17458 18083
8909078683521 37 29 28 27 17334 17164 17458 18083
8909078685096 39 27 28 29 17334 17164 17458 18083
8909078686665 37 29 28 28 17334 17164 17458 18083
8909078688256 37 29 28 28 17334 17164 17458 18083
8909078689827 37 27 28 28 17334 17164 17458 18083
The output shows that normal call time is approx. 30ns (column 2 to 5) but the largest time is nearly 20000ns (column 6 to 9).
I start the program like this:
chrt -f 99 nice -n -20 myprogram
Any ideas why the call sometimes takes 20000ns when it normally takes 30ns?
The program is executed on a dual Xeon (8 cores each) machine.
I connect using SSH.
top shows:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8107 root rt -20 16788 1448 1292 S 3.0 0.0 0:00.88 myprogram
2327 root 20 0 69848 7552 5056 S 1.3 0.0 0:37.07 sshd

Even the lowest value of niceness is not a real time priority — it is still in policy SCHED_OTHER, which is a round-robin time-sharing policy. You need to switch to a real time scheduling policy with sched_setscheduler(), either SCHED_FIFO or SCHED_RR as required.
Note that that will still not give you absolute 100% CPU if it isn't the only task running. If you run the task without interruption, Linux will still grant a few percent of the CPU time to non-real time tasks so that a runaway RT task will not effectively hang the machine. Of course, a real time task needing 100% CPU time is unlikely to perform correctly.
Edit: Given that the process already runs with a RT scheduler (nice values are only relevant to SCHED_OTHER, so it's pointless to set those in addition) as pointed out, the rest of my answer still applies as to how and why other tasks still are being run (remember that there are also a number kernel tasks).
The only way better than this is probably dedicating one CPU core to the task to get the most out of it. Obviously this only works on multi-core CPUs. There is a question related to that here: Whole one core dedicated to single process

Related

How to tell if Spock is actually executing tests in parallel

We have some API integration tests that take ~30 minutes to run a test class with 19 rows in the where: table. We're trying to speed this up using Spock's (experimental) Parellel Execution feature. We are using a simple SpockConfig.groovy file:
runner {
parallel {
enabled true
fixed(4)
}
}
This still took ~30 minutes on a GitLab runner. Is there any way to log something out so we can verify that the test is running in parallel? Or am I misunderstanding the nature of Spock's parallelization?

You can parallelise Spock testing on several levels, i.e. per specification (test class) or per feature (test method). The per-method setting also means that in iterated (unrolled) tests, iterations can run in parallel. Here is some proof:
import static org.spockframework.runtime.model.parallel.ExecutionMode.CONCURRENT
runner {
parallel {
enabled true
// These values are the default already, specifying them is redundant
// defaultSpecificationExecutionMode = CONCURRENT
// defaultExecutionMode = CONCURRENT
fixed 4
}
}
package de.scrum_master.testing
import spock.lang.Specification
import static java.lang.System.currentTimeMillis
import static java.lang.Thread.currentThread
class ParallelExecutionTest extends Specification {
static long startTime = currentTimeMillis()
static volatile int execCount = 0
def "feature-A-#count"() {
given:
def i = ++execCount
printf "%5d %2d [%s] >> %s%n", currentTimeMillis() - startTime, i, currentThread().name, specificationContext.currentIteration.displayName
sleep 1000
printf "%5d %2d [%s] << %s%n", currentTimeMillis() - startTime, i, currentThread().name, specificationContext.currentIteration.displayName
where:
count << (1..8)
}
def "feature-B-#count"() {
given:
def i = ++execCount
printf "%5d %2d [%s] >> %s%n", currentTimeMillis() - startTime, i, currentThread().name, specificationContext.currentIteration.displayName
sleep 1000
printf "%5d %2d [%s] << %s%n", currentTimeMillis() - startTime, i, currentThread().name, specificationContext.currentIteration.displayName
where:
count << (1..8)
}
}
The output would look something like this (I used the volatile variable in order to more easily sort the log output in my editor into groups of threads running simultaneously in groups of 4:
68 1 [ForkJoinPool-1-worker-3] >> feature-B-1
68 2 [ForkJoinPool-1-worker-2] >> feature-A-5
68 3 [ForkJoinPool-1-worker-1] >> feature-B-6
68 4 [ForkJoinPool-1-worker-4] >> feature-B-2
1177 1 [ForkJoinPool-1-worker-3] << feature-B-1
1177 3 [ForkJoinPool-1-worker-1] << feature-B-6
1179 4 [ForkJoinPool-1-worker-4] << feature-B-2
1180 2 [ForkJoinPool-1-worker-2] << feature-A-5
1191 5 [ForkJoinPool-1-worker-3] >> feature-B-3
1199 6 [ForkJoinPool-1-worker-4] >> feature-B-4
1200 7 [ForkJoinPool-1-worker-1] >> feature-B-5
1204 8 [ForkJoinPool-1-worker-2] >> feature-A-6
2199 5 [ForkJoinPool-1-worker-3] << feature-B-3
2205 7 [ForkJoinPool-1-worker-1] << feature-B-5
2213 6 [ForkJoinPool-1-worker-4] << feature-B-4
2213 8 [ForkJoinPool-1-worker-2] << feature-A-6
2203 9 [ForkJoinPool-1-worker-3] >> feature-B-7
2207 10 [ForkJoinPool-1-worker-1] >> feature-A-1
2217 11 [ForkJoinPool-1-worker-4] >> feature-B-8
2222 12 [ForkJoinPool-1-worker-2] >> feature-A-8
3205 9 [ForkJoinPool-1-worker-3] << feature-B-7
3213 10 [ForkJoinPool-1-worker-1] << feature-A-1
3229 12 [ForkJoinPool-1-worker-2] << feature-A-8
3230 11 [ForkJoinPool-1-worker-4] << feature-B-8
3208 13 [ForkJoinPool-1-worker-3] >> feature-A-2
3216 14 [ForkJoinPool-1-worker-1] >> feature-A-3
3234 15 [ForkJoinPool-1-worker-4] >> feature-A-4
3237 16 [ForkJoinPool-1-worker-2] >> feature-A-7
4214 13 [ForkJoinPool-1-worker-3] << feature-A-2
4230 14 [ForkJoinPool-1-worker-1] << feature-A-3
4245 15 [ForkJoinPool-1-worker-4] << feature-A-4
4245 16 [ForkJoinPool-1-worker-2] << feature-A-7
As you can clearly see, with your own settings you can expect that multiple features and even multiple iterations in your spec run concurrently. If in your case the concurrent iterations take as long to finish as running the test sequentially, it could mean that those tests spend a lot of time using a shared resource which is synchronised or in some other way can only be used by one caller at the same time. So probably not Spock is the problem here, but the shared resource or the way you synchronise on it.

pthread_cond_signal doesn't unblock pthread_cond_wait thread immediately

I'm observing a behaviour which doesn't seem to be inline with how pthread_cond_signal and pthread_cond_wait should behave (accordingly to manpages). man 3 pthread_cond_signal stipulates that:
The pthread_cond_signal() function shall unblock at least one of the threads that are blocked on the specified condition variable cond (if any threads are blocked on cond).
This isn't precise enough and doesn't clarify if, at the same time, the thread calling pthread_cond_signal will yield its time back to the scheduler.
Here's an example program:
1 #include <pthread.h>
2 #include <iostream>
3 #include <time.h>
4 #include <unistd.h>
5
6 int mMsg = 0;
7 pthread_mutex_t mMsgMutex;
8 pthread_cond_t mMsgCond;
9 pthread_t consumerThread;
10 pthread_t producerThread;
11
12 void* producer(void* data) {
13 (void) data;
14 while(true) {
15 pthread_mutex_lock(&mMsgMutex);
16 std::cout << "1> locked" << std::endl;
17 mMsg += 1;
18 std::cout << "1> sending signal, mMsg = " << mMsg << "" << std::endl;
19 pthread_cond_signal(&mMsgCond);
20 pthread_mutex_unlock(&mMsgMutex);
21 }
22
23 return nullptr;
24 }
25
26 void* consumer(void* data) {
27 (void) data;
28 pthread_mutex_lock(&mMsgMutex);
29
30 while(true) {
31 while (mMsg == 0) {
32 pthread_cond_wait(&mMsgCond, &mMsgMutex);
33 }
34 std::cout << "2> wake up, msg: " << mMsg << std::endl;
35 mMsg = 0;
36 }
37
38 return nullptr;
39 }
40
41 int main()
42 {
43 pthread_mutex_init(&mMsgMutex, nullptr);
44 pthread_cond_init(&mMsgCond, nullptr);
45
46 pthread_create(&consumerThread, nullptr, consumer, nullptr);
47
48 std::cout << "starting producer..." << std::endl;
49
50 sleep(1);
51 pthread_create(&producerThread, nullptr, producer, nullptr);
52
53 pthread_join(consumerThread, nullptr);
54 pthread_join(producerThread, nullptr);
55 return 0;
56 }
Here's the output:
starting producer...
1> locked
1> sending signal, mMsg = 1
1> locked
1> sending signal, mMsg = 2
1> locked
1> sending signal, mMsg = 3
1> locked
1> sending signal, mMsg = 4
1> locked
1> sending signal, mMsg = 5
1> locked
1> sending signal, mMsg = 6
1> locked
1> sending signal, mMsg = 7
1> locked
1> sending signal, mMsg = 8
1> locked
1> sending signal, mMsg = 9
1> locked
1> sending signal, mMsg = 10
2> wake up, msg: 10
1> locked
1> sending signal, mMsg = 1
1> locked
1> sending signal, mMsg = 2
1> locked
1> sending signal, mMsg = 3
...
It seems like there's no guarantee that any pthread_cond_signal will indeed immediately unblock any waiting pthread_cond_wait thread. At the same time it seems that any amount of pthread_cond_signal can be lost after first one has been issued.
Is this really the intended behaviour or am I doing something wrong here?

This is the intended behavior. pthread_cond_signal does not yield it's remaining runtime, but will continue to run.
And yes, pthread_cond_signal will immediately unblock (one or more) thread waiting on the corresponding condition variable. However, that doesn't guarantee that said waiting thread will immediately run. It just tells the OS that this thread is no longer blocked, and it's up to the OS thread scheduler to decide when to start running it. Since the signalling thread is already running, is hot in the cache etc., it will likely have plenty of time to do something before the now-unblocked thread starts doing anything.
In your example above, if you don't want to skip messages, maybe what you're looking for is something like a producer-consumer queue, maybe backed by a ring buffer.

How many promises can Perl 6 keep?

That's a bit of a glib title, but in playing around with Promises I wanted to see how far I could stretch the idea. In this program, I make it so I can specify how many promises I want to make.
The default value in the thread scheduler is 16 threads (rakudo/ThreadPoolScheduler.pm)
If I specify more than that number, the program hangs but I don't get a warning (say, like "Too many threads").
If I set RAKUDO_MAX_THREADS, I can stop the program hanging but eventually there is too much thread competition to run.
I have two questions, really.
How would a program know how many more threads it can make? That's slightly more than the number of promises, for what that's worth.
How would I know how many threads I should allow, even if I can make more?
This is Rakudo 2017.01 on my puny Macbook Air with 4 cores:
my $threads = #*ARGS[0] // %*ENV<RAKUDO_MAX_THREADS> // 1;
put "There are $threads threads";
my $channel = Channel.new;
# start some promises
my #promises;
for 1 .. $threads {
#promises.push: start {
react {
whenever $channel -> $i {
say "Thread {$*THREAD.id} got $i";
}
}
}
}
put "Done making threads";
for ^100 { $channel.send( $_ ) }
put "Done sending";
$channel.close;
await |#promises;
put "Done!";

This isn't actually about Promise per se, but rather about the thread pool scheduler. A Promise itself is just a synchronization construct. The start construct actually does two things:
Ensures a fresh $_, $/, and $! inside of the block
Calls Promise.start with that block
And Promise.start also does two things:
Creates and returns a Promise
Schedules the code in the block to be run on the thread pool, and arranges that successful completion keeps the Promise and an exception breaks the Promise.
It's not only possible, but also relatively common, to have Promise objects that aren't backed by code on the thread pool. Promise.in, Promise.anyof and Promise.allof factories don't immediately schedule anything, and there are all kinds of uses of a Promise that involve doing Promise.new and then calling keep or break later on. So I can easily create and await on 1000 Promises:
my #p = Promise.new xx 1000;
start { sleep 1; .keep for #p };
await #p;
say 'done' # completes, no trouble
Similarly, a Promise is not the only thing that can schedule code on the ThreadPoolScheduler. The many things that return Supply (like intervals, file watching, asynchronous sockets, asynchronous processes) all schedule their callbacks there too. It's possible to throw code there fire-and-forget style by doing $*SCHEDULER.cue: { ... } (though often you care about the result, or any errors, so it's not especially common).
The current Perl 6 thread pool scheduler has a configurable but enforced upper limit, which defaults to 16 threads. If you create a situation where all 16 are occupied but unable to make progress, and the only thing that can make progress is stuck in the work queue, then deadlock will occur. This is nothing unique to Perl 6 thread pool; any bounded pool will be vulnerable to this (and any unbounded pool will be vulnerable to using up all resources and getting the process killed :-)).
As mentioned in another post, Perl 6.d will make await and react non-blocking constructs; this has always been the plan, but there was insufficient development resources to realize it in time for Perl 6.c. The use v6.d.PREVIEW pragma provides early access to this feature. (Also, fair warning, it's a work in progress.) The upshot of this is that an await or react on a thread owned by the thread pool will pause the execution of the scheduled code (for those curious, by taking a continuation) and and allow the thread to get on with further work. The resumption of the code will be scheduled when the awaited thing completes, or the react block gets done. Note that this means you can be on a different OS thread before and after the await or react in 6.d. (Most Perl 6 users will not need to care about this. It's mostly relevant for those writing bindings to C libraries, or doing over systems-y stuff. And a good C library binding will make it so users of the binding don't have to care.)
The upcoming 6.d change doesn't eliminate the possibility of exhausting the thread pool, but it will mean a bunch of ways that you can do in 6.c will no longer be of concern (of note, writing recursive conquer/divide things that await the results of the divided parts, or having thousands of active react blocks launched with start react { ... }).
Looking forward, the thread pool scheduler itself will also become smarter. What follows is speculation, though given I'll likely be the one implementing the changes it's probably the best speculation on offer. :-) The thread pool will start following the progress being made, and use it to dynamically tune the pool size. This will include noticing that no progress is being made and, combined with the observation that the work queues contain items, adding threads to try and resolve the deadlock - at the cost of memory overhead of added threads. Today the thread pool conservatively tends to spawn up to its maximum size anyway, even if this is not a particularly optimal choice; most likely some kind of hill-climbing algorithm will be used to try and settle on an optimal number instead. Once that happens, the default max_threads can be raised substantially, so that more programs will - at the cost of a bunch of memory overhead - be able to complete, but most will run with just a handful of threads.

Quick fix, add use v6.d.PREVIEW; on the first line.
This fixes a number of thread exhaustion issues.
I added a few other changes like $*SCHEDULER.max_threads, and adding the Promise “id” so that it is easy to see that the Thread id doesn't necessarily correlate with a given Promise.
#! /usr/bin/env perl6
use v6.d.PREVIEW; # <--
my $threads = #*ARGS[0] // $*SCHEDULER.max_threads;
put "There are $threads threads";
my $channel = Channel.new;
# start some promises
my #promises;
for 1 .. $threads {
#promises.push: start {
react {
whenever $channel -> $i {
say "Thread $*THREAD.id() ($_) got $i";
}
}
}
}
put "Done making threads";
for ^100 { $channel.send( $_ ) }
put "Done sending";
$channel.close;
await #promises;
put "Done!";
There are 16 threads
Done making threads
Thread 4 (14) got 0
Thread 4 (14) got 1
Thread 8 (8) got 3
Thread 10 (6) got 4
Thread 6 (1) got 5
Thread 16 (5) got 2
Thread 3 (16) got 7
Thread 7 (8) got 8
Thread 7 (9) got 9
Thread 5 (3) got 6
Thread 3 (6) got 10
Thread 11 (2) got 11
Thread 14 (5) got 12
Thread 4 (16) got 13
Thread 16 (15) got 14 # <<
Thread 13 (11) got 15
Thread 4 (15) got 16 # <<
Thread 4 (15) got 17 # <<
Thread 4 (15) got 18 # <<
Thread 11 (15) got 19 # <<
Thread 13 (15) got 20 # <<
Thread 3 (15) got 21 # <<
Thread 9 (13) got 22
Thread 18 (15) got 23 # <<
Thread 18 (15) got 24 # <<
Thread 8 (13) got 25
Thread 7 (15) got 26 # <<
Thread 3 (15) got 27 # <<
Thread 7 (15) got 28 # <<
Thread 8 (15) got 29 # <<
Thread 13 (13) got 30
Thread 14 (13) got 31
Thread 8 (13) got 32
Thread 6 (13) got 33
Thread 9 (15) got 34 # <<
Thread 13 (15) got 35 # <<
Thread 9 (15) got 36 # <<
Thread 16 (15) got 37 # <<
Thread 3 (15) got 38 # <<
Thread 18 (13) got 39
Thread 3 (15) got 40 # <<
Thread 7 (14) got 41
Thread 12 (15) got 42 # <<
Thread 15 (15) got 43 # <<
Thread 4 (1) got 44
Thread 11 (1) got 45
Thread 7 (15) got 46 # <<
Thread 8 (15) got 47 # <<
Thread 7 (15) got 48 # <<
Thread 17 (15) got 49 # <<
Thread 10 (10) got 50
Thread 10 (15) got 51 # <<
Thread 11 (14) got 52
Thread 6 (8) got 53
Thread 5 (13) got 54
Thread 11 (15) got 55 # <<
Thread 11 (13) got 56
Thread 3 (13) got 57
Thread 7 (13) got 58
Thread 16 (16) got 59
Thread 5 (15) got 60 # <<
Thread 5 (15) got 61 # <<
Thread 6 (15) got 62 # <<
Thread 5 (15) got 63 # <<
Thread 5 (15) got 64 # <<
Thread 17 (11) got 65
Thread 15 (15) got 66 # <<
Thread 17 (15) got 67 # <<
Thread 11 (13) got 68
Thread 10 (15) got 69 # <<
Thread 3 (15) got 70 # <<
Thread 11 (15) got 71 # <<
Thread 6 (15) got 72 # <<
Thread 16 (13) got 73
Thread 6 (13) got 74
Thread 17 (15) got 75 # <<
Thread 4 (13) got 76
Thread 8 (13) got 77
Thread 12 (15) got 78 # <<
Thread 6 (11) got 79
Thread 3 (15) got 80 # <<
Thread 11 (13) got 81
Thread 7 (13) got 82
Thread 4 (15) got 83 # <<
Thread 7 (15) got 84 # <<
Thread 7 (15) got 85 # <<
Thread 10 (15) got 86 # <<
Thread 7 (15) got 87 # <<
Thread 12 (13) got 88
Thread 3 (13) got 89
Thread 18 (13) got 90
Thread 6 (13) got 91
Thread 18 (13) got 92
Thread 15 (15) got 93 # <<
Thread 16 (15) got 94 # <<
Thread 12 (15) got 95 # <<
Thread 17 (15) got 96 # <<
Thread 11 (13) got 97
Thread 15 (16) got 98
Thread 18 (7) got 99
Done sending
Done!

high resource usage program stalls/crashes linux

I have a program that reads about 1000 images and creates a statistical summary of their contents. Each image is processed in its own thread using OpenMP, and I have the thread limit set to match my number of processors.
Until about two weeks ago, the program ran fine. Now, however, if I run the program more than once, my system slows down and eventually freezes up.
In order to troubleshoot, I wrote the simple code listed below that emulates what my program is doing. This code will freeze my system, just as my original program does, after trying to read only a few files at line 35.
I ran the program, successively reverting to an earlier kernel after each failure, and found that it fails with all 3.6 kernels up to version 3.6.8.
However, when I go back to kernel 3.5.6, it works.
test.cc:
1 #include <cstdio>
2 #include <iostream>
3 #include <vector>
4 #include <unistd.h>
5
6 using namespace std;
7
8 int main ()
9 {
10 // number of files
11 const size_t N = 1000;
12 // total system memory
13 const size_t MEM = sysconf (_SC_PHYS_PAGES) * sysconf (_SC_PAGE_SIZE);
14 // file size
15 const size_t SZ = MEM/N;
16
17 // create temp filenames
18 vector<string> fn (N);
19 for (size_t i = 0; i < fn.size (); ++i)
20 fn[i] = string (tmpnam (NULL));
21
22 // write a bunch of files to disk
23 for (size_t i = 0; i < fn.size (); ++i)
24 {
25 vector<char> a (SZ);
26 FILE *fp = fopen (fn[i].c_str (), "wb");
27 fwrite (&a[0], a.size (), 1, fp);
28 clog << fn[i] << " written" << endl;
29 }
30
31 // read a bunch of files from disk
32 #pragma omp parallel for
33 for (size_t i = 0; i < fn.size (); ++i)
34 {
35 vector<char> a (SZ);
36 FILE *fp = fopen (fn[i].c_str (), "rb");
37 fread (&a[0], a.size (), 1, fp);
38 clog << fn[i] << " read" << endl;
39 }
40
41 return 0;
42 }
Makefile:
1 a:$
2 g++ -fopenmp -Wall -o test -g test.cc$
3 ./test$
My question is: What is different about kernel 3.6 that would cause this program to fail, but does not cause it to fail in version 3.5?

Without going through the code, if you want to set some limits to your processes, have a look at cgroups for limiting resource usage.
As for the freezing - you are trying to read/write GBs of data to disk at once. Given the speeds of ~100MB/s of today's hard-drives, I would expect a freeze at the time the kernel decides to flush the caches to the disk - which will probably occur as soon as you try to read a reasonably sized chunk of data from the disk under memory pressure (since you allocated lots of memory, the space for caches is limited).
You can try to mmap() the files or change kernel I/O scheduler.

I haven't look in deep at your code, but I realised some bad practices (at least, I thing they're) :
First, the critical section inside the openmp loop. That is a synchronism point, and putting it in every iteration sounds kind of problematic to me. Since each thread must be sure no other one has entered there, probably the overhead that synchronism introduces increases with the number of threads.
Second: I am not very used to C++, but I guess that every time vector<char> a (SZ) is executed memory is allocated (and freed at the end of the block). Excuse me if I am wrong. Since you know beforehand the value of SZ, you'll better allocate a vector<vector<char> > with as many elements as threads before the parallel region. Then, in the parallel region, you'd make each thread access its vector<char>.

Accurately Calculating CPU Utilization in Linux using /proc/stat

There are a number of posts and references on how to get CPU Utilization using statistics in /proc/stat. However, most of them use only four of the 7+ CPU stats (user, nice, system, and idle), ignoring the remaining jiffie CPU counts present in Linux 2.6 (iowait, irq, softirq).
As an example, see Determining CPU utilization.
My question is this: Are the iowait/irq/softirq numbers also counted in one of the first four numbers (user/nice/system/idle)? In other words, does the total jiffie count equal the sum of the first four stats? Or, is the total jiffie count equal to the sum of all 7 stats? If the latter is true, then a CPU utilization formula should take all of the numbers into account, like this:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
long double a[7],b[7],loadavg;
FILE *fp;
for(;;)
{
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&a[0],&a[1],&a[2],&a[3],&a[4],&a[5],&a[6]);
fclose(fp);
sleep(1);
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&b[0],&b[1],&b[2],&b[3],&b[4],&b[5],&b[6]);
fclose(fp);
loadavg = ((b[0]+b[1]+b[2]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[4]+a[5]+a[6]))
/ ((b[0]+b[1]+b[2]+b[3]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]));
printf("The current CPU utilization is : %Lf\n",loadavg);
}
return(0);
}

I think iowait/irq/softirq are not counted in one of the first 4 numbers. You can see the comment of irqtime_account_process_tick in kernel code for more detail:
(for Linux kernel 4.1.1)
2815 * Tick demultiplexing follows the order
2816 * - pending hardirq update <-- this is irq
2817 * - pending softirq update <-- this is softirq
2818 * - user_time
2819 * - idle_time <-- iowait is included in here, discuss below
2820 * - system time
2821 * - check for guest_time
2822 * - else account as system_time
For the idle time handling, see account_idle_time function:
2772 /*
2773 * Account for idle time.
2774 * #cputime: the cpu time spent in idle wait
2775 */
2776 void account_idle_time(cputime_t cputime)
2777 {
2778 u64 *cpustat = kcpustat_this_cpu->cpustat;
2779 struct rq *rq = this_rq();
2780
2781 if (atomic_read(&rq->nr_iowait) > 0)
2782 cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
2783 else
2784 cpustat[CPUTIME_IDLE] += (__force u64) cputime;
2785 }
If the cpu is idle AND there is some IO pending, it will count the time in CPUTIME_IOWAIT. Otherwise, it is count in CPUTIME_IDLE.
To conclude, I think the jiffies in irq/softirq should be counted as "busy" for cpu because it was actually handling some IRQ or soft IRQ. On the other hand, the jiffies in "iowait" should be counted as "idle" for cpu because it was not doing something but waiting for a pending IO to happen.

from busybox, its top magic is:
static const char fmt[] ALIGN1 = "cp%*s %llu %llu %llu %llu %llu %llu %llu %llu";
int ret;
if (!fgets(line_buf, LINE_BUF_SIZE, fp) || line_buf[0] != 'c' /* not "cpu" */)
return 0;
ret = sscanf(line_buf, fmt,
&p_jif->usr, &p_jif->nic, &p_jif->sys, &p_jif->idle,
&p_jif->iowait, &p_jif->irq, &p_jif->softirq,
&p_jif->steal);
if (ret >= 4) {
p_jif->total = p_jif->usr + p_jif->nic + p_jif->sys + p_jif->idle
+ p_jif->iowait + p_jif->irq + p_jif->softirq + p_jif->steal;
/* procps 2.x does not count iowait as busy time */
p_jif->busy = p_jif->total - p_jif->idle - p_jif->iowait;
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string