Using C++11 and/or C11 thread_local, should we expect any performance penalty over non-thread_local storage on x86 (32- or 64-bit) Linux, Red Hat 5 or newer, with a recent g++/gcc (say, version 4 or newer) or clang?

On Ubuntu 18.04 x86_64 with gcc-8.3 (options -pthread -m{arch,tune}=native -std=gnu++17 -g -O3 -ffast-math -falign-{functions,loops}=64 -DNDEBUG) the difference is almost imperceptible:
#include <benchmark/benchmark.h>
struct A { static unsigned n; };
unsigned A::n = 0;
struct B { static thread_local unsigned n; };
thread_local unsigned B::n = 0;
template<class T>
void bm(benchmark::State& state) {
for(auto _ : state)
Run on (16 X 5000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.59, 0.49, 0.38
Benchmark Time CPU Iterations
bm<A> 1.09 ns 1.09 ns 642390002
bm<B> 1.09 ns 1.09 ns 633963210
On x86_64 thread_local variables are accessed relative to fs register. Instructions with such addressing mode are often 2 bytes longer, so theoretically, they can take more time.
On other platforms it depends on how access to thread_local variables is implemented. See ELF Handling For Thread-Local Storage for more details.


Twice as many page faults when reading from a large malloced array instead of just storing?

I am doing a simple test on monitoring page faults with the code below, What I don't know is how a simple one line of code below doubled my page fault count.
if I use
ptr[i+4096] = 'A'
I got 25,722 page-faults with perf tool, which is what I expected,
but if I use
tmp = ptr[i+4096]
instead, the page-faults doubled to 51,322
I don't how to explain it. Below is the complete code. Thanks!
void do_something() {
int i;
char* ptr;
char tmp;
ptr = malloc(100*1024*1024);
int j = 0;
int k = 0;
for (i = 0; i < 100*1024*1024; i+=4096) {
//ptr[i+4096] = 'A' ;
tmp = ptr[i+4096];
for (j = 0 ; j < 4096; j++)
ptr[i+j] = (char) (i & 0xff); // pagefault
int main(int argc, char* argv[]) {
return 0;
Machine Info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2687W v3 # 3.10GHz
Stepping: 2
CPU MHz: 3096.188
BogoMIPS: 6197.81
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
3.10.0-514.32.3.el7.x86_64 #1
malloc() will often satisfy requests for memory by asking the OS for new pages, e.g., via mmap. Such pages are generally allocated lazily: no actual page is allocated until the first access.
What happens then depends on the type of the first access: when you do a read first, Linux will map in a shared read-only COW page of zeros to satisfy it, and then if you later you write it takes a second fault to allocate the private writeable page.
When you just do the write first, the first step is skipped. That's the usual case since code generally isn't reading from newly allocated memory which has undefined contents (at least when you get it from malloc).
Note that the above is a description of how newly allocated pages work in Linux - when you use malloc there is another layer: malloc will generally try to satisfy requests for blocks the process freed earlier, rather than continually requesting new memory. In the case memory is re-used, it will generally already be paged in and the above won't apply. Of course for your initial big allocation of 1024 MiB, where is no memory to re-use so you can be sure the allocator is getting it from the OS.

OpenMP worst performance with more threads (following openMP tutorials)

I'm starting to work with OpenMP and I follow these tutorials:
OpenMP Tutorials
I'm coding exactly what appears on the video, but instead of a better performance with more threads I get worse. I don't understand why.
Here's my code:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 2
int main()
clock_t t;
t = clock();
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double)num_steps;
#pragma omp parallel
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id == 0) nthreads = nthrds;
for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;
t = clock() - t;
cout << "time: " << t << " miliseconds" << endl;
As you can see, it's exactly the same as in the video, I only added a code to measure an elapsed time.
On the tutorial, the more threads we use the better a performance.
In my case, that doesn't happen. Here are the timing I got:
1 thread: 433590 miliseconds
2 threads: 1705704 miliseconds
3 threads: 2689001 miliseconds
4 threads: 4221881 miliseconds
Why do I get this behavior?
-- EDIT --
gcc version: gcc 5.5.0
result of lscpu:
Architechure: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i7-4720HQ CPU # 2.60Ghz
Stepping: 3
CPU Mhz: 2594.436
CPU max MHz: 3600,0000
CPU min Mhz: 800,0000
BogoMIPS: 5188.41
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
-- EDIT --
I've tried using omp_get_wtime() instead, like this:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 8
int main()
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double)num_steps;
double start_time = omp_get_wtime();
#pragma omp parallel
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id == 0) nthreads = nthrds;
for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;
double time = omp_get_wtime() - start_time;
cout << "time: " << time << " seconds" << endl;
The behavior is different, although I have some questions.
Now, if I increase the number of threads by 1, for example, 1 thread, 2 threads, 3, 4, ..., the results are basically the same as previous, the performance gets worse, although if I increase to 64 threads, or 128 threads I get indeed better performance, the timing decreases from 0.44 [s] (for 1 thread) to 0.13 [s] ( for 128 threads ).
My question is: Why I don't have the same behaviour as in the tutorial?
2 threads get better performance than 1,
3 threads get better performance than 2, etc.
Why do I only get better performance with much bigger amount of threads?
instead of better performances with more threads I get worse ... I don't understand why.
Well,let's make the testing a bit more systematic and repeatable to see if :
// time: 1535120 milliseconds 1 thread
// time: 200679 milliseconds 1 thread -O2
// time: 191205 milliseconds 1 thread -O3
// time: 184502 milliseconds 2 threads -O3
// time: 189947 milliseconds 3 threads -O3
// time: 202277 milliseconds 4 threads -O3
// time: 182628 milliseconds 5 threads -O3
// time: 192032 milliseconds 6 threads -O3
// time: 185771 milliseconds 7 threads -O3
// time: 187606 milliseconds 16 threads -O3
// time: 187231 milliseconds 32 threads -O3
// time: 186131 milliseconds 64 threads -O3
ref.: a few sample runs on a TiO.RUN platform fast mock-up ... where limited resources apply a certain glass-ceiling to hit...
This did show more the effects of { -O2 |-O3 }-compilation-mode optimisation effects, than the above proposed principal degradation for growing number of threads.
Next comes the "background" noise from non-managed code-execution ecosystem, where O/S will easily skew the simplistic performance benchmarking
If indeed interested in further details, feel free to read about a Law of diminishing returns ( about real world compositions of [SERIAL], resp. [PARALLEL] parts of the process-scheduling ), where Dr. Gene AMDAHL has initiated the principal rules,
why more threads do not get way better performance ( and where a bit more contemporary re-formulation of this law explains, why more threads may even get negative improvement ( get more expensive add-on overheads ), than a right-tuned peak performance.
#include <time.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 7
int main()
clock_t t;
t = clock();
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0 / ( double )num_steps;
omp_set_num_threads( NUM_THREADS );
// struct timespec start;
// t = clock(); // _________________________________________ BEST START HERE
// clock_gettime( CLOCK_MONOTONIC, &start ); // ____________ USING MONOTONIC CLOCK
#pragma omp parallel
int i,
nthrds = omp_get_num_threads(),
id = omp_get_thread_num();;
double x;
if ( id == 0 ) nthreads = nthrds;
for ( i = id, sum[id] = 0.0;
i < num_steps;
i += nthrds
x = ( i + 0.5 ) * step;
sum[id] += 4.0 / ( 1.0 + x * x );
// t = clock() - t; // _____________________________________ BEST STOP HERE
// clock_gettime( CLOCK_MONOTONIC, &end ); // ______________ USING MONOTONIC CLOCK
for ( i = 0, pi = 0.0;
i < nthreads;
) pi += sum[i] * step;
t = clock() - t;
// // time: 1535120 milliseconds 1 thread
// // time: 200679 milliseconds 1 thread -O2
// // time: 191205 milliseconds 1 thread -O3
printf( "time: %d milliseconds %d threads\n", // time: 184502 milliseconds 2 threads -O3
t, // time: 189947 milliseconds 3 threads -O3
NUM_THREADS // time: 202277 milliseconds 4 threads -O3
); // time: 182628 milliseconds 5 threads -O3
} // time: 192032 milliseconds 6 threads -O3
// time: 185771 milliseconds 7 threads -O3
The major problem in that version is false sharing. This is explained later in the video you started to watch. You get this when many threads are accessing data that is adjacent in memory (the sum array). The video also explains how to use padding to manually avoid this issue.
That said, the idiomatic solution is to use a reduction and not even bother with the manual work sharing:
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i < num_steps; i++)
double x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
This is also explained in a later video of the series. It is much simpler than what you started with and most likely the most efficient way.
Although the presenter is certainly competent, the style of these OpenMP tutorial videos is very much bottom up. I'm not sure that is a good educational approach. In any case you should probably watch all of the videos to know how to best use OpenMP it in practice.
Why do I only get better performance with much bigger amount of threads?
This is a bit counterintuitive, you very rarely get better performance from using more OpenMP threads than hardware threads - unless this is indirectly fixing another issue. In your case the large amount of threads means that the sum array is spread out over a larger region in memory and false-sharing is less likely.

Halide AOT for OpenCL works fine as static library but not as shared object

I try to compile the code below both to static library and to object file:
Halide::Func f("f");
Halide::Var x("x");
f(x) = x;
f.gpu_tile(x, 4);
f.bound(x, 0, 16);
Halide::Target target = Halide::get_target_from_environment();
// f.compile_to_static_library("mylib", {}, "f", target);
// f.compile_to_file("mylib", {}, "f", target);
In case of static linking all works fine and output result is correct:
Halide::Buffer<int> output(16);
std::cout << output(10) << std::endl;
But when I try link object file into shared object,
gcc -shared -pthread mylib.o -o
And open it from code (Ubuntu 16.04),
void* handle = dlopen("", RTLD_NOW);
int (*func)(halide_buffer_t*);
*(void**)(&func) = dlsym(handle, "f");
I receive CL_INVALID_MEM_OBJECT error. Here is the debugging log:
CL: halide_opencl_init_kernels (user_context: 0x0, state_ptr: 0x7f1266b5a4e0, program: 0x7f1266957480, size: 1577
load_libopencl (user_context: 0x0)
Loaded OpenCL runtime library:
create_opencl_context (user_context: 0x0)
Got platform 'Intel(R) OpenCL', about to create context (t=6249430)
Multiple CL devices detected. Selecting the one with the most cores.
Device 0 has 20 cores
Device 1 has 4 cores
Selected device 0
device name: Intel(R) HD Graphics
device vendor: Intel(R) Corporation
device profile: FULL_PROFILE
global mem size: 1630 MB
max mem alloc size: 815 MB
local mem size: 65536
max compute units: 20
max workgroup size: 256
max work item dimensions: 3
max work item sizes: 256x256x256x0
clCreateContext -> 0x1899af0
clCreateCommandQueue 0x1a26a80
clCreateProgramWithSource -> 0x1a26ab0
clBuildProgram 0x1a26ab0 -D MAX_CONSTANT_BUFFER_SIZE=854799155 -D MAX_CONSTANT_ARGS=8
Time: 1.015832e+02 ms
CL: halide_opencl_run (user_context: 0x0, entry: kernel_f_s0_x___deprecated_block_id_x___block_id_x, blocks: 4x1x1, threads: 4x1x1, shmem: 0
clCreateKernel kernel_f_s0_x___deprecated_block_id_x___block_id_x -> Time: 1.361700e-02 ms
clSetKernelArg 0 4 [0x2e00010000000000 ...] 0
clSetKernelArg 1 8 [0x2149040 ...] 1
Mapped dev handle is: 0x2149040
Error: CL: clSetKernelArg failed: CL_INVALID_MEM_OBJECT
Aborted (core dumped)
Thank you very much for help! Commit state c7375fa. I'm pleasure provide extra information if it will be necessary.
Solution: In this case we have runtime duplication. Load shared object with flag RTLD_DEEPBIND.
void* handle = dlopen("", RTLD_NOW | RTLD_DEEPBIND);
RTLD_DEEPBIND (since glibc 2.3.4)
Place the lookup scope of the symbols in this library ahead of the global scope. This means that a self-contained library will use its own symbols in preference to global symbols with the same name contained in libraries that have already been loaded. This flag is not specified in POSIX.1-2001.

nanosleep sleep 60 microseconds too long

I have the following test compiled in g++ which nanosleep too long , it takes
60 microseconds to finish , I expected it cost only less than 1 microsecond :
int main()
gettimeofday(&startx, NULL);
struct timespec req={0};
req.tv_nsec=100 ;
nanosleep(&req,NULL) ;
gettimeofday(&endx, NULL);
return 0 ;
My environment : uname -r showes :
cat /boot/config-uname -r | grep HZ
# CONFIG_NO_HZ_IDLE is not set
# CONFIG_NO_HZ_FULL_ALL is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
Should I do something in HZ config so that nanosleep will do exactly I expect?!
my cpu information :
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2643 v3 # 3.40GHz
Stepping: 2
CPU MHz: 3600.015
BogoMIPS: 6804.22
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23
Edit :
#ifdef SYS_gettid
pid_t tid = syscall(SYS_gettid);
#error "SYS_gettid unavailable on this system"
this will get tid of the thread I like to high priority , then do
chrt -v -r -p 99 tid to achieve this goal , thanks for Johan Boule
kind help , great appreciate !!!
Edit2 :
#ifdef SYS_gettid
const char *sched_policy[] = {
struct sched_param sp = {
.sched_priority = 99
pid_t tid = syscall(SYS_gettid);
sched_setscheduler(tid, SCHED_RR, &sp);
printf("Scheduler Policy is %s.\n", sched_policy[sched_getscheduler(0)]);
#error "SYS_gettid unavailable on this system"
this will do exact what I want to do without help of chrt
(This is not an answer)
For what is worth, using the more modern functions leads to the same result:
#include <stddef.h>
#include <time.h>
#include <stdio.h>
int main()
struct timespec startx, endx;
clock_gettime(CLOCK_MONOTONIC, &startx);
struct timespec req={0};
req.tv_nsec=100 ;
clock_nanosleep(CLOCK_MONOTONIC, 0, &req, NULL);
clock_gettime(CLOCK_MONOTONIC, &endx);
return 0 ;

Custom allocators vs. promises and packaged tasks

Are the allocator-taking constructors of standard promise/packaged_task supposed to use the allocator for just the state object itself, or should this be guaranteed for all (internal) related objects?
[futures.promise]: "...allocate memory for the shared state"
[futures.task.members]: "...allocate memory needed to store the internal data structures"
In particular, are the below bugs or features?
*MSVC 2013.4, Boost 1.57, short_alloc.h by Howard Hinnant
Example 1
#include <boost/thread/future.hpp>
#include "short_alloc.h"
#include <cstdio>
void *operator new( std::size_t s ) {
printf( "alloc %Iu\n", s );
return malloc( s );
void operator delete( void *p ) {
free( p );
int main() {
const int N = 1024;
arena< N > a;
short_alloc< int, N > al( a );
printf( "[promise]\n" );
auto p = boost::promise< int >( std::allocator_arg, al );
p.set_value( 123 );
printf( "[packaged_task]\n" );
auto q = boost::packaged_task< int() >( std::allocator_arg, al, [] { return 123; } );
return 0;
alloc 8
alloc 12
alloc 8
alloc 24
alloc 8
alloc 12
alloc 8
alloc 24
FWIW, the output with the default allocator is
alloc 144
alloc 8
alloc 12
alloc 8
alloc 16
alloc 160
alloc 8
alloc 12
alloc 8
alloc 16
Example 2
AFAICT, MSVC's std::mutex does an unavoidable heap allocation, and therefore, so does std::promise which uses it. Is this a conformant behaviour?
N.B. there are a couple of issues with your code. In C++14 if you replace operator delete(void*) then you must also replace operator delete(void*, std::size)t). You can use a feature-test macro to see if the compiler requires that:
void operator delete( void *p ) {
free( p );
#if __cpp_sized_deallocation
// Also define sized-deallocation function:
void operator delete( void *p, std::size_t ) {
free( p );
Secondly the correct printf format specifier for size_t is zu not u, so you should be using %Izu.
AFAICT, MSVC's std::mutex does an unavoidable heap allocation, and therefore, so does std::promise which uses it. Is this a conformant behaviour?
It's certainly questionable whether std::mutex should use dynamic allocation. Its constructor can't, because it must be constexpr. It could delay the allocation until the first call to lock() or try_lock() but lock() doesn't list failure to acquire resources as a valid error condition, and it means try_lock() could fail to lock an uncontended mutex if it can't allocate the resources it needs. That's allowed, if you squint at it, but is not ideal.
But regarding your main question, as you quoted, the standard only says this for promise:
The second constructor uses the allocator a to allocate memory for the shared state.
That doesn't say anything about other resources needed by the promise. It's reasonable to assume that any synchronization objects like mutexes are part of the shared state, not the promise, but that wording doesn't require that the allocator is used for memory the shared state's members require, only for the memory needed by the shared state itself.
For packaged_task the wording is broader and implies that all internal state should use the allocator, although it could be argued that it means the allocator is used to obtain memory for the stored task and the shared state, but again that members of the shared state don't have to use the allocator.
In summary, I don't think the standard is 100% clear whether the MSVC implementation is allowed, but IMHO an implementation that does not need additional memory from malloc or new is better (and that's how the libstdc++ <future> implementation works).
