nanosleep sleep 60 microseconds too long - linux

I have the following test compiled in g++ which nanosleep too long , it takes
60 microseconds to finish , I expected it cost only less than 1 microsecond :
int main()
{
gettimeofday(&startx, NULL);
struct timespec req={0};
req.tv_sec=0;
req.tv_nsec=100 ;
nanosleep(&req,NULL) ;
gettimeofday(&endx, NULL);
printf("(%d)(%d)\n",startx.tv_sec,startx.tv_usec);
printf("(%d)(%d)\n",endx.tv_sec,endx.tv_usec);
return 0 ;
}
My environment : uname -r showes :
3.10.0-123.el7.x86_64
cat /boot/config-uname -r | grep HZ
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
# CONFIG_NO_HZ_FULL_ALL is not set
CONFIG_NO_HZ=y
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_MACHZ_WDT=m
Should I do something in HZ config so that nanosleep will do exactly I expect?!
my cpu information :
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2643 v3 # 3.40GHz
Stepping: 2
CPU MHz: 3600.015
BogoMIPS: 6804.22
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23
Edit :
#ifdef SYS_gettid
pid_t tid = syscall(SYS_gettid);
printf("thread1...tid=(%d)\n",tid);
#else
#error "SYS_gettid unavailable on this system"
#endif
this will get tid of the thread I like to high priority , then do
chrt -v -r -p 99 tid to achieve this goal , thanks for Johan Boule
kind help , great appreciate !!!
Edit2 :
#ifdef SYS_gettid
const char *sched_policy[] = {
"SCHED_OTHER",
"SCHED_FIFO",
"SCHED_RR",
"SCHED_BATCH"
};
struct sched_param sp = {
.sched_priority = 99
};
pid_t tid = syscall(SYS_gettid);
printf("thread1...tid=(%d)\n",tid);
sched_setscheduler(tid, SCHED_RR, &sp);
printf("Scheduler Policy is %s.\n", sched_policy[sched_getscheduler(0)]);
#else
#error "SYS_gettid unavailable on this system"
#endif
this will do exact what I want to do without help of chrt

(This is not an answer)
For what is worth, using the more modern functions leads to the same result:
#include <stddef.h>
#include <time.h>
#include <stdio.h>
int main()
{
struct timespec startx, endx;
clock_gettime(CLOCK_MONOTONIC, &startx);
struct timespec req={0};
req.tv_sec=0;
req.tv_nsec=100 ;
clock_nanosleep(CLOCK_MONOTONIC, 0, &req, NULL);
clock_gettime(CLOCK_MONOTONIC, &endx);
printf("(%d)(%d)\n",startx.tv_sec,startx.tv_nsec);
printf("(%d)(%d)\n",endx.tv_sec,endx.tv_nsec);
return 0 ;
}
Output:
(296441)(153832940)
(296441)(153888488)

Related

mpi_run on multicore architecture --bind-to l3 or --bind-to core

I am running a code on a 24c architecture and would like to use one mpi rank for each set of three cores bound to a L3 cache bloc. So, 8 mpi ranks per socket, 16 per node, with 3 threads per rank. I think the following command line should apply
mpirun --bind-to l3 -np 16 gmx_mpi mdrun -nt 3
--bind-to binding the mpi ranks to each bloc of L3 cache, -np allocating 16 mpi ranks per node and a -nt a number of threads per MPI rank of 3. Is this the correct approach ?
If the core is capable of multithreading (2 threads) is it right to write
mpirun --bind-to l3 -np 16 gmx_mpi mdrun -nt 6
--bind-to core is I assume binding one MPI rank per core, with no spanning into threads, or spanning into 2 threads per core for exploiting MT, e.g.
mpirun --bind-to core -np 48 gmx_mpi mdrun -nt 2
with 48 ranks one per core on a 2-socket platform and 2 threads per core (MT)
Would you confirm ?
I always use this piece of code, that I inherited from somewhere many years ago, to print out bindings at runtime. For example, on my 4-core laptop:
dsh#e7390dh:binding$ mpicc -o bind bind.c utilities.c
dsh#e7390dh:binding$ mpirun -n 4 ./bind
Rank 2 on core 2,6 of node <e7390dh>
Rank 3 on core 3,7 of node <e7390dh>
Rank 0 on core 0,4 of node <e7390dh>
Rank 1 on core 1,5 of node <e7390dh>
i.e. each process is bound to one physical core but can run on either hypercore. If there is no binding you get a range, e.g. "on core [0-7]".
Hope this is useful.
bind.c:
#include <stdio.h>
#include <mpi.h>
void printlocation();
int main(void)
{
MPI_Init(NULL,NULL);
printlocation();
MPI_Finalize();
return 0;
}
utilities.c:
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sched.h>
#include <mpi.h>
/* Borrowed from util-linux-2.13-pre7/schedutils/taskset.c */
static char *cpuset_to_cstr(cpu_set_t *mask, char *str)
{
char *ptr = str;
int i, j, entry_made = 0;
for (i = 0; i < CPU_SETSIZE; i++) {
if (CPU_ISSET(i, mask)) {
int run = 0;
entry_made = 1;
for (j = i + 1; j < CPU_SETSIZE; j++) {
if (CPU_ISSET(j, mask)) run++;
else break;
}
if (!run)
sprintf(ptr, "%d,", i);
else if (run == 1) {
sprintf(ptr, "%d,%d,", i, i + 1);
i++;
} else {
sprintf(ptr, "%d-%d,", i, i + run);
i += run;
}
while (*ptr != 0) ptr++;
}
}
ptr -= entry_made;
*ptr = 0;
return(str);
}
void printlocation()
{
int rank, namelen;
char hnbuf[MPI_MAX_PROCESSOR_NAME];
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
memset(hnbuf, 0, sizeof(hnbuf));
MPI_Get_processor_name(hnbuf, &namelen);
cpu_set_t coremask;
char clbuf[7 * CPU_SETSIZE];
memset(clbuf, 0, sizeof(clbuf));
(void)sched_getaffinity(0, sizeof(coremask), &coremask);
cpuset_to_cstr(&coremask, clbuf);
printf("Rank %d on core %s of node <%s>\n", rank, clbuf, hnbuf);
}
the exact command seems to be --bind-to l3cache
mpirun --bind-to l3cache -np 16 gmx_mpi mdrun -nt 6

any performance penalto to be expected with thread_local?

Using C++11 and/or C11 thread_local, should we expect any performance penalty over non-thread_local storage on x86 (32- or 64-bit) Linux, Red Hat 5 or newer, with a recent g++/gcc (say, version 4 or newer) or clang?
On Ubuntu 18.04 x86_64 with gcc-8.3 (options -pthread -m{arch,tune}=native -std=gnu++17 -g -O3 -ffast-math -falign-{functions,loops}=64 -DNDEBUG) the difference is almost imperceptible:
#include <benchmark/benchmark.h>
struct A { static unsigned n; };
unsigned A::n = 0;
struct B { static thread_local unsigned n; };
thread_local unsigned B::n = 0;
template<class T>
void bm(benchmark::State& state) {
for(auto _ : state)
benchmark::DoNotOptimize(++T::n);
}
BENCHMARK_TEMPLATE(bm, A);
BENCHMARK_TEMPLATE(bm, B);
BENCHMARK_MAIN();
Results:
Run on (16 X 5000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.59, 0.49, 0.38
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
bm<A> 1.09 ns 1.09 ns 642390002
bm<B> 1.09 ns 1.09 ns 633963210
On x86_64 thread_local variables are accessed relative to fs register. Instructions with such addressing mode are often 2 bytes longer, so theoretically, they can take more time.
On other platforms it depends on how access to thread_local variables is implemented. See ELF Handling For Thread-Local Storage for more details.

How to setup OpenMP to use whole hyperthreads for parallel processing?

Please help me, I want to use OpenMP for parallel-processing in my program with all threads. I set up it the same follow:
#pragma omp parallel
{
omp_set_num_threads(272);
region my_routine processing;
}
When I execute it, I use compiler "top" to check the performance of CPU use, and just sometimes it archives 6800% (almost it less than 5500%) - it is not stable. I want it stable (always archives 6800%) during the time my program executing.
Where is being wrong for using OpenMP or we have any other method for use whole threads?
Thanks a lot.
This is my platform:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 272
On-line CPU(s) list: 0-271
Thread(s) per core: 4
Core(s) per socket: 68
Socket(s): 1
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 87
Model name: Intel(R) Xeon Phi(TM) CPU 7250 # 1.40GHz
Stepping: 1
CPU MHz: 1392.507
BogoMIPS: 2799.81
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
NUMA node0 CPU(s): 0-271
NUMA node1 CPU(s):
Step 0: safety first:check with your Cluster-provider HPC-support team, if they do not consider harmful to set such high-level of workloads on their owned/operated Cluster device(s).
Step 1: set for a "smoke-on" flight-test
prepare lstopo command ( apt-get or require your admin to fix this, if necessary ) and create the system's NUMA-topology map into a pdf-file and post it here
prepare htop command ( apt-get or require your admin to fix this, if necessary ) ,configure F2
Setup METERS to show CPUs (1&2/4): first half in 2 shorter columns to the left MONITOR panel,
setup METERS to show CPUs (3&4/4): second half in 2 shorter columns to the left MONITOR panel,setup COLUMNS to show at least the { PPID, PID, TGID, CPU, CPU%, STATUS, command } column fields
Step 2: with htop-monitor running, run the compiled OpenMP code
Expect something above this on the terminal CLI, yet the htop-monitor will show the NUMA-CPU-workloads' live-scene landscape better, than any single number:
Real time: 23.027 s
User time: 45.337 s
Sys. time: 0.047 s
Exit code: 0
stdout will read about this:
WARMUP: OpenMP thread[ 0] instantiated as thread[ 0]
WARMUP: OpenMP thread[ 3] instantiated as thread[ 3]
...
WARMUP: OpenMP thread[272] instantiated as thread[272]
my_routine(): thread[ 0] START_TIME( 2078891848 )
my_routine(): thread[ 2] START_TIME( -528891186 )
...
my_routine(): thread[ 2] ENDED_TIME( 635748478 ) sum = 1370488.801186
HOT RUN: in thread[ 2] my_routine() returned 10.915321 ....
my_routine(): thread[ 4] ENDED_TIME( -1543969584 ) sum = 1370489.030301
HOT RUN: in thread[ 4] my_routine() returned 11.133672 ....
my_routine(): thread[ 1] ENDED_TIME( -213996360 ) sum = 1370489.060176
HOT RUN: in thread[ 1] my_routine() returned 11.158897 ....
...
my_routine(): thread[ 0] ENDED_TIME( -389214506 ) sum = 1370489.079366
HOT RUN: in thread[270] my_routine() returned 11.149798 ....
my_routine(): thread[ 3] ENDED_TIME( -586400566 ) sum = 1370489.125829
HOT RUN: in thread[269] my_routine() returned 11.091430 ....
OpenMP ver(201511)...finito
mock-up source ( on TiO.run ):
#include <omp.h> // ------------------------------------ compile flags: -fopenmp -O3
#include <stdio.h>
#define MAX_COUNT 999999999
#define MAX_THREADS 272
double my_routine()
{
printf( "my_routine(): thread[%3d] START_TIME( %20d )\n", omp_get_thread_num(), omp_get_wtime() );
double temp = omp_get_wtime(),
sum = 0;
for ( int count = 0; count < MAX_COUNT; count++ )
{
sum += ( omp_get_wtime() - temp ); temp = omp_get_wtime();
}
printf( "my_routine(): thread[%3d] ENDED_TIME( %20d ) sum = %15.6f\n", omp_get_thread_num(), omp_get_wtime(), sum );
return( sum );
}
void warmUp() // -------------------------------- prevents performance skewing in-situ
{ // NOP-alike payload, yet enforces all thread-instantiations to happen
#pragma omp parallel for num_threads( MAX_THREADS )
for ( int i = 0; i < MAX_THREADS; i++ )
printf( "WARMUP: OpenMP thread[%3d] instantiated as thread[%3d]\n", i, omp_get_thread_num() );
}
int main( int argc, char **argv )
{
omp_set_num_threads( MAX_THREADS ); warmUp(); // ---------- pre-arrange all threads
#pragma omp parallel for
for ( int i = 0; i < MAX_THREADS; i++ )
printf( "HOT RUN: in thread[%3d] my_routine() returned %34.6f ....\n", omp_get_thread_num(), my_routine() );
printf( "\nOpenMP ver(%d)...finito", _OPENMP );
}
I execute on CentOS7 and a little confused with your guide. This is my NUMA-topology and htop monitor when I run application. You can see it only use 1 thread/core and with one thread also cannot archive 100%. How I can use 4 threads/core or 100% 1 thread/core?

openmp core assignation fails

My Centos 6 VM shows four cores when displaying the content of /proc/cpuinfo, and /sys/devices/system/cpu/online shows 0-3.
I am trying to run the following code on the core 2 and 3 using KMP_AFFINITY="explicit,proclist=[2-3]"
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
int main (int argc, char *argv[]) {
int nthreads, tid, cid;
#pragma omp parallel private(nthreads, tid)
{
tid = omp_get_thread_num();
cid = sched_getcpu();
printf("Hello from thread %d on core %d\n", tid, cid);
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
}
}
When compiled with icc (ICC) 16.0.1 20151021, it it fails to detect the available cores and executes everything on the core 0.
$ OMP_NUM_THREADS=4 ./a.out
OMP: Warning #123: Ignoring invalid OS proc ID 2.
OMP: Warning #123: Ignoring invalid OS proc ID 3.
OMP: Warning #124: No valid OS proc IDs specified - not using affinity.
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Where as gcc (GCC) 4.4.7 20120313, with GOMP_CPU_AFFINITY="2-3", executes properly on core 2 and 3, like set.
I used strace to check what's going on under the hood, and I noticed something strange :
[...]
open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 3
read(3, "0-3\n", 8192) = 4
[...]
sched_getaffinity(0, 1048576, { 1 }) = 8
sched_setaffinity(0, 8, { 4521c26fbb1c38c1 }) = -1 EFAULT (Bad addres
[...]
Could this be an error from the implementation of OpenMP made by intel?
I cannot upgrade my compiler to fix it in this case. Is it possible to use the GCC OpenMP library instead of the Intel one when compiling with icc ?
Update:
I managed to compile the code with gcc and linking it with iomp using the following command
gcc omp.c -L/opt/intel/compilers_and_libraries_2016/linux/lib/intel64_lin/ -liomp5
The execution outputs no warning, and is still not correct:
$ OMP_NUM_THREADS=4 ./a.out
Hello from thread 0 on core 0
Number of threads = 1
Same sched_setaffinity error than previously shown.

Twice as many page faults when reading from a large malloced array instead of just storing?

I am doing a simple test on monitoring page faults with the code below, What I don't know is how a simple one line of code below doubled my page fault count.
if I use
ptr[i+4096] = 'A'
I got 25,722 page-faults with perf tool, which is what I expected,
but if I use
tmp = ptr[i+4096]
instead, the page-faults doubled to 51,322
I don't how to explain it. Below is the complete code. Thanks!
void do_something() {
int i;
char* ptr;
char tmp;
ptr = malloc(100*1024*1024);
int j = 0;
int k = 0;
for (i = 0; i < 100*1024*1024; i+=4096) {
//ptr[i+4096] = 'A' ;
tmp = ptr[i+4096];
for (j = 0 ; j < 4096; j++)
ptr[i+j] = (char) (i & 0xff); // pagefault
}
free(ptr);
}
int main(int argc, char* argv[]) {
do_something();
return 0;
}
Machine Info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2687W v3 # 3.10GHz
Stepping: 2
CPU MHz: 3096.188
BogoMIPS: 6197.81
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
3.10.0-514.32.3.el7.x86_64 #1
malloc() will often satisfy requests for memory by asking the OS for new pages, e.g., via mmap. Such pages are generally allocated lazily: no actual page is allocated until the first access.
What happens then depends on the type of the first access: when you do a read first, Linux will map in a shared read-only COW page of zeros to satisfy it, and then if you later you write it takes a second fault to allocate the private writeable page.
When you just do the write first, the first step is skipped. That's the usual case since code generally isn't reading from newly allocated memory which has undefined contents (at least when you get it from malloc).
Note that the above is a description of how newly allocated pages work in Linux - when you use malloc there is another layer: malloc will generally try to satisfy requests for blocks the process freed earlier, rather than continually requesting new memory. In the case memory is re-used, it will generally already be paged in and the above won't apply. Of course for your initial big allocation of 1024 MiB, where is no memory to re-use so you can be sure the allocator is getting it from the OS.

Resources