How to setup OpenMP to use whole hyperthreads for parallel processing?

How to setup OpenMP to use whole hyperthreads for parallel processing? - linux

Please help me, I want to use OpenMP for parallel-processing in my program with all threads. I set up it the same follow:
#pragma omp parallel
{
omp_set_num_threads(272);
region my_routine processing;
}
When I execute it, I use compiler "top" to check the performance of CPU use, and just sometimes it archives 6800% (almost it less than 5500%) - it is not stable. I want it stable (always archives 6800%) during the time my program executing.
Where is being wrong for using OpenMP or we have any other method for use whole threads?
Thanks a lot.
This is my platform:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 272
On-line CPU(s) list: 0-271
Thread(s) per core: 4
Core(s) per socket: 68
Socket(s): 1
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 87
Model name: Intel(R) Xeon Phi(TM) CPU 7250 # 1.40GHz
Stepping: 1
CPU MHz: 1392.507
BogoMIPS: 2799.81
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
NUMA node0 CPU(s): 0-271
NUMA node1 CPU(s):

Step 0: safety first:check with your Cluster-provider HPC-support team, if they do not consider harmful to set such high-level of workloads on their owned/operated Cluster device(s).
Step 1: set for a "smoke-on" flight-test
prepare lstopo command ( apt-get or require your admin to fix this, if necessary ) and create the system's NUMA-topology map into a pdf-file and post it here
prepare htop command ( apt-get or require your admin to fix this, if necessary ) ,configure F2
Setup METERS to show CPUs (1&2/4): first half in 2 shorter columns to the left MONITOR panel,
setup METERS to show CPUs (3&4/4): second half in 2 shorter columns to the left MONITOR panel,setup COLUMNS to show at least the { PPID, PID, TGID, CPU, CPU%, STATUS, command } column fields
Step 2: with htop-monitor running, run the compiled OpenMP code
Expect something above this on the terminal CLI, yet the htop-monitor will show the NUMA-CPU-workloads' live-scene landscape better, than any single number:
Real time: 23.027 s
User time: 45.337 s
Sys. time: 0.047 s
Exit code: 0
stdout will read about this:
WARMUP: OpenMP thread[ 0] instantiated as thread[ 0]
WARMUP: OpenMP thread[ 3] instantiated as thread[ 3]
...
WARMUP: OpenMP thread[272] instantiated as thread[272]
my_routine(): thread[ 0] START_TIME( 2078891848 )
my_routine(): thread[ 2] START_TIME( -528891186 )
...
my_routine(): thread[ 2] ENDED_TIME( 635748478 ) sum = 1370488.801186
HOT RUN: in thread[ 2] my_routine() returned 10.915321 ....
my_routine(): thread[ 4] ENDED_TIME( -1543969584 ) sum = 1370489.030301
HOT RUN: in thread[ 4] my_routine() returned 11.133672 ....
my_routine(): thread[ 1] ENDED_TIME( -213996360 ) sum = 1370489.060176
HOT RUN: in thread[ 1] my_routine() returned 11.158897 ....
...
my_routine(): thread[ 0] ENDED_TIME( -389214506 ) sum = 1370489.079366
HOT RUN: in thread[270] my_routine() returned 11.149798 ....
my_routine(): thread[ 3] ENDED_TIME( -586400566 ) sum = 1370489.125829
HOT RUN: in thread[269] my_routine() returned 11.091430 ....
OpenMP ver(201511)...finito
mock-up source ( on TiO.run ):
#include <omp.h> // ------------------------------------ compile flags: -fopenmp -O3
#include <stdio.h>
#define MAX_COUNT 999999999
#define MAX_THREADS 272
double my_routine()
{
printf( "my_routine(): thread[%3d] START_TIME( %20d )\n", omp_get_thread_num(), omp_get_wtime() );
double temp = omp_get_wtime(),
sum = 0;
for ( int count = 0; count < MAX_COUNT; count++ )
{
sum += ( omp_get_wtime() - temp ); temp = omp_get_wtime();
}
printf( "my_routine(): thread[%3d] ENDED_TIME( %20d ) sum = %15.6f\n", omp_get_thread_num(), omp_get_wtime(), sum );
return( sum );
}
void warmUp() // -------------------------------- prevents performance skewing in-situ
{ // NOP-alike payload, yet enforces all thread-instantiations to happen
#pragma omp parallel for num_threads( MAX_THREADS )
for ( int i = 0; i < MAX_THREADS; i++ )
printf( "WARMUP: OpenMP thread[%3d] instantiated as thread[%3d]\n", i, omp_get_thread_num() );
}
int main( int argc, char **argv )
{
omp_set_num_threads( MAX_THREADS ); warmUp(); // ---------- pre-arrange all threads
#pragma omp parallel for
for ( int i = 0; i < MAX_THREADS; i++ )
printf( "HOT RUN: in thread[%3d] my_routine() returned %34.6f ....\n", omp_get_thread_num(), my_routine() );
printf( "\nOpenMP ver(%d)...finito", _OPENMP );
}

I execute on CentOS7 and a little confused with your guide. This is my NUMA-topology and htop monitor when I run application. You can see it only use 1 thread/core and with one thread also cannot archive 100%. How I can use 4 threads/core or 100% 1 thread/core?

Related

mpi_run on multicore architecture --bind-to l3 or --bind-to core

I am running a code on a 24c architecture and would like to use one mpi rank for each set of three cores bound to a L3 cache bloc. So, 8 mpi ranks per socket, 16 per node, with 3 threads per rank. I think the following command line should apply
mpirun --bind-to l3 -np 16 gmx_mpi mdrun -nt 3
--bind-to binding the mpi ranks to each bloc of L3 cache, -np allocating 16 mpi ranks per node and a -nt a number of threads per MPI rank of 3. Is this the correct approach ?
If the core is capable of multithreading (2 threads) is it right to write
mpirun --bind-to l3 -np 16 gmx_mpi mdrun -nt 6
--bind-to core is I assume binding one MPI rank per core, with no spanning into threads, or spanning into 2 threads per core for exploiting MT, e.g.
mpirun --bind-to core -np 48 gmx_mpi mdrun -nt 2
with 48 ranks one per core on a 2-socket platform and 2 threads per core (MT)
Would you confirm ?

I always use this piece of code, that I inherited from somewhere many years ago, to print out bindings at runtime. For example, on my 4-core laptop:
dsh#e7390dh:binding$ mpicc -o bind bind.c utilities.c
dsh#e7390dh:binding$ mpirun -n 4 ./bind
Rank 2 on core 2,6 of node <e7390dh>
Rank 3 on core 3,7 of node <e7390dh>
Rank 0 on core 0,4 of node <e7390dh>
Rank 1 on core 1,5 of node <e7390dh>
i.e. each process is bound to one physical core but can run on either hypercore. If there is no binding you get a range, e.g. "on core [0-7]".
Hope this is useful.
bind.c:
#include <stdio.h>
#include <mpi.h>
void printlocation();
int main(void)
{
MPI_Init(NULL,NULL);
printlocation();
MPI_Finalize();
return 0;
}
utilities.c:
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sched.h>
#include <mpi.h>
/* Borrowed from util-linux-2.13-pre7/schedutils/taskset.c */
static char *cpuset_to_cstr(cpu_set_t *mask, char *str)
{
char *ptr = str;
int i, j, entry_made = 0;
for (i = 0; i < CPU_SETSIZE; i++) {
if (CPU_ISSET(i, mask)) {
int run = 0;
entry_made = 1;
for (j = i + 1; j < CPU_SETSIZE; j++) {
if (CPU_ISSET(j, mask)) run++;
else break;
}
if (!run)
sprintf(ptr, "%d,", i);
else if (run == 1) {
sprintf(ptr, "%d,%d,", i, i + 1);
i++;
} else {
sprintf(ptr, "%d-%d,", i, i + run);
i += run;
}
while (*ptr != 0) ptr++;
}
}
ptr -= entry_made;
*ptr = 0;
return(str);
}
void printlocation()
{
int rank, namelen;
char hnbuf[MPI_MAX_PROCESSOR_NAME];
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
memset(hnbuf, 0, sizeof(hnbuf));
MPI_Get_processor_name(hnbuf, &namelen);
cpu_set_t coremask;
char clbuf[7 * CPU_SETSIZE];
memset(clbuf, 0, sizeof(clbuf));
(void)sched_getaffinity(0, sizeof(coremask), &coremask);
cpuset_to_cstr(&coremask, clbuf);
printf("Rank %d on core %s of node <%s>\n", rank, clbuf, hnbuf);
}

the exact command seems to be --bind-to l3cache
mpirun --bind-to l3cache -np 16 gmx_mpi mdrun -nt 6

openmp core assignation fails

My Centos 6 VM shows four cores when displaying the content of /proc/cpuinfo, and /sys/devices/system/cpu/online shows 0-3.
I am trying to run the following code on the core 2 and 3 using KMP_AFFINITY="explicit,proclist=[2-3]"
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
int main (int argc, char *argv[]) {
int nthreads, tid, cid;
#pragma omp parallel private(nthreads, tid)
{
tid = omp_get_thread_num();
cid = sched_getcpu();
printf("Hello from thread %d on core %d\n", tid, cid);
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
}
}
When compiled with icc (ICC) 16.0.1 20151021, it it fails to detect the available cores and executes everything on the core 0.
$ OMP_NUM_THREADS=4 ./a.out
OMP: Warning #123: Ignoring invalid OS proc ID 2.
OMP: Warning #123: Ignoring invalid OS proc ID 3.
OMP: Warning #124: No valid OS proc IDs specified - not using affinity.
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Where as gcc (GCC) 4.4.7 20120313, with GOMP_CPU_AFFINITY="2-3", executes properly on core 2 and 3, like set.
I used strace to check what's going on under the hood, and I noticed something strange :
[...]
open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 3
read(3, "0-3\n", 8192) = 4
[...]
sched_getaffinity(0, 1048576, { 1 }) = 8
sched_setaffinity(0, 8, { 4521c26fbb1c38c1 }) = -1 EFAULT (Bad addres
[...]
Could this be an error from the implementation of OpenMP made by intel?
I cannot upgrade my compiler to fix it in this case. Is it possible to use the GCC OpenMP library instead of the Intel one when compiling with icc ?
Update:
I managed to compile the code with gcc and linking it with iomp using the following command
gcc omp.c -L/opt/intel/compilers_and_libraries_2016/linux/lib/intel64_lin/ -liomp5
The execution outputs no warning, and is still not correct:
$ OMP_NUM_THREADS=4 ./a.out
Hello from thread 0 on core 0
Number of threads = 1
Same sched_setaffinity error than previously shown.

OpenMP worst performance with more threads (following openMP tutorials)

I'm starting to work with OpenMP and I follow these tutorials:
OpenMP Tutorials
I'm coding exactly what appears on the video, but instead of a better performance with more threads I get worse. I don't understand why.
Here's my code:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 2
int main()
{
clock_t t;
t = clock();
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double)num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id == 0) nthreads = nthrds;
for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;
t = clock() - t;
cout << "time: " << t << " miliseconds" << endl;
}
As you can see, it's exactly the same as in the video, I only added a code to measure an elapsed time.
On the tutorial, the more threads we use the better a performance.
In my case, that doesn't happen. Here are the timing I got:
1 thread: 433590 miliseconds
2 threads: 1705704 miliseconds
3 threads: 2689001 miliseconds
4 threads: 4221881 miliseconds
Why do I get this behavior?
-- EDIT --
gcc version: gcc 5.5.0
result of lscpu:
Architechure: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i7-4720HQ CPU # 2.60Ghz
Stepping: 3
CPU Mhz: 2594.436
CPU max MHz: 3600,0000
CPU min Mhz: 800,0000
BogoMIPS: 5188.41
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
-- EDIT --
I've tried using omp_get_wtime() instead, like this:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 8
int main()
{
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double)num_steps;
double start_time = omp_get_wtime();
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id == 0) nthreads = nthrds;
for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;
double time = omp_get_wtime() - start_time;
cout << "time: " << time << " seconds" << endl;
}
The behavior is different, although I have some questions.
Now, if I increase the number of threads by 1, for example, 1 thread, 2 threads, 3, 4, ..., the results are basically the same as previous, the performance gets worse, although if I increase to 64 threads, or 128 threads I get indeed better performance, the timing decreases from 0.44 [s] (for 1 thread) to 0.13 [s] ( for 128 threads ).
My question is: Why I don't have the same behaviour as in the tutorial?
2 threads get better performance than 1,
3 threads get better performance than 2, etc.
Why do I only get better performance with much bigger amount of threads?

instead of better performances with more threads I get worse ... I don't understand why.
Well,let's make the testing a bit more systematic and repeatable to see if :
// time: 1535120 milliseconds 1 thread
// time: 200679 milliseconds 1 thread -O2
// time: 191205 milliseconds 1 thread -O3
// time: 184502 milliseconds 2 threads -O3
// time: 189947 milliseconds 3 threads -O3
// time: 202277 milliseconds 4 threads -O3
// time: 182628 milliseconds 5 threads -O3
// time: 192032 milliseconds 6 threads -O3
// time: 185771 milliseconds 7 threads -O3
// time: 187606 milliseconds 16 threads -O3
// time: 187231 milliseconds 32 threads -O3
// time: 186131 milliseconds 64 threads -O3
ref.: a few sample runs on a TiO.RUN platform fast mock-up ... where limited resources apply a certain glass-ceiling to hit...
This did show more the effects of { -O2 |-O3 }-compilation-mode optimisation effects, than the above proposed principal degradation for growing number of threads.
Next comes the "background" noise from non-managed code-execution ecosystem, where O/S will easily skew the simplistic performance benchmarking
If indeed interested in further details, feel free to read about a Law of diminishing returns ( about real world compositions of [SERIAL], resp. [PARALLEL] parts of the process-scheduling ), where Dr. Gene AMDAHL has initiated the principal rules,
why more threads do not get way better performance ( and where a bit more contemporary re-formulation of this law explains, why more threads may even get negative improvement ( get more expensive add-on overheads ), than a right-tuned peak performance.
#include <time.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 7
int main()
{
clock_t t;
t = clock();
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0 / ( double )num_steps;
omp_set_num_threads( NUM_THREADS );
// struct timespec start;
// t = clock(); // _________________________________________ BEST START HERE
// clock_gettime( CLOCK_MONOTONIC, &start ); // ____________ USING MONOTONIC CLOCK
#pragma omp parallel
{
int i,
nthrds = omp_get_num_threads(),
id = omp_get_thread_num();;
double x;
if ( id == 0 ) nthreads = nthrds;
for ( i = id, sum[id] = 0.0;
i < num_steps;
i += nthrds
)
{
x = ( i + 0.5 ) * step;
sum[id] += 4.0 / ( 1.0 + x * x );
}
}
// t = clock() - t; // _____________________________________ BEST STOP HERE
// clock_gettime( CLOCK_MONOTONIC, &end ); // ______________ USING MONOTONIC CLOCK
for ( i = 0, pi = 0.0;
i < nthreads;
i++
) pi += sum[i] * step;
t = clock() - t;
// // time: 1535120 milliseconds 1 thread
// // time: 200679 milliseconds 1 thread -O2
// // time: 191205 milliseconds 1 thread -O3
printf( "time: %d milliseconds %d threads\n", // time: 184502 milliseconds 2 threads -O3
t, // time: 189947 milliseconds 3 threads -O3
NUM_THREADS // time: 202277 milliseconds 4 threads -O3
); // time: 182628 milliseconds 5 threads -O3
} // time: 192032 milliseconds 6 threads -O3
// time: 185771 milliseconds 7 threads -O3

The major problem in that version is false sharing. This is explained later in the video you started to watch. You get this when many threads are accessing data that is adjacent in memory (the sum array). The video also explains how to use padding to manually avoid this issue.
That said, the idiomatic solution is to use a reduction and not even bother with the manual work sharing:
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i < num_steps; i++)
{
double x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
This is also explained in a later video of the series. It is much simpler than what you started with and most likely the most efficient way.
Although the presenter is certainly competent, the style of these OpenMP tutorial videos is very much bottom up. I'm not sure that is a good educational approach. In any case you should probably watch all of the videos to know how to best use OpenMP it in practice.
Why do I only get better performance with much bigger amount of threads?
This is a bit counterintuitive, you very rarely get better performance from using more OpenMP threads than hardware threads - unless this is indirectly fixing another issue. In your case the large amount of threads means that the sum array is spread out over a larger region in memory and false-sharing is less likely.

c++ std async : almost no effect to use several cores

This question is related to:
c++ std::async : faster on 4 cores compared to 8 cores
In the previous question, I was wondering why some code would run faster on 4 cores rather than 8 (answer: my cpu had 4 cores and 8 threads)
Now I am discovering that code is more or less the same speed independently of the number of cores used.
I am on ubuntu 16.06. c++11. Intel® Core™ i7-8550U CPU # 1.80GHz × 8
Here code for benchmarking computation time against number of core used
#include <math.h>
#include <future>
#include <ctime>
#include <vector>
#include <iostream>
#define NB_JOBS 2000.0
#define MAX_CORES 8
// no special meaning to this function,
// just uses some CPU
static bool _expensive(int nb_jobs){
for(int job=0;job<nb_jobs;job++){
float x = 0.6;
bool b = true;
double f = 1;
for(int i=0;i<1000;i++){
if(!b) f=-1;
for(double j=1;j<2.0;j+=0.01) x+= f* pow(1.0/sin(x),j);
b = !b;
}
}
return true;
}
static double _duration(int nb_cores){
std::clock_t begin = clock();
int nb_jobs_per_core = rint ( NB_JOBS / (float)nb_cores );
std::vector < std::future<bool> > futures;
for(int i=0;i<nb_cores;i++){
futures.push_back( std::async(std::launch::async,_expensive,nb_jobs_per_core));
}
for (auto &e: futures) {
bool foo = e.get();
}
std::clock_t end = clock();
double duration = double(end - begin) / CLOCKS_PER_SEC;
return duration;
}
int main(){
for(int nb_cores=1 ; nb_cores<=MAX_CORES ; nb_cores++){
double duration = _duration(nb_cores);
std::cout << nb_cores << " threads: " << duration << "\n";
}
return 0;
}
Here the output:
1 threads: 8.55817
2 threads: 8.76621
3 threads: 7.90191
4 threads: 8.4656
5 threads: 10.5494
6 threads: 11.6175
7 threads: 21.697
8 threads: 24.3621
using cores seems to have marginal impacts.
What troubles me is that the CPU has 4 cores. So I was expecting the program to run (around) 4 times faster when using 4 threads. It does not.
Note: "htop" shows usage of virtual cores as expected by the program, i.e. first one core used at 100%, then 2, ..., and at the end 8.
If I replace:
futures.push_back( std::async(std::launch::async,[...]
by :
futures.push_back( std::async(std::launch::async|std::launch::deferred,[...]
then I get:
1 threads: 8.6459
2 threads: 8.69905
3 threads: 10.7763
4 threads: 11.4505
5 threads: 11.8426
6 threads: 10.4282
7 threads: 9.55181
8 threads: 9.05565
and htop shows only 1 virtual core being used 100% during the full duration.
Anything I am doing wrong ?
note: I tried on several desktops, all with various specs (nb of core and nb of threads), and observed something similar.

nanosleep sleep 60 microseconds too long

I have the following test compiled in g++ which nanosleep too long , it takes
60 microseconds to finish , I expected it cost only less than 1 microsecond :
int main()
{
gettimeofday(&startx, NULL);
struct timespec req={0};
req.tv_sec=0;
req.tv_nsec=100 ;
nanosleep(&req,NULL) ;
gettimeofday(&endx, NULL);
printf("(%d)(%d)\n",startx.tv_sec,startx.tv_usec);
printf("(%d)(%d)\n",endx.tv_sec,endx.tv_usec);
return 0 ;
}
My environment : uname -r showes :
3.10.0-123.el7.x86_64
cat /boot/config-uname -r | grep HZ
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
# CONFIG_NO_HZ_FULL_ALL is not set
CONFIG_NO_HZ=y
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_MACHZ_WDT=m
Should I do something in HZ config so that nanosleep will do exactly I expect?!
my cpu information :
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2643 v3 # 3.40GHz
Stepping: 2
CPU MHz: 3600.015
BogoMIPS: 6804.22
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23
Edit :
#ifdef SYS_gettid
pid_t tid = syscall(SYS_gettid);
printf("thread1...tid=(%d)\n",tid);
#else
#error "SYS_gettid unavailable on this system"
#endif
this will get tid of the thread I like to high priority , then do
chrt -v -r -p 99 tid to achieve this goal , thanks for Johan Boule
kind help , great appreciate !!!
Edit2 :
#ifdef SYS_gettid
const char *sched_policy[] = {
"SCHED_OTHER",
"SCHED_FIFO",
"SCHED_RR",
"SCHED_BATCH"
};
struct sched_param sp = {
.sched_priority = 99
};
pid_t tid = syscall(SYS_gettid);
printf("thread1...tid=(%d)\n",tid);
sched_setscheduler(tid, SCHED_RR, &sp);
printf("Scheduler Policy is %s.\n", sched_policy[sched_getscheduler(0)]);
#else
#error "SYS_gettid unavailable on this system"
#endif
this will do exact what I want to do without help of chrt

(This is not an answer)
For what is worth, using the more modern functions leads to the same result:
#include <stddef.h>
#include <time.h>
#include <stdio.h>
int main()
{
struct timespec startx, endx;
clock_gettime(CLOCK_MONOTONIC, &startx);
struct timespec req={0};
req.tv_sec=0;
req.tv_nsec=100 ;
clock_nanosleep(CLOCK_MONOTONIC, 0, &req, NULL);
clock_gettime(CLOCK_MONOTONIC, &endx);
printf("(%d)(%d)\n",startx.tv_sec,startx.tv_nsec);
printf("(%d)(%d)\n",endx.tv_sec,endx.tv_nsec);
return 0 ;
}
Output:
(296441)(153832940)
(296441)(153888488)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to setup OpenMP to use whole hyperthreads for parallel processing? - linux

I execute on CentOS7 and a little confused with your guide. This is my NUMA-topology and htop monitor when I run application. You can see it only use 1 thread/core and with one thread also cannot archive 100%. How I can use 4 threads/core or 100% 1 thread/core?

Related

mpi_run on multicore architecture --bind-to l3 or --bind-to core

openmp core assignation fails

OpenMP worst performance with more threads (following openMP tutorials)

c++ std async : almost no effect to use several cores

nanosleep sleep 60 microseconds too long

Categories

Resources