which openmp schedule am I running? - multithreading

How to check in runtime the openmp schedule?
I compile my code with a parallel loop and runtime scheduele
#pragma omp parallel for schedule(runtime) collapse(2)
for(j=1;j>-2;j-=2){
for(i=0;i<n;i++){
//nested loop code here
}
}
and I specify the environment variable OMP_SCHEDULE=dynamic,50.
How can I check in runtime that my program actually took the OMP_SCHEDULEvariable ?
I am using openmp 3.1 with gcc 4.7.3

I downloaded http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf
Then went to the section "C/C++ Stub Routines" and found this
void omp_get_schedule(omp_sched_t *kind, int *chunk_size)
{
*kind = omp_sched_static;
*chunk_size = 0;
}
Then made this test
/*
typedef enum omp_sched_t {
omp_sched_static = 1,
omp_sched_dynamic = 2,
omp_sched_guided = 3,
omp_sched_auto = 4
} omp_sched_t;
*/
#include <omp.h>
#include <stdio.h>
int main(void) {
omp_sched_t kind;
int chunk;
omp_get_schedule(&kind, &chunk);
printf("%d %d\n", kind, chunk);
}
and compiled
gcc -fopenmp -O3 foo.c
and then
export OMP_SCHEDULE=static,50
./a.out
1 50
export OMP_SCHEDULE=dynamic,100
2 100
Note that omp_get_schedule only reports the runtime scheduling definition OMP_SCHEDULE. If you change the scheduling with e.g.
#pragma omp parallel for schedule(static,1)
and define OMP_SCHEDULE=dynamic,100 then omp_get_schedule still reports dynamic scheduling and chunk size 100.

Related

openmp memory leak when using g++ and Intel's libiomp5

I found this code:
#include <iostream>
#include <vector>
template<typename T>
void worknested1(std::vector<T> &x){
#if defined(_OPENMP)
#pragma omp parallel for num_threads(1)
#endif
for(int j=0;j<x.size();++j){
x[j]=(T)0;
}
}
template<typename T>
void worknested0(){
#if defined(_OPENMP)
#pragma omp parallel num_threads(1)
#endif
{
std::vector<T> a;a.resize(100);
#if defined(_OPENMP)
#pragma omp for
#endif
for(int i=0;i<10000;++i){
worknested1(a);
}
}
};
void work(){
#if defined(_OPENMP)
#pragma omp parallel for num_threads(18)
#endif
for(int i=0;i<1000000;++i){
worknested0<double>();
}
}
int main(){
work();
return(0);
}
to produce a nice memory leak when compiled with
g++ -Ofast -fopenmp -c test.cpp
g++ -L /opt/intel/oneapi/compiler/2022.0.2/linux/compiler/lib/intel64_lin -o exe test.o -Wl,--start-group -l iomp5 -l pthread -lm -ldl -Wl,--end-group
Depending on the the number of iterations in work it will eat the whole of my 256GB of RAM.
The g++ version is 11.2.
The problem does not occur with icpc 2021.5.0 and clang++ 13.0.1.
Further, a workaround is to set num_threads(1) in worknested0 to num_threads(2).
Is there anything wrong in the code or is this a bug/incompatibility between g++ and Intel?
Any suggestions on how to get this going appreciated (Switching to gomp is not an option at the moment as it appears to kill the MKL performance).
OS is Arch Linux, kernel 5.16.
OMP environment is:
OMP_PLACES=cores
OMP_PROC_BIND=true
OMP_DYNAMIC=FALSE
OMP_MAX_ACTIVE_LEVELS=2147483647
OMP_NUM_THREADS=18
OMP_STACKSIZE=2000M

error: undefined reference to `sched_setaffinity' on windows xp

Basically the code below was intended for use on linux and maybe thats the reason I get the error because I'm using windows XP, but I figure that pthreads should work just as well on both machines. I'm using gcc as my compiler and I did link with -lpthread but I got the following error anyways.
|21|undefined reference to sched_setaffinity'|
|30|undefined reference tosched_setaffinity'|
If there is another method to setting the thread affinity using pthreads (on windows) let me know. I already know all about the windows.h thread affinity functions available but I want to keep things multiplatform. thanks.
#include <stdio.h>
#include <math.h>
#include <sched.h>
double waste_time(long n)
{
double res = 0;
long i = 0;
while(i <n * 200000)
{
i++;
res += sqrt (i);
}
return res;
}
int main(int argc, char **argv)
{
unsigned long mask = 1; /* processor 0 */
/* bind process to processor 0 */
if (sched_setaffinity(0, sizeof(mask), &mask) <0)//line 21
{
perror("sched_setaffinity");
}
/* waste some time so the work is visible with "top" */
printf ("result: %f\n", waste_time (2000));
mask = 2; /* process switches to processor 1 now */
if (sched_setaffinity(0, sizeof(mask), &mask) <0)//line 30
{
perror("sched_setaffinity");
}
/* waste some more time to see the processor switch */
printf ("result: %f\n", waste_time (2000));
}
sched_getaffinity() and sched_setaffinity() are strictly Linux-specific calls. Windows provides its own set of specific Win32 API calls that affect scheduling. See this answer for sample code for Windows.

Thread for interprocess communication in OpenMP

I have an OpenMP parallelized program that looks like that:
[...]
#pragma omp parallel
{
//initialize threads
#pragma omp for
for(...)
{
//Work is done here
}
}
Now I'm adding MPI support. What I will need is a thread that handles the communication, in my case, calls GatherAll all the time and fills/empties a linked list for receiving/sending data from the other processes. That thread should send/receive until a flag is set. So right now there is no MPI stuff in the example, my question is about the implementation of that routine in OpenMP.
How do I implement such a thread? For example, I tried to introduce a single directive here:
[...]
int kill=0
#pragma omp parallel shared(kill)
{
//initialize threads
#pragma omp single nowait
{
while(!kill)
send_receive();
}
#pragma omp for
for(...)
{
//Work is done here
}
kill=1
}
but in this case the program gets stuck because the implicit barrier after the for-loop waits for the thread in the while-loop above.
Thank you, rugermini.
You could try adding a nowait clause to your single construct:
EDIT: responding to the first comment
If you enable nested parallelism for OpenMP, you might be able to achieve what you want by making two levels of parallelism. In the top level, you have two concurrent parallel sections, one for the MPI communications, the other for local computation. This last section can itself be parallelized, which gives you a second level of parallelisation. Only threads executing this level will be affected by barriers in it.
#include <iostream>
#include <omp.h>
int main()
{
int kill = 0;
#pragma omp parallel sections
{
#pragma omp section
{
while (kill == 0){
/* manage MPI communications */
}
}
#pragma omp section
{
#pragma omp parallel
#pragma omp for
for (int i = 0; i < 10000 ; ++i) {
/* your workload */
}
kill = 1;
}
}
}
However, you must be aware that your code is going to break if you don't have at least two threads, which means you're breaking the assumption that the sequential and parallelized versions of the code should do the same thing.
It would be much cleaner to wrap your OpenMP kernel inside a more global MPI communication scheme (potentially using asynchronous communications to overlap communications with computations).
You have to be careful, because you can't just have your MPI calling thread "skip" the omp for loop; all threads in the thread team have to go through the for loop.
There's a couple ways you could do this: with nested parallism and tasks, you could launch one task to do the message passing and anther to call a work routine which has an omp parallel for in it:
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
void work(int rank) {
const int n=14;
#pragma omp parallel for
for (int i=0; i<n; i++) {
int tid = omp_get_thread_num();
printf("%d:%d working on item %d\n", rank, tid, i);
}
}
void sendrecv(int rank, int sneighbour, int rneighbour, int *data) {
const int tag=1;
MPI_Sendrecv(&rank, 1, MPI_INT, sneighbour, tag,
data, 1, MPI_INT, rneighbour, tag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
int main(int argc, char **argv) {
int rank, size;
int sneighbour;
int rneighbour;
int data;
int got;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &got);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
omp_set_nested(1);
sneighbour = rank + 1;
if (sneighbour >= size) sneighbour = 0;
rneighbour = rank - 1;
if (rneighbour <0 ) rneighbour = size-1;
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{
sendrecv(rank, sneighbour, rneighbour, &data);
printf("Got data from %d\n", data);
}
#pragma omp task
work(rank);
}
}
MPI_Finalize();
return 0;
}
Alternately, you could make your omp for loop schedule(dynamic) so that the other threads can pick up some of the slack from while the master thread is sending, and the master thread can pick up some work when it's done:
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
void sendrecv(int rank, int sneighbour, int rneighbour, int *data) {
const int tag=1;
MPI_Sendrecv(&rank, 1, MPI_INT, sneighbour, tag,
data, 1, MPI_INT, rneighbour, tag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
int main(int argc, char **argv) {
int rank, size;
int sneighbour;
int rneighbour;
int data;
int got;
const int n=14;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &got);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
omp_set_nested(1);
sneighbour = rank + 1;
if (sneighbour >= size) sneighbour = 0;
rneighbour = rank - 1;
if (rneighbour <0 ) rneighbour = size-1;
#pragma omp parallel
{
#pragma omp master
{
sendrecv(rank, sneighbour, rneighbour, &data);
printf("Got data from %d\n", data);
}
#pragma omp for schedule(dynamic)
for (int i=0; i<n; i++) {
int tid = omp_get_thread_num();
printf("%d:%d working on item %d\n", rank, tid, i);
}
}
MPI_Finalize();
return 0;
}
Hmmm. If you are indeed adding MPI 'support' to your program, then you ought to be using mpi_allgather as mpi_gatherall does not exist. Note that mpi_allgather is a collective operation, that is all processes in the communicator call it. You can't have a process gathering data while the other processes do whatever it is they do. What you could do is use MPI single-sided communications to implement your idea; this will be a little tricky but no more than that if one process only reads the memory of other processes.
I'm puzzled by your use of the term 'thread' wrt MPI. I fear that you are confusing OpenMP and MPI, one of whose variants is called OpenMPI. Despite this name it is as different from OpenMP as chalk from cheese. MPI programs are written in terms of processes, not threads. The typical OpenMP implementation does indeed use threads, though the details are generally well-hidden from the programmer.
I'm seriously impressed that you are trying, or seem to be trying, to use MPI 'inside' your OpenMP code. This is exactly the opposite of work I do, and see others do on some seriously large computers. The standard mode for such 'hybrid' parallelisation is to write MPI programs which call OpenMP code. Many of today's very large computers comprise collections of what are, in effect, multicore boxes. A typical approach to programming one of these is to have one MPI process running on each box, and for each of those processes to use one OpenMP thread for each core in the box.

GLIB: g_atomic_int_get becomes NO-OP?

In a larger piece of code, I noticed that the g_atomic_* functions in glib were not doing what I expected, so I wrote this simple example:
#include <stdlib.h>
#include "glib.h"
#include "pthread.h"
#include "stdio.h"
void *set_foo(void *ptr) {
g_atomic_int_set(((int*)ptr), 42);
return NULL;
}
int main(void) {
int foo = 0;
pthread_t other;
if (pthread_create(&other, NULL, set_foo, &foo)== 0) {
pthread_join(other, NULL);
printf("Got %d\n", g_atomic_int_get(&foo));
} else {
printf("Thread did not run\n");
exit(1);
}
}
When I compile this with GCC's '-E' option (stop after pre-processing), I notice that the call to g_atomic_int_get(&foo) has become:
(*(&foo))
and g_atomic_int_set(((int*)ptr), 42) has become:
((void) (*(((int*)ptr)) = (42)))
Clearly I was expecting some atomic compare and swap operations, not just simple (thread-unsafe) assignments. What am I doing wrong?
For reference my compile command looks like this:
gcc -m64 -E -o foo.E `pkg-config --cflags glib-2.0` -O0 -g foo.c
The architecture you are on does not require a memory barrier for atomic integer set/get operations, so the transformation is valid.
Here's where it's defined: http://git.gnome.org/browse/glib/tree/glib/gatomic.h#n60
This is a good thing, because otherwise you'd need to lock a global mutex for every atomic operation.

Crash in program using OpenMP, x64 only

The program below crashes when I build it in Release x64 (all other configurations run fine).
Am I doing it wrong or is it an OpenMP issue?
Well-grounded workarounds are highly appreciated.
To reproduce build a project (console application) with the code below.
Build with /openmp and /GL and (/O1 or /O2 or /Ox) options in Release x64 configuration.
That is OpenMP support and C++ optimization must be turned on. The resulting program should (should not) crash.
#include <omp.h>
#include <vector>
class EmptyClass
{
public:
EmptyClass() {}
};
class SuperEdge
{
public:
SuperEdge() {mp_points[0] = NULL; mp_points[1] = NULL;}
private:
const int* mp_points[2];
};
EmptyClass CreateEmptyClass(SuperEdge s)
{
return EmptyClass();
}
int main(int argc, wchar_t* argv[], wchar_t* envp[])
{
std::vector<int> v;
long count = 1000000;
SuperEdge edge;
#pragma omp parallel for
for(long i = 0; i < count; ++i)
{
EmptyClass p = CreateEmptyClass(edge);
#pragma omp critical
{
v.push_back(0);
}
}
return 0;
}
I think it is a bug. Looking at the ASM output with /O2 on the push_back call has been optimized away and there are just a couple of reserve calls and what looks like direct accesses instead. The reserve calls however don't appear to be in the critical section and you end up with Heap corruption. Doing a release x64 with /openmp /GL /Od you will see that there is a call to push_back in the asm, and it is between the _vcomp_enter_critsect calls, and doesn't crash. I'd report it to MS. (tested with VS 2010)

Resources