openmp memory leak when using g++ and Intel's libiomp5 - memory-leaks

I found this code:
#include <iostream>
#include <vector>
template<typename T>
void worknested1(std::vector<T> &x){
#if defined(_OPENMP)
#pragma omp parallel for num_threads(1)
#endif
for(int j=0;j<x.size();++j){
x[j]=(T)0;
}
}
template<typename T>
void worknested0(){
#if defined(_OPENMP)
#pragma omp parallel num_threads(1)
#endif
{
std::vector<T> a;a.resize(100);
#if defined(_OPENMP)
#pragma omp for
#endif
for(int i=0;i<10000;++i){
worknested1(a);
}
}
};
void work(){
#if defined(_OPENMP)
#pragma omp parallel for num_threads(18)
#endif
for(int i=0;i<1000000;++i){
worknested0<double>();
}
}
int main(){
work();
return(0);
}
to produce a nice memory leak when compiled with
g++ -Ofast -fopenmp -c test.cpp
g++ -L /opt/intel/oneapi/compiler/2022.0.2/linux/compiler/lib/intel64_lin -o exe test.o -Wl,--start-group -l iomp5 -l pthread -lm -ldl -Wl,--end-group
Depending on the the number of iterations in work it will eat the whole of my 256GB of RAM.
The g++ version is 11.2.
The problem does not occur with icpc 2021.5.0 and clang++ 13.0.1.
Further, a workaround is to set num_threads(1) in worknested0 to num_threads(2).
Is there anything wrong in the code or is this a bug/incompatibility between g++ and Intel?
Any suggestions on how to get this going appreciated (Switching to gomp is not an option at the moment as it appears to kill the MKL performance).
OS is Arch Linux, kernel 5.16.
OMP environment is:
OMP_PLACES=cores
OMP_PROC_BIND=true
OMP_DYNAMIC=FALSE
OMP_MAX_ACTIVE_LEVELS=2147483647
OMP_NUM_THREADS=18
OMP_STACKSIZE=2000M

Related

which openmp schedule am I running?

How to check in runtime the openmp schedule?
I compile my code with a parallel loop and runtime scheduele
#pragma omp parallel for schedule(runtime) collapse(2)
for(j=1;j>-2;j-=2){
for(i=0;i<n;i++){
//nested loop code here
}
}
and I specify the environment variable OMP_SCHEDULE=dynamic,50.
How can I check in runtime that my program actually took the OMP_SCHEDULEvariable ?
I am using openmp 3.1 with gcc 4.7.3
I downloaded http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf
Then went to the section "C/C++ Stub Routines" and found this
void omp_get_schedule(omp_sched_t *kind, int *chunk_size)
{
*kind = omp_sched_static;
*chunk_size = 0;
}
Then made this test
/*
typedef enum omp_sched_t {
omp_sched_static = 1,
omp_sched_dynamic = 2,
omp_sched_guided = 3,
omp_sched_auto = 4
} omp_sched_t;
*/
#include <omp.h>
#include <stdio.h>
int main(void) {
omp_sched_t kind;
int chunk;
omp_get_schedule(&kind, &chunk);
printf("%d %d\n", kind, chunk);
}
and compiled
gcc -fopenmp -O3 foo.c
and then
export OMP_SCHEDULE=static,50
./a.out
1 50
export OMP_SCHEDULE=dynamic,100
2 100
Note that omp_get_schedule only reports the runtime scheduling definition OMP_SCHEDULE. If you change the scheduling with e.g.
#pragma omp parallel for schedule(static,1)
and define OMP_SCHEDULE=dynamic,100 then omp_get_schedule still reports dynamic scheduling and chunk size 100.

how to implement POSIX threads ( pthread.h ) on fedora 9

I need to use pthreads but it seems that I do not have it in my fedora and I cannot found how to install it.
Thanks
Fedora 9 uses Linux Kernel version 2.6 and this version is fully compatible with libc 2.3.2. This libc contains the pthread.h header.
Check this implementation example.
#include <pthread.h>
#include <stdio.h>
#define NUM_THREADS 5
void *PrintHello(void *threadid)
{
long tid;
tid = (long)threadid;
printf("Hello World! It's me, thread #%ld!\n", tid);
pthread_exit(NULL);
}
int main (int argc, char *argv[])
{
pthread_t threads[NUM_THREADS];
int rc;
long t;
for(t=0; t<NUM_THREADS; t++){
printf("In main: creating thread %ld\n", t);
rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
/* Last thing that main() should do */
pthread_exit(NULL);
}
And compile with:
gcc program.c -o program -lpthread
phtread.h header files are provided by the glibc header, so, you want to install it before compiling your application.

C++ 11 threading error

I have a C++11 program that gives me this error:
terminate called after throwing an instance of 'std::system_error'
what(): Operation not permitted
Code:
const int popSize=100;
void initializePop(mt3dSet mt3dPop[], int index1, int index2, std::string ssmName, std::string s1, std::string s2, std::string s3, std::string s4, std::string mt3dnam, std::string obf1, int iNSP, int iNRM, int iNCM, int iNLY, int iOPT, int iNPS, int iNWL, int iNRO, int ssmPosition, int obsPosition ){
if((index1 >= index2)||index1<0||index2>popSize){
std::cout<<"\nInitializing population...\nIndex not valid..\nQuitting...\n";
exit(1);
}
for(int i=index1; i<index2; i++){
mt3dPop[i].setSSM(ssmName, iNSP, iNRM, iNCM, iNLY);
mt3dPop[i].setNam(toString(s1,s3,i));
mt3dPop[i].setObsName(toString(s1,s4,i));
mt3dPop[i].setSsmName(toString(s1,s2,i));
mt3dPop[i].getSSM().generateFl(toString(s1,s2,i),iOPT,iNPS);
mt3dPop[i].generateNam(mt3dnam, ssmPosition, obsPosition);
mt3dPop[i].setFitness(obf1, iNWL, iNRO);
}}
void runPackage(ifstream& inFile){
//all variables/function parameters for function call are read from inFile
unsigned int numThreads = std::thread::hardware_concurrency();// =4 in my computer
std::vector<std::thread> vt(numThreads-1);//three threads
for(int j=0; j<numThreads-1; j++){
vt[j]= std::thread(initializePop,mt3dPop,j*popSize/numThreads, (j+1)*popSize/numThreads, ssmName, s1,s2, s3, s4, mt3dnam,obf1,iNSP, iNRM, iNCM, iNLY, iOPT, iNPS, iNWL, iNRO, ssmPosition, obsPosition ); //0-24 in thread 1, 25-49 in thread 2, 50-74 in thread 3
}
//remaining 75 to 99 in main thread
initializePop(mt3dPop,(numThreads-1)*popSize/numThreads, popSize, ssmName, s1,s2, s3, s4, mt3dnam,obf1,iNSP, iNRM, iNCM, iNLY, iOPT, iNPS, iNWL, iNRO, ssmPosition, obsPosition);
for(int j=0; j<numThreads-1; j++){
vt[j].join();
}}
What does the error mean and how do I fix it?
You need to link correctly, and compile with -std=c++11 - see this example.
I'm guessing you had the same problem as me! (I compiled with -pthread and -std=c++11 rather than linking with those two. (But you will need to compile with std=c++11 as well as linking with it.))
Probably you want to do something like this:
g++ -c <input_files> -std=c++11
then
g++ -o a.out <input_files> -std=c++11 -pthread
... at least I think that's right. (Someone to confirm?)
How to reproduce these errors:
#include <iostream>
#include <stdlib.h>
#include <string>
using namespace std;
void task1(std::string msg){
cout << "task1 says: " << msg;
}
int main() {
std::thread t1(task1, "hello");
t1.join();
return 0;
}
Compile and run:
el#defiant ~/foo4/39_threading $ g++ -o s s.cpp
s.cpp: In function ‘int main()’:
s.cpp:9:3: error: ‘thread’ is not a member of ‘std’
s.cpp:9:15: error: expected ‘;’ before ‘t1’
You forgot to #include <thread>, include it and try again:
#include <iostream>
#include <stdlib.h>
#include <string>
#include <thread>
using namespace std;
void task1(std::string msg){
cout << "task1 says: " << msg;
}
int main() {
std::thread t1(task1, "hello");
t1.join();
return 0;
}
Compile and run:
el#defiant ~/foo4/39_threading $ g++ -o s s.cpp -std=c++11
el#defiant ~/foo4/39_threading $ ./s
terminate called after throwing an instance of 'std::system_error'
what(): Operation not permitted
Aborted (core dumped)
More errors, as you defined above, because you didn't specify -pthread in the compile:
el#defiant ~/foo4/39_threading $ g++ -o s s.cpp -pthread -std=c++11
el#defiant ~/foo4/39_threading $ ./s
task1 says: hello
Now it works.

Segmentation fault when calling backtrace() on Linux x86

I am attempting to do the following - write a wrapper for the pthreads library that will log some information whenever each of its APIs it called.
One piece of info I would like to record is the stack trace.
Below is the minimal snippet from the original code that can be compiled and run AS IS.
Initializations (file libmutex.c):
#include <execinfo.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <dlfcn.h>
static int (*real_mutex_lock)(pthread_mutex_t *) __attribute__((__may_alias__));
static void *pthread_libhandle;
#ifdef _BIT64
#define PTHREAD_PATH "/lib64/libpthread.so.0"
#else
#define PTHREAD_PATH "/lib/libpthread.so.0"
#endif
static inline void load_real_function(char* function_name, void** real_func) {
char* msg;
*(void**) (real_func) = dlsym(pthread_libhandle, function_name);
msg = dlerror();
if (msg != NULL)
printf("init: real_%s load error %s\n", function_name, msg);
}
void __attribute__((constructor)) my_init(void) {
printf("init: trying to dlopen '%s'\n", PTHREAD_PATH);
pthread_libhandle = dlopen(PTHREAD_PATH, RTLD_LAZY);
if (pthread_libhandle == NULL) {
fprintf(stderr, "%s\n", dlerror());
exit(EXIT_FAILURE);
}
load_real_function("pthread_mutex_lock", (void**) &real_mutex_lock);
}
The wrapper and the call to backtrace.
I have chopped as much as possible from the methods, so yes, I know that I never call the original pthread_mutex_lock for example.
void my_backtrace(void) {
#define SIZE 100
void *buffer[SIZE];
int nptrs;
nptrs = backtrace(buffer, SIZE);
printf("backtrace() returned %d addresses\n", nptrs);
}
int pthread_mutex_lock(pthread_mutex_t *mutex) {
printf("In pthread_mutex_lock\n"); fflush(stdout);
my_backtrace();
return 0;
}
To test this I use this binary (file tst_mutex.c):
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
int main (int argc, char *argv[]) {
pthread_mutex_t x;
printf("Before mutex\n"); fflush(stdout);
pthread_mutex_lock(&x);
printf("after mutex\n");fflush(stdout);
return 0;
}
Here is the way all this is compiled:
rm -f *.o *.so tst_mutex
cc -Wall -D_BIT64 -c -m64 -fPIC libmutex.c
cc -m64 -o libmutex.so -shared -fPIC -ldl -lpthread libmutex.o
cc -Wall -m64 tst_mutex.c -o tst_mutex
and run
LD_PRELOAD=$(pwd)/libmutex.so ./tst_mutex
This crashes with segmentation fault on Linux x86.
On Linux PPC everything works flawlessly.
I have tried a few versions of GCC compilers, GLIBC libraries and Linux distros - all fail.
The output is
init: trying to dlopen '/lib64/libpthread.so.0'
Before mutex
In pthread_mutex_lock
In pthread_mutex_lock
In pthread_mutex_lock
...
...
./run.sh: line 1: 25023 Segmentation fault LD_PRELOAD=$(pwd)/libmutex.so ./tst_mutex
suggesting that there is a recursion here.
I have looked at the source code for backtrace() - there is no call in it to locking mechanism. All it does is a simple walk over the stack frame linked list.
I have also, checked the library code with objdump, but that hasn't revealed anything out of the ordinary.
What is happening here?
Any solution/workaround?
Oh, and maybe the most important thing. This only happens with the pthread_mutex_lock function!!
Printing the stack from any other overridden pthread_* function works just fine ...
It is a stack overflow, caused by an endless recursion (as remarked by #Chris Dodd).
The backtrace() function runs different system calls being called from programs compiled with pthread library and without. Even if no pthread functions are called explicitly by the program.
Here is a simple program that uses the backtrace() function and does not use any pthread function.
#include <stdio.h>
#include <stdlib.h>
#include <execinfo.h>
int main(void)
{
void* buffer[100];
int num_ret_addr;
num_ret_addr=backtrace(buffer, 100);
printf("returned number of addr %d\n", num_ret_addr);
return 0;
}
Lets compile it without linking to the pthread and inspect the program system calls with the strace utility. No mutex related system call appears in the output.
$ gcc -o backtrace_no_thread backtrace.c
$ strace -o backtrace_no_thread.out backtrace_no_thread
No lets compile the same code linking it to the pthread library, run the strace and look at its output.
$ gcc -o backtrace_with_thread backtrace.c -lpthread
$ strace -o backtrace_with_thread.out backtrace_with_thread
This time the output contains mutex related system calls (their names may depend on the platform). Here is a fragment of the strace output file obtained on an X86 Linux machine.
futex(0x3240553f80, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x324480d350, FUTEX_WAKE_PRIVATE, 2147483647) = 0

GLIB: g_atomic_int_get becomes NO-OP?

In a larger piece of code, I noticed that the g_atomic_* functions in glib were not doing what I expected, so I wrote this simple example:
#include <stdlib.h>
#include "glib.h"
#include "pthread.h"
#include "stdio.h"
void *set_foo(void *ptr) {
g_atomic_int_set(((int*)ptr), 42);
return NULL;
}
int main(void) {
int foo = 0;
pthread_t other;
if (pthread_create(&other, NULL, set_foo, &foo)== 0) {
pthread_join(other, NULL);
printf("Got %d\n", g_atomic_int_get(&foo));
} else {
printf("Thread did not run\n");
exit(1);
}
}
When I compile this with GCC's '-E' option (stop after pre-processing), I notice that the call to g_atomic_int_get(&foo) has become:
(*(&foo))
and g_atomic_int_set(((int*)ptr), 42) has become:
((void) (*(((int*)ptr)) = (42)))
Clearly I was expecting some atomic compare and swap operations, not just simple (thread-unsafe) assignments. What am I doing wrong?
For reference my compile command looks like this:
gcc -m64 -E -o foo.E `pkg-config --cflags glib-2.0` -O0 -g foo.c
The architecture you are on does not require a memory barrier for atomic integer set/get operations, so the transformation is valid.
Here's where it's defined: http://git.gnome.org/browse/glib/tree/glib/gatomic.h#n60
This is a good thing, because otherwise you'd need to lock a global mutex for every atomic operation.

Resources