Memory leak when using OpenMP - multithreading

The below test case runs out of memory on 32 bit machines (throwing std::bad_alloc) in the loop following the "post MT section" message when OpenMP is used, however, if the #pragmas for OpenMP are commented out, the code runs through to completion fine, so it appears that when the memory is allocated in parallel threads, it does not free correctly and thus we run out of memory.
Question is whether there is something wrong with the memory allocation and deletion code below or is this a bug in gcc v4.2.2 or OpenMP? I also tried gcc v4.3 and got same failure.
int main(int argc, char** argv)
{
std::cout << "start " << std::endl;
{
std::vector<std::vector<int*> > nts(100);
#pragma omp parallel
{
#pragma omp for
for(int begin = 0; begin < int(nts.size()); ++begin) {
for(int i = 0; i < 1000000; ++i) {
nts[begin].push_back(new int(5));
}
}
}
std::cout << " pre delete " << std::endl;
for(int begin = 0; begin < int(nts.size()); ++begin) {
for(int j = 0; j < nts[begin].size(); ++j) {
delete nts[begin][j];
}
}
}
std::cout << "post MT section" << std::endl;
{
std::vector<std::vector<int*> > nts(100);
int begin, i;
try {
for(begin = 0; begin < int(nts.size()); ++begin) {
for(i = 0; i < 2000000; ++i) {
nts[begin].push_back(new int(5));
}
}
} catch (std::bad_alloc &e) {
std::cout << e.what() << std::endl;
std::cout << "begin: " << begin << " i: " << i << std::endl;
throw;
}
std::cout << "pre delete 1" << std::endl;
for(int begin = 0; begin < int(nts.size()); ++begin) {
for(int j = 0; j < nts[begin].size(); ++j) {
delete nts[begin][j];
}
}
}
std::cout << "end of prog" << std::endl;
char c;
std::cin >> c;
return 0;
}

Changing the first OpenMP loop from 1000000 to 2000000 will cause the same error. This indicates that the out of memory problem is with OpenMP stack limit.
Try setting the OpenMP stack limit to unlimit in bash with
ulimit -s unlimited
You can also change the OpenMP environment variable OMP_STACKSIZE and setting it to 100MB or more.
UPDATE 1: I change the first loop to
{
std::vector<std::vector<int*> > nts(100);
#pragma omp for schedule(static) ordered
for(int begin = 0; begin < int(nts.size()); ++begin) {
for(int i = 0; i < 2000000; ++i) {
nts[begin].push_back(new int(5));
}
}
std::cout << " pre delete " << std::endl;
for(int begin = 0; begin < int(nts.size()); ++begin) {
for(int j = 0; j < nts[begin].size(); ++j) {
delete nts[begin][j]
}
}
}
Then, I get a memory error at i=1574803 on the Main thread.
UPDATE 2: If you are using the Intel compiler, you can add the following to the top of your code and it will solve the problem (providing you have enough memory for the extra overhead).
std::cout << "Previous stack size " << kmp_get_stacksize_s() << std::endl;
kmp_set_stacksize_s(1000000000);
std::cout << "Now stack size " << kmp_get_stacksize_s() << std::endl;
UPDATE 3: For completeness, like mentioned by another member, if you are performing some numerical computation, it is best to preallocate everything in a single new float[1000000] instead of using OpenMP to do 1000000 allocations. This applies to allocating objects as well.

I found this issue elsewhere seen without OpenMP but just using pthreads. The extra memory consumption when multi-threaded appears to be typical behavior for the standard memory allocator. By switching to the Hoard allocator the extra memory consumption goes away.

Why are you using int* as the inner vector member? That's very wasteful - you have 4 bytes (sizeof(int), strictly) of data and 2-3 times more again of heap control structure for every vector entry. Try this just using vector<int> and see if it runs better.
I'm not an OpenMP expert but this usage seems weird in its asymmetry - you fill the vectors in parallel section and clear them in non-parallel code. Cannot tell you whether that's wrong, but it 'feels' wrong.

Related

Threads executing an operation a second

I'm trying to create a concurrent code where I execute a function per second, this function prints a character and waits a second on that thread. The behaviour I expect is to print each character after another but this doesn't happen, instead, it prints all of the characters of the inner loop execution. I'm not sure if this is somewhat related to an I/O operation or whatnot.
I've also tried to create an array of threads where each thread are created on the execution of the inner loop but the behaviour repeats, even if not calling join(). What might be wrong with the code?
The following code is what I've tried to do, and I used a clock to see if it was waiting the correct amount of time
#include <iostream>
#include <thread>
#include <chrono>
#include <string>
void print_char();
int main() {
using Timer = std::chrono::high_resolution_clock;
using te = std::chrono::duration<double>;
using s = std::chrono::seconds;
te interval;
for (int i = 0; i < 100; i++) {
auto a = Timer::now();
for (int j = 0; j < i; j++) {
std::thread t(print_char);
t.join();
}
auto b = Timer::now();
interval = b-a;
std::cout << std::chrono::duration_cast<s>(interval).count();
std::cout << std::endl;
}
return 0;
}
void print_char() {
std::cout << "*";
std::this_thread::sleep_for(std::chrono::seconds(1));
}
The behaviour I expect is to print each character after another but this doesn't happen, instead, it prints all of the characters of the inner loop execution.
You need to flush the output stream in order to see the text:
std::cout << "*" << std::flush;
std::endl contains call to std::flush in it and this is why you see whole lines to be displayed once the inner loop is complete. Your threads do add the '*' characters once per second, you just don't see them being added until the stream is flushed.
Consider the code
std::thread t(print_char);
t.join();
The first line creates and start a thread. The second line immediately wait for the thread to end. That makes your program serial and not parallel. In fact, it's no different than calling the function directly instead of creating the thread.
If you want to have the thread operate in parallel and independently from your main thread, you should have the loop in the thread function itself instead. Perhaps something like
std::atomic<bool> keep_running = true;
void print_char() {
while (keep_running) {
std::cout << "*";
std::this_thread::sleep_for(std::chrono::seconds(1));
}
}
Then in the main function you just create the thread, and do something else until you want the thread to end.
std::thread t(print_char);
// Do something else...
keep_running = false;
t.join();
In regard to your current code, it's really no different than
for (int i = 0; i < 100; i++) {
auto a = Timer::now();
for (int j = 0; j < i; j++) {
print_char();
}
auto b = Timer::now();
interval = b-a;
std::cout << std::chrono::duration_cast<s>(interval).count();
std::cout << std::endl;
}

Posix semaphore for synchronisation between two different processes [duplicate]

According to my understanding, a semaphore should be usable across related processes without it being placed in shared memory. If so, why does the following code deadlock?
#include <iostream>
#include <semaphore.h>
#include <sys/wait.h>
using namespace std;
static int MAX = 100;
int main(int argc, char* argv[]) {
int retval;
sem_t mutex;
cout << sem_init(&mutex, 1, 0) << endl;
pid_t pid = fork();
if (0 == pid) {
// sem_wait(&mutex);
cout << endl;
for (int i = 0; i < MAX; i++) {
cout << i << ",";
}
cout << endl;
sem_post(&mutex);
} else if(pid > 0) {
sem_wait(&mutex);
cout << endl;
for (int i = 0; i < MAX; i++) {
cout << i << ",";
}
cout << endl;
// sem_post(&mutex);
wait(&retval);
} else {
cerr << "fork error" << endl;
return 1;
}
// sem_destroy(&mutex);
return 0;
}
When I run this on Gentoo/Ubuntu Linux, the parent hangs. Apparently, it did not receive the post by child. Uncommenting sem_destroy won't do any good. Am I missing something?
Update 1:
This code works
mutex = (sem_t *) mmap(NULL, sizeof(sem_t), PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, 0, 0);
if (!mutex) {
perror("out of memory\n");
exit(1);
}
Thanks,
Nilesh.
The wording in the manual page is kind of ambiguous.
If pshared is nonzero, then the semaphore is shared between processes,
and should be located in a region of shared memory.
Since a child created by fork(2) inherits its parent's memory
mappings, it can also access the semaphore.
Yes, but it still has to be in a shared region. Otherwise the memory simply gets copied with the usual CoW and that's that.
You can solve this in at least two ways:
Use sem_open("my_sem", ...)
Use shm_open and mmap to create a shared region
An excellent article on this topic, for future passers-by:
http://blog.superpat.com/2010/07/14/semaphores-on-linux-sem_init-vs-sem_open/

Addition of two arrays not working using pthread

Could you please some one help me to identify the issues in the following code.
Background: The test code adds two arrays, input1 & input2 and stores the results in output, using 4-threads.
The problem was one of the thread not able to do correctly, output buffer shows "0" randomly for one of the threads. Any help highly appreciated.
#include<iostream>
#include<pthread.h>
#include<unistd.h>
using namespace std;
int input1[1000], input2[1000], output[1000];
void* Addition(void* offset) {
int *local_offset = (int*)offset;
for(int i = ((*local_offset) * 250); i < ((*local_offset)+1)*250; ++i) {
output[i] = input1[i] + input2[i];
}
pthread_exit(0);
}
int main() {
pthread_t thread_id[4];
void* status;
fill_n(input1, 1000, 3); // input1, fill the buffer with 3
fill_n(input2, 1000, 4); // input2, fill the buffer with 4
fill_n(output, 1000, 0); // output, fill the buffer with 0
// create 4 thread with load of 250items
for(int i = 0; i < 4; ++i) {
int result = pthread_create(&thread_id[i], NULL, Addition, &i);
if(result) cout << "Thread creation failed" << endl;
}
// join the 4-threads
for(int i = 0; i < 4; ++i) {
int result = pthread_join(thread_id[i], &status);
if(result) cout << "Join failed " << i << endl;
}
// print output buffer, the output buffer not updated properly,
// noticed"0" for 1 & 2 thread randomly
for(int i =0; i < 1000; ++i)
cout << i << " " << output[i] << endl;
pthread_exit(NULL);
}
I have found the root cause of the issue...
The "&i" gives unknown result because "i" memory will be overwritten by ++i and thread_id[0] get different value by the time the thread created... so you should have a dedicated memory so that no overwriting will happen by ++i;
&i is the problem...
int result = pthread_create(&thread_id[i], NULL, Addition, &i);
To solve, replace with &shared_data[i] ....
int result = pthread_create(&thread_id[i], NULL, Addition, &shared_data[i]);
The shared_data is an array of 4 elements.. like this
int shared_data[4] = {0,1,2,3};

Lambda expressions, concurrency and static variables

As far as I know, such use of static storage within lambda is legal. Essentially it counts number of entries into the closure:
#include <vector>
#include <iostream>
#include <algorithm>
#include <iterator>
typedef std::pair<int,int> mypair;
std::ostream &operator<< (std::ostream &os, mypair const &data) {
return os << "(" << data.first << ": " << data.second << ") ";
}
int main()
{
int n;
std::vector<mypair> v;
std::cin >> n;
v.reserve(n);
std::for_each(std::begin(v), std::end(v), [](mypair& x) {
static int i = 0;
std::cin >> x.second;
x.first = i++;
});
std::for_each(std::begin(v), std::end(v), [](mypair& x) {
std::cout << x;
});
return 0;
}
Let assume I have a container 'workers' of threads.
std::vector<std::thread> workers;
for (int i = 0; i < 5; i++) {
workers.push_back(std::thread([]()
{
std::cout << "thread #" << "start\n";
doLengthyOperation();
std::cout << "thread #" << "finish\n";
}));
}
Code in doLengthyOperation() is contained and self-sufficient operation, akin a new process creation.
Provided I join them using for_each and the stored variable in question must count number of active tasks, not just number of entries, what possible implementations for such counter are there, if I want to avoid to rely onto global variables to avoid someone else messing up with it and allowing automatic support for separate "flavors" of threads.
std::for_each(workers.begin(), workers.end(), [](std::thread &t)
{
t.join();
});
Surrounding scope would die quickly after finishing thread starts, may repeat , adding new threads to the container is possible, and that must be global variable, which I want to avoid. More of, the whole operation is a template
The best way to handle this is to capture an instance of std::atomic<int> which provides a thread safe counter. Depending on the lifetime of lambdas and the surrounding scope, you may wish to capture by reference or shared pointer.
To take your example:
std::vector<std::thread> workers;
auto counter = std::make_shared<std::atomic<int>>(0);
for (int i = 0; i < 5; i++) {
workers.push_back(std::thread([counter]()
{
std::cout << "thread #" << "start\n";
(*counter)++;
doLengthyOperation();
(*counter)--;
std::cout << "thread #" << "finish\n";
}));
}

Thread running between fork and exec blocks other thread read

While studying the possibility of improving Recoll performance by using vfork() instead of fork(), I've encountered a fork() issue which I can't explain.
Recoll repeatedly execs external commands to translate files, so that's what the sample program does: it starts threads which repeatedly execute "ls" and read back the output.
The following problem is not a "real" one, in the sense that an actual program would not do what triggers the issue. I just stumbled on it while having a look at what threads were stopped or not between fork()/vfork() and exec().
When I have one of the threads busy-looping between fork() and exec(), the other thread never completes the data reading: the last read(), which should indicate eof, is blocked forever or until the other thread's looping ends (at which point everything resumes normally, which you can see by replacing the infinite loop with one which completes). While read() is blocked, the "ls" command has exited (ps shows <defunct>, a zombie).
There is a random aspect to the issue, but the sample program "succeeds" most of the time. I tested with Linux kernels 3.2.0 (Debian), 3.13.0 (Ubuntu) and 3.19 (Ubuntu). Works on a VM, but you need at least 2 procs, I could not make it work with one processor.
Here follows the sample program, I can't see what I'm doing wrong.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <memory.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <pthread.h>
#include <iostream>
using namespace std;
struct thread_arg {
int tnum;
int loopcount;
const char *cmd;
};
void* task(void *rarg)
{
struct thread_arg *arg = (struct thread_arg *)rarg;
const char *cmd = arg->cmd;
for (int i = 0; i < arg->loopcount; i++) {
pid_t pid;
int pipefd[2];
if (pipe(pipefd)) {
perror("pipe");
exit(1);
}
pid = fork();
if (pid) {
cerr << "Thread " << arg->tnum << " parent " << endl;
if (pid < 0) {
perror("fork");
exit(1);
}
} else {
// Child code. Either exec ls or loop (thread 1)
if (arg->tnum == 1) {
cerr << "Thread " << arg->tnum << " looping" <<endl;
for (;;);
//for (int cc = 0; cc < 1000 * 1000 * 1000; cc++);
} else {
cerr << "Thread " << arg->tnum << " child" <<endl;
}
close(pipefd[0]);
if (pipefd[1] != 1) {
dup2(pipefd[1], 1);
close(pipefd[1]);
}
cerr << "Thread " << arg->tnum << " child calling exec" <<
endl;
execlp(cmd, cmd, NULL);
perror("execlp");
_exit(255);
}
// Parent closes write side of pipe
close(pipefd[1]);
int ntot = 0, nread;
char buf[1000];
while ((nread = read(pipefd[0], buf, 1000)) > 0) {
ntot += nread;
cerr << "Thread " << arg->tnum << " nread " << nread << endl;
}
cerr << "Total " << ntot << endl;
close(pipefd[0]);
int status;
cerr << "Thread " << arg->tnum << " waiting for process " << pid
<< endl;
if (waitpid(pid, &status, 0) != -1) {
if (status) {
cerr << "Child exited with status " << status << endl;
}
} else {
perror("waitpid");
}
}
return 0;
}
int main(int, char **)
{
int loopcount = 5;
const char *cmd = "ls";
cerr << "cmd [" << cmd << "]" << " loopcount " << loopcount << endl;
const int nthreads = 2;
pthread_t threads[nthreads];
for (int i = 0; i < nthreads; i++) {
struct thread_arg *arg = new struct thread_arg;
arg->tnum = i;
arg->loopcount = loopcount;
arg->cmd = cmd;
int err;
if ((err = pthread_create(&threads[i], 0, task, arg))) {
cerr << "pthread_create failed, err " << err << endl;
exit(1);
}
}
void *status;
for (int i = 0; i < nthreads; i++) {
pthread_join(threads[i], &status);
if (status) {
cerr << "pthread_join: " << status << endl;
exit(1);
}
}
}
What's happening is that your pipes are getting inherited by both child processes instead of just one.
What you want to do is:
Create pipe with 2 ends
fork(), child inherits both ends of the pipe
child closes the read end, parent closes the write end
...so that the child ends up with just one end of one pipe, which is dup2()'ed to stdout.
But your threads race with each other, so what can happen is this:
Thread 1 creates pipe with 2 ends
Thread 0 creates pipe with 2 ends
Thread 1 fork()s. The child process has inherited 4 file descriptors, not 2!
Thread 1's child closes the read end of the pipe that thread 1 opened, but it keeps a reference to the read end and write end of thread 0's pipe too.
Later, thread 0 waits forever because it never gets an EOF on the pipe it is reading because the write end of that pipe is still held open by thread 1's child.
You will need to define a critical section that starts before pipe(), encloses the fork(), and ends after close() in the parent, and enter that critical section from only one thread at a time using a mutex.

Resources