Mpiexec fails to terminate when program ends - pbs

I am running an mpi program on a cluster. When the program ends the job does not. And so I have to wait for it to time out.
I am not sure how to debug this. I checked that the program got to the finalize statement in MPI, and it does. I am using lib Elemental.
Final lines of the program
if (grid.Rank() == 0) std::cout << "Finalize" << std::endl;
std::string message = std::string("rank_") +
std::to_string(mpi::Rank(mpi::COMM_WORLD)) + "_a";
std::cout << message;
Finalize();
message = message + "b";
std::cout << message;
mpi::Finalize();
message = message + "c";
std::cout << message;
return 0;
The output will is
Finalize
rank_0_arank_0_abrank_0_abcmpiexec: killall: caught signal 15 (Terminated).
mpiexec: kill_tasks: killing all tasks.
mpiexec: wait_tasks: waiting for taub205.
mpiexec: killall: caught signal 15 (Terminated).
=>> PBS: job killed: walltime 801 exceeded limit 780
----------------------------------------
Begin Torque Epilogue (Tue Nov 4 16:15:19 2014)
Job ID: ***
Username: ***
Group: ***
Job Name: mpi_test1
Session: 11270
Limits:
ncpus=1,neednodes=1:ppn=6:m24G:taub,nodes=1:ppn=6:m24G:taub,walltime=00:13:00
Resources: cput=00:02:12,mem=429524kb,vmem=773600kb,walltime=00:13:21
Job Queue: secondary
Account: ***
Nodes: taub205
End Torque Epilogue
----------------------------------------
Running these modules on https://campuscluster.illinois.edu/hardware/#taub
> module list
Currently Loaded Modulefiles:
1) torque/4.2.9 5) gcc/4.7.1
2) moab/7.2.9 6) mvapich2/2.0b-gcc-4.7.1
3) env/taub 7) mvapich2/mpiexec
4) blas 8) lapack

Related

pthread_cond_signal doesn't unblock pthread_cond_wait thread immediately

I'm observing a behaviour which doesn't seem to be inline with how pthread_cond_signal and pthread_cond_wait should behave (accordingly to manpages). man 3 pthread_cond_signal stipulates that:
The pthread_cond_signal() function shall unblock at least one of the threads that are blocked on the specified condition variable cond (if any threads are blocked on cond).
This isn't precise enough and doesn't clarify if, at the same time, the thread calling pthread_cond_signal will yield its time back to the scheduler.
Here's an example program:
1 #include <pthread.h>
2 #include <iostream>
3 #include <time.h>
4 #include <unistd.h>
5
6 int mMsg = 0;
7 pthread_mutex_t mMsgMutex;
8 pthread_cond_t mMsgCond;
9 pthread_t consumerThread;
10 pthread_t producerThread;
11
12 void* producer(void* data) {
13 (void) data;
14 while(true) {
15 pthread_mutex_lock(&mMsgMutex);
16 std::cout << "1> locked" << std::endl;
17 mMsg += 1;
18 std::cout << "1> sending signal, mMsg = " << mMsg << "" << std::endl;
19 pthread_cond_signal(&mMsgCond);
20 pthread_mutex_unlock(&mMsgMutex);
21 }
22
23 return nullptr;
24 }
25
26 void* consumer(void* data) {
27 (void) data;
28 pthread_mutex_lock(&mMsgMutex);
29
30 while(true) {
31 while (mMsg == 0) {
32 pthread_cond_wait(&mMsgCond, &mMsgMutex);
33 }
34 std::cout << "2> wake up, msg: " << mMsg << std::endl;
35 mMsg = 0;
36 }
37
38 return nullptr;
39 }
40
41 int main()
42 {
43 pthread_mutex_init(&mMsgMutex, nullptr);
44 pthread_cond_init(&mMsgCond, nullptr);
45
46 pthread_create(&consumerThread, nullptr, consumer, nullptr);
47
48 std::cout << "starting producer..." << std::endl;
49
50 sleep(1);
51 pthread_create(&producerThread, nullptr, producer, nullptr);
52
53 pthread_join(consumerThread, nullptr);
54 pthread_join(producerThread, nullptr);
55 return 0;
56 }
Here's the output:
starting producer...
1> locked
1> sending signal, mMsg = 1
1> locked
1> sending signal, mMsg = 2
1> locked
1> sending signal, mMsg = 3
1> locked
1> sending signal, mMsg = 4
1> locked
1> sending signal, mMsg = 5
1> locked
1> sending signal, mMsg = 6
1> locked
1> sending signal, mMsg = 7
1> locked
1> sending signal, mMsg = 8
1> locked
1> sending signal, mMsg = 9
1> locked
1> sending signal, mMsg = 10
2> wake up, msg: 10
1> locked
1> sending signal, mMsg = 1
1> locked
1> sending signal, mMsg = 2
1> locked
1> sending signal, mMsg = 3
...
It seems like there's no guarantee that any pthread_cond_signal will indeed immediately unblock any waiting pthread_cond_wait thread. At the same time it seems that any amount of pthread_cond_signal can be lost after first one has been issued.
Is this really the intended behaviour or am I doing something wrong here?
This is the intended behavior. pthread_cond_signal does not yield it's remaining runtime, but will continue to run.
And yes, pthread_cond_signal will immediately unblock (one or more) thread waiting on the corresponding condition variable. However, that doesn't guarantee that said waiting thread will immediately run. It just tells the OS that this thread is no longer blocked, and it's up to the OS thread scheduler to decide when to start running it. Since the signalling thread is already running, is hot in the cache etc., it will likely have plenty of time to do something before the now-unblocked thread starts doing anything.
In your example above, if you don't want to skip messages, maybe what you're looking for is something like a producer-consumer queue, maybe backed by a ring buffer.

boost::thread not updating global variable

I am using a wrapper function in an external software to start a new thread, which updates a global variable, but yet this seems invisible to the main thread. I cant call join(), not to block the main thread and crash the software. boost::async, boost::thread and boost::packaged_task all behave the same way.
uint32 *Dval;
bool hosttask1()
{
while(*Dval<10)
{
++*Dval;
PlugIn::gResultOut << " within thread global value: " << *Dval << std::endl;
Sleep(500);
}
return false;
}
void SU_HostThread1(uint32 *value)
{
Dval = value;
*Dval = 2;
PlugIn::gResultOut << " before thread: " << *value << " before thread global: " << *Dval << std::endl;
auto myFuture = boost::async(boost::launch::async,&hosttask1);
//boost::thread thread21 = boost::thread(&hosttask1);
//boost::packaged_task<bool> pt(&hosttask1);
//boost::thread thread21 = boost::thread(boost::move(pt));
}
When I call the function:
number a=0
su_hostthread1(a)
sleep(2) //seconds
result(" function returned "+a+" \n")
OUTPUT:
before thread value: 2 before thread global value: 2
within thread global value: 3
within thread global value: 4
within thread global value: 5
within thread global value: 6
function returned 2
within thread global value: 7
within thread global value: 8
within thread global value: 9
within thread global value: 10
Any ideas?
Thanks in advance!
If you share data between threads, you must syncronize access to that data. The two possible ways are a mutex protecting said data and atomic operations. The simple reason is that caches and read/write reordering (both by CPU and compiler) exist. This is a complex topic though and it's nothing that can be explained in an answer here, but there are a few good books out there and also a bunch of code that gets it right.
The following code correctly reproduces what I intend to do. Mainly, the thread updates a global variable which the main thread correctly observes.
#include "stdafx.h"
#include <iostream>
#include <boost/thread.hpp>
#include <boost/chrono.hpp>
unsigned long *dataR;
bool hosttask1()
{
bool done = false;
std::cout << "In thread global value: " << *dataR << "\n"; //*value11 << *dataL <<
unsigned long cc = 0;
boost::mutex m;
while (!done)
{
m.lock();
*dataR = cc;
m.unlock();
cc++;
std::cout << "In thread loop global value: "<< *dataR << "\n";
if (cc==5) done = true;
}
return done;
}
void SU_HostThread1(unsigned long *value)
{
dataR = value;
std::cout << "Before thread value: " << *value << " Before thread global value: " << *dataR << "\n"; //*value11 << *dataL <<
auto myFuture = boost::async(boost::launch::async, &hosttask1);
return;
}
int main()
{
unsigned long value =1;
unsigned long *value11;
value11 = &value;
SU_HostThread1(value11);
boost::this_thread::sleep(boost::posix_time::seconds(1));
std::cout << "done with end value: " << *value11 << "\n";
return 0;
}
output:
Before thread value: 1 Before thread global value: 1
In thread global value: 1
In thread loop global value: 0
In thread loop global value: 1
In thread loop global value: 2
In thread loop global value: 3
In thread loop global value: 4
done with end value: 4
Yet when I copy this exactly to the SDK of the external software, the main thread does not update global value. Any ideas how this is so?
Thanks
output in external software:
before thread value: 1 before thread global value: 1
In thread global value: 1
In thread loop global value: 0
In thread loop global value: 1
In thread loop global value: 2
In thread loop global value: 3
In thread loop global value: 4
done with end value: 1
Likely this is because the compiler doesn't generally think about multithreading when optimising your code. If has seen you code checks a value repeatedly, and it knows that in single threading that value cannot change, so it just omitted the check.
If you declare the variable as volatile, then it will probably generate less efficient code that checks more often.
Of course you have to also understand that when a value is written, there are circumstances when it may not all be written in one go, so if you are unlucky enough to read it back when it is half-written, then you get back a garbage value. The fix for that is to declare it as std::atomic (which is automatically considered volatile by the optimiser), and then even more complex code will be emitted to ensure that the write and the read cannot intersect (or different processor primitives might be used for small objects such as integers)
most variables are not shared between threads, and when they are it is up to the programmer to identify those and balance optimisation against the thread synchronisation needs during design.

Multiples threads running on one core instead of four depending on the OS

I am using Raspbian on Raspberry 3.
I need to divide my code in few blocks (2 or 4) and assign a thread per block to speed up calculations.
At the moment, I am testing with simple loops (see attached code) on one thread and then on 4 threads. And executions time on 4 threads is always 4 times longer, so it looks like this 4 threads are scheduled to run on the same CPU.
How to assign each thread to run on other CPUs? Even 2 threads on 2 CPUs should make big difference to me.
I even tried to use g++6 and no improvement. And using parallel libs openmp in the code with "#pragma omp for" still running on one CPU.
I tried to run this code on Fedora Linux x86 and I had the same behavior, but on Windows 8.1 and VS2015 i have got different results where time was the same one one thread and then on 4 threads, so it was running on different CPUs.
Would you have any suggestions??
Thank you.
#include <iostream>
//#include <arm_neon.h>
#include <ctime>
#include <thread>
#include <mutex>
#include <iostream>
#include <vector>
using namespace std;
float simd_dot0() {
unsigned int i;
unsigned long rezult;
for (i = 0; i < 0xfffffff; i++) {
rezult = i;
}
return rezult;
}
int main() {
unsigned num_cpus = std::thread::hardware_concurrency();
std::mutex iomutex;
std::vector<std::thread> threads(num_cpus);
cout << "Start Test 1 CPU" << endl; // prints !!!Hello World!!!
double t_start, t_end, scan_time;
scan_time = 0;
t_start = clock();
simd_dot0();
t_end = clock();
scan_time += t_end - t_start;
std::cout << "\nExecution time on 1 CPU: "
<< 1000.0 * scan_time / CLOCKS_PER_SEC << "ms" << std::endl;
cout << "Finish Test on 1 CPU" << endl; // prints !!!Hello World!!!
cout << "Start Test 4 CPU" << endl; // prints !!!Hello World!!!
scan_time = 0;
t_start = clock();
for (unsigned i = 0; i < 4; ++i) {
threads[i] = std::thread([&iomutex, i] {
{
simd_dot0();
std::cout << "\nExecution time on CPU: "
<< i << std::endl;
}
// Simulate important work done by the tread by sleeping for a bit...
});
}
for (auto& t : threads) {
t.join();
}
t_end = clock();
scan_time += t_end - t_start;
std::cout << "\nExecution time on 4 CPUs: "
<< 1000.0 * scan_time / CLOCKS_PER_SEC << "ms" << std::endl;
cout << "Finish Test on 4 CPU" << endl; // prints !!!Hello World!!!
cout << "!!!Hello World!!!" << endl; // prints !!!Hello World!!!
while (1);
return 0;
}
Edit :
On Raspberry Pi3 Raspbian I used g++4.9 and 6 with the following flags :
-std=c++11 -ftree-vectorize -Wl--no-as-needed -lpthread -march=armv8-a+crc -mcpu=cortex-a53 -mfpu=neon-fp-armv8 -funsafe-math-optimizations -O3

Why am I experiencing unexpected behavior with Linux signal handling?

I live in an environment with Win7/MSVC 2010sp1, two different Linux boxes (Red Hat) with g++ versions (4.4.7, 4.1.2), and AIX with xlc++ (08.00.0000.0025).
Not so long ago it was requested that we move some code from AIX over to Linux. It didn't take too long to see that Linux was a bit different. Normally when a signal is thrown, we handle it and throw a C++ exception. That was not working as expected.
Long story short, throwing c++ exceptions from a signal handler isn't going to work.
Sometime later, I put together a fix that uses setjmp/longjmp to move the exception out of the handler. Aftersome testing and the dang thing works on all platforms. After an obligatory round of cubical happy dance I moved on to setting up some unit tests. Oops.
Some of my tests were failing on Linux. What I observed was that the raise function only worked once. With two tests using SIGILL, the first one passed, and the second one failed. I broke out an axe, and started chopping away at the code to remove as much cruft as possible. That yielded this smaller example.
#include <csetjmp>
#include <iostream>
#include <signal.h>
jmp_buf mJmpBuf;
jmp_buf *mpJmpBuf = &mJmpBuf;
int status = 0;
int testCount = 3;
void handler(int signalNumber)
{
signal(signalNumber, handler);
longjmp(*mpJmpBuf, signalNumber);
}
int main(void)
{
if (signal(SIGILL, handler) != SIG_ERR)
{
for (int test = 1; test <= testCount; test++)
{
try
{
std::cerr << "Test " << test << "\n";
if ((status = setjmp(*mpJmpBuf)) == 0)
{
std::cerr << " About to raise SIGILL" << "\n";
int returnStatus = raise(SIGILL);
std::cerr << " Raise returned value " << returnStatus
<< "\n";
}
else
{
std::cerr << " Caught signal. Converting signal "
<< status << " to exception" << "\n";
std::exception e;
throw e;
}
std::cerr << " SIGILL should have been thrown **********\n";
}
catch (std::exception &)
{ std::cerr << " Caught exception as expected\n"; }
}
}
else
{ std::cerr << "The signal handler wasn't registered\n"; }
return 0;
}
For the Windows and the AIX boxes I get the expected output.
Test 1
About to raise SIGILL
Caught signal. Converting signal 4 to exception
Caught exception as expected Test 2
About to raise SIGILL
Caught signal. Converting signal 4 to exception
Caught exception as expected Test 3
About to raise SIGILL
Caught signal. Converting signal 4 to exception
Caught exception as expected
For both Linux boxes it looks like this.
Test 1
About to raise SIGILL
Caught signal. Converting signal 4 to exception
Caught exception as expected
Test 2
About to raise SIGILL
Raise returned value 0
SIGILL should have been thrown **********
Test 3
About to raise SIGILL
Raise returned value 0
SIGILL should have been thrown **********
So, my real question is "What is going on here?"
My retorical questions are:
Is anyone else observing this behavior?
What should I do to try to troubleshoot this issue?
What other things should I be aware of?
You must use sigsetjmp/siglongjmp to ensure the correct behavior when mixing signals and jumps. If you change your code it will correctly work under Linux.
You also used the old signal API which not recommended. I encourage you to use the much more reliable sigaction interface. The first benefits will be that you have no more need to reset the signal catch in the handler...

Wait for a thread to join with time limit

I've got a thread that invokes a function MyFunc with parameters params. Basically it outputs dots in a stream while MyFunc is running, with timeout 500 ms. I need to wait for a thread for 1 minute, then I need to output either "MyFunc successfully completed" if the function finishes its work within 1 min or "Timeout" if after 1 min it is still running. How can I do that ?
std::future<void> f = std::async(std::launch::async, MyFunc, params);
std::chrono::milliseconds span(500);
while (f.wait_for(span) == std::future_status::timeout)
std::cout << '.';
You can use wait_for(),without a problem.
std::future<void> f = std::async(std::launch::async, MyFunc, params);
auto because = std::async(std::launch::async,[&]()
{
// for your use, you may want to change it from 0 seconds to something
// like 1 second, or 500 ms
while(f.wait_for(std::chrono::seconds(0)) != std::future_status::ready)
std::cout << ".";
}).wait_for(std::chrono::seconds(60));
if(because == std::future_status::ready)
std::cout << "Successfully Completed\n";
else
std::cout << "Timeout";
Remember when you started waiting, or count the number of times you waited. Then you check those values on each iteration and determine whether more than 1min has passed. In that case you exit the loop.

Resources