MPI_Comm_size Segmentation fault - linux

Mhm,Hello,everyone.I get these errors when running parallel program wiht MPI and OpenMP in Linux,
[node65:03788] *** Process received signal ***
[node65:03788] Signal: Segmentation fault (11)
[node65:03788] Signal code: Address not mapped (1)
[node65:03788] Failing at address: 0x44000098
[node65:03788] [ 0] /lib64/libpthread.so.0 [0x2b663e446c00]
[node65:03788] [ 1] /public/share/mpi/openmpi- 1.4.5//lib/libmpi.so.0(MPI_Comm_size+0x60) [0x2b663d694360]
[node65:03788] [ 2] fdtd_3D_xyzPML_MPI_OpenMP(main+0xaa) [0x42479a]
[node65:03788] [ 3] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b663e56f184]
[node65:03788] [ 4] fdtd_3D_xyzPML_MPI_OpenMP(_ZNSt8ios_base4InitD1Ev+0x39) [0x405d79]
[node65:03788] *** End of error message ***
-----------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 3787 on node node65 exited on signal 11 (Segmentation fault).
-----------------------------------------------------------------------------
After I analysis the core files,I get following message:
[Thread debugging using libthread_db enabled]
[New Thread 47310344057648 (LWP 26962)]
[New Thread 1075841344 (LWP 26966)]
[New Thread 1077942592 (LWP 26967)]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 47310344057648 (LWP 26962)]
0x00002b074afb3360 in PMPI_Comm_size () from /public/share/mpi/openmpi-1.4.5//lib/libmpi.so.0
what causes these? Thanks for your help
the code(test.cpp) is as follows,and you can have a try:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include "mpi.h"
int main(int argc, char* argv[])
{
int nprocs = 1; //the number of processes
int myrank = 0;
int provide;
MPI_Init_thread(&argc,&argv,MPI_THREAD_FUNNELED,&provide);
if (MPI_THREAD_FUNNELED != provide)
{
printf ("%d != required %d", MPI_THREAD_FUNNELED, provide);
return 0;
}
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
int num_threads = 1; //Openmp
omp_set_dynamic(1);
num_threads = 16;
omp_set_num_threads(num_threads);
#pragma omp parallel
{
printf ("%d omp thread from %d mpi process\n", omp_get_thread_num(), myrank);
}
MPI_Finalize();
}

Well, this is probably not much, or even a bit of a lame answer, but I had this problem when mixing up different MPI installations (an OpenMPI and a MVAPICH2 to be precise).
Here are a few things to check
against what version of MPI you linked
ldd <application> | grep -i mpi
libmpi.so.1 => /usr/lib64/mpi/gcc/openmpi/lib64/libmpi.so.1 (0x00007f90c03cc000)
what version of MPI is dynamically loaded
echo $LD_LIBRARY_PATH | tr : "\n" | grep -i mpi
/usr/lib64/mpi/gcc/openmpi/lib64
whether you override this dynamic loading (this variable should be empty, unless you know what you're doing)
echo $LD_PRELOAD
If that's all OK, you need to check that each library you linked to and that relies on MPI was also linked with the same version. If no other library is linked to MPI, nothing should appear.
ldd <application> | sed "s/^\s*\(.*=> \)\?//;s/ (0x[0-9a-fA-F]*)$//" | xargs -L 1 ldd | grep -i mpi
If something suspect does show up, say libmpich.so.3 => /usr/lib64/mpi/gcc/MVAPICH2/1.8.1/lib/libmpich.so.3 for example, you should remove the -L 1 and replace grep with something to visualize (nothing ? or less, or vim - ...), then search for that suspect line.

Related

How can I put a prefix on every information outputed from GDB?

I would like to put a prefix, like "GDB> ", on every output from gdb to distinguish it from the output of my program.
Is that possible?
An example:
test.c
#include <stdio.h>
int main(int argc, char **argv)
{
char *p = NULL;
printf("Test begins...\n");
p[0] = '\0'; // Forcing a segmentation fault.
printf("Test finished.\n");
return 0;
}
Debugging with this command line:
$ gdb -q -ex "set confirm off" -ex run -ex quit --args ./test
The output is:
Reading symbols from ./test...
Starting program: /home/me/tst/test
Test begins...
Program received signal SIGSEGV, Segmentation fault.
0x000000000040113b in main ()
What I have in mind is set the output to something like that:
GDB> Reading symbols from ./test...
GDB> Starting program: /home/me/tst/test
Test begins...
GDB>
GDB> Program received signal SIGSEGV, Segmentation fault.
GDB> 0x000000000040113b in main ()

Suppressing the segfault signal

I am analyzing a set of buggy programs that under some test they may terminate with segfault. The segfault event is logged in /var/log/syslog.
For example the following snippet returns Segmentation fault and it is logged.
#!/bin/bash
./test
My question is how to suppress the segfault such that it does NOT appear in the system log. I tried trap to capture the signal in the following script:
#!/bin/bash
set -bm
trap "echo 'something happened'" {1..64}
./test
It returns:
Segmentation fault
something happened
So, it does traps the segfault but the segfault is still logged.
kernel: [81615.373989] test[319]: segfault at 0 ip 00007f6b9436d614
sp 00007ffe33fb77f8 error 6 in libc-2.19.so[7f6b942e1000+1bb000]
You can try to change ./test to the following line:
. ./test
This will execute ./test in the same shell.
We can suppress the log message system-wide with e. g.
echo 0 >/proc/sys/debug/exception-trace
- see also
Making the Linux kernel shut up about segfaulting user programs
Is there a way to temporarily disable segfault messages in dmesg?
We can suppress the log message for a single process if we run it under ptrace() control, as in a debugger. This program does that:
exe.c
#include <sys/wait.h>
#include <sys/ptrace.h>
main(int argc, char *args[])
{
pid_t pid;
if (*++args)
if (pid = fork())
{
int status;
while (wait(&status) > 0)
{
if (!WIFSTOPPED(status))
return WIFSIGNALED(status) ? 128+WTERMSIG(status)
: WEXITSTATUS(status);
int signal = WSTOPSIG(status);
if (signal == SIGTRAP) signal = 0;
ptrace(PTRACE_CONT, pid, 0, signal);
}
perror("wait");
}
else
{
ptrace(PTRACE_TRACEME, 0, 0, 0);
execvp(*args, args);
perror(*args);
}
return 1;
}
It is called with the buggy program as its argument, in your case
exe ./test
- then the exit status of exe normally is the exit status of test, but if test was terminated by signal n (11 for Segmentation fault), it is 128+n.
After I wrote this, I realized that we can also use strace for the purpose, e. g.
strace -enone ./test

How to finding all runnable processes

I'm learning about the scheduler and trying to print all runnable proceeses. So I have written a kernel module that uses the for_each_process macro to iterate over all processes, and prints the ones at "runnable" state. But this seems like a stupid (and inefficient) way of doing this. So I thought about getting a reference to all running queues and use their Red-Black-Tree to go over the runnable processes, but couldn't find a way to do this.
I have found out that there is a list of sched_classs for each CPU which are stop_sched_class->rt_sched_class->fair_sched_class->idle_sched_class and each one of them has it's own running queue. But couldn't find a way to reach them all.
I have used the module that uses the tasks_timeline to find all runnable processes, to print the address of the running queues - seems I have 3 running queues (while having only two processors).
The module:
#include <linux/module.h> /* Needed by all modules */
#include <linux/kernel.h> /* Needed for KERN_INFO */
#include <linux/sched.h>
MODULE_LICENSE("GPL");
struct cfs_rq {
struct load_weight load;
unsigned int nr_running, h_nr_running;
};
void printList(void){
int count;
struct task_struct * tsk;
count = 0;
for_each_process(tsk){
if(tsk->state)
continue;
printk("pid: %d rq: %p (%d)\n", tsk->pid, tsk->se.cfs_rq, tsk->se.cfs_rq->nr_running);
count++;
}
printk("count is: %d\n", count);
}
int init_module(void)
{
printList();
return 0;
}
void cleanup_module(void)
{
printk(KERN_INFO "Goodbye world proc.\n");
}
The output:
[ 8215.627038] pid: 9147 ffff88007bbe9200 (3)
[ 8215.627043] pid: 9148 ffff8800369b0200 (2)
[ 8215.627045] pid: 9149 ffff8800369b0200 (2)
[ 8215.627047] pid: 9150 ffff88007bbe9200 (3)
[ 8215.627049] pid: 9151 ffff88007bbe9200 (3)
[ 8215.627051] pid: 9154 ffff8800a46d4600 (1)
[ 8215.627053] count is: 6
[ 8215.653741] Goodbye world proc.
About the computer:
$ uname -a
Linux k 3.13.0-39-generic #66-Ubuntu SMP Tue Oct 28 13:30:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/cpuinfo | grep 'processor' | wc -l
2
So my questions are:
How can I print all runnable processes in a nicer way?
How are running queues made and managed?
Are the running queues somehow linked each other? (How?)
$ps -A -l and find the instance where both the process state (R) and the Process Flags (1) are as mentioned.
You can try this below cmd.
Sample output.
127:~$ ps -A -l | grep -e R -e D
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
1 S 0 1367 2 0 80 0 - 0 - ? 00:00:01 SEPDRV_ABNORMAL
4 R 1000 2634 2569 2 80 0 - 794239 - ? 00:25:06 Web Content
1 D 0 20091 2 0 80 0 - 0 - ? 00:00:00 kworker/3:2
4 R 1000 21077 9361 0 80 0 - 7229 - pts/17 00:00:00 ps

How to tell if a downstream process in a Unix pipe has crashed

I have a Linux process (let's call it the main process) whose standard output is piped to another process (called the downstream process) by means of the shell's pipe operator (|). The main process is set up to receive SIGPIPE signals if the downstream process crashes. Unfortunately, SIGPIPE is not raised until the main process writes to stdout. Is there a way to tell sooner that the downstream process has terminated?
One approach is to write continuously to the downstream process, but that seems wasteful. Another approach is to have a separate watchdog process that monitors all relevant processes, but that is complex. Or perhaps there is some way to use select() to trigger the signal. I am hoping that the main process can do all this itself.
It appears the stdout file descriptor becomes "ready for reading" when the receiver crashes:
$ gcc -Wall select-downstream-crash.c -o select-downstream-crash
$ gcc -Wall crash-in-five-seconds.c -o crash-in-five-seconds
$ ./select-downstream-crash | ./crash-in-five-seconds
... five seconds pass ...
stdout is ready for reading
Segmentation fault
select-downstream-crash.c
#include <err.h>
#include <stdio.h>
#include <sys/select.h>
#include <unistd.h>
int main(void)
{
fd_set readfds;
int rc;
FD_ZERO(&readfds);
FD_SET(STDOUT_FILENO, &readfds);
rc = select(STDOUT_FILENO + 1, &readfds, NULL, NULL, NULL);
if (rc < 0)
err(1, "select");
if (FD_ISSET(STDOUT_FILENO, &readfds))
fprintf(stderr, "stdout is ready for reading\n");
return 0;
}
crash-in-five-seconds.c
#include <stdio.h>
#include <unistd.h>
int main(void)
{
sleep(5);
putchar(*(char*)NULL);
return 0;
}
I tried this on Linux, but don't know if it'll work elsewhere. It would be nice to find some documentation explaining this observation.
If the main process forks the other processes, then it will get SIGCHLD notifications when they exit.

Externally disabling signals for a Linux program

On Linux, is it possible to somehow disable signaling for programs externally... that is, without modifying their source code?
Context:
I'm calling a C (and also a Java) program from within a bash script on Linux. I don't want any interruptions for my bash script, and for the other programs that the script launches (as foreground processes).
While I can use a...
trap '' INT
... in my bash script to disable the Ctrl C signal, this works only when the program control happens to be in the bash code. That is, if I press Ctrl C while the C program is running, the C program gets interrupted and it exits! This C program is doing some critical operation because of which I don't want it be interrupted. I don't have access to the source code of this C program, so signal handling inside the C program is out of question.
#!/bin/bash
trap 'echo You pressed Ctrl C' INT
# A C program to emulate a real-world, long-running program,
# which I don't want to be interrupted, and for which I
# don't have the source code!
#
# File: y.c
# To build: gcc -o y y.c
#
# #include <stdio.h>
# int main(int argc, char *argv[]) {
# printf("Performing a critical operation...\n");
# for(;;); // Do nothing forever.
# printf("Performing a critical operation... done.\n");
# }
./y
Regards,
/HS
The process signal mask is inherited across exec, so you can simply write a small wrapper program that blocks SIGINT and executes the target:
#include <signal.h>
#include <unistd.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
sigset_t sigs;
sigemptyset(&sigs);
sigaddset(&sigs, SIGINT);
sigprocmask(SIG_BLOCK, &sigs, 0);
if (argc > 1) {
execvp(argv[1], argv + 1);
perror("execv");
} else {
fprintf(stderr, "Usage: %s <command> [args...]\n", argv[0]);
}
return 1;
}
If you compile this program to noint, you would just execute ./noint ./y.
As ephemient notes in comments, the signal disposition is also inherited, so you can have the wrapper ignore the signal instead of blocking it:
#include <signal.h>
#include <unistd.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
struct sigaction sa = { 0 };
sa.sa_handler = SIG_IGN;
sigaction(SIGINT, &sa, 0);
if (argc > 1) {
execvp(argv[1], argv + 1);
perror("execv");
} else {
fprintf(stderr, "Usage: %s <command> [args...]\n", argv[0]);
}
return 1;
}
(and of course for a belt-and-braces approach, you could do both).
The "trap" command is local to this process, never applies to children.
To really trap the signal, you have to hack it using a LD_PRELOAD hook. This is non-trival task (you have to compile a loadable with _init(), sigaction() inside), so I won't include the full code here. You can find an example for SIGSEGV on Phack Volume 0x0b, Issue 0x3a, Phile #0x03.
Alternativlly, try the nohup and tail trick.
nohup your_command &
tail -F nohup.out
I would suggest that your C (and Java) application needs rewriting so that it can handle an exception, what happens if it really does need to be interrupted, power fails, etc...
I that fails, J-16 is right on the money. Does the user need to interract with the process, or just see the output (do they even need to see the output?)
The solutions explained above are not working for me, even by chaining the both commands proposed by Caf.
However, I finally succeeded in getting the expected behavior this way :
#!/bin/zsh
setopt MONITOR
TRAPINT() { print AAA }
print 1
( ./child & ; wait)
print 2
If I press Ctrl-C while child is running, it will wait that it exits, then will print AAA and 2. child will not receive any signals.
The subshell is used to prevent the PID from being shown.
And sorry... this is for zsh though the question is for bash, but I do not know bash enough to provide an equivalent script.
This is example code of enabling signals like Ctrl+C for programs which block it.
fixControlC.c
#include <stdio.h>
#include <signal.h>
int sigaddset(sigset_t *set, int signo) {
printf("int sigaddset(sigset_t *set=%p, int signo=%d)\n", set, signo);
return 0;
}
Compile it:
gcc -fPIC -shared -o fixControlC.so fixControlC.c
Run it:
LD_LIBRARY_PATH=. LD_PRELOAD=fixControlC.so mysqld

Resources