Hide this OpenMPI message

Hide this OpenMPI message - openmpi

Anytime I run an MPI program with "mpirun -n 1 myprogram" I get this message:
Reported: 1 (out of 1) daemons - 1 (out of 1) procs
How do I disable this message? I am using Open MPI 1.6.5

For some reason the value of the orte_report_launch_progress MCA parameter is set to true. This could either be coming from the system-wide Open MPI configuration file or from an environment variable named OMPI_MCA_orte_report_launch_progress. In any case, you may override it by passing --mca orte_report_launch_progress 0 to mpirun:
mpirun --mca orte_report_launch_progress 0 -n 1 myprogram
If the value is coming from the system-wide Open MPI configuration, you may also override it by appending the following to $HOME/.openmpi/mca-params.conf (create the file /and the directory/ if it doesn't exist):
orte_report_launch_progress = 0

Related

How could I run Open MPI under Slurm

I am unable to run Open MPI under Slurm through a Slurm-script.
In general, I am able to obtain the hostname and run Open MPI on my machine.
$ mpirun hostname
myHost
$ cd NPB3.3-SER/ && make ua CLASS=B && mpirun -n 1 bin/ua.B.x inputua.data # Works
But if I do the same operation through the slurm-script mpirun hostname returns empty string and consequently I am unable to run mpirun -n 1 bin/ua.B.x inputua.data.
slurm-script.sh:
#!/bin/bash
#SBATCH -o slurm.out # STDOUT
#SBATCH -e slurm.err # STDERR
#SBATCH --mail-type=ALL
export LD_LIBRARY_PATH="/usr/lib/openmpi/lib"
mpirun hostname > output.txt # Returns empty
cd NPB3.3-SER/
make ua CLASS=B
mpirun --host myHost -n 1 bin/ua.B.x inputua.data
$ sbatch -N1 slurm-script.sh
Submitted batch job 1
The error I am receiving:
There are no allocated resources for the application
bin/ua.B.x
that match the requested mapping:
------------------------------------------------------------------
Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.
A daemon (pid unknown) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
------------------------------------------------------------------

If Slurm and OpenMPI are recent versions, make sure that OpenMPI is compiled with Slurm support (run ompi_info | grep slurm to find out) and just run srun bin/ua.B.x inputua.data in your submission script.
Alternatively, mpirun bin/ua.B.x inputua.data should work too.
If OpenMPI is compiled without Slurm support the following should work:
srun hostname > output.txt
cd NPB3.3-SER/
make ua CLASS=B
mpirun --hostfile output.txt -n 1 bin/ua.B.x inputua.data
Make sure also that by running export LD_LIBRARY_PATH="/usr/lib/openmpi/lib" you do not overwrite other library paths that are necessary. Better would probably be export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/lib/openmpi/lib" (or a more complex version if you want to avoid a leading : if it were initially empty.)

What you need is: 1) run mpirun, 2) from slurm, 3) with --host.
To determine who is responsible for this not to work (Problem 1), you could test a few things.
Whatever you test, you should test exactly the same via command line (CLI) and via slurm (S).
It is understood that some of these tests will produce different results in cases CLI and S.
A few notes are:
1) You are not testing exactly the same things in CLI and S.
2) You say that you are "unable to run mpirun -n 1 bin/ua.B.x inputua.data", while the problem is actually with mpirun --host myHost -n 1 bin/ua.B.x inputua.data.
3) The fact that mpirun hostname > output.txt returns an empty file (Problem 2) does not necessarily have the same origin as your main problem, see paragraph above. You can overcome this problem by using scontrol show hostnames
or with the environment variable SLURM_NODELIST (on which scontrol show hostnames is based), but this will not solve Problem 1.
To work around Problem 2, which is not the most important, try a few things via both CLI and S.
The slurm script below may be helpful.
#SBATCH -o slurm_hostname.out # STDOUT
#SBATCH -e slurm_hostname.err # STDERR
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}/usr/lib64/openmpi/lib"
mpirun hostname > hostname_mpirun.txt # 1. Returns values ok for me
hostname > hostname.txt # 2. Returns values ok for me
hostname -s > hostname_slurmcontrol.txt # 3. Returns values ok for me
scontrol show hostnames > hostname_scontrol.txt # 4. Returns values ok for me
echo ${SLURM_NODELIST} > hostname_slurmcontrol.txt # 5. Returns values ok for me
(for an explanation of the export command see this).
From what you say, I understand 2, 3, 4 and 5 work ok for you, and 1 does not.
So you could now use mpirun with suitable options --host or --hostfile.
Note the different format of the output of scontrol show hostnames (e.g., for me cnode17<newline>cnode18) and echo ${SLURM_NODELIST} (cnode[17-18]).
The host names could perhaps also be obtained in file names set dynamically with %h and %n in slurm.conf, look for e.g. SlurmdLogFile, SlurmdPidFile.
To diagnose/work around/solve Problem 1, try mpirun with/without --host, in CLI and S.
From what you say, assuming you used the correct syntax in each case, this is the outcome:
mpirun, CLI (original post).
"Works".
mpirun, S (comment?).
Same error as item 4 below?
Note that mpirun hostname in S should have produced similar output in your slurm.err.
mpirun --host, CLI (comment).
Error
There are no allocated resources for the application bin/ua.B.x that match the requested mapping:
...
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
mpirun --host, S (original post).
Error (same as item 3 above?)
There are no allocated resources for the application
bin/ua.B.x
that match the requested mapping:
------------------------------------------------------------------
Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.
...
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
As per comments, you may have a wrong LD_LIBRARY_PATH path set.
You may also need to use mpi --prefix ...
Related?
https://github.com/easybuilders/easybuild-easyconfigs/issues/204

How do you specify nodes on `mpirun`'s command line?

How do I use mpirun's -machine flag?
To select which cluster node to execute on, I figured out to use mpirun's -machinefile option like this
> mpirun -machinefile $HOME/utils/Host_file -np <integer> <executable-filename>
Host_file contains a list of the nodes, one on each line.
But I want to submit a whole bunch of processes with different arguments and I don't want them running on the same node. That is, I want to do something like
> mpirun -machinefile $HOME/utils/Host_file -np 1 filename 1
nano Host_file % change the first node name
> mpirun -machinefile $HOME/utils/Host_file -np 1 filename 2
nano Host_file
> mpirun -machinefile $HOME/utils/Host_file -np 1 filename 3
nano Host_file
...
I could use the -machine flag and then just type a different node for each execution. But I can't get it to work. For example
> mpirun -machine node21-ib -np 1 FPU
> mpirun -machine node21 -np 1 FPU
always executes on the master node.
I also tried the -nodes option
> mpirun -nodes node21-ib -np 1 FPU
> mpirun -nodes node21 -np 1 FPU
But that just executes on my current node.
Similarly, I've tried the -nolocal and -exclude options without success.
So I have a simple question: How do I use the -machine option? Or is there a better way to do this (for a Linux newbie)?
I'm using the following version of MPI, which seems to have surprisingly little documentation on the web (so far, the entirety of the documentation I have comes from > mpirun --help).
> mpichversion
MPICH Version: 1.2.7
MPICH Release date: $Date: 2005/06/22 16:33:49$
MPICH Patches applied: none
MPICH configure: --with-device=ch_gen2 --with-arch=LINUX -prefix=/usr/local/mvapich-gcc --with-romio --without-mpe -lib=-L/usr/lib64 -Wl,-rpath=/usr/lib64 -libverbs -libumad -lpthread
MPICH Device: ch_gen2
Thanks for your help.

What you need is to specific a hosts file
for example at your mpirun command try mpirun -np 4 -hostfile hosts ./exec
where hosts contains your ip address generally 192.168.1.201:8 where the last digit is the maximum number of cores, separate each node by a newline. Ideally you should install some cluster management software like torque and maui for example.

How do do configure a Hudson linux slave to generate core files?

I've seeing occasional segmentation faults in glibc on several different Fedora Core 9 Hudson Slaves. I've attempted to configure each slave to generate core files and place them in /corefiles, but have had no luck.
Here is what I've done on each linux slave:
1) Create a corefile storage location
sudo install -m 1777 -d /corefiles
2) Directed the corefiles to the storage location by adding the following to /etc/sysctl.conf
kernel.core_pattern = /corefiles/core.%e-PID:%p-%t-signal_%s-%h
3) Enabled unlimited corefiles for all users by adding the following to /etc/profile
ulimit -c unlimited
Is there some additional Linux magic required or do I need to do something to the Hudson slave or JVM?
Thanks for the help

Did you reboot or run "sysctl -p" (as root) after editing /etc/sysctl.conf ?
Also, if i remember correctly, ulimit values are per user and calling ulimit wont survive a boot. You should add this to /etc/security/limits.conf:
* soft core unlimited
Or call ulimit in the script that starts hudson if you don't wont everyone to produce coredumps.

I figured this out :-).
The issue is Hudson invokes the bash shell as a non-interactive shell, which will bypass the ulimit setting in /etc/profile. The solution is to add the BASH_ENV environmental variable tothe Hudson slaves and set the value to a file with ulimit -c unlimited set.

I don't get coredump with all process

I try to get a coredump, so i use :
ulimit -c unlimited
I run my program in background, and I kill it :
kill -SEGV %1
But i just get :
[1]+ Exit 1 ./Test
And no coredumps are created.
I did the same with other programs and it works, so why that didn't work with all ? Anybody can help me ?
Thanks. (GNU/Linux, Debian 2.6.26)

If your program traps the SEGV signal and does something else, it won't invoke the OS core dump routine. Check that it doesn't do that.
Under Linux, processes which change their user ID using setuid, seteuid or some other parameters get excluded from dumping core for security reasons (Think: /bin/passwd dumps core while reading /etc/shadow into memory)
You can re-enable dumping core on Linux programs which change their user ID by calling prctl() after the change of UID

Also you might want to check that the program you're running is not changing its working directory ( chdir() ), because then it will create the core file in a different directory than the one you're running it from.
And you can try this too:
kill -ABRT pid

Try (as root):
sysctl kernel.core_pattern=core
and then repeat your experiment. On some systems that variable is set to /dev/null by default.
However, if you see exit status 1, perhaps the program indeed intercepts the signal.

How to redirect output of an already running process [duplicate]

This question already has answers here:
Redirect STDERR / STDOUT of a process AFTER it's been started, using command line?
(8 answers)
Closed 6 years ago.
Normally I would start a command like
longcommand &;
I know you can redirect it by doing something like
longcommand > /dev/null;
for instance to get rid of the output or
longcommand 2>&1 > output.log
to capture output.
But I sometimes forget, and was wondering if there is a way to capture or redirect after the fact.
longcommand
ctrl-z
bg 2>&1 > /dev/null
or something like that so I can continue using the terminal without messages popping up on the terminal.

See Redirecting Output from a Running Process.
Firstly I run the command cat > foo1 in one session and test that data from stdin is copied to the file. Then in another session I redirect the output.
Firstly find the PID of the process:
$ ps aux | grep cat
rjc 6760 0.0 0.0 1580 376 pts/5 S+ 15:31 0:00 cat
Now check the file handles it has open:
$ ls -l /proc/6760/fd
total 3
lrwx—— 1 rjc rjc 64 Feb 27 15:32 0 -> /dev/pts/5
l-wx—— 1 rjc rjc 64 Feb 27 15:32 1 -> /tmp/foo1
lrwx—— 1 rjc rjc 64 Feb 27 15:32 2 -> /dev/pts/5
Now run GDB:
$ gdb -p 6760 /bin/cat
GNU gdb 6.4.90-debian
[license stuff snipped]
Attaching to program: /bin/cat, process 6760
[snip other stuff that's not interesting now]
(gdb) p close(1)
$1 = 0
(gdb) p creat("/tmp/foo3", 0600)
$2 = 1
(gdb) q
The program is running. Quit anyway (and detach it)? (y or n) y
Detaching from program: /bin/cat, process 6760
The p command in GDB will print the value of an expression, an expression can be a function to call, it can be a system call… So I execute a close() system call and pass file handle 1, then I execute a creat() system call to open a new file. The result of the creat() was 1 which means that it replaced the previous file handle. If I wanted to use the same file for stdout and stderr or if I wanted to replace a file handle with some other number then I would need to call the dup2() system call to achieve that result.
For this example I chose to use creat() instead of open() because there are fewer parameter. The C macros for the flags are not usable from GDB (it doesn’t use C headers) so I would have to read header files to discover this – it’s not that hard to do so but would take more time. Note that 0600 is the octal permission for the owner having read/write access and the group and others having no access. It would also work to use 0 for that parameter and run chmod on the file later on.
After that I verify the result:
ls -l /proc/6760/fd/
total 3
lrwx—— 1 rjc rjc 64 2008-02-27 15:32 0 -> /dev/pts/5
l-wx—— 1 rjc rjc 64 2008-02-27 15:32 1 -> /tmp/foo3 <====
lrwx—— 1 rjc rjc 64 2008-02-27 15:32 2 -> /dev/pts/5
Typing more data in to cat results in the file /tmp/foo3 being appended to.
If you want to close the original session you need to close all file handles for it, open a new device that can be the controlling tty, and then call setsid().

You can also do it using reredirect (https://github.com/jerome-pouiller/reredirect/).
The command bellow redirects the outputs (standard and error) of the process PID to FILE:
reredirect -m FILE PID
The README of reredirect also explains other interesting features: how to restore the original state of the process, how to redirect to another command or to redirect only stdout or stderr.
The tool also provides relink, a script allowing to redirect the outputs to the current terminal:
relink PID
relink PID | grep usefull_content
(reredirect seems to have same features than Dupx described in another answer but, it does not depend on Gdb).

Dupx
Dupx is a simple *nix utility to redirect standard output/input/error of an already running process.
Motivation
I've often found myself in a situation where a process I started on a remote system via SSH takes much longer than I had anticipated. I need to break the SSH connection, but if I do so, the process will die if it tries to write something on stdout/error of a broken pipe. I wish I could suspend the process with ^Z and then do a
bg %1 >/tmp/stdout 2>/tmp/stderr
Unfortunately this will not work (in shells I know).
http://www.isi.edu/~yuri/dupx/

Screen
If process is running in a screen session you can use screen's log command to log the output of that window to a file:
Switch to the script's window, C-a H to log.
Now you can :
$ tail -f screenlog.2 | grep whatever
From screen's man page:
log [on|off]
Start/stop writing output of the current window to a file "screenlog.n" in the window's default directory, where n is the number of the current window. This filename can be changed with the 'logfile' command. If no parameter is given, the state of logging is toggled. The session log is appended to the previous contents of the file if it already exists. The current contents and the contents of the scrollback history are not included in the session log. Default is 'off'.
I'm sure tmux has something similar as well.

I collected some information on the internet and prepared the script that requires no external tool: See my response here. Hope it's helpful.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string