"mpirun was unable to launch the specified application as it could not find an executable" error - openmpi

I came across a strage problem, which worked but now it doesn't.
I run an OpenMPI program with tau profiling among 2 computers. It seems that mpirun can't run tau_exec program on a remote host, maybe it's a permission issue?
cluster#master:~/software/mpi_in_30_source/test2$ mpirun -np 2 --hostfile hostfile -d tau_exec -v -T MPI,TRACE,PROFILE ./hello.exe
[master:19319] procdir: /tmp/openmpi-sessions-cluster#master_0/4568/0/0
[master:19319] jobdir: /tmp/openmpi-sessions-cluster#master_0/4568/0
[master:19319] top: openmpi-sessions-cluster#master_0
[master:19319] tmp: /tmp
[slave2:06777] procdir: /tmp/openmpi-sessions-cluster#slave2_0/4568/0/1
[slave2:06777] jobdir: /tmp/openmpi-sessions-cluster#slave2_0/4568/0
[slave2:06777] top: openmpi-sessions-cluster#slave2_0
[slave2:06777] tmp: /tmp
[master:19319] [[4568,0],0] node[0].name master daemon 0 arch ff000200
[master:19319] [[4568,0],0] node[1].name slave2 daemon 1 arch ff000200
[slave2:06777] [[4568,0],1] node[0].name master daemon 0 arch ff000200
[slave2:06777] [[4568,0],1] node[1].name slave2 daemon 1 arch ff000200
[master:19319] Info: Setting up debugger process table for applications
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_proctable_size = 2
MPIR_proctable:
(i, host, exe, pid) = (0, master, /home/cluster/software/mpi_in_30_source/test2/tau_exec, 19321)
(i, host, exe, pid) = (1, slave2, /home/cluster/software/mpi_in_30_source/test2/tau_exec, 0)
--------------------------------------------------------------------------
mpirun was unable to launch the specified application as it could not find an executable:
Executable: tau_exec
Node: slave2
while attempting to start process rank 1.
--------------------------------------------------------------------------
[slave2:06777] sess_dir_finalize: job session dir not empty - leaving
[slave2:06777] sess_dir_finalize: job session dir not empty - leaving
[master:19319] sess_dir_finalize: job session dir not empty - leaving
[master:19319] sess_dir_finalize: proc session dir not empty - leaving
orterun: exiting with status -123
On slave2:
cluster#slave2:~/software/mpi_in_30_source/test2$ tau_exec -T MPI,TRACE,PROFILE ./hello.exe
hello MPI user: from process = 0 on machine=slave2, of NCPU=1 processes
cluster#slave2:~/software/mpi_in_30_source/test2$ which tau_exec
/home/cluster/tools/tau-2.22.2/arm_linux/bin/tau_exec
So there is a working tau_exec on both nodes. When I run mpirun without tau_exec everything works.
cluster#master:~/software/mpi_in_30_source/test2$ mpirun -np 2 --hostfile hostfile ./hello.exe
hello MPI user: from process = 0 on machine=master, of NCPU=2 processes
hello MPI user: from process = 1 on machine=slave2, of NCPU=2 processes

Try putting the full path to tau_exec in your command line. It's possible that you PATH isn't the same on all of the nodes. If that's the case, it wouldn't be able to find the executable anywhere where the path isn't correct.
It's most likely not a permission issue, but I don't remember all of the error messages in Open MPI to tell you how helpful they might be.

once had an error like this when i tried to name the output file
just try leave it the same
mpirun -n <number> a.out
that is how it worked for me!

Maybe is because you already had installed openMPI and not only MPICH2, so you should run the below commands as root:
root~# update-alternatives --config mpirun
There are 2 choices for the alternative mpirun (providing /usr/bin/mpirun).
Selection | Path | Priority | Status
*0 | /usr/bin/mpirun.openmpi | 50 | auto mode
1 | /usr/bin/mpirun.mpich2 | 40 | manual mode
2 | /usr/bin/mpirun.openmpi | 50 | manual mode
Press enter to keep the current choice[*], or type selection number: 1
Then you should select the MPICH version, as above, to run normally.

If you're running a shell script with mpirun, make sure you've chmod +x script_file.sh else you'll see this error.

Related

ps command -o option gives "ERROR: Garbage option"

I have 2 suse-11 machine both has same kernel version.
Linux version 2.6.32.59-0.7-default (geeko#buildhost) (gcc version
4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP 2012-07-13 15:50:56 +0200
but in one machine below command works
ps -u test-o '%U %p %P %c'
but in other gives error like below
ERROR: Garbage option.
********* simple selection ********* ********* selection by list *********
-A all processes -C by command name
-N negate selection -G by real group ID (supports names)
-a all w/ tty except session leaders -U by real user ID (supports names)
-d all except session leaders -g by session OR by effective group name
-e all processes -p by process ID
T all processes on this terminal -s processes in the sessions given
a all w/ tty, including other users -t by tty
g OBSOLETE -- DO NOT USE -u by effective user ID (supports names)
r only running processes U processes for specified users
x processes w/o controlling ttys t by tty
*********** output format ********** *********** long options ***********
-o,o user-defined -f full --Group --User --pid --cols --ppid
-j,j job control s signal --group --user --sid --rows --info
-O,O preloaded -o v virtual memory --cumulative --format --deselect
-l,l long u user-oriented --sort --tty --forest --version
-F extra full X registers --heading --no-heading --context
********* misc options *********
-V,V show version L list format codes f ASCII art forest
-m,m,-L,-T,H threads S children in sum -y change -l format
-M,Z security data c true command name -c scheduling class
-w,w wide output n numeric WCHAN,UID -H process hierarchy
Really not able to figure what is the problem here, can any one suggest me what could be wrong?
EDIT: which ps in both gives
/bin/ps
I checked md5sum of both ps command, it was different. So i suspect some one might have replaced this bin without notice. Copied the ps command from correct source and problem solved

Bash Shell: How can I know if TeamViewer has disconnected?

Sometimes TeamViewer disconnects itself (or gets disconnected) from its internet's main servers.
I am programming a script that will check if connection is lost and, if yes, kills and reopens the concerned process to make TeamViewer up and running again.
The problem is: I don't know how to discover that TeamViewer has lost its remote access capability (this is: the capability to be remotely accessed and controlled).
Tested until now:
Check TeamViewer process and/or daemon. Not valid: they keep working even after disconnected.
NICs review. Not valid: TeamViewer seems not to add any.
See the TeamViewer's main window. Not programmatically valid or easy to implement.
How can I programmatically know if TeamViewer has disconnected?
I don't know if this method differs between platforms, but at least I would like to know about a solution for some Linux shell. Bash if possible.
Probably I'm late, but run into the same problem and found a possible solution. I'm using teamviewer 12.
I noticed that, in my case sometimes some GUI related process are not launched so the machine is not online in my computer and contact list, if I ssh it and check for the list of teamviewer processes using:
ps -ef | grep [t]eamviewer
I get just one process, the teamviewer daemon:
root 1808 1 0 09:22 ? 00:00:53 /opt/teamviewer/tv_bin/teamviewerd -d
But, when everything is fine I have:
root 1808 1 0 09:22 ? 00:00:53 /opt/teamviewer/tv_bin/teamviewerd -d
rocco 10975 8713 0 09:31 ? 00:00:58 /opt/teamviewer/tv_bin/wine/bin/wineserver
rocco 11064 10859 0 09:31 ? 00:00:33 /opt/teamviewer//tv_bin/TVGuiSlave.64 31 1
rocco 11065 10859 0 09:31 ? 00:00:28 /opt/teamviewer//tv_bin/TVGuiDelegate 31 1
So simply counting the number of process works for me..
#!/bin/bash
online() {
## Test connection
ping -c1 www.google.com > /dev/null
return $?
}
online
if (test $? -eq 0)
then
network=$(ps -ef | grep [t]eamviewer | wc -l)
if (test $network -gt 3)
then
echo Machine online, teamviewer connected
else
echo Machine online, teamviewer not connected, trying restart daemon
sudo teamviewer --daemon restart
fi
fi
Have you considered trapping the signal(if possible) and executing a function that will restart TeamViewer.
Start it from a script and trap an exit signal
function restartTV {
# re-start TeamViewrt
sudo /etc/init.d/something start
}
trap finish EXIT # or appropriate signal
sudo /etc/init.d/something stop
# Do the work...

Linux jobs command - how to see the full path of the running process

I am using Linux SUSE 11 and running a lot of jobs.
The path of each job is very long , for example:
cmd>/user/data/some/very/very/very/long/path/to/my/command/run_me param0 param1 param2
When I am running a lot of these commands I want to know which is finished and which is running. Let say after a day or so.
Using 'jobs' command I see only the following :
[1] + Running ...
[2] + Running ...
[3] + Running ...
[4] + Running ...
So I can't know which exact command is running.
Using top command is not helpful either, because is is showing the process and not the exact script/program I am running.
My shell is /usr/bin/tcsh
In tcsh, jobs -l gives you a "long" listing which includes the PID. You can then use this number to examine the jobs with ps or groping around in the /proc pseudo-filesystem or whatever.
% jobs -l
[1] + 19038 Running tail --follow=name /path/to/long/and/complex /long/and/complex/files /and/so/on ...
From this listing, you can grab 19038 and see what it's really doing.
% ps -o args= --width 1200 19945
tail --follow=name /path/to/long/and/complex /long/and/complex/files /and/so/on /really/long /etc/motd
as well as
% tr '\0' '\n' </proc/19038/cmdline
tail
--follow=name
/path/to/long/and/complex
/long/and/complex/files
/and/so/on
/really/long
/etc/motd
or, somewhat messily, with top:
% setenv COLUMNS 512
% top -b -n1 -c -p 19038
(The output is too ugly to show here and does not add anything useful.)

taskset wrapped has a "?" mark, not sure how its introduced

I'm running commands from python via popen. On multi-core system, I have a function that will return the string "/usr/bin/taskset -c <>" based on the system's utilization. I then append the string to the system command prior to sending it to popen.
From my observation, the taskset wrapper is functioning correctly, the system command is being observed wrapped in a taskset command from "ps -elf"
4 S root 18986 18978 0 80 0 - 15016 poll_s 10:54 pts/3 00:00:00 sudo /usr/bin/taskset -c 0?sudo /usr/sbin/tcpdump -s 0 -nei lo
I'm not sure what the "?" means, I don't observed that while executing the command manually from linux console
I'm issuing the command via popen
I have a function that decides whether the system is multi-core or not, if its multicore, it will return the string "sudo /usr/bin/taskset -c"
if multicore():
taskstring="sudo /usr/bin/taskset -c %s" % cpu
else:
taskstring=""
The susbsequent command is tpcdump, so it will be
command = taskstring+" sudo /usr/sbin/tcpdump -s 0 -nei lo"
cmd=command.split(" ")
subprocess.Popen(cmd,stdout=open('%s' % fileout,'w'),stderr=subprocess.STDOUT)
I'm running on Ubuntu 14.04, if that means anything....

setuid on an executable doesn't seem to work

I wrote a small C utility called killSPR to kill the following processes on my RHEL box. The idea is for anyone who logs into this linux box to be able to use this utility to kill the below mentioned processes (which doesn't work - explained below).
cadmn#rhel /tmp > ps -eaf | grep -v grep | grep " SPR "
cadmn 5822 5821 99 17:19 ? 00:33:13 SPR 4 cadmn
cadmn 10466 10465 99 17:25 ? 00:26:34 SPR 4 cadmn
cadmn 13431 13430 99 17:32 ? 00:19:55 SPR 4 cadmn
cadmn 17320 17319 99 17:39 ? 00:13:04 SPR 4 cadmn
cadmn 20589 20588 99 16:50 ? 01:01:30 SPR 4 cadmn
cadmn 22084 22083 99 17:45 ? 00:06:34 SPR 4 cadmn
cadmn#rhel /tmp >
This utility is owned by the user cadmn (under which these processes run) and has the setuid flag set on it (shown below).
cadmn#rhel /tmp > ls -l killSPR
-rwsr-xr-x 1 cadmn cusers 9925 Dec 17 17:51 killSPR
cadmn#rhel /tmp >
The C code is given below:
/*
* Program Name: killSPR.c
* Description: A simple program that kills all SPR processes that
* run as user cadmn
*/
#include <stdio.h>
int main()
{
char *input;
printf("Before you proceed, find out under which ID I'm running. Hit enter when you are done...");
fgets(input, 2, stdin);
const char *killCmd = "kill -9 $(ps -eaf | grep -v grep | grep \" SPR \" | awk '{print $2}')";
system(killCmd);
return 0;
}
A user (pmn) different from cadmn tries to kill the above-mentioned processes with this utility and fails (shown below):
pmn#rhel /tmp > ./killSPR
Before you proceed, find out under which ID I'm running. Hit enter when you are done...
sh: line 0: kill: (5822) - Operation not permitted
sh: line 0: kill: (10466) - Operation not permitted
sh: line 0: kill: (13431) - Operation not permitted
sh: line 0: kill: (17320) - Operation not permitted
sh: line 0: kill: (20589) - Operation not permitted
sh: line 0: kill: (22084) - Operation not permitted
pmn#rhel /tmp >
While the user waits to hit enter above, the process killSPR is inspected and is seen to be running as the user cadmn (shown below) despite which killSPR is unable to terminate the processes.
cadmn#rhel /tmp > ps -eaf | grep -v grep | grep killSPR
cadmn 24851 22918 0 17:51 pts/36 00:00:00 ./killSPR
cadmn#rhel /tmp >
BTW, none of the main partitions have any nosuid on them
pmn#rhel /tmp > mount | grep nosuid
pmn#rhel /tmp >
The setuid flag on the executable doesn't seem to have the desired effect. What am I missing here? Have I misunderstood how setuid works?
First and foremost, setuid bit simply allows a script to set the uid. The script still needs to call setuid() or setreuid() to run in the the real uid or effective uid respectively. Without calling setuid() or setreuid(), the script will still run as the user who invoked the script.
Avoid system and exec as they drop privileges for security reason. You can use kill() to kill the processes.
Check These out.
http://linux.die.net/man/2/setuid
http://man7.org/linux/man-pages/man2/setreuid.2.html
http://man7.org/linux/man-pages/man2/kill.2.html
You should replace your system call with exec call. Manual for system say's it drops privileges when run from suid program.
The reason is explained in man system:
Do not use system() from a program with set-user-ID or set-group-ID
privileges, because strange values for some environment variables might
be used to subvert system integrity. Use the exec(3) family of func‐
tions instead, but not execlp(3) or execvp(3). system() will not, in
fact, work properly from programs with set-user-ID or set-group-ID
privileges on systems on which /bin/sh is bash version 2, since bash 2
drops privileges on startup. (Debian uses a modified bash which does
not do this when invoked as sh.)
If you replace system with exec you will need to be able to use shell syntax unless you call /bin/sh -c <shell command>, this is what is system actually doing.
Check out this link on making a shell script a daemon:
Best way to make a shell script daemon?
You might also want to google some 'linux script to service', I found a couple of links on this subject.
The idea is that you wrap a shell script that has some basic stuff in it that allows a user to control a program run as another user by calling a 'service' type script instead. For example, you could wrap up /usr/var/myservice/SPRkiller as a 'service' script that could then just be called as such from any user: service SPRkiller start, then SPRkiller would run, kill the appropriate services (assuming the SPR 'program' is run as a non-root user).
This is what it sounds like you are trying to achieve. Running a program (shell script/C program/whatever) carries the same user restrictions on it no matter what (except for escalation bugs/hacks).
On a side note, you seem to have a slight misunderstanding of user rights on Linux/Unix as well as what certain commands and functions do. If a user does not have permissions to do a certain action (like kill the process of another user), then calling setuid on the program you want to kill (or on kill itself) will have no effect because the user does not have permission to another users 'space' without super user rights. So even if you're in a shell script or a C program and called the same system command, you will get the same effect.
http://www.linux.com/learn/ is a great resource, and here's a link for file permissions
hope that helps

Resources