Running a program in a MPI process, which exec file is myapp_mpi. The command line below launches the app correctly
mpirun --bind-to none -np 128 myapp_mpi apprun -resethway -noconfout -nsteps 8000 -s benchData.tpr -cpo state.cpt -e ener.edr -dlb no -pin off -v
I now wish to constrain the thread binding with a script pin.sh. The command line below would then produce an error.
mpirun --bind-to none -np 128 pin.sh myapp_mpi apprun -resethway -noconfout -nsteps 8000 -s benchData.tpr -cpo state.cpt -e ener.edr -dlb no -pin off -v
I get the following error
--------------------------------------------------------------------------
Open MPI tried to fork a new process via the "execve" system call but
failed. Open MPI checks many things before attempting to launch a
child process, but nothing is perfect. This error may be indicative
of another problem on the target host, or even something as silly as
having specified a directory for your application. Your job will now
abort.
Local host: machine001
Working dir: /home/user/myapp/bin
Application name: /home/user/myapp/bin/pin.sh
Error: Exec format error
--------------------------------------------------------------------------
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an
error:
Error code: 1
Error name: (null)
Node: machine001
when attempting to start process rank 0.
--------------------------------------------------------------------------
125 total processes failed to start
File locations are good, a priori. Any clue ?
Related
I have a BitBake build process that runs on a Docker container (CentOS 7). The BitBake fails during recipe gcc-cross-i586-5.2.0-r0: task do_compile on each run that I try it in.
An example of bitbake's output:
NOTE: recipe gcc-cross-i586-5.2.0-r0: task do_compile: Started
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
NOTE: Tasks Summary: Attempted 1538 tasks of which 17 didn't need to be rerun and all succeeded.
Is this a problem with recipe gcc-cross-i586-5.2.0-r0: task do_compile? Perhaps an out-of-memory error? I don't know what the -9 refers to or how to find out more information about it.
Try:
$ bitbake -c cleansstate gcc-cross ; bitbake -k gcc-cross
How much you have memory of ram?
Report log error here.
This worked for me,
Edit conf/local.conf and decrease the number of working threads by adding the following to you conf/local.conf file (under the build directory):
BB_NUMBER_THREADS = "6"
Just a long shot, -9 in kernel land means EBADF (bad file number.) Is it possible you have done some operations as root and some files are not accessible during the build? Is the issue reproducible? ie. can you rm -rf tmp and does it happen again? Make sure you don't have any permissions issues in your project directory and associated file system(s).
I have a custom program (e.g. myexe) being executed by a web app using PHP's exec() function. It does not fail when run using the PHP CLI nor does myexe fail when run from the command line with me as a user. I have built myexe so that there are no memory issues when profiled using valgrind. myexe is about 26MB in size.
To simplify the situation, I have run myexe on the command line under the user 'apache' and reproduced the failure.
su -s /bin/sh apache -c "/usr/local/bin/myexe parm1 parm2..."
==> Segmentation fault (core dumped)
BUT when I change the user to myself and run the same command above, it works.
su -s /bin/sh mike -c "/usr/local/bin/myexe parm1 parm2..."
==> WORKS
Here's the error from the system log file:
Jul 9 18:26:15 DEVSTN-1 kernel: myexe[27352]: segfault at 7fffa2bf9ff8 ip 0000000000410324 sp 00007fffa2bfa000 error 6 in myexe[400000+5ae000]
Jul 9 18:26:16 DEVSTN-1 abrt[27353]: Saved core dump of pid 27352 (/usr/local/bin/myexe) to /var/spool/abrt/ccpp-2015-07-09-18:26:15-27352 (13631488 bytes)
Jul 9 18:26:16 DEVSTN-1 abrtd: Directory 'ccpp-2015-07-09-18:26:15-27352' creation detected
Jul 9 18:26:17 DEVSTN-1 abrtd: Executable '/usr/local/bin/myexe' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Jul 9 18:26:17 DEVSTN-1 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-07-09-18:26:15-27352' exited with 1
Jul 9 18:26:17 DEVSTN-1 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2015-07-09-18:26:15-27352'
My configuration:
CentOS6 2.6.32-504.23.4.el6.x86_64
Apache/2.2.15 (CentOS)
PHP Version 5.3.3
Am I correct with assuming that PHP has nothing to do with the error?
What should I do next?
Correct; PHP has nothing to do with the error. This is a segmentation fault caused by invalid memory access (either overflowing a buffer, or accessing already-freed memory) in myexe. It seems to have saved a core dump to /var/spool/abrt/ccpp-2015-07-09-18:26:15-27352, so, try debugging with GDB:
gdb /usr/local/bin/myexe -c /var/spool/abrt/ccpp-2015-07-09-18:26:15-27352
(gdb) bt
And try to see where the executable is failing. To get useful output, it will need to be compiled with debugging symbols. If it doesn't fail running as root or a different user, or running in an interactive terminal, I'd look for bugs that could be triggered by being unable to open a file, unable to read an expected environment variable, etc. to help isolate your problem.
Running the executable under strace might help figure out what's going on as well.
Found the problem by entering a bash shell user user apache and running the program using gdb.
Turns out myexe was trying to create a directory under the user's home dir (/home/apache) which doesn't exist.
What helped me was knowing how to start a shell under a different user and using gdb.
Here's the command to start a shell under another user (apache):
su -s /bin/bash apache
I am trying to run Mesos (without zookeeper) using monit to keep slaves running.
I use the following scripts to start and stop mesos slaves:
start-slave.sh:
#!/bin/bash
nohup /home/someuser/mesos/build/bin/mesos-slave.sh
--master=192.168.0.241:5050
--strict=false
--log_dir=/home/someuser/mesos/logs < /dev/null &
sleep 1
pidof lt-mesos-slave > /home/someuser/run/mesos-slave.pid
stop-slave.sh:
#!/bin/bash
cat /home/someuser/run/mesos-slave.pid | xargs kill -9
When I start the scripts via ssh they work great. However when I use them via monit as follows the slaves register (I can see them on the online interface) but when I try to run a computation using spark, it fails in the sense that most of the tasks are lost.
Monit setup:
check process mesos-slave with pidfile /home/someuser/run/mesos-slave.pid
start program = "/home/someuser/run/start-mesos.sh"
as uid someuser
stop program = "/home/someuser/run/stop-mesos.sh"
as uid someuser
if failed port 5051 then restart
Log exerp:
I0925 14:06:21.461169 10633 slave.cpp:2413] Executor '20140924-160043-4043352256-5050-7966-0' of framework 20140925-140255-4043352256-5050-11608-0000 has terminated with signal Killed
E0925 14:06:21.461323 10639 slave.cpp:2686] Failed to unmonitor container for executor 20140924-160043-4043352256-5050-7966-0 of framework 20140925-140255-4043352256-5050-11608-0000: Not monitored
I0925 14:06:21.462224 10633 slave.cpp:2018] Handling status update TASK_LOST (UUID: 8258a34e-7831-4e5d-ba55-6df2b905b5ba) for task 0 of framework 20140925-140255-4043352256-5050-11608-0000 from #0.0.0.0:0
Am I using Monit correctly?
I have stuck in a such problem that my c++ server program can't coredump when terminate abnormally. The program running in daemon mode with chdir to '/'.
I had done the following things:
ulimit -c unlimited, so coredump enabled.
echo "/tmp/coredump/core.%e.%p.%t" > /proc/sys/kernel/core_pattern, and chmod a+w coredump, so it has the permission to write coredump file.
and I had try such things:
send a SIGABRT via kill -6, it can coredump.
in dmesg, I can't find any info about the abnormally terminates process.
running the program not in daemon mode.
My OS version: CentOS release 6.4 (Final), x86_64
ps. the server program installed a signal handler (sigaction() with flag SA_RESETHAND) to catch such signals {SIGHUP, SIGINT, SIGQUIT, SIGTERM} for normally terminates(free resources). so it can exclude the signal shielding.
When I start the Cygwin shell in Windows XP, I get a stream of messages like this:
0 [main] -bash 2920 C:\cygwin\bin\bash.exe: *** fatal error - prefork: couldn't create pipe process trackerWin32 error 161
I get yet more messages whenever I run scripts and commands. The shell appears to work underneath the flood of errors, but it is very annoying.
Is there any way to find out what is causing this or to fix it?