Shell script process is getting killed automatically - linux

I am facing problem with shell script i have ascript which will be running in infinite loop so say its havin PID X.The process is running for 4-5 hours but automatically the process getting killed.This is happening only for some long time running system and i am observing some times after 2 hours also its getting killed.
I am not able to find the reason the why its going down why its getting killed.No one is using the system other than me.And i am running the process as a root user.
Can any one explain or suspect the reason who is killing the process.
Below is the sample script
#!/bin/bash
until ./test.tcl; do
echo "Server 'test.tcl' crashed with exit code $?. Respawing.." >&2
done
In test.tcl script i am running it for infinite loop and the script is used to trap signal and do some special operation.But we find that test.tcl is also going down.
So is there any way from where i capture who and how it gets killed.

Enable core dump in your system, it is the most commonly used method for app crash analysis. I know it is a bit painful to gdb core file, but more or less you can find something out of it.
Here is a reference link for you.(http://www.cyberciti.biz/tips/linux-core-dumps.html).
Another way to do this is tracing you script by "strace -p PID-X", note that it will slow down your system, espeically several hours in your case, but it can be the last resort.
Hope above helpful to you.

Better to check all the signals generated and caught by OS at that time by specific script might be one of signal is causing to kill your process.

Related

Linux job scheduler launching a script 2 hours after it terminates

I have a script that runs unknown period of time that depends on its input. It can run one hour when little data available, or it can run for 8 hours if much data is to be processed.
I need to run it periodically, particularly 2 hours after previous run was completed.
Is there an utility to do that?
Use 'at' instead of 'cron' and at the end of your script add:
at now +2 hours $*
This means that each occurrence is chained - so if it terminates abnormally the next instance won't be scheduled - but I don't think there's a more robust solution without adding a lot of code/complexity.
I don't like the at solution proposed, so here another solution:
Use cron to launch your every two hours
Upon startup, your application(*) checks if there's a pidfile.
2.1 if it is present, then there may be another instance running: read contents of the file (pid) and see if that pid is the pid of an existing process, a zombie process or something else. If it is the pid of a running, existing process, then exit. If it is the pid of a zombie process then the previous job ended unexpectedly and then you have to delete the pidfile and go to step 3. Otherwise.
After deleting pidfile, you create a new one and put your pid into it. Then proceed to do your job.
*: In order not to add complexity, this application i cited could also be a simple wrapper that spawns your code using exec.
This solution can also be scripted quite easily.
Hope it helps,
SnoopyBBT
If it looks complicated, here is another, dirtier solution:
while true ; do
./your_application
sleep 7200
done
Hope this helps,
SnoopyBBT

Monitor multiple instances of same process

I'm trying to monitor multiple instances of the same process. I can't for the life of me do this without running into a problem.
All the examples I have seen so far on the internet involve me writing out the PID or monitoring the process itself. The issue is that if one instance fails, it doesn't mean all the rest have failed as well.
In order for me to write out the PID for each process it would mean I'd probably have to run each process with a short delay to record the correct, seeing as the way I need to record the PID is done through the process name being probed.
If I'm wrong on this, please correct me. But so far I haven't found a way to monitor each individual process, which all have the same name.
To add to the above, the processes are run in a batch script and each one is run in its own screen (ffmpeg would otherwise not be able to run in the background).
If anyone can point me vaguely in the right direction on how to do this in Linux I would really appreciate it. I read somewhere that it would be possible to set up symlinks which would then give me fake process names and that way I can monitor the 'fake' process name.
man wait. For example, in shell script:
wget "$url1" &
pid1=$!
wget "$url2" &
pid2=$!
wait $pid1 $pid2
will launch both wget processes, and wait until both processes are finished (or failed)

Stopping a nohup parallel simulation in R

I need to cancel a parallel processing simulation that I started on a Linux server using R.
Last night I connected to the server using ssh and started the simulation from the shell with the nohup command:
nohup R CMD BATCH mysimulation.R
The mysimulation.R file uses the mclapply command in the multicore package to spawn additional instances of R to run the simulation in parallel. For instance, when I run top in the shell I see ten different instances of R running, then finishing a run of the simulation, then starting again to continue with the additional simulation replications.
I have almost no experience working directly with a Linux server (the extent of my knowledge is cd, ls, and the nohup command mentioned above). I searched around a bit for a solution and thought killing the process might work. I tried:
kill -9 mypid (which said it killed the process).
However, the simulation continues to run. I can see the instances of R continuing to run, close, and respawn. Can anyone point me to resources or walk me through the specific steps I need to take to shut down this simulation? Thanks.
It seems you need to kill several processes at once. Maybe one of the answers to the post Best way to kill all child processes might help you.
You may see this page. I hope this would help

How to find the reason for a dead process without log file on unix?

This is an interview question.
A developer started a process.
But when a customer wants to use the process, he found the process wasn't running.
The developer logged in and found the process died. How can the developer know what was wrong?
Follow up: a running process which is supposed to write logs to a file. But there are no logs in the file. How can the developer figure out what's going on in the process?
I think :
If the program can be re-run, i will use gdb to track the process.
If not, check the output file from the process (the application program).
or, add print to the code.
But, are there other ways to do it by referring some information generated by OS?
If you have the disk space and spare CPU power, you can leave strace following the program to catch the sequence leading up to exit.
One possible cause if the program died without leaving any trace is the Out-Of-Memory (OOM) killer. This will leave a message in the kernel log if it kills your process.
From the same answer, process accounting can be modified to provide some clues by telling you the exit code along with the exit time.
are there other ways to do it by referring some information generated
by OS?
core dump is one option.
Sometimes programs don't create core dumps. In this case knowing the exit code of your software may help.
So you can use this script below to start your software and log its exit status for finding its exit reason.
Example :
#!/bin/bash
./myprogram
#get exit code
exitvalue=$?
#log exit code value to /var/log/messages
logger -s "exit code of my program is " $exitvalue
... use a debugger like gdb ...

Linux, timing out on subprocess

Ok, I need to write a code that calls a script, and if the operation in script hangs, terminates the process.
The preferred language is Python, but I'm also looking through C and bash script documentation too.
Seems like an easy problem, but I can't decide on the best solution.
From research so far:
Python: Has some weird threading model where the virtual machine uses
one thread at a time, won't work?
C: The preferred solution so far seems to use SIGALARM + fork +
execl. But SIGALARM is not heap safe, so it can trash everything?
Bash: timeout program? Not standard on all distros?
Since I'm a newbie to Linux, I'm probably unaware of 500 different gotchas with those functions, so can anyone tell me what's the safest and cleanest way?
Avoid SIGALRM because there is not much safe stuff to do inside the signal handler.
Considering the system calls that you should use, in C, after doing the fork-exec to start the subprocess, you can periodically call waitpid(2) with the WNOHANG option to inspect whether the subprocess is still running. If waitpid returns 0 (process is still running) and the desired timeout has passed, you can kill(2) the subprocess.
In bash you can do something similar to this:
start the script/program in background with &
get the process id of the background process
sleep for some time
and then kill the process (if it is finished you cannot kill it) or you can check if the process is still live and then to kill it.
Example:
sh long_time_script.sh &
pid=$!
sleep 30s
kill $pid
you can even try to use trap 'script_stopped $pid' SIGCHLD - see the bash man for more info.
UPDATE: I found other command timeout. It does exactly what you need - runs a command with a time limit. Example:
timeout 10s sleep 15s
will kill the sleep after 10 seconds.
There is a collection of Python code that has features to do exactly this, and without too much difficulty if you know the APIs.
The Pycopia collection has the scheduler module for timing out functions, and the proctools module for spawning subprocesses and sending signals to it. The kill method can be used in this case.

Resources