linux - check if program has died - linux

i wrote a program that needs to continuously run. but since im a bad programmer it crashes every so often. is there a way to have another program watch it and restart it when it crashes?

Not to be specious, but if you're a bad programmer, what's to say your watching programming won't fail too ;) And, you should get better so that you don't have this issue (for this reason). That said, you will probably have need of the following answer eventually.
However, if getting better isn't possible, just run a cron job at regular intervals looking for the name of your program in the output from 'ps'. And that answer you can get from superuser.com

No need for 3rd party programs
All of this can be accomplished with the linux inittab
inittab MAN pages
Look for "respawn"

You can use supervisord
http://supervisord.org/

Since Stackoverflow is a programming site, let me give you an overview of how such a watcher would be implemented.
First thing to know is that your watcher will have to start the watched program yourself. You do this with fork and exec.
What you can then do is wait for the program to exit. You can use of the wait system calls (i.e. wait, waitpid or wait4) depending on your specific needs. You can also catch SIGCHLD so that you can get asynchronously informed when your child exits (you will then need to call wait to get it's status).
Now that you have the status, you can tell if the process died due to a signal with the macro WIFSIGNALED. If that macro returns true, then your program crashed and needs to be restarted.

It still won't continuously run if you have another task monitoring it... it will still have a short amount of down time while it restarts.
Additionally, if you are acting as a network (or local) server process, you'll lose any state about requests in progress; I hope this is ok (Of course your clients may have built-in timeout and retry).
Finally, if your process crashed while it was in the middle of storing any persistent data, I hope it has a mechanism of coping with half-written files, etc.
However, if you intend it to be robust, all of these things should be true anyway, so you can use something like supervisord safely.

I use Monit to watch over my programs and services.

Related

Can a Linux process/thread terminate without pass through do_exit()?

To verify the behavior of a third party binary distributed software I'd like to use, I'm implementing a kernel module whose objective is to keep track of each child this software produces and terminates.
The target binary is a Golang produced one, and it is heavily multi thread.
The kernel module I wrote installs hooks on the kernel functions _do_fork() and do_exit() to keep track of each process/thread this binary produces and terminates.
The LKM works, more or less.
During some conditions, however, I have a scenario I'm not able to explain.
It seems like a process/thread could terminate without passing through do_exit().
The evidence I collected by putting printk() shows the process creation but does not indicate the process termination.
I'm aware that printk() can be slow, and I'm also aware that messages can be lost in such situations.
Trying to prevent message loss due to slow console (for this particular application, serial tty 115200 is used), I tried to implement a quicker console, and messages have been collected using netconsole.
The described setup seems to confirm a process can terminate without pass through the do_exit() function.
But because I wasn't sure my messages couldn't be lost on the printk() infrastructure, I decided to repeat the same test but replacing printk() with ftrace_printk(), which should be a leaner alternative to printk().
Still the same result, occasionally I see processes not passing through the do_exit(), and verifying if the PID is currently running, I have to face the fact that it is not running.
Also note that I put my hook in the do_exit() kernel function as the first instruction to ensure the function flow does not terminate inside a called function.
My question is then the following:
Can a Linux process terminate without its flow pass through the do_exit() function?
If so, can someone give me a hint of what this scenario can be?
After a long debug session, I'm finally able to answer my own question.
That's not all; I'm also able to explain why I saw the strange behavior I described in my scenario.
Let's start from the beginning: monitoring a heavily multithreading application. I observed rare cases where a PID that suddenly stops exists without observing its flow to pass through the Linux Kernel do_exit() function.
Because this my original question:
Can a Linux process terminate without pass through the do_exit() function?
As for my current knowledge, which I would by now consider reasonably extensive, a Linux process can not end its execution without pass through the do_exit() function.
But this answer is in contrast with my observations, and the problem leading me to this question is still there.
Someone here suggested that the strange behavior I watched was because my observations were somehow wrong, alluding my method was inaccurate, as for my conclusions.
My observations were correct, and the process I watched didn't pass through the do_exit() but terminated.
To explain this phenomenon, I want to put on the table another question that I think internet searchers may find somehow useful:
Can two processes share the same PID?
If you'd asked me this a month ago, I'd surely answered this question with: "definitively no, two processes can not share the same PID."
Linux is more complex, though.
There's a situation in which, in a Linux system, two different processes can share the same PID!
https://elixir.bootlin.com/linux/v4.19.20/source/fs/exec.c#L1141
Surprisingly, this behavior does not harm anyone; when this happens, one of these two processes is a zombie.
updated to correct an error
The circumstances of this duplicate PID are more intricate than those described previously. The process must flush the previous exec context if a threaded process forks before invoking an execve (the fork copies also the threads). If the intention is to use the execve() function to execute a new text, the kernel must first call the flush_old_exec()  function, which then calls the de_thread() function for each thread in the process other than the task leader. Except the task leader, all the process' threads are eliminated as a result. Each thread's PID is changed to that of the leader, and if it is not immediately terminated, for example because it needs to wait an operation completion, it keeps using that PID.
end of the update
That was what I was watching; the PID I was monitoring did not pass through the do_exit() because when the corresponding thread terminated, it had no more the PID it had when it started, but it had its leader's.
For people who know the Linux Kernel's mechanics very well, this is nothing to be surprised for; this behavior is intended and hasn't changed since 2.6.17.
Current 5.10.3, is still this way.
Hoping this to be useful to internet searchers; I'd also like to add that this also answers the followings:
Question: Can a Linux process/thread terminate without pass through do_exit()? Answer: NO, do_exit() is the only way a process has to end its execution — both intentional than unintentional.
Question: Can two processes share the same PID? Answer: Normally don't. There's some rare case in which two schedulable entities have the same PID.
Question: Do Linux kernel have scenarios where a process change its PID? Answer: yes, there's at least one scenario where a Process changes its PID.
Can a Linux process terminate without its flow pass through the do_exit() function?
Probably not, but you should study the source code of the Linux kernel to be sure. Ask on KernelNewbies. Kernel threads and udev or systemd related things (or perhaps modprobe or the older hotplug) are probable exceptions. When your /sbin/init of pid 1 terminates (that should not happen) strange things would happen.
The LKM works, more or less.
What does that means? How could a kernel module half-work?
And in real life, it does happen sometimes that your Linux kernel is panicking or crashes (and it could happen with your LKM, if it has not been peer-reviewed by the Linux kernel community). In such a case, there is no more any notion of processes, since they are an abstraction provided by a living Linux kernel.
See also dmesg(1), strace(1), proc(5), syscalls(2), ptrace(2), clone(2), fork(2), execve(2), waitpid(2), elf(5), credentials(7), pthreads(7)
Look also inside the source code of your libc, e.g. GNU libc or musl-libc
Of course, see Linux From Scratch and Advanced Linux Programming
And verifying if the PID is currently running,
This can be done is user land with /proc/, or using kill(2) with a 0 signal (and maybe also pidfd_send_signal(2)...)
PS. I still don't understand why you need to write a kernel module or change the kernel code. My intuition would be to avoid doing that when possible.

How to identify if a long-running process died?

I'm working on a daemon that communicates with several processes. The daemon can't monitor the processes all the time, but it must be able to properly identify if a process dies to release scare resources it holds for it.
The processes can communicate with the daemon, giving it some information at the start, but not vice versa. So the daemon can't just ask a process its identity.
The simplest form would be to use just their PID. But eventually another process could be assigned the same PID without my tool noticing.
A better approach would be to use PID plus the time the process started. A new process with the same PID would have a distinct start time. But I couldn't find a way how to get the process start time in a POSIX way. Using ps or looking at /proc/<pid>/stat seems not portable enough.
A more complicated idea that seems POSIX-compliant would be:
Each process creates a temporary file.
Locks it using flock
Tells my daemon "my identity is connected with this file".
Any time the daemon can check the temporary file. If it's locked, the process is alive. If it's not, the process is dead.
But this seems unnecessarily complicated.
Is there a better, or standard way?
Edit: The daemon must be able to resume after a restart, so it's not possible to keep a persistent connection for each process.
But I couldn't find a way how to get the process start time in a POSIX way.
Try the standard "etime" format specifier: LC_ALL=C ps -eo etime= $PIDS
In fairness, I would probably construct my own table of live processes rather that relying on the process table and elapsed time. That's fundamentally your file-locking approach, though I'd probably aggregate all the lockfiles together in a known place and name them by PID, e.g., /var/run/my-app/8819.lock. Indeed, this might even be retrofitted on to the long-running processes, since file locks on file descriptors can be inherited across exec().
(Of course, if the long-running processes I cared about had a common parent, then I'd rather query the common parent, who can be a reliable authority on which processes are running and which are not.)
The standard way is the unnecessarily complicated one. That' life in a POSIX-compliant environment...
Other methods than the file exist and have various benefits/tradeoffs - most of the "standard" IPC mechanisms would work for this as well - a socket, pipe, message queue, shared memory... Basically pick one mechanism that allows your application to announce to the daemon that it has started (and maybe that it's exiting, for an orderly shutdown). In between, it could send periodic "I'm still here" messages and the daemon could notice when it doesn't get one, or the daemon could poll periodically or something... There's quite a few ways to accomplish what you want, but without knowing more about the exact architecture you're trying to achieve, it's difficult to point at the "one best way"...

is it possible to run multiple executables in a shell script with the same PID?

I have a simple Shell script in which multiple executables will run sequentially. Every time a new executable starts to run, a new process with a new PID will start. Is it possible to run them with the same PID? I know for a shell script, we can use "source". But I do not know how to handle executables.
In principle, I believe it should be possible, but in practice it would be very complicated and brittle.
The exec family of system calls in Linux allows a process to replace itself with an entirely new process, which holds on to the same PID. The tricky part would be to somehow "return" from the second process back to the first. When exec is called, the OS loads everything it needs to start running the new process, and wipes out every piece of state related to the current process (the one being replaced). And when the new process terminates, the OS releases all resources (including the PID) associated with that process.
So if you really wanted to do this, you would have to hijack how processes terminate in order to restart your original process rather than let the OS clean everything up. How can you do this? Well, execle and execvpe functions allow a program to specify the environment of the new process before starting the process. Since every process depends on libc (or equivalent) to bring up/tear down a process, you should be able to provide a custom libc which would re-start execution of your script, or exec another process. The great difficulty would be hacking such a libc. Additionally, you would have to figure out a good way for your master program to keep state even though the OS wipes away any memory it might have been using when it called exec. You can probably accomplish this with temporary files.
So long story short, don't do it. While it's kind of fun for me to sit here and think about the massive hacks that would be required to pull this off, it would be a huge pain and I'm sure there is a much more elegant solution to whatever problem you are actually trying to solve.
The PID is assigned by OS when shell creates a new process. There is no way to tell the OS to use some specific PID. So it's not possible.

Can the operating system restart a process that is stuck in infinite loop?

The other day, when doing testing on a Linux server, we observed that under some conditions, one process could die and then started again. After checking the code, we found it was caused by an infinite loop.
This aroused my curiosity how the process went dead and then got started? Is it the OS who detects and determines the abnormal process and get it restarted? If yes, how does that work?
Let's assume you won't be able to fix your code... And let's ignore all crazy options like attaching gdb via script or so.
You can either check CPU usage (most accidental infinite loops that I've done used 100% of CPU for hours :) ), or (more likely option) use strace to check what the software is doing right now and implement your own signature tracing (if those 20 APIs repeats 20 times let's assume infinite loop or so).
For example:
#!/bin/bash
strace -p`cat your_app.pid` | ./your_signature_evaluator
# Or
strace -p12345 | ./your_signature_evaluator
As for automatic system recognition... It seems normal that program crashes after calling things in loop uncontrollably (for example malloc() until you deplete memory, opening files...), but I've (and correct me in comment if I'm wrong) never seen system (kernel) restarting the app. I think you've either:
have conditions (signal handling, whatever) inside program that helps to recover
you're running a watchdog (check every 20 seconds that <pid> is running and if not start new instance)
you're running distribution that provides service/program configuration with restart if stopped
But I really doubt that Linux would be so nice to your application on it's own.
If it could the person that wrote that kernel will have solved the halting problem
PS: Vytor - Web servers are in an infinite loop and do not use 100% CPU.

pthread_rwlock across processes: Repair after crash?

I'm working on linux and I'm using a pthread_rwlock, which is stored in shared memory and shared over multiple processes. This mostly works fine, but when I kill a process (SIGKILL) while it is holding a lock, it appears that the lock is still held (regardless of whether it's a read- or write-lock).
Is there any way to recognize such a state, and possibly even repair it?
The real answer is to find a decent way to stop a process. Killing it with SIGKILL is not a decent way to do it.
This feature is specified for mutexes, called robustness (PTHREAD_MUTEX_ROBUST) but not for rwlocks. The standard doesn't provide it and kernel.org doesn't even have a page on rwlocks. So, like I said:
Find another way to stop the process (perhaps another signal that can be handled ?)
Release the lock when you exit
#cnicutar - that "real answer" is pretty dubious. It's the kernel's job to handle cross process responsibilities of freeing of resources and making sure things are marked consistent - userspace can't effectively do the job when stuff goes wrong.
Granted if everybody plays nice the robust features will not be needed but for a robust system you want to make sure the system doesn't go down from some buggy client process.

Resources