How to identify if a long-running process died? - linux

I'm working on a daemon that communicates with several processes. The daemon can't monitor the processes all the time, but it must be able to properly identify if a process dies to release scare resources it holds for it.
The processes can communicate with the daemon, giving it some information at the start, but not vice versa. So the daemon can't just ask a process its identity.
The simplest form would be to use just their PID. But eventually another process could be assigned the same PID without my tool noticing.
A better approach would be to use PID plus the time the process started. A new process with the same PID would have a distinct start time. But I couldn't find a way how to get the process start time in a POSIX way. Using ps or looking at /proc/<pid>/stat seems not portable enough.
A more complicated idea that seems POSIX-compliant would be:
Each process creates a temporary file.
Locks it using flock
Tells my daemon "my identity is connected with this file".
Any time the daemon can check the temporary file. If it's locked, the process is alive. If it's not, the process is dead.
But this seems unnecessarily complicated.
Is there a better, or standard way?
Edit: The daemon must be able to resume after a restart, so it's not possible to keep a persistent connection for each process.

But I couldn't find a way how to get the process start time in a POSIX way.
Try the standard "etime" format specifier: LC_ALL=C ps -eo etime= $PIDS
In fairness, I would probably construct my own table of live processes rather that relying on the process table and elapsed time. That's fundamentally your file-locking approach, though I'd probably aggregate all the lockfiles together in a known place and name them by PID, e.g., /var/run/my-app/8819.lock. Indeed, this might even be retrofitted on to the long-running processes, since file locks on file descriptors can be inherited across exec().
(Of course, if the long-running processes I cared about had a common parent, then I'd rather query the common parent, who can be a reliable authority on which processes are running and which are not.)

The standard way is the unnecessarily complicated one. That' life in a POSIX-compliant environment...

Other methods than the file exist and have various benefits/tradeoffs - most of the "standard" IPC mechanisms would work for this as well - a socket, pipe, message queue, shared memory... Basically pick one mechanism that allows your application to announce to the daemon that it has started (and maybe that it's exiting, for an orderly shutdown). In between, it could send periodic "I'm still here" messages and the daemon could notice when it doesn't get one, or the daemon could poll periodically or something... There's quite a few ways to accomplish what you want, but without knowing more about the exact architecture you're trying to achieve, it's difficult to point at the "one best way"...

Related

Correct way of calling fork() after parent has created threads?

I'm implementing a complex application that takes third-party plug-ins, and I want to run the plug-in code in child processes for isolation. The parent process needs to be multithreaded, but I have read that fork may be unsafe in multithreaded processes, particularly if you do not immediately call execve, and that pthread_atfork is not a complete solution.
What do other complex applications do about this? I know Chrome uses both subprocesses and multithreading simultaneously, so it must be possible.
The behavior of fork() in a multithreaded program is well-defined. On success, the child process has exactly one thread -- the same one that called fork() in the parent program. Although this can be a problem, whether it actually is a problem depends on the circumstances.
When is fork()ing a problem for a multithreaded program?
The main reason for fork()ing to present a problem in a multithreaded program is that the child process depends on mutexes, condition variables, etc. that other threads can no longer be relied upon to manipulate. For example, if the child needs to acquire a process-private mutex that it does not already hold, then it may be that that mutex was held by a different thread at the time of the fork. In that case, it will never be released in the child process, because no thread that could release it exists in the child.
When is fork()ing not a problem for a multithreaded program?
One of the common idioms involving fork() is to immediately follow it up by execing another program. That's no problem, regardless of the threadedness of the parent.
Alternatively, if the child process does not depend on any problematic resources, then nothing special need be done. Note that process-shared interthread objects are not "problematic" in this sense. This situation is fairly common, and it sounds like it might be your case.
Otherwise, it's not a problem if the parent's forking thread can and does acquire all the process-private interthread resources that the child will need before it forks. Handlers registered by pthread_atfork() can help with this under some circumstances, but under others, it makes more sense for that to be done in the immediate environs of the fork call.
Overall
You've presented the question as if fork()ing was a deep and troublesome problem for multithreaded programs. It is certainly a problem that should be considered, and it is typically best to avoid using both multiple threads and multiple processes. Therefore, inasmuch as you want multiple processes so as to have separate address spaces and perhaps name spaces into which to load plugins, perhaps you should consider using separate processes wherever you now use threads. On the other hand, if you exercise some thought and care, you can probably make it work just fine for your multi-threaded process to fork children and interact with them.
If you cannot ensure that fork is only used under safe circumstances, as described in John Bollinger's answer, a general workaround is to use a "fork server". Before creating any threads, the original process forks once. The child process is the fork server; it remains single-threaded. The parent process now goes ahead and creates its threads. Whenever the parent would want to call fork, it instead sends a message to the fork server asking it to do so.
If the (ultimate) child processes also need to communicate with the parent, the easiest way to accomplish this is to have the parent create pipes for each child's stdin and stdout, and then transfer the child sides of those pipes to the fork server, using a SCM_RIGHTS special message. You can send file descriptors and data simultaneously. The communication protocol between the fork server and the parent might need to get pretty fancy — look at the posix_spawn API for a more-or-less complete list of all the knobs you might want. (Note: posix_spawn is just a library wrapper around fork; using it will not avoid the original problem.)
The fork server is also responsible for calling waitpid and relaying exit statuses back to the parent. This is trickier than it ought to be, because the standard APIs for waiting for the next of several possible events (select and poll) do not accept a process ID as one of the things to wait for. (BSD's kqueue does, but you're probably not on a BSD.) You have to do a messy dance with SIGCHLD and a pipe-to-self instead.

Passing messages between processes

I need to write a simple function which does the following in linux:
Create two processes.
Have thread1 in Process1 do some small operation, and send a message to Process2 via thread2 once operation is completed.
*Process2 shall acknowledge the received message.
I have no idea where to begin
I have written two simple functions which simply count from 0 to 1000 in a loop (the loop is run in a function called by a thread) and I have compiled them to get the binaries.
I am executing these one after the other (both running in the background) from a shell script
Once process1 reaches 1000 in its loop, I want the first process to send a "Complete" message to the other.
I am not sure if my approach is correct on the process front and I have absolutely no idea how to communicate between these two.
Any help will be appreciated.
LostinSpace
You'd probably want to use pipes for this. Depending on how the processes are started, you either want named or anonymous pipes:
Use named pipes (aka fifo, man mkfifo) if the processes are started independently of each other.
Use anonymous pipes (man 2 pipe) if the processes are started by a parent process through forking. The parent process would create the pipes, the child processes would inherit them. This is probably the "most beautiful" solution.
In both cases, the end points of the pipes are used just like any other file descriptor (but more like sockets than files).
If you aren't familiar with pipes yet, I recommend getting a copy of Marc Rochkind's book "Advanced UNIX programming" where these techniques are explained in great detail and easy to understand example code. That book also presents other inter-process communication methods (the only really other useful inter-process communication method on POSIX systems is shared memory, but just for fun/completeness he presents some hacks).
Since you create the processes (I assume you are using fork()), you may want to look at eventfd().
eventfd()'s provide a lightweight mechanism to send events from one process or thread to another.
More information on eventfd()s and a small example can be found here http://man7.org/linux/man-pages/man2/eventfd.2.html.
Signals or named pipes (since you're starting the two processes separately) are probably the way to go here if you're just looking for a simple solution. For signals, your client process (the one sending "Done") will need to know the process id of the server, and for named pipes they will both need to know the location of a pipe file to communicate through.
However, I want to point out a neat IPC/networking tool that can make your job a lot easier if you're designing a larger, more robust system: 0MQ can make this kind of client/server interaction dead simple, and allows you to start up the programs in whatever order you like (if you structure your code correctly). I highly recommend it.

pthread_rwlock across processes: Repair after crash?

I'm working on linux and I'm using a pthread_rwlock, which is stored in shared memory and shared over multiple processes. This mostly works fine, but when I kill a process (SIGKILL) while it is holding a lock, it appears that the lock is still held (regardless of whether it's a read- or write-lock).
Is there any way to recognize such a state, and possibly even repair it?
The real answer is to find a decent way to stop a process. Killing it with SIGKILL is not a decent way to do it.
This feature is specified for mutexes, called robustness (PTHREAD_MUTEX_ROBUST) but not for rwlocks. The standard doesn't provide it and kernel.org doesn't even have a page on rwlocks. So, like I said:
Find another way to stop the process (perhaps another signal that can be handled ?)
Release the lock when you exit
#cnicutar - that "real answer" is pretty dubious. It's the kernel's job to handle cross process responsibilities of freeing of resources and making sure things are marked consistent - userspace can't effectively do the job when stuff goes wrong.
Granted if everybody plays nice the robust features will not be needed but for a robust system you want to make sure the system doesn't go down from some buggy client process.

How do I reliably track child/grandchild processes on a POSIX system?

I have an interesting (at least to me) problem: I can't manage to find a way to reliably and portably get information on grandchildren processes in certain cases. I have an application, AllTray, that I am trying to get to work in certain strange cases where its subprocess spawns a child and then dies. AllTray's job is essentially to dock an application to the task tray, which is (usually) specified as a command line for AllTray to invoke (i.e., alltray xterm would start xterm, and manage it in AllTray).
Most GUI software runs just fine under it. It sets the _NET_WM_PID property on its window (or a widget library does) and all's well, because _NET_WM_PID == fork()ed child. However, in some cases (such as when running oowriter, or software written to run under KDE such as K3b), the child process that AllTray runs is a wrapper, be it a shell script (as in OO.o's case) or a strange program that fork()s and exec()s itself and effectively backgrounds itself, since the parent process dies very early.
I had the idea to not reap my child processes, so as to preserve in the process table the parent process ID for my grandchildren, so that I could link them back to me by traversing the family tree from bottom-to-top. That doesn't work, though: once my child process dies and turns into a zombie, the system considers my grandchild process to be an orphan, and init adopts it. This appears to be the case on at least Linux 2.6 and NetBSD; I'd presume it's probably the norm, and POSIX doesn't seem to specify that to be the case, so I was hoping for the opposite.
Since that approach won't work, I thought about using LD_PRELOAD and intercepting my child process' call to fork(), and passing information back to my parent process. However, I'm concerned that won't be as portable as the ideal solution, because different systems have different rules on how the dynamic linker does things like LD_PRELOAD. It won't work for setuid/setgid GUI applications either without the helper library also being setuid or setgid, at least on Linux systems. Generally, it smells like a bad idea to me, and feels quite hackish.
So, I'm hoping that someone has an idea on how to do this, or if the idea of relying on a mechanism like LD_PRELOAD is really the only option I have short of patching kernels (which is not going to happen).
You could investigate the possibility of using process groups to keep track of, well, process groups. A process group is a property (just a number) which you can set before forking, and child processes then inherit it automatically.
AllTray can create a new process group for each application started with it. You can the send signals to all members of the process group. I suppose the most useful signals here would be TERM and KILL, in order to kill an application managed in AllTray.
I'm not sure if there is a convenient way to figure out if all members of the process group have already exited or not. You may have to resort to going through the entire process list and call getpgid for each process to see if there are any left in the process group.
Note that process groups won't work for applications which create new process groups themselves. But that's relatively rare and you probably don't need to worry about such applications.

Starting and stopping a forked process

Is it possible for a parent process to start and stop a child (forked) process in Unix?
I want to implement a task scheduler (see here) which is able to run multiple processes at the same time which I believe requires either separate processes or threads.
How can I stop the execution of a child process and resume it after a given amount of time?
(If this is only possible with threads, how are threads implemented?)
You could write a simple scheduler imitation using signals.
If you have the permissions, then stop signal (SIGSTOP) stops the execution of a process, and continue signal (SIGCONT) continues it.
With signals you would not have any fine grained control on the "scheduling",
but I guess OS grade scheduler is not the purpose of this execersice any way.
Check kill (2) and signal (7) manual pages.
There are also many guides to using Unix signals in the web.
You can use signals, but in the usual UNIX world it's probably easier to use semaphores. Once you set the semaphore to not let the other process proceed, the scheduler will swap it out in the normal course of things; when you clear the semaphore, it will become ready to run again.
You can do the exact same thing with threads of course; the only dramatic difference is you save a heavyweight context switch.
Just a side note: If you are using signal(), the behavior may be different on different unixes. If you are using Linux, check the "Portability" section of the signal manpage, and the sigaction manpage, which is preferred.

Resources