Which is the better way to restart a daemontools service? - linux

You've pushed a code update to a daemontools service and want to restart it so it picks up the changes. The service itself is simple and has no built-in signal handling. Which way is better?
svc -d; sleep 5; svc -u
Sends TERM and then CONT. Waits for the service to actually exit, and then restarts it.
svc -h
Sends a HUP signal. The process will die on reception of the signal, and daemontools will restart it.
I've always done some variation on the first, but somebody pointed out today we could actually do the HUP just as well, and I like that better, but I've been doing it the other way so long I can't remember if there was a reason.
I thought it might be because a process in uninterruptible sleep waiting on I/O ignores signals, but according to wikipedia, "When the process is sleeping uninterruptibly, signals accumulated during the sleep will be noticed when the process returns from the system call or trap."
Anybody have an informed opinion on best practice?

It would appear that option 2 (sending a HUP signal) is somewhat cleaner, but in the end, both will get the job done and neither is inherently superior.

Send SIGHUP. It's shorter to type and won't make you wait as long.

Related

when will /proc/<pid> be removed?

Process A opened && mmaped thousand of files when running. Then killl -9 <pid of process A> is issued. Then I have a question about the sequence of below two events.
a) /proc/<pid of process A> cannot be accessed.
b) all files opened by process A are closed.
More background about the question:
Process A is a multi-thread background service. It is started by cmd ./process_A args1 arg2 arg3.
There is also a watchdog process which checked whether process A is still alive periodically(every 1 second). If process A is dead, then restart it. The way watchdog checks process A is as below.
1) collect all numerical subdir under /proc/
2) compares /proc/<all-pids>/cmdline with cmdline of process A. If these is a /proc/<some-pid>/cmdline matches, then process A is alive and do nothing, otherwise restart process A.
process A will do below stuff when doing initialization.
1) open fileA
2) flock fileA
3) mmap fileA into memory
4) close fileA
process A will mmap thousand of files after initialization.
after several minutes, kill -9 <pid of process A> is issued.
watchdog detect the death of process A, restart it. But sometimes process A stuck at step 2 flock fileA. After some debugging, we found that unlock of fileA is executed when process A is killed. But sometimes this event will happen after step 2 flock fileA of new process.
So we guess the way to check process alive by monitor /proc/<pid of process A>
is not correct.
then kill -9 is issued
This is bad habit. You'll better send a SIGTERM first. Because well behaved processes and well designed programs can catch it (and exit nicely and properly when getting a SIGTERM...). In some cases, I even recommend: sending SIGTERM. Wait two or three seconds. sending SIGQUIT. Wait two seconds. At last, send a SIGKILL signal (for those bad programs who have not been written properly or are misbehaving). A few seconds later, you could send a SIGKILL. Read signal(7) and signal-safety(7). In multi-threaded, but Linux specific, programs, you might use signalfd(2) or the pipe(7) to self trick (well explained in Qt documentation, but not Qt specific).
If your Linux system is systemd based, you could imagine your program-A is started with systemd facilities. Then you'll use systemd facilities to "communicate" with it. In some ways (I don't know the details), systemd is making signals almost obsolete. Notice that signals are not multi-thread friendly and have been designed, in the previous century, for single-thread processes.
we guess the way to check process alive by monitor /proc/ is not correct.
The usual (and faster, and "atomic" enough) way to detect the existence of a process (on which you have enough privileges, e.g. which runs with your uid/gid) is to use kill(2) with a signal number (the second argument to kill) of 0. To quote that manpage:
If sig is 0, then no signal is sent, but existence and permission
checks are still performed; this can be used to check for the
existence of a process ID or process group ID that the caller is
permitted to signal.
Of course, that other process can still terminate before any further interaction with it. Because Linux has preemptive scheduling.
You watchdog process should better use kill(pid-of-process-A, 0) to check existence and liveliness of that process-A. Using /proc/pid-of-process-A/ is not the correct way for that.
And whatever you code, that process-A could disappear asynchronously (in particular, if it has some bug that gives a segmentation fault). When a process terminates (even with a segmentation fault) the kernel is acting on its file locks (and "releases" them).
Don't scan /proc/PID to find out if a specific process has terminated. There are lots of better ways to do that, such as having your watchdog program actually launch the server program and wait for it to terminate.
Or, have the watchdog listen on a TCP socket, and have the server process connect to that and send its PID. If either end dies, the other can notice the connect was closed (hint: send a heartbeat packet every so often, to a frozen peer). If the watchdog receives a connection from another server while the first is still running, it can decide to allow it or tell one of the instances to shut down (via TCP or kill()).

Are there suspend\resume signals in Linux?

My application needs to react on hibernation mode so it can do some action on suspending and other actions on resuming. I've found some distributive-specific ways to achieve it(Upower + DBus) but didn't find anything universal. Is there a way to do it?
Thanks!
A simple solution to this is to use a self-pipe. Open up a pipe and periodically write timestamps to it. select on this pipe to read the timestamps and compare them to the current time. When there is a big gap, that means you have just woken up from system suspension or hibernate mode.
As for the other way around, there is not much time when the lid is closed and it flips the switch.
If you really need to act on suspend, then you will need to set powersave hooks like this https://help.ubuntu.com/community/PowerManagement/ReducedPower in pm-utils. It could be as simple as
kill -1 `cat mypid` ; sleep 1
Your process would then trap SIGHUP and do what needs to be done to prepare for suspension. The sleep delays the process long enough for your program to react to the signal.
I believe you are looking for SIGSTOP and SIGCONT signals. You can send these to a running process like so:
kill -STOP pid
sleep 60
kill -CONT pid

What's the best way to signal threads that sleep or block to stop?

I've got a service that I need to shut down and update. I'm having difficulties with this in two different cases:
I have some threads that sleep for large amounts of time. Obviously I can't wait for them to wake up to finish shutting down the service. I had a thought to use an AutoResetEvent that gets set by some controller thread when the sleep interval is up (by just checking every two seconds or something), and triggering it immediately at OnClose time. Is there a better way to facilitate that?
I have one thread that makes a call to a blocking method call (one which I cannot modify). How do you signal such a thread to stop?
I'm not sure if I understood your first question correctly, but have you looked at using WaitForSingleObject as an alternative to Sleep? You can specify a timeout as well as an object to wait on, so if you want it to wake up earlier, just signal the object.
What exactly do you mean by "call to a blocking thread"? Or did you just mean a blocking call? In general, there isn't a way to interrupt a thread without forcefully terminating it. However, if the call is a system call, there might be ways to return control by making the call fail, eg. cancelling I/O or closing an associated handle.
For 1. you can get your threads into an interruptable Sleep by using SleepEx rather than Sleep. Once they get this shutdown kick (initiated from your termination logic using QueueUserApc), you can detect it happened using the return code from SleepEx and terminate those threads accordingly. This is similar to the suggestion to use WaitForSingleObject, but you don't need another per-thread handle that's just used to terminate the associated thread.
The return value is zero if the
specified time interval expired.
The return value is WAIT_IO_COMPLETION
if the function returned due to one or
more I/O completion callback
functions. This can happen only if
bAlertable is TRUE, and if the thread
that called the SleepEx function is
the same thread that called the
extended I/O function.
For 2., that's a tough one unless you have access to some resource used in that thread that can cause the blocking call to abort in such a way that the calling thread can handle it cleanly. You may just have to implement code to kill that thread with extreme prejudice using TerminateThread (probably this should be the last thing you do before exiting the process) and see what happens under test.
An easy and reliable solution is to kill the service process. A process is the memory-safe abstraction of the OS, after all, so you can safely terminate one without regard for process-internal state - of course, if your process is communicating or fiddling with external state, all bets are off...
Additionally, you could implement the solution which OS's themselves commonly do: one warning signal asking the process to clean up as best possible (which sets a flag and gracefully exits what can be gracefully stopped), and then forceful termination if the process doesn't exit by itself (which ends pesky things like blocking I/O).
All services should be built such that forceful termination isn't harmful, since these processes are system managed and may be terminated by things such as a reboot - i.e., your service ideally should permit this without corrupting storage anyhow.
Oh, and one final warning; windows services may share a process (I presume for efficiency, though it strikes me as an avoidable optimization), so if you go this route, you want to make sure your service is not sharing a process with other services. You can ensure this by passing the option SERVICE_WIN32_OWN_PROCESS to ChangeServiceConfig.

Should I be worried about the order, in which processes in a process goup receive signals?

I want to terminate a process group by sending SIGTERM to processes within it. This can be accomplished via the kill command, but the manuals I found provide few details about how exactly it works:
int kill(pid_t pid, int sig);
...
If pid is less than -1, then sig is sent to every process in
the process group whose ID is -pid.
However, in which order will the signal be sent to the processes that form the group? Imagine the following situation: a pipe is set between master and slave processes in the group. If slave is killed during processing kill(-pid), while the master is still not, the master might report this as an internal failure (upon receiving notification that the child is dead). However, I want all processes to understand that such termination was caused by something external to their process group.
How can I avoid this confusion? Should I be doing something more than mere kill(-pid,SIGTERM)? Or it is resolved by underlying properties of the OS, about which I'm not aware?
Note that I can't modify the code of the processes in the group!
Try doing it as a three-step process:
kill(-pid, SIGSTOP);
kill(-pid, SIGTERM);
kill(-pid, SIGCONT);
The first SIGSTOP should put all the processes into a stopped state. They cannot catch this signal, so this should stop the entire process group.
The SIGTERM will be queued for the process but I don't believe it will be delivered, since the processes are stopped (this is from memory, and I can't currently find a reference but I believe it is true).
The SIGCONT will start the processes again, allowing the SIGTERM to be delivered. If the slave gets the SIGCONT first, the master may still be stopped so it will not notice the slave going away. When the master gets the SIGCONT, it will be followed by the SIGTERM, terminating it.
I don't know if this will actually work, and it may be implementation dependent on when all the signals are actually delivered (including the SIGCHLD to the master process), but it may be worth a try.
My understanding is that you cannot rely on any specific order of signal delivery.
You could avoid the issue if you send the TERM signal to the master process only, and then have the master kill its children.
Even if all the various varieties of UNIX would promise to deliver the signals in a particular order, the scheduler might still decide to run the critical child process code before the parent code.
Even your STOP/TERM/CONT sequence will be vulnerable to this.
I'm afraid you may need something more complicated. Perhaps the child process could catch the SIGTERM and then loop until its parent exits before it exits itself? Be sure and add a timeout if you do this.
Untested: Use shared memory and put in some kind of "we're dying" semaphore, which may be checked before I/O errors are treated as real errors. mmap() with MAP_ANONYMOUS|MAP_SHARED and make sure it survives your way of fork()ing processes.
Oh, and be sure to use the volatile keyword or your semaphore is optimized away.

kill -9 and production application

Which problem can cause kill -9 in production application (in linux to be exact)?
I have application which do some periodical work, stopping these takes long time, and I don't care if some jobs will be aborted - work can be finished by new processes. So can I use kill -9 just to stop it immediately or this can cause serious OS problems?
For example, Unicorn, uses it as normal working procedure:
When your application goes awry, a BOFH can just "kill -9" the runaway worker process without worrying about tearing all clients down, just one.
But this article claims:
The -9 (or KILL) argument to kill(1) should never be used on Unix systems
PS: I understand that kill -9 cannot be handled by application, but I know that for may application it doesnt cause any problems, I just intrested can it cause some problems on OS level? shared memory segments active, lingering sockets sounds dangerous to me.
kill -9 doesn't give an application a chance to shut down cleanly.
Normally an application can catch a SIGINT/SIGTERM and shut down cleanly (close files, save data etc.). An application can't catch a SIGKILL (which occurs with a kill -9) and so it can't do any of this (optional) cleanup.
A better approach is to use a standard kill, and if the application remains unresponsive, then use kill -9.
kill -9 won't cause any "serious OS problems". But the process will stop immediately, which means it might leave data in an odd state.
It depends what kind of application it is.
Something like a database may either lose data (if it does not write all its data to a persistent transaction log at once), or take longer to start up next time, or both.
Although Crash-only is a good principle, few application currently conform to it.
For example, the mysql database is not "crash only" and killing it with a kill -9 will result in either significantly longer startup time (than a clean shutdown), data loss, or both, depending on the settings (and to some extent, luck).
On the other hand, Cassandra actually encourages the use of kill -9 as a shutdown mechanism; it supports nothing else.
The KILL signal cannot be caught by the application. If the application is in the middle of writing some complex data structure to disk when you kill it, the structure may be only half-written, resulting in a corrupted data file. it is usually best to implement some other signal such as USER1 as the "stop" signal, as this can be caught and allows the application to shut down in a controlled manner.

Resources