Why linux process with status 'D' can be killed ? [duplicate] - linux

Sometimes whenever I write a program in Linux and it crashes due to a bug of some sort, it will become an uninterruptible process and continue running forever until I restart my computer (even if I log out). My questions are:
What causes a process to become uninterruptible?
How do I stop that from happening?
This is probably a dumb question, but is there any way to interrupt it without restarting my computer?

An uninterruptible process is a process which happens to be in a system call (kernel function) that cannot be interrupted by a signal.
To understand what that means, you need to understand the concept of an interruptible system call. The classic example is read(). This is a system call that can take a long time (seconds) since it can potentially involve spinning up a hard drive, or moving heads. During most of this time, the process will be sleeping, blocking on the hardware.
While the process is sleeping in the system call, it can receive a Unix asynchronous signal (say, SIGTERM), then the following happens:
The system call exits prematurely, and is set up to return -EINTR to user space.
The signal handler is executed.
If the process is still running, it gets the return value from the system call, and it can make the same call again.
Returning early from the system call enables the user space code to immediately alter its behavior in response to the signal. For example, terminating cleanly in reaction to SIGINT or SIGTERM.
On the other hand, some system calls are not allowed to be interrupted in this way. If the system calls stalls for some reason, the process can indefinitely remains in this unkillable state.
LWN ran a nice article that touched this topic in July.
To answer the original question:
How to prevent this from happening: figure out which driver is causing you trouble, and either stop using, or become a kernel hacker and fix it.
How to kill an uninterruptible process without rebooting: somehow make the system call terminate. Frequently the most effective manner to do this without hitting the power switch is to pull the power cord. You can also become a kernel hacker and make the driver use TASK_KILLABLE, as explained in the LWN article.

When a process is on user mode, it can be interrupted at any time (switching to kernel mode). When the kernel returns to user mode, it checks if there are any signals pending (including the ones which are used to kill the process, such as SIGTERM and SIGKILL). This means a process can be killed only on return to user mode.
The reason a process cannot be killed in kernel mode is that it could potentially corrupt the kernel structures used by all the other processes in the same machine (the same way killing a thread can potentially corrupt data structures used by other threads in the same process).
When the kernel needs to do something which could take a long time (waiting on a pipe written by another process or waiting for the hardware to do something, for instance), it sleeps by marking itself as sleeping and calling the scheduler to switch to another process (if there is no non-sleeping process, it switches to a "dummy" process which tells the cpu to slow down a bit and sits in a loop — the idle loop).
If a signal is sent to a sleeping process, it has to be woken up before it will return to user space and thus process the pending signal. Here we have the difference between the two main types of sleep:
TASK_INTERRUPTIBLE, the interruptible sleep. If a task is marked with this flag, it is sleeping, but can be woken by signals. This means the code which marked the task as sleeping is expecting a possible signal, and after it wakes up will check for it and return from the system call. After the signal is handled, the system call can potentially be automatically restarted (and I won't go into details on how that works).
TASK_UNINTERRUPTIBLE, the uninterruptible sleep. If a task is marked with this flag, it is not expecting to be woken up by anything other than whatever it is waiting for, either because it cannot easily be restarted, or because programs are expecting the system call to be atomic. This can also be used for sleeps known to be very short.
TASK_KILLABLE (mentioned in the LWN article linked to by ddaa's answer) is a new variant.
This answers your first question. As to your second question: you can't avoid uninterruptible sleeps, they are a normal thing (it happens, for instance, every time a process reads/writes from/to the disk); however, they should last only a fraction of a second. If they last much longer, it usually means a hardware problem (or a device driver problem, which looks the same to the kernel), where the device driver is waiting for the hardware to do something which will never happen. It can also mean you are using NFS and the NFS server is down (it is waiting for the server to recover; you can also use the "intr" option to avoid the problem).
Finally, the reason you cannot recover is the same reason the kernel waits until return to user mode to deliver a signal or kill the process: it would potentially corrupt the kernel's data structures (code waiting on an interruptible sleep can receive an error which tells it to return to user space, where the process can be killed; code waiting on an uninterruptible sleep is not expecting any error).

Uninterruptable processes are USUALLY waiting for I/O following a page fault.
Consider this:
The thread tries to access a page which is not in core (either an executable which is demand-loaded, a page of anonymous memory which has been swapped out, or a mmap()'d file which is demand loaded, which are much the same thing)
The kernel is now (trying to) load it in
The process can't continue until the page is available.
The process/task cannot be interrupted in this state, because it can't handle any signals; if it did, another page fault would happen and it would be back where it was.
When I say "process", I really mean "task", which under Linux (2.6) roughly translates to "thread" which may or may not have an individual "thread group" entry in /proc
In some cases, it may be waiting for a long time. A typical example of this would be where the executable or mmap'd file is on a network filesystem where the server has failed. If the I/O eventually succeeds, the task will continue. If it eventually fails, the task will generally get a SIGBUS or something.

To your 3rd question:
I think you can kill the uninterruptable processes by running
sudo kill -HUP 1.
It will restart init without ending the running processes and after running it, my uninterruptable processes were gone.

If you are talking about a "zombie" process (which is designated as "zombie" in ps output), then this is a harmless record in the process list waiting for someone to collect its return code and it could be safely ignored.
Could you please describe what and "uninterruptable process" is for you? Does it survives the "kill -9 " and happily chugs along? If that is the case, then it's stuck on some syscall, which is stuck in some driver, and you are stuck with this process till reboot (and sometimes it's better to reboot soon) or unloading of relevant driver (which is unlikely to happen). You could try to use "strace" to find out where your process is stuck and avoid it in the future.

Related

Can I coredump a process that is blocking on disk activity (preferably without killing it)?

I want to dump the core of a running process that is, according to /proc/<pid>/status, currently blocking on disk activity. Actually, it is busy doing work on the GPU (should be 4 hours of work, but it has taken significantly longer now). I would like to know how much of the process's work has been done, so it'd be good to be able to dump the process's memory. However, as far as I know, "blocking on disk activity" means that it's not possible to interrupt the process in any way, and coredumping a process e.g. using gdb requires interrupting and temporarily stopping the process in order to attach via ptrace, right?
I know that I could just read /proc/<pid>/{maps,mem} as root to get the (maybe inconsistent) memory state, but I don't know any way to get hold of the process's userspace CPU register values... they stay the same while the process is inside the kernel, right?
You can probably run gcore on your program. It's basically a wrapper around GDB that attaches, uses the gcore command, and detaches again.
This might interrupt your IO (as if it received a signal, which it will), but your program can likely restart it if written correctly (and this may occur in any case, due to default handling).

When does a process handle a signal

I want to know when does a linux process handles the signal.
Assuming that the process has installed the signal handler for a signal, I wanted to know when would the process's normal execution flow be interrupted and signal handler called.
According to http://www.tldp.org/LDP/tlk/ipc/ipc.html, the process would handle the signal when it exits from a system call. This would mean that a normal instruction like a = b+c (or its equivalent machine code) would not be interrupted because of signal.
Also, there are system calls which would get interrupted (and fail with EINTR or get restarted) upon receiving a signal. This means that signal is processed even before the system call completes. This behaviour seems to b conflicting with what I have mentioned in the previous paragraph.
So, I am not clear as to when is the signal processed and in which process states would it be handled by the process. Can it be interrupted
Anytime it enters from kernel space to user space, or
Anytime it is in user space, or
Anytime the process is scheduled for execution by the scheduler
Thanks!
According to http://www.tldp.org/LDP/tlk/ipc/ipc.html, the process would handle the signal when it exits from a system call. This would mean that a normal instruction like a = b+c (or its equivalent machine code) would not be interrupted because of signal.
Well, if that were the case, a CPU-intensive process would not obey the process scheduler. The scheduler, in fact, can interrupt a process at any point of time when its time quantum has elapsed. Unless it is a FIFO real-time process.
A more correct definition: One point when a signal is delivered to the process is when the control flow leaves the kernel mode to resume executing user-mode code. That doesn't necessarily involve a system call.
A lot of the semantics of signal handling are documented (for Linux, anyway - other OSes probably have similar, but not necessarily in the same spot) in the section 7 signal manual page, which, if installed on your system, can be accessed like this:
man 7 signal
If manual pages are not installed, online copies are pretty easy to find.

Process state when calling syscall?

What process state has when it calls a syscall?
I mean, don't asume it's an I/O syscall like read or write...
It's the process itselft that executes kernel code, or the process is suspendes and there's like a "kernel thread" that execute the syscall handler (and knows wich process called (current))?
I'm not sure if changes from executing to ready, or executing to blocked.
It's the process itself that switches to kernel mode and executes the system call - although it switches to a kernel stack to do so. A process executing inside the kernel has state Running, and can be pre-empted and end up in state Runnable.
It depends what the syscall does.
Suppose there is a hypothetical syscall which calculates PI to a lot of digits, and places the result in a buffer the application specifies, then the process will probably just be in the "R" running state. Switching to kernel mode doesn't stop it running in the context of the task that made the call.
Of course many system calls wait for things - consider sleep() for example, which releases the CPU rather than spinning. This puts the process to sleep, having registered a kernel timer to wake it up.
Quite a lot of syscalls never sleep, the likes of getpid() which just retrieve information which is always in ram. And many which do sometimes sleep don't necessarily do so, for example, if you call read() on data already in a kernel buffer.

Linux Process States

In Linux, what happens to the state of a process when it needs to read blocks from a disk? Is it blocked? If so, how is another process chosen to execute?
When a process needs to fetch data from a disk, it effectively stops running on the CPU to let other processes run because the operation might take a long time to complete – at least 5ms seek time for a disk is common, and 5ms is 10 million CPU cycles, an eternity from the point of view of the program!
From the programmer point of view (also said "in userspace"), this is called a blocking system call. If you call write(2) (which is a thin libc wrapper around the system call of the same name), your process does not exactly stop at that boundary; it continues, in the kernel, running the system call code. Most of the time it goes all the way up to a specific disk controller driver (filename → filesystem/VFS → block device → device driver), where a command to fetch a block on disk is submitted to the proper hardware, which is a very fast operation most of the time.
THEN the process is put in sleep state (in kernel space, blocking is called sleeping – nothing is ever 'blocked' from the kernel point of view). It will be awakened once the hardware has finally fetched the proper data, then the process will be marked as runnable and will be scheduled. Eventually, the scheduler will run the process.
Finally, in userspace, the blocking system call returns with proper status and data, and the program flow goes on.
It is possible to invoke most I/O system calls in non-blocking mode (see O_NONBLOCK in open(2) and fcntl(2)). In this case, the system calls return immediately and only report submitting the disk operation. The programmer will have to explicitly check at a later time whether the operation completed, successfully or not, and fetch its result (e.g., with select(2)). This is called asynchronous or event-based programming.
Most answers here mentioning the D state (which is called TASK_UNINTERRUPTIBLE in the Linux state names) are incorrect. The D state is a special sleep mode which is only triggered in a kernel space code path, when that code path can't be interrupted (because it would be too complex to program), with the expectation that it would block only for a very short time. I believe that most "D states" are actually invisible; they are very short lived and can't be observed by sampling tools such as 'top'.
You can encounter unkillable processes in the D state in a few situations. NFS is famous for that, and I've encountered it many times. I think there's a semantic clash between some VFS code paths, which assume to always reach local disks and fast error detection (on SATA, an error timeout would be around a few 100 ms), and NFS, which actually fetches data from the network which is more resilient and has slow recovery (a TCP timeout of 300 seconds is common). Read this article for the cool solution introduced in Linux 2.6.25 with the TASK_KILLABLE state. Before this era there was a hack where you could actually send signals to NFS process clients by sending a SIGKILL to the kernel thread rpciod, but forget about that ugly trick.…
While waiting for read() or write() to/from a file descriptor return, the process will be put in a special kind of sleep, known as "D" or "Disk Sleep". This is special, because the process can not be killed or interrupted while in such a state. A process waiting for a return from ioctl() would also be put to sleep in this manner.
An exception to this is when a file (such as a terminal or other character device) is opened in O_NONBLOCK mode, passed when its assumed that a device (such as a modem) will need time to initialize. However, you indicated block devices in your question. Also, I have never tried an ioctl() that is likely to block on a fd opened in non blocking mode (at least not knowingly).
How another process is chosen depends entirely on the scheduler you are using, as well as what other processes might have done to modify their weights within that scheduler.
Some user space programs under certain circumstances have been known to remain in this state forever, until rebooted. These are typically grouped in with other "zombies", but the term would not be correct as they are not technically defunct.
A process performing I/O will be put in D state (uninterruptable sleep), which frees the CPU until there is a hardware interrupt which tells the CPU to return to executing the program. See man ps for the other process states.
Depending on your kernel, there is a process scheduler, which keeps track of a runqueue of processes ready to execute. It, along with a scheduling algorithm, tells the kernel which process to assign to which CPU. There are kernel processes and user processes to consider. Each process is allocated a time-slice, which is a chunk of CPU time it is allowed to use. Once the process uses all of its time-slice, it is marked as expired and given lower priority in the scheduling algorithm.
In the 2.6 kernel, there is a O(1) time complexity scheduler, so no matter how many processes you have up running, it will assign CPUs in constant time. It is more complicated though, since 2.6 introduced preemption and CPU load balancing is not an easy algorithm. In any case, it’s efficient and CPUs will not remain idle while you wait for the I/O.
As already explained by others, processes in "D" state (uninterruptible sleep) are responsible for the hang of ps process. To me it has happened many times with RedHat 6.x and automounted NFS home directories.
To list processes in D state you can use the following commands:
cd /proc
for i in [0-9]*;do echo -n "$i :";cat $i/status |grep ^State;done|grep D
To know the current directory of the process and, may be, the mounted NFS disk that has issues you can use a command similar to the following example (replace 31134 with the sleeping process number):
# ls -l /proc/31134/cwd
lrwxrwxrwx 1 pippo users 0 Aug 2 16:25 /proc/31134/cwd -> /auto/pippo
I found that giving the umount command with the -f (force) switch, to the related mounted nfs file system, was able to wake-up the sleeping process:
umount -f /auto/pippo
the file system wasn't unmounted, because it was busy, but the related process did wake-up and I was able to solve the issue without rebooting.
Assuming your process is a single thread, and that you're using blocking I/O, your process will block waiting for the I/O to complete. The kernel will pick another process to run in the meantime based on niceness, priority, last run time, etc. If there are no other runnable processes, the kernel won't run any; instead, it'll tell the hardware the machine is idle (which will result in lower power consumption).
Processes that are waiting for I/O to complete typically show up in state D in, e.g., ps and top.
Yes, the task gets blocked in the read() system call. Another task which is ready runs, or if no other tasks are ready, the idle task (for that CPU) runs.
A normal, blocking disc read causes the task to enter the "D" state (as others have noted). Such tasks contribute to the load average, even though they're not consuming the CPU.
Some other types of IO, especially ttys and network, do not behave quite the same - the process ends up in "S" state and can be interrupted and doesn't count against the load average.
Yes, tasks waiting for IO are blocked, and other tasks get executed. Selecting the next task is done by the Linux scheduler.
Generally the process will block. If the read operation is on a file descriptor marked as non-blocking or if the process is using asynchronous IO it won't block. Also if the process has other threads that aren't blocked they can continue running.
The decision as to which process runs next is up to the scheduler in the kernel.

What is an uninterruptible process?

Sometimes whenever I write a program in Linux and it crashes due to a bug of some sort, it will become an uninterruptible process and continue running forever until I restart my computer (even if I log out). My questions are:
What causes a process to become uninterruptible?
How do I stop that from happening?
This is probably a dumb question, but is there any way to interrupt it without restarting my computer?
An uninterruptible process is a process which happens to be in a system call (kernel function) that cannot be interrupted by a signal.
To understand what that means, you need to understand the concept of an interruptible system call. The classic example is read(). This is a system call that can take a long time (seconds) since it can potentially involve spinning up a hard drive, or moving heads. During most of this time, the process will be sleeping, blocking on the hardware.
While the process is sleeping in the system call, it can receive a Unix asynchronous signal (say, SIGTERM), then the following happens:
The system call exits prematurely, and is set up to return -EINTR to user space.
The signal handler is executed.
If the process is still running, it gets the return value from the system call, and it can make the same call again.
Returning early from the system call enables the user space code to immediately alter its behavior in response to the signal. For example, terminating cleanly in reaction to SIGINT or SIGTERM.
On the other hand, some system calls are not allowed to be interrupted in this way. If the system calls stalls for some reason, the process can indefinitely remains in this unkillable state.
LWN ran a nice article that touched this topic in July.
To answer the original question:
How to prevent this from happening: figure out which driver is causing you trouble, and either stop using, or become a kernel hacker and fix it.
How to kill an uninterruptible process without rebooting: somehow make the system call terminate. Frequently the most effective manner to do this without hitting the power switch is to pull the power cord. You can also become a kernel hacker and make the driver use TASK_KILLABLE, as explained in the LWN article.
When a process is on user mode, it can be interrupted at any time (switching to kernel mode). When the kernel returns to user mode, it checks if there are any signals pending (including the ones which are used to kill the process, such as SIGTERM and SIGKILL). This means a process can be killed only on return to user mode.
The reason a process cannot be killed in kernel mode is that it could potentially corrupt the kernel structures used by all the other processes in the same machine (the same way killing a thread can potentially corrupt data structures used by other threads in the same process).
When the kernel needs to do something which could take a long time (waiting on a pipe written by another process or waiting for the hardware to do something, for instance), it sleeps by marking itself as sleeping and calling the scheduler to switch to another process (if there is no non-sleeping process, it switches to a "dummy" process which tells the cpu to slow down a bit and sits in a loop — the idle loop).
If a signal is sent to a sleeping process, it has to be woken up before it will return to user space and thus process the pending signal. Here we have the difference between the two main types of sleep:
TASK_INTERRUPTIBLE, the interruptible sleep. If a task is marked with this flag, it is sleeping, but can be woken by signals. This means the code which marked the task as sleeping is expecting a possible signal, and after it wakes up will check for it and return from the system call. After the signal is handled, the system call can potentially be automatically restarted (and I won't go into details on how that works).
TASK_UNINTERRUPTIBLE, the uninterruptible sleep. If a task is marked with this flag, it is not expecting to be woken up by anything other than whatever it is waiting for, either because it cannot easily be restarted, or because programs are expecting the system call to be atomic. This can also be used for sleeps known to be very short.
TASK_KILLABLE (mentioned in the LWN article linked to by ddaa's answer) is a new variant.
This answers your first question. As to your second question: you can't avoid uninterruptible sleeps, they are a normal thing (it happens, for instance, every time a process reads/writes from/to the disk); however, they should last only a fraction of a second. If they last much longer, it usually means a hardware problem (or a device driver problem, which looks the same to the kernel), where the device driver is waiting for the hardware to do something which will never happen. It can also mean you are using NFS and the NFS server is down (it is waiting for the server to recover; you can also use the "intr" option to avoid the problem).
Finally, the reason you cannot recover is the same reason the kernel waits until return to user mode to deliver a signal or kill the process: it would potentially corrupt the kernel's data structures (code waiting on an interruptible sleep can receive an error which tells it to return to user space, where the process can be killed; code waiting on an uninterruptible sleep is not expecting any error).
Uninterruptable processes are USUALLY waiting for I/O following a page fault.
Consider this:
The thread tries to access a page which is not in core (either an executable which is demand-loaded, a page of anonymous memory which has been swapped out, or a mmap()'d file which is demand loaded, which are much the same thing)
The kernel is now (trying to) load it in
The process can't continue until the page is available.
The process/task cannot be interrupted in this state, because it can't handle any signals; if it did, another page fault would happen and it would be back where it was.
When I say "process", I really mean "task", which under Linux (2.6) roughly translates to "thread" which may or may not have an individual "thread group" entry in /proc
In some cases, it may be waiting for a long time. A typical example of this would be where the executable or mmap'd file is on a network filesystem where the server has failed. If the I/O eventually succeeds, the task will continue. If it eventually fails, the task will generally get a SIGBUS or something.
To your 3rd question:
I think you can kill the uninterruptable processes by running
sudo kill -HUP 1.
It will restart init without ending the running processes and after running it, my uninterruptable processes were gone.
If you are talking about a "zombie" process (which is designated as "zombie" in ps output), then this is a harmless record in the process list waiting for someone to collect its return code and it could be safely ignored.
Could you please describe what and "uninterruptable process" is for you? Does it survives the "kill -9 " and happily chugs along? If that is the case, then it's stuck on some syscall, which is stuck in some driver, and you are stuck with this process till reboot (and sometimes it's better to reboot soon) or unloading of relevant driver (which is unlikely to happen). You could try to use "strace" to find out where your process is stuck and avoid it in the future.

Resources