Sometimes whenever I write a program in Linux and it crashes due to a bug of some sort, it will become an uninterruptible process and continue running forever until I restart my computer (even if I log out). My questions are:
What causes a process to become uninterruptible?
How do I stop that from happening?
This is probably a dumb question, but is there any way to interrupt it without restarting my computer?
An uninterruptible process is a process which happens to be in a system call (kernel function) that cannot be interrupted by a signal.
To understand what that means, you need to understand the concept of an interruptible system call. The classic example is read(). This is a system call that can take a long time (seconds) since it can potentially involve spinning up a hard drive, or moving heads. During most of this time, the process will be sleeping, blocking on the hardware.
While the process is sleeping in the system call, it can receive a Unix asynchronous signal (say, SIGTERM), then the following happens:
The system call exits prematurely, and is set up to return -EINTR to user space.
The signal handler is executed.
If the process is still running, it gets the return value from the system call, and it can make the same call again.
Returning early from the system call enables the user space code to immediately alter its behavior in response to the signal. For example, terminating cleanly in reaction to SIGINT or SIGTERM.
On the other hand, some system calls are not allowed to be interrupted in this way. If the system calls stalls for some reason, the process can indefinitely remains in this unkillable state.
LWN ran a nice article that touched this topic in July.
To answer the original question:
How to prevent this from happening: figure out which driver is causing you trouble, and either stop using, or become a kernel hacker and fix it.
How to kill an uninterruptible process without rebooting: somehow make the system call terminate. Frequently the most effective manner to do this without hitting the power switch is to pull the power cord. You can also become a kernel hacker and make the driver use TASK_KILLABLE, as explained in the LWN article.
When a process is on user mode, it can be interrupted at any time (switching to kernel mode). When the kernel returns to user mode, it checks if there are any signals pending (including the ones which are used to kill the process, such as SIGTERM and SIGKILL). This means a process can be killed only on return to user mode.
The reason a process cannot be killed in kernel mode is that it could potentially corrupt the kernel structures used by all the other processes in the same machine (the same way killing a thread can potentially corrupt data structures used by other threads in the same process).
When the kernel needs to do something which could take a long time (waiting on a pipe written by another process or waiting for the hardware to do something, for instance), it sleeps by marking itself as sleeping and calling the scheduler to switch to another process (if there is no non-sleeping process, it switches to a "dummy" process which tells the cpu to slow down a bit and sits in a loop — the idle loop).
If a signal is sent to a sleeping process, it has to be woken up before it will return to user space and thus process the pending signal. Here we have the difference between the two main types of sleep:
TASK_INTERRUPTIBLE, the interruptible sleep. If a task is marked with this flag, it is sleeping, but can be woken by signals. This means the code which marked the task as sleeping is expecting a possible signal, and after it wakes up will check for it and return from the system call. After the signal is handled, the system call can potentially be automatically restarted (and I won't go into details on how that works).
TASK_UNINTERRUPTIBLE, the uninterruptible sleep. If a task is marked with this flag, it is not expecting to be woken up by anything other than whatever it is waiting for, either because it cannot easily be restarted, or because programs are expecting the system call to be atomic. This can also be used for sleeps known to be very short.
TASK_KILLABLE (mentioned in the LWN article linked to by ddaa's answer) is a new variant.
This answers your first question. As to your second question: you can't avoid uninterruptible sleeps, they are a normal thing (it happens, for instance, every time a process reads/writes from/to the disk); however, they should last only a fraction of a second. If they last much longer, it usually means a hardware problem (or a device driver problem, which looks the same to the kernel), where the device driver is waiting for the hardware to do something which will never happen. It can also mean you are using NFS and the NFS server is down (it is waiting for the server to recover; you can also use the "intr" option to avoid the problem).
Finally, the reason you cannot recover is the same reason the kernel waits until return to user mode to deliver a signal or kill the process: it would potentially corrupt the kernel's data structures (code waiting on an interruptible sleep can receive an error which tells it to return to user space, where the process can be killed; code waiting on an uninterruptible sleep is not expecting any error).
Uninterruptable processes are USUALLY waiting for I/O following a page fault.
Consider this:
The thread tries to access a page which is not in core (either an executable which is demand-loaded, a page of anonymous memory which has been swapped out, or a mmap()'d file which is demand loaded, which are much the same thing)
The kernel is now (trying to) load it in
The process can't continue until the page is available.
The process/task cannot be interrupted in this state, because it can't handle any signals; if it did, another page fault would happen and it would be back where it was.
When I say "process", I really mean "task", which under Linux (2.6) roughly translates to "thread" which may or may not have an individual "thread group" entry in /proc
In some cases, it may be waiting for a long time. A typical example of this would be where the executable or mmap'd file is on a network filesystem where the server has failed. If the I/O eventually succeeds, the task will continue. If it eventually fails, the task will generally get a SIGBUS or something.
To your 3rd question:
I think you can kill the uninterruptable processes by running
sudo kill -HUP 1.
It will restart init without ending the running processes and after running it, my uninterruptable processes were gone.
If you are talking about a "zombie" process (which is designated as "zombie" in ps output), then this is a harmless record in the process list waiting for someone to collect its return code and it could be safely ignored.
Could you please describe what and "uninterruptable process" is for you? Does it survives the "kill -9 " and happily chugs along? If that is the case, then it's stuck on some syscall, which is stuck in some driver, and you are stuck with this process till reboot (and sometimes it's better to reboot soon) or unloading of relevant driver (which is unlikely to happen). You could try to use "strace" to find out where your process is stuck and avoid it in the future.
Related
I've this problem, I need to understand if a Linux thread is running or not due to crash and not for normal exit. The reason to do that is try to restart the thread without reset\restart all system.
The pthread_join() seems not a good option because I've several thread to monitoring and the function return on specific thread, It doesn't work in "parallel". At moment I've a keeep live signal from thread to main but I'm looking for some system call or thread attribute to understand the state
Any suggestion?
P
Thread "crashes"
How to detect if a linux thread is crashed
if (0) //...
That is, the only way that a pthreads thread can terminate abnormally while other threads in the process continue to run is via thread cancellation,* which is not well described as a "crash". In particular, if a signal is received whose effect is abnormal termination then the whole process terminates, not just the thread that handled the signal. Other kinds of errors do not cause threads to terminate.
On the other hand, if by "crash" you mean normal termination in response to the thread detecting an error condition, then you have no limitation on what the thread can do prior to terminating to communicate about its state. For example,
it could update a shared object that tracks information about your threads
it could write to a pipe designated for the purpose
it could raise a signal
If you like, you can use pthread_cleanup_push() to register thread cleanup handlers to help with that.
On the third hand, if you're asking about detecting live threads that are failing to make progress -- because they are deadlocked, for example -- then your best bet is probably to implement some form of heartbeat monitor. That would involve each thread you want to monitor periodically updating a shared object that tracks the time of each thread's last update. If a thread goes too long between beats then you can guess that it may be stalled. This requires you to instrument all the threads you want to monitor.
Thread cancellation
You should not use thread cancellation. But if you did, and if you include termination because of cancellation in your definition of "crash", then you still have all the options above available to you, but you must engage them by registering one or more cleanup handlers.
GNU-specific options
The main issues with using pthread_join() to check thread state are
it doesn't work for daemon threads, and
pthread_join() blocks until the specified thread terminates.
For daemon threads, you need one of the approaches already discussed, but for ordinary threads on GNU/Linux, Glibc provides non-standard pthread_tryjoin_np(), which performs a non-blocking attempt to join a thread, and also pthread_timedjoin_np(), which performs a join attempt with a timeout. If you are willing to rely on Glibc-specific functions then one of these might serve your purpose.
Linux-specific options
The Linux kernel makes per-process thread status information available via the /proc filesystem. See How to check the state of Linux threads?, for example. Do be aware, however, that the details vary a bit from one kernel version to another. And if you're planning to do this a lot, then also be aware that even though /proc is a virtual filesystem (so no physical disk is involved), you still access it via slow-ish I/O interfaces.
Any of the other alternatives is probably better than reading files in /proc. I mention it only for completeness.
Overall
I'm looking for some system call or thread attribute to understand the state
The pthreads API does not provide a "have you terminated?" function or any other such state-inquiry function, unless you count pthread_join(). If you want that then you need to roll your own, which you can do by means of some of the facilities already discussed.
*Do not use thread cancellation.
Sometimes whenever I write a program in Linux and it crashes due to a bug of some sort, it will become an uninterruptible process and continue running forever until I restart my computer (even if I log out). My questions are:
What causes a process to become uninterruptible?
How do I stop that from happening?
This is probably a dumb question, but is there any way to interrupt it without restarting my computer?
An uninterruptible process is a process which happens to be in a system call (kernel function) that cannot be interrupted by a signal.
To understand what that means, you need to understand the concept of an interruptible system call. The classic example is read(). This is a system call that can take a long time (seconds) since it can potentially involve spinning up a hard drive, or moving heads. During most of this time, the process will be sleeping, blocking on the hardware.
While the process is sleeping in the system call, it can receive a Unix asynchronous signal (say, SIGTERM), then the following happens:
The system call exits prematurely, and is set up to return -EINTR to user space.
The signal handler is executed.
If the process is still running, it gets the return value from the system call, and it can make the same call again.
Returning early from the system call enables the user space code to immediately alter its behavior in response to the signal. For example, terminating cleanly in reaction to SIGINT or SIGTERM.
On the other hand, some system calls are not allowed to be interrupted in this way. If the system calls stalls for some reason, the process can indefinitely remains in this unkillable state.
LWN ran a nice article that touched this topic in July.
To answer the original question:
How to prevent this from happening: figure out which driver is causing you trouble, and either stop using, or become a kernel hacker and fix it.
How to kill an uninterruptible process without rebooting: somehow make the system call terminate. Frequently the most effective manner to do this without hitting the power switch is to pull the power cord. You can also become a kernel hacker and make the driver use TASK_KILLABLE, as explained in the LWN article.
When a process is on user mode, it can be interrupted at any time (switching to kernel mode). When the kernel returns to user mode, it checks if there are any signals pending (including the ones which are used to kill the process, such as SIGTERM and SIGKILL). This means a process can be killed only on return to user mode.
The reason a process cannot be killed in kernel mode is that it could potentially corrupt the kernel structures used by all the other processes in the same machine (the same way killing a thread can potentially corrupt data structures used by other threads in the same process).
When the kernel needs to do something which could take a long time (waiting on a pipe written by another process or waiting for the hardware to do something, for instance), it sleeps by marking itself as sleeping and calling the scheduler to switch to another process (if there is no non-sleeping process, it switches to a "dummy" process which tells the cpu to slow down a bit and sits in a loop — the idle loop).
If a signal is sent to a sleeping process, it has to be woken up before it will return to user space and thus process the pending signal. Here we have the difference between the two main types of sleep:
TASK_INTERRUPTIBLE, the interruptible sleep. If a task is marked with this flag, it is sleeping, but can be woken by signals. This means the code which marked the task as sleeping is expecting a possible signal, and after it wakes up will check for it and return from the system call. After the signal is handled, the system call can potentially be automatically restarted (and I won't go into details on how that works).
TASK_UNINTERRUPTIBLE, the uninterruptible sleep. If a task is marked with this flag, it is not expecting to be woken up by anything other than whatever it is waiting for, either because it cannot easily be restarted, or because programs are expecting the system call to be atomic. This can also be used for sleeps known to be very short.
TASK_KILLABLE (mentioned in the LWN article linked to by ddaa's answer) is a new variant.
This answers your first question. As to your second question: you can't avoid uninterruptible sleeps, they are a normal thing (it happens, for instance, every time a process reads/writes from/to the disk); however, they should last only a fraction of a second. If they last much longer, it usually means a hardware problem (or a device driver problem, which looks the same to the kernel), where the device driver is waiting for the hardware to do something which will never happen. It can also mean you are using NFS and the NFS server is down (it is waiting for the server to recover; you can also use the "intr" option to avoid the problem).
Finally, the reason you cannot recover is the same reason the kernel waits until return to user mode to deliver a signal or kill the process: it would potentially corrupt the kernel's data structures (code waiting on an interruptible sleep can receive an error which tells it to return to user space, where the process can be killed; code waiting on an uninterruptible sleep is not expecting any error).
Uninterruptable processes are USUALLY waiting for I/O following a page fault.
Consider this:
The thread tries to access a page which is not in core (either an executable which is demand-loaded, a page of anonymous memory which has been swapped out, or a mmap()'d file which is demand loaded, which are much the same thing)
The kernel is now (trying to) load it in
The process can't continue until the page is available.
The process/task cannot be interrupted in this state, because it can't handle any signals; if it did, another page fault would happen and it would be back where it was.
When I say "process", I really mean "task", which under Linux (2.6) roughly translates to "thread" which may or may not have an individual "thread group" entry in /proc
In some cases, it may be waiting for a long time. A typical example of this would be where the executable or mmap'd file is on a network filesystem where the server has failed. If the I/O eventually succeeds, the task will continue. If it eventually fails, the task will generally get a SIGBUS or something.
To your 3rd question:
I think you can kill the uninterruptable processes by running
sudo kill -HUP 1.
It will restart init without ending the running processes and after running it, my uninterruptable processes were gone.
If you are talking about a "zombie" process (which is designated as "zombie" in ps output), then this is a harmless record in the process list waiting for someone to collect its return code and it could be safely ignored.
Could you please describe what and "uninterruptable process" is for you? Does it survives the "kill -9 " and happily chugs along? If that is the case, then it's stuck on some syscall, which is stuck in some driver, and you are stuck with this process till reboot (and sometimes it's better to reboot soon) or unloading of relevant driver (which is unlikely to happen). You could try to use "strace" to find out where your process is stuck and avoid it in the future.
I want to know when does a linux process handles the signal.
Assuming that the process has installed the signal handler for a signal, I wanted to know when would the process's normal execution flow be interrupted and signal handler called.
According to http://www.tldp.org/LDP/tlk/ipc/ipc.html, the process would handle the signal when it exits from a system call. This would mean that a normal instruction like a = b+c (or its equivalent machine code) would not be interrupted because of signal.
Also, there are system calls which would get interrupted (and fail with EINTR or get restarted) upon receiving a signal. This means that signal is processed even before the system call completes. This behaviour seems to b conflicting with what I have mentioned in the previous paragraph.
So, I am not clear as to when is the signal processed and in which process states would it be handled by the process. Can it be interrupted
Anytime it enters from kernel space to user space, or
Anytime it is in user space, or
Anytime the process is scheduled for execution by the scheduler
Thanks!
According to http://www.tldp.org/LDP/tlk/ipc/ipc.html, the process would handle the signal when it exits from a system call. This would mean that a normal instruction like a = b+c (or its equivalent machine code) would not be interrupted because of signal.
Well, if that were the case, a CPU-intensive process would not obey the process scheduler. The scheduler, in fact, can interrupt a process at any point of time when its time quantum has elapsed. Unless it is a FIFO real-time process.
A more correct definition: One point when a signal is delivered to the process is when the control flow leaves the kernel mode to resume executing user-mode code. That doesn't necessarily involve a system call.
A lot of the semantics of signal handling are documented (for Linux, anyway - other OSes probably have similar, but not necessarily in the same spot) in the section 7 signal manual page, which, if installed on your system, can be accessed like this:
man 7 signal
If manual pages are not installed, online copies are pretty easy to find.
What process state has when it calls a syscall?
I mean, don't asume it's an I/O syscall like read or write...
It's the process itselft that executes kernel code, or the process is suspendes and there's like a "kernel thread" that execute the syscall handler (and knows wich process called (current))?
I'm not sure if changes from executing to ready, or executing to blocked.
It's the process itself that switches to kernel mode and executes the system call - although it switches to a kernel stack to do so. A process executing inside the kernel has state Running, and can be pre-empted and end up in state Runnable.
It depends what the syscall does.
Suppose there is a hypothetical syscall which calculates PI to a lot of digits, and places the result in a buffer the application specifies, then the process will probably just be in the "R" running state. Switching to kernel mode doesn't stop it running in the context of the task that made the call.
Of course many system calls wait for things - consider sleep() for example, which releases the CPU rather than spinning. This puts the process to sleep, having registered a kernel timer to wake it up.
Quite a lot of syscalls never sleep, the likes of getpid() which just retrieve information which is always in ram. And many which do sometimes sleep don't necessarily do so, for example, if you call read() on data already in a kernel buffer.
In Linux, what happens to the state of a process when it needs to read blocks from a disk? Is it blocked? If so, how is another process chosen to execute?
When a process needs to fetch data from a disk, it effectively stops running on the CPU to let other processes run because the operation might take a long time to complete – at least 5ms seek time for a disk is common, and 5ms is 10 million CPU cycles, an eternity from the point of view of the program!
From the programmer point of view (also said "in userspace"), this is called a blocking system call. If you call write(2) (which is a thin libc wrapper around the system call of the same name), your process does not exactly stop at that boundary; it continues, in the kernel, running the system call code. Most of the time it goes all the way up to a specific disk controller driver (filename → filesystem/VFS → block device → device driver), where a command to fetch a block on disk is submitted to the proper hardware, which is a very fast operation most of the time.
THEN the process is put in sleep state (in kernel space, blocking is called sleeping – nothing is ever 'blocked' from the kernel point of view). It will be awakened once the hardware has finally fetched the proper data, then the process will be marked as runnable and will be scheduled. Eventually, the scheduler will run the process.
Finally, in userspace, the blocking system call returns with proper status and data, and the program flow goes on.
It is possible to invoke most I/O system calls in non-blocking mode (see O_NONBLOCK in open(2) and fcntl(2)). In this case, the system calls return immediately and only report submitting the disk operation. The programmer will have to explicitly check at a later time whether the operation completed, successfully or not, and fetch its result (e.g., with select(2)). This is called asynchronous or event-based programming.
Most answers here mentioning the D state (which is called TASK_UNINTERRUPTIBLE in the Linux state names) are incorrect. The D state is a special sleep mode which is only triggered in a kernel space code path, when that code path can't be interrupted (because it would be too complex to program), with the expectation that it would block only for a very short time. I believe that most "D states" are actually invisible; they are very short lived and can't be observed by sampling tools such as 'top'.
You can encounter unkillable processes in the D state in a few situations. NFS is famous for that, and I've encountered it many times. I think there's a semantic clash between some VFS code paths, which assume to always reach local disks and fast error detection (on SATA, an error timeout would be around a few 100 ms), and NFS, which actually fetches data from the network which is more resilient and has slow recovery (a TCP timeout of 300 seconds is common). Read this article for the cool solution introduced in Linux 2.6.25 with the TASK_KILLABLE state. Before this era there was a hack where you could actually send signals to NFS process clients by sending a SIGKILL to the kernel thread rpciod, but forget about that ugly trick.…
While waiting for read() or write() to/from a file descriptor return, the process will be put in a special kind of sleep, known as "D" or "Disk Sleep". This is special, because the process can not be killed or interrupted while in such a state. A process waiting for a return from ioctl() would also be put to sleep in this manner.
An exception to this is when a file (such as a terminal or other character device) is opened in O_NONBLOCK mode, passed when its assumed that a device (such as a modem) will need time to initialize. However, you indicated block devices in your question. Also, I have never tried an ioctl() that is likely to block on a fd opened in non blocking mode (at least not knowingly).
How another process is chosen depends entirely on the scheduler you are using, as well as what other processes might have done to modify their weights within that scheduler.
Some user space programs under certain circumstances have been known to remain in this state forever, until rebooted. These are typically grouped in with other "zombies", but the term would not be correct as they are not technically defunct.
A process performing I/O will be put in D state (uninterruptable sleep), which frees the CPU until there is a hardware interrupt which tells the CPU to return to executing the program. See man ps for the other process states.
Depending on your kernel, there is a process scheduler, which keeps track of a runqueue of processes ready to execute. It, along with a scheduling algorithm, tells the kernel which process to assign to which CPU. There are kernel processes and user processes to consider. Each process is allocated a time-slice, which is a chunk of CPU time it is allowed to use. Once the process uses all of its time-slice, it is marked as expired and given lower priority in the scheduling algorithm.
In the 2.6 kernel, there is a O(1) time complexity scheduler, so no matter how many processes you have up running, it will assign CPUs in constant time. It is more complicated though, since 2.6 introduced preemption and CPU load balancing is not an easy algorithm. In any case, it’s efficient and CPUs will not remain idle while you wait for the I/O.
As already explained by others, processes in "D" state (uninterruptible sleep) are responsible for the hang of ps process. To me it has happened many times with RedHat 6.x and automounted NFS home directories.
To list processes in D state you can use the following commands:
cd /proc
for i in [0-9]*;do echo -n "$i :";cat $i/status |grep ^State;done|grep D
To know the current directory of the process and, may be, the mounted NFS disk that has issues you can use a command similar to the following example (replace 31134 with the sleeping process number):
# ls -l /proc/31134/cwd
lrwxrwxrwx 1 pippo users 0 Aug 2 16:25 /proc/31134/cwd -> /auto/pippo
I found that giving the umount command with the -f (force) switch, to the related mounted nfs file system, was able to wake-up the sleeping process:
umount -f /auto/pippo
the file system wasn't unmounted, because it was busy, but the related process did wake-up and I was able to solve the issue without rebooting.
Assuming your process is a single thread, and that you're using blocking I/O, your process will block waiting for the I/O to complete. The kernel will pick another process to run in the meantime based on niceness, priority, last run time, etc. If there are no other runnable processes, the kernel won't run any; instead, it'll tell the hardware the machine is idle (which will result in lower power consumption).
Processes that are waiting for I/O to complete typically show up in state D in, e.g., ps and top.
Yes, the task gets blocked in the read() system call. Another task which is ready runs, or if no other tasks are ready, the idle task (for that CPU) runs.
A normal, blocking disc read causes the task to enter the "D" state (as others have noted). Such tasks contribute to the load average, even though they're not consuming the CPU.
Some other types of IO, especially ttys and network, do not behave quite the same - the process ends up in "S" state and can be interrupted and doesn't count against the load average.
Yes, tasks waiting for IO are blocked, and other tasks get executed. Selecting the next task is done by the Linux scheduler.
Generally the process will block. If the read operation is on a file descriptor marked as non-blocking or if the process is using asynchronous IO it won't block. Also if the process has other threads that aren't blocked they can continue running.
The decision as to which process runs next is up to the scheduler in the kernel.