get current->pid while in interrupt - linux

I'm writing something on the linux scheduler and I need to know which process was running before my interrupt came in.. is the current structure available? If I do current->pid while in the interrupt handler, do I get the pid of the process I interrupted?

You can, current->pid exists and is the process that was interrupted (may be the idle thread, or any).
If you're writing inside the Linux scheduler, you should be very careful. current is changed by the scheduler as it chooses a new process to run, so its value depends on when exactly you read it.

I wouldn't expect current to be valid outside process context. If you're working on scheduler maybe you can get hold of where it stores pointer to running task, e.g. struct cfs_rq.

Related

Can a Linux process/thread terminate without pass through do_exit()?

To verify the behavior of a third party binary distributed software I'd like to use, I'm implementing a kernel module whose objective is to keep track of each child this software produces and terminates.
The target binary is a Golang produced one, and it is heavily multi thread.
The kernel module I wrote installs hooks on the kernel functions _do_fork() and do_exit() to keep track of each process/thread this binary produces and terminates.
The LKM works, more or less.
During some conditions, however, I have a scenario I'm not able to explain.
It seems like a process/thread could terminate without passing through do_exit().
The evidence I collected by putting printk() shows the process creation but does not indicate the process termination.
I'm aware that printk() can be slow, and I'm also aware that messages can be lost in such situations.
Trying to prevent message loss due to slow console (for this particular application, serial tty 115200 is used), I tried to implement a quicker console, and messages have been collected using netconsole.
The described setup seems to confirm a process can terminate without pass through the do_exit() function.
But because I wasn't sure my messages couldn't be lost on the printk() infrastructure, I decided to repeat the same test but replacing printk() with ftrace_printk(), which should be a leaner alternative to printk().
Still the same result, occasionally I see processes not passing through the do_exit(), and verifying if the PID is currently running, I have to face the fact that it is not running.
Also note that I put my hook in the do_exit() kernel function as the first instruction to ensure the function flow does not terminate inside a called function.
My question is then the following:
Can a Linux process terminate without its flow pass through the do_exit() function?
If so, can someone give me a hint of what this scenario can be?
After a long debug session, I'm finally able to answer my own question.
That's not all; I'm also able to explain why I saw the strange behavior I described in my scenario.
Let's start from the beginning: monitoring a heavily multithreading application. I observed rare cases where a PID that suddenly stops exists without observing its flow to pass through the Linux Kernel do_exit() function.
Because this my original question:
Can a Linux process terminate without pass through the do_exit() function?
As for my current knowledge, which I would by now consider reasonably extensive, a Linux process can not end its execution without pass through the do_exit() function.
But this answer is in contrast with my observations, and the problem leading me to this question is still there.
Someone here suggested that the strange behavior I watched was because my observations were somehow wrong, alluding my method was inaccurate, as for my conclusions.
My observations were correct, and the process I watched didn't pass through the do_exit() but terminated.
To explain this phenomenon, I want to put on the table another question that I think internet searchers may find somehow useful:
Can two processes share the same PID?
If you'd asked me this a month ago, I'd surely answered this question with: "definitively no, two processes can not share the same PID."
Linux is more complex, though.
There's a situation in which, in a Linux system, two different processes can share the same PID!
https://elixir.bootlin.com/linux/v4.19.20/source/fs/exec.c#L1141
Surprisingly, this behavior does not harm anyone; when this happens, one of these two processes is a zombie.
updated to correct an error
The circumstances of this duplicate PID are more intricate than those described previously. The process must flush the previous exec context if a threaded process forks before invoking an execve (the fork copies also the threads). If the intention is to use the execve() function to execute a new text, the kernel must first call the flush_old_exec()  function, which then calls the de_thread() function for each thread in the process other than the task leader. Except the task leader, all the process' threads are eliminated as a result. Each thread's PID is changed to that of the leader, and if it is not immediately terminated, for example because it needs to wait an operation completion, it keeps using that PID.
end of the update
That was what I was watching; the PID I was monitoring did not pass through the do_exit() because when the corresponding thread terminated, it had no more the PID it had when it started, but it had its leader's.
For people who know the Linux Kernel's mechanics very well, this is nothing to be surprised for; this behavior is intended and hasn't changed since 2.6.17.
Current 5.10.3, is still this way.
Hoping this to be useful to internet searchers; I'd also like to add that this also answers the followings:
Question: Can a Linux process/thread terminate without pass through do_exit()? Answer: NO, do_exit() is the only way a process has to end its execution — both intentional than unintentional.
Question: Can two processes share the same PID? Answer: Normally don't. There's some rare case in which two schedulable entities have the same PID.
Question: Do Linux kernel have scenarios where a process change its PID? Answer: yes, there's at least one scenario where a Process changes its PID.
Can a Linux process terminate without its flow pass through the do_exit() function?
Probably not, but you should study the source code of the Linux kernel to be sure. Ask on KernelNewbies. Kernel threads and udev or systemd related things (or perhaps modprobe or the older hotplug) are probable exceptions. When your /sbin/init of pid 1 terminates (that should not happen) strange things would happen.
The LKM works, more or less.
What does that means? How could a kernel module half-work?
And in real life, it does happen sometimes that your Linux kernel is panicking or crashes (and it could happen with your LKM, if it has not been peer-reviewed by the Linux kernel community). In such a case, there is no more any notion of processes, since they are an abstraction provided by a living Linux kernel.
See also dmesg(1), strace(1), proc(5), syscalls(2), ptrace(2), clone(2), fork(2), execve(2), waitpid(2), elf(5), credentials(7), pthreads(7)
Look also inside the source code of your libc, e.g. GNU libc or musl-libc
Of course, see Linux From Scratch and Advanced Linux Programming
And verifying if the PID is currently running,
This can be done is user land with /proc/, or using kill(2) with a 0 signal (and maybe also pidfd_send_signal(2)...)
PS. I still don't understand why you need to write a kernel module or change the kernel code. My intuition would be to avoid doing that when possible.

Does kill(SIGSTOP) take effect by the time kill() returns?

Suppose I have a parent process and a child process (started with e.g. fork() or clone()) running on Linux. Further suppose that there is some shared memory that both the parent and the child can modify.
Within the context of the parent process, I would like to stop the child process and know that it has actually stopped, and moreover that any shared memory writes made by the child are visible to the parent (including whatever synchronization or cache flushes that may require in a multi-processor system).
This answer, which speaks of using kill(SIGSTOP) to stop a child process, contains an interesting tidbit:
When the first kill() call succeeds, you can safely assume that the child has stopped.
Is this statement actually true, and if so, can anyone expound on it, or point me to some more detailed documentation (e.g. a Linux manpage)? Otherwise, is there another mechanism that I can use to ensure that the child process is completely stopped and is not going to be doing any more writes to the shared memory?
I'm imagining something along the lines of:
the parent sends a different signal (e.g. SIGUSR1), which the child can handle
the child handles the SIGUSR1 and does something like a pthread_cond_wait() in the signal handler to safely "stop" (though still running from the kernel perspective) -- this is not fully fleshed out in my mind yet, just an idea
I'd like to avoid reinventing the wheel if there's already an established solution to this problem. Note that the child process needs to be stopped preemptively; adding some kind of active polling to the child process is not an option in this case.
If it only existed on Linux, pthread_suspend() would be perfect ...
It definitely sounds like you should be using a custom signal with a handler, and not sigstop.
It's rare not to care about the state of the child at all, e.g. being fine with it having stored 32bits out of a single non-atomic 64bit write, or logically caught between two dependent writes.
Even if you are, POSIX allows the OS to not make shared writes immediately visible to other processes, so the child should have a chance to call msync for portability, to ensure that writes are completely synced.
The POSIX documentation on Signal Concepts strongly suggests, but does not explicitly say, that the targeted process will be STOPped by the time kill() returns:
A signal is said to be "generated" for (or sent to) a process or thread when the event that causes the signal first occurs... Examples of such events include ... invocations of the kill() and sigqueue() functions.
The documentation is at pains to distinguish signal generation from delivery (when the signal action takes effect) or acceptance. Unfortunately, it sometimes mentions actions taken in response to a stop signal upon generation, and sometimes upon delivery. Given that something must happen upon generation per se, I'd agree that the target process must be STOPped by the time your call returns.
However, at the cost of another syscall, you can be sure. Since you have a parent/child relationship in your design, you can waitpid()/WUNTRACED to receive notification that your child process has, indeed, STOPped.
Edit
See the other answer from that other guy [sic] for reasons why you might not want to do this.

Why linux process with status 'D' can be killed ? [duplicate]

Sometimes whenever I write a program in Linux and it crashes due to a bug of some sort, it will become an uninterruptible process and continue running forever until I restart my computer (even if I log out). My questions are:
What causes a process to become uninterruptible?
How do I stop that from happening?
This is probably a dumb question, but is there any way to interrupt it without restarting my computer?
An uninterruptible process is a process which happens to be in a system call (kernel function) that cannot be interrupted by a signal.
To understand what that means, you need to understand the concept of an interruptible system call. The classic example is read(). This is a system call that can take a long time (seconds) since it can potentially involve spinning up a hard drive, or moving heads. During most of this time, the process will be sleeping, blocking on the hardware.
While the process is sleeping in the system call, it can receive a Unix asynchronous signal (say, SIGTERM), then the following happens:
The system call exits prematurely, and is set up to return -EINTR to user space.
The signal handler is executed.
If the process is still running, it gets the return value from the system call, and it can make the same call again.
Returning early from the system call enables the user space code to immediately alter its behavior in response to the signal. For example, terminating cleanly in reaction to SIGINT or SIGTERM.
On the other hand, some system calls are not allowed to be interrupted in this way. If the system calls stalls for some reason, the process can indefinitely remains in this unkillable state.
LWN ran a nice article that touched this topic in July.
To answer the original question:
How to prevent this from happening: figure out which driver is causing you trouble, and either stop using, or become a kernel hacker and fix it.
How to kill an uninterruptible process without rebooting: somehow make the system call terminate. Frequently the most effective manner to do this without hitting the power switch is to pull the power cord. You can also become a kernel hacker and make the driver use TASK_KILLABLE, as explained in the LWN article.
When a process is on user mode, it can be interrupted at any time (switching to kernel mode). When the kernel returns to user mode, it checks if there are any signals pending (including the ones which are used to kill the process, such as SIGTERM and SIGKILL). This means a process can be killed only on return to user mode.
The reason a process cannot be killed in kernel mode is that it could potentially corrupt the kernel structures used by all the other processes in the same machine (the same way killing a thread can potentially corrupt data structures used by other threads in the same process).
When the kernel needs to do something which could take a long time (waiting on a pipe written by another process or waiting for the hardware to do something, for instance), it sleeps by marking itself as sleeping and calling the scheduler to switch to another process (if there is no non-sleeping process, it switches to a "dummy" process which tells the cpu to slow down a bit and sits in a loop — the idle loop).
If a signal is sent to a sleeping process, it has to be woken up before it will return to user space and thus process the pending signal. Here we have the difference between the two main types of sleep:
TASK_INTERRUPTIBLE, the interruptible sleep. If a task is marked with this flag, it is sleeping, but can be woken by signals. This means the code which marked the task as sleeping is expecting a possible signal, and after it wakes up will check for it and return from the system call. After the signal is handled, the system call can potentially be automatically restarted (and I won't go into details on how that works).
TASK_UNINTERRUPTIBLE, the uninterruptible sleep. If a task is marked with this flag, it is not expecting to be woken up by anything other than whatever it is waiting for, either because it cannot easily be restarted, or because programs are expecting the system call to be atomic. This can also be used for sleeps known to be very short.
TASK_KILLABLE (mentioned in the LWN article linked to by ddaa's answer) is a new variant.
This answers your first question. As to your second question: you can't avoid uninterruptible sleeps, they are a normal thing (it happens, for instance, every time a process reads/writes from/to the disk); however, they should last only a fraction of a second. If they last much longer, it usually means a hardware problem (or a device driver problem, which looks the same to the kernel), where the device driver is waiting for the hardware to do something which will never happen. It can also mean you are using NFS and the NFS server is down (it is waiting for the server to recover; you can also use the "intr" option to avoid the problem).
Finally, the reason you cannot recover is the same reason the kernel waits until return to user mode to deliver a signal or kill the process: it would potentially corrupt the kernel's data structures (code waiting on an interruptible sleep can receive an error which tells it to return to user space, where the process can be killed; code waiting on an uninterruptible sleep is not expecting any error).
Uninterruptable processes are USUALLY waiting for I/O following a page fault.
Consider this:
The thread tries to access a page which is not in core (either an executable which is demand-loaded, a page of anonymous memory which has been swapped out, or a mmap()'d file which is demand loaded, which are much the same thing)
The kernel is now (trying to) load it in
The process can't continue until the page is available.
The process/task cannot be interrupted in this state, because it can't handle any signals; if it did, another page fault would happen and it would be back where it was.
When I say "process", I really mean "task", which under Linux (2.6) roughly translates to "thread" which may or may not have an individual "thread group" entry in /proc
In some cases, it may be waiting for a long time. A typical example of this would be where the executable or mmap'd file is on a network filesystem where the server has failed. If the I/O eventually succeeds, the task will continue. If it eventually fails, the task will generally get a SIGBUS or something.
To your 3rd question:
I think you can kill the uninterruptable processes by running
sudo kill -HUP 1.
It will restart init without ending the running processes and after running it, my uninterruptable processes were gone.
If you are talking about a "zombie" process (which is designated as "zombie" in ps output), then this is a harmless record in the process list waiting for someone to collect its return code and it could be safely ignored.
Could you please describe what and "uninterruptable process" is for you? Does it survives the "kill -9 " and happily chugs along? If that is the case, then it's stuck on some syscall, which is stuck in some driver, and you are stuck with this process till reboot (and sometimes it's better to reboot soon) or unloading of relevant driver (which is unlikely to happen). You could try to use "strace" to find out where your process is stuck and avoid it in the future.

Linux - Handle process termination

I need to watch for a process with a known PID in Linux. Once it is terminated want to execute a command with reason of termination.
Questions
How to subscribe to process health rather than polling (e.g. watch
command)?
Where to inject the event handler in OS' user space?
How to detect the termination/failure reason inside handler?
Note
The process I intend to keep a tab on is not forked as a child process of some parent through which it can be monitored.
The process type is generic (good number of them are daemons)
The only way to get that kind of control over another process, is to use ptrace(2) to trace the target process. You would use ptrace(PTRACE_ATTACH, pid) to attach to the process, after which you effectively become the target process's parent (and can use wait or more ptrace calls to figure out what the process is doing).

What is an uninterruptible process?

Sometimes whenever I write a program in Linux and it crashes due to a bug of some sort, it will become an uninterruptible process and continue running forever until I restart my computer (even if I log out). My questions are:
What causes a process to become uninterruptible?
How do I stop that from happening?
This is probably a dumb question, but is there any way to interrupt it without restarting my computer?
An uninterruptible process is a process which happens to be in a system call (kernel function) that cannot be interrupted by a signal.
To understand what that means, you need to understand the concept of an interruptible system call. The classic example is read(). This is a system call that can take a long time (seconds) since it can potentially involve spinning up a hard drive, or moving heads. During most of this time, the process will be sleeping, blocking on the hardware.
While the process is sleeping in the system call, it can receive a Unix asynchronous signal (say, SIGTERM), then the following happens:
The system call exits prematurely, and is set up to return -EINTR to user space.
The signal handler is executed.
If the process is still running, it gets the return value from the system call, and it can make the same call again.
Returning early from the system call enables the user space code to immediately alter its behavior in response to the signal. For example, terminating cleanly in reaction to SIGINT or SIGTERM.
On the other hand, some system calls are not allowed to be interrupted in this way. If the system calls stalls for some reason, the process can indefinitely remains in this unkillable state.
LWN ran a nice article that touched this topic in July.
To answer the original question:
How to prevent this from happening: figure out which driver is causing you trouble, and either stop using, or become a kernel hacker and fix it.
How to kill an uninterruptible process without rebooting: somehow make the system call terminate. Frequently the most effective manner to do this without hitting the power switch is to pull the power cord. You can also become a kernel hacker and make the driver use TASK_KILLABLE, as explained in the LWN article.
When a process is on user mode, it can be interrupted at any time (switching to kernel mode). When the kernel returns to user mode, it checks if there are any signals pending (including the ones which are used to kill the process, such as SIGTERM and SIGKILL). This means a process can be killed only on return to user mode.
The reason a process cannot be killed in kernel mode is that it could potentially corrupt the kernel structures used by all the other processes in the same machine (the same way killing a thread can potentially corrupt data structures used by other threads in the same process).
When the kernel needs to do something which could take a long time (waiting on a pipe written by another process or waiting for the hardware to do something, for instance), it sleeps by marking itself as sleeping and calling the scheduler to switch to another process (if there is no non-sleeping process, it switches to a "dummy" process which tells the cpu to slow down a bit and sits in a loop — the idle loop).
If a signal is sent to a sleeping process, it has to be woken up before it will return to user space and thus process the pending signal. Here we have the difference between the two main types of sleep:
TASK_INTERRUPTIBLE, the interruptible sleep. If a task is marked with this flag, it is sleeping, but can be woken by signals. This means the code which marked the task as sleeping is expecting a possible signal, and after it wakes up will check for it and return from the system call. After the signal is handled, the system call can potentially be automatically restarted (and I won't go into details on how that works).
TASK_UNINTERRUPTIBLE, the uninterruptible sleep. If a task is marked with this flag, it is not expecting to be woken up by anything other than whatever it is waiting for, either because it cannot easily be restarted, or because programs are expecting the system call to be atomic. This can also be used for sleeps known to be very short.
TASK_KILLABLE (mentioned in the LWN article linked to by ddaa's answer) is a new variant.
This answers your first question. As to your second question: you can't avoid uninterruptible sleeps, they are a normal thing (it happens, for instance, every time a process reads/writes from/to the disk); however, they should last only a fraction of a second. If they last much longer, it usually means a hardware problem (or a device driver problem, which looks the same to the kernel), where the device driver is waiting for the hardware to do something which will never happen. It can also mean you are using NFS and the NFS server is down (it is waiting for the server to recover; you can also use the "intr" option to avoid the problem).
Finally, the reason you cannot recover is the same reason the kernel waits until return to user mode to deliver a signal or kill the process: it would potentially corrupt the kernel's data structures (code waiting on an interruptible sleep can receive an error which tells it to return to user space, where the process can be killed; code waiting on an uninterruptible sleep is not expecting any error).
Uninterruptable processes are USUALLY waiting for I/O following a page fault.
Consider this:
The thread tries to access a page which is not in core (either an executable which is demand-loaded, a page of anonymous memory which has been swapped out, or a mmap()'d file which is demand loaded, which are much the same thing)
The kernel is now (trying to) load it in
The process can't continue until the page is available.
The process/task cannot be interrupted in this state, because it can't handle any signals; if it did, another page fault would happen and it would be back where it was.
When I say "process", I really mean "task", which under Linux (2.6) roughly translates to "thread" which may or may not have an individual "thread group" entry in /proc
In some cases, it may be waiting for a long time. A typical example of this would be where the executable or mmap'd file is on a network filesystem where the server has failed. If the I/O eventually succeeds, the task will continue. If it eventually fails, the task will generally get a SIGBUS or something.
To your 3rd question:
I think you can kill the uninterruptable processes by running
sudo kill -HUP 1.
It will restart init without ending the running processes and after running it, my uninterruptable processes were gone.
If you are talking about a "zombie" process (which is designated as "zombie" in ps output), then this is a harmless record in the process list waiting for someone to collect its return code and it could be safely ignored.
Could you please describe what and "uninterruptable process" is for you? Does it survives the "kill -9 " and happily chugs along? If that is the case, then it's stuck on some syscall, which is stuck in some driver, and you are stuck with this process till reboot (and sometimes it's better to reboot soon) or unloading of relevant driver (which is unlikely to happen). You could try to use "strace" to find out where your process is stuck and avoid it in the future.

Resources