Can a Linux process/thread terminate without pass through do_exit()? - linux

To verify the behavior of a third party binary distributed software I'd like to use, I'm implementing a kernel module whose objective is to keep track of each child this software produces and terminates.
The target binary is a Golang produced one, and it is heavily multi thread.
The kernel module I wrote installs hooks on the kernel functions _do_fork() and do_exit() to keep track of each process/thread this binary produces and terminates.
The LKM works, more or less.
During some conditions, however, I have a scenario I'm not able to explain.
It seems like a process/thread could terminate without passing through do_exit().
The evidence I collected by putting printk() shows the process creation but does not indicate the process termination.
I'm aware that printk() can be slow, and I'm also aware that messages can be lost in such situations.
Trying to prevent message loss due to slow console (for this particular application, serial tty 115200 is used), I tried to implement a quicker console, and messages have been collected using netconsole.
The described setup seems to confirm a process can terminate without pass through the do_exit() function.
But because I wasn't sure my messages couldn't be lost on the printk() infrastructure, I decided to repeat the same test but replacing printk() with ftrace_printk(), which should be a leaner alternative to printk().
Still the same result, occasionally I see processes not passing through the do_exit(), and verifying if the PID is currently running, I have to face the fact that it is not running.
Also note that I put my hook in the do_exit() kernel function as the first instruction to ensure the function flow does not terminate inside a called function.
My question is then the following:
Can a Linux process terminate without its flow pass through the do_exit() function?
If so, can someone give me a hint of what this scenario can be?

After a long debug session, I'm finally able to answer my own question.
That's not all; I'm also able to explain why I saw the strange behavior I described in my scenario.
Let's start from the beginning: monitoring a heavily multithreading application. I observed rare cases where a PID that suddenly stops exists without observing its flow to pass through the Linux Kernel do_exit() function.
Because this my original question:
Can a Linux process terminate without pass through the do_exit() function?
As for my current knowledge, which I would by now consider reasonably extensive, a Linux process can not end its execution without pass through the do_exit() function.
But this answer is in contrast with my observations, and the problem leading me to this question is still there.
Someone here suggested that the strange behavior I watched was because my observations were somehow wrong, alluding my method was inaccurate, as for my conclusions.
My observations were correct, and the process I watched didn't pass through the do_exit() but terminated.
To explain this phenomenon, I want to put on the table another question that I think internet searchers may find somehow useful:
Can two processes share the same PID?
If you'd asked me this a month ago, I'd surely answered this question with: "definitively no, two processes can not share the same PID."
Linux is more complex, though.
There's a situation in which, in a Linux system, two different processes can share the same PID!
https://elixir.bootlin.com/linux/v4.19.20/source/fs/exec.c#L1141
Surprisingly, this behavior does not harm anyone; when this happens, one of these two processes is a zombie.
updated to correct an error
The circumstances of this duplicate PID are more intricate than those described previously. The process must flush the previous exec context if a threaded process forks before invoking an execve (the fork copies also the threads). If the intention is to use the execve() function to execute a new text, the kernel must first call the flush_old_exec()  function, which then calls the de_thread() function for each thread in the process other than the task leader. Except the task leader, all the process' threads are eliminated as a result. Each thread's PID is changed to that of the leader, and if it is not immediately terminated, for example because it needs to wait an operation completion, it keeps using that PID.
end of the update
That was what I was watching; the PID I was monitoring did not pass through the do_exit() because when the corresponding thread terminated, it had no more the PID it had when it started, but it had its leader's.
For people who know the Linux Kernel's mechanics very well, this is nothing to be surprised for; this behavior is intended and hasn't changed since 2.6.17.
Current 5.10.3, is still this way.
Hoping this to be useful to internet searchers; I'd also like to add that this also answers the followings:
Question: Can a Linux process/thread terminate without pass through do_exit()? Answer: NO, do_exit() is the only way a process has to end its execution — both intentional than unintentional.
Question: Can two processes share the same PID? Answer: Normally don't. There's some rare case in which two schedulable entities have the same PID.
Question: Do Linux kernel have scenarios where a process change its PID? Answer: yes, there's at least one scenario where a Process changes its PID.

Can a Linux process terminate without its flow pass through the do_exit() function?
Probably not, but you should study the source code of the Linux kernel to be sure. Ask on KernelNewbies. Kernel threads and udev or systemd related things (or perhaps modprobe or the older hotplug) are probable exceptions. When your /sbin/init of pid 1 terminates (that should not happen) strange things would happen.
The LKM works, more or less.
What does that means? How could a kernel module half-work?
And in real life, it does happen sometimes that your Linux kernel is panicking or crashes (and it could happen with your LKM, if it has not been peer-reviewed by the Linux kernel community). In such a case, there is no more any notion of processes, since they are an abstraction provided by a living Linux kernel.
See also dmesg(1), strace(1), proc(5), syscalls(2), ptrace(2), clone(2), fork(2), execve(2), waitpid(2), elf(5), credentials(7), pthreads(7)
Look also inside the source code of your libc, e.g. GNU libc or musl-libc
Of course, see Linux From Scratch and Advanced Linux Programming
And verifying if the PID is currently running,
This can be done is user land with /proc/, or using kill(2) with a 0 signal (and maybe also pidfd_send_signal(2)...)
PS. I still don't understand why you need to write a kernel module or change the kernel code. My intuition would be to avoid doing that when possible.

Related

Does kill(SIGSTOP) take effect by the time kill() returns?

Suppose I have a parent process and a child process (started with e.g. fork() or clone()) running on Linux. Further suppose that there is some shared memory that both the parent and the child can modify.
Within the context of the parent process, I would like to stop the child process and know that it has actually stopped, and moreover that any shared memory writes made by the child are visible to the parent (including whatever synchronization or cache flushes that may require in a multi-processor system).
This answer, which speaks of using kill(SIGSTOP) to stop a child process, contains an interesting tidbit:
When the first kill() call succeeds, you can safely assume that the child has stopped.
Is this statement actually true, and if so, can anyone expound on it, or point me to some more detailed documentation (e.g. a Linux manpage)? Otherwise, is there another mechanism that I can use to ensure that the child process is completely stopped and is not going to be doing any more writes to the shared memory?
I'm imagining something along the lines of:
the parent sends a different signal (e.g. SIGUSR1), which the child can handle
the child handles the SIGUSR1 and does something like a pthread_cond_wait() in the signal handler to safely "stop" (though still running from the kernel perspective) -- this is not fully fleshed out in my mind yet, just an idea
I'd like to avoid reinventing the wheel if there's already an established solution to this problem. Note that the child process needs to be stopped preemptively; adding some kind of active polling to the child process is not an option in this case.
If it only existed on Linux, pthread_suspend() would be perfect ...
It definitely sounds like you should be using a custom signal with a handler, and not sigstop.
It's rare not to care about the state of the child at all, e.g. being fine with it having stored 32bits out of a single non-atomic 64bit write, or logically caught between two dependent writes.
Even if you are, POSIX allows the OS to not make shared writes immediately visible to other processes, so the child should have a chance to call msync for portability, to ensure that writes are completely synced.
The POSIX documentation on Signal Concepts strongly suggests, but does not explicitly say, that the targeted process will be STOPped by the time kill() returns:
A signal is said to be "generated" for (or sent to) a process or thread when the event that causes the signal first occurs... Examples of such events include ... invocations of the kill() and sigqueue() functions.
The documentation is at pains to distinguish signal generation from delivery (when the signal action takes effect) or acceptance. Unfortunately, it sometimes mentions actions taken in response to a stop signal upon generation, and sometimes upon delivery. Given that something must happen upon generation per se, I'd agree that the target process must be STOPped by the time your call returns.
However, at the cost of another syscall, you can be sure. Since you have a parent/child relationship in your design, you can waitpid()/WUNTRACED to receive notification that your child process has, indeed, STOPped.
Edit
See the other answer from that other guy [sic] for reasons why you might not want to do this.

Can I coredump a process that is blocking on disk activity (preferably without killing it)?

I want to dump the core of a running process that is, according to /proc/<pid>/status, currently blocking on disk activity. Actually, it is busy doing work on the GPU (should be 4 hours of work, but it has taken significantly longer now). I would like to know how much of the process's work has been done, so it'd be good to be able to dump the process's memory. However, as far as I know, "blocking on disk activity" means that it's not possible to interrupt the process in any way, and coredumping a process e.g. using gdb requires interrupting and temporarily stopping the process in order to attach via ptrace, right?
I know that I could just read /proc/<pid>/{maps,mem} as root to get the (maybe inconsistent) memory state, but I don't know any way to get hold of the process's userspace CPU register values... they stay the same while the process is inside the kernel, right?
You can probably run gcore on your program. It's basically a wrapper around GDB that attaches, uses the gcore command, and detaches again.
This might interrupt your IO (as if it received a signal, which it will), but your program can likely restart it if written correctly (and this may occur in any case, due to default handling).

How to list threads were killed by the kernel?

Is there any way to list all the killed processes in a linux device?
I saw this answer suggesting:
check in:
/var/log/kern.log
but it is not generic. there is any other way to do it?
What I want to do:
list thread/process if it got killed. What function in the kernel should I edit to list all the killed tid/pid and their names, or alternitavily is there a sysfs does it anyway?
The opposite of do_fork is do_exit, here:
do_exit kernel source
I'm not able to find when threads are exiting, other than:
release_task
I believe "task" and "thread" are (almost) synonymous in Linux.
First, task and thread contexts are different in the kernel.
task (using tasklet api) runs in software interrupt context (meaning you cannot sleep while you are in the task ctx) while thread (using kthread api, or workqueue api) runs the handler in process ctx (i.e. sleep-able ctx).
In both cases, if a thread hangs in the kerenl, you cannot kill it.
if you run "ps" command from the shell, you can see it there (normally with "[" and "]" braces) but any attempt to kill it won't work.
the kernel is trusted code, such a situation shouldn't happen, and if it does, it indicates a kernel (or kernel module) bug.
normally the whole machine will hand after a while because the core running that thread is not responding (you will see a message in /var/log/messages or the console with more info) in some other cases the machine may survive but that specific core is dead. depends on the kernel configuration.

Program stalls during long runs

Fixed:
Well this seems a bit silly. Turns out top was not displaying correctly and programs actually continue to run. Perhaps the CPU time became too large to display? Either way, the program seems to be working fine and this whole question was moot.
Thanks (and sorry for the silly question).
Original Q:
I am running a simulation on a computer running Ubuntu server 10.04.3. Short runs (<24 hours) run fine, but long runs eventually stall. By stall, I mean that the program no longer gets any CPU time, but it still holds all information in memory. In order to run these simulations, I SSH and nohup the program and pipe any output to a file.
Miscellaneous information:
The system is definitely not running out of RAM. The program does not need to read or write to the hard drive until completion; the computation is done completely in memory. The program is not killed, as it still has a PID after it stalls. I am using openmp, but have increased the max number of processes and the max time is unlimited. I am finding the largest eigenvalues of a matrix using the ARPACK fortran library.
Any thoughts on what is causing this behavior or how to resume my currently stalled program?
Thanks
I assume this is an OpenMP program from your tags, though you never actually state this. Is ARPACK threadsafe?
It sounds like you are hitting a deadlock (more common in MPI programs than OpenMP, but it's definitely possible). The first thing to do is to compile with debugging flags on, then the next time you find this problem, attach with a debugger and find out what the various threads are doing. For gdb, for instance, some instructions for switching between threads are shown here.
Next time your program "stalls", attach GDB to it and do thread apply all where.
If all your threads are blocked waiting for some mutex, you have a
deadlock.
If they are waiting for something else (e.g. read), then you need to figure out what prevents the operation from completing.
Generally on UNIX you don't need to rebuild with debug flags on to get a meaningful stack trace. You wouldn't get file/line numbers, but they may not be necessary to diagnose the problem.
A possible way of understanding what a running program (that is, a process) is doing is to attach a debugger to it with gdb program *pid* (which works well only when the program has been compiled with debugging enabled with -g), or to use strace on it, using strace -p *pid*. the strace command is an utility (technically, a specialized debugger built above the ptrace system call interface) which shows you all the system calls done by a program or a process.
There is also a variant, called ltrace that intercepts the call to functions in dynamic libraries.
To get a feeling of it, try for instance strace ls
Of course, strace won't help you much if the running program is not doing any system calls.
Regards.
Basile Starynkevitch

Faster forking of large processes on Linux?

What's the fastest, best way on modern Linux of achieving the same effect as a fork-execve combo from a large process ?
My problem is that the process forking is ~500MByte big, and a simple benchmarking test achieves only about 50 forks/s from the process (c.f ~1600 forks/s from a minimally sized process) which is too slow for the intended application.
Some googling turns up vfork as having being invented as the solution to this problem... but also warnings about not to use it. Modern Linux seems to have acquired related clone and posix_spawn calls; are these likely to help ? What's the modern replacement for vfork ?
I'm using 64bit Debian Lenny on an i7 (the project could move to Squeeze if posix_spawn would help).
On Linux, you can use posix_spawn(2) with the POSIX_SPAWN_USEVFORK flag to avoid the overhead of copying page tables when forking from a large process.
See Minimizing Memory Usage for Creating Application Subprocesses for a good summary of posix_spawn(2), its advantages and some examples.
To take advantage of vfork(2), make sure you #define _GNU_SOURCE before #include <spawn.h> and then simply posix_spawnattr_setflags(&attr, POSIX_SPAWN_USEVFORK)
I can confirm that this works on Debian Lenny, and provides a massive speed-up when forking from a large process.
benchmarking the various spawns over 1000 runs at 100M RSS
user system total real
fspawn (fork/exec): 0.100000 15.460000 40.570000 ( 41.366389)
pspawn (posix_spawn): 0.010000 0.010000 0.540000 ( 0.970577)
Outcome: I was going to go down the early-spawned helper subprocess route as suggested by other answers here, but then I came across this re using huge page support to improve fork performance.
Having tried it myself using libhugetlbfs to simply make all my app's mallocs allocate huge pages, I'm now getting around 2400 forks/s regardless of the process size (over the range I'm interested in anyway). Amazing.
Did you actually measure how much time forks take? Quoting the page you linked,
Linux never had this problem; because Linux used copy-on-write semantics internally, Linux only copies pages when they changed (actually, there are still some tables that have to be copied; in most circumstances their overhead is not significant)
So the number of forks doesn't really show how big the overhead will be. You should measure the time consumed by forks, and (which is a generic advice) consumed only by the forks you actually perform, not by benchmarking maximum performance.
But if you really figure out that forking a large process is a slow, you may spawn a small ancillary process, pipe master process to its input, and receive commands to exec from it. The small process will fork and exec these commands.
posix_spawn()
This function, as far as I understand, is implemented via fork/exec on desktop systems. However, in embedded systems (particularly, in those without MMU on board), processes are spawned via a syscall, interface to which is posix_spawn or a similar function. Quoting the informative section of POSIX standard describing posix_spawn:
Swapping is generally too slow for a realtime environment.
Dynamic address translation is not available everywhere that POSIX might be useful.
Processes are too useful to simply option out of POSIX whenever it must run without address translation or other MMU services.
Thus, POSIX needs process creation and file execution primitives that can be efficiently implemented without address translation or other MMU services.
I don't think that you will benefit from this function on desktop if your goal is to minimize time consumption.
If you know the number of subprocess ahead of time, it might be reasonable to pre-fork your application on startup then distribute the execv information via a pipe. Alternatively, if there is some sort of "lull" in your program it might be reasonable to fork ahead of time a subprocess or two for quick turnaround at a later time. Neither of these options would directly solve the problem but if either approach is suitable to your app, it might allow you to side-step the issue.
I've come across this blog post: http://blog.famzah.net/2009/11/20/a-much-faster-popen-and-system-implementation-for-linux/
pid = clone(fn, stack_aligned, CLONE_VM | SIGCHLD, arg);
Excerpt:
The system call clone() comes to the rescue. Using clone() we create a
child process which has the following features:
The child runs in the same memory space as the parent. This means that no memory structures are copied when the child process is
created. As a result of this, any change to any non-stack variable
made by the child is visible by the parent process. This is similar to
threads, and therefore completely different from fork(), and also very
dangerous – we don’t want the child to mess up the parent.
The child starts from an entry function which is being called right after the child was created. This is like threads, and unlike fork().
The child has a separate stack space which is similar to threads and fork(), but entirely different to vfork().
The most important: This thread-like child process can call exec().
In a nutshell, by calling clone in the following way, we create a
child process which is very similar to a thread but still can call
exec():
However I think it may still be subject to the setuid problem:
http://ewontfix.com/7/ "setuid and vfork"
Now we get to the worst of it. Threads and vfork allow you to get in a
situation where two processes are both sharing memory space and
running at the same time. Now, what happens if another thread in the
parent calls setuid (or any other privilege-affecting function)? You
end up with two processes with different privilege levels running in a
shared address space. And this is A Bad Thing.
Consider for example a multi-threaded server daemon, running initially
as root, that’s using posix_spawn, implemented naively with vfork, to
run an external command. It doesn’t care if this command runs as root
or with low privileges, since it’s a fixed command line with fixed
environment and can’t do anything harmful. (As a stupid example, let’s
say it’s running date as an external command because the programmer
couldn’t figure out how to use strftime.)
Since it doesn’t care, it calls setuid in another thread without any
synchronization against running the external program, with the intent
to drop down to a normal user and execute user-provided code (perhaps
a script or dlopen-obtained module) as that user. Unfortunately, it
just gave that user permission to mmap new code over top of the
running posix_spawn code, or to change the strings posix_spawn is
passing to exec in the child. Whoops.

Resources