why kthread have pid 2 and systemd have PID 1? - linux

In the systemd environment, when I performed ps -auxf, I see that the kthread has PID of 2 while systemd has PID 1 assigned.
So, who assigns PID 2 to kthread and why is it getting PID 2 when kthread is what that calls systemd?

I don't think that kthreadd is starting init (in your case symlinked to systemd).
init is started by the kernel initialization. kthreadd is started just after. See this kernel threads wikipage, and, for Linux 4.2, its file init/main.c, function rest_init, near lines 397, where you see:
/*
* We need to spawn init first so that it obtains pid 1, however
* the init task will end up wanting to create kthreads, which, if
* we schedule it before we create kthreadd, will OOPS.
*/
kernel_thread(kernel_init, NULL, CLONE_FS);
numa_default_policy();
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
So kthreadd is not starting init, but both are started in the kernel before being scheduled, so before starting their execution.
The static kernel_init function (lines 930 - 975 of init/main.c) has notably:
if (execute_command) {
ret = run_init_process(execute_command);
if (!ret)
return 0;
panic("Requested init %s failed (error %d).",
execute_command, ret);
}
if (!try_to_run_init_process("/sbin/init") ||
!try_to_run_init_process("/etc/init") ||
!try_to_run_init_process("/bin/init") ||
!try_to_run_init_process("/bin/sh"))
return 0;
so is setting up the init process (à la execve(2)....) and has hardwired /sbin/init etc...
And the init process has pid 1 for ages (since primordial Unix of the 1970s, and also in old Linux 1.x kernels without kernel threads). It is a strong Unix convention (on which probably a lot of software depends). You can use systemd as your init process, but you could also use sysvinit or simply bash (it is sometimes useful to pass init=/bin/bash to the kernel thru GRUB for repairing purposes) or something else (e.g. runit)

Related

Can not get correct pid in WSL2

I am learning Linux programing.
When I trying to write a simple module to get family of a process, I find I can not get current pid of a process and its parent process. How to fix it?
Here is a part of my code.
static pid_t pid = 1;
module_param(pid, int, 0644);
static int hello_init(void) {
struct task_struct *p;
struct list_head *pp;
struct task_struct *psibling;
struct pid *kpid;
kpid = find_get_pid(pid);
p = pid_task(kpid, PIDTYPE_PID);
printk("me: %d %s\n", pid, p->comm);
if (p->parent == NULL) {
printk("No Parent\n");
}
else {
printk("Parent: %d %s\n", p->parent->pid, p->parent->comm);
}
list_for_each(pp, &p->parent->children) {
psibling = list_entry(pp, struct task_struct, sibling);
printk("sibling %d %s \n", psibling->pid, psibling->comm);
}
list_for_each(pp, &p->children) {
psibling = list_entry(pp, struct task_struct, sibling);
printk("children %d %s \n", psibling->pid, psibling->comm);
}
return 0;
}
result:
sudo insmod module.ko pid=1
dmesg
[ 6396.170631] me: 237 systemd
[ 6396.170633] Parent: 235 unshare
[ 6396.170633] sibling 237 systemd
[ 6396.170633] children 286 systemd-journal
[ 6396.170634] children 306 systemd-udevd
[ 6396.170635] children 314 systemd-network
[ 6396.170635] children 501 snapfuse
[ 6396.170636] children 508 dbus-daemon
[ 6396.170636] children 509 NetworkManager
[ 6396.170637] children 632 systemd-logind
[ 6396.170637] children 639 systemd
[ 6396.170638] children 665 rtkit-daemon
[ 6396.170638] children 671 polkitd
[ 6396.170638] children 711 udisksd
[ 6396.170639] children 761 upowerd
I'm not a Linux systems development expert, but I'll take a stab at helping based on what I see you trying.
First, you don't mention it in your question, but you are clearly running some sort of Systemd enablement. As you know, Systemd isn't normally supported on WSL. At a high level, the scripts to enable Systemd on WSL all have two essential functions:
Create a new PID namespace where Systemd is running as PID1. At the most basic level, this can be done via:
sudo -b unshare --pid --fork --mount-proc /lib/systemd/systemd --system-unit=basic.target
We can see the unshare in the list of processes returned, so that's getting called, at least.
Wait for Systemd to fully start, then enter the namespace that was created above. This is typically something like:
sudo -E nsenter --all -t $(pgrep -xo systemd) $SHELL
The actual scripts are typically a bit more complicated in order to handle multiple shells, distributions, etc. They also attempt to preserve more of the WSL environment inside the namespace in order to enable the Interop features such as running Windows .exes. But the core concept is always the same.
So, taking a guess here (again, as a non-systems-dev guy), it seems that:
kpid=find_get_pid(1) is returning the systemd process inside the namespace
pid_task(kpid, PIDTYPE_PID) is returning the "true" process information from the root namespace.
It seems to me that code must be running outside the namespace, since you see the unshare as part of it. From within the namespace, the unshare doesn't exist. You can verify this (inside the namespace) with ps -ef | grep unshare.
There are at least two possible solutions:
If it's not an issue (and from the comments, it wasn't), then just run your code from the root pid namespace. I'm assuming that your Systemd script is running via your shell startup files, so you should be able to get back to the root namespace by starting up with something like wsl ~ -e bash --noprofile --norc. This will start the shell without any of the startup scripts.
Of course, other techniques for disabling the Systemd script are probably documented by whatever script you are using.
If you do want your code to work properly from within a PID namespace, then you'll probably need to find the namespace (I'd start with the source of lsns as an example).
Then find the task struct within that namespace (probably find_task_by_pid_ns?).

Does adding '&' makes it run as a daemon?

I am aware that adding a '&' in the end makes it run as a background but does it also mean that it runs as a daemon?
Like:
celery -A project worker -l info &
celery -A project worker -l info --detach
I am sure that the first one runs in a background however the second as stated in the document runs in the background as a daemon.
I would love to know the main difference of the commands above
They are different!
"&" version is background , but not run as daemon, daemon process will detach with terminal.
in C language ,daemon can write in code :
fork()
setsid()
close(0) /* and /dev/null as fd 0, 1 and 2 */
close(1)
close(2)
fork()
This ensures that the process is no longer in the same process group as the terminal and thus won't be killed together with it. The IO redirection is to make output not appear on the terminal.(see:https://unix.stackexchange.com/questions/56495/whats-the-difference-between-running-a-program-as-a-daemon-and-forking-it-into)
a daemon make it to be in its own session, not be attached to a terminal, not have any file descriptor inherited from the parent open to anything, not have a parent caring for you (other than init) have the current directory in / so as not to prevent a umount... while "&" version do not
Yes the process will be ran as a daemon, or background process; they both do the same thing.
You can verify this by looking at the opt parser in the source code (if you really want to verify this):
. cmdoption:: --detach
Detach and run in the background as a daemon.
https://github.com/celery/celery/blob/d59518f5fb68957b2d179aa572af6f58cd02de40/celery/bin/beat.py#L12
https://github.com/celery/celery/blob/d59518f5fb68957b2d179aa572af6f58cd02de40/celery/platforms.py#L365
Ultimately, the code below is what detaches it in the DaemonContext. Notice the fork and exit calls:
def _detach(self):
if os.fork() == 0: # first child
os.setsid() # create new session
if os.fork() > 0: # pragma: no cover
# second child
os._exit(0)
else:
os._exit(0)
return self
Not really. The process started with & runs in the background, but is attached to the shell that started it, and the process output goes to the terminal.
Meaning, if the shell dies or is killed (or the terminal is closed), that process will be sent a HUG signal and will die as well (if it doesn't catch it, or if its output goes to the terminal).
The command nohup detaches a process (command) from the shell and redirects its I/O, and prevents it from dying when the parent process (shell) dies.
Example:
You can see that by opening two terminals. In one run
sleep 500 &
in the other one run ps -ef to see the list of processes, and near the bottom something like
me 1234 1201 ... sleep 500
^ ^
process id parent process (shell)
close the terminal in which sleep sleeps in the background, and then do a ps -ef again, the sleep process is gone.
A daemon job is usually started by the system (its owner may be changed to a regular user) by upstart or init.

What happens to other processes when a Docker container's PID1 exits?

Consider the following, which runs sleep 60 in the background and then exits:
$ cat run.sh
sleep 60&
ps
echo Goodbye!!!
$ docker run --rm -v $(pwd)/run.sh:/run.sh ubuntu:16.04 bash /run.sh
PID TTY TIME CMD
1 ? 00:00:00 bash
5 ? 00:00:00 sleep
6 ? 00:00:00 ps
Goodbye!!!
This will start a Docker container, with bash as PID1. It then fork/execs a sleep process, and then bash exits. When the Docker container dies, the sleep process somehow dies too.
My question is: what is the mechanism by which the sleep process is killed? I tried trapping SIGTERM in a child process, and that appears to not get tripped. My presumption is that something (either Docker or the Linux kernel) is sending SIGKILL when shutting down the cgroup the container is using, but I've found no documentation anywhere clarifying this.
EDIT The closest I've come to an explanation is the following quote from baseimage-docker:
If your init process is your app, then it'll probably only shut down itself, not all the other processes in the container. The kernel will then forcefully kill those other processes, not giving them a chance to gracefully shut down, potentially resulting in file corruption, stale temporary files, etc. You really want to shut down all your processes gracefully.
So at least according to this, the implication is that when the container exits, the kernel will sending a SIGKILL to all remaining processes. But I'd still like clarity on how it decides to do that (i.e., is it a feature of cgroups?), and ideally a more authoritative source would be nice.
OK, I seem to have come up with some more solid evidence that this is, in fact, the Linux kernel doing the terminating. In the clone(2) man page, there's this useful section:
CLONE_NEWPID (since Linux 2.6.24)
The first process created in a new namespace (i.e., the process
created using the CLONE_NEWPID flag) has the PID 1, and is the
"init" process for the namespace. Children that are orphaned
within the namespace will be reparented to this process rather than
init(8). Unlike the traditional init process, the "init" process of a
PID namespace can terminate, and if it does, all of the processes in
the namespace are terminated.
Unfortunately this is still vague on how exactly the processes in the namespace are terminated, but perhaps that's because, unlike a normal process exit, no entry is left in the process table. Whatever the case is, it seems clear that:
The kernel itself is killing the other processes
They are not killed in a way that allows them any chance to do cleanup, making it (almost?) identical to a SIGKILL

reboot within an initrd image

I am looking for a method to restart/reset my linux system from within an init-bottom script*. At the time my script is executed the system is found under /root and I have access to a busybox.
But the "reboot" command which is part of my busybox does not work. Is there any other possibility?
My system is booted normally with an initramfs image and my script is eventually causing an update process. The new systemd which comes with debian irritates this. But with a power reset everything is fine.
I have found this:
echo b >/proc/sysrq-trigger
(it's like pressing CTRL+ALT+DEL)
If you -are- init (the PID of your process/script is 0), then starting the busybox reboot program won't work since it tries to signal init (which is not started) to reboot.
Instead, as PID 0, you should do what init would do. This is call the correct kernel API for the reboot. See Man reboot(2) for details.
Assuming you are running a c program or something, one would do:
#include <unistd.h>
#include <sys/reboot.h>
void main() { reboot(0x1234567); }
This is much better than executing the sysrq trigger which will act more like a panic restart than a clean restart.
As a final note, busybox's init actually forks a process to do the reboot for it. This is because the reboot systemcall actually also exists the program, and the system should never run without an init process (which will also panic the kernel). Hence in this case, you would do something like:
pid_t pid;
pid = vfork();
if (pid == 0) { /* child */
reboot(0x1234567);
_exit(EXIT_SUCCESS);
}
while (1); /* Parent (init) waits */

Init script won't "stop" forked C program

I have a C program what has a "daemon" mode, so that I can have it run in the background. When it is run with "-d" it forks using the following code:
if(daemon_mode == 1)
{
int i = fork();
if(i<0) exit(1); // error
if(i>0) exit(0); // parent
}
I created an init script, and when i manually run the init script to start my daemon, it starts ok, however, if i run it with "stop" the daemon isn't stopped.
I imagine the issue is that the PID has changed due to forking, what am I don't wrong and how do I fix it?
If you are using a pid file to control the process, then you are likely correct that changing the pid is causing a problem. Just write the pid file after you have daemonized rather than before.

Resources