Linux kernel: What process does schedule() run in?

Linux kernel: What process does schedule() run in? - linux

When you call a system call such as fork in process X, the kernel is said to be executing in process context. So, fork can be said to be running in process X, right?
But if schedule() is called (and it isn't a sys call) in the same process, would you say that it is running as part of X? Or does it runs in the swapper process? Or does it sound absurd, taking into account the monolithic nature of the kernel?

schedule() is always running in process context. The special part about it is that it can change which process context is current - but it does always have a process context. Prior to the call to context_switch() it runs in the context of the process to be swapped out, and after it runs in the process swapped in.
The Linux kernel does not have a dedicated "swapper" task (there is an idle task, which is always runnable in case nothing else is eligible to run).

It really depends upon where the schedule() call is made from; schedule() can be called both from process context or from a work queue. The work queues are kernel-scheduled threads:
# ps auxw | grep worker
root 1378 0.0 0.0 0 0 ? S 20:45 0:00 [kworker/1:0]
root 1382 0.0 0.0 0 0 ? S 20:45 0:00 [kworker/2:0]
root 1384 0.0 0.0 0 0 ? S 20:45 0:00 [kworker/3:1]
...
The [..] signifies that the processes do not execute in userspace.
The worker_thread() function calls schedule() after handling a work item but before starting all over again.
schedule() can also be called on behalf of a process, either by a driver or by signal handling code, or filesystem internals, or myriad other options.

The scheduler take care of all processes, so does not run inside one process.
Of course, when e.g. a process is scheduled out because of a clock interrupt, some process was running (and later, another one is scheduled).
You cannot view all the kernel as running for processes (only system calls are).

Q: So, fork can be said to be running in process X, right?
A: Yes, absolutely. The system call by which a process REQUESTS to "fork" occurs in user space. The act of making the system call TRANSITIONS from user space to kernel space. The IMPLEMENTATION of the system call may involve many separate steps. Some may occur in user space; other steps occur in kernel space.
Q: ...taking into account the monolithic nature of the kernel ?
A: The issue of "user space" vs "kernel space" has absolutely NOTHING to do with whether the kernel happens to be "monolithic", a "microkernel" or something else entirely.

Related

Does kernel threads get scheduled by the scheduller?

How kernel threads gets executed on the CPU
does these kernel threads get scheduled by the scheduller , like normal user space processes?
or they get waken up when some events happen ?
root 2 0 0 Nov30 ? 00:00:00 [kthreadd]
root 3 2 0 Nov30 ? 00:00:03 [ksoftirqd/0]

The answer to both questions is yes - kernel threads gets scheduled just like user threads and they are normally blocking pending certain events (different events per kernel thread).

Answer is Yes.
Only major difference between kernel threads and user space process would be task->mm = NULL for kernel threads.
Hence they don't have distinct address space. Rest is pretty much same for kernel threads and user space processes.

How to list threads were killed by the kernel?

Is there any way to list all the killed processes in a linux device?
I saw this answer suggesting:
check in:
/var/log/kern.log
but it is not generic. there is any other way to do it?
What I want to do:
list thread/process if it got killed. What function in the kernel should I edit to list all the killed tid/pid and their names, or alternitavily is there a sysfs does it anyway?

The opposite of do_fork is do_exit, here:
do_exit kernel source
I'm not able to find when threads are exiting, other than:
release_task
I believe "task" and "thread" are (almost) synonymous in Linux.

First, task and thread contexts are different in the kernel.
task (using tasklet api) runs in software interrupt context (meaning you cannot sleep while you are in the task ctx) while thread (using kthread api, or workqueue api) runs the handler in process ctx (i.e. sleep-able ctx).
In both cases, if a thread hangs in the kerenl, you cannot kill it.
if you run "ps" command from the shell, you can see it there (normally with "[" and "]" braces) but any attempt to kill it won't work.
the kernel is trusted code, such a situation shouldn't happen, and if it does, it indicates a kernel (or kernel module) bug.
normally the whole machine will hand after a while because the core running that thread is not responding (you will see a message in /var/log/messages or the console with more info) in some other cases the machine may survive but that specific core is dead. depends on the kernel configuration.

Concurrency of posix threads in multiprocessor machine

I have some doubts regarding concurrency of posix threads in multiprocessor machine. I have found similar questions in SO regarding it but didnt find conclusive answer.
Below is my understanding. I want to know if i am correct.
Posix threads are user level threads and kernel is not aware of it.
Kernel scheduler will treat Process( with all its threads) as one entity for scheduling. It is the thread library that in turn chooses which thread to run. It can slice the cpu time given by the kernel among the run-able threads.
User threads can run on different cpu cores. ie Let threads T1 & T2 be created by a Process(T), then T1 can run in Cpu1 and T2 can run in Cpu2 BUT they cant run concurrently.
Please let me know if my understanding in correct.
Thanks...

Since you marked your question with "Linux" tag I'm going to answer it according to standard pthreads implementation under linux. If you are talking about "green" threads, which are scheduled at the VM/language level instead of the OS, then your answers are mostly correct. But my comments below are on Linux pthreads.
1) Posix threads are user level threads and kernel is not aware of it.
No this is certainly not correct. The Linux kernel and the pthreads libraries work together to administer the threads. The kernel does the context switching, scheduling, memory management, cache memory management, etc.. There is other administration done at the user level of course but without he kernel, much of the power of pthreads would be lost.
2) Kernel scheduler will treat Process( with all its threads) as one entity for scheduling. It is the thread library that in turn chooses which thread to run. It can slice the cpu time given by the kernel among the run-able threads.
No, the kernel treats each process-thread as one entity. It has it's own rules about time slicing that take processes (and process priorities) into consideration but each sub-process thread is a schedulable entity.
3) User threads can run on different cpu cores. ie Let threads T1 & T2 be created by a Process(T), then T1 can run in Cpu1 and T2 can run in Cpu2 BUT they cant run concurrently.
No. Concurrent executing is expected for multi-threaded programs. That's why synchronization and mutexes are so important and why programmers put up with the complexity of multithreaded programming.
One way to prove this to you is to look at the output of ps with -L option to show the associated threads. ps usually wraps multiple threaded processes into one line but with -L you can see that the kernel has a separate virtual process-id for each thread:
ps -ef | grep 20587
foo 20587 1 1 Apr09 ? 00:16:39 java -server -Xmx1536m ...
versus
ps -eLf | grep 20587
foo 20587 1 20587 0 641 Apr09 ? 00:00:00 java -server -Xmx1536m ...
foo 20587 1 20588 0 641 Apr09 ? 00:00:30 java -server -Xmx1536m ...
foo 20587 1 20589 0 641 Apr09 ? 00:00:03 java -server -Xmx1536m ...
...
I'm not sure if Linux threads still do this but historically pthreads used the clone(2) system call to create another thread copy of itself:
Unlike fork(2), these calls allow the child process to share parts of its execution context with the calling process, such as the memory space, the table of file descriptors, and the table of signal handlers.
This is different from fork(2) which is used when another full process is created.

POSIX does not specify how the threads created with pthread_create are scheduled on to processor cores. This is up to the implementation.
However, I would expect the following in a quality implementation, and this is the case for current versions of linux:
Threads are full kernel threads, and scheduled by the kernel
Threads from the same process can run concurrently on separate processors
i.e. all 3 of your numbered statements are wrong with current implementations of linux, but could in theory be true for another implementation that also conformed to POSIX.

Most POSIX implementations use OS support to provide thread functionality - they wrap the system calls needed to manage threads. As such, the behaviour of the threads re. scheduling etc depends upon the underlying OS.
So, on most modern OS:
A process with POSIX threads is no different to the kernel than any other process with multiple threads.
The kernel scheduler dispatches threads, not processes. A 'process' is often regarded as a higher-level construct that has code, memory-management, quotas, auditing, and security permissions, but not execution. A process can do nothing unless a thread runs its code, which is why, when a process is created, a 'main thread' is created at the same time, else nothing would run. The OS scheduling algorithm may use the process that the thread runs as one of the parameters to decide on which set of ready threads to run next - it's cheaper to swap one thread out with one from the same process - but it does not have to.
Slicing the cpu time given by the kernel among the run-able threads is a side-effect of the OS timer interrupt when there are more ready threads than there are cores to run them. Any machine that regularly has to resort to (ugh! - I hate the term), 'time slicing' should be regarded as overloaded and should have more CPU or less work. Threads should ideally only become ready when signaled by another thread or an IO driver, not because the OS timer interrupt has decided to run it in place of another thread that could still be doing useful work.
They just can run concurrently. If two threads are ready and there are two cores, the threads are dispatched onto the two cores. It does not matter if they are from the same process or not.

Linux Kernel Threads - scheduler

Is Linux Kernel scheduler a part of init process? My understanding is that it is part of Kernel threads managed internally not visible to user by either top or ps. Please correct my understanding.
Is it possible to view standard kernel threads through any kernel debugger to see how standard threads occupy cpu activity?
-Kartlee

Kernel threads can be seen through "top" and "ps" and can be distinguished by having zero VM size (they have no userspace, so no userspace memory map).
These are created by kernel_thread (or its friends). Some facilities create one thread per CPU and tie it to a CPU, so you see stuff like aio/0 aio/1 on the PS list.
Also some work is done through the several deferred execution mechanisms and gets attributed to other tasks, typically something called "events/0" (one per CPU). Time spent "really" in interrupts isn't counted anywhere (it just runs at the expense of whatever task happened to be on that CPU at the time).

1) Is Linux Kernel scheduler a part of init process?
-> no, scheduler is a subsystem, init process is just process but special and is scheduled by scheduler.
2) My understanding is that it is part of Kernel threads managed internally not visible to user by either top or ps. Please correct my understanding.
-> It is a kind of kernel thread and typically not shown to user.
3) Is it possible to view standard kernel threads through any kernel debugger to see how standard threads occupy cpu activity?
-> yes!

use ps aux, the kernel thread's name is surrounded by square brackets, e.g. [kthreadd]
kernel threads are created by kthread_create function. And it is finally handled by kthreadd, i.e. the PID=2 thread in the kernel;
And all the kernel threads is forked/copied/cloned by kthreadd (pid=2). Not init(pid=1).
the source code is here: https://elixir.bootlin.com/linux/latest/source/kernel/kthread.c

Linux Process States

In Linux, what happens to the state of a process when it needs to read blocks from a disk? Is it blocked? If so, how is another process chosen to execute?

When a process needs to fetch data from a disk, it effectively stops running on the CPU to let other processes run because the operation might take a long time to complete – at least 5ms seek time for a disk is common, and 5ms is 10 million CPU cycles, an eternity from the point of view of the program!
From the programmer point of view (also said "in userspace"), this is called a blocking system call. If you call write(2) (which is a thin libc wrapper around the system call of the same name), your process does not exactly stop at that boundary; it continues, in the kernel, running the system call code. Most of the time it goes all the way up to a specific disk controller driver (filename → filesystem/VFS → block device → device driver), where a command to fetch a block on disk is submitted to the proper hardware, which is a very fast operation most of the time.
THEN the process is put in sleep state (in kernel space, blocking is called sleeping – nothing is ever 'blocked' from the kernel point of view). It will be awakened once the hardware has finally fetched the proper data, then the process will be marked as runnable and will be scheduled. Eventually, the scheduler will run the process.
Finally, in userspace, the blocking system call returns with proper status and data, and the program flow goes on.
It is possible to invoke most I/O system calls in non-blocking mode (see O_NONBLOCK in open(2) and fcntl(2)). In this case, the system calls return immediately and only report submitting the disk operation. The programmer will have to explicitly check at a later time whether the operation completed, successfully or not, and fetch its result (e.g., with select(2)). This is called asynchronous or event-based programming.
Most answers here mentioning the D state (which is called TASK_UNINTERRUPTIBLE in the Linux state names) are incorrect. The D state is a special sleep mode which is only triggered in a kernel space code path, when that code path can't be interrupted (because it would be too complex to program), with the expectation that it would block only for a very short time. I believe that most "D states" are actually invisible; they are very short lived and can't be observed by sampling tools such as 'top'.
You can encounter unkillable processes in the D state in a few situations. NFS is famous for that, and I've encountered it many times. I think there's a semantic clash between some VFS code paths, which assume to always reach local disks and fast error detection (on SATA, an error timeout would be around a few 100 ms), and NFS, which actually fetches data from the network which is more resilient and has slow recovery (a TCP timeout of 300 seconds is common). Read this article for the cool solution introduced in Linux 2.6.25 with the TASK_KILLABLE state. Before this era there was a hack where you could actually send signals to NFS process clients by sending a SIGKILL to the kernel thread rpciod, but forget about that ugly trick.…

While waiting for read() or write() to/from a file descriptor return, the process will be put in a special kind of sleep, known as "D" or "Disk Sleep". This is special, because the process can not be killed or interrupted while in such a state. A process waiting for a return from ioctl() would also be put to sleep in this manner.
An exception to this is when a file (such as a terminal or other character device) is opened in O_NONBLOCK mode, passed when its assumed that a device (such as a modem) will need time to initialize. However, you indicated block devices in your question. Also, I have never tried an ioctl() that is likely to block on a fd opened in non blocking mode (at least not knowingly).
How another process is chosen depends entirely on the scheduler you are using, as well as what other processes might have done to modify their weights within that scheduler.
Some user space programs under certain circumstances have been known to remain in this state forever, until rebooted. These are typically grouped in with other "zombies", but the term would not be correct as they are not technically defunct.

A process performing I/O will be put in D state (uninterruptable sleep), which frees the CPU until there is a hardware interrupt which tells the CPU to return to executing the program. See man ps for the other process states.
Depending on your kernel, there is a process scheduler, which keeps track of a runqueue of processes ready to execute. It, along with a scheduling algorithm, tells the kernel which process to assign to which CPU. There are kernel processes and user processes to consider. Each process is allocated a time-slice, which is a chunk of CPU time it is allowed to use. Once the process uses all of its time-slice, it is marked as expired and given lower priority in the scheduling algorithm.
In the 2.6 kernel, there is a O(1) time complexity scheduler, so no matter how many processes you have up running, it will assign CPUs in constant time. It is more complicated though, since 2.6 introduced preemption and CPU load balancing is not an easy algorithm. In any case, it’s efficient and CPUs will not remain idle while you wait for the I/O.

As already explained by others, processes in "D" state (uninterruptible sleep) are responsible for the hang of ps process. To me it has happened many times with RedHat 6.x and automounted NFS home directories.
To list processes in D state you can use the following commands:
cd /proc
for i in [0-9]*;do echo -n "$i :";cat $i/status |grep ^State;done|grep D
To know the current directory of the process and, may be, the mounted NFS disk that has issues you can use a command similar to the following example (replace 31134 with the sleeping process number):
# ls -l /proc/31134/cwd
lrwxrwxrwx 1 pippo users 0 Aug 2 16:25 /proc/31134/cwd -> /auto/pippo
I found that giving the umount command with the -f (force) switch, to the related mounted nfs file system, was able to wake-up the sleeping process:
umount -f /auto/pippo
the file system wasn't unmounted, because it was busy, but the related process did wake-up and I was able to solve the issue without rebooting.

Assuming your process is a single thread, and that you're using blocking I/O, your process will block waiting for the I/O to complete. The kernel will pick another process to run in the meantime based on niceness, priority, last run time, etc. If there are no other runnable processes, the kernel won't run any; instead, it'll tell the hardware the machine is idle (which will result in lower power consumption).
Processes that are waiting for I/O to complete typically show up in state D in, e.g., ps and top.

Yes, the task gets blocked in the read() system call. Another task which is ready runs, or if no other tasks are ready, the idle task (for that CPU) runs.
A normal, blocking disc read causes the task to enter the "D" state (as others have noted). Such tasks contribute to the load average, even though they're not consuming the CPU.
Some other types of IO, especially ttys and network, do not behave quite the same - the process ends up in "S" state and can be interrupted and doesn't count against the load average.

Yes, tasks waiting for IO are blocked, and other tasks get executed. Selecting the next task is done by the Linux scheduler.

Generally the process will block. If the read operation is on a file descriptor marked as non-blocking or if the process is using asynchronous IO it won't block. Also if the process has other threads that aren't blocked they can continue running.
The decision as to which process runs next is up to the scheduler in the kernel.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string