How one thread wakes up others inside Linux Kernel - linux

My question is how and when one thread wakes up other thread(s)?
I tried to look at Linux kernel code, but didn't find what I was looking for.
For example, there is one thread waiting on mutex, conditional variable, file descriptor (event fd, for example).
What work is performed by thread that releases mutex, and work is performed by other cpu core that is about to run sleeping thread?
I have searched existing answers, but did not find details.
I have have read that scheduler can usually be called:
after a system call before returning to userspace
after interrupt processing
after timer interrupt processing - for example, every 1ms (hz = 1000) ок 4ms (hz = 250)
I believe that thread that releases some resource, it calls through some system call kernel function try_to_wake_up. This function picks some task(s) and sets its state to RUNNABLE. This work is performed by signaling thread and takes some time. But how actually task is started to run? If system is buzy, there may be no free cpus to run this task. Some time in the future, for example, after call by timer when some other thread goes to sleep or exhausted its quantum, scheduler is called on some cpu and takes runnable task for running. Maybe this task will be preferably run on that cpu where it ran previously.
But there must be some other scenario. When there are idle cpus, I believe that task is awakened immediately, without waiting at most 1ms or even 4ms (wake-up latency is always around several microseconds, not milleseconds).
Also, for example, imagine situation when some thread is running exclusively on some cpu core.
This cpu core may be isolated from kernel threads and interrupts handlers and only one user thread has affinity set to run on this and only on this core. I believe that if there are enough free cpu cores, no other threads will be normally scheduled to run on that core (am I wrong?)
Also this cpu core may have nohz_full option enabled. So when user thread goes to sleep, this core goes to sleep too. No irqs from devices, no timer irqs are processed.
So there must be some way for one cpu to tell other cpu to start running (throug interrupt), call scheduler and run user thread that is ready to awake.
Scheduler must run not on the cpu that releases resource, but on some other cpu, that should be awakened. Maybe this is performed somehow via IPI interrupt? Can you help me to find corresponding code in the kernel or describe how it works?
Thank you.

Related

OS Round Robin thread scheduling

Let's assume you have an OS that tries to run threads in round-robin scheduling.
I know there are two instances when the OS will try to switch between multiple threads: (there could be more...)
When the current thread actually yields the CPU earlier on its own.
When the OS receives a timer interrupt.
The question is let's say the OS has a max compute-bound time of say 5ms. (the OS receives a timer interrupt every 5ms)this assumption means that each thread can own a CPU core for a maximum of 5ms.
What happens if a process/thread finishes its time slice earlier than 5ms? Will this cause the next thread to be scheduled to have a compute-bound time lesser than 5ms since the next timer interrupt will occur and the thread will have no choice but to give up the CPU?
Specific Example:
What happens if a process/thread finishes its time slice earlier than 5ms let's say 2ms?
I know another thread will be scheduled, but will that thread have a full-time slice of 5ms or will this next thread only have 3ms before the next timer interrupt occurs?
This question is likely dependent of the OS (not provided). On mainstream OSs, a yielding tasks typically waits for a resource or for a given time. The operating system will not reschedule it unless it becomes ready (completed IO operation, available lock, timeout, etc.). When a new task is ready, the OS scheduler is free to either wait for the end of a time-slice or reschedule the previous task. It is common for the task to be rescheduled so to increase the responsiveness of multithreaded applications (waiting for few milliseconds when a code tries to hold a lock that is already taken is not reasonable). That being said, this behavior is often not implemented so directly. Windows make use of priority boosts so to do that. On Linux, CFS tries to make the schedule fair so that all tasks have a balanced time on all available resources (eg. cores). The priority of the target tasks also matters. The OS of some famous gaming consoles uses a round-robin scheduler by default and they only schedule lower-priority tasks if there is no high-priority task. On such system, when a higher-priority task becomes ready, the current one is directly interrupted (with not delay except the overhead of a context switch).
Put it shortly, the OS does not necessary have to wait for a timer interrupt to do context switches. Also, yes, time-slices are generally never left empty so they are reused by other tasks. Whether the scheduled tasks can be scheduled in a full time slice is dependent of the actual OS scheduler. Also note that a thread do not "give up the CPU": user-land threads have not real control on this. In practice, either a schedule-like kernel call is done during a system call (causing a context switch of the current task), or a system interrupt causes a kernel code to be executed that typically do this schedule-like kernel call at the end of a time slice.
There are a lot of ways threads can yield, e.g. posting or waiting on a semaphore, input, output, etc. If a thread yields, or it's scheduling timer times out (5ms), the OS will rummage through the list of threads to see what else can be run.
That rummaging literally involves running through the list of threads and seeing what their status is.
Some threads may be listed as "preempted" (i.e. they hadn't yielded, the OS 5ms scheduling timer timed out and the OS had suspended them in favour of another) in which case one of those can simply be reinstated (registers restored, program counter set, CPU picks up from that point). Round Robin scheduling is simply an extra piece of information, namely "When did this thread last get run?", the OS favouring the thread that's not been run for longer than all the others.
Others will be listed as waiting on I/O (so those can't be run), and yet more will be listed as waiting on locks like semaphores (so those can't be run either).
Note that something like a sem_post() is also a yield, giving the OS a chance to do this rummaging, perhaps finding a thread that is listed as waiting on the semaphore that's just been posted.
If an OS determines that across all processes there are no runnable threads at all (nothing waiting on semaphores, everything hung up waiting for I/O), and there is nothing for it to do itself, what happens next depends on the CPU. For some older CPUs, the OS would literally have to enter an infinite loop waiting for some interrupt to fire from some device. More modern CPUs have an instruction which will suspend execution until some interrupt fires.
Basically, the OS scheduler is part interrupt service routine (responding to timer or device interrupts), and part "ordinary" code that simply manages lists of threads when threads voluntarily yield.

Is the scheduler built into the kernel a program or a process?

I looked up the CPU scheduler source code built into the kernel.
https://github.com/torvalds/linux/tree/master/kernel/sched
But I have a question.
There are mixed opinions on the cpu scheduler on the Internet.
I saw an opinion that CPU scheduler is a process.
Question: If so, when ps-ef on Linux, the scheduler process should be visible. It was difficult to find the PID and name of the scheduler process.
The PID for the CPU scheduler process is not on the internet either. However, the PID 0 SWAPPER process is called SCHED, but in Linux, PID 0 functions as an idle process.
I saw an opinion that CPU scheduler is not a process.
CPU scheduler is a passive source code built into the kernel, and user processes frequently enter the kernel and rotate the source code.
Question: How does the user process execute the kernel's scheduler source code on its own?
What if you created a user program without adding a system call using the scheduler of the kernel?
How does the user process self-rotate the scheduler in the kernel without such code?
You have 2 similar questions (The opinion that the scheduler built into the kernel is the program and the opinion that it is the process and I want to know how to implement the cpu scheduling process in Linux operating system) so I'll answer for both of these here.
The answer is that it doesn't work that way at all. The scheduler is not called by user mode processes by using system calls. The scheduler isn't a system call. There are timers that are programmed to throw interrupts after some time has elapsed. Timers are accessed using registers that are memory in RAM often called memory mapped IO (MMIO). You write to some position in RAM specified by the ACPI tables (https://wiki.osdev.org/ACPI) and it will allow to control the chips in the CPU or external PCI devices (PCI is everything nowadays).
When the timer reaches 0, it will trigger an interrupt. Interrupts are thrown by hardware (the CPU). The CPU thus includes special mechanism to let the OS determine the position at which it will jump on interrupt (https://wiki.osdev.org/Interrupt_Descriptor_Table). Interrupts are used by the CPU to notify the OS that an event happened. Without interrupts, the OS would have to reserve at least one core of the processor for a special kernel process that would constantly poll the registers of peripherals and other things. It would be impossible to implement. Also, if user mode processes did the scheduler system call by themselves, the kernel would be slave to user mode because it wouldn't be able to tell if a process is finished and processes could be selfish over CPU time.
I didn't look at the source code but I think the scheduler is also often called on some IO completion (also on interrupt but not always on timer interrupt). I am quite sure that the scheduler must not be preempted. That is interrupts (and other things) will be disabled while the schedule() function runs.
I don't think you can call the scheduler a process (not even a kernel thread). The scheduler can be called by kernel threads that are created by interrupts due to bottom half processing. In bottom half processing, the top "half" of the interrupt handler runs fast and efficiently while the bottom "half" is added to the queue of processes and runs when the scheduler decides it should be scheduled. This has the effect of creating some kernel threads. The scheduler can thus be called from kernel threads but not always from bottom half of interrupts. There has to be a mechanism to call the scheduler without the scheduler having to schedule the task itself. Otherwise, the kernel will stop functioning.

How does the scheduler know that a thread is blocked waiting for input?

When a thread executing user code is waiting for input, how does the scheduler know to interrupt it or how does the thread know to call the scheduler, seeing as the average programmer of a simple single threaded application is unlikely to insert sched_yield() everywhere. Does the compiler insert sched_yield() on optimisation or does the thread just spin lock until the general timer interrupt set by the scheduler fires, or does the user have to explicitly state wait(), sleep() functions in order for the context to switch?
This question is especially relevant if the scheduler is not preemptive because then it has to call the scheduler when it is waiting for input for throughput to be effective, but I'm not sure how it does this.
Be careful not to confuse preemption with the ability of a process to sleep. Processes can sleep even with a non-preempting scheduler. This is what happens when a process is waiting for I/O. The process makes a system call such as read() and the device determines no data is available. It then internally puts the process to sleep by updating a data structure used by the scheduler. The scheduler then executes other processes until an interrupt or some other event occurs that wakes the original process. The awoken process then becomes eligible again for scheduling.
On the other hand preemption is the ability of an architecture's scheduler to stop execution of a process without its cooperation. The interruption can occur anywhere in the program's instruction stream. Control returns to the scheduler which can then execute other processes and return to the interrupted (preempted) process later. Most schedulers allocate time slices where a process is allowed to run for up to a predetermined amount of time, after which it is preempted if higher-priority processes need time slices.
Unless you're writing drivers or kernel code, you don't need to worry about the underlying mechanisms too much. When writing user-space applications the key concepts are (1) that some system calls may block which means your process is put to sleep until an event occurs, and (2) on preemptible systems (all mainstream modern operating systems) your program may be preempted at any time so that other processes can run.
* Note that in some platforms, such as Linux, a thread is really just another process which shares its virtual address space with another process. Processes and threads are therefore treated exactly the same by the scheduler.
It is not clear to me whether your question is about theory or practice. In practice in every modern operating system, i/o operations are privileged. Meaning that in order for a user process or thread to access files, devices and so on it must issue a system call.
Then the kernel has the opportunity to do whatever it considers appropriate. For example it can check whether the I/o operation will block and, therefore switch the running (i.e. “call” the scheduler) process after issuing the operation.
Note that this mechanism can work even when there is no timer interruption handled by the kernel. Anyway in general it will depend upon your system. For example in an embedded system where no OS exits (or a minimal one) it could be the entire responsibility of the user’s code to invoke the scheduler before issueing a blocking operation.
Kernel can be preemptive, not scheduler.
First sched_yield() and wait() are types of voluntary preemption, when process itself gives out CPU even if kernel is non-preemptive.
If kernel has ability to switch to another process when time quantum has expired or higher priority process become runnable then we are talking about involuntary preemption, i.e preemptive kernel, and it can happen on different places explained below.
Difference is that insched_yield() process stays in runnable TASK_RUNNING state but just goes to the end of the run queue for it's static priority. Process must wait to get the CPU again.
On the other hand, wait() puts process to a sleep TASK_(UN)INTERRUPTABLE state, on a wait queue, calls schedule() and waits for an event to occur. When event occur, process are moved to run queue again. But that doesn't mean that they will get CPU immediately.
Here is explained when schedule() can be called after process is woken up:
Wakeups don't really cause entry into schedule(). They add a
task to the run-queue and that's it.
If the new task added to the run-queue preempts the current
task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
called on the nearest possible occasion:
If the kernel is preemptible (CONFIG_PREEMPT=y):
in syscall or exception context, at the next outmost
preempt_enable(). (this might be as soon as the wake_up()'s
spin_unlock()!)
in IRQ context, return from interrupt-handler to
preemptible context
If the kernel is not preemptible (CONFIG_PREEMPT is not set)
then at the next:
cond_resched() call
explicit schedule() call
return from syscall or exception to user-space
return from interrupt-handler to user-space

At what points in a program the system switch threads

I know that threads cannot actually run in parallel on the same core, but in a regular desktop system there is normally hundreds or even thousands of threads. Which is of course much more than today's average of 4 core CPU's. So the system actually running some thread for X time and then switches to run another thread for Y amount of time an so on.
My question is, how does the system decide how much time to execute each thread?
I know that when a program is calling sleep() on a thread for an amount of time, the operation system can use this time to execute other threads, but what happens when a program does not call sleep at all?
E.g:
int main(int argc, char const *argv[])
{
while(true)
printf("busy");
return 0;
}
When does the operating system decide to suspend this thread and excutre another?
The OS keeps a container of all those threads that can use CPU execution, (usually such threads are described as being'ready'). On most desktop systems, this is a very small fraction of the total number of threads. Most threads in such systems are waiting on either I/O, (this includes sleeping - waiting on timer I/O), or inter-thread signaling; such threads cannot use CPU execution and so the OS does not dispatch them onto cores.
A software syscall, (eg. a request to open a file, a request to sleep or wait for a signal from another thread), or a hardware interrupt from a peripheral device, (eg. a disk controller, NIC, KB, mouse), may cause the set of ready threads to change and so initiate a scheduling run.
When run, the shceduler decides on what set of ready threads to assign to the available cores. The algorithm it uses is a compromise that tries to optimize overall performance by balancing the need for expensive context-switches with the need for responsive I/O. The kernel CAN stop any thread on any core an preempt it, but it would surely prefer not to:)
So:
My question is, how does the system decide how much time to execute
each thread?
Essentially, it does not. If the set of ready threads is not greater than the number of cores, there is no need to stop/control/influence a CPU-intensive loop - it can be allowed to run on forever, taking up a whole core.
Note that your example is very poor - the printf() call will request output from the OS and, if not immediately available, the OS will block your seemingly 'CPU only' thread until it is.
but what happens when a program does not call sleep at all?
It's just one more thread. If it is purely CPU-intensive, then whether it runs continually depends upon the loading on the box and the number of cores available, as already described. It can, of course, get blocked by requesting I/O or electing to wait for a signal from another thread, so removing itself from the set of ready threads.
Note that one I/O device is a hardware timer. This is very useful for timing out system calls and providing Sleep() functionality. It usually does have a side-effect on those boxes where the number of ready threads is larger than the number of cores available to run them, (ie. the box is overloaded or the task/s it runs have no limits on CPU use). It can result in sharing out the available cores around the ready threads, so giving the illusion of running more threads than it's actually physically capable of, (try not to get hung up on Sleep() and the timer interrupt - it's one of many interrupts that can change thread state).
It is this behaviour of the timer hardware, interrupt and driver that gives rise to the apalling 'quantum', 'time-sharing', 'round-robin' etc. etc.etc. confusion and FUD that surrounds the operation of modern preemptive kernels.
A preemptive kernel, and it's drivers etc, is a state-machine. Syscalls from running threads and hardware interrupts from peripheral devices go in, a set of running threads comes out.
It depends which type of scheduling your OS is using for example lets take
Round Robbin:
In order to schedule processes fairly, a round-robin scheduler generally employs time-sharing, giving each job a time slot or quantum(its allowance of CPU time), and interrupting the job if it is not completed by then. The job is resumed next time a time slot is assigned to that process. If the process terminates or changes its state to waiting during its attributed time quantum, the scheduler selects the first process in the ready queue to execute.
There are others scheduling algorithms as well you will find this link useful:https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/5_CPU_Scheduling.html
The operating system has a component called the scheduler that decides which thread should run and for how long. There are essentially two basic kinds of schedulers: cooperative and preemptive. Cooperative scheduling requires that the threads cooperate and regularly hand control back to the operating system, for example by doing some kind of IO. Most modern operating systems use preemptive scheduling.
In preemptive scheduling the operating system gives a time slice for the thread to run. The OS does this by setting a handler for a CPU timer: the CPU regularly runs a piece of code (the scheduler) that checks if the current thread's time slice is over, and possibly decides to give the next time slice to a thread that is waiting to run. The size of the time slice and how to choose the next thread depends on the operating system and the scheduling algorithm you use. When the OS switches to a new thread it saves the state of the CPU (register contents, program counter etc) for the current thread into main memory, and restores the state of the new thread - this is called a context switch.
If you want to know more, the Wikipedia article on Scheduling has lots of information and pointers to related topics.

Does a thread waiting on IO also block a core?

In the synchronous/blocking model of computation we usually say that a thread of execution will wait (be blocked) while it waits for an IO task to complete.
My question is simply will this usually cause the CPU core executing the thread to be idle, or will a thread waiting on IO usually be context switched out and put into a waiting state until the IO is ready to be processed?
A CPU core is normally not dedicated to one particular thread of execution. The kernel is constantly switching processes being executed in and out of the CPU. The process currently being executed by the CPU is in the "running" state. The list of processes waiting for their turn are in a "ready" state. The kernel switches these in and out very quickly. Modern CPU features (multiple cores, simultaneous multithreading, etc.) try to increase the number of threads of execution that can be physically executed at once.
If a process is I/O blocked, the kernel will just set it aside (put it in the "waiting" state) and not even consider giving it time in the CPU. When the I/O has finished, the kernel moves the blocked process from the "waiting" state to the "ready" state so it can have its turn ("running") in the CPU.
So your blocked thread of execution blocks only that: the thread of execution. The CPU and the CPU cores continue to have other threads of execution switched in and out of them, and are not idle.
For most programming languages, used in standard ways, then the answer is that it will block your thread, but not your CPU.
You would need to explicitely reserve a CPU for a particular thread (affinity) for 1 thread to block an entire CPU. To be more explicit, see this question:
You could call the SetProcessAffinityMask on every process but yours with a mask that excludes just the core that will "belong" to your process, and use it on your process to set it to run just on this core (or, even better, SetThreadAffinityMask just on the thread that does the time-critical task).
If we assume it's not async, then I would say, in that case, your thread owning the thread would be put to the waiting queue for sure and the state would be "waiting".
Context-switching wise, IMO, it may need a little bit more explanation since the term context-switch can mean/involve many things (swapping in/out, page table updates, register updates, etc). Depending on the current state of execution, potentially, a second thread that belongs to the same process might be scheduled to run whilst the thread that was blocked on the IO operation is still waiting.
For example, then context-switching would most likely be limited to changing register values on the CPU regarding core (but potentially the owning process might even get swapped-out if there's no much memory left).
no,in java , block thread did't participate scheduling

Resources