Measuring Context Switch time in kernel space

Measuring Context Switch time in kernel space - linux

There are plenty of programs available which can be used to calculate context switch time from user-space. But all of these have several overhead - like overhead of clock_gettime() timer, overhead of read/write operation in pipe.
Is it possible to measure context switch time in Linux kernel space where above overhead won't be there?
May be TWO GLOABL VARIABLES can be added in kernel module which will store the time when context_switch function is called , the time when context_switch is finished.
The challenges I am facing with this approach is context switch function can be called by any process and from any core.
Is it feasible or advisable to add something to struct task_struct or struct rq ?
I am using Ubuntu 16.04 OS .

if you want to check the delay on context switch between threads (not including thread execution time) :
based on kernel config, you can refer to
__schedule : scheduler main API
preempt_schedule_common
schedule
preempt_schedule_context
preempt_schedule_irq
However, it would be hard to calculate the exact delay since entire scheduling functionality doesn't seem to protected by spinlock_irq_disable() context. If you don't disable "local interrupt", your delay calculation will include ISR serving time.
__schedule is disabling local_irq for some specific critical section. Anyway, core part of scheduling is __schedule() API.

Related

Would it makes the kernel level thread clearly preferable to user level thread if system calls is as fast as procedure calls?

Some web searching results told me that the only deficiency of kernel-level thread is the slow speed of its management(create, switch, terminate, etc.). It seems that if the operation on the kernel-level thread is all through system calls, the answer to my question will be true. However, I've searched a lot to find whether the management of kernel-level thread is all through system call but find nothing. And I always have an instinct that such management should be done by the OS automatically because only OS knows which thread would be suitable to run at a specific time. So it seems impossible for programmers to write some explicit system calls to manage threads. I'm appreciative of any ideas.

Some web searching results told me that the only deficiency of kernel-level thread is the slow speed of its management(create, switch, terminate, etc.).
It's not that simple. To understand, think about what causes task switches. Here's a (partial) list:
a device told a device driver that an operation completed (some data arrived, etc) causing a thread that was waiting for the operation to unblock and then preempt the currently running thread. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
enough time passed; either causing an "end of time slice" task switch, or causing a sleeping thread to unblock and preempt. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the thread accessed virtual memory that isn't currently accessible, triggering the kernel's page fault handler which finds out that the current task has to wait while the kernel fetches data from from swap space or from a file (if the virtual memory is part of a memory mapped file), or has to wait for kernel to free up RAM by sending other pages to swap space (if virtual memory was involved in some kind of "copy on write"); causing a task switch because the currently running task can't continue. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
a new process is being created, and its initial thread preempts the currently running thread. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread asked kernel to do something with a file and kernel got "VFS cache miss" that prevents the request from being performed without any task switches. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread releases a mutex or sends some data (e.g. using a pipe or socket); causing a thread that belongs to a different process to unblock and preempt. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread releases a mutex or sends some data (e.g. using a pipe or socket); causing a thread that belongs to the same process to unblock and preempt. For this case you're running user-space code when you find out that a task switch is needed, so in theory user-space task switching is faster, but in practice it can just as easily be an indicator of poor design (using too many threads and/or far too much lock contention).
a new thread is being created for the same process; and the new thread preempts the currently running thread. For this case you're running user-space code when you find out that a task switch is needed, so in user-space task switching is faster; but only if kernel isn't informed (e.g. so that utilities like "top" can properly display details for threads) - if kernel is informed anyway then it doesn't make much difference where the task switch happens.
For most software (which doesn't use very many threads); doing task switches in the kernel is faster. Of course it's also (hopefully) fairly irrelevant for performance (because time spent switching tasks should be tiny compared to time spend doing other work).
And I always have an instinct that such management should be done by the OS automatically because only OS knows which thread would be suitable to run at a specific time.
Yes; but possibly not for the reason you think.
Another problem with user-space threading (besides making most task switches slower) is that it can't support global thread priorities without becoming a severe security disaster. Specifically; a process can't know if its own thread is higher or lower priority than a thread belonging to a different process (unless it has information about all threads for the entire OS, which is information that normal processes shouldn't be trusted to have); so user-space threading leads to wasting CPU time doing unimportant work (for one process) when there's important work to do (for a different process).
Another problem with user-space threading is that (for some CPUs - e.g. most 80x86 CPUs) the CPUs are not independent, and there may be power management decisions involved with scheduling. For examples; most 80x86 CPUs have hyper-threading (where a core is shared by 2 logical processors), where a smart scheduler may say "one logical processor in the core is running a high priority/important thread, so the other logical processor in the same core should not run a low priority/unimportant thread because that would make the important work slower"; most 80x86 CPUs have "turbo boost" (with similar "don't let low priority threads ruin the turbo-boost/performance of high priority thread" possibilities); and most CPUs have thermal management (where scheduler might say "Hey, these threads are all low priority, so let's underclock the CPU so that it cools down and can go faster later (has more thermal headroom) when there's high priority/more important work to do!").
Would it makes the kernel level thread clearly preferable to user level thread if system calls is as fast as procedure calls?
If system calls were as fast as normal procedure calls, then the performance differences between user-space threading and kernel threading would disappear (but all the other problems with user-space threading would remain). However, the reason why system calls are slower than normal procedure calls is that they pass through a kind of "isolation barrier" (that isolates kernel's code and data from malicious user-space code); so to make system calls as fast as normal procedure calls you'd have to get rid of the isolation (effectively turning the kernel into a kind of "global shared library" that can be dynamically linked) but without that isolation you'll have an extreme security disaster. In other words; to have any hope of achieving acceptable security, system calls must be slower than normal procedure calls.

Your basic premise is wrong. System calls are much slower than procedure calls in almost every interesting architecture.
The perceived cpu throughput is based on pipelining, speculative execution and fetching. The syscall stops the pipeline, invalidates the speculative execution and halts the speculative fetching, is a store and instruction barrier, and may flush the write fifo.
So, the processor slows down to its ‘spec’ speed around the syscall, accelerating back up until the syscall return, whereupon it does about the exact same thing.
Attempts to optimise this area have given rise to lots of papers named after fictional James Bond organizations, and not conciliatory enough apologies from not embarrassed enough cpu product managers. Google spectre as an example, then follow the associated links.
The other cost of syscall
A bit over 30 years ago, some smart guys wrote a paper about least privilege. Conceptually, it is a stunner. The basic premise is that whatever your program is doing, it should do it with the least privilege possible.
If your program is inverting arrays, according to the notion of least privilege, it should not be able to disable interrupts. Disabling interrupts can cause a very difficult to diagnose system failure. Simple user code should not have this ability.
The notion of user and kernel modes of execution evolved from early computer systems, and (with the possible exception of the iax32 / 80286 ) are increasingly showing their inadequacy in the connected computer environment. At one point in time you could say "this is a single user system"; but the IoT dweebs have made everything multi-user.
Least privilege insists that all code should execute with the minimum privilege required to complete the task at hand. Thus, nothing should be in the kernel that absolutely doesn't need to be. If you think that is a radical thought, in Ken Thompson's 1977(?) paper on the UNIX kernel he states exactly the same thing.
So no, putting your junk in the kernel just means you have increased the attack surface for no valid reason. Try to think in terms of exposing minimum risk, it leads to better software and better sleep.

Context switch between kernel threads vs user threads

Copy pasted from this link:
Thread switching does not require Kernel mode privileges.
User level threads are fast to create and manage.
Kernel threads are generally slower to create and manage than the user threads.
Transfer of control from one thread to another within the same process requires a mode switch to the Kernel.
I never came across these points while reading standard operating systems reference books. Though these points sound logical, I wanted to know how they reflect in Linux. To be precise :
Can someone give detailed steps involved in context switching between user threads and kernel threads, so that I can find the step difference between the two.
Can someone explain the difference with actual context switch example or code. May be system calls involved (in case of context switching between kernel threads) and thread library calls involved (in case of context switching between user threads).
Can someone link me to Linux source code line (say on github) handling context switch.
I also doubt why context switch between kernel threads requires changing to kernel mode. Aren't we already in kernel mode for first thread?

Can someone give detailed steps involved in context switching between user threads and kernel threads, so that I can find the step difference between the two.
Let's imagine a thread needs to read data from a file, but the file isn't cached in memory and disk drives are slow so the thread has to wait; and for simplicity let's also assume that the kernel is monolithic.
For kernel threading:
thread calls a "read()" function in a library or something; which must cause at least a switch to kernel code (because it's going to involve device drivers).
the kernel adds the IO request to the disk driver's "queue of possibly many pending requests"; realizes the thread will need to wait until the request completes, sets the thread to "blocked waiting for IO" and switches to a different thread (that may belong to a completely different process, depending on global thread priorities). The kernel returns to the user-space of whatever thread it switch to.
later; the disk hardware causes an IRQ which causes a switch back to the IRQ handler in kernel code. The disk driver finishes up the work it had to do the for (currently blocked) thread and unblocks that thread. At this point the kernel might decide to switch to the "now unblocked" thread; and the kernel returns to the user-space of the "now unblocked" thread.
For user threading:
thread calls a "read()" function in a library or something; which must cause at least a switch to kernel code (because it's going to involve device drivers).
the kernel adds the IO request to the disk driver's "queue of possibly many pending requests"; realizes the thread will need to wait until the request completes but can't take care of that because some fool decided to make everything worse by doing thread switching in user space, so the kernel returns to user-space with "IO request has been queued" status.
after the pointless extra overhead of switching back to user-space; the user-space scheduler does the thread switch that the kernel could have done. At this point the user-space scheduler will either tell kernel it has nothing to do and you'll have more pointless extra overhead switching back to kernel; or user-space scheduler will do a thread switch to another thread in the same process (which may be the wrong thread because a thread in a different process is higher priority).
later; the disk hardware causes an IRQ which causes a switch back to the IRQ handler in kernel code. The disk driver finishes up the work it had to do for the (currently blocked) thread; but the kernel isn't able to do the thread switch to unblock the thread because some fool decided to make everything worse by doing thread switching in user space. Now we've got a problem - how does kernel inform the user-space scheduler that the IO has finished? To solve this (without any "user-space scheduler running zero threads constantly polls kernel" insanity) you have to have some kind of "kernel puts notification of IO completion on some kind of queue and (if the process was idle) wakes the process up" which (on its own) will be more expensive than just doing the thread switch in the kernel. Of course if the process wasn't idle then code in user-space is going to have to poll its notification queue to find out if/when the "notification of IO completion" arrives, and that's going to increase latency and overhead. In any case, after lots of stupid pointless and avoidable overhead; the user-space scheduler can do the thread switch.
Can someone explain the difference with actual context switch example or code. May be system calls involved (in case of context switching between kernel threads) and thread library calls involved (in case of context switching between user threads).
The actual low-level context switch code typically begins with something like:
save whichever registers are "caller preserved" according to the calling conventions on the stack
save the current stack top in some kind of "thread info structure" belonging to the old thread
load a new stack top from some kind of "thread info structure" belonging to the new thread
pop whichever registers are "caller preserved" according to the calling conventions
return
However:
usually (for modern CPUs) there's a relatively large amount of "SIMD register state" (e.g. for 80x86 with support for AVX-512 I think it's over 4 KiB of of stuff). CPU manufacturers often have mechanisms to avoid saving parts of that state if it wasn't changed, and to (optionally) postpone the loading of (pieces of) that state until its actually used (and avoid it completely if its not actually used). All of that requires kernel.
if it's a task switch and not just used for thread switches you might need some kind of "if virtual address space needs to change { change virtual address space }" on top of that
normally you want to keep track of statistics, like how much CPU time a thread has used. This requires some kind of "thread_info.time_used += now() - time_at_last_thread_switch;"; which gets difficulty/ugly when "process switching" is separated from "thread switching".
normally there's other state (e.g. pointer to thread local storage, special registers for performance monitoring and/or debugging, ...) that may need to be saved/loaded during thread switches. Often this state is not directly accessible in user code.
normally you also want to set a timer to expire when the thread has used too much time; either because you're doing some kind of "time multiplexing" (e.g. round-robin scheduler) or because its a cooperating scheduler where you need to have some kind of "terminate this task after 5 seconds of not responding in case it goes into an infinite loop forever" safe-guard.
this is just the low level task/thread switching in isolation. There is almost always higher level code to select a task to switch to, handle "thread used too much CPU time", etc.
Can someone link me to Linux source code line (say on github) handling context switch
Someone probably can't. It's not one line; it's many lines of assembly for each different architecture, plus extra higher-level code (for timers, support routines, the "select a task to switch to" code, for exception handlers to support "lazy SIMD state load", ...); which probably all adds up to something like 10 thousand lines of code spread across 50 files.
I also doubt why context switch between kernel threads requires changing to kernel mode. Aren't we already in kernel mode for first thread?
Yes; often you're already in kernel code when you find out that a thread switch is needed.
Rarely/sometimes (mostly only due to communication between threads belonging to the same process - e.g. 2 or more threads in the same process trying to acquire the same mutex/semaphore at the same time; or threads sending data to each other and waiting for data from each other to arrive) kernel isn't involved; and in some cases (which are almost always massive design failures - e.g. extreme lock contention problems, failure to use "worker thread pools" to limit the number of threads needed, etc) it's possible for this to be the dominant cause of thread switches, and therefore possible that doing thread switches in user space can be beneficial (e.g. as a work-around for the massive design failures).

Don't limit yourself to Linux or even UNIX, they are neither the first nor last word on systems or programming models. The synchronous execution model dates back to the early days of computing, and are not particularly well suited to larger scale concurrent and reactive programming.
Golang, for example, employs a great many lightweight user threads -- goroutines -- and multiplexes them on a smaller set of heavyweight kernel threads to produce a more compelling concurrency paradigm. Some other programming systems take similar approaches.

Does the kernel only execute on occurrence of an exception

I'm learning about embedded Linux. I can't seem to find proper answers for my questions below.
My understanding is, when user-space applications are executing, if we want to perform IO for example, a system call is made which will cause a SW interrupt, generally causing the MCU to switch from non-privileged mode to privileged mode and the kernel will perform the IO on behalf of the application.
Similarity when a hardware interrupt occurs, I'm guessing this will cause the modes to switch again and execute an interrupt handler within the kernel.
What's not clear to me is, are these the only times when the kernel code gets control of the CPU?
With only one core for example, if user application code is running, shouldn't the kernel be getting control of the CPU from time to time to check things, regardless of whether an interrupt has occurred or not. Perhaps there is a periodic timer interrupt allowing this?
Also, if we have multiple cores, could the kernel just be running all the time on one core while user applications on another?

Read Operating Systems: Three Easy Pieces since an entire book is needed to answer your questions. Later, study the source code of the kernel, with the help of https://kernelnewbies.org/
Interrupts happen really often (perhaps hundreds, or even thousands, per second). Try cat /proc/interrupts (see proc(5)) a few times in a terminal.
the kernel will perform the IO on behalf of the application.
Not always immediately. If you read a file, its content could be in the page cache (and then no physical IO is needed). If disk access (or networking) is required, the kernel will schedule (read about preemptive scheduling) some IO to happen and context switch to other runnable tasks. Much later, several interrupts have been handled (some of which may be triggered by physical devices related to your IO), and finally (many milliseconds later) your process could return -in user space- from the read(2) system call and be running again. During that delay, other processes have been running.
Also, if we have multiple cores, could the kernel just be running all the time on one core while user applications on another?
It depends a lot (even on the kernel version). Probably the kernel is not running on the same core. YMMV.

What's not clear to me is, are these the only times when the kernel code gets control of the CPU?
Yes, kernel can not interrupt user code from running. But Kernel will setup up a timer hardware, which will generate timer interrupt between consistent time period. Kernel utilize it to implement task schedule.
Also, if we have multiple cores, could the kernel just be running all the time on one core while user applications on another?
You can consider multiple cores system as multiple machines but they share memory, and are able to send interrupt to each other.

Process with multiple threads on multiprocessor system. How do they work?

So I was reading about Processes and Threads and I had a question. Following is the scenario.
Uniprocessor Environment
I understand that the OS rotates the processes over processor for a particular time period.(quantum) . Now I get it when the process is single threaded, ie just one path of execution. In that case, whenever it is assigned the processor, it continues with it's execution. Let's say the process forks and or just creates a new thread. Now how does the entire process works? Is it that the OS will say to process P "Go on, continue with execution" and the Process within itself will pick the new thread or the parent thread on rotation? So that if there are more than two threads, the rotation seems fair to each thread. Or does the OS actually interacts with the threads? (In that case I am not sure what happens).
Multiprocessor Environment
Now say I have a multiprocessor environment. Now in this case, if there was just uni-threaded process, then OS will assign either of the processors to it and on it will go with it's execution. Now say, there are multiple threads in the Process. Now if I assign one of the processor to the process, and ask it to continue it's execution, and the Process has to pick either of the thread for it's execution, then there never will be parallel processing going on in that specific process. Since the process will have to put either of it's threads on the processor.
So how does it happen in both the cases?
Cheers.

Process Scheduing
Operating Systems ultimately control these types of thread scheduling.
Windows systems are priority-based and so will allow a process to consume more resources that others. This is why your machine can 'hang', if a process has been escalated to a high priority. Priorities are ranged between 1-31 as far as I know.
Mac OS / Linux / Unix are time-based, allowing all processes to have equal amounts of CPU time. Therefore loading more processes will slow your system down as they all share a smaller slice of execution time.
Uniprocessor Environment
The OS is ultimately responsible for this but switching processes involves (I cannot guarantee accuracy here, but its just an indication):
Halting a process / thread
Storing the current stack (code location)
Storing the current registers of the CPU
Asking the kernel for the next process/thread to run
Kernel indicates which one has to be run
OS reloads the registers from the cache
OS reloads the current stack for the next application.
Resumes the process
Obviously the more threads and processes you have running, the slower it will become. The problem is that the time taken to switch processes can actually take longer than the time allowed to execute the process.
Threads are just child processes of a single process. For a single processor, it just looks like additional work.
Multi-processor Environment
Multi-processor environments work differently as the cache is shared amongst processors. I believe these are called L1 (Level) and L2 caches. So the difference is that processor A can reload the state stored by processor B without conflicts. 'Hyper-threading' also has the same approach, although this is processor specific. The difference here is that a processor could solely control a specific process - this is called 'CPU Affinity' Its not encouraged for every process, but it does allow an application to have a dedicated processor to work off.

This is OS-specific, of course, but most operating systems schedule at the thread level. A process is just a grouping of threads. For example, on Linux, threads are called "tasks" and each is scheduled independently. They are created with the clone call. What is typically called a thread is a task which shares its address space (and other resources such as file descriptors, mount points, etc.) with the creating task. Note that the clone call can also create what is typically called a process if the flags to enable sharing are not passed.
Considering the above, any thread may be scheduled at any time on any processor, no matter how many processors there are available. That said, most OSs also attempt to maintain some measure of processor affinity to avoid excessive cache misses, but usually if a thread is runnable and a different CPU is available, it will change CPUs. Often there is also a way to specify which CPUs a particular thread may execute upon.

Doesn't matter whether there is 1 or 128 processors. The OS manages access to resources to try an efficiently match up requests with availabilty, and that includes CPU execution. If a thread is running, it has already managed to get some CPU but, if it requests a resource that is not immediately available, it no longer needs any CPU until that other resource does become free, and so the OS will remove CPU execution from it and, if there is another thread that is waiting for CPU, it will hand it over. When the requested reource does become available, the thread will be made ready again. If there is a core free, it will be made running 'immediately', if not, the CPU scheduling algorithm makes a decision on whether to stop a currently-running thread to free up a core or to leave the newly-ready thrad waiting.
It's better to try and ignore things like 'time-slice, quantum, priority' - it causes much confusion and FUD. If a running thread wants something it cannot have yet, it doesn't need any more CPU cycles, and the OS will take them away and, if another thread needs it, apply them there. That is why preemptive multitaskers exist - to match up threads with resources in an attempt to maximize forward progress.

Linux Threads and process - CPU affinity

I have few queries related to threads and Process scheduling.
When my process goes into sleep and wakes back, is it always that it will be scheduled on the same CPU that it got scheduled before?
When i create a thread from the process, Will it also be executed on the same CPU always? Even if other CPU's are free and sleeping.
I would like to know the mechanism in Linux in specific. Also i am creating the threads through pthread library. I am facing a random hangup issue which is always not reproducible. Need this information to proceed in the right direction.

On single processor/core systems
Yes
Yes
on multi processor/core systems
No.
No.
use taskset to retrieve or set a processes’s CPU affinity on multicore systems. Setting the CPU affinity to a specific processor/core will change the answers to
Yes
Yes
also for multicore systems.
From within an application you may use sched_setaffinity and/or sched_getaffinity to adjust the CPU affinity.
Edit: Additional details about how/when CPU swaps are managed with respect to cache disadvantages:
The Linux/SMP Scheduler: "... In order to achieve good system performance, Linux/SMP (2.4 kernel) adopts an empirical rule to solve the dilemma ..." Read the details in the linked reference, section The Linux/SMP Scheduler.
For the newer CFS (Completely Fair Scheduler) you'd look at sched_migration_cost. "...if the real runtime of the task is smaller than the values of this parameter then the scheduler assumes that it is still in the cache and tries to avoid moving the task to another CPU during the load balancing procedure ..." (e.g.: Completely Fair Scheduler and its tuning).

when process goes in to sleep and when it wake up ,it is not necessary that it will schedule on same cpu.if u have multiprocessor environment then according to scheduler policy it will schedule on any cpu.When process goes to sleep there are different reason ,it goes to sleep beacause it is waiting for io,any resource.When event will occurs it goes from waiting state to ready state.At that time which cpu will be free scheduler will schedule that process on free cpu.It is not necessary it will schedule on same cpu.
for extra information about scheduler open source code of scheduler in linux release tree path.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string