In Non-Blocking IO - what is exactly does the IO? - multithreading

Even after a lot of reading I cannot seem to understand how non-blocking IO actually works at the OS level.
If a thread is the most granular unit to an OS scheduler, ie. any work that must be done must be done using a thread. Then which thread actually does the IO in non-blocking mode ?
Example:
Lets say a thread requests the contents of a socket in non-blocking
mode, lets say the open system call in POSIX.
This basically means that the thread wants to be notified or will check a particular status for the completion of the IO. It does not wait for the IO to be completed. But my question is who does the IO ?
Does the kernel spin up a thread to wait for the IO ? If so how is it different from the main thread spinning up a child thread and doing the same ?
Most threads are mapped to kernel threads (green threads), then if non-blocking modes spin up threads then what's the great advantage ? All I get is my main thread not waiting.
Are there other ways to complete IO without the use of threads ? Like Direct Memory Access (DMA) ? I have heard that it uses hardware interrupts (no idea how threads are not involved). So is all non-blocking IO, DMA ? Even reading from a disk ?
All these questions are basically the same: if NIO involves just spawning threads to do the waiting then how is it different from async IO, if not, then how is the IO even done ?
More info: There is no Thread

For simple, low-speed I/O devices (keyboard, mouse, serial ports, etc.) The CPU pretty much has to handle every byte of data. The secret sauce that makes that work is hardware interrupts.
When a hardware device needs attention from the CPU (e.g., because it has received a byte of data, and it needs the CPU to "read" the byte before the device is able to receive the next one) It signals an interrupt to the CPU.
The CPU handles the interrupt by;
saving the the context of whatever thread was executing,
possibly elevating the current privilege level, and then
it effectively calls a function—the interrupt handler—that is responsible for servicing the hardware.
A typical interrupt handler for a low-speed I/O device might;
read a byte from a register,
maybe set a flag in another register,
store the byte into a circular buffer, and then
return.
The handler "returns" by executing a special "return-from-interrupt" opcode that restores the context of the thread that was interrupted, and restores the previous privilege level.
For high-speed I/O devices such as file storage devices or network interfaces, it's much the same except the hardware most likely will DMA entire blocks or packets of data to/from some physical memory buffer before it triggers the interrupt.

Related

How the OS knows when an I/O operation has finished execution?

Consider the situation, where you issue a read from the disc (I/O operation). Then what is the exact mechanism that the OS uses to get to know whether the operation has been executed?
Then what is the exact mechanism that the OS uses to get to know whether the operation has been executed?
The exact mechanism depends on the specific hardware (and OS and scenario); but typically when a device finishes doing something the device triggers an IRQ that causes the CPU to interrupt whatever it was doing and switch to a device driver's interrupt handler.
Sometimes/often device driver ends up maintaining a queue or buffer of pending commands; so that when its interrupt handler is executed (telling it that a previous command has completed) it takes the next pending command and tells the device to start it. Sometimes/often this also includes some kind of IO priority scheme, where driver can ask device to do more important work sooner (while less important work is postponed/remains pending).
A device driver is typically also tied to scheduler in some way - a normal thread in user-space might (directly or indirectly - e.g. via. file system) request that data be transferred and the scheduler will be told to not give that thread CPU time because it's blocked/waiting for something; and then later when the transfer is completed the device driver's interrupt handler tells the scheduler that the requesting thread can continue, causing it to be unblocked/able to be given CPU time by scheduler again.

Context switch between kernel threads vs user threads

Copy pasted from this link:
Thread switching does not require Kernel mode privileges.
User level threads are fast to create and manage.
Kernel threads are generally slower to create and manage than the user threads.
Transfer of control from one thread to another within the same process requires a mode switch to the Kernel.
I never came across these points while reading standard operating systems reference books. Though these points sound logical, I wanted to know how they reflect in Linux. To be precise :
Can someone give detailed steps involved in context switching between user threads and kernel threads, so that I can find the step difference between the two.
Can someone explain the difference with actual context switch example or code. May be system calls involved (in case of context switching between kernel threads) and thread library calls involved (in case of context switching between user threads).
Can someone link me to Linux source code line (say on github) handling context switch.
I also doubt why context switch between kernel threads requires changing to kernel mode. Aren't we already in kernel mode for first thread?
Can someone give detailed steps involved in context switching between user threads and kernel threads, so that I can find the step difference between the two.
Let's imagine a thread needs to read data from a file, but the file isn't cached in memory and disk drives are slow so the thread has to wait; and for simplicity let's also assume that the kernel is monolithic.
For kernel threading:
thread calls a "read()" function in a library or something; which must cause at least a switch to kernel code (because it's going to involve device drivers).
the kernel adds the IO request to the disk driver's "queue of possibly many pending requests"; realizes the thread will need to wait until the request completes, sets the thread to "blocked waiting for IO" and switches to a different thread (that may belong to a completely different process, depending on global thread priorities). The kernel returns to the user-space of whatever thread it switch to.
later; the disk hardware causes an IRQ which causes a switch back to the IRQ handler in kernel code. The disk driver finishes up the work it had to do the for (currently blocked) thread and unblocks that thread. At this point the kernel might decide to switch to the "now unblocked" thread; and the kernel returns to the user-space of the "now unblocked" thread.
For user threading:
thread calls a "read()" function in a library or something; which must cause at least a switch to kernel code (because it's going to involve device drivers).
the kernel adds the IO request to the disk driver's "queue of possibly many pending requests"; realizes the thread will need to wait until the request completes but can't take care of that because some fool decided to make everything worse by doing thread switching in user space, so the kernel returns to user-space with "IO request has been queued" status.
after the pointless extra overhead of switching back to user-space; the user-space scheduler does the thread switch that the kernel could have done. At this point the user-space scheduler will either tell kernel it has nothing to do and you'll have more pointless extra overhead switching back to kernel; or user-space scheduler will do a thread switch to another thread in the same process (which may be the wrong thread because a thread in a different process is higher priority).
later; the disk hardware causes an IRQ which causes a switch back to the IRQ handler in kernel code. The disk driver finishes up the work it had to do for the (currently blocked) thread; but the kernel isn't able to do the thread switch to unblock the thread because some fool decided to make everything worse by doing thread switching in user space. Now we've got a problem - how does kernel inform the user-space scheduler that the IO has finished? To solve this (without any "user-space scheduler running zero threads constantly polls kernel" insanity) you have to have some kind of "kernel puts notification of IO completion on some kind of queue and (if the process was idle) wakes the process up" which (on its own) will be more expensive than just doing the thread switch in the kernel. Of course if the process wasn't idle then code in user-space is going to have to poll its notification queue to find out if/when the "notification of IO completion" arrives, and that's going to increase latency and overhead. In any case, after lots of stupid pointless and avoidable overhead; the user-space scheduler can do the thread switch.
Can someone explain the difference with actual context switch example or code. May be system calls involved (in case of context switching between kernel threads) and thread library calls involved (in case of context switching between user threads).
The actual low-level context switch code typically begins with something like:
save whichever registers are "caller preserved" according to the calling conventions on the stack
save the current stack top in some kind of "thread info structure" belonging to the old thread
load a new stack top from some kind of "thread info structure" belonging to the new thread
pop whichever registers are "caller preserved" according to the calling conventions
return
However:
usually (for modern CPUs) there's a relatively large amount of "SIMD register state" (e.g. for 80x86 with support for AVX-512 I think it's over 4 KiB of of stuff). CPU manufacturers often have mechanisms to avoid saving parts of that state if it wasn't changed, and to (optionally) postpone the loading of (pieces of) that state until its actually used (and avoid it completely if its not actually used). All of that requires kernel.
if it's a task switch and not just used for thread switches you might need some kind of "if virtual address space needs to change { change virtual address space }" on top of that
normally you want to keep track of statistics, like how much CPU time a thread has used. This requires some kind of "thread_info.time_used += now() - time_at_last_thread_switch;"; which gets difficulty/ugly when "process switching" is separated from "thread switching".
normally there's other state (e.g. pointer to thread local storage, special registers for performance monitoring and/or debugging, ...) that may need to be saved/loaded during thread switches. Often this state is not directly accessible in user code.
normally you also want to set a timer to expire when the thread has used too much time; either because you're doing some kind of "time multiplexing" (e.g. round-robin scheduler) or because its a cooperating scheduler where you need to have some kind of "terminate this task after 5 seconds of not responding in case it goes into an infinite loop forever" safe-guard.
this is just the low level task/thread switching in isolation. There is almost always higher level code to select a task to switch to, handle "thread used too much CPU time", etc.
Can someone link me to Linux source code line (say on github) handling context switch
Someone probably can't. It's not one line; it's many lines of assembly for each different architecture, plus extra higher-level code (for timers, support routines, the "select a task to switch to" code, for exception handlers to support "lazy SIMD state load", ...); which probably all adds up to something like 10 thousand lines of code spread across 50 files.
I also doubt why context switch between kernel threads requires changing to kernel mode. Aren't we already in kernel mode for first thread?
Yes; often you're already in kernel code when you find out that a thread switch is needed.
Rarely/sometimes (mostly only due to communication between threads belonging to the same process - e.g. 2 or more threads in the same process trying to acquire the same mutex/semaphore at the same time; or threads sending data to each other and waiting for data from each other to arrive) kernel isn't involved; and in some cases (which are almost always massive design failures - e.g. extreme lock contention problems, failure to use "worker thread pools" to limit the number of threads needed, etc) it's possible for this to be the dominant cause of thread switches, and therefore possible that doing thread switches in user space can be beneficial (e.g. as a work-around for the massive design failures).
Don't limit yourself to Linux or even UNIX, they are neither the first nor last word on systems or programming models. The synchronous execution model dates back to the early days of computing, and are not particularly well suited to larger scale concurrent and reactive programming.
Golang, for example, employs a great many lightweight user threads -- goroutines -- and multiplexes them on a smaller set of heavyweight kernel threads to produce a more compelling concurrency paradigm. Some other programming systems take similar approaches.

C# When thread switching will most probably occur?

I was wondering when .Net would most probably switch from a thread to another?
I understand we can't predict when this will happen exactly, but is there any intelligence in this? For example, when a thread is executed will it try to wait for a method to returns or a loop to finish before switching?
I'm not an expert on .NET, but in general scheduling is handled by the kernel.
Either your thread's timeslice has expired (threads/processes only get a certain amount of CPU time)
Your thread has blocked for IO.
Some other obscure reason, like waiting for an IPC message, a network packet or something.
Threads can be preempted at any point along their execution path, be it in a loop or returning from a function. This in general isn't handled by the underlying VM (.NET or JVM) but is controlled by the OS.
Of course there is 'intelligence', of a sort:). The set of running threads can only change upon an interrupt, either:
An actual hardware interrupt from a peripheral device, eg. disk, NIC, KB, mouse, timer.
A software interrupt, (ie. a system call), that can change the state of thread/s. This encompasses sleep calls and calls to wait/signal on inter-thread synchro objects, as well as I/O calls that request data that is not immediately available.
If there is no interrupt, the OS cannot change the set of running threads because it is not entered. The OS does not know or care about loops, function/methods calls, (except those that make system calls as above), gotos or any other user-level flow-control mechanisms.
I read your question now, it may not be rellevant anymore, but after reading the above answers, i want to just to make sure:
Threads are managed (or as i know) by the process they belong to. There is nothing to do with the Operation System(and that's is the main reason why working with multithreads is more faster than working with multiprocess, because there are data sharing between threads and the switching between them is occuring faster than the context switch wich occure between process by the Short-Term-Scheduler).
(NOTE: There are two types of threads: USER_MODE' threads and KERNEL_MODE' threadss, and each os can have both of them or just on of them. Anyway a thread that working in a user application environment is considered as a USER_MODE' thread and managed by the process it's belong to.)
Am I Write?
Thanks!!!

How do system calls like select() or poll() work under the hood?

I understand that async I/O ops via select() and poll() do not use processor time i.e its not a busy loop but then how are these really implemented under the hood ? Is it supported in hardware somehow and is that why there is not much apparent processor cost for using these ?
It depends on what the select/poll is waiting for. Let's consider a few cases; I'm going to assume a single-core machine for simplification.
First, consider the case where the select is waiting on another process (for example, the other process might be carrying out some computation and then outputs the result through a pipeline). In this case the kernel will mark your process as waiting for input, and so it will not provide any CPU time to your process. When the other process outputs data, the kernel will wake up your process (give it time on the CPU) so that it can deal with the input. This will happen even if the other process is still running, because modern OSes use preemptive multitasking, which means that the kernel will periodically interrupt processes to give other processes a chance to use the CPU ("time-slicing").
The picture changes when the select is waiting on I/O; network data, for example, or keyboard input. In this case, while archaic hardware would have to spin the CPU waiting for input, all modern hardware can put the CPU itself into a low-power "wait" state until the hardware provides an interrupt - a specially handled event that the kernel handles. In the interrupt handler the CPU will record the incoming data and after returning from the interrupt will wake up your process to allow it to handle the data.
There is no hardware support. Well, there is... but is nothing special and it depends on what kind of file descriptor are you watching. If there is a device driver involved, the implementation depends on the driver and/or the device. For example, sockets. If you wait for some data to read, there are a sequence of events:
Some process calls poll()/select()/epoll() system call to wait for data in a socket. There is a context switch from the user mode to the kernel.
The NIC interrupts the processor when some packet arrives. The interrupt routine in the driver push the packet in the back of a queue.
There is a kernel thread that takes data from that queue and wakes up the network code inside the kernel to process that packet.
When the packet is processed, the kernel determines the socket that was expecting for it, saves the data in the socket buffer and returns the system call back to user space.
This is just a very brief description, there are a lot of details missing but I think that is enough to get the point.
Another example where no drivers are involved is a unix socket. If you wait for data from one of them, the process that waits is added to a list. When other process on the other side of the socket writes data, the kernel checks that list and the point 4 is applied again.
I hope it helps. I think that examples are the best to undertand it.

Is the timeslice given to a thread that is waiting on I/O "wasted"?

I'm currently analyzing the pros and cons of writing a server using a threaded model or event driven model. I already know the many cons of the threaded model (does not scale well due to context switching overhead, virtual memory limitations, etc.) but I came upon another one in my analysis and would like to verify that my understanding of threads is correct.
If I have 5 threads, 1 which is doing work (not being blocked), 4 which are being blocked waiting for I/O (for example waiting on data from a socket), isn't the CPU time given to those 4 threads essentially wasted since no work is actually being done (assuming no data arrives)? The timeslice given to those 4 blocked threads is taking away time from the 1 thread actually doing work, correct?
In this case I'm explicitly saying that the socket is a blocking one.
No. Although it actually depends on the type of OS, type of I/O (polled/DMA) and device driver architecture, most device I/O is performed using DMA + interrupts. In such cases a thread is put into a sleep state until an interrupt is triggered for such I/O operations and scheduler does not visit those threads until their pending I/O is complete. Only polling I/O can cause consumption of CPU, such as PIO mode for hard disks.
Threads don't need to use their entire timeslice. I don't know the specifics, but if blocked threads even get time, they certainly don't use it all.
Obviously, these details vary platform-to-platform-to-environment-to-etc.

Resources