How do system calls like select() or poll() work under the hood?

How do system calls like select() or poll() work under the hood? - linux

I understand that async I/O ops via select() and poll() do not use processor time i.e its not a busy loop but then how are these really implemented under the hood ? Is it supported in hardware somehow and is that why there is not much apparent processor cost for using these ?

It depends on what the select/poll is waiting for. Let's consider a few cases; I'm going to assume a single-core machine for simplification.
First, consider the case where the select is waiting on another process (for example, the other process might be carrying out some computation and then outputs the result through a pipeline). In this case the kernel will mark your process as waiting for input, and so it will not provide any CPU time to your process. When the other process outputs data, the kernel will wake up your process (give it time on the CPU) so that it can deal with the input. This will happen even if the other process is still running, because modern OSes use preemptive multitasking, which means that the kernel will periodically interrupt processes to give other processes a chance to use the CPU ("time-slicing").
The picture changes when the select is waiting on I/O; network data, for example, or keyboard input. In this case, while archaic hardware would have to spin the CPU waiting for input, all modern hardware can put the CPU itself into a low-power "wait" state until the hardware provides an interrupt - a specially handled event that the kernel handles. In the interrupt handler the CPU will record the incoming data and after returning from the interrupt will wake up your process to allow it to handle the data.

There is no hardware support. Well, there is... but is nothing special and it depends on what kind of file descriptor are you watching. If there is a device driver involved, the implementation depends on the driver and/or the device. For example, sockets. If you wait for some data to read, there are a sequence of events:
Some process calls poll()/select()/epoll() system call to wait for data in a socket. There is a context switch from the user mode to the kernel.
The NIC interrupts the processor when some packet arrives. The interrupt routine in the driver push the packet in the back of a queue.
There is a kernel thread that takes data from that queue and wakes up the network code inside the kernel to process that packet.
When the packet is processed, the kernel determines the socket that was expecting for it, saves the data in the socket buffer and returns the system call back to user space.
This is just a very brief description, there are a lot of details missing but I think that is enough to get the point.
Another example where no drivers are involved is a unix socket. If you wait for data from one of them, the process that waits is added to a list. When other process on the other side of the socket writes data, the kernel checks that list and the point 4 is applied again.
I hope it helps. I think that examples are the best to undertand it.

Related

How the OS knows when an I/O operation has finished execution?

Consider the situation, where you issue a read from the disc (I/O operation). Then what is the exact mechanism that the OS uses to get to know whether the operation has been executed?

Then what is the exact mechanism that the OS uses to get to know whether the operation has been executed?
The exact mechanism depends on the specific hardware (and OS and scenario); but typically when a device finishes doing something the device triggers an IRQ that causes the CPU to interrupt whatever it was doing and switch to a device driver's interrupt handler.
Sometimes/often device driver ends up maintaining a queue or buffer of pending commands; so that when its interrupt handler is executed (telling it that a previous command has completed) it takes the next pending command and tells the device to start it. Sometimes/often this also includes some kind of IO priority scheme, where driver can ask device to do more important work sooner (while less important work is postponed/remains pending).
A device driver is typically also tied to scheduler in some way - a normal thread in user-space might (directly or indirectly - e.g. via. file system) request that data be transferred and the scheduler will be told to not give that thread CPU time because it's blocked/waiting for something; and then later when the transfer is completed the device driver's interrupt handler tells the scheduler that the requesting thread can continue, causing it to be unblocked/able to be given CPU time by scheduler again.

In Non-Blocking IO - what is exactly does the IO?

Even after a lot of reading I cannot seem to understand how non-blocking IO actually works at the OS level.
If a thread is the most granular unit to an OS scheduler, ie. any work that must be done must be done using a thread. Then which thread actually does the IO in non-blocking mode ?
Example:
Lets say a thread requests the contents of a socket in non-blocking
mode, lets say the open system call in POSIX.
This basically means that the thread wants to be notified or will check a particular status for the completion of the IO. It does not wait for the IO to be completed. But my question is who does the IO ?
Does the kernel spin up a thread to wait for the IO ? If so how is it different from the main thread spinning up a child thread and doing the same ?
Most threads are mapped to kernel threads (green threads), then if non-blocking modes spin up threads then what's the great advantage ? All I get is my main thread not waiting.
Are there other ways to complete IO without the use of threads ? Like Direct Memory Access (DMA) ? I have heard that it uses hardware interrupts (no idea how threads are not involved). So is all non-blocking IO, DMA ? Even reading from a disk ?
All these questions are basically the same: if NIO involves just spawning threads to do the waiting then how is it different from async IO, if not, then how is the IO even done ?
More info: There is no Thread

For simple, low-speed I/O devices (keyboard, mouse, serial ports, etc.) The CPU pretty much has to handle every byte of data. The secret sauce that makes that work is hardware interrupts.
When a hardware device needs attention from the CPU (e.g., because it has received a byte of data, and it needs the CPU to "read" the byte before the device is able to receive the next one) It signals an interrupt to the CPU.
The CPU handles the interrupt by;
saving the the context of whatever thread was executing,
possibly elevating the current privilege level, and then
it effectively calls a function—the interrupt handler—that is responsible for servicing the hardware.
A typical interrupt handler for a low-speed I/O device might;
read a byte from a register,
maybe set a flag in another register,
store the byte into a circular buffer, and then
return.
The handler "returns" by executing a special "return-from-interrupt" opcode that restores the context of the thread that was interrupted, and restores the previous privilege level.
For high-speed I/O devices such as file storage devices or network interfaces, it's much the same except the hardware most likely will DMA entire blocks or packets of data to/from some physical memory buffer before it triggers the interrupt.

How does the kernel track which processes receive data from an interrupt?

In a preemptive kernel (say Linux), say process A makes a call to getc on stdin, so it's blocked waiting for a character. I feel like I have a fundamental misunderstanding of how the kernel knows then to wake process A at this point and deliver the data after it's received.
My understanding is then this process can be put into a suspended state while the scheduler schedules other processes/threads to run, or it gets preempted. When the keypress happens, through polling/interrupts depending on the implementation, the OS runs a device driver that decodes the key that was pressed. However it's possible (and likely) that my process A isn't currently running. At this point, I'm confused on how my process that was blocked waiting on I/O is now queued to run again, especially how it knows which process is waiting for what. It seems like the device drivers hold some form of a wait queue.
Similarly, and I'm not sure if this is exactly related to the above, but if my browser window, for example, is in focus, it seems to receive key presses but not other windows. Does every window/process have the ability to "listen" for keyboard events even if they're not in focus, but just don't for user experience sake?
So I'm curious how kernels (or how some) keep track of what processes are waiting on which events, and when those events come in, how it determines which processes to schedule to run?

The events that processes wait on are abstract software events, such as a particular queue is not empty, rather than concrete hardware events, such as a interrupt 4635 occurring.
Some configuration ( perhaps guided by a hardware description like device tree ) identifies interrupt 4635 as being a signal from a given serial device with a given address. The serial device driver configures itself so it can access the device registers of this serial port, and attaches its interrupt handler to the given interrupt identifier (4635).
Once configured, when an interrupt from the serial device is raised, the lowest level of the kernel invokes this serial device's interrupt handler. In turn, when the handler sees a new character arriving, it places it in the input queue of that device. As it enqueues the character, it may notice that some process(es) are waiting for that queue to be non-empty, and cause them to be run.
That approximately describes the situation using condition variables as the signalling mechanism between interrupts and processes, as was established in UNIX-y kernels 44 years ago. Other approaches involve releasing a semaphore on each character in the queue; or replying with messages for each character. There are many forms of synchronization that can be used.
Common to all such mechanisms, is that the caller chooses to suspend itself to wait for io to complete; and does so by associating its suspension with the instance of the object which it is expecting input from.
What happens next can vary; typically the waiting process, which is now running, reattempts to remove a character from the input queue. It is possible some other process got to it first, in which case, it merely goes back to waiting for the queue to become non empty.
So, the OS doesn't explicitly route the character from the device to the application; a series of implicit and indirect steps does.

Is there really no way to control priority of workqueue processing as compared to user processes/threads?

I've been reading a variety of references that discuss the use of bottom-half work queues for deferred processing in linux drivers. From what I glean, it seems like any work done by kernel work queues gets scheduled just like ordinary user processes/threads and that the only real difference between a kernel work queue-related process and a user process is that the work queue can move data between user-side buffers and kernel buffers. I would appreciate knowing if my interpretation of these references is correct, or whether there are mechanisms by which I can maintain some degree of control over the priority of work queue processing. More specifically, I'd like to know if I can guarantee that a work queue process has higher priority than any user process, at least when the work queue process is not sleeping. I'm asking this question in the context of handling reads/writes from/to chips hanging off a 400 kHz (i.e. slow) I2C bus. We're running linux 2.6.10 on an ARM9 processor. - Thanks!

Server running in linux kernel. Should listen happen in a thread or not?

I am writing a client/server in linux kernel (Yes. Inside the kernel. Its design decision taken and finalised. Its not going to change)
The server reads incoming packets from a raw socket. The transport protocol for these packets (on which the raw socket is listening) is custom and UDP like. In short I do not have to listen for incoming connections and then fork a thread to handle that connection.
I have to just process any IP datagram coming on that raw socket. I will keep reading for packets in an infinite loop on the raw socket. In the user-level equivalent program, I would have created a separate thread and kept listening for incoming packets.
Now for kernel level server, I have doubts about whether I should run it in a separate thread or not because:
I think read() is an I/O operation. So somewhere inside the read(), kernel must be calling schedule() function to relinquish the control of the processor. Thus after calling read() on raw socket, the current kernel active context will be put on hold (put in a sleep queue maybe?) until the packets are available. As and when packets will arrive, the kernel interrupt context will signal that the read context, which is sleeping in the queue, is once again ready to run. I am using 'context' here on purpose instead of 'thread'. Thus I should not require a separate kernel thread.
On the other hand, if read() does not relinquish the control then entire kernel will be blocked.
Can anyone provide tips about how should I design my server?
What is the fallacy of the argument presented in point 1?

I'm not sure whether you need a raw socket at all in the kernel. Inside the kernel you can add a netfilter hook, or register something else (???) which will receive all packets; this might be what you want.
If you DID use a raw socket inside the kernel, then you'd probably need to have a kernel thread (i.e. started by kernel_thread) to call read() on it. But it need not be a kernel thread, it could be a userspace thread which just made a special syscall or device call to call the desired kernel-mode routine.
If you have a hook registered, the context it's called in is probably something which should not do too much processing; I don't know exactly what that is likely to be, it may be a "bottom half handler" or "tasklet", whatever the are (these types of control structures keep changing from one version to another). I hope it's not actually an interrupt service routine.
In answer to your original question:
Yes, sys_read will block the calling thread, whether it's a kernel thread or a userspace one. The system will not hang. However, if the calling thread is not in a state where blocking makes sense, the kernel will panic (scheduling in interrupt or something)
Yes you will need to do this in a separate thread, no it won't hang the system. However, making system calls in kernel mode is very iffy, although it does work (sort of).
But if you installed some kind of hook instead, you wouldn't need to do any of that.

I think your best bet might be to emulate the way drivers are written, think of your server as a virtual device sitting on top of the ones that the requests are coming from. Example: a mouse driver accepts continuous input, but doesn't lock the system if programmed correctly, and a network adapter is probably more similar to your case.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string