I understand that it means an I/O function that could block indefinitely instead returns immediately. My question is, how does it do that? What happens if the function has to return immediately, but the I/O device is not available? Obviously it can't return immediately with the results of the I/O operation, because the operation hasn't had a chance to execute, so it has to do one of two things: either (1) return now with a result indicating failure, or (2) return control to the main program temporarily and perform the I/O operation concurrently with the main program, then return again when the I/O is completed. Which of these is it? What is the exact procedure followed? None of the sources I've been able to find clarify this point.
An I/O function delegates its operation to the OS Kernel. In general, these operations are asynchronous: the OS instructs a peripheral device to perform an operation, and eventually receives an interrupt from the device, indicating success or failure. In the meantime, the OS does many other things, including allowing user programs to run.
When an I/O operation is blocking for the user, then this means that the OS will not schedule CPU time for that user process until it has received the completion interrupt from the hardware. It then looks as if the function returned only after completion. In reality, it is ready to return immediately. It is only the OS that keeps the user process in a waiting state until the underlying hardware request has completed.
When an I/O operation is non blocking for the user, then the OS lets the user process continue immediately after it has initiated the corresponding hardware operation. It is then necessary to establish a notification mechanism for the user process to get notified when the operation completes. Details on how this is done vary from OS to OS.
Addendum:
In Posix, non-blocking means that if a request cannot be fulfilled immediately (e.g. you want to read something but data has not yet been received), then you get an error status. It is then up to you to re-issue the request later.
Related
Consider the situation, where you issue a read from the disc (I/O operation). Then what is the exact mechanism that the OS uses to get to know whether the operation has been executed?
Then what is the exact mechanism that the OS uses to get to know whether the operation has been executed?
The exact mechanism depends on the specific hardware (and OS and scenario); but typically when a device finishes doing something the device triggers an IRQ that causes the CPU to interrupt whatever it was doing and switch to a device driver's interrupt handler.
Sometimes/often device driver ends up maintaining a queue or buffer of pending commands; so that when its interrupt handler is executed (telling it that a previous command has completed) it takes the next pending command and tells the device to start it. Sometimes/often this also includes some kind of IO priority scheme, where driver can ask device to do more important work sooner (while less important work is postponed/remains pending).
A device driver is typically also tied to scheduler in some way - a normal thread in user-space might (directly or indirectly - e.g. via. file system) request that data be transferred and the scheduler will be told to not give that thread CPU time because it's blocked/waiting for something; and then later when the transfer is completed the device driver's interrupt handler tells the scheduler that the requesting thread can continue, causing it to be unblocked/able to be given CPU time by scheduler again.
As mentioned in the man page of signal(7),
Interruption of system calls and library functions by signal handlers
If a signal handler is invoked while a system call or library function call is blocked, then either:
* the call is automatically restarted after the signal handler returns; or
* the call fails with the error EINTR.
Which of these two behaviors occurs depends on the interface and whether or not the signal handler was established using the SA_RESTART flag (see sigaction(2)). The details vary across UNIX systems; below, the details for
Linux.
If a blocked call to one of the following interfaces is interrupted by a signal handler, then the call will be automatically restarted after the signal handler returns if the SA_RESTART flag was used; otherwise the call will
fail with the error EINTR:
* read(2), readv(2), write(2), writev(2), and ioctl(2) calls on "slow" devices. A "slow" device is one where the I/O call may block for an indefinite time, for example, a terminal, pipe, or socket. If an I/O call on a
slow device has already transferred some data by the time it is interrupted by a signal handler, then the call will return a success status (normally, the number of bytes transferred). Note that a (local) disk is not a
slow device according to this definition; I/O operations on disk devices are not interrupted by signals.
As it is mentioned that a blocked call to one of the following interfaces(read, write) is interrupted by a signal handler, then the call will be automatically restarted after the signal handler returns if the SA_RESTART flag was used, that means in case of blocked read/write system call, process must be in TASK_INTERRUPTIBLE state.
But when I was trying to find out blocked system calls that put process in TASK_UNINTERRUPTIBLE state, I found https://unix.stackexchange.com/questions/62697/why-is-i-o-uninterruptible and Why doing I/O in Linux is uninterruptible? , and in both the places it is mentioned that blocked I/O call(read, write) will put a process in TASK_UNINTERRUPTIBLE.
Also its mentioned here: https://access.redhat.com/sites/default/files/attachments/processstates_20120831.pdf
The Uninterruptible state is mostly used by device drivers waiting for disk or network I/O. When the process
is sleeping uninterruptibly, signals accumulated during the sleep are noticed when the process returns from
the system call or trap. In Linux systems. the command ps -l uses the letter D in the state field (S) to
indicate that the process is in an Uninterruptible sleep state. In that case, the process state flag is set as
follows:
p->state = TASK_UNINTERRUPTABLE
LEARN MORE: Read more about D states in the Red Hat Knowledgebase:
https://access.redhat.com/knowledge/solutions/59989/
It's kind of confusing.
Also I want to know other blocked system calls which can put a process in TASK_UNINTERRUPTIBLE state.
With read(2) or write(2) family syscalls, the type of sleep depends on the type of file that is being accessed. In the documentation you quoted, "slow" devices are those where a read/write will sleep interruptibly, and "fast" devices are those which will sleep uninterruptibly (the uninterruptible sleep state is named D for "Disk Wait", since originally read/write on disk files was the most common reason for this type of sleep).
Note that "blocking" technically refers only to interruptible sleep.
Almost any system call can enter uninterruptible sleep, because this can happen (among other things) when a process needs to acquire a lock protecting an internal kernel resource. Usually, this sort of uninterruptible sleep is so short-lived that you will not notice it.
From what I've read here, the golang scheduler will automatically determine if a goroutine is blocking on I/O, and will automatically switch to processing others goroutines on a thread that isn't blocked.
What I'm wondering is how the scheduler then figures out that that goroutine has stopped blocking on I/O.
Does it just do some kind of polling every so often to check if it's still blocking? Is there some kind of background thread running that checks the status of all goroutines?
For example, if you were to do an HTTP GET request inside a goroutine that took 5s to get a response, it would block while waiting for the response, and the scheduler would switch to processing another goroutine. Now given that, when the server returns a response, how does the scheduler understand that the response has arrived, and it's time to go back to the goroutine that made the GET so that it can process the result of the GET?
All I/O must be done through syscalls, and the way syscalls are implemented in Go, they are always called through code that is controlled by the runtime. This means that when you call a syscall, instead of just calling it directly (thus giving up control of the thread to the kernel), the runtime is notified of the syscall you want to make, and it does it on the goroutine's behalf. This allows it to, for example, do a non-blocking syscall instead of a blocking one (essentially telling the kernel, "please do this thing, but instead of blocking until it's done, return immediately, and let me know later once the result is ready"). This allows it to continue doing other work in the meantime.
I understand that async I/O ops via select() and poll() do not use processor time i.e its not a busy loop but then how are these really implemented under the hood ? Is it supported in hardware somehow and is that why there is not much apparent processor cost for using these ?
It depends on what the select/poll is waiting for. Let's consider a few cases; I'm going to assume a single-core machine for simplification.
First, consider the case where the select is waiting on another process (for example, the other process might be carrying out some computation and then outputs the result through a pipeline). In this case the kernel will mark your process as waiting for input, and so it will not provide any CPU time to your process. When the other process outputs data, the kernel will wake up your process (give it time on the CPU) so that it can deal with the input. This will happen even if the other process is still running, because modern OSes use preemptive multitasking, which means that the kernel will periodically interrupt processes to give other processes a chance to use the CPU ("time-slicing").
The picture changes when the select is waiting on I/O; network data, for example, or keyboard input. In this case, while archaic hardware would have to spin the CPU waiting for input, all modern hardware can put the CPU itself into a low-power "wait" state until the hardware provides an interrupt - a specially handled event that the kernel handles. In the interrupt handler the CPU will record the incoming data and after returning from the interrupt will wake up your process to allow it to handle the data.
There is no hardware support. Well, there is... but is nothing special and it depends on what kind of file descriptor are you watching. If there is a device driver involved, the implementation depends on the driver and/or the device. For example, sockets. If you wait for some data to read, there are a sequence of events:
Some process calls poll()/select()/epoll() system call to wait for data in a socket. There is a context switch from the user mode to the kernel.
The NIC interrupts the processor when some packet arrives. The interrupt routine in the driver push the packet in the back of a queue.
There is a kernel thread that takes data from that queue and wakes up the network code inside the kernel to process that packet.
When the packet is processed, the kernel determines the socket that was expecting for it, saves the data in the socket buffer and returns the system call back to user space.
This is just a very brief description, there are a lot of details missing but I think that is enough to get the point.
Another example where no drivers are involved is a unix socket. If you wait for data from one of them, the process that waits is added to a list. When other process on the other side of the socket writes data, the kernel checks that list and the point 4 is applied again.
I hope it helps. I think that examples are the best to undertand it.
In Linux, when you make a blocking i/o call like read or accept, what actually happens?
My thoughts: the process get taken out of the run queue, put into a waiting or blocking state on some wait queue. Then when a tcp connection is made (for accept) or the hard drive is ready or something for a file read, a hardware interrupt is raised which lets those processes waiting to wake up and run (in the case of a file read, how does linux know what processes to awaken, as there could be lots of processes waiting on different files?). Or perhaps instead of hardware interrupts, the individual process itself polls to check availability. Not sure, help?
Each Linux device seems to be implemented slightly differently, and the preferred way seems to vary every few Linux releases as safer/faster kernel features are added, but generally:
The device driver creates read and
write wait queues for a device.
Any process thread wanting to wait
for i/o is put on the appropriate
wait queue. When an interrupt occurs
the handler wakes up one or more
waiting threads. (Obviously the
threads don't run immediately as we are in interrupt
context, but are added to the
kernel's scheduling queue).
When scheduled by the kernel the
thread checks to see if conditions
are right for it to proceed - if not
it goes back on the wait queue.
A typical example (slightly simplified):
In the driver at initialisation:
init_waitqueue_head(&readers_wait_q);
In the read function of a driver:
if (filp->f_flags & O_NONBLOCK)
{
return -EAGAIN;
}
if (wait_event_interruptible(&readers_wait_q, read_avail != 0))
{
/* signal interrupted the wait, return */
return -ERESTARTSYS;
}
to_copy = min(user_max_read, read_avail);
copy_to_user(user_buf, read_ptr, to_copy);
Then the interrupt handler just issues:
wake_up_interruptible(&readers_wait_q);
Note that wait_event_interruptible() is a macro that hides a loop that checks for a condition - read_avail != 0 in this case - and repeatedly adds to the wait queue again if woken when the condition is not true.
As mentioned there are a number of variations - the main one is that if there is potentially a lot of work for the interrupt handler to do then it does the bare minimum itself and defers the rest to a work queue or tasklet (generally known as the "bottom half") and it is this that would wake the waiting threads.
See Linux Device Driver book for more details - pdf available here:
http://lwn.net/Kernel/LDD3
Effectivly the method will only returns when the file is ready to read, when data is on a socket, when a connection has arrived...
To make sure it can return immediatly you probably want to use the Select system call to find a ready file descriptor.
Read this: http://www.minix3.org/doc/
It's a very, clear, very easy to understand explanation. It generally applies to Linux, also.