Disk io queue overflow - linux

From what I understand, the disk device has a queue that stores read/write requests from the linux kernel. What happens when the device doesn't drain the queue fast enough (i.e. overflows)?
Does this queue extend (logically) into DRAM?
can some requests be lost?

Does this queue extend (logically) into DRAM?
Where do you think that queue is in the first place? It's in RAM.
The IO buffering infrastructure of any operating system can only serve the purpose of avoiding blocking whatever program tries to do an IO operation.
E.g. imagine you have a program that writes data to a file. For that reason, it calls a write system call. in the Operating System, that goes to the file system driver, which decides which disk sector gets changed.
Now, that change command goes to the IO subsystem, which puts the command in a queue. If that queue is full, the file system call blocks, ie. the call doesn't complete until there is space in the queue, which means that the write call blocks.
very simple: for as long as your writing device doesn't keep up, your writing program gets stopped in the write call. That's pretty logical. It's like trying to push mail into a full postbox. Until someone takes out the mail at the other end, you can't push in new mail, so the postman will have to wait.

The queue doesn't extend to RAM. There's a disk cache with dirty pages. The OS really would like to write those to disk. Some programs may even block while they're waiting for their dirty pages to be written. And as programs get blocked, they stop writing further data to disk. Pretty self-limiting, actually.

Related

NodeJS is reading the same file blocking?

We have a backend expressjs server that will read off of the disk for many files whenever a front-end client connects.
At the OS level, are these reads blocking?
I.E., if two people connect at the same time, will whoever gets scheduled second have to wait to read the file until the first person who is currently reading it finishes?
We are just using fs.readFile to read files.
EDIT: I'm implementing caching anyway (it's a legacy codebase, don't hate me), I'm just curious if these reads are blocking and this might improve response time from not having to wait until the file is free to read.
fs.readFile() is not blocking for nodejs. It's a non-blocking, asynchronous operation. While one fs.readFile() operation is in progress, other nodejs code can run.
If two fs.readFile() calls are in operation at the same time, they will both proceed in parallel.
Nodejs itself uses a native OS thread pool with a default size of 4 for file operations so it will support up to 4 file operations in parallel. Beyond 4, it queues the next operation so when one of the 4 finishes, then the next one in line will start to execute.
Within the OS, it will time slice these different threads to achieve parallel operation. But, at the disk controller itself for a spinning drive, only one particular read operation can be occurring at once because the disk head can only be on one track at a given time. So, the underlying read operations reading from different parts of a spinning disk will eventually be serialized at the disk controller as it moves the disk head to read from a given track.
But, if two separate reads are trying to read from the same file, the OS will typically cache that info so the 2nd read won't have to read from the disk again, it will just get the data from an OS cache.
I inherited this codebase and am going to implement some caching anyway, but was just curious if caching would also improve response time since we would be reading from non-blocking process memory instead of (potentially) blocking filesystem memory.
OS file caching is heavily, heavily optimized (it's a problem operating systems have spent decades working on). Implementing my own level of caching on top of the OS isn't where I would think you'd find the highest bang for the buck for improving performance. While there may be a temporary lock used in the OS file cache, that lock would only exist for the duration of a memory copy from cache to target read location which is really, really short. Probably not something anything would notice. And, that temporary lock is not blocking nodejs at all.

Will non-blocking I/O be put to sleep during copying data from kernel to user?

I ask this question because I am looking at multiplexing I/O in Go, which is using epollwait.
When an socket is ready, a goroutine will be waked up and begin to read socket in non-blocking mode. If the read system call still will be blocked during copying data from kernel to user, I assume the kernel thread the gorouine attached to will be put to sleep as well.
I am not sure of that, hoping someone can help correct me if I am wrong.
I fail to quite parse what you've written.
I'll try to make a sheer guess and conjure you might be overseeing the fact that the write(2) and read(2) syscalls (and those of their ilk such as send(2) and recv(2)) on the sockets put into non-blocking mode are free to consume (and return, respectively) less data than requested.
In other words, a write(2) call on a non-blocking socket told to write 1 megabyte of data will consume just as much data currently fits into the assotiated kernel buffer and return immediately, signalling it consumed only as much data. The next immediate call to write(2) will likely return EWOULDBLOCK.
The same goes for the read(2) call: if you pass it a buffer large enough to hold 1 megabyte of data, and tell it to read that number of bytes, the call will only drain the contents of the kernel buffer and return immediately, signaling how much data it actually copied. The next immediate call to read(2) will likely return EWOULDBLOCK.
So, any attempt to get or put data to the socket succeeds almost immediately: either after the data had been shoveled between the kernel's buffer and the user space or right away—with the EAGAIN return code.
Sure, there's supposedly a possibility for an OS thread to be suspended right in the middle of performing such a syscall, but this does not count as "blocking in a syscall."
Update to the original answer in response to the following comment of the OP:
<…>
This is what I see in book
"UNIX Network Programming" (Volume 1, 3rd), chapter 6.2:
A synchronous I/O operation causes the requesting process
to be blocked until that I/O operation completes. Using these
definitions, the first four I/O models—blocking, nonblocking, I/O
multiplexing, and signal-driven I/O—are all synchronous because the
actual I/O operation (recvfrom) blocks the process.
It uses "blocks" to describe nonblocking I/O operation. That makes me confused.
I still don't understand why the book uses "blocks the process" if the process is actually not blocked.
I can only guess that the book's author intended to highlight that the process is indeed blocked since entering a syscall and until returning from it. Reads from and writes to a non-blocking socket do block to transfer the data, if available, between the kernel and the user space. We colloquially say this does not block because we mean "it does not block waiting and doing nothing for an indeterminate amount of time".
The book's author might contrast this to the so-called asynchronous I/O (called "overlapping" on Windows™)—where you basically give the kernel a buffer with/for data and ask it to do away with it completely in parallel with your code—in the sense the relevant syscall returns right away and the I/O is carried out in background (with regard to your user-space code).
To my knowledge, Go does not use kernel's async I/O facilities on neither platform it supports. You might look there for the developments regarding Linux and its contemporary io_uring subsystem.
Oh, and one more point. The book might (at that point through the narrative at least) be discussing a simplified "classic" scheme where there are no in-process threads, and the sole unit of concurrency is the process (with a single thread of execution). In this scheme, any syscall obviously blocks the whole process. In contrast, Go works only on kernels which support threads, so in a Go program a syscall never blocks the whole process—only the thread it's called on.
Let me take yet another stab at explaining the problem as—I perceive—the OP stated it.
The problem of serving multiple client requests is not new—one of the more visible first statements of it is "The C10k problem".
To quickly recap it, a single threaded server with blocking operations on the sockets it manages is only realistically able to handle a single client at a time.
To solve it, there exist two straightforward approaches:
Fork a copy of the server process to handle each incoming client connection.
On an OS which supports threads, fork a new thread inside the same process to handle each incoming client.
They have their pros and cons but they both suck with regard to resource usage, and—which is more important—they do not play well with the fact most clients have relatively low rate and bandwidth of I/O they perform with regard to the processing resources available on a typical server.
In other words, when serving a typical TCP/IP exchange with a client, the serving thread most of the time sleeps in the write(2) and read(2) calls on the client socket.
This is what most people mean when talking about "blocking operations" on sockets: if a socket is blocking, and operation on it will block until it can actually be carried out, and the originating thread will be put to sleep for an indeterminate amount of time.
Another important thing to note is that when the socket becomes ready, the amount of work done is typically miniscule compared to the amount of time slept between the wakeups.
While the tread sleeps, its resources (such as memory) are effectively wasted, as they cannot be used to do any other work.
Enter "polling". It combats the problem of wasted resources by noticing that the points of readiness of networked sockets are relatively rare and far in between, so it makes sense to have lots of such sockets been served by a single thread: it allows to keep the thread almost as busy as theoretically possible, and also allows to scale out when needed: if a single thread is unable to cope with the data flow, add another thread, and so on.
This approach is definitely cool but it has a downside: the code which reads and writes data must be re-written to use callback style instead of the original plain sequential style. Writing with callbacks is hard: you usuaully have to implement intricate buffer management and state machines to deal with this.
The Go runtime solves this problem by adding another layer of scheduling for its execution flow units—goroutines: for goroutines, operations on the sockets are always blocking, but when a goroutine is about to block on a socket, this is transparently handled by suspending only the goroutine itself—until the requested operation will be able to proceed—and using the thread the goroutine was running on to do other work¹.
This allows to have the best of both approaches: the programmer may write classic no-brainer sequential callback-free networking code but the threads used to handle networking requests are fully utilized².
As to the original question of blocking, both the goroutine and the thread it runs on are indeed blocked when the data transfer on a socket is happening, but since what happens is data shoveling between a kernel and a user-space buffer, the delay is most of the time small, and is no different to the classic "polling" case.
Note that performing of syscalls—including I/O on non-pollable descriptors—in Go (at leas up until, and including Go 1.14) does block both the calling goroutine and the thread it runs on, but is handled differently from those of pollable descriptors: when a special monitoring thread notices a goroutine spent in a syscall more that certain amount of time (20 µs, IIRC), the runtime pulls the so-called "processor" (a runtime thing which runs goroutines on OS threads) from under the gorotuine and tries to make it run another goroutine on another OS thread; if there is a goroutine wanting to run but no free OS thread, the Go runtime creates another one.
Hence "normal" blocking I/O is still blocking in Go in both senses: it blocks both goroutines and OS threads, but the Go scheduler makes sure the program as a whole still able to make progress.
This could arguably be a perfect case for using true asynchronous I/O provided by the kernel, but it's not there yet.
¹ See this classic essay for more info.
² The Go runtime is certainly not the first one to pioneer this idea. For instance, look at the State Threads library (and the more recent libtask) which implement the same approach in plain C; the ST library has superb docs which explain the idea.

Linux splice() + kernel AIO when writing to disk

With kernel AIO and O_DIRECT|O_SYNC, there is no copying into kernel buffers and it is possible to get fine grained notification when data is actually flushed to disk. However, it requires data to be held in user space buffers for io_prep_pwrite().
With splice(), it is possible to move data directly to disk from kernel space buffers (pipes) without never having to copy it around. However, splice() returns immediately after data is queued and does not wait for actual writes to the disk.
The goal is to move data from sockets to disk without copying it around while getting confirmation that it has been flushed out. How to combine both previous approaches?
By combining splice() with O_SYNC, I expect splice() to block and one has to use multiple threads to mask latency. Alternatively, one could use asynchronous io_prep_fsync()/io_prep_fdsync(), but this waits for all data to be flushed, not for a specific write. Neither is perfect.
What would be required is a combination of splice() with kernel AIO, allowing zero copy and asynchronous confirmation of writes, such that a single event driven thread can move data from sockets to the disk and get confirmations when required, but this doesn't seem to be supported. Is there a good workaround / alternative approach?
To get a confirmation of the writes, you can't use splice().
There's aio stuff in userspace, but if you were doing it in the kernel it might come to finding out which bio's (block I/O) are generated and waiting for those:
Block I/O structure:
http://www.makelinux.net/books/lkd2/ch13lev1sec3
If you want to use AIO, you will need to use io_getevents():
http://man7.org/linux/man-pages/man2/io_getevents.2.html
Here are some examples on how to perform AIO:
http://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt
If you do it from userspace and use msync it's still kind of up in the air if it is actually on spinning rust yet.
msync() docs:
http://man7.org/linux/man-pages/man2/msync.2.html
You might have to soften expectations in order to make it more robust, because it might be very expensive to actually be sure that the writes are fisically written on disk.
The 'highest' typical standard for write assurance in light of something like power removal is a journal recording operation that modifies the storage. The journal itself is append only and you can see if entries are complete when you play it back. That very last journal entry may not be complete, so something may still be potentially lost.

Transferring data between process calls

I have a Linux process that is being called numerous times, and I need to make this process as fast as possible.
The problem is that I must maintain a state between calls (load data from previous call and store it for the next one), without running another process / daemon.
Can you suggest fast ways to do so? I know I can use files for I/O, and would like to avoid it, for obvious performance reasons. Should (can?) I create a named pipe to read/write from and by that avoid real disk I/O?
Pipes aren't appropriate for this. Use posix shared memory or a posix message queue if you are absolutely sure files are too slow - which you should test first.
In the shared memory case your program creates the segment with shm_open() if it doesn't exist or opens it if it does. You mmap() the memory and make whatever changes and exit. You only shm_unlink() when you know your program won't be called anymore and no longer needs the shared memory.
With message queues, just set up the queue. Your program reads the queue, makes whatever changes, writes the queue and exits. Mq_unlink() when you no longer need the queue.
Both methods have kernel persistence so you lose the shared memory and the queue on a reboot.
It sounds like you have a process that is continuously executed by something.
Why not create a factory that spawns the worker threads?
The factory could provide the workers with any information needed.
... I can use files for I/O, and would like to avoid it, for obvious performance reasons.
I wonder what are these reasons please...
Linux caches files in kernel memory in the page cache. Writes go to the page cash first, in other words, a write() syscall is a kernel call that only copies the data from the user space to the page cache (it is a bit more complicated when the system is under stress). Some time later pdflush writes data to disk asynchronously.
File read() first checks the page cache to see if the data is already available in memory to avoid a disk read. What it means is that if one program writes data to files and another program reads it, these two programs are effectively communicating via kernel memory as long as the page cache keeps those files.
If you want to avoid disk writes entirely, that is, the state does not need to be persisted across OS reboots, those files can be put in /dev/shm or in /tmp, which are normally the mount points of in-memory filesystems.

issuing a disk read from bottom-half of device driver

In a Xen setup, IO accesses from guest VMs go through a privileged domain called dom0 that is just a modified Linux kernel which has calls from and to the XEN hypervisor. For block IO, they have a split driver model whose front-end is in the guest VM and the backend of the driver in the domain0. The backend just creates a 'bio' structure and invokes submit_bio() as in traditional linux block driver code.
My goal here is to check if there is any problem in the data written to disk(lost data, silently corrupted writes, misdirected writes, etc). So I need to read the data that was written to disk and compare it with a on-cache copy of data (this is a common disk function called 'read after write'). My question is, is it not possible to invoke __bread() from my backend driver level ? The kernel crashes when __bread is invoked.. Could anyone understand the reason for this ? Also, if this ain't possible, what other ways are out there to read a specific block of data from disk at the driver's bottom-half ?
Can I intercept and clone the bio structure of the writes, and change the operation as read in my new bio and invoke submit_bio() again ? I did that, but the sector number in the bio structure that is returned by the completion callback of submit_bio() is some random value and not the ones I sent..
Thanks.
If this were my task, I'd try first writing a new scheduling algorithm. Start by copying cfq or deadline or noop or as scheduling code and start working on it from there to self-submit read commands after accepting write requests. noop would probably be the easiest one to modify to read immediately after write, and propagate errors upwards, but I can't imagine the performance would be very good. But, if you use one of the other schedulers as base, it would probably be much more difficult to signal an error immediately after the write -- perhaps a few seconds would have elapsed before reads were scheduled again -- so it would really only be useful as a diagnostic after the fact, and not something that could benefit applications directly.

Resources