Uninterruptable write in Linux - linux

According to an answer on this question : Why doing I/O in Linux is uninterruptible? I/O on linux is uninterruptible (uninterruptible in sleep). But if I start a process ,say a large 'dd' on a file and while the process is going on I forcefully unmount the Filesystem (where the file is),the process gets killed . ideally it should be in a hung state because it is sleeping and is UN.

"Uninterruptible" applies to the low-level read/write operations handled by the kernel. In C programming, these correspond broadly to read() and write() calls on the C standard library. That a utility can be interrupted does not say much about whether I/O operations can be interrupted, because a specific file operation in a utility might correspond to many low-level I/O operations.
In the case of dd, the default transfer block size is 512 bytes, so copying a large file might consist of many I/O operations. dd can be interrupted between these operations. I would expect the same to apply to most utilities that operate on files. If you can force them to work with huge data blocks (e.g., specify a gigabyte-size argument for bs= in dd) then you might be able to see that low-level I/O operations are uninterruptible.

Related

Is it thread-safe to write to the same pipe from multiple threads sharing the same file descriptor in Linux?

I have a Linux process with two threads, both sharing the same file descriptor to write data of 400 bytes to the same pipe every 100ms. I'm wondering if POSIX guarantees that this is thread-safe or if I need to add additional synchronization mechanisms to serialize the writing to the pipe from multiple threads (not processes).
I'm also aware that POSIX guarantees that multiple writes to the same pipe from different processes that are less than PIPE_BUF bytes are atomically written. But I'm not sure if the same guarantee applies to writes from multiple threads within the same process.
Can anyone provide some insight on this? Are there any additional synchronization mechanisms that I should use to ensure thread safety when writing to the same pipe from multiple threads using the same file descriptor in Linux?
Thank you in advance for any help or advice!
In the posix standard, on general information we read:
2.9.1 Thread-Safety
All functions defined by this volume of POSIX.1-2008 shall be thread-safe, except that the following functions need not be thread-safe.
And neither read nor write are listed afterwards. And so indeed, it is safe to call them from multiple threads. This however only means that the syscall won't crash, it doesn't say anything about the exact behaviour of calling them in parallel. In particular, it doesn't say about atomicity.
However in docs regarding write syscall we read:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process. This is useful when there are multiple writers sending data to a single reader. Applications need to know how large a write request can be expected to be performed atomically. This maximum is called {PIPE_BUF}. This volume of POSIX.1-2008 does not say whether write requests for more than {PIPE_BUF} bytes are atomic, but requires that writes of {PIPE_BUF} or fewer bytes shall be atomic.
And in the same doc we also read:
Write requests to a pipe or FIFO shall be handled in the same way as a regular file with the following exceptions:
and the guarantee about atomicity (when size below PIPE_BUF) is repeated.
man 2 write (Linux man-pages 6.02) says:
According to POSIX.1-2008/SUSv4 Section XSI 2.9.7 ("Thread Interactions
with Regular File Operations"):
All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2008 when they operate on
regular files or symbolic links: ...
Among the APIs subsequently listed are write() and writev(2). And
among the effects that should be atomic across threads (and processes)
are updates of the file offset. However, before Linux 3.14, this was
not the case: if two processes that share an open file description (see
open(2)) perform a write() (or writev(2)) at the same time, then the
I/O operations were not atomic with respect to updating the file off-
set, with the result that the blocks of data output by the two pro-
cesses might (incorrectly) overlap. This problem was fixed in Linux
3.14.
So, it should be safe as long as you're running at least Linux 3.14 (which is almost 9 years old).

Linux Kernel - msync Locking Behavior

I am investigating an application that writes random data in fixed-size chunks (e.g. 4k) to random locations in a large buffer file. I have several processes (not threads) doing that, each process has its own buffer file assigned.
If I use mmap+msync to write and persist data to disk, I see a performance spike for 16 processes, and a performance drop for more threads (32 processes).
If I use open+write+fsync, I do not see such a spike, instead a performance plateau (and mmap is slower than open/write).
I've read multiple times [1,2] that both mmap and msync can take locks. With vtune, I analyzed that we are indeed spinlocking, and spending the most time in clear_page_erms and xas_load functions.
However, when reading the source code for msync [3], I cannot understand whether these locks are global or per-file. The paper [2] states that the locks are on radix-trees within the kernel that are per-file, however, as I do observe some spinlocks in the kernel, I believe that some locks may be global, as I have one file per process.
Do you have an explanation on why we have such a spike at 16 processes for mmap and input on the locking behavior of msync?
Thank you!
Best,
Maximilian
[1] https://kb.pmem.io/development/100000025-Why-msync-is-less-optimal-for-persistent-memory/
[2] Optimizing Memory-mapped I/O for Fast Storage Devices, Papagiannis et al., ATC '20
[3] https://elixir.bootlin.com/linux/latest/source/mm/msync.c
In the 4.19 kernel source, the current task keeps its own mm_struct, which contains a single semaphore used to arbitrate accesses to all memory regions being synced. All of the threads in a process acting on one of your buffer files will therefore take this semaphore, operate on some region(s) of the file, and release the semaphore.
While I can't rationalise the exact number of 16 processes where you hit your performance cliff, clearly when you use mmap() you are forcing entry into the msync(M_SYNC) code section for VM_SHARED. This invokes vfs_fsync_range() and guarantees actual synchronous disk I/O will happen, which is going to generally slow down the show: it does not allow advantageous grouping of I/Os for economy and tends to maximise the actual time spent waiting for disk I/O to complete.
To avoid this, ensure that each thread in your process manages a dedicated subset of 4K chunks in the buffer file, avoids mmap() on the buffer file, and schedules async I/O. So long as you avoid mmap() on the buffer file itself, each thread will be alone in writing (safely, if you design well) to its own section of the buffer file. You will therefore be able to specify all of your I/O to be asynchronous, which should allow better aggregation and significantly improve your application's performance i.e. avoid that cliff at 16 processes or whatever the number ends up being. Obviously, you'll still have to ensure that any thread writing to one of its chunks either completes that write or has not yet begun it (and drops any request(s) to do so) if a request for another write to the same chunk comes along.

File Access (read/write) synchronization between 'n' processes in Linux

I am studying Operating Systems this semester and was just wondering how Linux handles file access (read/write) synchronization, what is the default implementation does it use semaphores, mutexes or monitors? And can you please tell me where I would find this in the source codes or my own copy of Ubuntu and how to disable it?
I need to disable it so i can check if my own implementation of this works, also how do i add my own implementation to the system.
Here's my current plan please tell me if its okay:
Disable the default implementation, add my own. (recompile kernel if need be)
My own version would keep track of every incoming process and maintain a list of what files they were using adn whenever a file would repeat i would check if its a reader process or a writer process
I will be going with a reader preferred solution to the readers writers problem.
Kernel doesn't impose process synchronization (it should be performed by processes while kernel only provides tools for that), but it can guarantee atomicity on some operations: atomic operation can not be interrupted and its result cannot be altered by other operation running in parallel.
Speaking of writing to a file, it has some atomicity guarantees. From man -s3 write:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process. This is useful when there are multiple writers sending data to a single reader. Applications need to know how large a write request can be expected to be performed atomically. This maximum is called {PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say whether write requests for more than {PIPE_BUF} bytes are atomic, but requires that writes of {PIPE_BUF} or fewer bytes shall be atomic.
Some discussion on SO: Atomicity of write(2) to a local filesystem.
To maintain atomicity, various kernel routines hold i_mutex mutex of an inode. I.e. in generic_file_write_iter():
mutex_lock(&inode->i_mutex);
ret = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);
So other write() calls won't mess with your call. Readers, however doesn't lock i_mutex, so they may get invalid data. Actual locking for readers is performed in page cache, so a page (4096 bytes on x86) is a minimum amount data that guarantees atomicity in kernel.
Speaking of recompiling kernel to test your own implementation, there are two ways of doing that: download vanilla kernel from http://kernel.org/ (or from Git), patch and build it - it is easy. Recompiling Ubuntu kernels is harder -- it will require working with Debian build tools: https://help.ubuntu.com/community/Kernel/Compile
I'm not clear about what you trying to achieve with your own implementation. If you want to apply strictier synchronization rules, maybe it is time to look at TxOS?

Difference between POSIX AIO and libaio on Linux?

What I seem to understand:
POSIX AIO APIs are prototyped in <aio.h> and you link your program with librt(-lrt), while the libaio APIs in <libaio.h> and your program is linked with libaio (-laio).
What I can't figure out:
1.Does the kernel handle the either of these methods differently?
2.Is the O_DIRECT flag mandatory for using either of them?
As mentioned in this post, libaio works fine without O_DIRECT when using libaio.Okay,understood but:
According to R.Love's Linux System Programming book, Linux supports aio (which I assume is POSIX AIO) on regular files only if opened with O_DIRECT.But a small program that I wrote (using aio.h,linked with -lrt) that calls aio_write on a file opened without the O_DIRECT flag works without issues.
On linux, the two AIO implementations are fundamentally different.
The POSIX AIO is a user-level implementation that performs normal blocking I/O in multiple threads, hence giving the illusion that the I/Os are asynchronous. The main reason to do this is that:
it works with any filesystem
it works (essentially) on any operating system (keep in mind that gnu's libc is portable)
it works on files with buffering enabled (i.e. no O_DIRECT flag set)
The main drawback is that your queue depth (i.e. the number of outstanding operations you can have in practice) is limited by the number of threads you choose to have, which also means that a slow operation on one disk may block an operation going to a different disk. It also affects which I/Os (or how many) is seen by the kernel and the disk scheduler as well.
The kernel AIO (i.e. io_submit() et.al.) is kernel support for asynchronous I/O operations, where the io requests are actually queued up in the kernel, sorted by whatever disk scheduler you have, presumably some of them are forwarded (in somewhat optimal order one would hope) to the actual disk as asynchronous operations (using TCQ or NCQ). The main restriction with this approach is that not all filesystems work that well or at all with async I/O (and may fall back to blocking semantics), files have to be opened with O_DIRECT which comes with a whole lot of other restrictions on the I/O requests. If you fail to open your files with O_DIRECT, it may still "work", as in you get the right data back, but it probably isn't done asynchronously, but is falling back to blocking semantics.
Also keep in mind that io_submit() can actually block on the disk under certain circumstances.

Linux Process States

In Linux, what happens to the state of a process when it needs to read blocks from a disk? Is it blocked? If so, how is another process chosen to execute?
When a process needs to fetch data from a disk, it effectively stops running on the CPU to let other processes run because the operation might take a long time to complete – at least 5ms seek time for a disk is common, and 5ms is 10 million CPU cycles, an eternity from the point of view of the program!
From the programmer point of view (also said "in userspace"), this is called a blocking system call. If you call write(2) (which is a thin libc wrapper around the system call of the same name), your process does not exactly stop at that boundary; it continues, in the kernel, running the system call code. Most of the time it goes all the way up to a specific disk controller driver (filename → filesystem/VFS → block device → device driver), where a command to fetch a block on disk is submitted to the proper hardware, which is a very fast operation most of the time.
THEN the process is put in sleep state (in kernel space, blocking is called sleeping – nothing is ever 'blocked' from the kernel point of view). It will be awakened once the hardware has finally fetched the proper data, then the process will be marked as runnable and will be scheduled. Eventually, the scheduler will run the process.
Finally, in userspace, the blocking system call returns with proper status and data, and the program flow goes on.
It is possible to invoke most I/O system calls in non-blocking mode (see O_NONBLOCK in open(2) and fcntl(2)). In this case, the system calls return immediately and only report submitting the disk operation. The programmer will have to explicitly check at a later time whether the operation completed, successfully or not, and fetch its result (e.g., with select(2)). This is called asynchronous or event-based programming.
Most answers here mentioning the D state (which is called TASK_UNINTERRUPTIBLE in the Linux state names) are incorrect. The D state is a special sleep mode which is only triggered in a kernel space code path, when that code path can't be interrupted (because it would be too complex to program), with the expectation that it would block only for a very short time. I believe that most "D states" are actually invisible; they are very short lived and can't be observed by sampling tools such as 'top'.
You can encounter unkillable processes in the D state in a few situations. NFS is famous for that, and I've encountered it many times. I think there's a semantic clash between some VFS code paths, which assume to always reach local disks and fast error detection (on SATA, an error timeout would be around a few 100 ms), and NFS, which actually fetches data from the network which is more resilient and has slow recovery (a TCP timeout of 300 seconds is common). Read this article for the cool solution introduced in Linux 2.6.25 with the TASK_KILLABLE state. Before this era there was a hack where you could actually send signals to NFS process clients by sending a SIGKILL to the kernel thread rpciod, but forget about that ugly trick.…
While waiting for read() or write() to/from a file descriptor return, the process will be put in a special kind of sleep, known as "D" or "Disk Sleep". This is special, because the process can not be killed or interrupted while in such a state. A process waiting for a return from ioctl() would also be put to sleep in this manner.
An exception to this is when a file (such as a terminal or other character device) is opened in O_NONBLOCK mode, passed when its assumed that a device (such as a modem) will need time to initialize. However, you indicated block devices in your question. Also, I have never tried an ioctl() that is likely to block on a fd opened in non blocking mode (at least not knowingly).
How another process is chosen depends entirely on the scheduler you are using, as well as what other processes might have done to modify their weights within that scheduler.
Some user space programs under certain circumstances have been known to remain in this state forever, until rebooted. These are typically grouped in with other "zombies", but the term would not be correct as they are not technically defunct.
A process performing I/O will be put in D state (uninterruptable sleep), which frees the CPU until there is a hardware interrupt which tells the CPU to return to executing the program. See man ps for the other process states.
Depending on your kernel, there is a process scheduler, which keeps track of a runqueue of processes ready to execute. It, along with a scheduling algorithm, tells the kernel which process to assign to which CPU. There are kernel processes and user processes to consider. Each process is allocated a time-slice, which is a chunk of CPU time it is allowed to use. Once the process uses all of its time-slice, it is marked as expired and given lower priority in the scheduling algorithm.
In the 2.6 kernel, there is a O(1) time complexity scheduler, so no matter how many processes you have up running, it will assign CPUs in constant time. It is more complicated though, since 2.6 introduced preemption and CPU load balancing is not an easy algorithm. In any case, it’s efficient and CPUs will not remain idle while you wait for the I/O.
As already explained by others, processes in "D" state (uninterruptible sleep) are responsible for the hang of ps process. To me it has happened many times with RedHat 6.x and automounted NFS home directories.
To list processes in D state you can use the following commands:
cd /proc
for i in [0-9]*;do echo -n "$i :";cat $i/status |grep ^State;done|grep D
To know the current directory of the process and, may be, the mounted NFS disk that has issues you can use a command similar to the following example (replace 31134 with the sleeping process number):
# ls -l /proc/31134/cwd
lrwxrwxrwx 1 pippo users 0 Aug 2 16:25 /proc/31134/cwd -> /auto/pippo
I found that giving the umount command with the -f (force) switch, to the related mounted nfs file system, was able to wake-up the sleeping process:
umount -f /auto/pippo
the file system wasn't unmounted, because it was busy, but the related process did wake-up and I was able to solve the issue without rebooting.
Assuming your process is a single thread, and that you're using blocking I/O, your process will block waiting for the I/O to complete. The kernel will pick another process to run in the meantime based on niceness, priority, last run time, etc. If there are no other runnable processes, the kernel won't run any; instead, it'll tell the hardware the machine is idle (which will result in lower power consumption).
Processes that are waiting for I/O to complete typically show up in state D in, e.g., ps and top.
Yes, the task gets blocked in the read() system call. Another task which is ready runs, or if no other tasks are ready, the idle task (for that CPU) runs.
A normal, blocking disc read causes the task to enter the "D" state (as others have noted). Such tasks contribute to the load average, even though they're not consuming the CPU.
Some other types of IO, especially ttys and network, do not behave quite the same - the process ends up in "S" state and can be interrupted and doesn't count against the load average.
Yes, tasks waiting for IO are blocked, and other tasks get executed. Selecting the next task is done by the Linux scheduler.
Generally the process will block. If the read operation is on a file descriptor marked as non-blocking or if the process is using asynchronous IO it won't block. Also if the process has other threads that aren't blocked they can continue running.
The decision as to which process runs next is up to the scheduler in the kernel.

Resources