From man pages
O_NONBLOCK or O_NDELAY
This flag has no effect for regular files and block
devices; that is, I/O operations will (briefly) block when
device activity is required, regardless of whether O_NONBLOCK
is set. Since O_NONBLOCK semantics might eventually be
implemented, applications should not depend upon blocking
behavior when specifying this flag for regular files and block
devices.
From my question I had following understanding of the io system.
Device <-----> Kernel Buffers <-----> Process
So whenever Buffers are full (write) or empty (read), the corresponding command from the process can block or not depending on the flag above. Kernel interacting with the device is not blocking for the process. The kernel might or might not be using DMA for communication with the device.
But looks like my understanding is wrong as I can't see why the regular file descriptors can't be non-blocking. Could somebody help me here?
"Blocking" is defined as waiting for a file to become readable or writable.
Regular files are always readable and/or writable; in other words, it is always possible to try to start the read/write operation without having to wait for some external event:
when reading, the kernel already knows if there are more bytes in the file (if the end of the file has been reached, it is not possible to block to wait for some other process to append more bytes);
when writing, the kernel already knows if there is enough space on the disk to write something (if the disk is full, it is not possible to block to wait for some other process to delete some data to free up space).
Related
Lets say I know that a file descriptor fd is open for reading in my process. I would like to pipe data from this fd into a fifo that is available for reading outside my of process, in a way that avoids calling poll or select on fd and manually reading/forwarding data. Can this be done?
You mean ask the OS to do that behind the scenes on an ongoing basis from now on? Like an I/O redirection?
No, you can't do that.
You could spawn a thread that does nothing but read the file fd and write the pipe fd, though. To avoid the overhead of copying memory around in read(2) and write(2) system calls, you can use sendfile(out_fd, in_fd, NULL, 4096) to tell the kernel to copy a page from in_fd to out_fd. See the man page.
You might have better results with splice(2), since it's designed for use with files and pipes. sendfile(2) used to require out_fd to be a socket. (Designed for zero-copy sending static data on TCP sockets, e.g. from a web server.)
Linux does have asynchronous I/O, so you can queue up a read or write to happen in the background. That's not a good choice here, because you can't queue up a copy from one fd to another. (no async splice(2) or sendfile(2)). Even if there was, it would have a specific request size, not a fire-and-forget keep copying-forever. AFAIK, threads have become the preferred way to do async I/O, rather than the POSIX AIO facilities.
I have a linux application that streams data to files on a directly-attached SAS storage array. It fills large buffers, writes them in O_DIRECT mode, then recycles the buffers (i.e. fills them again etc.). I do not need to use O_SYNC for data integrity, because I can live with data loss on crashes, delayed writing etc. I'm primarily interested in high throughput and I seem to get better performance without O_SYNC. However, I am wondering if it is safe: if O_DIRECT is used but not O_SYNC, when exactly does the write() system call return?
If the write() returns after the DMA to the storage array's cache has been completed, then my application is safe to fill the buffer again. The array itself is in write-back mode: it will write to disk eventually, which is acceptable to me.
If the write returns immediately after the DMA has been initiated (but not yet completed), then my application is not safe, because it would overwrite the buffer while the DMA is still in progress. Obviously I don't want to write corrupted data; but in this case there is also no way that I know to figure out when the DMA for a buffer has been completed and it is safe to refill.
(There are actually several parallel threads, each one with its pool of buffers, although this may not really matter for the question above.)
When the write call returns you can reuse the buffer without any danger. You don't know that the write has made it to disk, but you indicated that was not an issue for you.
One supporting reference is at http://www.makelinux.net/ldd3/chp-15-sect-3, which states:
For example, the use of direct I/O requires that the write system call
operate synchronously; otherwise the application does not know when it
can reuse its I/O buffer.
My motivation
I'd love to write a distributed file system using FUSE. I'm still designing the code before I jump in. It'll be possibly written in C or Go, the question is, how do I deal with network i/o in parallel?
My problem
More specifically, I want my file system to write locally, and have a thread do the network overhead asynchronously. It doesn't matter if it's slightly delayed in my case, I simply want to avoid slow writes to files because the code has to contact some slow server somewhere.
My understanding
There's two ideas conflicting in my head. One is that the FUSE kernel module uses the ABI of my program to hijack the process and call the specific FUSE function names I implemented (sync or async, w/e), the other is that.. the program is running, and blocking to receive events from the kernel module (which I don't think is the case, but I could be wrong).
Whatever it is, does it means I can simply start a thread and do network stuff? I'm a bit lost on how that works. Thanks.
You don't need to do any hijacking. The FUSE kernel module registers as a filesystem provider (of type fusefs). It then services read/write/open/etc calls, by dispatching them to the user-mode process. When that process returns, the kernel module gets the return value, and returns from the corresponding system call.
If you want to have the server (i.e. user mode process) by asynchronous and multi-threaded, all you have to do is dispatch the operation (assuming it's write - you can't parallelize input this way) to another thread in that process, and return immediately to FUSE. That way, your user mode process can, at its leisure, write out to the remote server.
You could similarly try to parallelize read, but the issue here is that you won't be able to return to FUSE (and thus release the reading process) until you have at least the beginning of the data read.
How does the Linux kernel handle multiple reads/writes to procfs? For instance, if two processes write to procfs at once, is one process queued (i.e. a kernel trap actually blocks one of the processes), or is there a kernel thread running for each core?
The concern is if you have a buffer used within a function (static to the global space), do you have to protect it or will the code be run sequentially?
It depends on each and every procfs file implementation. No one can even give you a definite answer because each driver can implement its own procfs folder and files (you didn't specify any specific files. Quick browsing in http://lxr.free-electrons.com/source/fs/proc/ shows that some files do use locks).
In either way you can't use the global buffer because a context switch can always occur, if not in the kernel then it can catch your reader thread right after it finishes the read syscall and before it started to process the read data.
my question: in Linux (and in FreeBsd, and generally in UNIX) is it possible/legal to read single file descriptor simultaneously from two threads?
I did some search but found nothing, although a lot of people ask like question about reading/writing from/to socket fd at the same time (meaning reading when other thread is writing, not reading when other is reading). I also have read some man pages and got no clear answer on my question.
Why I ask it. I tried to implement simple program that counts lines in stdin, like wc -l. I actually was testing my home-made C++ io engine for overhead, and discovered that wc is 1.7 times faster. I trimmed down some C++ and came closer to wc speed but didn't reach it. Then I experimented with input buffer size, optimized it, but still wc is clearly a bit faster. Finally I created 2 threads which read same STDIN_FILENO in parallel, and this at last was faster than wc! But lines count became incorrect... so I suppose some junk comes from reads which is unexpected. Doesn't kernel care what process read?
Edit: I did some research and discovered just that calling read directly via syscall does not change anything. Kernel code seem to do some sync handling, but i didnt understand much (read_write.c)
That's undefined behavior, POSIX
says:
The read() function shall attempt to read nbyte bytes from the file
associated with the open file descriptor, fildes, into the buffer
pointed to by buf. The behavior of multiple concurrent reads on the
same pipe, FIFO, or terminal device is unspecified.
About accessing a single file descriptor concurrently (i.e. from multiple threads or even processes), I'm going to cite POSIX.1-2008 (IEEE Std 1003.1-2008), Subsection 2.9.7 Thread Interactions with Regular File Operations:
2.9.7 Thread Interactions with Regular File Operations
All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links:
[…] read() […]
If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them. […]
At first glance, this looks quite good. However, I hope you did not miss the restriction when they operate on regular files or symbolic links.
#jarero cites:
The behavior of multiple concurrent reads on the same pipe, FIFO, or terminal device is unspecified.
So, implicitly, we're agreeing, I assume: It depends on the type of the file you are reading. You said, you read from STDIN. Well, if your STDIN is a plain file, you can use concurrent access. Otherwise you shouldn't.
When used with a descriptor (fd), read() and write() rely on the internal state of the fd to know the "current offset" at which the read and write will occur. As a result, they aren't thread-safe.
To allow a single descriptor to be used by multiple threads simultaneously, pread() and pwrite() are provided. With those interfaces, the descriptor and the desired offset are specified, so the "current offset" in the descriptor isn't used.