How I/O request works in Linux Kernel - io

Good day! May I have a question about how I/O works on low-level? More specifically, my question is how many bytes can be handled in the one I/O requests?
For example, if we read an input document line by line (let's say a document containing 10 lines), the Linux kernel will submit all 10 lines using one I/O request or it will submit 10 I/O request separately? Thank you in advance for any comments or suggestions.

Just came across this question. Let me give an overview of the layers of IO as per Linux kernel.
1. You perform a line read from application
Userspace
2. Read/write system call moves the control to kernel
Kernel
3. VFS layer comes into picture and find the filesystem the file belongs to
4. Filesystem specific read is invoked, say ext3_read
5. Filesystem sees if the data in Page cache, if yes then returns the data
6. If data is not in pagecache the using the file Inode, filesystem gets the sector number of the disk, this File block number is mapped to
7. Filesystem prepares an IOVEC, a structure for performing IO from disk. It consists of vector of RAM pages and sector number fetched from the Inode map.
8. The IOVEC is converted to SCSI/NVMe command and send to Lowerlayer device driver to read from the disk.
If you perform multiple lines read then you do multiple system call to kernel. The filesystem in kernel may fetch the data from page cache, or issues read to the disk.

Related

What is the benefit and micro-ops of ENQCMD instruction?

ENQCMD and MOVDIR64B are two instructions in Intel DSA.
MOVDIR64B reads 64-bytes from the source memory address and performs a 64-byte direct-store operation to the destination address. The ENQCMD instruction allows software to write commands to enqueue registers, which are special device registers accessed using memory-mapped I/O (MMIO).
My question is - what is the aim of designing those two instructions?
Based on my understanding, setting up the memory-mapped IO area (the register) requires OS support, i.e. the device driver. After setting up the MMIO area, we could access it using write() system call, which is also implemented in the device driver. For general architectures, Linux supports iowrite64() to write 8-byte values at a time. Hence, if we want to write 64 bytes, needs to call iowrite64() 8 times.
With the help of MOVDIR64B, for Intel DSA, a new API is created - __iowrite512() which writes 64 bytes atomically.
I agree that the latter one is at least more efficient than the previous one, but I am confused about the time it requires to transfer data.
Consider the following case: if we are given a device (Intel DSA) that supports MOVDIR64B and ENQCMD, suppose we want to transfer 64 bytes of data from memory to MMIO register. There are two options: iowrite64() 8 times (using a loop); or __iowrite512() once. Will the latter one be 8 times faster than the previous one?
My thoughts is that it is less likely to be 8 times difference, but the latter one will be faster. May I know how faster it would be? Is it documented anywhere? I do not have Intel DSA, so I am not sure how to test it.
Besides, what other benefits do ENQCMD have? Will it be broken up into several micro operations? If yes, then what are the micro operations that does ENQCMD?
iowrite64 uses a UC access to MMIO space, so writes are serialized, not pipelined. That is, only one UC write can be in flight at a time from a single CPU thread, and the CPU doesn't continue execution until the MMIO write is complete.
MOVDIR64B has the potential to be faster than even a single iowrite64, because it uses the WC memory type instead of UC (even if the destination address is UC). After the write is issued by the CPU, it can continue execution. Multiple direct stores can be streamed to the device. That means that multiple direct stores can be in flight at one time from a single CPU thread. MOVDIRI also behaves this way.
As far as I know, the time to actually transfer the data to the destination is the same regardless of the size (between 1 and 64 bytes). Of course that is dependent on the width of the data path within the SoC, which could be different for different implementations.
The main advantage of MOVDIR64B is that the descriptor arrives at the device all at once instead of in pieces. The device doesn't have to worry about receiving a partial descriptor or receiving parts of two descriptors interleaved. In fact, Intel DSA ignores writes smaller than 64 bytes to a portal.
To realize the full benefit of streaming writes, the destination address for each MOVDIR64B from a single CPU thread should be different. Each Intel DSA portal is a 4096-byte page, so there are 64 unique addresses within each portal. Descriptor writes from a single CPU can be striped across the 64 addresses. (It doesn't matter whether writes from multiple CPUs use the same address or different addresses, but normally you would not expect multiple CPUs to be using the same dedicated WQ in DSA.)
ENQCMD allows the device to respond to software whether it accepted the descriptor or not. This allows multiple applications to use the same shared WQ without risk of a descriptor being lost because the shared WQ is full. Applications can submit descriptors without any driver involvement (after setup), and without any lock or communication between the applications.

When and how is mmap'ed memory swapped in and out?

In my understanding, mmap'ing a file that fits into RAM will be like having the file in memory.
Say that we have 16G of RAM, and we first mmap a 10G file that we use for a while. This should be fairly efficient in terms of access. If we then mmap a second 10G file, will that cause the first one be swapped out? Or parts of it? If so, when will this happen? At the mmap call, or on accessing the memory area of the newly loaded file?
And if we want to access the memory of the pointer for the first file again, will that make it load the swap the file in again? So, say we alternate reading between memory corresponding to the first file and the second file, will that lead to disastrous performance?
Lastly, if any of this is true, would it be better to mmap several smaller files?
As has been discussed, your file will be accessed in pages; on x86_64 (and IA32) architectures, a page is typically 4096 bytes. So, very little if any of the file will be loaded at mmap time. The first time you access some page in either file, then the kernel will generate a page fault and load some of your file. The kernel may prefetch pages, so more than one page may be loaded. Whether it does this depends on your access pattern.
In general, your performance should be good if your working set fits in memory. That is, if you're only regularly accesning 3G of file across the two files, so long as you have 3G of RAM available to your process, things should generally be fine.
On a 64-bit system there's no reason to split the files, and you'll be fine if the parts you need tend to fit in RAM.
Note that if you mmap an existing file, swap space will not be required to read that file. When an object is backed by a file on the filesystem, the kernel can read from that file rather than swap space. However, if you specify MMAP_PRIVATE in your call to mmap, swap space may be required to hold changed pages until you call msync.
Your question does not have a definitive answer, as swapping in/out is handled by your kernel, and each kernel will have a different implementation (and linux itself offers different profiles depending on your usage, RT, desktop, server…)
Generally speaking, though, whatever you load in memory is done using pages, so your mmap'ed file in memory is loaded (and offloaded) by pages between all the levels of memory (the caches, RAM and swap).
Then if you load two 10GB data into memory, you'll have parts of both between the RAM and your Swap, and the kernel will try to keep in RAM the pages you're likely to use now and guess what you'll load next.
What it means is that if you do truly random access to a few bytes of data in both files alternatively, you should expect awful performance, if you access contiguous chunks sequentially from both files alternatively, you should expect decent performance.
You can read some more details about kernel paging into virtual memory theory:
https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html
https://en.wikipedia.org/wiki/Paging

updation of dirty pages in swap in Linux

What I have read,
Swap space has no file system
Disk has filesystem. Whenever a file which is on disk is modified then its modified content is written to new disk block (and not to the original block) and associated data structures are updated.
Dirty pages are written back to the Swap before they are paged out(due to various reasons).
The question is, are dirty pages written back to their original Page-Slots or they are written to new Page-Slot ? If written to new page slot then what is the procedure ?
Let me try to answer the questions you raise in generic terms.
First of all, the page partition is called a swap partition in eunuchs for historical reasons. In ye olde days before virtual memory, entire processes were swapped out. Now processes are paged out.
For performance reasons, the operating system wants to do paging in complete blocks. A page generally maps to one or more disk blocks. On most non-eunuchs systems, the page file is a contiguous file. Paging is done using virtual block I/O to the page file (and the executable file or libraries).
The traditional eunuchs file (inode) system was quick and dirty designed. There is no ability to create a contiguous file. The only way to write contiguous data is to use an entire disk or disk partition. Eunuchs databases and page files have then been disk partitions (Mac OS uses a different system). Instead of doing virtual block I/O to a page file the system does logical (or physical) I/O to the disk.
When a process allocates virtual memory, normally page file space is a prerequisite. Thus the page file location for a page frame remains in the same location. If there were not he case, a process might need to have a page out and not have an available location in the page file.

Bypassing 4KB block size limitation on block layer/device

We are developing an ssd-type storage hardware device that can take read/write request for big block size >4KB at a time (even in MBs size).
My understanding is that linux and its filesystem will "chop down" files into 4KB block size that will be passed to block device driver, which will need to physically fill the block with data from the device (ex., for write)
I am also aware the kernel page size has a role in this limitation as it is set at 4KB.
For experiment, I want to find out if there is a way to actually increase this block size, so that we will save some time (instead of doing multiple 4KB writes, we can do it with bigger block size).
Is there any FS or any existing project that I can take a look for this?
If not, what is needed to do this experiment - what parts of linux needs to be modified?
Trying to find out the level of difficulties and resource needed. Or, if it is even impossible to do so and/or any reason why we do not even need to do so. Any comment is appreciated.
Thanks.
The 4k limitation is due to the page cache. The main issue is that if you have a 4k page size, but a 32k block size, what happens if the file is only 2000 bytes long, so you only allocate a 4k page to cover the first 4k of the block. Now someone seeks to offset 20000, and writes a single byte. Now suppose the system is under a lot of memory pressure, and the 4k page for the first 2000 bytes, which is clean, gets pushed out of memory. How do you track which parts of the 32k block contain valid data, and what happens when the system needs to write out the dirty page at offset 20000?
Also, let's assume that the system is under a huge amount of memory pressure, we need to write out that last page; what if there isn't enough memory available to instantiante the other 28k of the 32k block, so we can do the read-modify-write cycle just to update that one dirty 4k page at offset 20000?
These problems can all be solved, but it would require a lot of surgery in the VM layer. The VM layer would need to know that for this file system, pages need to be instantiated in chunks of 8 pages at a time, and if that there is memory pressure to push out a particular page, you need write out all of the 8 pages at the same time if it is dirty, and then drop all 8 pages from the page cache at the same time. All of this implies that you want to track page usage and page dirty not at the 4k page level, but at the compound 32k page/"block" level. It basically will involve changes to almost every single part of the VM subsystem, from the page cleaner, to the page fault handler, the page scanner, the writeback algorithms, etc., etc., etc.
Also consider that even if you did hire a Linux VM expert to do this work, (which the HDD vendors would deeply love you for, since they also want to be able to deploy HDD's with a 32k or 64k physical sector size), it will be 5-7 years before such a modified VM layer would make its appearance in a Red Hat Enterprise Linux kernel, or the equivalent enterprise or LTS kernel for SuSE or Ubuntu. So if you are working at a startup who is hoping to sell your SSD product into the enterprise market --- you might as well give up now with this approach. It's just not going to work before you run out of money.
Now, if you happen to be working for a large Cloud company who is making their own hardware (ala Facebook, Amazon, Google, etc.) maybe you could go down this particular path, since they don't use enterprise kernels that add new features at a glacial pace --- but for that reason, they want to stick relatively close to the upstream kernel to minimize their maintenance cost.
If you do work for one of these large cloud companies, I'd strongly recommend that you contact other companies who are in this same space, and maybe you could collaborate with them to see if together you could do this kind of development work and together try to get this kind of change upstream. It really, really is not a trivial change, though --- especially since the upstream linux kernel developers will demand that this not negatively impact performance in the common case, which will not be involving > 4k block devices any time in the near future. And if you work at a Facebook, Google, Amazon, etc., this is not the sort of change that you would want to maintain as a private change to your kernel, but something that you would want to get upstream, since other wise it would be such a massive, invasive change that supporting it as an out-of-tree patch would be huge headache.
Although I've never written a device driver for Linux, I find it very unlikely that this is a real limitation of the driver interface. I guess it's possible that you would want to break I/O into scatter-gather lists where each entry in the list is one page long (to improve memory allocation performance and decrease memory fragmentation), but most device types can handle those directly nowadays, and I don't think anything in the driver interface actually requires it. In fact, the simplest way that requests are issued to block devices (described on page 13 -- marked as page 476 -- of that text) looks like it receives:
a sector start number
a number of sectors to transfer (no limit is mentioned, let alone a limit of 8 512B sectors)
a pointer to write the data into / read the data from (not a scatter-gather list for this simple case, I guess)
whether this is a read versus a write
I suspect that if you're seeing exclusively 4K accesses it's probably a result of the caller not requesting more than 4K at a time -- if the filesystem you're running on top of your device only issues 4K reads, or whatever is using the filesystem only accesses one block at a time, there is nothing your device driver can do to change that on its own!
Using one block at a time is common for random access patterns like database read workloads, but database log or FS journal writes or large serial file reads on a traditional (not copy-on-write) filesystem would issue large I/Os more like what you're expecting. If you want to try issuing large reads against your device directly to see if it's possible through whatever driver you have now, you could use dd if=/dev/rdiskN of=/dev/null bs=N to see if increasing the bs parameter from 4K to 1M shows a significant throughput increase.

writeback of dirty pages in linux

I have a question regarding the writeback of the dirty pages. If a portion of page data is modified, will the writeback write the whole page to the disk, or only the partial page with modified data?
The memory management hardware on x86 systems has a granularity of 4096 bytes. This means: It is not possible to find out which bytes of a 4096-byte page are really changed and which ones are unchanged.
Theoretically the disk driver system could check if bytes have been changed and not write the 512-byte blocks that have not been changed.
However this would mean that - if the blocks are no longer in disk cache memory - the page must be read from hard disk to check if it has changed before writing.
I do not think that Linux would do this in that way because reading the page from disk would cost too much time.
Upon EACH hardware interrupt, the CPU would like to write as much data as possible that the harddisk controller can handle - this size is defined by us as the blksize (or ONE sector, in Linux):
http://en.wikipedia.org/wiki/Disk_sector
https://superuser.com/questions/121252/how-do-i-find-the-hardware-block-read-size-for-my-hard-drive
But waiting too long for SINGLE interrupt for a large file can make the system appear unresponsive, so it is logical to break the chunks into smaller size (like 512bytes) so that the CPU can handle other tasks while transferring each 512 bytes down. Therefore, whether u changed one byte or 511 bytes, so long as it is within that single block, all data get written at the same time. And throughout linux kernel, flagging the blocks as dirty for write or not, all goes by the single unique identifier: sector number, so anything smaller than sector size is too difficult for efficient management.
All these said, don't forget that the harddisk controller itself also has a minimum block size for write operation.

Resources