Linux: writes are split into 512K chunks - linux

I have a user-space application that generates big SCSI writes (details below). However, when I'm looking at the SCSI commands that reach the SCSI target (i.e. the storage, connected by the FC) something is splitting these writes into 512K chunks.
The application basically does 1M-sized direct writes directly into the device:
fd = open("/dev/sdab", ..|O_DIRECT);
write(fd, ..., 1024 * 1024);
This code causes two SCSI WRITEs to be sent, 512K each.
However, if I issue a direct SCSI command, without the block layer, the write is not split.
I issue the following command from the command line:
sg_dd bs=1M count=1 blk_sgio=1 if=/dev/urandom of=/dev/sdab oflag=direct
I can see one single 1M-sized SCSI WRITE.
The question is, what is splitting the write and, more importantly, is it configurable?
Linux block layer seems to be guilty (because SG_IO doesn't pass through it) and 512K seems too arbitrary a number not to be some sort of a configurable parameter.

As described in an answer to the "Why is the size of my IO requests being limited, to about 512K" Unix & Linux Stack Exchange question and the "Device limitations" section of the "When 2MB turns into 512KB" document by kernel block layer maintainer Jens Axboe, this can be because your device and kernel have size restrictions (visible in /sys/block/<disk>/queue/):
max_hw_sectors_kb maximum size of a single I/O the hardware can accept
max_sectors_kb the maximum size the block layer will send
max_segment_size and max_segments the DMA engine limitations for scatter gather (SG) I/O (maximum size of each segment and the maximum number of segments for a single I/O)
The segment restrictions matter a lot when the buffer the I/O is coming from is not contiguous and in the worst case each segment can be as small as page (which is 4096 bytes on x86 platforms). This means SG I/O for one I/O can be limited to a size of 4096 * max_segments.
The question is, what is splitting the write
As you guessed the Linux block layer.
and, more importantly, is it configurable?
You can fiddle with max_sectors_kb but the rest is fixed and come from device/driver restrictions (so I'm going to guess in your case probably not but you might see bigger I/O directly after a reboot due to less memory fragmentation).
512K seems too arbitrary a number not to be some sort of a configurable parameter
The value is likely related to fragment SG buffers. Let's assume you're on an x86 platform and have a max_segments of 128 so:
4096 * 128 / 1024 = 512
and that's where 512K could come from.
Bonus chatter: according to https://twitter.com/axboe/status/1207509190907846657 , if your device uses an IOMMU rather than a DMA engine then you shouldn't be segment limited...

The blame is indeed on the block layer, the SCSI layer itself has little regard to the size. You should check though that the underlying layers are indeed able to pass your request, especially with regard to direct io since that may be split into many small pages and requires a scatter-gather list that is longer than what can be supported by the hardware or even just the drivers (libata is/was somewhat limited).
You should look and tune /sys/class/block/$DEV/queue there are assorted files there and the most likely to match what you need is max_sectors_kb but you can just try it out and see what works for you. You may also need to tune the partitions variables as well.

There's a max sectors per request attribute of the block driver. I'd have to check how to modify it. You used to be able to get this value via blockdev --getmaxsect but I'm not seeing the --getmaxsect option on my machine's blockdev.

Looking at the following files should tell you if the logical block size is different, possibly 512 in your case. I am not however sure if you can write to these files to change those values. (the logical block size that is)
/sys/block/<disk>/queue/physical_block_size
/sys/block/<disk>/queue/logical_block_size

try ioctl(fd, BLKSECTSET, &blocks)

Related

Storing and writing to text file in STM32 F4 internal flash memory

I need to store a text file into the STM32 F446RE internal flash memory.  This text file will contain log data that needs to be written to and updated consistently.  I know there are a couple ways of writing to it including embedding the text as constant string/data into the source code or implementing a file system like fatfs (Not suitable for STM32 F4 flash due to its sector orientation). It has total 7 sectors that vary in size.  Sectors 0-3 each contain 16 kB, 4 contains 64 kB, and 5-7 each contain 128 kB.  This translates to a total of 512 kB of Flash memory.   These are are not sufficient for what I'm looking for, and was wondering if anyone has ideas?  I'm using the STM32CubeIDE. 
Writing to FLASH memory requires an erase operation first. It seems that you already know that erase operations must be performed on whole sectors. Note also that FLASH memory wears out with repeated erase/write cycles.
I suggest one of three approaches depending upon how much data you must store and your coding abilities.
An "in-chip" approach is to implement a circular buffer in RAM and maintain your log there. If power is lost then you need code to commit that RAM buffer to FLASH. On power-up you need code to restore the RAM buffer from FLASH. This implies that your design does not suffer frequent power cycles and that you can maintain power to the microcontroller long enough to save the buffer from RAM to FLASH.
Next option is to use an external memory chip. EEPROMs are not terribly fast and are also subject to wear. FRAM is fast and suffers trillions of writes before wear issues. It is available as I2C or SPI so you can implement a number of chips to provide a reasonable buffer size and handle the chip to memory mapping in your handler code. FRAM is not cheap though.
Finally, there is an option to add an SSD drive. These devices include "wear levelling" to maintain their active lives. However, you need a suitable interface such as USB or PCI.
HTH

Bypassing 4KB block size limitation on block layer/device

We are developing an ssd-type storage hardware device that can take read/write request for big block size >4KB at a time (even in MBs size).
My understanding is that linux and its filesystem will "chop down" files into 4KB block size that will be passed to block device driver, which will need to physically fill the block with data from the device (ex., for write)
I am also aware the kernel page size has a role in this limitation as it is set at 4KB.
For experiment, I want to find out if there is a way to actually increase this block size, so that we will save some time (instead of doing multiple 4KB writes, we can do it with bigger block size).
Is there any FS or any existing project that I can take a look for this?
If not, what is needed to do this experiment - what parts of linux needs to be modified?
Trying to find out the level of difficulties and resource needed. Or, if it is even impossible to do so and/or any reason why we do not even need to do so. Any comment is appreciated.
Thanks.
The 4k limitation is due to the page cache. The main issue is that if you have a 4k page size, but a 32k block size, what happens if the file is only 2000 bytes long, so you only allocate a 4k page to cover the first 4k of the block. Now someone seeks to offset 20000, and writes a single byte. Now suppose the system is under a lot of memory pressure, and the 4k page for the first 2000 bytes, which is clean, gets pushed out of memory. How do you track which parts of the 32k block contain valid data, and what happens when the system needs to write out the dirty page at offset 20000?
Also, let's assume that the system is under a huge amount of memory pressure, we need to write out that last page; what if there isn't enough memory available to instantiante the other 28k of the 32k block, so we can do the read-modify-write cycle just to update that one dirty 4k page at offset 20000?
These problems can all be solved, but it would require a lot of surgery in the VM layer. The VM layer would need to know that for this file system, pages need to be instantiated in chunks of 8 pages at a time, and if that there is memory pressure to push out a particular page, you need write out all of the 8 pages at the same time if it is dirty, and then drop all 8 pages from the page cache at the same time. All of this implies that you want to track page usage and page dirty not at the 4k page level, but at the compound 32k page/"block" level. It basically will involve changes to almost every single part of the VM subsystem, from the page cleaner, to the page fault handler, the page scanner, the writeback algorithms, etc., etc., etc.
Also consider that even if you did hire a Linux VM expert to do this work, (which the HDD vendors would deeply love you for, since they also want to be able to deploy HDD's with a 32k or 64k physical sector size), it will be 5-7 years before such a modified VM layer would make its appearance in a Red Hat Enterprise Linux kernel, or the equivalent enterprise or LTS kernel for SuSE or Ubuntu. So if you are working at a startup who is hoping to sell your SSD product into the enterprise market --- you might as well give up now with this approach. It's just not going to work before you run out of money.
Now, if you happen to be working for a large Cloud company who is making their own hardware (ala Facebook, Amazon, Google, etc.) maybe you could go down this particular path, since they don't use enterprise kernels that add new features at a glacial pace --- but for that reason, they want to stick relatively close to the upstream kernel to minimize their maintenance cost.
If you do work for one of these large cloud companies, I'd strongly recommend that you contact other companies who are in this same space, and maybe you could collaborate with them to see if together you could do this kind of development work and together try to get this kind of change upstream. It really, really is not a trivial change, though --- especially since the upstream linux kernel developers will demand that this not negatively impact performance in the common case, which will not be involving > 4k block devices any time in the near future. And if you work at a Facebook, Google, Amazon, etc., this is not the sort of change that you would want to maintain as a private change to your kernel, but something that you would want to get upstream, since other wise it would be such a massive, invasive change that supporting it as an out-of-tree patch would be huge headache.
Although I've never written a device driver for Linux, I find it very unlikely that this is a real limitation of the driver interface. I guess it's possible that you would want to break I/O into scatter-gather lists where each entry in the list is one page long (to improve memory allocation performance and decrease memory fragmentation), but most device types can handle those directly nowadays, and I don't think anything in the driver interface actually requires it. In fact, the simplest way that requests are issued to block devices (described on page 13 -- marked as page 476 -- of that text) looks like it receives:
a sector start number
a number of sectors to transfer (no limit is mentioned, let alone a limit of 8 512B sectors)
a pointer to write the data into / read the data from (not a scatter-gather list for this simple case, I guess)
whether this is a read versus a write
I suspect that if you're seeing exclusively 4K accesses it's probably a result of the caller not requesting more than 4K at a time -- if the filesystem you're running on top of your device only issues 4K reads, or whatever is using the filesystem only accesses one block at a time, there is nothing your device driver can do to change that on its own!
Using one block at a time is common for random access patterns like database read workloads, but database log or FS journal writes or large serial file reads on a traditional (not copy-on-write) filesystem would issue large I/Os more like what you're expecting. If you want to try issuing large reads against your device directly to see if it's possible through whatever driver you have now, you could use dd if=/dev/rdiskN of=/dev/null bs=N to see if increasing the bs parameter from 4K to 1M shows a significant throughput increase.

writeback of dirty pages in linux

I have a question regarding the writeback of the dirty pages. If a portion of page data is modified, will the writeback write the whole page to the disk, or only the partial page with modified data?
The memory management hardware on x86 systems has a granularity of 4096 bytes. This means: It is not possible to find out which bytes of a 4096-byte page are really changed and which ones are unchanged.
Theoretically the disk driver system could check if bytes have been changed and not write the 512-byte blocks that have not been changed.
However this would mean that - if the blocks are no longer in disk cache memory - the page must be read from hard disk to check if it has changed before writing.
I do not think that Linux would do this in that way because reading the page from disk would cost too much time.
Upon EACH hardware interrupt, the CPU would like to write as much data as possible that the harddisk controller can handle - this size is defined by us as the blksize (or ONE sector, in Linux):
http://en.wikipedia.org/wiki/Disk_sector
https://superuser.com/questions/121252/how-do-i-find-the-hardware-block-read-size-for-my-hard-drive
But waiting too long for SINGLE interrupt for a large file can make the system appear unresponsive, so it is logical to break the chunks into smaller size (like 512bytes) so that the CPU can handle other tasks while transferring each 512 bytes down. Therefore, whether u changed one byte or 511 bytes, so long as it is within that single block, all data get written at the same time. And throughout linux kernel, flagging the blocks as dirty for write or not, all goes by the single unique identifier: sector number, so anything smaller than sector size is too difficult for efficient management.
All these said, don't forget that the harddisk controller itself also has a minimum block size for write operation.

Implementing asynchronous file system with FUSE on Linux

I tried to ask on FUSE's mailing list but I haven't received any response so far... I have a couple of questions. I'm going to implement a low-level FUSE file system and watch over fuse_chan's descriptor with epoll.
I have to fake inodes for all
objects in my file system right? Are
there any rules about choosing
inodes for objects in VFS (e.g. do I
have to use only positive values or
can I use values in some range)?
Can I make fuse_chan's descriptor
nonblocking? If yes, please tell me
whether I can assume that
fuse_chan_recv()/fuse_chan_send()
will receive/send a whole request
structure, or do I have to override them
with functions handling partial send
and receive?
What about buffer size? I see that
in fuse_loop() a new buffer is
allocated for each call, so I assume
that the buffer size is not fixed.
However maybe there is some maximum
possible buffer size? I can then
allocate a larger buffer and reduce
memory allocation operations.
(1) Inodes are defined as unsigned integers, so in theory, you could use any values.
However, since there could be programs which are not careful, I'd play it safe and only use non-zero, positive integers up to INT_MAX.
(2) Fuse uses a special kernel device. While fuse_chan_recv() do not support partial reads, this may not be required, as kernel should not return partial packets anyway.
(3) Filenames in Linux are max 4096 chars. This puts a limit on a buffer size:
$ grep PATH_MAX /usr/include/linux/limits.h
#define PATH_MAX 4096 /* # chars in a path name including nul */

Linux block device with odd (not even) size

Is it possible to create a Linux (2.6) block device (such as a loopback device) with an odd size? I couldn't make it happen. losetup seems to round down to 512 byte boundary. The ubd devices of User-mode Linux ubd devices seem to round up to 512 byte boundary. In struct request, we have sector_t __sector for the block offset for read/write operations.
I'm asking this question just for educational purposes. I can cope with the 512 byte boundary, but I'm still interested if it would be possible to bypass it. In this question I'm not interested in other layers of abstraction (such as using regular files or character devices).
No. The Linux 2.6 block layer doesn't comprehend anything smaller than 512 bytes. Anything smaller (especially not a power of 2) would require a major rewrite of an awful lot of code.
This is what makes a block device instead of a character device: the block granularity. The dichotomy exists because it is vastly more efficient to model real hardware that works a block at a time as an abstraction that also deals in blocks. To do otherwise would turn every operation into a much more costly computation.
The way to bypass it is, as you mention, to use a character oriented device or abstraction. This is central to the Unix device model: everything is a series of octets, except for the things that can only be virtualized as one.

Resources