Bypassing 4KB block size limitation on block layer/device

Bypassing 4KB block size limitation on block layer/device - linux

We are developing an ssd-type storage hardware device that can take read/write request for big block size >4KB at a time (even in MBs size).
My understanding is that linux and its filesystem will "chop down" files into 4KB block size that will be passed to block device driver, which will need to physically fill the block with data from the device (ex., for write)
I am also aware the kernel page size has a role in this limitation as it is set at 4KB.
For experiment, I want to find out if there is a way to actually increase this block size, so that we will save some time (instead of doing multiple 4KB writes, we can do it with bigger block size).
Is there any FS or any existing project that I can take a look for this?
If not, what is needed to do this experiment - what parts of linux needs to be modified?
Trying to find out the level of difficulties and resource needed. Or, if it is even impossible to do so and/or any reason why we do not even need to do so. Any comment is appreciated.
Thanks.

The 4k limitation is due to the page cache. The main issue is that if you have a 4k page size, but a 32k block size, what happens if the file is only 2000 bytes long, so you only allocate a 4k page to cover the first 4k of the block. Now someone seeks to offset 20000, and writes a single byte. Now suppose the system is under a lot of memory pressure, and the 4k page for the first 2000 bytes, which is clean, gets pushed out of memory. How do you track which parts of the 32k block contain valid data, and what happens when the system needs to write out the dirty page at offset 20000?
Also, let's assume that the system is under a huge amount of memory pressure, we need to write out that last page; what if there isn't enough memory available to instantiante the other 28k of the 32k block, so we can do the read-modify-write cycle just to update that one dirty 4k page at offset 20000?
These problems can all be solved, but it would require a lot of surgery in the VM layer. The VM layer would need to know that for this file system, pages need to be instantiated in chunks of 8 pages at a time, and if that there is memory pressure to push out a particular page, you need write out all of the 8 pages at the same time if it is dirty, and then drop all 8 pages from the page cache at the same time. All of this implies that you want to track page usage and page dirty not at the 4k page level, but at the compound 32k page/"block" level. It basically will involve changes to almost every single part of the VM subsystem, from the page cleaner, to the page fault handler, the page scanner, the writeback algorithms, etc., etc., etc.
Also consider that even if you did hire a Linux VM expert to do this work, (which the HDD vendors would deeply love you for, since they also want to be able to deploy HDD's with a 32k or 64k physical sector size), it will be 5-7 years before such a modified VM layer would make its appearance in a Red Hat Enterprise Linux kernel, or the equivalent enterprise or LTS kernel for SuSE or Ubuntu. So if you are working at a startup who is hoping to sell your SSD product into the enterprise market --- you might as well give up now with this approach. It's just not going to work before you run out of money.
Now, if you happen to be working for a large Cloud company who is making their own hardware (ala Facebook, Amazon, Google, etc.) maybe you could go down this particular path, since they don't use enterprise kernels that add new features at a glacial pace --- but for that reason, they want to stick relatively close to the upstream kernel to minimize their maintenance cost.
If you do work for one of these large cloud companies, I'd strongly recommend that you contact other companies who are in this same space, and maybe you could collaborate with them to see if together you could do this kind of development work and together try to get this kind of change upstream. It really, really is not a trivial change, though --- especially since the upstream linux kernel developers will demand that this not negatively impact performance in the common case, which will not be involving > 4k block devices any time in the near future. And if you work at a Facebook, Google, Amazon, etc., this is not the sort of change that you would want to maintain as a private change to your kernel, but something that you would want to get upstream, since other wise it would be such a massive, invasive change that supporting it as an out-of-tree patch would be huge headache.

Although I've never written a device driver for Linux, I find it very unlikely that this is a real limitation of the driver interface. I guess it's possible that you would want to break I/O into scatter-gather lists where each entry in the list is one page long (to improve memory allocation performance and decrease memory fragmentation), but most device types can handle those directly nowadays, and I don't think anything in the driver interface actually requires it. In fact, the simplest way that requests are issued to block devices (described on page 13 -- marked as page 476 -- of that text) looks like it receives:
a sector start number
a number of sectors to transfer (no limit is mentioned, let alone a limit of 8 512B sectors)
a pointer to write the data into / read the data from (not a scatter-gather list for this simple case, I guess)
whether this is a read versus a write
I suspect that if you're seeing exclusively 4K accesses it's probably a result of the caller not requesting more than 4K at a time -- if the filesystem you're running on top of your device only issues 4K reads, or whatever is using the filesystem only accesses one block at a time, there is nothing your device driver can do to change that on its own!
Using one block at a time is common for random access patterns like database read workloads, but database log or FS journal writes or large serial file reads on a traditional (not copy-on-write) filesystem would issue large I/Os more like what you're expecting. If you want to try issuing large reads against your device directly to see if it's possible through whatever driver you have now, you could use dd if=/dev/rdiskN of=/dev/null bs=N to see if increasing the bs parameter from 4K to 1M shows a significant throughput increase.

Related

How to make the OS schedule disk accesses optimally?

Suppose that a process needs to access the file system in many (1000+) places, and the order is not important to the program logic. However, the order obviously matters for performance if the file system is stored on a (spinning) hard disk.
How can the application programmer communicate to the OS that it should schedule the accesses optimally? Launching 1000+ threads does not seem practical. Does database management software accomplish this, and if so, then how?
Additional details: I had a large (1TB+) mmapped file where I needed to read 1000+ chunks of about 1KB, each time in new, unpredictable places.

In the early days when parameters like Wikipedia: Hard disk drive performance characteristics → Seek time were very expensive and thus very important, database vendors payed attention to the on-disk data representation and layout as can be seen e.g. in Oracle8i: Designing and Tuning for Performance → Tuning I/O.
The important optimization parameters changed with appearance of Solid-state drives (SSD) where the seek time is 0 (or at least constant) as there is nothing to rotate. Some of the new parameters are addressed by Wikipedia: Solid-state drive (SSD) → optimized file systems.
But even those optimization parameters go away with the use of Wikipedia: In-memory databases. The list of vendors is pretty long, all big players on it.
So how to schedule your access optimally depends a lot on the use case (1000 concurrent hits is not sufficient problem description) and buying some RAM is one of the options and "how can the programmer communicate with the OS" will be one of the last (not first) questions

Files and their transactions are cached in various devices in your computer; RAM and the HD cache are the most usual places. The file system driver may also implement IO transaction queues, defragmentation, and error-correction logic that makes things complicated for the developer who wants to control every aspect of file access. This level of complexity is ultimately designed to provide integrity, security, performance, and coordination of file access across all processes of your system.
Optimization efforts should not interfere with the system's own caching and prediction algorithms, not just for IO but for all caches. Trying to second-guess your system is a waste of your time and your processors' time.
Most probably your IO operations and data will stay on caches and later be committed to your storage devices when your OS sees fit.
That said, there's always options like database suites, mmap, readahead mechanisms, and direct IO to your drive. You will need to invest time benchmarking any of your efforts. I advise against multiple IO threads because cache contention will make things even slower than one thread.

The kernel will already reorder the read/write requests (e.g. to fit the spin of a mechanical disk), if they come from various processes or threads. BTW, most of the reads & writes would go to the kernel file system cache, not to the disk.
You might consider using posix_fadvise(2) & perhaps (in a separate thread) readahead(2). If -instead of read(2)-ing- you use mmap(2) to project some file portion to virtual memory, you might use also madvise(2)
Of course, the file system does not usually guarantee that a sequential portion of a file is physically sequentially located on the disk (and even the disk firmware might reorder sectors). See picture in Ext2 wikipage, also relevant for Ext4. Some file systems might be better in that respect, and you could tune their block size (at mkfs time).
I would not recommend having thousands of threads (only at most a few dozens).
At last, it might worth buying some SSD or some more RAM (for file cache). See http://linuxatemyram.com/
Actual performance would depend a lot on the particular system and hardware.
Perhaps using an indexed file library like GDBM or a database library Sqlite (or a real database like PostGreSQL) might be worthwhile! Perhaps have fewer files but bigger ones could help.
BTW, you are mmap-ing, and reading small chunk of 1K (smaller than page size of 4K). You could use madvise (if possible in advance), but you should try to read larger chunks, since every file access will bring at least a whole page.
You really should benchmark!

writeback of dirty pages in linux

I have a question regarding the writeback of the dirty pages. If a portion of page data is modified, will the writeback write the whole page to the disk, or only the partial page with modified data?

The memory management hardware on x86 systems has a granularity of 4096 bytes. This means: It is not possible to find out which bytes of a 4096-byte page are really changed and which ones are unchanged.
Theoretically the disk driver system could check if bytes have been changed and not write the 512-byte blocks that have not been changed.
However this would mean that - if the blocks are no longer in disk cache memory - the page must be read from hard disk to check if it has changed before writing.
I do not think that Linux would do this in that way because reading the page from disk would cost too much time.

Upon EACH hardware interrupt, the CPU would like to write as much data as possible that the harddisk controller can handle - this size is defined by us as the blksize (or ONE sector, in Linux):
http://en.wikipedia.org/wiki/Disk_sector
https://superuser.com/questions/121252/how-do-i-find-the-hardware-block-read-size-for-my-hard-drive
But waiting too long for SINGLE interrupt for a large file can make the system appear unresponsive, so it is logical to break the chunks into smaller size (like 512bytes) so that the CPU can handle other tasks while transferring each 512 bytes down. Therefore, whether u changed one byte or 511 bytes, so long as it is within that single block, all data get written at the same time. And throughout linux kernel, flagging the blocks as dirty for write or not, all goes by the single unique identifier: sector number, so anything smaller than sector size is too difficult for efficient management.
All these said, don't forget that the harddisk controller itself also has a minimum block size for write operation.

Block based storage

I would like to store a couple of entries to a file (optimized for reading) and a good data structure for that seems to be a B+ tree. It offers a O(log(n)/log(b)) access time where b is the number of entries in one block.
There are many papers etc. describing B+ trees, but I still have some troubles understaning block based storage systems in general. Maybe someone can point me to the right direction or answer a couple of questions:
Do (all common) file systems create new files at the beginning of a new block? So, can I be sure that seek(0) will set the read/write head to a multiply of the device's block size?
Is it right that I only should use calls like pread(fd, buf, n * BLOCK_SIZE, p * BLOCK_SIZE) (with n, p being integers) to ensure that I always read full blocks?
Is it better to read() BLOCK_SIZE bytes into an array or mmap() those instead? Or is there only a difference if I mmap many blocks and access only a few? What is better?
Should I try to avoid keys spawning multiple blocks by adding padding bytes at the end of each block? Should I do the same for the leaf nodes by adding padding bytes between the data too?
Many thanks,
Christoph

In general, file systems create new files at the beginning of a new block because that is how the underlying device works. Hard disks are block devices and thus cannot handle anything less than a "block" or "sector". Additionally, operating systems treat memory and memory mappings in terms of pages, which are usually even larger (sectors are often 512 or 1024 bytes, pages usually 4096 bytes).
One exception to this rule that comes to mind would be ReiserFS, which puts small files directly into the filesystem structure (which, if I remember right, is incidentially a B+ tree!). For very small files this can actually a viable optimization since the data is already in RAM without another seek, but it can equally be an anti-optimization, depending on the situation.
It does not really matter, because the operating system will read data in units of full pages (normally 4kB) into the page cache anyway. Reading one byte will transfer 4kB and return a byte, reading another byte will serve you from the page cache (if it's the same page or one that was within the readahead range).
read is implemented by copying data from the page cache whereas mmap simply remaps the pages into your address space (possibly marking them copy-on-write, depending on your protection flags). Therefore, mmap will always be at least as fast and usually faster. mmap is more comfortable too, but has the disadvantage that it may block at unexpected times when it needs to fetch more pages that are not in RAM (though, that is generally true for any application or data that is not locked into memory). readon the other hand blocks when you tell it, not otherwise.
The same is true under Windows with the exception that memory mapped files under pre-Vista Windows don't scale well under high concurrency, as the cache manager serializes everything.
Generally one tries to keep data compact, because less data means fewer pages, and fewer pages means higher likelihood they're in the page cache and fit within the readahead range. Therefore I would not add padding, unless it is necessary for other reasons (alignment).

Filesystems which support delayed allocation don't create new files anywhere on disc. Lots of newer filesystems support packing very small files into their own pages or sharing them with metadata (For example, reiser puts very tiny files into the inode?). But for larger files, mostly, yes.
You can do this, but the OS page cache will always read an entire block in, and just copy the bits you requested into your app's memory.
It depends on whether you're using direct IO or non-direct IO.
If you're using direct IO, which bypasses the OS's cache, you don't use mmap. Most databases do not use mmap and use direct IO.
Direct IO means that the pages don't go through the OS's page cache, they don't get cached at all by the OS and don't push other blocks out of the OS cache. It also means that all reads and writes need to be done on block boundaries. Block boundaries can sometimes be determined by a statfs call on the filesystem.
Most databases seem to take the view that they should manage their own page cache themselves, and use the OS only for physical reads/writes. Therefore they typically use direct and synchronous IO.
Linus Torvalds famously disagrees with this approach. I think the vendors really do it to achieve better consistency of behaviour across different OSs.

Yes. Doing otherwise would cause unnecessary complications in FS design.
And the options (as an alternative to "only") are ...?
In Windows memory-mapped files work faster than file API (ReadFile). I guess on Linux it's the same, but you can conduct your own measurements

Why do discussions of "swappiness" act like information can only be in one place at a time?

I've been reading up on Linux's "swappiness" tuneable, which controls how aggressive the kernel is about swapping applications' memory to disk when they're not being used. If you Google the term, you get a lot of pages like this discussing the pros and cons. In a nutshell, the argument goes like this:
If your swappiness is too low, inactive applications will hog all the system memory that other programs might want to use.
If your swappiness is too high, when you wake up those inactive applications, there's going to be a big delay as their state is read back off the disk.
This argument doesn't make sense to me. If I have an inactive application that's using a ton of memory, why doesn't the kernel page its memory to disk AND leave another copy of that data in-memory? This seems to give the best of both worlds: if another application needs that memory, it can immediately claim the physical RAM and start writing to it, since another copy of it is on disk and can be swapped back in when the inactive application is woken up. And when the original app wakes up, any of its pages that are still in RAM can be used as-is, without having to pull them off the disk.
Or am I missing something?

If I have an inactive application that's using a ton of memory, why doesn't the kernel page its memory to disk AND leave another copy of that data in-memory?
Lets say we did it. We wrote the page to disk, but left it in memory. A while later another process needs memory, so we want to kick out the page from the first process.
We need to know with absolute certainty whether the first process has modified the page since it was written out to disk. If it has, we have to write it out again. The way we would track this is to take away the process's write permission to the page back when we first wrote it out to disk. If the process tries to write to the page again there will be a page fault. The kernel can note that the process has dirtied the page (and will therefore need to be written out again) before restoring the write permission and allowing the application to continue.
Therein lies the problem. Taking away write permission from the page is actually somewhat expensive, particularly in multiprocessor machines. It is important that all CPUs purge their cache of page translations to make sure they take away the write permission.
If the process does write to the page, taking a page fault is even more expensive. I'd presume that a non-trivial number of these pages would end up taking that fault, which eats into the gains we were looking for by leaving it in memory.
So is it worth doing? I honestly don't know. I'm just trying to explain why leaving the page in memory isn't so obvious a win as it sounds.
(*) This whole thing is very similar to a mechanism called Copy-On-Write, which is used when a process fork()s. The child process is very likely going to execute just a few instructions and call exec(), so it would be silly to copy all of the parents pages. Instead the write permission is taken away and the child simply allowed to run. Copy-On-Write is a win because the page fault is almost never taken: the child almost always calls exec() immediately.

Even if you page the apps memory to disk and keep it in memory, you would still have to decide when should an application be considered "inactive" and that's what swapiness controls. Paging to disk is expensive in terms of IO and you don't want to do it too often. There is also another variable on this equation, and that is the fact that Linux uses of remaining memory as disk buffers/cache.

According to this 1 that is exactly what Linux does.
I'm still trying to make sense of a lot of this, so any authoritative links would be appreciated.

The first thing the VM does is clean pages and move them to the clean list.
When cleaning anonymous memory (things which do not have an actual file backing store, you can see the segments in /proc//maps which are anonymous and have no filesystem vnode storage behind them), the first thing the VM is going to do is take the "dirty" pages and "clean" then by writing the contents of the page out to swap. Now when the VM has a shortage of completely free memory and is worried about its ability to grant new free pages to be used, it can go through the list of 'clean' pages and based on how recently they were used and what kind of memory they are it will move those pages to the free list.
Once the memory pages are placed on the free list, they no longer are associated with the contents they had before. If a program comes along a references the memory location the page was serving previously the program will take a major fault and a (most likely completely different) page will be grabbed from the free list and the data will be read into the page from disk. Once this is done, the page is actually still 'clean' since it has not been modified. If the VM chooses to use that page on swap for a different page in RAM then the page would be again 'dirtied', or if the app wrote to that page it would be 'dirtied'. And then the process begins again.
Also, swappinness is pretty horrible for server applications in a business/transactional/online/latency-sensitive environment. When I've got 16GB RAM boxes where I'm not running a lot of browsers and GUIs, I typically want all my apps nearly pinned in memory. The bulk of my RAM tends to be 8-10GB java heaps that I NEVER want paged to disk, ever, and the cruft that is available are processes like mingetty (but even there the glibc pages in those apps are shared by other apps and actually used, so even the RSS size of those useless processes are mostly shared, used pages). I normally don't see more than a few 10MBs of the 16GB actually cleaned to swap. I would advise very, very low swappiness numbers or zero swappiness for servers -- the unused pages should be a small fraction of the overall RAM and trying to reclaim that relatively tiny amount of RAM for buffer cache risks swapping application pages and taking latency hits in the running app.

Store more than 3GB of video-frames in memory, on 32-bit OS

At work we have an application to play 2K (2048*1556px) OpenEXR film sequences. It works well.. apart from when sequences that are over 3GB (quite common), then it has to unload old frames from memory, despite the fact all machines have 8-16GB of memory (which is addressable via the linux BIGMEM stuff).
The frames have to he cached into memory to play back in realtime. The OS is a several-year old 32-bit Fedora Distro (not possible to upgradable to 64bit, for the foreseeable future). The per-process limitation is 3GB per process.
Basically, is it possible to cache more than 3GB of data in memory, somehow? My initial idea was to spread the data between multiple processes, but I've no idea if this is possible..

One possibility may be to use mmap. You would map/unmap different parts of your data into the same virtual memory region. You could only have one set mapped at a time, but as long as there was enough physical memory, the data should stay resident.

How about creating a RAM drive and loading the file into that ... assuming the RAM drive supports the BIGMEM stuff for you.
You could use multiple processes: each process loads a view of the file as a shared memory segment, and the player process then maps the segments in turn as needed.

My, what an interesting problem :)
(EDIT: Oh, I just read Rob's ram drive post...I got all excited by the problem...but have a bit more to suggest, so I won't delete)
Would it be possible to...
setup a multi-gigabyte ram disk, and then
modify the program to do all it's reading from the "disk"?
I'd guess the ram disk part is where all the problem would be, since the size of the ram disk would be OS and file system dependent. You might have to create multiple ram disks and have your code jump between them. Or maybe you could setup a RAID-0 stripe set over multiple ram disks. Or, if there are still OS limitations and you can afford to drop a couple grand (4k?), setup a hardware RAID-0 strip set with some of those new blazing fast solid state drives. Or...
Fun, fun, fun.
Be sure to follow up!

I assume you can modify the application. If so, the easiest thing would be to start the application several times (once for each 3GB chunk of video), have each one hold a chunk of video, and use another program to synchronize them so they each take control of the framebuffer (or other video output) in turn.
The synchronization is going to be a little messy, perhaps, but it can be simplified if each app has its own framebuffer and the sync program points the video controller to the correct framebuffer inbetween frames when switching to the next app.

#dbr said:
There is a review machine with an absurd fiber-channel-RAID-array that can play 2K files direct from the array easily. The issue is with the artist-workstations, so it wouldn't be one $4000 RAID array, it'd be hundreds..
Well, if you can accept a limit of ~30GB, then maybe a single 36GB SSD drive would be enough? Those go for ~US$1k each I think, and the data rates might be enough. That very well maybe cheaper than a pure RAM approach. There are smaller sizes available, too. If ~60GB is enough you could probably get away with a JBOD array of 2 for double the cost, and skip the RAID controller. Be sure only to look at the higher end SSD options--the low end is filled with glorified memory sticks. :P

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string