Write a cached page before it is reclaimed

Write a cached page before it is reclaimed - linux

everyone. I am stuck on the following question.
I am working on a hybrid storage system which uses an ssd as a cache layer for hard disk. To this end, the data read from the hard disk should be written to the ssd to boost the subsequent reads of this data. Since Linux caches data read from disk in the page cache, the writing of data to the ssd can be delayed; however, the pages caching the data may be freed, and accessing the freed pages is not recommended. Here is the question: I have "struct page" pointers pointing to the pages to be written to the ssd. Is there any way to determine whether the page represented by the pointer is valid or not (by valid I mean the cached page can be safely written to the ssd? What will happen if a freed page is accessed via the pointer? Is the data of the freed page the same as that before freeing?

Are you using cleancache module? You should only get valid pages from it and it should remain valid until your callback function finished.

Isn't this a cleancache/frontswap reimplementation? (https://www.kernel.org/doc/Documentation/vm/cleancache.txt).
The benefit of existing cleancache code is that it calls your code only just before it frees a page, so before the page resides in RAM, and when there is no space left in RAM for it the kernel calls your code to back it up in tmem (transient memory).
Searching I also found an existing project that seems to do exactly this: http://bcache.evilpiepirate.org/:
Bcache is a Linux kernel block layer cache. It allows one or more fast
disk drives such as flash-based solid state drives (SSDs) to act as a
cache for one or more slower hard disk drives.
Bcache patches for the Linux kernel allow one to use SSDs to cache
other block devices. It's analogous to L2Arc for ZFS, but Bcache also
does writeback caching (besides just write through caching), and it's
filesystem agnostic. It's designed to be switched on with a minimum of
effort, and to work well without configuration on any setup. By
default it won't cache sequential IO, just the random reads and writes
that SSDs excel at. It's meant to be suitable for desktops, servers,
high end storage arrays, and perhaps even embedded.

What you are trying to achieve looks like the following:
Before the page is evicted from the pagecache, you want to cache it. This, in concept, is called a Victim cache. You can look for papers around this.
What you need is a way to "pin" the pages targeted for eviction for the duration of the IO. Post IO, you can free the pagecache page.
But, this will delay the eviction, which is possibly needed during memory pressure to create more un-cached pages.
So, one possible solution is to start your caching algorithm a bit before the pagecache eviction starts.
A second possible solution is to set aside a bunch of free pages and exchange the page being evicted form the page cache with a page from the free pool, and cache the evicted page in the background. But, you need to now synchronize with file block deletes, etc

Related

Do Linux and macOS have an `OfferVirtualMemory` counterpart?

Windows, starting with a certain unspecified update of Windows 8.1, has the excellent OfferVirtualMemory and ReclaimVirtualMemory system calls which allow memory regions to be "offered" to the OS. This removes them from the working set, reduces the amount of physical memory usage that is attributed to the calling process, and puts them onto the standby memory list of the program, but without ever swapping out the contents anywhere.
(Below is a brief and rough explanation of what those do and how standby lists work, to help people understand what kind of system call I'm looking for, so skip ahead if you already know all of this.)
Quick standby list reference
Pages in the standby list can be returned back to the working set of the process, which is when their contents are swapped out to disk and the physical memory is used for housing a fresh allocation or swapping in memory from disk (if there's no available "dead weight" zeroed memory on the system), or no swapping happens and the physical memory is returned to the same virtual memory region they were first removed from, sidestepping the swapping process while still having reduced the working set of the program to, well, the memory it's actively working on, back when they were removed from the working set and put into the standby list to begin with.
Alternatively, if another program requests physical memory and the system doesn't have zeroed pages (if no program was closed recently, for example, and the rest of RAM has been used up with various system caches), physical memory from the standby list of a program can be zeroed, removed from the standby list, and handed over to the program which requested the memory.
Back to memory offering
Since the offered memory never gets swapped out if, upon being removed from the standby list, it no longer belongs to the same virtual memory segment (removed from standby by anything other than ReclaimVirtualMemory), the reclamation process can fail, reporting that the contents of the memory region are now undefined (uninitialized memory has been fetched from the program's own standby list or from zeroed memory). This means that the program will have to re-generate the contents of the memory region from another data source, or by rerunning some computation.
The practical effect, when used to implement an intelligent computation cache system, is that, firstly, the reported working set of the program is reduced, giving a more accurate picture of how much memory it really needs. Secondly, the cached data, which can be re-generated from another region of memory, can be quickly discarded for another program to use that cache, without waiting for the disk (and putting additional strain on it, which adds up over time and results in increased wear) as it swaps out the contents of the cache, which aren't too expensive to recreate.
One good example of a use case is the render cache of a web browser, where it can just re-render parts of the page upon request, and has little to no use in having those caches taking up the working set and bugging the user which high memory usage. Pages which aren't currently being shown are the moment where this approach may give the biggest theoretical yield.
The question
Do Linux and macOS have a comparable API set that allows memory to be marked as discardable at the memory manager's discretion, with a fallible system call to lock that memory back in, declaring the memory uninitialized if it was indeed discarded?

Linux 4.5 and later has madvise with the MADV_FREE, the memory may be replaced with pages of zeros anytime until they are next written.
To lock the memory back in write to it, then read it to check if it has been zeroed. This needs to be done separately for every page.
Before Linux 4.12 the memory was freed immediately on systems without swap.
You need to take care of compiler memory reordering so use atomic_signal_fence or equivalent in C/C++.

Is anonymous memory part of the page cache on Linux?

Is anonymous memory - i.e. program heap and stack - part of the page cache on Linux? The linked documentation of the kernel does not state that.
But the Wikipedia entry about Page Cache contains a graphic (look at the top right) which gives me the impression that malloc() allocates dynamic memory within the page cache:
Does that make sense? Regarding mmap(), when it is used to access files it makes sense to use the page cache. Also generally for anonymous memory e.g. malloc() and anonymous mappings through mmap()?
I would appreciate some explanation.
Thank you.
Edit 2021-03-14
I've decided it is the best to ask the kernel maintainers of the memory subsystem on their mailing-list. Luckily Matthew Wilcox responded and helped me. Extract:
Anonymous memory is not handled by the page cache.
Anonymous pages are handled in a number of different ways -- they can be found on LRU lists (Least Recently Used) and they can be found through the page tables. Somewhat ad-hoc.
The wikipedia diagram is wrong. And it contains further flaws.
If a system provides swap and if anonymous memory is swapped - it enters the swap cache, not the page cache.
The discussion can be read on here or here.

TLDR: No, except for anonymous memory with special filesystem backing (like IPC shmem).
Update: Corrected answer to incorporate new info from the kernel mailing list discussion with OP.
The page cache originally was meant to be an OS-level region of memory for fast lookup of disk-backed files and in its original form was a buffer cache (meant to cache blocks from disk). The notion of a page cache came about later in 1995 after Linux's inception, but the premise was similar, just a new abstraction -- pages [1].
In fact, eventually the two caches became one: the page cache included the buffer cache, or rather, the buffer cache is the page cache [1, 2].
So what does go in the page cache? Aside from traditional disk-backed files, in an attempt to make the page cache as general purpose as possible, Linux has a few examples of page types that don't adhere to the traditional notion of disk-backed pages, yet are still stored in the page cache. Of course, as mentioned, the buffer cache (which is the same as the page cache) is used to store disk-backed blocks of data. Blocks aren't necessarily the same size as pages. In fact, I learned that they can be smaller than pages [pg.323 of 3]. In that case, pages considered part of the buffer cache might consist of multiple blocks corresponding to non-contiguous regions of memory on disk. I'm unclear whether, then, each page in the buffer cache must be a one-to-one mapping between a page and a file, or if one page can consist of blocks from different files. Nonetheless, this is one page cache usage that doesn't adhere to the strictest definition of the original page cache.
Next is the swap cache. As Barmar mentioned in the comments, anonymous (non-file backed pages) can be swapped out to disk. Along the way to disk and back, pages are put in the swap cache. The swap cache repurposes similar data structures as the page cache, specifically the address_space struct, albeit with swap flags set and a few other differences [pg. 731 of 4, 5] However, since the swap cache is considered separate from the page cache, anonymous pages in the swap cache are not considered to be "in the page cache."
Finally: the question about whether mmap/malloc are allocating memory in the page cache. As discussed in [5], typically, mmap uses memory that comes from the free page list, not the page cache (unless there were no free pages left, I assume). When using mmap to map files for reading and writing, these pages do end up residing within the page cache. However, for anonymous memory, mmap/mallocd pages do not normally reside within the page cache.
One exception to this is anonymous memory that has special filesystem backing. For instance, shared memory mmapd between processes for IPC is backed by the ram-based tmpfs [6]. This memory sits in the page cache, but is anonymous because it has no disk-backing file [pg. 600 of 4].
Sources:
https://lwn.net/Articles/712467/
https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics
https://www.doc-developpement-durable.org/file/Projets-informatiques/cours-&-manuels-informatiques/Linux/Linux%20Kernel%20Development,%203rd%20Edition.pdf
https://doc.lagout.org/operating%20system%20/linux/Understanding%20Linux%20Kernel.pdf
https://lore.kernel.org/linux-mm/20210315000738.GR2577561#casper.infradead.org/
https://github.com/torvalds/linux/blob/master/Documentation/filesystems/tmpfs.rst

Bypassing 4KB block size limitation on block layer/device

We are developing an ssd-type storage hardware device that can take read/write request for big block size >4KB at a time (even in MBs size).
My understanding is that linux and its filesystem will "chop down" files into 4KB block size that will be passed to block device driver, which will need to physically fill the block with data from the device (ex., for write)
I am also aware the kernel page size has a role in this limitation as it is set at 4KB.
For experiment, I want to find out if there is a way to actually increase this block size, so that we will save some time (instead of doing multiple 4KB writes, we can do it with bigger block size).
Is there any FS or any existing project that I can take a look for this?
If not, what is needed to do this experiment - what parts of linux needs to be modified?
Trying to find out the level of difficulties and resource needed. Or, if it is even impossible to do so and/or any reason why we do not even need to do so. Any comment is appreciated.
Thanks.

The 4k limitation is due to the page cache. The main issue is that if you have a 4k page size, but a 32k block size, what happens if the file is only 2000 bytes long, so you only allocate a 4k page to cover the first 4k of the block. Now someone seeks to offset 20000, and writes a single byte. Now suppose the system is under a lot of memory pressure, and the 4k page for the first 2000 bytes, which is clean, gets pushed out of memory. How do you track which parts of the 32k block contain valid data, and what happens when the system needs to write out the dirty page at offset 20000?
Also, let's assume that the system is under a huge amount of memory pressure, we need to write out that last page; what if there isn't enough memory available to instantiante the other 28k of the 32k block, so we can do the read-modify-write cycle just to update that one dirty 4k page at offset 20000?
These problems can all be solved, but it would require a lot of surgery in the VM layer. The VM layer would need to know that for this file system, pages need to be instantiated in chunks of 8 pages at a time, and if that there is memory pressure to push out a particular page, you need write out all of the 8 pages at the same time if it is dirty, and then drop all 8 pages from the page cache at the same time. All of this implies that you want to track page usage and page dirty not at the 4k page level, but at the compound 32k page/"block" level. It basically will involve changes to almost every single part of the VM subsystem, from the page cleaner, to the page fault handler, the page scanner, the writeback algorithms, etc., etc., etc.
Also consider that even if you did hire a Linux VM expert to do this work, (which the HDD vendors would deeply love you for, since they also want to be able to deploy HDD's with a 32k or 64k physical sector size), it will be 5-7 years before such a modified VM layer would make its appearance in a Red Hat Enterprise Linux kernel, or the equivalent enterprise or LTS kernel for SuSE or Ubuntu. So if you are working at a startup who is hoping to sell your SSD product into the enterprise market --- you might as well give up now with this approach. It's just not going to work before you run out of money.
Now, if you happen to be working for a large Cloud company who is making their own hardware (ala Facebook, Amazon, Google, etc.) maybe you could go down this particular path, since they don't use enterprise kernels that add new features at a glacial pace --- but for that reason, they want to stick relatively close to the upstream kernel to minimize their maintenance cost.
If you do work for one of these large cloud companies, I'd strongly recommend that you contact other companies who are in this same space, and maybe you could collaborate with them to see if together you could do this kind of development work and together try to get this kind of change upstream. It really, really is not a trivial change, though --- especially since the upstream linux kernel developers will demand that this not negatively impact performance in the common case, which will not be involving > 4k block devices any time in the near future. And if you work at a Facebook, Google, Amazon, etc., this is not the sort of change that you would want to maintain as a private change to your kernel, but something that you would want to get upstream, since other wise it would be such a massive, invasive change that supporting it as an out-of-tree patch would be huge headache.

Although I've never written a device driver for Linux, I find it very unlikely that this is a real limitation of the driver interface. I guess it's possible that you would want to break I/O into scatter-gather lists where each entry in the list is one page long (to improve memory allocation performance and decrease memory fragmentation), but most device types can handle those directly nowadays, and I don't think anything in the driver interface actually requires it. In fact, the simplest way that requests are issued to block devices (described on page 13 -- marked as page 476 -- of that text) looks like it receives:
a sector start number
a number of sectors to transfer (no limit is mentioned, let alone a limit of 8 512B sectors)
a pointer to write the data into / read the data from (not a scatter-gather list for this simple case, I guess)
whether this is a read versus a write
I suspect that if you're seeing exclusively 4K accesses it's probably a result of the caller not requesting more than 4K at a time -- if the filesystem you're running on top of your device only issues 4K reads, or whatever is using the filesystem only accesses one block at a time, there is nothing your device driver can do to change that on its own!
Using one block at a time is common for random access patterns like database read workloads, but database log or FS journal writes or large serial file reads on a traditional (not copy-on-write) filesystem would issue large I/Os more like what you're expecting. If you want to try issuing large reads against your device directly to see if it's possible through whatever driver you have now, you could use dd if=/dev/rdiskN of=/dev/null bs=N to see if increasing the bs parameter from 4K to 1M shows a significant throughput increase.

Linux flush_cache_range() behavior

My code has a user mode mapping (set up via mmap()) which I need to flush after writing to it from the CPU but before I dispatch the data by DMA’ing the underlying physical memory. Also I need to invalidate the cache after data has arrived via a DMA to the underlying physical memory but before I attempt to read from it with the CPU.
In my mind “cache flushing” and “cache invalidating” mean two different things. Roughly “cache flushing” means writing what’s in the cache out to memory (or simply cache data goes to memory) whereas “cache invalidating” means subsequently assuming all cache contents are stale so that any attempts to read from that range will provoke a fresh read from memory (or simply memory data goes to cache).
However in the kernel I do not find two calls but instead just one: flush_cache_range().
This is the API I use for both tasks and it “seems to work”… at least it has up until the present issue I'm trying to debug.
This is possibly because the behavior of flush_cache_range() just might be to:
1) First write any dirty cache entries to memory- THEN
2) Invalidate all cache entries
IF is this is what this API really does then my use of it in this role is justified. After all it’s how I myself might implement it. The precise question for which I seek a confident answer is:
IS that in fact how flush_cache_range() actually works?

Whether caches need to be invalidated or flushed is architecture dependent.
You should always use the Linux DMA functions to handle these issues correctly.
Read DMA-API-HOWTO.txt and DMA-API.txt.

Why do discussions of "swappiness" act like information can only be in one place at a time?

I've been reading up on Linux's "swappiness" tuneable, which controls how aggressive the kernel is about swapping applications' memory to disk when they're not being used. If you Google the term, you get a lot of pages like this discussing the pros and cons. In a nutshell, the argument goes like this:
If your swappiness is too low, inactive applications will hog all the system memory that other programs might want to use.
If your swappiness is too high, when you wake up those inactive applications, there's going to be a big delay as their state is read back off the disk.
This argument doesn't make sense to me. If I have an inactive application that's using a ton of memory, why doesn't the kernel page its memory to disk AND leave another copy of that data in-memory? This seems to give the best of both worlds: if another application needs that memory, it can immediately claim the physical RAM and start writing to it, since another copy of it is on disk and can be swapped back in when the inactive application is woken up. And when the original app wakes up, any of its pages that are still in RAM can be used as-is, without having to pull them off the disk.
Or am I missing something?

If I have an inactive application that's using a ton of memory, why doesn't the kernel page its memory to disk AND leave another copy of that data in-memory?
Lets say we did it. We wrote the page to disk, but left it in memory. A while later another process needs memory, so we want to kick out the page from the first process.
We need to know with absolute certainty whether the first process has modified the page since it was written out to disk. If it has, we have to write it out again. The way we would track this is to take away the process's write permission to the page back when we first wrote it out to disk. If the process tries to write to the page again there will be a page fault. The kernel can note that the process has dirtied the page (and will therefore need to be written out again) before restoring the write permission and allowing the application to continue.
Therein lies the problem. Taking away write permission from the page is actually somewhat expensive, particularly in multiprocessor machines. It is important that all CPUs purge their cache of page translations to make sure they take away the write permission.
If the process does write to the page, taking a page fault is even more expensive. I'd presume that a non-trivial number of these pages would end up taking that fault, which eats into the gains we were looking for by leaving it in memory.
So is it worth doing? I honestly don't know. I'm just trying to explain why leaving the page in memory isn't so obvious a win as it sounds.
(*) This whole thing is very similar to a mechanism called Copy-On-Write, which is used when a process fork()s. The child process is very likely going to execute just a few instructions and call exec(), so it would be silly to copy all of the parents pages. Instead the write permission is taken away and the child simply allowed to run. Copy-On-Write is a win because the page fault is almost never taken: the child almost always calls exec() immediately.

Even if you page the apps memory to disk and keep it in memory, you would still have to decide when should an application be considered "inactive" and that's what swapiness controls. Paging to disk is expensive in terms of IO and you don't want to do it too often. There is also another variable on this equation, and that is the fact that Linux uses of remaining memory as disk buffers/cache.

According to this 1 that is exactly what Linux does.
I'm still trying to make sense of a lot of this, so any authoritative links would be appreciated.

The first thing the VM does is clean pages and move them to the clean list.
When cleaning anonymous memory (things which do not have an actual file backing store, you can see the segments in /proc//maps which are anonymous and have no filesystem vnode storage behind them), the first thing the VM is going to do is take the "dirty" pages and "clean" then by writing the contents of the page out to swap. Now when the VM has a shortage of completely free memory and is worried about its ability to grant new free pages to be used, it can go through the list of 'clean' pages and based on how recently they were used and what kind of memory they are it will move those pages to the free list.
Once the memory pages are placed on the free list, they no longer are associated with the contents they had before. If a program comes along a references the memory location the page was serving previously the program will take a major fault and a (most likely completely different) page will be grabbed from the free list and the data will be read into the page from disk. Once this is done, the page is actually still 'clean' since it has not been modified. If the VM chooses to use that page on swap for a different page in RAM then the page would be again 'dirtied', or if the app wrote to that page it would be 'dirtied'. And then the process begins again.
Also, swappinness is pretty horrible for server applications in a business/transactional/online/latency-sensitive environment. When I've got 16GB RAM boxes where I'm not running a lot of browsers and GUIs, I typically want all my apps nearly pinned in memory. The bulk of my RAM tends to be 8-10GB java heaps that I NEVER want paged to disk, ever, and the cruft that is available are processes like mingetty (but even there the glibc pages in those apps are shared by other apps and actually used, so even the RSS size of those useless processes are mostly shared, used pages). I normally don't see more than a few 10MBs of the 16GB actually cleaned to swap. I would advise very, very low swappiness numbers or zero swappiness for servers -- the unused pages should be a small fraction of the overall RAM and trying to reclaim that relatively tiny amount of RAM for buffer cache risks swapping application pages and taking latency hits in the running app.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string