If my server memory is full I like to clean old objects by the number of times people call them.
Can I do that in Varnish?
Varnish storages (i.e. malloc or file) implement a LRU strategy when objects need to be removed (i.e. nuked) from the cache. From the docs [1]:
Watch the n_lru_nuked counter with varnishstat or some other tool. If
you have a lot of LRU activity then your cache is evicting objects due
to space constraints and you should consider increasing the size of
the cache.
[1] https://varnish-cache.org/docs/6.3/users-guide/sizing-your-cache.html
Related
An availability operation can be compared to a cache flush - cache contents are released to main memory.
Similarly, visibility operation can be compared to a cache invalidation - cache consumes contents of main memory.
(it doesn't have to be a 1:1 hw mapping, but you get the idea)
It seems nonsensical to perform a visibility operation before a write (since we're about to override whatever is in our imaginary or no-so-imaginary cache either way) or an availability operation after a read (nothing has changed!).
I saw code, which includes memory writes in dstAccessMask and/or memory reads in srcAccessMask. What's the point?
It seems nonsensical to perform a visibility operation before a write (since we're about to override whatever is in our imaginary or no-so-imaginary cache either way) or an availability operation after a read (nothing has changed!).
So which not-so-imaginary cache are you talking about? And how are you enforcing coherency with other not-so-imaginary caches in the system? GPUs have many levels of parallel caching, some of which are coherent in hardware and some of which are not. What you get is entirely implementation-defined behavior ...
memory writes in dstAccessMask
Missing invalidation before a write can cause write-after-write hazards on the GPU (you can end up with mismatches in parallel copies of data in two different caches). Can also get issues if HOST modified the data, which is why HOST is a possible stage.
memory reads in srcAccessMask
Can't think of a case where you'd need memory operations after a read-only usage.
Is anonymous memory - i.e. program heap and stack - part of the page cache on Linux? The linked documentation of the kernel does not state that.
But the Wikipedia entry about Page Cache contains a graphic (look at the top right) which gives me the impression that malloc() allocates dynamic memory within the page cache:
Does that make sense? Regarding mmap(), when it is used to access files it makes sense to use the page cache. Also generally for anonymous memory e.g. malloc() and anonymous mappings through mmap()?
I would appreciate some explanation.
Thank you.
Edit 2021-03-14
I've decided it is the best to ask the kernel maintainers of the memory subsystem on their mailing-list. Luckily Matthew Wilcox responded and helped me. Extract:
Anonymous memory is not handled by the page cache.
Anonymous pages are handled in a number of different ways -- they can be found on LRU lists (Least Recently Used) and they can be found through the page tables. Somewhat ad-hoc.
The wikipedia diagram is wrong. And it contains further flaws.
If a system provides swap and if anonymous memory is swapped - it enters the swap cache, not the page cache.
The discussion can be read on here or here.
TLDR: No, except for anonymous memory with special filesystem backing (like IPC shmem).
Update: Corrected answer to incorporate new info from the kernel mailing list discussion with OP.
The page cache originally was meant to be an OS-level region of memory for fast lookup of disk-backed files and in its original form was a buffer cache (meant to cache blocks from disk). The notion of a page cache came about later in 1995 after Linux's inception, but the premise was similar, just a new abstraction -- pages [1].
In fact, eventually the two caches became one: the page cache included the buffer cache, or rather, the buffer cache is the page cache [1, 2].
So what does go in the page cache? Aside from traditional disk-backed files, in an attempt to make the page cache as general purpose as possible, Linux has a few examples of page types that don't adhere to the traditional notion of disk-backed pages, yet are still stored in the page cache. Of course, as mentioned, the buffer cache (which is the same as the page cache) is used to store disk-backed blocks of data. Blocks aren't necessarily the same size as pages. In fact, I learned that they can be smaller than pages [pg.323 of 3]. In that case, pages considered part of the buffer cache might consist of multiple blocks corresponding to non-contiguous regions of memory on disk. I'm unclear whether, then, each page in the buffer cache must be a one-to-one mapping between a page and a file, or if one page can consist of blocks from different files. Nonetheless, this is one page cache usage that doesn't adhere to the strictest definition of the original page cache.
Next is the swap cache. As Barmar mentioned in the comments, anonymous (non-file backed pages) can be swapped out to disk. Along the way to disk and back, pages are put in the swap cache. The swap cache repurposes similar data structures as the page cache, specifically the address_space struct, albeit with swap flags set and a few other differences [pg. 731 of 4, 5] However, since the swap cache is considered separate from the page cache, anonymous pages in the swap cache are not considered to be "in the page cache."
Finally: the question about whether mmap/malloc are allocating memory in the page cache. As discussed in [5], typically, mmap uses memory that comes from the free page list, not the page cache (unless there were no free pages left, I assume). When using mmap to map files for reading and writing, these pages do end up residing within the page cache. However, for anonymous memory, mmap/mallocd pages do not normally reside within the page cache.
One exception to this is anonymous memory that has special filesystem backing. For instance, shared memory mmapd between processes for IPC is backed by the ram-based tmpfs [6]. This memory sits in the page cache, but is anonymous because it has no disk-backing file [pg. 600 of 4].
Sources:
https://lwn.net/Articles/712467/
https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics
https://www.doc-developpement-durable.org/file/Projets-informatiques/cours-&-manuels-informatiques/Linux/Linux%20Kernel%20Development,%203rd%20Edition.pdf
https://doc.lagout.org/operating%20system%20/linux/Understanding%20Linux%20Kernel.pdf
https://lore.kernel.org/linux-mm/20210315000738.GR2577561#casper.infradead.org/
https://github.com/torvalds/linux/blob/master/Documentation/filesystems/tmpfs.rst
Under the total store order(TSO) memory consistency model, a x86 cpu will have a write buffer to buffer write requests and can serve reordered read requests from the write buffer. And it says that the write requests in the write buffer will exit and be issued toward cache hierarchy in FIFO order, which is the same as program order.
I am curious about:
To serve the write requests issued from the write buffer, does L1 cache controller handle the write requests, finish the cache coherence of the write requests and insert data into L1 cache in the same order as the issue order?
Your terminology is unusual. You say "finish the cache coherence"; what actually happens is that the core has to get (exclusive) ownership of the cache line before it can modify it. At the instant/cycle that modification happens, it becomes part of the view of memory contents shared by all participants in the cache-coherency protocol.
So yes, you do "finish the cache coherence" = get exclusive ownership before the store can even enter cache and become globally visible = available for requests to share that cache line. Cache always maintains coherence (that's the point of MESI), not gets out of sync and then wait for coherence. I think your confusion stems from your mental model not matching that reality.
(Weakly-ordered architectures have mind-bending possibilities like not all cores seeing the stores from two other cores in the same order; that can happen by private store-forwarding between SMT threads on one physical core letting another logical core see a store ahead of commit to L1d = global visibility.)
I think you know some of this, but let me start from the basics.
L1 cache in each core participates in the cache-coherency protocol that keeps its cache coherent with the other caches in the coherency domain (e.g. L2 and L3, and L1 in other cores, but not video-RAM caches inside the GPU).
Loads become globally visible at the instant their data is read from L1 cache (or from the store buffer or from uncacheable RAM or MMIO). MFENCE can force them to wait for earlier stores to become globally visible before sampling L1, to avoid StoreLoad reordering.
Stores become globally visible at the instant their data is committed to L1 cache. The conditions required before this can happen are:
It's finished executing: the data+address is in a store buffer entry. (i.e. the store-address and store-data uops executed on the appropriate ports once their inputs were ready, writing the address and data into the store buffer, aka Memory Order Buffer on Intel CPUs).
It's retired from the out-of-order part of the core, and thus known to be non-speculative. Before retirement, we don't know that it and all preceding instructions won't fault, or that it wasn't in the shadow of a branch mispredict or other mis-speculation.
Retirement can only happen after it's finished executing, but isn't tied to commitment to L1d. The store buffer can continue to track a non-speculative store that will definitely happen eventually even after the ROB (out-of-order execution ReOrder Buffer) has forgotten about the store instruction.
All preceding loads/stores/fences are already globally visible (because of x86's memory ordering rules). This excludes weakly-ordered ops (NT stores); other loads/stores can pass them.
The cache line is in the Exclusive or Modified state of the MESI/MESIF/MOESI cache-coherence protocol, in the L1d cache of the current core. This can take a long time if the RFO (read for ownership) encounters a cache miss in outer levels of cache, or contention with other cores that also want exclusive access to write, or atomically RMW, a cache line.
See wikipedia's MESI article for diagrams of allowed state transitions, and details. The key point is that coherency is achieved by only allowing a core to modify its copy of a cache line when it's sure that no other caches contain that line, so that it's impossible for two conflicting copies of the same line to exist.
Intel CPUs actually use MESIF, while AMD CPUs actually use MOESI to allow cache->cache data transfer of dirty data instead of write-back to a shared outer cache like the basic MESI protocol requires.
Also note that modern Intel designs (before Skylake-AVX512) implement use a large shared inclusive L3 cache as a backstop for cache-coherency, so snoop requests don't actually have to be broadcast to all cores; they just check L3 tags (which contain extra metadata to track which core is caching what.
Intel's L3 is tag-inclusive even for lines that inner caches have in Exclusive or Modified state and thus are Invalid in L3. See this paper for more details of a simplified version of what Intel does).
Also related: I wrote an answer recently about why we have small/fast L1 + larger L2/L3, instead of one big cache, including some links to other cache-related stuff.
Back to the actual question:
Yes, stores are committed to L1 in program order, because that's the order that x86 requires them to become globally visible. L1-commit order is the same thing as global-visibility order.
Instead of "finish the cache coherence", instead you should say "get ownership of the cache line". This involves communicating with other caches using the cache coherency protocol, so I guess you probably meant "finish getting exclusive ownership using the cache coherency protocl".
The memory ordering part of the MESI wiki article points out that buffering stores in a store queue is separate from out-of-order execution in general.
The store buffer decouples commit to L1d from OoO exec retirement. This can potentially hide a lot more store latency than the regular out-of-order window size. However, retired stores must eventually happen (in the right order) even if an interrupt arrives, so allowing lots of retired but not committed stores can increase interrupt latency.
The store buffer tries to commit retired stores to L1d as quickly as it can, but it's restricted by the memory ordering rules. (i.e. other cores will see stores soon; you don't need a fence to flush the store buffer unless you need the current thread to wait for that to happen before a later load in this thread. e.g. for sequentially-consistent stores.)
On a weakly-ordered ISA, later stores can commit to L1d while an earlier store is still waiting for a cache miss. (But you'd still need a memory order buffer to preserve the illusion of a single core running instructions in program order.)
The store buffer can have multiple cache misses in flight at once, because even on strongly-ordered x86 it can send an RFO for a cache line before that store is the oldest one in the buffer.
Yes in a model like x86-TSO stores are likely committed to the L1 in program order, and Peter's answer covers it well. That is, the store buffer is maintained in program order, and the core will commit only the oldest store (or perhaps several consecutive oldest stores if they are all going to the same cache line) to L1 before moving on.1
However, you mention in the comments your concern that this might impact performance by essentially making the store buffer commit a blocking (serialized) process:
And why I am confused about this problem is that cache controller
could handle the requests in a non-blocking way. But, to conform to
the TSO and make sure data globally visible on a multi-core system,
should cache controller follow the store ordering? Because if there
are two variable A and B being updated sequentially on core 1 and core
2 get the updated B from core 1, then core 2 must also can see the
updated A. And to achieve this, I think the private cache hierarchy on
core 1 have to finishes the cache coherence of the variable A and B in
order and make them globally visible. Am I right?
The good news is that even though the store buffer might commit in a ordered way only the oldest store to L1, it can still get plenty of parallelism with respect to the rest of the memory subsystem by looking ahead in the store buffer and making prefetch RFO requests: trying to get the line in the E state in the local core even before the store first in line to commit to L1.
This approach doesn't violate ordering, since the stores are still written in program order, but it allows full parallelism when resolving L1 store misses. It is L1 store misses that really matter anyways: stores hits in L1 can commit rapidly, at least 1 per cycle, so committing a bunch of hits doens't help much: but getting MLP on store misses is very important, especially for scattered stores the prefetcher can't deal with.
Do x86 chips actually use a technique like this? Almost certainly. Most convincingly, tests of a long series of random writes show a much better average latency than the full memory latency, implying MLP significantly better than one. You can also find patents like this one or this one where Intel describes pretty much exactly this method.
Still, nothing is perfect. There is some evidence that ordering concerns causes weird performance hiccups when stores are missing L1, even if they hit in L2.
1 It is certainly possible that it can commit stores out of order if in maintains the illusion of in-order commit, e.g., by not relinquishing ownership of cache lines written out of order until order is restored, but this is prone to deadlocks and other complicated cases, and I have no evidence that x86 does so.
My code has a user mode mapping (set up via mmap()) which I need to flush after writing to it from the CPU but before I dispatch the data by DMA’ing the underlying physical memory. Also I need to invalidate the cache after data has arrived via a DMA to the underlying physical memory but before I attempt to read from it with the CPU.
In my mind “cache flushing” and “cache invalidating” mean two different things. Roughly “cache flushing” means writing what’s in the cache out to memory (or simply cache data goes to memory) whereas “cache invalidating” means subsequently assuming all cache contents are stale so that any attempts to read from that range will provoke a fresh read from memory (or simply memory data goes to cache).
However in the kernel I do not find two calls but instead just one: flush_cache_range().
This is the API I use for both tasks and it “seems to work”… at least it has up until the present issue I'm trying to debug.
This is possibly because the behavior of flush_cache_range() just might be to:
1) First write any dirty cache entries to memory- THEN
2) Invalidate all cache entries
IF is this is what this API really does then my use of it in this role is justified. After all it’s how I myself might implement it. The precise question for which I seek a confident answer is:
IS that in fact how flush_cache_range() actually works?
Whether caches need to be invalidated or flushed is architecture dependent.
You should always use the Linux DMA functions to handle these issues correctly.
Read DMA-API-HOWTO.txt and DMA-API.txt.
everyone. I am stuck on the following question.
I am working on a hybrid storage system which uses an ssd as a cache layer for hard disk. To this end, the data read from the hard disk should be written to the ssd to boost the subsequent reads of this data. Since Linux caches data read from disk in the page cache, the writing of data to the ssd can be delayed; however, the pages caching the data may be freed, and accessing the freed pages is not recommended. Here is the question: I have "struct page" pointers pointing to the pages to be written to the ssd. Is there any way to determine whether the page represented by the pointer is valid or not (by valid I mean the cached page can be safely written to the ssd? What will happen if a freed page is accessed via the pointer? Is the data of the freed page the same as that before freeing?
Are you using cleancache module? You should only get valid pages from it and it should remain valid until your callback function finished.
Isn't this a cleancache/frontswap reimplementation? (https://www.kernel.org/doc/Documentation/vm/cleancache.txt).
The benefit of existing cleancache code is that it calls your code only just before it frees a page, so before the page resides in RAM, and when there is no space left in RAM for it the kernel calls your code to back it up in tmem (transient memory).
Searching I also found an existing project that seems to do exactly this: http://bcache.evilpiepirate.org/:
Bcache is a Linux kernel block layer cache. It allows one or more fast
disk drives such as flash-based solid state drives (SSDs) to act as a
cache for one or more slower hard disk drives.
Bcache patches for the Linux kernel allow one to use SSDs to cache
other block devices. It's analogous to L2Arc for ZFS, but Bcache also
does writeback caching (besides just write through caching), and it's
filesystem agnostic. It's designed to be switched on with a minimum of
effort, and to work well without configuration on any setup. By
default it won't cache sequential IO, just the random reads and writes
that SSDs excel at. It's meant to be suitable for desktops, servers,
high end storage arrays, and perhaps even embedded.
What you are trying to achieve looks like the following:
Before the page is evicted from the pagecache, you want to cache it. This, in concept, is called a Victim cache. You can look for papers around this.
What you need is a way to "pin" the pages targeted for eviction for the duration of the IO. Post IO, you can free the pagecache page.
But, this will delay the eviction, which is possibly needed during memory pressure to create more un-cached pages.
So, one possible solution is to start your caching algorithm a bit before the pagecache eviction starts.
A second possible solution is to set aside a bunch of free pages and exchange the page being evicted form the page cache with a page from the free pool, and cache the evicted page in the background. But, you need to now synchronize with file block deletes, etc