Linux flush_cache_range() behavior

Linux flush_cache_range() behavior - linux

My code has a user mode mapping (set up via mmap()) which I need to flush after writing to it from the CPU but before I dispatch the data by DMA’ing the underlying physical memory. Also I need to invalidate the cache after data has arrived via a DMA to the underlying physical memory but before I attempt to read from it with the CPU.
In my mind “cache flushing” and “cache invalidating” mean two different things. Roughly “cache flushing” means writing what’s in the cache out to memory (or simply cache data goes to memory) whereas “cache invalidating” means subsequently assuming all cache contents are stale so that any attempts to read from that range will provoke a fresh read from memory (or simply memory data goes to cache).
However in the kernel I do not find two calls but instead just one: flush_cache_range().
This is the API I use for both tasks and it “seems to work”… at least it has up until the present issue I'm trying to debug.
This is possibly because the behavior of flush_cache_range() just might be to:
1) First write any dirty cache entries to memory- THEN
2) Invalidate all cache entries
IF is this is what this API really does then my use of it in this role is justified. After all it’s how I myself might implement it. The precise question for which I seek a confident answer is:
IS that in fact how flush_cache_range() actually works?

Whether caches need to be invalidated or flushed is architecture dependent.
You should always use the Linux DMA functions to handle these issues correctly.
Read DMA-API-HOWTO.txt and DMA-API.txt.

Related

How to put data in L2 cache with A72 Core?

I have an array of data that looks like this :
uint32_t data[128]; //Could be more than L1D Cache size
In order to do computation on it, I want to put the data as close as possible to my computing unit so in the L2 Cache.
My target runs with a linux kernel and some additionnal apps
I know that I can get an access to a certain area of the memory with mmap and I have succesfully done it in some part of my available memory shared between cores.
How to do the same thing but in L2 Cache area ?
I've read part of gcc documentation and AArch64 assembly instruction set but cannot figure out the way to achieve this.

How to do the same thing but in L2 Cache area ?
Your hardware doesn't support that.
In general, the ARMv8 architecture doesn't make any guarantees about the contents of caches and does not provide any means to explicitly manipulate or query them - it only makes guarantees and provides tools for dealing with coherency.
Specifically, from section D4.4.1 "General behavior of the caches" of the spec:
[...] the architecture cannot guarantee whether:
• A memory location present in the cache remains in the cache.
• A memory location not present in the cache is brought into the cache.
Instead, the following principles apply to the behavior of caches:
• The architecture has a concept of an entry locked down in the cache.
How lockdown is achieved is IMPLEMENTATION DEFINED, and lockdown might
not be supported by:
— A particular implementation.
— Some memory attributes.
• An unlocked entry in a cache might not remain in that cache. The
architecture does not guarantee that an unlocked cache entry remains in
the cache or remains incoherent with the rest of memory. Software must
not assume that an unlocked item that remains in the cache remains dirty.
• A locked entry in a cache is guaranteed to remain in that cache. The
architecture does not guarantee that a locked cache entry remains
incoherent with the rest of memory, that is, it might not remain dirty.
[...]
• Any memory location is not guaranteed to remain incoherent with the rest of memory.
So basically you want cache lockdown. Consulting the manual of your CPU though:
• The Cortex-A72 processor does not support TLB or cache lockdown.
So you can't put something in cache on purpose. Now, you might be able to tell whether something has been cached by trying to observe side effects. The two common side effects of caches are latency and coherency. So you could try and time access times or modify the contents of DRAM and check whether you see that change in your cached mapping... but that's still a terrible idea.
For one, both of these are destructive operations, meaning they will change the property you're measuring, by measuring it. And for another, just because you observe them once does not mean you can rely on that happening.
Bottom line: you cannot guarantee that something is held in any particular cache by the time you use it.

Cache - is not a place where data should be stored, it's just... cache? :)
I mean, your processor decide which data it should cache and where (L1/L2/L3) and logic depends on CPU implementation.
If you wanted to, you could try to find the algorithm of placing and replacing data in cache and play with this (without guaranties, of course) by using dedicated instructions to prefetch your data and then maintain the cache with non-caching instructions for your other program.
Maybe for modern ARM there are easier ways, I spoke from x86/x64 perspective, but my whole point is "are you really sure that you need this"?
CPU's smart enough to cache the data which they need and they do it better and better year by year.
I'd recommend you to use any profiler that can show you cache misses to be sure than your data is not presented in the cache already.
If it don't, the first thing to optimize - is an algorithm. Try to figure out why there was a cache miss - maybe you should load less data in loop by using temp variables, for example, or even unroll the loop manually to control where and what being accessed.

What does it even mean to make an image visible to a writes (or available to reads)?

An availability operation can be compared to a cache flush - cache contents are released to main memory.
Similarly, visibility operation can be compared to a cache invalidation - cache consumes contents of main memory.
(it doesn't have to be a 1:1 hw mapping, but you get the idea)
It seems nonsensical to perform a visibility operation before a write (since we're about to override whatever is in our imaginary or no-so-imaginary cache either way) or an availability operation after a read (nothing has changed!).
I saw code, which includes memory writes in dstAccessMask and/or memory reads in srcAccessMask. What's the point?

It seems nonsensical to perform a visibility operation before a write (since we're about to override whatever is in our imaginary or no-so-imaginary cache either way) or an availability operation after a read (nothing has changed!).
So which not-so-imaginary cache are you talking about? And how are you enforcing coherency with other not-so-imaginary caches in the system? GPUs have many levels of parallel caching, some of which are coherent in hardware and some of which are not. What you get is entirely implementation-defined behavior ...
memory writes in dstAccessMask
Missing invalidation before a write can cause write-after-write hazards on the GPU (you can end up with mismatches in parallel copies of data in two different caches). Can also get issues if HOST modified the data, which is why HOST is a possible stage.
memory reads in srcAccessMask
Can't think of a case where you'd need memory operations after a read-only usage.

Write a cached page before it is reclaimed

everyone. I am stuck on the following question.
I am working on a hybrid storage system which uses an ssd as a cache layer for hard disk. To this end, the data read from the hard disk should be written to the ssd to boost the subsequent reads of this data. Since Linux caches data read from disk in the page cache, the writing of data to the ssd can be delayed; however, the pages caching the data may be freed, and accessing the freed pages is not recommended. Here is the question: I have "struct page" pointers pointing to the pages to be written to the ssd. Is there any way to determine whether the page represented by the pointer is valid or not (by valid I mean the cached page can be safely written to the ssd? What will happen if a freed page is accessed via the pointer? Is the data of the freed page the same as that before freeing?

Are you using cleancache module? You should only get valid pages from it and it should remain valid until your callback function finished.

Isn't this a cleancache/frontswap reimplementation? (https://www.kernel.org/doc/Documentation/vm/cleancache.txt).
The benefit of existing cleancache code is that it calls your code only just before it frees a page, so before the page resides in RAM, and when there is no space left in RAM for it the kernel calls your code to back it up in tmem (transient memory).
Searching I also found an existing project that seems to do exactly this: http://bcache.evilpiepirate.org/:
Bcache is a Linux kernel block layer cache. It allows one or more fast
disk drives such as flash-based solid state drives (SSDs) to act as a
cache for one or more slower hard disk drives.
Bcache patches for the Linux kernel allow one to use SSDs to cache
other block devices. It's analogous to L2Arc for ZFS, but Bcache also
does writeback caching (besides just write through caching), and it's
filesystem agnostic. It's designed to be switched on with a minimum of
effort, and to work well without configuration on any setup. By
default it won't cache sequential IO, just the random reads and writes
that SSDs excel at. It's meant to be suitable for desktops, servers,
high end storage arrays, and perhaps even embedded.

What you are trying to achieve looks like the following:
Before the page is evicted from the pagecache, you want to cache it. This, in concept, is called a Victim cache. You can look for papers around this.
What you need is a way to "pin" the pages targeted for eviction for the duration of the IO. Post IO, you can free the pagecache page.
But, this will delay the eviction, which is possibly needed during memory pressure to create more un-cached pages.
So, one possible solution is to start your caching algorithm a bit before the pagecache eviction starts.
A second possible solution is to set aside a bunch of free pages and exchange the page being evicted form the page cache with a page from the free pool, and cache the evicted page in the background. But, you need to now synchronize with file block deletes, etc

How to prioritize write() over mmap updates (or delay mmap page cache flush)

I'm running a specialized DB daemon on a debian-64 with 64G of RAM and lots of disk space. It uses an on-disk hashtable (mmaped) and writes the actual data into a file with regular write() calls. When doing really a lot of updates, a big part of the mmap gets dirty and the page cache tries to flush it to disk, producing lots of random writes which in turn slows down the performance of the regular (sequential) writes to the data file.
If it were possible to delay the page cache flush of the mmaped area performance would improve (I assume), since several (or all) changes to the dirty page would be written at once instead of once for every update (worst case, in reality of course it aggregates a lot of changes anyway).
So my question: Is it possible to delay page cache flush for a memory-mapped area? Or is it possible to prioritze the regular write? Or does anyone have any other ideas? madvise and posix_fadvise don't seem to make any difference...

You could play with the tuneables in /proc/sys/vm. For example, increase the value in dirty_writeback_centisecs to make pdflush wake up somewhat less often, increase dirty_expire_centiseconds so data is allowed to stay dirty for longer until it must be written out, and increase dirty_background_ratio to allow more dirty pages to stay in RAM before something must be done.
See here for a somewhat comprehensive description of what all the values do.
Note that this will affect every process on your machine, but seeing how you're running a huge database server, chances are that this is no problem since you don't want anything else to run on the same machine anyway.
Now of course this delays writes, but it still doesn't fully solve the problem of dirty page writebacks competing with write (though it will likely collapse a few writes if there are many updates).
But: You can use the sync_file_range syscall to force beginning write-out of pages in a given range on your "write" file descriptor (SYNC_FILE_RANGE_WRITE). So while the dirty pages will be written back at some unknown time later (and with greater grace periods), you manually kick off writeback on the ones you're interested.
This doesn't give any guarantees, but it should just work.
Be sure to absolutely positively read the documentation, better read it twice. sync_file_range can very easily corrupt or lose data if you use it wrong. In particular, you must be sure metadata is up-to-date and flushed if you appended to a file, or data that has been "successfully written" will just be "gone" in case of a crash.

I would try mlock. If you mlock the relevant memory range, it may prevent the flush from occurring. You could munlock when you're done.

unbuffered I/O in Linux

I'm writing lots and lots of data that will not be read again for weeks - as my program runs the amount of free memory on the machine (displayed with 'free' or 'top') drops very quickly, the amount of memory my app uses does not increase - neither does the amount of memory used by other processes.
This leads me to believe the memory is being consumed by the filesystems cache - since I do not intend to read this data for a long time I'm hoping to bypass the systems buffers, such that my data is written directly to disk. I dont have dreams of improving perf or being a super ninja, my hope is to give a hint to the filesystem that I'm not going to be coming back for this memory any time soon, so dont spend time optimizing for those cases.
On Windows I've faced similar problems and fixed the problem using FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH - the machines memory was not consumed by my app and the machine was more usable in general. I'm hoping to duplicate the improvements I've seen but on Linux. On Windows there is the restriction of writing in sector sized pieces, I'm happy with this restriction for the amount of gain I've measured.
is there a similar way to do this in Linux?

The closest equivalent to the Windows flags you mention I can think of is to open your file with the open(2) flags O_DIRECT | O_SYNC:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The O_DIRECT flag on its
own makes at an effort to transfer data synchronously, but does not
give the guarantees of the O_SYNC that data and necessary metadata are
transferred. To guarantee synchronous I/O the O_SYNC must be used in
addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is
described in raw(8).
Granted, trying to do research on this flag to confirm it's what you want I found this interesting piece telling you that unbuffered I/O is a bad idea, Linus describing it as "brain damaged". According to that you should be using madvise() instead to tell the kernel how to cache pages. YMMV.

You can use O_DIRECT, but in that case you need to do the block IO yourself; you must write in multiples of the FS block size and on block boundaries (it is possible that it is not mandatory but if you do not its performance will suck x1000 because every unaligned write will need a read first).
Another much less impacting way of stopping your blocks using up the OS cache without using O_DIRECT, is to use posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED). Under Linux 2.6 kernels which support it, this immediately discards (clean) blocks from the cache. Of course you need to use fdatasync() or such like first, otherwise the blocks may still be dirty and hence won't be cleared from the cache.
It is probably a bad idea of fdatasync() and posix_fadvise( ... POSIX_FADV_DONTNEED) after every write, but instead wait until you've done a reasonable amount (50M, 100M maybe).
So in short
after every (significant chunk) of writes,
Call fdatasync followed by posix_fadvise( ... POSIX_FADV_DONTNEED)
This will flush the data to disc and immediately remove them from the OS cache, leaving space for more important things.
Some users have found that things like fast-growing log files can easily blow "more useful" stuff out of the disc cache, which reduces cache hits a lot on a box which needs to have a lot of read cache, but also writes logs quickly. This is the main motivation for this feature.
However, like any optimisation
a) You're not going to need it so
b) Do not do it (yet)

as my program runs the amount of free memory on the machine drops very quickly
Why is this a problem? Free memory is memory that isn't serving any useful purpose. When it's used to cache data, at least there is a chance it will be useful.
If one of your programs requests more memory, file caches will be the first thing to go. Linux knows that it can re-read that data from disk whenever it wants, so it will just reap the memory and give it a new use.
It's true that Linux by default waits around 30 seconds (this is what the value used to be anyhow) before flushing writes to disk. You can speed this up with a call to fsync(). But once the data has been written to disk, there's practically zero cost to keeping a cache of the data in memory.
Seeing as you write to the file and don't read from it, Linux will probably guess that this data is the best to throw out, in preference to other cached data. So don't waste effort trying to optimise unless you've confirmed that it's a performance problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string