Pime and Probe vs Evict and Reload

Pime and Probe vs Evict and Reload - security

I'm trying to figure out what are the differences between two cache side channel attack: Prime and Probe vs Evict and Reload.
It seems that both of the attacks are identical - the adversary evicts data from cache sets by filling them with its own data, then he periodically test whether there is a cache miss or hit, and that allows him to infer memory access pattern that is done by the victim.
I did find a lecture from Black Hat Asia 2017, in which they explain that Prime and Probe doesn't require shared memory, so my assumption is that the attacks are both identical, but the term Prime and Probe refers to Evict and Reload on unshared memory ?

Evict+Reload uses shared memory (usually a shared library) in the middle.
The attacker first evicts the shared memory from the cache set with the use of an evictions set. If the victim now accesses the shared memory, it will overwrite the attacker's data in the cache. The attacker now also accesses the shared memory and measures how long it takes. If it was fast, the vicitim accessed the shared memory in between, if it is slow, he did not.
Now for Prime+Probe the attacker first primes/fills the cache set with his "eviction set".
The vicitm may now access his memory that maps to the same cache set and therefore evicts some of the attacker's data. Now the attacker accesses all his memory and measures the time. If it was fast, the victim did not access the memory, if it was slow, he did (because we have a cache miss).
So essentially the idea behind the two attacks is similar, but Prime+Probe does not need shared memory and therefore works slightly different.

Related

Do Linux and macOS have an `OfferVirtualMemory` counterpart?

Windows, starting with a certain unspecified update of Windows 8.1, has the excellent OfferVirtualMemory and ReclaimVirtualMemory system calls which allow memory regions to be "offered" to the OS. This removes them from the working set, reduces the amount of physical memory usage that is attributed to the calling process, and puts them onto the standby memory list of the program, but without ever swapping out the contents anywhere.
(Below is a brief and rough explanation of what those do and how standby lists work, to help people understand what kind of system call I'm looking for, so skip ahead if you already know all of this.)
Quick standby list reference
Pages in the standby list can be returned back to the working set of the process, which is when their contents are swapped out to disk and the physical memory is used for housing a fresh allocation or swapping in memory from disk (if there's no available "dead weight" zeroed memory on the system), or no swapping happens and the physical memory is returned to the same virtual memory region they were first removed from, sidestepping the swapping process while still having reduced the working set of the program to, well, the memory it's actively working on, back when they were removed from the working set and put into the standby list to begin with.
Alternatively, if another program requests physical memory and the system doesn't have zeroed pages (if no program was closed recently, for example, and the rest of RAM has been used up with various system caches), physical memory from the standby list of a program can be zeroed, removed from the standby list, and handed over to the program which requested the memory.
Back to memory offering
Since the offered memory never gets swapped out if, upon being removed from the standby list, it no longer belongs to the same virtual memory segment (removed from standby by anything other than ReclaimVirtualMemory), the reclamation process can fail, reporting that the contents of the memory region are now undefined (uninitialized memory has been fetched from the program's own standby list or from zeroed memory). This means that the program will have to re-generate the contents of the memory region from another data source, or by rerunning some computation.
The practical effect, when used to implement an intelligent computation cache system, is that, firstly, the reported working set of the program is reduced, giving a more accurate picture of how much memory it really needs. Secondly, the cached data, which can be re-generated from another region of memory, can be quickly discarded for another program to use that cache, without waiting for the disk (and putting additional strain on it, which adds up over time and results in increased wear) as it swaps out the contents of the cache, which aren't too expensive to recreate.
One good example of a use case is the render cache of a web browser, where it can just re-render parts of the page upon request, and has little to no use in having those caches taking up the working set and bugging the user which high memory usage. Pages which aren't currently being shown are the moment where this approach may give the biggest theoretical yield.
The question
Do Linux and macOS have a comparable API set that allows memory to be marked as discardable at the memory manager's discretion, with a fallible system call to lock that memory back in, declaring the memory uninitialized if it was indeed discarded?

Linux 4.5 and later has madvise with the MADV_FREE, the memory may be replaced with pages of zeros anytime until they are next written.
To lock the memory back in write to it, then read it to check if it has been zeroed. This needs to be done separately for every page.
Before Linux 4.12 the memory was freed immediately on systems without swap.
You need to take care of compiler memory reordering so use atomic_signal_fence or equivalent in C/C++.

spectre with device memory

Regarding the spectre security issues and side-channel attacks.
In both x86 and ARM exists a method to disable caches/speculative access on specific memory pages. So any side-channel attack (spectre, meltdown) on these memory regions should be impossible. So why are we not using this to prevent side-channel attacks by storing all secure information (password, keys, etc.) into slow but secure (?) memory regions, while placing the unsecure data into the fast but unsecure normal memory? Accesstime on these pages will decrease by a huge factor (~100), but the kernel fixes are not cheap either. So maybe reducing the performance of only a few memory-pages is faster then a slightly overall decrease?
It would shift the responsibility of fixing the issues from the OS to the application-developer, which would be a huge change. But hoping, that the kernel will somehow fix all bugs seems not to a be good approach either.
So my questions are
Will the use of "device" memory-pages really prevent such attacks?
What are the downsides of it? (Besides the obvious performance issues)
How practical would be the usage?

Because our compilers / toolchains / OSes don't have support for using uncacheable memory for some variables, and for avoiding spilling copies of them to the stack. (Or temporaries calculated from them.)
Also AFAIK, you can't even allocate a page of UC memory in a user-space process on Linux even if you wanted to. That could of course be changed with a new flag for mmap and/or mprotect. Hopefully it could be designed so that running new binaries on an old system would get regular write-back memory (and thus still work, but without security advantages).
I don't think there are any denial-of-service implications to letting unprivileged user-space map WC or UC memory; you can already use NT stores and / or clflush to force memory access and compete for a larger share of system memory-controller time / resources.

What is coherent memory on GPU?

I have stumbled not once into a term "non coherent" and "coherent" memory in the
tech papers related to graphics programming.I have been searching for a simple and clear explanation,but found mostly 'hardcore' papers of this type.I would be glad to receive layman's style answer on what coherent memory actually is on GPU architectures and how it is compared to other (probably not-coherent) memory types.

Memory is memory. But different things can access that memory. The GPU can access memory, the CPU can access memory, maybe other hardware bits, whatever.
A particular thing has "coherent" access to memory if changes made by others to that memory are visible to the reader. Now, you might think this is foolishness. After all, if the memory has been changed, how could someone possibly be unable to see it?
Simply put, caches.
It turns out that changing memory is expensive. So we do everything possible to avoid changing memory unless we absolutely have to. When you write a single byte from the CPU to a pointer in memory, the CPU doesn't write that byte yet. Or at least, not to memory. It writes it to a local copy of that memory called a "cache."
The reason for this is that, generally speaking, applications do not write (or read) single bytes. They are more likely to write (and read) lots of bytes, in small chunks. So if you're going to perform an expensive operation like a memory load or store, you should load or store a large chunk of memory. So you store all of the changes you're going to make to a chunk of memory in a cache, then make a single write of that cached chunk to actual memory at some point in the future.
But if you have two separate devices that use the same memory, you need some way to be certain that writes one device makes are visible to other devices. Most GPUs can't read the CPU cache. And most CPU languages don't have language-level support to say "hey, that stuff I wrote to memory? I really mean for you to write it to memory now." So you usually need something to ensure visibility of changes.
In Vulkan, memory which is labeled by VK_MEMORY_PROPERTY_HOST_COHERENT_BIT means that, if you read/write that memory (via a mapped pointer, since that's the only way Vulkan lets you directly write to memory), you don't need to use functions vkInvalidateMappedMemoryRanges/vkFlushMappedMemoryRanges to make sure the CPU/GPU can see those changes. The visibility of any changes is guaranteed in both directions. If that flag isn't available on the memory, then you must use the aforementioned functions to ensure the coherency of the specific regions of data you want to access.
With coherent memory, one of two things is going on in terms of hardware. Either CPU access to the memory is not cached in any of the CPU's caches, or the GPU has direct access to the CPU's caches (perhaps due to being on the same die as the CPU(s)). You can usually tell that the latter is happening, because on-die GPU implementations of Vulkan don't bother to offer non-coherent memory options.

If memory is coherent then all threads accessing that memory must agree on the state of the memory at all times, e.g.: if thread 0 reads memory location A and thread 1 reads the same location at the same time, both threads should always read the same value.
But if memory is not coherent then threads A and B might read back different values. Thread 0 could think that location A contains a 1, while thread thinks that that location contains a 2. The different threads would have an incoherent view of the memory.
Coherence is hard to achieve with a high number of cores. Often every core must be aware of memory accesses from all other cores. So if you have 4 cores in a quad core CPU, coherence is not that hard to achieve as every core must be informed about the memory accesses addresses of 3 other cores, but in a GPU with 16 cores, every core must be made aware of the memory accesses by 15 other cores. The cores exchange data about the content of their cache using so called "cache coherence protocols".
This is why GPUs often only support limited forms of coherency. If some memory locations are read only or are only accessed by a single thread, then no coherence is required. If caches are small and coherence is not always required but only at specific instructions of the program, then it is possible to achieve correct behavior of the program using cache flushes before or after specific memory accesses.
If your hardware offers both coherent and non-coherent memory types, then you can expect that non-coherent memory will be faster, but if you try to run parallel algorithms using this memory they will fail in really weird ways.

mmap(): resetting old memory to a zero'd non-resident state

I'm writing a memory allocation routine, and it's currently running smoothly. I get my memory from the OS with mmap() in 4096-byte pages. When I start my memory allocator I allocate 1gig of virtual address space with mmap(), and then as allocations are made I divide it up into hunks according to the specifics of my allocation algorithm.
I feel safe allocating as much as a 1gig of memory on a whim because I know mmap() doesn't actually put pages into physical memory until I actually write to them.
Now, the program using my allocator might have a spurt where it needs a lot of memory, and in this case the OS would have to eventually put a whole 1gig worth of pages into physical RAM. The trouble is that the program might then go into a dormant period where it frees most of that 1gig and then uses only minimal amounts of memory. Yet, all I really do inside of my allocator's MyFree() function is to flip a few bits of bookkeeping data which mark the previously used gig as free, but I know this doesn't cause the OS remove those pages from physical memory.
I can't use something like munmap() to fix this problem, because the nature of the allocation algorithm is such that it requires a continuous region of memory without any holes in it. Basically I need a way to tell the OS "Listen, you can take these pages out of physical memory and clear them to 0, but please remap them on the fly when I need them again, as if they were freshly mmap()'d"
What would be the best way to go about this?
Actually, after writing this all up I just realized that I can probably do an munmap() followed immediately by a fresh mmap(). Would that be the correct way to go about? I get the sense that there's probably some more efficient way to do this.

You are looking for madvise(addr, length, MADV_DONTNEED). From the manpage:
MADV_DONTNEED: Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
Note especially the language about how subsequent accesses will succeed but revert to zero-fill-on-demand (for mappings without an underlying file).
Your thinking-out-loud alternative of an munmap followed immediately by another mmap will also work but risks kernel-side inefficiencies because it is no longer tracking the allocation a single contiguous region; if there are many such unmap-and-remap events the kernelside data structures might wind up being quite bloated.
By the way, with this kind of allocator it's very important that you use MAP_NORESERVE for the initial allocation, and then touch each page as you allocate it, and trap any resulting SIGSEGV and fail the allocation. (And you'll need to document that your allocator installs a handler for SIGSEGV.) If you don't do this your application will not work on systems that have disabled memory overcommit. See the mmap manpage for more detail.

Write a cached page before it is reclaimed

everyone. I am stuck on the following question.
I am working on a hybrid storage system which uses an ssd as a cache layer for hard disk. To this end, the data read from the hard disk should be written to the ssd to boost the subsequent reads of this data. Since Linux caches data read from disk in the page cache, the writing of data to the ssd can be delayed; however, the pages caching the data may be freed, and accessing the freed pages is not recommended. Here is the question: I have "struct page" pointers pointing to the pages to be written to the ssd. Is there any way to determine whether the page represented by the pointer is valid or not (by valid I mean the cached page can be safely written to the ssd? What will happen if a freed page is accessed via the pointer? Is the data of the freed page the same as that before freeing?

Are you using cleancache module? You should only get valid pages from it and it should remain valid until your callback function finished.

Isn't this a cleancache/frontswap reimplementation? (https://www.kernel.org/doc/Documentation/vm/cleancache.txt).
The benefit of existing cleancache code is that it calls your code only just before it frees a page, so before the page resides in RAM, and when there is no space left in RAM for it the kernel calls your code to back it up in tmem (transient memory).
Searching I also found an existing project that seems to do exactly this: http://bcache.evilpiepirate.org/:
Bcache is a Linux kernel block layer cache. It allows one or more fast
disk drives such as flash-based solid state drives (SSDs) to act as a
cache for one or more slower hard disk drives.
Bcache patches for the Linux kernel allow one to use SSDs to cache
other block devices. It's analogous to L2Arc for ZFS, but Bcache also
does writeback caching (besides just write through caching), and it's
filesystem agnostic. It's designed to be switched on with a minimum of
effort, and to work well without configuration on any setup. By
default it won't cache sequential IO, just the random reads and writes
that SSDs excel at. It's meant to be suitable for desktops, servers,
high end storage arrays, and perhaps even embedded.

What you are trying to achieve looks like the following:
Before the page is evicted from the pagecache, you want to cache it. This, in concept, is called a Victim cache. You can look for papers around this.
What you need is a way to "pin" the pages targeted for eviction for the duration of the IO. Post IO, you can free the pagecache page.
But, this will delay the eviction, which is possibly needed during memory pressure to create more un-cached pages.
So, one possible solution is to start your caching algorithm a bit before the pagecache eviction starts.
A second possible solution is to set aside a bunch of free pages and exchange the page being evicted form the page cache with a page from the free pool, and cache the evicted page in the background. But, you need to now synchronize with file block deletes, etc

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string