spectre with device memory

spectre with device memory - security

Regarding the spectre security issues and side-channel attacks.
In both x86 and ARM exists a method to disable caches/speculative access on specific memory pages. So any side-channel attack (spectre, meltdown) on these memory regions should be impossible. So why are we not using this to prevent side-channel attacks by storing all secure information (password, keys, etc.) into slow but secure (?) memory regions, while placing the unsecure data into the fast but unsecure normal memory? Accesstime on these pages will decrease by a huge factor (~100), but the kernel fixes are not cheap either. So maybe reducing the performance of only a few memory-pages is faster then a slightly overall decrease?
It would shift the responsibility of fixing the issues from the OS to the application-developer, which would be a huge change. But hoping, that the kernel will somehow fix all bugs seems not to a be good approach either.
So my questions are
Will the use of "device" memory-pages really prevent such attacks?
What are the downsides of it? (Besides the obvious performance issues)
How practical would be the usage?

Because our compilers / toolchains / OSes don't have support for using uncacheable memory for some variables, and for avoiding spilling copies of them to the stack. (Or temporaries calculated from them.)
Also AFAIK, you can't even allocate a page of UC memory in a user-space process on Linux even if you wanted to. That could of course be changed with a new flag for mmap and/or mprotect. Hopefully it could be designed so that running new binaries on an old system would get regular write-back memory (and thus still work, but without security advantages).
I don't think there are any denial-of-service implications to letting unprivileged user-space map WC or UC memory; you can already use NT stores and / or clflush to force memory access and compete for a larger share of system memory-controller time / resources.

Related

Why can't we have a safe ISA?

Accroding to this paper: https://doi.org/10.1109/SP.2013.13, Memory corruption bugs are one of the oldest problems in computer security. The lack of memory safety and type safety has caused countless bugs, causing billions of dollars and huge efforts to fix them.
But the root of C/C++'s memory vulnerability can trace down to the ISA level. At ISA level, every instruction can access any memory address without any fine grained safe check (only corase grained check like page fault). Sure, we can implement memory safe at a higher software level, like Java (JVM), but this leads to significant cost of performance. In a word, we can't have both safety and performance at the same time on existing CPUs.
My question is, why can't we implement the safety at the hardware level? If the CPU has a safe ISA, which ensures the memory safe by, I don't know, taking the responsbilities of malloc and free, then maybe we can get rid of the performance decline of software safe checking. If anyone professional in microelectronics can tell me, is this idea realistic?

Depending on what you mean, it could make it impossible implement memory-unsafe languages like C in a normal way. e.g. every memory access would have to be to some object that has a known size? I'd guess an operating system for such a machine might have to work around that "feature" by telling it that the entire address space was one large array object. Or else you'd need some mechanism for a read system call to know the proper bounds of the object it's writing in the copy_to_user() part of its job. And then there's other OS stuff like accessing the same physical page from different virtual pages.
The OP (via asking on Reddit) found the CHERI project which is an attempt at this idea, involving "... revisit fundamental design choices in hardware and software to dramatically improve system security." Changing hardware alone can't work; compilers need to change, too. But they were able to adapt "Clang/LLVM, FreeBSD, FreeRTOS, and applications such as WebKit," so their approach could be practical. (Unlike the hypothetical versions I was imagining when writing other parts of this answer.)
CHERI uses "fine-grained memory protection", and "Language and compiler extensions" to implement memory-safe C and C++, and higher-level languages.
So it's not a drop-in replacement, and it sounds like you have to actively use the features to gain safety. As I argue in the rest of the answer, hardware can't do it alone, and it's highly non-trivial even with software cooperation. It's easy to come up with ways that wouldn't work. :P
For hardware-enforced memory-safety to be possible, hardware would have to know about every object and its size, and be able to cache that structure in a way that allows efficient lookups to find the bounds. Page tables (4k granularity, or larger in more modern ISAs) are already hard enough for hardware for hardware to cache efficiently for large programs, and that's without even considering which pointer goes with which object.
Checking a TLBs as part of every load and store can be done efficiently, but checking another structure in parallel with that might be problematic. Especially when the ranges don't have power-of-2 sizes and natural alignment, the way pages do, which makes it possible to build a TLB from content-addressable memory that checks for a match against each of several possible values for the high bits. (e.g. a page is 4k in size, always starting at a 4k alignment boundary.)
You mean it may cost too much at hardware level, like the die area?
Die area might not even be the biggest problem, especially these days. It would cost power, and/or cost latency in very important critical paths such as L1d hit load-use latency. Even if you could come up with some plausible way for software to make tables that hardware could check, or otherwise solve the other parts of this problem.
Modifying a page-table entry requires invalidating the entry, including TLB shootdown for other cores. If every free (and some malloc) cost inter-core communication to do similar things for object tables, that would be very expensive.
I think inventing a way for software to tell the hardware about objects would be an even bigger problem. malloc and free aren't something you can just build in to a CPU where memory addressing works anything like existing CPUs, or like it does in C. Software needs to manage memory, it doesn't make sense to try to build that in to a CPU. So then malloc and free (and mmap with file-backed mappings and shared memory...) need a way to tell the CPU about objects. Seems like a mess.
I think at best an ISA could provide more tools software can use to make bounds-checks cheaper. Perhaps some kind of extra semantics on loads/stores, like an extra operand for indexed addressing modes for load or store that takes a max?
At least if we want an ISA to work anything like current ones, rather than work like a JVM or a Transmeta Crusoe and internally recompile for some real ISA.
Intel's MPX ISA extension to x86 was an attempt to let software set up bound ranges, but it's been mostly abandoned due to lower performance than pure software. Intel even dropped it from their recent CPUs (Not present in 10th Gen CPUs using 10nm lithography, or later.)
This is all just off the top of my head; I haven't searched for any serious proposals for how a system could plausibly work.
I don't think memory safety is something you can easily add after the fact to languages like C that weren't originally designed with it.

Have a look to "Code for malloc and free" at SO. Those commands are very, very far away from even being defined within an instruction set.

How to read stale values on x86

My goal is to read in stale and outdated values of memory without cache-coherence. I have attempted to use prefetchnta to perform a non-temporal load, but it failed to fetch outdated values. I am looking into performing some kind of Streaming Memory-to-Memory Direct-Memory-Access, but am having a little trouble due to the overwhelming amount of background knowledge required to proceed with my current project. Currently I am attempting to mess around with udmabuf but even that is going slowly. It should be noted that ideally I would like to ignore the contents of all CPU caches, including the current CPU.
To provide my reasoning as to why: I am developing software that can be used to prove correctness of programs written for non-volatile memory. As the CPU Cache is volatile, the CPU's write-back cache will still be volatile and the arbitrary nature of how they are written back to memory needs to be observed.
I would sincerely appreciate it if someone could give me some pointers of how to proceed. I do not mind digging into the Linux kernel, as in fact I am doing that now, nor do I mind modifying it, I just need a little guidance in the right direction.

I haven't played around with this, but my understanding from the docs is that for loads (unlike NT stores) nothing can bypass cache or override the strong ordering of memory types like the normal WB (write-back). And even NT stores evict already-cached data, so they can't break coherence for this or another core that has cached data for the line you're writing.
You can do weakly-ordered loads from WC (write-combining) memory regions (with prefetchnta or SSE4 movntdqa), but they're probably still coherent at the physical address level.
#MargaretBloom commented
IIRC Intel warns the developer about multiple mapping with different cache types, which may indeed be good in this case.
so maybe you could actually bypass cache coherence with multiple virtual mappings of the same physical page.
I don't know if it's possible to do non-coherent DMA with a PCI / PCIe device, but that might be your only hope for getting actual DRAM contents without going through cache.
Normally (always?) DMA on modern x86 systems is cache-coherent, which is good for performance. To maintain backwards compat with 386 and earlier CPUs without caches, the first x86 CPUs with caches had cache-coherent DMA, not introducing cache-control instructions until later generations, since existing OSes didn't use them. In modern systems, memory controllers are built-in to the CPU. So on Intel CPUs, the system agent can snoop L3 tags to see if a line is cached anywhere on-chip in parallel with sending the request to the memory controller. Or a Xeon can DMA right into L3 cache without data having to bounce through DRAM, good for high bandwidth NICs.
There's an INVD instruction which invalidates all caches without doing write-back first, but I think that includes the shared L3 cache, and probably the private caches of all other cores. So you can't practically use it on a Linux system where other cores are potentially in the middle of doing stuff; you'd potentially corrupt kernel data structures by using it, as well as simulating power failure on a machine with NVDIMMs for the process you were interested in.
Maybe if you somehow offlined all the other CPU cores, and disabled interrupts on the one core that was still up
you could wbinvd (write-back+invalidate) to flush all caches
then run some code under test
then invd and see what made it to DRAM
Then re-enable interrupts. Interrupt handlers could end up with some kernel data cached and some in memory, or get device drivers out of sync with hardware, if any interrupts are handled between the wbinvd and the invd.
Update: someone did actually attempt this:
How to run "invd" instruction with disabled SMP support?
How to explicitly load a structure into L1d cache? Weird results with INVD with CR0.CD = 1 on isolated core with/without hyperthreading - invd worked so well it nuked some of the stores done by printk in the mis-designed attempt to log something about it.

lock contention in memory allocation - multi-threaded vs. multi-process

We have developed a big C++ application that is running satisfactorily at several sites on big Linux and Solaris boxes (up to 160 CPU cores or even more). It's a heavily multi-threaded (1000+ threads), single-process architecture, consuming huge amounts of memory (200 GB+). We are LD_PRELOADing Google Perftool's tcmalloc (or libumem/mtmalloc on Solaris) to avoid memory allocation performance bottlenecks with generally good results. However, we are starting to see adverse effects of lock contention during memory allocation/deallocation on some bigger installations, especially after the process has been running for a while (which hints to aging/fragmentation effects of the allocator).
We are considering changing to a multi-process/shared memory architecture (the heavy allocation/deallocation will not happen in shared memory, rather on the regular heap).
So, finally, here's our question: can we assume that the virtual memory manager of modern Linux kernels is capable of efficiently handing out memory to hundreds of concurent processes? Or do we have to expect running into the same kind of problems with memory allocation contention that we see in our single-process/multi-threading environment? I tend to hope for a better overall system performance, as we would no longer be limited to a single address space, and that having several independent address spaces would require less locking on the part of the virtual memory manager. Anyone have any actual experience or performance data comparing multi-threaded vs. multi-process memory allocation?

I tend to hope for a better overall system performance, as we would no longer be limited to a single address space, and that having several independent address spaces would require less locking on the part of the virtual memory manager.
There is no reason to expect this. Unless your code is so badly designed that it constantly goes back to the OS to allocate memory, it won't make any significant difference. Your application should only need to go back to the OS's virtual memory manager when it needs more virtual memory, which should not occur significantly once the process reaches its stable size.
If you are constantly allocating and freeing all the way back to the OS, you should stop doing that. If you're not, then you can keep multiple pools of already-allocated memory that can be used by multiple threads without contention. And, as a benefit, your context switches will be cheaper because TLB's don't have to be flushed.
Only if you can't reduce the frequency of address space changes (for example, if you must map and unmap files) or if you have to change other shared resources (like file descriptors) should you look at multiprocess options.

Can caches keep data from multiple processes?

We know that cache uses virtual addresses. So, how does this work when multiple processes are involved, especially for the shared caches, such as shared L2 cache, or even for a local L1 cache, when processes are switched, as in simultaneous (hyper) multithreading, you could have threads from two different processes running on the same physical core. Is hyperthreading any good when threads from different processes are involved, or can it only boost performance when threads of the same process are involved?

None of the major x86 CPU microarchitectures use virtually-addressed caches. They all use virtually-indexed / physically-tagged (VIPT) L1 caches. VIPT is a performance hack that allows tags from a set to be fetched in parallel with a TLB lookup.
The bits of the address that are used as the index are the same in the physical and virtual addresses. (i.e. they're part of the offset within a 4k page, so they don't need to be translated by the TLB). This means that it effectively behaves exactly like a phys/phys (PIPT) cache, avoiding all problems of virtual addressing.
This is made possible by keeping the cache small and having enough ways. Intel's L1 caches are 32kiB, 8-way associativity, with 64B lines. This accounts for all the within-a-page address bits. (see other resources for diagrams and more detailed explanations.)
Hyperthreading works fine with separate processes, because x86 CPUs avoid cache aliasing (synonym / homonym problems). They work like physically-addressed caches. Two memory-intensive processes that don't share any memory might run slower with hyperthreading than without, though. Competitive sharing of the caches can be worse than just running one process after the other finishes, if that's an option.
For processes that bottleneck on something other than a resource that hyperthreading shares, HT certainly helps. e.g. with branch mispredicts. Also with cache misses due to unpredictable access to a big working set that would still miss often without hyperthreading.
CPUs that use virt/virt caches do need to invalidate them on context switches, or have extra tags to keep track of which PID they were for. This is like what caches currently do to support virtualization: they're tagged with VM IDs, so they know which VM's physical address it's for. virt/virt L1 means you don't need a fast TLB: it's only needed on L1 misses, so the L1 cache is also caching translations.
Some designs must use phys/phys L1, but I don't know any specific examples. The virt/phys trick is pretty common in high-performance CPUs, because an L1 with enough ways to make it possible is just a good idea anyway.
Note that only L1 ever uses virt addresses. Big L2 and L3 caches are always phys/phys.
Other links:
http://www.realworldtech.com/forum/?threadid=76592&curpostid=76600 This whole thread goes into a bunch of detail and questions about caches. David Kanter's posts tend to explain things in a readable way. I haven't read the whole thread. RWT forums are now searchable! so if you google for more details, you're likely to see more hits from years of forum threads there.
Paul Clayton explains why phys-idx/virt-tag (PIVT) is such a bad idea that nobody would ever build one: it has the disadvantages of virtually-addressed caches without any advantages. (Wikipedia says that MIPS r6000 is the only known implementation, and gives the extremely esoteric reason: Even a TLB would be too large to implement in Emitter-Coupled Logic, so they implement a TLB slice to just translate enough bits for a physical index. Given that limitation, PIPT and VIPT were not options, and they decided to go PIVT instead of VIVT.
Another very-detailed answer from Paul Clayton about caches.

Why are Sempaphores limited in Linux

We just ran out of semaphores on our Linux box, due to the use of too many Websphere Message Broker instances or somesuch.
A colleague and I got to wondering why this is even limited - it's just a bit of memory, right?
I thoroughly googled and found nothing.
Anyone know why this is?
cheers

Semaphores, when being used, require frequent access with very, very low overhead.
Having an expandable system where memory for each newly requested semaphore structure is allocated on the fly would introduce complexity that would slow down access to them because it would have to first look up where the particular semaphore in question at the moment is stored, then go fetch the memory where it is stored and check the value. It is easier and faster to keep them in one compact block of fixed memory that is readily at hand.
Having them dispersed throughout memory via dynamic allocation would also make it more difficult to efficiently use memory pages that are locked (that is, not subject to being swapped out when there are high demands on memory). The use of "locked in" memory pages for kernel data is especially important for time-sensitive and/or critical kernel functions.
Having the limit be a tunable parameter (see links in the comments of original question) allows it to be increased at runtime if needed via an "expensive" reallocation and relocation of the block. But typically this is done one time at system initialization before anything much is even using semaphores.
That said, the amount of memory used by a semaphore set is rather tiny. With modern memory available on systems being in the many gigabytes the original default limits on the number of them might seem a bit stingy. But keep in mind that on many systems semaphores are rarely used by user space processes and the linux kernel finds its way into lots of small embedded systems with rather limited memory, so setting the default limit arbitrarily high in case it might be used seems wasteful.
The few software packages, such as Oracle database for example, that do depend on having many semaphores available, typically do recommend in their installation and/or system tuning advice to increase the system limits.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string