I want to disable the low level cache on an ARMv8-A platform running Linux, in order to measure performance of optimized code, independent of cache access.
For Intel systems I found the following resource (Is there a way to disable CPU cache (L1/L2) on a Linux system?), but I can not directly be applied directly due to a different instruction set.
So far I have a kernel module which alters the corresponding system register to disable instruction and data cache.
#include <linux/module.h>
int init_module(void)
{
int64_t value;
asm volatile("\
MRS %0, SCTLR_EL1 // Read SCTLR_EL1 into Xt\n\
BIC %0, %0, (1<<2) // clear bit 2, SCTLR_EL1.C\n\
BIC %0, %0, (1<<12) // clear bit 12, SCTLR_EL1.I\n\
MSR SCTLR_EL1, %0 // Write Xt to SCTLR_EL1\n\
" : "+r" (value));
return 0;
}
void cleanup_module(void)
{
int64_t value;
asm volatile("\
MRS %0, SCTLR_EL1 // Read SCTLR_EL1 into Xt\n\
ORR %0, %0, (1<<2) // set bit 2, SCTLR_EL1.C\n\
ORR %0, %0, (1<<12) // set bit 12, SCTLR_EL1.I\n\
MSR SCTLR_EL1, %0 // Write Xt to SCTLR_EL1\n\
": "+r" (value));
}
MODULE_LICENSE("GPL");
However it results in a complete system freeze when loaded (when I set the corresponding bits in the system register). My guess is that I still need some kind of cache clear, but I didn't find anything useful in the ARM manuals.
Anyone has some helpful hints how I could succeed in disabling the cache on ARM or what I am missing here? Thanks.
In general, this is unworkable, for several reasons.
Firstly, clearing the SCTLR.C bit only makes all data accesses non-cacheable. and prevents allocating into any caches. Any data in the caches is still there in the caches, especially dirty lines from anything recently-written; consider what happens when your function returns and the caller tries to restore a stack frame which doesn't even exist in the memory it's now accessing.
Secondly, there are very few uniprocessor ARMv8 systems; assuming you're running SMP Linux, and suddenly disable the caches on just whichever CPU the module loader happened to be scheduled on, then even disregarding the first point things are going to go downhill very fast. Linux expects all CPUs to be coherent with each other, and will typically become very broken very rapidly if that assumption is violated. Note that it's not even worth venturing into SMP cross-calling for this; suffice to say the only safe way to even attempt to run Linux with caches disabled is to make sure they are never enabled to begin with, except...
Thirdly, there is no guarantee Linux will even run with caches disabled. On current hardware, all of the locking and atomic operations in the kernel (not to mention userspace) rely on the exclusive access instructions. Whilst the CPU cluster(s) will implement the architecturally-required local and global exclusive monitors for cacheable memory (usually as part of the cache machinery itself), it is dependent on the system whether a global exclusive monitor for non-cacheable accesses is implemented, as such a thing must be external to the CPU (usually in the interconnect or memory controller). Many systems don't implement such a global monitor, in which case exclusive accesses to external memory may fault, do nothing, or other various implementation-defined behaviours which will result in Linux crashing or deadlocking. It is effectively impossible to run Linux with the cache off on such a system - the amount of hacking just to get a UP arm64 kernel to work (SMP would be literally impossible) would be impractical alone; good luck with userspace.
As it happens, though, the worst problem is none of that, it's this:
...in order to measure performance of optimized code, independent of cache access.
If the code is intended to run in deployment with caches disabled, then logically it can't be intended to run under Linux, therefore the effort spent in hacking up Linux would be better spent on benchmarking in a more realistic execution environment so that results are actually representative. On the other hand, if it is intended to run with caches enabled (under Linux or any other OS), then benchmarking with caches disabled will give meaningless results and be a waste of time. "Optimising" for e.g. an instruction-fetch-bandwidth bottleneck which won't exist in practice is not going to lead you in the right direction.
I have done it on armv8-a linux. I do it not to measure performance but to verify that xilinx zcu104 platform might have potential coherence error. As a result, the pynq image xilinx provided must have some coherence error during pl and ps communication.
Here is my workaround:
My platform is cortex-a53, ubuntu18 started at EL2 and switched to EL1 and supported SMP on four cpu cores. Thus, I need to turn off multi-core to ensure L2-cache coherency. Thanks to the feature of cpu-hot-plug, I just run :
echo '0' > /sys/devices/system/cpu/cpu1/online,
echo '0' > /sys/devices/system/cpu/cpu2/online,
echo '0' > /sys/devices/system/cpu/cpu3/online
then I run dmesg to verify that multi-core has been turn off.
I build the kernel source tree, cause I cannot find it in my linux. You can run uname -r to see your kernel version. And find in /usr/src to see whether your linux already have it.
I build the linux module. With gcc inline asm , I flush all cache and set sctlr_el1.c 0.
I insmod the module. And I first run my program with the right result, although it takes 20 times slower than multi-core and D-cache on.
Related
I like examples, so I wrote a bit of self-modifying code in c...
#include <stdio.h>
#include <sys/mman.h> // linux
int main(void) {
unsigned char *c = mmap(NULL, 7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|
MAP_ANONYMOUS, -1, 0); // get executable memory
c[0] = 0b11000111; // mov (x86_64), immediate mode, full-sized (32 bits)
c[1] = 0b11000000; // to register rax (000) which holds the return value
// according to linux x86_64 calling convention
c[6] = 0b11000011; // return
for (c[2] = 0; c[2] < 30; c[2]++) { // incr immediate data after every run
// rest of immediate data (c[3:6]) are already set to 0 by MAP_ANONYMOUS
printf("%d ", ((int (*)(void)) c)()); // cast c to func ptr, call ptr
}
putchar('\n');
return 0;
}
...which works, apparently:
>>> gcc -Wall -Wextra -std=c11 -D_GNU_SOURCE -o test test.c; ./test
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
But honestly, I didn't expect it to work at all. I expected the instruction containing c[2] = 0 to be cached upon the first call to c, after which all consecutive calls to c would ignore the repeated changes made to c (unless I somehow explicitedly invalidated the cache). Luckily, my cpu appears to be smarter than that.
I guess the cpu compares RAM (assuming c even resides in RAM) with the instruction cache whenever the instruction pointer makes a large-ish jump (as with the call to the mmapped memory above), and invalidates the cache when it doesn't match (all of it?), but I'm hoping to get more precise information on that. In particular, I'd like to know if this behavior can be considered predictable (barring any differences of hardware and os), and relied on?
(I probably should refer to the Intel manual, but that thing is thousands of pages long and I tend to get lost in it...)
What you do is usually referred as self-modifying code. Intel's platforms (and probably AMD's too) do the job for you of maintaining an i/d cache-coherency, as the manual points it out (Manual 3A, System Programming)
11.6 SELF-MODIFYING CODE
A write to a memory location in a code segment that is currently cached in the
processor causes the associated cache line (or lines) to be invalidated.
But this assertion is valid as long as the same linear address is used for modifying and fetching, which is not the case for debuggers and binary loaders since they don't run in the same address-space:
Applications that include self-modifying code use the same
linear address for modifying and fetching the instruction. Systems software, such as
a debugger, that might possibly modify an instruction using a different linear address
than that used to fetch the instruction, will execute a serializing operation, such as a
CPUID instruction, before the modified instruction is executed, which will automatically
resynchronize the instruction cache and prefetch queue.
For instance, serialization operation are always requested by many other architectures such as PowerPC, where it must be done explicitely (E500 Core Manual):
3.3.1.2.1 Self-Modifying Code
When a processor modifies any memory location that can contain an instruction, software must
ensure that the instruction cache is made consistent with data memory and that the modifications
are made visible to the instruction fetching mechanism. This must be done even if the cache is
disabled or if the page is marked caching-inhibited.
It is interesting to notice that PowerPC requires the issue of a context-synchronizing instruction even when caches are disabled; I suspect it enforces a flush of deeper data processing units such as the load/store buffers.
The code you proposed is unreliable on architectures without snooping or advanced cache-coherency facilities, and therefore likely to fail.
Hope this help.
It's pretty simple; the write to an address that's in one of the cache lines in the instruction cache invalidates it from the instruction cache. No "synchronization" is involved.
The CPU handles cache invalidation automatically, you don't have to do anything manually. Software can't reasonably predict what will or will not be in CPU cache at any point in time, so it's up to the hardware to take care of this. When the CPU saw that you modified data, it updated its various caches accordingly.
By the way, many x86 processors (that I worked on) snoop not only the instruction cache but also the pipeline, instruction window - the instructions that are currently in flight. So self modifying code will take effect the very next instruction. But, you are encouraged to use a serializing instruction like CPUID to ensure that your newly written code will be executed.
I just reached this page in one of my Search and want to share my knowledge on this area of Linux kernel!
Your code executes as expected and there are no surprises for me here. The mmap() syscall and processor Cache coherency protocol does this trick for you. The flags "PROT_READ|PROT_WRITE|PROT_EXEC" asks the mmamp() to set the iTLB, dTLB of L1 Cache and TLB of L2 cache of this physical page correctly. This low level architecture specific kernel code does this differently depending on processor architecture(x86,AMD,ARM,SPARC etc...). Any kernel bug here will mess up your program!
This is just for explanation purpose.
Assume that your system is not doing much and there are no process switches between between "a[0]=0b01000000;" and start of "printf("\n"):"...
Also, assume that You have 1K of L1 iCache, 1K dCache in your processor and some L2 cache in the core, . (Now a days these are in the order of few MBs)
mmap() sets up your virtual address space and iTLB1, dTLB1 and TLB2s.
"a[0]=0b01000000;" will actually Trap(H/W magic) into kernel code and your physical address will be setup and all Processor TLBs will be loaded by the kernel. Then, You will be back into user mode and your processor will actually Load 16 bytes(H/W magic a[0] to a[3]) into L1 dCache and L2 Cache. Processor will really go into Memory again, only when you refer a[4] and and so on(Ignore the prediction loading for now!). By the time you complete "a[7]=0b11000011;", Your processor had done 2 burst READs of 16 bytes each on the eternal Bus. Still no actual WRITEs into physical memory. All WRITEs are happening within L1 dCache(H/W magic, Processor knows) and L2 cache so for and the DIRTY bit is set for the Cache-line.
"a[3]++;" will have STORE Instruction in the Assembly code, but the Processor will store that only in L1 dCache&L2 and it will not go to Physical Memory.
Let's come to the function call "a()". Again the processor do the Instruction Fetch from L2 Cache into L1 iCache and so on.
Result of this user mode program will be the same on any Linux under any processor, due to correct implementation of low level mmap() syscall and Cache coherency protocol!
If You are writing this code under any embedded processor environment without OS assistance of mmap() syscall, you will find the problem you are expecting. This is because your are not using either H/W mechanism(TLBs) or software mechanism(memory barrier instructions).
My goal is to read in stale and outdated values of memory without cache-coherence. I have attempted to use prefetchnta to perform a non-temporal load, but it failed to fetch outdated values. I am looking into performing some kind of Streaming Memory-to-Memory Direct-Memory-Access, but am having a little trouble due to the overwhelming amount of background knowledge required to proceed with my current project. Currently I am attempting to mess around with udmabuf but even that is going slowly. It should be noted that ideally I would like to ignore the contents of all CPU caches, including the current CPU.
To provide my reasoning as to why: I am developing software that can be used to prove correctness of programs written for non-volatile memory. As the CPU Cache is volatile, the CPU's write-back cache will still be volatile and the arbitrary nature of how they are written back to memory needs to be observed.
I would sincerely appreciate it if someone could give me some pointers of how to proceed. I do not mind digging into the Linux kernel, as in fact I am doing that now, nor do I mind modifying it, I just need a little guidance in the right direction.
I haven't played around with this, but my understanding from the docs is that for loads (unlike NT stores) nothing can bypass cache or override the strong ordering of memory types like the normal WB (write-back). And even NT stores evict already-cached data, so they can't break coherence for this or another core that has cached data for the line you're writing.
You can do weakly-ordered loads from WC (write-combining) memory regions (with prefetchnta or SSE4 movntdqa), but they're probably still coherent at the physical address level.
#MargaretBloom commented
IIRC Intel warns the developer about multiple mapping with different cache types, which may indeed be good in this case.
so maybe you could actually bypass cache coherence with multiple virtual mappings of the same physical page.
I don't know if it's possible to do non-coherent DMA with a PCI / PCIe device, but that might be your only hope for getting actual DRAM contents without going through cache.
Normally (always?) DMA on modern x86 systems is cache-coherent, which is good for performance. To maintain backwards compat with 386 and earlier CPUs without caches, the first x86 CPUs with caches had cache-coherent DMA, not introducing cache-control instructions until later generations, since existing OSes didn't use them. In modern systems, memory controllers are built-in to the CPU. So on Intel CPUs, the system agent can snoop L3 tags to see if a line is cached anywhere on-chip in parallel with sending the request to the memory controller. Or a Xeon can DMA right into L3 cache without data having to bounce through DRAM, good for high bandwidth NICs.
There's an INVD instruction which invalidates all caches without doing write-back first, but I think that includes the shared L3 cache, and probably the private caches of all other cores. So you can't practically use it on a Linux system where other cores are potentially in the middle of doing stuff; you'd potentially corrupt kernel data structures by using it, as well as simulating power failure on a machine with NVDIMMs for the process you were interested in.
Maybe if you somehow offlined all the other CPU cores, and disabled interrupts on the one core that was still up
you could wbinvd (write-back+invalidate) to flush all caches
then run some code under test
then invd and see what made it to DRAM
Then re-enable interrupts. Interrupt handlers could end up with some kernel data cached and some in memory, or get device drivers out of sync with hardware, if any interrupts are handled between the wbinvd and the invd.
Update: someone did actually attempt this:
How to run "invd" instruction with disabled SMP support?
How to explicitly load a structure into L1d cache? Weird results with INVD with CR0.CD = 1 on isolated core with/without hyperthreading - invd worked so well it nuked some of the stores done by printk in the mis-designed attempt to log something about it.
I want to emulate the system with prohibited unaligned memory accesses on the x86/x86_64.
Is there some debugging tool or special mode to do this?
I want to run many (CPU-intensive) tests on the several x86/x86_64 PCs when working with software (C/C++) designed for SPARC or some other similar CPU. But my access to Sparc is limited.
As I know, Sparc always checks alignment in memory reads and writes to be natural (reading a byte from any address, but reading a 4-byte word only allowed when address is divisible by 4).
May be Valgrind or PIN has such mode? Or special mode of compiler?
I'm searching for Linux non-commercial tool, but windows tools allowed too.
or may be there is secret CPU flag in EFLAGS?
I've just read question Does unaligned memory access always cause bus errors? which linked to Wikipedia article Segmentation Fault.
In the article, there's a wonderful reminder of rather uncommon Intel processor flags AC aka Alignment Check.
And here's how to enable it (from Wikipedia's Bus Error example, with a red-zone clobber bug fixed for x86-64 System V so this is safe on Linux and MacOS, and converted from Basic asm which is never a good idea inside functions: you want changes to AC to be ordered wrt. memory accesses.
#if defined(__GNUC__)
# if defined(__i386__)
/* Enable Alignment Checking on x86 */
__asm__("pushf\n orl $0x40000,(%%esp)\n popf" ::: "memory");
# elif defined(__x86_64__)
/* Enable Alignment Checking on x86_64 */
__asm__("add $-128, %%rsp \n" // skip past the red-zone, in case there is one and the compiler has local vars there.
"pushf\n"
"orl $0x40000,(%%rsp)\n"
"popf \n"
"sub $-128, %%rsp" // and restore the stack pointer.
::: "memory"); // ordered wrt. other mem access
# endif
#endif
Once enable it's working a lot like ARM alignment settings in /proc/cpu/alignment, see answer How to trap unaligned memory access? for examples.
Additionally, if you're using GCC, I suggest you enable -Wcast-align warnings. When building for a target with strict alignment requirements (ARM for example), GCC will report locations that might lead to unaligned memory access.
But note that libc's handwritten asm for memcpy and other functions will still make unaligned accesses, so setting AC is often not practical on x86 (including x86-64). GCC will sometimes emit asm that makes unaligned accesses even if your source doesn't, e.g. as an optimization to copy or zero two adjacent array elements or struct members at once.
It's tricky and I haven't done it personally, but I think you can do it in the following way:
x86_64 CPUs (specifically I've checked Intel Corei7 but I guess others as well) have a performance counter MISALIGN_MEM_REF which counter misaligned memory references.
So first of all, you can run your program and use "perf" tool under Linux to get a count of the number of misaligned access your code has done.
A more tricky and interesting hack would be to write a kernel module that programs the performance counter to generate an interrupt on overflow and get it to overflow the first unaligned load/store. Respond to this interrupt in your kernel module but sending a signal to your process.
This will, in effect, turn the x86_64 into a core that doesn't support unaligned access.
This wont be simple though - beside your code, the system libraries also use unaligned accesses, so it will be tricky to separate them from your own code.
Both GCC and Clang have UndefinedBehaviorSanitizer built in. One of those checks, alignment, can be enabled with -fsanitize=alignment. It'll emit code to check pointer alignment at runtime and abort if unaligned pointers are dereferenced.
See online documentation at:
https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
Perhaps you somehow could compile to SSE, with all aligned moves. Unaligned accesses with movaps are illegal and probably would behave as illegal unaligned accesses on other architechtures.
I am studying about how CPU changes from user mode to kernel mode in linux. I came across two different methods: Interrupts and using sysenter.
I could not understand how sysenter works. Could someone please explain what exactly happens in the cpu when the sysenter instruction is run?
The problem that a program faces when it wants to get into the kernel (aka "making syscalls") is that user programs cannot access anything kernel-related, yet the program has to somehow switch the CPU into "kernel mode".
On an interrupt, this is done by the hardware.
It also happens automatically when a (CPU-, not C++) exception occurs, like accessing memory that doesn't exist, a divison by zero, or invoking a privileged instruction in user code. Or trying to execute an unimplemented instruction. This last thing is actually a decent way to implement a "call the kernel" interface: CPU runs on an instruction that the CPU doesn't know, so it raises an exception which drops the CPU into kernel mode and into the kernel. The kernel code could then check whether the "correct" unmiplemented instruction was used and perform the syscall stuff if it was, or just kill the process if it was any other unimplemented instruction.
Of course, doing something like this isn't, well, "clean". It's more like a dirty hack, abusing what should be an error to implement a perfectly valid control flow change. Hence, CPUs do tend to have actual instructions to do essentially the same thing, just in a more "defined" way. The main purpose of anything like a "sysenter" instruction is still the same: it changes the CPU into "kernel mode", saves the position where the "sysenter" was called, and continues execution somewhere in the kernel.
As for the difference between a "software interrupt" and "sysenter": "sysenter" is specifically optimized for this kind of use case. For example, it doesn't get the kernel address to call from memory like a (software-)interrupt does, but instead uses a special register to get the address from, which saves the memory address lookup. It might also have additional optimizations internally, based on the fact that software-interrupts might be handled more like interrupts, and the sysenter instruction doesn't actually need that. I don't know the precise details of the implementations of these instructions on the CPUs, you would probably have to read the Intel manuals to really get into such details.
I'm trying to implement a kernel function intercept (replacing a System.map's pointer to function, using this old method (c) Silvio); according to this forum post, some possible flaws may be related to instruction cache and mp-IRQ sources, which aren't flushed/updated after the interception.
What are these subsystems and how to deal with them in this case?
The instruction cache isn't a Linux kernel subsystem - it's part of the CPU.
Fetching code from main memory takes a lot of time, so CPUs use cache memory to cache code sections. This is the instruction cache that holds copies of instructions (code) that the CPU has a reason to believe will be needed soon.
If you change the instructions (code) in memory, as the example referred to does, but do not flush the instruction cache, your changed code might mysteriously fail to run until some random point in time where the instruction cache entry holding the instruction you replaced gets cleared.
mp-IRQ is short for Multiple Processor Interrupts. The problem related to in this context is that on a SMP (multiple CPU or multi core) system, the code that plants your trace point might be running on one CPU, while another is executing it. To handle that safely you need to do the very complex task of syncing al the CPU to make sure the code you are trying to patch is not being use on some other CPU by an interrupt.