I'm looking for a Linux utility that allows profiling the cache eviction in my program.
Specifically, I'm interested in finding what causes certain cache line(s) to be repeatedly evicted from L2 cache.
Any suggestions?
You have several options at your disposal, some of which are free. Below I'll mostly talk about profiling L2 misses, not necessarily L2 evictions since those are more or less the same thing. Lines get evicted from the L2 because another line is being brought in, and another line is being brought in usually due to an L2 miss1.
Cachegrind
First, I'd try out cachegrind. This basically runs your binary under a type of lightweight virtual machine which allows it to intercept all memory accesses and subsequently model their effect on the cache. It can pinpoint exactly where cache misses occur, who is responsible for eviction and so on.
It is important to note that cachegrind doesn't actually tell you what's going on with the hardware caches but rather what happens in its cache model. Since the L1 and L2 are simple enough on Intel x86, the cachegrind model should be accurate, except in unusual cases.
Cachegrind can only simulate two cache levels, but modern Intel has 3 or sometimes 4. That shouldn't be a problem if you are trying to evaluate L2 misses though. By default cachegrind sets the L1 cache to the detected values of the local L1 cache, and it's LLC to the detected values of the LLC. In your case, you'll want to override that latter decision to reflect the L2 cache, not the LLC. You can find the details in the manual, but this should be correct for recent Intel Broadwell and earlier:
--LL=262144,8,64
For Skylake client/Kaby Lake and friends you'd want:
--LL=262144,4,64
For Skylake-X server you'll want to look up the new values because the L2 changed.
The primary downside of this approach is that you can't be 100% sure that the cache model is an accurate reflection of reality (e.g., it doesn't model things like prefetching or virtual-physical paging). Another downside is that running a process under cachegrind is probably an order of magnitude slower than running it native, but for an investigation outside of "production" this probably isn't an issue.
Perf
You can use the default, included and free profiling tool to learn exactly what's actually going on with your actual hardware: perf.
In particular, you can use perf record combined with perf report or perf annotate to determine where in your program misses are occurring. You can start with something like this:
perf record -e mem_load_retired.l2_miss <your process>
This periodically records where an L2 misses appears. You can display the result with perf report which lets you explore the results interactively. There are lots of other options such as --call-graph to record the full call graph which may be useful.
The perf record approach always where you where in your code something is happening, but it doesn't help you determine what memory was being accessed when the misses occurred. That often doesn't matter: the location in the code often makes it very obvious what memory is being accessed. Sometimes, however, that's not the case: you have some code that might access a large region of memory and you want to know the address to figure how why misses are occurring.
In that case you can use perf mem, which records both the location, in code, of the miss and the address of the miss. This tool isn't as polished as the others, but the source is at least available so you could always make some improvements. I cover this option in some detail in another answer.
The primary disadvantage of perf is that it is less straightforward to use than something like cachegrind. The behavior and available events depends on your hardware and kernel version, and sometimes things like stack traces don't work, etc. You have to be relatively comfortable with the command line to make good use of this tool.
VTune
This tool uses the same underlying performance counters as perf, but uses a GUI based exploration and is perhaps easier to jump into than perf. It takes more of a top down approach: telling you where the problems are and allowing you to drill down, whereas perf is more about "here's the raw data, figure out what's wrong".
It provides specific analyses like the Memory Access Analysis which might be appropriate for your problem.
The main downside is that this is a paid product, unless you qualify to use it for free. It may be somewhat easier to use than perf but it's still not exactly easy and there is a lot of magic that goes on so if something goes wrong it may be hard to debug.
1 In some scenarios this might not be true. The main one I can think of is if prefetching into L2 causes most lines to arrive before they are missing. In that case, the number of L2 replacements might be might higher than the number of L2 misses. This is the kind of thing that cachegrind won't be able to help you with, but perf can: you can compare the number of L2 lines in/replaced to the number of L2 misses and see if they are close. If they aren't, you'll have to play around with other counters to see if prefetching is the cause.
Related
Accroding to this paper: https://doi.org/10.1109/SP.2013.13, Memory corruption bugs are one of the oldest problems in computer security. The lack of memory safety and type safety has caused countless bugs, causing billions of dollars and huge efforts to fix them.
But the root of C/C++'s memory vulnerability can trace down to the ISA level. At ISA level, every instruction can access any memory address without any fine grained safe check (only corase grained check like page fault). Sure, we can implement memory safe at a higher software level, like Java (JVM), but this leads to significant cost of performance. In a word, we can't have both safety and performance at the same time on existing CPUs.
My question is, why can't we implement the safety at the hardware level? If the CPU has a safe ISA, which ensures the memory safe by, I don't know, taking the responsbilities of malloc and free, then maybe we can get rid of the performance decline of software safe checking. If anyone professional in microelectronics can tell me, is this idea realistic?
Depending on what you mean, it could make it impossible implement memory-unsafe languages like C in a normal way. e.g. every memory access would have to be to some object that has a known size? I'd guess an operating system for such a machine might have to work around that "feature" by telling it that the entire address space was one large array object. Or else you'd need some mechanism for a read system call to know the proper bounds of the object it's writing in the copy_to_user() part of its job. And then there's other OS stuff like accessing the same physical page from different virtual pages.
The OP (via asking on Reddit) found the CHERI project which is an attempt at this idea, involving "... revisit fundamental design choices in hardware and software to dramatically improve system security." Changing hardware alone can't work; compilers need to change, too. But they were able to adapt "Clang/LLVM, FreeBSD, FreeRTOS, and applications such as WebKit," so their approach could be practical. (Unlike the hypothetical versions I was imagining when writing other parts of this answer.)
CHERI uses "fine-grained memory protection", and "Language and compiler extensions" to implement memory-safe C and C++, and higher-level languages.
So it's not a drop-in replacement, and it sounds like you have to actively use the features to gain safety. As I argue in the rest of the answer, hardware can't do it alone, and it's highly non-trivial even with software cooperation. It's easy to come up with ways that wouldn't work. :P
For hardware-enforced memory-safety to be possible, hardware would have to know about every object and its size, and be able to cache that structure in a way that allows efficient lookups to find the bounds. Page tables (4k granularity, or larger in more modern ISAs) are already hard enough for hardware for hardware to cache efficiently for large programs, and that's without even considering which pointer goes with which object.
Checking a TLBs as part of every load and store can be done efficiently, but checking another structure in parallel with that might be problematic. Especially when the ranges don't have power-of-2 sizes and natural alignment, the way pages do, which makes it possible to build a TLB from content-addressable memory that checks for a match against each of several possible values for the high bits. (e.g. a page is 4k in size, always starting at a 4k alignment boundary.)
You mean it may cost too much at hardware level, like the die area?
Die area might not even be the biggest problem, especially these days. It would cost power, and/or cost latency in very important critical paths such as L1d hit load-use latency. Even if you could come up with some plausible way for software to make tables that hardware could check, or otherwise solve the other parts of this problem.
Modifying a page-table entry requires invalidating the entry, including TLB shootdown for other cores. If every free (and some malloc) cost inter-core communication to do similar things for object tables, that would be very expensive.
I think inventing a way for software to tell the hardware about objects would be an even bigger problem. malloc and free aren't something you can just build in to a CPU where memory addressing works anything like existing CPUs, or like it does in C. Software needs to manage memory, it doesn't make sense to try to build that in to a CPU. So then malloc and free (and mmap with file-backed mappings and shared memory...) need a way to tell the CPU about objects. Seems like a mess.
I think at best an ISA could provide more tools software can use to make bounds-checks cheaper. Perhaps some kind of extra semantics on loads/stores, like an extra operand for indexed addressing modes for load or store that takes a max?
At least if we want an ISA to work anything like current ones, rather than work like a JVM or a Transmeta Crusoe and internally recompile for some real ISA.
Intel's MPX ISA extension to x86 was an attempt to let software set up bound ranges, but it's been mostly abandoned due to lower performance than pure software. Intel even dropped it from their recent CPUs (Not present in 10th Gen CPUs using 10nm lithography, or later.)
This is all just off the top of my head; I haven't searched for any serious proposals for how a system could plausibly work.
I don't think memory safety is something you can easily add after the fact to languages like C that weren't originally designed with it.
Have a look to "Code for malloc and free" at SO. Those commands are very, very far away from even being defined within an instruction set.
I have an array of data that looks like this :
uint32_t data[128]; //Could be more than L1D Cache size
In order to do computation on it, I want to put the data as close as possible to my computing unit so in the L2 Cache.
My target runs with a linux kernel and some additionnal apps
I know that I can get an access to a certain area of the memory with mmap and I have succesfully done it in some part of my available memory shared between cores.
How to do the same thing but in L2 Cache area ?
I've read part of gcc documentation and AArch64 assembly instruction set but cannot figure out the way to achieve this.
How to do the same thing but in L2 Cache area ?
Your hardware doesn't support that.
In general, the ARMv8 architecture doesn't make any guarantees about the contents of caches and does not provide any means to explicitly manipulate or query them - it only makes guarantees and provides tools for dealing with coherency.
Specifically, from section D4.4.1 "General behavior of the caches" of the spec:
[...] the architecture cannot guarantee whether:
• A memory location present in the cache remains in the cache.
• A memory location not present in the cache is brought into the cache.
Instead, the following principles apply to the behavior of caches:
• The architecture has a concept of an entry locked down in the cache.
How lockdown is achieved is IMPLEMENTATION DEFINED, and lockdown might
not be supported by:
— A particular implementation.
— Some memory attributes.
• An unlocked entry in a cache might not remain in that cache. The
architecture does not guarantee that an unlocked cache entry remains in
the cache or remains incoherent with the rest of memory. Software must
not assume that an unlocked item that remains in the cache remains dirty.
• A locked entry in a cache is guaranteed to remain in that cache. The
architecture does not guarantee that a locked cache entry remains
incoherent with the rest of memory, that is, it might not remain dirty.
[...]
• Any memory location is not guaranteed to remain incoherent with the rest of memory.
So basically you want cache lockdown. Consulting the manual of your CPU though:
• The Cortex-A72 processor does not support TLB or cache lockdown.
So you can't put something in cache on purpose. Now, you might be able to tell whether something has been cached by trying to observe side effects. The two common side effects of caches are latency and coherency. So you could try and time access times or modify the contents of DRAM and check whether you see that change in your cached mapping... but that's still a terrible idea.
For one, both of these are destructive operations, meaning they will change the property you're measuring, by measuring it. And for another, just because you observe them once does not mean you can rely on that happening.
Bottom line: you cannot guarantee that something is held in any particular cache by the time you use it.
Cache - is not a place where data should be stored, it's just... cache? :)
I mean, your processor decide which data it should cache and where (L1/L2/L3) and logic depends on CPU implementation.
If you wanted to, you could try to find the algorithm of placing and replacing data in cache and play with this (without guaranties, of course) by using dedicated instructions to prefetch your data and then maintain the cache with non-caching instructions for your other program.
Maybe for modern ARM there are easier ways, I spoke from x86/x64 perspective, but my whole point is "are you really sure that you need this"?
CPU's smart enough to cache the data which they need and they do it better and better year by year.
I'd recommend you to use any profiler that can show you cache misses to be sure than your data is not presented in the cache already.
If it don't, the first thing to optimize - is an algorithm. Try to figure out why there was a cache miss - maybe you should load less data in loop by using temp variables, for example, or even unroll the loop manually to control where and what being accessed.
I have been trying to profile some code that I wrote as a small memory test on my machine and by using perf I noticed:
Performance counter stats for './MemBenchmark':
15,980 LLC-loads
8,714 LLC-load-misses # 54.53% of all LL-cache hits
10.002878281 seconds time elapsed
The whole idea of the benchmark is 'stress' the memory so in my books the higher I can make the miss rate the better I think.
EDIT: Is there functionality within Perf that will allow a file to be profiled into different sections? e.g. If main() contains three for loops, is it possible to profile each loop individually to see the number of LLC load misses?
Remember that LLC-loads only counts loads that missed in L1d and L2. As a fraction of total loads (L1-dcache-loads), that's probably a very good hit rate for the cache hierarchy overall (thanks to good locality and/or successful prefetch.)
(Your CPU has a 3-level cache, so the Last Level is the shared L3; the L1 and L2 are private per-core caches. On CPU with only 2 levels of cache, the LLC would be L2.)
Only 9k accesses that had to go all the way to DRAM 10 seconds is very very good.
A low LLC hit rate with such a low total LLC-loads tells you that your workload has good locality for most of its accesses, but the accesses that do miss often have to go all the way to DRAM, and only half of them benefit from having L3 cache at all.
related: Cache friendly offline random read, and see #BeeOnRope's answer on Understanding perf detail when comparing two different implementations of a BFS algorithm where he says the absolute number of LLC misses is what counts for performance.
An algorithm with poor locality will generate a lot of L2 misses, and often a lot of L3 hits (quite possibly with a high L3 hit rate), but also many total L3 misses, so the pipeline is stalled a lot of the time waiting for memory.
What metric could you suggest to measure how my program performs in terms of stressing the memory?
Do you want to know how much total memory traffic your program causes, including prefetches? i.e. what kind of impact it might have on other programs competing for memory bandwidth? offcore_requests.all_requests could tell you how many requests (including L2 prefetches, page walks, and both loads and stores, but not L3 prefetches) make it past L2 to the shared L3 cache, whether or not they hit in shared L3. (Use the ocperf.py wrapper for perf. My Skylake has that event; IDK if your Nehalem will.)
As far as detecting whether your code bottlenecks on memory, LLC-load-misses per second as an absolute measure would be reasonable. Skylake at least has a cycle_activity.stalls_l3_miss to count cycles where no uops executed and there was an outstanding L3 miss. If that's more than a couple % of total cycles, you'd want to look into avoiding those stalls.
(I haven't tried using these events to learn anything myself, they might not be the most useful suggestion. It's hard to know the right question to ask yourself when profiling; there are lots of events you could look at but using them to learn something that helps you figure out how to change your code is hard. It helps a lot to have a good mental picture of how your code uses memory, so you know what to look for. For such a general question, it's hard to say much.)
Is there a way you could suggest that can break down the benchmark file to see which loops are causing the most stress?
You can use perf record -e whatever / perf report -Mintel to do statistical sample-based profiling for any event you want, to see where the hotspots are.
But for cache misses, sometimes the blame lies with some code that looped over an array and evicted lots of valuable data, not the code touching the valuable data that would still be hot.
A loop over a big array might not see many cache misses itself if hardware prefetching does its job.
linux perf: how to interpret and find hotspots. It can be very useful to use stack sampling if you don't know exactly what's slow and fast in your program. Sampling the call stack on each event will show you which function call high up in the call tree is to blame for all the work its callees are doing. Avoiding that call in the first place can be much better than speeding up the functions it calls by a bit.
(Avoid work instead of just doing the same work with better brute force. Careful applications of the maximum brute-force a modern CPU can bring to bear with AVX2 is useful after you've established that you can't avoid doing it in the first place.)
I would like to know what techniques I should look up-to for optimizing a given algorithm for a given architecture. How do I improve performance using better caching. How do I reduce cache coherency or what access patterns should I avoid in my algorithm/program so that cache coherency doesn't impact my performance?
I understand a few standard techniques for using the recently cached data in L1 but how would I use data in a shared cache(say L2) on a multi-core effectively thereby I avoid a main-memory access which is even more costlier?
Basically, I am interested in what data access patterns I should try to exploit or avoid for a better mapping to my given architecture. What data structure I could use, in what scenarios for what architectures(with different levels of private cache and shared cache) to improve performance. Thanks.
What techniques I should look up-to for optimizing a given algorithm for a given architecture?
Micro-architectures vary, so learn the details of your specific processor. Intel provides good documentation in their optimization guide. If you are using an Intel processor you'll want to read sections 8.3 and 8.6:
8.3 OPTIMIZATION GUIDELINES
This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are listed (in order of importance):
Thread synchronization
Bus utilization
Memory optimization
Front end optimization
Execution resource optimization
Practices associated with each area are listed in this section. Guidelines for each area are discussed in greater depth in sections that follow. Most of the coding recommendations improve performance scaling with processor cores; and scaling due to HT Technology. Techniques that apply to only one environment are noted.
8.6 MEMORY OPTIMIZATION
Efficient operation of caches is a critical aspect of memory optimization. Efficient operation of caches needs to address the following:
Cache blocking
Shared memory optimization
Eliminating 64-KByte aliased data accesses
Preventing excessive evictions in first-level cache
What data access patterns I should try to exploit or avoid for a better mapping to my given architecture?
Exploit
When caches are full and an access misses in the cache the cache must evict something to make room for the new data/code, what is evicted is usually based on an approximation of least-recently used (LRU). If possible then your code should have strong locality of reference:
Try to pack data that is used close in time in the algorithm such that it is close in space (address)
Pack data tightly, don't use a 64-bit integer when a 32-bit integer will do, for example
Sometimes the alignment of an "object" (related data) relative to a cache line matters. For example, if there is an array of objects each of 64-bytes and they are accessed randomly then aligning at a 64-byte boundary will improve cache efficiency by not bringing in data that is not used. If the object isn't aligned then every object touched brings in two cache lines, but only 64-bytes are needed, so 50% of data transferred isn't used (assumes cache lines are 64-bytes).
As #PaulA.Clayton pointed out in the comments, pre-fetching data is very important, as it hides part or all of the memory latency. "Also, exploiting stride-based hardware prefetching can be quite beneficial. (Software prefetching can also be useful in some cases.) Getting pointers early helps increase memory-level parallelism."
In order to facilitate the hardware pre-fetcher and to increase the utilization of the data that is brought into the cache pay careful attention to how matrices and other large structures are stored and accessed... see Wikipedia article on row-major order.
Avoid
Data that you don't use often shouldn't be close to data that you use frequently
Avoid false sharing. If two or more threads access the same cache line but are not sharing the same data within the cache line and at least one of them is a writer you have false sharing... there will be unnecessary burden and latency hit associated with cache coherency protocol.
Try not to use new data until you are done with the older data
Measure
As Andrei Alexandrescu said in this talk - when it comes to performance tuning the only intuition that is right is "I should measure this." Familiarize yourself with cache performance monitoring tools, for example:
perf
Cachegrind
The key principle is locality: when you have the choice, process nearby data first (avoid sparse accesses), and perform data reuse as soon as possible (regroup successive passes over the same data).
For multithreaded programs, the principle is separate locality: ensure that the threads work on disjoint data sets (use distinct copies is necessary/possible).
Unless you have very good reasons to do so, stay away from the peculiarities of the hardware.
It should be mentioned that code is also cached in the same way as data. Small, dense code with a lot of inlining and few jumps/calls will put less strain on the L1C cache and, ultimately, L2, L3 and RAM where collisions with data fetches will occur.
If you are using hyperthreading there appears to be evidence to indicate that a lower optimization level (O1) on two hyperthreads in a core will overall get more work done than a single, highly optimized (O2 and higher) thread.
Context:
-- embedded platform running Linux with some static RAM which is declared about 3 times faster then the rest of RAM (dynamic). The amount of this fast memory is 512kB and the official name is eSRAM. (Details not important for this post: Galileo board, information on eSRAM and relevant kernel API: https://communities.intel.com/servlet/JiveServlet/previewBody/22488-102-1-26046/Quark_SWDevManLx_330235_001.pdf)
-- eSRAM can be used by an application with some support from the kernel---a simple driver that allocates kernel memory on its behalf, overlays the memory with eSRAM (this is done in physical space) and mmaps it to app's virtual memory space. This was tested and confirmed to work as expected.
Problem:
Identify which sections of app's data (and possibly code) to map into eSRAM to achieve optimum performance gain. A suitable analysis tool is required.
After some search I'm not sure if any existing tool is actually suited to this task. Currently my best bet is to develop a specialized Valgrind tool. But maybe there is already something in the ecosystem to start with. Any advice/information is welcome even if, for instance, a tool is kind of partially suited etc.
P.S.
Full analysis should probably take a lot of factors into account, like:
-- memory access patterns (cache performance)
-- changes over time (one could consider eSRAM paging)
...
I have taken a look at Valgrind Cachegrind. It can collect data about data cache reades and data cache writes. And cg_annotate can report Line-by-line Counts for you program. Can it be useful for you to find variables in your program that cause most operations with data cache and in this way to identify data that can benefit most from moving to quick memory? http://valgrind.org/docs/manual/cg-manual.html#cg-manual.line-by-line
Probably, you are interested in D cache reads (Dr) and D cache writes (Dw), or even (Dr+Dw). In that way you can find a place in your code which does most (Dr+Dw) and try to move this place in your quick memory.