I have a question regarding a paper I am reading right now, which is a demonstration of an attack against some tampering resistant software, using self-hashing mechanism. This kind of self hashing is working because authors are making the assumption that the executed code is the same as the hashed code, which is true except against some manipulations against the way a processor is manipulating the memory.
In the paper, there is the following sentence which troubles me : "A critical (implicit) assumption of both the hashing in Aucsmith’s IVK and checksum systems employing networks is that processors operate such that D(x) = I(x), where D(x) is the bit-string result of a “data read”
from memory address x, and I(x) is the bit-string result of an “instruction fetch” of corresponding length from x."
How do you state the difference between D(x) and I(x) ? What is the difference between a data read and an instruction fetch ?
Thanks for your help
The difference in these operations is when they occur and where the data is stored before use. Most processors have dedicated caches for instructions. This may mean that the data is fetched from main memory twice: once into a data cache for calculating the hash and again into the instruction cache.
I cannot find it now, but a year ago I read of a means of hiding malicious code on an Intel processor by causing a cache incoherency between these two caches. The processor would execute the malicious code, but any other tool reading the same memory as mere data would see good code. Here is a means of accomplishing this on an ARM chip.
Related
I've been looking into RISC-V and the RISC pipeline, and realised that a Memory Access can happen at the same time as the Instruction Fetch. Assuming that it is a fairly basic implementation without any cache, this is a hazard. I did a bit of digging and found this. It talks about inserting a bubble/stall, which makes sense, but how would one go about that? I thought about using a NOP, but the IF to actually grab that would still cause contention. Is the stall inserted by the Load/Store instruction? Or is it something else?
In computer architecture, there are two kind of architectures - Harvard and von Neumann. In Harvard architecture,instruction and data memory are separate but von Neumann has same instruction and data memory resulting in structural hazards.
Generally in RISC-V, we use separate instruction and data memory to avoid structural hazard. As in, if you have only data cache, architecturally you should fetch instruction directly from the memory.
Several minutes ago, I and my friends solved some algorithmic problems on the leetcode.com and share our solutions. We used high level languages and when new memory allocated by Array.new(128) in Ruby or int[] map = new int[128]; in Java it already filled by zero-like values nil or 0 respectively.
So it's guarantied that high level program have cleared place.
And here I have a question: In C or Assembler program could it happens that new chunk of memory stores data from other process unchanged?
And thus one process get data of another process. And even may be data from another user that worked in system some time ago. Could it be a way information leaked?
Do OS clear a memory before sharing it among processes? and If so is it very expensive to run so many iterations?
Thank you.
UPD: http://www.cplusplus.com/articles/ETqpX9L8/ looks like it need to clear valuable data in "lower-level" languages manually to prevent data leaks to other processes.
Yes, in lower-level languages where memory is not initialized, it could contain valuable stuff from other processes. There have been encryption key leakage attacks done this way by continually allocating memory and scanning it for what looks like useful information.
Security sensitive programs that store passwords or crypto keys, etc should always clear the memory ASAP after use. It's not only to prevent leaks through re-allocated memory, but there are also other attack vectors like RAM dumps that could be used to extract secrets. Always zero or randomize your memory when you are done with it.
As part of writing driver code, i have come across codes which uses memory barrier (fencing). After reading and surfing through Google, learnt as to why it is used and helpful in SMP. Thinking through this, in multi threaded programming we would find many instances where there are memory races and putting barrier in all places would cost system CPU. I was wondering how to:
I know about specific code path which use common memory to access data, do I need memory barrier in all these places?
Any specific technique or tip which will help me identify this pitfalls?
This is very generic questions but wanted to get insight on others experiences and any tips which would help to identify such pitfalls.
Often device hardware is sensitive to the order in which device registers are written. Modern systems are weakly-coupled and typically have write-combining hardware between the CPU and memory.
Suppose you write a single byte of a 32-bit object. What is in the write-combining hardware is now A _ _ _. Instead of immediately initiating a read/modify/write cycle to update the A byte, the hardware sets a timer. The hope is that the CPU will send the B, C, and D bytes before the timer expires. The timer expires, the data in the write-combining register gets dumped into memory.
Setting a barrier causes the write-combining hardware to use what it has. If only slot A is occupied then only slot A gets written.
Now supposed the hardware expected the bytes to be written in the strict order A, C, B, D. Without precautions the hardware registers get written in the wrong order. The result is what you expect: Ready! Fire! Aim!
Barriers should be placed judiciously because their incorrect use can seriously impede performance. Not every device write needs a barrier; judgement is called for.
I would like to know what techniques I should look up-to for optimizing a given algorithm for a given architecture. How do I improve performance using better caching. How do I reduce cache coherency or what access patterns should I avoid in my algorithm/program so that cache coherency doesn't impact my performance?
I understand a few standard techniques for using the recently cached data in L1 but how would I use data in a shared cache(say L2) on a multi-core effectively thereby I avoid a main-memory access which is even more costlier?
Basically, I am interested in what data access patterns I should try to exploit or avoid for a better mapping to my given architecture. What data structure I could use, in what scenarios for what architectures(with different levels of private cache and shared cache) to improve performance. Thanks.
What techniques I should look up-to for optimizing a given algorithm for a given architecture?
Micro-architectures vary, so learn the details of your specific processor. Intel provides good documentation in their optimization guide. If you are using an Intel processor you'll want to read sections 8.3 and 8.6:
8.3 OPTIMIZATION GUIDELINES
This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are listed (in order of importance):
Thread synchronization
Bus utilization
Memory optimization
Front end optimization
Execution resource optimization
Practices associated with each area are listed in this section. Guidelines for each area are discussed in greater depth in sections that follow. Most of the coding recommendations improve performance scaling with processor cores; and scaling due to HT Technology. Techniques that apply to only one environment are noted.
8.6 MEMORY OPTIMIZATION
Efficient operation of caches is a critical aspect of memory optimization. Efficient operation of caches needs to address the following:
Cache blocking
Shared memory optimization
Eliminating 64-KByte aliased data accesses
Preventing excessive evictions in first-level cache
What data access patterns I should try to exploit or avoid for a better mapping to my given architecture?
Exploit
When caches are full and an access misses in the cache the cache must evict something to make room for the new data/code, what is evicted is usually based on an approximation of least-recently used (LRU). If possible then your code should have strong locality of reference:
Try to pack data that is used close in time in the algorithm such that it is close in space (address)
Pack data tightly, don't use a 64-bit integer when a 32-bit integer will do, for example
Sometimes the alignment of an "object" (related data) relative to a cache line matters. For example, if there is an array of objects each of 64-bytes and they are accessed randomly then aligning at a 64-byte boundary will improve cache efficiency by not bringing in data that is not used. If the object isn't aligned then every object touched brings in two cache lines, but only 64-bytes are needed, so 50% of data transferred isn't used (assumes cache lines are 64-bytes).
As #PaulA.Clayton pointed out in the comments, pre-fetching data is very important, as it hides part or all of the memory latency. "Also, exploiting stride-based hardware prefetching can be quite beneficial. (Software prefetching can also be useful in some cases.) Getting pointers early helps increase memory-level parallelism."
In order to facilitate the hardware pre-fetcher and to increase the utilization of the data that is brought into the cache pay careful attention to how matrices and other large structures are stored and accessed... see Wikipedia article on row-major order.
Avoid
Data that you don't use often shouldn't be close to data that you use frequently
Avoid false sharing. If two or more threads access the same cache line but are not sharing the same data within the cache line and at least one of them is a writer you have false sharing... there will be unnecessary burden and latency hit associated with cache coherency protocol.
Try not to use new data until you are done with the older data
Measure
As Andrei Alexandrescu said in this talk - when it comes to performance tuning the only intuition that is right is "I should measure this." Familiarize yourself with cache performance monitoring tools, for example:
perf
Cachegrind
The key principle is locality: when you have the choice, process nearby data first (avoid sparse accesses), and perform data reuse as soon as possible (regroup successive passes over the same data).
For multithreaded programs, the principle is separate locality: ensure that the threads work on disjoint data sets (use distinct copies is necessary/possible).
Unless you have very good reasons to do so, stay away from the peculiarities of the hardware.
It should be mentioned that code is also cached in the same way as data. Small, dense code with a lot of inlining and few jumps/calls will put less strain on the L1C cache and, ultimately, L2, L3 and RAM where collisions with data fetches will occur.
If you are using hyperthreading there appears to be evidence to indicate that a lower optimization level (O1) on two hyperthreads in a core will overall get more work done than a single, highly optimized (O2 and higher) thread.
I'm designing a program and i found that assuming implicit cache coherency make the design much much easier. For example my single writer (always the same thread) multiple reader (always other threads) scenarios are not using any mutexes.
It's not a problem for current Intel CPU's. But i want this program to generate income for at least the next ten years (a short time for software) so i wonder if you think this could be a problem for future cpu architectures.
I suspect that future CPU generations will still handle cache coherence for you. Without this, most mainstream programming methodologies would fail. I doubt any CPU architecture that will be used widely in the next ten years will invalidate the current programming model - it may extend it, but it's difficult to drop something so widely assumed.
That being said, programming with the assumption of implicit cache coherency is not always a good idea. There are many issues with false sharing that can easily be avoided if you purposefully try to isolate your data. Handling this properly can lead to huge performance boosts (rather, a lack of huge performance losses) on current generation CPUs. Granted, it's more work in the design, but it is often required.
We are already there. Computers claim cache coherency but at the same time they have a temporary store buffer for writes, reads can be completed via this buffer instead of the cache (ie the store buffer has just become a incoherent cache) and invalidate requests are also queued allowing the processor to temporarily use cache lines it knows are stale.
X86 doesn't use many of these techniques, but it does use some. As long as memory stays significantly slower than the CPU, expect to see more of these techniques and others yet devised to be used. Even itanium, failed as it is, uses many of these ideas, so expect intel to migrate them into x86 over time.
As for avoiding locks, etc: it is always hard to guage people's level of expertise over the Internet so either you are misguided with what you think might work, or you are on the cutting edge of lockfree programming. Hard to tell.
Do you understand the MESI protocol, memory barriers and visibility? Have you read stuff from Paul McKenney, etc?
I don't know per se. But I'd like to see a trend toward non-cache coherent modes.
The conceptual mind shift is significant (can't just pass data in a method call, must pass it through a queue to an async method), but it's required as we move more and more into a multicore world anyway. The closer we get to one processor per memory bank the better. Because then we're working in a world of network message routing, where data is just not available rather than having threads that can silently stomp on data.
However, as Reed Copsey points out, the whole x86 world of computing is built on the assumption of cache coherency (which is even bigger than Microsoft's market share!). So it won't go away any time soon!
Here is a paper from reputed authors in computer architecture area which argues that cache coherence is here to stay.
http://acg.cis.upenn.edu/papers/cacm12_why_coherence.pdf
"Why On-Chip Cache Coherence Is Here to Stay" -By Martin, Hill and Sorin
You are making a strange request. You are asking for our (the SO community) assumptions about future CPU architectures - a very dangerous proposition. Are you willing to put your money where our mouth is? Because if we're wrong and your application will fail it will be you who's not making any money..
Anyway, I would suspect things are not going to change that dramatically because of all the legacy code that was written for single threaded execution but that's just my opinion.
The question seems misleading to me. The CPU architecture is not that important, what is important is the memory model of the platform you are working for.
You are developing the application is some environment, with some defined memory model. E.g. if you are currently targeting x86, you can be pretty sure any future platform will implement the same memory model when it is running x86 code. The same is true for Java or .NET VMs and other execution platforms.
If you expect to port your current application at some other platforms, if the platform memory model will be different, you will have to adjust for it, but in such case you are the one doing the port and you have the complete control over how you do it. This is however true even for current platforms, e.g. PowerPC memory model allows much more reorderings to happen than the x86 one.