What does it mean to "train" a branch predictor? - security

I was reading this article about a theoretical CPU vulnerability similar to Spectre, and it noted that:
"The attacker needs to train the branch predictor such that it
reliably mispredicts the branch."
I roughly understand what branch prediction is and how it works, but what does it mean to "train" a branch predictor? Does this mean biasing one branch such that it is much more computationally expensive than the other, or does it mean to (in a loop) continually have the CPU to correctly predict a particular branch before proceeding to the next, mispredicted branch?
E.g.,
// Train branch predictor
for (int i = 0; i < 512; i++)
{
if (true){
// Do some instructions
} else {
// Do some other instruction
}
}
// The branch predictor is now "trained"/biased to predict the first branch?
// Proceed to attack
Do the branch predictors use weights to bias the prediction or one way or the other based on previous predictions/mispredictions?

It means to create a branch that aliases the branch you're attacking (by putting it at a specific address, maybe the same virtual address as in another process, or a 4k or some other power of 2 offset may work), and running it a bunch of times to bias the predictor.
So that when the branch you're attacking with Spectre actually runs, it will be predicted the way you want. (Or for indirect branches, will jump to the virtual address you want).
Modern TAGE branch predictors index based on branch history (of other branches in the dynamic instruction stream leading to this branch), so properly training can be complicated...
But at the most simplistic level, yes, branch-predictors with more than 1 bit of state remember more than just the last branch direction. Wikipedia has a big article about many different implementations of branch prediction, from simple 2-level saturating counters on up.
Training them involves making a branch you control go the same way repeatedly.
Specifically, you'd put something like this asm in a loop (at a known address), and run it repeatedly.
xor eax,eax ; eax=0 and thus set ZF
jnz .target ; always not-taken
Then the target branch will fall through and run the Spectre "gadget" you want, even though it's normally taken.

A branch predictor works by remembering recent branch targets. The simplest form of prediction simply remembers which branch was taken the last time it was hit; more complex predictors exist and are common.
The "training" is simply populating that memory. For the simple (1-value) predictor, that means taking the branch you want to favour, once. For complex predictors, it will mean executing the favoured branch multiple times until the processor reliably predicts the desired outcome.

Related

Initialize Random.Seed once to keep seed

I have a function with a lot of random.choice, random.choices, random.randint, random.uniform etc functions and i have to put random.seed before them all.
I use python 3.8.6, is there anyway to keep a seed initialized just in the function or atleast a way to toggle it instead of doing it every time?
It sounds like you have a misconception about how PRNGs work. They are generators (it's right there in the name!) which produce a sequence of values that are deterministic but are constructed in an attempt to be indistinguishable from true randomness based on statistical tests. In essence, they attempt to pass a Turing test/imitation game for randomness. They do this by maintaining an internal state of bits, which gets updated via a deterministic algorithm with every call to the generator. The only role a seed is supposed to play is to set the initial internal state so that separate runs create reproducible sequences. Repeating the seed for separate runs can be useful for debugging and for playing tricks to reduce the variability of some classes of estimators in Monte Carlo simulation.
All PRNGs eventually cycle. Since the internal state is composed of a finite number of bits and the state update mechanism is deterministic, the entire sequence will repeat from the point where any state duplication occurs. In other words, the output sequence is actually a loop of values. The seed value has nothing to do with the quality of the pseudo-random numbers, that's determined by the state transition algorithm. Conceptually you can think of the seed as just being the entry point to the PRNG's cycle. (Note that this doesn't mean you have a cycle just because you observe the same output, cycling occurs only when the internal state that produces the output repeats. That's why the 1980's and 90's saw an emergence of PRNGs whose state space contained more bits than the output space, allowing duplicate output values as predicted by the birthday problem without having the sequence repeat verbatim from that point on.)
If you mess with a good PRNG by reseeding it multiple times, you're undoing all of the hard work that went into designing an algorithm which passes statistically based Turing tests. Since the seed does not determine the quality of the results, you're invoking additional cost (to spawn a new state from the seed), gaining no statistical benefit, and quite likely harming the ability to pass statistical testing. Don't do that!

Obfuscation of checksum guards

As part of my project, I have to insert small codes in a C program called checksum guards. What these guards do is they calculate the checksum value of a portion of code using a function(add, xor, etc.) which operates on the instruction opcodes. So, if somebody has tampered with the instructions(add, modify, delete) in that region of code, the checksum value will change and intrusion will be detected.
Here is the research paper which talks about this technique:
https://www.cerias.purdue.edu/assets/pdf/bibtex_archive/2001-49.pdf
Here is the guard template:
guard:
add ebp, -checksum
mov eax, client_addr
for:
cmp eax, client_end
jg end
mov ebx, dword[eax]
add ebp, ebx
add eax, 4
jmp for
end:
Now, I have two questions.
Would putting the guards in the assembly better than putting it in the source program?
Assuming I put them in the assembly(at an appropriate place) what kind of obfuscation should I use to prevent the guard template to be easily visible? (Since when I have more than 1 guard, the attacker should not easily find out all the guard templates and disable all the guards together as that would leave the code with no security)
Thank you in advance.
From attacker's (without sources) point of view the first question doesn't matter; he's tampering with the final binary machine code, whether it was produced from .c or .s will make zero difference. So I would worry mainly how to generate the correct binary with appropriate checksums. I'm not aware of any way how to get proper checksum inside the C source. But I can imagine to have some external tool running over assembler files created by C compiler, in some post-process way - before compiling the .s files into .o. But... Keep in mind, that some calls and addresses are just relative offsets, and the binary loaded into memory is patched by the OS loader according to linker's table, to make those point to the real memory addresses. Thus the data bytes will change (opcodes will stay fixed).
Yours guard template doesn't take that into account, and does checksum whole opcodes with data bytes as well (Some advanced guards have opcodes definitions, and checksum/encrypt/decipher only the opcodes themselves without operand bytes).
Otherwise it's neat, that the result is damaged ebp value, ruining any C code around (*) working with stack variables. But it's still artificial test, you can simply comment out both add ebp,-checksum and add ebp,ebx making the guard harmless.
(*) notice you have to put the guard in between some classic C code to get some real runtime problems from invalid ebp value. If you would put it at the end of subroutine, which ends with pop ebp, everything would work well.
So to the second question:
You definitely want more malicious ways to guard correct value, than only ebp damage. Usually the hardest (to remove) way is to make checksum value part of some calculation, eventually skewing results just slightly, so serious usage of the SW will be impossible, but it will take time to notice by the user.
You can also use some genuine code loop to add your checksumming to it, so simply skipping whole loop will skip also valid code (but I can imagine this one only added by hand into generated assembly from C, so you will have to redo it after every new compilation of particular C source).
Then the particular guard template can be obfuscated by any imaginable mutation (different registers used, modified order of instructions, instruction variants), try to search about viruses with mutation encoding to get some ideas.
And I didn't read that paper, but from the Figures I would say the main point is to make those guarding areas to overlap, so patching off one of them will affect another one, which sounds to me like that extra sugar to make it somewhat functional (although this still looks like normal challenge to 8bit game crackers ;), not even "hard" level). But that also means you would need either very cunning external tool to calculate that cyclic tree of dependencies, and insert the guard templates in correct order, or do it again manually completely.
Of course when doing manually, you have to do it after each new C compilation, so it's worth of the effort only on something very precious and expensive, or rock solid stable, where you will not produce another revision for next 10y or so... :D

Branch mispredictions

This question may be silly but i will ask it anyway.
I've heard about branch prediction from this Mysticial's answer
and i want to know if it is possible for the following to happen
Lets say i have this piece of C++ code
while(memoryAddress = getNextAddress()){
if(haveAccess(memoryAddress))
// change the value of *memoryAdrress
else
// do something else
}
So if the branch predictor predicts wrongly in some case that the if statement is true and then the program change the value of *memoryAddress can bad happen out of that?
Can things like segmentation fault happen?
The branch predictor inside a processor is designed to have no functionally observable effects.
The branch predictor is not sophisticated enough to get it right every time, regardless of attempts to trick it such as yours. If it was right everytime, it would just be how branches are always executed, it wouldn't be a “predictor”.
The condition of the branch is still computed while execution continues, and if the condition turns out not have the predicted value, nothing is committed to memory. Execution goes back to the correct branch.

Managing a stateful computation system in Haskell

So, I have a system of stateful processors that are chained together. For example, a processor might output the average of its last 10 inputs. It requires state to calculate this average.
I would like to submit values to the system, and get the outputs. I also would like to jump back and restore the state at any time in the past. Ie. I run 1000 values through the system. Now I want to "move" the system back to exactly as it was after I had sent the 500th value through. Then I want to "replay" the system from that point again.
I also need to be able to persist the historical state to disk so I can restore the whole thing again some time in the future (and still have the move back and replay functions work). And of course, I need to do this with gigabytes of data, and have it be extremely fast :)
I had been approaching it using closures to hold state. But I'm wondering if it would make more sense to use a monad. I have only read through 3 or 4 analogies for monads so don't understand them well yet, so feel free to educate me.
If each processor modifies its state in the monad in such a way that its history is kept and it is tied to an id for each processing step. And then somehow the monad is able to switch its state to a past step id and run the system with the monad in that state. And the monad would have some mechanism for (de)serializing itself for storage.
(and given the size of the data... it really shouldn't even all be in memory, which would mean the monad would need to be mapped to disk, cached, etc...)
Is there an existing library/mechanism/approach/concept that has already been done to accomplish or assist in accomplishing what I'm trying to do?
So, I have a system of stateful processors that are chained together. For example, a processor might output the average of its last 10 inputs. It requires state to calculate this average.
First of all, it sounds like what you have are not just "stateful processors" but something like finite-state machines and/or transducers. This is probably a good place to start for research.
I would like to submit values to the system, and get the outputs. I also would like to jump back and restore the state at any time in the past. Ie. I run 1000 values through the system. Now I want to "move" the system back to exactly as it was after I had sent the 500th value through. Then I want to "replay" the system from that point again.
The simplest approach here, of course, is to simply keep a log of all prior states. But since it sounds like you have a great deal of data, the storage needed could easily become prohibitive. I would recommend thinking about how you might construct your processors in a way that could avoid this, e.g.:
If a processor's state can be reconstructed easily from the states of its neighbors a few steps prior, you can avoid logging it directly
If a processor is easily reversible in some situations, you don't need to log those immediately; rewinding can be calculated directly, and logging can be done as periodic snapshots
If you can nail a processor down to a very small number of states, make sure to do so.
If a processor behaves in very predictable ways on certain kinds of input, you can record that as such--e.g., if it idles on numeric input below some cutoff, rather than logging each value just log "idled for N steps".
I also need to be able to persist the historical state to disk so I can restore the whole thing again some time in the future (and still have the move back and replay functions work). And of course, I need to do this with gigabytes of data, and have it be extremely fast :)
Explicit state is your friend. Functions are a convenient way to represent active state machines, but they can't be serialized in any simple way. You want a clean separation of a (basically static) network of processors vs. a series of internal states used by each processor to calculate the next step.
Is there an existing library/mechanism/approach/concept that has already been done to accomplish what I'm trying to do? Does the monad approach make sense? Are there other better/special approaches that would help it do this efficiently especially given the enormous amount of data I have to manage?
If most of your processors resemble finite state transducers, and you need to have processors that take inputs of various types and produce different types of outputs, what you probably want is actually something with a structure based on Arrows, which gives you an abstraction for things that compose "like functions" in some sense, e.g., connecting the input of one processor to the output of another.
Furthermore, as long as you avoid the ArrowApply class and make sure that your state machine model only returns an output value and a new state, you'll be guaranteed to avoid implicit state because (unlike functions) Arrows aren't automatically higher-order.
Given the size of the data... it really shouldn't even all be in memory, which would mean the monad would need to be mapped to disk, cached, etc...
Given a static representation of your processor network, it shouldn't be too difficult to also provide an incremental input/output system that would read the data, serialize/deserialize the state, and write any output.
As a quick, rough starting point, here's an example of probably the simplest version of what I've outlined above, ignoring the logging issue for the moment:
data Transducer s a b = Transducer { curState :: s
, runStep :: s -> a -> (s, b)
}
runTransducer :: Transducer s a b -> [a] -> [b]
runTransducer t [] = (t, [])
runTransducer t (x:xs) = let (s, y) = runStep t (curState t) x
(t', ys) = runTransducer (t { curState = s }) xs
in (t', y:ys)
It's a simple, generic processor with explicit internal state of type s, input of type a, and output of type b. The runTransducer function shoves a list of inputs through, updating the state value manually, and collects a list of outputs.
P.S. -- since you were asking about monads, you might want to know if the example I gave is one. In fact, it's a combination of multiple common monads, though which ones depends on how you look at it. However, I've deliberately avoided treating it as a monad! The thing is, monads capture only abstractions that are in some sense very powerful, but that same power also makes them more resistant in some ways to optimization and static analysis. The main thing that needs to be ruled out is processors that take other processors as input and run them, which (as you can imagine) can create convoluted logic that's nearly impossible to analyze.
So, while the processors probably could be monads, and in some logical sense intrinsically are, it may be more useful to pretend that they aren't; imposing an artificial limitation in order to make static analysis simpler.

Hypothesis testing and GPGPU

I'm very new to GPGPU and Programming. I'm interested to know if statistical hypothesis testing like one-sample Kolmogorov-Smirnov test (K–S test) and Levene's test could be implemented in GPGPU (SIMD) using CUDA? If so what will be the limitations?
I just read web definitions about these tests, but, if I understood correctly, they can be properly accelerated by the kind of parallelism expressed by SIMD (in particular as implemented by CUDA).
In K-S test, one has to compute the difference between a function and an estimate on N samples, then take the maximum difference. In other words, one has to perform the same operation on N different values, which is exactly SIMD (single instruction, multiple data).
In Levene's test, there is again the same difference, square and multiplication over N different values.
What SIMD can do is a sort of FOR statement over N value sets, provided that the iterations are independent from each other. Thus, in CUDA for example, the compiler can allocate the iterations to the processing elements of the graphic device, so that, executing in parallel, the FOR loop is run for all the data in the time of a single iteration.
The CUDA toolkit provides a specific C/C++ compiler (NVCC) where special instructions are dispatched to the GPGPU rather than to the CPU, therefore distributed to its parallel processing elements.

Resources