I need prove that the optimal page replacement algorithm is indeed optimal, and I'm not sure exactly how to start. I thought maybe proof by contradiction, but once I formulated an alternative claim, I wasn't sure how to show that it would have equal or less page faults than OPT.
Is this for CSE 330 final tomorrow?
Longest Forward Distance (LFD)
Replace the page whose next request is farthest (in the future)
Theorem:
LFD (longest forward distance) is an optimal alg.
Proof:
For contradiction, assume that LFD is not optimal
Then there exists a finite input sequence α on which LFD is not
optimal (assume that the length of α is | α| = n)
Let OPT be an optimal solution for α such that
– OPT processes requests 1,2, …, i in the same way as LFD
– OPT processes request i+1 differently than LFD
– Any other optimal strategy processes one of the first i+1 requests differently than LDF
• Hence, OPT is the optimal solution that behaves in the same way as LFD
for as long as possible --> we have i < n
• Goal: Construct OPT′ that is identical with LFD for req. 1, … , i+1
Case 1: Request i+1 does not lead to a page fault
• LFD does not change the content of the fast memory
• OPT behaves differently than LFD --> OPT replaces some page in the fast memory
– As up to request i+1, both algorithms behave in the same way, they also have the same fast memory content
– OPT therefore does not require the new page for request i+1
– Hence, OPT can also load that page later (without extra cost) --> OPT′
Case 2: Request i+1 does lead to a page fault
• LFD and OPT move the same page into the fast memory, but they evict different pages
– If OPT loads more than one page, all pages that are not required for request i+1 can also be loaded later
• Say, LFD evicts page p and OPT evicts page p’
• By the definition of LFD, p’ is required again before page p
Now, there are 2 cases:-
a) OPT keeps p in fast memory until request ℓ
– OPT could evict p at request i+1, keeping p′ instead and load p (instead of p′) back into the fast memory at request ℓ, at no extra cost, similar to LFD
b) OPT evicts p at request ℓ’ < ℓ
– OPT could evict p at request i+1, keeping p′ instead and load p while evicting p′ at request ℓ’ (switch evictions of p and p′),again, similar to LFD
ergo, OPT is not a better solution than LFD.
ie, LFD is the Optimum Page replacement Technique.
LFD is also called as Optimum Page replacement Technique(OPT).
PS: in the proof, a name 'OPT' is used just as a 'name', not be confused it as Optimum Page replacement Technique.
Related
I am modeling the towers of Hanoi problem with n discs and k pegs, and I am trying to find its maximun branching factor. The problem is that, as the number both of discs and pegs is variable, so is the number of actions possible for each node. How can I find a generic way of assesing the maximum branching factor depending on k and n?
In general the smallest disc can move to any other peg: k-1 options.
The second smallest disk (at the top of the stack on a peg; might not be the second smallest overall) can move onto any pegs except the one with the smallest disc: k-2 options.
This continues until the largest disk on the top of a peg, which can't move anywhere (assuming n>k).
So, the expected branching factor is: (k-1)+(k-2)+(k-3)+...+2+1 = (k-1)*k/2
The only time you won't get this is when one of the pegs contains no disks. If n>>k this will rarely happen. But, it means that if you are searching from random states to a goal state, you should consider searching backwards, because the standard goal state has the lowest branching factor since only one peg has a disc.
The n < k case can be similarly analyzed, except that you stop after n disks and subtract an additional term for the moves we counted the first time around that aren't available now:
k(k-1)/2 - (k-n)(k-n-1)/2
I understand the part of the paper where they trick the CPU to speculatively load the part of the victim memory into the CPU cache. Part I do not understand is how they retrieve it from cache.
They don't retrieve it directly (out of bounds read bytes are not "retired" by the CPU and cannot be seen by the attacker in the attack).
A vector of attack is to do the "retrieval" a bit at a time. After the CPU cache has been prepared (flushing the cache where it has to be), and has been "taught" that a if branch goes through while the condition relies on non-cached data, the CPU speculatively executes the couple of lines from the if scope, including an out-of-bounds access (giving a byte B), and then immediately access some authorized non-cached array at an index that depends on one bit of the secret B (B will never directly be seen by the attacker). Finally, attacker retrieves the same authorized data array from, say, an index calculated with B bit, say zero: if the retrieval of that ok byte is fast, data was still in the cache, meaning B bit is zero. If the retrieval is (relatively) slow, the CPU had to load in its cache that ok data, meaning it didn't earlier, meaning B bit was one.
For instance, Cond, all ValidArray not cached, LargeEnough is big enough to ensure the CPU will not load both ValidArray[ valid-index + 0 ] and ValidArray[ valid-index + LargeEnough ] in its cache in one shot
if ( Cond ) {
// the next 2 lines are only speculatively executed
V = SomeArray[ out-of-bounds-attacked-index ]
Dummy = ValidArray [ valid-index + ( V & bit ) * LargeEnough ]
}
// the next code is always retired (executed, not only speculatively)
t1 = get_cpu_precise_time()
Dummy2 = ValidArray [ valid-index ]
diff = get_cpu_precise_time() - t1
if (diff > SOME_CALCULATED_VALUE) {
// bit was its value (1, or 2, or 4, or ... 128)
}
else {
// bit was 0
}
where bit is tried successively being first 0x01, then 0x02... to 0x80. By measuring the "time" (number of CPU cycles) the "next" code takes for each bit, the value of V is revealed:
if ValidArray[ valid-index + 0 ] is in the cache, V & bit is 0
otherwise V & bit is bit
This takes time, each bit requires to prepare the CPU L1 cache, tries several time the same bit to minimize timing errors etc...
Then the correct attack "offset" has to be determined to read an interesting area.
Clever attack, but not so easy to implement.
how they retrieve it from cache
Basically, the secret retrieved speculatively is immediately used as an index to read from another array called side_effects. All we need is to "touch" an index in side_effects array, so the corresponding element get from memory to CPU cache:
secret = base_array[huge_index_to_a_secret];
tmp = side_effects[secret * PAGE_SIZE];
Then the latency to access each element in side_effects array is measured and compared to a memory access time:
for (i = 0; i < 256; i++) {
start = time();
tmp = side_effects[i * PAGE_SIZE];
latency = time() - start;
if (latency < MIN_MEMORY_ACCESS_TIME)
return i; // so, thas was the secret!
}
If latency is lower that minimum memory access time, the element is in cache, so the secret was the current index. If the latency is high, the element is not in cache, so we continue our measurements.
So, basically we do not retrieve any information directly, rather we touch some memory during the speculative execution and then observe the side effects.
Here is the Specter-Based Meltdown proof of concept in 99 lines of code you might find easier to understand that the other PoCs:
https://github.com/berestovskyy/spectre-meltdown
In general, this technique is called Side-Channel Attack and more information could be found on Wikipedia: https://en.wikipedia.org/wiki/Side-channel_attack
I would like to contribute one piece of information to the already existing answers, namely how the attacker can actually probe an array from the victim process in the probing phase. This is a problem, because Spectre (unlike Meltdown) runs in the victim's process and even through the cache the attacker cannot just query arrays from other processes.
In short: With Spectre the FLUSH+RELOAD attack needs KSM or another method for shared memory. That way the attacker (to my understanding) can replicate the relevant parts of the victim's memory in his own address space and thus will be able to query the cache for the access times on the probe array.
Long Explanation:
One big difference between Meltdown and Spectre is that in Meltdown the whole attack is running in the address space of the attacker. Thus, it's quite clear how the attacker can both cause changes to the cache and read the cache at the same time. With Spectre however, the attack itself runs in the process of the victim. By using so called gadgets the victim will execute code that writes the secret data into the index of a probe array, e.g. with a = array2[array1[x] * 4096].
The proof-of-concepts that have been linked in other answers implement the basic branching/speculation concept of Spectre, but all code seems to run in the same process. Thus, of course it is no problem to have gadget code write to array2 and then read array2 for probing. In a real-world scenario, however, the victim process would write to array2 which is also located in the victim process.
Now, the problem - which the paper in my opinion does not explain well - is that the attacker has to be able to probe the cache for the victim's address space array (array2). Theoretically, this could be done either from within the victim again or from the attackers address space.
The original paper only describes it vaguely, probably because it was clear to the authors:
For the final phase, the sensitive data is recovered. For Spectre attacks using Flush+Reload or Evict+Reload, the recovery process consists of timing the access to memory addresses in the cache lines being monitored.
To complete the attack, the adversary measures which location in array2 was brought into the cache, e.g., via Flush+Reload or Prime+Probe.
Accessing the cache for array2 from within the victim's address space would be possible, but it would require another gadget and the attacker would have to be able to trigger execution of this gadget. This seemed quite unrealistic to me, especially in Spectre-PHT.
In the paper Detecting Spectre Attacks by identifying Cache Side-Channel Attacks using Machine Learning I found my missing explanation:
In order for the FLUSH+RELOAD attack to work in this case,
three preconditions have to be met. [...] But most
importantly the CPU must have a mechanism like Kernel Same-page Merging (KSM) [4] or Transparent Page Sharing (TPS) [54]
enabled [10].
KSM allows processes to share pages by merging different virtual
addresses into the same page, if they reference the same physical
address. It thereby increases the memory density, allowing for a
more efficient memory usage. KSM was first implemented in Linux
2.6.32 and is enabled by default [33].
KSM explains how the attacker can access array2 that normally would only be available within the victim's process.
I'm working with a serial protocol. Messages are of variable length that is known in advance. On both transmission and reception sides, I have the message saved to a shift register that is as long as the longest possible message.
I need to calculate CRC32 of these registers, the same as for Ethernet, as fast as possible. Since messages are variable length (anything from 12 to 64 bits), I chose serial implementation that should run already in parallel with reception/transmission of the message.
I ran into a problem with organization of data before calculation. As specified here , the data needs to be bit-reversed, padded with 32 zeros and complemented before calculation.
Even if I forget the part about running in parallel with receiving or transmitting data, how can I effectively get only my relevant message from max-length register so that I can pad it before calculation? I know that ideas like
newregister[31:0] <= oldregister[X:0] // X is my variable length
don't work. It's also impossible to have the generate for loop clause that I use to bit-reverse the old vector run variable number of times. I could use a counter to serially move data to desired length, but I cannot afford to lose this much time.
Alternatively, is there an operation that would directly give me the padded and complemented result? I do not even have an idea how to start developing such an idea.
Thanks in advance for any insight.
You've misunderstood how to do a serial CRC; the Python question you quote isn't relevant. You only need a 32-bit shift register, with appropriate feedback taps. You'll get a million hits if you do a Google search for "serial crc" or "ethernet crc". There's at least one Xilinx app note that does the whole thing for you. You'll need to be careful to preload the 32-bit register with the correct value, and whether or not you invert the 32-bit data on completion.
EDIT
The first hit on 'xilinx serial crc' is xapp209, which has the basic answer in fig 1. On top of this, you need the taps, the preload value, whether or not to invert the answer, and the value to check against on reception. I'm sure they used to do all this in another app note, but I can't find it at the moment. The basic references are the Ethernet 802.3 spec (3.2.8 Frame check Sequence field, which was p27 in the original book), and the V42 spec (8.1.1.6.2 32-bit frame check sequence, page 311 in the old CCITT Blue Book). Both give the taps. V42 requires a preload to all 1's, invert of completion, and gives the test value on reception. Warren has a (new) chapter in Hacker's Delight, which shows the taps graphically; see his website.
You only need the online generators to check your solution. Be careful, though: they will generally have different preload values, and may or may not invert the result, and may or may not be bit-reversed.
Since X is a viarable, you will need to bit assignments with a for-loop. The for-loop needs to be inside an always block and the for-loop must static unroll (ie the starting index, ending index, and step value must be constants).
for(i=0; i<32; i=i+1) begin
if (i<X)
newregister[i] <= oldregister[i];
else
newregister[i] <= 1'b0; // pad zeros
end
I am using the Intel Xeon Phi coprocessor, which has up to 240 threads, and I am working on minimizing the number of threads used for a particular application (or maximize performance) while being within a percentage of the best execution time. So for example if I have the following measurements:
Threads | Execution time
240 100 s
200 105 s
150 107 s
120 109 s
100 120 s
I would like to select a number of threads between 120 and 150, since the "performance curve" there seems to stabilize and the reduction in execution time is not that significant (in this case around 15% of the best measured time. I did this using an exhaustive search algorithm (measuring from 1 to 240 threads), but my problem is that it takes too long for smaller number of threads (obviously depending on the size of the problem).
To try to reduce the number of measurements, I developed a sort of "binary search" algorithm. Basically I have an upper and lower limit (beginning at 0 and 240 threads), I take the value in the middle and measure it and at 240. I get the percent difference between both values and if it is within 15% (this value was selected after analyzing the results for the exhaustive search) I assign a new lower or upper bound. If the difference is larger than 15% then this is a new lower bound (120-240) and if it is smaller then it is a new upper bound (0-120), and if I get a better execution time I store it as the best execution time.
The problem with this algorithm is that first of all this is not necessarily a sorted array of execution times, and for some problem sizes the exhaustive search results show two different minimum, so for example in one I get the best performance at 80 threads and at 170, and I would like to be able to return 80, and not 170 threads as a result of the search. However, for the other cases where there is only one minimum, the algorithm found a value very close to the one expected.
If anyone has a better idea or knows of an existing search algorithm or heuristic that could help me I would be really grateful.
I'm taking it that your goal is to get the best relative performance for the least amount of threads, while still maintaining some limit on performance based on a coefficient (<=1) of the best possible performance. IE: If the coefficient is 0.85 then the performance should be no less than 85% of the performance using all threads.
It seems like what you should be trying to do is simply find the minimium number of threads required to obtain the performance bound. Rather than looking at 1-240 threads, start at 240 threads and reduce the number of threads until you can place a lower bound on the performance limit. You can then work up from the lower bound in such a way that you can find the min without passing over it. If you don't have predefined performance bound, then you can calculate one on the fly based on diminishing returns.
As long as the performance limit has not been exceeded, half the number of threads (start with max number of threads). The number that exceeds the performance limit is a lower bound on the number of threads required.
Starting at the lower bound on the number of threads, Z, add m threads if can be added without getting within the performance limit. Repeatedly double the number of threads added until within the performance limit. If adding the threads get within the performance limit, subtract the last addition and reset the number of threads to be added to m. If even just adding m gets within the limit, then add the last m threads and return the number of threads.
It might be clearer to give an example of what the process looks like step by step. Where Passed means that the number of threads are outside of the performance limits, and failed means they are either on the performance limit or inside of it.
Try adding 1m (Z + 1m). Passed. Threads = Z + m.
Try adding 2m (Z + 3m). Passed. Threads = Z + 3m.
Try adding 4m (Z + 7m). Failed. Threads = Z + 3m. Reset.
Try adding 1m. Passed. Threads = Z + 4m.
Try adding 2m. Passed. Threads = Z + 6m.
Z + 7m failed earlier so reset.
Comparisons/lookups are cheap, use them to prevent duplication of work.
Try adding 1m. Failed. Threads = Z + 6m. Reset.
Cannot add less than 1m and still in outside of performance limit.
The solution is Z + 7m threads.
Since Z + 6m is m threads short of the performance limit.
It's a bit inefficient, but it does find the minimium number of threads (>= Z) required to obtain the performance bound to within an error of m-1 threads and requiring only O(log (N-Z)) tests. This should be enough in most cases, but if it isn't just skip step 1 and use Z=m. Unless increasing the number of threads rapidly decreases the run-time causing very slow run times when Z is very small. In which case, doing step 1 and using interpolation can get an idea of how quickly the run-time increases as the number of threads decrease, which is also useful for determining a good performance limit if none is given.
I'm trying to introduce Hoopl into some compiler and faced some problem: creating
a graph for Hoopl makes the nodes to appear in order of labels that were introduced.
Eg:
(define (test) (if (eq? (random) 1 ) 2 (if (eq? (random) 2 ) 3 0) ) )
"compiles" to
L25: call-direct random -> _tmp7_6
branch L27
L26: return RETVAL
L27: iconst 1 _tmp8_7
branch L28
L28: call-direct eq? _tmp7_6, _tmp8_7 -> _tmp4_8
branch L29
L29: cond-branch _tmp4_8 L30 L31
L30: iconst 2 RETVAL
branch L26
L31: call-direct random -> _tmp12_10
branch L32
L32: iconst 2 _tmp13_11
branch L33
L33: call-direct eq? _tmp12_10, _tmp13_11 -> _tmp9_12
branch L34
L34: cond-branch _tmp9_12 L36 L37
L35: assign RETVAL _tmp6_15
branch L26
L36: iconst 3 _tmp6_15
branch L35
L37: iconst 0 _tmp6_15
branch L35
The order of instructions (in order of showGraph) is strange, because of order of
recursive graph building from the AST. In order to generate code I need to reorder blocks in more natural way, say place return RETVAL to the end of function, merge blocks like this
branch Lx:
Lx: ...
into the one block, and so on. Seems that I need something like:
block1 = get block
Ln = get_last jump
block2 = find block Ln
if (some conditions)
remove block2
replace blick1 (merge block1 block2)
I'm totally confused how to perform this with Hoopl. Of course, I may dump all the nodes
and then perform the transformations outside the Hoopl framework, but I believe that this
is bad idea.
May someone give me any glue? I did not find any useful examples. Something similar is performed in Lambdachine project, but seems too complicated.
There is also an another question. Is there any point to make all Call instruction non-local?
What the point of this considering that implementation of Call is not changing any local
variables and always transfer the control to the next instruction of the block? If Call instructions are defined like
data Insn e x where
Call :: [Expr] -> Expr -> Label :: Insn O C -- last instruction of the block
that cause the graph to looks even more strange. So I use
-- what the difference with any other primitive, like "add a b -> c"
Call :: [Expr] -> Expr -> Label :: Insn O O
May be I'm wrong with this?
It is possible to implement the "block merging" using HOOPL. Your question is too generic, so I give you a plan:
Determine what analysis type this optimization requires (either forward or backward)
Design the analysis lattice
Design the transfer function
Design the rewriting function
Create a pass
Merge the pass with other passes of the same direction so they interleave
Run the pass using fuel
Convert optimized graph back to the form you need
With which stages do you have problems? Steps 1 and 2 should be rather straightforward once you've read the papers.
You should also understand the general concept of basic block - why instructions are merged into blocks, why control flow graph consists of blocks and not of individual instructions, why analysis is performed on blocks and not on individual instructions.
Your rewrite function should use the facts to rewrite the last node in the block. So the fact lattice should include not only "information about reachability", but also the destination blocks themselves.
I've found and tried a couple of ways to do the trick:
Using foldBlockNodesF3 function or other foldBlockNodes... functions
Using preorder_dfs* functions (like in Lambdachine project)
Build the graph with larger blocks from the start
The last option is not useful for me, because FactBase is linked with labels, so every instruction that change a liveness of the variables should have a label for using in the following analysis.
So, my final solution is to use foldBlockNodesF3 function and linearize the graph and delete extra labels manually with simultaneous register allocation