How does Spectre attack read the cache it tricked CPU to load? - security

I understand the part of the paper where they trick the CPU to speculatively load the part of the victim memory into the CPU cache. Part I do not understand is how they retrieve it from cache.

They don't retrieve it directly (out of bounds read bytes are not "retired" by the CPU and cannot be seen by the attacker in the attack).
A vector of attack is to do the "retrieval" a bit at a time. After the CPU cache has been prepared (flushing the cache where it has to be), and has been "taught" that a if branch goes through while the condition relies on non-cached data, the CPU speculatively executes the couple of lines from the if scope, including an out-of-bounds access (giving a byte B), and then immediately access some authorized non-cached array at an index that depends on one bit of the secret B (B will never directly be seen by the attacker). Finally, attacker retrieves the same authorized data array from, say, an index calculated with B bit, say zero: if the retrieval of that ok byte is fast, data was still in the cache, meaning B bit is zero. If the retrieval is (relatively) slow, the CPU had to load in its cache that ok data, meaning it didn't earlier, meaning B bit was one.
For instance, Cond, all ValidArray not cached, LargeEnough is big enough to ensure the CPU will not load both ValidArray[ valid-index + 0 ] and ValidArray[ valid-index + LargeEnough ] in its cache in one shot
if ( Cond ) {
// the next 2 lines are only speculatively executed
V = SomeArray[ out-of-bounds-attacked-index ]
Dummy = ValidArray [ valid-index + ( V & bit ) * LargeEnough ]
}
// the next code is always retired (executed, not only speculatively)
t1 = get_cpu_precise_time()
Dummy2 = ValidArray [ valid-index ]
diff = get_cpu_precise_time() - t1
if (diff > SOME_CALCULATED_VALUE) {
// bit was its value (1, or 2, or 4, or ... 128)
}
else {
// bit was 0
}
where bit is tried successively being first 0x01, then 0x02... to 0x80. By measuring the "time" (number of CPU cycles) the "next" code takes for each bit, the value of V is revealed:
if ValidArray[ valid-index + 0 ] is in the cache, V & bit is 0
otherwise V & bit is bit
This takes time, each bit requires to prepare the CPU L1 cache, tries several time the same bit to minimize timing errors etc...
Then the correct attack "offset" has to be determined to read an interesting area.
Clever attack, but not so easy to implement.

how they retrieve it from cache
Basically, the secret retrieved speculatively is immediately used as an index to read from another array called side_effects. All we need is to "touch" an index in side_effects array, so the corresponding element get from memory to CPU cache:
secret = base_array[huge_index_to_a_secret];
tmp = side_effects[secret * PAGE_SIZE];
Then the latency to access each element in side_effects array is measured and compared to a memory access time:
for (i = 0; i < 256; i++) {
start = time();
tmp = side_effects[i * PAGE_SIZE];
latency = time() - start;
if (latency < MIN_MEMORY_ACCESS_TIME)
return i; // so, thas was the secret!
}
If latency is lower that minimum memory access time, the element is in cache, so the secret was the current index. If the latency is high, the element is not in cache, so we continue our measurements.
So, basically we do not retrieve any information directly, rather we touch some memory during the speculative execution and then observe the side effects.
Here is the Specter-Based Meltdown proof of concept in 99 lines of code you might find easier to understand that the other PoCs:
https://github.com/berestovskyy/spectre-meltdown
In general, this technique is called Side-Channel Attack and more information could be found on Wikipedia: https://en.wikipedia.org/wiki/Side-channel_attack

I would like to contribute one piece of information to the already existing answers, namely how the attacker can actually probe an array from the victim process in the probing phase. This is a problem, because Spectre (unlike Meltdown) runs in the victim's process and even through the cache the attacker cannot just query arrays from other processes.
In short: With Spectre the FLUSH+RELOAD attack needs KSM or another method for shared memory. That way the attacker (to my understanding) can replicate the relevant parts of the victim's memory in his own address space and thus will be able to query the cache for the access times on the probe array.
Long Explanation:
One big difference between Meltdown and Spectre is that in Meltdown the whole attack is running in the address space of the attacker. Thus, it's quite clear how the attacker can both cause changes to the cache and read the cache at the same time. With Spectre however, the attack itself runs in the process of the victim. By using so called gadgets the victim will execute code that writes the secret data into the index of a probe array, e.g. with a = array2[array1[x] * 4096].
The proof-of-concepts that have been linked in other answers implement the basic branching/speculation concept of Spectre, but all code seems to run in the same process. Thus, of course it is no problem to have gadget code write to array2 and then read array2 for probing. In a real-world scenario, however, the victim process would write to array2 which is also located in the victim process.
Now, the problem - which the paper in my opinion does not explain well - is that the attacker has to be able to probe the cache for the victim's address space array (array2). Theoretically, this could be done either from within the victim again or from the attackers address space.
The original paper only describes it vaguely, probably because it was clear to the authors:
For the final phase, the sensitive data is recovered. For Spectre attacks using Flush+Reload or Evict+Reload, the recovery process consists of timing the access to memory addresses in the cache lines being monitored.
To complete the attack, the adversary measures which location in array2 was brought into the cache, e.g., via Flush+Reload or Prime+Probe.
Accessing the cache for array2 from within the victim's address space would be possible, but it would require another gadget and the attacker would have to be able to trigger execution of this gadget. This seemed quite unrealistic to me, especially in Spectre-PHT.
In the paper Detecting Spectre Attacks by identifying Cache Side-Channel Attacks using Machine Learning I found my missing explanation:
In order for the FLUSH+RELOAD attack to work in this case,
three preconditions have to be met. [...] But most
importantly the CPU must have a mechanism like Kernel Same-page Merging (KSM) [4] or Transparent Page Sharing (TPS) [54]
enabled [10].
KSM allows processes to share pages by merging different virtual
addresses into the same page, if they reference the same physical
address. It thereby increases the memory density, allowing for a
more efficient memory usage. KSM was first implemented in Linux
2.6.32 and is enabled by default [33].
KSM explains how the attacker can access array2 that normally would only be available within the victim's process.

Related

Invalidate range by virtual address in dcache_inval_poc(start,end); ARMV8; Cache;

I'm confused by the implementation of the dcache_inval_poc (start, end) as follows: https://github.com/torvalds/linux/blob/v5.15/arch/arm64/mm/cache.S#L134. There is no sanity check for the "end" address, but what will happen if the range (start, end) passes from the upper layer, like dma_sync_single_for_cpu/dma_sync_single_for_device, beyond the L1 data cache size? eg: dcache_inval_poc(start, start+256KB), but L1 D-cache size is 32KB
After going through the source code of the dcache_inval_poc (start, end) https://github.com/torvalds/linux/blob/v5.15/arch/arm64/mm/cache.S#L152 , I tried to convert the loop code to Pseudo-Code in C as the following:
x0_kaddr = start;
while ( start < end){
dc_civac( x0_kaddr );
x0_kaddr += cache_line_size;
}
If "end - start" > L1 D-cache size, the loop will still run, however, the "x0_kaddr" address no longer exists in the D-cache.
Your confusion comes from fact that you thinking in terms of cache lines somehow mapped on top of some memory range. But function is Invalidate range by virtual address in terms of available mapped memory.
So far as start and end parameters are valid virtual addresses of general memory that's fine.
Memory range does not have to be cached as a whole, only some data out of given range might be cached or none at all.
So say there is 2MB buffer in physical DDR memory that's mapped and could be accessed by virtual addresses.
Say L1 is 32KB.
So up to 32KB out of 2MB buffer might be cached (or none at all). You don't know what part, if any, is in cache.
For that reason you run a loop over virtual addresses of your 2MB buffer. If data block of cache_line_size is in cache, that cache line would be invalidated. If data is not in cache and only in DDR memory, that's basically a nop.
It's good practice to provide start and end addresses aligned to cache_line_size, because memory controller would clip lower bits and you might miss cleaning some data in buffer tail.
PS: if you want to operate directly on cache lines, there is other functions for that. And they takes way and set parameters to address directly cache lines.

Getting the percentage of used space and used inodes in a mount

I need to calculate the percentage of used space and used inodes for a mount path (e.g. /mnt/mycustommount) in Go.
This is my attempt:
var statFsOutput unix.Statfs_t
err := unix.Statfs(mount_path, &statFsOutput)
if err != nil {
return err
}
totalBlockCount := statFsOutput.Blocks // Total data blocks in filesystem
freeSystemBlockCount = statFsOutput.Bfree // Free blocks in filesystem
freeUserBlockCount = statFsOutput.Bavail // Free blocks available to unprivileged user
Now the proportion I need would be something like this:
x : 100 = (totalBlockCount - free[which?]BlockCount) : totalBlockCount
i.e. x : 100 = usedBlockCount : totalBlockCount . But I don't understand the difference between Bfree and Bavail (what's 'unprivileged' user go to do with filesystem blocks?).
For inodes my attempt:
totalInodeCount = statFsOutput.Files
freeInodeCount = statFsOutput.Ffree
// so now the proportion is
// x : 100 = (totalInodeCount - freeInodeCount) : totalInodeCount
How to get the percentage for used storage?
And is the inodes calculation I did correct?
Your comment expression isn't valid Go, so I can't really interpret it without guessing. With guessing, I interpret it as correct, but have I guessed what you actually mean, or merely what I think you mean? In other words, without showing actual code, I can only imagine what your final code will be. If the code I imagine isn't the actual code, the correctness of the code I imagine you will write is irrelevant.
That aside, I can answer your question here:
(what's 'unprivileged' user go to do with filesystem blocks?)
The Linux statfs call uses the same fields as 4.4BSD. The default 4.4BSD file system (the one called the "fast file system") uses a blocks-with-fragmentation approach to allocate blocks in a sort of stochastic manner. This allocation process works very well on an empty file system, and continues to work well, without extreme slowdown, on somewhat-full file systems. Computerized modeling of its behavior, however, showed pathological slowdowns (amounting to linear search, more or less) were possible if the block usage exceeded somewhere around 90%.
(Later, analysis of real file systems found that the slowdowns generally did not hit until the block usage exceeded 95%. But the idea of a 10% "reserve" was pretty well established by then.)
Hence, if a then-popular large-size disk drive of 400 MB1 gave 10% for inodes and another 10% for reserved blocks, that meant that ordinary users could allocate about 320 MB of file data. At that point the drive was "100% full", but it could go to 111% by using up the remaining blocks. Those blocks were reserved to the super-user though.
These days, instead of a "super user", one can have a capability that can be granted or revoked. However, these days we don't use the same file systems either. So there may be no difference between bfree and bavail on your system.
1Yes, the 400 MB Fujitsu Eagle was a large (in multiple senses: it used a 19 inch rack mount setup) drive back then. People are spoiled today with their multi-terabyte SSDs. 😀

Buffer overflow exploitation 101

I've heard in a lot of places that buffer overflows, illegal indexing in C like languages may compromise the security of a system. But in my experience all it does is crash the program I'm running. Can anyone explain how buffer overflows could cause security problems? An example would be nice.
I'm looking for a conceptual explanation of how something like this could work. I don't have any experience with ethical hacking.
First, buffer overflow (BOF) are only one of the method of gaining code execution. When they occur, the impact is that the attacker basically gain control of the process. This mean that the attacker will be able to trigger the process in executing any code with the current process privileges (depending if the process is running with a high or low privileged user on the system will respectively increase or reduce the impact of exploiting a BOF on that application). This is why it is always strongly recommended to run applications with the least needed privileges.
Basically, to understand how BOF works, you have to understand how the code you have build gets compiled into machine code (ASM) and how data managed by your software is stored in memory.
I will try to give you a basic example of a subcategory of BOF called Stack based buffer overflows :
Imagine you have an application asking the user to provide a username.
This data will be read from user input and then stored in a variable called USERNAME. This variable length has been allocated as a 20 byte array of chars.
For this scenario to work, we will consider the program's do not check for the user input length.
At some point, during the data processing, the user input is copied to the USERNAME variable (20bytes) but since the user input is longer (let's say 500 bytes) data around this variable will be overwritten in memory :
Imagine such memory layout :
size in bytes 20 4 4 4
data [USERNAME][variable2][variable3][RETURN ADDRESS]
If you define the 3 local variables USERNAME, variable2 and variable3 the may be store in memory the way it is shown above.
Notice the RETURN ADDRESS, this 4 byte memory region will store the address of the function that has called your current function (thanks to this, when you call a function in your program and readh the end of that function, the program flow naturally go back to the next instruction just after the initial call to that function.
If your attacker provide a username with 24 x 'A' char, the memory layout would become something like this :
size in bytes 20 4 4 4
data [USERNAME][variable2][variable3][RETURN ADDRESS]
new data [AAA...AA][ AAAA ][variable3][RETURN ADDRESS]
Now, if an attacker send 50 * the 'A' char as a USERNAME, the memory layout would looks like this :
size in bytes 20 4 4 4
data [USERNAME][variable2][variable3][RETURN ADDRESS]
new data [AAA...AA][ AAAA ][ AAAA ][[ AAAA ][OTHER AAA...]
In this situation, at the end of the execution of the function, the program would crash because it will try to reach the address an invalid address 0x41414141 (char 'A' = 0x41) because the overwritten RETURN ADDRESS doesn't match a correct code address.
If you replace the multiple 'A' with well thought bytes, you may be able to :
overwrite RETURN ADDRESS to an interesting location.
place "executable code" in the first 20 + 4 + 4 bytes
You could for instance set RETURN ADDRESS to the address of the first byte of the USERNAME variable (this method is mostly no usable anymore thanks to many protections that have been added both to OS and to compiled programs).
I know it is quite complex to understand at first, and this explanation is a very basic one. If you want more detail please just ask.
I suggest you to have a look at great tutorials like this one which are quite advanced but more realistic

Objective C For loops with #autorelease and ARC

As part of an app that allows auditors to create findings and associate photos to them (Saved as Base64 strings due to a limitation on the web service) I have to loop through all findings and their photos within an audit and set their sync value to true.
Whilst I perform this loop I see a memory spike jumping from around 40MB up to 500MB (for roughly 350 photos and 255 findings) and this number never goes down. On average our users are creating around 1000 findings and 500-700 photos before attempting to use this feature. I have attempted to use #autorelease pools to keep the memory down but it never seems to get released.
for (Finding * __autoreleasing f in self.audit.findings){
#autoreleasepool {
[f setToSync:#YES];
NSLog(#"%#", f.idFinding);
for (FindingPhoto * __autoreleasing p in f.photos){
#autoreleasepool {
[p setToSync:#YES];
p = nil;
}
}
f = nil;
}
}
The relationships and retain cycles look like this
Audit has a strong reference to Finding
Finding has a weak reference to Audit and a strong reference to FindingPhoto
FindingPhoto has a weak reference to Finding
What am I missing in terms of being able to effectively loop through these objects and set their properties without causing such a huge spike in memory. I'm assuming it's got something to do with so many Base64 strings being loaded into memory when looping through but never being released.
So, first, make sure you have a batch size set on the fetch request. Choose a relatively small number, but not too small because this isn't for UI processing. You want to batch a reasonable number of objects into memory to reduce loading overhead while keeping memory usage down. Try 50 or 100 and see how it goes, then consider upping the batch size a little.
If all of the objects you're loading are managed objects then the correct way to evict them during processing is to turn them into faults. That's done by calling refreshObject:mergeChanges: on the context. BUT - that discards any changes, and your loop is specifically there to make changes.
So, what you should really be doing is batch saving the objects you've modified and then turning those objects back into faults to remove the data from memory.
So, in your loop, keep a counter of how many you've modified and save the context each time you hit that count and refresh all the objects that were processed so far. The batch on the fetch and the batch size to save should be the same number.
There's probably a big difference in size between your "Finding" objects and the associated images. So your primary aim should be to redesign your database in a way so that unfaulting (loading) a Finding object does not automatically load the base64 encoded image.
That's actually one of the major strengths of Code Data: Loading part of an object hierarchy. Just try to move the base64 encoded data to an own (managed) object so that Core Data does not load it. It will still be loaded as needed when the reference is touched.

DMA memcpy operation in Linux

I want to dma using dma_async_memcpy_buf_to_buf function which is in dmaengine.c file (linux/drivers/dma). For this, I add a function in dmatest.c file (linux/drivers/dma) as following:
void foo ()
{
int index = 0;
dma_cookie_t cookie;
size_t len = 0x20000;
ktime_t start, end, end1, end2, end3;
s64 actual_time;
u16* dest;
u16* src;
dest = kmalloc(len, GFP_KERNEL);
src = kmalloc(len, GFP_KERNEL);
for (index = 0; index < len/2; index++)
{
dest[index] = 0xAA55;
src[index] = 0xDEAD;
}
start = ktime_get();
cookie = dma_async_memcpy_buf_to_buf(chan, dest, src, len);
while (dma_async_is_tx_complete(chan, cookie, NULL, NULL) == DMA_IN_PROGRESS)
{
dma_sync_wait(chan, cookie);
}
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution dma: %lld\n",(long long)actual_time);
memset(dest, 0 , len);
start = ktime_get();
memcpy(dest, src, len);
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution non-dma: %lld\n",(long long)actual_time);
}
There are some issues with DMA:
Interestingly, memcpy function execution time is less than dma_async_memcpy_buf_to_buf function. Maybe, its related with ktime_get() function problem.
My method with foo function is correct or incorrect to perform DMA operation? I'm not sure about this.
How can I measure tick counts of memcpy and dma_async_memcpy_buf_to_buf functions in terms of cpu usage
Finally, Is DMA operation possible at application level? Up to now I used in kernel level as you can see above(dmatest.c is inserted kernel module)
There are multiple issues in your question, which make it kind of hard to answer exactly what you're questioning:
Yes, your general DMA operation invocation algorithm is correct.
The fundamental difference in using plain memcpy and DMA operations for copying memory is not getting direct performance gains, but (a) performance gains due to sustaining CPU cache / prefetcher state when using DMA operation (which is likely would be garbled when using plain old memcpy, executed on CPU itself), and (b) true background operation which leaves CPU available to do other stuff.
Given (a), it's kind of pointless to use DMA operations on anything less than CPU cache size, i.e. dozens of megabytes. Typically it's done for purposes of fast off-CPU stream processing, i.e. moving data that would be anyhow produced/consumed by external devices, such as fast networking cards, video streaming / capturing / encoding hardware, etc.
Comparing async and sync operations in terms of wall clock elapsed time is wrong. There might be hundreds of threads / processes running and no one guarantees you that you'd get scheduled next tick and not several thousands ticks later.
Using ktime_get for benchmarking purposes is wrong - it's fairly imprecise, especially for given such short jobs. Profiling kernel code in fact is a pretty hard and complex task which is well beyond the scope of this question. A quick recommendation here would be to refrain at all from such micro-benchmarks and profile a much bigger and more complete job - similar to what you're ultimately trying to achieve.
Measuring "ticks" for modern CPUs is also kind of pointless, although you can use CPU vendor-specific tools, such as Intel's VTune.
Using DMA copy operations on application level is fairly pointless - at least I can't come with a single viable scenario from top of my head when it would be worth the trouble. It's not innately faster, and, what's more important, I seriously doubt that your application performance's bottleneck is memory copying. For this to be the case, you generally should be doing everything else faster than regular memory copying, and I can't really think of anything on application level that would be faster than memcpy. And if we're talking about communication with some other, off-CPU processing device, then it's automatically not application level.
Generally, memory copy performance is usually limited by memory speed, i.e. clock freq and timings. You aren't going to get any miracle boosts over regular memcpy in direct performance, just because memcpy executed on CPU is fast enough, as CPU usually works with 3x-5x-10x faster clock frequencies than memory.

Resources