Strange code for preventing false sharing

Strange code for preventing false sharing - multithreading

I want to discuss the following structure in golang from this link
// Local per-P Pool appendix.
57 type poolLocal struct {
58 private interface{} // Can be used only by the respective P.
59 shared []interface{} // Can be used by any P.
60 Mutex // Protects shared.
61 pad [128]byte // Prevents false sharing.
62 }
The above structure can be accessed only one thread at a time as Mutex is used. The coder will Lock the structure in the beginning of a thread and unlock it when the thread completes. So the memory is not shared between threads. So no more than one core will have
access to the memory. So, by my understanding, false sharing cannot happen here. If false sharing cannot happen, why did the coder pad the structure
with extra bytes (pad [128]byte) ? Is my understanding wrong?

Memory locations on the same cache line are subject to false-sharing, which is very bad for performance.
Cache line size ranges from 32 to 128 bytes, depending on processor model. 128 byte pad will reduce chance for same cache line being used by different processes and that improvesthe performace
as i see it, the following would be better as it would be more explicit
type poolLocal struct {
_ [64]byte // Prevents false sharing.
private interface{} // Can be used only by the respective P.
shared []interface{} // Can be used by any P.
Mutex // Protects shared.
_ [64]byte // Prevents false sharing.
}

Related

Rust pointer being freed was not allocated error

Here's the situation, I want to do some data conversion from a string, and for convenience, I converted it to a pointer in the middle, and now I want to return the part of the string, but I'm stuck with this exception:
foo(74363,0x10fd2fdc0) malloc: *** error for object 0x7ff65ff000d1: pointer being freed was not allocated
foo(74363,0x10fd2fdc0) malloc: *** set a breakpoint in malloc_error_break to debug
When I try to debug the program, I got the error message as shown above.
Here's my sample code:
fn main() {
unsafe {
let mut s = String::from_utf8_unchecked(vec![97, 98]);
let p = s.as_ptr();
let k = p.add(1);
String::from_raw_parts(k as *mut u8, 1, 1);
}
}

You should never use an unsafe function without understanding its documentation, 100%.
So, what does String::from_raw_parts says:
Safety
This is highly unsafe, due to the number of invariants that aren't
checked:
The memory at ptr needs to have been previously allocated by the same allocator the standard library uses, with a required alignment of exactly 1.
length needs to be less than or equal to capacity.
capacity needs to be the correct value.
Violating these may cause problems like corrupting the allocator's internal data structures.
The ownership of ptr is effectively transferred to the String which may then deallocate, reallocate or change the contents of memory pointed to by the pointer at will. Ensure that nothing else uses the pointer after calling this function.
There are two things that stand out here:
The memory at ptr needs to have been previously allocated.
capacity needs to be the correct value.
And those are related to how allocations work in Rust. Essentially, deallocation only expects the very pointer value (and type) that allocation returned.
Shenanigans such as trying to deallocate a pointer pointing in the middle of an allocation, with a different alignment, or with a different size, are Not Allowed.
Furthermore, you also missed:
Ensure that nothing else uses the pointer after calling this function.
Here, the original instance of String is still owning the allocation, and you are trying to deallocate one byte out of it. It cannot ever go well.

How does Spectre attack read the cache it tricked CPU to load?

I understand the part of the paper where they trick the CPU to speculatively load the part of the victim memory into the CPU cache. Part I do not understand is how they retrieve it from cache.

They don't retrieve it directly (out of bounds read bytes are not "retired" by the CPU and cannot be seen by the attacker in the attack).
A vector of attack is to do the "retrieval" a bit at a time. After the CPU cache has been prepared (flushing the cache where it has to be), and has been "taught" that a if branch goes through while the condition relies on non-cached data, the CPU speculatively executes the couple of lines from the if scope, including an out-of-bounds access (giving a byte B), and then immediately access some authorized non-cached array at an index that depends on one bit of the secret B (B will never directly be seen by the attacker). Finally, attacker retrieves the same authorized data array from, say, an index calculated with B bit, say zero: if the retrieval of that ok byte is fast, data was still in the cache, meaning B bit is zero. If the retrieval is (relatively) slow, the CPU had to load in its cache that ok data, meaning it didn't earlier, meaning B bit was one.
For instance, Cond, all ValidArray not cached, LargeEnough is big enough to ensure the CPU will not load both ValidArray[ valid-index + 0 ] and ValidArray[ valid-index + LargeEnough ] in its cache in one shot
if ( Cond ) {
// the next 2 lines are only speculatively executed
V = SomeArray[ out-of-bounds-attacked-index ]
Dummy = ValidArray [ valid-index + ( V & bit ) * LargeEnough ]
}
// the next code is always retired (executed, not only speculatively)
t1 = get_cpu_precise_time()
Dummy2 = ValidArray [ valid-index ]
diff = get_cpu_precise_time() - t1
if (diff > SOME_CALCULATED_VALUE) {
// bit was its value (1, or 2, or 4, or ... 128)
}
else {
// bit was 0
}
where bit is tried successively being first 0x01, then 0x02... to 0x80. By measuring the "time" (number of CPU cycles) the "next" code takes for each bit, the value of V is revealed:
if ValidArray[ valid-index + 0 ] is in the cache, V & bit is 0
otherwise V & bit is bit
This takes time, each bit requires to prepare the CPU L1 cache, tries several time the same bit to minimize timing errors etc...
Then the correct attack "offset" has to be determined to read an interesting area.
Clever attack, but not so easy to implement.

how they retrieve it from cache
Basically, the secret retrieved speculatively is immediately used as an index to read from another array called side_effects. All we need is to "touch" an index in side_effects array, so the corresponding element get from memory to CPU cache:
secret = base_array[huge_index_to_a_secret];
tmp = side_effects[secret * PAGE_SIZE];
Then the latency to access each element in side_effects array is measured and compared to a memory access time:
for (i = 0; i < 256; i++) {
start = time();
tmp = side_effects[i * PAGE_SIZE];
latency = time() - start;
if (latency < MIN_MEMORY_ACCESS_TIME)
return i; // so, thas was the secret!
}
If latency is lower that minimum memory access time, the element is in cache, so the secret was the current index. If the latency is high, the element is not in cache, so we continue our measurements.
So, basically we do not retrieve any information directly, rather we touch some memory during the speculative execution and then observe the side effects.
Here is the Specter-Based Meltdown proof of concept in 99 lines of code you might find easier to understand that the other PoCs:
https://github.com/berestovskyy/spectre-meltdown
In general, this technique is called Side-Channel Attack and more information could be found on Wikipedia: https://en.wikipedia.org/wiki/Side-channel_attack

I would like to contribute one piece of information to the already existing answers, namely how the attacker can actually probe an array from the victim process in the probing phase. This is a problem, because Spectre (unlike Meltdown) runs in the victim's process and even through the cache the attacker cannot just query arrays from other processes.
In short: With Spectre the FLUSH+RELOAD attack needs KSM or another method for shared memory. That way the attacker (to my understanding) can replicate the relevant parts of the victim's memory in his own address space and thus will be able to query the cache for the access times on the probe array.
Long Explanation:
One big difference between Meltdown and Spectre is that in Meltdown the whole attack is running in the address space of the attacker. Thus, it's quite clear how the attacker can both cause changes to the cache and read the cache at the same time. With Spectre however, the attack itself runs in the process of the victim. By using so called gadgets the victim will execute code that writes the secret data into the index of a probe array, e.g. with a = array2[array1[x] * 4096].
The proof-of-concepts that have been linked in other answers implement the basic branching/speculation concept of Spectre, but all code seems to run in the same process. Thus, of course it is no problem to have gadget code write to array2 and then read array2 for probing. In a real-world scenario, however, the victim process would write to array2 which is also located in the victim process.
Now, the problem - which the paper in my opinion does not explain well - is that the attacker has to be able to probe the cache for the victim's address space array (array2). Theoretically, this could be done either from within the victim again or from the attackers address space.
The original paper only describes it vaguely, probably because it was clear to the authors:
For the final phase, the sensitive data is recovered. For Spectre attacks using Flush+Reload or Evict+Reload, the recovery process consists of timing the access to memory addresses in the cache lines being monitored.
To complete the attack, the adversary measures which location in array2 was brought into the cache, e.g., via Flush+Reload or Prime+Probe.
Accessing the cache for array2 from within the victim's address space would be possible, but it would require another gadget and the attacker would have to be able to trigger execution of this gadget. This seemed quite unrealistic to me, especially in Spectre-PHT.
In the paper Detecting Spectre Attacks by identifying Cache Side-Channel Attacks using Machine Learning I found my missing explanation:
In order for the FLUSH+RELOAD attack to work in this case,
three preconditions have to be met. [...] But most
importantly the CPU must have a mechanism like Kernel Same-page Merging (KSM) [4] or Transparent Page Sharing (TPS) [54]
enabled [10].
KSM allows processes to share pages by merging different virtual
addresses into the same page, if they reference the same physical
address. It thereby increases the memory density, allowing for a
more efficient memory usage. KSM was first implemented in Linux
2.6.32 and is enabled by default [33].
KSM explains how the attacker can access array2 that normally would only be available within the victim's process.

malloc/realloc/free capacity optimization

When you have a dynamically allocated buffer that varies its size at runtime in unpredictable ways (for example a vector or a string) one way to optimize its allocation is to only resize its backing store on powers of 2 (or some other set of boundaries/thresholds), and leave the extra space unused. This helps to amortize the cost of searching for new free memory and copying the data across, at the expense of a little extra memory use. For example the interface specification (reserve vs resize vs trim) of many C++ stl containers have such a scheme in mind.
My question is does the default implementation of the malloc/realloc/free memory manager on Linux 3.0 x86_64, GLIBC 2.13, GCC 4.6 (Ubuntu 11.10) have such an optimization?
void* p = malloc(N);
... // time passes, stuff happens
void* q = realloc(p,M);
Put another way, for what values of N and M (or in what other circumstances) will p == q?

From the realloc implementation in glibc trunk at http://sources.redhat.com/git/gitweb.cgi?p=glibc.git;a=blob;f=malloc/malloc.c;h=12d2211b0d6603ac27840d6f629071d1c78586fe;hb=HEAD
First, if the memory has been obtained via mmap() instead of sbrk(), which glibc malloc does for large requests, >= 128 kB by default IIRC:
if (chunk_is_mmapped(oldp))
{
void* newmem;
#if HAVE_MREMAP
newp = mremap_chunk(oldp, nb);
if(newp) return chunk2mem(newp);
#endif
/* Note the extra SIZE_SZ overhead. */
if(oldsize - SIZE_SZ >= nb) return oldmem; /* do nothing */
/* Must alloc, copy, free. */
newmem = public_mALLOc(bytes);
if (newmem == 0) return 0; /* propagate failure */
MALLOC_COPY(newmem, oldmem, oldsize - 2*SIZE_SZ);
munmap_chunk(oldp);
return newmem;
}
(Linux has mremap(), so in practice this is what is done).
For smaller requests, a few lines below we have
newp = _int_realloc(ar_ptr, oldp, oldsize, nb);
where _int_realloc is a bit big to copy-paste here, but you'll find it starting at line 4221 in the link above. AFAICS, it does NOT do the constant factor optimization increase that e.g. the C++ std::vector does, but rather allocates exactly the amount requested by the user (rounded up to the next chunk boundaries + alignment stuff and so on).
I suppose the idea is that if the user wants this factor of 2 size increase (or any other constant factor increase in order to guarantee logarithmic efficiency when resizing multiple times), then the user can implement it himself on top of the facility provided by the C library.

Perhaps you can use malloc_usable_size (google for it) to find the answer experimentally. This function, however, seems undocumented, so you will need to check out if it is still available at your platform.
See also How to find how much space is allocated by a call to malloc()?

x86 equivalent for LWARX and STWCX

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *pval;
do
{
// fetch the pointer to the value
pval = *ptr;
// if its NULL, then just return NULL, the smart pointer
// will then become NULL as well
if(pval == NULL)
return NULL;
// Grab the reference count
val = lwarx(pval);
// make sure the pointer we grabbed the value from
// is still the same one referred to by 'ptr'
if(pval != *ptr)
continue;
// Increment the reference count via 'stwcx' if any other threads
// have done anything that could potentially break then it should
// fail and try again
} while(!stwcx(pval, val + 1));
return pval;
}
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks

As Michael mentioned, what you're probably looking for is the cmpxchg instruction.
It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).
If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.

x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).

You're probably looking for the cmpxchg family of instructions.
You'll need to precede these with a lock instruction to get equivalent behaviour.
Have a look here for a quick overview of what's available.
You'll likely end up with something similar to this:
mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12
You should read this paper...
Edit
In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.

if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *unpacked;
do
{
val = *ptr;
unpacked = unpack(val);
if(unpacked == NULL)
return NULL;
// pointer is on the bottom
} while(!cas(unpacked, val, val + 1));
return unpacked;
}

Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.

What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).
The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Why is thread local storage so slow?

I'm working on a custom mark-release style memory allocator for the D programming language that works by allocating from thread-local regions. It seems that the thread local storage bottleneck is causing a huge (~50%) slowdown in allocating memory from these regions compared to an otherwise identical single threaded version of the code, even after designing my code to have only one TLS lookup per allocation/deallocation. This is based on allocating/freeing memory a large number of times in a loop, and I'm trying to figure out if it's an artifact of my benchmarking method. My understanding is that thread local storage should basically just involve accessing something through an extra layer of indirection, similar to accessing a variable via a pointer. Is this incorrect? How much overhead does thread-local storage typically have?
Note: Although I mention D, I'm also interested in general answers that aren't specific to D, since D's implementation of thread-local storage will likely improve if it is slower than the best implementations.

The speed depends on the TLS implementation.
Yes, you are correct that TLS can be as fast as a pointer lookup. It can even be faster on systems with a memory management unit.
For the pointer lookup you need help from the scheduler though. The scheduler must - on a task switch - update the pointer to the TLS data.
Another fast way to implement TLS is via the Memory Management Unit. Here the TLS is treated like any other data with the exception that TLS variables are allocated in a special segment. The scheduler will - on task switch - map the correct chunk of memory into the address space of the task.
If the scheduler does not support any of these methods, the compiler/library has to do the following:
get current ThreadId
Take a semaphore
Lookup the pointer to the TLS block by the ThreadId (may use a map or so)
Release the semaphore
Return that pointer.
Obviously doing all this for each TLS data access takes a while and may need up to three OS calls: Getting the ThreadId, Take and Release the semaphore.
The semaphore is btw required to make sure no thread reads from the TLS pointer list while another thread is in the middle of spawning a new thread. (and as such allocate a new TLS block and modify the datastructure).
Unfortunately it's not uncommon to see the slow TLS implementation in practice.

Thread locals in D are really fast. Here are my tests.
64 bit Ubuntu, core i5, dmd v2.052
Compiler options: dmd -O -release -inline -m64
// this loop takes 0m0.630s
void main(){
int a; // register allocated
for( int i=1000*1000*1000; i>0; i-- ){
a+=9;
}
}
// this loop takes 0m1.875s
int a; // thread local in D, not static
void main(){
for( int i=1000*1000*1000; i>0; i-- ){
a+=9;
}
}
So we lose only 1.2 seconds of one of CPU's cores per 1000*1000*1000 thread local accesses.
Thread locals are accessed using %fs register - so there is only a couple of processor commands involved:
Disassembling with objdump -d:
- this is local variable in %ecx register (loop counter in %eax):
8: 31 c9 xor %ecx,%ecx
a: b8 00 ca 9a 3b mov $0x3b9aca00,%eax
f: 83 c1 09 add $0x9,%ecx
12: ff c8 dec %eax
14: 85 c0 test %eax,%eax
16: 75 f7 jne f <_Dmain+0xf>
- this is thread local, %fs register is used for indirection, %edx is loop counter:
6: ba 00 ca 9a 3b mov $0x3b9aca00,%edx
b: 64 48 8b 04 25 00 00 mov %fs:0x0,%rax
12: 00 00
14: 48 8b 0d 00 00 00 00 mov 0x0(%rip),%rcx # 1b <_Dmain+0x1b>
1b: 83 04 08 09 addl $0x9,(%rax,%rcx,1)
1f: ff ca dec %edx
21: 85 d2 test %edx,%edx
23: 75 e6 jne b <_Dmain+0xb>
Maybe compiler could be even more clever and cache thread local before loop to a register
and return it to thread local at the end (it's interesting to compare with gdc compiler),
but even now matters are very good IMHO.

One needs to be very careful in interpreting benchmark results. For example, a recent thread in the D newsgroups concluded from a benchmark that dmd's code generation was causing a major slowdown in a loop that did arithmetic, but in actuality the time spent was dominated by the runtime helper function that did long division. The compiler's code generation had nothing to do with the slowdown.
To see what kind of code is generated for tls, compile and obj2asm this code:
__thread int x;
int foo() { return x; }
TLS is implemented very differently on Windows than on Linux, and will be very different again on OSX. But, in all cases, it will be many more instructions than a simple load of a static memory location. TLS is always going to be slow relative to simple access. Accessing TLS globals in a tight loop is going to be slow, too. Try caching the TLS value in a temporary instead.
I wrote some thread pool allocation code years ago, and cached the TLS handle to the pool, which worked well.

I've designed multi-taskers for embedded systems, and conceptually the key requirement for thread-local storage is having the context switch method save/restore a pointer to thread-local storage along with the CPU registers and whatever else it's saving/restoring. For embedded systems which will always be running the same set of code once they've started up, it's easiest to simply save/restore one pointer, which points to a fixed-format block for each thread. Nice, clean, easy, and efficient.
Such an approach works well if one doesn't mind having space for every thread-local variable allocated within every thread--even those that never actually use it--and if everything that's going to be within the thread-local storage block can be defined as a single struct. In that scenario, accesses to thread-local variables can be almost as fast as access to other variables, the only difference being an extra pointer dereference. Unfortunately, many PC applications require something more complicated.
On some frameworks for the PC, a thread will only have space allocated for thread-static variables if a module that uses those variables has been run on that thread. While this can sometimes be advantageous, it means that different threads will often have their local storage laid out differently. Consequently, it may be necessary for the threads to have some sort of searchable index of where their variables are located, and to direct all accesses to those variables through that index.
I would expect that if the framework allocates a small amount of fixed-format storage, it may be helpful to keep a cache of the last 1-3 thread-local variables accessed, since in many scenarios even a single-item cache could offer a pretty high hit rate.

If you can't use compiler TLS support, you can manage TLS yourself.
I built a wrapper template for C++, so it is easy to replace an underlying implementation.
In this example, i've implemented it for Win32.
Note: Since you cannot obtain an unlimited number of TLS indices per process (at least under Win32),
you should point to heap blocks large enough to hold all thread specific data.
This way you have a minimum number of TLS indices and related queries.
In the "best case", you'd have just 1 TLS pointer pointing to one private heap block per thread.
In a nutshell: Don't point to single objects, instead point to thread specific, heap memory/containers holding object pointers to achieve better performance.
Don't forget to free memory if it isn't used again.
I do this by wrapping a thread into a class (like Java does) and handle TLS by constructor and destructor.
Furthermore, i store frequently used data like thread handles and ID's as class members.
usage:
for type*:
tl_ptr<type>
for const type*:
tl_ptr<const type>
for type* const:
const tl_ptr<type>
const type* const:
const tl_ptr<const type>
template<typename T>
class tl_ptr {
protected:
DWORD index;
public:
tl_ptr(void) : index(TlsAlloc()){
assert(index != TLS_OUT_OF_INDEXES);
set(NULL);
}
void set(T* ptr){
TlsSetValue(index,(LPVOID) ptr);
}
T* get(void)const {
return (T*) TlsGetValue(index);
}
tl_ptr& operator=(T* ptr){
set(ptr);
return *this;
}
tl_ptr& operator=(const tl_ptr& other){
set(other.get());
return *this;
}
T& operator*(void)const{
return *get();
}
T* operator->(void)const{
return get();
}
~tl_ptr(){
TlsFree(index);
}
};

We have seen similar performance issues from TLS (on Windows). We rely on it for certain critical operations inside our product's "kernel'. After some effort I decided to try and improve on this.
I'm pleased to say that we now have a small API that offers > 50% reduction in CPU time for an equivalent operation when the callin thread doesn't "know" its thread-id and > 65% reduction if calling thread has already obtained its thread-id (perhaps for some other earlier processing step).
The new function ( get_thread_private_ptr() ) always returns a pointer to a struct we use internally to hold all sorts, so we only need one per thread.
All in all I think the Win32 TLS support is poorly crafted really.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string