Why is thread local storage so slow?

Why is thread local storage so slow? - multithreading

I'm working on a custom mark-release style memory allocator for the D programming language that works by allocating from thread-local regions. It seems that the thread local storage bottleneck is causing a huge (~50%) slowdown in allocating memory from these regions compared to an otherwise identical single threaded version of the code, even after designing my code to have only one TLS lookup per allocation/deallocation. This is based on allocating/freeing memory a large number of times in a loop, and I'm trying to figure out if it's an artifact of my benchmarking method. My understanding is that thread local storage should basically just involve accessing something through an extra layer of indirection, similar to accessing a variable via a pointer. Is this incorrect? How much overhead does thread-local storage typically have?
Note: Although I mention D, I'm also interested in general answers that aren't specific to D, since D's implementation of thread-local storage will likely improve if it is slower than the best implementations.

The speed depends on the TLS implementation.
Yes, you are correct that TLS can be as fast as a pointer lookup. It can even be faster on systems with a memory management unit.
For the pointer lookup you need help from the scheduler though. The scheduler must - on a task switch - update the pointer to the TLS data.
Another fast way to implement TLS is via the Memory Management Unit. Here the TLS is treated like any other data with the exception that TLS variables are allocated in a special segment. The scheduler will - on task switch - map the correct chunk of memory into the address space of the task.
If the scheduler does not support any of these methods, the compiler/library has to do the following:
get current ThreadId
Take a semaphore
Lookup the pointer to the TLS block by the ThreadId (may use a map or so)
Release the semaphore
Return that pointer.
Obviously doing all this for each TLS data access takes a while and may need up to three OS calls: Getting the ThreadId, Take and Release the semaphore.
The semaphore is btw required to make sure no thread reads from the TLS pointer list while another thread is in the middle of spawning a new thread. (and as such allocate a new TLS block and modify the datastructure).
Unfortunately it's not uncommon to see the slow TLS implementation in practice.

Thread locals in D are really fast. Here are my tests.
64 bit Ubuntu, core i5, dmd v2.052
Compiler options: dmd -O -release -inline -m64
// this loop takes 0m0.630s
void main(){
int a; // register allocated
for( int i=1000*1000*1000; i>0; i-- ){
a+=9;
}
}
// this loop takes 0m1.875s
int a; // thread local in D, not static
void main(){
for( int i=1000*1000*1000; i>0; i-- ){
a+=9;
}
}
So we lose only 1.2 seconds of one of CPU's cores per 1000*1000*1000 thread local accesses.
Thread locals are accessed using %fs register - so there is only a couple of processor commands involved:
Disassembling with objdump -d:
- this is local variable in %ecx register (loop counter in %eax):
8: 31 c9 xor %ecx,%ecx
a: b8 00 ca 9a 3b mov $0x3b9aca00,%eax
f: 83 c1 09 add $0x9,%ecx
12: ff c8 dec %eax
14: 85 c0 test %eax,%eax
16: 75 f7 jne f <_Dmain+0xf>
- this is thread local, %fs register is used for indirection, %edx is loop counter:
6: ba 00 ca 9a 3b mov $0x3b9aca00,%edx
b: 64 48 8b 04 25 00 00 mov %fs:0x0,%rax
12: 00 00
14: 48 8b 0d 00 00 00 00 mov 0x0(%rip),%rcx # 1b <_Dmain+0x1b>
1b: 83 04 08 09 addl $0x9,(%rax,%rcx,1)
1f: ff ca dec %edx
21: 85 d2 test %edx,%edx
23: 75 e6 jne b <_Dmain+0xb>
Maybe compiler could be even more clever and cache thread local before loop to a register
and return it to thread local at the end (it's interesting to compare with gdc compiler),
but even now matters are very good IMHO.

One needs to be very careful in interpreting benchmark results. For example, a recent thread in the D newsgroups concluded from a benchmark that dmd's code generation was causing a major slowdown in a loop that did arithmetic, but in actuality the time spent was dominated by the runtime helper function that did long division. The compiler's code generation had nothing to do with the slowdown.
To see what kind of code is generated for tls, compile and obj2asm this code:
__thread int x;
int foo() { return x; }
TLS is implemented very differently on Windows than on Linux, and will be very different again on OSX. But, in all cases, it will be many more instructions than a simple load of a static memory location. TLS is always going to be slow relative to simple access. Accessing TLS globals in a tight loop is going to be slow, too. Try caching the TLS value in a temporary instead.
I wrote some thread pool allocation code years ago, and cached the TLS handle to the pool, which worked well.

I've designed multi-taskers for embedded systems, and conceptually the key requirement for thread-local storage is having the context switch method save/restore a pointer to thread-local storage along with the CPU registers and whatever else it's saving/restoring. For embedded systems which will always be running the same set of code once they've started up, it's easiest to simply save/restore one pointer, which points to a fixed-format block for each thread. Nice, clean, easy, and efficient.
Such an approach works well if one doesn't mind having space for every thread-local variable allocated within every thread--even those that never actually use it--and if everything that's going to be within the thread-local storage block can be defined as a single struct. In that scenario, accesses to thread-local variables can be almost as fast as access to other variables, the only difference being an extra pointer dereference. Unfortunately, many PC applications require something more complicated.
On some frameworks for the PC, a thread will only have space allocated for thread-static variables if a module that uses those variables has been run on that thread. While this can sometimes be advantageous, it means that different threads will often have their local storage laid out differently. Consequently, it may be necessary for the threads to have some sort of searchable index of where their variables are located, and to direct all accesses to those variables through that index.
I would expect that if the framework allocates a small amount of fixed-format storage, it may be helpful to keep a cache of the last 1-3 thread-local variables accessed, since in many scenarios even a single-item cache could offer a pretty high hit rate.

If you can't use compiler TLS support, you can manage TLS yourself.
I built a wrapper template for C++, so it is easy to replace an underlying implementation.
In this example, i've implemented it for Win32.
Note: Since you cannot obtain an unlimited number of TLS indices per process (at least under Win32),
you should point to heap blocks large enough to hold all thread specific data.
This way you have a minimum number of TLS indices and related queries.
In the "best case", you'd have just 1 TLS pointer pointing to one private heap block per thread.
In a nutshell: Don't point to single objects, instead point to thread specific, heap memory/containers holding object pointers to achieve better performance.
Don't forget to free memory if it isn't used again.
I do this by wrapping a thread into a class (like Java does) and handle TLS by constructor and destructor.
Furthermore, i store frequently used data like thread handles and ID's as class members.
usage:
for type*:
tl_ptr<type>
for const type*:
tl_ptr<const type>
for type* const:
const tl_ptr<type>
const type* const:
const tl_ptr<const type>
template<typename T>
class tl_ptr {
protected:
DWORD index;
public:
tl_ptr(void) : index(TlsAlloc()){
assert(index != TLS_OUT_OF_INDEXES);
set(NULL);
}
void set(T* ptr){
TlsSetValue(index,(LPVOID) ptr);
}
T* get(void)const {
return (T*) TlsGetValue(index);
}
tl_ptr& operator=(T* ptr){
set(ptr);
return *this;
}
tl_ptr& operator=(const tl_ptr& other){
set(other.get());
return *this;
}
T& operator*(void)const{
return *get();
}
T* operator->(void)const{
return get();
}
~tl_ptr(){
TlsFree(index);
}
};

We have seen similar performance issues from TLS (on Windows). We rely on it for certain critical operations inside our product's "kernel'. After some effort I decided to try and improve on this.
I'm pleased to say that we now have a small API that offers > 50% reduction in CPU time for an equivalent operation when the callin thread doesn't "know" its thread-id and > 65% reduction if calling thread has already obtained its thread-id (perhaps for some other earlier processing step).
The new function ( get_thread_private_ptr() ) always returns a pointer to a struct we use internally to hold all sorts, so we only need one per thread.
All in all I think the Win32 TLS support is poorly crafted really.

Related

Results of doing += on a double from multiple threads

Consider the following code:
void add(double& a, double b) {
a += b;
}
which according to godbolt compiles on a Skylake to:
add(double&, double):
vaddsd xmm0, xmm0, QWORD PTR [rdi]
vmovsd QWORD PTR [rdi], xmm0
ret
If I call add(a, 1.23) and add(a, 2.34) from different threads (for the same variable a), will a definitely end up as either a+1.23, a+2.34, or a+1.23+2.34?
That is, will one of these results definitely happen given this assembly, and a will not end up in some other state?

Here is a relevant questions to me:
Does the CPU fetch the word you are dealing with in a single operation?
Some processor might allow memory access to a variable that happens to be not aligned in memory by doing two fetches one after the other - non atomically of course.
In that case, problems would arise if another thread interjects writing on that area of memory while the first thread had fetched already the first part of the word and then fetches the second part when the other thread has already modified the word.
thread 1 fetches first part of a XXXX
thread 1 fetches second part of a YYYY
thread 2 fetches first part of a XXXX
thread 1 increments double represented as XXXXYYYY that becomes ZZZZWWWW by adding b
thread 1 writes back in memory ZZZZ
thread 1 writes back in memory WWWW
thread 2 fetches second part of a that is now WWWW
thread 2 increments double represented as XXXXWWWW that becomes VVVVPPPP by adding b
thread 2 writes back in memory VVVV
thread 2 writes back in memory PPPP
For keeping it compact I used one character to represent 8 bits.
Now XXXXWWWW and VVVVPPPP are going to be representation of total different floating point values than the one you would have expected. That is because you ended up mixing two parts of two different binary representation (IEEE-754) of double variables.
Said that, I know that in certain ARM based architectures data access are not allowed (that would cause a trap to be generated), but I suspect that Intel processors do allow that instead.
Therefore, if your variable a is aligned, your result can be any of
a+1.23, a+2.34, a+1.23+2.34
if your variable might be mis-aligned (i.e. has got an address that is not a multiple of 8) your result can be any of
a+1.23, a+2.34, a+1.23+2.34 or a rubbish value
As a further note, please bear in mind that even if your environment alignof(double) == 8 that is not necessarily enough to conclude you are not going to have misalignment issues. All depends from where your particular variable comes from. Consider the following (or run it here):
#pragma push()
#pragma pack(1)
struct Packet
{
unsigned char val1;
unsigned char val2;
double val3;
unsigned char val4;
unsigned char val5;
};
#pragma pop()
int main()
{
static_assert(alignof(double) == 8);
double d;
add(d,1.23); // your a parameter is aligned
Packet p;
add(p.val3,1.23); // your a parameter is now NOT aligned
return 0;
}
Therefore asserting alignof() doesn't necessarily guarantee your variable is aligned. If your variable is not involved in any packing then you should be OK.
Please allow me just a disclaimer for whoever else is reading this answer: using std::atomic<double> in these situations is the best compromise in term of implementation effort and performance to achieve thread safety. There are CPUs architectures that have special efficient instructions for dealing with atomic variables without injecting heavy fences. That might end up satisfying your performance requirements already.

Strange code for preventing false sharing

I want to discuss the following structure in golang from this link
// Local per-P Pool appendix.
57 type poolLocal struct {
58 private interface{} // Can be used only by the respective P.
59 shared []interface{} // Can be used by any P.
60 Mutex // Protects shared.
61 pad [128]byte // Prevents false sharing.
62 }
The above structure can be accessed only one thread at a time as Mutex is used. The coder will Lock the structure in the beginning of a thread and unlock it when the thread completes. So the memory is not shared between threads. So no more than one core will have
access to the memory. So, by my understanding, false sharing cannot happen here. If false sharing cannot happen, why did the coder pad the structure
with extra bytes (pad [128]byte) ? Is my understanding wrong?

Memory locations on the same cache line are subject to false-sharing, which is very bad for performance.
Cache line size ranges from 32 to 128 bytes, depending on processor model. 128 byte pad will reduce chance for same cache line being used by different processes and that improvesthe performace
as i see it, the following would be better as it would be more explicit
type poolLocal struct {
_ [64]byte // Prevents false sharing.
private interface{} // Can be used only by the respective P.
shared []interface{} // Can be used by any P.
Mutex // Protects shared.
_ [64]byte // Prevents false sharing.
}

x86 reserved EFLAGS bit 1 == 0: how can this happen?

I'm using the Win32 API to stop/start/inspect/change thread state. Generally works pretty well. Sometimes it fails, and I'm trying to track down the cause.
I have one thread that is forcing context switches on other threads by:
thread stop
fetch processor state into windows context block
read thread registers from windows context block to my own context block
write thread registers from another context block into windows context block
restart thread
This works remarkably well... but ... very rarely, context switches seem to fail.
(Symptom: my multithread system blows sky high executing a strange places with strange register content).
The context control is accomplished by:
if ((suspend_count=SuspendThread(WindowsThreadHandle))<0)
{ printf("TimeSlicer Suspend Thread failure");
...
}
...
Context.ContextFlags = (CONTEXT_INTEGER | CONTEXT_CONTROL | CONTEXT_FLOATING_POINT);
if (!GetThreadContext(WindowsThreadHandle,&Context))
{ printf("Context fetch failure");
...
}
call ContextSwap(&Context); // does the context swap
if (ResumeThread(WindowsThreadHandle)<0)
{ printf("Thread resume failure");
...
}
None of the print statements ever get executed. I conclude that Windows thinks the context operations all happened reliably.
Oh, yes, I do know when a thread being stopped is not computing [e.g., in a system function] and won't attempt to stop/context switch it. I know this because each thread that does anything other-than-computing sets a thread specific "don't touch me" flag, while it is doing other-than-computing. (Device driver programmers will recognize this as the equivalent of "interrupt disable" instructions).
So, I wondered about the reliability of the content of the context block.
I added a variety of sanity tests on various register values pulled out of the context block; you can actually decide that ESP is OK (within bounds of the stack area defined in the TIB), PC is in the program that I expect or in a system call, etc. No surprises here.
I decided to check that the condition code bits (EFLAGS) were being properly read out; if this were wrong, it would cause a switched task to take a "wrong branch" when its state was restored. So I added the following code to verify that the purported EFLAGS register contains stuff that only look like EFLAGS according to the Intel reference manual (http://en.wikipedia.org/wiki/FLAGS_register).
mov eax, Context.EFlags[ebx] ; ebx points to Windows Context block
mov ecx, eax ; check that we seem to have flag bits
and ecx, 0FFFEF32Ah ; where we expect constant flag bits to be
cmp ecx, 000000202h ; expected state of constant flag bits
je #f
breakpoint ; trap if unexpected flag bit status
##:
On my Win 7 AMD Phenom II X6 1090T (hex core),
it traps occasionally with a breakpoint, with ECX = 0200h. Fails same way on my Win 7 Intel i7 system. I would ignore this,
except it hints the EFLAGS aren't being stored correctly, as I suspected.
According to my reading of the Intel (and also the AMD) reference manuals, bit 1 is reserved and always has the value "1". Not what I see here.
Obviously, MS fills the context block by doing complicated things on a thread stop. I expect them to store the state accurately. This bit isn't stored correctly.
If they don't store this bit correctly, what else don't they store?
Any explanations for why the value of this bit could/should be zero sometimes?
EDIT: My code dumps the registers and the stack on catching a breakpoint.
The stack area contains the context block as a local variable.
Both EAX, and the value in the stack at the proper offset for EFLAGS in the context block contain the value 0244h. So the value in the context block really is wrong.
EDIT2: I changed the mask and comparsion values to
and ecx, 0FFFEF328h ; was FFEF32Ah where we expect flag bits to be
cmp ecx, 000000200h
This seems to run reliably with no complaints. Apparently Win7 doesn't do bit 1 of eflags right, and it appears not to matter.
Still interested in an explanation, but apparently this is not the source of my occasional context switch crash.

Microsoft has a long history of squirreling away a few bits in places that aren't really used. Raymond Chen has given plenty of examples, e.g. using the lower bit(s) of a pointer that's not byte-aligned.
In this case, Windows might have needed to store some of its thread context in an existing CONTEXT structure, and decided to use an otherwise unused bit in EFLAGS. You couldn't do anything with that bit anyway, and Windows will get that bit back when you call SetThreadContext.

Is there a way to check whether the processor cache has been flushed recently?

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this?
Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole proccesor(with wbinvd()). Program runs as root but I'd prefer to stay in user space if possible.

Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting registers to detect cache misses or by carefully measuring the time to read a memory location with a high resolution timer.
This program works on my x86_64 box to demonstrate the effects of clflush. It times how long it takes to read a global variable using rdtsc. Being a single instruction tied directly to the CPU clock makes direct use of rdtsc ideal for this.
Here is the output:
took 81 ticks
took 81 ticks
flush: took 387 ticks
took 72 ticks
You see 3 trials: The first ensures i is in the cache (which it is, because it was just zeroed as part of BSS), the second is a read of i that should be in the cache. Then clflush kicks i out of the cache (along with its neighbors) and shows that re-reading it takes significantly longer. A final read verifies it is back in the cache. The results are very reproducible and the difference is substantial enough to easily see the cache misses. If you cared to calibrate the overhead of rdtsc() you could make the difference even more pronounced.
If you can't read the memory address you want to test (although even mmap of /dev/mem should work for these purposes) you may be able to infer what you want if you know the cacheline size and associativity of the cache. Then you can use accessible memory locations to probe the activity in the set you're interested in.
Source code:
#include <stdio.h>
#include <stdint.h>
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("rdtsc" : "=a" (a), "=d" (d));
return a | ((uint64_t)d << 32);
}
volatile int i;
inline void
test()
{
uint64_t start, end;
volatile int j;
start = rdtsc();
j = i;
end = rdtsc();
printf("took %lu ticks\n", end - start);
}
int
main(int ac, char **av)
{
test();
test();
printf("flush: ");
clflush(&i);
test();
test();
return 0;
}

I dont know of any generic command to get the the cache state, but there are ways:
I guess this is the easiest: If you got your kernel module, just disassemble it and look for cache invalidation / flushing commands (atm. just 3 came to my mind: WBINDVD, CLFLUSH, INVD).
You just said it is for i386, but I guess you dont mean a 80386. The problem is that there are many different with different extension and features. E.g. the newest Intel series has some performance/profiling registers for the cache system included, which you can use to evalute cache misses/hits/number of transfers and similar.
Similar to 2, very depending on the system you got. But when you have a multiprocessor configuration you could watch the first cache coherence protocol (MESI) with the 2nd.
You mentioned WBINVD - afaik that will always flush complete, i.e. all, cache lines

It may not be an answer to your specific question, but have you tried using a cache profiler such as Cachegrind? It can only be used to profile userspace code, but you might be able to use it nonetheless, by e.g. moving the code of your function to userspace if it does not depend on any kernel-specific interfaces.
It might actually be more effective than trying to ask the processor for information that may or may not exist and that will be probably affected by your mere asking about it - yes, Heisenberg was way before his time :-)

x86 equivalent for LWARX and STWCX

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *pval;
do
{
// fetch the pointer to the value
pval = *ptr;
// if its NULL, then just return NULL, the smart pointer
// will then become NULL as well
if(pval == NULL)
return NULL;
// Grab the reference count
val = lwarx(pval);
// make sure the pointer we grabbed the value from
// is still the same one referred to by 'ptr'
if(pval != *ptr)
continue;
// Increment the reference count via 'stwcx' if any other threads
// have done anything that could potentially break then it should
// fail and try again
} while(!stwcx(pval, val + 1));
return pval;
}
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks

As Michael mentioned, what you're probably looking for is the cmpxchg instruction.
It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).
If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.

x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).

You're probably looking for the cmpxchg family of instructions.
You'll need to precede these with a lock instruction to get equivalent behaviour.
Have a look here for a quick overview of what's available.
You'll likely end up with something similar to this:
mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12
You should read this paper...
Edit
In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.

if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *unpacked;
do
{
val = *ptr;
unpacked = unpack(val);
if(unpacked == NULL)
return NULL;
// pointer is on the bottom
} while(!cas(unpacked, val, val + 1));
return unpacked;
}

Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.

What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).
The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string