preemptions due to malloc

preemptions due to malloc - linux

I am thinking of the following scenario and I want to double check it with you.
One Linux process with 2 or more threads running in parallel on different cores. Let's say that they both call malloc with same amount such that malloc will not have to invoke mmap. In other words, the heap is big enough and (previously) increased by other sbrk invocations. In a such case, the memory allocations are entirely in user-space. By looking on git hub I have seen that there is a mutex protecting the internal data structures that malloc uses.
My questions is, can a thread be preempted out by the kernel given that the threads try to acquire the same lock? In other words, one of the threads will suffer a penalty in its execution due to the fact that the other has got that lock.
Thanks,

Related

Multithreading on multiple core/processors

I get the idea that if locking and unlocking a mutex is an atomic operation, it can protect the critical section of code in case of a single processor architecture.
Any thread, which would be scheduled first, would be able to "lock" the mutex in a single machine code operation.
But how are mutexes any good when the threads are running on multiple cores? (Where different threads could be running at the same time on different "cores" at the same time).
I can't seem to grasp the idea of how a multithreaded program would work without any deadlock or race condition on multiple cores?

The general answer:
Mutexes are an operating system concept. An operating system offering mutexes has to ensure that these mutexes work correctly on all hardware that this operation system wants to support. If implementing a mutex is not possible for a specific hardware, the operating system cannot offer mutexes on that hardware. If the operating system requires the existence of mutexes to work correctly, it cannot support that hardware at all. How the operating system is implementing mutexes for a specific hardware is unsurprisingly very hardware dependent and varies a lot between the operating systems and their supported hardware.
The detailed answer:
Most general purpose CPUs offer atomic operations. These operations are designed to be atomic across all CPU cores within a system, whether these cores are part of a single or multiple individual CPUs.
With as little as two atomic operations, atomic_or and atomic_and, it is possible to implement a lock. E.g. think of
int atomic_or ( int * addr, int val )
It atomically calculates *addr = *addr | val and returns the old value of *addr prior to performing the calculation. If *lock == 0 and multiple threads call atomic_or(lock, 1), then only one of them will get 0 as result; only the first thread to perform that operation. All other threads get 1 as result. The one thread that got 0 is the winner, it has the lock, all other threads register for an event and go to sleep.
The winner thread now has exclusive access to the section following the atomic_or, it can perform the desired work and once it is done, it just clears the lock again (atomic_and(lock, 0)) and generates a system event, that the lock is now available again.
The system will then wake up one, some, or all of the threads that registered for this event before going to sleep and the race for the lock starts all over. Either one of the woken up threads will win the race or possibly none of them, as another thread was even faster and may have grabbed the lock in between the atomic_and and before the other threads were even woken up but that is okay and still correct, as it's still only one thread having access. All threads that failed to obtain the lock go back to sleep.
Of course, the actual implementations of modern systems are often much more complicated than that, they may take things like threads priorities into account (high prio threads may be preferred in the lock race) or might ensure that every thread waiting for a mutex will eventually also get it (precautions exist that prevent a thread from always losing the lock-race). Also mutexes can be recursive, in which case the system ensures that the same thread can obtain the same mutex multiple times without deadlocking and this requires some additional bookkeeping.
Probably needless to say but atomic operations are more expensive operations as they require the cores within a system to synchronize their work and this will slow their processing throughput. They may be somewhat expensive if all cores run on a single CPU but they may even be very expensive if there are multiple CPUs as the synchronization must take place over the CPU bus system that connects the CPUs with each other and this bus system usually does not operate at CPU speed level.
On the other hand, using mutexes will always slow down processing to begin with as providing exclusive access to resources has to slow down processing if multiple threads ever require access at the same time to continue their work. So for implementing mutexes this is irrelevant. Actually, if you can implement a function in a thread-safe way using just atomic operations instead of full featured mutexes, you will quite often have a noticeable speed benefit, despite these operations being more expensive than normal operations.

Threads are managed by the operating system, which among other things, is responsible for scheduling threads to cores, so it can also avoid scheduling a specific thread onto a core.
A mutex is an operating-system concept. You're basically asking the OS to block a thread until some other thread tells the OS it's ok

On modern operating systems, threads are an abstraction over the physical hardware. A programmer targets the thread as an abstraction for code execution. There is no separate abstraction for working on a hardware core available. The operating system is responsible for mapping threads to physical cores.
A mutex is a data structure that lives in system memory. Any thread that has access can read that memory position, regardless of what thread or core it is running in. It doesn't matter whether your code is executing on core 1 or 20, its still has the ability to read the current state of the lock.
In other words, regardless of the number of threads or cores, there is only shared system memory for them to act on.

Consumer-producer: Pausing the consumer if memory usage goes beyond a particular threshold

I have a situation where my main thread (producer) allocates a huge chunk of memory on heap for a task, does some work on that buffer and then provides the buffer to worker threads (consumers) for further processing(which will first compress that data and then write it to disk). Once the worker thread is done with it's job, it releases memory that was acquired by the producer for the task.
However there can be a situation where my main thread allocates too much of memory and thus my system starts swapping out other programs to disk to accommodate the memory requirement. Since the disk becomes busy the worker threads find it difficult to write on disk (and eventually free any memory) and meanwhile the producers continues to allocate more memory for other tasks. This in the end kills my system's performance.
What can be a good design for this problem?
Additionally, if pausing the main thread by pre-computing the memory requirement, in advance, is an option how can I come to a reliable number?

Possible design options
single-producer-multiple-consumers blocking queue between producer and workers
atomic task inboxes to each worker in the pull and producer round-robining tasks among them and busy-spinning/blocking when unable to post (I think Herb Sutter features this design in one of his concurrency lectures)
Memory allocation-wise it is always beneficial to be deterministic, even as deterministic as pre-allocating everything on startup. It is just not always possible or practical to be that strict, so usually a combination of fixed/dynamic sizing and startup/runtime allocation takes place in any non-trivial system.

Malloc performance in a multithreaded environment

I've been running some experiments with the openmp framework and found some odd results I'm not sure I know how to explain.
My goal is to create this huge matrix and then fill it with values. I made some parts of my code like parallel loops in order to gain performance from my multithreaded enviroment. I'm running this in a machine with 2 quad-core xeon processors, so I can safely put up to 8 concurrent threads in there.
Everything works as expected, but for some reason the for loop actually allocating the rows of my matrix have an odd peak performance when running with only 3 threads. From there on, adding some more threads just makes my loop take longer. With 8 threads taking actually more time that it would need with only one.
This is my parallel loop:
int width = 11;
int height = 39916800;
vector<vector<int> > matrix;
matrix.resize(height);
#pragma omp parallel shared(matrix,width,height) private(i) num_threads(3)
{
#pragma omp for schedule(dynamic,chunk)
for(i = 0; i < height; i++){
matrix[i].resize(width);
}
} /* End of parallel block */
This made me wonder: is there a known performance problem when calling malloc (which I suppose is what the resize method of the vector template class is actually calling) in a multithreaded enviroment? I found some articles saying something about performance loss in freeing heap space in a mutithreaded enviroment, but nothing specific about allocating new space as in this case.
Just to give you an example, I'm placing below a graph of the time it takes for the loop to finish as a function of the number of threads for both the allocation loop, and a normal loop that just reads data from this huge matrix later on.
Both times where measured using the gettimeofday function and seem to return very similar and accurate results across different execution instances. So, anyone has a good explanation?

You are right about vector::resize() internally calling malloc. Implementation-wise malloc is fairly complicated. I can see multiple places where malloc can lead to contention in a multi-threaded environment.
malloc probably keeps a global data structure in userspace to manage the user's heap address space. This global data structure would need to be protected against concurrent access and modification. Some allocators have optimizations to alleviate the number of times this global data structure is accessed... I don't know how far has Ubuntu come along.
malloc allocates address space. So when you actually begin to touch the allocated memory you would go through a "soft page fault" which is a page fault which allows the OS kernel to allocate the backing RAM for the allocated address space. This can be expensive because of the trip to the kernel and would require the kernel to take some global locks to access its own global RAM resource data structures.
the user space allocator probably keeps some allocated space to give out new allocations from. However, once those allocations run out the allocator would need to go back to the kernel and allocate some more address space from the kernel. This is also expensive and would require a trip to the kernel and the kernel taking some global locks to access its global address space management related data structures.
Bottomline, these interactions could be fairly complicated. If you are running into these bottlenecks I would suggest that you simply "pre-allocate" your memory. This would involve allocating it and then touching all of it (all from a single thread) so that you can use that memory later from all your threads without running into lock contention at user or kernel level.

Memory allocators are definitely a possible contention point for multiple threads.
Fundamentally, the heap is a shared data structure, since it is possible to allocate memory on one thread, and de-allocate it on another. In fact, your example does exactly that - the "resize" will free memory on each of the worker threads, which was initially allocated elsewhere.
Typical implementations of malloc included with gcc and other compilers use a shared global lock and work reasonably well across threads if memory allocation pressure is relatively low. Above a certain allocation level, however, threads will begin to serialize on the lock, you'll get excessive context switching and cache trashing, and performance will degrade. Your program is an example of something which is allocation heavy, with an alloc + dealloc in the inner loop.
I'm surprised that an OpenMP compatible compiler doesn't have a better threaded malloc implementation? They certainly exist - take a look at this question for a list.

Technically, the STL vector uses the std::allocator which eventually calls new. new in its turn calls the libc's malloc (for your Linux system).
This malloc implementation is quite efficient as a general purpose allocator, is thread-safe, however it is not scalable (the GNU libc's malloc derives from Doug Lea's dlmalloc). There are numerous allocators and papers that improve upon dlmalloc to provide scalable allocation.
I would suggest that you take a look at Hoard from Dr. Emery Berger, tcmalloc from Google and Intel Threading Building Blocks scalable allocator.

How expensive are threads?

How expensive is a OS native thread ? The host OS allocates some virtual memory for a thread stack and a little bit of the kernel memory for the thread control structures. Am I missing something?

It can increase the scheduler workload, depending how busy the thread is, and the kind of scheduler. It will also allocate physical memory for the first page of the stack.
The main cost in many cases is cache pollution. Having too many active concurrent tasks kills performance because too many threads are sharing too little cache, and they just keep shoving each other back onto main memory, which is a far worse indignity for a thread to suffer than simply being put to sleep, since sleeping incurs a single penalty of several hundred cycles, while retrieving main memory incurs a similar overhead several times during a single time-slice, and also means proportionally more context-switching since much less work gets done during that time-slice.

How efficient is locking and unlocked mutex? What is the cost of a mutex?

In a low level language (C, C++ or whatever): I have the choice in between either having a bunch of mutexes (like what pthread gives me or whatever the native system library provides) or a single one for an object.
How efficient is it to lock a mutex? I.e. how many assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
(I am not sure how much differences there are between different hardware. If there is, I would also like to know about them. But mostly, I am interested about common hardware.)
The point is, by using many mutex which each cover only a part of the object instead of a single mutex for the whole object, I could safe many blocks. And I am wondering how far I should go about this. I.e. should I try to safe any possible block really as far as possible, no matter how much more complicated and how many more mutexes this means?
WebKits blog post (2016) about locking is very related to this question, and explains the differences between a spinlock, adaptive lock, futex, etc.

I have the choice in between either having a bunch of mutexes or a single one for an object.
If you have many threads and the access to the object happens often, then multiple locks would increase parallelism. At the cost of maintainability, since more locking means more debugging of the locking.
How efficient is it to lock a mutex? I.e. how much assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
The precise assembler instructions are the least overhead of a mutex - the memory/cache coherency guarantees are the main overhead. And less often a particular lock is taken - better.
Mutex is made of two major parts (oversimplifying): (1) a flag indicating whether the mutex is locked or not and (2) wait queue.
Change of the flag is just few instructions and normally done without system call. If mutex is locked, syscall will happen to add the calling thread into wait queue and start the waiting. Unlocking, if the wait queue is empty, is cheap but otherwise needs a syscall to wake up one of the waiting processes. (On some systems cheap/fast syscalls are used to implement the mutexes, they become slow (normal) system calls only in case of contention.)
Locking unlocked mutex is really cheap. Unlocking mutex w/o contention is cheap too.
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
You can throw as much mutex variables into your code as you wish. You are only limited by the amount of memory you application can allocate.
Summary. User-space locks (and the mutexes in particular) are cheap and not subjected to any system limit. But too many of them spells nightmare for debugging. Simple table:
Less locks means more contentions (slow syscalls, CPU stalls) and lesser parallelism
Less locks means less problems debugging multi-threading problems.
More locks means less contentions and higher parallelism
More locks means more chances of running into undebugable deadlocks.
A balanced locking scheme for application should be found and maintained, generally balancing the #2 and the #3.
(*) The problem with less very often locked mutexes is that if you have too much locking in your application, it causes to much of the inter-CPU/core traffic to flush the mutex memory from the data cache of other CPUs to guarantee the cache coherency. The cache flushes are like light-weight interrupts and handled by CPUs transparently - but they do introduce so called stalls (search for "stall").
And the stalls are what makes the locking code to run slowly, often without any apparent indication why application is slow. (Some arch provide the inter-CPU/core traffic stats, some not.)
To avoid the problem, people generally resort to large number of locks to decrease the probability of lock contentions and to avoid the stall. That is the reason why the cheap user space locking, not subjected to the system limits, exists.

I wanted to know the same thing, so I measured it.
On my box (AMD FX(tm)-8150 Eight-Core Processor at 3.612361 GHz),
locking and unlocking an unlocked mutex that is in its own cache line and is already cached, takes 47 clocks (13 ns).
Due to synchronization between two cores (I used CPU #0 and #1),
I could only call a lock/unlock pair once every 102 ns on two threads,
so once every 51 ns, from which one can conclude that it takes roughly 38 ns to recover after a thread does an unlock before the next thread can lock it again.
The program that I used to investigate this can be found here:
https://github.com/CarloWood/ai-statefultask-testsuite/blob/b69b112e2e91d35b56a39f41809d3e3de2f9e4b8/src/mutex_test.cxx
Note that it has a few hardcoded values specific for my box (xrange, yrange and rdtsc overhead), so you probably have to experiment with it before it will work for you.
The graph it produces in that state is:
This shows the result of benchmark runs on the following code:
uint64_t do_Ndec(int thread, int loop_count)
{
uint64_t start;
uint64_t end;
int __d0;
asm volatile ("rdtsc\n\tshl $32, %%rdx\n\tor %%rdx, %0" : "=a" (start) : : "%rdx");
mutex.lock();
mutex.unlock();
asm volatile ("rdtsc\n\tshl $32, %%rdx\n\tor %%rdx, %0" : "=a" (end) : : "%rdx");
asm volatile ("\n1:\n\tdecl %%ecx\n\tjnz 1b" : "=c" (__d0) : "c" (loop_count - thread) : "cc");
return end - start;
}
The two rdtsc calls measure the number of clocks that it takes to lock and unlock `mutex' (with an overhead of 39 clocks for the rdtsc calls on my box). The third asm is a delay loop. The size of the delay loop is 1 count smaller for thread 1 than it is for thread 0, so thread 1 is slightly faster.
The above function is called in a tight loop of size 100,000. Despite that the function is slightly faster for thread 1, both loops synchronize because of the call to the mutex. This is visible in the graph from the fact that the number of clocks measured for the lock/unlock pair is slightly larger for thread 1, to account for the shorter delay in the loop below it.
In the above graph the bottom right point is a measurement with a delay loop_count of 150, and then following the points at the bottom, towards the left, the loop_count is reduced by one each measurement. When it becomes 77 the function is called every 102 ns in both threads. If subsequently loop_count is reduced even further it is no longer possible to synchronize the threads and the mutex starts to be actually locked most of the time, resulting in an increased amount of clocks that it takes to do the lock/unlock. Also the average time of the function call increases because of this; so the plot points now go up and towards the right again.
From this we can conclude that locking and unlocking a mutex every 50 ns is not a problem on my box.
All in all my conclusion is that the answer to question of OP is that adding more mutexes is better as long as that results in less contention.
Try to lock mutexes as short as possible. The only reason to put them -say- outside a loop would be if that loop loops faster than once every 100 ns (or rather, number of threads that want to run that loop at the same time times 50 ns) or when 13 ns times the loop size is more delay than the delay you get by contention.
EDIT: I got a lot more knowledgable on the subject now and start to doubt the conclusion that I presented here. First of all, CPU 0 and 1 turn out to be hyper-threaded; even though AMD claims to have 8 real cores, there is certainly something very fishy because the delays between two other cores is much larger (ie, 0 and 1 form a pair, as do 2 and 3, 4 and 5, and 6 and 7). Secondly, the std::mutex is implemented in way that it spin locks for a bit before actually doing system calls when it fails to immediately obtain the lock on a mutex (which no doubt will be extremely slow). So what I have measured here is the absolute most ideal situtation and in practise locking and unlocking might take drastically more time per lock/unlock.
Bottom line, a mutex is implemented with atomics. To synchronize atomics between cores an internal bus must be locked which freezes the corresponding cache line for several hundred clock cycles. In the case that a lock can not be obtained, a system call has to be performed to put the thread to sleep; that is obviously extremely slow (system calls are in the order of 10 mircoseconds). Normally that is not really a problem because that thread has to sleep anyway-- but it could be a problem with high contention where a thread can't obtain the lock for the time that it normally spins and so does the system call, but CAN take the lock shortly there after. For example, if several threads lock and unlock a mutex in a tight loop and each keeps the lock for 1 microsecond or so, then they might be slowed down enormously by the fact that they are constantly put to sleep and woken up again. Also, once a thread sleeps and another thread has to wake it up, that thread has to do a system call and is delayed ~10 microseconds; this delay thus happens while unlocking a mutex when another thread is waiting for that mutex in the kernel (after spinning took too long).

This depends on what you actually call "mutex", OS mode and etc.
At minimum it's a cost of an interlocked memory operation. It's a relatively heavy operation (compared to other primitive assembler commands).
However, that can be very much higher. If what you call "mutex" a kernel object (i.e. - object managed by the OS) and run in the user mode - every operation on it leads to a kernel mode transaction, which is very heavy.
For example on Intel Core Duo processor, Windows XP.
Interlocked operation: takes about 40 CPU cycles.
Kernel mode call (i.e. system call) - about 2000 CPU cycles.
If this is the case - you may consider using critical sections. It's a hybrid of a kernel mutex and interlocked memory access.

I'm completely new to pthreads and mutex, but I can confirm from experimentation that the cost of locking/unlocking a mutex is almost zilch when there is no contention, but when there is contention, the cost of blocking is extremely high. I ran a simple code with a thread pool in which the task was just to compute a sum in a global variable protected by a mutex lock:
y = exp(-j*0.0001);
pthread_mutex_lock(&lock);
x += y ;
pthread_mutex_unlock(&lock);
With one thread, the program sums 10,000,000 values virtually instantaneously (less than one second); with two threads (on a MacBook with 4 cores), the same program takes 39 seconds.

The cost will vary depending on the implementation but you should keep in mind two things:
the cost will be most likely be minimal since it's both a fairly primitive operation and it will be optimised as much as possible due to its use pattern (used a lot).
it doesn't matter how expensive it is since you need to use it if you want safe multi-threaded operation. If you need it, then you need it.
On single processor systems, you can generally just disable interrupts long enough to atomically change data. Multi-processor systems can use a test-and-set strategy.
In both those cases, the instructions are relatively efficient.
As to whether you should provide a single mutex for a massive data structure, or have many mutexes, one for each section of it, that's a balancing act.
By having a single mutex, you have a higher risk of contention between multiple threads. You can reduce this risk by having a mutex per section but you don't want to get into a situation where a thread has to lock 180 mutexes to do its job :-)

I just measured it on my Windows 10 system.
This is testing Single Threaded code with no contention at all.
Compiler: Visual Studio 2019, x64 release, with loop overhead subtracted from measurements.
Using std::mutex takes about 74 machine cycles, while using a native Win32 CRITICAL_SECTION takes about 53 machine cycles.
So unless 100 machine cycles is a significant amount of time compared to the code itself, the mutexes aren't going to be the source of a performance problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string