Why don't glibc malloc thread local arenas - multithreading

glibc malloc assigns each thread only one arena. Why don't malloc simply thread local arena to avoid locking?
I have tried searching over Google and have found 0 answers related to this question.

Related

preemptions due to malloc

I am thinking of the following scenario and I want to double check it with you.
One Linux process with 2 or more threads running in parallel on different cores. Let's say that they both call malloc with same amount such that malloc will not have to invoke mmap. In other words, the heap is big enough and (previously) increased by other sbrk invocations. In a such case, the memory allocations are entirely in user-space. By looking on git hub I have seen that there is a mutex protecting the internal data structures that malloc uses.
My questions is, can a thread be preempted out by the kernel given that the threads try to acquire the same lock? In other words, one of the threads will suffer a penalty in its execution due to the fact that the other has got that lock.
Thanks,

when does dlopen blocks?

sharedlibrary loaded through LD_PRELOAD, constructor of the same library calls dlopen("libc.so.6")
the problem is dlopen takes forever, debugging showes the following
dlopen calls __dlopen which calls calloc, and unknow function ??, then at last __GI___pthread_mutex_lock.
providing unlimited resources before dlopen as I suspected, but doesn't solve the problem.
the problem only happen if LD_PRELOAD is set with sharedlibrary (mentioned above) with target application Firefox at Linux any other application works without problems(dlopen doesn't block)!
when does dlopen blocks?
When it needs a lock that is not available for some reason.
debugging showes
You need more debugging. The dlopen calls calloc which requires a malloc lock. Nothing special about that.
It must be that some other thread is holding this malloc lock, and is waiting for your LD_PRELOADed library to finish its initialization (thus creating a deadlock). You should be able to find that other thread with (gdb) thread apply all where.
It may also matter what functions you are trying to interpose in your LD_PRELOADed library.

Memory management while using threads

1) I tried searching how memory would be allocated when we use threads in program but couldn't find the answer. Here What and where are the stack and heap? is how stack and heap works when a single program is called. But what happens when it comes to program with threads?
2)Using OpenMP parallel region creates threads and parallel code would be executed concurrently in each thread. Does this allocate more space in the memory than the memory occupied by same code with sequential execution?
In general, yes, [user-space] stacks are one per thread, whereas the heap is usually shared by all threads. See for example this Linux question. However, on some operating systems (OS), on Windows in particular, even a single threaded app may use more than one heap. Using OpenMP for threading doesn't change these basics, which are mostly dependant on the operating system. So unless you narrow your question to a specific OS, more can't be said at this level of generality.
Since I'm too lazy to draw this myself, here's the comparative illustration from PThreads Programming by Nichols et al. (1996)
A somewhat more detailed (and alas potentially a bit more confusing) diagram is found in the free LLNL POSIX Threads Programming tutorial by B. Barney.
And yes, as you correctly suspected, running more threads does consume more stack memory. You can actually exhaust the virtual address space of a process just with thread stacks if you make enough of them. Various implementations of OpenMP have a STACKSIZE environment variable (or thereabout) that controls how much stack OpenMP allocates for a thread.
Regarding Z boson's question/suggestion about Thread Local Storage (TLS): roughly (i.e. conceptually) speaking, Thread Local Storage is a per-thread heap. There are differences from the per-process heap in the API used to manipulate it, at the very least because each thread needs its own separate pointer to its own TLS, but basically you have a heap-like chunk of the process address space that's reserved to each thread. TLS is optional, you don't have to use it. OpenMP provides its own abstraction/directive for TLS-like persistent per-thread data, called THREADPRIVATE. It's not necessary that the OpenMP THREADPRIVATE uses the operating system's TLS support, however there's a Linux-focused paper which says that such an implementation gave the best performance, at least in that environment.
And here is a subtlety (or why I said "roughly speaking" when I compared TLS to per-thread heaps): assume you want a per-thread heap, say, in order to reduce locking contention to the main heap. You don't actually have to store an entire per-thread heap in each thread's TLS. It suffices to store in each thread's TLS a different head pointer to heaps allocated in the shared per-process space. Identifying and automatically using per-thread heaps in a program (in order to reduce locking contention on the main heap) is a farily difficult CS problem. Heap allocators which do this automatically are called scalable/parallel[izing] heap allocators or thereabout. For example, Intel TBB provides one such allocator, and it can be used in your program even if you use nothing else from TBB. Although some people seem to believe Intel's TBB allocator contains black magic, it's in fact not really different from the aforementioned basic idea of using TLS to point to some thread-local heap, which in turn is made of several doubly-linked lists segregated by block/object-size, as the following diagrams from the Intel paper on TBB illustrate:
IBM has something rather similar for AIX 7.1, but a bit more complex. You can tell its (default) allocator to use a fixed number of heaps for multi-threaded applications, e.g. MALLOCOPTIONS=multiheap:3. AIX 7.1 also has another option (which can be combined the multiheap) MALLOCOPTIONS=threadcache, which appears somewhat similar to what Intel TBB does, in that it keeps a per-thread cache of deallocated regions, from which future allocation requests can be serviced with less global heap contention. Besides those options for the default allocator, AIX 7.1 also has a (non-default) "Watson2" allocator which "uses a thread-specific mechanism that uses a varying number of heap structures, which depend on the behavior of the program. Therefore no configuration options are required." (But you do need to select this allocator explicitly with MALLOCTYPE=Watson2.) Watson2's operation sounds even closer to what the Intel TBB allocator does.
The aforementioned two examples (Intel TBB and AIX) detailed above just meant as concrete examples, but shouldn't be understood as holding some exclusive sauce. The idea of per-thread or per-CPU heap cache/arena/magazine is fairly widespread. The BSDcan jemalloc paper cites a 1998 MS Research paper as the first to have systematically evaluated arenas for this purpose. The aforementioned MS paper does cite the ptmalloc web page as "visited on May 11, 1998" and summarizes ptmalloc's working as follows: "It uses a linked list of subheaps where each subheap has a lock, 128 free lists, and some memory to manage. When a thread needs to allocate a block, it scans the list of subheaps and grabs the first unlocked one, allocates the required block, and returns. If it can't find an unlocked subheap, it creates a new one and adds it to the list. In this way, a thread never waits on a locked subheap."

How safe is pthread robust mutex?

I m thinking to use Posix robust mutexes to protect shared resource among different processes (on Linux). However there are some doubts about safety in difference scenarios. I have the following questions:
Are robust mutexes implemented in the kernel or in user code?
If latter, what would happen if a process happens to crash while in a call to pthread_mutex_lock or pthread_mutex_unlock and while a shared pthread_mutex datastructure is getting updated?
I understand that if a process locked the mutex and dies, a thread in another process will be awaken and return EOWNERDEAD. However, what would happen if the process dies (in unlikely case) exactly when the pthread_mutex datastructure (in shared memory) is being updated? Will the mutex get corrupted in that case? What would happen to another process that is mapped to the same shared memory if it were to call a pthread_mutex function?
Can the mutex still be recovered in this case?
This question applies to any pthread object with PTHREAD_PROCESS_SHARED attribute. Is it safe to call functions like pthread_mutex_lock, pthread_mutex_unlock, pthread_cond_signal, etc. concurrently on the same object from different processes? Are they thread-safe across different processes?
From the man-page for pthreads:
Over time, two threading implementations have been provided by the
GNU C library on Linux:
LinuxThreads
This is the original Pthreads implementation. Since glibc
2.4, this implementation is no longer supported.
NPTL (Native POSIX Threads Library)
This is the modern Pthreads implementation. By comparison
with LinuxThreads, NPTL provides closer conformance to the
requirements of the POSIX.1 specification and better
performance when creating large numbers of threads. NPTL is
available since glibc 2.3.2, and requires features that are
present in the Linux 2.6 kernel.
Both of these are so-called 1:1 implementations, meaning that each
thread maps to a kernel scheduling entity. Both threading
implementations employ the Linux clone(2) system call. In NPTL,
thread synchronization primitives (mutexes, thread joining, and so
on) are implemented using the Linux futex(2) system call.
And from man futex(7):
In its bare form, a futex is an aligned integer which is touched only
by atomic assembler instructions. Processes can share this integer
using mmap(2), via shared memory segments or because they share
memory space, in which case the application is commonly called
multithreaded.
An additional remark found here:
(In case you’re wondering how they work in shared memory: Futexes are keyed upon their physical address)
Summarizing, Linux decided to implement pthreads on top of their "native" futex primitive, which indeed lives in the user process address space. For shared synchronization primitives, this would be shared memory and the other processes will still be able to see it, after one process dies.
What happens in case of process termination? Ingo Molnar wrote an article called Robust Futexes about just that. The relevant quote:
Robust Futexes
There is one race possible though: since adding to and removing from the
list is done after the futex is acquired by glibc, there is a few
instructions window for the thread (or process) to die there, leaving
the futex hung. To protect against this possibility, userspace (glibc)
also maintains a simple per-thread 'list_op_pending' field, to allow the
kernel to clean up if the thread dies after acquiring the lock, but just
before it could have added itself to the list. Glibc sets this
list_op_pending field before it tries to acquire the futex, and clears
it after the list-add (or list-remove) has finished
Summary
Where this leaves you for other platforms, is open-ended. Suffice it to say that the Linux implementation, at least, has taken great care to meet our common-sense expectation of robustness.
Seeing that other operating systems usually resort to Kernel-based synchronization primitives in the first place, it makes sense to me to assume their implementations would be even more naturally robust.
Following the documentation from here: http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_getrobust.html, it does read that in a fully POSIX compliant OS, shared mutex with the robust flag will behave in the way you'd expect.
The problem obviously is that not all OS are fully POSIX compliant. Not even those claiming to be. Process shared mutexes and in particular robust ones are among those finer points that are often not part of an OS's implementation of POSIX.

Malloc performance in a multithreaded environment

I've been running some experiments with the openmp framework and found some odd results I'm not sure I know how to explain.
My goal is to create this huge matrix and then fill it with values. I made some parts of my code like parallel loops in order to gain performance from my multithreaded enviroment. I'm running this in a machine with 2 quad-core xeon processors, so I can safely put up to 8 concurrent threads in there.
Everything works as expected, but for some reason the for loop actually allocating the rows of my matrix have an odd peak performance when running with only 3 threads. From there on, adding some more threads just makes my loop take longer. With 8 threads taking actually more time that it would need with only one.
This is my parallel loop:
int width = 11;
int height = 39916800;
vector<vector<int> > matrix;
matrix.resize(height);
#pragma omp parallel shared(matrix,width,height) private(i) num_threads(3)
{
#pragma omp for schedule(dynamic,chunk)
for(i = 0; i < height; i++){
matrix[i].resize(width);
}
} /* End of parallel block */
This made me wonder: is there a known performance problem when calling malloc (which I suppose is what the resize method of the vector template class is actually calling) in a multithreaded enviroment? I found some articles saying something about performance loss in freeing heap space in a mutithreaded enviroment, but nothing specific about allocating new space as in this case.
Just to give you an example, I'm placing below a graph of the time it takes for the loop to finish as a function of the number of threads for both the allocation loop, and a normal loop that just reads data from this huge matrix later on.
Both times where measured using the gettimeofday function and seem to return very similar and accurate results across different execution instances. So, anyone has a good explanation?
You are right about vector::resize() internally calling malloc. Implementation-wise malloc is fairly complicated. I can see multiple places where malloc can lead to contention in a multi-threaded environment.
malloc probably keeps a global data structure in userspace to manage the user's heap address space. This global data structure would need to be protected against concurrent access and modification. Some allocators have optimizations to alleviate the number of times this global data structure is accessed... I don't know how far has Ubuntu come along.
malloc allocates address space. So when you actually begin to touch the allocated memory you would go through a "soft page fault" which is a page fault which allows the OS kernel to allocate the backing RAM for the allocated address space. This can be expensive because of the trip to the kernel and would require the kernel to take some global locks to access its own global RAM resource data structures.
the user space allocator probably keeps some allocated space to give out new allocations from. However, once those allocations run out the allocator would need to go back to the kernel and allocate some more address space from the kernel. This is also expensive and would require a trip to the kernel and the kernel taking some global locks to access its global address space management related data structures.
Bottomline, these interactions could be fairly complicated. If you are running into these bottlenecks I would suggest that you simply "pre-allocate" your memory. This would involve allocating it and then touching all of it (all from a single thread) so that you can use that memory later from all your threads without running into lock contention at user or kernel level.
Memory allocators are definitely a possible contention point for multiple threads.
Fundamentally, the heap is a shared data structure, since it is possible to allocate memory on one thread, and de-allocate it on another. In fact, your example does exactly that - the "resize" will free memory on each of the worker threads, which was initially allocated elsewhere.
Typical implementations of malloc included with gcc and other compilers use a shared global lock and work reasonably well across threads if memory allocation pressure is relatively low. Above a certain allocation level, however, threads will begin to serialize on the lock, you'll get excessive context switching and cache trashing, and performance will degrade. Your program is an example of something which is allocation heavy, with an alloc + dealloc in the inner loop.
I'm surprised that an OpenMP compatible compiler doesn't have a better threaded malloc implementation? They certainly exist - take a look at this question for a list.
Technically, the STL vector uses the std::allocator which eventually calls new. new in its turn calls the libc's malloc (for your Linux system).
This malloc implementation is quite efficient as a general purpose allocator, is thread-safe, however it is not scalable (the GNU libc's malloc derives from Doug Lea's dlmalloc). There are numerous allocators and papers that improve upon dlmalloc to provide scalable allocation.
I would suggest that you take a look at Hoard from Dr. Emery Berger, tcmalloc from Google and Intel Threading Building Blocks scalable allocator.

Resources