1) I tried searching how memory would be allocated when we use threads in program but couldn't find the answer. Here What and where are the stack and heap? is how stack and heap works when a single program is called. But what happens when it comes to program with threads?
2)Using OpenMP parallel region creates threads and parallel code would be executed concurrently in each thread. Does this allocate more space in the memory than the memory occupied by same code with sequential execution?
In general, yes, [user-space] stacks are one per thread, whereas the heap is usually shared by all threads. See for example this Linux question. However, on some operating systems (OS), on Windows in particular, even a single threaded app may use more than one heap. Using OpenMP for threading doesn't change these basics, which are mostly dependant on the operating system. So unless you narrow your question to a specific OS, more can't be said at this level of generality.
Since I'm too lazy to draw this myself, here's the comparative illustration from PThreads Programming by Nichols et al. (1996)
A somewhat more detailed (and alas potentially a bit more confusing) diagram is found in the free LLNL POSIX Threads Programming tutorial by B. Barney.
And yes, as you correctly suspected, running more threads does consume more stack memory. You can actually exhaust the virtual address space of a process just with thread stacks if you make enough of them. Various implementations of OpenMP have a STACKSIZE environment variable (or thereabout) that controls how much stack OpenMP allocates for a thread.
Regarding Z boson's question/suggestion about Thread Local Storage (TLS): roughly (i.e. conceptually) speaking, Thread Local Storage is a per-thread heap. There are differences from the per-process heap in the API used to manipulate it, at the very least because each thread needs its own separate pointer to its own TLS, but basically you have a heap-like chunk of the process address space that's reserved to each thread. TLS is optional, you don't have to use it. OpenMP provides its own abstraction/directive for TLS-like persistent per-thread data, called THREADPRIVATE. It's not necessary that the OpenMP THREADPRIVATE uses the operating system's TLS support, however there's a Linux-focused paper which says that such an implementation gave the best performance, at least in that environment.
And here is a subtlety (or why I said "roughly speaking" when I compared TLS to per-thread heaps): assume you want a per-thread heap, say, in order to reduce locking contention to the main heap. You don't actually have to store an entire per-thread heap in each thread's TLS. It suffices to store in each thread's TLS a different head pointer to heaps allocated in the shared per-process space. Identifying and automatically using per-thread heaps in a program (in order to reduce locking contention on the main heap) is a farily difficult CS problem. Heap allocators which do this automatically are called scalable/parallel[izing] heap allocators or thereabout. For example, Intel TBB provides one such allocator, and it can be used in your program even if you use nothing else from TBB. Although some people seem to believe Intel's TBB allocator contains black magic, it's in fact not really different from the aforementioned basic idea of using TLS to point to some thread-local heap, which in turn is made of several doubly-linked lists segregated by block/object-size, as the following diagrams from the Intel paper on TBB illustrate:
IBM has something rather similar for AIX 7.1, but a bit more complex. You can tell its (default) allocator to use a fixed number of heaps for multi-threaded applications, e.g. MALLOCOPTIONS=multiheap:3. AIX 7.1 also has another option (which can be combined the multiheap) MALLOCOPTIONS=threadcache, which appears somewhat similar to what Intel TBB does, in that it keeps a per-thread cache of deallocated regions, from which future allocation requests can be serviced with less global heap contention. Besides those options for the default allocator, AIX 7.1 also has a (non-default) "Watson2" allocator which "uses a thread-specific mechanism that uses a varying number of heap structures, which depend on the behavior of the program. Therefore no configuration options are required." (But you do need to select this allocator explicitly with MALLOCTYPE=Watson2.) Watson2's operation sounds even closer to what the Intel TBB allocator does.
The aforementioned two examples (Intel TBB and AIX) detailed above just meant as concrete examples, but shouldn't be understood as holding some exclusive sauce. The idea of per-thread or per-CPU heap cache/arena/magazine is fairly widespread. The BSDcan jemalloc paper cites a 1998 MS Research paper as the first to have systematically evaluated arenas for this purpose. The aforementioned MS paper does cite the ptmalloc web page as "visited on May 11, 1998" and summarizes ptmalloc's working as follows: "It uses a linked list of subheaps where each subheap has a lock, 128 free lists, and some memory to manage. When a thread needs to allocate a block, it scans the list of subheaps and grabs the first unlocked one, allocates the required block, and returns. If it can't find an unlocked subheap, it creates a new one and adds it to the list. In this way, a thread never waits on a locked subheap."
Related
A multi-thread process written in C exhausts almost all of system memory. To find out the thread which is consuming most of the memory, I made a core file using gcore [pid] to check the memory usage per threads, but I can't find the way to do that.
ps -eLFlm and top command with -H option shows the total memory consumption, but not per thread.
Is there any useful tip to solve the problem?
OS : Centos6
A multi-thread process written in C exhausts almost all of system memory. To find out the thread which is consuming most of the memory....
That question does not make sense. By definition, all threads of the same process share the same virtual address space. You could query it programmatically using proc(5) (e.g. reading /proc/self/maps from your program).
It is possible (and quite common) that some heap memory is allocated (e.g. with malloc) in thread A, and would be released (e.g. free-d) later in some other thread B (often the main thread, just before exiting).
The C dynamic memory management heap is, by definition, a whole program property.
A typical example is the last arg argument to pthread_create(3). It generally should be heap-allocated. You could document and adopt the convention that the calling thread (the one using pthread_create) would malloc it, but that the created thread should free it (you could require that each start_routine passed to pthread_create should free that arg).
Is there any useful tip to solve the problem?
Perhaps valgrind might help you finding your memory leaks. You'll better compile all your program (and perhaps some relevant libraries) with DWARF debug information (e.g. compile with gcc -g) then restart your program. But such bugs are difficult to find, so be prepared to spend several weeks on them.
From the conceptual point of view, the "theory" of garbage collection (and also smart pointers, RAII, perhaps reference counting, etc...) could be helpful. So read the GC handbook (it is introducing the good concepts and terminology, and it explains that memory management is a whole program issue). A lot of concepts there are even relevant for programs in non-GC-ed languages like C or C++.
You need to define and follow some good enough whole program conventions regarding memory management (and that is difficult).
As far as I know, in multithreaded applications each thread has its own individual stack and shares the heap with others. So, I was wondering if this is a language-, compiler- or OS-depenadent feature? Are all multithreaded applications like this?
My second question is if there is a shared variable among threads, is it created in heap or not? Is there any other possible way to share a variable among the threads except through heap?
Is an OS-dependent feature but all OSes agree on the same basic principle of having a common heap space and separate stacks
Shared variables are stored in the heap memory. Threads share variables using the heap only.. obviously that's not the only way to share data, there are signals, callbacks and many other techniques...
Each thread will get its own stack memory that is feature of OS.
All multi-threaded applications likes and follows same feature.
My second question is if there is a shared variable among threads, is
it created in heap or not? Is there any other possible way to share a
variable among the threads except through heap?
According to Java there is no other way to share variable but through heap.
I would also say the stack is OS-dependent
However the heap (not the unified memory abstraction) is language-dependent as some languages maintain one heap per thread. This does not mean that the data allocated on this heap cannot be seen by other threads, though. It's just that a language does not have to use a shared concurrent data structure for allocations or a shared pages pool, which would be pretty stupid on a computing cluster for example.
Now there are also languages that enforce the isolation of some memory addresses (and maybe memory pages although it would be for performances rather than consistency), but this is yet another matter not related to allocations themselves.
I m thinking to use Posix robust mutexes to protect shared resource among different processes (on Linux). However there are some doubts about safety in difference scenarios. I have the following questions:
Are robust mutexes implemented in the kernel or in user code?
If latter, what would happen if a process happens to crash while in a call to pthread_mutex_lock or pthread_mutex_unlock and while a shared pthread_mutex datastructure is getting updated?
I understand that if a process locked the mutex and dies, a thread in another process will be awaken and return EOWNERDEAD. However, what would happen if the process dies (in unlikely case) exactly when the pthread_mutex datastructure (in shared memory) is being updated? Will the mutex get corrupted in that case? What would happen to another process that is mapped to the same shared memory if it were to call a pthread_mutex function?
Can the mutex still be recovered in this case?
This question applies to any pthread object with PTHREAD_PROCESS_SHARED attribute. Is it safe to call functions like pthread_mutex_lock, pthread_mutex_unlock, pthread_cond_signal, etc. concurrently on the same object from different processes? Are they thread-safe across different processes?
From the man-page for pthreads:
Over time, two threading implementations have been provided by the
GNU C library on Linux:
LinuxThreads
This is the original Pthreads implementation. Since glibc
2.4, this implementation is no longer supported.
NPTL (Native POSIX Threads Library)
This is the modern Pthreads implementation. By comparison
with LinuxThreads, NPTL provides closer conformance to the
requirements of the POSIX.1 specification and better
performance when creating large numbers of threads. NPTL is
available since glibc 2.3.2, and requires features that are
present in the Linux 2.6 kernel.
Both of these are so-called 1:1 implementations, meaning that each
thread maps to a kernel scheduling entity. Both threading
implementations employ the Linux clone(2) system call. In NPTL,
thread synchronization primitives (mutexes, thread joining, and so
on) are implemented using the Linux futex(2) system call.
And from man futex(7):
In its bare form, a futex is an aligned integer which is touched only
by atomic assembler instructions. Processes can share this integer
using mmap(2), via shared memory segments or because they share
memory space, in which case the application is commonly called
multithreaded.
An additional remark found here:
(In case you’re wondering how they work in shared memory: Futexes are keyed upon their physical address)
Summarizing, Linux decided to implement pthreads on top of their "native" futex primitive, which indeed lives in the user process address space. For shared synchronization primitives, this would be shared memory and the other processes will still be able to see it, after one process dies.
What happens in case of process termination? Ingo Molnar wrote an article called Robust Futexes about just that. The relevant quote:
Robust Futexes
There is one race possible though: since adding to and removing from the
list is done after the futex is acquired by glibc, there is a few
instructions window for the thread (or process) to die there, leaving
the futex hung. To protect against this possibility, userspace (glibc)
also maintains a simple per-thread 'list_op_pending' field, to allow the
kernel to clean up if the thread dies after acquiring the lock, but just
before it could have added itself to the list. Glibc sets this
list_op_pending field before it tries to acquire the futex, and clears
it after the list-add (or list-remove) has finished
Summary
Where this leaves you for other platforms, is open-ended. Suffice it to say that the Linux implementation, at least, has taken great care to meet our common-sense expectation of robustness.
Seeing that other operating systems usually resort to Kernel-based synchronization primitives in the first place, it makes sense to me to assume their implementations would be even more naturally robust.
Following the documentation from here: http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_getrobust.html, it does read that in a fully POSIX compliant OS, shared mutex with the robust flag will behave in the way you'd expect.
The problem obviously is that not all OS are fully POSIX compliant. Not even those claiming to be. Process shared mutexes and in particular robust ones are among those finer points that are often not part of an OS's implementation of POSIX.
I've been running some experiments with the openmp framework and found some odd results I'm not sure I know how to explain.
My goal is to create this huge matrix and then fill it with values. I made some parts of my code like parallel loops in order to gain performance from my multithreaded enviroment. I'm running this in a machine with 2 quad-core xeon processors, so I can safely put up to 8 concurrent threads in there.
Everything works as expected, but for some reason the for loop actually allocating the rows of my matrix have an odd peak performance when running with only 3 threads. From there on, adding some more threads just makes my loop take longer. With 8 threads taking actually more time that it would need with only one.
This is my parallel loop:
int width = 11;
int height = 39916800;
vector<vector<int> > matrix;
matrix.resize(height);
#pragma omp parallel shared(matrix,width,height) private(i) num_threads(3)
{
#pragma omp for schedule(dynamic,chunk)
for(i = 0; i < height; i++){
matrix[i].resize(width);
}
} /* End of parallel block */
This made me wonder: is there a known performance problem when calling malloc (which I suppose is what the resize method of the vector template class is actually calling) in a multithreaded enviroment? I found some articles saying something about performance loss in freeing heap space in a mutithreaded enviroment, but nothing specific about allocating new space as in this case.
Just to give you an example, I'm placing below a graph of the time it takes for the loop to finish as a function of the number of threads for both the allocation loop, and a normal loop that just reads data from this huge matrix later on.
Both times where measured using the gettimeofday function and seem to return very similar and accurate results across different execution instances. So, anyone has a good explanation?
You are right about vector::resize() internally calling malloc. Implementation-wise malloc is fairly complicated. I can see multiple places where malloc can lead to contention in a multi-threaded environment.
malloc probably keeps a global data structure in userspace to manage the user's heap address space. This global data structure would need to be protected against concurrent access and modification. Some allocators have optimizations to alleviate the number of times this global data structure is accessed... I don't know how far has Ubuntu come along.
malloc allocates address space. So when you actually begin to touch the allocated memory you would go through a "soft page fault" which is a page fault which allows the OS kernel to allocate the backing RAM for the allocated address space. This can be expensive because of the trip to the kernel and would require the kernel to take some global locks to access its own global RAM resource data structures.
the user space allocator probably keeps some allocated space to give out new allocations from. However, once those allocations run out the allocator would need to go back to the kernel and allocate some more address space from the kernel. This is also expensive and would require a trip to the kernel and the kernel taking some global locks to access its global address space management related data structures.
Bottomline, these interactions could be fairly complicated. If you are running into these bottlenecks I would suggest that you simply "pre-allocate" your memory. This would involve allocating it and then touching all of it (all from a single thread) so that you can use that memory later from all your threads without running into lock contention at user or kernel level.
Memory allocators are definitely a possible contention point for multiple threads.
Fundamentally, the heap is a shared data structure, since it is possible to allocate memory on one thread, and de-allocate it on another. In fact, your example does exactly that - the "resize" will free memory on each of the worker threads, which was initially allocated elsewhere.
Typical implementations of malloc included with gcc and other compilers use a shared global lock and work reasonably well across threads if memory allocation pressure is relatively low. Above a certain allocation level, however, threads will begin to serialize on the lock, you'll get excessive context switching and cache trashing, and performance will degrade. Your program is an example of something which is allocation heavy, with an alloc + dealloc in the inner loop.
I'm surprised that an OpenMP compatible compiler doesn't have a better threaded malloc implementation? They certainly exist - take a look at this question for a list.
Technically, the STL vector uses the std::allocator which eventually calls new. new in its turn calls the libc's malloc (for your Linux system).
This malloc implementation is quite efficient as a general purpose allocator, is thread-safe, however it is not scalable (the GNU libc's malloc derives from Doug Lea's dlmalloc). There are numerous allocators and papers that improve upon dlmalloc to provide scalable allocation.
I would suggest that you take a look at Hoard from Dr. Emery Berger, tcmalloc from Google and Intel Threading Building Blocks scalable allocator.
Okay, so we support per-process memory paging/protection today. I've been wondering for years what sort of benefit is gained by offering page-level protections to what is arguably the smallest execution unit our OSes support today: threads. This question on Software Transactional Memory brought it back to the forefront for me.
Benefits to having page-level thread-ownership
OS support for locking the page when accessed
In theory, protection against memory corruption if the OS had a mechanism to take ownership for the lifetime of a thread.
Downsides:
Deadlock detection with standard
locking techniques is already
difficult enough
debugger/OS
support for determining page-level
ownership
Any other downsides, upsides that you can see from supporting such a model?
This kind of programming model is already possible with processes and shared memory. It isn't used much, for good reason: interprocess message passing is far safer and easier to reason about.
Per-thread per-page memory protection can be used to efficiently implement parallel garbage collection.
The problem to be solved is that in order to collect a region of memory, the garbage collector needs exclusive access to that region, otherwise other threads (so-called "mutator" threads) would be able to read and write objects that are not in a consistent state (for example, halfway through being copied from oldspace to newspace).
With per-thread memory protection, the garbage collector can control access to the region of memory so that only the collector thread can access it; attempts by other threads to access the region of memory will result in segmentation faults that can be handled by the collector (for example, by blocking the thread until the collector is finished with that region).