How to identify memory consumption per thread in a process?

How to identify memory consumption per thread in a process? - multithreading

A multi-thread process written in C exhausts almost all of system memory. To find out the thread which is consuming most of the memory, I made a core file using gcore [pid] to check the memory usage per threads, but I can't find the way to do that.
ps -eLFlm and top command with -H option shows the total memory consumption, but not per thread.
Is there any useful tip to solve the problem?
OS : Centos6

A multi-thread process written in C exhausts almost all of system memory. To find out the thread which is consuming most of the memory....
That question does not make sense. By definition, all threads of the same process share the same virtual address space. You could query it programmatically using proc(5) (e.g. reading /proc/self/maps from your program).
It is possible (and quite common) that some heap memory is allocated (e.g. with malloc) in thread A, and would be released (e.g. free-d) later in some other thread B (often the main thread, just before exiting).
The C dynamic memory management heap is, by definition, a whole program property.
A typical example is the last arg argument to pthread_create(3). It generally should be heap-allocated. You could document and adopt the convention that the calling thread (the one using pthread_create) would malloc it, but that the created thread should free it (you could require that each start_routine passed to pthread_create should free that arg).
Is there any useful tip to solve the problem?
Perhaps valgrind might help you finding your memory leaks. You'll better compile all your program (and perhaps some relevant libraries) with DWARF debug information (e.g. compile with gcc -g) then restart your program. But such bugs are difficult to find, so be prepared to spend several weeks on them.
From the conceptual point of view, the "theory" of garbage collection (and also smart pointers, RAII, perhaps reference counting, etc...) could be helpful. So read the GC handbook (it is introducing the good concepts and terminology, and it explains that memory management is a whole program issue). A lot of concepts there are even relevant for programs in non-GC-ed languages like C or C++.
You need to define and follow some good enough whole program conventions regarding memory management (and that is difficult).

Related

exec without replacing current process, posix_spawn kernel implementation

Although the kernel marks pages (and page tables) as copy on write to make the fork syscall work efficiently, the creation and tear-down of page tables and related structures is still an expensive task.
Thus I wonder why the linux community has never managed to implement posix_spawn as a real kernel syscall that just spawns a new process, eliminating the need to call fork beforehand.
Instead, posix_spawn is just a poor glibc wrapper around fork and exec.
The performance gains would be significantly for workloads that have to spawn thousands of new processes every second. The latency for launching new processes would be improved as well.

That's basically what posix_spawn is for. It is also a more flexible API. The real bug is that the Linux exec man page still doesn't include a cross-reference for it.

Fork with copy-on-write is very expensive. To illustrate this, you might want to read the implementation of classic vfork semantics in NetBSD. The mail provides some hard numbers for a real world use case, building software. COW for very large programs is also an easily measurable penalty. A friend of mine wrote his own spawn daemon for his Java application, because forking+exec from a 8GB+ JVM took way too long.
The main problem with vfork in the modern world is that it can interact badly with multi-threading. I.e. consider that the post-vfork code has to reference a function that hasn't been resolved by the dynamic linker yet. The dynamic linker now has to lock itself. This can result in dead locks with the original program for example.

Memory management while using threads

1) I tried searching how memory would be allocated when we use threads in program but couldn't find the answer. Here What and where are the stack and heap? is how stack and heap works when a single program is called. But what happens when it comes to program with threads?
2)Using OpenMP parallel region creates threads and parallel code would be executed concurrently in each thread. Does this allocate more space in the memory than the memory occupied by same code with sequential execution?

In general, yes, [user-space] stacks are one per thread, whereas the heap is usually shared by all threads. See for example this Linux question. However, on some operating systems (OS), on Windows in particular, even a single threaded app may use more than one heap. Using OpenMP for threading doesn't change these basics, which are mostly dependant on the operating system. So unless you narrow your question to a specific OS, more can't be said at this level of generality.
Since I'm too lazy to draw this myself, here's the comparative illustration from PThreads Programming by Nichols et al. (1996)
A somewhat more detailed (and alas potentially a bit more confusing) diagram is found in the free LLNL POSIX Threads Programming tutorial by B. Barney.
And yes, as you correctly suspected, running more threads does consume more stack memory. You can actually exhaust the virtual address space of a process just with thread stacks if you make enough of them. Various implementations of OpenMP have a STACKSIZE environment variable (or thereabout) that controls how much stack OpenMP allocates for a thread.
Regarding Z boson's question/suggestion about Thread Local Storage (TLS): roughly (i.e. conceptually) speaking, Thread Local Storage is a per-thread heap. There are differences from the per-process heap in the API used to manipulate it, at the very least because each thread needs its own separate pointer to its own TLS, but basically you have a heap-like chunk of the process address space that's reserved to each thread. TLS is optional, you don't have to use it. OpenMP provides its own abstraction/directive for TLS-like persistent per-thread data, called THREADPRIVATE. It's not necessary that the OpenMP THREADPRIVATE uses the operating system's TLS support, however there's a Linux-focused paper which says that such an implementation gave the best performance, at least in that environment.
And here is a subtlety (or why I said "roughly speaking" when I compared TLS to per-thread heaps): assume you want a per-thread heap, say, in order to reduce locking contention to the main heap. You don't actually have to store an entire per-thread heap in each thread's TLS. It suffices to store in each thread's TLS a different head pointer to heaps allocated in the shared per-process space. Identifying and automatically using per-thread heaps in a program (in order to reduce locking contention on the main heap) is a farily difficult CS problem. Heap allocators which do this automatically are called scalable/parallel[izing] heap allocators or thereabout. For example, Intel TBB provides one such allocator, and it can be used in your program even if you use nothing else from TBB. Although some people seem to believe Intel's TBB allocator contains black magic, it's in fact not really different from the aforementioned basic idea of using TLS to point to some thread-local heap, which in turn is made of several doubly-linked lists segregated by block/object-size, as the following diagrams from the Intel paper on TBB illustrate:
IBM has something rather similar for AIX 7.1, but a bit more complex. You can tell its (default) allocator to use a fixed number of heaps for multi-threaded applications, e.g. MALLOCOPTIONS=multiheap:3. AIX 7.1 also has another option (which can be combined the multiheap) MALLOCOPTIONS=threadcache, which appears somewhat similar to what Intel TBB does, in that it keeps a per-thread cache of deallocated regions, from which future allocation requests can be serviced with less global heap contention. Besides those options for the default allocator, AIX 7.1 also has a (non-default) "Watson2" allocator which "uses a thread-specific mechanism that uses a varying number of heap structures, which depend on the behavior of the program. Therefore no configuration options are required." (But you do need to select this allocator explicitly with MALLOCTYPE=Watson2.) Watson2's operation sounds even closer to what the Intel TBB allocator does.
The aforementioned two examples (Intel TBB and AIX) detailed above just meant as concrete examples, but shouldn't be understood as holding some exclusive sauce. The idea of per-thread or per-CPU heap cache/arena/magazine is fairly widespread. The BSDcan jemalloc paper cites a 1998 MS Research paper as the first to have systematically evaluated arenas for this purpose. The aforementioned MS paper does cite the ptmalloc web page as "visited on May 11, 1998" and summarizes ptmalloc's working as follows: "It uses a linked list of subheaps where each subheap has a lock, 128 free lists, and some memory to manage. When a thread needs to allocate a block, it scans the list of subheaps and grabs the first unlocked one, allocates the required block, and returns. If it can't find an unlocked subheap, it creates a new one and adds it to the list. In this way, a thread never waits on a locked subheap."

Linux system call for creating process and thread

I read in a paper that the underlying system call to create processes and threads is actually the same, and thus the cost of creating processes over threads is not that great.
First, I wanna know what is the system call that creates
processes/threads (possibly a sample code or a link?)
Second, is
the author correct to assume that creating processes instead of
threads is inexpensive?
EDIT:
Quoting article:
Replacing pthreads with processes is surprisingly inexpensive,
especially on Linux where both pthreads and processes are invoked
using the same underlying system call.

Processes are usually created with fork, threads (lightweight processes) are usually created with clone nowadays. However, anecdotically, there exist 1:N thread models, too, which don't do either.
Both fork and clone map to the same kernel function do_fork internally. This function can create a lightweight process that shares the address space with the old one, or a separate process (and many other options), depending on what flags you feed to it. The clone syscall is more or less a direct forwarding of that kernel function (and used by the higher level threading libraries) whereas fork wraps do_fork into the functionality of the 50 year old traditional Unix function.
The important difference is that fork guarantees that a complete, separate copy of the address space is made. This, as Basil points out correctly, is done with copy-on-write nowadays and therefore is not nearly as expensive as one would think.
When you create a thread, it just reuses the original address space and the same memory.
However, one should not assume that creating processes is generally "lightweight" on unix-like systems because of copy-on-write. It is somewhat less heavy than for example under Windows, but it's nowhere near free.
One reason is that although the actual pages are not copied, the new process still needs a copy of the page table. This can be several kilobytes to megabytes of memory for processes that use larger amounts of memory.
Another reason is that although copy-on-write is invisible and a clever optimization, it is not free, and it cannot do magic. When data is modified by either process, which inevitably happens, the affected pages fault.
Redis is a good example where you can see that fork is everything but lightweight (it uses fork to do background saves).

The underlying system call to create threads is clone(2) (it is Linux specific). BTW, the list of Linux system calls is on syscalls(2), and you could use the strace(1) command to understand the syscalls done by some process or command. Processes are usually created with fork(2) (or vfork(2), which is not much useful these days). However, you could (and some C standard libraries might do that) create them with some particular form of clone. I guess that the kernel is sharing some code to implement clone, fork etc... (since some functionalities, e.g. management of the virtual address space, are common).
Indeed, process creation (and also thread creation) is generally quite fast on most Unix systems (because they use copy-on-write machinery for the virtual memory), typically a small fraction of a millisecond. But you could have pathological cases (e.g. thrashing) which makes that much longer.
Since most C standard library implementations are free software on Linux, you could study the source code of the one on your system (often GNU glibc, but sometimes musl-libc or something else).

Tracking threads memory and CPU consumption

I'm writing a Linux application which observes other applications and tracks consumption of resources . I'm planning work with Java, but programming language isn't important for me. The Goal is important, so I can switch to another technology or use modules. My application runs any selected third party application as child process. Mostly child software solves some algorithm like graphs, string search, etc. Observer program tracks child's resources while it ends the job.
If child application is multi-threaded, maybe somehow is possible to track how much resources consumes each thread? Application could be written using any not distributive-memory threads technology: Java threads, Boost threads, POSIX threads, OpenMP, any other.

In modern Linux systems (2.6), each thread has a separate identifier that has nearly the same treatment as the pid. It is shown in the process table (at least, in htop program) and it also has its separate /proc entry, i.e. /proc/<tid>/stat.
Check man 5 proc and pay particular attention to stat, statm, status etc. You should find the information you're interested in there.
An only obstacle is to obtain this thread identifier. It is different with the process id! I.e. getpid() calls in all threads return the same value. To get the actual thread identifier, you should use (within a C program):
pid_t tid = syscall(SYS_gettid);
By the way, java virtual machine (at least, its OpenJDK Linux implementation) does that internally and uses it for debugging purposes in its back-end, but doesn't expose it to the java interface.

Memory is not allocated to threads, and often shared across threads. This makes it generally impossible and at least meaningless to talk about the memory consumption of a thread.
An example could be a program with 11 threads; 1 creating objects and 10 using those objects. Most of the work is done on those 10 threads, but all memory was allocated on the one thread that created the objects. Now how does one account for that?

If you're willing to use Perl take a look at this: Sys-Statistics-Linux
I used it together with some of the GD graphing packages to generate system resource usage graphs for various processes.
One thing to watch out for - you'll really need to read up on /proc and understand jiffies - last time I looked they're not documented correctly in the man pages, you'll need to read kernel source probably:
http://lxr.linux.no/#linux+v2.6.18/include/linux/jiffies.h
Also, remember that in Linux the only difference between a thread and process is that threads share memory - other than that they're identical in how the kernel implements them.

Threads vs Processes in Linux [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
Improve this question
I've recently heard a few people say that in Linux, it is almost always better to use processes instead of threads, since Linux is very efficient in handling processes, and because there are so many problems (such as locking) associated with threads. However, I am suspicious, because it seems like threads could give a pretty big performance gain in some situations.
So my question is, when faced with a situation that threads and processes could both handle pretty well, should I use processes or threads? For example, if I were writing a web server, should I use processes or threads (or a combination)?

Linux uses a 1-1 threading model, with (to the kernel) no distinction between processes and threads -- everything is simply a runnable task. *
On Linux, the system call clone clones a task, with a configurable level of sharing, among which are:
CLONE_FILES: share the same file descriptor table (instead of creating a copy)
CLONE_PARENT: don't set up a parent-child relationship between the new task and the old (otherwise, child's getppid() = parent's getpid())
CLONE_VM: share the same memory space (instead of creating a COW copy)
fork() calls clone(least sharing) and pthread_create() calls clone(most sharing). **
forking costs a tiny bit more than pthread_createing because of copying tables and creating COW mappings for memory, but the Linux kernel developers have tried (and succeeded) at minimizing those costs.
Switching between tasks, if they share the same memory space and various tables, will be a tiny bit cheaper than if they aren't shared, because the data may already be loaded in cache. However, switching tasks is still very fast even if nothing is shared -- this is something else that Linux kernel developers try to ensure (and succeed at ensuring).
In fact, if you are on a multi-processor system, not sharing may actually be beneficial to performance: if each task is running on a different processor, synchronizing shared memory is expensive.
* Simplified. CLONE_THREAD causes signals delivery to be shared (which needs CLONE_SIGHAND, which shares the signal handler table).
** Simplified. There exist both SYS_fork and SYS_clone syscalls, but in the kernel, the sys_fork and sys_clone are both very thin wrappers around the same do_fork function, which itself is a thin wrapper around copy_process. Yes, the terms process, thread, and task are used rather interchangeably in the Linux kernel...

Linux (and indeed Unix) gives you a third option.
Option 1 - processes
Create a standalone executable which handles some part (or all parts) of your application, and invoke it separately for each process, e.g. the program runs copies of itself to delegate tasks to.
Option 2 - threads
Create a standalone executable which starts up with a single thread and create additional threads to do some tasks
Option 3 - fork
Only available under Linux/Unix, this is a bit different. A forked process really is its own process with its own address space - there is nothing that the child can do (normally) to affect its parent's or siblings address space (unlike a thread) - so you get added robustness.
However, the memory pages are not copied, they are copy-on-write, so less memory is usually used than you might imagine.
Consider a web server program which consists of two steps:
Read configuration and runtime data
Serve page requests
If you used threads, step 1 would be done once, and step 2 done in multiple threads. If you used "traditional" processes, steps 1 and 2 would need to be repeated for each process, and the memory to store the configuration and runtime data duplicated. If you used fork(), then you can do step 1 once, and then fork(), leaving the runtime data and configuration in memory, untouched, not copied.
So there are really three choices.

That depends on a lot of factors. Processes are more heavy-weight than threads, and have a higher startup and shutdown cost. Interprocess communication (IPC) is also harder and slower than interthread communication.
Conversely, processes are safer and more secure than threads, because each process runs in its own virtual address space. If one process crashes or has a buffer overrun, it does not affect any other process at all, whereas if a thread crashes, it takes down all of the other threads in the process, and if a thread has a buffer overrun, it opens up a security hole in all of the threads.
So, if your application's modules can run mostly independently with little communication, you should probably use processes if you can afford the startup and shutdown costs. The performance hit of IPC will be minimal, and you'll be slightly safer against bugs and security holes. If you need every bit of performance you can get or have a lot of shared data (such as complex data structures), go with threads.

Others have discussed the considerations.
Perhaps the important difference is that in Windows processes are heavy and expensive compared to threads, and in Linux the difference is much smaller, so the equation balances at a different point.

Once upon a time there was Unix and in this good old Unix there was lots of overhead for processes, so what some clever people did was to create threads, which would share the same address space with the parent process and they only needed a reduced context switch, which would make the context switch more efficient.
In a contemporary Linux (2.6.x) there is not much difference in performance between a context switch of a process compared to a thread (only the MMU stuff is additional for the thread).
There is the issue with the shared address space, which means that a faulty pointer in a thread can corrupt memory of the parent process or another thread within the same address space.
A process is protected by the MMU, so a faulty pointer will just cause a signal 11 and no corruption.
I would in general use processes (not much context switch overhead in Linux, but memory protection due to MMU), but pthreads if I would need a real-time scheduler class, which is a different cup of tea all together.
Why do you think threads are have such a big performance gain on Linux? Do you have any data for this, or is it just a myth?

I think everyone has done a great job responding to your question. I'm just adding more information about thread versus process in Linux to clarify and summarize some of the previous responses in context of kernel. So, my response is in regarding to kernel specific code in Linux. According to Linux Kernel documentation, there is no clear distinction between thread versus process except thread uses shared virtual address space unlike process. Also note, the Linux Kernel uses the term "task" to refer to process and thread in general.
"There are no internal structures implementing processes or threads, instead there is a struct task_struct that describe an abstract scheduling unit called task"
Also according to Linus Torvalds, you should NOT think about process versus thread at all and because it's too limiting and the only difference is COE or Context of Execution in terms of "separate the address space from the parent " or shared address space. In fact he uses a web server example to make his point here (which highly recommend reading).
Full credit to linux kernel documentation

If you want to create a pure a process as possible, you would use clone() and set all the clone flags. (Or save yourself the typing effort and call fork())
If you want to create a pure a thread as possible, you would use clone() and clear all the clone flags (Or save yourself the typing effort and call pthread_create())
There are 28 flags that dictate the level of resource sharing. This means that there are over 268 million flavours of tasks that you can create, depending on what you want to share.
This is what we mean when we say that Linux does not distinguish between a process and a thread, but rather alludes to any flow of control within a program as a task. The rationale for not distinguishing between the two is, well, not uniquely defining over 268 million flavours!
Therefore, making the "perfect decision" of whether to use a process or thread is really about deciding which of the 28 resources to clone.

How tightly coupled are your tasks?
If they can live independently of each other, then use processes. If they rely on each other, then use threads. That way you can kill and restart a bad process without interfering with the operation of the other tasks.

To complicate matters further, there is such a thing as thread-local storage, and Unix shared memory.
Thread-local storage allows each thread to have a separate instance of global objects. The only time I've used it was when constructing an emulation environment on linux/windows, for application code that ran in an RTOS. In the RTOS each task was a process with it's own address space, in the emulation environment, each task was a thread (with a shared address space). By using TLS for things like singletons, we were able to have a separate instance for each thread, just like under the 'real' RTOS environment.
Shared memory can (obviously) give you the performance benefits of having multiple processes access the same memory, but at the cost/risk of having to synchronize the processes properly. One way to do that is have one process create a data structure in shared memory, and then send a handle to that structure via traditional inter-process communication (like a named pipe).

In my recent work with LINUX is one thing to be aware of is libraries. If you are using threads make sure any libraries you may use across threads are thread-safe. This burned me a couple of times. Notably libxml2 is not thread-safe out of the box. It can be compiled with thread safe but that is not what you get with aptitude install.

I'd have to agree with what you've been hearing. When we benchmark our cluster (xhpl and such), we always get significantly better performance with processes over threads. </anecdote>

The decision between thread/process depends a little bit on what you will be using it to.
One of the benefits with a process is that it has a PID and can be killed without also terminating the parent.
For a real world example of a web server, apache 1.3 used to only support multiple processes, but in in 2.0 they added an abstraction so that you can swtch between either. Comments seems to agree that processes are more robust but threads can give a little bit better performance (except for windows where performance for processes sucks and you only want to use threads).

For most cases i would prefer processes over threads.
threads can be useful when you have a relatively smaller task (process overhead >> time taken by each divided task unit) and there is a need of memory sharing between them. Think a large array.
Also (offtopic), note that if your CPU utilization is 100 percent or close to it, there is going to be no benefit out of multithreading or processing. (in fact it will worsen)

Threads -- > Threads shares a memory space,it is an abstraction of the CPU,it is lightweight.
Processes --> Processes have their own memory space,it is an abstraction of a computer.
To parallelise task you need to abstract a CPU.
However the advantages of using a process over a thread is security,stability while a thread uses lesser memory than process and offers lesser latency.
An example in terms of web would be chrome and firefox.
In case of Chrome each tab is a new process hence memory usage of chrome is higher than firefox ,while the security and stability provided is better than firefox.
The security here provided by chrome is better,since each tab is a new process different tab cannot snoop into the memory space of a given process.

Multi-threading is for masochists. :)
If you are concerned about an environment where you are constantly creating threads/forks, perhaps like a web server handling requests, you can pre-fork processes, hundreds if necessary. Since they are Copy on Write and use the same memory until a write occurs, it's very fast. They can all block, listening on the same socket and the first one to accept an incoming TCP connection gets to run with it. With g++ you can also assign functions and variables to be closely placed in memory (hot segments) to ensure when you do write to memory, and cause an entire page to be copied at least subsequent write activity will occur on the same page. You really have to use a profiler to verify that kind of stuff but if you are concerned about performance, you should be doing that anyway.
Development time of threaded apps is 3x to 10x times longer due to the subtle interaction on shared objects, threading "gotchas" you didn't think of, and very hard to debug because you cannot reproduce thread interaction problems at will. You may have to do all sort of performance killing checks like having invariants in all your classes that are checked before and after every function and you halt the process and load the debugger if something isn't right. Most often it's embarrassing crashes that occur during production and you have to pore through a core dump trying to figure out which threads did what. Frankly, it's not worth the headache when forking processes is just as fast and implicitly thread safe unless you explicitly share something. At least with explicit sharing you know exactly where to look if a threading style problem occurs.
If performance is that important, add another computer and load balance. For the developer cost of debugging a multi-threaded app, even one written by an experienced multi-threader, you could probably buy 4 40 core Intel motherboards with 64gigs of memory each.
That being said, there are asymmetric cases where parallel processing isn't appropriate, like, you want a foreground thread to accept user input and show button presses immediately, without waiting for some clunky back end GUI to keep up. Sexy use of threads where multiprocessing isn't geometrically appropriate. Many things like that just variables or pointers. They aren't "handles" that can be shared in a fork. You have to use threads. Even if you did fork, you'd be sharing the same resource and subject to threading style issues.

If you need to share resources, you really should use threads.
Also consider the fact that context switches between threads are much less expensive than context switches between processes.
I see no reason to explicitly go with separate processes unless you have a good reason to do so (security, proven performance tests, etc...)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string