Memory consumption for threads

Memory consumption for threads - multithreading

How will the memory consumption vary in these cases, there is a Process containing 1 thread and there is a Process containing 0 thread.

While this question is too broad to answer, a thread typically need to save its stack, which is anywhere from a few KBs up to 1MB depending on the exact code executed inside the thread.
This is talking only about native C threads, other implementations might save extra data but it's usually safe to assume a thread will stay below 1 MB for most programming languages, and when in doubt just profile the program at hand.
Edit: according to Jérôme, the max memory of the thread stack is 2 MB on linux systems instead of 1 MB for windows.

Related

why is GHC thread extremely light weight?

reading Simon Marlow's tutorial on parallel programming in haskell.
a thread typically costs less than a hundred bytes plus the space for the stack, so the runtime supports literally millions of them, limited only by the available memory, unlike OS threads ....
I had different impression on thread costs. kernel structure used for a single thread exceeds 4MB (thread stack). 32bit process space can spawn no more than 1000 threads, not literally millions
I think he is referring to the part that GHC controlled, but if OS has already maintained stack for thread why would GHC need to do that again?

Haskell is using "green threads" (managed by the Haskell runtime) in addition / on top of operating system threads (you still need those to make use of multiple CPU cores).
if OS has already maintained stack for thread why would GHC need to do that again?
Precisely for the reason you mention: An OS thread is heavy. A green thread can be very lightweight.
If you are familar with Java, this is roughly equivalent to using Thread versus submitting a task to an ExecutorService backed by a threadpool.

Memory management while using threads

1) I tried searching how memory would be allocated when we use threads in program but couldn't find the answer. Here What and where are the stack and heap? is how stack and heap works when a single program is called. But what happens when it comes to program with threads?
2)Using OpenMP parallel region creates threads and parallel code would be executed concurrently in each thread. Does this allocate more space in the memory than the memory occupied by same code with sequential execution?

In general, yes, [user-space] stacks are one per thread, whereas the heap is usually shared by all threads. See for example this Linux question. However, on some operating systems (OS), on Windows in particular, even a single threaded app may use more than one heap. Using OpenMP for threading doesn't change these basics, which are mostly dependant on the operating system. So unless you narrow your question to a specific OS, more can't be said at this level of generality.
Since I'm too lazy to draw this myself, here's the comparative illustration from PThreads Programming by Nichols et al. (1996)
A somewhat more detailed (and alas potentially a bit more confusing) diagram is found in the free LLNL POSIX Threads Programming tutorial by B. Barney.
And yes, as you correctly suspected, running more threads does consume more stack memory. You can actually exhaust the virtual address space of a process just with thread stacks if you make enough of them. Various implementations of OpenMP have a STACKSIZE environment variable (or thereabout) that controls how much stack OpenMP allocates for a thread.
Regarding Z boson's question/suggestion about Thread Local Storage (TLS): roughly (i.e. conceptually) speaking, Thread Local Storage is a per-thread heap. There are differences from the per-process heap in the API used to manipulate it, at the very least because each thread needs its own separate pointer to its own TLS, but basically you have a heap-like chunk of the process address space that's reserved to each thread. TLS is optional, you don't have to use it. OpenMP provides its own abstraction/directive for TLS-like persistent per-thread data, called THREADPRIVATE. It's not necessary that the OpenMP THREADPRIVATE uses the operating system's TLS support, however there's a Linux-focused paper which says that such an implementation gave the best performance, at least in that environment.
And here is a subtlety (or why I said "roughly speaking" when I compared TLS to per-thread heaps): assume you want a per-thread heap, say, in order to reduce locking contention to the main heap. You don't actually have to store an entire per-thread heap in each thread's TLS. It suffices to store in each thread's TLS a different head pointer to heaps allocated in the shared per-process space. Identifying and automatically using per-thread heaps in a program (in order to reduce locking contention on the main heap) is a farily difficult CS problem. Heap allocators which do this automatically are called scalable/parallel[izing] heap allocators or thereabout. For example, Intel TBB provides one such allocator, and it can be used in your program even if you use nothing else from TBB. Although some people seem to believe Intel's TBB allocator contains black magic, it's in fact not really different from the aforementioned basic idea of using TLS to point to some thread-local heap, which in turn is made of several doubly-linked lists segregated by block/object-size, as the following diagrams from the Intel paper on TBB illustrate:
IBM has something rather similar for AIX 7.1, but a bit more complex. You can tell its (default) allocator to use a fixed number of heaps for multi-threaded applications, e.g. MALLOCOPTIONS=multiheap:3. AIX 7.1 also has another option (which can be combined the multiheap) MALLOCOPTIONS=threadcache, which appears somewhat similar to what Intel TBB does, in that it keeps a per-thread cache of deallocated regions, from which future allocation requests can be serviced with less global heap contention. Besides those options for the default allocator, AIX 7.1 also has a (non-default) "Watson2" allocator which "uses a thread-specific mechanism that uses a varying number of heap structures, which depend on the behavior of the program. Therefore no configuration options are required." (But you do need to select this allocator explicitly with MALLOCTYPE=Watson2.) Watson2's operation sounds even closer to what the Intel TBB allocator does.
The aforementioned two examples (Intel TBB and AIX) detailed above just meant as concrete examples, but shouldn't be understood as holding some exclusive sauce. The idea of per-thread or per-CPU heap cache/arena/magazine is fairly widespread. The BSDcan jemalloc paper cites a 1998 MS Research paper as the first to have systematically evaluated arenas for this purpose. The aforementioned MS paper does cite the ptmalloc web page as "visited on May 11, 1998" and summarizes ptmalloc's working as follows: "It uses a linked list of subheaps where each subheap has a lock, 128 free lists, and some memory to manage. When a thread needs to allocate a block, it scans the list of subheaps and grabs the first unlocked one, allocates the required block, and returns. If it can't find an unlocked subheap, it creates a new one and adds it to the list. In this way, a thread never waits on a locked subheap."

Is spawning threads based on application memory usage an overkill?

I have a system that uses threads to do various jobs.
Each thread uses from enough to too much memory, so there are times that the PC gets out of memory.
Each thread works from 8sec to 40sec max. approximatelly.
Is using Process.WorkingSet64 before spawing a new thread (to check for memory usage) an overkill ?
Basically, I am trying to prevent out-of-memory situations.
Is using Process.WorkingSet64 too heavy for calling it that often (let's say once every 4 seconds)?

If 256 threads give better performance than 8 have I likely got the wrong approach?

I've just started programming with POSIX threads on dual-core x86_64 Linux system. It seems that 256 threads is about the optimum for performance with the way I've done it. I'm wondering how this could be? And if it could mean that my approach is wrong and a better approach would require far fewer threads and be just as fast or faster?
For further background (the program in question is a skeleton for a multi-threaded M-set image generator) see the following questions I've asked already:
Using threads, how should I deal with something which ideally should happen in sequential order?
How can my threaded image generating app get it’s data to the gui?
Perhaps I should mention that the skeleton (in which I've reproduced minimal functionality for testing and comparison) is now displaying the image, and the actual calculations are done almost twice as fast as the non-threaded program.
So if 256 threads running faster than 8 threads is not indicative of a poor approach to threading, how come 256 threads does outperform 8 threads?
The speed test case is a portion of the Mandelbrot Set located at:
xmin -0.76243636067708333333333328
xmax -0.7624335575810185185185186
ymax 0.077996663411458333333333929
calculated to a maximum of 30000 iterations.
On the non-threaded version rendering time on my system is around 15 seconds. On the threaded version, averages speed for 8 threads is 7.8 seconds, while 256 threads is 7.6 seconds.

Well, probably yes, you're doing something wrong.
However, there are circumstances where 256 threads would run better than 8 without you necessarily having a bad threading model. One must remember that having 8 threads does not mean all 8 threads are actually running all the time. Anytime one thread makes a blocking syscall to the operating system, the thread will stop running and wait for the result. In the meantime, another thread can often do work.
There's this myth that one can't usefully use more threads than contexts on the CPU, but that's just not true. If your threads block on a syscall, it can be critical to have another thread available to do more work. (In practice when threads block there tends to be less work to do, but this is not always the case.)
It's all very dependent on work-load and there's no one right number of threads for any particular application. Generally you never want less threads available than the OS will run, and that's the only true rule. (Unfortunately this can be very hard to find out and so people tend to just fire up as many threads as contexts and then use non-blocking syscalls where possible.)

Could it be your app is io bound? How is the image data generated?

A performance improvement gained by allocating more threads than cores suggests that the CPU is not the bottleneck. If I/O access such as disk, memory or even network access are involved your results make perfect sense.

You are probably benefitting from Simultaneous Multithreading (SMT). Your operating system schedules more threads than cores available, and will swap in and out the threads that are not stalled waiting for resources (such as a memory load). This can very effectively hide the latencies of your memory system from your program and is the technique used to great effect for massive parallelization in CUDA for general purpose GPU programming.

If you are seeing a performance increase with the jump to 256 threads, then what you are probably dealing with is a resource bottleneck. At some point, your code is waiting for some slow device (a hard disk or a network connection, for example) in order to continue. With multiple threads, waiting on this slow device isn't a problem because instead of sitting idle and twiddling its electronic thumbs, the CPU can process another thread while the first thread is waiting on the slow device. The more parallel threads that are running, the more work the CPU can do while it is waiting on something else.
If you are seeing performance improve all the way up to 256 threads, I am tempted to say that you have a major performance bottleneck somewhere and it's not the CPU. To test this, try to see if you can measure the idle time of individual threads. I suspect that you will see your threads are stuck in a "blocked" or "waiting" state for a longer portion of their lifetime than they spend in the "running" or "active" state. Some debuggers or function profiling tools will let you do this, and I think there are also Linux tools to do this on the command line.

Regarding threads in Linux

How many threads a single process can handle in Linux(RHEL-5)? Once threads are created how much stack an each thread can get ?

Maximum number of threads: Maximum number of threads per process in Linux?
Stack size:
Even if pthread_attr_setstacksize() and pthread_attr_setstackaddr() are now provided, we still recommend that you do not use them unless you really have strong reasons for doing so. The default stack allocation strategy for LinuxThreads is nearly optimal: stacks start small (4k) and automatically grow on demand to a fairly large limit (2M). Moreover, there is no portable way to estimate the stack requirements of a thread, so setting the stack size yourself makes your program less reliable and non-portable.
(from http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html)

There is not a maximum number of threads by process.
There is however limit of the total active thread. This value can be retrieved by typing :
cat /proc/sys/kernel/threads-max
you can also change this value :
echo 99999 > /proc/sys/kernel/threads-max
Hope this helps.

If you're on a 32-bit machine, then the thread stacks will consume the address space eventually, depending on the size, probably at < 10,000 threads.
10k threads is certainly feasible and some people do run production servers with that many, but you really want to be sure that's the best way of doing what you're doing.
If you're thinking of having 10k threads, you probably have 64-bit machines anyway, and lots and lots of ram.

Thread stack size is configurable, using pthread_attr_setstack method. Amount of thread is imho limited only by resources you have, more than 2K threads work in an application i know.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string