Regarding threads in Linux - linux

How many threads a single process can handle in Linux(RHEL-5)? Once threads are created how much stack an each thread can get ?

Maximum number of threads: Maximum number of threads per process in Linux?
Stack size:
Even if pthread_attr_setstacksize() and pthread_attr_setstackaddr() are now provided, we still recommend that you do not use them unless you really have strong reasons for doing so. The default stack allocation strategy for LinuxThreads is nearly optimal: stacks start small (4k) and automatically grow on demand to a fairly large limit (2M). Moreover, there is no portable way to estimate the stack requirements of a thread, so setting the stack size yourself makes your program less reliable and non-portable.
(from http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html)

There is not a maximum number of threads by process.
There is however limit of the total active thread. This value can be retrieved by typing :
cat /proc/sys/kernel/threads-max
you can also change this value :
echo 99999 > /proc/sys/kernel/threads-max
Hope this helps.

If you're on a 32-bit machine, then the thread stacks will consume the address space eventually, depending on the size, probably at < 10,000 threads.
10k threads is certainly feasible and some people do run production servers with that many, but you really want to be sure that's the best way of doing what you're doing.
If you're thinking of having 10k threads, you probably have 64-bit machines anyway, and lots and lots of ram.

Thread stack size is configurable, using pthread_attr_setstack method. Amount of thread is imho limited only by resources you have, more than 2K threads work in an application i know.

Related

Cost of a thread

I understand how to create a thread in my chosen language and I understand about mutexs, and the dangers of shared data e.t.c but I'm sure about how the O/S manages threads and the cost of each thread. I have a series of questions that all relate and the clearest way to show the limit of my understanding is probably via these questions.
What is the cost of spawning a thread? Is it worth even worrying about when designing software? One of the costs to creating a thread must be its own stack pointer and process counter, then space to copy all of the working registers to as it is moved on and off of a core by the scheduler, but what else?
Is the amount of stack available for one program split equally between threads of a process or on a first come first served?
Can I somehow check the hardware on start up (of the program) for number of cores. If I am running on a machine with N cores, should I keep the number of threads to N-1?
then space to copy all of the working registeres to as it is moved on
and off of a core by the scheduler, but what else?
One less evident cost is the strain imposed on the scheduler which may start to choke if it needs to juggle thousands of threads. The memory isn't really the issue. With the right tweaking you can get a "thread" to occupy very little memory, little more than its stack. This tweaking could be difficult (i.e. using clone(2) directly under linux etc) but it can be done.
Is the amount of stack available for one program split equally between
threads of a process or on a first come first served
Each thread gets its own stack, and typically you can control its size.
If I am running on a machine with N cores, should I keep the number of
threads to N-1
Checking the number of cores is easy, but environment-specific. However, limiting the number of threads to the number of cores only makes sense if your workload consists of CPU-intensive operations, with little I/O. If I/O is involved you may want to have many more threads than cores.
You should be as thoughtful as possible in everything you design and implement.
I know that a Java thread stack takes up about 1MB each time you create a thread. , so they add up.
Threads make sense for asynchronous tasks that allow long-running activities to happen without preventing all other users/processes from making progress.
Threads are managed by the operating system. There are lots of schemes, all under the control of the operating system (e.g. round robin, first come first served, etc.)
It makes perfect sense to me to assign one thread per core for some activities (e.g. computationally intensive calculations, graphics, math, etc.), but that need not be the deciding factor. One app I develop uses roughly 100 active threads in production; it's not a 100 core machine.
To add to the other excellent posts:
'What is the cost of spawning a thread? Is it worth even worrying about when designing software?'
It is if one of your design choices is doing such a thing often. A good way of avoiding this issue is to create threads once, at app startup, by using pools and/or app-lifetime threads dedicated to operations. Inter-thread signaling is much quicker than continual thread creation/termination/destruction and also much safer/easier.
The number of posts concerning problems with thread stopping, terminating, destroying, thread count runaway, OOM failure etc. is ledgendary. If you can avoid doing it at all, great.

Selecting number of threads in a multiprocess multiprocessor environment

I went through a few questions such as POSIX Threads on a Multiprocessor System and Concurrency of posix threads in multiprocessor machine and Threads & Processes Vs MultiThreading & Multi-Core/MultiProcessor : How they are mapped?
Based on these and few other Wiki articles , I believe for a system having three basic works viz, Input , Processing and Output
For a CPU - bound processing number of CPU -intensive threads (No. of Application * Thread per application) should be apprx 1 to 1.5 times the number of cores of processor.
Input and Output threads must be sufficiently large, so as to remove any bottlenecks. For example for a communication system which is based on query/query-ack and response/response - ack model, time must not be wasted in I/O waiting states.
If there is a large requirement for dynamic memory, its better to go with greater number of processes than threads (to avoid memory sync ups).
Are these arguments fairly consistent while determining number of threads to have in our application ? Do we need to look into any other paramters??
'1 to 1.5 times the number of cores' - this appears to be OS/langauge dependent. On Windows/C++, for example, with large numbers of CPU-intensive tasks, the optimum seems to be much more than twice the number of cores with the performance spread very small. If such environments, it seems you may as well just allocate 64 threads on a pool and not bother with the number of cores.
'query/query-ack and response/response - ack model, time must not be wasted in I/O waiting states' - this is unavoidable with such protocols with the high latency of most networks. The delay is enforced by the 'ping-pong' protocol & so there will, inevitably be an I/O wait. Async I/O just moves this wait into the kernel - it's still there!
'large requirement for dynamic memory, its better to go with greater number of processes than threads' - not really. 'large requirement for dynamic memory' usually means that large data buffers are going to be moved about. Large buffers can only be efficiently moved around by reference. This is very easy and quick between threads because of the shared memory space. With processes, you are stuck with awkward and slow inter-process comms.
'Determining number of threads to have in our application' - well, so difficult on several fronts. Given an unknown architecture, design. language and OS, the only advice I have is to try and make everything as flexible and configurable as you reasonably can. If you have a thread pool, make its size a run-time parameter you can tweak. If you have an object pool, try to design it so that you can change its depth. Have some default values that work on your test boxes and then, at installation or while running, you can make any specific changes and tweaks for a particular system.
The other thing with flexible/configurable designs is that you can, at test time, tweak away and fix many of the incorrect decisions, assumptions and guesstimates made by architects, designers, developers and, most of all, customers

How to find the max no of threads spawned by one system?

Is there a way / program to find out the maximum no of threads a system can spawn ? i am creating a application and i am in a dilemma whether to go with event looping model or multi threaded model . so wanted to test the systems capabilities on how many threads it can handle ?
The "maximum number of threads" is not as useful a metric as you might think:
There is usually a system-wide maximum number of threads imposed by either the operating system or the available hardware resources.
The per-process maximum number of threads is often configurable and can even change on-the-fly.
In most cases the actual restriction comes from your hardware resources - rather than any imposed limit. Much like any other resource (e.g. memory) you have to check if you were successfull, rather than rely on some kind of limit.
In general, multi-threading has only two advantages when compared to event loops:
It can utilise more than one processor. Depending on the operating system you can also use multiple processes (rather than the more lightweight threads) to do that.
Depending on the operating system, it may offer some degree of privlilege separation.
Other than that multi-threading is usually more expensive in both memory and processing resources. A large number of threads can bring your system to a halt regardless if what they are doing is resource-intensive or not.
In most cases the best solution is a hybrid of both models i.e. a number of threads with an event loop in each one.
EDIT:
On modern Linux systems, the /proc/sys/kernel/threads-max file provides a system-wide limit for the number of threads. The root user can change that value if they wish to:
echo 100000 > /proc/sys/kernel/threads-max
As far as I know, the kernel does not specifically impose a per-process limit on the number of threads.
sysconf() can be used to query system limits. There are some semi-documented thread-related query variables defined in /usr/include/bits/confname.h (the _SC_THREAD* variables).
getrlimit() can be used to query per-session limits - in this case the RLIMIT_NPROC resource is related to threads.
The threading implementation in glibc may also impose its own limits on a per-process basis.
Keep in mind that, depending on your hardware and software configuration, none of these limits may be of use. On Linux a main limiting factor on the number of threads comes from the fact that each thread requires memory in the stack - if you start launching threads you can easily come upon this limit before any others.
If you really want to find the actual limit, then the only way is to start launching threads until you cannot do so any more. Even that will only give you a rough limit that is only valid at the time you run the program. It can easily change if e.g. your threads start doing actual work and increase their resource usage.
In my opinion if you are launching more than 3-4 threads per processor you should reconsider your design.
On Linux, you can find the total number of threads running by checking /proc/loadavg
# cat /proc/loadavg
0.02 0.03 0.04 1/389 7017
In the above, 389 is the total number of threads.
it can handel as many threads as provided by OS and you never know the limit.
But as a general measures if a normal LOB application having more than 25 threads at a time can lead to problems and have serious design issue.
The current upper global limit is 4 million threads, because once reached, you have run out of PIDs (see futex.h).

How many threads can I spawn before efficiency drops?

Is there any formula, maybe involving RAM & number of CPUs, which can give me a rough idea of how many threads I can spawn before it starts to be inefficient and slows the PC?
I want to load test another machine, so want to send requests as quickly as pobbile. But there's no point of spawning a million threads if they will just get in each other's way.
Edit: The threads are making Remote Procedure Calls (SOAP), so will be blocking waiting for the call to return.
It depends on what the threads are doing. If they're doing calculations, then the number will be lower. If they're waiting on I/O, then you can have more.
However, if they're waiting on I/O then you may be able to achieve the same result using async I/O apis better than using multiple threads.
If all threads are active and not blocking waiting for something then basically one thread per CPU (core really). Any more than that and you're relying on the operating system to context switch between the threads on a given CPU.
But it all depends on what the threads are doing. If they're sleeping most of the time or waiting on asynchronous IO operations, then you mostly just need to worry about the memory used for the stack which defaults to about 1MB per thread I believe.
The other answers are of course correct; "it depends". If the threads are busy doing CPU-intensive work, there's no point having more than the number of cores available. But assuming they are waiting on external results, it can vary widely.
I often find that this question is answered by the architecture and requirements of an application; you need as many threads as you need.
But if you potentially have an unlimited number of threads you might end up spawning, I think that probably sounds like a task for the ThreadPool myself; let it decide how many threads to actually have running.
First of all starting a thread may be quite a slow operation itself. When you start a thread stack space must be allocated, entry points in DLLs may be called etc. If you have a lot more threads than available cores then the majority of your threads will not be running at any given moment. I.e. they use resources and contribute nothing.
It is hard to give an exact number of threads for optimal performance, cause it depends on what the threads are doing, but generally you shouldn't go way above the number of available cores. Keep in mind that you cannot have more running threads than the number of available cores.

If 256 threads give better performance than 8 have I likely got the wrong approach?

I've just started programming with POSIX threads on dual-core x86_64 Linux system. It seems that 256 threads is about the optimum for performance with the way I've done it. I'm wondering how this could be? And if it could mean that my approach is wrong and a better approach would require far fewer threads and be just as fast or faster?
For further background (the program in question is a skeleton for a multi-threaded M-set image generator) see the following questions I've asked already:
Using threads, how should I deal with something which ideally should happen in sequential order?
How can my threaded image generating app get it’s data to the gui?
Perhaps I should mention that the skeleton (in which I've reproduced minimal functionality for testing and comparison) is now displaying the image, and the actual calculations are done almost twice as fast as the non-threaded program.
So if 256 threads running faster than 8 threads is not indicative of a poor approach to threading, how come 256 threads does outperform 8 threads?
The speed test case is a portion of the Mandelbrot Set located at:
xmin -0.76243636067708333333333328
xmax -0.7624335575810185185185186
ymax 0.077996663411458333333333929
calculated to a maximum of 30000 iterations.
On the non-threaded version rendering time on my system is around 15 seconds. On the threaded version, averages speed for 8 threads is 7.8 seconds, while 256 threads is 7.6 seconds.
Well, probably yes, you're doing something wrong.
However, there are circumstances where 256 threads would run better than 8 without you necessarily having a bad threading model. One must remember that having 8 threads does not mean all 8 threads are actually running all the time. Anytime one thread makes a blocking syscall to the operating system, the thread will stop running and wait for the result. In the meantime, another thread can often do work.
There's this myth that one can't usefully use more threads than contexts on the CPU, but that's just not true. If your threads block on a syscall, it can be critical to have another thread available to do more work. (In practice when threads block there tends to be less work to do, but this is not always the case.)
It's all very dependent on work-load and there's no one right number of threads for any particular application. Generally you never want less threads available than the OS will run, and that's the only true rule. (Unfortunately this can be very hard to find out and so people tend to just fire up as many threads as contexts and then use non-blocking syscalls where possible.)
Could it be your app is io bound? How is the image data generated?
A performance improvement gained by allocating more threads than cores suggests that the CPU is not the bottleneck. If I/O access such as disk, memory or even network access are involved your results make perfect sense.
You are probably benefitting from Simultaneous Multithreading (SMT). Your operating system schedules more threads than cores available, and will swap in and out the threads that are not stalled waiting for resources (such as a memory load). This can very effectively hide the latencies of your memory system from your program and is the technique used to great effect for massive parallelization in CUDA for general purpose GPU programming.
If you are seeing a performance increase with the jump to 256 threads, then what you are probably dealing with is a resource bottleneck. At some point, your code is waiting for some slow device (a hard disk or a network connection, for example) in order to continue. With multiple threads, waiting on this slow device isn't a problem because instead of sitting idle and twiddling its electronic thumbs, the CPU can process another thread while the first thread is waiting on the slow device. The more parallel threads that are running, the more work the CPU can do while it is waiting on something else.
If you are seeing performance improve all the way up to 256 threads, I am tempted to say that you have a major performance bottleneck somewhere and it's not the CPU. To test this, try to see if you can measure the idle time of individual threads. I suspect that you will see your threads are stuck in a "blocked" or "waiting" state for a longer portion of their lifetime than they spend in the "running" or "active" state. Some debuggers or function profiling tools will let you do this, and I think there are also Linux tools to do this on the command line.

Resources