Haskell: large number of long-running threads taking up big STACK space

Haskell: large number of long-running threads taking up big STACK space - haskell

A server program keeps long-running TCP connections with many clients. Each client connection is served by a thread created with forkIO. The server takes up a lot of memory when running, so naturally I did a profiling to hunt down possible space leaks. However, with around 10k clients (hence 10k threads), the result shows that a major portion of the heap is actually STACK allocated by threads. If I understand correctly, this is not surprising since the stack of a thread starts with 1k by default, and increases by 32k chunks. As these are long-running threads, these memory won't be GCed.
My question: STACK takes up too much space, is there a way to reduce it?
I had some thoughts on this: previously I could use the event notification APIs from GHC to write the program without using threads, however it seems this option is no longer possible as GHC has stopped exporting some of the event handling functions such as loop. On the other hand, such change means a major shift in concurrency model (threads vs events), which is very undesirable since the Haskell threads are simply so enjoyable to work with. Another way that came to my mind is to split/rewrite the threads so that one thread does all the handshaking+authentication stuff, creates a new thread and then exits. The new thread, which will keep looping, hopefully doesn't require more STACK space. However I'm not sure if this idea is correct or doable.

Related

Why thead-per-multiple-connections model is considered better than thread-per-connection model?

Most of the times, you will hear that thread-per-multiple-connections model (non-blocking IO) is much better than thread-per-connection model (blocking io). And reasoning sounds like "Thread-per-connection approach creates too many threads and a lot of overhead is associated with mantaining so many threads". But this overhead is not explained.
Common misconception is that scheduling overhead is proportinal to the number of all threads. But it's not true, scheduling overhead is proportinal to the number of runnable threads. So in typical IO bound application, most of the threads will actually be blocked on IO and only several of them will be runnable - which is not different with "thread-per-multiple-connections" model.
As for context switching overhead, I expect that there should be no difference, because when data arrives kernel should wake up a thread - selector thread or connection thread.
The problem may lay in IO system calls - kernel might handle kqueue/epoll calls better than blocking IO calls. However, this does not sound plausible, because it should not be a problem to implement O(1) algorithm for selecting blocked thread when data arrives.
If you have many short-lived connections, you will have many short lived thread. And spawning a new thread is an expensive operation (is it?). To solve this problem, you may create thread pool and still use blocking I/O.
There might be OS limits for the number of threads that could be spawned, however they might be changed with configuration parameters.
In multicore system, suppose different sessions access same shared data. If we're talking about connection-per-thread model, this might cause a lot of cache coherency traffic and may slow down the system. However, why not to shedule all these thread on the single core if only one of them is runnable at the given point in time? If more than one of them is runnable, it means that they should be scheduled on different cores. However, to achieve same performance in thread-per-multiple connections model, we would need to have several selectors and they will be scheduled on different cores and will access same shared data. So I don't see differences from cache perspective.
In GC environment (take Java for example), garbage collector should understand which objects are reachable by traversing object graph starting with GC roots. GC roots include thread stacks. So there is more work for GC to do on the first level of this graph. However, total number of alive nodes in this graph should be the same for both approaches. So no overhead from GC point of view.
The only argument, I agree with, is that each thread consumes memory for its stack. But even for this case, we may limit size of stacks for these threads if they don't use recursive calls.
What are your thoughts?

There are two overheads:
Stack memory. Non-blocking IO (in whatever form you are using it) saves the stack memory. An IO is just a small data structure now.
Reduction in context switching and kernel transitions when load is high. Then, a single switch can be used to process multiple completed IOs.
Most servers are not under high load because that would leave little safety margin against load spikes. So point (2) is relevant mostly for artificial loads such as benchmarks (meant to prove a point...).
The stack savings are the 99% reason this is being done.
Whether you want to trade off dev time and code complexity for memory savings depends on how many connections you have. At 10 connections this is not a concern. At 10000 connections a thread-based model becomes infeasible.
The points that you state in the question are correct.
Maybe you are confused by the fact that the "common wisdom" is to always use non-blocking socket IO? Indeed, this (false) propaganda is being communicated everywhere on the web. Propaganda works by repeatedly making the same simple statement and it works.

Are thread pools needed for pure Haskell code?

In Real World Haskell, Chapter 28, Software transactional memory, a concurrent web link checker is developed. It fetches all the links in a webpage and hits every once of them with a HEAD request to figure out if the link is active. A concurrent approach is taken to build this program and the following statement is made:
We can't simply create one thread per URL, because that may overburden either our CPU or our network connection if (as we expect) most of the links are live and responsive. Instead, we use a fixed number of worker threads, which fetch URLs to download from a queue.
I do not fully understand why this pool of threads is needed instead of using forkIO for each link. AFAIK, the Haskell runtime maintains a pool of threads and schedules them appropriately so I do not see the CPU being overloaded. Furthermore, in a discussion about concurrency on the Haskell mailing list, I found the following statement going in the same direction:
The one paradigm that makes no sense in Haskell is worker threads (since the RTS does that
for us); instead of fetching a worker, just forkIO instead.
Is the pool of threads only required for the network part or there is a CPU reason for it too?

The core issue, I imagine, is the network side. If you have 10,000 links and forkIO for each link, then you potentially have 10,000 sockets you're attempting to open at once, which, depending on how your OS is configured, probably won't even be possible, much less efficient.
However, the fact that we have green threads that get "virtually" scheduled across multiple os threads (which ideally are stuck to individual cores) doesn't mean that we can just distribute work randomly without regards to cpu usage either. The issue here isn't so much that the scheduling of the CPU itself won't be handled for us, but rather that context-switches (even green ones) cost cycles. Each thread, if its working on different data, will need to pull that data into the cpu. If there's enough data, that means pulling things in and out of the cpu cache. Even absent that, it means pulling things from the cache to registers, etc.
Even if a problem is trivially parallel, it is virtually never the right idea to just break it up as small as possible and attempt to do it "all at once".

Cost of a thread

I understand how to create a thread in my chosen language and I understand about mutexs, and the dangers of shared data e.t.c but I'm sure about how the O/S manages threads and the cost of each thread. I have a series of questions that all relate and the clearest way to show the limit of my understanding is probably via these questions.
What is the cost of spawning a thread? Is it worth even worrying about when designing software? One of the costs to creating a thread must be its own stack pointer and process counter, then space to copy all of the working registers to as it is moved on and off of a core by the scheduler, but what else?
Is the amount of stack available for one program split equally between threads of a process or on a first come first served?
Can I somehow check the hardware on start up (of the program) for number of cores. If I am running on a machine with N cores, should I keep the number of threads to N-1?

then space to copy all of the working registeres to as it is moved on
and off of a core by the scheduler, but what else?
One less evident cost is the strain imposed on the scheduler which may start to choke if it needs to juggle thousands of threads. The memory isn't really the issue. With the right tweaking you can get a "thread" to occupy very little memory, little more than its stack. This tweaking could be difficult (i.e. using clone(2) directly under linux etc) but it can be done.
Is the amount of stack available for one program split equally between
threads of a process or on a first come first served
Each thread gets its own stack, and typically you can control its size.
If I am running on a machine with N cores, should I keep the number of
threads to N-1
Checking the number of cores is easy, but environment-specific. However, limiting the number of threads to the number of cores only makes sense if your workload consists of CPU-intensive operations, with little I/O. If I/O is involved you may want to have many more threads than cores.

You should be as thoughtful as possible in everything you design and implement.
I know that a Java thread stack takes up about 1MB each time you create a thread. , so they add up.
Threads make sense for asynchronous tasks that allow long-running activities to happen without preventing all other users/processes from making progress.
Threads are managed by the operating system. There are lots of schemes, all under the control of the operating system (e.g. round robin, first come first served, etc.)
It makes perfect sense to me to assign one thread per core for some activities (e.g. computationally intensive calculations, graphics, math, etc.), but that need not be the deciding factor. One app I develop uses roughly 100 active threads in production; it's not a 100 core machine.

To add to the other excellent posts:
'What is the cost of spawning a thread? Is it worth even worrying about when designing software?'
It is if one of your design choices is doing such a thing often. A good way of avoiding this issue is to create threads once, at app startup, by using pools and/or app-lifetime threads dedicated to operations. Inter-thread signaling is much quicker than continual thread creation/termination/destruction and also much safer/easier.
The number of posts concerning problems with thread stopping, terminating, destroying, thread count runaway, OOM failure etc. is ledgendary. If you can avoid doing it at all, great.

Efficient multi-threaded server implementation in Qt

I'm planning a multithreaded server written in Qt. Each connection would be attended in a separate thread. Each of those threads would run its own event loop and use asynchronous sockets. I would like to dispatch a const value (for instance, a QString containing an event string) from the main thread to all the client threads in the most efficient possible way. The value should obviously be deleted when all the client threads have read it.
If I simply pass the data in a queued signal/slot connection, would this introduce a considerable overhead? Would it be more efficient to pass a QSharedPointer<QString>? What about passing a const QString* together with a QAtomicInt* for the reference counting and letting the thread decrease it and delete it when the reference counter reaches 0?

Somewhat off-topic, but please be aware that the one-thread-per-connection model could enable anyone able to connect to conduct a highly efficient denial of service attack against the system running the server, since the maximum number of threads that can be created on any system is limited. Also, if it's 32-bit, you can also starve address space since each thread gets its own stack. The default stack size varies acorss systems. On Win32 it's 1 MB, IIRC, so 2048 connections kept open and alive will eat 2 GB, i.e. the entire address space reserved for userspace (you can bump it up to 3 GB but that doesn't help much.)
For more details, check The C10K Problem, specifically the I/O Strategies -> Serve one client with each server thread chapter.

According to the documentation:
Behind the scenes, QString uses implicit sharing (copy-on-write) to reduce memory usage and to avoid the needless copying of data.
Based on this, you shouldn't have any more overhead sending copies of strings through the queued signal/slot connections than you would with your other proposed solutions. So I wouldn't worry about it until and unless it is a demonstrable performance problem.

Delphi 2010: Advantage of running multi threads if cannot allocate memory to create object for calculation in each thread

My Previous Question
From the above answer, means if in my threads has create objects, i will face memory allocation/deallocation bottleneck, thus result running threads may slower or no obvious time taken diff. than no thread. What's the advantages of running multi threads in the application if I cannot allocate memory to create the object for calculations in my thread?

What's the advantages of running multi threads in the application if I cannot allocate memory to create the objects for calculations in my thread?
It depends on where your bottlenecks are. If your bottleneck is the amount of memory available, then creating more threads won't help. Or, if I/O is a bottleneck, trying to parallelize will just slightly slow down everything because of context switching. It's like trying to make an underpowered car faster by putting wider tyres in it: fixing the wrong thing doesn't help.
Threads are useful when the bottleneck is the processor and there are several processors available.

Well, if you allocate chunks of memory in a loop, things will slow down.
If you can create your objects once at the beginning of TThread.execute, the overhead will be smaller.
Threads can also be benificial if you have to wait for IO-operations, or if you have expensive calculations to do on a machine with more than one physical core.

If you have memory intensive threads (many memory allocations/deallocations) you better use TopMM instead of FastMM:
http://www.topsoftwaresite.nl/
FastMM uses a lock which blocks all other threads, TopMM does not so it scales much better on multi cores/cpus!

When it comes to multithreding, shared resources issues will always arise (with current technology). All resources that may need serialization (RAM, disk, etc.) are a possible bottleneck. Multithreading is not a magic solution that turns a slow app in a fast one, and not always result in better speed. Made in the wrong way, it can actually result in worse speed. it should be analyzed to find possible bottlenecks, and some parts could need to be rewritten to minimize bottlenecks using different techniques (i.e. preallocating memory, using async I/O, etc.). Anyway, performance is only one of the reasons to use more than one thread. There are several other reason, for example letting the user to be able to interact with the application while background threads perform operations (i.e. printing, checking data, etc.) without "locking" the user. The application that way could seem "faster" (the user can keep on using it without waiting) even if it is actually slowerd (it takes more time to finish operations than if made them serially).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string