Limiting thread memory access per thread in GHC - multithreading

I'm wondering, is it possible to limit the amount of memory a thread uses? I'm looking at running a server where untrusted user code is submitted and run. I can use SafeHaskell to ensure that it doesn't perform any unauthorized IO, but I need to make sure that a user's code doesn't crash the entire server, i.e. by causing a stack overflow or out-of-memory heap error.
Is there a way to limit the amount of memory each individual thread can access, or some way to ensure that if one thread consumes a massive amount of memory, that only that thread is terminated?
Perhaps, is there a way that when any thread encounters an out of memory error, I can catch the exception and choose which thread dies?
I'm talking more about concurrency, in the sense of forkIO and STM threads, rather than paralellism with par and seq.
Note: this is very similar to this question, but it never received an answer to the general problem, rather the answers dealt with the specific scenario of the question. Additionally, it's possible that since 2011, something might have changed in GHC 7.8, maybe with the new IO manager?

I don't know about Haskell, but in general, the answer to your question is no. In all programming languages/runtimes/operating systems/etc. that I know of, threads are nothing more than different paths of execution through the same code. The important thing in this case, is that threads always share the same virtual address space.
That being said, there is no technical reason why a memory allocator in your particular language & runtime system could not use a thread-specific variable to track how much has been allocated by any given thread, and impose an arbitrary limit.
No technical reason why it couldn't do that, but if thread A allocates an object which is subsequently accessed by thread B, thread C, thread D,... Then what sense does it make to penalize thread A for having allocated it? There is no practical way to track the "ownership" of an object that is accessed by many threads in the general case, which is why none of the languages/runtimes/OSes/etc. that I know of attempt to do it.

Related

Why thead-per-multiple-connections model is considered better than thread-per-connection model?

Most of the times, you will hear that thread-per-multiple-connections model (non-blocking IO) is much better than thread-per-connection model (blocking io). And reasoning sounds like "Thread-per-connection approach creates too many threads and a lot of overhead is associated with mantaining so many threads". But this overhead is not explained.
Common misconception is that scheduling overhead is proportinal to the number of all threads. But it's not true, scheduling overhead is proportinal to the number of runnable threads. So in typical IO bound application, most of the threads will actually be blocked on IO and only several of them will be runnable - which is not different with "thread-per-multiple-connections" model.
As for context switching overhead, I expect that there should be no difference, because when data arrives kernel should wake up a thread - selector thread or connection thread.
The problem may lay in IO system calls - kernel might handle kqueue/epoll calls better than blocking IO calls. However, this does not sound plausible, because it should not be a problem to implement O(1) algorithm for selecting blocked thread when data arrives.
If you have many short-lived connections, you will have many short lived thread. And spawning a new thread is an expensive operation (is it?). To solve this problem, you may create thread pool and still use blocking I/O.
There might be OS limits for the number of threads that could be spawned, however they might be changed with configuration parameters.
In multicore system, suppose different sessions access same shared data. If we're talking about connection-per-thread model, this might cause a lot of cache coherency traffic and may slow down the system. However, why not to shedule all these thread on the single core if only one of them is runnable at the given point in time? If more than one of them is runnable, it means that they should be scheduled on different cores. However, to achieve same performance in thread-per-multiple connections model, we would need to have several selectors and they will be scheduled on different cores and will access same shared data. So I don't see differences from cache perspective.
In GC environment (take Java for example), garbage collector should understand which objects are reachable by traversing object graph starting with GC roots. GC roots include thread stacks. So there is more work for GC to do on the first level of this graph. However, total number of alive nodes in this graph should be the same for both approaches. So no overhead from GC point of view.
The only argument, I agree with, is that each thread consumes memory for its stack. But even for this case, we may limit size of stacks for these threads if they don't use recursive calls.
What are your thoughts?
There are two overheads:
Stack memory. Non-blocking IO (in whatever form you are using it) saves the stack memory. An IO is just a small data structure now.
Reduction in context switching and kernel transitions when load is high. Then, a single switch can be used to process multiple completed IOs.
Most servers are not under high load because that would leave little safety margin against load spikes. So point (2) is relevant mostly for artificial loads such as benchmarks (meant to prove a point...).
The stack savings are the 99% reason this is being done.
Whether you want to trade off dev time and code complexity for memory savings depends on how many connections you have. At 10 connections this is not a concern. At 10000 connections a thread-based model becomes infeasible.
The points that you state in the question are correct.
Maybe you are confused by the fact that the "common wisdom" is to always use non-blocking socket IO? Indeed, this (false) propaganda is being communicated everywhere on the web. Propaganda works by repeatedly making the same simple statement and it works.

Why are thread locks resources?

I recently read that thread locks are system resources, therefore they have to be properly released "just like memory". I realised I wasn't really aware of this.
Can someone offer some further elaboration on this fact or point to a good reference? More specifically: how can I think about the implementation of locks at a deeper system level? What are the possible consequences of leaking locks? Is there a maximum number of locks available in the system?
All that means is that you have to be careful that anything you lock gets released, similar to how you would be careful to close network connections or files or graphic device contexts or whatever. If you write code that is not careful about that, then you risk having the program deadlock or be unable to progress when it can't get access to something that's locked (because the point of locking is to make sure multiple threads can access something safely, so if one thread leaves something locked other threads that need to access it are shut out).
The program will have severe performance issues a long time before it runs out of physical locks, so typically you shouldn't have to worry about the number of available locks.

Accessing shared memory without synchronization can halts process or thread?

Summary: There is a reader thread and a writer thread accessing same memory space without synchronization. Is there any run-time error( NOT logical error ) or a risk of breaking the process or threads?
I'm trying to make a simple task scheduler.
There are some worker threads and they have their own task queue.
The task scheduler push tasks to the workers' queue.
I want the scheduler know the least busy thread, whose queue is shortest.
So I need some shared integer variables to store each queue's length.
Each worker thread writes the length of its own queue on specific variable.
And the scheduler reads that variables to know the shortest one.
So each variable has one read and one writer.
This is R-W problem and i need a mutex.
But i don't want any overhead and know the exact length of queues.
So I want to let the threads access the shared values without synchronization.
Is there any problem, without the inaccurate value?
There's no law that says, for all platforms, all languages, and all standards what must happen when two threads access the same memory space without synchronization. Some platforms may allow it and add the synchronization themselves. That's not impossible. Some may test for it in all cases and guarantee that the process will crash. That's not impossible either.
As generic as this question is, the answer could be anything. You might as well ask "What happens if I do something I'm not supposed to do?"
POSIX, for example, allows the process to crash. Win32, for aligned 32-bit accesses, requires it to, at worst, give stale values. There's no universal law.

sched_yield slow down other threads

We have code that makes use of sched_yield inside a loop. When we do this we seem to get a slower performance of other threads, in particular those involving kernel calls (like IO and mutex/event handling). I'm trying to determine the exact cause of this behaviour.
Can excessive calls to sched_yield lead to a bottleneck in the kernel?
My suspicion is if we keep asking the kernel to check its process list then other threads will suffer as key data structures may be continually locked -- whereas if we didn't call sched_yield those kernel locks would tend to be uncontested. Does this make sense, or should it be totally okay to repeatedly call sched_yield.
Have a look at the sched_yield man page for Linux:
Avoid calling sched_yield()
unnecessarily or inappropriately (e.g., when resources needed by other
schedulable threads are still held by the caller), since doing so will result
in unnecessary context switches, which will degrade system performance.
Calling it in a tight loop will cause problems. Reduce the rate at which you're calling it.
(And check that you need to call it in the first place. The scheduler often does the Right Thing all by itself.)
Other options you could find interesting to investigate if you have a low priority thread:
sched_setscheduler - with SCHED_IDLE or SCHED_BATCH maybe (affects the whole process)
thread_setschedparam - per thread, but might have restrictions on what policies you can use (can't find it right now).
Or the good old nice command of course.

fork in multi-threaded program

I've heard that mixing forking and threading in a program could be very problematic, often resulting with mysterious behavior, especially when dealing with shared resources, such as locks, pipes, file descriptors. But I never fully understand what exactly the dangers are and when those could happen. It would be great if someone with expertise in this area could explain a bit more in detail what pitfalls are and what needs to be care when programming in a such environment.
For example, if I want to write a server that collects data from various different resources, one solution I've thought is to have the server spawns a set of threads, each popen to call out another program to do the actual work, open pipes to get the data back from the child. Each of these threads responses for its own work, no data interexchange in b/w them, and when the data is collected, the main thread has a queue and these worker threads will just put the result in the queue. What could go wrong with this solution?
Please do not narrow your answer by just "answering" my example scenario. Any suggestions, alternative solutions, or experiences that are not related to the example but helpful to provide a clean design would be great! Thanks!
The problem with forking when you do have some threads running is that the fork only copies the CPU state of the one thread that called it. It's as if all of the other threads just died, instantly, wherever they may be.
The result of this is locks aren't released, and shared data (such as the malloc heap) may be corrupted.
pthread does offer a pthread_atfork function - in theory, you could take every lock in the program before forking, release them after, and maybe make it out alive - but it's risky, because you could always miss one. And, of course, the stacks of the other threads won't be freed.
It is really quite simple. The problems with multiple threads and processes always arise from shared data. If there is not shared data then there can be no possible issues arising.
In your example the shared data is the queue owned by the main thread - any potential contention or race conditions will arise here. Typical methods for "solving" these issues involve locking schemes - a worker thread will lock the queue before inserting any data, and the main thread will lock the queue before removing it.

Resources