What are the thread limitations when working on Linux compared to processes for network/IO-bound apps?

What are the thread limitations when working on Linux compared to processes for network/IO-bound apps? - linux

I've heard that under linux on multicore server it would be impossible to reach top performance when you have just 1 process but multiple threads because Linux have some limitations on the IO, so that 1 process with 8 threads on 8-core server might be slower than 8 processes.
Any comments? Are there other limitation which might slow the applications?
The applications is a network C++ application, serving 100s of clients, with some disk IO.
Update: I am concerned that there are some more IO-related issues other than the locking I implement myself... Aren't there any issues doing simultanious network/disk IO in several threads?

Drawbacks of Threads
Threads:
Serialize on memory operations. That is the kernel, and in turn the MMU must service operations such as mmap() that perform page allocations.
Share the same file descriptor table. There is locking involved making changes and performing lookups in this table, which stores stuff like file offsets, and other flags. Every system call made that uses this table such as open(), accept(), fcntl() must lock it to translate fd to internal file handle, and when make changes.
Share some scheduling attributes. Processes are constantly evaluated to determine the load they're putting on the system, and scheduled accordingly. Lots of threads implies a higher CPU load, which the scheduler typically dislikes, and it will increase the response time on events for that process (such as reading incoming data on a socket).
May share some writable memory. Any memory being written to by multiple threads (especially slow if it requires fancy locking), will generate all kinds of cache contention and convoying issues. For example heap operations such as malloc() and free() operate on a global data structure (that can to some degree be worked around). There are other global structures also.
Share credentials, this might be an issue for service-type processes.
Share signal handling, these will interrupt the entire process while they're handled.
Processes or Threads?
If you want to make debugging easier, use threads.
If you are on Windows, use threads. (Processes are extremely heavyweight in Windows).
If stability is a huge concern, try to use processes. (One SIGSEGV/PIPE is all it takes...).
If threads aren't available, use processes. (Not so common now, but it did happen).
If your threads share resources that can't be use from multiple processes, use threads. (Or provide an IPC mechanism to allow communicating with the "owner" thread of the resource).
If you use resources that are only available on a one-per-process basis (and you one per context), obviously use processes.
If your processing contexts share absolutely nothing (such as a socket server that spawns and forgets connections as it accept()s them), and CPU is a bottleneck, use processes and single-threaded runtimes (which are devoid of all kinds of intense locking such as on the heap and other places).
One of the biggest differences between threads and processes is this: Threads use software constructs to protect data structures, processes use hardware (which is significantly faster).
Links
pthreads(7)
About Processes and Threads (MSDN)
Threads vs. Processes

it really should make no difference but is probably about design.
A multi process app may have to do less locking but may use more memory. Sharing data between processes may be harder.
On the other hand multi process can be more robust. You can call exit() and quit the child safely mostly without affecting others.
It depends how dependent the clients are. I usually recommend the simplest solution.

Related

Run threads in each core in Delphi

I'm working with a Delphi application and I have created two threads to sync with different databases, one to read and other to write. I would like to know if Delphi is actually using all potential of each core (running on an i5 with 4 cores for example) or if I need to write a specific code to distribute the threads to each core.
I have no idea how to find this.

There's nothing you need to do. The operating system schedules ready-to-run threads on available cores.

There is nothing to do. The OS will choose the best place to run each of your threads taking into account a large number of factors completely beyond your control. The OS manages your threads in conjunction with all other threads in all other processes on the system.
Don't forget that if your threads aren't particularly busy, there will be absolutely no need to run them on different cores.
Sometimes moving code to a separate core can introduce unexpected inefficiencies. Remember CPU's have high speed memory caches; and if certain data is not available in the cache of one core, moving to it could incur relatively slower RAM access.
The point I'm trying to make here, is that you trying to second-guess all these scenarios and permutations is premature optimisation. Rather let the OS do the work for you. You have other things you should rather focus on as indicated below.
However, that said any interaction between your threads can significantly affect the OS's ability to run them on separate cores. E.g.
At one extreme: if each of your threads do a lot of work through a shared lock (perhaps the reader thread places data in a shared location that the writer consumes, so a lock is used to avoid race conditions), then it's likely that both threads will run on the same core.
The best case scenario would be when there is zero interaction between the threads. In this case the OS can easily run the threads on separate cores.
One thing to be aware of is that the threads can interact even if you didn't explicitly code anything to do so. The default memory manger is shared between all threads. So if you do a lot of dynamic memory allocation in each thread, you can experience contention limiting scalability across large numbers of cores.
So the important thing for you to focus on is getting your design "correct":
Ensure a "clean" separation of concerns.
Eliminate unnecessary interaction between threads.
Ensure whatever interaction is needed uses the most appropriate technique for your requirements.
Get the above right, and the OS will schedule your threads as efficiently as it can.

How do user level threads (ULTs) and kernel level threads (KLTs) differ with regards to concurrent execution?

Here's what I understand; please correct/add to it:
In pure ULTs, the multithreaded process itself does the thread scheduling. So, the kernel essentially does not notice the difference and considers it a single-thread process. If one thread makes a blocking system call, the entire process is blocked. Even on a multicore processor, only one thread of the process would running at a time, unless the process is blocked. I'm not sure how ULTs are much help though.
In pure KLTs, even if a thread is blocked, the kernel schedules another (ready) thread of the same process. (In case of pure KLTs, I'm assuming the kernel creates all the threads of the process.)
Also, using a combination of ULTs and KLTs, how are ULTs mapped into KLTs?

Your analysis is correct. The OS kernel has no knowledge of user-level threads. From its perspective, a process is an opaque black box that occasionally makes system calls. Consequently, if that program has 100,000 user-level threads but only one kernel thread, then the process can only one run user-level thread at a time because there is only one kernel-level thread associated with it. On the other hand, if a process has multiple kernel-level threads, then it can execute multiple commands in parallel if there is a multicore machine.
A common compromise between these is to have a program request some fixed number of kernel-level threads, then have its own thread scheduler divvy up the user-level threads onto these kernel-level threads as appropriate. That way, multiple ULTs can execute in parallel, and the program can have fine-grained control over how threads execute.
As for how this mapping works - there are a bunch of different schemes. You could imagine that the user program uses any one of multiple different scheduling systems. In fact, if you do this substitution:
Kernel thread <---> Processor core
User thread <---> Kernel thread
Then any scheme the OS could use to map kernel threads onto cores could also be used to map user-level threads onto kernel-level threads.
Hope this helps!

Before anything else, templatetypedef's answer is beautiful; I simply wanted to extend his response a little.
There is one area which I felt the need for expanding a little: combinations of ULT's and KLT's. To understand the importance (what Wikipedia labels hybrid threading), consider the following examples:
Consider a multi-threaded program (multiple KLT's) where there are more KLT's than available logical cores. In order to efficiently use every core, as you mentioned, you want the scheduler to switch out KLT's that are blocking with ones that in a ready state and not blocking. This ensures the core is reducing its amount of idle time. Unfortunately, switching KLT's is expensive for the scheduler and it consumes a relatively large amount of CPU time.
This is one area where hybrid threading can be helpful. Consider a multi-threaded program with multiple KLT's and ULT's. Just as templatetypedef noted, only one ULT can be running at one time for each KLT. If a ULT is blocking, we still want to switch it out for one which is not blocking. Fortunately, ULT's are much more lightweight than KLT's, in the sense that there less resources assigned to a ULT and they require no interaction with the kernel scheduler. Essentially, it is almost always quicker to switch out ULT's than it is to switch out KLT's. As a result, we are able to significantly reduce a cores idle time relative to the first example.
Now, of course, all of this depends on the threading library being used for implementing ULT's. There are two ways (which I can come up with) for "mapping" ULT's to KLT's.
A collection of ULT's for all KLT's
This situation is ideal on a shared memory system. There is essentially a "pool" of ULT's to which each KLT has access. Ideally, the threading library scheduler would assign ULT's to each KLT upon request as opposed to the KLT's accessing the pool individually. The later could cause race conditions or deadlocks if not implemented with locks or something similar.
A collection of ULT's for each KLT (Qthreads)
This situation is ideal on a distributed memory system. Each KLT would have a collection of ULT's to run. The draw back is that the user (or the threading library) would have to divide the ULT's between the KLT's. This could result in load imbalance since it is not guaranteed that all ULT's will have the same amount of work to complete and complete roughly the same amount of time. The solution to this is allowing for ULT migration; that is, migrating ULT's between KLT's.

What the difference between lightweight concurrency and heavyweight concurrency?

I just learn multiple threading programming, but the question here is a very basic concept need to be clarified first of all.
As I searched from internet, what i understand is Heavyweight is regarding to "process", and Lightweight maps to "thread". However, why process is heavyweight? because of non-sharing memory or something else?

"Heavyweight" concurrency is where each of the concurrent executors is expensive to start and/or has large overheads.
"Lightweight" concurrency is where each of the concurrent executors is cheap to start and/or has small overheads.
Processes are generally more expensive to manage for the OS than threads, since each process needs an independent address space and various management structures, whereas threads within a process share these structures.
Consequently, processes are considered heavyweight, whereas threads are lightweight.
However, in some contexts, threads are considered heavyweight, and the "lightweight" concurrency facility is some kind of "task". In these contexts, the runtime will typically execute these tasks on a pool of threads, suspending them when they block, and reusing the threads for other tasks.

Nowadays the "heavy" classification no longer carries the same weight as it used to while the advantage of process separation has lost none of its potency ;-)
This is all thanks to the copy-on-write semantics; during a fork() the pages from the parent are no longer blindly copied for the child process. Both processes can operate using shared memory until the child process starts to write into one of the shared memory pages.
Of course, creating more processes has a higher tendency of being limited by the operating system as process ids are a more limited resource than threads.

Concurrency: Processes vs Threads

What are the main advantages of using a model for concurrency based on processes over one
based on threads and in what contexts is the latter appropriate?

Fault-tolerance and scalability are the main advantages of using Processes vs. Threads.
A system that relies on shared memory or some other kind of technology that is only available when using threads, will be useless when you want to run the system on multiple machines. Sooner or later you will need to communicate between different processes.
When using processes you are forced to deal with communication via messages, for example, this is the way Erlang handles communication. Data is not shared, so there is no risk of data corruption.
Another advantage of processes is that they can crash and you can feel relatively safe in the knowledge that you can just restart them (even across network hosts). However, if a thread crashes, it may crash the entire process, which may bring down your entire application. To illustrate: If an Erlang process crashes, you will only lose that phone call, or that webrequest, etc. Not the whole application.
In saying all this, OS processes also have many drawbacks that can make them harder to use, like the fact that it takes forever to spawn a new process. However, Erlang has it's own notion of processes, which are extremely lightweight.
With that said, this discussion is really a topic of research. If you want to get into more of the details, you can give Joe Armstrong's paper on fault-tolerant systems]1 a read, it explains a lot about Erlang and the philosophy that drives it.

The disadvantage of using a process-based model is that it will be slower. You will have to copy data between the concurrent parts of your program.
The disadvantage of using a thread-based model is that you will probably get it wrong. It may sound mean, but it's true-- show me code based on threads and I'll show you a bug. I've found bugs in threaded code that has run "correctly" for 10 years.
The advantages of using a process-based model are numerous. The separation forces you to think in terms of protocols and formal communication patterns, which means its far more likely that you will get it right. Processes communicating with each other are easier to scale out across multiple machines. Multiple concurrent processes allows one process to crash without necessarily crashing the others.
The advantage of using a thread-based model is that it is fast.
It may be obvious which of the two I prefer, but in case it isn't: processes, every day of the week and twice on Sunday. Threads are too hard: I haven't ever met anybody who could write correct multi-threaded code; those that claim to be able to usually don't know enough about the space yet.

In this case Processes are more independent of eachother, while Threads shares some resources e.g. memory. But in a general case Threads are more light-weight than Processes.
Erlang Processes is not the same thing as OS Processes. Erlang Processes are very light-weight and Erlang can have many Erlang Processes within the same OS Thread. See Technically why is processes in Erlang more efficient than OS threads?

First and foremost, processes differ from threads mostly in the way their memory is handled:
Process = n*Thread + memory region (n>=1)
Processes have their own isolated memory.
Processes can have multiple threads.
Processes are isolated from each other on the operating system level.
Threads share their memory with their peers in the process.
(This is often undesirable. There are libraries and methods out there to remedy this, but that is usually an artificial layer over operating system threads.)
The memory thing is the most important discerning factor, as it has certain implications:
Exchanging data between processes is slower than between threads. Breaking the process isolation always requires some involvement of kernel calls and memory remapping.
Threads are more lightweight than processes. The operating system has to allocate resources and do memory management for each process.
Using processes gives you memory isolation and synchronization. Common problems with access to memory shared between threads do not concern you. Since you have to make a special effort to share data between processes, you will most likely sync automatically with that.
Using processes gives you good (or ultimate) encapsulation. Since inter process communication needs special effort, you will be forced to define a clean interface. It is a good idea to break certain parts of your application out of the main executable. Maybe you can split dependencies like that.
e.g. Process_RobotAi <-> Process_RobotControl
The AI will have vastly different dependencies compared to the control component. The interface might be simple: Process_RobotAI --DriveXY--> Process_RobotControl.
Maybe you change the robot platform. You only have to implement a new RobotControl executable with that simple interface. You don't have to touch or even recompile anything in your AI component.
It will also, for the same reasons, speed up compilation in most cases.
Edit: Just for completeness I will shamelessly add what the others have reminded me of :
A crashing process does not (necessarily) crash your whole application.
In General:
Want to create something highly concurrent or synchronuous, like an algorithm with n>>1 instances running in parallel and sharing data, use threads.
Have a system with multiple components that do not need to share data or algorithms, nor do they exchange data too often, use processes. If you use a RPC library for the inter process communication, you get a network-distributable solution at no extra cost.
1 and 2 are the extreme and no-brainer scenarios, everything in between must be decided individually.
For a good (or awesome) example of a system that uses IPC/RPC heavily, have a look at ros.

Threads vs Processes in Linux [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
Improve this question
I've recently heard a few people say that in Linux, it is almost always better to use processes instead of threads, since Linux is very efficient in handling processes, and because there are so many problems (such as locking) associated with threads. However, I am suspicious, because it seems like threads could give a pretty big performance gain in some situations.
So my question is, when faced with a situation that threads and processes could both handle pretty well, should I use processes or threads? For example, if I were writing a web server, should I use processes or threads (or a combination)?

Linux uses a 1-1 threading model, with (to the kernel) no distinction between processes and threads -- everything is simply a runnable task. *
On Linux, the system call clone clones a task, with a configurable level of sharing, among which are:
CLONE_FILES: share the same file descriptor table (instead of creating a copy)
CLONE_PARENT: don't set up a parent-child relationship between the new task and the old (otherwise, child's getppid() = parent's getpid())
CLONE_VM: share the same memory space (instead of creating a COW copy)
fork() calls clone(least sharing) and pthread_create() calls clone(most sharing). **
forking costs a tiny bit more than pthread_createing because of copying tables and creating COW mappings for memory, but the Linux kernel developers have tried (and succeeded) at minimizing those costs.
Switching between tasks, if they share the same memory space and various tables, will be a tiny bit cheaper than if they aren't shared, because the data may already be loaded in cache. However, switching tasks is still very fast even if nothing is shared -- this is something else that Linux kernel developers try to ensure (and succeed at ensuring).
In fact, if you are on a multi-processor system, not sharing may actually be beneficial to performance: if each task is running on a different processor, synchronizing shared memory is expensive.
* Simplified. CLONE_THREAD causes signals delivery to be shared (which needs CLONE_SIGHAND, which shares the signal handler table).
** Simplified. There exist both SYS_fork and SYS_clone syscalls, but in the kernel, the sys_fork and sys_clone are both very thin wrappers around the same do_fork function, which itself is a thin wrapper around copy_process. Yes, the terms process, thread, and task are used rather interchangeably in the Linux kernel...

Linux (and indeed Unix) gives you a third option.
Option 1 - processes
Create a standalone executable which handles some part (or all parts) of your application, and invoke it separately for each process, e.g. the program runs copies of itself to delegate tasks to.
Option 2 - threads
Create a standalone executable which starts up with a single thread and create additional threads to do some tasks
Option 3 - fork
Only available under Linux/Unix, this is a bit different. A forked process really is its own process with its own address space - there is nothing that the child can do (normally) to affect its parent's or siblings address space (unlike a thread) - so you get added robustness.
However, the memory pages are not copied, they are copy-on-write, so less memory is usually used than you might imagine.
Consider a web server program which consists of two steps:
Read configuration and runtime data
Serve page requests
If you used threads, step 1 would be done once, and step 2 done in multiple threads. If you used "traditional" processes, steps 1 and 2 would need to be repeated for each process, and the memory to store the configuration and runtime data duplicated. If you used fork(), then you can do step 1 once, and then fork(), leaving the runtime data and configuration in memory, untouched, not copied.
So there are really three choices.

That depends on a lot of factors. Processes are more heavy-weight than threads, and have a higher startup and shutdown cost. Interprocess communication (IPC) is also harder and slower than interthread communication.
Conversely, processes are safer and more secure than threads, because each process runs in its own virtual address space. If one process crashes or has a buffer overrun, it does not affect any other process at all, whereas if a thread crashes, it takes down all of the other threads in the process, and if a thread has a buffer overrun, it opens up a security hole in all of the threads.
So, if your application's modules can run mostly independently with little communication, you should probably use processes if you can afford the startup and shutdown costs. The performance hit of IPC will be minimal, and you'll be slightly safer against bugs and security holes. If you need every bit of performance you can get or have a lot of shared data (such as complex data structures), go with threads.

Others have discussed the considerations.
Perhaps the important difference is that in Windows processes are heavy and expensive compared to threads, and in Linux the difference is much smaller, so the equation balances at a different point.

Once upon a time there was Unix and in this good old Unix there was lots of overhead for processes, so what some clever people did was to create threads, which would share the same address space with the parent process and they only needed a reduced context switch, which would make the context switch more efficient.
In a contemporary Linux (2.6.x) there is not much difference in performance between a context switch of a process compared to a thread (only the MMU stuff is additional for the thread).
There is the issue with the shared address space, which means that a faulty pointer in a thread can corrupt memory of the parent process or another thread within the same address space.
A process is protected by the MMU, so a faulty pointer will just cause a signal 11 and no corruption.
I would in general use processes (not much context switch overhead in Linux, but memory protection due to MMU), but pthreads if I would need a real-time scheduler class, which is a different cup of tea all together.
Why do you think threads are have such a big performance gain on Linux? Do you have any data for this, or is it just a myth?

I think everyone has done a great job responding to your question. I'm just adding more information about thread versus process in Linux to clarify and summarize some of the previous responses in context of kernel. So, my response is in regarding to kernel specific code in Linux. According to Linux Kernel documentation, there is no clear distinction between thread versus process except thread uses shared virtual address space unlike process. Also note, the Linux Kernel uses the term "task" to refer to process and thread in general.
"There are no internal structures implementing processes or threads, instead there is a struct task_struct that describe an abstract scheduling unit called task"
Also according to Linus Torvalds, you should NOT think about process versus thread at all and because it's too limiting and the only difference is COE or Context of Execution in terms of "separate the address space from the parent " or shared address space. In fact he uses a web server example to make his point here (which highly recommend reading).
Full credit to linux kernel documentation

If you want to create a pure a process as possible, you would use clone() and set all the clone flags. (Or save yourself the typing effort and call fork())
If you want to create a pure a thread as possible, you would use clone() and clear all the clone flags (Or save yourself the typing effort and call pthread_create())
There are 28 flags that dictate the level of resource sharing. This means that there are over 268 million flavours of tasks that you can create, depending on what you want to share.
This is what we mean when we say that Linux does not distinguish between a process and a thread, but rather alludes to any flow of control within a program as a task. The rationale for not distinguishing between the two is, well, not uniquely defining over 268 million flavours!
Therefore, making the "perfect decision" of whether to use a process or thread is really about deciding which of the 28 resources to clone.

How tightly coupled are your tasks?
If they can live independently of each other, then use processes. If they rely on each other, then use threads. That way you can kill and restart a bad process without interfering with the operation of the other tasks.

To complicate matters further, there is such a thing as thread-local storage, and Unix shared memory.
Thread-local storage allows each thread to have a separate instance of global objects. The only time I've used it was when constructing an emulation environment on linux/windows, for application code that ran in an RTOS. In the RTOS each task was a process with it's own address space, in the emulation environment, each task was a thread (with a shared address space). By using TLS for things like singletons, we were able to have a separate instance for each thread, just like under the 'real' RTOS environment.
Shared memory can (obviously) give you the performance benefits of having multiple processes access the same memory, but at the cost/risk of having to synchronize the processes properly. One way to do that is have one process create a data structure in shared memory, and then send a handle to that structure via traditional inter-process communication (like a named pipe).

In my recent work with LINUX is one thing to be aware of is libraries. If you are using threads make sure any libraries you may use across threads are thread-safe. This burned me a couple of times. Notably libxml2 is not thread-safe out of the box. It can be compiled with thread safe but that is not what you get with aptitude install.

I'd have to agree with what you've been hearing. When we benchmark our cluster (xhpl and such), we always get significantly better performance with processes over threads. </anecdote>

The decision between thread/process depends a little bit on what you will be using it to.
One of the benefits with a process is that it has a PID and can be killed without also terminating the parent.
For a real world example of a web server, apache 1.3 used to only support multiple processes, but in in 2.0 they added an abstraction so that you can swtch between either. Comments seems to agree that processes are more robust but threads can give a little bit better performance (except for windows where performance for processes sucks and you only want to use threads).

For most cases i would prefer processes over threads.
threads can be useful when you have a relatively smaller task (process overhead >> time taken by each divided task unit) and there is a need of memory sharing between them. Think a large array.
Also (offtopic), note that if your CPU utilization is 100 percent or close to it, there is going to be no benefit out of multithreading or processing. (in fact it will worsen)

Threads -- > Threads shares a memory space,it is an abstraction of the CPU,it is lightweight.
Processes --> Processes have their own memory space,it is an abstraction of a computer.
To parallelise task you need to abstract a CPU.
However the advantages of using a process over a thread is security,stability while a thread uses lesser memory than process and offers lesser latency.
An example in terms of web would be chrome and firefox.
In case of Chrome each tab is a new process hence memory usage of chrome is higher than firefox ,while the security and stability provided is better than firefox.
The security here provided by chrome is better,since each tab is a new process different tab cannot snoop into the memory space of a given process.

Multi-threading is for masochists. :)
If you are concerned about an environment where you are constantly creating threads/forks, perhaps like a web server handling requests, you can pre-fork processes, hundreds if necessary. Since they are Copy on Write and use the same memory until a write occurs, it's very fast. They can all block, listening on the same socket and the first one to accept an incoming TCP connection gets to run with it. With g++ you can also assign functions and variables to be closely placed in memory (hot segments) to ensure when you do write to memory, and cause an entire page to be copied at least subsequent write activity will occur on the same page. You really have to use a profiler to verify that kind of stuff but if you are concerned about performance, you should be doing that anyway.
Development time of threaded apps is 3x to 10x times longer due to the subtle interaction on shared objects, threading "gotchas" you didn't think of, and very hard to debug because you cannot reproduce thread interaction problems at will. You may have to do all sort of performance killing checks like having invariants in all your classes that are checked before and after every function and you halt the process and load the debugger if something isn't right. Most often it's embarrassing crashes that occur during production and you have to pore through a core dump trying to figure out which threads did what. Frankly, it's not worth the headache when forking processes is just as fast and implicitly thread safe unless you explicitly share something. At least with explicit sharing you know exactly where to look if a threading style problem occurs.
If performance is that important, add another computer and load balance. For the developer cost of debugging a multi-threaded app, even one written by an experienced multi-threader, you could probably buy 4 40 core Intel motherboards with 64gigs of memory each.
That being said, there are asymmetric cases where parallel processing isn't appropriate, like, you want a foreground thread to accept user input and show button presses immediately, without waiting for some clunky back end GUI to keep up. Sexy use of threads where multiprocessing isn't geometrically appropriate. Many things like that just variables or pointers. They aren't "handles" that can be shared in a fork. You have to use threads. Even if you did fork, you'd be sharing the same resource and subject to threading style issues.

If you need to share resources, you really should use threads.
Also consider the fact that context switches between threads are much less expensive than context switches between processes.
I see no reason to explicitly go with separate processes unless you have a good reason to do so (security, proven performance tests, etc...)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string