Can thread creation within OS internals run concurrently? - multithreading

Suppose we have a dual-core machine with a mainstream, modern OS capable to utilize both the cores.
If I have two threads, P1 and Q1 within the same process, and they happen to commence creating child threads, say, P2 and Q2, at approximately the same machine cycle, will OS perform the thread creation concurrently?
I heard thread creation is expensive, so the question came forth...
Thanks in advance.

Any reasonably well designed OS can have multiple processors executing kernel code at the same time. Therefore some of the tasks involved in a thread creation can be happening concurrently. But there will be some necessary serialization to manipulate some shared data structures (e.g. allocating memory, inserting a newly created threat structure into a global list). The processors could contend for the same lock thereby reducing concurrency.
Systems/applications which make new threads so often that the overhead of thread creation actually matters are probably designed wrong (doing too little useful work in a thread relative to the startup time, and not taking advantage of the obvious optimization of reusing short-lived threads from a pool).

It will be sorta-concurrently. There are aspects of thread-creation that cannot proceed in parallel - it would be unfortunate if the kernel memory-manager allocated both threads the same stack!
Thread creation is sufficiently expensive that it's worth while avoiding doing it at all during an app. run, hence the popularity of thread pools. Long-running tasks that block can be threaded off and left for the life of the app - often this means that explicit thread termination, (awkward at best, almost impossible at worst, from user code), is not necessary.
I think developers continually start and stop threads because they like to think of them as 'functions', where you 'pass parameters' in at the start and 'return' results when the thread ends. Ths is not the best way of conceptualizing threads.

Related

How do user level threads (ULTs) and kernel level threads (KLTs) differ with regards to concurrent execution?

Here's what I understand; please correct/add to it:
In pure ULTs, the multithreaded process itself does the thread scheduling. So, the kernel essentially does not notice the difference and considers it a single-thread process. If one thread makes a blocking system call, the entire process is blocked. Even on a multicore processor, only one thread of the process would running at a time, unless the process is blocked. I'm not sure how ULTs are much help though.
In pure KLTs, even if a thread is blocked, the kernel schedules another (ready) thread of the same process. (In case of pure KLTs, I'm assuming the kernel creates all the threads of the process.)
Also, using a combination of ULTs and KLTs, how are ULTs mapped into KLTs?
Your analysis is correct. The OS kernel has no knowledge of user-level threads. From its perspective, a process is an opaque black box that occasionally makes system calls. Consequently, if that program has 100,000 user-level threads but only one kernel thread, then the process can only one run user-level thread at a time because there is only one kernel-level thread associated with it. On the other hand, if a process has multiple kernel-level threads, then it can execute multiple commands in parallel if there is a multicore machine.
A common compromise between these is to have a program request some fixed number of kernel-level threads, then have its own thread scheduler divvy up the user-level threads onto these kernel-level threads as appropriate. That way, multiple ULTs can execute in parallel, and the program can have fine-grained control over how threads execute.
As for how this mapping works - there are a bunch of different schemes. You could imagine that the user program uses any one of multiple different scheduling systems. In fact, if you do this substitution:
Kernel thread <---> Processor core
User thread <---> Kernel thread
Then any scheme the OS could use to map kernel threads onto cores could also be used to map user-level threads onto kernel-level threads.
Hope this helps!
Before anything else, templatetypedef's answer is beautiful; I simply wanted to extend his response a little.
There is one area which I felt the need for expanding a little: combinations of ULT's and KLT's. To understand the importance (what Wikipedia labels hybrid threading), consider the following examples:
Consider a multi-threaded program (multiple KLT's) where there are more KLT's than available logical cores. In order to efficiently use every core, as you mentioned, you want the scheduler to switch out KLT's that are blocking with ones that in a ready state and not blocking. This ensures the core is reducing its amount of idle time. Unfortunately, switching KLT's is expensive for the scheduler and it consumes a relatively large amount of CPU time.
This is one area where hybrid threading can be helpful. Consider a multi-threaded program with multiple KLT's and ULT's. Just as templatetypedef noted, only one ULT can be running at one time for each KLT. If a ULT is blocking, we still want to switch it out for one which is not blocking. Fortunately, ULT's are much more lightweight than KLT's, in the sense that there less resources assigned to a ULT and they require no interaction with the kernel scheduler. Essentially, it is almost always quicker to switch out ULT's than it is to switch out KLT's. As a result, we are able to significantly reduce a cores idle time relative to the first example.
Now, of course, all of this depends on the threading library being used for implementing ULT's. There are two ways (which I can come up with) for "mapping" ULT's to KLT's.
A collection of ULT's for all KLT's
This situation is ideal on a shared memory system. There is essentially a "pool" of ULT's to which each KLT has access. Ideally, the threading library scheduler would assign ULT's to each KLT upon request as opposed to the KLT's accessing the pool individually. The later could cause race conditions or deadlocks if not implemented with locks or something similar.
A collection of ULT's for each KLT (Qthreads)
This situation is ideal on a distributed memory system. Each KLT would have a collection of ULT's to run. The draw back is that the user (or the threading library) would have to divide the ULT's between the KLT's. This could result in load imbalance since it is not guaranteed that all ULT's will have the same amount of work to complete and complete roughly the same amount of time. The solution to this is allowing for ULT migration; that is, migrating ULT's between KLT's.

What is the difference between threads and lightweight threads?

I don't quite understand the difference between threads and lightweight threads. From an API perspective both types of threads are identical so where exactly does the difference come in. Is it at the implementation level where a lightweight thread is managed by a higher level runtime than the OS thread scheduler or is it something else? Also, is there set of heuristics that people use to decide which type of thread to use in specific scenarios?
In what context, lightweight threads could represent threads which are implemented by a library, for example threads can be simulated in a library by switching between lightweight threads at an event handling layer, these lightweight threads are queued up and processed by a singe OS thread, the advantage of this is that since context switching is handled in the library switching can occur when the processing of data is complete and so the data does not need to be loaded back into the CPU's cache next time this lightweight thread becomes active.
Lightweight threads could also refer to co-operative threads (or fibers), these are threads where you have to explicitly yield to give other lightweight threads a chance, this has the same advantage in that the context switching can occur at a place you know you have finished processing some data and so you know it will not be need again.
Alternativly Lightweight threads could mean normal OS threads and the non-lightweight threads could mean processes, process have at least one thread within them and also have there own memory and other resources, they are more expensive than threads because you can not share data between thread easily and it can be a more expensive operation for the OS to create processes.

Why are OS threads considered expensive?

There are many solutions geared toward implementing "user-space" threads. Be it golang.org goroutines, python's green threads, C#'s async, erlang's processes etc. The idea is to allow concurrent programming even with a single or limited number of threads.
What I don't understand is, why are the OS threads so expensive? As I see it, either way you have to save the stack of the task (OS thread, or userland thread), which is a few tens of kilobytes, and you need a scheduler to move between two tasks.
The OS provides both of this functions for free. Why should OS threads be more expensive than "green" threads? What's the reason for the assumed performance degradation caused by having a dedicated OS thread for each "task"?
I want to amend Tudors answer which is a good starting point. There are two main overheads of threads:
Starting and stopping them. Involves creating a stack and kernel objects. Involves kernel transitions and global kernel locks.
Keeping their stack around.
(1) is only a problem if you are creating and stopping them all the time. This is solved commonly using thread pools. I consider this problem to be practically solved. Scheduling a task on a thread pool usually does not involve a trip to the kernel which makes it very fast. The overhead is on the order of a few interlocked memory operations and a few allocations.
(2) This becomes important only if you have many threads (> 100 or so). In this case async IO is a means to get rid of the threads. I found that if you don't have insane amounts of threads synchronous IO including blocking is slightly faster than async IO (you read that right: sync IO is faster).
Saving the stack is trivial, no matter what its size - the stack pointer needs to be saved in the Thread Info Block in the kernel, (so usualy saving most of the registers as well since they will have been pushed by whatever soft/hard interrupt caused the OS to be entered).
One issue is that a protection level ring-cycle is required to enter the kernel from user. This is an essential, but annoying, overhead. Then the driver or system call has to do whatever was requested by the interrupt and then the scheduling/dispatching of threads onto processors. If this results in the preemption of a thread from one process by a thread from another, a load of extra process context has to be swapped as well. Even more overhead is added if the OS decides that a thread that is running on another processor core than the one handling the interrupt mut be preempted - the other core must be hardware-interrupted, (this is on top of the hard/soft interrupt that entred the OS in the first place.
So, a scheduling run may be quite a complex operation.
'Green threads' or 'fibers' are, (usually), scheduled from user code. A context-change is much easier and cheaper than an OS interrupt etc. because no Wagnerian ring-cycle is required on every context-change, process-context does not change and the OS thread running the green thread group does not change.
Since something-for-nothing does not exist, there are problems with green threads. They ar run by 'real' OS threads. This means that if one 'green' thread in a group run by one OS thread makes an OS call that blocks, all green threads in the group are blocked. This means that simple calls like sleep() have to be 'emulated' by a state-machine that yields to other green threads, (yes, just like re-implementing the OS). Similarly, any inter-thread signalling.
Also, of course, green threads cannot directly respond to IO signaling, so somewhat defeating the point of having any threads in the first place.
There are many solutions geared toward implementing "user-space" threads. Be it golang.org goroutines, python's green threads, C#'s async, erlang's processes etc. The idea is to allow concurrent programming even with a single or limited number of threads.
It's an abstraction layer. It's easier for many people to grasp this concept and use it more effectively in many scenarios. It's also easier for many machines (assuming a good abstraction), since the model moves from width to pull in many cases. With pthreads (as an example), you have all the control. With other threading models, the idea is to reuse threads, for the process of creating a concurrent task to be inexpensive, and to use a completely different threading model. It's far easier to digest this model; there's less to learn and measure, and the results are generally good.
What I don't understand is, why are the OS threads so expensive? As I see it, either way you have to save the stack of the task (OS thread, or userland thread), which is a few tens of kilobytes, and you need a scheduler to move between two tasks.
Creating a thread is expensive, and the stack requires memory. As well, if your process is using many threads, then context switching can kill performance. So lightweight threading models became useful for a number of reasons. Creating an OS thread became a good solution for medium to large tasks, ideally in low numbers. That's restrictive, and quite time consuming to maintain.
A task/thread pool/userland thread does not need to worry about much of the context switching or thread creation. It's often "reuse the resource when it becomes available, if it's not ready now -- also, determine the number of active threads for this machine".
More commmonly (IMO), OS level threads are expensive because they are not used correctly by the engineers - either there are too many and there is a ton of context switching, there is competition for the same set of resources, the tasks are too small. It takes much more time to understand how to use OS threads correctly, and how to apply that best to the context of a program's execution.
The OS provides both of this functions for free.
They're available, but they are not free. They are complex, and very important to good performance. When you create an OS thread, it's given time 'soon' -- all the process' time is divided among the threads. That's not the common case with user threads. The task is often enqueued when the resource is not available. This reduces context switching, memory, and the total number of threads which must be created. When the task exits, the thread is given another.
Consider this analogy of time distribution:
Assume you are at a casino. There are a number people who want cards.
You have a fixed number of dealers. There are fewer dealers than people who want cards.
There is not always enough cards for every person at any given time.
People need all cards to complete their game/hand. They return their cards to the dealer when their game/hand is complete.
How would you ask the dealers to distribute cards?
Under the OS scheduler, that would be based on (thread) priority. Every person would be given one card at a time (CPU time), and priority would be evaluated continually.
The people represent the task or thread's work. The cards represent time and resources. The dealers represent threads and resources.
How would you deal fastest if there were 2 dealers and 3 people? and if there were 5 dealers and 500 people? How could you minimize running out of cards to deal? With threads, adding cards and adding dealers is not a solution you can deliver 'on demand'. Adding CPUs is equivalent to adding dealers. Adding threads is equivalent to dealers dealing cards to more people at a time (increases context switching). There are a number of strategies to deal cards more quickly, especially after you eliminate the people's need for cards in a certain amount of time. Would it not be faster to go to a table and deal to a person or people until their game is complete if the dealer to people ratio were 1/50? Compare this to visiting every table based on priority, and coordinating visitation among all dealers (the OS approach). That's not to imply the OS is stupid -- it implies that creating an OS thread is an engineer adding more people and more tables, potentially more than the dealers can reasonably handle. Fortunately, the constraints may be lifted in many cases by using other multithreading models and higher abstractions.
Why should OS threads be more expensive than "green" threads? What's the reason for the assumed performance degradation caused by having a dedicated OS thread for each "task"?
If you developed a performance critical low level threading library (e.g. upon pthreads), you would recognize the importance of reuse (and implement it in your library as a model available for users). From that angle, the importance of higher level multithreading models is a simple and obvious solution/optimization based on real world usage as well as the ideal that the entry bar for adopting and effectively utilizing multithreading can be lowered.
It's not that they are expensive -- the lightweight threads' model and pool is a better solution for many problems, and a more appropriate abstraction for engineers who do not understand threads well. The complexity of multithreading is greatly simplified (and often more performant in real world usage) under this model. With OS threads, you do have more control, but several more considerations must be made to use them as effectively as possible -- heeding these consideration can dramatically reflow a program's execution/implementation. With higher level abstractions, many of these complexities are minimized by completely altering the flow of task execution (width vs pull).
The problem with starting kernel threads for each small task is that it incurs a non-negligible overhead to start and stop, coupled with the stack size it needs.
This is the first important point: thread pools exist so that you can recycle threads, in order to avoid wasting time starting them as well as wasting memory for their stacks.
Secondly, if you fire off threads to do asynchronous I/O, they will spend most of their time blocked waiting for the I/O to complete, thus effectively not doing any work and wasting memory. A much better option is to have a single worker handle multiple async calls (through some under-the-hood scheduling technique, such as multiplexing), thus again saving memory and time.
One thing that makes "green" threads faster than kernel threads is that they are user-space objects, managed by a virtual machine. Starting them is a user space call, while starting a thread is a kernel-space call that is much slower.
A person in Google shows an interesting approach.
According to him, kernel mode switching itself is not the bottleneck, and the core cost happen on SMP scheduler. And he claims M:N schedule assisted by kernel wouldn't be expensive, and this makes me to expect general M:N threading to be available on every languages.
Because the OS. Imagine that instead of asking you to clean the house your grandmother has to call the social service that does some paperwork and a week after assigns a social worker for helping her. The worker can be called off at any time and replaced with another one, which again takes several days.
That's pretty ineffective and slow, huh?
In this metaphor you are a userland coroutine scheduler, the social service is an OS with its kernel-level thread scheduler, and a social worker is a fully-fledged thread.
I think the two things are in different levels.
Thread or Process is an instance of the program which is being executed. In a process/thread there is much more things in it. Execution stack, opening files, signals, processors status, and a many other things.
Greentlet is different, it is runs in vm. It supplies a light-weight thread. Many of them supply a pseudo-concurrently (typically in a single or a few OS-level threads). And often they supply a lock-free method by data-transmission instead of data sharing.
So, the two things focus different, so the weight are different.
And In my mind, the greenlet should be finished in the VM not the OS.

when to use process v/s thread?

I know the theoretical difference between the thread and process. But in practical when to use the thread and process because both will does the same work.
In general (and it varies by operating system):
Threads are usually lighter-weight than processes
Processes provide better isolation between actions
Threads provide simpler data sharing and coordination within the process
Typically the middle point is the kicker for me - if you really, really don't want two actions to interfere with each other, to the extent that one process going belly-up doesn't affect the other action, use separate processes. Otherwise I personally go for threads.
(I'm assuming that both models are available - if you want to run a separate executable, that's going to be pretty hard to do within an existing thread, at least in most environments I'm aware of.)
Thread is a subtotal of a process. Hereby the main difference is memory allocation and CPU time scheduling:
operating system handles memory per process and schedules execution time for processes
you allocate memory (within the bounds allowed per process) and you schedule execution time (within given execution timeframe per process) for threads
Other than that there's a lot of minor defining differences, like hardware allocation (threads can share hardware locked by their process), communication (depending on the platform/language/runtime, threads can share variables, processes need a pipe to share information) etc. There's much more in this distinction if you think of a thread as of an atomic entity, whilst process in that case would be the way to group these entities.

Thread Pool vs Thread Spawning

Can someone list some comparison points between Thread Spawning vs Thread Pooling, which one is better? Please consider the .NET framework as a reference implementation that supports both.
Thread pool threads are much cheaper than a regular Thread, they pool the system resources required for threads. But they have a number of limitations that may make them unfit:
You cannot abort a threadpool thread
There is no easy way to detect that a threadpool completed, no Thread.Join()
There is no easy way to marshal exceptions from a threadpool thread
You cannot display any kind of UI on a threadpool thread beyond a message box
A threadpool thread should not run longer than a few seconds
A threadpool thread should not block for a long time
The latter two constraints are a side-effect of the threadpool scheduler, it tries to limit the number of active threads to the number of cores your CPU has available. This can cause long delays if you schedule many long running threads that block often.
Many other threadpool implementations have similar constraints, give or take.
A "pool" contains a list of available "threads" ready to be used whereas "spawning" refers to actually creating a new thread.
The usefulness of "Thread Pooling" lies in "lower time-to-use": creation time overhead is avoided.
In terms of "which one is better": it depends. If the creation-time overhead is a problem use Thread-pooling. This is a common problem in environments where lots of "short-lived tasks" need to be performed.
As pointed out by other folks, there is a "management overhead" for Thread-Pooling: this is minimal if properly implemented. E.g. limiting the number of threads in the pool is trivial.
For some definition of "better", you generally want to go with a thread pool. Without knowing what your use case is, consider that with a thread pool, you have a fixed number of threads which can all be created at startup or can be created on demand (but the number of threads cannot exceed the size of the pool). If a task is submitted and no thread is available, it is put into a queue until there is a thread free to handle it.
If you are spawning threads in response to requests or some other kind of trigger, you run the risk of depleting all your resources as there is nothing to cap the amount of threads created.
Another benefit to thread pooling is reuse - the same threads are used over and over to handle different tasks, rather than having to create a new thread each time.
As pointed out by others, if you have a small number of tasks that will run for a long time, this would negate the benefits gained by avoiding frequent thread creation (since you would not need to create a ton of threads anyway).
My feeling is that you should start just by creating a thread as needed... If the performance of this is OK, then you're done. If at some point, you detect that you need lower latency around thread creation you can generally drop in a thread pool without breaking anything...
All depends on your scenario. Creating new threads is resource intensive and an expensive operation. Most very short asynchronous operations (less than a few seconds max) could make use of the thread pool.
For longer running operations that you want to run in the background, you'd typically create (spawn) your own thread. (Ab)using a platform/runtime built-in threadpool for long running operations could lead to nasty forms of deadlocks etc.
Thread pooling is usually considered better, because the threads are created up front, and used as required. Therefore, if you are using a lot of threads for relatively short tasks, it can be a lot faster. This is because they are saved for future use and are not destroyed and later re-created.
In contrast, if you only need 2-3 threads and they will only be created once, then this will be better. This is because you do not gain from caching existing threads for future use, and you are not creating extra threads which might not be used.
It depends on what you want to execute on the other thread.
For short task it is better to use a thread pool, for long task it may be better to spawn a new thread as it could starve the thread pool for other tasks.
The main difference is that a ThreadPool maintains a set of threads that are already spun-up and available for use, because starting a new thread can be expensive processor-wise.
Note however that even a ThreadPool needs to "spawn" threads... it usually depends on workload - if there is a lot of work to be done, a good threadpool will spin up new threads to handle the load based on configuration and system resources.
There is little extra time required for creating/spawning thread, where as thread poll already contains created threads which are ready to be used.
This answer is a good summary but just in case, here is the link to Wikipedia:
http://en.wikipedia.org/wiki/Thread_pool_pattern
For Multi threaded execution combined with getting return values from the execution, or an easy way to detect that a threadpool has completed, java Callables could be used.
See https://blogs.oracle.com/CoreJavaTechTips/entry/get_netbeans_6 for more info.
Assuming C# and Windows 7 and up...
When you create a thread using new Thread(), you create a managed thread that becomes backed by a native OS thread when you call Start – a one to one relationship. It is important to know only one thread runs on a CPU core at any given time.
An easier way is to call ThreadPool.QueueUserWorkItem (i.e. background thread), which in essence does the same thing, except those background threads aren’t forever tied to a single native thread. The .NET scheduler will simulate multitasking between managed threads on a single native thread. With say 4 cores, you’ll have 4 native threads each running multiple managed threads, determined by .NET. This offers lighter-weight multitasking since switching between managed threads happens within the .NET VM not in the kernel. There is some overhead associated with crossing from user mode to kernel mode, and the .NET scheduler minimizes such crossing.
It may be important to note that heavy multitasking might benefit from pure native OS threads in a well-designed multithreading framework. However, the performance benefits aren’t that much.
With using the ThreadPool, just make sure the minimum worker thread count is high enough or ThreadPool.QueueUserWorkItem will be slower than new Thread(). In a benchmark test looping 512 times calling new Thread() left ThreadPool.QueueUserWorkItem in the dust with default minimums. However, first setting the minimum worker thread count to 512, in this test, made new Thread() and ThreadPool.QueueUserWorkItem perform similarly.
A side effective of setting a high worker thread count is that new Task() (or Task.Factory.StartNew) also performed similarly as new Thread() and ThreadPool.QueueUserWorkItem.

Resources