When should I consider changing thread priority - multithreading

I once was asked to increase thread priority to fix a problem. I refused, saying that changing it was dangerous and was not the root cause of the problem.
My question is, under what circumstannces should I conider changing priority of threads?

When you've made a list of the threads you're using and defined a priority order for them which makes sense in terms of the work they do.
If you nudge threads up here and there in order to bodge your way out of a problem, eventually they'll all be high priority and you're back where you started. Don't assume you can fix a race condition with prioritisation when really it needs locking, because chances are you've only fixed it in friendly conditions. There may still be cases where it can fail, such as when the lower-priority thread has undergone priority inheritance because another high-priority thread is waiting on another lock it's holding.
If you classify threads along the lines of "these threads fill the audio buffer", "these threads make my app responsive to system events", "these threads make my app responsive to the user", "these threads are getting on with some business and will report when they're good and ready", then the threads ought to be prioritised accordingly.
Finally, it depends on the OS. If thread priority is completely secondary to process priority, then it shouldn't be "dangerous" to prioritise threads: the only thing you can starve of CPU is yourself. But if your high-priority threads run in preference to the normal-priority threads of other, unrelated applications, then you have a broader responsibility. You should only be raising priorities of threads which do small amounts of urgent work. The definition of "small" depends what kind of device you're on - with a 3GHz multi-core processor you get away with a lot, but a mobile device might have pseudo real-time expectations that user-level apps can break.
Keeping the audio buffer serviced is the canonical example of when to be high priority, though, since small under-runs usually cause nasty crackling. Long downloads (or other slow I/O) are the canonical example of when to be low priority, since there's no urgency processing this chunk of data if the next one won't be along for ages anyway. If you're ever writing a device driver you'll need to make more complex decisions how to play nicely with others.

Not many. The only time I've ever had to change thread priorities in a positive direction was with a user interface thread. UIs must be extremely snappy in order for the app to feel right, so a lot of times it is best to prioritize painting threads higher than others. For example, the Swing Event Dispatch Thread runs at priority 6 by default (1 higher than the default).
I do push threads down in priority quite a bit. Again, this is usually to keep the UI responsive while some long-running background process does its thing. However, this also will sometimes apply to polling daemons and the like which I know that I don't want to be interfering with anything, regardless of how minimal the interference.

Our app uses a background thread to download data and we didn't want that interfering with the UI thread on single-core machines, so we deliberately prioritized that lower.

I think it depends on the direction you're looking at changing the priority.
Normally you shouldn't ever increase thread priority unless you have a very good reason. Increasing thread priority can cause your app's thread to start taking away time from other applications, which probably isn't what the user wants. If your thread is using up a significant amount of CPU it can make the machine hard to use, as some standard UI threads may start to starve.
I'd say the only times you should increase priority above normal is if the user explicitly told your app to do so, but even then you want to prevent "clueless" users from doing so. Maybe if your app doesn't use much CPU normally, but might have brief bursts of really really important activity then it could be OK to have an increased priority, as it wouldn't normally detract from the user's general experience.
Decreasing priority is another matter. If your app is doing something that takes a LOT of CPU and runs for a long time, yet isn't critical, then lowering the priority can be good. By lowering the priority you allow the CPU to be used for other things when it's needed, which helps keep the system responding quickly. As long as the system is mostly idling other than your app you'll still get most of the CPU time, but won't take away from tasks that need it more than you. An example of this would be a thread that indexes the hard drive (think google desktop).

I would say when your original design assumptions about the threads are no longer valid.
Thread priority is mostly a design decision about what work is most important. So for some examples of when to reconsider: If you add a new feature that might require its own thread that becomes more important, then reconsider thread priorities. If some requirements change that force you to reconsider the priorities of the work you are doing, then reconsider. Or, if you do performance testing and realize that your "high priority work" as specified in your design do not get the required performance, then tweak priorities.
Otherwise, its often a hack.

Related

Cost of a thread

I understand how to create a thread in my chosen language and I understand about mutexs, and the dangers of shared data e.t.c but I'm sure about how the O/S manages threads and the cost of each thread. I have a series of questions that all relate and the clearest way to show the limit of my understanding is probably via these questions.
What is the cost of spawning a thread? Is it worth even worrying about when designing software? One of the costs to creating a thread must be its own stack pointer and process counter, then space to copy all of the working registers to as it is moved on and off of a core by the scheduler, but what else?
Is the amount of stack available for one program split equally between threads of a process or on a first come first served?
Can I somehow check the hardware on start up (of the program) for number of cores. If I am running on a machine with N cores, should I keep the number of threads to N-1?
then space to copy all of the working registeres to as it is moved on
and off of a core by the scheduler, but what else?
One less evident cost is the strain imposed on the scheduler which may start to choke if it needs to juggle thousands of threads. The memory isn't really the issue. With the right tweaking you can get a "thread" to occupy very little memory, little more than its stack. This tweaking could be difficult (i.e. using clone(2) directly under linux etc) but it can be done.
Is the amount of stack available for one program split equally between
threads of a process or on a first come first served
Each thread gets its own stack, and typically you can control its size.
If I am running on a machine with N cores, should I keep the number of
threads to N-1
Checking the number of cores is easy, but environment-specific. However, limiting the number of threads to the number of cores only makes sense if your workload consists of CPU-intensive operations, with little I/O. If I/O is involved you may want to have many more threads than cores.
You should be as thoughtful as possible in everything you design and implement.
I know that a Java thread stack takes up about 1MB each time you create a thread. , so they add up.
Threads make sense for asynchronous tasks that allow long-running activities to happen without preventing all other users/processes from making progress.
Threads are managed by the operating system. There are lots of schemes, all under the control of the operating system (e.g. round robin, first come first served, etc.)
It makes perfect sense to me to assign one thread per core for some activities (e.g. computationally intensive calculations, graphics, math, etc.), but that need not be the deciding factor. One app I develop uses roughly 100 active threads in production; it's not a 100 core machine.
To add to the other excellent posts:
'What is the cost of spawning a thread? Is it worth even worrying about when designing software?'
It is if one of your design choices is doing such a thing often. A good way of avoiding this issue is to create threads once, at app startup, by using pools and/or app-lifetime threads dedicated to operations. Inter-thread signaling is much quicker than continual thread creation/termination/destruction and also much safer/easier.
The number of posts concerning problems with thread stopping, terminating, destroying, thread count runaway, OOM failure etc. is ledgendary. If you can avoid doing it at all, great.

Why are OS threads considered expensive?

There are many solutions geared toward implementing "user-space" threads. Be it golang.org goroutines, python's green threads, C#'s async, erlang's processes etc. The idea is to allow concurrent programming even with a single or limited number of threads.
What I don't understand is, why are the OS threads so expensive? As I see it, either way you have to save the stack of the task (OS thread, or userland thread), which is a few tens of kilobytes, and you need a scheduler to move between two tasks.
The OS provides both of this functions for free. Why should OS threads be more expensive than "green" threads? What's the reason for the assumed performance degradation caused by having a dedicated OS thread for each "task"?
I want to amend Tudors answer which is a good starting point. There are two main overheads of threads:
Starting and stopping them. Involves creating a stack and kernel objects. Involves kernel transitions and global kernel locks.
Keeping their stack around.
(1) is only a problem if you are creating and stopping them all the time. This is solved commonly using thread pools. I consider this problem to be practically solved. Scheduling a task on a thread pool usually does not involve a trip to the kernel which makes it very fast. The overhead is on the order of a few interlocked memory operations and a few allocations.
(2) This becomes important only if you have many threads (> 100 or so). In this case async IO is a means to get rid of the threads. I found that if you don't have insane amounts of threads synchronous IO including blocking is slightly faster than async IO (you read that right: sync IO is faster).
Saving the stack is trivial, no matter what its size - the stack pointer needs to be saved in the Thread Info Block in the kernel, (so usualy saving most of the registers as well since they will have been pushed by whatever soft/hard interrupt caused the OS to be entered).
One issue is that a protection level ring-cycle is required to enter the kernel from user. This is an essential, but annoying, overhead. Then the driver or system call has to do whatever was requested by the interrupt and then the scheduling/dispatching of threads onto processors. If this results in the preemption of a thread from one process by a thread from another, a load of extra process context has to be swapped as well. Even more overhead is added if the OS decides that a thread that is running on another processor core than the one handling the interrupt mut be preempted - the other core must be hardware-interrupted, (this is on top of the hard/soft interrupt that entred the OS in the first place.
So, a scheduling run may be quite a complex operation.
'Green threads' or 'fibers' are, (usually), scheduled from user code. A context-change is much easier and cheaper than an OS interrupt etc. because no Wagnerian ring-cycle is required on every context-change, process-context does not change and the OS thread running the green thread group does not change.
Since something-for-nothing does not exist, there are problems with green threads. They ar run by 'real' OS threads. This means that if one 'green' thread in a group run by one OS thread makes an OS call that blocks, all green threads in the group are blocked. This means that simple calls like sleep() have to be 'emulated' by a state-machine that yields to other green threads, (yes, just like re-implementing the OS). Similarly, any inter-thread signalling.
Also, of course, green threads cannot directly respond to IO signaling, so somewhat defeating the point of having any threads in the first place.
There are many solutions geared toward implementing "user-space" threads. Be it golang.org goroutines, python's green threads, C#'s async, erlang's processes etc. The idea is to allow concurrent programming even with a single or limited number of threads.
It's an abstraction layer. It's easier for many people to grasp this concept and use it more effectively in many scenarios. It's also easier for many machines (assuming a good abstraction), since the model moves from width to pull in many cases. With pthreads (as an example), you have all the control. With other threading models, the idea is to reuse threads, for the process of creating a concurrent task to be inexpensive, and to use a completely different threading model. It's far easier to digest this model; there's less to learn and measure, and the results are generally good.
What I don't understand is, why are the OS threads so expensive? As I see it, either way you have to save the stack of the task (OS thread, or userland thread), which is a few tens of kilobytes, and you need a scheduler to move between two tasks.
Creating a thread is expensive, and the stack requires memory. As well, if your process is using many threads, then context switching can kill performance. So lightweight threading models became useful for a number of reasons. Creating an OS thread became a good solution for medium to large tasks, ideally in low numbers. That's restrictive, and quite time consuming to maintain.
A task/thread pool/userland thread does not need to worry about much of the context switching or thread creation. It's often "reuse the resource when it becomes available, if it's not ready now -- also, determine the number of active threads for this machine".
More commmonly (IMO), OS level threads are expensive because they are not used correctly by the engineers - either there are too many and there is a ton of context switching, there is competition for the same set of resources, the tasks are too small. It takes much more time to understand how to use OS threads correctly, and how to apply that best to the context of a program's execution.
The OS provides both of this functions for free.
They're available, but they are not free. They are complex, and very important to good performance. When you create an OS thread, it's given time 'soon' -- all the process' time is divided among the threads. That's not the common case with user threads. The task is often enqueued when the resource is not available. This reduces context switching, memory, and the total number of threads which must be created. When the task exits, the thread is given another.
Consider this analogy of time distribution:
Assume you are at a casino. There are a number people who want cards.
You have a fixed number of dealers. There are fewer dealers than people who want cards.
There is not always enough cards for every person at any given time.
People need all cards to complete their game/hand. They return their cards to the dealer when their game/hand is complete.
How would you ask the dealers to distribute cards?
Under the OS scheduler, that would be based on (thread) priority. Every person would be given one card at a time (CPU time), and priority would be evaluated continually.
The people represent the task or thread's work. The cards represent time and resources. The dealers represent threads and resources.
How would you deal fastest if there were 2 dealers and 3 people? and if there were 5 dealers and 500 people? How could you minimize running out of cards to deal? With threads, adding cards and adding dealers is not a solution you can deliver 'on demand'. Adding CPUs is equivalent to adding dealers. Adding threads is equivalent to dealers dealing cards to more people at a time (increases context switching). There are a number of strategies to deal cards more quickly, especially after you eliminate the people's need for cards in a certain amount of time. Would it not be faster to go to a table and deal to a person or people until their game is complete if the dealer to people ratio were 1/50? Compare this to visiting every table based on priority, and coordinating visitation among all dealers (the OS approach). That's not to imply the OS is stupid -- it implies that creating an OS thread is an engineer adding more people and more tables, potentially more than the dealers can reasonably handle. Fortunately, the constraints may be lifted in many cases by using other multithreading models and higher abstractions.
Why should OS threads be more expensive than "green" threads? What's the reason for the assumed performance degradation caused by having a dedicated OS thread for each "task"?
If you developed a performance critical low level threading library (e.g. upon pthreads), you would recognize the importance of reuse (and implement it in your library as a model available for users). From that angle, the importance of higher level multithreading models is a simple and obvious solution/optimization based on real world usage as well as the ideal that the entry bar for adopting and effectively utilizing multithreading can be lowered.
It's not that they are expensive -- the lightweight threads' model and pool is a better solution for many problems, and a more appropriate abstraction for engineers who do not understand threads well. The complexity of multithreading is greatly simplified (and often more performant in real world usage) under this model. With OS threads, you do have more control, but several more considerations must be made to use them as effectively as possible -- heeding these consideration can dramatically reflow a program's execution/implementation. With higher level abstractions, many of these complexities are minimized by completely altering the flow of task execution (width vs pull).
The problem with starting kernel threads for each small task is that it incurs a non-negligible overhead to start and stop, coupled with the stack size it needs.
This is the first important point: thread pools exist so that you can recycle threads, in order to avoid wasting time starting them as well as wasting memory for their stacks.
Secondly, if you fire off threads to do asynchronous I/O, they will spend most of their time blocked waiting for the I/O to complete, thus effectively not doing any work and wasting memory. A much better option is to have a single worker handle multiple async calls (through some under-the-hood scheduling technique, such as multiplexing), thus again saving memory and time.
One thing that makes "green" threads faster than kernel threads is that they are user-space objects, managed by a virtual machine. Starting them is a user space call, while starting a thread is a kernel-space call that is much slower.
A person in Google shows an interesting approach.
According to him, kernel mode switching itself is not the bottleneck, and the core cost happen on SMP scheduler. And he claims M:N schedule assisted by kernel wouldn't be expensive, and this makes me to expect general M:N threading to be available on every languages.
Because the OS. Imagine that instead of asking you to clean the house your grandmother has to call the social service that does some paperwork and a week after assigns a social worker for helping her. The worker can be called off at any time and replaced with another one, which again takes several days.
That's pretty ineffective and slow, huh?
In this metaphor you are a userland coroutine scheduler, the social service is an OS with its kernel-level thread scheduler, and a social worker is a fully-fledged thread.
I think the two things are in different levels.
Thread or Process is an instance of the program which is being executed. In a process/thread there is much more things in it. Execution stack, opening files, signals, processors status, and a many other things.
Greentlet is different, it is runs in vm. It supplies a light-weight thread. Many of them supply a pseudo-concurrently (typically in a single or a few OS-level threads). And often they supply a lock-free method by data-transmission instead of data sharing.
So, the two things focus different, so the weight are different.
And In my mind, the greenlet should be finished in the VM not the OS.

Why would I have to use multiple threads for one processing task if i can turn up the priority of the program?

Earlier I asked about processing a datastream and someone suggested to put data in a queue and processing this data on a different thead. If this was to slow, I should use multiple threads.
However, i'm using a system that has one core.
So my question is: why not up the prio of my app, so it gets more CPU time from the OS?
I'm writing a server based app and it will be the only big thing running on there.
What would be the pro's and con's of putting the prio up?:)
If you have only one core, then the only way that multi-threading can help you is if chunks of that work depends on something other than CPU, so one thread can get some work done while another is waiting for data from a disk or network connection.
If your application has a GUI, then it can benefit from multi-threading in that while it would be no quicker to do the processing (slower in fact, though probably negligibly so if the task is very long), it can still react to user input in the meantime.
If you have two or more cores, then you can also gain in CPU-bound operations though doing so varies from trivial to impossible depending on just what that operation is. This is irrelevant to your case, but worth considering generally if code you write could later be run on a multi-core system.
Upping the priority is probably a bad idea though, especially if you have only one core (one advantage of multi-core systems is that people who up priorities can't do as much damage).
All threads have priorities which is a factor of both their process' priority and their priority within that process. A low-priority thread in a high priority process trumps a high-priority thread in a low-priority process.
The scheduler doles out CPU slices in a round-robin fashion to the highest priority threads that have work to do. If there are CPUs left over (which in your case means if there are zero threads at that priority that need to run), then it doles out slices to the next lowest priority, and so on.
Most of the time, most threads aren't doing much anyway, which can be seen from the fact that most of the time CPU usage on most systems is below the 100% mark (hyperthreading skews this, the internal scheduling within the cores means a hyperthreaded system can be fully saturated and seem to be only running at as little as 70%). Anyway, generally stuff gets done and a thread that suddenly has lots to do will do so at normal priority in pretty much the same time it would at a higher.
However, while the benefit to that busy thread of higher priority is generally little or nothing, the decrement is great. Since it's the only thread that gets any CPU time, all other threads are stuck. All other processes therefore hang for a while. Eventually the scheduler notices that they've all been waiting for around 3seconds, and fixes this by boosting them all to highest priority and giving them larger slices than normal. Now we have a burst of activity as threads that got no time are all suddenly highest-priority threads that all want CPU time. There's a spurt of every thread except the high-priority one running, and the system stops from keeling over, though there's likely still a lot of applications showing "Not Responding" in their title bars. It's far from ideal, but it is an effective way to deal with a thread of higher than usual priority grabbing the core for so long.
The threads gradually drop down in priority, and eventually we're back to the situation where the single higher priority thread is the only one that can work.
For extra fun, if our high priority thread in any way depended upon services provided by the lower priority threads, it would have ended up being stuck waiting on them. Hopefully in a way that made it block and stopped itself from doing any damage, but probably not.
In all, thread priorities are to be approached with great caution, and process priorities even more so. They're only really valid if they'll yield quickly and are either essential to the workings of other threads (e.g. some OS processes will be done at a higher priority, finaliser threads in .NET will be higher than the rest of the process, etc) or if sub-millisecond delays can mess things up (some intensive media work requires this).
If you have multiple cores/processors in your system, upping the priority of a single threaded program will not improve your performance by much, because the other cores would still be unused.
The only way to take advantage of multiple processing units is to write your program using multiple threads/processes.
Having said this, setting your multithreaded application to very high priority may lead to some performance improvement, but I really never saw it to be significant, at least in my own tests.
Edit: I see now that you are using only one core. Basically your program will be able to run more often on the CPU than the rest of the processes that are of lower priority. This may bring you a marginal improvement, but not a dramatic one. Since we cannot know what other applications are running at the same time on your system, the golden rule here is to try it yourself with various priority levels and see what happens. It's the only valid way to see if things will be faster or not.
It all depends on why the data processing is slow.
If the data processing is slow because it is a genuinely cpu intensive operation then splitting it out into multiple threads on a single core system is not going to get you any benefit. In this case increasing the task priority would provide some benefit, assuming that there is (user) cpu time being used by other processes.
However, if the data processing operation is slow because of some non-cpu restriction (eg. if it is I/O bound, or relying on another process), then:
Increasing the task priority is going to have negligible impact. Task priority won't affect I/O times and if there is a dependency on another process on the system you may actually harm performance.
Splitting the data processing out into multiple threads can allow the cpu intensive areas to continue processing while waiting for the non-cpu intensive (eg. I/O) areas to complete.
Increasing the priority of a single-threaded process just gives you more (or bigger) time slices on the one core the process is running on. The core can still only do one thing at a time.
If you spin off a thread to handle the data processing, it can run on a different processor core (assuming a multi-core system), and it and your main thread are actually executing at the same time. Much more efficient.
If you use only one thread your server app will only be able to service one request at a time, no matter what its priority. If you use multiple threads you could service many at the same time.

These days, what are good reasons for setting thread affinity rather than leaving it to the OS?

Searching answers here for "thread affinity", I see a lot of interest in doing it but little justification for it save possibly getting stable QueryPerformanceTimer results.
Assuming a modern OS and a modern 2-4 socket workstation/server class machine with modern 4-6 core CPUs, what good reasons would anyone have for thinking they know better than their OS's scheduler ? Are there any real world situations where taking more control of thead affinity is the right thing to do ? What sort of performance benefits can be demonstrated ?
The last time I saw a really good case for setting thread affinity somewhere (as in, it was backed up by concrete results showing genuine and significant improvements in system performance), it was some obscure thing to do with Win2K device drivers. But I haven't seen anything like that in years so when someone tells me they need to control thread affinity (but not why) these days I am deeply sceptical... but curious to be shown otherwise.
The primary reason is if you have something that depends heavily upon caching. The OS scheduler doesn't necessarily take that into account to the degree you might like.
I use it to assign threads to cores; for example in a simulation you do the physics entirely on one core, and allow the rest of the computation to be executed on another one. It makes sense to be able to control this, if you're on a tight environment where you know the hardware.
Of course, configuring this needs to be done per system, so by default I let the OS decide the cores on which to run, but keep the option of restricting core usage.
In the OS kernel and sometimes in kernel mode drivers you need to perform the same action on every CPU (e.g. update a system register). You can do that in a loop in a single thread, changing the affinity on each iteration.
For desktops it's quite unnecessary.
But I can see some applications where it would help. For example the CPU cache likes it if the app that runs on it doesn't change.
Another possibility is you have a critical task - you give it an entire CPU, and the other tasks use the rest of the CPUs.
Or the opposite: You have some low priority tasks, you put them all on one CPU, then leave the others free for more important tasks (using process priority will give you most of this benefit without having affinity, but I can imagine some memory heavy cases where it wouldn't).
I would agree its best to leave to the CPU to figure this out in most situations. However, the most common reason to go for thread affinity as far as I have seen is when you need good cache dependency. In multiple CPU systems, when a particular CPU caches something individually for itself and if the same thing has been cached in some other CPU, then I believe it can automatically get invalidated on the other CPU. So if a particular thread keeps changing CPUs on which it executes, then the cache hit rate will be too less. So in this case I guess it makes sense for the programmer to be a better judge of the COU affinities.
I also think the above point by Ariel about making sure a critical task constantly gets a CPU without throttling other low priority processes also makes sense.

C# How to maximize chance that particular section of code will have no context switch?

I have time-critical piece of code in my app. I made the thread which runs it Highest priority - that's the most I could do.
Is there any suggestions on how to make part of the code run in this thread to be interrupted as few times as possible (less context switch occurred) ?
The code is not complicated. I replaced all the method calls with inlined code and I don't use anything of high level (like no LINQ). The most of operations are arithmetic. There is only one comparison of strings (I am thinking of ways how to rid of it). Half of maths are with ints and half with doubles.
The code is x86 .NET 4 C#. Runs on single Xenon X3450 W2008R2. Single app server.
(Unfortunately data is coming from 3d party API which doesn't support x64 (hate it!))
I'd appreciate grown-up discussion with experienced developers.
P.S. The server has no paging file so hard page faults wont happen either (no unwanted IO operations).
The only thing you need to worry about in terms of context switches, is blocking your thread. So there should be no problem with using LINQ (that is, LINQ-to-objects, obviously LINQ-to-SQL or whatever would involve blocking!). Any sort of arithmetic or calling methods and so on will also not block the thread and so have no impact on context switches.
The other thing that affects context switching is, as you noted, priority. But not just thread priority, also your process's priority. You can use SetPriorityClass to increase your process's priority to ABOVE_NORMAL_PRIORITY_CLASS (I wouldn't bother putting it higher than that) and then set your thread's priority to Above Normal as well.
However, in general, priorities are really only useful when it's a matter of timing (that is, making sure your process responds to external input (network, user input, disk I/O) as fast as possible). It will actually have very little impact on your thread's actual throughput, unless you have other processes that are also CPU-bound running at the same time. But if that's the case, then fiddling with priorities is not going to be a viable long-term solution anyway. This is because you'll find that by setting one of the processes to a higher priority, it'll completely starve the other processes and they'll never run.
So anyway, I would carefully consider things before adjusting thread and process priorities. And, as always, test, test, test!
If you make that unmanaged WINAPI code instead, the SetThreadPriority function also supports a THREAD_PRIORITY_TIME_CRITICAL (higher than THREAD_PRIORITY_HIGHEST).
It's also worth boosting the priority of the process in which the thread is running (actual priority depends on a combination of thread and process priority).
You should also avoid making I/O calls on the thread (which could block). Taking it to a perhaps-ridiculous extreme you could also avoid making I/O calls on other threads (which could temporarily boost the priority of those threads).

Resources