Linux 2.6.31 Scheduler and Multithreaded Jobs - linux

I run massively parallel scientific computing jobs on a shared Linux computer with 24 cores. Most of the time my jobs are capable of scaling to 24 cores when nothing else is running on this computer. However, it seems like when even one single-threaded job that isn't mine is running, my 24-thread jobs (which I set for high nice values) only manage to get ~1800% CPU (using Linux notation). Meanwhile, about 500% of the CPU cycles (again, using Linux notation) are idle. Can anyone explain this behavior and what I can do about it to get all of the 23 cores that aren't being used by someone else?
Notes:
In case it's relevant, I have observed this on slightly different kernel versions, though I can't remember which off the top of my head.
The CPU architecture is x64. Is it at all possible that the fact that my 24-core jobs are 32-bit and the other jobs I'm competing w/ are 64-bit is relevant?
Edit: One thing I just noticed is that going up to 30 threads seems to alleviate the problem to some degree. It gets me up to ~2100% CPU.

It is possible that this is caused by the scheduler trying to keep each of your tasks running on the same CPU that it was previously running on (it does this because the task has likely brought its working set into that CPU's cache - it's "cache hot").
Here's a few ideas you can try:
Run twice as many threads as you have cores;
Run one or two less threads than you have cores;
Reduce the value of /proc/sys/kernel/sched_migration_cost (perhaps down to zero);
Reduce the value of /proc/sys/kernel/sched_domain/.../imbalance_pct down closer to 100.

Do your threads have to synchronize? If so, you might have the following problem:
Assume you have a 4-cpu system, and a 4-thread job. When run alone, threads fan out to use all 4 cores and total usage is near perfect (We'll call this 400%).
If you add one single-threaded interfering job, the scheduler might place 2 of your threads on the same cpu. This means that 2 of your threads are now running at effectively half their normal pace (dramatic simplification), and if your threads need to synchronize periodically, the progress of your job can be limited by the slowest thread, which in this case is running at half normal speed. You would see utilization of only 200% (from your job running 4x 50%) plus 100% (the interfering job) = 300%.
Similarly, if you assume that the interfering job only uses 25% of one processor's time, you might see one of your threads and the interferer on the same CPU. In that case the slowest thread is running at 3/4 normal speed, causing the total utilization to be 300% (4x 75%) + 25% = 325%. Play with these numbers and it's not hard to come up with something similar to what you're seeing.
If that's the problem, you can certainly play with priorities to give unwelcome tasks only tiny fractions of available CPU (I'm assuming I/O delays aren't a factor). Or, as you've found, try to increase threads so that each CPU has, say, 2 threads, minus a few to allow for system tasks. In this way a 24 core system might run best with, say, 46 threads (which always leaves half of 2 cores' time available for system tasks).

Do your threads communicates with each other?
Try to manually bind every thread to cpu, with sched_setaffinity or pthread_setaffinity_np. Scheduler can be rather dumb when working with a lot of relating threads.

It might be worthwhile to use mpstat (part of the sysstat package) to figure out if you have entire CPUs sitting idle while others are fully utilized. It should give you a more detailed view of the utilization than top or vmstat: run mpstat -P ALL to see 1 line per CPU.
As an experiment, you might try setting the CPU affinity on each thread such that each is bound to an individual CPU; this would let you see what performance is like if you don't let the kernel scheduler decide which CPU a task is scheduled on. It's not a good permanent solution, but if it helps a lot it gives you an idea of where the scheduler is falling short.

Do you think the bottleneck is in your application or the kernel's scheduling algorithm? Before you start tweaking scheduling parameters, I suggest you try running a simple multi-threaded application to see if it exhibits the same behavior as your application.
// COMPILE WITH: gcc threads.c -lpthread -o thread
#include <pthread.h>
#define NUM_CORES 24
void* loop_forever(void* argument) {
int a;
while(1) a++;
}
void main() {
int i;
pthread_t threads[NUM_CORES];
for (i = 0; i < NUM_CORES; i++)
pthread_create(&threads[i], 0, loop_forever, 0);
for (i = 0; i < NUM_CORES; i++)
pthread_join(threads[i], 0);
}

Related

Matlab: -maxNumCompThreads, hyperthreading, and parpool

I'm running Matlab R2014a on a node in a Linux cluster that has 20 cores and hyperthreading enabled. I know this has been discussed before, but I'm looking for some clarification. Here's what my understanding is of the threads vs. cores issue in Matlab:
Matlab has inherent multithreading capabilities, and will utilize extra cores on a multicore machine.
Matlab runs its threads in such a way that putting multiple Matlab threads on the same core (i.e. hyperthreading) isn't useful. So by default, the maximum number of threads that Matlab will create is the number of cores on your system.
When using parpool(), regardless of the number of workers you create, each worker will use only one physical core, as mentioned in this thread.
However, I've also read that using the (deprecated) function maxNumCompThreads(), you can either decrease or increase the number of threads that Matlab or one of the workers will generate. This can be useful in several scenarios:
You want to utilize Matlab's implicit multithreading capabilities to run some code on a cluster node without allocating the entire node. It would be nice if there was some other way to do this if maxNumCompThreads ever gets removed.
You want to do a parameter sweep but have less parameters than the number of cores on your machine. In this case you might want to increase the number of threads per worker so that all of your cores are utilized. This was suggested recently in this thread. However, in my experience, while the individual workers seem quite happy to use maxNumCompThreads() to increase their thread count, inspecting the actual CPU usage using the "top" command suggests that it doesn't have any effect, i.e. each worker still only gets to use one core. It's possible that what is happening is that the individual Matlab processes spawned by the parpool are run with the argument -singleCompThread. I've confirmed that if the parent Matlab process is run with -singleCompThread, the command maxNumCompThreads(n), where n > 1 throws an error due to the fact that Matlab is running in single threaded mode. So the result seems to be that (at least in 2014a), you can't increase the number of computational threads on the parallel pool workers. Related to this is that I can't seem to get the Parent matlab process to to start more threads than there are cores, even though the computer itself has hyperthreading enabled. Again, it will happily run maxNumCompThreads(n), where n > # physical cores, but the fact that top is showing CPU utilization to be 50% suggests otherwise. So what is happening, or what am I misunderstanding?
Edit: to lay out my questions more explicitly:
Within a parfor loop, why doesn't setting maxNumCompThreads(n), when n > 1 seem to work? If it's because the worker process is started with -singleCompThread, why doesn't maxNumCompThreads() return an error like it does in the parent process started with -singleCompThread?
In the parent process, why doesn't using maxNumCompThreads(n), where n > # physical cores, do anything?
Note: I posted this previously on Matlab answers, but haven't received any feedback.
Edit2: It looks like the problem in (1) was an issue with the test code I was using.
That's quite a long question, but I think the straightforward answer is that yes, as I understand it, MATLAB workers are started with -singleCompThread.
First, a few quick tests to confirm our understanding:
> matlab.exe -singleCompThread
>> warning('off', 'MATLAB:maxNumCompThreads:Deprecated')
>> maxNumCompThreads
ans =
1
>> maxNumCompThreads(2)
Error using feature
MATLAB has computational multithreading disabled.
To enable multithreading please restart MATLAB without singleCompThread option.
Error in maxNumCompThreadsHelper (line 37)
Error in maxNumCompThreads (line 27)
lastn = maxNumCompThreadsHelper(varargin{:});
As indicated, when MATLAB is started with the -singleCompThread option, we cannot override it using maxNumCompThreads.
> matlab.exe
>> parpool(2); % local pool
>> spmd, n = maxNumCompThreads, end
Lab 1:
n =
1
Lab 2:
n =
1
We can see that each worker is by default limited to a single computation thread. This is a good thing because we want to avoid over-subscription and unnecessary context switches, which occurs when the number of threads trying to run exceeds the number of available physical/logical cores. So in theory, the best way to maximize CPU utilization is to start as many single-threaded workers as we have cores.
No by looking at the local worker processes running in background, we see that each is launched as:
matlab.exe -dmlworker -noFigureWindows [...]
I believe the undocumented -dmlworker option does something similar to -singleCompThread, but probably a bit different. For one, I was able to override it using maxNumCompThreads(2) without it throwing an error like before..
Remember that even if a MATLAB session is running in single-threaded computation mode, it doesn't mean the computational thread is exclusively restricted to one CPU core only (the thread could jump around between cores assigned by the OS scheduler). You'll have to set the affinity of the worker processes if you want to control that..
So I did some profiling using Intel VTune Amplifier. Basically I ran some linear algebra code, and performed hotspots analysis by attaching to the MATLAB process and filtering on the mkl.dll module (this is the Intel MKL library that MATLAB uses as an optimized BLAS/LAPACK implementation). Here are my results:
- Serial mode
I used the following code: eig(rand(500));
Starting MATLAB normally, computation spawns 4 threads (that's the default automatic value chosen seeing that I have a quad-core i7 Intel CPU).
starting MATLAB normally, but calling maxNumCompThreads(1) before the computation. As expected, only 1 thread is used by the computation.
starting MATLAB with -singleCompThread option, again only 1 thread is used.
- Parallel mode (parpool)
I used the following code: parpool(2); spmd, eig(rand(500)); end. In both cases below, MATLAB is started normally
when running code on the workers with the defaults settings, each worker is limited to one computation thread
when I override the settings on the workers using maxNumCompThreads(2), then each worker will use 2 threads
Here is a screenshot of what VTune reports:
Hope that answers your questions :)
I was wrong about maxNumCompThreads not working on parpool workers. It looks like the problem was that the code I was using:
parfor j = 1:2
tic
maxNumCompThreads(2);
workersCompThreads(j) = maxNumCompThreads;
i = 1;
while toc < 200
a = randn(10^i)*randn(10^i);
i = i + 1;
end
end
used so much memory by the time I checked CPU utilization that the bottleneck was I/O and the extra threads were already shut down. When I did the following:
parfor j = 1:2
tic
maxNumCompThreads(2);
workersCompThreads(j) = maxNumCompThreads;
i = 4;
while toc < 200
a = randn(10^i)*randn(10^i);
end
end
The extra threads started and stayed running.
As for the second issue, I got a confirmation from the Mathworks that the parent Matlab process won't start more threads than the number of physical cores, even if you explicitly raise the limit beyond that. So in the documentation, the sentence:
"Currently, the maximum number of computational threads is equal to the number of computational cores on your machine."
should say:
"Currently, the maximum number of computational threads is equal to the number of physical cores on your machine."

How can I utilize multithread CPU most in Matlab?

I just bought the Matlab Parallel Computing toolbox.
The command matlabpool open opens parallel workers with the number of the cores in my CPU.
But each of my CPU core has two threads. According to Windows Task Manager, each worker can only use half performance of one CPU core, which seems could be interpreted as one worker = one thread = "half core".
Therefore, after all workers opened, still half of the total power of CPU could be utilized.
Is there any other command could help with that?
By default, the local cluster type for matlabpool considers only "real" cores when choosing the default number of workers to launch. This is because for MATLAB workloads, hyperthreading often does not provide much benefit. However, this value is only a default - you can edit the cluster type and run anything up to 12 local workers.
You need to understand HyperThreading to answer this question.
Matlab launches a worker thread for every CPU. Suppose you now use a directive like parfor to distribute computation over multiple threads. Every thread will now be crunching numbers happily.
Suppose you are doing a sum of a large vector of numbers. What actually happens is the following:
sum = sum + a[0]
array a is not in my CPU cache yet
I will fetch a small part of a from main memory and put it in the CPU cache
sum = sum + a[1]
sum = sum + a[2]
...
During the fetch of a, the CPU stalls, waiting for the system memory. This is called a pipeline bubble, and it is not good for performance. Sometimes, a part of the array a was swapped out to the hard drive. The operating system will need to access the drive to put that part into main memory, after which it will be transferred to the CPU cache. When this happens, your operating system will not let the CPU wait for +200 ms. It will use that time to execute another task instead (like the backup running on your system, or refreshing your screen, or ...).
Switching tasks on a CPU results in a performance penalty. To switch to a different task, the operating system must save the CPU registers in main memory, and load the CPU registers of the other task back into the CPU first. This takes time.
With HyperThreading, the number of registers per CPU is doubled. This means that two processes can 'occupy' the CPU. Only one can be executed, but during a stall, the operating system can switch to the second process without any performance penalty.
Forget how Microsoft Windows reports CPU usage. It's wrong. CPU usage is a lot more complicated than only a simple 47%. The real question is rather: should matlab register two threads per core, or only one?
Arguments pro:
During a stall, the CPU can quickly switch to the other thread and continue executing.
Arguments contra:
There are more threads, and the problem is divided in smaller pieces. This may actually reduce performance, as you need to put more pieces together to get the final result.
A context switch will still 'poison' the L1 and L2 cache, loading in pieces of memory that are of no use to the other thread on the CPU.
If there are no stalls, you have more overhead.
On a desktop, the operating system will also want to run: redrawing the screen, moving your mouse, etc. When all logical cpu's are in use, the operating system is required to do an actual (costly) context switch.
Your problem will only be complete if all pieces of the problem have been calculated. Using all the cores / threads increases the risk of one thread taking more time.
My guess is that the Matlab developers considered the arguments contra to be more important than the arguments pro. My own performance tests certainly suggest that there is little performance gain from HyperThreading for cpu-intensive calculations.

Reserve a processor for only one process (with already the max priority)

I have used this piece of code for trying to set the -same- high priority while executing a program :
CPU_SET(CPU_NUM, &cmask);
if (pthread_setaffinity_np(pid, sizeof(cmask), &cmask) < 0) {
LOG_ERROR("Could not set cpu affinity to core %d", CPU_NUM); goto exit_err;
}
errno = 0;
setpriority(PRIO_PROCESS, 0, -19);
The purpose of the program is to perform a computation for a constant bunch (every 80 bytes) of input.
But when executing the program, the time elapsed for this computation varies from 30% to 150%.
When plotting the computation time values, I was waiting for a -quite- smooth graph were the deviation would be something like 10%-15%, but instead there is more than 40% !!!
So I would like to ask, if the CPU is interfering the execution of the program with an other, and if so could I force the CPU to run ONLY a specific program?
Thanks in advance !
P.S. I haven't found a thread that could answer to my question yet...
The most relevant is :) :
Linux reserve a processor for a group of processes (dynamically)
To try and reduce jitter some of the things you can do are:
Ensure sure you've turned off CPU scaling.
Set scheduling policy to SCHED_FIFO for that program.
Try and pin your process to a single processor if you have more than one.
Try and run as few other processes at the same time while you're measuring your program.
Don't trigger sources of time related non-determinism (e.g. disk I/O).
It is probably useful to skim through How to build a Linux RT application because accurate measurement is the same domain - it's possible to be more extreme though:
Ensure your program doesn't use dynamic memory allocations.
Use a realtime Linux kernel.
Prevent Linux from scheduling non-specific userspace programs on a given CPU.
Even disable timer ticks on a given CPU (CONFIG_TASK_ISOLATION).
Modern desktop/server processors are so complicated that trying to precisely measure a single program's execution time with low variance is extremely hard. Things like the various caches and pipeline starting states can perturb execution times in any number of ways so there are always going to be limits.

Considerate, dynamic CPU load management

I am writing a CPU-intensive image processing library. To make best use of available CPU, I can detect the total number of cores on my machine and have my library run with that number of threads. When my library to allocate one thread for each core it performs optimally using 100% available processor time.
The above approach works fine when mine is the only CPU-heavy process running. If another CPU-intensive process is running, or even another instance of my own code, then the OS allocates us only a fraction of the available cores and my library then has too many threads running which is both inefficient and inconsiderate to other processes.
So I would like to find a way to determine the "fair share" number of threads to run given a specific load. For example, if two instances of my process are running on an 8-core machine, each would run with 4 threads. Each would need a way to adapt thread count dynamically according to fluctuations in machine load.
So, my question:
Is there any OS feature or third-party library which allows my process to adapt thread count dynamically to use its fair share of the CPU?
My focus is Windows but interested in non-Windows solutions too.
Edit: to be clear, this is about optimization. I am trying to achieve peak efficiency by running the optimal number of threads appropriate to my fair share of the CPU.
In my eyes, the application shouldnt decide how many threads to spawn. This is an information, that the caller should know. In linux, the "-j" or "--jobs" parameter is widely used (Default: 1).
What about also setting the priority of the processing tasks. So if the caller knows, the processing is mission-critical, he can increase the prio (with the knowledge of maybe blocking the (whole) system). Your processing lib would never know, how important the processing of this image would be.
If the caller doesnt care, then the default low-prio is used, which shouldnt affect the rest of the system. If it does, you should look to what is exactly blocking the system (maybe writing image files to the hdd, reduce ram size to prevent swapping, ...). If you figured out that, you can optimize exactly that point.
If you start the processing with (cpu-cores)*2 on low till normal priority, your system should be useable. No one would expect, that this will kill the system.
Just my 2 cents.
Actually it's not a problem of multithreading but a problem of executing many programs simultaneously. This is hard on most PC's operating systems because it conflicts to the idea of time-sharing.
Let's assume some workflow.
Suppose we have 8 cores and we create 8 threads to feed them; ok, that's easy. Next we choose to monitor core loading to summary how many tasks running on a certain core; well, that needs some statistical assumptions, e.g on Linux you can get a 1/5/15-mins load average chart, but that could be done. The statistical chart is clear and now we get a plot about how many CPU-bound processes are running, say, seeing other 3 CPU-intensive processes.
Then we come to the point: we have to make 3 redundant threads to sleep, but which 3?
Usually we choose 3 threads arbitrarily because the scheduler arranges the other 8 CPU-bound threads automatically. In some cases, we explicitly put threads on high load cores to sleep, assign other threads to certain low load cores, and let the scheduler do the rest things. Most scheduling policies also try to "keep CPU cache hot", which means they tend to forbid transferring threads between cores. We reasonably expect our CPU-intensive threads can utilize the core cache since other processes are scheduled to the 3 crowded cores. Everything looks good.
However this could fail in tightly synchronized computation. In this scenario we need to run our 5 threads simultaneously. Simultaneity here means the 5 threads have to gain CPU and run at almost the same time. I don't know if there's any scheduler on PC could do this for us. In most low-load cases, things still work fine because costs to wait for simultaneity is trivial. But when the load of a core is high and even 1 of our 5 threads is disturbed, occasionally we'll find we spend many life cycles in waiting.
It may help to schedule your program as a real-time program but it's not a perfect solution. Statistically it leads to a wider time window for simultaneity when it gains more CPU control priority. I have to say, it's not guaranteed.

Multicore + Hyperthreading - how are threads distributed?

I was reading a review of the new Intel Atom 330, where they noted that Task Manager shows 4 cores - two physical cores, plus two more simulated by Hyperthreading.
Suppose you have a program with two threads. Suppose also that these are the only threads doing any work on the PC, everything else is idle. What is the probability that the OS will put both threads on the same core? This has huge implications for program throughput.
If the answer is anything other than 0%, are there any mitigation strategies other than creating more threads?
I expect there will be different answers for Windows, Linux, and Mac OS X.
Using sk's answer as Google fodder, then following the links, I found the GetLogicalProcessorInformation function in Windows. It speaks of "logical processors that share resources. An example of this type of resource sharing would be hyperthreading scenarios." This implies that jalf is correct, but it's not quite a definitive answer.
Linux has quite a sophisticated thread scheduler which is HT aware. Some of its strategies include:
Passive Loadbalancing: If a physical CPU is running more than one task the scheduler will attempt to run any new tasks on a second physical processor.
Active Loadbalancing: If there are 3 tasks, 2 on one physical cpu and 1 on the other when the second physical processor goes idle the scheduler will attempt to migrate one of the tasks to it.
It does this while attempting to keep thread affinity because when a thread migrates to another physical processor it will have to refill all levels of cache from main memory causing a stall in the task.
So to answer your question (on Linux at least); given 2 threads on a dual core hyperthreaded machine, each thread will run on its own physical core.
A sane OS will try to schedule computationally intensive tasks on their own cores, but problems arise when you start context switching them. Modern OS's still have a tendency to schedule things on cores where there is no work at scheduling time, but this can result in processes in parallel applications getting swapped from core to core fairly liberally. For parallel apps, you do not want this, because you lose data the process might've been using in the caches on its core. People use processor affinity to control for this, but on Linux, the semantics of sched_affinity() can vary a lot between distros/kernels/vendors, etc.
If you're on Linux, you can portably control processor affinity with the Portable Linux Processor Affinity Library (PLPA). This is what OpenMPI uses internally to make sure processes get scheduled to their own cores in multicore and multisocket systems; they've just spun off the module as a standalone project. OpenMPI is used at Los Alamos among a number of other places, so this is well-tested code. I'm not sure what the equivalent is under Windows.
I have been looking for some answers on thread scheduling on Windows, and have some empirical information that I'll post here for anyone who may stumble across this post in the future.
I wrote a simple C# program that launches two threads. On my quad core Windows 7 box, I saw some surprising results.
When I did not force affinity, Windows spread the workload of the two threads across all four cores. There are two lines of code that are commented out - one that binds a thread to a CPU, and one that suggests an ideal CPU. The suggestion seemed to have no effect, but setting thread affinity did cause Windows to run each thread on their own core.
To see the results best, compile this code using the freely available compiler csc.exe that comes with the .NET Framework 4.0 client, and run it on a machine with multiple cores. With the processor affinity line commented out, Task Manager showed the threads spread across all four cores, each running at about 50%. With affinity set, the two threads maxed out two cores at 100%, with the other two cores idling (which is what I expected to see before I ran this test).
EDIT:
I initially found some differences in performance with these two configurations. However, I haven't been able to reproduce them, so I edited this post to reflect that. I still found the thread affinity interesting since it wasn't what I expected.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Threading.Tasks;
class Program
{
[DllImport("kernel32")]
static extern int GetCurrentThreadId();
static void Main(string[] args)
{
Task task1 = Task.Factory.StartNew(() => ThreadFunc(1));
Task task2 = Task.Factory.StartNew(() => ThreadFunc(2));
Stopwatch time = Stopwatch.StartNew();
Task.WaitAll(task1, task2);
Console.WriteLine(time.Elapsed);
}
static void ThreadFunc(int cpu)
{
int cur = GetCurrentThreadId();
var me = Process.GetCurrentProcess().Threads.Cast<ProcessThread>().Where(t => t.Id == cur).Single();
//me.ProcessorAffinity = (IntPtr)cpu; //using this line of code binds a thread to each core
//me.IdealProcessor = cpu; //seems to have no effect
//do some CPU / memory bound work
List<int> ls = new List<int>();
ls.Add(10);
for (int j = 1; j != 30000; ++j)
{
ls.Add((int)ls.Average());
}
}
}
The probability is essentially 0% that the OS won't utilize as many physical cores as possible. Your OS isn't stupid. Its job is to schedule everything, and it knows full well what cores it has available. If it sees two CPU-intensive threads, it will make sure they run on two physical cores.
Edit
Just to elaborate a bit, for high-performance stuff, once you get into MPI or other serious parallelization frameworks, you definitely want to control what runs on each core.
The OS will make a sort of best-effort attempt to utilize all cores, but it doesn't have the long-term information that you do, that "this thread is going to run for a very long time", or that "we're going to have this many threads executing in parallel". So it can't make perfect decisions, which means that your thread will get assigned to a new core from time to time, which means you'll run into cache misses and similar, which costs a bit of time. For most purposes, it's good enough, and you won't even notice the performance difference. And it also plays nice with the rest of the system, if that matters. (On someone's desktop system, that's probably fairly important. In a grid with a few thousand CPU's dedicated to this task, you don't particularly want to play nice, you just want to use every clock cycle available).
So for large-scale HPC stuff, yes, you'll want each thread to stay on one core, fixed. But for most smaller tasks, it won't really matter, and you can trust the OS's scheduler.
This is a very good and relevant question. As we all know, a hyper-threaded core is not a real CPU/core. Instead, it is a virtual CPU/core (from now on I'll say core). The Windows CPU scheduler as of Windows XP is supposed to be able to distinguish hyperthreaded (virtual) cores from real cores. You might imagine then that in this perfect world it handles them 'just right' and it is not an issue. You would be wrong.
Microsoft's own recommendation for optimizing a Windows 2008 BizTalk server recommends disabling HyperThreading. This suggests, to me, that the handling of hyper-threaded cores isn't perfect and sometimes threads get a time slice on a hyper-threaded core and suffer the penalty (a fraction of the performance of a real core, 10% I'd guess, and Microsoft guesses 20-30%).
Microsoft article reference where they suggest disabling HyperThreading to improve server efficiency: http://msdn.microsoft.com/en-us/library/cc615012(BTS.10).aspx
It is the SECOND recommendation after BIOS update, that is how important they consider it. They say:
FROM MICROSOFT:
"Disable hyper-threading on BizTalk
Server and SQL Server computers
It is critical hyper-threading be
turned off for BizTalk Server
computers. This is a BIOS setting,
typically found in the Processor
settings of the BIOS setup.
Hyper-threading makes the server
appear to have more
processors/processor cores than it
actually does; however hyper-threaded
processors typically provide between
20 and 30% of the performance of a
physical processor/processor core.
When BizTalk Server counts the number
of processors to adjust its
self-tuning algorithms; the
hyper-threaded processors cause these
adjustments to be skewed which is
detrimental to overall performance. "
Now, they do say it is due to it throwing off the self-tuning algorithms, but then go on to mention contention problems (suggesting it is a larger scheduling issue, at least to me). Read it as you will, but I think it says it all. HyperThreading was a good idea when were with single CPU systems, but is now just a complication that can hurt performance in this multi-core world.
Instead of completely disabling HyperThreading, you can use programs like Process Lasso (free) to set default CPU affinities for critical processes, so that their threads never get allocated to virtual CPUs.
So.... I don't think anyone really knows just how well the Windows CPU Scheduler handles virtual CPUs, but I think it is safe to say that XP handles it worst, and they've gradually improved it since then, but it still isn't perfect. In fact, it may NEVER be perfect because the OS doesn't have any knowledge of what threads are best to put on these slower virtual cores. That may be the issue there, and why Microsoft recommends disabling HyperThreading in server environments.
Also remember even WITHOUT HyperThreading, there is the issue of 'core thrashing'. If you can keep a thread on a single core, that's a good thing, as it reduces the core change penalties.
You can make sure both threads get scheduled for the same execution units by giving them a processor affinity. This can be done in either windows or unix, via either an API (so the program can ask for it) or via administrative interfaces (so an administrator can set it). E.g. in WinXP you can use the Task Manager to limit which logical processor(s) a process can execute on.
Otherwise, the scheduling will be essentially random and you can expect a 25% usage on each logical processor.
I don't know about the other platforms, but in the case of Intel, they publish a lot of info on threading on their Intel Software Network. They also have a free newsletter (The Intel Software Dispatch) you can subscribe via email and has had a lot of such articles lately.
The chance that the OS will dispatch 2 active threads to the same core is zero unless the threads were tied to a specific core (thread affinity).
The reasons behind this are mostly HW related:
The OS (and the CPU) wants to use as little power as possible so it will run the tasks as efficient as possible in order to enter a low power-state ASAP.
Running everything on the same core will cause it to heat up much faster. In pathological conditions, the processor may overheat and reduce its clock to cool down. Excessive heat also cause CPU fans to spin faster (think laptops) and create more noise.
The system is never actually idle. ISRs and DPCs run every ms (on most modern OSes).
Performance degradation due to threads hopping from core to core are negligible in 99.99% of the workloads.
In all modern processors the last level cache is shared thus switching cores isn't so bad.
For Multi-socket systems (Numa), the OS will minimize hopping from socket to socket so a process stays "near" its memory controller. This is a complex domain when optimizing for such systems (tens/hundreds of cores).
BTW, the way the OS knows the CPU topology is via ACPI - an interface provided by the BIOS.
To sum things up, it all boils down to system power considerations (battery life, power bill, noise from cooling solution).

Resources