Why threads starve even on preemptive multitasking OS (Windows 7) - multithreading

I wrote a Win32 application (in Delphi-7 which is 32-bit using TThread class) to create 100 threads. Each thread when resumed will continuously (in a loop) increment a 64 bit counter associated with the thread object (so no locking or sharing of data).
If you let the system run for 10 to 15 seconds and stop after that, you should see roughly the same counts in each of the threads. But what I observed was that 81 threads ran under 400 million loops and the remaining ones looped more than 950 million times. Slowest thread got only 230 million compared to the fastest 2111 million.
According to MSDN, the preemptive multitasking is at the thread-level (not process level), so each of my thread should have gotten its time-slice in a round-robin fashion. What am I missing here and why is this discrepancy?
Edit1: Machine configuration: Intel i7 Quad Core 3.4GHz with hyper-threading turned on (8 active threads at a time). Running Windows-7 64 bit professional (and the test application is 32 bit)
Edit2 (thread code): The test application is built with optimization turned on and without any debug info. Run the test application outside of IDE.
type
TMyThread = class(TThread)
protected
FCount: Int64;
public
constructor Create;
procedure Execute; override;
property Count: Int64 read FCount;
end;
{ TMyThread }
constructor TMyThread.Create;
begin
inherited Create(True);
FCount := 0;
end;
procedure TMyThread.Execute;
begin
inherited;
while not Terminated do
begin
Inc(FCount);
end;
end;

Round-robin scheduling is an obvious strategy for a kernel. That's however not the way that the Windows scheduler works. It used to, back in the Windows 9x days, a scheduler which was very capable of giving various VMs equal time. But not in the NT branch, started by Dave Cutler's group, scheduling is purely based on priority.
Whatever thread has the highest priority gets the cpu. There's another chunk of code in Windows that tinkers with a thread's priority, modifying it from the default priority it gets when the thread got created. That code is aware of stuff like a thread owning a window that's in the foreground. Or a thread that's waiting for a synchronization object that got signaled. Or the more bizarre scheduling problems that tries to solve a priority inversion problem. Randomly giving a thread a chance to run even though it wasn't its turn.
Focus on writing sane code first. Starting a hundred threads isn't a very sane thing to do. You are trying to consume resources that the machine doesn't actually have available, nobody has a machine with a hundred cores. Yet. Powers of two, get a machine with 128 cores first.

I have reproduced and confirm your results. Additionally, disabling thread priority boost doesn't change the distribution. GetThreadTimes reports that threads with higher Values took more UserTime and vice versa, while KernelTime seems to have no correlation with Values.
Thread 97: 1081,5928 Ke:0 Us:25116161
Thread 98: 1153,8029 Ke:0 Us:26988173
Thread 99: 704,6996 Ke:0 Us:16848108
Clearly, some threads really get to run more often than others.
I haven't graphed the results, but I suppose what we're seeing is a Normal distribution, which means the results depend on a number of factors, some which are random.
I tried disabling hyper-threading (this kinda smoothed the results), then assigning each thread a single physical processor (by using SetThreadAffinityMask). In the second case, Values were much closer to each other.
SetThreadAffinityMask(Self.Handle, 1 shl (FIndex mod 4));
I can sort of understand how running on a hyper-threaded system can make some threads "unlucky": they are scheduled to compete with other threads on the same physical processor, and because of "soft affinity" to this virtual core they get to run on it again and again, thus scoring lower than others.
But as to why binding each thread to a fixed core helps on a non-hyperthreaded system, I don't know.
There are probably other random things involved, such as the activity on the cores by other processes. Thread can get "unlucky" if some other process' thread associated with the same core suddenly wakes up and starts doing some (relatively) heavy work.
All of this is guessing though.

Windows 7 is designed for user land. When your first thread wants to do work, the OS gives it a time slice. You, the user, just started it after all. By the time the 50th thread in succession (from the same process !) wants to do work, higher priority threads (background processes controlled by Windows 7 itself) step in. This is happening in such a fashion as to make some threads luckier.
You and I don't really want a personal OS that hands out CPU time based on the whims of user land processes. I would be curious to see how 2008 R2 server handled this. You also might play around with the Advanced tab setting: "Choose how to allocate processor resources".

Some good reasoning here..but there are some features to take into consideration.
Windows is trying to do Multitasking with software.
You hardware isnt multitasking, its using power to do what a parallel processed system would do.
Under windows, it give priority. in many ways..and its confusing.
let me explain it this way.
I have a small program what watches my Cores for their use.
When windows loads, you would think that ALL the cores would get used. NOPE.
As windows loads, the other cores Start to get used.
Then you would think, that as windows loads it would accelerate loading as it has access to the cores. it dont accelerate. It doesnt use the cores are FULL speed to load faster.
Even if windows shoved programs to 1 core EACH as they were loading, and running, it WAITS for them to finish. If it used ALL the cores to process each program, it uses Software(about 100 times slower then hardware) to assemble the parts in the other end.
Long ago, Intel wanted to change the hardware to parallel processed, and MS said 'NO' as their software isnt designed for it. NOW they are trying to push Serial based hardware design to the N point. Even after MS bought NT software. They have forgotten to use much of its design, recently.
There needs to be some hardware changes. There needs to be programming language Changes(MS created the programming language) and the Core of windows needs to be designed again. NOT changed. it needs to go back and start from scratch. Good luck with that.
to tell you how old this thought idea is...VIVA La' Amiga.

Related

Maximum number of threads and multithreading

I'm tripping up on the multithreading concept.
For example, my processor has 2 cores (and with hyper threading) 2 threads per core totaling to 4 thread. So does this mean my CPU can execute four separate instructions simultaneously? Is each thread capable of being a multi-thread?
So does this mean my CPU can execute four separate instructions simultaneously? Is each thread capable of being a multi-thread?
In short to both, yes.
A CPU can only execute 1 single instruction per phase in a clock cycle, due to certain factors like pipelining, a CPU might be able to pass multiple instructions through the different phases in a single clock cycle, and the frequency of the clock might be extremely fast, but it's still only 1 instruction at a time.
As an example, NOP is an x86 assembly instruction which the CPU interprets as "no operation this cycle" that's 1 instruction out of the hundreds or thousands (and more) that are executed from something even as simple as:
int main(void)
{
while (1) { /* eat CPU */ }
return 0;
}
A CPU thread of execution is one in which a series of instructions (a thread of instructions) are being executed, it does not matter from what "application" the instructions are coming from, a CPU does not know about high level concepts (like applications), that's a function of the OS.
So if you have a computer with 2 (or 4/8/128/etc.) CPU's that share the same memory (cache/RAM), then you can have 2 (or more) CPU's that can run 2 (or more) instructions at (literally) the exact same time. Keep in mind that these are machine instructions that are running at the same time (i.e. the physical side of the software).
An OS level thread is something a bit different. While the CPU handles the physical side of the execution, the OS handles the logical side. The above code breaks down into more than 1 instruction and when executed, actually gets run on more than 1 CPU (in a multi-CPU aware environment), even though it's a single "thread" (at the OS level), the OS schedules when to run the next instructions and on what CPU (based on the OS's thread scheduling policy, which is different amongst the various OS's). So the above code will eat up 100% CPU usage per a given "time slice" on that CPU it's running on.
This "slicing" of "time" (also known as preemptive computing) is why an OS can run multiple applications "at the same time", it's not literally1 at the same time since a CPU can only handle 1 instruction at a time, but to a human (who can barely comprehend the length of 1 second), it appears "at the same time".
1) except in the case with a multi-CPU setup, then it might be literally the same time.
When an application is run, the kernel (the OS) actually spawns a separate thread (a kernel thread) to run the application on, additionally the application can request to create another external thread (i.e. spawning another process or forking), or by creating an internal thread by calling the OS's (or programming languages) API which actually call lower level kernel routines that spawn and maintain the context switching of the spawned thread, additionally, any created thread is also capable of calling the same API's to spawn other separate threads (thus a thread is capable of being "multi-threaded").
Multi-threading (in the sense of applications and operating systems), is not necessarily portable, so while you might learn Java or C# and use their API's (i.e. Thread.Start or Runnable), utilizing the actual OS API's as provided (i.e. CreateThread or pthread_create and the slew of other concurrency functions) opens a different door for design decisions (i.e. "does platform X support thread library Y"); just something to keep in mind as you explore the different API's.
I hope that can help add some clarity.
I actually researched this very topic in my Operating Systems class.
When using threads, a good rule of thumb for increased performance for CPU bound processes is to use an equal number of threads as cores, except in the case of a hyper-threaded system in which case one should use twice as many cores. The other rule of thumb that can be concluded is for I/O bound processes. This rule is to quadruple the number threads per cores, except for the case of a hyper-threaded system than one can quadruple the number of threads per core.

OmniThreadLibrary memory leak (consumption) on pipeline running from another thread

I'm running pipeline (thread's pipeline from OmniThreadLibrary) from another thread and got memory leak or rather memory consumption. But when application close then it's ok and there are no memory leak report (ReportMemoryLeaksOnShutdown := True;).
Here example: click button 10 times and test app will get ~600 MB of memory. Windows 7 x64, Delphi XE6, latest omni source.
It's a bug? Or I need use another code?
uses
OtlParallel,
OtlCommon;
procedure TForm75.Button1Click(Sender: TObject);
begin
// run empty pipeline from another threads
Parallel.&For(1, 100).Execute(
procedure(value: integer)
var
pipe: IOmniPipeline;
begin
pipe := Parallel.Pipeline
.Stage(procedure(const input: TOmniValue; var output: TOmniValue) begin end)
.Run;
pipe.Cancel;
pipe.WaitFor(100000);
pipe := nil;
end
);
end;
Edit 1:
Tested that code with ProcessExplorer and found what threads count at runtime is constant, but handles count is grown. If I'm insert Application.ProcessMessages; at the end of "for loop" (after pipe's code) then test app running good, handles are closing and memory consumption is constant. Don't know why.
How many threads does it create ?
Check it in SysInternals Process Explorer for example.
Or in Delphi IDE (View -> Debug Windows -> Threads)
I think that because you block each For-worker for wuite a long WaitFor your application then creates many worker threads for every button click, and when you click it 10 times it consequently creates 10 times many threads.
And yes, in general-purpose operating systems like Windows threads are expensive! Google for "windows thread memory footprint" - and multiply it by the number of threads created by 10 parallel-for loop you spawn.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms686774.aspx
https://blogs.technet.microsoft.com/markrussinovich/2009/07/05/pushing-the-limits-of-windows-processes-and-threads/
This fact was the reason that for making highly parallel server applications special approaches were done to create light-eight application-level threads and bypass OS threads, to name a few
Make special language that would spawn dozens of thousands of cooperative thread and cross-thread enforce memory safety by strict language rules: https://www.erlang.org/docs
Make a library, which cannot enforce those regulations but at least can demand programmer to follow them voluntarily: https://en.wikipedia.org/wiki/Actor_model
Fibers: no-protection threads within threads: What is the difference between a thread and a fiber?
However OTL being a generic library for generic threads imposes little restrictions but relies on OS-provided native threads, and they are heavy expensive in both CPU time needed to create/release Windows Threads (mitigated by Thread Pools concept) and by memory footprint needed to maintain each Windows Threads by OS (which is unavoidable and you see its manifestation).
Of course later, when all those loops are worked through, their threads are getting closed and released, together with the memory that was used to maintain them. So no memory leak indeed, once you wait enough for all your threads to be terminated - they are, with all the temporarily allocated memory they used as their workplaces.
UPD. How to check that hypothesis? easiest way would be to change how many threads is spawned by every instance of For-Loop (by every button click).
See the .NumTasks options of you Parallel.For object:
http://otl.17slon.com/book/chap04.html#leanpub-auto-iomniparallelsimpleloop-interface
By default every button click should spawn one thread for every CPU core. But you can enforce your own size of thread pool. Add .NumTasks(1) call and check memory consumption, then check it into .NumTasks(10) and do it again. If the memory consumption would grow approximately tenfold after that - then it is it.

Who schedules threads?

I have a question about scheduling threads. on the one hand, I learned that threads are scheduled and treated as processes in Linux, meaning they get scheduled like any other process using the conventional methods. (for example, the Completely Fair Scheduler in linux)
On the other hand, I also know that the CPU might also switch between threads using methods like Switch on Event or Fine-grain. For example, on cache miss event the CPU switches a thread. but what if the scheduler doesn't want to switch the thread? how do they agree on one action?
I'm really confused between the two: who schedules a thread? the OS or the CPU?
thanks alot :)
The answer is both.
What happens is really fairly simple: on a CPU that supports multiple threads per core (e.g., an Intel with Hyperthreading) the CPU appears to the OS as having some number of virtual cores. For example, an Intel i7 has 4 actual cores, but looks to the OS like 8 cores.
The OS schedules 8 threads onto those 8 (virtual) cores. When it's time to do a task switch, the OS's scheduler looks through the threads and finds the 8 that are...the most eligible to run (taking into account things like thread priority, time since they last ran, etc.)
The CPU has only 4 real cores, but those cores support executing multiple instructions simultaneously (and out of order, in the absence of dependencies). Incoming instructions get decoded and thrown into a "pool". Each clock cycle, the CPU tries to find some instructions in that pool that don't depend on the results of some previous instruction.
With multiple threads per core, what happens is basically that each actual core has two input streams to put into the "pool" of instructions it might be able to execute in a given clock cycle. Each clock cycle it still looks for instructions from that pool that don't depend on the results of previous instructions that haven't finished executing yet. If it finds some, it puts them into the execution units and executes them. The only major change is that each instruction now needs some sort of tag attached to designate which "virtual core" will be used to store results into--that is, each of the two threads has (for example) its own set of registers, and instructions from each thread have to write to the registers for that virtual core.
It is possible, however, for a CPU to support some degree of thread priority so that (for example) if the pool of available instructions includes some instructions from both input threads (or all N input threads, if there are more than two) it will prefer to choose instructions from one thread over instructions from another thread in any given clock cycle. This can be absolute, so it runs thread A as fast as possible, and thread B only with cycles A can't use, or it can be a "milder" preference, such as attempting to maintain a 2:1 ratio of instructions executed (or, of course, essentially any other ratio preferred).
Of course, there are other ways of setting up priorities as well (such as partitioning execution resources), but the general idea remains the same.
An OS that's aware of shared cores like this can also modify its scheduling to suit, such as scheduling only one thread on a pair of cores if that thread has higher priority.
The OS handles scheduling and dispatching of ready threads, (those that require CPU), onto cores, managing CPU execution in a similar fashion as it manages other resources. A cache-miss is no reason to swap out a thread. A page-fault, where the desired page is not loaded into RAM at all, may cause a thread to be blocked until the page gets loaded from disk. The memory-management hardware does that by generating a hardware interrupt to an OS driver that handles the page-fault.

How long is the time frame between context switches on Windows?

Reading CLR via C# 2.0 (I dont have 3.0 with me at the moment)
Is this still the case:
If there is only one CPU in a computer, only one thread can run at any one time. Windows has to keep track of the thread objects, and every so often, Windows has to decide which thread to schedule next to go to the CPU. This is additional code that has to execute once every 20 milliseconds or so. When Windows makes a CPU stop executing one thread's code and start executing another thread's code, we call this a context switch. A context switch is fairly expensive because the operating system has to:
So circa CLR via C# 2.0 lets say we are on Pentium 4 2.4ghz 1 core non-HT, XP. Every 20 milliseconds? Where a CLR thread or Java thread is mapped to an OS thread only a maximum of 50 threads per second may get a chance to to run?
I've read that context switching is very fast in mircoseconds here on SO, but how often roughly (magnitude style guesses) will say a modest 5 year old server Windows 2003 Pentium Xeon single core give the OS the opportunity to context switch? 20ms in the right area?
I dont need exact figures I just want to be sure that's in the right area, seems rather long to me.
The Quantum as it's called is dependant on a few things, including performance tweaks the operating system makes as it goes along; for instance the foreground process is given a higher priority and can be given [a quantum 3 times longer than default. There is a also a difference between Server and Client SKU typically a client would have a default quantum of 30ms where a server would be 180ms.
So a foreground process that wants as much CPU as it can get may get a quantum of 90ms before a context switch.. and then the OS may decide it doesn't need to switch and let the Quantum continue.
Your "50 threads at a time" math is wrong. You assume that each of those threads is in a 100% CPU state. Most threads are in fact asleep waiting for IO or other events. Even then, most threads don't use their entire 20 ms before going into IO mode or otherwise giving up its slice.
Try this. Write an app with an inifinite loop (eats its entire CPU window). Run 50 instances of it. See how Windows reacts.
I just did a test I got 43 threads seeing its share in a second (after warming up) which makes Richter statement pretty accurate (with overhead) I say. Quadcore/Win7/64bit. Yes these were 100% cpu threads so obviously they weren't given themselves back before their 20ms. Interesting

Multicore + Hyperthreading - how are threads distributed?

I was reading a review of the new Intel Atom 330, where they noted that Task Manager shows 4 cores - two physical cores, plus two more simulated by Hyperthreading.
Suppose you have a program with two threads. Suppose also that these are the only threads doing any work on the PC, everything else is idle. What is the probability that the OS will put both threads on the same core? This has huge implications for program throughput.
If the answer is anything other than 0%, are there any mitigation strategies other than creating more threads?
I expect there will be different answers for Windows, Linux, and Mac OS X.
Using sk's answer as Google fodder, then following the links, I found the GetLogicalProcessorInformation function in Windows. It speaks of "logical processors that share resources. An example of this type of resource sharing would be hyperthreading scenarios." This implies that jalf is correct, but it's not quite a definitive answer.
Linux has quite a sophisticated thread scheduler which is HT aware. Some of its strategies include:
Passive Loadbalancing: If a physical CPU is running more than one task the scheduler will attempt to run any new tasks on a second physical processor.
Active Loadbalancing: If there are 3 tasks, 2 on one physical cpu and 1 on the other when the second physical processor goes idle the scheduler will attempt to migrate one of the tasks to it.
It does this while attempting to keep thread affinity because when a thread migrates to another physical processor it will have to refill all levels of cache from main memory causing a stall in the task.
So to answer your question (on Linux at least); given 2 threads on a dual core hyperthreaded machine, each thread will run on its own physical core.
A sane OS will try to schedule computationally intensive tasks on their own cores, but problems arise when you start context switching them. Modern OS's still have a tendency to schedule things on cores where there is no work at scheduling time, but this can result in processes in parallel applications getting swapped from core to core fairly liberally. For parallel apps, you do not want this, because you lose data the process might've been using in the caches on its core. People use processor affinity to control for this, but on Linux, the semantics of sched_affinity() can vary a lot between distros/kernels/vendors, etc.
If you're on Linux, you can portably control processor affinity with the Portable Linux Processor Affinity Library (PLPA). This is what OpenMPI uses internally to make sure processes get scheduled to their own cores in multicore and multisocket systems; they've just spun off the module as a standalone project. OpenMPI is used at Los Alamos among a number of other places, so this is well-tested code. I'm not sure what the equivalent is under Windows.
I have been looking for some answers on thread scheduling on Windows, and have some empirical information that I'll post here for anyone who may stumble across this post in the future.
I wrote a simple C# program that launches two threads. On my quad core Windows 7 box, I saw some surprising results.
When I did not force affinity, Windows spread the workload of the two threads across all four cores. There are two lines of code that are commented out - one that binds a thread to a CPU, and one that suggests an ideal CPU. The suggestion seemed to have no effect, but setting thread affinity did cause Windows to run each thread on their own core.
To see the results best, compile this code using the freely available compiler csc.exe that comes with the .NET Framework 4.0 client, and run it on a machine with multiple cores. With the processor affinity line commented out, Task Manager showed the threads spread across all four cores, each running at about 50%. With affinity set, the two threads maxed out two cores at 100%, with the other two cores idling (which is what I expected to see before I ran this test).
EDIT:
I initially found some differences in performance with these two configurations. However, I haven't been able to reproduce them, so I edited this post to reflect that. I still found the thread affinity interesting since it wasn't what I expected.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Threading.Tasks;
class Program
{
[DllImport("kernel32")]
static extern int GetCurrentThreadId();
static void Main(string[] args)
{
Task task1 = Task.Factory.StartNew(() => ThreadFunc(1));
Task task2 = Task.Factory.StartNew(() => ThreadFunc(2));
Stopwatch time = Stopwatch.StartNew();
Task.WaitAll(task1, task2);
Console.WriteLine(time.Elapsed);
}
static void ThreadFunc(int cpu)
{
int cur = GetCurrentThreadId();
var me = Process.GetCurrentProcess().Threads.Cast<ProcessThread>().Where(t => t.Id == cur).Single();
//me.ProcessorAffinity = (IntPtr)cpu; //using this line of code binds a thread to each core
//me.IdealProcessor = cpu; //seems to have no effect
//do some CPU / memory bound work
List<int> ls = new List<int>();
ls.Add(10);
for (int j = 1; j != 30000; ++j)
{
ls.Add((int)ls.Average());
}
}
}
The probability is essentially 0% that the OS won't utilize as many physical cores as possible. Your OS isn't stupid. Its job is to schedule everything, and it knows full well what cores it has available. If it sees two CPU-intensive threads, it will make sure they run on two physical cores.
Edit
Just to elaborate a bit, for high-performance stuff, once you get into MPI or other serious parallelization frameworks, you definitely want to control what runs on each core.
The OS will make a sort of best-effort attempt to utilize all cores, but it doesn't have the long-term information that you do, that "this thread is going to run for a very long time", or that "we're going to have this many threads executing in parallel". So it can't make perfect decisions, which means that your thread will get assigned to a new core from time to time, which means you'll run into cache misses and similar, which costs a bit of time. For most purposes, it's good enough, and you won't even notice the performance difference. And it also plays nice with the rest of the system, if that matters. (On someone's desktop system, that's probably fairly important. In a grid with a few thousand CPU's dedicated to this task, you don't particularly want to play nice, you just want to use every clock cycle available).
So for large-scale HPC stuff, yes, you'll want each thread to stay on one core, fixed. But for most smaller tasks, it won't really matter, and you can trust the OS's scheduler.
This is a very good and relevant question. As we all know, a hyper-threaded core is not a real CPU/core. Instead, it is a virtual CPU/core (from now on I'll say core). The Windows CPU scheduler as of Windows XP is supposed to be able to distinguish hyperthreaded (virtual) cores from real cores. You might imagine then that in this perfect world it handles them 'just right' and it is not an issue. You would be wrong.
Microsoft's own recommendation for optimizing a Windows 2008 BizTalk server recommends disabling HyperThreading. This suggests, to me, that the handling of hyper-threaded cores isn't perfect and sometimes threads get a time slice on a hyper-threaded core and suffer the penalty (a fraction of the performance of a real core, 10% I'd guess, and Microsoft guesses 20-30%).
Microsoft article reference where they suggest disabling HyperThreading to improve server efficiency: http://msdn.microsoft.com/en-us/library/cc615012(BTS.10).aspx
It is the SECOND recommendation after BIOS update, that is how important they consider it. They say:
FROM MICROSOFT:
"Disable hyper-threading on BizTalk
Server and SQL Server computers
It is critical hyper-threading be
turned off for BizTalk Server
computers. This is a BIOS setting,
typically found in the Processor
settings of the BIOS setup.
Hyper-threading makes the server
appear to have more
processors/processor cores than it
actually does; however hyper-threaded
processors typically provide between
20 and 30% of the performance of a
physical processor/processor core.
When BizTalk Server counts the number
of processors to adjust its
self-tuning algorithms; the
hyper-threaded processors cause these
adjustments to be skewed which is
detrimental to overall performance. "
Now, they do say it is due to it throwing off the self-tuning algorithms, but then go on to mention contention problems (suggesting it is a larger scheduling issue, at least to me). Read it as you will, but I think it says it all. HyperThreading was a good idea when were with single CPU systems, but is now just a complication that can hurt performance in this multi-core world.
Instead of completely disabling HyperThreading, you can use programs like Process Lasso (free) to set default CPU affinities for critical processes, so that their threads never get allocated to virtual CPUs.
So.... I don't think anyone really knows just how well the Windows CPU Scheduler handles virtual CPUs, but I think it is safe to say that XP handles it worst, and they've gradually improved it since then, but it still isn't perfect. In fact, it may NEVER be perfect because the OS doesn't have any knowledge of what threads are best to put on these slower virtual cores. That may be the issue there, and why Microsoft recommends disabling HyperThreading in server environments.
Also remember even WITHOUT HyperThreading, there is the issue of 'core thrashing'. If you can keep a thread on a single core, that's a good thing, as it reduces the core change penalties.
You can make sure both threads get scheduled for the same execution units by giving them a processor affinity. This can be done in either windows or unix, via either an API (so the program can ask for it) or via administrative interfaces (so an administrator can set it). E.g. in WinXP you can use the Task Manager to limit which logical processor(s) a process can execute on.
Otherwise, the scheduling will be essentially random and you can expect a 25% usage on each logical processor.
I don't know about the other platforms, but in the case of Intel, they publish a lot of info on threading on their Intel Software Network. They also have a free newsletter (The Intel Software Dispatch) you can subscribe via email and has had a lot of such articles lately.
The chance that the OS will dispatch 2 active threads to the same core is zero unless the threads were tied to a specific core (thread affinity).
The reasons behind this are mostly HW related:
The OS (and the CPU) wants to use as little power as possible so it will run the tasks as efficient as possible in order to enter a low power-state ASAP.
Running everything on the same core will cause it to heat up much faster. In pathological conditions, the processor may overheat and reduce its clock to cool down. Excessive heat also cause CPU fans to spin faster (think laptops) and create more noise.
The system is never actually idle. ISRs and DPCs run every ms (on most modern OSes).
Performance degradation due to threads hopping from core to core are negligible in 99.99% of the workloads.
In all modern processors the last level cache is shared thus switching cores isn't so bad.
For Multi-socket systems (Numa), the OS will minimize hopping from socket to socket so a process stays "near" its memory controller. This is a complex domain when optimizing for such systems (tens/hundreds of cores).
BTW, the way the OS knows the CPU topology is via ACPI - an interface provided by the BIOS.
To sum things up, it all boils down to system power considerations (battery life, power bill, noise from cooling solution).

Resources