Developing Kernels to support Multiple CPUs

Developing Kernels to support Multiple CPUs - linux

I am looking to get into operating system kernel development and figured my contribution would be to extend the SANOS operating system in order to support multiple core machines. I have been reading books on operating systems (Tannenbaum) as well as studying how BSD and Linux have tackled this challenge but still am stuck on several concepts.
Does SANOS need to have more sophisticated scheduling algorithms when it runs on multiple CPUs or will what is currently in place work fine?
I know that it is a good idea for threads to have affinity to a core that they were started on, but is this handled via scheduling or by changing the implementation of how threads are created?
What would need to be considered such that SANOS could run on a machine with hundreds of cores? From what I can tell, BSD and Linux at best only support a maximum of a dozen of cores.

Your reading material is good. SO no problems there. Also take a peek at the CS downloadable lectures on operating system design from Stanford.
The scheduling algorithm may need to be more sophisticated. This depends on the types of applications running and how greedy they are. Do they yield themselves or are they forced to. That kind of thing. This is more a question of what your processes want, or expect. A RTOS will have more complex scheduling than a desktop.
Threads should have an affinity to one core, because 2 threads in one process can execute in parallel ... but not at the same real-time on the same core. Putting them on different cores allows them to really-run-in-parallel. Also caching can be optimized for core affinity. This is really a mix of your thread implementation and scheduler. The sched may want to ensure threads are started at the same time on cores, rather than ad-hoc to reduce the amount of time threads wait on eachother and things. If your thread library is user-space, maybe it assigns core, or lets the scheduler decide based on capacity or recent deaths.
Scalability is often a kernel limit (which can be arbitrary). In Linux, if I recall, the limits are due to static sizing of arrays that hold CPU information structs in the scheduler. Hence they are a fixed size. This can be changed by recompiling the kernel. Most good scheduling algorithms will support a very large number of cores. As your core or processor count gets higher, you need to be careful that you don't fragment a processes execution too much. If a program has 2 threads, try and schedule them in close-time-proximity because causation may exist (through shared data) between them.
You also need to decide how your threads are implemented, and how a process is represented (be it heavy or lightweight) in the kernel. Are threads kernel managed? user-space managed? These things all have an impact on scheduler design. Look at how POSIX threads are implemented in various operating systems. There is just so much for you to think about :)
in short there are not really any straight-cut answers to where the logic does, or should reside. It is all down to design, application expectation, time-constraints (on the programs) and so on.
Hope this helps, I am not an expert here however.

Related

Controlling the process allocation to a processor

Does fork always create a process in a separate processor?
Is there a way, I could control the forking to a particular processor. For example, if I have 2 processors and want the fork to create a parallel process but in the same processor that contains the parent. Does NodeJS provide any method for this? I am looking for a control over the allocation of the processes. ... Is this even a good idea?
Also, what are the maximum number of processes that could be forked and why?

I've no Node.js wisdom to impart, simply some info on what OSes generally do.
Any modern OS will schedule processes / threads on CPUs and cores according to the prevailing burden on the machine. The whole point is that they're very good at this, so one is going to have to try very hard to come up with scheduling / core affinity decisions that beat the OS. Almost no one bothers. Unless you're running on very specific hardware (which perhaps, perhaps one might get to understand very well), you're having to make a lot of complex decisions for every single different machine the code runs on.
If you do want to try then I'm assuming that you'll have to dig deep below node.JS to make calls to the underlying C library. Most OSes (including Linux) provide means for a process to control core affinity (it's exposed in Linux's glibc).

Are Tcl threads multi process/multi core

I'm new to using threads in tcl but thought it was a nice way to solve a problem I'm having
I was trying to read through the tcl thread documentation but i can't quite figure out if tcl threads span threads across multiple cpu cores or try to keep all threads within the CPU core from which the master process was started?

Tcl's threads are threads as supported by the operating system's standard libraries (e.g., they're normal POSIX threads on Linux and OSX), and so are entirely capable of running over as many cores as the OS allows.
Tcl takes care to limit the use of locks in its implementation as much as possible, so as to make multi-core operation as efficient as possible; this came from experience supporting high-performance application servers in the 1990s, where it turned out that reducing the sharing of resources was a big win as hardware scaled up the number of cores.
It also means that you've got a non-shared memory model based on structured message passing; it scales well, but it was very different to what most programmers knew at the time. It's a little bit more mainstream now because shared-memory parallelism remains annoyingly troublesome on modern hardware.

Linux: Processes and Threads in a Multi-core CPU

Is it true that threads, compared to processes, are less likely to benefit from a multi-core processor? In other words, would the kernel make the decision of executing threads on a single core rather than on multiple cores?
I'm talking about threads belonging to the same process.

I don't know how the (various) Linux scheduler handle this, but inter-thread communication gets more expensive when threads are running on different Cores.
So the scheduler may decide to run threads of a process on the same CPU if there are other processes needing CPU time.
Eg with a Dual-Core CPU, if there are two processes with two threads and all are using all CPU time they get, it is better to run the two threads of the first process on the first Core and the two threads of the other process on the second core.

That's news to me. Linux in particular makes little distinction between threads and processes. They are really just processes that share their address-space.

Multiple single-threaded processes are more expensive to the system than single multi-threaded ones. But they will benefit from multicore CPU with same efficiency. Plus inter-thread communication is much cheaper then inter-process communication. If these threads really form single application i vote for multithreading.

Shared-memory multithreading imposes huge complexity costs on everything from your tool-chain, to development, to debugging, reasoning, and testing your code. NEVER use shared-memory multithreading where you can reasonably use a multi-process design.
#Marcelo is right, any decent OS will treat threads and processes very similarly, some cpu-affinity for threads may reduce the multi-processor usage of a multi-threaded process, but you should see that with any two processes that share a common .text segment as well.
Pick threads vs. processes based on complexity and architectural design constraints, speed will almost never come into it.

It actually all depends on the scheduler, type of multiprocessing, and current running environment.
Assume nothing, test, test, test!
If you're the only multi-threaded process on the system, multi-threading is generally a good idea.
However, from the perspective of the ease of development, sometimes you want separate address spaces and shared data, especially in NUMA systems.
One thing for sure: If it's a 'HyperThreaded' system, threads are much more efficient by virtue of close memory sharing.
If it is a regular multi-core processing.. it should be similar.
If it is a NUMA system, you're better off keeping data shared and code separate. Again, it's all architecture dependent, and it doesn't matter performance-wise unless you're in the HPC business.
If you are in the HPC (supercomputing) business, TEST!. It's all machine dependent (and benefits are 10-25% on average, it matters if you're talking days of difference)

Whereas Windows uses fibres and threads I sometimes think Linux uses processes and twine.
I've found that in writing multi-threaded processes you really have to be rigorous, pedantic, disciplined and bloody-minded in designing threaded processes in order for them to achieve a balance of benefit in using whatever number of cores are available on the machine that the process is to run on.
Is it true, on Linux, that threads, compared to processes, are less likely to benefit from a multi-core processor? No one knows.

With modern OS schedulers, does it still make sense to manually lock processes to specific CPUs/cores?

I recently learned that sometimes people will lock specific processes or threads to specific processors or cores, and it's thought that this manual tuning will best distribute the load. This is a bit counter-intuitive to me -- I would think the OS scheduler would be able to make a better decision than a human about how to spread the load. I could see it being true for older operating systems that perhaps weren't aware of issues like their being more latency between specific pairs of cores, or shared cache between one pair of cores but not another pair. But I assume 'modern' OSs like Linux, Solaris 10, OS X, and Vista should have schedulers that know this information. Am I mistaken about their capabilities? Am I mistaken that it's a problem the OS can actually solve? I'm particularly interested in the answer for Solaris and Linux.
The consequence is whether or not I need to inform users of my (multithreaded) software of how they might consider balancing on their box.

First of all, 'Lock' is not a correct term to describe it. 'Affinity' is more suitable term.
In most case, you don't need to care about it. However, in some cases, manually setting CPU/Process/Thread affinity could be beneficial.
Operating systems are usually oblivious to the details of modern multicore architecture. For example, say we have 2-socket quadcore processors, and the processor supports SMT(=HyperThreading). In this case, we have 2 processors, 8 cores, and 16 hardware threads. So, OS will see 16 logical processors. If an OS does not recognize such hierarchy, it is highly likely to lose some performance gains. The reasons are:
Caches: in our example, two different processors (installed on two different sockets) are not sharing any on-chip caches. Say an application has 4 busy-running threads and a lot of data are shared by threads. If an OS schedules the threads across the processors, then we may lose some cache locality, resulting in performance lose. However, the threads are not sharing much data (having distinct working set), then separating to different physical processors would be better by increasing effective cache capacity. Also, more tricky scenario could be happen, which is very hard for OS to be aware of.
Resource conflict: let's consider SMT(=HyperThreading) case. SMT shares a lot of important resources of CPU such as caches, TLB, and execution units. Say there are only two busy threads. However, an OS may stupidly schedule these two threads on two logical processors from the same physical core. In such case, a significant resources are contended by two logical threads.
One good example is Windows 7. Windows 7 now supports a smart scheduling policy that consider SMT (related article). Windows 7 actually prevents the above 2. case. Here is a snapshot of task manger in Windows 7 with 20% load on Core i7 (quadcore with HyperThreading = 8 logical processors):
(source: egloos.com)
The CPU usage history is very interesting, isn't? :) You may see that only a single CPU in pairs is utilized, meaning Windows 7 avoids scheduling two threads on a same core simultaneously as possible. This policy will definitely decrease the negative effects of SMT such as resource conflict.
I'd like to say OS are not very smart to understand modern multicore architecture where a lot of caches, shared last-level cache, SMT, and even NUMA. So, there could be good reasons you may need to manually set CPU/process/thread affinity.
However, I won't say this is really needed. Only when you fully understand your workload patterns and your system architecture, then try it on. And, see the results whether your try is effective.

For general-purpose applications, there is no reason to set the CPU affinity; you should just allow the OS scheduler to choose which CPU should run the process or thread. However, there are instances where it is necessary to set the CPU affinity. For example, in real-time systems where the cost of migrating a thread from one core to another (which can happen at any time if the CPU affinity has not been set) can introduce unpredictable delays that can cause tasks to miss their deadlines and which preclude real-time guarantees.
You can take a look at this article about a multi-core aware implementation of real-time CORBA that, among other things, had to set the CPU affinity so that CPU migration could not result in missed deadlines.
The paper is: Real-Time Performance and Middleware for Multiprocessor and Multicore Linux Platforms

For applications designed with parallelism and multiple cores in mind, OS-default thread affinity is sometimes not enough. There are many approaches to parallelism, but so far all require involvement of the programmer and knowledge - at some level at least - of the architecture on which the solution will be mapped. This includes the machines, CPU's and threads that are involved.
This is an actively researched subject, and there is an excellent course on MIT's OpenCourseWare that delves into these issues: http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-189January--IAP--2007/CourseHome/

Well something many people haven't thought here is the idea of forbidding two processes to run on the same processor (socket). It might be worth to help the system to bound different heavily used processes to different processors. This can avoid contention if the scheduler is not clever enough to figure it out itself.
But this is more a system admin task then one for the programmers. I have seen optimizations like this for a few high performance database servers.

Most modern operating systems will do an effective job of allocating work between cores. They also attempt to keep threads running on the same core, to get the cache benefits you mentioned.
In general, you should never be setting your thread affinity unless you have a very good reason to. You don't have as good an insight as the OS into the other work that threads on the system are doing. Kernels are constantly being updated based on new processor technology (single CPU per socket to hyper threading to multiple cores per sockets). Any attempt by you to set hard affinity may backfire on future platforms.

This article from MSDN Magazine, Using concurrency for scalability, gives a good overview of multithreading on Win32. Regarding CPU affinity,
Windows automatically employs
so-called ideal processor affinity in
an attempt to maximize cache
efficiency. For example, a thread
running on CPU 1 that gets context
switched out will prefer to run again
on CPU 1 in the hope that some of its
data will still reside in cache. But
if CPU 1 is busy and CPU 2 is not, the
thread could be scheduled on CPU 2
instead, with all the negative cache
effects that implies.
The article also warns that CPU affinity shouldn't be manipulated without a deep understanding of the problem. Based on this information, my answer to your question would be No, except for very specific, well-understood scenarios.

I am not even sure you can pin processes to a specific CPU on linux. So, my answer is "NO" - let the OS handle it, it's smarter then you most of the time.
Edit:
It seems that on win32 you have some control over which CPU family are you going to run this process. Now I only wait for someone to prove me wrong also on linux/posix ...

How good is the linux kernel in the new Quad Core processors running multithreading application

Is there anyone here with experience in the linux thread scheduler running multithreading applications in the new Quad core processors?. If there is someone like that can you please write here your experience about how is the performance of the kernel managing the different threads, Have you experienced any thread starving or the starving of one of the cores?.
Thank you.

Given that kernel developers like Christoph Lameter (and Ingo Molnar on the scheduler) have tuned the kernel to work well on 4096 processors, and given the amount of optimizations invested by Intel itself in the issue, with multicore specific tuning both for performance and energy saving, I bet the kernel is by far more optimized than anything any of us can write in userspace.
Same about the threading library; there is currently only one thread library, NPTL for Linux 2.6. LinuxThreads was removed from glibc in the 2.4 release, and NPTL was produced before the 2.6 release. And it's really fast.
Just make sure to avoid using an old kernel, the last release of your distro, or kernel.org, is the best. Before deploying in production, make sure to measure the performance difference, and consider whether that is worth the additional support costs (if any).

Linux supports using many processors very well itself. If I remember correctly with SMP, Linux supports 4096-processors. What will really make a difference is whether your applications are written to take advantage of multiple processors.

It works very well on a twin quad-core system (V8) we have in production... bloody fast.
But be very wary of Linux's propensity for thread starvation when locks (mutexes) are heavily contested. Imagine a scenario when 10 threads are working with one lock where the lock is needed very frequently but for very short amounts of time and the work done outside of the lock at any given point is less than a timeslice. Linux will very much tend to deliver the lock almost always to one thread to the exclusion of all the others.
This depends on the specific threading package bound into the kernal, too - I believe there are several.

I have gotten absolutely stunning results on our Intel Q6600's, both for parallel make and for some other parallel applications, but I have been careful to avoid excessive parallelism: I generally fork between four and eight threads, so there's not too much contention. If you fork enough threads, you will have noticeable overhead, especially if they contend for the same semaphores. My guess is that thousands of threads are probably too many, and dozens of threads are probably OK. But that's just a guess; if you want to know, you'll have to find someone who has measured, or you'll have to do an experiment yourself.
But for a dozen threads our results have been fabulous.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string