Kernel Scheduling for 1024 CPUs - multithreading

Azul Systems has an appliance that supports thousands of cache coherent CPUs. I would love insight into what changes would need to occur to an operating system in order to schedule thousands of simultaneously running threads.

Scheduling thousands of threads is not a big deal, but scheduling them on hundreds of CPUs is. What you need, first and foremost, is very fine-grained locking, or, better yet, lock-free data structures and algorithms. You just can't afford to let 200 CPUs waiting while one CPU executes a critical section.

You're asking for possible changes to the OS, so I presume there's a significant engineering team behind this effort.
There are also a few pieces of clarififying info that would help define the problem parameters:
How much IPC (inter process communication) do you need?
Do they really have to be threads, or can they be processes?
If they're processes, is it okay if the have to talk to each other through sockets, and not by using shared memory?
What is the memory architecture? Are you straight SMP with 1024 cores, or is there some other NUMA (Non-Uniform Memory Architecture) or MMP going on here? What are your page tables like?
Knowing only the very smallest of info about Azul systems, I would guess that you have very little IPC, and that a simple "run one kernel per core" model might actually work out just fine. If processes need to talk to each other, then they can create sockets and transfer data that way. Does your hardware support this model? (You would likely end up needing one IP address per core as well, and at 1024 IP addrs, this might be troublesome, although they could all be NAT'd, and maybe it's not such a big deal). If course, this model would lead to some inefficiencies, like extra page tables, and a fair bit of RAM overhead, and may even not be supported by your hardware system.
Even if "1 kernel per core" doesn't work, you could likely run 1024/8 kernels, and be just fine, letting each kernel control 8 physical CPUs.
That said, if you wanted to run 1 thread per core in a traditional SMP machine with 1024 cores (and only a few physical CPUs) then I would expect that the old fashioned O(1) scheduler is what you'd want. It's likely that your CPU[0] will end up nearly 100% in kernel and doing interrupt handling, but that's just fine for this use case, unless you need more than 1 core to handle your workload.

Making Linux scale has been a long and ongoing project. The first multiprocessor capable Linux kernel had a single lock protecting the entire kernel (the Big Kernel Lock, BKL), which was simple, but limited scalability.
Subsequently the locking has been made more fine-grained, i.e. there are many locks (thousands?), each covering only a small portion of data. However, there are limits to how far this can be taken, as fine-grained locking tends to be complicated, and the locking overhead starts to eat up the performance benefit, especially considering that most multi-CPU Linux systems have relatively few CPU's.
Another thing, is that as far as possible the kernel uses per-cpu data structures. This is very important, as it avoids the cache coherency performance issues with shared data, and of course there is no locking overhead. E.g. every CPU runs its own process scheduler, requiring only occasional global synchronization.
Also, some algorithms are chosen with scalability in mind. E.g. some read-mostly data is protected by Read-Copy-Update (RCU) instead of traditional mutexes; this allows readers to proceed during a concurrent update.
As for memory, Linux tries hard to allocate memory from the same NUMA node as where the process is running. This provides better memory bandwidth and latency for the applications.

My uneducated guess would be that there is a run-queue per processor and a work-stealing algorithm when a processor is idle. I could see this working in an M:N model, where there is a single process per cpu and light-weight processes as the work items. This would then feel similar to a work-stealing threadpool, such as the one in Java-7's fork-join library.
If you really want to know, go pick up Solaris Internals or dig into the Solaris kernel code. I'm still reading Design & Impl of FreeBSD, with Solaris Internals being the next on my list, so all I can do is make wild guesses atm.

I am pretty sure that the SGI Altix we have at work, (which does ccNUMA) uses special hardware for cache coherency.
There is a huge overhead connected to hold 4mb cache per core coherent. It's unlikely to happen in software only.
in an array of 256 cpus you would need 768mb ram just to hold the cache-invalidation bits.
12mb cache / 128 bytes per cache line * 256² cores.

Modifying the OS is one thing, but using unchanged application code is a waste of hardware. When going over some limit (depending on the hardware), the effort to keep coherency and synchronization in order to execute generic code is simply too much. You can do it, but it will be very expensive.
From the OS side you'll need complex affinity model, i.e. not to jump CPUs just because yours is busy. Scheduling threads based on hardware topology - cooperating threads on CPUs that are "close" to minimize penalties. Simple work stealing is not a good solution, you must consider topology. One solution is hierarchical work stealing - steal work by distance, divide topology to sectors and try to steal from closest first.
Touching a bit the lock issue; you'll still use spin-locks nd such, but using totally different implementations. This is probably the most patented field in CS these days.
But, again, you will need to program specifically for such massive scale. Or you'll simply under-use it. No automatic "parallelizers" will do it for you.

The easiest way to do this is to bind each process/thread to a few CPUS, and then only those CPUs would have to compete for a lock on that thread. Obviously, there would need to be some way to move threads around to even out the load, but on a NUMA architecture, you have to minimize this as much as possible.

Even on dual-core intel systems, I'm pretty sure that Linux can already handle "Thousands" of threads with native posix threads.
(Glibc and the kernel both need to be configured to support this, however, but I believe most systems these days have that by default now).

Related

Using loopback for synchronous IPC when using NUMA architecture

(For a Linux platform) Is it feasible (from a performance point of view) to try to communicate (in a synchronous way) via loopback interface between processes on different NUMA nodes?
What about if the processes reside on the same NUMA node?
I know it's possible to memory bind a process and/or set CPU affinity to a node (using libnuma). I don't know if this true also for the network interface.
Later edit. If loopback interface is just a memory buffer used by kernel, is there a way to be sure that buffer is on the same NUMA node in order for two processes to communicate without the cross node overhead?
Network interfaces don't reside on a node; they're a device - virtual or real - shared across the whole machine. The loopback interface is just a memory buffer somewhere or other, and some kernel code. The code that runs to support that device is likely bouncing round the CPU cores, just like any other thread in the system.
You talk of NUMA nodes, and tagged the question with Linux. Linux doens't run on pure NUMA architectures, it runs on SMP architectures. Modern CPUs from, say, Intel, AMD, ARM all synthesise an SMP hardware environment using separate cores, varying degrees of cache / memory interface unification, and high speed serial links between cores or CPUs. Effectively it's not possible for the operating system or software running on top to see the underlying NUMA architecture; it thinks it's running on a classical SMP architecture.
Intel / AMD / everyone else have done this because, back in the day, successful multiple CPU machines really were SMP; they had multiple CPUs all sharing the same memory bus, and had equal access to the RAM at the other end of the bus. Software got written to take advantage of that (Linux, Windows, etc).
Then the CPU manufacturers realised that SMP architectures suck so far as speed improvements are concerned. AMD blinked first, and ditched SMP in favour of Hypertransport, and were successful. Intel persisted with pure SMP for longer, but soon gave up too and started using QPI between CPUs.
But to give the old software (Linux, Windows, etc) backward compatibility, the CPU designers had to create a synthetic SMP hardware environment on top of Hypertransport and QPI. In principal they might have, at that point in time, decided that SMP was dead and delivered us pure NUMA architectures. But that would likely have been commercial suicide; it would have taken coorindation of the entire hardware and software industries to agree to go that way, but by then it was already far too late to rewrite everything from scratch.
Thinks like network sockets (including via the loopback interface), pipes, serial ports are not synchronous. They're stream carriers, and the sender and receiver are not synchronised by the act of transferring data. That is, the sender can write() data and think that that has completed, but the data is in reality still stuck in some network buffer somewhere and hasn't yet made it into the read() that the destination process will have to call to receive the data.
What Linux will do with processes and threads is endeavour to run them all at once, up to the limit of the number of CPU cores in the machine. By and large that will result in your processes running simultaneously on separate cores. I think Linux will also use knowledge of which physical CPU's memory holds the bulk of a process's data, and will try to run the process on that CPU; memory latency will be a tiny bit better that way.
If your processes try to communicate via socket, pipe or similar, it results in data being copied out of one process's memory space into a memory buffer controlled by the kernel (that's what write() is doing under the hood), and then being copied out of that into the receiving process's memory space (that's what read() does). Where that intermediate kernel buffer actually is doesn't really matter because the transactions taking place at the microelectronic level (below the SMP level) are pretty much the same regardless. Memory allocations and processes can be bound to specific CPU cores, but you can't influence whereabouts the kernel puts its memory buffers through which the exchanged data must pass.
Regarding memory and process core affinity - it's really, really hard to do this to any measurable benefit. The OSes are so good nowadays at understanding the behaviour of CPUs that it's almost always best to simply let the OS run your processes and cores whereever it chooses. Companies like Intel make large code contributions to the Linux project, specifically to ensure that Linux does this as well as possible on the latest and greatest chips.
==EDIT==
Additions in the light of engaging comments!
By "pure NUMA" I really mean systems where one CPU core cannot directly address memory physically attached to another CPU core. Such systems include Transputers, and even the Cell processor found in the Sony PS3. These aren't SMP, there's nothing in the silicon that unifies the separate memories into a single address space, so the question of cache coherency doesn't come into it.
With Transputer systems the only way to access memory attached to another transputer was to have the application software send the data over via a serial link; what made it CSP was that the sending application would finish sending until the receiving application had read the last byte.
For the Cell processor, there were 8 maths cores each with 256kbyte of RAM. That was the only RAM the maths cores could address. To use them the application had to move data and code into that 256k of RAM, tell the core to run, and then move the results out (possibly back out to RAM, or onto another maths core).
There are some supercomputers today that aren't disimilar to this. The K machine (Riken, Kobe in Japan) has an awful lot of cores, a very complex on-chip interconnect fabric, and OpenMPI is used by applications to move data around between nodes; nodes cannot directly address memory on other nodes.
The point is that on the PS3 it was up to application software to decide what data was in what memory and when, whereas modern x86 implementations from Intel and AMD make all data in all memories (no matter if they're shared via an L3 cache or are remote at the other end of a hypertransport or QPI link) accessible from any cores (that's what SMP means afterall).
The all out performance of code written on the Cell process was truly astounding for the Watts and transistor count. Trouble was in a world where programemrs are trained in writing for SMP environments, it takes a brain transplant to get to grips with one that isn't.
Newer languages like Rust and Go have reintroduced the concept of communicating sequential processes, which is all one had with Transputers back in the 1980s, early 1990s. CSP is almost ideal for multicore systems as the hardware does not need to implement an SMP environment. In principle this saves an awful lot of silicon.
CSP implemented on top of today's cache coherent SMP chips in languages like C generally involves a thread writing data into a bufffer, and that being copied into a buffer belonging to another thread (Rust can do it a little differently because Rust knows about memory ownership, and so can transfer ownership instead of copying memory. I'm not familiar with Go - maybe it can do the same).
Looked at at the microelectronic level, copying data from one buffer to another is not really any different to what happens if the data is shared by 2 cores instead of copied (especially in AMD's hypertransport CPUs where each has its own memory system). To share data, the remote core has to use hypertransport to request data from another core's memory, and more traffic to maintain cache coherency. That's about the same amount of hypertransport traffic as if the data where copied from one core to the other, but then there's no subsequent cache coherency traffic.

lock contention in memory allocation - multi-threaded vs. multi-process

We have developed a big C++ application that is running satisfactorily at several sites on big Linux and Solaris boxes (up to 160 CPU cores or even more). It's a heavily multi-threaded (1000+ threads), single-process architecture, consuming huge amounts of memory (200 GB+). We are LD_PRELOADing Google Perftool's tcmalloc (or libumem/mtmalloc on Solaris) to avoid memory allocation performance bottlenecks with generally good results. However, we are starting to see adverse effects of lock contention during memory allocation/deallocation on some bigger installations, especially after the process has been running for a while (which hints to aging/fragmentation effects of the allocator).
We are considering changing to a multi-process/shared memory architecture (the heavy allocation/deallocation will not happen in shared memory, rather on the regular heap).
So, finally, here's our question: can we assume that the virtual memory manager of modern Linux kernels is capable of efficiently handing out memory to hundreds of concurent processes? Or do we have to expect running into the same kind of problems with memory allocation contention that we see in our single-process/multi-threading environment? I tend to hope for a better overall system performance, as we would no longer be limited to a single address space, and that having several independent address spaces would require less locking on the part of the virtual memory manager. Anyone have any actual experience or performance data comparing multi-threaded vs. multi-process memory allocation?
I tend to hope for a better overall system performance, as we would no longer be limited to a single address space, and that having several independent address spaces would require less locking on the part of the virtual memory manager.
There is no reason to expect this. Unless your code is so badly designed that it constantly goes back to the OS to allocate memory, it won't make any significant difference. Your application should only need to go back to the OS's virtual memory manager when it needs more virtual memory, which should not occur significantly once the process reaches its stable size.
If you are constantly allocating and freeing all the way back to the OS, you should stop doing that. If you're not, then you can keep multiple pools of already-allocated memory that can be used by multiple threads without contention. And, as a benefit, your context switches will be cheaper because TLB's don't have to be flushed.
Only if you can't reduce the frequency of address space changes (for example, if you must map and unmap files) or if you have to change other shared resources (like file descriptors) should you look at multiprocess options.

FreeRTOS vs Linux against single event upsets

I am working on the on-board computer for a CubeSat. Our computer will be vulnerable to radiation, hence single event upsets, e.g. bit flips are likely to occur. Would a lighter, smaller OS like FreeRTOS bring more stability, robustness and a lower probability of failure over a full-blown Linux operating system?
The probability of a bit error in RAM is a function of time, memory size and radiation density, so a larger memory has a greater probability, and you can fit a FreeRTOS system in much less memory (like 10kb instead of 4Mb). However the usage rate of the smaller memory is likely much higher - i.e. in a FreeRTOS application, most of the code and data are accessed relatively frequently, while in a Linux deployment, much of it is redundant and if corrupted may never be accessed in any case.
However the question makes little sense for a number of reasons, such as:
The effect of a bit-flip event is entirely non-deterministic, any single event it may be benign or catastrophic. It is impossible to say that a system can tolerate 1 error when you don't know when or where the error will occur.
If your system can be implemented on FreeRTOS, why would you even consider Linux? They are chalk and cheese. If you need the extensive networking, filesystem, memory management, POSIX API and device support etc. provided by Linux, FreeRTOS is not suited to your application in any case, as you would have to add all that yourself from your own or additional third-party code. FreeRTOS is only a scheduling kernel, with threading, synchronisation and IPC support and little else. Conversely if you need hard real-time deterministic behaviour, Linux is unsuited to your application.
Where you might benefit from using an RTOS kernel like FreeRTOS is that it will execute from ROM which may be less prone to the bit-flipping cosmic ray issue - (although the availability of ECC/radiation hardened Flash memory may indicate otherwise). You still need RAM for R/W data, but at least the code itself will be robust. A typical FreeRTOS system might run in SRAM (possibly in on-chip RAM on a microcontroller) - I don't know whether low density SRAM is less prone to bit-flipping than high-density SDRAM, but I am willing to believe it is. It is also possible to source radiation hardened SRAM in any case.
The solution for a system using SDRAM in such an environment is to use ECC RAM which may largely overcome the problem of data corruption from radiation and non-deterministic system behaviour. However I would not imagine that even that would be sufficient for space or high-atmosphere applications.
In short the solution is not in the software, it has to be in the hardware, and the lengths you need to go to will depend on the radiation environment your system will be subjected to. However the selection of a small RTOS kernel allows the selection of hardware to be potentially much wider since it will run on a much wider range of architectures in much smaller memory, perform deterministically, respond to events in fewer cycles and is ROMable.

Does pinning a process to a CPU core or an SMP node help reduce cache coherency traffic?

It is possible to pin a process to a specific set of CPU cores using sched_setaffinity() call. The manual page says:
Restricting a process to run on a single CPU also avoids the
performance cost caused by the cache invalidation that occurs when a process
ceases to execute on one CPU and then recommences execution on a different
CPU.
Which is almost an obvious thing (or not?). What is not that obvious to me is this -
Does pinning LWPs to a specific CPU or an SMP node reduces a cache coherency bus traffic? For example, since a process is running pinned, other CPUs should not modify its private memory, thus only CPUs that are part of the same SMP node should stay cache-coherent.
There should be no CPU socket-to-socket coherency traffic for the pinned process case you describe. Modern Xeon platforms implement snoop filtering in the chipset. The snoop filter indicates when a remote socket cannot have the cache line in question, thus avoiding the need to send cache invalidate messages to that socket.
You can measure this for yourself. Xeon processors implement a large variety of cache statistic counters. You can read the counters in your own code with the rdpmc instruction or just use a product like VTune. FYI, using rdpmc is very precise, but a little tricky since you have to initially set a bit in CR4 to allow using this instruction in user mode.
-- EDIT --
My answer above is outdated for the 55xx series of CPUs which use QPI links. These links interconnect CPU sockets directly without an intervening chipset, as in:
http://ark.intel.com/products/37111/Intel-Xeon-Processor-X5570-%288M-Cache-2_93-GHz-6_40-GTs-Intel-QPI%29
However, since the L3 cache in each CPU is inclusive, snoops over the QPI links only occur when the local L3 cache indicates the line is nowhere in the local socket. Likewise, the remote socket's L3 can quickly respond to a cross-snoop without bothering the cores, assuming the line isn't there either.
So, the inclusive L3 caches should minimize inter-socket coherency overhead, it's just not due to a chipset snoop filter in your case.
If you run on a NUMA system (like, Opteron server or Itanium), it makes sense, but you must be sure to bind a process to the same NUMA node that it allocates memory from. Otherwise, this is an anti-optimization. It should be noted that any NUMA-aware operating system will try to keep execution and memory in the same node anyway, if you don't tell it anything at all, to the best of its abilities (some elderly versions of Windows are rather poor at this, but I wouldn't expect that to be the case with recent Linux).
If you don't run on a NUMA system, binding a process to a particular core is the one biggest stupid thing you can do. The OS will not make processes bounce between CPUs for fun, and if a process must be moved to another CPU, then that is not ideal, but the world does not end, either. It happens rarely, and when it does, you will hardly be able to tell.
On the other hand, if the process is bound to a CPU and another CPU is idle, the OS cannot use it... that is 100% available processing power gone down the drain.

Do you expect that future CPU generations are not cache coherent?

I'm designing a program and i found that assuming implicit cache coherency make the design much much easier. For example my single writer (always the same thread) multiple reader (always other threads) scenarios are not using any mutexes.
It's not a problem for current Intel CPU's. But i want this program to generate income for at least the next ten years (a short time for software) so i wonder if you think this could be a problem for future cpu architectures.
I suspect that future CPU generations will still handle cache coherence for you. Without this, most mainstream programming methodologies would fail. I doubt any CPU architecture that will be used widely in the next ten years will invalidate the current programming model - it may extend it, but it's difficult to drop something so widely assumed.
That being said, programming with the assumption of implicit cache coherency is not always a good idea. There are many issues with false sharing that can easily be avoided if you purposefully try to isolate your data. Handling this properly can lead to huge performance boosts (rather, a lack of huge performance losses) on current generation CPUs. Granted, it's more work in the design, but it is often required.
We are already there. Computers claim cache coherency but at the same time they have a temporary store buffer for writes, reads can be completed via this buffer instead of the cache (ie the store buffer has just become a incoherent cache) and invalidate requests are also queued allowing the processor to temporarily use cache lines it knows are stale.
X86 doesn't use many of these techniques, but it does use some. As long as memory stays significantly slower than the CPU, expect to see more of these techniques and others yet devised to be used. Even itanium, failed as it is, uses many of these ideas, so expect intel to migrate them into x86 over time.
As for avoiding locks, etc: it is always hard to guage people's level of expertise over the Internet so either you are misguided with what you think might work, or you are on the cutting edge of lockfree programming. Hard to tell.
Do you understand the MESI protocol, memory barriers and visibility? Have you read stuff from Paul McKenney, etc?
I don't know per se. But I'd like to see a trend toward non-cache coherent modes.
The conceptual mind shift is significant (can't just pass data in a method call, must pass it through a queue to an async method), but it's required as we move more and more into a multicore world anyway. The closer we get to one processor per memory bank the better. Because then we're working in a world of network message routing, where data is just not available rather than having threads that can silently stomp on data.
However, as Reed Copsey points out, the whole x86 world of computing is built on the assumption of cache coherency (which is even bigger than Microsoft's market share!). So it won't go away any time soon!
Here is a paper from reputed authors in computer architecture area which argues that cache coherence is here to stay.
http://acg.cis.upenn.edu/papers/cacm12_why_coherence.pdf
"Why On-Chip Cache Coherence Is Here to Stay" -By Martin, Hill and Sorin
You are making a strange request. You are asking for our (the SO community) assumptions about future CPU architectures - a very dangerous proposition. Are you willing to put your money where our mouth is? Because if we're wrong and your application will fail it will be you who's not making any money..
Anyway, I would suspect things are not going to change that dramatically because of all the legacy code that was written for single threaded execution but that's just my opinion.
The question seems misleading to me. The CPU architecture is not that important, what is important is the memory model of the platform you are working for.
You are developing the application is some environment, with some defined memory model. E.g. if you are currently targeting x86, you can be pretty sure any future platform will implement the same memory model when it is running x86 code. The same is true for Java or .NET VMs and other execution platforms.
If you expect to port your current application at some other platforms, if the platform memory model will be different, you will have to adjust for it, but in such case you are the one doing the port and you have the complete control over how you do it. This is however true even for current platforms, e.g. PowerPC memory model allows much more reorderings to happen than the x86 one.

Resources