(For a Linux platform) Is it feasible (from a performance point of view) to try to communicate (in a synchronous way) via loopback interface between processes on different NUMA nodes?
What about if the processes reside on the same NUMA node?
I know it's possible to memory bind a process and/or set CPU affinity to a node (using libnuma). I don't know if this true also for the network interface.
Later edit. If loopback interface is just a memory buffer used by kernel, is there a way to be sure that buffer is on the same NUMA node in order for two processes to communicate without the cross node overhead?
Network interfaces don't reside on a node; they're a device - virtual or real - shared across the whole machine. The loopback interface is just a memory buffer somewhere or other, and some kernel code. The code that runs to support that device is likely bouncing round the CPU cores, just like any other thread in the system.
You talk of NUMA nodes, and tagged the question with Linux. Linux doens't run on pure NUMA architectures, it runs on SMP architectures. Modern CPUs from, say, Intel, AMD, ARM all synthesise an SMP hardware environment using separate cores, varying degrees of cache / memory interface unification, and high speed serial links between cores or CPUs. Effectively it's not possible for the operating system or software running on top to see the underlying NUMA architecture; it thinks it's running on a classical SMP architecture.
Intel / AMD / everyone else have done this because, back in the day, successful multiple CPU machines really were SMP; they had multiple CPUs all sharing the same memory bus, and had equal access to the RAM at the other end of the bus. Software got written to take advantage of that (Linux, Windows, etc).
Then the CPU manufacturers realised that SMP architectures suck so far as speed improvements are concerned. AMD blinked first, and ditched SMP in favour of Hypertransport, and were successful. Intel persisted with pure SMP for longer, but soon gave up too and started using QPI between CPUs.
But to give the old software (Linux, Windows, etc) backward compatibility, the CPU designers had to create a synthetic SMP hardware environment on top of Hypertransport and QPI. In principal they might have, at that point in time, decided that SMP was dead and delivered us pure NUMA architectures. But that would likely have been commercial suicide; it would have taken coorindation of the entire hardware and software industries to agree to go that way, but by then it was already far too late to rewrite everything from scratch.
Thinks like network sockets (including via the loopback interface), pipes, serial ports are not synchronous. They're stream carriers, and the sender and receiver are not synchronised by the act of transferring data. That is, the sender can write() data and think that that has completed, but the data is in reality still stuck in some network buffer somewhere and hasn't yet made it into the read() that the destination process will have to call to receive the data.
What Linux will do with processes and threads is endeavour to run them all at once, up to the limit of the number of CPU cores in the machine. By and large that will result in your processes running simultaneously on separate cores. I think Linux will also use knowledge of which physical CPU's memory holds the bulk of a process's data, and will try to run the process on that CPU; memory latency will be a tiny bit better that way.
If your processes try to communicate via socket, pipe or similar, it results in data being copied out of one process's memory space into a memory buffer controlled by the kernel (that's what write() is doing under the hood), and then being copied out of that into the receiving process's memory space (that's what read() does). Where that intermediate kernel buffer actually is doesn't really matter because the transactions taking place at the microelectronic level (below the SMP level) are pretty much the same regardless. Memory allocations and processes can be bound to specific CPU cores, but you can't influence whereabouts the kernel puts its memory buffers through which the exchanged data must pass.
Regarding memory and process core affinity - it's really, really hard to do this to any measurable benefit. The OSes are so good nowadays at understanding the behaviour of CPUs that it's almost always best to simply let the OS run your processes and cores whereever it chooses. Companies like Intel make large code contributions to the Linux project, specifically to ensure that Linux does this as well as possible on the latest and greatest chips.
==EDIT==
Additions in the light of engaging comments!
By "pure NUMA" I really mean systems where one CPU core cannot directly address memory physically attached to another CPU core. Such systems include Transputers, and even the Cell processor found in the Sony PS3. These aren't SMP, there's nothing in the silicon that unifies the separate memories into a single address space, so the question of cache coherency doesn't come into it.
With Transputer systems the only way to access memory attached to another transputer was to have the application software send the data over via a serial link; what made it CSP was that the sending application would finish sending until the receiving application had read the last byte.
For the Cell processor, there were 8 maths cores each with 256kbyte of RAM. That was the only RAM the maths cores could address. To use them the application had to move data and code into that 256k of RAM, tell the core to run, and then move the results out (possibly back out to RAM, or onto another maths core).
There are some supercomputers today that aren't disimilar to this. The K machine (Riken, Kobe in Japan) has an awful lot of cores, a very complex on-chip interconnect fabric, and OpenMPI is used by applications to move data around between nodes; nodes cannot directly address memory on other nodes.
The point is that on the PS3 it was up to application software to decide what data was in what memory and when, whereas modern x86 implementations from Intel and AMD make all data in all memories (no matter if they're shared via an L3 cache or are remote at the other end of a hypertransport or QPI link) accessible from any cores (that's what SMP means afterall).
The all out performance of code written on the Cell process was truly astounding for the Watts and transistor count. Trouble was in a world where programemrs are trained in writing for SMP environments, it takes a brain transplant to get to grips with one that isn't.
Newer languages like Rust and Go have reintroduced the concept of communicating sequential processes, which is all one had with Transputers back in the 1980s, early 1990s. CSP is almost ideal for multicore systems as the hardware does not need to implement an SMP environment. In principle this saves an awful lot of silicon.
CSP implemented on top of today's cache coherent SMP chips in languages like C generally involves a thread writing data into a bufffer, and that being copied into a buffer belonging to another thread (Rust can do it a little differently because Rust knows about memory ownership, and so can transfer ownership instead of copying memory. I'm not familiar with Go - maybe it can do the same).
Looked at at the microelectronic level, copying data from one buffer to another is not really any different to what happens if the data is shared by 2 cores instead of copied (especially in AMD's hypertransport CPUs where each has its own memory system). To share data, the remote core has to use hypertransport to request data from another core's memory, and more traffic to maintain cache coherency. That's about the same amount of hypertransport traffic as if the data where copied from one core to the other, but then there's no subsequent cache coherency traffic.
Related
My goal is to read in stale and outdated values of memory without cache-coherence. I have attempted to use prefetchnta to perform a non-temporal load, but it failed to fetch outdated values. I am looking into performing some kind of Streaming Memory-to-Memory Direct-Memory-Access, but am having a little trouble due to the overwhelming amount of background knowledge required to proceed with my current project. Currently I am attempting to mess around with udmabuf but even that is going slowly. It should be noted that ideally I would like to ignore the contents of all CPU caches, including the current CPU.
To provide my reasoning as to why: I am developing software that can be used to prove correctness of programs written for non-volatile memory. As the CPU Cache is volatile, the CPU's write-back cache will still be volatile and the arbitrary nature of how they are written back to memory needs to be observed.
I would sincerely appreciate it if someone could give me some pointers of how to proceed. I do not mind digging into the Linux kernel, as in fact I am doing that now, nor do I mind modifying it, I just need a little guidance in the right direction.
I haven't played around with this, but my understanding from the docs is that for loads (unlike NT stores) nothing can bypass cache or override the strong ordering of memory types like the normal WB (write-back). And even NT stores evict already-cached data, so they can't break coherence for this or another core that has cached data for the line you're writing.
You can do weakly-ordered loads from WC (write-combining) memory regions (with prefetchnta or SSE4 movntdqa), but they're probably still coherent at the physical address level.
#MargaretBloom commented
IIRC Intel warns the developer about multiple mapping with different cache types, which may indeed be good in this case.
so maybe you could actually bypass cache coherence with multiple virtual mappings of the same physical page.
I don't know if it's possible to do non-coherent DMA with a PCI / PCIe device, but that might be your only hope for getting actual DRAM contents without going through cache.
Normally (always?) DMA on modern x86 systems is cache-coherent, which is good for performance. To maintain backwards compat with 386 and earlier CPUs without caches, the first x86 CPUs with caches had cache-coherent DMA, not introducing cache-control instructions until later generations, since existing OSes didn't use them. In modern systems, memory controllers are built-in to the CPU. So on Intel CPUs, the system agent can snoop L3 tags to see if a line is cached anywhere on-chip in parallel with sending the request to the memory controller. Or a Xeon can DMA right into L3 cache without data having to bounce through DRAM, good for high bandwidth NICs.
There's an INVD instruction which invalidates all caches without doing write-back first, but I think that includes the shared L3 cache, and probably the private caches of all other cores. So you can't practically use it on a Linux system where other cores are potentially in the middle of doing stuff; you'd potentially corrupt kernel data structures by using it, as well as simulating power failure on a machine with NVDIMMs for the process you were interested in.
Maybe if you somehow offlined all the other CPU cores, and disabled interrupts on the one core that was still up
you could wbinvd (write-back+invalidate) to flush all caches
then run some code under test
then invd and see what made it to DRAM
Then re-enable interrupts. Interrupt handlers could end up with some kernel data cached and some in memory, or get device drivers out of sync with hardware, if any interrupts are handled between the wbinvd and the invd.
Update: someone did actually attempt this:
How to run "invd" instruction with disabled SMP support?
How to explicitly load a structure into L1d cache? Weird results with INVD with CR0.CD = 1 on isolated core with/without hyperthreading - invd worked so well it nuked some of the stores done by printk in the mis-designed attempt to log something about it.
Assume x86 multi-core PC architecture...
Lets say there are 2 cores (capable of executing 2 separate streams of instructions) and that the interface between the CPU and RAM is a memory bus.
Can 2 instructions (which access some memory) that are scheduled on the 2 different cores truly be simultaneous on such a machine?
I'm not talking about a case where the 2 instructions are accessing the same memory location. Even in the case where the 2 instructions are accessing completely different memory locations (and lets also assume that the memory contents for these locations are not in any cache), I would think that the single memory bus sitting in between the CPU and RAM (which is very common) would cause these 2 instructions to be serialized by the bus arbitration circuitry:
CPU0 CPU1
mov eax,[1000] mov ebx,[2000]
Is this true? If so, what is the advantage of having multiple cores if the software you will run is multi-threaded but has lots of memory accesses? Wouldn't these instructions all be serialized at the end?
Also, if this is true, whats the point of the LOCK prefix in x86 which is used for making a memory-access instruction atomic?
You need to check a few concepts of x86 architecture to answer that:
speculative execution (and out of order)
load store buffer
MESI protocol
load forwarding
memory barriers
NUMA
basically, my guess is your instructions will be absolutely parallel executed but the result in memory will be one or the other of the thread and the election will be decided by MESI hardware.
to extend on the answer, when you have multiple flow and single data (http://en.wikipedia.org/wiki/MISD) you need to expect serialization. Note that this can be mitigated if you access different memory adresses, notably on NUMA systems.
Opterons and new i7 has NUMA hardware, but the OS need to activate them, and its not by default. if you have NUMA, you can use the advantage of one bus to connect one core to one memory zone. however the core must be the owner of that zone, which should be verified if the core allocated its zone itself.
In all other hardware there will be serialization, but if the memory addresses are different they will not hinder on the write performance (no wait before end of write) thanks to the store buffer, and L2 intermediate caching. L2 content is commited to RAM later and L2 is by core so serialization happens but do not hinder CPU instructions that can continue on ahead.
EDIT about the LOCK question:
lock x86 instruction is about flushing load store buffers so that other cores can obtain visibility on the current values operated on in the instruction pipeline. this is much closer to the CPU than the RAM writing problem. LOCK allows that cores are not working on their local view of some variable content because without it, the CPU assumes any optimization it can considering only one thread, meaning it will often keep everything in registers and not rely on cache. It can ever go slightly ahead of that, when you consider load fowarding, or more preciselly called store to load forwarding.
It is possible to pin a process to a specific set of CPU cores using sched_setaffinity() call. The manual page says:
Restricting a process to run on a single CPU also avoids the
performance cost caused by the cache invalidation that occurs when a process
ceases to execute on one CPU and then recommences execution on a different
CPU.
Which is almost an obvious thing (or not?). What is not that obvious to me is this -
Does pinning LWPs to a specific CPU or an SMP node reduces a cache coherency bus traffic? For example, since a process is running pinned, other CPUs should not modify its private memory, thus only CPUs that are part of the same SMP node should stay cache-coherent.
There should be no CPU socket-to-socket coherency traffic for the pinned process case you describe. Modern Xeon platforms implement snoop filtering in the chipset. The snoop filter indicates when a remote socket cannot have the cache line in question, thus avoiding the need to send cache invalidate messages to that socket.
You can measure this for yourself. Xeon processors implement a large variety of cache statistic counters. You can read the counters in your own code with the rdpmc instruction or just use a product like VTune. FYI, using rdpmc is very precise, but a little tricky since you have to initially set a bit in CR4 to allow using this instruction in user mode.
-- EDIT --
My answer above is outdated for the 55xx series of CPUs which use QPI links. These links interconnect CPU sockets directly without an intervening chipset, as in:
http://ark.intel.com/products/37111/Intel-Xeon-Processor-X5570-%288M-Cache-2_93-GHz-6_40-GTs-Intel-QPI%29
However, since the L3 cache in each CPU is inclusive, snoops over the QPI links only occur when the local L3 cache indicates the line is nowhere in the local socket. Likewise, the remote socket's L3 can quickly respond to a cross-snoop without bothering the cores, assuming the line isn't there either.
So, the inclusive L3 caches should minimize inter-socket coherency overhead, it's just not due to a chipset snoop filter in your case.
If you run on a NUMA system (like, Opteron server or Itanium), it makes sense, but you must be sure to bind a process to the same NUMA node that it allocates memory from. Otherwise, this is an anti-optimization. It should be noted that any NUMA-aware operating system will try to keep execution and memory in the same node anyway, if you don't tell it anything at all, to the best of its abilities (some elderly versions of Windows are rather poor at this, but I wouldn't expect that to be the case with recent Linux).
If you don't run on a NUMA system, binding a process to a particular core is the one biggest stupid thing you can do. The OS will not make processes bounce between CPUs for fun, and if a process must be moved to another CPU, then that is not ideal, but the world does not end, either. It happens rarely, and when it does, you will hardly be able to tell.
On the other hand, if the process is bound to a CPU and another CPU is idle, the OS cannot use it... that is 100% available processing power gone down the drain.
How to give single common address space for all tasks. IF its happening like this can we avoid virtual to physical memory mapping.
I f all task sharing common address space then how can we avoid virtual to physical memory mapping.
There are a few modern (research) OS's that do this, like Singularity and there are performance benefits, primarily because it no longer needs to do context changes and the file/symbol loader no longer needs to do address translation for global caches and kernel functions.
You do need to be a bit more specific about what you're looking for, tho'. You tagged your post as OSX and Linux, both of which require virtual memory. When running on systems without a MMU (and thus no virtual memory) it emulates it, which I'm fairly certain you can't circumvent. I'm not an expert by any means.
uClinux is an implementation of Linux that runs on processors that lack an MMU (such as ARM7), so by definition must have a single address space for all tasks.
So one answer to "how" is "use uClinux".
You tagged this VxWorks, and there is another answer; VxWorks supports a flat memory. In fact when I last used it the MMU protection was an (expensive) add on. Many other RTOS designed for micro controllers similarly do not support an MMU, such as eCOS, and FreeRTOS.
Of RTOS's that do support an MMU, QNX is probably amongst the most robust and mature, while still maintaining high performance.
I'm not sure why you would want to disable virtual memory mapping - it's a built in function of the cpu, and pretty much essential when running an OS to properly isolate processes from each other.
Most operating systems allow you to disable virtual memory, so that your memory capacity is limited by physical memory. However, A processes address space is still virtual, and virtual to physical mapping is still happening.
A way to get what you want is to run an operating system that executes in Real Mode, such as DOS or Windows 3.0, or write your own.
The advantages of virtual memory far outweigh the disadvantages. Why do you want to avoid virtual memory.
This is how some older operating systems and even how some modern operating systems that lack VM still work. It has many disadvantages for things like desktop and server applications but it can be useful in an embedded and/or real-time context, or where you have minimal hardware.
The VxWorks AE(Advanced Edition supports) deviates from the concept of Common address space for all tasks.So it can effectively be used in both systems with MMU and without MMU .The common address space for all tasks is called flat memory model and the separate address space for different tasks is called over lapped memory model or segmented memory model.You should not confuse the memory model with the memory lay out as seen in object files which divides data in to Code Segment ,Data Segment ,BSS etc .Both are entirely different things :).
This link in stack overflow will help better
Difference between flat memory model and protected memory model?
Azul Systems has an appliance that supports thousands of cache coherent CPUs. I would love insight into what changes would need to occur to an operating system in order to schedule thousands of simultaneously running threads.
Scheduling thousands of threads is not a big deal, but scheduling them on hundreds of CPUs is. What you need, first and foremost, is very fine-grained locking, or, better yet, lock-free data structures and algorithms. You just can't afford to let 200 CPUs waiting while one CPU executes a critical section.
You're asking for possible changes to the OS, so I presume there's a significant engineering team behind this effort.
There are also a few pieces of clarififying info that would help define the problem parameters:
How much IPC (inter process communication) do you need?
Do they really have to be threads, or can they be processes?
If they're processes, is it okay if the have to talk to each other through sockets, and not by using shared memory?
What is the memory architecture? Are you straight SMP with 1024 cores, or is there some other NUMA (Non-Uniform Memory Architecture) or MMP going on here? What are your page tables like?
Knowing only the very smallest of info about Azul systems, I would guess that you have very little IPC, and that a simple "run one kernel per core" model might actually work out just fine. If processes need to talk to each other, then they can create sockets and transfer data that way. Does your hardware support this model? (You would likely end up needing one IP address per core as well, and at 1024 IP addrs, this might be troublesome, although they could all be NAT'd, and maybe it's not such a big deal). If course, this model would lead to some inefficiencies, like extra page tables, and a fair bit of RAM overhead, and may even not be supported by your hardware system.
Even if "1 kernel per core" doesn't work, you could likely run 1024/8 kernels, and be just fine, letting each kernel control 8 physical CPUs.
That said, if you wanted to run 1 thread per core in a traditional SMP machine with 1024 cores (and only a few physical CPUs) then I would expect that the old fashioned O(1) scheduler is what you'd want. It's likely that your CPU[0] will end up nearly 100% in kernel and doing interrupt handling, but that's just fine for this use case, unless you need more than 1 core to handle your workload.
Making Linux scale has been a long and ongoing project. The first multiprocessor capable Linux kernel had a single lock protecting the entire kernel (the Big Kernel Lock, BKL), which was simple, but limited scalability.
Subsequently the locking has been made more fine-grained, i.e. there are many locks (thousands?), each covering only a small portion of data. However, there are limits to how far this can be taken, as fine-grained locking tends to be complicated, and the locking overhead starts to eat up the performance benefit, especially considering that most multi-CPU Linux systems have relatively few CPU's.
Another thing, is that as far as possible the kernel uses per-cpu data structures. This is very important, as it avoids the cache coherency performance issues with shared data, and of course there is no locking overhead. E.g. every CPU runs its own process scheduler, requiring only occasional global synchronization.
Also, some algorithms are chosen with scalability in mind. E.g. some read-mostly data is protected by Read-Copy-Update (RCU) instead of traditional mutexes; this allows readers to proceed during a concurrent update.
As for memory, Linux tries hard to allocate memory from the same NUMA node as where the process is running. This provides better memory bandwidth and latency for the applications.
My uneducated guess would be that there is a run-queue per processor and a work-stealing algorithm when a processor is idle. I could see this working in an M:N model, where there is a single process per cpu and light-weight processes as the work items. This would then feel similar to a work-stealing threadpool, such as the one in Java-7's fork-join library.
If you really want to know, go pick up Solaris Internals or dig into the Solaris kernel code. I'm still reading Design & Impl of FreeBSD, with Solaris Internals being the next on my list, so all I can do is make wild guesses atm.
I am pretty sure that the SGI Altix we have at work, (which does ccNUMA) uses special hardware for cache coherency.
There is a huge overhead connected to hold 4mb cache per core coherent. It's unlikely to happen in software only.
in an array of 256 cpus you would need 768mb ram just to hold the cache-invalidation bits.
12mb cache / 128 bytes per cache line * 256² cores.
Modifying the OS is one thing, but using unchanged application code is a waste of hardware. When going over some limit (depending on the hardware), the effort to keep coherency and synchronization in order to execute generic code is simply too much. You can do it, but it will be very expensive.
From the OS side you'll need complex affinity model, i.e. not to jump CPUs just because yours is busy. Scheduling threads based on hardware topology - cooperating threads on CPUs that are "close" to minimize penalties. Simple work stealing is not a good solution, you must consider topology. One solution is hierarchical work stealing - steal work by distance, divide topology to sectors and try to steal from closest first.
Touching a bit the lock issue; you'll still use spin-locks nd such, but using totally different implementations. This is probably the most patented field in CS these days.
But, again, you will need to program specifically for such massive scale. Or you'll simply under-use it. No automatic "parallelizers" will do it for you.
The easiest way to do this is to bind each process/thread to a few CPUS, and then only those CPUs would have to compete for a lock on that thread. Obviously, there would need to be some way to move threads around to even out the load, but on a NUMA architecture, you have to minimize this as much as possible.
Even on dual-core intel systems, I'm pretty sure that Linux can already handle "Thousands" of threads with native posix threads.
(Glibc and the kernel both need to be configured to support this, however, but I believe most systems these days have that by default now).