I learnt that multi core processors have more than one processing units( i.e. the main executing units ALU etc.) and they are better at performance. I want to know how they share Physical memory. I'll take following example to make my question clearer - Say, There is a memory location M in physical memory and Two threads T1 and T2 running on different cores. Is it possible for T1 and T2 to access M at the same instance of time or do they have to wait for one other to complete access i.e. do they share the same memory bus so that they have to wait, for one another or Can they read M at same instance of time from two different memory buses? If former is the case, There is not much performance gain right, as they have to wait for memory bus to be free?
Summarising, Are memory operations independent of other cores or each core can only make a physical memory access when memory bus is free?
Memory access depends on RAM module and not on CPU , your CPU cores can request access but to which is the access given first , depends on RAM itself!
Related
In a 32 core system, a process(A) consumes 4 core fully (400% cpu usage in top). Rest of the cores are avialble. Does it impact the performance of another process(B)? Will process(B) run better if process(A) is not running , then why ?
Process(B) is using boost and multiple threds ( say 24).
I was expecting performance of Process-B is not impacted by Process-A as there are 32 cores.
In general, yes, running a process can slow down others even though not all cores are active. In practice, the impact is strongly dependent of the code being executed.
This can happen because some hardware resources are shared. The most common ones are storage devices, the network, the RAM, the LLC cache (typically a L3). For example, few cores are generally enough to saturate the RAM bandwidth so using more than 8 cores is generally not significantly faster if the two processes are memory bound. HDD storage devices tends to not be faster in parallel so when 2 processes try to massively use it at the same time they are often significantly slower. In practice, they can be more than 2 times slower because HDD have a high fetch time and a process doing many random accesses can drastically slow down a process reading/writing large contiguous files.
On NUMA systems, things can be a bit complex since 2 processes operating on the same NUMA node can be slower than 2 processes running on different NUMA node due to a saturation of the RAM of the target node and the NUMA allocation policy. In some rare case, 2 processes running on different NUMA nodes can be slower than running on the same NUMA node. This is true if the processes communicate each other (due to the higher latency between core belonging to different NUMA nodes) or if the processes communicate with hardware resources bound to specific NUMA nodes that is not the ones where the processes are running (eg. a GPU with a high-performance interconnect, a high-performance Infiniband device, etc.)
Note that some software resources can also be shared. The operating system can lock them so to ease the maintenance of some parts of its code or just because the resource cannot fundamentally be used in parallel in a way that can scale. Historically, some OS used a giant lock preventing nearly all system call to scale. Such lock has been progressively replaced with finer-grained locks or no lock at all (eg. atomics) due to the democratisation of the multi-core processors. Note that even atomic data structures do not scale very well on most processors so system calls operating on the same data structure tends to impact other running processes on many-core systems. Still, the biggest issue is generally the saturation of shared hardware resources.
I have stumbled not once into a term "non coherent" and "coherent" memory in the
tech papers related to graphics programming.I have been searching for a simple and clear explanation,but found mostly 'hardcore' papers of this type.I would be glad to receive layman's style answer on what coherent memory actually is on GPU architectures and how it is compared to other (probably not-coherent) memory types.
Memory is memory. But different things can access that memory. The GPU can access memory, the CPU can access memory, maybe other hardware bits, whatever.
A particular thing has "coherent" access to memory if changes made by others to that memory are visible to the reader. Now, you might think this is foolishness. After all, if the memory has been changed, how could someone possibly be unable to see it?
Simply put, caches.
It turns out that changing memory is expensive. So we do everything possible to avoid changing memory unless we absolutely have to. When you write a single byte from the CPU to a pointer in memory, the CPU doesn't write that byte yet. Or at least, not to memory. It writes it to a local copy of that memory called a "cache."
The reason for this is that, generally speaking, applications do not write (or read) single bytes. They are more likely to write (and read) lots of bytes, in small chunks. So if you're going to perform an expensive operation like a memory load or store, you should load or store a large chunk of memory. So you store all of the changes you're going to make to a chunk of memory in a cache, then make a single write of that cached chunk to actual memory at some point in the future.
But if you have two separate devices that use the same memory, you need some way to be certain that writes one device makes are visible to other devices. Most GPUs can't read the CPU cache. And most CPU languages don't have language-level support to say "hey, that stuff I wrote to memory? I really mean for you to write it to memory now." So you usually need something to ensure visibility of changes.
In Vulkan, memory which is labeled by VK_MEMORY_PROPERTY_HOST_COHERENT_BIT means that, if you read/write that memory (via a mapped pointer, since that's the only way Vulkan lets you directly write to memory), you don't need to use functions vkInvalidateMappedMemoryRanges/vkFlushMappedMemoryRanges to make sure the CPU/GPU can see those changes. The visibility of any changes is guaranteed in both directions. If that flag isn't available on the memory, then you must use the aforementioned functions to ensure the coherency of the specific regions of data you want to access.
With coherent memory, one of two things is going on in terms of hardware. Either CPU access to the memory is not cached in any of the CPU's caches, or the GPU has direct access to the CPU's caches (perhaps due to being on the same die as the CPU(s)). You can usually tell that the latter is happening, because on-die GPU implementations of Vulkan don't bother to offer non-coherent memory options.
If memory is coherent then all threads accessing that memory must agree on the state of the memory at all times, e.g.: if thread 0 reads memory location A and thread 1 reads the same location at the same time, both threads should always read the same value.
But if memory is not coherent then threads A and B might read back different values. Thread 0 could think that location A contains a 1, while thread thinks that that location contains a 2. The different threads would have an incoherent view of the memory.
Coherence is hard to achieve with a high number of cores. Often every core must be aware of memory accesses from all other cores. So if you have 4 cores in a quad core CPU, coherence is not that hard to achieve as every core must be informed about the memory accesses addresses of 3 other cores, but in a GPU with 16 cores, every core must be made aware of the memory accesses by 15 other cores. The cores exchange data about the content of their cache using so called "cache coherence protocols".
This is why GPUs often only support limited forms of coherency. If some memory locations are read only or are only accessed by a single thread, then no coherence is required. If caches are small and coherence is not always required but only at specific instructions of the program, then it is possible to achieve correct behavior of the program using cache flushes before or after specific memory accesses.
If your hardware offers both coherent and non-coherent memory types, then you can expect that non-coherent memory will be faster, but if you try to run parallel algorithms using this memory they will fail in really weird ways.
This question is about DRAM speeds and memory interleaving. I have a very specific problem. I am using a power based architecture board (minus the AltiVec) and I wish to copy a large segment of memory (virtual contiguous) between two regions within my process' address space. To offset the slowness of my core, I affixed two threads to two cpu's and that made copy a lot faster.
However that was still not fast enough. so I added a third thread, and it made no difference to copy times whatsoever. I did more research on this and found that my board was equipped with a single DDR3 RAM (speed 1600 MB/s) and it was pretty close to max attainable speeds already.
[ Some explanation here: With just 2 threads, I am copying, say 5500 pages of size 4K in around 16.5 milliseconds. If you do a simple calculation, it would seem that the minimum time in theory that you could clock (bar all prefetches and stuff) is 13.75 milliseconds. ]
I discovered that I could add an extra RAM to my board. Which I could possibly get my co. to fund by telling them I also intend to halve the size of each stick of memory, but how can I get the kernel to allocate me memory that is guaranteed to be evenly distributed across both memories?
Thanks a lot for answering!
P.s. I am using linux kernel version 2.6.34.
See if your Linux / board combination supports the NUMA (Non-uniform memory access) extensions. You can specify interleaving policies through libnuma:
The libnuma library offers a simple programming interface to the NUMA
(Non Uniform Memory Access) policy supported by the Linux kernel. On a
NUMA architecture some memory areas have different latency or
bandwidth than others.
Available policies are page interleaving (i.e., allocate in a
round-robin fashion from all, or a subset, of the nodes on the
system), preferred node allocation (i.e., preferably allocate on a
particular node), local allocation (i.e., allocate on the node on
which the task is currently executing), or allocation only on specific
nodes (i.e., allocate on some subset of the available nodes). It is
also possible to bind tasks to specific nodes.
Assume x86 multi-core PC architecture...
Lets say there are 2 cores (capable of executing 2 separate streams of instructions) and that the interface between the CPU and RAM is a memory bus.
Can 2 instructions (which access some memory) that are scheduled on the 2 different cores truly be simultaneous on such a machine?
I'm not talking about a case where the 2 instructions are accessing the same memory location. Even in the case where the 2 instructions are accessing completely different memory locations (and lets also assume that the memory contents for these locations are not in any cache), I would think that the single memory bus sitting in between the CPU and RAM (which is very common) would cause these 2 instructions to be serialized by the bus arbitration circuitry:
CPU0 CPU1
mov eax,[1000] mov ebx,[2000]
Is this true? If so, what is the advantage of having multiple cores if the software you will run is multi-threaded but has lots of memory accesses? Wouldn't these instructions all be serialized at the end?
Also, if this is true, whats the point of the LOCK prefix in x86 which is used for making a memory-access instruction atomic?
You need to check a few concepts of x86 architecture to answer that:
speculative execution (and out of order)
load store buffer
MESI protocol
load forwarding
memory barriers
NUMA
basically, my guess is your instructions will be absolutely parallel executed but the result in memory will be one or the other of the thread and the election will be decided by MESI hardware.
to extend on the answer, when you have multiple flow and single data (http://en.wikipedia.org/wiki/MISD) you need to expect serialization. Note that this can be mitigated if you access different memory adresses, notably on NUMA systems.
Opterons and new i7 has NUMA hardware, but the OS need to activate them, and its not by default. if you have NUMA, you can use the advantage of one bus to connect one core to one memory zone. however the core must be the owner of that zone, which should be verified if the core allocated its zone itself.
In all other hardware there will be serialization, but if the memory addresses are different they will not hinder on the write performance (no wait before end of write) thanks to the store buffer, and L2 intermediate caching. L2 content is commited to RAM later and L2 is by core so serialization happens but do not hinder CPU instructions that can continue on ahead.
EDIT about the LOCK question:
lock x86 instruction is about flushing load store buffers so that other cores can obtain visibility on the current values operated on in the instruction pipeline. this is much closer to the CPU than the RAM writing problem. LOCK allows that cores are not working on their local view of some variable content because without it, the CPU assumes any optimization it can considering only one thread, meaning it will often keep everything in registers and not rely on cache. It can ever go slightly ahead of that, when you consider load fowarding, or more preciselly called store to load forwarding.
I've read a lot on this topic already both here (e.g., stackoverflow.com/questions/1713554/threads-processes-vs-multithreading-multi-core-multiprocessor-how-they-are or multi-CPU, multi-core and hyper-thread) and elsewhere (e.g., ixbtlabs.com/articles2/cpu/rmmt-l2-cache.html or software.intel.com/en-us/articles/multi-core-introduction/), but I still am not sure about a couple things that seem very straightforward. So I thought I'd just ask.
(1) Is a multi-core processor in which each core has dedicated cache effectively the same as a multiprocessor system (balanced of course for processor speed, cache size, and so on)?
(2) Let's say I have some images to analyze (i.e., computer vision), and I have these images loaded into RAM. My app spawns a thread for each image that needs to be analyzed. Will this app on a shared cache multi-core processor run slower than on a dedicated cache multi-core processor, and would the latter run at the same speed as on an equivalent single-core multiprocessor machine?
Thank you for the help!
The size of the cache is important. For the sake of this I'm assuming x86 processors and only using the L2 cache, which is shared on dual core processors.
If you are comparing 2 single core processors with 1 dual core processor and the single core processors both have the same amount of data cache (running at the same speed), then you have more cache, so more portions of the images can fit into cache, and it is very likely that if the processing of the image data had to load and/or store to this data repeatedly that this would go more quickly at the same clock speeds.
If you are comparing 2 single core processors with 1 dual core processor whose data cache is twice the size of each single core processor's data cache, then about half of the data cache will be used for each processor's work. It is quit likely that in addition to the image data that each independent thread has to use that there will be some shared data. If this shared data is stored in the shared data cache then it can be more easily shared between the two cores than on the 2xSingle core set up. On the 2xSingle core setup for each chunk of shared data one of the caches would store it and there would be a little bit of overhead when the other processor needed to use that data.
Dual core machines also make it easier for threads to migrate from one core to another on the same processor module, because the cache of the thread's new processor does not need to be filled while the other has data that it doesn't need anymore taking up space.
I'd suggest that whatever you end up with that you experiment with limiting the number of threads to 3 to 10 per-core at any time for general use. The threads will all be competing with each other for that cache space, so too many will make it so that all of the data from 1 thread is pushed out before that thread is rescheduled. Also, if each thread can loop over a few image files you gain a little by encouraging each thread's stack space to stay in cache because you have fewer stacks. You also reduce the amount of memory that the OS has to use to keep up with threads.
You're biggest win is when you can overlap processing with slow access, such as disk, network, or human interaction, so just enough threads to keep the CPUs busy processing is what you need.