Linux has two different ways to manage shared memory: shm_open()/mmap() and shmget()/shmat(). What are the pros and cons of each? How do I decide which one to choose for my application?
I am asking myself the same question, and I don't know the final answer, but I thought I would share my benchmarks.
I have found that out-of-the box, the POSIX shm_open framework is faster than the System V shmget.
In my benchmark, I write 32 GB of memory from one process and read the same 32 GB of memory and verify it. I use ZeroMQ to pass ownership tokens from the writer to the reader to keep things in sync. The memory block size is actually pretty small, 32KB, but I've found this doesn't seem to be a rate limiting factor, neither does the presence of ZeroMQ make a big difference.
I've found that the net throughput using POSIX shared memory is about 40% faster than System V shared memory. Specifically, the net operation (write + read) operates at a sustained rate of 3.5 GB/s for the POSIX shared memory and 2.5 GB/s for the System V shared memory.
As why this is true, I don't know. I've also found these benchmarks a bit slippery if you don't pin down process and memory affinities, although this speed difference appears to be present regardless of any combination of CPU binding and memory binding (using numactl).
Related
Above is a picture summarizing my understanding on memoryHeap and their memoryTypes generated by Vulkan for a given system setup. Thanks to the answers on this topics shared by #NicolBolas 1, 2, 3 and an answer by #krOoze 4.
Still, I have a few outstanding questions that I like help on and I have indicated them in red and elaborated below per comment of #NicolBolas.
Questions
Why are there 9 memoryType in sysRam when there are only 4x RAMs?
What is the physical meaning of each memoryType? How to use each of
these memoryType?
Why are there 2 memory types for GPU RAM? Does this mean each
memoryType of the GPU RAM is 6144MB/2 = 3072MB?
Is there a size limit to each memoryTypes? If yes, how to discover
their limits?
Why are the free memory reported by Vulkan and cat /proc/meminfo
different?
Thanks for your help in advance.
Why are there 9 memoryType in sysRam when there are only 4x RAMs? What is the physical meaning of each memoryType? How to use each of these memoryType?
Why are there 2 memory types for GPU RAM?
I don't know what you mean by "4x RAMs"; I suspect you're talking about how many physical memory sticks are in your machine. Memory types (or heaps for that matter) don't care about such things.
As for the rest, it is always important to remember how memory works in Vulkan. Heaps represent actual physical RAM to one degree or another. Memory types represent ways of allocating that memory. But uses of memory have their own memory type restrictions.
For example, if an image has the color attachment usage parameter, the implementation can force you to use a specific memory type for the memory backing that image. And images that don't have color attachment can be restricted to using other memory types, but not that one. And so forth.
Apparently, NVIDIA does this for certain combinations of usage and formats. Simply querying the available memory types isn't enough to know how to go about allocating memory. You have to figure out what buffers and images (complete with format and usage parameters) you will use. And then you have to query what restrictions the implementation imposes on them.
Your application must adapt to these restrictions.
Is there a size limit to each memoryTypes?
It wouldn't make sense for there to be such a thing. Memory types define how memory is allocated, not how much storage is available. The latter is the job of memory heaps.
Why are the free memory reported by Vulkan and cat /proc/meminfo different?
Vulkan has no API to report free memory, only total memory. Asking for the amount of free memory is folly. Memory (or at least, virtual pages in your application) are shared by all threads in your application. And GPU memory especially is shared among all processes on the machine. By the time you get an answer back, the amount of memory may have changed. So when you go to allocate memory based on what you were told was available, it may not be available anymore.
Better to allocate first and deal with failure to allocate if it happens.
You can ask for the total memory so that you can decide on how you want to allocate chunks of memory. But that's how you determine what is and is not available, not by querying a size.
[metaquestion] Why is X in Vulkan?
Because it is allowed by the Vulkan specification. Rest is implementation detail, and only the implementer\vendor knows for sure, and may depend on how well he slept.
Why are there 9 memoryType in sysRam when there are only 4x RAMs? What is the physical meaning of each memoryType? How to use each of these memoryType?
Answered in Why does vkGetPhysicalDeviceMemoryProperties return multiple identical memory types?. One for VkBuffers, one for VkImages, and one per depth format (i.e. 7). Equals 9; mystery solved.
Why are there 2 memory types for GPU RAM? Does this mean each memoryType of the GPU RAM is 6144MB/2 = 3072MB?
Likely similar reason as 1. I speculate one for VkBuffers, one for VkImages. Someone with NVIDIA could test with vkGetXMemoryRequirements.
It does not neccessarily mean RAM/2. It is not completely out of the question, but then again implementer should instead expose separate Heap if that is so.
Is there a size limit to each memoryTypes? If yes, how to discover their limits?
Roughly the Heap size. You may get significantly less due to fragmentation. And due to other processes sharing the same. Your impl may also allocate some itself for its internal needs.
You discover the limit when you get VK_ERROR_OUT_OF_DEVICE_MEMORY. (BTW mostly works the same as on CPU side, where you get bad_alloc).
There is limit to size of single allocation (not recommended to allocate > 4 GB), and to the count of allocations too (maxMemoryAllocationCount).
Why are the free memory reported by Vulkan and cat /proc/meminfo different?
AFAIK Vulkan does not report free memory. The VkMemoryHeap shows total memory:
size is the total memory size in bytes in the heap.
You don't know anything about the memory types in Vulkan until you ask the driver.
I think the biggest misunderstanding you have is that the memory types are physically separate. As shown, you have two memory heaps, assume 0 is CPU memory, 1 is GPU. Within those heaps, you have different memory types. Each memory type occupies space within its own heap, and can use all the heap space or share it with other types. For each type you'll have different internal allocation methods with different alignment requirements and different allowed uses. There are multiple queries related to memory types including vkGetBufferMemoryRequirements, vkGetImageMemoryRequirements, and others. It all depends on what you're using the memory for.
Also, those memory types are driver dependent, and will vary between vendors (that looks like the current nVidia layout).
I'm reading Operating System: Internals and Design Principles by William Stallings, 7th edition. In section 8.4 Linux Memory Management, when talking about kernel memory management, it goes like:
The foundation of kernel memory allocation for Linux is the page allocation
mechanism used for user virtual memory management. As in the virtual memory
scheme, a buddy algorithm is used so that memory for the kernel can be allocated
and deallocated in units of one or more pages. Because the minimum amount of
memory that can be allocated in this fashion is one page, the page allocator alone
would be inefficient because the kernel requires small short-term memory chunks
in odd sizes.
I could understand the discuss on paging, but why does the author says that the kernel requires small short-term memory chunks
in odd sizes., especially, why in odd sizes?
Because most programs require small allocations, for relatively short periods, in a variety of sizes? That's why malloc and friends exist: To subdivide the larger allocations from the OS into smaller pieces with sub-page-size granularity. Want a linked list (commonly needed in OS kernels)? You need to be able to allocate small nodes that contain the value and a pointer to the next node (and possibly a reverse pointer too).
I suspect by "odd sizes" they just mean "arbitrary sizes"; I don't expect the kernel to be unusually heavy on 1, 3, 5, 7, etc. byte allocations, but the allocation sizes are, in many cases, not likely to be consistent enough that a fixed block allocator is broadly applicable. Writing a special block allocator for each possible linked list node size (let alone every other possible size needed for dynamically allocated memory) isn't worth it unless that linked list is absolutely performance critical after all.
I have stumbled not once into a term "non coherent" and "coherent" memory in the
tech papers related to graphics programming.I have been searching for a simple and clear explanation,but found mostly 'hardcore' papers of this type.I would be glad to receive layman's style answer on what coherent memory actually is on GPU architectures and how it is compared to other (probably not-coherent) memory types.
Memory is memory. But different things can access that memory. The GPU can access memory, the CPU can access memory, maybe other hardware bits, whatever.
A particular thing has "coherent" access to memory if changes made by others to that memory are visible to the reader. Now, you might think this is foolishness. After all, if the memory has been changed, how could someone possibly be unable to see it?
Simply put, caches.
It turns out that changing memory is expensive. So we do everything possible to avoid changing memory unless we absolutely have to. When you write a single byte from the CPU to a pointer in memory, the CPU doesn't write that byte yet. Or at least, not to memory. It writes it to a local copy of that memory called a "cache."
The reason for this is that, generally speaking, applications do not write (or read) single bytes. They are more likely to write (and read) lots of bytes, in small chunks. So if you're going to perform an expensive operation like a memory load or store, you should load or store a large chunk of memory. So you store all of the changes you're going to make to a chunk of memory in a cache, then make a single write of that cached chunk to actual memory at some point in the future.
But if you have two separate devices that use the same memory, you need some way to be certain that writes one device makes are visible to other devices. Most GPUs can't read the CPU cache. And most CPU languages don't have language-level support to say "hey, that stuff I wrote to memory? I really mean for you to write it to memory now." So you usually need something to ensure visibility of changes.
In Vulkan, memory which is labeled by VK_MEMORY_PROPERTY_HOST_COHERENT_BIT means that, if you read/write that memory (via a mapped pointer, since that's the only way Vulkan lets you directly write to memory), you don't need to use functions vkInvalidateMappedMemoryRanges/vkFlushMappedMemoryRanges to make sure the CPU/GPU can see those changes. The visibility of any changes is guaranteed in both directions. If that flag isn't available on the memory, then you must use the aforementioned functions to ensure the coherency of the specific regions of data you want to access.
With coherent memory, one of two things is going on in terms of hardware. Either CPU access to the memory is not cached in any of the CPU's caches, or the GPU has direct access to the CPU's caches (perhaps due to being on the same die as the CPU(s)). You can usually tell that the latter is happening, because on-die GPU implementations of Vulkan don't bother to offer non-coherent memory options.
If memory is coherent then all threads accessing that memory must agree on the state of the memory at all times, e.g.: if thread 0 reads memory location A and thread 1 reads the same location at the same time, both threads should always read the same value.
But if memory is not coherent then threads A and B might read back different values. Thread 0 could think that location A contains a 1, while thread thinks that that location contains a 2. The different threads would have an incoherent view of the memory.
Coherence is hard to achieve with a high number of cores. Often every core must be aware of memory accesses from all other cores. So if you have 4 cores in a quad core CPU, coherence is not that hard to achieve as every core must be informed about the memory accesses addresses of 3 other cores, but in a GPU with 16 cores, every core must be made aware of the memory accesses by 15 other cores. The cores exchange data about the content of their cache using so called "cache coherence protocols".
This is why GPUs often only support limited forms of coherency. If some memory locations are read only or are only accessed by a single thread, then no coherence is required. If caches are small and coherence is not always required but only at specific instructions of the program, then it is possible to achieve correct behavior of the program using cache flushes before or after specific memory accesses.
If your hardware offers both coherent and non-coherent memory types, then you can expect that non-coherent memory will be faster, but if you try to run parallel algorithms using this memory they will fail in really weird ways.
This question is about DRAM speeds and memory interleaving. I have a very specific problem. I am using a power based architecture board (minus the AltiVec) and I wish to copy a large segment of memory (virtual contiguous) between two regions within my process' address space. To offset the slowness of my core, I affixed two threads to two cpu's and that made copy a lot faster.
However that was still not fast enough. so I added a third thread, and it made no difference to copy times whatsoever. I did more research on this and found that my board was equipped with a single DDR3 RAM (speed 1600 MB/s) and it was pretty close to max attainable speeds already.
[ Some explanation here: With just 2 threads, I am copying, say 5500 pages of size 4K in around 16.5 milliseconds. If you do a simple calculation, it would seem that the minimum time in theory that you could clock (bar all prefetches and stuff) is 13.75 milliseconds. ]
I discovered that I could add an extra RAM to my board. Which I could possibly get my co. to fund by telling them I also intend to halve the size of each stick of memory, but how can I get the kernel to allocate me memory that is guaranteed to be evenly distributed across both memories?
Thanks a lot for answering!
P.s. I am using linux kernel version 2.6.34.
See if your Linux / board combination supports the NUMA (Non-uniform memory access) extensions. You can specify interleaving policies through libnuma:
The libnuma library offers a simple programming interface to the NUMA
(Non Uniform Memory Access) policy supported by the Linux kernel. On a
NUMA architecture some memory areas have different latency or
bandwidth than others.
Available policies are page interleaving (i.e., allocate in a
round-robin fashion from all, or a subset, of the nodes on the
system), preferred node allocation (i.e., preferably allocate on a
particular node), local allocation (i.e., allocate on the node on
which the task is currently executing), or allocation only on specific
nodes (i.e., allocate on some subset of the available nodes). It is
also possible to bind tasks to specific nodes.
Linux noob question:
If I have 500MB of RAM, and 500MB of swap space, can the OS and processes then use 1GB of memory?
In other words, is the total amount of memory available to programs and the OS the total of the physical memory size and swap size?
I'm trying to figure out which SNMP counters to query, but need to understand how Linux uses virtual memory a little better first.
Thanks
Actually, it IS essentially correct, but your "virtual" memory does NOT reside beside your "physical memory" (as Matthew Scharley stated).
Your "virtual memory" is an abstraction layer covering both "physical" (as in RAM) and "swap" (as in hard-disk, which is of course as much physical as RAM is) memory.
Virtual memory is in essention an abstraction layer. Your program always addresses a "virtual" address, which your OS translates to an address in RAM or on disk (which needs to be loaded to RAM first) depending on where the data resides. So your program never has to worry about lack of memory.
Nothing is ever quite so simple anymore...
Memory pages are lazily allocated. A process can malloc() a large quantity of memory and never use it. So on your 500MB_RAM + 500MB_SWAP system, I could -- at least in theory -- allocate 2 gig of memory off the heap and things will run merrily along until I try to use too much of that memory. (At which point whatever process couldn't acquire more memory pages gets nuked. Hopefully it's my process. But not always.)
Individual processes may be limited to 4 gig as a hard address limitation on 32-bit systems. Even when you have more than 4 gig of RAM on the machine and you're using that bizarre segmented 36-bit atrocity from hell addressing scheme, individual processes are still limited to only 4 gigs. Some of that 4 gigs has to go for shared libraries and program code. So yer down to 2-3 gigs of stack+heap as an ADDRESSING limitation.
You can mmap files in, effectively giving you more memory. It basically acts as extra swap. I.e. Rather than loading a program's binary code data into memory and then swapping it out to the swapfile, the file is just mmapped. As needed, pages are swapped into RAM directly from the file.
You can get into some interesting stuff with sparse data and mmapped sparse files. I've seen X-windows claim enormous memory usage when in fact it was only using up a tiny bit.
BTW: "free" might help you. As might "cat /proc/meminfo" or the Vm lines in /proc/$PID/status. (Especially VmData and VmStk.) Or perhaps "ps up $PID"
Although mostly it's true, it's not entirely correct. For a particular process, the environment you run it in may limit the memory available to your process. Check the output of ulimit -v as well.
Yes, this is essentially correct. The actual numbers might be (very) marginally lower, but for all intents and purposes, if you have x physical memory and y virtual memory (swap in linux), then you have x + y memory available to the operating system and any programs running underneath the OS.