Vulkan memoryHeaps and their memoryTypes - linux

Above is a picture summarizing my understanding on memoryHeap and their memoryTypes generated by Vulkan for a given system setup. Thanks to the answers on this topics shared by #NicolBolas 1, 2, 3 and an answer by #krOoze 4.
Still, I have a few outstanding questions that I like help on and I have indicated them in red and elaborated below per comment of #NicolBolas.
Questions
Why are there 9 memoryType in sysRam when there are only 4x RAMs?
What is the physical meaning of each memoryType? How to use each of
these memoryType?
Why are there 2 memory types for GPU RAM? Does this mean each
memoryType of the GPU RAM is 6144MB/2 = 3072MB?
Is there a size limit to each memoryTypes? If yes, how to discover
their limits?
Why are the free memory reported by Vulkan and cat /proc/meminfo
different?
Thanks for your help in advance.

Why are there 9 memoryType in sysRam when there are only 4x RAMs? What is the physical meaning of each memoryType? How to use each of these memoryType?
Why are there 2 memory types for GPU RAM?
I don't know what you mean by "4x RAMs"; I suspect you're talking about how many physical memory sticks are in your machine. Memory types (or heaps for that matter) don't care about such things.
As for the rest, it is always important to remember how memory works in Vulkan. Heaps represent actual physical RAM to one degree or another. Memory types represent ways of allocating that memory. But uses of memory have their own memory type restrictions.
For example, if an image has the color attachment usage parameter, the implementation can force you to use a specific memory type for the memory backing that image. And images that don't have color attachment can be restricted to using other memory types, but not that one. And so forth.
Apparently, NVIDIA does this for certain combinations of usage and formats. Simply querying the available memory types isn't enough to know how to go about allocating memory. You have to figure out what buffers and images (complete with format and usage parameters) you will use. And then you have to query what restrictions the implementation imposes on them.
Your application must adapt to these restrictions.
Is there a size limit to each memoryTypes?
It wouldn't make sense for there to be such a thing. Memory types define how memory is allocated, not how much storage is available. The latter is the job of memory heaps.
Why are the free memory reported by Vulkan and cat /proc/meminfo different?
Vulkan has no API to report free memory, only total memory. Asking for the amount of free memory is folly. Memory (or at least, virtual pages in your application) are shared by all threads in your application. And GPU memory especially is shared among all processes on the machine. By the time you get an answer back, the amount of memory may have changed. So when you go to allocate memory based on what you were told was available, it may not be available anymore.
Better to allocate first and deal with failure to allocate if it happens.
You can ask for the total memory so that you can decide on how you want to allocate chunks of memory. But that's how you determine what is and is not available, not by querying a size.

[metaquestion] Why is X in Vulkan?
Because it is allowed by the Vulkan specification. Rest is implementation detail, and only the implementer\vendor knows for sure, and may depend on how well he slept.
Why are there 9 memoryType in sysRam when there are only 4x RAMs? What is the physical meaning of each memoryType? How to use each of these memoryType?
Answered in Why does vkGetPhysicalDeviceMemoryProperties return multiple identical memory types?. One for VkBuffers, one for VkImages, and one per depth format (i.e. 7). Equals 9; mystery solved.
Why are there 2 memory types for GPU RAM? Does this mean each memoryType of the GPU RAM is 6144MB/2 = 3072MB?
Likely similar reason as 1. I speculate one for VkBuffers, one for VkImages. Someone with NVIDIA could test with vkGetXMemoryRequirements.
It does not neccessarily mean RAM/2. It is not completely out of the question, but then again implementer should instead expose separate Heap if that is so.
Is there a size limit to each memoryTypes? If yes, how to discover their limits?
Roughly the Heap size. You may get significantly less due to fragmentation. And due to other processes sharing the same. Your impl may also allocate some itself for its internal needs.
You discover the limit when you get VK_ERROR_OUT_OF_DEVICE_MEMORY. (BTW mostly works the same as on CPU side, where you get bad_alloc).
There is limit to size of single allocation (not recommended to allocate > 4 GB), and to the count of allocations too (maxMemoryAllocationCount).
Why are the free memory reported by Vulkan and cat /proc/meminfo different?
AFAIK Vulkan does not report free memory. The VkMemoryHeap shows total memory:
size is the total memory size in bytes in the heap.

You don't know anything about the memory types in Vulkan until you ask the driver.
I think the biggest misunderstanding you have is that the memory types are physically separate. As shown, you have two memory heaps, assume 0 is CPU memory, 1 is GPU. Within those heaps, you have different memory types. Each memory type occupies space within its own heap, and can use all the heap space or share it with other types. For each type you'll have different internal allocation methods with different alignment requirements and different allowed uses. There are multiple queries related to memory types including vkGetBufferMemoryRequirements, vkGetImageMemoryRequirements, and others. It all depends on what you're using the memory for.
Also, those memory types are driver dependent, and will vary between vendors (that looks like the current nVidia layout).

Related

How fast or slow is the Constant memory that Numba allows a device to allocate, when compared to local and shared memories?

I can't find any clarity as to what is the performance of the so called Constant memory referred to in the Numba documentation:
https://numba.pydata.org/numba-doc/dev/cuda/memory.html#constant-memory
I am curious as to what are the size limits for this memory, how fast/slow it is when compared to other memory types and if there are any pitfalls using it.
Thank you!
This is more of a general question regarding the constant memory in a CUDA-capable device. You can find info in the official CUDA programming guide and here in which it says:
There is a total of 64 KB constant memory on a device. The constant
memory space is cached. As a result, a read from constant memory costs
one memory read from device memory only on a cache miss; otherwise, it
just costs one read from the constant cache. Accesses to different
addresses by threads within a warp are serialized, thus the cost
scales linearly with the number of unique addresses read by all
threads within a warp. As such, the constant cache is best when
threads in the same warp accesses only a few distinct locations. If
all threads of a warp access the same location, then constant memory
can be as fast as a register access.
Regarding how this compares to other memory types, here is my short answer. You may want to read this page for further details:
Registers: Thread private on-chip read + write memory which can be considered as the fastest memory space on a GPU.
Local memory: Thread private off-chip read + write memory which, despite its misleading name, is physically the same location as global memory. Hence, its high latency.
Global memory: The largest memory with a high latency and a global scope which is also off-chip with read + write permissions.
Constant memory: Off-chip cached read-only memory limited to 64 KB which could be accessed by threads as fast as registers, if all threads of a warp access the same location.
Shared memory: On-chip, low-latency, read + write with limited space per multiprocessor (48 KB to 164 KB depending on the compute capability of your device).
Texture memory: On-chip cached read-only memory optimized for 2D spatial locality that supports unique features like hardware filtering.
Pinned (page-locked) memory: Not an explicit device memory. Accessible directly by both CPU and GPU codes, used to maximize and overlap data transfer between CPU/GPU.
These memories have different scopes, life-times and usages. The Numba page that you have mentioned in your question explains the basics but the official CUDA programming guide has a lot more details. At the end of the day, the answer to the question of when to use each memory is to a large degree application-dependent.

Why does the Linux kernel require small short-term memory chunks in odd sizes?

I'm reading Operating System: Internals and Design Principles by William Stallings, 7th edition. In section 8.4 Linux Memory Management, when talking about kernel memory management, it goes like:
The foundation of kernel memory allocation for Linux is the page allocation
mechanism used for user virtual memory management. As in the virtual memory
scheme, a buddy algorithm is used so that memory for the kernel can be allocated
and deallocated in units of one or more pages. Because the minimum amount of
memory that can be allocated in this fashion is one page, the page allocator alone
would be inefficient because the kernel requires small short-term memory chunks
in odd sizes.
I could understand the discuss on paging, but why does the author says that the kernel requires small short-term memory chunks
in odd sizes., especially, why in odd sizes?
Because most programs require small allocations, for relatively short periods, in a variety of sizes? That's why malloc and friends exist: To subdivide the larger allocations from the OS into smaller pieces with sub-page-size granularity. Want a linked list (commonly needed in OS kernels)? You need to be able to allocate small nodes that contain the value and a pointer to the next node (and possibly a reverse pointer too).
I suspect by "odd sizes" they just mean "arbitrary sizes"; I don't expect the kernel to be unusually heavy on 1, 3, 5, 7, etc. byte allocations, but the allocation sizes are, in many cases, not likely to be consistent enough that a fixed block allocator is broadly applicable. Writing a special block allocator for each possible linked list node size (let alone every other possible size needed for dynamically allocated memory) isn't worth it unless that linked list is absolutely performance critical after all.

What is coherent memory on GPU?

I have stumbled not once into a term "non coherent" and "coherent" memory in the
tech papers related to graphics programming.I have been searching for a simple and clear explanation,but found mostly 'hardcore' papers of this type.I would be glad to receive layman's style answer on what coherent memory actually is on GPU architectures and how it is compared to other (probably not-coherent) memory types.
Memory is memory. But different things can access that memory. The GPU can access memory, the CPU can access memory, maybe other hardware bits, whatever.
A particular thing has "coherent" access to memory if changes made by others to that memory are visible to the reader. Now, you might think this is foolishness. After all, if the memory has been changed, how could someone possibly be unable to see it?
Simply put, caches.
It turns out that changing memory is expensive. So we do everything possible to avoid changing memory unless we absolutely have to. When you write a single byte from the CPU to a pointer in memory, the CPU doesn't write that byte yet. Or at least, not to memory. It writes it to a local copy of that memory called a "cache."
The reason for this is that, generally speaking, applications do not write (or read) single bytes. They are more likely to write (and read) lots of bytes, in small chunks. So if you're going to perform an expensive operation like a memory load or store, you should load or store a large chunk of memory. So you store all of the changes you're going to make to a chunk of memory in a cache, then make a single write of that cached chunk to actual memory at some point in the future.
But if you have two separate devices that use the same memory, you need some way to be certain that writes one device makes are visible to other devices. Most GPUs can't read the CPU cache. And most CPU languages don't have language-level support to say "hey, that stuff I wrote to memory? I really mean for you to write it to memory now." So you usually need something to ensure visibility of changes.
In Vulkan, memory which is labeled by VK_MEMORY_PROPERTY_HOST_COHERENT_BIT means that, if you read/write that memory (via a mapped pointer, since that's the only way Vulkan lets you directly write to memory), you don't need to use functions vkInvalidateMappedMemoryRanges/vkFlushMappedMemoryRanges to make sure the CPU/GPU can see those changes. The visibility of any changes is guaranteed in both directions. If that flag isn't available on the memory, then you must use the aforementioned functions to ensure the coherency of the specific regions of data you want to access.
With coherent memory, one of two things is going on in terms of hardware. Either CPU access to the memory is not cached in any of the CPU's caches, or the GPU has direct access to the CPU's caches (perhaps due to being on the same die as the CPU(s)). You can usually tell that the latter is happening, because on-die GPU implementations of Vulkan don't bother to offer non-coherent memory options.
If memory is coherent then all threads accessing that memory must agree on the state of the memory at all times, e.g.: if thread 0 reads memory location A and thread 1 reads the same location at the same time, both threads should always read the same value.
But if memory is not coherent then threads A and B might read back different values. Thread 0 could think that location A contains a 1, while thread thinks that that location contains a 2. The different threads would have an incoherent view of the memory.
Coherence is hard to achieve with a high number of cores. Often every core must be aware of memory accesses from all other cores. So if you have 4 cores in a quad core CPU, coherence is not that hard to achieve as every core must be informed about the memory accesses addresses of 3 other cores, but in a GPU with 16 cores, every core must be made aware of the memory accesses by 15 other cores. The cores exchange data about the content of their cache using so called "cache coherence protocols".
This is why GPUs often only support limited forms of coherency. If some memory locations are read only or are only accessed by a single thread, then no coherence is required. If caches are small and coherence is not always required but only at specific instructions of the program, then it is possible to achieve correct behavior of the program using cache flushes before or after specific memory accesses.
If your hardware offers both coherent and non-coherent memory types, then you can expect that non-coherent memory will be faster, but if you try to run parallel algorithms using this memory they will fail in really weird ways.

Vxworks memory allocation failure even though there is enough memory

I am rather new to vxworks, and I am building an RTP application, which needs to allocate some memory dynamically. I have configured the kernel for a memory size of 750MB.
I am allocating memory in blocks 10 numbers each of size 32MB in the very beginning of the program, but after the 5th or 6th block allocation, I get an allocation failure with message memPartAlloc: block too big 15912260 bytes (0x10 aligned) in partition 0xe004608 on the console.
How could memory allocation be failing when there is enough memory available? I do not think memory had fragmented enough for allocation to fail right in the beginning of my program and as per output of memShow(), there is indeed enough free memory to satisfy the request.
If memory has indeed fragmented due to any strange reason, is there some way to compact free space and continue in Vxworks?
This is an old question, so this answer may be moot now, and is to an extent based on speculation based on the limited information in the question.
Whilst the kernel maybe configured to support 750MB, this will be the total memory available. Some of this will be used by the OS image, although we wont expect much, and we can assume that at least 700MB should be available for use.
Some extra memory will be used to provide the stacks for each task - how much is very application dependant, as it is specified in the taskSpawn. You can check this, but again, is unlikely to make significant difference.
Lets be generous, and assume that you really only have 650MB. This should, in theory, be plenty.
And yet we have this error:
memPartAlloc: block too big 15912260 bytes (0x10 aligned) in partition 0xe004608
What can be happening? And what does this mean?
This error tells you that the memory allocator could not allocate memory, as the request was too large. Interestingly, the request is 15912260, which is not 32MB, it is actually a shade over 15MB. So it would be worth checking what you are actually requesting.
Secondly, this error message is coming from memPartAlloc. Are you using allocating memory using malloc() or memPartAlloc()? The distinction matters, since malloc will allocate memory from the system memory partition, whereas memPartAlloc allocates memory from a user-specifed, and created, partition.
If you are using memPartAlloc, ensure that you are allocating memory from the correct partition, and that it has been created with enough memory to fulfill the request.
EDIT:
As it appears that this was an RTP, you should also confirm that the RTP has a large enough heap allocated. This is specified via an environment variable, as this answer describes.

Calculating % memory used on Linux

Linux noob question:
If I have 500MB of RAM, and 500MB of swap space, can the OS and processes then use 1GB of memory?
In other words, is the total amount of memory available to programs and the OS the total of the physical memory size and swap size?
I'm trying to figure out which SNMP counters to query, but need to understand how Linux uses virtual memory a little better first.
Thanks
Actually, it IS essentially correct, but your "virtual" memory does NOT reside beside your "physical memory" (as Matthew Scharley stated).
Your "virtual memory" is an abstraction layer covering both "physical" (as in RAM) and "swap" (as in hard-disk, which is of course as much physical as RAM is) memory.
Virtual memory is in essention an abstraction layer. Your program always addresses a "virtual" address, which your OS translates to an address in RAM or on disk (which needs to be loaded to RAM first) depending on where the data resides. So your program never has to worry about lack of memory.
Nothing is ever quite so simple anymore...
Memory pages are lazily allocated. A process can malloc() a large quantity of memory and never use it. So on your 500MB_RAM + 500MB_SWAP system, I could -- at least in theory -- allocate 2 gig of memory off the heap and things will run merrily along until I try to use too much of that memory. (At which point whatever process couldn't acquire more memory pages gets nuked. Hopefully it's my process. But not always.)
Individual processes may be limited to 4 gig as a hard address limitation on 32-bit systems. Even when you have more than 4 gig of RAM on the machine and you're using that bizarre segmented 36-bit atrocity from hell addressing scheme, individual processes are still limited to only 4 gigs. Some of that 4 gigs has to go for shared libraries and program code. So yer down to 2-3 gigs of stack+heap as an ADDRESSING limitation.
You can mmap files in, effectively giving you more memory. It basically acts as extra swap. I.e. Rather than loading a program's binary code data into memory and then swapping it out to the swapfile, the file is just mmapped. As needed, pages are swapped into RAM directly from the file.
You can get into some interesting stuff with sparse data and mmapped sparse files. I've seen X-windows claim enormous memory usage when in fact it was only using up a tiny bit.
BTW: "free" might help you. As might "cat /proc/meminfo" or the Vm lines in /proc/$PID/status. (Especially VmData and VmStk.) Or perhaps "ps up $PID"
Although mostly it's true, it's not entirely correct. For a particular process, the environment you run it in may limit the memory available to your process. Check the output of ulimit -v as well.
Yes, this is essentially correct. The actual numbers might be (very) marginally lower, but for all intents and purposes, if you have x physical memory and y virtual memory (swap in linux), then you have x + y memory available to the operating system and any programs running underneath the OS.

Resources