What is possible relationship between states and utilized memory in model-checking. Can we estimate total memory that can possibly be utilized by our model (state space) before actually implementing it into the model-checker. can we roughly estimate minimum possible memory before actual implementation. On which factor memory utilization depend other than transitions and states.
ThankYou
Related
I can't find any clarity as to what is the performance of the so called Constant memory referred to in the Numba documentation:
https://numba.pydata.org/numba-doc/dev/cuda/memory.html#constant-memory
I am curious as to what are the size limits for this memory, how fast/slow it is when compared to other memory types and if there are any pitfalls using it.
Thank you!
This is more of a general question regarding the constant memory in a CUDA-capable device. You can find info in the official CUDA programming guide and here in which it says:
There is a total of 64 KB constant memory on a device. The constant
memory space is cached. As a result, a read from constant memory costs
one memory read from device memory only on a cache miss; otherwise, it
just costs one read from the constant cache. Accesses to different
addresses by threads within a warp are serialized, thus the cost
scales linearly with the number of unique addresses read by all
threads within a warp. As such, the constant cache is best when
threads in the same warp accesses only a few distinct locations. If
all threads of a warp access the same location, then constant memory
can be as fast as a register access.
Regarding how this compares to other memory types, here is my short answer. You may want to read this page for further details:
Registers: Thread private on-chip read + write memory which can be considered as the fastest memory space on a GPU.
Local memory: Thread private off-chip read + write memory which, despite its misleading name, is physically the same location as global memory. Hence, its high latency.
Global memory: The largest memory with a high latency and a global scope which is also off-chip with read + write permissions.
Constant memory: Off-chip cached read-only memory limited to 64 KB which could be accessed by threads as fast as registers, if all threads of a warp access the same location.
Shared memory: On-chip, low-latency, read + write with limited space per multiprocessor (48 KB to 164 KB depending on the compute capability of your device).
Texture memory: On-chip cached read-only memory optimized for 2D spatial locality that supports unique features like hardware filtering.
Pinned (page-locked) memory: Not an explicit device memory. Accessible directly by both CPU and GPU codes, used to maximize and overlap data transfer between CPU/GPU.
These memories have different scopes, life-times and usages. The Numba page that you have mentioned in your question explains the basics but the official CUDA programming guide has a lot more details. At the end of the day, the answer to the question of when to use each memory is to a large degree application-dependent.
Above is a picture summarizing my understanding on memoryHeap and their memoryTypes generated by Vulkan for a given system setup. Thanks to the answers on this topics shared by #NicolBolas 1, 2, 3 and an answer by #krOoze 4.
Still, I have a few outstanding questions that I like help on and I have indicated them in red and elaborated below per comment of #NicolBolas.
Questions
Why are there 9 memoryType in sysRam when there are only 4x RAMs?
What is the physical meaning of each memoryType? How to use each of
these memoryType?
Why are there 2 memory types for GPU RAM? Does this mean each
memoryType of the GPU RAM is 6144MB/2 = 3072MB?
Is there a size limit to each memoryTypes? If yes, how to discover
their limits?
Why are the free memory reported by Vulkan and cat /proc/meminfo
different?
Thanks for your help in advance.
Why are there 9 memoryType in sysRam when there are only 4x RAMs? What is the physical meaning of each memoryType? How to use each of these memoryType?
Why are there 2 memory types for GPU RAM?
I don't know what you mean by "4x RAMs"; I suspect you're talking about how many physical memory sticks are in your machine. Memory types (or heaps for that matter) don't care about such things.
As for the rest, it is always important to remember how memory works in Vulkan. Heaps represent actual physical RAM to one degree or another. Memory types represent ways of allocating that memory. But uses of memory have their own memory type restrictions.
For example, if an image has the color attachment usage parameter, the implementation can force you to use a specific memory type for the memory backing that image. And images that don't have color attachment can be restricted to using other memory types, but not that one. And so forth.
Apparently, NVIDIA does this for certain combinations of usage and formats. Simply querying the available memory types isn't enough to know how to go about allocating memory. You have to figure out what buffers and images (complete with format and usage parameters) you will use. And then you have to query what restrictions the implementation imposes on them.
Your application must adapt to these restrictions.
Is there a size limit to each memoryTypes?
It wouldn't make sense for there to be such a thing. Memory types define how memory is allocated, not how much storage is available. The latter is the job of memory heaps.
Why are the free memory reported by Vulkan and cat /proc/meminfo different?
Vulkan has no API to report free memory, only total memory. Asking for the amount of free memory is folly. Memory (or at least, virtual pages in your application) are shared by all threads in your application. And GPU memory especially is shared among all processes on the machine. By the time you get an answer back, the amount of memory may have changed. So when you go to allocate memory based on what you were told was available, it may not be available anymore.
Better to allocate first and deal with failure to allocate if it happens.
You can ask for the total memory so that you can decide on how you want to allocate chunks of memory. But that's how you determine what is and is not available, not by querying a size.
[metaquestion] Why is X in Vulkan?
Because it is allowed by the Vulkan specification. Rest is implementation detail, and only the implementer\vendor knows for sure, and may depend on how well he slept.
Why are there 9 memoryType in sysRam when there are only 4x RAMs? What is the physical meaning of each memoryType? How to use each of these memoryType?
Answered in Why does vkGetPhysicalDeviceMemoryProperties return multiple identical memory types?. One for VkBuffers, one for VkImages, and one per depth format (i.e. 7). Equals 9; mystery solved.
Why are there 2 memory types for GPU RAM? Does this mean each memoryType of the GPU RAM is 6144MB/2 = 3072MB?
Likely similar reason as 1. I speculate one for VkBuffers, one for VkImages. Someone with NVIDIA could test with vkGetXMemoryRequirements.
It does not neccessarily mean RAM/2. It is not completely out of the question, but then again implementer should instead expose separate Heap if that is so.
Is there a size limit to each memoryTypes? If yes, how to discover their limits?
Roughly the Heap size. You may get significantly less due to fragmentation. And due to other processes sharing the same. Your impl may also allocate some itself for its internal needs.
You discover the limit when you get VK_ERROR_OUT_OF_DEVICE_MEMORY. (BTW mostly works the same as on CPU side, where you get bad_alloc).
There is limit to size of single allocation (not recommended to allocate > 4 GB), and to the count of allocations too (maxMemoryAllocationCount).
Why are the free memory reported by Vulkan and cat /proc/meminfo different?
AFAIK Vulkan does not report free memory. The VkMemoryHeap shows total memory:
size is the total memory size in bytes in the heap.
You don't know anything about the memory types in Vulkan until you ask the driver.
I think the biggest misunderstanding you have is that the memory types are physically separate. As shown, you have two memory heaps, assume 0 is CPU memory, 1 is GPU. Within those heaps, you have different memory types. Each memory type occupies space within its own heap, and can use all the heap space or share it with other types. For each type you'll have different internal allocation methods with different alignment requirements and different allowed uses. There are multiple queries related to memory types including vkGetBufferMemoryRequirements, vkGetImageMemoryRequirements, and others. It all depends on what you're using the memory for.
Also, those memory types are driver dependent, and will vary between vendors (that looks like the current nVidia layout).
From the MallocInternals section of the wiki:
As pressure from thread collisions increases, additional arenas are created via mmap to relieve the pressure. The number of arenas is capped at eight times the number of CPUs in the system (unless the user specifies otherwise, see mallopt), which means a heavily threaded application will still see some contention, but the trade-off is that there will be less fragmentation.
Why does increasing the number of arenas increase fragmentation? Anecdotally, I've been able to reduce resident set size (not just virtual set size) by almost 50% simply by severely restricting memory arenas to just 2 (via MALLOC_ARENA_MAX).
How could a high number of per-thread memory arenas in malloc lead to memory fragmentation and an increase in RSS?
Say you have 300MB to divide up. If you divide it into 8 arenas, you wind up with more "odds and ends" memory, none of it big enough to satisfy a large memory request. The arenas don't dynamically grow or shrink, as a rule. So if you allocate an arena, use up 90% of it and wind up with 10% as fragmentation (nominally waiting to serve small allocations) it doesn't help with your big allocations.
The more arenas you divide it into, the more "odds and ends" memory you wind up with. So: more arenas equals more internal fragmentation.
The arenas have an internal mutex to lock the memory access from different threads at the same time.
If you have only one arena, only one memory operation is allowed at a time, which decreases fragmentation as new allocations/free are performed.
If you increase the number of arenas, then it means you allow multiple threads to allocate/free the memory at the same time, which creates fragmentation because it is not an atomic operation (see the sections "Malloc Algorithm" and "Free Algorithm")
The 8xCPU rule is there just to leave enough thread concurrency but limit it somehow. 2 arenas may (as this is not a fatality) lead to more fragmentation than only 1, so the more arenas, the more fragmentation because more memory operations happen at the same time.
Why does concurrent memory operations lead to fragmentation (oversimplified) : because it becomes harder and harder to allocate contiguous large blocks and freeing little blocks all over does not leave enough space for new large blocks to be allocated there.
So if you restrict the number of arenas, the maintenance job (allocation/free algorithm) is not entering in competition with other arenas running at the same time.
So then, why 8xCPU ? and why not just 2 arenas (as you tested) or even only one ?
This is totally possible to run only 1 arena, but then it means that only one thread is allowed to use the memory at once, hence you loose some performance. If you restrict to 2 arenas, then only 2 threads can use the memory at the same time (one per arena).
And if you have a 4-CPU machine only 2 arenas, only 2 (this is the arena limit) threads can manipulate (via glibc) the memory at the same time which is a waste of resources because only 2 of the CPU are used.
This question is about DRAM speeds and memory interleaving. I have a very specific problem. I am using a power based architecture board (minus the AltiVec) and I wish to copy a large segment of memory (virtual contiguous) between two regions within my process' address space. To offset the slowness of my core, I affixed two threads to two cpu's and that made copy a lot faster.
However that was still not fast enough. so I added a third thread, and it made no difference to copy times whatsoever. I did more research on this and found that my board was equipped with a single DDR3 RAM (speed 1600 MB/s) and it was pretty close to max attainable speeds already.
[ Some explanation here: With just 2 threads, I am copying, say 5500 pages of size 4K in around 16.5 milliseconds. If you do a simple calculation, it would seem that the minimum time in theory that you could clock (bar all prefetches and stuff) is 13.75 milliseconds. ]
I discovered that I could add an extra RAM to my board. Which I could possibly get my co. to fund by telling them I also intend to halve the size of each stick of memory, but how can I get the kernel to allocate me memory that is guaranteed to be evenly distributed across both memories?
Thanks a lot for answering!
P.s. I am using linux kernel version 2.6.34.
See if your Linux / board combination supports the NUMA (Non-uniform memory access) extensions. You can specify interleaving policies through libnuma:
The libnuma library offers a simple programming interface to the NUMA
(Non Uniform Memory Access) policy supported by the Linux kernel. On a
NUMA architecture some memory areas have different latency or
bandwidth than others.
Available policies are page interleaving (i.e., allocate in a
round-robin fashion from all, or a subset, of the nodes on the
system), preferred node allocation (i.e., preferably allocate on a
particular node), local allocation (i.e., allocate on the node on
which the task is currently executing), or allocation only on specific
nodes (i.e., allocate on some subset of the available nodes). It is
also possible to bind tasks to specific nodes.
As we all may know, there are lots of different ways to implement DGEMV in parallel (column or block -wise etc) resulting in different communication overheads. I have been looking through both the MKL and all the reference manuals to BLAS to try and figure out which style is in general being called in by cblas_dgemv from MKL(v.11) without success. If anyone has a reference that documents which algorithm or the overheads for the algorithm that is being used, I would be very happy.
MKL ref manuals keep DGEMV as well as other routines as black boxes.
But I think there is still some way to estimate the overhead/efficiency.
As we know, DGEMV is a mem bandwidth bounded operation.
For y += A*x you could measure its speed by the mem bandwidth achieved:
measure the running time for one DGEMV call as t;
compute total mem read/write size: m = 2*len(y)+len(x)+len(A);
actual bandwidth bw = m/t;
check out the peak bandwidth of the total system RAM bw0;
Then bw/bw0*100% can be seen as the actual efficiency of the algorithm.
Please note you may want a large enough matrix/vector to do the measurement. Also if you want repeat the measurement to get more accurate result, you may need to keep the cache cold before starting a new iteration.