Debugging my program on big counts of kernels, I faced with very strange error of insufficient virtual memory. My investigations lead to peace of code, where master sends small messages to each slave. Then I wrote small program, where 1 master simply send 10 integers with MPI_SEND and all slaves receives it with MPI_RECV. Comparison of files /proc/self/status before and after MPI_SEND showed, that difference between memory sizes is huge! The most interesting thing (which crashes my program), is that this memory won't deallocate after MPI_Send and still take huge space.
Any ideas?
System memory usage before MPI_Send, rank: 0
Name: test_send_size
State: R (running)
Pid: 7825
Groups: 2840
VmPeak: 251400 kB
VmSize: 186628 kB
VmLck: 72 kB
VmHWM: 4068 kB
VmRSS: 4068 kB
VmData: 71076 kB
VmStk: 92 kB
VmExe: 604 kB
VmLib: 6588 kB
VmPTE: 148 kB
VmSwap: 0 kB
Threads: 3
System memory usage after MPI_Send, rank 0
Name: test_send_size
State: R (running)
Pid: 7825
Groups: 2840
VmPeak: 456880 kB
VmSize: 456872 kB
VmLck: 257884 kB
VmHWM: 274612 kB
VmRSS: 274612 kB
VmData: 341320 kB
VmStk: 92 kB
VmExe: 604 kB
VmLib: 6588 kB
VmPTE: 676 kB
VmSwap: 0 kB
Threads: 3
This is an expected behaviour from almost any MPI implementation that runs over InfiniBand. The IB RDMA mechanisms require that data buffers should be registered, i.e. they are first locked into a fixed position in the physical memory and then the driver tells the InfiniBand HCA how to map virtual addresses to physical memory. It is very complex and hence very slow process to register memory for usage by the IB HCA and that's why most MPI implementations never unregister memory that was once registered in hope that the same memory would later be used as a source or data target again. If the registered memory was heap memory, it is never returned back to the operating system and that's why your data segment only grows in size.
Reuse send and receive buffers as much as possible. Keep in mind that communication over InfiniBand incurrs high memory overhead. Most people don't really think about this and it is usually poorly documented, but InfiniBand uses a lot of special data structures (queues) which are allocated in the memory of the process and those queues grow significantly with the number of processes. In some fully connected cases the amount of queue memory can be so large that no memory is actually left for the application.
There are certain parameters that control IB queues used by Intel MPI. The most important in your case is I_MPI_DAPL_BUFFER_NUM which controls the amount of preallocated and preregistered memory. It's default value is 16, so you might want to decrease it. Be aware of possible performance implications though. You can also try to use dynamic preallocated buffer sizes by setting I_MPI_DAPL_BUFFER_ENLARGEMENT to 1. With this option enabled, Intel MPI would initially register small buffers and will later grow them if needed. Note also that IMPI opens connections lazily and that's why you see the huge increase in used memory only after the call to MPI_Send.
If not using the DAPL transport, e.g. using the ofa transport instead, there is not much that you can do. You can enable XRC queues by setting I_MPI_OFA_USE_XRC to 1. This should somehow decrease the memory used. Also enabling dynamic queue pairs creation by setting I_MPI_OFA_DYNAMIC_QPS to 1 might decrease memory usage if the communication graph of your program is not fully connected (a fully connected program is one in which each rank talks to all other ranks).
Hristo's answer is mostly right, but since you are using small messages there's a bit of a difference. The messages end up on the eager path: they first get copied to an already-registered buffer, then that buffer is used for the transfer, and the receiver copies the message out of an eager buffer on their end. Reusing buffers in your code will only help with large messages.
This is done precisely to avoid the slowness of registering the user-supplied buffer. For large messages the copy takes longer than the registration would, so the rendezvous protocol is used instead.
These eager buffers are somewhat wasteful. For example, they are 16kB by default on Intel MPI with OF verbs. Unless message aggregation is used, each 10-int-sized message is eating four 4kB pages. But aggregation won't help when talking to multiple receivers anyway.
So what to do? Reduce the size of the eager buffers. This is controlled by setting the eager/rendezvous threshold (I_MPI_RDMA_EAGER_THRESHOLD environment variable). Try 2048 or even smaller. Note that this can result in a latency increase. Or change the I_MPI_DAPL_BUFFER_NUM variable to control the number of these buffers, or try the dynamic resizing feature that Hristo suggested. This assumes your IMPI is using DAPL (the default). If you are using OF verbs directly, the DAPL variables won't work.
Edit: So the final solution for getting this to run was setting I_MPI_DAPL_UD=enable. I can speculate on the origin of the magic, but I don't have access to Intel's code to actually confirm this.
IB can have different transport modes, two of which are RC (Reliable Connected) and UD (Unreliable Datagram). RC requires an explicit connection between hosts (like TCP), and some memory is spent per connection. More importantly, each connection has those eager buffers tied to it, and this really adds up. This is what you get with Intel's default settings.
There is an optimization possible: sharing the eager buffers between connections (this is called SRQ - Shared Receive Queue). There's a further Mellanox-only extension called XRC (eXtended RC) that takes the queue sharing further: between the processes that are on the same node.
By default Intel's MPI accesses the IB device through DAPL, and not directly through OF verbs. My guess is this precludes these optimizations (I don't have experience with DAPL). It is possible to enable XRC support by setting I_MPI_FABRICS=shm:ofa and I_MPI_OFA_USE_XRC=1 (making Intel MPI use the OFA interface instead of DAPL).
When you switch to the UD transport you get a further optimization on top of buffer sharing: there is no longer a need to track connections. The buffer sharing is natural in this model: since there are no connections, all the internal buffers are in a shared pool, just like with SRQ. So there are further memory savings, but at a cost: datagram delivery can potentially fail, and it is up to the software, not the IB hardware to handle retransmissions. This is all transparent to the application code using MPI, of course.
Related
Have a memory problem. Small app (server on few ports, TMS+Indy, Delphi 10.3) is using about 2 Gb of memory (to compare - only 6 Mb under Windows).
So, small test, almost empty console app:
var x: TThread;
begin
x := TThread.Create(true);
after Create (looking internally - after pthread_create) "pmap" in Ubuntu says that few memory blocks was added as: 4 kb + 132 kb + 8192 kb + 65404 kb
4 + 132 kb - not too large, ok
8 mb - looks as default stack size for a thread, ok
but what is 64 Mb????
The pmap entry is not the consumed memory, it is the reserved/mapped memory.
RAM is first reserved/mapped, but actually consumed per page (typically 4KB), the first time a page is accessed by the program.
IIRC under Linux, the Delphi memory manager is just a wrapper around the libc malloc free API calls. It is typical for this MM to reserve some per-thread memory pages for the thread-specific memory arena. I guess 64MB is a plausible value.
So don't be afraid about those numbers. They don't imply that your RAM chips are consumed eagerly. You can have a process reserving GB of memory, but actually only using a few 100MB of hardware RAM - the writeable/private value in pmap. And note that a Delphi app will consume typically lower than any GC program. :)
Update/Proposal after comments:
If you need to reduce memory, don't use threads, but non-blocking inet libraries, or a thread pool and a reverse proxy (e.g. nginx) to maintain the inet connections. You can have your server handle HTTP/1.0 in a thread pool, and your nginx proxy maintain HTTP/1.1 or HTTP/2 long standing connections. If you use Posix sockets between nginx and the service on the loopback, it is very efficient. We use this configuration with our mORMot services on high production Linux systems, with great stability.
Update/More Info:
This 64MB mapping seems indeed the per-thread-arena reserved mapped memory from malloc. Check https://codearcana.com/posts/2016/07/11/arena-leak-in-glibc.html and MALLOC_ARENA_MAX as detailed in https://siddhesh.in/posts/malloc-per-thread-arenas-in-glibc.html
I can't find any clarity as to what is the performance of the so called Constant memory referred to in the Numba documentation:
https://numba.pydata.org/numba-doc/dev/cuda/memory.html#constant-memory
I am curious as to what are the size limits for this memory, how fast/slow it is when compared to other memory types and if there are any pitfalls using it.
Thank you!
This is more of a general question regarding the constant memory in a CUDA-capable device. You can find info in the official CUDA programming guide and here in which it says:
There is a total of 64 KB constant memory on a device. The constant
memory space is cached. As a result, a read from constant memory costs
one memory read from device memory only on a cache miss; otherwise, it
just costs one read from the constant cache. Accesses to different
addresses by threads within a warp are serialized, thus the cost
scales linearly with the number of unique addresses read by all
threads within a warp. As such, the constant cache is best when
threads in the same warp accesses only a few distinct locations. If
all threads of a warp access the same location, then constant memory
can be as fast as a register access.
Regarding how this compares to other memory types, here is my short answer. You may want to read this page for further details:
Registers: Thread private on-chip read + write memory which can be considered as the fastest memory space on a GPU.
Local memory: Thread private off-chip read + write memory which, despite its misleading name, is physically the same location as global memory. Hence, its high latency.
Global memory: The largest memory with a high latency and a global scope which is also off-chip with read + write permissions.
Constant memory: Off-chip cached read-only memory limited to 64 KB which could be accessed by threads as fast as registers, if all threads of a warp access the same location.
Shared memory: On-chip, low-latency, read + write with limited space per multiprocessor (48 KB to 164 KB depending on the compute capability of your device).
Texture memory: On-chip cached read-only memory optimized for 2D spatial locality that supports unique features like hardware filtering.
Pinned (page-locked) memory: Not an explicit device memory. Accessible directly by both CPU and GPU codes, used to maximize and overlap data transfer between CPU/GPU.
These memories have different scopes, life-times and usages. The Numba page that you have mentioned in your question explains the basics but the official CUDA programming guide has a lot more details. At the end of the day, the answer to the question of when to use each memory is to a large degree application-dependent.
I searched for my problem in Google and at this site but i still don't understand the solution.
I have piece of MPI program which RECV some data. Program crashes on big arrays with error of insufficient virtual memory, and so i started to consider /proc/self/status file.
Before MPI_RECV it was:
Name: model.exe
VmPeak: 841640 kB
VmSize: 841640 kB
VmHWM: 15100 kB
VmRSS: 15100 kB
VmData: 760692 kB
And after:
Name: model.exe
VmPeak: 841640 kB
VmSize: 841640 kB
VmHWM: 719980 kB
VmRSS: 719980 kB
VmData: 760692 kB
I test it on Ubuntu and through System Monitor i saw this memory increasing. But i was confused that there are no changes in VmSize(and VmPeak) parameters.
And the question is - what is the indicator of real memory usage?
Does it mean, that true indicator is VmRSS? (and VmSize is only allocated but still not used memory)
(The possible solution to your problem is the last paragraph)
Memory allocation on most modern operating systems with virtual memory is a two-phase process. First, a portion of the virtual address space of the process is reserved and the virtual memory size of the process (VmSize) increases accordingly. This creates entries in the so-called process page table. Pages are initially not associated with phyiscal memory frames, i.e. no physical memory is actually used. Whenever some part of this allocated portion is actually read from or written to, a page fault occurs and the operating system installs (maps) a free page from the physical memory. This increases the resident set size of the process (VmRSS). When some other process needs memory, the OS might store the content of some infrequently used page (the definition of "infrequently used page" is highly implementation-dependent) to some persistent storage (hard drive in most cases, or generally to the swap device) and then unmap up. This process decreases the RSS but leaves VmSize intact. If this page is later accessed, a page fault would again occur and it will be brought back. The virutal memory size only decreases when virtual memory allocations are freed. Note that VmSize also counts for memory mapped files (i.e. the executable file and all shared libraries it links to or other explicitly mapped files) and shared memory blocks.
There are two generic types of memory in a process - statically allocated memory and heap memory. The statically allocated memory keeps all constants and global/static variables. It is part of the data segment, whose size is shown by the VmData metric. The data segment also hosts part of the program heap, where dynamic memory is being allocated. The data segment is continuous, i.e. it starts at a certain location and grows upwards towards the stack (which starts at a very high address and then grows downwards). The problem with the heap in the data segment is that it is managed by a special heap allocator that takes care of subdividing the contiguous data segment into smaller memory chunks. On the other side, in Linux dynamic memory can also be allocated by directly mapping virtual memory. This is usually done only for large allocations in order to conserve memory, since it only allows memory in multiples of the page size (usually 4 KiB) to be allocated.
The stack is also an important source of heavy memory usage, especially if big arrays are allocated in the automatic (stack) storage. The stack starts near the very top of the usable virtual address space and grows downwards. In some cases it could reach the top of the data segment or it could reach the end of some other virtual allocation. Bad things happen then. The stack size is accounted in the VmStack metric and also in the VmSize.
One can summarise it as so:
VmSize accounts for all virtual memory allocations (file mappings, shared memory, heap memory, whatever memory) and grows almost every time new memory is being allocated. Almost, because if the new heap memory allocation is made in the place of a freed old allocation in the data segment, no new virtual memory would be allocated. It decreses whenever virtual allocations are being freed. VmPeak tracks the max value of VmSize - it could only increase in time.
VmRSS grows as memory is being accessed and decreases as memory is paged out to the swap device.
VmData grows as the data segment part of the heap is being utilised. It almost never shrinks as current heap allocators keep the freed memory in case future allocations need it.
If you are running on a cluster with InfiniBand or other RDMA-based fabrics, another kind of memory comes into play - the locked (registered) memory (VmLck). This is memory which is not allowed to be paged out. How it grows and shrinks depends on the MPI implementation. Some never unregister an already registered block (the technical details about why are too complex to be described here), others do so in order to play better with the virtual memory manager.
In your case you say that you are running into a virtual memory size limit. This could mean that this limit is set too low or that you are running into an OS-imposed limits. First, Linux (and most Unixes) have means to impose artificial restrictions through the ulimit mechanism. Running ulimit -v in the shell would tell you what the limit on the virtual memory size is in KiB. You can set the limit using ulimit -v <value in KiB>. This only applies to processes spawned by the current shell and to their children, grandchilren and so on. You need to instruct mpiexec (or mpirun) to propagate this value to all other processes, if they are to be launched on remote nodes. if you are running your program under the control of some workload manager like LSF, Sun/Oracle Grid Engine, Torque/PBS, etc., there are job parameters which control the virtual memory size limit. And last but not least, 32-bit processes are usually restricted to 2 GiB of usable virtual memory.
We use a Tool (Whats Up Gold) to monitor memory usage on a Linux Box.
We see Memory usage (graphs) related to:
Physical, Real, Swap, Virtual Memory and ALL Memory (which is a average of all these).
'The ALL' Memory graphs show low memory usage of about: 10%.
But Physical memory shows as 95% used.
Swap memory shows as 2% used.
So, do i need more memory on this Linux Box?
In other words should i go by:
ALL Memory graph(which says memory situation is good) OR
Physical Memory Graph (which says memory situation is bad).
Real and Physical
Physical memory is the amount of DRAM which is currently used. Real memory shows how much your applications are using system DRAM memory. It is roughly lower than physical memory. Linux system caches some of disk data. This caching is the difference between physical and real memory. Actually, when you have free memory Linux goes to use it for caching. Do not worry, as your applications demand memory they gonna get the cached space back.
Swap and Virtual
Swap is additional space to your actual DRAM. This space is borrowed from disk space and once you application fill-out entire DRAM, Linux transfers some unused memory to swap to let all application stay alive. Total of swap and physical memory is the virtual memory.
Do you need extra memory?
In answer to your question, you need to check real memory. If your real memory is full, you need to get some RAM. Use free command to check the amount of actual free memory. For example on my system free says:
$ free
total used free shared buffers cached
Mem: 16324640 9314120 7010520 0 433096 8066048
-/+ buffers/cache: 814976 15509664
Swap: 2047992 0 2047992
You need to check buffer/cache section. As shown above, there are real 15 GB free DRAM (second line) on my system. Check this on your system and find out whether you need more memory or not. The lines represent physical, real, and swap memory, respectively.
free -m
as for free tool analisys about memory lack in linux i have some opinion proved by experiments (practice)
~# free -m
total used free shared buff/cache available
Mem: 2000 164 144 1605 1691 103
you should summarize 'used'+'shared' and compare with 'total'
other columns are useless just confuse and nothing more
i would say
[ total - (used + shared ) ] should be always at least > 200 MB
also you can get almost the same number if you check MemAvailable in meminfo :
# cat /proc/meminfo
MemAvailable: 107304 kB
MemAvailable - is how much memory linux thinks right now is really free before active swapping happens.
so now you can consume 107304 kB maximum. if you
consume more big swappening starts happening.
MemAvailable also is in good correlation with real practice.
Due to some issues with modules, on boot I have to set mem=4096M. When this happens, though, this is the available memory:
MemTotal: 3354504 kB
SwapTotal: 1670724 kB
as opposed to
MemTotal: 4057728 kB
SwapTotal: 1670724 kB
Why is the amount of RAM dropping so much? Shouldn't it just stay at 4057728kB or pretend to have more?
Memory-mapped I/O such as that for video, sound, disks, etc. takes a certain number of physical addresses. Normally the RAM behind it is mapped somewhere else, but since you've artificially limited the number of physical addresses available to the OS there is no way for the OS to actually reach this RAM.