Delphi + Linux + Threads = memory

Delphi + Linux + Threads = memory - linux

Have a memory problem. Small app (server on few ports, TMS+Indy, Delphi 10.3) is using about 2 Gb of memory (to compare - only 6 Mb under Windows).
So, small test, almost empty console app:
var x: TThread;
begin
x := TThread.Create(true);
after Create (looking internally - after pthread_create) "pmap" in Ubuntu says that few memory blocks was added as: 4 kb + 132 kb + 8192 kb + 65404 kb
4 + 132 kb - not too large, ok
8 mb - looks as default stack size for a thread, ok
but what is 64 Mb????

The pmap entry is not the consumed memory, it is the reserved/mapped memory.
RAM is first reserved/mapped, but actually consumed per page (typically 4KB), the first time a page is accessed by the program.
IIRC under Linux, the Delphi memory manager is just a wrapper around the libc malloc free API calls. It is typical for this MM to reserve some per-thread memory pages for the thread-specific memory arena. I guess 64MB is a plausible value.
So don't be afraid about those numbers. They don't imply that your RAM chips are consumed eagerly. You can have a process reserving GB of memory, but actually only using a few 100MB of hardware RAM - the writeable/private value in pmap. And note that a Delphi app will consume typically lower than any GC program. :)
Update/Proposal after comments:
If you need to reduce memory, don't use threads, but non-blocking inet libraries, or a thread pool and a reverse proxy (e.g. nginx) to maintain the inet connections. You can have your server handle HTTP/1.0 in a thread pool, and your nginx proxy maintain HTTP/1.1 or HTTP/2 long standing connections. If you use Posix sockets between nginx and the service on the loopback, it is very efficient. We use this configuration with our mORMot services on high production Linux systems, with great stability.
Update/More Info:
This 64MB mapping seems indeed the per-thread-arena reserved mapped memory from malloc. Check https://codearcana.com/posts/2016/07/11/arena-leak-in-glibc.html and MALLOC_ARENA_MAX as detailed in https://siddhesh.in/posts/malloc-per-thread-arenas-in-glibc.html

Related

Application fails when free memory is low but available memory is high

I am building data models via an app called Sisense on Linux. Lately the process fails with an out of memory error. Running free -h I see that that the failure occurs when free memory is low, but before it actually reaches zero and even though there is still plenty of available memory.
Here is the exception:
Failed to build custom table: Rule_pre; BE#521691 SQL error: SafeModeException:
Safe-Mode triggered due to memory pressure. Pod physical memory: 5.31 GB available, 2.87 GB
used, 8.19 GB total. Server physical memory: 4.86 GB available, 28.67 GB used,
33.54 GB total. Application total virtual memory: 2.54 GB. The server exceeded 85% capacity
(28.67/33.54). Possible ways to reduce memory pressure: increase server memory, adjust data
modelling (M2M, un-indexed string fields, etc.), reduce number of simultaneous queries
And here is the output of free -h where you can see the declining memory in the center "free" column. Once free memory got below 235 MB I saw the above exception.
The free util man page has these definitions for free and available memory:
free Unused memory (MemFree and SwapFree in /proc/meminfo)
available
Estimation of how much memory is available for starting new applications, without swapping. Unlike the data provided by the cache or free fields, this field takes into account page cache and also that not all reclaimable memory slabs will be reclaimed due to items being in use (MemAvailable in /proc/meminfo, available on kernels 3.14, emulated on kernels 2.6.27+, otherwise the same as free
As I read on the internet there seems to be a casualness about low free memory. That it is not an issue. But the failure coincides with free memory getting to low. If I understand the man page, the available memory is for starting new applications. I am assuming then that available memory is not available to the existing application that fails, and that free memory is indeed what matters. But any confirmation form others or additional explanation would be appreciated. I'd also be curious about opinions on whether this may constitute a memory leak or if I should simply allocate more memory somehow perhaps at the Linux layer.

I think I have enough understanding here. Free memory never goes below 200MB whether a build fails or succeed. It does not appear to be an indicator of the issue. A successful build will also show a drop in free memory to 200MB.

What cause kernel to eat CPU on page_fault?

hw/os: linux 4.9, 64G RAM.
16 daemons running. Each reading random short (100 bytes) pieces of 5GiB file accessing it as a memory mapped via mmap() at daemon startup. Each daemon reads its own file, so 16 5GiB files total.
Each daemon making maybe 10 reads per second. Not too much, disk load is rather small.
Sometimes (1 event in 5 minutes, no period, totally random) some random daemon stuck in kernel code with the followind stack (see picture) for 300 milliseconds. This does not corellate with major-faults: the major-faults go at constant rate about 100...200 per second. Disk reads are also constant.
What can cause this?
Text of the image: __list_del_entry isolate_lru_pages.isra.48 shrink_inactive_list shrink_node_memcg shrink_node node_reclaim get_page_from_freelist enqueue_task_fair sched_clock __alloc_pages_nodemask alloc_pages_vma handle_mm_fault __do_page_fault page_fault

You have shrink_node and node_reclaim functions in your stack. They are called to free memory (which is shown as buff/cache by free command-line tool): https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html#reclaim
The process of freeing the reclaimable physical memory pages and repurposing them is called (surprise!) reclaim. Linux can reclaim pages either asynchronously or synchronously, depending on the state of the system. When the system is not loaded, most of the memory is free and allocation requests will be satisfied immediately from the free pages supply. As the load increases, the amount of the free pages goes down and when it reaches a certain threshold (high watermark), an allocation request will awaken the kswapd daemon. It will asynchronously scan memory pages and either just free them if the data they contain is available elsewhere, or evict to the backing storage device (remember those dirty pages?). As memory usage increases even more and reaches another threshold - min watermark - an allocation will trigger direct reclaim. In this case allocation is stalled until enough memory pages are reclaimed to satisfy the request.
So your 64 GB RAM system has situation when there is no free memory left. This amount of memory is enough to hold a copy of 12 files of 5 GB each, and your daemons uses 16 files. Linux may read more data from files than it was required by application with Readahead technique ("Linux readahead: less tricks for more", ols 2007 pp273-284, man 2 readahead). MADV_SEQUENTIAL may also turn on this mechanism, https://man7.org/linux/man-pages/man2/madvise.2.html
MADV_SEQUENTIAL
Expect page references in sequential order. (Hence, pages in
the given range can be aggressively read ahead, and may be
freed soon after they are accessed.)
MADV_RANDOM
Expect page references in random order. (Hence, read ahead
may be less useful than normally.)
Not sure how your daemons did open and read files, was MADV_SEQUENTIAL active for them or not (or was this flag added by glibc or any other library). Also there can be some effect from THP - Transpartent huge pages https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html. Normal 4.9 kernel is from 2016 and thp expansion for filesystems was planned in 2019 https://lwn.net/Articles/789159/, but if you use RHEL/CentOS, some features may be backported into fork of 4.9 kernel.
You should check free and cat /proc/meminfo output periodically to check how your daemons and linux kernel readahead uses memory.

How do I tune node.js memory usage for Raspberry pi?

I'm running node.js on a Raspberry Pi 3 B with the following free memory:
free -m
total used free shared buffers cached
Mem: 973 230 742 6 14 135
-/+ buffers/cache: 80 892
Swap: 99 0 99
How can I configure node (v7) to not use all the free memory? To prolong the SD card life, I would like to prevent it from going to swap.
I am aware of --max_old_space_size:
node --v8-options | grep -A 5 max_old
--max_old_space_size (max size of the old space (in Mbytes))
type: int default: 0
I know some of the answer is application specific, however what are some general tips to limit node.js memory consumption to prevent swapping? Also any other tips to squeeze more free ram out of the pi would be appreciated.
I have already set the memory split so that the GPU has the minimum 16 megs of RAM allocated.

The only bulletproof way to prevent swapping is to turn off swapping in the operating system (delete or comment out any swap lines in /etc/fstab for permanent settings, or use swapoff -a to turn off all swap devices for the current session). Note that the kernel is forced to kill random processes when there is no free memory available (this is true both with and without swap).
In node.js, what you can limit is the size of V8's managed heap, and the --max-old-space-size flag you already mentioned is the primary way for doing that. A value around 400-500 (megabytes) probably makes sense for your Raspberry. There's also --max-semi-space-size which should be small and you can probably just stick with the default, and --max-executable-size for generated code (how much you need depends on the app you run; I'd just stick with the default).
That said, there's no way to limit the overall memory usage of the process, because there are other memory consumers outside the managed heap (e.g. node.js itself, V8's parser and compiler). There is no way to set limits on all kinds of memory usage. (Because what would such a limit do? Crash when memory is needed but not available? The kernel will take care of that anyway.)

Why does a JVM report more committed memory than the linux process resident set size?

When running a Java app (in YARN) with native memory tracking enabled (-XX:NativeMemoryTracking=detail see https://docs.oracle.com/javase/8/docs/technotes/guides/vm/nmt-8.html and https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html), I can see how much memory the JVM is using in different categories.
My app on jdk 1.8.0_45 shows:
Native Memory Tracking:
Total: reserved=4023326KB, committed=2762382KB
- Java Heap (reserved=1331200KB, committed=1331200KB)
(mmap: reserved=1331200KB, committed=1331200KB)
- Class (reserved=1108143KB, committed=64559KB)
(classes #8621)
(malloc=6319KB #17371)
(mmap: reserved=1101824KB, committed=58240KB)
- Thread (reserved=1190668KB, committed=1190668KB)
(thread #1154)
(stack: reserved=1185284KB, committed=1185284KB)
(malloc=3809KB #5771)
(arena=1575KB #2306)
- Code (reserved=255744KB, committed=38384KB)
(malloc=6144KB #8858)
(mmap: reserved=249600KB, committed=32240KB)
- GC (reserved=54995KB, committed=54995KB)
(malloc=5775KB #217)
(mmap: reserved=49220KB, committed=49220KB)
- Compiler (reserved=267KB, committed=267KB)
(malloc=137KB #333)
(arena=131KB #3)
- Internal (reserved=65106KB, committed=65106KB)
(malloc=65074KB #29652)
(mmap: reserved=32KB, committed=32KB)
- Symbol (reserved=13622KB, committed=13622KB)
(malloc=12016KB #128199)
(arena=1606KB #1)
- Native Memory Tracking (reserved=3361KB, committed=3361KB)
(malloc=287KB #3994)
(tracking overhead=3075KB)
- Arena Chunk (reserved=220KB, committed=220KB)
(malloc=220KB)
This shows 2.7GB of committed memory, including 1.3GB of allocated heap and almost 1.2GB of allocated thread stacks (using many threads).
However, when running ps ax -o pid,rss | grep <mypid> or top it shows only 1.6GB of RES/rss resident memory. Checking swap says none in use:
free -m
total used free shared buffers cached
Mem: 129180 99348 29831 0 2689 73024
-/+ buffers/cache: 23633 105546
Swap: 15624 0 15624
Why does the JVM indicate 2.7GB memory is committed when only 1.6GB is resident? Where did the rest go?

I'm beginning to suspect that stack memory (unlike the JVM heap) seems to be precommitted without becoming resident and over time becomes resident only up to the high water mark of actual stack usage.
Yes, at least on linux mmap is lazy unless told otherwise. Anonymous pages are only backed by physical memory once they're written to (reads are not sufficient due to the zero-page optimization)
GC heap memory effectively gets touched by the copying collector or by pre-zeroing (-XX:+AlwaysPreTouch), so it'll always be resident. Thread stacks otoh aren't affected by this.
For further confirmation you can use pmap -x <java pid> and cross-reference the RSS of various address ranges with the output from the virtual memory map from NMT.
Reserved memory has been mmaped with PROT_NONE. Which means the virtual address space ranges have entries in the kernel's vma structs and thus will not be used by other mmap/malloc calls. But they will still cause page faults being forwarded to the process as SIGSEGV, i.e. accessing them is an error.
This is important to have contiguous address ranges available for future use, which in turn simplifies pointer arithmetic.
Committed-but-not-backed-by-storage memory has been mapped with - for example - PROT_READ | PROT_WRITE but accessing it still causes a page fault. But that page fault is silently handled by the kernel by backing it with actual memory and returning to execution as if nothing happened.
I.e. it's an implementation detail/optimization that won't be noticed by the process itself.
To give a breakdown of the concepts:
Used Heap: the amount of memory occupied by live objects according to the last GC
Committed: Address ranges that have been mapped with something other than PROT_NONE. They may or may not be backed by physical or swap due to lazy allocation and paging.
Reserved: The total address range that has been pre-mapped via mmap for a particular memory pool.
The reserved − committed difference consists of PROT_NONE mappings, which are guaranteed to not be backed by physical memory
Resident: Pages which are currently in physical ram. This means code, stacks, part of the committed memory pools but also portions of mmaped files which have recently been accessed and allocations outside the control of the JVM.
Virtual: The sum of all virtual address mappings. Covers committed, reserved memory pools but also mapped files or shared memory. This number is rarely informative since the JVM can reserve very large address ranges in advance or mmap large files.

MPI_SEND takes huge part of virtual memory

Debugging my program on big counts of kernels, I faced with very strange error of insufficient virtual memory. My investigations lead to peace of code, where master sends small messages to each slave. Then I wrote small program, where 1 master simply send 10 integers with MPI_SEND and all slaves receives it with MPI_RECV. Comparison of files /proc/self/status before and after MPI_SEND showed, that difference between memory sizes is huge! The most interesting thing (which crashes my program), is that this memory won't deallocate after MPI_Send and still take huge space.
Any ideas?
System memory usage before MPI_Send, rank: 0
Name: test_send_size
State: R (running)
Pid: 7825
Groups: 2840
VmPeak: 251400 kB
VmSize: 186628 kB
VmLck: 72 kB
VmHWM: 4068 kB
VmRSS: 4068 kB
VmData: 71076 kB
VmStk: 92 kB
VmExe: 604 kB
VmLib: 6588 kB
VmPTE: 148 kB
VmSwap: 0 kB
Threads: 3
System memory usage after MPI_Send, rank 0
Name: test_send_size
State: R (running)
Pid: 7825
Groups: 2840
VmPeak: 456880 kB
VmSize: 456872 kB
VmLck: 257884 kB
VmHWM: 274612 kB
VmRSS: 274612 kB
VmData: 341320 kB
VmStk: 92 kB
VmExe: 604 kB
VmLib: 6588 kB
VmPTE: 676 kB
VmSwap: 0 kB
Threads: 3

This is an expected behaviour from almost any MPI implementation that runs over InfiniBand. The IB RDMA mechanisms require that data buffers should be registered, i.e. they are first locked into a fixed position in the physical memory and then the driver tells the InfiniBand HCA how to map virtual addresses to physical memory. It is very complex and hence very slow process to register memory for usage by the IB HCA and that's why most MPI implementations never unregister memory that was once registered in hope that the same memory would later be used as a source or data target again. If the registered memory was heap memory, it is never returned back to the operating system and that's why your data segment only grows in size.
Reuse send and receive buffers as much as possible. Keep in mind that communication over InfiniBand incurrs high memory overhead. Most people don't really think about this and it is usually poorly documented, but InfiniBand uses a lot of special data structures (queues) which are allocated in the memory of the process and those queues grow significantly with the number of processes. In some fully connected cases the amount of queue memory can be so large that no memory is actually left for the application.
There are certain parameters that control IB queues used by Intel MPI. The most important in your case is I_MPI_DAPL_BUFFER_NUM which controls the amount of preallocated and preregistered memory. It's default value is 16, so you might want to decrease it. Be aware of possible performance implications though. You can also try to use dynamic preallocated buffer sizes by setting I_MPI_DAPL_BUFFER_ENLARGEMENT to 1. With this option enabled, Intel MPI would initially register small buffers and will later grow them if needed. Note also that IMPI opens connections lazily and that's why you see the huge increase in used memory only after the call to MPI_Send.
If not using the DAPL transport, e.g. using the ofa transport instead, there is not much that you can do. You can enable XRC queues by setting I_MPI_OFA_USE_XRC to 1. This should somehow decrease the memory used. Also enabling dynamic queue pairs creation by setting I_MPI_OFA_DYNAMIC_QPS to 1 might decrease memory usage if the communication graph of your program is not fully connected (a fully connected program is one in which each rank talks to all other ranks).

Hristo's answer is mostly right, but since you are using small messages there's a bit of a difference. The messages end up on the eager path: they first get copied to an already-registered buffer, then that buffer is used for the transfer, and the receiver copies the message out of an eager buffer on their end. Reusing buffers in your code will only help with large messages.
This is done precisely to avoid the slowness of registering the user-supplied buffer. For large messages the copy takes longer than the registration would, so the rendezvous protocol is used instead.
These eager buffers are somewhat wasteful. For example, they are 16kB by default on Intel MPI with OF verbs. Unless message aggregation is used, each 10-int-sized message is eating four 4kB pages. But aggregation won't help when talking to multiple receivers anyway.
So what to do? Reduce the size of the eager buffers. This is controlled by setting the eager/rendezvous threshold (I_MPI_RDMA_EAGER_THRESHOLD environment variable). Try 2048 or even smaller. Note that this can result in a latency increase. Or change the I_MPI_DAPL_BUFFER_NUM variable to control the number of these buffers, or try the dynamic resizing feature that Hristo suggested. This assumes your IMPI is using DAPL (the default). If you are using OF verbs directly, the DAPL variables won't work.
Edit: So the final solution for getting this to run was setting I_MPI_DAPL_UD=enable. I can speculate on the origin of the magic, but I don't have access to Intel's code to actually confirm this.
IB can have different transport modes, two of which are RC (Reliable Connected) and UD (Unreliable Datagram). RC requires an explicit connection between hosts (like TCP), and some memory is spent per connection. More importantly, each connection has those eager buffers tied to it, and this really adds up. This is what you get with Intel's default settings.
There is an optimization possible: sharing the eager buffers between connections (this is called SRQ - Shared Receive Queue). There's a further Mellanox-only extension called XRC (eXtended RC) that takes the queue sharing further: between the processes that are on the same node.
By default Intel's MPI accesses the IB device through DAPL, and not directly through OF verbs. My guess is this precludes these optimizations (I don't have experience with DAPL). It is possible to enable XRC support by setting I_MPI_FABRICS=shm:ofa and I_MPI_OFA_USE_XRC=1 (making Intel MPI use the OFA interface instead of DAPL).
When you switch to the UD transport you get a further optimization on top of buffer sharing: there is no longer a need to track connections. The buffer sharing is natural in this model: since there are no connections, all the internal buffers are in a shared pool, just like with SRQ. So there are further memory savings, but at a cost: datagram delivery can potentially fail, and it is up to the software, not the IB hardware to handle retransmissions. This is all transparent to the application code using MPI, of course.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string