memory usage discrepency between pprof and ps - linux

I have been trying to profile the heap usage of a cli tool built with cobra.
The pprof tool is showing like the following,
Flat Flat% Sum% Cum Cum% Name Inlined?
1.58GB 49.98% 49.98% 1.58GB 49.98% os.ReadFile
1.58GB 49.98% 99.95% 1.58GB 50.02% github.com/bytedance/sonic.(*frozenConfig).Unmarshal
0 0.00% 99.95% 3.16GB 100.00% runtime.main
0 0.00% 99.95% 3.16GB 100.00% main.main
0 0.00% 99.95% 3.16GB 100.00% github.com/spf13/cobra.(*Command).execute
0 0.00% 99.95% 3.16GB 100.00% github.com/spf13/cobra.(*Command).ExecuteC
0 0.00% 99.95% 3.16GB 100.00% github.com/spf13/cobra.(*Command).Execute (inline)
0 0.00% 99.95% 3.16GB 100.00% github.com/mirantis/broker/misc.ParseUcpNodesInspect
0 0.00% 99.95% 3.16GB 100.00% github.com/mirantis/broker/cmd.glob..func3
0 0.00% 99.95% 3.16GB 100.00% github.com/mirantis/broker/cmd.getInfos
0 0.00% 99.95% 3.16GB 100.00% github.com/mirantis/broker/cmd.Execute
0 0.00% 99.95% 1.58GB 50.02% github.com/bytedance/sonic.Unmarshal
But ps is sowing at the end it almost consumes 6752.23 Mb(rss).
Also, I am putting the defer profile.Start(profile.MemProfileHeap).Stop() at the last function gets executed. Putting the profiler in the func main doesn't show anything. So I traced through the functions and found the considerable usage of memory at the last one.
My question is, how do I find the missing ~3gb of memory?

There are multiple problems (with your question):
ps (and top etc) show multiple memory readings. The only one of interest is typically called RES or RSS. You don't tell which one it was.
Basically, looking at the reading typically named VIRT is not interesting.
As Volker said, pprof does not measure memory consumption, it measures (in the mode you have run it) memory allocation rate—in the sense of "how much", not "how frequently".
To understand what it means, consider how pprof works.
During profiling, a timer ticks, and on each tick, the profiler sort of snaphots your running program, scans the stacks of all live goroutines and attributes live objects on the heap to the variables contained in the stack frames of those stacks, and each stack frame belongs to an active function.
This means that, if your process will call, say, os.ReadFile—which, by its contract, allocates a slice of bytes long enough to contain the whole contents of the file to be read,—100 times to read 1 GiB file each time, and the profiler's timer will manage to pinpoint each of these 100 calls (it can miss some of the calls as it's sampling), os.ReadFile will be attributed to having had allocated 100 GiB.
But if your program is not written in such a way that it holds each of the slices returned by these calls, but rather does something with those slices and throws them away after processing, the slices from the past calls will likely be already collected by the GC by the time the newer ones are allocated.
While not required by the spec, the two "standard" contemporary implementations of Go—the one originally dubbed "gc", which most people think is the implementation, and the GCC frontend—feature garbage collector which runs in parallel with the flow of your own process; the moments it actually collects the garbage produced by your process are governed by a set of complicated heuristics (start here if interested) which try to balance between spending CPU time for GC and spending RAM for not doing it ;-) , and it means for short-lived processes, the GC might not kick in even a single time, meaning your process will end with all the generated garbage still floating, and all that memory will be reclaimed by the OS in the usual way when the process ends.
When the GC collects garbage, the freed memory is not returned to the OS immediately. Instead, two-staged process is involved:
First, the freed regions are returned to the memory manager which is a part of the Go rutime powering your running program.
This is a sensible thing because in a typical program memory churn is usually high enough and freed memory will likely be quickly allocated back again.
Second, memory pages staying free long enough are marked to let the OS know it can use it for its own needs.
Basically it means that even after the GC frees some memory, you won't see this outside the running Go process as this memory is first retuned to the process' own pool.
Different versions of Go (again, I mean the "gc" implementation) implemented different policies about returning the freed pages to the OS: first they were marked by madvise(2) as MADV_FREE, then as MADV_DONTNEED and then again as MADV_FREE.
If you happen to use a version of Go whose runtime marks freed memory as MADV_DONTNEED, the readings of RSS will be even less sensible because the memory marked that way still counts against the process' RSS even though the OS was hinted it can reclaim that memory when needed.
To recap.
This topic is complex enough and you appear to be drawing certain conclusions too fast ;-)
An update.
I've decided to expand on memory management a bit because I feel like certain bits and pieces may be missing from the big picture of this stuff in your head, and because of this you might find the comments to your question to be moot and dismissive.
The reasoning for the advice to not measure memory consumption of programs written in Go using ps, top and friends is rooted in the fact the memory management implemented in the runtime environments powering programs written in contemporary high-level programming languages is quite far removed from the down-to-the-metal memory management implemented in the OS kernels and the hardware they run on.
Let's consider Linux to have concrete tangible examples.
You certainly can ask the kernel directly to allocate a memory for you: mmap(2) is a syscall which does that.
If you call it with MAP_PRIVATE (and usually also with MAP_ANONYMOUS), the kernel will make sure the page table of your process has one or more new entries for as many pages of memory to contain the contiguous region of as many bytes as you have requested, and return the address of the first page in the sequence.
At this time you might think that the RSS of your process had grown by that number of bytes, but it had not: the memory was "reserved" but not actually allocated; for a memory page to really get allocated, the process had to "touch" any byte within the page—by reading it or writing it: this will generate the so-called "page fault" on the CPU, and the in-kernel handler will ask the hardware to actually allocate a real "hardware" memory page. Only after that the page will actually count against the process' RSS.
OK, that's fun, but you probably can see a problem: it's not too convenient to operate with complete pages (wich can be of different size on different systems; typically it's 4 KiB on systems of the x86 lineage): when you program in a high-level language, you don't think on such a low level about the memory; instead, you expect the running program to somehow materialize "objects" (I do not mean OOP here; just pieces of memory containing values of some language- or user-defined types) as you need them.
These objects may be of any size, most of the time way smaller than a single memory page, and—what is more important,—most of the time you do not even think about how much space these objects are consuming when allocated.
Even when programming in a language like C, which these days is considered to be quite low-level, you're usually accustomed to using memory management functions in the malloc(3) family provided by the standard C library, which allow you to allocate regions of memory of arbitrary size.
A way to solve this sort of problem is to have a higher-level memory manager on top on what the kernel can do for your program, and the fact is, every single general-purpose program written in a high-level language (even C and C++!) is using one: for interpreted languages (such as Perl, Tcl, Python, POSIX shell etc) it is provided by the interpreter; for byte-compiled languages such as Java, it is provided by the process which executes that code (such as JRE for Java); for languages which compile down to machine (CPU) code—such as the "stock" implementation of Go—it is provided by the "runtime" code included into the resulting executable image file or linked into the program dynamically when it's being loaded into the memory for execution.
Such memory managers are usually quite complicated as they have to deal with many complex problems such as memory fragmentation, and they usually have to avoid talking to the kernel as much as possible because syscalls are slow.
The latter requirement naturally means process-level memory managers try to cache the memory they have once taken from the kernel, and are reluctant to release it back.
All this means that, say, in a typical active Go program you might have crazy memory churn — hordes of small objects being allocated and deallocated all the time which has next to no effect on the values of RSS monitored "from the outside" of the process: all this churn is handled by the in-process memory manager and—as in the case of the stock Go implementation—the GC which is naturally tightly integrated with the MM.
Because of that, to have useful actionable idea about what is happening in a long-running production-grade Go program, such program usually provides a set of continuously updated metrics (delivering, collecting and monitoring them is called telemetry). For Go programs, a part of the program tasked with producing these metrics can either make periodic calls to runtime.ReadMemStats and runtime/debug.ReadGCStats or directly use what the runtime/metrics has to offer. Looking at such metrics in a monitoring system such as Zabbix, Graphana etc is quite instructive: you can literally see how the amount of free memory available to the in-process MM increases after each GC cycle while the RSS stays roughly the same.
Also note that you might consider running your Go program with various GC-related debugging settings in a special environment variable GODEBUG described here: basically, you make the Go runtime powering your running program emit detailed information on how the GC is working (also see this).
Hope this will make your curious to make further exploration of these matters ;-)
You might find this to be a good introduction on memory management implemented by the Go runtime—in connection with the kernel and the hardware; recommended read.

Related

What is coherent memory on GPU?

I have stumbled not once into a term "non coherent" and "coherent" memory in the
tech papers related to graphics programming.I have been searching for a simple and clear explanation,but found mostly 'hardcore' papers of this type.I would be glad to receive layman's style answer on what coherent memory actually is on GPU architectures and how it is compared to other (probably not-coherent) memory types.
Memory is memory. But different things can access that memory. The GPU can access memory, the CPU can access memory, maybe other hardware bits, whatever.
A particular thing has "coherent" access to memory if changes made by others to that memory are visible to the reader. Now, you might think this is foolishness. After all, if the memory has been changed, how could someone possibly be unable to see it?
Simply put, caches.
It turns out that changing memory is expensive. So we do everything possible to avoid changing memory unless we absolutely have to. When you write a single byte from the CPU to a pointer in memory, the CPU doesn't write that byte yet. Or at least, not to memory. It writes it to a local copy of that memory called a "cache."
The reason for this is that, generally speaking, applications do not write (or read) single bytes. They are more likely to write (and read) lots of bytes, in small chunks. So if you're going to perform an expensive operation like a memory load or store, you should load or store a large chunk of memory. So you store all of the changes you're going to make to a chunk of memory in a cache, then make a single write of that cached chunk to actual memory at some point in the future.
But if you have two separate devices that use the same memory, you need some way to be certain that writes one device makes are visible to other devices. Most GPUs can't read the CPU cache. And most CPU languages don't have language-level support to say "hey, that stuff I wrote to memory? I really mean for you to write it to memory now." So you usually need something to ensure visibility of changes.
In Vulkan, memory which is labeled by VK_MEMORY_PROPERTY_HOST_COHERENT_BIT means that, if you read/write that memory (via a mapped pointer, since that's the only way Vulkan lets you directly write to memory), you don't need to use functions vkInvalidateMappedMemoryRanges/vkFlushMappedMemoryRanges to make sure the CPU/GPU can see those changes. The visibility of any changes is guaranteed in both directions. If that flag isn't available on the memory, then you must use the aforementioned functions to ensure the coherency of the specific regions of data you want to access.
With coherent memory, one of two things is going on in terms of hardware. Either CPU access to the memory is not cached in any of the CPU's caches, or the GPU has direct access to the CPU's caches (perhaps due to being on the same die as the CPU(s)). You can usually tell that the latter is happening, because on-die GPU implementations of Vulkan don't bother to offer non-coherent memory options.
If memory is coherent then all threads accessing that memory must agree on the state of the memory at all times, e.g.: if thread 0 reads memory location A and thread 1 reads the same location at the same time, both threads should always read the same value.
But if memory is not coherent then threads A and B might read back different values. Thread 0 could think that location A contains a 1, while thread thinks that that location contains a 2. The different threads would have an incoherent view of the memory.
Coherence is hard to achieve with a high number of cores. Often every core must be aware of memory accesses from all other cores. So if you have 4 cores in a quad core CPU, coherence is not that hard to achieve as every core must be informed about the memory accesses addresses of 3 other cores, but in a GPU with 16 cores, every core must be made aware of the memory accesses by 15 other cores. The cores exchange data about the content of their cache using so called "cache coherence protocols".
This is why GPUs often only support limited forms of coherency. If some memory locations are read only or are only accessed by a single thread, then no coherence is required. If caches are small and coherence is not always required but only at specific instructions of the program, then it is possible to achieve correct behavior of the program using cache flushes before or after specific memory accesses.
If your hardware offers both coherent and non-coherent memory types, then you can expect that non-coherent memory will be faster, but if you try to run parallel algorithms using this memory they will fail in really weird ways.

vm/min_free_kbytes - Why Keep Minimum Reserved Memory?

According to this article:
/proc/sys/vm/min_free_kbytes: This controls the amount of memory that is kept free for use by special reserves including “atomic” allocations (those which cannot wait for reclaim)
My question is that what does it mean by "those which cannot wait for reclaim"? In other words, I would like to understand why there's a need to tell the system to always keep a certain minimum amount of memory free and under what circumstances will this memory be used? [It must be used by something; don't see the need otherwise]
My second question: does setting this memory to something higher than 4MB (on my system) leads to better performance? We have a server which occasionally exhibit very poor shell performance (e.g. ls -l takes 10-15 seconds to execute) when certain processes get going and if setting this number to something higher will lead to better shell performance?
(link is dead, looks like it's now here)
That text is referring to atomic allocations, which are requests for memory that must be satisfied without giving up control (i.e. the current thread can not be suspended). This happens most often in interrupt routines, but it applies to all cases where memory is needed while holding an essential lock. These allocations must be immediate, as you can't afford to wait for the swapper to free up memory.
See Linux-MM for a more thorough explanation, but here is the memory allocation process in short:
_alloc_pages first iterates over each memory zone looking for the first one that contains eligible free pages
_alloc_pages then wakes up the kswapd task [..to..] tap into the reserve memory pools maintained for each zone.
If the memory allocation still does not succeed, _alloc pages will either give up [..] In this process _alloc_pages executes a cond_resched() which may cause a sleep, which is why this branch is forbidden to allocations with GFP_ATOMIC.
min_free_kbytes is unlikely to help much with the described "ls -l takes 10-15 seconds to execute"; that is likely caused by general memory pressure and swapping rather than zone exhaustion. The min_free_kbytes setting only needs to allow enough free pages to handle immediate requests. As soon as normal operation is resumed, the swapper process can be run to rebalance the memory zones. The only time I've had to increase min_free_kbytes is after enabling jumbo frames on a network card that didn't support dma scattering.
To expand on your second question a bit, you will have better results tuning vm.swappiness and the dirty ratios mentioned in the linked article. However, be aware that optimizing for "ls -l" performance may cause other processes to become slower. Never optimize for a non-primary usecase.
All linux systems will attempt to make use of all physical memory available to the system, often through the creation of a filesystem buffer cache, which put simply is an I/O buffer to help improve system performance. Technically this memory is not in use, even though it is allocated for caching.
"wait for reclaim", in your question, refers to the process of reclaiming that cache memory that is "not in use" so that it can be allocated to a process. This is supposed to be transparent but in the real world there are many processes that do not wait for this memory to become available. Java is a good example, especially where a large minimum heap size has been set. The process tries to allocate the memory and if it is not instantly available in one large contiguous (atomic?) chunk, the process dies.
Reserving a certain amount of memory with min_free_kbytes allows this memory to be instantly available and reduces the memory pressure when new processes need to start, run and finish while there is a high memory load and a full buffer cache.
4MB does seem rather low because if the buffer cache is full, any process that wants an immediate allocation of more than 4MB will likely fail. The setting is very tunable and system-specific, but if you have a few GB of memory available it can't hurt to bump up the reserve memory to 128MB. I'm not sure what effect it will have on shell interactivity, but likely positive.
This memory is kept free from use by normal processes. As #Arno mentioned, the special processes that can run include interrupt routines, which must be run now (as it's an interrupt), and finish before any other processes can run (atomic). This can include things like swapping out memory to disk when memory is full.
If the memory is filled an interrupt (memory management) process runs to swap some memory into disk so it can free some memory for use by normal processes. But if vm.min_free_kbytes is too small for it to run, then it locks up the system. This is because this interrupt process must run first to free memory so others can run, but then it's stuck because it doesn't have enough reserved memory vm.min_free_kbytes to do its task resulting in a deadlock.
Also see:
https://www.linbit.com/en/kernel-min_free_kbytes/ and
https://askubuntu.com/questions/41778/computer-freezing-on-almost-full-ram-possibly-disk-cache-problem (where the memory management process has so little memory to work with it takes so long to swap little by little that it feels like a freeze.)

How to stop page cache for disk I/O in my linux system?

Here is my system based on Linux2.6.32.12:
1 It contains 20 processes which occupy a lot of usr cpu
2 It needs to write data on rate 100M/s to disk and those data would not be used recently.
What I expect:
It can run steadily and disk I/O would not affect my system.
My problem:
At the beginning, the system run as I thought. But as the time passed, Linux would cache a lot data for the disk I/O, that lead to physical memory reducing. At last, there will be not enough memory, then Linux will swap in/out my processes. It will cause I/O problem that a lot cpu time was used to I/O.
What I have try:
I try to solved the problem, by "fsync" everytime I write a large block.But the physical memory is still decreasing while cached increasing.
How to stop page cache here, it's useless for me
More infomation:
When Top show free 46963m, all is well including cpu %wa is low and vmstat shows no si or so.
When Top show free 273m, %wa is so high which affect my processes and vmstat shows a lot si and so.
I'm not sure that changing something will affect overall performance.
Maybe you might use posix_fadvise(2) and sync_file_range(2) in your program (and more rarely fsync(2) or fdatasync(2) or sync(2) or syncfs(2), ...). Also look at madvise(2), mlock(2) and munlock(2), and of course mmap(2) and munmap(2). Perhaps ionice(1) could help.
In the reader process, you might perhaps use readhahead(2) (perhaps in a separate thread).
Upgrading your kernel (to a 3.6 or better) could certainly help: Linux has improved significantly on these points since 2.6.32 which is really old.
To drop pagecache you can do the following:
"echo 1 > /proc/sys/vm/drop_caches"
drop_caches are usually 0. And, can be changed as per need. As you've identified yourself, that you need to free pagecache, so this is how to do it. You can also take a look at dirty_writeback_centisecs (and it's related tunables)(http://lxr.linux.no/linux+*/Documentation/sysctl/vm.txt#L129) to make quick writeback, but note it might have consequences, as it calls up kernel flasher thread to write out dirty pages. Also, note the uses of dirty_expire_centices, which defines how much time some data needs to be eligible for writeout.

Memory of type "root-set" reallocation Error - Erlang

I have been running a crypto-intensive application that was generating pseudo-random strings, with special structure and mathematical requirements. It has generated around 1.7 million voucher numbers per node in over the last 8 days. The generation process was CPU intensive, with very low memory requirements.
Mnesia running on OTP-14B02 was the storage database and the generation was done within each virtual machine. I had 3 nodes in the cluster with all mnesia tables disc_only_copies type. Suddenly, as activity on the Solaris boxes increased (other users logged on remotely and were starting webservers, ftp sessions, and other tasks), my bash shell started reporting afork: not enough space error.
My erlang Vms also, went down with this error below:
Crash dump was written to: erl_crash.dump
temp_alloc: Cannot reallocate 8388608 bytes of memory (of type "root_set").
Usually, we get memory allocation errors and not memory re-location errors and normally memory of type "heap" is the problem. This time, the memory type reported is type "root-set".
Qn 1. What is this "root-set" memory?
Qn 2. Has it got to do with CPU intensive activities ? (why am asking this is that when i start the task, the Machine reponse to say mouse or Keyboard interrupts is too slow meaning either CPU is too busy or its some other problem i cannot explain for now)
Qn 3. Can such an error be avoided? and how ?
The fork: not enough space message suggests this is a problem with the operating system setup, but:
Q1 - The Root Set
The Root Set is what the garbage collector uses as a starting point when it searches for data that is live in the heap. It usually starts off from the registers of the VM and off from the stack, if the stack has references to heap data that still needs to be live. There may be other roots in Erlang I am not aware of, but these are the basic stuff you start off from.
That it is a reallocation error of exactly 8 Megabyte space could mean one of two things. Either you don't have 8 Megabyte free in the heap, or that the heap is fragmented beyond recognition, so while there are 8 megabytes in it, there are no contiguous such space.
Q2 - CPU activity impact
The problem has nothing to do with the CPU per se. You are running out of memory. A large root set could indicate that you have some very deep recursions going on where you keep around a lot of pointers to data. You may be able to rewrite the code such that it is tail-calling and uses less memory while operating.
You should be more worried about the slow response times from the keyboard and mouse. That could indicate something is not right. Does a vmstat 1, a sysstat, a htop, a dstat or similar show anything odd while the process is running? You are also on the hunt to figure out if the kernel or the C libc is doing something odd here due to memory being constrained.
Q3 - How to fix
I don't know how to fix it without knowing more about what the application is doing. Since you have a crash dump, your first instinct should be to take the crash dump viewer and look at the dump. The goal is to find a process using a lot of memory, or one that has a deep stack. From there on, you can seek to limit the amount of memory that process is using. either by rewriting the code so it can give memory up earlier, by tuning the garbage collection setup for the process (see the spawn options in the erlang man pages for this), or by adding more memory to the system.

Reclaim memory after program exit

Here is my problem: after running a suite of programs, free tells me that after execution there is about 1 GB less memory free. After some searches I found SO: What really happens when you dont free after malloc which (as I understand it) makes clear that missing memory deallocations should not be the problem... (is that correct?)
top does not show any processes that use significant amounts of memory.
How can I find out 'what happend' to the memory, i.e. which program allocated it and why it is not free after program execution?
Where does free collect its information?
(I am running a recent Ubuntu version)
Yes, memory used by your program is freed after your program exits.
The statistics in "free" are confusing, but the fact is that the memory IS available to other programs:
http://kevinclosson.wordpress.com/2009/11/17/linux-free-memory-is-it-free-or-reclaimable-yes-when-i-want-free-memory-i-want-free-memory/
http://sourcefrog.net/weblog/software/linux-kernel/free-mem.html
Here's an event better link:
http://www.linuxatemyram.com/
free (1) is a misnomer, it should more correctly be called unused, because that's what it shows. Or maybe it should be called physicalfree (or, more precisely, the "free" column in the output should be named "unused").
You'll note that "buffers" and "cached" tends to go up as "free" goes down. Memory does not disappear, it just gets assigned to a different "bucket".
The difference between free memory and unused memory is that while both are "free", the unused memory is truly so (no physical memory in use) whereas the simply "free" memory is often moved into the buffer cache. That is for example the case for all executable images and libraries, anything that is read-only or read-execute. If the same file is loaded again later, the "free" page is mapped into the process again and no data must be loaded.
Note that "unused" is actually a bad thing, although it is not immediately obvious (it sounds good, doesn't it?). Free (but physically used) memory serves a purpose, whereas free (unused) memory means you could as well have saved on money for RAM. Therefore, having unused memory (e.g. by purging pages) is exactly what you don't want.
Stunningly, under Windows there exists a lot of "memory optimizer" tools which cost real money and which do just that...
About reclaiming memory, the way this works is easy: The OS simply removes the references to all pages in the working set. If a page is shared with another process, nothing spectacular happens. If it belongs to a non-anonymous mapping and is not writeable (or writeable and not written), it goes into the buffer cache. Otherwise, it goes zap poof.
This removes any memory allocated with malloc as well as the memory used by executables and file mappings, and (since all memory is based on pages) everything else.
It is probably your OS using up that space for its own purposes.
For example, many modern OS's will keep programs loaded in memory after they terminate, in case you want to start them up again. If their guess is right, it saves a lot of time at the cost of some memory that wasn't being used anyway. Some OS's will even speculatively load some commonly used programs.
CPU utilization works the same way. Often your OS will speculatively do some work when the CPU would otherwise be "idle".

Resources