Get memory high water mark for a time interval - linux

I'm trying to get the max amount of memory used during brief intervals, for a long-running linux process. For example, something like:
resetmaxrss(); // hypothetical new command
void* foo = malloc(4096);
free(foo);
getrusage(...); // 'ru_maxrss' reports 4096 plus whatever else is alive
resetmaxrss();
void* bar = malloc(2048);
free(bar);
getrusage(...); // 'ru_maxrss' reports 2048 + whatever, *not* 4096
Options I've found and ruled out:
getrusage's max RSS can't be reset.
cgmemtime seem to use wait4 under the hood, so isn't viable to query a process while it's running.
tstime reports for exiting processes, so is also not viable to query a process while it's running.
Other options, none of which are good:
Polling. Prone to miss our brief allocations.
Instrumenting our code. We don't have access to all of the memory allocators being used, so this wouldn't be very elegant or straightforward. I'd also rather use values reported by the OS for accuracy.
Is there a way to do this, short of proposing a patch to the Linux kernel?

It turns out that since Linux 4.0, the peak RSS can be reset:
/proc/[pid]/clear_refs (since Linux 2.6.22)
This is a write-only file, writable only by owner of the
process.
The following values may be written to the file:
[snip]
5 (since Linux 4.0)
Reset the peak resident set size ("high water mark") to
the process's current resident set size value.
That HWM/peak RSS can be read out with /proc/[pid]/status -> VmHWM or getrusage().
Patch RFC

Related

How to measure minor page fault cost?

I want to verify transparent huge page(THP) would cause large page fault latency, because Linux must zero pages before returning them to user. THP is 512x larger than 4KB pages, thus slower to clear. When memory is fragmented, the OS often compact memory to generate THP.
So I want to measure minor page fault latency(cost), but I still have no idea.
Check https://www.kernel.org/doc/Documentation/vm/transhuge.txt documentation and search LWN & RedHat docs for THP latency and THP faults.
https://www.kernel.org/doc/Documentation/vm/transhuge.txt says about zero THP:
By default kernel tries to use huge zero page on read page fault to
anonymous mapping. It's possible to disable huge zero page by writing 0
or enable it back by writing 1:
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
You can vary the setting (introduced around 2012: https://lwn.net/Articles/517465/ Adding a huge zero page) and do measurements of page mapping and access latency. Just read some system time with rdtsc/rdtscp/CLOCK_MONOTONIC, do accesses to the page, reread time; record stats about the time difference, like min/max/avg; draw a histogram - count how many differences were in 0..100, 101..300, 301..600 ... ranges and how many were bigger than some huge value. Array to count histogram many be small enough.
You may try mmap() with MAP_POPULATE flag - (http://d3s.mff.cuni.cz/teaching/advanced_operating_systems/slides/10_huge_pages.pdf page 17)
RedHat blog has post about THP & page fault latency (with help of their stap SystemTap tracing): https://developers.redhat.com/blog/2014/03/10/examining-huge-pages-or-transparent-huge-pages-performance/
To prevent information leakage from the previous user of the page the kernel writes zeros in the entire page. For a 4096 byte page this is a relatively short operation and will only take a couple of microseconds. The x86 hugepages are 2MB in size, 512 times larger than the normal page. Thus, the operation may take hundreds of microseconds and impact the operation of latency sensitive code. Below is a simple SystemTap command line script to show which applications have huge pages zeroed out and how long those operations take. It will run until cntl-c is pressed.
stap -e 'global huge_clear probe kernel.function("clear_huge_page").return {
huge_clear [execname(), pid()] <<< (gettimeofday_us() - #entry(gettimeofday_us()))}'
Also, I'm not sure about this, but in theory, Linux Kernel may have some kernel thread to do prezeroing of huge pages before they will be required by any application.

Do I need to tune sysctl.conf under linux when running MongoDB?

We are seeing occational huge writes to disk in the MongoDB log, effectively locking MongoDB for a long time. Many people are reporting similar issues on the net, but I have found no good answers so far.
Tue Mar 11 09:42:49.818 [DataFileSync] flushing mmaps took 75264ms for 46 files
The average mmap flush on my server is around 100 ms according to the mongo statistics.
A large percentage of our MongDB data is updated within a few hours. This leads me to speculate whether we need to tune the Linux sysctl virtual memory parameters as described in the performance guide for Neo4J, another memory mapped tool: http://docs.neo4j.org/chunked/stable/linux-performance-guide.html
There are a lot of blocks going out to IO, way more than expected for the write speed we
are seeing in the benchmark. Another observation that can be made is that the Linux kernel
has spawned a process called "flush-x:x" (run top) that seems to be consuming a lot of
resources.
The problem here is that the Linux kernel is trying to be smart and write out dirty pages
from the virtual memory. As the benchmark will memory map a 1GB file and do random writes
it is likely that this will result in 1/4 of the memory pages available on the system to
be marked as dirty. The Neo4j kernel is not sending any system calls to the Linux kernel to
write out these pages to disk however the Linux kernel decided to start doing so and it
is a very bad decision. The result is that instead of doing sequential like writes down
to disk (the logical log file) we are now doing random writes writing regions of the
memory mapped file to disk.
TOP shows that we indeed have a flush process that has been running a very long time, so this seems to match.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28352 mongod 20 0 153g 3.2g 3.1g S 3.3 42.3 299:18.36 mongod
3678 root 20 0 0 0 0 S 0.3 0.0 26:27.88 flush-253:1
The recommended Neo4J sysctl settings are
vm.dirty_background_ratio = 50
vm.dirty_ratio = 80
Does these settings have any relevance for a MongoDB installation at all?
The short answer is "yes". What values to choose depends very much on your write patterns. This gives background on exactly how MongoDB manages its mappings - it's not anything unexpected.
One wrinkle is that in a web-facing database application, you may care about latency more than throughput. vm.dirty_background_ratio gives the threshold for starting to write dirty pages, and vm.dirty_ratio tells when to stop accepting new writes (ie, block) until all writes have been flushed.
If you are hammering a relatively small working set, you can be OK with setting both of those values fairly high, and relying on Mongo's (or the OS's) periodic time-based flush-to-disk to commit the writes.
If you're conducting a high volume of inserts and also some modifications, which sounds like it might be your situation, it's a balancing act that depends on inserts vs. rewrites - starting to flush too early will cause writes that will be re-written soon, "wasting" io. Starting to flush too late will result in pauses as you flush huge writes.
If you're doing mostly inserts, then you may very well want a large dirty_ratio (to avoid blocking) and a relatively small dirty_background_ratio (small enough to always be writing as you're inserting to reduce latency, and just large enough to linearize some of the writes).
The correct solution is to replay some dummy data with various options for those sysctl parameters, and optimize it by brute force, bearing in mind your average latency / total throughput objectives.

Memory Debugging

Currently I analyze a C++ application and its memory consumption. Checking the memory consumption of the process before and after a certain function call is possible. However, it seems that, for technical reasons or for better efficiency the OS (Linux) assigns not only the required number of bytes but always a few more which can be consumed later by the application. This makes it hard to analyze the memory behavior of the application.
Is there a workaround? Can one switch Linux to a mode where it assigns just the required number of bytes/pages?
if you use malloc/new, the allocator will always alloc a little more bytes than you requested , as it needs some room to do its housekeeping, also it may need to align the bytes on pages boundaries. The amount of supplementary bytes allocated is implementation dependent.
you can consider to use tools such as gperftools (google) to monitor the memory used.
I wanted to check a process for memory leeks some years ago.
What I did was the following: I wrote a very small debugger (it is easier than it sounds) that simply set breakpoints to malloc(), free(), mmap(), ... and similar functions (I did that under Windows but under Linux it is simpler - I did it in Linux for another purpose!).
Whenever a breakpoint was reached I logged the function arguments and continued program execution...
By processing the logfile (semi-automated) I could find memory leaks.
Disadvantage: It is not possible to debug the program using another debugger in parallel.

How to stop page cache for disk I/O in my linux system?

Here is my system based on Linux2.6.32.12:
1 It contains 20 processes which occupy a lot of usr cpu
2 It needs to write data on rate 100M/s to disk and those data would not be used recently.
What I expect:
It can run steadily and disk I/O would not affect my system.
My problem:
At the beginning, the system run as I thought. But as the time passed, Linux would cache a lot data for the disk I/O, that lead to physical memory reducing. At last, there will be not enough memory, then Linux will swap in/out my processes. It will cause I/O problem that a lot cpu time was used to I/O.
What I have try:
I try to solved the problem, by "fsync" everytime I write a large block.But the physical memory is still decreasing while cached increasing.
How to stop page cache here, it's useless for me
More infomation:
When Top show free 46963m, all is well including cpu %wa is low and vmstat shows no si or so.
When Top show free 273m, %wa is so high which affect my processes and vmstat shows a lot si and so.
I'm not sure that changing something will affect overall performance.
Maybe you might use posix_fadvise(2) and sync_file_range(2) in your program (and more rarely fsync(2) or fdatasync(2) or sync(2) or syncfs(2), ...). Also look at madvise(2), mlock(2) and munlock(2), and of course mmap(2) and munmap(2). Perhaps ionice(1) could help.
In the reader process, you might perhaps use readhahead(2) (perhaps in a separate thread).
Upgrading your kernel (to a 3.6 or better) could certainly help: Linux has improved significantly on these points since 2.6.32 which is really old.
To drop pagecache you can do the following:
"echo 1 > /proc/sys/vm/drop_caches"
drop_caches are usually 0. And, can be changed as per need. As you've identified yourself, that you need to free pagecache, so this is how to do it. You can also take a look at dirty_writeback_centisecs (and it's related tunables)(http://lxr.linux.no/linux+*/Documentation/sysctl/vm.txt#L129) to make quick writeback, but note it might have consequences, as it calls up kernel flasher thread to write out dirty pages. Also, note the uses of dirty_expire_centices, which defines how much time some data needs to be eligible for writeout.

Memory of type "root-set" reallocation Error - Erlang

I have been running a crypto-intensive application that was generating pseudo-random strings, with special structure and mathematical requirements. It has generated around 1.7 million voucher numbers per node in over the last 8 days. The generation process was CPU intensive, with very low memory requirements.
Mnesia running on OTP-14B02 was the storage database and the generation was done within each virtual machine. I had 3 nodes in the cluster with all mnesia tables disc_only_copies type. Suddenly, as activity on the Solaris boxes increased (other users logged on remotely and were starting webservers, ftp sessions, and other tasks), my bash shell started reporting afork: not enough space error.
My erlang Vms also, went down with this error below:
Crash dump was written to: erl_crash.dump
temp_alloc: Cannot reallocate 8388608 bytes of memory (of type "root_set").
Usually, we get memory allocation errors and not memory re-location errors and normally memory of type "heap" is the problem. This time, the memory type reported is type "root-set".
Qn 1. What is this "root-set" memory?
Qn 2. Has it got to do with CPU intensive activities ? (why am asking this is that when i start the task, the Machine reponse to say mouse or Keyboard interrupts is too slow meaning either CPU is too busy or its some other problem i cannot explain for now)
Qn 3. Can such an error be avoided? and how ?
The fork: not enough space message suggests this is a problem with the operating system setup, but:
Q1 - The Root Set
The Root Set is what the garbage collector uses as a starting point when it searches for data that is live in the heap. It usually starts off from the registers of the VM and off from the stack, if the stack has references to heap data that still needs to be live. There may be other roots in Erlang I am not aware of, but these are the basic stuff you start off from.
That it is a reallocation error of exactly 8 Megabyte space could mean one of two things. Either you don't have 8 Megabyte free in the heap, or that the heap is fragmented beyond recognition, so while there are 8 megabytes in it, there are no contiguous such space.
Q2 - CPU activity impact
The problem has nothing to do with the CPU per se. You are running out of memory. A large root set could indicate that you have some very deep recursions going on where you keep around a lot of pointers to data. You may be able to rewrite the code such that it is tail-calling and uses less memory while operating.
You should be more worried about the slow response times from the keyboard and mouse. That could indicate something is not right. Does a vmstat 1, a sysstat, a htop, a dstat or similar show anything odd while the process is running? You are also on the hunt to figure out if the kernel or the C libc is doing something odd here due to memory being constrained.
Q3 - How to fix
I don't know how to fix it without knowing more about what the application is doing. Since you have a crash dump, your first instinct should be to take the crash dump viewer and look at the dump. The goal is to find a process using a lot of memory, or one that has a deep stack. From there on, you can seek to limit the amount of memory that process is using. either by rewriting the code so it can give memory up earlier, by tuning the garbage collection setup for the process (see the spawn options in the erlang man pages for this), or by adding more memory to the system.

Resources