Can malloc_trim() release memory from the middle of the heap? - linux

I am confused about the behaviour of malloc_trim as implemented in the glibc.
man malloc_trim
[...]
malloc_trim - release free memory from the top of the heap
[...]
This function cannot release free memory located at places other than the top of the heap.
When I now look up the source of malloc_trim() (in malloc/malloc.c) I see that it calls mtrim() which is utilizing madvise(x, MADV_DONTNEED) to release memory back to the operating system.
So I wonder if the man-page is wrong or if I misinterpret the source in malloc/malloc.c.
Can malloc_trim() release memory from the middle of the heap?

There are two usages of madvise with MADV_DONTNEED in glibc now: http://code.metager.de/source/search?q=MADV_DONTNEED&path=%2Fgnu%2Fglibc%2Fmalloc%2F&project=gnu
H A D arena.c 643 __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
H A D malloc.c 4535 __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
There was https://sourceware.org/git/?p=glibc.git;a=commit;f=malloc/malloc.c;h=68631c8eb92ff38d9da1ae34f6aa048539b199cc commit by Ulrich Drepper on 16 Dec 2007 (part of glibc 2.9 and newer):
malloc/malloc.c (public_mTRIm): Iterate over all arenas and call
mTRIm for all of them.
(mTRIm): Additionally iterate over all free blocks and use madvise
to free memory for all those blocks which contain at least one
memory page.
mTRIm (now mtrim) implementation was changed. Unused parts of chunks, aligned on page size and having size more than page may be marked as MADV_DONTNEED:
/* See whether the chunk contains at least one unused page. */
char *paligned_mem = (char *) (((uintptr_t) p
+ sizeof (struct malloc_chunk)
+ psm1) & ~psm1);
assert ((char *) chunk2mem (p) + 4 * SIZE_SZ <= paligned_mem);
assert ((char *) p + size > paligned_mem);
/* This is the size we could potentially free. */
size -= paligned_mem - (char *) p;
if (size > psm1)
madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
Man page of malloc_trim is there: https://github.com/mkerrisk/man-pages/blob/master/man3/malloc_trim.3 and it was committed by kerrisk in 2012: https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65
As I can grep the glibc's git, there are no man pages in the glibc, and no commit to malloc_trim manpage to document this patch. The best and the only documentation of glibc malloc is its source code: https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c
Additional functions:
malloc_trim(size_t pad);
609 /*
610 malloc_trim(size_t pad);
611
612 If possible, gives memory back to the system (via negative
613 arguments to sbrk) if there is unused memory at the `high' end of
614 the malloc pool. You can call this after freeing large blocks of
615 memory to potentially reduce the system-level memory requirements
616 of a program. However, it cannot guarantee to reduce memory. Under
617 some allocation patterns, some large free blocks of memory will be
618 locked between two used chunks, so they cannot be given back to
619 the system.
620
621 The `pad' argument to malloc_trim represents the amount of free
622 trailing space to leave untrimmed. If this argument is zero,
623 only the minimum amount of memory to maintain internal data
624 structures will be left (one page or less). Non-zero arguments
625 can be supplied to maintain enough trailing space to service
626 future expected allocations without having to re-obtain memory
627 from the system.
628
629 Malloc_trim returns 1 if it actually released any memory, else 0.
630 On systems that do not support "negative sbrks", it will always
631 return 0.
632 */
633 int __malloc_trim(size_t);
634
Freeing from the middle of the chunk is not documented as text in malloc/malloc.c (and malloc_trim description in commend was not updated in 2007) and not documented in man-pages project. Man page from 2012 may be the first man page of the function, written not by authors of glibc. Info page of glibc only mentions M_TRIM_THRESHOLD of 128 KB:
https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#Malloc-Tunable-Parameters and don't list malloc_trim function https://www.gnu.org/software/libc/manual/html_node/Summary-of-Malloc.html#Summary-of-Malloc (and it also don't document memusage/memusagestat/libmemusage.so).
You may ask Drepper and other glibc developers again as you already did in https://sourceware.org/ml/libc-help/2015-02/msg00022.html "malloc_trim() behaviour", but there is still no reply from them. (Only wrong answers from other users like https://sourceware.org/ml/libc-help/2015-05/msg00007.html https://sourceware.org/ml/libc-help/2015-05/msg00008.html)
Or you may test the malloc_trim with this simple C program (test_malloc_trim.c) and strace/ltrace:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
int main()
{
int *m1,*m2,*m3,*m4;
printf("%s\n","Test started");
m1=(int*)malloc(20000);
m2=(int*)malloc(40000);
m3=(int*)malloc(80000);
m4=(int*)malloc(10000);
printf("1:%p 2:%p 3:%p 4:%p\n", m1, m2, m3, m4);
free(m2);
malloc_trim(0); // 20000, 2000000
sleep(1);
free(m1);
free(m3);
free(m4);
// malloc_stats(); malloc_info(0, stdout);
return 0;
}
gcc test_malloc_trim.c -o test_malloc_trim, strace ./test_malloc_trim
write(1, "Test started\n", 13Test started
) = 13
brk(0) = 0xcca000
brk(0xcef000) = 0xcef000
write(1, "1:0xcca010 2:0xccee40 3:0xcd8a90"..., 441:0xcca010 2:0xccee40 3:0xcd8a90 4:0xcec320
) = 44
madvise(0xccf000, 36864, MADV_DONTNEED) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffffafbfff0) = 0
brk(0xceb000) = 0xceb000
So, there is madvise with MADV_DONTNEED for 9 pages after malloc_trim(0) call, when there was hole of 40008 bytes in the middle of the heap.

... utilizing madvise(x, MADV_DONTNEED) to release memory back to the
operating system.
madvise(x, MADV_DONTNEED) does not release memory. man madvise:
MADV_DONTNEED
Do not expect access in the near future. (For the time being,
the application is finished with the given range, so the kernel
can free resources associated with it.) Subsequent accesses of
pages in this range will succeed, but will result either in
reloading of the memory contents from the underlying mapped file
(see mmap(2)) or zero-fill-on-demand pages for mappings without
an underlying file.
So, the usage of madvise(x, MADV_DONTNEED) does not contradict man malloc_trim's statement:
This function cannot release free memory located at places other than the top of the heap.

Related

NtQueryObject returns wrong insufficient required size via WOW64, why?

I am using the NT native API NtQueryObject()/ZwQueryObject() from user mode (and I am aware of the risks in general and I have written kernel mode drivers for Windows in the past in my professional capacity).
Generally when one uses the typical "query information" function (of which there are a few) the protocol is first to ask with a too small buffer to retrieve the required size with STATUS_INFO_LENGTH_MISMATCH, then allocate a buffer of said size and query again -- this time using the buffer and previously returned size.
In order to get the list of object types (67 on my build) on the system I am doing just that:
ULONG Size = 0;
NTSTATUS Status = NtQueryObject(NULL, ObjectTypesInformation, &Size, sizeof(Size), &Size);
And in Size I get 8280 (WOW64) and 8968 (x64). I then proceed to allocate the buffer with calloc() and query again:
ULONG Size2 = 0;
BYTE* Buf = (BYTE*)::calloc(1, Size);
Status = NtQueryObject(NULL, ObjectTypesInformation, Buf, Size, &Size2);
NB: ObjectTypesInformation is 3. It isn't declared in winternl.h, but Nebbett (as ObjectAllTypesInformation) and others describe it. Since I am not querying for a particular object's traits but the system-wide list of object types, I pass NULL for the object handle.
Curiously on WOW64, i.e. 32-bit, the value in Size2 upon return from the second query is 16 Bytes (= 8296) bigger than the previously returned required size.
As far as alignment is concerned, I'd expect at most 8 Bytes for this sort of thing and indeed neither 8280 nor 8296 are at a 16 Byte alignment boundary, but on an 8 Byte one.
Certainly I can add some slack space on top of the returned required size (e.g. ALIGN_UP to the next 32 Byte alignment boundary), but this seems highly irregular to be honest. And I'd rather want to understand what's going on than to implement a workaround that breaks, because I miss something crucial.
The practical issue for the code is that in Debug configurations it tells me there's a corrupted heap somewhere, upon freeing Buf. Which suggests that NtQueryObject() was indeed writing these extra 16 Bytes beyond the buffer I provided.
Question: Any idea why it is doing that?
As usual for NT native API the sources of information are scarce. The x64 version of the exact same code returns the exact number of bytes required. So my thinking here is that WOW64 is the issue. A somewhat cursory look into wow64.dll with IDA didn't reveal any immediate points for suspicion regarding what goes wrong in translating the results to 32-bit here.
PS: Windows 10 (10.0.19043, ntdll.dll "timestamp" 77755782)
PPS: this may be related: https://wj32.org/wp/2012/11/30/obquerytypeinfo-and-ntqueryobject-buffer-overrun-in-windows-8/ Tested it, by checking that OBJECT_TYPE_INFORMATION::TypeName.Length + sizeof(WCHAR) == OBJECT_TYPE_INFORMATION::TypeName.MaximumLength in all returned items, which was the case.
The only part of ObjectTypesInformation that's public is the first field defined in winternl.h header in the Windows SDK:
typedef struct __PUBLIC_OBJECT_TYPE_INFORMATION {
UNICODE_STRING TypeName;
ULONG Reserved [22]; // reserved for internal use
} PUBLIC_OBJECT_TYPE_INFORMATION, *PPUBLIC_OBJECT_TYPE_INFORMATION;
For x86 this is 96 bytes, and for x64 this is 104 bytes (assuming you have the right packing mode enabled). The difference is the pointer in UNICODE_STRING which changes the alignment in x64.
Any additional memory space should be related to the TypeName buffer.
UNICODE_STRING accounts for 8 bytes of the difference between 8280 and 8296. The function uses the sizeof(ULONG_PTR) for alignment of the returned string plus an extra WCHAR, so that could easily account for the remaining 8 bytes.
AFAIK: The public use of NtQueryObject is supposed to be limited to kernel-mode use which of course means it always matches the OS native bitness (x86 code can't run as kernel in x64 native OS), so it's probably just a quirk of using the NT functions via the WOW64 thunk.
Alright, I think I figured out the issue with the help of WinDbg and a thorough look at wow64.dll using IDA.
NB: the wow64.dll I have has the same build number, but differs slightly in data only (checksum, security directory entry, pieces from version resources). The code is identical, which was to be expected, given deterministic builds and how they affect the PE timestamp.
There's an internal function called whNtQueryObject_SpecialQueryCase (according to PDBs), which covers the ObjectTypesInformation class queries.
For the above wow64.dll I used the following points of interest in WinDbg, from a 32 bit program which calls NtQueryObject(NULL, ObjectTypesInformation, ...) (the program itself is irrelevant, though):
0:000> .load wow64exts
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B0E0
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B14E
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B1A7
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B24A
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B252
Explanation of the above points of interest:
+B0E0: computing length required for 64 bit query, based on passed length for 32 bit
+B14E: call to NtQueryObject()
+B1A7: loop body for copying 64 to 32 bit buffer contents, after successful NtQueryObject() call
+B24A: computing written length by subtracting current (last + 1) entry from base buffer address
+B252: downsizing returned (64 bit) required length to 32 bit
The logic of this function in regards to just ObjectTypesInformation is roughly as follows:
Common steps
Take the ObjectInformationLength (32 bit query!) argument and size it up to fit the 64 bit info
Align the retrieved size up to the next 16 byte boundary
If necessary allocate the resulting amount from some PEB::ProcessHeap and store in TLS slot 3; otherwise using this as a scratch space
Call NtQueryObject() passing the buffer and length from the two previous steps
The length passed to NtQueryObject() is the one from step 1, not the one aligned to a 16 byte boundary. There seems to be some sort of header to this scratch space, so perhaps that's where the 16 byte alignment comes from?
Case 1: buffer size too small (here: 4), just querying required length
The up-sized length in this case equals 4, which is too small and consequently NtQueryObject() returns STATUS_INFO_LENGTH_MISMATCH. Required size is reported as 8968.
Down-size from the 64 bit required length to 32 bit and end up 16 bytes too short
Return the status from NtQueryObject() and the down-sized required length form the previous step
Case 2: buffer size supposedly (!) sufficient
Copy OBJECT_TYPES_INFORMATION::NumberOfTypes from queried buffer to 32 bit one
Step to the first entry (OBJECT_TYPE_INFORMATION) of source (64 bit) and target (32 bit) buffer, 8 and 4 byte aligned respectively
For for each entry up to OBJECT_TYPES_INFORMATION::NumberOfTypes:
Copy UNICODE_STRING::Length and UNICODE_STRING::MaximumLength for TypeName member
memcpy() UNICODE_STRING::Length bytes from source to target UNICODE_STRING::Buffer (target entry + sizeof(OBJECT_TYPE_INFORMATION32)
Add terminating zero (WCHAR) past the memcpy'd string
Copy the individual members past the TypeName from 64 to 32 bit struct
Compute pointer of next entry by aligning UNICODE_STRING::MaximumLength up to an 8 byte boundary (i.e. the ULONG_PTR alignment mentioned in the other answer) + sizeof(OBJECT_TYPE_INFORMATION64) (already 8 byte aligned!)
The next target entry (32 bit) gets 4 byte aligned instead
At the end compute required (32 bit) length by subtracting the value we arrived at for the "next" entry (i.e. one past the last) from the base address of the buffer passed by the WOW64 program (32 bit) to NtQueryObject()
In my debugged scenario these were: 0x008ce050 - 0x008cbfe8 = 0x00002068 (= 8296), which is 16 bytes larger than the buffer length we were told during case 1 (8280)!
The issue
That crucial last step differs between merely querying and actually getting the buffer filled. There is no further bounds checking in that loop I described for case 2.
And this means it will just overrun the passed buffer and return a written length bigger than the buffer length passed to it.
Possible solutions and workarounds
I'll have to approach this mathematically after some sleep, the workaround is obviously to top up the required length returned from case 1 in order to avoid the buffer overrun. The easiest method is to use my up_size_from_32bit() from the example below and use that on the returned required size. This way you are allocating enough for the 64 bit buffer, while querying the 32 bit one. This should never overrun during the copy loop.
However, the fix in wow64.dll is a little more involved, I guess. While adding bounds checking to the loop would help avert the overrun, it would mean that the caller would have to query for the required size twice, because the first time around it lies to us.
Which means the query-only case (1) would have to allocate that internal buffer after querying the required length for 64 bit, then get it filled and then walk the entries (just like the copy loop), skipping over the last entry to compute the required length the same as it is now done after the copy loop.
Example program demonstrating the "static" computation by wow64.dll
Build for x64, just the way wow64.dll was!
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <cstdio>
typedef struct
{
ULONG JustPretending[24];
} OBJECT_TYPE_INFORMATION32;
typedef struct
{
ULONG JustPretending[26];
} OBJECT_TYPE_INFORMATION64;
constexpr ULONG size_delta_3264 = sizeof(OBJECT_TYPE_INFORMATION64) - sizeof(OBJECT_TYPE_INFORMATION32);
constexpr ULONG down_size_to_32bit(ULONG len)
{
return len - size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION64));
}
constexpr ULONG up_size_from_32bit(ULONG len)
{
return len + size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION32));
}
// Trying to mimic the wdm.h macro
constexpr size_t align_up_by(size_t address, size_t alignment)
{
return (address + (alignment - 1)) & ~(alignment - 1);
}
constexpr auto u32 = 8280UL;
constexpr auto u64 = 8968UL;
constexpr auto from_64 = down_size_to_32bit(u64);
constexpr auto from_32 = up_size_from_32bit(u32);
constexpr auto from_32_16_byte_aligned = (ULONG)align_up_by(from_32, 16);
int wmain()
{
wprintf(L"32 to 64 bit: %u -> %u -(16-byte-align)-> %u\n", u32, from_32, from_32_16_byte_aligned);
wprintf(L"64 to 32 bit: %u -> %u\n", u64, from_64);
return 0;
}
static_assert(sizeof(OBJECT_TYPE_INFORMATION32) == 96, "Size for 64 bit struct does not match.");
static_assert(sizeof(OBJECT_TYPE_INFORMATION64) == 104, "Size for 64 bit struct does not match.");
static_assert(u32 == from_64, "Must match (from 64 to 32 bit)");
static_assert(u64 == from_32, "Must match (from 32 to 64 bit)");
static_assert(from_32_16_byte_aligned % 16 == 0, "16 byte alignment failed");
static_assert(from_32_16_byte_aligned > from_32, "We're aligning up");
This does not mimic the computation that happens in case 2, though.

While running, is it possible to display the currently allocated buffers managed by LeakSanitizer?

I've a very large program (okay, only 13,000 lines of code according to cloc) which leaks. I know because over time, it uses more and more resident memory.
I have the sanitizer option turned on, but on a clean exit, all my C++ software will properly clean everything as expected. So I don't see anything growing in the sanitizer output.
What would be useful in this case, is a way to call a function which displays the (large) list of allocated buffers while running the code. I can then look at a diff of two such outputs and see what was allocated anew. The leaked buffers will be in there...
At this point, though, I just don't see any header with sanitizer functions I could call to see such a list. Does it exist?
Lsan interface is available in sanitizer/lsan_interface.h but AFAIK it has no API to print allocation info. The best you can get is compile your code with Asan (which includes Lsan as well) and use __asan_print_accumulated_stats to get basic allocation statistics:
$ cat tmp.c
#include <sanitizer/asan_interface.h>
#include <stdlib.h>
int main() {
malloc(100);
__asan_print_accumulated_stats();
return 0;
}
$ gcc -fsanitize=address -g tmp.c && ./a.out
Stats: 0M malloced (0M for red zones) by 2 calls
Stats: 0M realloced by 0 calls
Stats: 0M freed by 0 calls
Stats: 0M really freed by 0 calls
Stats: 0M (0M-0M) mmaped; 5 maps, 0 unmaps
mallocs by size class: 7:1; 11:1;
Stats: malloc large: 0
Stats: StackDepot: 2 ids; 0M allocated
Stats: SizeClassAllocator64: 0M mapped in 256 allocations; remains 256
07 (112): mapped: 64K allocs: 128 frees: 0 inuse: 128 num_freed_chunks 457 avail: 585 rss: 4K releases: 0
11 (176): mapped: 64K allocs: 128 frees: 0 inuse: 128 num_freed_chunks 244 avail: 372 rss: 4K releases: 0
Stats: LargeMmapAllocator: allocated 0 times, remains 0 (0 K) max 0 M; by size logs:
=================================================================
==15060==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 100 byte(s) in 1 object(s) allocated from:
#0 0x7fdf2194fb40 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xdeb40)
#1 0x559ca08a7857 in main /home/yugr/tmp.c:5
#2 0x7fdf214a1bf6 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)
SUMMARY: AddressSanitizer: 100 byte(s) leaked in 1 allocation(s).
Unfortunately there is no way to print exact allocations.

Linux OS: /proc/[pid]/smaps vs /proc/[pid]/statm

I would like calculate the memory usage for single process. So after a little bit of research I came across over smaps and statm.
First of all what is smaps and statm? What is the difference?
statm has a field RSS and in smaps I sum up all RSS values. But those values are different for the same process. I know that statm measures in pages. For comparison purposes I converted that value in kb as in smaps. But those values are not equal.
Why do these two values differ, even though they represent the rss value for the same process?
statm
232214 80703 7168 27 0 161967 0 (measured in pages, pages size is 4096)
smaps
Rss 1956
My aim is to calculate the memory usage for a single process. I am interested in two values. USS and PSS.
Can I gain those two values by just using smaps? Is that value correct?
Also, I would like to return that value as percentage.
I think statm is an approximated simplification of smaps, which is more expensive to get. I came to this conclusion after I looked at the source:
smaps
The information you see in smaps is defined in /fs/proc/task_mmu.c:
static int show_smap(struct seq_file *m, void *v, int is_pid)
{
(...)
struct mm_walk smaps_walk = {
.pmd_entry = smaps_pte_range,
.mm = vma->vm_mm,
.private = &mss,
};
memset(&mss, 0, sizeof mss);
walk_page_vma(vma, &smaps_walk);
show_map_vma(m, vma, is_pid);
seq_printf(m,
(...)
"Rss: %8lu kB\n"
(...)
mss.resident >> 10,
The information in mss is used by walk_page_vma defined in /mm/pagewalk.c. However, the mss member resident is not filled in walk_page_vma - instead, walk_page_vma calls callback specified in smaps_walk:
.pmd_entry = smaps_pte_range,
.private = &mss,
like this:
if (walk->pmd_entry)
err = walk->pmd_entry(pmd, addr, next, walk);
So what does our callback, smaps_pte_range in /fs/proc/task_mmu.c, do?
It calls smaps_pte_entry and smaps_pmd_entry in some circumstances, out of which both call statm_account(), which in turn... upgrades resident size! All of these functions are defined in the already linked task_mmu.c so I didn't post relevant code snippets as they can be easily seen in the linked sources.
PTE stands for Page Table Entry and PMD is Page Middle Directory. So basically we iterate through the page entries associated with given process and update RAM usage depending on the circumstances.
statm
The information you see in statm is defined in /fs/proc/array.c:
int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
unsigned long size = 0, resident = 0, shared = 0, text = 0, data = 0;
struct mm_struct *mm = get_task_mm(task);
if (mm) {
size = task_statm(mm, &shared, &text, &data, &resident);
mmput(mm);
}
seq_put_decimal_ull(m, 0, size);
seq_put_decimal_ull(m, ' ', resident);
seq_put_decimal_ull(m, ' ', shared);
seq_put_decimal_ull(m, ' ', text);
seq_put_decimal_ull(m, ' ', 0);
seq_put_decimal_ull(m, ' ', data);
seq_put_decimal_ull(m, ' ', 0);
seq_putc(m, '\n');
return 0;
}
This time, resident is filled by task_statm. This one has two implementations, one in /fs/proc/task_mmu.c and second in /fs/proc/task_nomm.c. Since they're almost surely mutually exclusive, I'll focus on the implementation in task_mmu.c (which also contained task_smaps). In this implementation we see that
unsigned long task_statm(struct mm_struct *mm,
unsigned long *shared, unsigned long *text,
unsigned long *data, unsigned long *resident)
{
*shared = get_mm_counter(mm, MM_FILEPAGES);
(...)
*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
return mm->total_vm;
}
it queries some counters, namely, MM_FILEPAGES and MM_ANONPAGES. These counters are modified during different operations on memory such as do_wp_page defined at /mm/memory.c. All of the modifications seem to be done by the files located in /mm/ and there seem to be quite a lot of them, so I didn't include them here.
Conclusion
smaps does complicated iteration through all referenced memory regions and updates resident size using the collected information. statm uses data that was already calculated by someone else.
The most important part is that while smaps collects the data each time in an independent manner, statm uses counters that get incremented or decremented during process life cycle. There are a lot of places that need to do the bookkeeping, and perhaps some places don't upgrade the counters like they should. That's why IMO statm is inferior to smaps, even if it takes fewer CPU cycles to complete.
Please note that this is the conclusion I drew based on common sense, but I might be wrong - perhaps there are no internal inconsistencies in counter decrementing and incrementing, and instead, they might count some pages differently than smaps. At this point I believe it'd be wise to take it to some experienced kernel maintainers.

Getting pid of pages that are being swapped out

My goal is to find out process-id of pages which are being swapped out. The Linux Kernel function swap_writepage() takes a pointer to struct page as a part of formal argument while swapping a page on backing store. All swap-out operations are done by "kswapd" process. I need to find out pid(s) of the processes whose page is passed as argument in the swap_writepage() function. In order to get that, I was able to find all page table entries associated with that page using rmap structures.
How can I get pid from a pte or from struct page? I have used sytemtap to get the value of struct page pointer, received in swap_writepage() function as argument. Also, the pid() function prints the pid of current process running not the pid of process to which that page belongs which always gives kswapd process.
Here is the example of how reverse mapping used in modern Linux (copied from lxr):
1435 static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
1436 {
1437 struct anon_vma *anon_vma;
1438 struct anon_vma_chain *avc;
1439 int ret = SWAP_AGAIN;
1440
1441 anon_vma = page_lock_anon_vma(page);
1442 if (!anon_vma)
1443 return ret;
1444
1445 list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
1446 struct vm_area_struct *vma = avc->vma;
1447 unsigned long address;
1448
1449 /*
1450 * During exec, a temporary VMA is setup and later moved.
1451 * The VMA is moved under the anon_vma lock but not the
1452 * page tables leading to a race where migration cannot
1453 * find the migration ptes. Rather than increasing the
1454 * locking requirements of exec(), migration skips
1455 * temporary VMAs until after exec() completes.
1456 */
1457 if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
1458 is_vma_temporary_stack(vma))
1459 continue;
1460
1461 address = vma_address(page, vma);
1462 if (address == -EFAULT)
1463 continue;
1464 ret = try_to_unmap_one(page, vma, address, flags);
1465 if (ret != SWAP_AGAIN || !page_mapped(page))
1466 break;
1467 }
1468
1469 page_unlock_anon_vma(anon_vma);
1470 return ret;
1471 }
This example shows for rmap used for unmapping pages. So each anonymous page in ->mapping field holds anon_vma object. anon_vma holds a list of vma areas page is mapped to. Having vma you have mm, having mm you have a task_struct. that's it. If you have any doubts - here is the illustraction
Daniel P. Bovet, Marco Cesati Understanding Linux Kernel chapter 17.2

RES != CODE + DATA in the output information of the top command,why?

what 'man top' said is: RES = CODE + DATA
q: RES -- Resident size (kb)
The non-swapped physical memory a task has used.
RES = CODE + DATA.
r: CODE -- Code size (kb)
The amount of physical memory devoted to executable code, also known as the 'text resident set' size or TRS.
s: DATA -- Data+Stack size (kb)
The amount of physical memory devoted to other than executable code, also known as the 'data >resident set' size or DRS.
what when i run 'top -p 4258',i get the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ CODE DATA COMMAND
258 root 16 0 3160 1796 1328 S 0.0 0.3 0:00.10 476 416 bash
1796 != 476+416
why?
ps:
linux distribution:
linux-iguu:~ # lsb_release -a
LSB Version: core-2.0-noarch:core-3.0-noarch:core-2.0-ia32:core-3.0-ia32:desktop-3.1-ia32:desktop-3.1-noarch:graphics-2.0-ia32:graphics-2.0-noarch:graphics-3.1-ia32:graphics-3.1-noarch
Distributor ID: SUSE LINUX
Description: SUSE Linux Enterprise Server 9 (i586)
Release: 9
Codename: n/a
kernel version:
linux-iguu:~ # uname -a
Linux linux-iguu 2.6.16.60-0.21-default #1 Tue May 6 12:41:02 UTC 2008 i686 i686 i386 GNU/Linux
I'll explain this with the help of an example of what happens when a program allocates and uses memory. Specifically, this program:
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
int main(){
int *data, size, count, i;
printf( "fyi: your ints are %d bytes large\n", sizeof(int) );
printf( "Enter number of ints to malloc: " );
scanf( "%d", &size );
data = malloc( sizeof(int) * size );
if( !data ){
perror( "failed to malloc" );
exit( EXIT_FAILURE );
}
printf( "Enter number of ints to initialize: " );
scanf( "%d", &count );
for( i = 0; i < count; i++ ){
data[i] = 1337;
}
printf( "I'm going to hang out here until you hit <enter>" );
while( getchar() != '\n' );
while( getchar() != '\n' );
exit( EXIT_SUCCESS );
}
This is a simple program that asks you how many integers to allocate, allocates them, asks how many of those integers to initialize, and then initializes them. For a run where I allocate 1250000 integers and initialize 500000 of them:
$ ./a.out
fyi: your ints are 4 bytes large
Enter number of ints to malloc: 1250000
Enter number of ints to initialize: 500000
Top reports the following information:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP CODE DATA COMMAND
<program start>
11129 xxxxxxx 16 0 3628 408 336 S 0 0.0 0:00.00 3220 4 124 a.out
<allocate 1250000 ints>
11129 xxxxxxx 16 0 8512 476 392 S 0 0.0 0:00.00 8036 4 5008 a.out
<initialize 500000 ints>
11129 xxxxxxx 15 0 8512 2432 396 S 0 0.0 0:00.00 6080 4 5008 a.out
The relevant information is:
DATA CODE RES VIRT
before allocation: 124 4 408 3628
after 5MB allocation: 5008 4 476 8512
after 2MB initialization: 5008 4 2432 8512
After I malloc'd 5MB of data, both VIRT and DATA increased by ~5MB, but RES did not. RES did increase after I touched 2MB of the integers I allocated, but DATA and VIRT stayed the same.
VIRT is the total amount of virtual memory used by the process, including what is shared and what is over-committed. DATA is the amount of virtual memory used that isn't shared and that isn't code-text. I.e., it is the virtual stack and heap of the process. RES is not virtual: it is a measurment of how much memory the process is actually using at that specific time.
So in your case, the large inequality CODE+DATA < RES is likely the shared libraries included by the process. In my example (and yours), SHR+CODE+DATA is a closer approximation to RES.
Hope this helps.
There's a lot of hand-waving and voodoo associated with top and ps. There are many articles (rants?) online about the descrepancies. E.g., this and this.
This explanation is terrific to resolve my some queries. Thanks!
And meanwhile, trying to add something got during my understanding of linux memory management knowledge. If any misunderstand, please correct me!
Modern OS process concepts are based on virtual memory. Virtual memory system includes the RAM+SWAP;
So I think most of the memory concepts related with processes refer to the virtual memory, except that there are some supplement notes.
Any virtual address(page) allocated to a process is in below state:
a) allocated, but no mapping to any physical memory(something like COW)
b) allocated, already mapped to physical memory
c) allocated, already mapped to swapped memory.
The fields ouput of top command:
a) VIRT -- it refers to all virtual memory that the process have the right
to access, no matter it is already mapped to physical memory or swapped
memory, or even has no any mapping.
b) RES -- it refers to the virtual address already mapped to physical address and it still in RAM.
c) SWAP -- refers to the virtual address already mapped to physical address and it is swapped into SWAP space.
d) SHR -- it refers to the shared memory available to a process(VM?)
e) CODE + DATA -- CODE could be in a state of 2.b/2.c, and DATA could be in any of 3 state 2.a/2.b/3.c, and 3.b/3.c also have a fields name called "USED".
4) So the calculation maybe look like:
a) VIRT(VM) = RES(VM in memory) + SWAP(VM in swap) + VM unmapped(DATA, SHR?).
b) USED = RES + SWAP
c) SWAP = CODE(vm in memory) + DATA(vm in memory) + SHR(vm in memory?)
d) RES = CODE(vm in memory) + DATA(vm in memory) + SHR(vm in memory?)
At least DATA segment still have a "DATA(VM unmapped)", this could be observed from above malloc example. That's a little different from the manpage of top command which says "DATA: The amount of physical memory devoted to other than executable code, also known as the Data Resident Set size or DRS". Thanks again.
So amount of (CODE + DATA + SHR) usually larger than RES, because at least DATA(vm unmapped) actually calculated in "DATA", not like the manpge claiming.
Regards,

Resources