As we know, struct page in Linux, associated with a physical 4KB page and mapped to a pfn. This forms the backbone to memory allocation in Linux.
The struct page is described in include\linux\mm_types.h contains variety of information about the page. I want to know, when are struct page allocated during boot and who initialize these struct page structures and where (in linux)?
I was able to get some idea of where the pages are stored from this answer -https://stackoverflow.com/a/63893944/13286624 but I'm unable to find how these structures are allocated and initialized.
Maybe try mm_init, called by start_kernel() in /init/main.c/ and the corresponding mem_init in /arch/x86/mm/init_64.c :
void __init mem_init(void)
1286 {
1287 pci_iommu_alloc();
1288
1289 /* clear_bss() already clear the empty_zero_page */
1290
1291 /* this will put all memory onto the freelists */
1292 memblock_free_all();
1293 after_bootmem = 1;
1294 x86_init.hyper.init_after_bootmem();
1295
1296 /*
1297 * Must be done after boot memory is put on freelist, because here we
1298 * might set fields in deferred struct pages that have not yet been
1299 * initialized, and memblock_free_all() initializes all the reserved
1300 * deferred pages for us.
1301 */
1302 register_page_bootmem_info();
You should take a look at the include/linux/gfp.h file. It has a bunch of alloc_page type functions. From there, using cscope or your favorite IDE, you can see every function that calls these functions and that will give you a sense of how the functions are used.
Related
The manual page told me so much and through it I know lots of the background knowledge of memory management of "glibc".
But I still get confused. does "malloc_trim(0)"(note zero as the parameter) mean (1.)all the memory in the "heap" section will be returned to OS ? Or (2.)just all "unused" memory of the top most region of the heap will be returned to OS ?
If the answer is (1.), what if the still used memory in the heap? if the heap has used momery at places somewhere, will them be eliminated, or the function wouldn't execute successfully?
While if the answer is (2.), what about those "holes" at places rather than top in the heap? they're unused memory anymore, but the top most region of the heap is still used, will this calling work efficiently?
Thanks.
Man page of malloc_trim was committed here: https://github.com/mkerrisk/man-pages/blob/master/man3/malloc_trim.3 and as I understand, it was written by man-pages project maintainer, kerrisk in 2012 from scratch: https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65
As I can grep the glibc's git, there are no man pages in the glibc, and no commit to malloc_trim manpage to document this patch. The best and the only documentation of glibc malloc is its source code: https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c
There are malloc_trim comments from malloc/malloc.c:
Additional functions:
malloc_trim(size_t pad);
609 /*
610 malloc_trim(size_t pad);
611
612 If possible, gives memory back to the system (via negative
613 arguments to sbrk) if there is unused memory at the `high' end of
614 the malloc pool. You can call this after freeing large blocks of
615 memory to potentially reduce the system-level memory requirements
616 of a program. However, it cannot guarantee to reduce memory. Under
617 some allocation patterns, some large free blocks of memory will be
618 locked between two used chunks, so they cannot be given back to
619 the system.
620
621 The `pad' argument to malloc_trim represents the amount of free
622 trailing space to leave untrimmed. If this argument is zero,
623 only the minimum amount of memory to maintain internal data
624 structures will be left (one page or less). Non-zero arguments
625 can be supplied to maintain enough trailing space to service
626 future expected allocations without having to re-obtain memory
627 from the system.
628
629 Malloc_trim returns 1 if it actually released any memory, else 0.
630 On systems that do not support "negative sbrks", it will always
631 return 0.
632 */
633 int __malloc_trim(size_t);
634
Freeing from the middle of the chunk is not documented as text in malloc/malloc.c and not documented in man-pages project. Man page from 2012 may be the first man page of the function, written not by authors of glibc. Info page of glibc only mentions M_TRIM_THRESHOLD of 128 KB:
https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#Malloc-Tunable-Parameters and don't list malloc_trim function https://www.gnu.org/software/libc/manual/html_node/Summary-of-Malloc.html#Summary-of-Malloc (and it also don't document memusage/memusagestat/libmemusage.so).
In december 2007 there was commit https://sourceware.org/git/?p=glibc.git;a=commit;f=malloc/malloc.c;h=68631c8eb92ff38d9da1ae34f6aa048539b199cc by Ulrich Drepper (it is part of glibc 2.9 and newer) which changed mtrim implementation (but it didn't change any documentation or man page as there are no man pages in glibc):
malloc/malloc.c (public_mTRIm): Iterate over all arenas and call
mTRIm for all of them.
(mTRIm): Additionally iterate over all free blocks and use madvise
to free memory for all those blocks which contain at least one
memory page.
Unused parts of chunks (anywhere, including chunks in the middle), aligned on page size and having size more than page may be marked as MADV_DONTNEED https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=malloc/malloc.c;h=c54c203cbf1f024e72493546221305b4fd5729b7;hp=1e716089a2b976d120c304ad75dd95c63737ad75;hb=68631c8eb92ff38d9da1ae34f6aa048539b199cc;hpb=52386be756e113f20502f181d780aecc38cbb66a
INTERNAL_SIZE_T size = chunksize (p);
if (size > psm1 + sizeof (struct malloc_chunk))
{
/* See whether the chunk contains at least one unused page. */
char *paligned_mem = (char *) (((uintptr_t) p
+ sizeof (struct malloc_chunk)
+ psm1) & ~psm1);
assert ((char *) chunk2mem (p) + 4 * SIZE_SZ <= paligned_mem);
assert ((char *) p + size > paligned_mem);
/* This is the size we could potentially free. */
size -= paligned_mem - (char *) p;
if (size > psm1)
madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
}
This is one of total two usages of madvise with MADV_DONTNEED in glibc now, one for top part of heaps (shrink_heap) and other is marking of any chunk (mtrim): http://code.metager.de/source/search?q=MADV_DONTNEED&path=%2Fgnu%2Fglibc%2Fmalloc%2F&project=gnu
H A D arena.c 643 __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
H A D malloc.c 4535 __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
We can test the malloc_trim with this simple C program (test_malloc_trim.c) and strace/ltrace:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
int main()
{
int *m1,*m2,*m3,*m4;
printf("%s\n","Test started");
m1=(int*)malloc(20000);
m2=(int*)malloc(40000);
m3=(int*)malloc(80000);
m4=(int*)malloc(10000);
// check that all arrays are allocated on the heap and not with mmap
printf("1:%p 2:%p 3:%p 4:%p\n", m1, m2, m3, m4);
// free 40000 bytes in the middle
free(m2);
// call trim (same result with 2000 or 2000000 argument)
malloc_trim(0);
// call some syscall to find this point in the strace output
sleep(1);
free(m1);
free(m3);
free(m4);
// malloc_stats(); malloc_info(0, stdout);
return 0;
}
gcc test_malloc_trim.c -o test_malloc_trim, strace ./test_malloc_trim
write(1, "Test started\n", 13Test started
) = 13
brk(0) = 0xcca000
brk(0xcef000) = 0xcef000
write(1, "1:0xcca010 2:0xccee40 3:0xcd8a90"..., 441:0xcca010 2:0xccee40 3:0xcd8a90 4:0xcec320
) = 44
madvise(0xccf000, 36864, MADV_DONTNEED) = 0
...
nanosleep({1, 0}, 0x7ffffafbfff0) = 0
brk(0xceb000) = 0xceb000
So, there was madvise with MADV_DONTNEED for 9 pages after malloc_trim(0) call, when there was hole of 40008 bytes in the middle of the heap.
The man page for malloc_trim says it releases free memory, so if there is allocated memory in the heap, it won't release the whole heap. The parameter is there if you know you're still going to need a certain amount of memory, so freeing more than that would cause glibc to have to do unnecessary work later.
As for holes, this is a standard problem with memory management and returning memory to the OS. The primary low-level heap management available to the program is brk and sbrk, and all they can do is extend or shrink the heap area by changing the top. So there's no way for them to return holes to the operating system; once the program has called sbrk to allocate more heap, that space can only be returned if the top of that space is free and can be handed back.
Note that there are other, more complex ways to allocate memory (with anonymous mmap, for example), which may have different constraints than sbrk-based allocation.
From LDD3/ Ch. 15/ sections "Using remap_pfn_range" and "A Simple Implementation", pfn has been equated to the vm_pgoff field. I am confused by this. How can that be so?
Note that vm_pgoff is described as:
The offset of the area in the file, in pages. When a file or device is
mapped, this is the file position of the first page mapped in this
area.
Thus if the first page mapped corresponds to the first page of the file as well (which, I think would be quite common), vm_pgoff would be 0. correct? If so, this doesn't seem to be the correct value for the pfn parameter of remap_pfn_range( ). What am I missing here? What is the correct value? For ease of reference, I am reproducing the relevant code from LDD3 below (Page no. 426)
static int simple_remap_mmap(struct file *filp, struct vm_area_struct *vma)
{
if (remap_pfn_range(vma, vma->vm_start, vm->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot))
return -EAGAIN;
...
}
The specific example you've provided is implementing a character device file that allows one to map physical memory, very similar to /dev/mem. By specifying the offset of the file you specify the physical memory address. Hence the calculation that takes the offset and divide in the page size to find the PFN.
For a "real" device driver, you would normally have the physical address of the device memory mapped registers or RAM hard coded from the device specification, and use that to derive the PFN (by dividing by the page size).
I read proc/<pid>/io to measure the IO-activity of SQL-queries, where <pid> is the PID of the database server. I read the values before and after each query to compute the difference and get the number of bytes the request caused to be read and/or written.
As far as I know the field READ_BYTES counts actual disk-IO, while RCHAR includes more, like reads that could be satisfied by the linux page cache (see Understanding the counters in /proc/[pid]/io for clarification).
This leads to the assumption, that RCHAR should come up with a value equal or greater than READ_BYTES, but my results contradict this assumption.
I could imagine some minor block or page overhead for results I get for Infobright ICE (values are MB):
Query RCHAR READ_BYTES
tpch_q01.sql| 34.44180| 34.89453|
tpch_q02.sql| 2.89191| 3.64453|
tpch_q03.sql| 32.58994| 33.19531|
tpch_q04.sql| 17.78325| 18.27344|
But I completely fail to understand the IO-counters for MonetDB (values are MB):
Query RCHAR READ_BYTES
tpch_q01.sql| 0.07501| 220.58203|
tpch_q02.sql| 1.37840| 18.16016|
tpch_q03.sql| 0.08272| 162.38281|
tpch_q04.sql| 0.06604| 83.25391|
Am I wrong with the assumption that RCHAR includes READ_BYTES? Is there a way to trick out the kernels counters, that MonetDB could use? What is going on here?
I might add, that I clear the page cache and restart the database-server before each query.
I'm on Ubuntu 11.10, running kernel 3.0.0-15-generic.
I can only think of two things:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/proc.txt;hb=HEAD#l1305
1:
1446 read_bytes
1447 ----------
1448
1449 I/O counter: bytes read
1450 Attempt to count the number of bytes which this process really did cause to
1451 be fetched from the storage layer.
I read "Caused to be fetched from the storage layer" to include readahead, whatever.
2:
1411 rchar
1412 -----
1413
1414 I/O counter: chars read
1415 The number of bytes which this task has caused to be read from storage. This
1416 is simply the sum of bytes which this process passed to read() and pread().
1417 It includes things like tty IO and it is unaffected by whether or not actual
1418 physical disk IO was required (the read might have been satisfied from
1419 pagecache)
Note that this says nothing about "disk access via memory mapped files". I think this is the more likely reason, and that your MonetDB probably mmaps out its database files and then does everything on them.
I'm not really sure how you could check the used bandwidth on mmap, because of its nature.
You can also read Linux kernel source code file: /include/linux/task_io_accounting.h
struct task_io_accounting {
#ifdef CONFIG_TASK_XACCT
/* bytes read */
u64 rchar;
/* bytes written */
u64 wchar;
/* # of read syscalls */
u64 syscr;
/* # of write syscalls */
u64 syscw;
#endif /* CONFIG_TASK_XACCT */
#ifdef CONFIG_TASK_IO_ACCOUNTING
/*
* The number of bytes which this task has caused to be read from
* storage.
*/
u64 read_bytes;
/*
* The number of bytes which this task has caused, or shall cause to be
* written to disk.
*/
u64 write_bytes;
/*
* A task can cause "negative" IO too. If this task truncates some
* dirty pagecache, some IO which another task has been accounted for
* (in its write_bytes) will not be happening. We _could_ just
* subtract that from the truncating task's write_bytes, but there is
* information loss in doing that.
*/
u64 cancelled_write_bytes;
#endif /* CONFIG_TASK_IO_ACCOUNTING */
};
i am trying to put a head-tag and a foot tag inside struct malloc_chunk, like this:
struct malloc_shunk {
INTERNAL_SIZE prev_size;
INTERNAL_SIZE size;
}
Here is what i did:
1.
struct malloc_shunk {
INTERNAL_SIZE foot_tag;
INTERNAL_SIZE prev_size;
INTERNAL_SIZE size;
}
before putting the head_tag inside malloc_chunk, i just added foot_tag only and compiled glibc. I made a small test program that mallocs 60 bytes from the system and then frees it. Thou i can see that the malloc returned properly, free complained, saying "invalid pointer". The pointer malloc returned was 0x9313010. That makes the pointer to the start of malloc_chunk to be 0x9313004.
So when the free is passed 0x9313010, it converts it to 0x9313004 through (mem2chunk) and the check it for alignment. Since my wordsize is 4, alignment check with 0x9313010 is where i am getting problems. Can you please tell me if the Mem pointer (returned by malloc) needs to be absolutely double-word aligned. (as here it may not satisfy that criterion, as difference betwwen the pointer returned by malloc and start of chunk will be 12 bytes here and not 8).
To ever come 1. issue , i just added 1 head-tag to the structure so that it becomes
struct malloc_shunk {
INTERNAL_SIZE foot_tag;
INTERNAL_SIZE prev_size;
INTERNAL_SIZE size;
INTERNAL_SIZE head_tag;
}
Now the difference betwwen the pointer returned by malloc and start of chunk will be 16 which will always be double word aligned. But Here i am facing a bigger problem as the time when first malloc is called the arena is setup and bins are initallised. The size of the victim here is not coming out to be zero as it should be in normal cases. The problem is that the victim->size is actually comes out to be the place where 'top' is stored rather than 'last_remainder'. i'd like to ask for you opinion if there is any other way/workaround/solution, so that i can over come this initialization of arena issue i am currently facing.
Thanks and Regards,
Kapil
From what ever i have learnt, try not to modify the structure malloc_chunk. Even if you have to then make sure that the top pointer is initially 0 otherwise heap will never be formed. Its the sYSMALLOC() function in _int_malloc() that first MMAPs a bigger chunk from kernel. It will be invoked only if the top is 0 initially.
Pertaining to the Linux kernel, do "Kernel" pages ever get swapped out ? Also, do User space pages ever get to reside in ZONE_NORMAL ?
No, kernel memory is unswappable.
Kernel pages are not swappable. But it can be freed.
UserSpace Pages can reside in ZONE_NORMAL.
Linux System Can be configured either to use HIGHMEM or not.
If ZONE_HIGHMEM is configured , then the userspace processes will get its memory from the HIGHMEM else userspace processes will get memory from ZONE_NORMAL.
Yes, under normal circumstances kernel pages (ie, memory residing in the kernel for kernel usage) are not swappable, in fact, once detected (see the pagefault handler source code), the kernel will explicitly crash itself.
See this:
http://lxr.free-electrons.com/source/arch/x86/mm/fault.c
and the function:
1205 /*
1206 * This routine handles page faults. It determines the address,
1207 * and the problem, and then passes it off to one of the appropriate
1208 * routines.
1209 *
1210 * This function must have noinline because both callers
1211 * {,trace_}do_page_fault() have notrace on. Having this an actual function
1212 * guarantees there's a function trace entry.
1213 */
1214 static noinline void
1215 __do_page_fault(struct pt_regs *regs, unsigned long error_code,
1216 unsigned long address)
1217 {
And the detection here:
1246 *
1247 * This verifies that the fault happens in kernel space
1248 * (error_code & 4) == 0, and that the fault was not a
1249 * protection error (error_code & 9) == 0.
1250 */
1251 if (unlikely(fault_in_kernel_space(address))) {
1252 if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
1253 if (vmalloc_fault(address) >= 0)
1254 return;
1255
1256 if (kmemcheck_fault(regs, address, error_code))
1257 return;
1258 }
But the same pagefault handler - which can detect pagefault arising from non-existent usermode memory (all hardware pagefault detection is always done in kernel) will explicitly retrieve the data from swap space if it exists, or start a memory allocation routine to give the process more memory.
Ok, that said, kernel does swap out kernel structures/memory/tasklists etc during software suspend and hibernation operation:
https://www.kernel.org/doc/Documentation/power/swsusp.txt
And during the resume phase it will restore back the kernel memory from swap file.