kpagecount of a present page is zero - linux

How is it possible, for a page whose present bit is set, to have a kpagecount equal to zero?
According to linux documentation:
/proc/kpagecount. This file contains a 64-bit count of the number
of times each page is mapped, indexed by PFN.
I did a toy application, and I printed out all pagemap entries for all vma's of a program.
I know, that kpagecount could be above 1, when a page is shared among different processes. Typical examples of these are the c-o-w of fork calls, or any libraries that are used by multiple programs.
In the case that it is zero, is when the last program that was using that particular page, does not need it any more, so the kernel can reclaim that page.
Is that correct?
However in the case of my toy app, I haven't issued any free command yet. So it does not make sense to have a heap page in ram (present bit set), and having the kpagecount to 0. So, is this counter accurate? Or am I missing something else?
Cheers!

So, a week after, I 've found the answer on theUnderstanding the Linux Kernelbook.
The _count variable, of the Page Descriptor structure, holds a reference counter to the page.
if it is set to -1, then the page frame is free, and it can be reclaimed by the kernel
if it is >=0 then it is assigned to 1 or more processes, or by the kernel itself
the page_count() function returns the value of _count, increased by one, which is the number of processes that share the page
Also, then _count is increased when:
the page is inserted into the swap cache
the flag PG_private in page descriptor is set

Related

2 Questions about memory check pointing with linux kernel (custom implementation)

We are given a project where we implementing memory checkpointing (basic is just looking over pages and dumping data found to a file (also check info about the page (private, locked, etc)) and incremental which is where we only look at if data changed previously and dump it to a file). My understanding of this is we are pretty much building a smaller scale version of memory save states (I could be wrong but that's just what I'm getting from this). We are currently using VMA approach to our problem to go through the given range (as long as it doesn't go below or above the user space range (this means no kernel range or below user space)) in order to report the data found from the pages we encounter. I know the vma_area_struct is used to access vma (some functions including find_vma()). My issue is I'm not sure how we check the individual pages within this given range of addresses (user gives us) from using this vma_area_struct. I only know about struct page (this is pretty much it), but im still learning about the kernel in detail, so im bound to miss things. Is there something I'm missing about the vma_area_sruct when accessing pages?
Second question is, what do we use to iterate through each individual page within the found vma (from given start and end address)?
VMAs contain the virtual adresses of their first and (one after their) last bytes:
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
...
This means that in order to get the page's data you need to first figure out in what context is your code running?
If it's within the process context, then a simple copy_from_user approach might be enough to get the actual data and a page walk (through the entirety of your PGD/PUD/PMD/PTE) to get the PFN and then turn it to a struct page. (Take care not to use the seductive virt_to_page(addr) as this will only work on kernel addresses).
In terms of iteration, you need only iterate in PAGE_SIZEs, over the virtual addresses you get from the VMAs.
Note that this assumes that the pages are actually mapped. If not (!pte_present(pte_t a)) you might need to remap it yourself to access the data.
If your check is running in some other context (such as a kthread/interrupt) you must remap the page from the swap before accessing it which is a whole different case. If you want the easy way, I'd look up here: https://www.kernel.org/doc/gorman/html/understand/understand014.html to understand how to handle swap lookup / retrieval.

Access "current_task" pointer of another cpu in a SMP based linux system

I'm writing some kernel driver where I need to check which thread is running at a certain point on another cores. My driver runs one kernel thread on per core and I need to sync from time to time some of the threads to do certain task. What I can observe from debugging logs is that sometimes one thread waits for some other threads too much. I did some patch where I store the __preempt_count on other cores to check if any softirq/hardirq or preemption disabled delays my thread.
I also used the FTRACE to check the irqsoff and preemptirqsoff for the max duration of IRQs off and preemption disabled.
Till now I was able to spot kerneloops thread which is disabling interrupts up to 20msec, which I find too long.
Did a systemctl disable kerneloops and got rid of this issue.
Now I seem to deal with some preemption disabled windows. For future analysis of this driver I need a way to figure out which threads are being executed at a particular point in time on other cores. I'm trying to use FTRACE mainly with events for IRQ entry/exit, I also use trace_printk to push some debug log in the ftrace buffer to to have everything in one log, etc.
However, one thing that I would like to do is to access the current_task structure of other cores (the current ptr) and print the comm field which gives the name of the task (or pid value).
But I'm having hard times in getting this done.
For __preempt_count I had no issue:
int *p = per_cpu_ptr(&__preempt_count,cpu);
pr_info("Preempt count of cpu%u is 0x%08x\n",cpu,*p);
So far I had no issue with declaring or accessing per cpu variables, but for some reason, the current_task pointer triggers a page fault when trying to access it.
char buf[10];
struct task_struct *task = per_cpu_ptr(current_task,cpu);
snprintf(buf,8,"%s",task->comm);
pr_info("Task name: %s",buf);
Above code triggers always a page fault, NULL ptr bla bla.
I couldn't find the reason till now. I tried to print the pointer value for task, I got the same page fault.
Might be this because the address is not accessible by other cores? In kernel space should not be the case afaik. I also had no issue till now with per core variables and I played a lot with this.
Bottom line: what would be the right approach to access the current_task of other cores and print the comm/pid fields?
Many Thanks,
Daniel
I finally figured out what was wrong.
The difference between __preempt_count and current_task is that first one is defined as an int variable, whereas the 2nd one as a pointer to structure. In other words 1st one is defined as a variable and the 2nd one as a pointer.
Now, looking deeper into per cpu variables, they are just variables allocated by the compiler in separate memory locations, like an array. When per_cpu_ptr for a variable Foo is called, then the macro computes something like Foo[cpu], but that means the per_cpu_ptr needs the actual base address of the variable, meaning the & so that it can compute the relative address value starting from this.
When declaring: foo = per_cpu_ptr(&__preempt_count,cpu) , this address is already given = &__preempt_count
When declaring: bar = per_cpu_ptr(current_task,cpu), this address is not given, as the & is missing here. The current_task is a pointer but not the base address of the current_task array.
In both above cases the argument to per_cpu_ptr is a pointer, but here my understanding was wrong, it was not clear to me what is actually the pointer of the variable I need to pass, now it's clear: I have to pass the base address of the variable(var or pointer doesn't matter) so that the macro can compute the relative address for that cpu.
Therefore the right approaches that work are:
bar = per_cpu(current_task,cpu) which translates into *per_cpu_var(&current_task,cpu)
or directly
bar = *per_cpu_var(&current_task,cpu);

Repeated Minor Pagefaults at Same Address After Calling mlockall()

The Problem
In the course of attempting to reduce/eliminate the occurrence of minor pagefaults in an application, I discovered a confusing phenomenon; namely, I am repeatedly triggering minor pagefaults for writes to the same address, even though I thought I had taken sufficient steps to prevent pagefaults.
Background
As per the advice here, I called mlockall to lock all current and future pages into memory.
In my original use-case (which involved a rather large array) I also pre-faulted the data by writing to every element (or at least to every page) as per the advice here; though I realize the advice there is intended for users running a kernel with the RT patch, the general idea of forcing writes to thwart COW / demand paging should remain applicable.
I had thought that mlockall could be used to prevent minor page faults. While the man page only seems to guarantee that there will be no major faults,various other resources (e.g. above) state that it can be used to prevent minor page faults as well.
The kernel documentation seems to indicate this as well. For example, unevictable-lru.txt and pagemap.txt state that mlock()'ed pages are unevictable and therefore not suitable for reclamation.
In spite of this, I continued to trigger several minor pagefaults.
Example
I've created an extremely stripped down example to illustrate the problem:
#include <sys/mman.h> // mlockall
#include <stdlib.h> // abort
int main(int , char **) {
int x;
if (mlockall(MCL_CURRENT | MCL_FUTURE)) abort();
while (true) {
asm volatile("" ::: "memory"); // So GCC won't optimize out the write
x = 0x42;
}
return 0;
}
Here I repeatedly write to the same address. It is easy to see (e.g. via cat /proc/[pid]/status | awk '{print $10}') that I continue to have minor pagefaults long after the initialization is complete.
Running a modified version* of the pfaults.stp script included in systemtap-doc, I logged the time of each pagefault, address that triggered the fault, address of the instruction that triggered the fault, whether it was major/minor, and read/write. After the initial faults from startup and mlockall, all faults were identical: The attempt to write to x triggered a minor write fault.
The interval between successive pagefaults displays a striking pattern. For one particular run, the intervals were, in seconds:
2, 4, 4, 4.8, 8.16, 13.87, 23.588, 40.104, 60, 60, 60, 60, 60, 60, 60, 60, 60, ...
This appears to be (approximately) exponential back-off, with an absolute ceiling of 1 minute.
Running it on an isolated CPU has no impact; neither does running with a higher priority. However, running with a realtime priority eliminates the pagefaults.
The Questions
Is this behavior expected?
1a. What explains the timing?
Is it possible to prevent this?
Versions
I'm running Ubuntu 14.04, with kernel 3.13.0-24-generic and Systemtap version 2.3/0.156, Debian version 2.3-1ubuntu1 (trusty). Code compiled with gcc-4.8 with no extra flags, though optimization level doesn't seem to matter (provided the asm volatile directive is left in place; otherwise the write gets optimized out entirely)
I'm happy to include further details (e.g. exact stap script, original output, etc.) if they will prove relevant.
*Actually, the vm.pagefault probe was broken for my combination of kernel and systemtap because it referenced a variable that no longer existed in the kernel's handle_mm_fault function, but the fix was trivial)
#fche's mention of Transparent Huge Pages put me onto the right track.
A less careless read of the kernel documentation I linked to in the question shows that mlock does not prevent the kernel from migrating the page to a new page frame; indeed, there's an entire section devoted to migrating mlocked pages. Thus, simply calling mlock() does not guarantee that you will not experience any minor pagefaults
Somewhat belatedly, I see that this answer quotes the same passage and partially answers my question.
One of the reasons the kernel might move pages around is memory compaction, whereby the kernel frees up a large contiguous block of pages so a "huge page" can be allocated. Transparent huge pages can be easily disabled; see e.g. this answer.
My particular test case was the result of some NUMA balancing changes introduced in the 3.13 kernel.
Quoting the LWN article linked therein:
The scheduler will periodically scan through each process's address
space, revoking all access permissions to the pages that are currently
resident in RAM. The next time the affected process tries to access
that memory, a page fault will result. The scheduler will trap that
fault and restore access to the page in question...
This behavior of the scheduler can be disabled by setting the NUMA policy of the process to explicitly use a certain node. This can be done using numactl at the command line (e.g. numactl --membind=0) or a call to the libnuma library.
EDIT: The sysctl documentation explicitly states regarding NUMA balancing:
If the target workload is already bound to NUMA nodes then this feature should be disabled.
This can be done with sysctl -w kernel.numa_balancing=0
There may still be other causes for page migration, but this sufficed for my purposes.
Just speculating here, but perhaps what you're seeing is some normal kernel page-utilization-tracking (maybe even KSM or THP or cgroup), wherein it tries to ascertain how many pages are in active use. Probe the mark_page_accessed function e.g.

How to tell Linux that a mmap()'d page does not need to be written to swap if the backing physical page is needed?

Hopefully the title is clear. I have a chunk of memory obtained via mmap(). After some time, I have concluded that I no longer need the data within this range. I still wish to keep this range, however. That is, I do not want to call mummap(). I'm trying to be a good citizen and not make the system swap more than it needs.
Is there a way to tell the Linux kernel that if the given page is backed by a physical page and if the kernel decides it needs that physical page, do not bother writing that page to swap?
I imagine under the hood this magical function call would destroy any mapping between the given virtual page and physical page, if present, without writing to swap first.
Your question (as stated) makes no sense.
Let's assume that there was a way for you to tell the kernel to do what you want.
Let's further assume that it did need the extra RAM, so it took away your page, and didn't swap it out.
Now your program tries to read that page (since you didn't want to munmap the data, presumably you might try to access it). What is the kernel to do? The choices I see:
it can give you a new page filled with 0s.
it can give you SIGSEGV
If you wanted choice 2, you could achieve the same result with munmap.
If you wanted choice 1, you could mremap over the existing mapping with MAP_ANON (or munmap followed by new mmap).
In either case, you can't depend on the old data being there when you need it.
The only way your question would make sense is if there was some additional mechanism for the kernel to let you know that it is taking away your page (e.g. send you a special signal). But the situation you described is likely rare enough to warrant additional complexity.
EDIT:
You might be looking for madvise(..., MADV_DONTNEED)
You could munmap the region, then mmap it again with MAP_NORESERVE
If you know at initial mapping time that swapping is not needed, use MAP_NORESERVE

Can I write-protect every page in the address space of a Linux process?

I'm wondering if there's a way to write-protect every page in a Linux
process' address space (from inside of the process itself, by way of
mprotect()). By "every page", I really mean every page of the
process's address space that might be written to by an ordinary
program running in user mode -- so, the program text, the constants,
the globals, and the heap -- but I would be happy with just constants,
globals, and heap. I don't want to write-protect the stack -- that
seems like a bad idea.
One problem is that I don't know where to start write-protecting
memory. Looking at /proc/pid/maps, which shows the sections of memory
in use for a given pid, they always seem to start with the address
0x08048000, with the program text. (In Linux, as far as I can tell,
the memory of a process is laid out with the program text at the
bottom, then constants above that, then globals, then the heap, then
an empty space of varying size depending on the size of the heap or
stack, and then the stack growing down from the top of memory at
virtual address 0xffffffff.) There's a way to tell where the top of
the heap is (by calling sbrk(0), which simply returns a pointer to the
current "break", i.e., the top of the heap), but not really a way to
tell where the heap begins.
If I try to protect all pages from 0x08048000 up to the break, I
eventually get an mprotect: Cannot allocate memory error. I don't know why mprotect would be
allocating memory anyway -- and Google is not very helpful. Any ideas?
By the way, the reason I want to do this is because I want to create a
list of all pages that are written to during a run of the program, and
the way that I can think of to do this is to write-protect all pages,
let any attempted writes cause a write fault, then implement a write
fault handler that will add the page to the list and then remove the write
protection. I think I know how to implement the handler, if only I could
figure out which pages to protect and how to do it.
Thanks!
You recieve ENOMEM from mprotect() if you try to call it on pages that aren't mapped.
Your best bet is to open /proc/self/maps, and read it a line at a time with fgets() to find all the mappings in your process. For each writeable mapping (indicated in the second field) that isn't the stack (indicated in the last field), call mprotect() with the right base address and length (calculated from the start and end addresses in the first field).
Note that you'll need to have your fault handler already set up at this point, because the act of reading the maps file itself will likely cause writes within your address space.
Start simple. Write-protect a few page and make sure your signal handler works for these pages. Then worry about expanding the scope of the protection. For example, you probably do not need to write-protect the code-section: operating systems can implement write-or-execute protection semantics on memory that will prevent code sections from ever being written to:
http://en.wikipedia.org/wiki/Self-modifying_code#Operating_systems

Resources