Repeated Minor Pagefaults at Same Address After Calling mlockall() - linux

The Problem
In the course of attempting to reduce/eliminate the occurrence of minor pagefaults in an application, I discovered a confusing phenomenon; namely, I am repeatedly triggering minor pagefaults for writes to the same address, even though I thought I had taken sufficient steps to prevent pagefaults.
Background
As per the advice here, I called mlockall to lock all current and future pages into memory.
In my original use-case (which involved a rather large array) I also pre-faulted the data by writing to every element (or at least to every page) as per the advice here; though I realize the advice there is intended for users running a kernel with the RT patch, the general idea of forcing writes to thwart COW / demand paging should remain applicable.
I had thought that mlockall could be used to prevent minor page faults. While the man page only seems to guarantee that there will be no major faults,various other resources (e.g. above) state that it can be used to prevent minor page faults as well.
The kernel documentation seems to indicate this as well. For example, unevictable-lru.txt and pagemap.txt state that mlock()'ed pages are unevictable and therefore not suitable for reclamation.
In spite of this, I continued to trigger several minor pagefaults.
Example
I've created an extremely stripped down example to illustrate the problem:
#include <sys/mman.h> // mlockall
#include <stdlib.h> // abort
int main(int , char **) {
int x;
if (mlockall(MCL_CURRENT | MCL_FUTURE)) abort();
while (true) {
asm volatile("" ::: "memory"); // So GCC won't optimize out the write
x = 0x42;
}
return 0;
}
Here I repeatedly write to the same address. It is easy to see (e.g. via cat /proc/[pid]/status | awk '{print $10}') that I continue to have minor pagefaults long after the initialization is complete.
Running a modified version* of the pfaults.stp script included in systemtap-doc, I logged the time of each pagefault, address that triggered the fault, address of the instruction that triggered the fault, whether it was major/minor, and read/write. After the initial faults from startup and mlockall, all faults were identical: The attempt to write to x triggered a minor write fault.
The interval between successive pagefaults displays a striking pattern. For one particular run, the intervals were, in seconds:
2, 4, 4, 4.8, 8.16, 13.87, 23.588, 40.104, 60, 60, 60, 60, 60, 60, 60, 60, 60, ...
This appears to be (approximately) exponential back-off, with an absolute ceiling of 1 minute.
Running it on an isolated CPU has no impact; neither does running with a higher priority. However, running with a realtime priority eliminates the pagefaults.
The Questions
Is this behavior expected?
1a. What explains the timing?
Is it possible to prevent this?
Versions
I'm running Ubuntu 14.04, with kernel 3.13.0-24-generic and Systemtap version 2.3/0.156, Debian version 2.3-1ubuntu1 (trusty). Code compiled with gcc-4.8 with no extra flags, though optimization level doesn't seem to matter (provided the asm volatile directive is left in place; otherwise the write gets optimized out entirely)
I'm happy to include further details (e.g. exact stap script, original output, etc.) if they will prove relevant.
*Actually, the vm.pagefault probe was broken for my combination of kernel and systemtap because it referenced a variable that no longer existed in the kernel's handle_mm_fault function, but the fix was trivial)

#fche's mention of Transparent Huge Pages put me onto the right track.
A less careless read of the kernel documentation I linked to in the question shows that mlock does not prevent the kernel from migrating the page to a new page frame; indeed, there's an entire section devoted to migrating mlocked pages. Thus, simply calling mlock() does not guarantee that you will not experience any minor pagefaults
Somewhat belatedly, I see that this answer quotes the same passage and partially answers my question.
One of the reasons the kernel might move pages around is memory compaction, whereby the kernel frees up a large contiguous block of pages so a "huge page" can be allocated. Transparent huge pages can be easily disabled; see e.g. this answer.
My particular test case was the result of some NUMA balancing changes introduced in the 3.13 kernel.
Quoting the LWN article linked therein:
The scheduler will periodically scan through each process's address
space, revoking all access permissions to the pages that are currently
resident in RAM. The next time the affected process tries to access
that memory, a page fault will result. The scheduler will trap that
fault and restore access to the page in question...
This behavior of the scheduler can be disabled by setting the NUMA policy of the process to explicitly use a certain node. This can be done using numactl at the command line (e.g. numactl --membind=0) or a call to the libnuma library.
EDIT: The sysctl documentation explicitly states regarding NUMA balancing:
If the target workload is already bound to NUMA nodes then this feature should be disabled.
This can be done with sysctl -w kernel.numa_balancing=0
There may still be other causes for page migration, but this sufficed for my purposes.

Just speculating here, but perhaps what you're seeing is some normal kernel page-utilization-tracking (maybe even KSM or THP or cgroup), wherein it tries to ascertain how many pages are in active use. Probe the mark_page_accessed function e.g.

Related

Is there a way to disable CPU cache (L1/L2) on a Linux system?

I am profiling some code on a Linux system (running on Intel Core i7 4500U) to obtain the time of ONLY the execution costs. The application is the demo mpeg2dec from libmpeg2. I am trying to obtain a probability distribution for the mpeg2 execution times. However we want to see the raw execution cost when cache is switched off.
Is there a way I can disable the cpu cache of my system via a Linux command, or via a gcc flag ? or even set the cpu (L1/L2) cache size to 0KB ? or even add some code changed to disable cache ? Of course, without modifying or rebuilding the kernel.
See this 2012 thread, someone posted a tiny kernel module source to disable cache through asm.
http://www.linuxquestions.org/questions/linux-kernel-70/disabling-cpu-caches-936077/
If disabling the cache is really necessary, then so be it.
Otherwise, to know how much time a process takes in terms of user or system "cycles", then I would recommend the getrusage() function.
struct rusage usage;
getrusage(RUSAGE_SELF, &usage);
You can call it before/after your loop/test and subtracted the values to get a good idea of how much time your process took, even if many other processes run in parallel on the same machine. The main problem you'd get is if your process start swapping. In that case your timings will be off.
double user_usage = usage.ru_utime.tv_sec + usage.ru_utime.tv_usec / 1000000.0;
double system_uage = usage.ru_stime.tv_sec + usage.ru_stime.tv_usec / 1000000.0;
This is really precise from my own experience. To increase precision, you could be root when running your test and give it a negative priority (-1 or -2 is enough.) Then it won't be swapped out until you call a function that may require it.
Of course, you still get the effect of the cache... assuming you do not handle very large amount of data with code that goes on and on (opposed to having a loop).

safe unloading of kernel module

I have to write LKM, that intercepts some syscalls.
Solution is to:
Find address of sys_call_table symbol, check if address is correct(checking for example that sys_call_table[__NR_close] points to address of sys_close)
Disable interrupts
Disable WP bit in CR0
Change sys_call_table[__NR_close] to my own function
Enable WP bit
Enable interrupts.
Loading of module works fine.
But, what about safe unloading of module?
Consider situation when I restore sys_call_table to it's original state and module is unloaded - what if kernel is still executing code from my module in context of syscall of other process on other CPU? I will get page fault in kernel mode(because pages with code segment of module are no more available, as module was unloaded).
The shared resource is entry in sys_call_table. If I can made access to this entry protected by locks - then I can safely unload my module.
But, since kernel system call handler doesn't have any of this locks(e.g.arch/x86/kernel/entry_32.S) - it means that there is no safe way of unloading my module? Is it true?
UPDATE1
I need to get information about file accesses on old kernels(where fanotify(2) is not available), starting from 2.4 kernel version. I need this information to perform on access scanning through antivirus engine.
You're correct that there is no safe way to unload your module once you've done this. This is one reason why replacing/wrapping system call table entries this way is frowned upon.
In most recent versions, sys_call_table is not an exported symbol -- at least in part to discourage this very thing.
It would be possible in theory to support a more robust system call replacement mechanism but the kernel maintainers believe that the whole concept is so fraught with the potential for errors and confusion that they have declined to support it. (A web search will show several long-ago debates about this subject on the linux kernel mailing list.)
(Speaking here as one who used exactly the same technique several years ago.)
You can of course do it anyway. Then, you can either "just risk" unloading your module - and hence potentially causing a kernel panic (but of course it will likely work 99% of the time). Or you can not allow your module to be unloaded at all (requiring a reboot in order to upgrade or uninstall).
At the end of the uninit function in your kernel module, you can wait till all your custom hooks end.
This can be achieved using counters.
Increment the counter when your custom hook is hit, decrement it right before it returns.
When the counter hits zero, only then return from the uninit function.
You will also need locking on the counter variable.

Cyclictest for RT patched Linux Kernel

Hello I patched the Linux kernel with the RT-Patch and tested it with the Cyclinctest which monitors latencies. The Kernel isn't doing good and not better than the vanilla kernel.
https://rt.wiki.kernel.org/index.php/Cyclictest
I checked the uname for RT, which looks fine.
So I checked the requirements for the cyclinctest and it states that I have to make sure that the following is configured within the kernel config:
CONFIG_PREEMPT_RT=y
CONFIG_WAKEUP_TIMING=y
CONFIG_LATENCY_TRACE=y
CONFIG_CRITICAL_PREEMPT_TIMING=y
CONFIG_CRITICAL_IRQSOFF_TIMING=y
The Problem now arising is that the config doesn't contain such entries. Maybe there are old and the they may be renamed in the new patch versions (3.8.14)?
I found options like:
CONFIG_PREEMPT_RT_FULL=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_RT_BASE=y
CONFIG_HIGH_RES_TIMERS=y
Is that enought in the 3.x kernel to provide the required from above? Anyone a hint?
There's a lot that must be done to get hard realtime performance under PREEMPT_RT. Here are the things I am aware of. Entries marked with an asterisk apply to your current position.
Patch the kernel with PREEMPT_RT (as you already did), and enable CONFIG_PREEMPT_RT_FULL (which used to be called CONFIG_PREEMPT_RT, as you correctly derived).
Disable processor frequency scaling (either by removing it from the kernel configuration or by changing the governor or its settings). (*)
Reasoning: Changing a core's frequency takes a while, during which the core does no useful work. This causes high latencies.
To remove this, look under the ACPI options in the kernel settings.
If you don't want to remove this capability from the kernel, you can set the cpufreq governor to "performance" to lock it into its highest frequency.
Disable deep CPU sleep states
Reasoning: Like switching frequencies, Waking the CPU from a deep sleep can take a while.
Cyclictest does this for you (look up /dev/cpu_dma_latency to see how to do it in your application).
Alternatively, you can disable the "cpuidle" infrastructure in the kernel to prevent this from ever occurring.
Set a high priority for the realtime thread, above 50 (preferably 99) (*)
Reasoning: You need to place your priority above the majority of the kernel -- much of a PREEMPT_RT kernel (including IRQs) runs at a priority of 50.
For cyclictest, you can do this with the "-p#" option, e.g. "-p99".
Your application's memory must be locked. (*)
Reasoning: If your application's memory isn't locked, then the kernel may need to re-map some of your application's address space during execution, triggering high latencies.
For cyclictest, this may be done with the "-m" option.
To do this in your own application, see the RT_PREEMPT howto.
You must unload the nvidia, nouveau, and i915 modules if they are loaded (or not build them in the first place) (*)
Reasoning: These are known to cause high latencies. Hopefully you don't need them on a realtime system :P
Your realtime task must be coded to be realtime
For example, you cannot do file access or dynamic memory allocation via malloc(). Many system calls are off-limits (it's hard to find which ones are acceptable, IMO).
cyclictest is mostly already coded for realtime operation, as are many realtime audio applications. You do need to run it with the "-n" flag, however, or it will not use a realtime-safe sleep call.
The actual execution of cyclictest should have at least the following set of parameters:
sudo cyclictest -p99 -m -n

How to tell Linux that a mmap()'d page does not need to be written to swap if the backing physical page is needed?

Hopefully the title is clear. I have a chunk of memory obtained via mmap(). After some time, I have concluded that I no longer need the data within this range. I still wish to keep this range, however. That is, I do not want to call mummap(). I'm trying to be a good citizen and not make the system swap more than it needs.
Is there a way to tell the Linux kernel that if the given page is backed by a physical page and if the kernel decides it needs that physical page, do not bother writing that page to swap?
I imagine under the hood this magical function call would destroy any mapping between the given virtual page and physical page, if present, without writing to swap first.
Your question (as stated) makes no sense.
Let's assume that there was a way for you to tell the kernel to do what you want.
Let's further assume that it did need the extra RAM, so it took away your page, and didn't swap it out.
Now your program tries to read that page (since you didn't want to munmap the data, presumably you might try to access it). What is the kernel to do? The choices I see:
it can give you a new page filled with 0s.
it can give you SIGSEGV
If you wanted choice 2, you could achieve the same result with munmap.
If you wanted choice 1, you could mremap over the existing mapping with MAP_ANON (or munmap followed by new mmap).
In either case, you can't depend on the old data being there when you need it.
The only way your question would make sense is if there was some additional mechanism for the kernel to let you know that it is taking away your page (e.g. send you a special signal). But the situation you described is likely rare enough to warrant additional complexity.
EDIT:
You might be looking for madvise(..., MADV_DONTNEED)
You could munmap the region, then mmap it again with MAP_NORESERVE
If you know at initial mapping time that swapping is not needed, use MAP_NORESERVE

Limiting the File System Usage Programmatically in Linux

I was assigned to write a system call for Linux kernel, which oddly determines (and reduces) users´ maximum transfer amount per minute (for file operations). This system call will be called lim_fs_usage and will take a parameter for maximum number of bytes all users can access in a minute. For short, I am going to determine bandwidth of all filesystem operations in Linux. The project also asks for choosing appropriate method for distribution of this restricted resource (file access) among the users but I think this
won´t be a big problem.
I did a long long search and scan but could not find a method for managing file system access programmatically. I thought of mapping (mmap())hard drive to memory and manage memory operations but this turned to be useless. I also tried to find an API for virtual file system in order to monitor and limit it but I could not find one. Any ideas, please... Any help is greatly appreciated. Thank you in advance...
I wonder if you could do this as an IO scheduler implementation.
The main difficulty of doing IO bandwidth limitation under Linux is, by the time it reaches anywhere near the device, the kernel has probably long since forgotten who caused it.
Likewise, you can get on some very tricky ground in determining who is responsible for a given piece of IO:
If a binary is demand-loaded, who owns the IO doing that?
A mapped section of memory (demand-loaded executable or otherwise) might be kicked out of memory because someone else used too much ram, thus causing the kernel to choose to evict those pages, which places an unfair burden on the quota of the other user to then page it back in
IO operations can be combined, and might come from different users
A write operation might cause an IO sooner or later depending on how the kernel schedules it; a later schedule may mean that fewer IOs need to be done in the long run, as another write gets done to the same block in the interim; writing to an already dirty block in cache does not make it any dirtier.
If you understand all these and more caveats, and still want to, I imagine doing it as an IO scheduler is the way to go.
IO schedulers are pluggable under Linux (2.6) and can be changed dynamically - the kernel waits for all IO on the device (IO scheduler is switchable per block device) to end and then switches to the new one.
Since it's urgent I'll give you an idea out of the top of my head without doing any research on the feasibility -- what about inserting a hook to monitor system calls that deal with file system access?
You might end up writing specialised kernel modules to handle the various filesystems (ext3, ext4, etc) but as a proof-of-concept you can start with one. Do not forget that root has reserved blocks in memory, process space and disk for his own operations.
Managing memory operations does not sound related to what you're trying to do (but perhaps I am mistaken here).
After a long period of thinking and searching, I decided to use the ¨hooking¨ method proposed. I am thinking of creating a new system call which initializes and manages a global variable like hdd_ bandwith _limit. This variable will be used in Read() and Write() system calls´ modified implementation (instead of ¨count¨ variable). Then I will decide distribution of this resource which is the real issue. Probably I will find out how many users are using the system for a certain moment and divide this resource equally. Will be a Round-Robin-like distribution. But still, I am open to suggestions on this distribution issue. Will it be a SJF or FCFS or Round-Robin? Synchronization is another issue. How can I know a user´s job is short or long? Or whether he is done with the operation or not?

Resources