How the OS knows a page is dirty in mapped memory? - linux

I mean when data is updated directly in memory, without using write().
In linux I thought all the data specified in a msync call was flushed.
But in Windows the doc of FlushViewOfFile says "writing of dirty pages", so somehow the OS knows what pages have been updated.
How does that work ? Do we have to use WriteFile to update mapped memory ?
If we use write() in linux does msync only syncs dirty pages ?

On most (perhaps all) modern-day computers running either Linux or Windows, the CPU keeps track of dirty pages on the operating system's behalf. This information is stored in the page table.
(See, for example, section 4.8 of the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 3A and section 5.4.2 in the AMD64 Architecture Programmer's Manual, Volume 2.)
If that functionality isn't available on a particular CPU, an operating system could instead use page faults to detect the first write to a page, as described in datenwolf's answer.

When flushing pages (i.e. cleaning them up) the OS internally removes the "writeable" flag. After that, when a program attempts to write to a memory location in such a page, the kernel's page fault handler is invoked. The page fault handler then sets the page access permissions to allow the actual write and marks the page dirty, then returns control to the program to let it perform the actual write.

Related

how to enable hugetlb on mips32

Here is the problem I have:
rx/tx packet in kernel driver. User space program need to access each of these packet. So, there are huge amount of data transfer between kernel and user space. (data stream: kernel rx -> user space process -> kernel tx)
throughput is the KPI.
I decide to use share memory/mmap to avoid data copy. although I haven't test it, others have told me tlb missing will be a problem.
The system I use is a
mips32 system (mips74kc, single core)
default page size 4KB.
kernel 2.6.32
It can only fit in one data packet. During the data transformation, there will be lots of tlb missing that impact throughput.
I found huge page might be a solution. But, it seems like only mips64 support hugetlbfs currently.
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
https://www.linux-mips.org/archives/linux-mips/2009-05/msg00429.html
So, my question is: how can I use hugetlbfs on mips32. or is there other way to solve the throughput problem.(I must do the data process part in user space)
According to ddaney's patch,
Currently the patch only works for 64-bit kernels because the value of
PTRS_PER_PTE in 32-bit kernels is such that it is impossible to have a
valid PageMask. It is thought that by adjusting the page allocation
scheme, 32-bit kernels could be supported in the future.
It seems possible. Could someone give me a hint, what need to be modify, in order to enable hugetlb.
thank you!
Does documentation of your core list support of non 4KB page in its TLB? If it is not supported you should modify your CPU (replace it with some which support larger pages, or redesign your CPU and make new chip).
But most probably you are on wrong track, and tlb missing is not yet proven to be the problem (and the 2MB huge page is wrong solution to 8KB or 15KB packets).
I will tell you "zero-copy" and/or user-space networking (netmap, snabb, PF_RING, DPDK, Network stack in userspace), or user-space network driver; or kernel-based data handler. But many of these tools are only for newer kernels.

mprotect : how is memory protection implemented

I already know that mprotect() syscall has 4 protection mode in BSD, but my problem is that how this protection is implemented ( Hardware or Software Implemention ) ?
let's say if we set protection of specific pages to PROT_NONE ,is it really depend on the hardware I'm using or it's kind of some software tricks by setting some flags on that specified page in page table.
it seems that this protection on hardware deponds on MMU we have, but I'm not sure about it.
you can find more information about mprotect and paging on :
BSD man page
Paging - Wiki
Page protection is implemented in hardware with software assistance. Basically, you want to achieve the following:
Enter kernel context automatically when user process wants to do something with specific memory page (the hardware is responsible for this).
Let kernel code do something to the accessing process in order to uphold the mprotect guarantee (this happens in software invoked from hardware trap handler triggered in p.1).
And yes, without the MMU p.1 would not work, so on ucLinux (a version of Linux designed to support processors without MMU) mprotect is not implemented (as it will be impossible to invoke the code from p.2 transparently).

Allocating "temporary" memory (in Linux)

I'm trying to find any system functionality that would allow a process to allocate "temporary" memory - i.e. memory that is considered discardable by the process, and can be take away by the system when memory is needed, but allowing the process to benefit from available memory when possible. In other words, the process tells the system it's OK to sacrifice the block of memory when the process is not using it. Freeing the block is also preferable to swapping it out (it's more expensive, or as expensive, to swap it out rather then re-constitute its contents).
Systems (e.g. Linux), have those things in the kernel, like F/S memory cache. I am looking for something like this, but available to the user space.
I understand there are ways to do this from the program, but it's really more of a kernel job to deal with this. To some extent, I'm asking the kernel:
if you need to reduce my, or another process residency, take these temporary pages off first
if you are taking these temporary pages off, don't swap them out, just unmap them
Specifically, I'm interested on a solution that would work on Linux, but would be interested to learn if any exist for any other O/S.
UPDATE
An example on how I expect this to work:
map a page (over swap). No difference to what's available right now.
tell the kernel that the page is "temporary" (for the lack of a better name), meaning that if this page goes away, I don't want it paged in.
tell the kernel that I need the temporary page "back". If the page was unmapped since I marked it "temporary", I am told that happened. If it hasn't, then it starts behaving as a regular page.
Here are the problems to have that done over existing MM:
To make pages not being paged in, I have to allocate them over nothing. But then, they can get paged out at any time, without notice. Testing with mincore() doesn't guarantee that the page will still be there by the time mincore() finishes. Using mlock() requires elevated privileges.
So, the closest I can get to this is by using mlock(), and anonymous pages. Following the expectations I outlined earlier, it would be:
map an anonymous, locked page. (MAP_ANON|MAP_LOCKED|MAP_NORESERVE). Stamp the page with magic.
for making page "temporary", unlock the page
when needing the page, lock it again. If the magic is there, it's my data, otherwise it's been lost, and I need to reconstitute it.
However, I don't really need for pages to be locked in RAM when I'm using them. Also, MAP_NORESERVE is problematic if memory is overcommitted.
This is what the VmWare ESXi server aka the Virtual Machine Monitor (VMM) layer implements. This is used in the Virtual Machines and is a way to reclaim memory from the virtual machine guests. Virtual machines that have more memory allocated than they actually are using/require are made to release/free it to the VMM so that it can assign it back to the Virtual Machines guests that are in need of it.
This technique of Memory Reclamation is mentioned in this paper: http://www.vmware.com/files/pdf/mem_mgmt_perf_vsphere5.pdf
On similar lines, something similar you can implement in your kernel.
I'm not sure to understand exactly your needs. Remember that processes run in virtual memory (their address space is virtual), that the kernel is dealing with virtual to physical address translation (using the MMU) and with paging. So page fault can happen at any time. The kernel will choose to page-in or page-out at arbitrary moments - and will choose which page to swap (only the kernel care about RAM, and it can page-out any physical RAM page at will). Perhaps you want the kernel to tell you when a page is genuinely discarded. How would the kernel take away temporary memory from your process without your process being notified ? The kernel could take away and later give back some RAM.... (so you want to know when the given back memory is fresh)
You might use mmap(2) with MAP_NORESERVE first, then again (on the same memory range) with MAP_FIXED|MAP_PRIVATE. See also mincore(2) and mlock(2)
You can also later use madvise(2) with MADV_WONTNEED or MADV_WILLNEED etc..
Perhaps you want to mmap some device like /dev/null, /dev/full, /dev/zero or (more likely) write your own kernel module providing a similar device.
GNU Hurd has an external pager mechanism... You cannot yet get exactly that on Linux. (Perhaps consider mmap on some FUSE mounted file).
I don't understand what you want to happen when the kernel is paging out your memory, and what you want to happen when the kernel is paging in again such a page because your process is accessing it. Do you want to get a zero-ed page, or a SIGSEGV ?

Trying to trap all memory reads/writes on a Linux machine

I have a Linux machine and I am trying to catch all the writes or reads to the memory for a specific amount of time (I basically need the byte address and the value that is being written). Is there any tool that can help me do that or do I have to change the OS code?
You mentioned that you only want to monitor memory reads and writes to a certain physical memory address. I'm going to assume that when you say memory reads/writes, you mean an assembly instruction that reads/writes data to memory and not an instruction fetch.
You would have to modify some paging code in the kernel so it page faults when a certain address range is accessed. Then, in the page fault handler, you could somehow log the access. You could extract the target address and data by decoding the instruction that caused the fault and reading the data off the registers. After logging, the page is configured to not to fault and the instruction is reattempted. Similar to the copy-on-write technique but you're logging each read/write to the region.
The other, hardware, method is to somehow install a bus sniffer or tap into a hardware debugging interface on your platform to monitor which regions of memory is being accessed but I imagine you'll run into trouble with caches with this method.
As mentioned by another poster, you could also modify an emulator to capture certain memory accesses and run your code on that.
I'd say both methods are very platform specific and will take a lot of effort to do. Out of curiosity, what is it that you're hoping to achieve? There must be a better way to solve it than to monitor accesses to physical memory.
Self-introspection is suitable for some types of debugging. For a complete trace of memory access, it is not. How is the debug code supposed to store a trace without performing more memory access?
If you want to stay in software, your best bet is to run the code being traced inside an emulator. Not a virtual machine that uses the MMU to isolate the test code while still providing direct access, but a full emulator. Plenty exist for x86 and most other architectures you would care about.
Well, if you're just interested in memory reads and writes by a particular process (to part/all of that process's virtual memory space), you can use a combination of ptrace and mprotect (mprotect to make the memory not accessable and ptrace run until it accesses the memory and then single step).
Sorry to say, it's just not possible to do what you want, even if you change OS code. Reads and writes to memory do not go through OS system calls.
The closest you could get would be to use accessor functions for the variables of interest. The accessors could be instrumented to put trace info in a separate buffer. Embedded debugging often does this to get a log of I/O register accesses.

Nvidia Information Disclosure / Memory Vulnerability on Linux and General OS Memory Protection

I thought this was expected behavior?
From: http://classic.chem.msu.su/cgi-bin/ceilidh.exe/gran/gamess/forum/?C35e9ea936bHW-7675-1380-00.htm
Paraphrased summary: "Working on the Linux port we found that cudaHostAlloc/cuMemHostAlloc CUDA API calls return un-initialized pinned memory. This hole may potentially allow one to examine regions of memory previously used by other programs and Linux kernel. We recommend everybody to stop running CUDA drivers on any multiuser system."
My understanding was that "Normal" malloc returns un-initialized memory, so I don't see what the difference here is...
The way I understand how memory allocation works would allow the following to happen:
-userA runs a program on a system that crunches a bunch of sensitive information. When the calculations are done, the results are written to disk, the processes exits, and userA logs off.
-userB logs in next. userB runs a program that requests all available memory in the system, and writes the content of his un-initialized memory, which contains some of userA's sensitive information that was left in RAM, to disk.
I have to be missing something here. What is it? Is memory zero'd-out somewhere? Is kernel/pinned memory special in a relevant way?
Memory returned by malloc() may be nonzero, but only after being used and freed by other code in the same process. Never another process. The OS is supposed to rigorously enforce memory protections between processes, even after they have exited.
Kernel/pinned memory is only special in that it apparently gave a kernel mode driver the opportunity to break the OS's process protection guarantees.
So no, this is not expected behavior; yes, this was a bug. Kudos to NVIDIA for acting on it so quickly!
The only part that requires root priviledges to install CUDA is the NVIDIA driver. As a result all operations done using NVIDIA compiler and link can be done using regular system calls, and standard compiling (provided you have the proper information -lol-). If any security holes lies there, it remains, wether or not cudaHostAlloc/cuMemHostAlloc is modified.
I am dubious about the first answer seen on this post. The man page for malloc specifies that
the memory is not cleared. The man page for free does not mention any clearing of the memory.
The clearing of memory seems to be in the responsability of the coder of a sensitive section -lol-, that leave the problem of an unexpected (rare) exit. Apart from VMS (good but not widely used OS), I dont think any OS accept the performance cost of a systematic clearing. I am not clear about the way the system may track in the heap of a newly allocated memory what was previously in the process area, and what was not.
My conclusion is: if you need a strict level of privacy, do not use a multi-user system
(or use VMS).

Resources