Under Linux, how can I tell what specific process owns / is using a given address in physical memory?
I understand that this may require writing a kernel module to access some kernel data structure and return the results to a user - I need to know how it can be done, regardless of how complicated it is.
The pages in use by a process and their location in physical memory are not static pieces of information. However, the information you seek should be in the page tables. A change went into the kernel that might be almost exactly what you're looking for:
author Arjan van de Ven <arjan#linux.intel.com> 2008-04-17 15:40:45 (GMT)
committer Ingo Molnar <mingo#elte.hu> 2008-04-17 15:40:45 (GMT)
commit 926e5392ba8a388ae32ca0d2714cc2c73945c609 (patch)
tree 2718b50b8b66a3614f47d3246b080ee8511b299e
parent 2596e0fae094be9354b29ddb17e6326a18012e8c (diff)
x86: add code to dump the (kernel) page tables for visual inspection by kernel developers
This patch adds code to the kernel to have an (optional)
/proc/kernel_page_tables debug file that basically dumps the kernel
pagetables; this allows us kernel developers to verify that nothing
fishy is going on and that the various mappings are set up correctly.
This was quite useful in finding various change_page_attr() bugs, and
is very likely to be useful in the future as well.
Signed-off-by:Arjan van de Ven <arjan#linux.intel.com>
Cc: mingo#elte.hu
Cc: tglx#tglx.de
Cc: hpa#zytor.com
Signed-off-by: Ingo Molnar <mingo#elte.hu>
Signed-off-by: Thomas Gleixner <tglx#linutronix.de>
The added functionality is enabled by a new config option (X86_PTDUMP).
Might want to start here for a discusson of how process virtual memory is mapped to physical memory. That would give you a good place to start as far as figuring out where you would need to hook into the kernel to access the page table, etc. where that information is stored.
Well due to the way things are done under Linux, a process may own memory at one instance, and then will not anymore, due to paging.
http://en.wikipedia.org/wiki/Paging
Essentially this means that the computer switches out data it doesn't need at one moment so that the memory can be used for something else.
I'm not sure if this helped or not, but I'd advise you to look at page tables and directories, since you can use these to translate to physical addresses.
You might be able to use pmap -d [pid] for this... unfortunately you'd have to run it on all processes to see which one returned a result for the given memory address. Certainly not as efficient as a kernel module (and you might not even get a result, if the memory is paged out while you're looking for it).
Related
Need to save the page of process (the user part!) from removing to the swap.
I need to do it in the kernel, only. (language C I know)
(Maybe insert hook in shrink_page_list?)
I have IDs of processes, which need to save and threshold amount of physical memory in the system (We fill, while it isn't filled). IDs and threshold write in /proc, /dev or /sys.
How to approach this?
What files to look at?
What tutorials to read?
Maybe there are examples that are somehow are related with this task.
Info: I compilling kernel of Debian Lenny, use Qemu for start it on my Ubuntu.
See get_user_pages. http://www.makelinux.net/ldd3/chp-15-sect-3.
Use get_user_pages, you can get whatever page you want and keep it locked in memory.
Even better, look at the comments on the source at
http://lxr.free-electrons.com/source/mm/gup.c#L637
Hopefully the title is clear. I have a chunk of memory obtained via mmap(). After some time, I have concluded that I no longer need the data within this range. I still wish to keep this range, however. That is, I do not want to call mummap(). I'm trying to be a good citizen and not make the system swap more than it needs.
Is there a way to tell the Linux kernel that if the given page is backed by a physical page and if the kernel decides it needs that physical page, do not bother writing that page to swap?
I imagine under the hood this magical function call would destroy any mapping between the given virtual page and physical page, if present, without writing to swap first.
Your question (as stated) makes no sense.
Let's assume that there was a way for you to tell the kernel to do what you want.
Let's further assume that it did need the extra RAM, so it took away your page, and didn't swap it out.
Now your program tries to read that page (since you didn't want to munmap the data, presumably you might try to access it). What is the kernel to do? The choices I see:
it can give you a new page filled with 0s.
it can give you SIGSEGV
If you wanted choice 2, you could achieve the same result with munmap.
If you wanted choice 1, you could mremap over the existing mapping with MAP_ANON (or munmap followed by new mmap).
In either case, you can't depend on the old data being there when you need it.
The only way your question would make sense is if there was some additional mechanism for the kernel to let you know that it is taking away your page (e.g. send you a special signal). But the situation you described is likely rare enough to warrant additional complexity.
EDIT:
You might be looking for madvise(..., MADV_DONTNEED)
You could munmap the region, then mmap it again with MAP_NORESERVE
If you know at initial mapping time that swapping is not needed, use MAP_NORESERVE
I'm currently developing an application which needs a lot of system and process information, some of which is only available through /proc, and I have some general questions about accessing the structures.
The application will be run on Linux (kernel >= 2.6), not on any other Unix-flavored OS. It should have access to any data in /proc, I can't say what is necessary now as the specifications are not clear yet, but the whole /proc directory is relevant to the application.
First of all: Is there a good documentation which covers all the features added / removed from kernel version to kernel version? One thing I'm curious about in particular is the format of the individual files. Can I take that for granted? Does it change among kernel versions?
Hooking up the parsing process based on the kernel wouldn't be a problem at all, it's just that I couldn't find any good docs on what has changed from version to version which could help me catching parsing errors in beforehand.
In addition: Is there a definite list of features that can be activated / deactivated by kernel options (except of course the /proc-feature itself)? I'm looking for a list of files / directories that only exist with the appropriate options being set in the kernel.
As an example of what I'm thinking of, this is a link to the proc manpage (http://linux.die.net/man/5/proc) which includes a lot of good information, e.g. some options include the earliest kernel version they were available at, some include whether a module is necessary to be loaded. This does not describe the output format of all information though, which is something I need if I want to parse it (e.g. if it is consistent throughout all kernel versions or changed at some point).
The second thing I'm wondering about is what happens if the process queried dies while being queried. What is my time interval? For example if I'm going to fetch a list of processes reading all the structures, and parse them one after another, what happens if my process x dies before I get to read it? Even if I check if the directory exists, it could still be gone one application call later.
Last but not least: Is there any major distribution out there that is not mounting proc?
From what I understand, a lot of common tools are based on the /proc interface such as lsmod or free, so I'm guessing that I can expect /proc to exist almost always.
The /proc interfaces are pretty stable (unlike the /sys interfaces), even if nothing is guaranteed. Almost all changes are backwards compatible, at least if they've been around for a few versions. You should
stick to the documented interfaces to be safe. If a file exists, its format may be extended in later versions, but normally in a backwards compatible way, e.g. adding columns to a table. The parts that are most at risk of disappearing are parts concerning hardware susbystems such as ACPI or SCSI, which are migrating to /sys (with a long transition period when both exist).
Most of the information is architecture-independent, except for hardware information (e.g. /proc/cpuinfo has very different fields on different architectures).
The main documentation is Documentation/filesystems/proc.txt in the kernel source. Consider proc(5) to be the overview and proc.txt to be the fine details. The kernel documentation is often incomplete, so don't be surprised if you need to resort to reading the source sometimes.
Most optional parts of /proc are activated by default if the driver whose data it exposes is included in the kernel. The exceptions are mostly related to hardware features that rarely need to be accessed from outside the kernel; if you need to access these features, you're probably already expecting to need to dig deeper. Look through Kconfig files in the kernel source for detailed information.
Process data (or hardware data related to removable hardware or provided by unloadable modules) can disappear under your nose. Most files under /proc can be read atomically, with a single read call with a reasonably-sized buffer; if you perform multiple read calls in sequence, drivers are supposed to guarantee that you get well-formed data. There is no way to guarantee atomicity between reads of separate files; if you're reading information about a process, this process can die at any time, and in principle could even be replaced by another process with the same PID before you're finished.
As it says in the description of /proc, “everyone should say Y here”. All desktop/server Linux systems and most embedded Linux systems must have /proc; a lot of things, including ps and other process management commands, many filesystem and device-related tools, and module loading, require it. The only systems that might be able to dispense with /proc are very small single-purpose embedded systems that support a single hardware configuration and run a fixed set of programs. You can count on its being here.
I'm creating a web application running on a Linux server. The application is constantly accessing a 250K file - it loads it in memory, reads it and sends back some info to the user. Since this file is read all the time, my client is suggesting to use something like memcache to cache it to memory, presumably because it will make read operations faster.
However, I'm thinking that the Linux filesystem is probably already caching the file in memory since it's accessed frequently. Is that right? In your opinion, would memcache provide a real improvement? Or is it going to do the same thing that Linux is already doing?
I'm not really familiar with neither Linux nor memcache, so I would really appreciate if someone could clarify this.
Yes, if you do not modify the file each time you open it.
Linux will hold the file's information in copy-on-write pages in memory, and "loading" the file into memory should be very fast (page table swap at worst).
Edit: Though, as cdhowie points out, there is no 'linux filesystem'. However, I believe the relevant code is in linux's memory management, and is therefore independent of the filesystem in question. If you're curious, you can read in the linux source about handling vm_area_struct objects in linux/mm/mmap.c, mainly.
As people have mentioned, mmap is a good solution here.
But, one 250k file is very small. You might want to read it in and put it in some sort of memory structure that matches what you want to send back to the user on startup. Ie, if it is a text file an array of lines might be a good choice, etc.
The file should be cached, but make sure the noatime option is set on the mount, otherwise the access time will attempt to be saved to the file, invalidating the cache.
Yes, definitely. It will keep accessed files in memory indefinitely, unless something else needs the memory.
You can control this behaviour (to some extent) with the fadvise system call. See its "man" page for more details.
A read/write system call will still normally need to copy the data, so if you see a real bottleneck doing this, consider using mmap() which can avoid the copy, by mapping the cache pages directly into the process.
I guess putting that file into ramdisk (tmpfs) may make enough advantage without big modifications. Unless you are really serious about response time in microseconds unit.
Is it possible to 'hibernate' a process in linux?
Just like 'hibernate' in laptop, I would to write all the memory used by a process to disk, free up the RAM. And then later on, I can 'resume the process', i.e, reading all the data from memory and put it back to RAM and I can continue with my process?
I used to maintain CryoPID, which is a program that does exactly what you are talking about. It writes the contents of a program's address space, VDSO, file descriptor references and states to a file that can later be reconstructed. CryoPID started when there were no usable hooks in Linux itself and worked entirely from userspace (actually, it still does work, depending on your distro / kernel / security settings).
Problems were (indeed) sockets, pending RT signals, numerous X11 issues, the glibc caching getpid() implementation amongst many others. Randomization (especially VDSO) turned out to be insurmountable for the few of us working on it after Bernard walked away from it. However, it was fun and became the topic of several masters thesis.
If you are just contemplating a program that can save its running state and re-start directly into that state, its far .. far .. easier to just save that information from within the program itself, perhaps when servicing a signal.
I'd like to put a status update here, as of 2014.
The accepted answer suggests CryoPID as a tool to perform Checkpoint/Restore, but I found the project to be unmantained and impossible to compile with recent kernels.
Now, I found two actively mantained projects providing the application checkpointing feature.
The first, the one I suggest 'cause I have better luck running it, is CRIU
that performs checkpoint/restore mainly in userspace, and requires the kernel option CONFIG_CHECKPOINT_RESTORE enabled to work.
Checkpoint/Restore In Userspace, or CRIU (pronounced kree-oo, IPA: /krɪʊ/, Russian: криу), is a software tool for Linux operating system. Using this tool, you can freeze a running application (or part of it) and checkpoint it to a hard drive as a collection of files. You can then use the files to restore and run the application from the point it was frozen at. The distinctive feature of the CRIU project is that it is mainly implemented in user space.
The latter is DMTCP; quoting from their main page:
DMTCP (Distributed MultiThreaded Checkpointing) is a tool to transparently checkpoint the state of multiple simultaneous applications, including multi-threaded and distributed applications. It operates directly on the user binary executable, without any Linux kernel modules or other kernel modifications.
There is also a nice Wikipedia page on the argument: Application_checkpointing
The answers mentioning ctrl-z are really talking about stopping the process with a signal, in this case SIGTSTP. You can issue a stop signal with kill:
kill -STOP <pid>
That will suspend execution of the process. It won't immediately free the memory used by it, but as memory is required for other processes the memory used by the stopped process will be gradually swapped out.
When you want to wake it up again, use
kill -CONT <pid>
The more complicated solutions, like CryoPID, are really only needed if you want the stopped process to be able to survive a system shutdown/restart - it doesn't sound like you need that.
Linux Kernel has now partially implemented the checkpoint/restart futures:https://ckpt.wiki.kernel.org/, the status is here.
Some useful information are in the lwn(linux weekly net):
http://lwn.net/Articles/375855/ http://lwn.net/Articles/412749/ ......
So the answer is "YES"
The issue is restoring the streams - files and sockets - that the program has open.
When your whole OS hibernates, the local files and such can obviously be restored. Network connections don't, but then the code that accesses the internet is typically more error checking and such and survives the error conditions (or ought to).
If you did per-program hibernation (without application support), how would you handle open files? What if another process accesses those files in the interim? etc?
Maintaining state when the program is not loaded is going to be difficult.
Simply suspending the threads and letting it get swapped to disk would have much the same effect?
Or run the program in a virtual machine and let the VM handle suspension.
Short answer is "yes, but not always reliably". Check out CryoPID:
http://cryopid.berlios.de/
Open files will indeed be the most common problem. CryoPID states explicitly:
Open files and offsets are restored.
Temporary files that have been
unlinked and are not accessible on the
filesystem are always saved in the
image. Other files that do not exist
on resume are not yet restored.
Support for saving file contents for
such situations is planned.
The same issues will also affect TCP connections, though CryoPID supports tcpcp for connection resuming.
I extended Cryopid producing a package called Cryopid2 available from SourceForge. This can
migrate a process as well as hibernating it (along with any open files and sockets - data
in sockets/pipes is sucked into the process on hibernation and spat back into these when
process is restarted).
The reason I have not been active with this project is I am not a kernel developer - both
this (and/or the original cryopid) need to get someone on board who can get them running
with the lastest kernels (e.g. Linux 3.x).
The Cryopid method does work - and is probably the best solution to general purpose process
hibernation/migration in Linux I have come across.
The short answer is "yes." You might start by looking at this for some ideas: ELF executable reconstruction from a core image (http://vx.netlux.org/lib/vsc03.html)
As others have noted, it's difficult for the OS to provide this functionality, because the application needs to have some error checking builtin to handle broken streams.
However, on a side note, some programming languages and tools that use virtual machines explicitly support this functionality, such as the Self programming language.
This is sort of the ultimate goal of clustered operating system. Mathew Dillon puts a lot of effort to implement something like this in his Dragonfly BSD project.
adding another workaround: you can use virtualbox. run your applications in a regular virtual machine and simply "save the machine state" whenever you want.
I know this is not an answer, but I thought it could be useful when there are no real options.
if for any reason you don't like virtualbox, vmware and Qemu are as good.
Ctrl-Z increases the chances the process's pages will be swapped, but it doesn't free the process's resources completely. The problem with freeing a process's resources completely is that things like file handles, sockets are kernel resources the process gets to use, but doesn't know how to persist on its own. So Ctrl-Z is as good as it gets.
There was some research on checkpoint/restore for Linux back in 2.2 and 2.4 days, but it never made it past prototype. It is possible (with the caveats described in the other answers) for certain values of possible - I you can write a kernel module to do it, it is possible. But for the common value of possible (can I do it from the shell on a commercial Linux distribution), it is not yet possible.
There's ctrl+z in linux, but i'm not sure it offers the features you specified. I suspect you asked this question since it doesn't