Configure kern.log to give more info about a segfault - linux

Currently I can find in kern.log entries like this:
[6516247.445846] ex3.x[30901]: segfault at 0 ip 0000000000400564 sp 00007fff96ecb170 error 6 in ex3.x[400000+1000]
[6516254.095173] ex3.x[30907]: segfault at 0 ip 0000000000400564 sp 00007fff0001dcf0 error 6 in ex3.x[400000+1000]
[6516662.523395] ex3.x[31524]: segfault at 7fff80000000 ip 00007f2e11e4aa79 sp 00007fff807061a0 error 4 in libc-2.13.so[7f2e11dcf000+180000]
(You see, apps causing segfault are named ex3.x, means exercise 3 executable).
Is there a way to ask kern.log to log the complete path? Something like:
[6...] /home/user/cclass/ex3.x[3...]: segfault at 0 ip 0564 sp 07f70 error 6 in ex3.x[4...]
So I can easily figure out from who (user/student) this ex3.x is?
Thanks!
Beco

That log message comes from the kernel with a fixed format that only includes the first 16 letters of the executable excluding the path as per show_signal_msg, see other relevant lines for segmentation fault on non x86 architectures.
As mentioned by Makyen, without significant changes to the kernel and a recompile, the message given to klogd which is passed to syslog won't have the information you are requesting.
I am not aware of any log transformation or injection functionality in syslog or klogd which would allow you to take the name of the file and run either locate or file on the filesystem in order to find the full path.
The best way to get the information you are looking for is to use crash interception software like apport or abrt or corekeeper. These tools store the process metadata from the /proc filesystem including the process's commandline which would include the directory run from, assuming the binary was run with a full path, and wasn't already in path.
The other more generic way would be to enable core dumps, and then to set /proc/sys/kernel/core_pattern to include %E, in order to have the core file name including the path of the binary.

The short answer is: No, it is not possible without making code changes and recompiling the kernel. The normal solution to this problem is to instruct your students to name their executable <student user name>_ex3.x so that you can easily have this information.
However, it is possible to get the information you desire from other methods. Appleman1234 has provided some alternatives in his answer to this question.
How do we know the answer is "Not possible to the the full path in the kern.log segfault messages without recompiling the kernel":
We look in the kernel source code to find out how the message is produced and if there are any configuration options.
The files in question are part of the kernel source. You can download the entire kernel source as an rpm package (or other type of package) for whatever version of linux/debian you are running from a variety of places.
Specifically, the output that you are seeing is produced from whichever of the following files is for your architecture:
linux/arch/sparc/mm/fault_32.c
linux/arch/sparc/mm/fault_64.c
linux/arch/um/kernel/trap.c
linux/arch/x86/mm/fault.c
An example of the relevant function from one of the files(linux/arch/x86/mm/fault.c):
/*
* Print out info about fatal segfaults, if the show_unhandled_signals
* sysctl is set:
*/
static inline void
show_signal_msg(struct pt_regs *regs, unsigned long error_code,
unsigned long address, struct task_struct *tsk)
{
if (!unhandled_signal(tsk, SIGSEGV))
return;
if (!printk_ratelimit())
return;
printk("%s%s[%d]: segfault at %lx ip %p sp %p error %lx",
task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
tsk->comm, task_pid_nr(tsk), address,
(void *)regs->ip, (void *)regs->sp, error_code);
print_vma_addr(KERN_CONT " in ", regs->ip);
printk(KERN_CONT "\n");
}
From that we see that the variable passed to printout the process identifier is tsk->comm where struct task_struct *tsk and regs->ip where struct pt_regs *regs
Then from linux/include/linux/sched.h
struct task_struct {
...
char comm[TASK_COMM_LEN]; /* executable name excluding path
- access with [gs]et_task_comm (which lock
it with task_lock())
- initialized normally by setup_new_exec */
The comment makes it clear that the path for the executable is not stored in the structure.
For regs->ip where struct pt_regs *regs, it is defined in whichever of the following are appropriate for your architecture:
arch/arc/include/asm/ptrace.h
arch/arm/include/asm/ptrace.h
arch/arm64/include/asm/ptrace.h
arch/cris/include/arch-v10/arch/ptrace.h
arch/cris/include/arch-v32/arch/ptrace.h
arch/metag/include/asm/ptrace.h
arch/mips/include/asm/ptrace.h
arch/openrisc/include/asm/ptrace.h
arch/um/include/asm/ptrace-generic.h
arch/x86/include/asm/ptrace.h
arch/xtensa/include/asm/ptrace.h
From there we see that struct pt_regs is defining registers for the architecture. ip is just: unsigned long ip;
Thus, we have to look at what print_vma_addr() does. It is defined in mm/memory.c
/*
* Print the name of a VMA.
*/
void print_vma_addr(char *prefix, unsigned long ip)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
/*
* Do not print if we are in atomic
* contexts (in exception stacks, etc.):
*/
if (preempt_count())
return;
down_read(&mm->mmap_sem);
vma = find_vma(mm, ip);
if (vma && vma->vm_file) {
struct file *f = vma->vm_file;
char *buf = (char *)__get_free_page(GFP_KERNEL);
if (buf) {
char *p;
p = d_path(&f->f_path, buf, PAGE_SIZE);
if (IS_ERR(p))
p = "?";
printk("%s%s[%lx+%lx]", prefix, kbasename(p),
vma->vm_start,
vma->vm_end - vma->vm_start);
free_page((unsigned long)buf);
}
}
up_read(&mm->mmap_sem);
}
Which shows us that a path was available. We would need to check that it was the path, but looking a bit further in the code gives a hint that it might not matter. We need to see what kbasename() did with the path that is passed to it. kbasename() is defined in include/linux/string.h as:
/**
* kbasename - return the last part of a pathname.
*
* #path: path to extract the filename from.
*/
static inline const char *kbasename(const char *path)
{
const char *tail = strrchr(path, '/');
return tail ? tail + 1 : path;
}
Which, even if the full path is available prior to it, chops off everything except for the last part of a pathname, leaving the filename.
Thus, no amount of runtime configuration options will permit printing out the full pathname of the file in the segment fault messages you are seeing.
NOTE: I've changed all of the links to kernel source to be to archives, rather than the original locations. Those links will get close to the code as it was at the time I wrote this, 2104-09. As should be no surprise, the code does evolve over time, so the code which is current when you're reading this may or may not be similar or perform in the way which is described here.

Related

How to extend MIPS bus error handlers?

I'm writing a Linux driver in MIPS architecture.
there I implement read operation - read some registers content.
Here is the code:
static int proc_seq_show(struct seq_file *seq, void *v)
{
volatile unsigned reg;
//here read registers and pass to user using seq_printf
//1. read reg number 1
//2. read reg number 2
}
int proc_open(struct inode *inode, struct file *filp)
{
return single_open(filp,&proc_seq_show, NULL);
}
static struct file_operation proc_ops = {.read = &seq_read, .open=&seq_open};
My problem is that reading register content sometimes causing kernel oops - bus error, and read operation is prevented. I can't avoid it in advance.
Since this behavior is acceptable I would like to ignore this error and continue to read the other registers.
I saw bus error handler in the kernel (do_be in traps.c), there is an option to add my own entry to the __dbe_table. An entry looks like that:
struct exception_table_entry {unsigned long insn, nextinsn; };
insn is the instruction that cause the error. nextinsn is the next instruction to be performed after exception.
In the driver I declare an entry:
struct exception_table_entry e __attribute__ ((section("__dbe_table"))) = {0,0};
but I don't know how to initialize it. How can I get the instruction address of the risky line in C? how can I get the address of the fix line? I have tried something with labels and addresses of label - but didn't manage to set correctly the exception_table_entry .
The same infrastructure is available in x86, does someone know how they use it?
Any help will be appreciated.

How to access the /proc file system's iiterate function pointer

I am trying to create a simple linux rootkit that can be used to hide processess. The method I have chosen to try is to replace the pointer to "/proc" 's iterate function with a pointer to a custom one that will hide the processess I need. To do this I have to first save a pointer to it's originale iterate function so it can be replaced later. The '/proc' file system's iterate function can be accessed by accessing the 'iterate' function pointer which is a member of it's 'file_operations' structure.
I have tried the following two methods to access it, as can be seen in the code segment as label_1 and label_2, each tested with the other commented out.
static struct file *proc_filp;
static struct proc_dir_entry *test_proc;
static struct proc_dir_entry *proc_root;
static struct file_operations *fs_ops;
int (*proc_iterate) (struct file *, struct dir_context *);
static int __init module_init(void)
{
label_1:
proc_filp = filp_open("/proc", O_RDONLY | O_DIRECTORY, 0);
fs_ops = (struct file_operations *) proc_filp->f_op;
printk(KERN_INFO "file_operations is %p", fs_ops);
proc_iterate = fs_ops->iterate;
printk(KERN_INFO "proc_iterate is %p", proc_iterate);
filp_close(proc_filp, NULL);
label_2:
test_proc = proc_create("test_proc", 0, NULL, &proc_fops);
proc_root = test_proc->parent;
printk(KERN_INFO "proc_root is %p", proc_root);
fs_ops = (struct file_operations *) proc_root->proc_fops;
printk(KERN_INFO "file_operations is %p", fs_ops);
proc_iterate = fs_ops->iterate;
printk(KERN_INFO "proc_iterate is %p", proc_iterate);
remove_proc_entry("test_proc", NULL);
return 0;
}
As per method 1, I open '/proc' as a file, then follow the file pointer to access it's 'struct file_operations' (f_op), and then follow this pointer to try and locate 'iterate'. However, although I am able to successfully access the 'f_op' structure, somehow it's 'iterate' seems to point to NULL. The following dmesg log shows the output of this.
[ 47.707558] file_operations is 00000000b8d10f59
[ 47.707564] proc_iterate is (null)
As per method 2, I create a new proc directory entry, then try to access it's parent directory (which should point to '/proc' itself), and then try to access it's 'struct file_operations' (proc_fops), and then try to follow on to 'iterate'. However, using this method I am not even able to access the '/proc' directory, as 'proc_root = test_proc->parent;' seems to return NULL. This causes a 'kernel NULL pointer dereference' error in the code that follows. The following dmesg log shows the output of this.
[ 212.078552] proc_root is (null)
[ 212.078567] BUG: unable to handle kernel NULL pointer dereference at 000000000000003
Now, I know that things have changed in the linux kernel, and we are not allowed to write at various addresses (like change the iterate pointer to point at a custom function for example) and that this can be overcome by making those pages writable in the kernel, but that will come later in my attempt to create this rootkit. At present I can't even figure out how to even read the original 'iterate' pointer.
So, I have these questions :
[1] Is something wrong in the following code ? How to fix it ?
[2] Is there any other way to access the pointer to the /proc 's iterate function ?
Tested on Arch Linux running linux 4.15.7

How to use proc_pid_cmdline in kernel module

I am writing a kernel module to get the list of pids with their complete process name. The proc_pid_cmdline() gives the complete process name;using same function /proc/*/cmdline gets the complete process name. (struct task_struct) -> comm gives hint of what process it is, but not the complete path.
I have included the function name, but it gives error because it does not know where to find the function.
How to use proc_pid_cmdline() in a module ?
You are not supposed to call proc_pid_cmdline().
It is a non-public function in fs/proc/base.c:
static int proc_pid_cmdline(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
However, what it does is simple:
get_cmdline(task, m->buf, PAGE_SIZE);
That is not likely to return the full path though and it will not be possible to determine the full path in every case. The arg[0] value may be overwritten, the file could be deleted or moved, etc. A process may exec() in a way which obscures the original command line, and all kinds of other maladies.
A scan of my Fedora 20 system /proc/*/cmdline turns up all kinds of less-than-useful results:
-F
BUG:
WARNING: at
WARNING: CPU:
INFO: possible recursive locking detecte
ernel BUG at
list_del corruption
list_add corruption
do_IRQ: stack overflow:
ear stack overflow (cur:
eneral protection fault
nable to handle kernel
ouble fault:
RTNL: assertion failed
eek! page_mapcount(page) went negative!
adness at
NETDEV WATCHDOG
ysctl table check failed
: nobody cared
IRQ handler type mismatch
Machine Check Exception:
Machine check events logged
divide error:
bounds:
coprocessor segment overrun:
invalid TSS:
segment not present:
invalid opcode:
alignment check:
stack segment:
fpu exception:
simd exception:
iret exception:
/var/log/messages
--
/usr/bin/abrt-dump-oops
-xtD
I have managed to solve a version of this problem. I wanted to access the cmdline of all PIDs but within the kernel itself (as opposed to a kernel module as the question states), but perhaps these principles can be applied to kernel modules as well?
What I did was, I added the following function to fs/proc/base.c
int proc_get_cmdline(struct task_struct *task, char * buffer) {
int i;
int ret = proc_pid_cmdline(task, buffer);
for(i = 0; i < ret - 1; i++) {
if(buffer[i] == '\0')
buffer[i] = ' ';
}
return 0;
}
I then added the declaration in include/linux/proc_fs.h
int proc_get_cmdline(struct task_struct *, char *);
At this point, I could access the cmdline of all processes within the kernel.
To access the task_struct, perhaps you could refer to kernel: efficient way to find task_struct by pid?.
Once you have the task_struct, you should be able to do something like:
char cmdline[256];
proc_get_cmdline(task, cmdline);
if(strlen(cmdline) > 0)
printk(" cmdline :%s\n", cmdline);
else
printk(" cmdline :%s\n", task->comm);
I was able to obtain the commandline of all processes this way.
To get the full path of the binary behind a process.
char * exepathp;
struct file * exe_file;
struct mm_struct *mm;
char exe_path [1000];
//straight up stolen from get_mm_exe_file
mm = get_task_mm(current);
down_read(&mm->mmap_sem); //lock read
exe_file = mm->exe_file;
if (exe_file) get_file(exe_file);
up_read(&mm->mmap_sem); //unlock read
//reduce exe path to a string
exepathp = d_path( &(exe_file->f_path), exe_path, 1000*sizeof(char) );
Where current is the task struct for the process you are interested in. The variable exepathp gets the string of the full path. This is slightly different than the process cmd, this is the path of binary which was loaded to start the process. Combining this path with the process cmd should give you the full path.

Where are ioctl parameters (such as 0x1268 / BLKSSZGET) actually specified?

I am looking for a definitive specification describing the expected arguments and behavior of ioctl 0x1268 (BLKSSZGET).
This number is declared in many places (none of which contain a definitive reference source), such as linux/fs.h, but I can find no specification for it.
Surely, somebody at some point in the past decided that 0x1268 would get the physical sector size of a device and documented that somewhere. Where does this information come from and where can I find it?
Edit: I am not asking what BLKSSZGET does in general, nor am I asking what header it is defined in. I am looking for a definitive, standardized source that states what argument types it should take and what its behavior should be for any driver that implements it.
Specifically, I am asking because there appears to be a bug in blkdiscard in util-linux 2.23 (and 2.24) where the sector size is queried in to a uint64_t, but the high 32-bits are untouched since BLKSSZGET appears to expect a 32-bit integer, and this leads to an incorrect sector size, incorrect alignment calculations, and failures in blkdiscard when it should succeed. So before I submit a patch, I need to determine, with absolute certainty, if the problem is that blkdiscard should be using a 32-bit integer, or if the driver implementation in my kernel should be using a 64-bit integer.
Edit 2: Since we're on the topic, the proposed patch presuming blkdiscard is incorrect is:
--- sys-utils/blkdiscard.c-2.23 2013-11-01 18:28:19.270004947 -0400
+++ sys-utils/blkdiscard.c 2013-11-01 18:29:07.334002382 -0400
## -71,7 +71,8 ##
{
char *path;
int c, fd, verbose = 0, secure = 0;
- uint64_t end, blksize, secsize, range[2];
+ uint64_t end, blksize, range[2];
+ uint32_t secsize;
struct stat sb;
static const struct option longopts[] = {
## -146,8 +147,8 ##
err(EXIT_FAILURE, _("%s: BLKSSZGET ioctl failed"), path);
/* align range to the sector size */
- range[0] = (range[0] + secsize - 1) & ~(secsize - 1);
- range[1] &= ~(secsize - 1);
+ range[0] = (range[0] + (uint64_t)secsize - 1) & ~((uint64_t)secsize - 1);
+ range[1] &= ~((uint64_t)secsize - 1);
/* is the range end behind the end of the device ?*/
end = range[0] + range[1];
Applied to e.g. https://www.kernel.org/pub/linux/utils/util-linux/v2.23/.
The answer to "where is this specified?" does seem to be the kernel source.
I asked the question on the kernel mailing list here: https://lkml.org/lkml/2013/11/1/620
In response, Theodore Ts'o wrote (note: he mistakenly identified sys-utils/blkdiscard.c in his list but it's inconsequential):
BLKSSZGET returns an int. If you look at the sources of util-linux
v2.23, you'll see it passes an int to BLKSSZGET in
sys-utils/blkdiscard.c
lib/blkdev.c
E2fsprogs also expects BLKSSZGET to return an int, and if you look at
the kernel sources, it very clearly returns an int.
The one place it doesn't is in sys-utils/blkdiscard.c, where as you
have noted, it is passing in a uint64 to BLKSSZGET. This looks like
it's a bug in sys-util/blkdiscard.c.
He then went on to submit a patch¹ to blkdiscard at util-linux:
--- a/sys-utils/blkdiscard.c
+++ b/sys-utils/blkdiscard.c
## -70,8 +70,8 ## static void __attribute__((__noreturn__)) usage(FILE *out)
int main(int argc, char **argv)
{
char *path;
- int c, fd, verbose = 0, secure = 0;
- uint64_t end, blksize, secsize, range[2];
+ int c, fd, verbose = 0, secure = 0, secsize;
+ uint64_t end, blksize, range[2];
struct stat sb;
static const struct option longopts[] = {
I had been hesitant to mention the blkdiscard tool in both my mailing list post and the original version of this SO question specifically for this reason: I know what's in my kernel's source, it's already easy enough to modify blkdiscard to agree with the source, and this ended up distracting from the real question of "where is this documented?".
So, as for the specifics, somebody more official than me has also stated that the BLKSSZGET ioctl takes an int, but the general question regarding documentation remained. I then followed up with https://lkml.org/lkml/2013/11/3/125 and received another reply from Theodore Ts'o (wiki for credibility) answering the question. He wrote:
> There was a bigger question hidden behind the context there that I'm
> still wondering about: Are these ioctl interfaces specified and
> documented somewhere? From what I've seen, and from your response, the
> implication is that the kernel source *is* the specification, and not
> document exists that the kernel is expected to comply with; is this
> the case?
The kernel source is the specification. Some of these ioctl are
documented as part of the linux man pages, for which the project home
page is here:
https://www.kernel.org/doc/man-pages/
However, these document existing practice; if there is a discrepancy
between what is in the kernel has implemented and the Linux man pages,
it is the Linux man pages which are buggy and which will be changed.
That is man pages are descriptive, not perscriptive.
I also asked about the use of "int" in general for public kernel APIs, his response is there although that is off-topic here.
Answer: So, there you have it, the final answer is: The ioctl interfaces are specified by the kernel source itself; there is no document that the kernel adheres to. There is documentation to describe the kernel's implementations of various ioctls, but if there is a mismatch, it is an error in the documentation, not in the kernel.
¹ With all the above in mind, I want to point out that an important difference in the patch Theodore Ts'o submitted, compared to mine, is the use of "int" rather than "uint32_t" -- BLKSSZGET, as per kernel source, does indeed expect an argument that is whatever size "int" is on the platform, not a forced 32-bit value.

Is there any API for determining the physical address from virtual address in Linux?

Is there any API for determining the physical address from virtual address in Linux operating system?
Kernel and user space work with virtual addresses (also called linear addresses) that are mapped to physical addresses by the memory management hardware. This mapping is defined by page tables, set up by the operating system.
DMA devices use bus addresses. On an i386 PC, bus addresses are the same as physical addresses, but other architectures may have special address mapping hardware to convert bus addresses to physical addresses.
In Linux, you can use these functions from asm/io.h:
virt_to_phys(virt_addr);
phys_to_virt(phys_addr);
virt_to_bus(virt_addr);
bus_to_virt(bus_addr);
All this is about accessing ordinary memory. There is also "shared memory" on the PCI or ISA bus. It can be mapped inside a 32-bit address space using ioremap(), and then used via the readb(), writeb() (etc.) functions.
Life is complicated by the fact that there are various caches around, so that different ways to access the same physical address need not give the same result.
Also, the real physical address behind virtual address can change. Even more than that - there could be no address associated with a virtual address until you access that memory.
As for the user-land API, there are none that I am aware of.
/proc/<pid>/pagemap userland minimal runnable example
virt_to_phys_user.c
#define _XOPEN_SOURCE 700
#include <fcntl.h> /* open */
#include <stdint.h> /* uint64_t */
#include <stdio.h> /* printf */
#include <stdlib.h> /* size_t */
#include <unistd.h> /* pread, sysconf */
typedef struct {
uint64_t pfn : 55;
unsigned int soft_dirty : 1;
unsigned int file_page : 1;
unsigned int swapped : 1;
unsigned int present : 1;
} PagemapEntry;
/* Parse the pagemap entry for the given virtual address.
*
* #param[out] entry the parsed entry
* #param[in] pagemap_fd file descriptor to an open /proc/pid/pagemap file
* #param[in] vaddr virtual address to get entry for
* #return 0 for success, 1 for failure
*/
int pagemap_get_entry(PagemapEntry *entry, int pagemap_fd, uintptr_t vaddr)
{
size_t nread;
ssize_t ret;
uint64_t data;
uintptr_t vpn;
vpn = vaddr / sysconf(_SC_PAGE_SIZE);
nread = 0;
while (nread < sizeof(data)) {
ret = pread(pagemap_fd, ((uint8_t*)&data) + nread, sizeof(data) - nread,
vpn * sizeof(data) + nread);
nread += ret;
if (ret <= 0) {
return 1;
}
}
entry->pfn = data & (((uint64_t)1 << 55) - 1);
entry->soft_dirty = (data >> 55) & 1;
entry->file_page = (data >> 61) & 1;
entry->swapped = (data >> 62) & 1;
entry->present = (data >> 63) & 1;
return 0;
}
/* Convert the given virtual address to physical using /proc/PID/pagemap.
*
* #param[out] paddr physical address
* #param[in] pid process to convert for
* #param[in] vaddr virtual address to get entry for
* #return 0 for success, 1 for failure
*/
int virt_to_phys_user(uintptr_t *paddr, pid_t pid, uintptr_t vaddr)
{
char pagemap_file[BUFSIZ];
int pagemap_fd;
snprintf(pagemap_file, sizeof(pagemap_file), "/proc/%ju/pagemap", (uintmax_t)pid);
pagemap_fd = open(pagemap_file, O_RDONLY);
if (pagemap_fd < 0) {
return 1;
}
PagemapEntry entry;
if (pagemap_get_entry(&entry, pagemap_fd, vaddr)) {
return 1;
}
close(pagemap_fd);
*paddr = (entry.pfn * sysconf(_SC_PAGE_SIZE)) + (vaddr % sysconf(_SC_PAGE_SIZE));
return 0;
}
int main(int argc, char **argv)
{
pid_t pid;
uintptr_t vaddr, paddr = 0;
if (argc < 3) {
printf("Usage: %s pid vaddr\n", argv[0]);
return EXIT_FAILURE;
}
pid = strtoull(argv[1], NULL, 0);
vaddr = strtoull(argv[2], NULL, 0);
if (virt_to_phys_user(&paddr, pid, vaddr)) {
fprintf(stderr, "error: virt_to_phys_user\n");
return EXIT_FAILURE;
};
printf("0x%jx\n", (uintmax_t)paddr);
return EXIT_SUCCESS;
}
GitHub upstream.
Usage:
sudo ./virt_to_phys_user.out <pid> <virtual-address>
sudo is required to read /proc/<pid>/pagemap even if you have file permissions as explained at: https://unix.stackexchange.com/questions/345915/how-to-change-permission-of-proc-self-pagemap-file/383838#383838
As mentioned at: https://stackoverflow.com/a/46247716/895245 Linux allocates page tables lazily, so make sure that you read and write a byte to that address from the test program before using virt_to_phys_user.
How to test it out
Test program:
#define _XOPEN_SOURCE 700
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
enum { I0 = 0x12345678 };
static volatile uint32_t i = I0;
int main(void) {
printf("vaddr %p\n", (void *)&i);
printf("pid %ju\n", (uintmax_t)getpid());
while (i == I0) {
sleep(1);
}
printf("i %jx\n", (uintmax_t)i);
return EXIT_SUCCESS;
}
The test program outputs the address of a variable it owns, and its PID, e.g.:
vaddr 0x600800
pid 110
and then you can pass convert the virtual address with:
sudo ./virt_to_phys_user.out 110 0x600800
Finally, the conversion can be tested by using /dev/mem to observe / modify the memory, but you can't do this on Ubuntu 17.04 without recompiling the kernel as it requires: CONFIG_STRICT_DEVMEM=n, see also: How to access physical addresses from user space in Linux? Buildroot is an easy way to overcome that however.
Alternatively, you can use a Virtual machine like QEMU monitor's xp command: How to decode /proc/pid/pagemap entries in Linux?
See this to dump all pages: How to decode /proc/pid/pagemap entries in Linux?
Userland subset of this question: How to find the physical address of a variable from user-space in Linux?
Dump all process pages with /proc/<pid>/maps
/proc/<pid>/maps lists all the addresses ranges of the process, so we can walk that to translate all pages: /proc/[pid]/pagemaps and /proc/[pid]/maps | linux
Kerneland virt_to_phys() only works for kmalloc() addresses
From a kernel module, virt_to_phys(), has been mentioned.
However, it is import to highlight that it has this limitation.
E.g. it fails for module variables. arc/x86/include/asm/io.h documentation:
The returned physical address is the physical (CPU) mapping for
the memory address given. It is only valid to use this function on
addresses directly mapped or allocated via kmalloc().
Here is a kernel module that illustrates that together with an userland test.
So this is not a very general possibility. See: How to get the physical address from the logical one in a Linux kernel module? for kernel module methods exclusively.
As answered before, normal programs should not need to worry about physical addresses as they run in a virtual address space with all its conveniences. Furthermore, not every virtual address has a physical address, the may belong to mapped files or swapped pages. However, sometimes it may be interesting to see this mapping, even in userland.
For this purpose, the Linux kernel exposes its mapping to userland through a set of files in the /proc. The documentation can be found here. Short summary:
/proc/$pid/maps provides a list of mappings of virtual addresses together with additional information, such as the corresponding file for mapped files.
/proc/$pid/pagemap provides more information about each mapped page, including the physical address if it exists.
This website provides a C program that dumps the mappings of all running processes using this interface and an explanation of what it does.
The suggested C program above usually works, but it can return misleading results in (at least) two ways:
The page is not present (but the virtual addressed is mapped to a page!). This happens due to lazy mapping by the OS: it maps addresses only when they are actually accessed.
The returned PFN points to some possibly temporary physical page which could be changed soon after due to copy-on-write. For example: for memory mapped files, the PFN can point to the read-only copy. For anonymous mappings, the PFN of all pages in the mapping could be one specific read-only page full of 0s (from which all anonymous pages spawn when written to).
Bottom line is, to ensure a more reliable result: for read-only mappings, read from every page at least once before querying its PFN. For write-enabled pages, write into every page at least once before querying its PFN.
Of course, theoretically, even after obtaining a "stable" PFN, the mappings could always change arbitrarily at runtime (for example when moving pages into and out of swap) and should not be relied upon.
I wonder why there is no user-land API.
Because user land memory's physical address is unknown.
Linux uses demand paging for user land memory. Your user land object will not have physical memory until it is accessed. When the system is short of memory, your user land object may be swapped out and lose physical memory unless the page is locked for the process. When you access the object again, it is swapped in and given physical memory, but it is likely different physical memory from the previous one. You may take a snapshot of page mapping, but it is not guaranteed to be the same in the next moment.
So, looking for the physical address of a user land object is usually meaningless.

Resources