Difference between offset and RVA - exe

What is the difference between a relative virtual address and an offset from the base of a file??

The RVA is the relative-virtual address, that is, the distance from the preferred base address. The preferred base address is stated in the PE header, and is the (preferred) virtual address of the start of the image in memory for when the executable be loaded in memory.
And the file offset is the number of bytes you have to read from the beginning of the PE file to arrive somewhere in the file. So, if you have a section, you will find both things in the section header: the RVA of the section and its offset in the file; you will also find two sizes, one for how much virtual memory the section will get once loaded and one that merely indicates the size of the section data in the PE file.
Many references inside a PE are given as RVAs. In such cases, you need to check in all the section headers (or have some sort of map) to get the offset in the PE file of the reference.

Related

Calculate the entry point of an ELF file as a physical address (offset from 0)

I am building a RISC-V emulator which basically loads a whole ELF file into memory.
Up to now, I used the pre-compiled test binaries that the risc-v foundation provided which conveniently had an entry point exactly at the start of the .text section.
For example:
> riscv32-unknown-elf-objdump ../riscv32i-emulator/tests/simple -d
../riscv32i-emulator/tests/simple: file format elf32-littleriscv
Disassembly of section .text.init:
80000000 <_start>:
80000000: 0480006f j 80000048 <reset_vector>
...
Going into this project I didn't know much about ELF files so I just assumed that every ELF's entry point is exactly the same as the start of the .text section.
The problem arose when I compiled my own binaries, I found out that the actual entry point is not always the same as the start of the .text section, but it might be anywhere inside it, like here:
> riscv32-unknown-elf-objdump a.out -d
a.out: file format elf32-littleriscv
Disassembly of section .text:
00010074 <register_fini>:
10074: 00000793 li a5,0
10078: 00078863 beqz a5,10088 <register_fini+0x14>
1007c: 00010537 lui a0,0x10
10080: 43850513 addi a0,a0,1080 # 10438 <__libc_fini_array>
10084: 3a00006f j 10424 <atexit>
10088: 00008067 ret
0001008c <_start>:
1008c: 00002197 auipc gp,0x2
10090: cec18193 addi gp,gp,-788 # 11d78 <__global_pointer$>
...
So, after reading more about ELF files, I found out that the actual entry point address is provided by the Entry entry on the ELF's header:
> riscv32-unknown-elf-readelf a.out -h | grep Entry
Entry point address: 0x1008c
The problem now becomes that this address is not the actual address on the file (offset from 0) but is a virtual address, so obviously if I set the program counter of my emulator to this address, the emulator would crash.
Reading a bit more, I heard people talk about calculations regarding offsets from program headers and whatnot, but no one had a concrete answer.
My question is: what is the actual "formula" of how exactly you get the entry point address of the _start procedure as an offset from byte 0?
Just to be clear my emulator doesn't support virtual memory and the binary is the only thing that is loaded into my emulator's memory, so I have no use for the abstraction of virtual memory. I just want every memory address as physical address on disk.
My question is: what is the actual "formula" of how exactly you get the entry point address of the _start procedure as an offset from byte 0?
First, forget about sections. Only segments matter at runtime.
Second, use readelf -Wl to look at segments. They tell you exactly which chunk of file ([.p_offset, .p_offset + .p_filesz)) goes into which in-memory region ([.p_vaddr, .p_vaddr + .p_memsz)).
The exact calculation of "at which offset in the file does _start reside" is:
Find Elf32_Phdr which "covers" the address contained in Elf32_Ehdr.e_entry.
Using that phdr, file offset of _start is: ehdr->e_entry - phdr->p_vaddr + phdr->p_offset.
Update:
So, am I always looking for the 1st program header?
No.
Also by "covers" you mean that the 1st phdr->p_vaddr is always equal to e_entry?
No.
You are looking for a the program header (describing relationship between in-memory and on-file data) which overlaps the ehdr->e_entry in memory. That is, you are looking for the segment for which phdr->p_vaddr <= ehdr->e_entry && ehdr->e_entry < phdr->p_vaddr + phdr->p_memsz. This segment is often the first, but that is in no way guaranteed. See also this answer.

ELF program header virtual address and file offset

I know the relationship between the two:
virtual address mod page alignment == file offset mod page alignment
But can someone tell me in which direction are these two numbers computed?
Is virtual address computed from file offset according to the relationship above, or vice versa?
Update
Here is some more detail: when the linker writes the ELF file header, it sets the virtual address and file offset of the program headers.(segments)
For example there's the output of readelf -l someELFfile:
Elf file type is EXEC (Executable file)
Entry point 0x8048094
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x08048000 0x08048000 0x00154 0x00154 R E 0x1000
LOAD 0x000154 0x08049154 0x08049154 0x00004 0x00004 RW 0x1000
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x10
We can see 2 LOAD segments.
The virtual address of the first LOAD ends at 0x8048154, while the second LOAD starts at 0x8049154.
In the ELF file, the second LOAD is right behind the first LOAD with file offset 0x00154, however when this ELF is loaded into memory it starts at 0x1000 bytes after the end of the first LOAD segment.
But, why? If we have to consider memory page alignment, why doesn't the second LOAD segment starts at 0x80489000? Why does it start at 0x1000 bytes AFTER THE END of the first LOAD segment?
I know the virtual address of the second LOAD satisfies the relationship:
virtual address mod page alignment == file offset mod page alignment
But I don't know why this relationship must be satisfied.
Why does it start at 0x1000 bytes AFTER THE END of the first LOAD segment?
If it didn't, it would have to start at 0x08048154, but it can't: the two LOAD segments have different flags specified for their mapping (the first is mapped with PROT_READ|PROT_EXEC, the second with PROT_READ|PROTO_WRITE. Protections (being part of the page table) can only apply to whole pages, not parts of a page. Therefore, the mappings with different protections must belong to different pages.
virtual address mod page alignment == file offset mod page alignment
But I don't know why this relationship must be satisfied.
The LOAD segments are directly mmaped from file. The actual mapping of the second LOAD segment performed for your example will look something like this (you can run your program under strace and see that it does):
mmap(0x08049000, 0x158, PROT_READ|PROT_WRITE, MAP_PRIVATE, $fd, 0)
If you try to make the virtual address or the offset non-page-aligned, mmap will fail with EINVAL. The only way to make file data to appear in virtual memory at desired address it to make VirtAddr congruent to Offset modulo Align, and that is exactly what the static linker does.
Note that for such a small first LOAD segment, the entire first segment also appears at the beginning of the second mapping (with the wrong protections). But the program is not supposed to access anything in the [0x08049000,0x08049154) range. In general, it is almost always the case that there is some "junk" before the start of actual data in the second LOAD segment (unless you get really lucky and the first LOAD segment ends on a page boundary).
See also mmap man page.
virtual address mod page alignment == file offset mod page alignment
But can someone tell me in which direction are these two numbers computed?
I believe the virtual address is deliberately setup this way to follow the file offset. Files themselves should be compact and can therefore save disk space, so all segment are stored right next to each other, with their boundary recorded in the ELF header.
virtual address mod page alignment == file offset mod page alignment
But I don't know why this relationship must be satisfied.
It needs not to, the second segment here can be mapped to 0x08049000 without any problem. As long as the segments with different flags be mapped to different virtual pages, everything is fine. But the OS has to allocate yet another physical page (4 KB usually) for the mapping and copy the 4 byte at file offset 0x154 to the start of the page, when loading the resulted ELF executable, which is kind of wasteful.
If the relationship is satisfied, however, the OS can allocate just one single physical page and copy the whole 0x158 (0x154 + 0x4) byte of the file to the page, and map the physical page both to 0x08048000 and 0x08049000 with different flags. This saves physical memory, and makes virtual memory techniques like demand-paging to be applied more easily.

Getting physical address from /proc/[pid]/pagemap

I am told that I can find the physical address corresponding to a virtual address using /proc/[pid]/pagemap.
I read that this pagemap file is an array of 64-bit entries, with bits 0-54 corresponding to the page frame number. I don't know how to make the leap from this to translating virtual to physical. Partially, I don't know how to find the entry I want in this file; nobody seems to specify how they are indexed.
Also, I don't know if the PFN is virtual or physical. And I don't know what to do with the PFN, regardless. How can I proceed?
Thanks
Divide the VA by the page size (4096 normally), use that as an offset into /proc/self/pagemap. Then take that number (the page), multiply by the pagesize (4096), and offset that by your VA%4094.
Larry

Using /proc/[pid]/pagemap

I am aware that there is a little information regarding the pagemap file here. But nobody seems to indicate how to reference entries in the file. Is it offset by virtual address? Can I take a virtual address VA and simply lseek to offset VA? Or is it by page? If so, how do I retrieve the page number, as maps simply lists them in order. I am trying to translate between virtual and physical addresses, and lseek'ing with the virtual address as the offset always returns the same number, no matter where I seek to.
Thanks
#leeduhem: Yes I have. Here's the relevant part:
3. Open /proc/pid/pagemap and seek to the pages you would like to examine.
4. Read a u64 for each page from pagemap.
That doesn't help me. It wants me to seek to the page, but how do I know where the entry for the page is?
There is a tool that will help you to get information you need from the pagemap file.
http://fivelinesofcode.blogspot.com/2014/03/how-to-translate-virtual-to-physical.html
You divide the virtual address by the pagesize (normally 0x1000 or 4096) and use that to index in /proc/self/pagemap. After the division, that's known as the PFN, or page frame number.
Larry

How does one determine the page frame number for device memory?

From LDD3/ Ch. 15/ sections "Using remap_pfn_range" and "A Simple Implementation", pfn has been equated to the vm_pgoff field. I am confused by this. How can that be so?
Note that vm_pgoff is described as:
The offset of the area in the file, in pages. When a file or device is
mapped, this is the file position of the first page mapped in this
area.
Thus if the first page mapped corresponds to the first page of the file as well (which, I think would be quite common), vm_pgoff would be 0. correct? If so, this doesn't seem to be the correct value for the pfn parameter of remap_pfn_range( ). What am I missing here? What is the correct value? For ease of reference, I am reproducing the relevant code from LDD3 below (Page no. 426)
static int simple_remap_mmap(struct file *filp, struct vm_area_struct *vma)
{
if (remap_pfn_range(vma, vma->vm_start, vm->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot))
return -EAGAIN;
...
}
The specific example you've provided is implementing a character device file that allows one to map physical memory, very similar to /dev/mem. By specifying the offset of the file you specify the physical memory address. Hence the calculation that takes the offset and divide in the page size to find the PFN.
For a "real" device driver, you would normally have the physical address of the device memory mapped registers or RAM hard coded from the device specification, and use that to derive the PFN (by dividing by the page size).

Resources