file offset vs virtual address in a shared library - shared-libraries

For a shared library file, how to convert between the file offset and virtual address of the definition of a symbol?
In ELF document, for a symbol in a symbol table,
In executable and shared object files, st_value holds a virtual address. To make these files' symbols more useful for the dynamic linker, the section offset (file interpretation) gives way to a virtual address (memory interpretation) for which the seciton number is irrelevant.
But how can I get the according offset in the file? Or given an offset, how can I calculate the virtual address(file interpretation to memory interpretation)?
Imagine a scenario like this. During the execution of a process, suppose it is using a function implemented in a shared library, say libx.so, and that the library file is mapped into a region represented by vma.
//addr holds the value of PC
offset = (vma->vm_pgoff << PAGE_SIZE) + addr -vma->vm_start;
As I understand it, now offset holds the offset of the instruction in the library file. Given this offset, I'd like to know the function name. One way is to calculate the the virtual address corresponding to offset, and compare the virtual address with the st_values in the symbol table. If st_values are processed to be stored in ascending order, then st_value_1 < virtual_address < st_value_2 means st_name_1 is what I'm looking for. So the problem lies in the conversion.
For reference, data structure of a symbol table entry is:
typedef struct{
Elf32_Word st_name;
Elf32_Addr st_value;
Elf32_Word st_size;
unsigned char st_info;
unsigned char st_other;
Elf32_Half st_shndx;
}Elf32_Sym;

The program header tables PT_LOAD entries define how the loader/linker is expected to map parts of the ELF file in the virtual address space. You should use this if you want to convert between file offset and (relative) virtual memory addresses:
~$ readelf -l /lib/i386-linux-gnu/libc-2.24.so
Elf file type is DYN (Shared object file)
Entry point 0x18400
There are 10 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000034 0x00000034 0x00000034 0x00140 0x00140 R E 0x4
INTERP 0x166374 0x00166374 0x00166374 0x00013 0x00013 R 0x4
[Requesting program interpreter: /lib/ld-linux.so.2]
LOAD 0x000000 0x00000000 0x00000000 0x1b01c8 0x1b01c8 R E 0x1000
LOAD 0x1b0260 0x001b1260 0x001b1260 0x02c74 0x0579c RW 0x1000
DYNAMIC 0x1b1db0 0x001b2db0 0x001b2db0 0x000f0 0x000f0 RW 0x4
NOTE 0x000174 0x00000174 0x00000174 0x00044 0x00044 R 0x4
TLS 0x1b0260 0x001b1260 0x001b1260 0x00008 0x00048 R 0x4
GNU_EH_FRAME 0x166388 0x00166388 0x00166388 0x061ec 0x061ec R 0x4
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x10
GNU_RELRO 0x1b0260 0x001b1260 0x001b1260 0x01da0 0x01da0 R 0x1
For example, considering this symbol
Num: Value Size Type Bind Vis Ndx Name
188: 0005df80 35 FUNC GLOBAL DEFAULT 13 fopen##GLIBC_2.1
It's (relative) virtual address is 0x0005df80. It belongs to the first PT_LOAD entry which ranges in relative virtual memory from 0x00000000 to 0x00000000 + 0x1b01c8. It's offset within the segment is Value - VirtAddr = 0x00000000. It's offset within the file is thus PhysAddr + (Value - VirtAddr) = 0005df80.

Related

Understanding ELF64 text/data segment layout/padding

I'm trying to brush up on UNIX viruses and one text I'm reading mentions that parasitic code can be inserted in the padding between the text and the data segment, supposedly up to 2MB in size on x86-64 systems. But when I compile a simple hello world program with gcc -no-pie...
#include <stdio.h>
int main()
{
printf("hello world\n");
}
...and inspect its segment headers with readelf -W -l I get:
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000040 0x0000000000400040 0x0000000000400040 0x0002d8 0x0002d8 R 0x8
INTERP 0x000318 0x0000000000400318 0x0000000000400318 0x00001c 0x00001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x000588 0x000588 R 0x1000
LOAD 0x001000 0x0000000000401000 0x0000000000401000 0x0001c5 0x0001c5 R E 0x1000
LOAD 0x002000 0x0000000000402000 0x0000000000402000 0x000138 0x000138 R 0x1000
LOAD 0x002e00 0x0000000000403e00 0x0000000000403e00 0x000230 0x000238 RW 0x1000
DYNAMIC 0x002e10 0x0000000000403e10 0x0000000000403e10 0x0001d0 0x0001d0 RW 0x8
...
I assume the segment starting at virtual address 0x401000 is the text segment and the one starting at 0x430e00 is the data segment. But what are the other two read-only LOAD segment? And how precisely does padding work here? There's no padding to 2MB boundaries to be seen and even assuming padding to 4KB boundaries, why does the data segment not start at address 0x403000?
But what are the other two read-only LOAD segment?
See this answer.
There's no padding to 2MB boundaries
The BFD linker used to align segments on 2MiB boundary because that's the maximum page size an x86_64 system can be configured with.
It no longer does this (not sure when the change was made).
The text you are reading is probably out of date.

why virtual address of LOAD program header and runtime virtual address shown by gdb is different?

I've been trying to understand elf file format and on elf format documentation, VirtAddr of LOAD header should be the virtual address of the loaded segment. But gdb memmap shows segments to be loaded at different virt address.
$ readelf -l
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000560 0x0000000000000560 R 0x1000
LOAD 0x0000000000001000 0x0000000000001000 0x0000000000001000
0x00000000000001e5 0x00000000000001e5 R E 0x1000
LOAD 0x0000000000002000 0x0000000000002000 0x0000000000002000
0x0000000000000118 0x0000000000000118 R 0x1000
LOAD 0x0000000000002de8 0x0000000000003de8 0x0000000000003de8
0x0000000000000248 0x0000000000000250 RW 0x1000
gdb memmap
Entry point: 0x555555555040
0x00005555555542a8 - 0x00005555555542c4 is .interp
0x00005555555542c4 - 0x00005555555542e4 is .note.ABI-tag
0x00005555555542e4 - 0x0000555555554308 is .note.gnu.build-id
0x0000555555554308 - 0x0000555555554324 is .gnu.hash
0x0000555555554328 - 0x00005555555543d0 is .dynsym
0x00005555555543d0 - 0x0000555555554454 is .dynstr
0x0000555555554454 - 0x0000555555554462 is .gnu.version
0x0000555555554468 - 0x0000555555554488 is .gnu.version_r
0x0000555555554488 - 0x0000555555554548 is .rela.dyn
0x0000555555554548 - 0x0000555555554560 is .rela.plt
0x0000555555555000 - 0x000055555555501b is .init
0x0000555555555020 - 0x0000555555555040 is .plt
0x0000555555555040 - 0x00005555555551d5 is .text
0x00005555555551d8 - 0x00005555555551e5 is .fini
0x0000555555556000 - 0x000055555555600a is .rodata
0x000055555555600c - 0x0000555555556040 is .eh_frame_hdr
0x0000555555556040 - 0x0000555555556118 is .eh_frame
0x0000555555557de8 - 0x0000555555557df0 is .init_array
0x0000555555557df0 - 0x0000555555557df8 is .fini_array
0x0000555555557df8 - 0x0000555555557fd8 is .dynamic
0x0000555555557fd8 - 0x0000555555558000 is .got
0x0000555555558000 - 0x0000555555558020 is .got.plt
0x0000555555558020 - 0x0000555555558030 is .data
0x0000555555558030 - 0x0000555555558038 is .bss
VirtAddr of LOAD header should be the virtual address of the loaded segment.
This is only true for ELF images of type ET_EXEC.
But you have an ELF image of type ET_DYN (probably a position independent executable), and these are relocated at runtime to a different virtual address.

How to modify the GNU linker to have separate 'RWE ' PT_LOAD segment

I have a program which when converted to binary using deafult options I get this.
>readelf -lW /tmp/sample
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x13e33f 0x13e33f R E 0x200000
LOAD 0x13e510 0x000000000073e510 0x000000000073e510 0x005160 0x007cc8 RW 0x200000
I want to have a separate LOAD segment with RWE permissions after the LOAD segment with RW (i.e. data segment) shown above. One approach to do this is to modify the custom GNU linker script to pick my new sections and put them in a separate segment after the bss segment. This will cause it to appear as third LOAD segment.
Adding this after bss end in linker script
.my_section = .;
.my_section : { *(.my_section)}
This is how it appears
>readelf -lW /tmp/sample
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x13e33f 0x13e33f R E 0x200000
LOAD 0x13e510 0x000000000073e510 0x000000000073e510 0x005160 0x007cc8 RW 0x200000
LOAD 0x13f000 0x000000000073f000 0x000000000073f000 0x0d6b00 0x0d6b00 RW 0x200000
How to get executable permissions as well in this segment? What changes I need to do to the linker script?

nopage () method implementation

Any one know about how virtual address is translated to physical address in no page method.
with reference to Device Drivers book the nopage method is given as ,
struct page *simple_vma_nopage(struct vm_area_struct *vma,
unsigned long address, int *type)
{
struct page *pageptr;
unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
unsigned long physaddr = address - vma->vm_start + offset;
unsigned long pageframe = physaddr >> PAGE_SHIFT;
if (!pfn_valid(pageframe))
return NOPAGE_SIGBUS;
pageptr = pfn_to_page(pageframe);
get_page(pageptr);
if (type)
*type = VM_FAULT_MINOR;
return pageptr;
}
page_shift is the number of bits used to reperesent offset for Virtual and physical memory address.
But what is the offset variable ?
How a physical address is calculated from arithmetic operations on virtual address variables like address and vm_start ?
I feel the documentation of vm_pgoff is not very clear.
This is the offset of the first page of the memory region in RAM.
So if our RAM begins at 0x00000000, and our memory region begins
at 0x0000A000, then vm_pgoff = 10. If you consider/ revisit the mmap
system call then you can see that the "offset" which we pass is the offset
of the starting byte in the file from which "length" bytes will be mapped
on to the memory region. This offset can be converted to address by left
shifting it to PAGE_SHIFT value which is 12 (i.e. 4KB per page size)
Now, irrespective of whether the cr3 register is used in linear address to
physical address translation or not, when we say that "address - vm_start"
then this gives the size of portion between the addresses.
example:
vm_start = 0xc0080000
address = 0xc0090000
address - vm_start = 0x00010000
physaddr = (address - vma->vm_start) + offset;
= 0x00010000 + (10 << PAGE_SHIFT)
= offset_to_page_that_fault + start_addr_of_memoryRegion_in_RAM
= 0x00010000 + 0x0000A000
= 0x0001A000
Now since this is the physical address therefore we need to convert to page frame
number by right shifting by PAGE_SHIFT value i.e 0x0001A000 >> 12 = 0x1A = 26 (decimal)
Therefore the 26th page-frame must be loaded with the data from the file which is being mapped.
Hence data is retrieved from the disk by using the inode's struct address_sapce
which contains the information of the location of page on the disk (swap space).
Once the data is brought in we return the struct page which represents this data in the
page_frame for which this page fault occurred. We return this to the user.
This is my understanding of the latest but I haven't tested it.
No, the statement in the book is correct, because
As aforementioned, "physical" is just the address
Of starting of your region/portion that you want
To map out of the physical memory which starts
From "off" physical address till the "simple_region_size"
The "simple_region_size" value is decided by the user.
Similarly "simple_region_start" is decided by the user.
simple_region_start >= off
So the maximum physical memory that user can map
Is decided by: psize = simple_region_size - off
I.e from start of physical memory till end
of the portion.
But actually how much will be mapped with this memory
region is given by "vma->vm_end - vma->vm_start" and is
represented by vsize. Hence the need existed to perform
the sanity check since User can get more than what it
intended.
Kind regards,
Sanjeev Ranot
"simple_region_start" is the offset from the starting of
physical memory out of which our sub-region needs to be mapped
Example:
off = start of the physical memory (page aligned)= 0xd000 8000
simple_region_start = 0x1000
therefore the physical address of the start of the sub region
we want to map is = 0xd000 8000 + 0x1000
= 0xd000 9000
now virtual size is the portion that needs to be mapped from the
physical memory available. This must be defined properly by the user.
simple_region_size = physical address pointing to last of the
portion that we need to map.
So if we wanted 8KBs to be mapped from the physical memory available
then following is how the calculation goes
simple_region_size = physical address just beyond the last of our portion
simple_region_size = 0xd000 9000 + 0x2000 (for the 8KBs)
simple_region_size = 0xd000 B000
So our 8KBs of portion will range from physical addresses [0xd000 B000 to 0xd000 9000]
Hence physical size i.e. psize = 0x2000
We perform the sanity check i.e
If the size of our portion of physical memory is smaller
than what the user tries to map using the full length
virtual address range of this memory region, then we
raise an exception. i.e say for ex. vsize = 0x3000
Otherwise we use the API "remap_pfn_range" to map the
portion of the physical memory passing in the the
physical address and not the page frame number as
was done previously since this is the IO memory.
I feel it should have been the API "io_remap_page_range"
here the aforementioned.
So it will map the portion of physical memory starting
from the physical address 0xd000 9000 on the the user
linear address starting from vma->vm_start of vsize.
N.B As before I have yet to test this out !

ELF Program Headers: MemSiz vs. FileSiz

readelf -l /bin/bash gives me this:
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
PHDR 0x0000000000000040 0x0000000000400040 0x0000000000400040
0x00000000000001f8 0x00000000000001f8 R E 8
INTERP 0x0000000000000238 0x0000000000400238 0x0000000000400238
0x000000000000001a 0x000000000000001a R 1
[Requesting program interpreter: /lib/ld-linux-x86-64.so.2]
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000aeef4 0x00000000000aeef4 R E 200000
LOAD 0x00000000000afde0 0x00000000006afde0 0x00000000006afde0
0x0000000000003cec 0x000000000000d3c8 RW 200000
DYNAMIC 0x00000000000afdf8 0x00000000006afdf8 0x00000000006afdf8
0x0000000000000200 0x0000000000000200 RW 8
NOTE 0x0000000000000254 0x0000000000400254 0x0000000000400254
0x0000000000000044 0x0000000000000044 R 4
GNU_EH_FRAME 0x000000000009dbc0 0x000000000049dbc0 0x000000000049dbc0
0x0000000000002bb4 0x0000000000002bb4 R 4
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 8
GNU_RELRO 0x00000000000afde0 0x00000000006afde0 0x00000000006afde0
0x0000000000000220 0x0000000000000220 R 1
Why is MemSiz not equal to FileSiz for some LOAD segments? What should be done with the memory region included by MemSiz but not FileSiz?
The loadable segment in question appears to be the program's data segment.
The data segment in an program contains space for both initialized and
uninitialized program variables. Values for initialized variables are
stored in the program's executable. Uninitialized program variables do not
need to stored anywhere; instead space is reserved for them in a
special zero-sized section named ".bss".
The file size of an executable's data segment can thus be less than
its in-memory size.
To illustrate:
/*
* Space for the intialized variable 'x' would be reserved the
* executable's ".data" section, along with its initial value.
*/
int x = 42;
/*
* Space for the uninitialized variable 'y' would be reserved in
* the ".bss" section; no file space would be allocated in the
* executable.
*/
int y;
On unix-like systems, the portion of the data segment mapped to the
".bss" section would be zero-filled at program load time.

Resources