How to understand such sample in GNU ld manual about linker script? - gnu

I am learning the GNU linker ld script sample about memory region alias.
I see the following ld script snippet:
SECTIONS
{
.text :
{
*(.text)
} > REGION_TEXT
.rodata :
{
*(.rodata)
rodata_end = .;
} > REGION_RODATA <=========== PLACE 1
.data : AT (rodata_end) <=========== PLACE 2
{
data_start = .;
*(.data)
} > REGION_DATA <=========== PLACE 3
data_size = SIZEOF(.data);
data_load_start = LOADADDR(.data);
.bss :
{
*(.bss)
} > REGION_BSS
}
One possible system memory region layout given in the sample is like this (C in that sample):
MEMORY
{
ROM : ORIGIN = 0, LENGTH = 2M /*0M ~ 2M*/
ROM2 : ORIGIN = 0x10000000, LENGTH = 1M /*256M ~ 257M*/
RAM : ORIGIN = 0x20000000, LENGTH = 1M /*512M ~ 513M*/
}
REGION_ALIAS("REGION_TEXT", ROM); /*0M ~ 2M*/
REGION_ALIAS("REGION_RODATA", ROM2); /*256M ~ 257M*/
REGION_ALIAS("REGION_DATA", RAM); /*512M ~ 513M*/
REGION_ALIAS("REGION_BSS", RAM); /*512M ~ 513M*/
So,
PLACE 1 says .rodata MUST go into REGION_RODATA, that is 256M~257M
PLACE 2 says the .data section MUST be placed immediately after the .rodata section. So .data section MUST start from at most 257M.
But PLACE 3 says the .data section MUST goes into the REGION_DATA region. So .data section MUST start from at least 512M.
So how could it be possible?

The key concepts to understand this example are those of Virtual Memory Address (VMA) and Load Memory Address (LMA).
The GNU Linker official documentation defines those two terms as follows.
Every loadable or allocatable output section has two addresses. The
first is the VMA, or virtual memory address. This is the address the
section will have when the output file is run. The second is the LMA,
or load memory address. This is the address at which the section will
be loaded.
In the example, for all output sections but .data, the VMA and LMA addresses are the same. For section .data the LMA is specified by AT (rodata_end) while the VMA address is the first available address of the REGION_DATA memory region.
With this in mind, we can read again the example and see that it leads to the situation represented below.
ROM (alias REGION_TEXT)
+---------+------------------------------+
| .text | |
+---------+------------------------------+
ROM2 (alias REGION_RODATA)
+-----------+---------+--------+
| .rodata | .data | |
+-----------+---------+--------+
RAM (alias REGION_DATA)
+---------+--------+-----------+
| .data | .bss | |
+---------+--------+-----------+
The .data section appears twice: once in ROM2 and once in RAM. It is put at its load address (LMA) when loaded; subsequently it is moved to its virtual address before running the program.
By the way, this is why, a few line later in the documentation you mentioned, we can read that
It is possible to write a common system initialization routine to copy
the .data section from ROM or ROM2 into the RAM if necessary.

Related

Loaded glibc base address different for each function

I'm trying to calculate the base address of the library of a binary file.
I have the address of printf, puts ecc and then I subtract it's offset to get the base address of the library.
I was doing this for printf, puts and signal, but every time I got a different base address.
I also tried to do the things in this post, but I couldn't get the right result either.
ASLR is disabled.
this is where I take the address of the library function:
gdb-peda$ x/20wx 0x804b018
0x804b018 <signal#got.plt>: 0xf7e05720 0xf7e97010 0x080484e6 0x080484f6
0x804b028 <puts#got.plt>: 0xf7e3fb40 0x08048516 0x08048526 0xf7df0d90
0x804b038 <memset#got.plt>: 0xf7f18730 0x08048556 0x08048566 0x00000000
then I have:
gdb-peda$ info proc mapping
process 114562
Mapped address spaces:
Start Addr End Addr Size Offset objfile
0x8048000 0x804a000 0x2000 0x0 /home/ofey/CTF/Pwnable.tw/applestore/applestore
0x804a000 0x804b000 0x1000 0x1000 /home/ofey/CTF/Pwnable.tw/applestore/applestore
0x804b000 0x804c000 0x1000 0x2000 /home/ofey/CTF/Pwnable.tw/applestore/applestore
0x804c000 0x806e000 0x22000 0x0 [heap]
0xf7dd8000 0xf7fad000 0x1d5000 0x0 /lib/i386-linux-gnu/libc-2.27.so
0xf7fad000 0xf7fae000 0x1000 0x1d5000 /lib/i386-linux-gnu/libc-2.27.so
0xf7fae000 0xf7fb0000 0x2000 0x1d5000 /lib/i386-linux-gnu/libc-2.27.so
0xf7fb0000 0xf7fb1000 0x1000 0x1d7000 /lib/i386-linux-gnu/libc-2.27.so
0xf7fb1000 0xf7fb4000 0x3000 0x0
0xf7fd0000 0xf7fd2000 0x2000 0x0
0xf7fd2000 0xf7fd5000 0x3000 0x0 [vvar]
0xf7fd5000 0xf7fd6000 0x1000 0x0 [vdso]
0xf7fd6000 0xf7ffc000 0x26000 0x0 /lib/i386-linux-gnu/ld-2.27.so
0xf7ffc000 0xf7ffd000 0x1000 0x25000 /lib/i386-linux-gnu/ld-2.27.so
0xf7ffd000 0xf7ffe000 0x1000 0x26000 /lib/i386-linux-gnu/ld-2.27.so
0xfffdd000 0xffffe000 0x21000 0x0 [stack]
and :
gdb-peda$ info sharedlibrary
From To Syms Read Shared Object Library
0xf7fd6ab0 0xf7ff17fb Yes /lib/ld-linux.so.2
0xf7df0610 0xf7f3d386 Yes /lib/i386-linux-gnu/libc.so.6
I then found the offset of signal and puts to calculate the base libc address.
base_with_signal_offset = 0xf7e05720 - 0x3eda0 = 0xf7dc6980
base_with_puts_offset = 0xf7e3fb40 - 0x809c0 = 0xf7dbf180
I was expecting base_with_signal_offset = base_with_puts_offset = 0xf7dd8000, but that's not the case.
What I'm doing wrong?
EDIT(To let you understand where I got those offset):
readelf -s /lib/x86_64-linux-gnu/libc-2.27.so | grep puts
I get :
191: 00000000000809c0 512 FUNC GLOBAL DEFAULT 13 _IO_puts##GLIBC_2.2.5
422: 00000000000809c0 512 FUNC WEAK DEFAULT 13 puts##GLIBC_2.2.5
496: 00000000001266c0 1240 FUNC GLOBAL DEFAULT 13 putspent##GLIBC_2.2.5
678: 00000000001285d0 750 FUNC GLOBAL DEFAULT 13 putsgent##GLIBC_2.10
1141: 000000000007f1f0 396 FUNC WEAK DEFAULT 13 fputs##GLIBC_2.2.5
1677: 000000000007f1f0 396 FUNC GLOBAL DEFAULT 13 _IO_fputs##GLIBC_2.2.5
2310: 000000000008a640 143 FUNC WEAK DEFAULT 13 fputs_unlocked##GLIBC_2.2.5
I was expecting base_with_signal_offset = base_with_puts_offset = 0xf7dd8000
There are 3 numbers in your calculation:
&puts_at_runtime - symbol_value_from_readelf == &first_executable_pt_load_segment_libc.
The readelf output shows that you got one of these almost correct: the value of puts in 64-bit /lib/x86_64-linux-gnu/libc-2.27.so is indeed 0x809c0, but that is not the library you are actually using. You need to repeat the same on the actually used 32-bit library: /lib/i386-linux-gnu/libc-2.27.so.
For the first number -- &puts_at_runtime, you are using value from the puts#got.plt import stub. That value is only guaranteed to have been resolved (point to actual puts in libc.so) IFF you have LD_BIND_NOW=1 set in the environment, or you linked your executable with -z now linker flag, or you actually called puts already.
It may be better to print &puts in GDB.
The last number -- &first_executable_pt_load_segment_libc is correct (because info shared shows that libc.so.6 .text section starts at 0xf7df0610, which is between 0xf7dd8000 and 0xf7fad000.
So putting it all together, the only error was that you used the wrong version of libc.so to extract the symbol_value_from_readelf.
On my system:
#include <signal.h>
#include <stdio.h>
int main() {
puts("Hello");
signal(SIGINT, SIG_IGN);
return 0;
}
gcc -m32 t.c -fno-pie -no-pie
gdb -q a.out
... set breakpoint on exit from main
Breakpoint 1, 0x080491ae in main ()
(gdb) p &puts
$1 = (<text variable, no debug info> *) 0xf7e31300 <puts>
(gdb) p &signal
$2 = (<text variable, no debug info> *) 0xf7df7d20 <ssignal>
(gdb) info proc map
process 114065
Mapped address spaces:
Start Addr End Addr Size Offset objfile
0x8048000 0x8049000 0x1000 0x0 /tmp/a.out
...
0x804d000 0x806f000 0x22000 0x0 [heap]
0xf7dc5000 0xf7de2000 0x1d000 0x0 /lib/i386-linux-gnu/libc-2.29.so
...
(gdb) info shared
From To Syms Read Shared Object Library
0xf7fd5090 0xf7ff0553 Yes (*) /lib/ld-linux.so.2
0xf7de20e0 0xf7f2b8d6 Yes (*) /lib/i386-linux-gnu/libc.so.6
Given above, we expect readelf -s to give us 0xf7e31300 - 0xf7dc5000 ==
0x6c300 for puts and 0xf7df7d20 - 0xf7dc5000 == 0x32d20 for signal respectively.
readelf -Ws /lib/i386-linux-gnu/libc-2.29.so | egrep ' (puts|signal)\W'
452: 00032d20 68 FUNC WEAK DEFAULT 14 signal##GLIBC_2.0
458: 0006c300 400 FUNC WEAK DEFAULT 14 puts##GLIBC_2.0
QED.

Understanding how $ works in assembly [duplicate]

len: equ 2
len: db 2
Are they the same, producing a label that can be used instead of 2? If not, then what is the advantage or disadvantage of each declaration form? Can they be used interchangeably?
The first is equate, similar to C's:
#define len 2
in that it doesn't actually allocate any space in the final code, it simply sets the len symbol to be equal to 2. Then, when you use len later on in your source code, it's the same as if you're using the constant 2.
The second is define byte, similar to C's:
int len = 2;
It does actually allocate space, one byte in memory, stores a 2 there, and sets len to be the address of that byte.
Here's some pseudo-assembler code that shows the distinction:
line addr code label instruction
---- ---- -------- ----- -----------
1 0000 org 1234h
2 1234 elen equ 2
3 1234 02 dlen db 2
4 1235 44 02 00 mov ax, elen
5 1238 44 34 12 mov ax, dlen
Line 1 simply sets the assembly address to be 1234h, to make it easier to explain what's happening.
In line 2, no code is generated, the assembler simply loads elen into the symbol table with the value 2. Since no code has been generated, the address does not change.
Then, when you use it on line 4, it loads that value into the register.
Line 3 shows that db is different, it actually allocates some space (one byte) and stores the value in that space. It then loads dlen into the symbol table but gives it the value of that address 1234h rather than the constant value 2.
When you later use dlen on line 5, you get the address, which you would have to dereference to get the actual value 2.
Summary
NASM 2.10.09 ELF output:
db does not have any magic effects: it simply outputs bytes directly to the output object file.
If those bytes happen to be in front of a symbol, the symbol will point to that value when the program starts.
If you are on the text section, your bytes will get executed.
Weather you use db or dw, etc. that does not specify the size of the symbol: the st_size field of the symbol table entry is not affected.
equ makes the symbol in the current line have st_shndx == SHN_ABS magic value in its symbol table entry.
Instead of outputting a byte to the current object file location, it outputs it to the st_value field of the symbol table entry.
All else follows from this.
To understand what that really means, you should first understand the basics of the ELF standard and relocation.
SHN_ABS theory
SHN_ABS tells the linker that:
relocation is not to be done on this symbol
the st_value field of the symbol entry is to be used as a value directly
Contrast this with "regular" symbols, in which the value of the symbol is a memory address instead, and must therefore go through relocation.
Since it does not point to memory, SHN_ABS symbols can be effectively removed from the executable by the linker by inlining them.
But they are still regular symbols on object files and do take up memory there, and could be shared amongst multiple files if global.
Sample usage
section .data
x: equ 1
y: db 2
section .text
global _start
_start:
mov al, x
; al == 1
mov al, [y]
; al == 2
Note that since the symbol x contains a literal value, no dereference [] must be done to it like for y.
If we wanted to use x from a C program, we'd need something like:
extern char x;
printf("%d", &x);
and set on the asm:
global x
Empirical observation of generated output
We can observe what we've said before with:
nasm -felf32 -o equ.o equ.asm
ld -melf_i386 -o equ equ.o
Now:
readelf -s equ.o
contains:
Num: Value Size Type Bind Vis Ndx Name
4: 00000001 0 NOTYPE LOCAL DEFAULT ABS x
5: 00000000 0 NOTYPE LOCAL DEFAULT 1 y
Ndx is st_shndx, so we see that x is SHN_ABS while y is not.
Also see that Size is 0 for y: db in no way told y that it was a single byte wide. We could simply add two db directives to allocate 2 bytes there.
And then:
objdump -dr equ
gives:
08048080 <_start>:
8048080: b0 01 mov $0x1,%al
8048082: a0 88 90 04 08 mov 0x8049088,%al
So we see that 0x1 was inlined into instruction, while y got the value of a relocation address 0x8049088.
Tested on Ubuntu 14.04 AMD64.
Docs
http://www.nasm.us/doc/nasmdoc3.html#section-3.2.4:
EQU defines a symbol to a given constant value: when EQU is used, the source line must contain a label. The action of EQU is to define the given label name to the value of its (only) operand. This definition is absolute, and cannot change later. So, for example,
message db 'hello, world'
msglen equ $-message
defines msglen to be the constant 12. msglen may not then be redefined later. This is not a preprocessor definition either: the value of msglen is evaluated once, using the value of $ (see section 3.5 for an explanation of $) at the point of definition, rather than being evaluated wherever it is referenced and using the value of $ at the point of reference.
See also
Analogous question for GAS: Difference between .equ and .word in ARM Assembly? .equiv seems to be the closes GAS equivalent.
equ: preprocessor time. analogous to #define but most assemblers are lacking an #undef, and can't have anything but an atomic constant of fixed number of bytes on the right hand side, so floats, doubles, lists are not supported with most assemblers' equ directive.
db: compile time. the value stored in db is stored in the binary output by the assembler at a specific offset. equ allows you define constants that normally would need to be either hardcoded, or require a mov operation to get. db allows you to have data available in memory before the program even starts.
Here's a nasm demonstrating db:
; I am a 16 byte object at offset 0.
db '----------------'
; I am a 14 byte object at offset 16
; the label foo makes the assembler remember the current 'tell' of the
; binary being written.
foo:
db 'Hello, World!', 0
; I am a 2 byte filler at offset 30 to help readability in hex editor.
db ' .'
; I am a 4 byte object at offset 16 that the offset of foo, which is 16(0x10).
dd foo
An equ can only define a constant up to the largest the assembler supports
example of equ, along with a few common limitations of it.
; OK
ZERO equ 0
; OK(some assemblers won't recognize \r and will need to look up the ascii table to get the value of it).
CR equ 0xD
; OK(some assemblers won't recognize \n and will need to look up the ascii table to get the value of it).
LF equ 0xA
; error: bar.asm:2: warning: numeric constant 102919291299129192919293122 -
; does not fit in 64 bits
; LARGE_INTEGER equ 102919291299129192919293122
; bar.asm:5: error: expression syntax error
; assemblers often don't support float constants, despite fitting in
; reasonable number of bytes. This is one of the many things
; we take for granted in C, ability to precompile floats at compile time
; without the need to create your own assembly preprocessor/assembler.
; PI equ 3.1415926
; bar.asm:14: error: bad syntax for EQU
; assemblers often don't support list constants, this is something C
; does support using define, allowing you to define a macro that
; can be passed as a single argument to a function that takes multiple.
; eg
; #define RED 0xff, 0x00, 0x00, 0x00
; glVertex4f(RED);
; #undef RED
;RED equ 0xff, 0x00, 0x00, 0x00
the resulting binary has no bytes at all because equ does not pollute the image; all references to an equ get replaced by the right hand side of that equ.

Where const strings are saved in assembly?

When i declare a string in assembly like that:
string DB "My string", 0
where is the string saved?
Can i determine where it will be saved when declaring it?
db assembles output bytes to the current position in the output file. You control exactly where they go.
There is no indirection or reference to any other location, it's like char string[] = "blah blah", not char *string = "blah blah" (but without the implicit zero byte at the end, that's why you have to use ,0 to add one explicitly.)
When targeting a modern OS (i.e. not making a boot-sector or something), your code + data will end up in an object file and then be linked into an executable or library.
On Linux (or other ELF platforms), put read-only constant data including strings in section .rodata. This section (along with section .text where you put code) becomes part of the text segment after linking.
Windows apparently uses section .rdata.
Different assemblers have different syntax for changing sections, but I think section .whatever works in most of the one that use DB for data bytes.
;; NASM source for the x86-64 System V ABI.
section .rodata ; use section .rdata on Windows
string DB "My string", 0
section .data
static_storage_for_something: dd 123 ; one dword with value = 123
;; usually you don't need .data and can just use registers or the stack
section .bss ; zero-initialized memory, bytes not stored in the executable, just size
static_array: resd 12300000 ;; 12300000 dwords with value = 0
section .text
extern puts ; defined in libc
global main
main:
mov edi, string ; RDI = address of string = first function arg
;mov [rdi], 1234 ; would segfault because .rodata is mapped read-only
jmp puts ; tail-call puts(string)
peter#volta:/tmp$ cat > string.asm
(and paste the above, then press control-D)
peter#volta:/tmp$ nasm -f elf64 string.asm && gcc -no-pie string.o && ./a.out
My string
peter#volta:/tmp$ echo $?
10
10 characters is the return value from puts, which is the return value from main because we tail-called it, which becomes the exit status of our program. (Linux glibc puts apparently returns the character count in this case. But the manual just says it returns non-negative number on success, so don't count on this)
I used -no-pie because I used an absolute address for string with mov instead of a RIP-relative LEA.
You can use readelf -a a.out or nm to look at what went where in your executable.

Swap sections in ELF

Is there a way to force gcc or ld place code section at the end of output ELF-format file?
Maybe I can force them not to produce any other section except .text if, for example, I dont have anything in .data, .rodata, .bss and other sections?
The minimal version of script that worked for me looked like:
ENTRY(_start)
SECTIONS
{
.data : { *(.data) }
.bss : { *(.bss) *(COMMON) }
.text : { *(.text) }
}
But after I've made some more research (docs here) I've replaced this script with default one (ld --verbose). Then I've just placed code section in the very end of verbose script and it worked perfectly.

Update linker variables after --gc-sections

I wrote a small binary in cortex-a9 board, and defined a linker script like this:
SECTIONS
{
.text :
{
__text = . ;
*(.vector)
*(.text)
*(.text.*)
}
.rodata :
{
*(.rodata)
*(.rodata.*)
}
.data : {
__data_start = . ;
*(.data)
*(.data.*)
}
. = ALIGN(4);
__bss_start = . ;
.bss :
{
*(.bss)
*(.bss.*)
*(COMMON)
. = ALIGN(4);
}
__bss_end = .;
. = ALIGN(4);
__heap_start = .;
. = . + 0x1000;
. = ALIGN(4);
__heap_end = .;
_end = . ;
PROVIDE (end = .) ;
}
But it seems after --gc-sections worked and removed unused sections, the __heap_start still the value before --gc-sections get workked (I print it in code and check the ld flags):
arm-linux-gnueabihf-gcc -mcpu=cortex-a7 -msoft-float -nostdlib
-Wl,--gc-sections -Wl,--print-gc-sections -Wl,-Ttext,0x04000000 -T csrvisor.lds -Wl,-Map,binary.map
Anyone knows how to change the __heap_start to correct value after --gc-sections removed unused sections?
Check your compiler flags: Do they really contain -ffunction-sections -fdata-sections?
The heap normally (and in your case as well) starts right after the .bss section. So as for the start of the heap your linker script looks fine
Check if the linker really removes unused variables - if it only removes unused text sections, the value for __heap_start won't change.
Code, read-only data, initialized data et. al. normally go into the flash. If something is garbage-collected there, it won't affect your heap.
Data (initialized and uninitialized) will (eventually) turn up in the RAM. If something is garbage-collected there, it will affect your heap. So check if you really have variables which are removed by the garbage collection.
As for your linker script
There is no KEEP statement. Normally something like a reset handler, main et. al. must not be removed by the linker garbage collection
Your data section does not define the handling of initial values.
Your linker script does not contain region declarations (MEMORY). Check which defaults apply
Your sections do not have a target region: Again check which defaults apply in your case.
Examples with target regions:
.rodata :
{
*(.rodata)
*(.rodata.*)
} >rom
.data : {
__data_start = . ;
*(.data)
*(.data.*)
} >ram

Resources