How to find the main function's entry point of elf executable file without any symbolic information? - linux

I developed a small cpp program on platform of Ubuntu-Linux 11.10.
Now I want to reverse engineer it. I am beginner. I use such tools: GDB 7.0, hte editor, hexeditor.
For the first time I made it pretty easy. With help of symbolic information I founded the address of main function and made everything I needed.
Then I striped (--strip-all) executable elf-file and I have some problems.
I know that main function starts from 0x8960 in this program.
But I haven't any idea how should I find this point without this knowledge.
I tried debug my program step by step with gdb but it goes into __libc_start_main
then into the (so, it finds and loads the shared libraries needed by a program). I debugged it about 10 minutes. Of course, may be in 20 minutes I can reach the main function's entry point, but, it seems, that more easy way has to exist.
What should I do to find the main function's entry point without any symbolic info?
Could you advise me some good books/sites/other_sources from reverse engineering of elf-files with help of gdb?
Any help would be appreciated.

Locating main() in a stripped Linux ELF binary is straightforward. No symbol information is required.
The prototype for __libc_start_main is
int __libc_start_main(int (*main) (int, char**, char**),
int argc,
char *__unbounded *__unbounded ubp_av,
void (*init) (void),
void (*fini) (void),
void (*rtld_fini) (void),
void (*__unbounded stack_end));
The runtime memory address of main() is the argument corresponding to the first parameter, int (*main) (int, char**, char**). This means that the last memory address saved on the runtime stack prior to calling __libc_start_main is the memory address of main(), since arguments are pushed onto the runtime stack in the reverse order of their corresponding parameters in the function definition.
One can enter main() in gdb in 4 steps:
Find the program entry point
Find where __libc_start_main is called
Set a break point to the address last saved on stack prior to the call to _libc_start_main
Let program execution continue until the break point for main() is hit
The process is the same for both 32-bit and 64-bit ELF binaries.
Entering main() in an example stripped 32-bit ELF binary called "test_32":
$ gdb -q -nh test_32
Reading symbols from test_32...(no debugging symbols found)...done.
(gdb) info file #step 1
Symbols from "/home/c/test_32".
Local exec file:
`/home/c/test_32', file type elf32-i386.
Entry point: 0x8048310
< output snipped >
(gdb) break *0x8048310
Breakpoint 1 at 0x8048310
(gdb) run
Starting program: /home/c/test_32
Breakpoint 1, 0x08048310 in ?? ()
(gdb) x/13i $eip #step 2
=> 0x8048310: xor %ebp,%ebp
0x8048312: pop %esi
0x8048313: mov %esp,%ecx
0x8048315: and $0xfffffff0,%esp
0x8048318: push %eax
0x8048319: push %esp
0x804831a: push %edx
0x804831b: push $0x80484a0
0x8048320: push $0x8048440
0x8048325: push %ecx
0x8048326: push %esi
0x8048327: push $0x804840b # address of main()
0x804832c: call 0x80482f0 <__libc_start_main#plt>
(gdb) break *0x804840b # step 3
Breakpoint 2 at 0x804840b
(gdb) continue # step 4
Breakpoint 2, 0x0804840b in ?? () # now in main()
(gdb) x/x $esp+4
0xffffd110: 0x00000001 # argc = 1
(gdb) x/s **(char ***) ($esp+8)
0xffffd35c: "/home/c/test_32" # argv[0]
Entering main() in an example stripped 64-bit ELF binary called "test_64":
$ gdb -q -nh test_64
Reading symbols from test_64...(no debugging symbols found)...done.
(gdb) info file # step 1
Symbols from "/home/c/test_64".
Local exec file:
`/home/c/test_64', file type elf64-x86-64.
Entry point: 0x400430
< output snipped >
(gdb) break *0x400430
Breakpoint 1 at 0x400430
(gdb) run
Starting program: /home/c/test_64
Breakpoint 1, 0x0000000000400430 in ?? ()
(gdb) x/11i $rip # step 2
=> 0x400430: xor %ebp,%ebp
0x400432: mov %rdx,%r9
0x400435: pop %rsi
0x400436: mov %rsp,%rdx
0x400439: and $0xfffffffffffffff0,%rsp
0x40043d: push %rax
0x40043e: push %rsp
0x40043f: mov $0x4005c0,%r8
0x400446: mov $0x400550,%rcx
0x40044d: mov $0x400526,%rdi # address of main()
0x400454: callq 0x400410 <__libc_start_main#plt>
(gdb) break *0x400526 # step 3
Breakpoint 2 at 0x400526
(gdb) continue # step 4
Breakpoint 2, 0x0000000000400526 in ?? () # now in main()
(gdb) print $rdi
$3 = 1 # argc = 1
(gdb) x/s **(char ***) ($rsp+16)
0x7fffffffe35c: "/home/c/test_64" # argv[0]
A detailed treatment of program initialization and what occurs before main() is called and how to get to main() can be found be found in Patrick Horgan's tutorial "Linux x86 Program Start Up
or - How the heck do we get to main()?"

If you have a very stripped version, or even a binary that is packed, as using UPX, you can gdb on it in the tough way as:
$ readelf -h echo | grep Entry
Entry point address: 0x103120
And then you can break at it in GDB as:
$ gdb mybinary
(gdb) break * 0x103120
Breakpoint 1 at 0x103120gdb)
(gdb) r
Starting program: mybinary
Breakpoint 1, 0x0000000000103120 in ?? ()
and then, you can see the entry instructions:
(gdb) x/10i 0x0000000000103120
=> 0x103120: bl 0x103394
0x103124: dcbtst 0,r5
0x103128: mflr r13
0x10312c: cmplwi r7,2
0x103130: bne 0x103214
0x103134: stw r5,0(r6)
0x103138: add r4,r4,r3
0x10313c: lis r0,-32768
0x103140: lis r9,-32768
0x103144: addi r3,r3,-1
I hope it helps

As far as I know, once a program has been stripped, there is no straightforward way to locate the function that the symbol main would have otherwise referenced.
The value of the symbol main is not required for program start-up: in the ELF format, the start of the program is specified by the e_entry field of the ELF executable header. This field normally points to the C library's initialization code, and not directly to main.
While the C library's initialization code does call main() after it has set up the C run time environment, this call is a normal function call that gets fully resolved at link time.
In some cases, implementation-specific heuristics (i.e., the specific knowledge of the internals of the C runtime) could be used to determine the location of main in a stripped executable. However, I am not aware of a portable way to do so.


Unclear output by riscv objdump -d

Now I am trying to understand the RISC-V ISA but I have an unclear point about the machine code and assembly.
I have written a C code like this:
int main() {
return 42;
Then, I produced the .s file by this command:
$ /opt/riscv/bin/riscv64-unknown-linux-gnu-gcc -S 42.c
The output was:
.file "42.c"
.option nopic
.align 1
.globl main
.type main, #function
addi sp,sp,-16
sd s0,8(sp)
addi s0,sp,16
li a5,42
mv a0,a5
ld s0,8(sp)
addi sp,sp,16
jr ra
.size main, .-main
.ident "GCC: (g5964b5cd727) 11.1.0"
.section .note.GNU-stack,"",#progbits
Now, I run following command to produce an elf.
$ /opt/riscv/bin/riscv64-unknown-linux-gnu-gcc -nostdlib -o 42 42.s
So, a binary file is produced. I tried to read that by objdump like this:
$ /opt/riscv/bin/riscv64-unknown-linux-gnu-objdump -d 42
So the output was like this:
42: file format elf64-littleriscv
Disassembly of section .text:
00000000000100b0 <main>:
100b0: 1141 addi sp,sp,-16
100b2: e422 sd s0,8(sp)
100b4: 0800 addi s0,sp,16
100b6: 02a00793 li a5,42
100ba: 853e mv a0,a5
100bc: 6422 ld s0,8(sp)
100be: 0141 addi sp,sp,16
100c0: 8082 ret
What I don't understand is the meaning of the machine code in objdump output.
For example, the first instruction addi is translated into .....0010011 according to this page, (while this is not an official spec). However, the dumped hex is 1141. 1141 can only represent 2 bytes, but the instruction should be 32-bit, 4bytes.
I guess I am missing some points, but how should I read the output of objdump for riscv?
You can tell objdump to show compressed (16-bit) instructions by using -M no-aliases in this way
riscv64-unknown-elf-objdump -d -M no-aliases
In that case, instructions starting with c. are compressed ones.
Unfortunately that will also disable some other aliases, making the asm less nice to read if you're used to them. You can just look at the number of bytes (2 vs. 4) in the hexdump to see if it's a compressed instruction or not.

Where const strings are saved in assembly?

When i declare a string in assembly like that:
string DB "My string", 0
where is the string saved?
Can i determine where it will be saved when declaring it?
db assembles output bytes to the current position in the output file. You control exactly where they go.
There is no indirection or reference to any other location, it's like char string[] = "blah blah", not char *string = "blah blah" (but without the implicit zero byte at the end, that's why you have to use ,0 to add one explicitly.)
When targeting a modern OS (i.e. not making a boot-sector or something), your code + data will end up in an object file and then be linked into an executable or library.
On Linux (or other ELF platforms), put read-only constant data including strings in section .rodata. This section (along with section .text where you put code) becomes part of the text segment after linking.
Windows apparently uses section .rdata.
Different assemblers have different syntax for changing sections, but I think section .whatever works in most of the one that use DB for data bytes.
;; NASM source for the x86-64 System V ABI.
section .rodata ; use section .rdata on Windows
string DB "My string", 0
section .data
static_storage_for_something: dd 123 ; one dword with value = 123
;; usually you don't need .data and can just use registers or the stack
section .bss ; zero-initialized memory, bytes not stored in the executable, just size
static_array: resd 12300000 ;; 12300000 dwords with value = 0
section .text
extern puts ; defined in libc
global main
mov edi, string ; RDI = address of string = first function arg
;mov [rdi], 1234 ; would segfault because .rodata is mapped read-only
jmp puts ; tail-call puts(string)
peter#volta:/tmp$ cat > string.asm
(and paste the above, then press control-D)
peter#volta:/tmp$ nasm -f elf64 string.asm && gcc -no-pie string.o && ./a.out
My string
peter#volta:/tmp$ echo $?
10 characters is the return value from puts, which is the return value from main because we tail-called it, which becomes the exit status of our program. (Linux glibc puts apparently returns the character count in this case. But the manual just says it returns non-negative number on success, so don't count on this)
I used -no-pie because I used an absolute address for string with mov instead of a RIP-relative LEA.
You can use readelf -a a.out or nm to look at what went where in your executable.

C program stores function parameters from $rbp+4 in memory? My check failed

I was trying to learn how to use rbp/ebp to visit function parameters and local variables on ubuntu1604, 64bit. I've got a simply c file:
int main(int argc,char*argv[])
return argc;
I compiled it with:
gcc -g my.c
Then debug it with argument parameters:
gdb --args my 01 02
Here I know the "argc" should be 3, so I tried to check:
(gdb) b main
Breakpoint 1 at 0x400535: file ret.c, line 5.
(gdb) r
Starting program: /home/a/cpp/my 01 02
Breakpoint 1, main (argc=3, argv=0x7fffffffde98) at ret.c:5
5 printf("hello\n");
(gdb) x $rbp+4
0x7fffffffddb4: 0x00000000
(gdb) x $rbp+8
0x7fffffffddb8: 0xf7a2e830
(gdb) x/1xw $rbp+8
0x7fffffffddb8: 0xf7a2e830
(gdb) x/1xw $rbp+4
0x7fffffffddb4: 0x00000000
(gdb) x/1xw $rbp
0x7fffffffddb0: 0x00400550
I don't find any clue that a dword of "3" is saved in any of bytes in $rbp+xBytes. Did I get anything wrong in my understanding or commands?
I was trying to learn how to use rbp/ebp to visit function parameters and local variables
The x86_64 ABI does not use stack to pass parameters; they are passed in registers. Because of that, you wouldn't find them at any offset off $rbp (this is different from ix86 calling convention).
To find the parameters, you'll need to look at the $rdi and $rsi regusters:
Breakpoint 1, main (argc=3, argv=0x7fffffffe3a8) at my.c:4
4 printf("hello\n");
(gdb) p/x $rdi
$1 = 0x3 # matches argc
(gdb) p/x $rsi
$2 = 0x7fffffffe3a8 # matches argv
x $rbp+4
You almost certainly wouldn't find anything useful at $rbp+4, because it is usually incremented or decremented by 8, in order to store the entire 64-bit value.

gdb catch syscall condition and string comparisson

I would like to catch a system call (more specifically access) and set a condition on it based on string comparison (obviously for arguments that are strings).
Specific example: when debugging ls I would like to catch access syscalls for specific pathnames (the 1st argument)
int access(const char *pathname, int mode);
So far, I have succeeded in manually inspecting the pathname argument of access (see [1]).
I tried to use this blog post:
catch syscall access
condition 1 strcmp((char*)($rdi), "/etc/") == 0
but failed (see [2]), as gdb informed me of a segfault and that Evaluation of the expression containing the function (strcmp#plt) will be abandoned.. However gdb suggested set unwindonsignal on.
Which I tried:
set unwindonsignal on
catch syscall access
condition 1 strcmp((char*)($rdi), "/etc/") == 0
but failed again (see [3]) with a similar error and the suggestion set unwindonsignal off...
I searched for the The program being debugged was signaled while in a function called from GDB. error message, but (I think) I didn't find something relevant.
Any help or ideas?
$ gdb ls
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Reading symbols from ls...(no debugging symbols found)...done.
(gdb) catch syscall access
Catchpoint 1 (syscall 'access' [21])
(gdb) r
Starting program: /bin/ls
Catchpoint 1 (call to syscall access), 0x00007ffff7df3537 in access () at ../sysdeps/unix/syscall-template.S:81
81 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) x /s $rdi
0x7ffff7df6911: "/etc/"
(gdb) c
Catchpoint 1 (returned from syscall access), 0x00007ffff7df3537 in access () at ../sysdeps/unix/syscall-template.S:81
81 in ../sysdeps/unix/syscall-template.S
(gdb) x /s $rdi
0x7ffff7df6911: "/etc/"
(gdb) c
Catchpoint 1 (call to syscall access), 0x00007ffff7df3537 in access () at ../sysdeps/unix/syscall-template.S:81
81 in ../sysdeps/unix/syscall-template.S
(gdb) x /s $rdi
0x7ffff7df9420 <preload_file.9747>: "/etc/"
$ gdb ls
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Reading symbols from ls...(no debugging symbols found)...done.
(gdb) catch syscall access
Catchpoint 1 (syscall 'access' [21])
(gdb) condition 1 strcmp((char*)($rdi), "/etc/") == 0
(gdb) info breakpoints
Num Type Disp Enb Address What
1 catchpoint keep y syscall "access"
stop only if strcmp((char*)($rdi), "/etc/") == 0
(gdb) r
Starting program: /bin/ls
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
Error in testing breakpoint condition:
The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(strcmp#plt) will be abandoned.
When the function is done executing, GDB will silently stop.
Catchpoint 1 (returned from syscall munmap), 0x0000000000000000 in ?? ()
$ gdb ls
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Reading symbols from ls...(no debugging symbols found)...done.
(gdb) set unwindonsignal on
(gdb) catch syscall access
Catchpoint 1 (syscall 'access' [21])
(gdb) condition 1 strcmp((char*)($rdi), "/etc/") == 0
(gdb) r
Starting program: /bin/ls
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
Error in testing breakpoint condition:
The program being debugged was signaled while in a function called from GDB.
GDB has restored the context to what it was before the call.
To change this behavior use "set unwindonsignal off".
Evaluation of the expression containing the function
(strcmp#plt) will be abandoned.
Catchpoint 1 (returned from syscall munmap), 0x00007ffff7df3537 in access () at ../sysdeps/unix/syscall-template.S:81
81 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) x /s $rdi
0x7ffff7df6911: "/etc/"
You can use the gdb internal function $_streq like this:
(gdb) catch syscall access
Catchpoint 1 (syscall 'access' [21])
(gdb) condition 1 $_streq((char *)$rdi, "/etc/")
(gdb) ru
Starting program: /bin/ls
Catchpoint 1 (call to syscall access), 0x00007ffff7df3537 in access ()
at ../sysdeps/unix/syscall-template.S:81
81 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) p (char *)$rdi
$1 = 0x7ffff7df9420 <preload_file> "/etc/"

ARM inline asm: exit system call with value read from memory

I want to execute the exit system call in ARM using inline assembly on a Linux Android device, and I want the exit value to be read from a location in memory.
Without giving this extra argument, a macro for the call looks like:
#define ASM_EXIT() __asm__("mov %r0, #1\n\t" \
"mov %r7, #1\n\t" \
"swi #0")
This works well.
To accept an argument, I adjust it to:
#define ASM_EXIT(var) __asm__("mov %r0, %0\n\t" \
"mov %r7, #1\n\t" \
"swi #0" \
: \
: "r"(var))
and I call it using:
#define GET_STATUS() (*(int*)(some_address)) //gets an integer from an address
invalid 'asm': operand number out of range
I can't explain why I get this error, as I use one input variable in the above snippet (%0/var). Also, I have tried with a regular variable, and still got the same error.
Extended-asm syntax requires writing %% to get a single % in the asm output. e.g. for x86:
asm("inc %eax") // bad: undeclared clobber
asm("inc %%eax" ::: "eax"); // safe but still useless :P
%r7 is treating r7 as an operand number. As commenters have pointed out, just omit the %s, because you don't need them for ARM, even with GNU as.
Unfortunately, there doesn't seem to be a way to request input operands in specific registers on ARM, the way you can for x86. (e.g. "a" constraint means eax specifically).
You can use register int var asm ("r7") to force a var to use a specific register, and then use an "r" constraint and assume it will be in that register. I'm not sure this is always safe, or a good idea, but it appears to work even after inlining. #Jeremy comments that this technique was recommended by the GCC team.
I did get some efficient code generated, which avoids wasting an instruction on a reg-reg move:
See it on the Godbolt Compiler Explorer:
__attribute__((noreturn)) static inline void ASM_EXIT(int status)
register int status_r0 asm ("r0") = status;
register int callno_r7 asm ("r7") = 1;
asm volatile("swi #0\n"
: "r" (status_r0), "r" (callno_r7)
: "memory" // any side-effects on shared memory need to be done before this, not delayed until after
// __builtin_unreachable(); // optionally let GCC know the inline asm doesn't "return"
#define GET_STATUS() (*(int*)(some_address)) //gets an integer from an address
void foo(void) { ASM_EXIT(12); }
push {r7} # # gcc is still saving r7 before use, even though it sees the "noreturn" and doesn't generate a return
movs r0, #12 # stat_r0,
movs r7, #1 # callno,
swi #0
# yes, it literally ends here, after the inlined noreturn
void bar(int status) { ASM_EXIT(status); }
push {r7} #
movs r7, #1 # callno,
swi #0 # doesn't touch r0: already there as bar()'s first arg.
Since you always want the value read from memory, you could use an "m" constraint and include a ldr in your inline asm. Then you wouldn't need the register int var asm("r0") trick to avoid a wasted mov for that operand.
The mov r7, #1 might not always be needed either, which is why I used the register asm() syntax for it, too. If gcc wants a 1 constant in a register somewhere else in a function, it can do it in r7 so it's already there for the ASM_EXIT.
Any time the first or last instructions of a GNU C inline asm statement are mov instructions, there's probably a way to remove them with better constraints.
