I am trying to create a simple 64-bit ELF file on Linux without using a compiler, just out of curiosity. For simplicity, I only need it to NOP (0x90).
My current file can be read by "readelf", and its output seems OK. When I execute it, however, I am greeted with a segfault.
The file in question (unformatted so it can be copy-pasted easier):
7F 45 4C 46 02 01 01 00 00 00 00 00 00 00 00 00 02 00 3E 00 01 00 00 00 78 00 40 00 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 38 00 01 00 00 00 00 00 00 00 01 00 00 00 05 00 00 00 78 00 00 00 00 00 00 00 78 00 40 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 80 00 00 00 00 00 90
GDB says segfault at address 0x400079, although the entry point is at 0x400078 and size of the file is 1 byte for the NOP, so why does that non-reserved memory get executed?
Bonus: why do a lot of executables start their code at large virtual addresses, and not at 0x0?
When you let a compiler handle the low level operations, you can build a do nothing program. But if you write it directly in machine code (or even in assembly language), you must handle the gory details, like properly ending the program.
You ask why the non reserved memory is executed. Well, you load something (a NOP code) at 0x400078 and declare that to be the start of program. The loader initializes the memory at 0x400078, loads the instruction pointer with that address, and the processor obediently starts the execution there. It successfully executes the NOP, and as it was not instructed to do anything else, it tries the execute the following instruction at 0x400079. Which segfaults thanks to the memory isolation and the virtual memory manager. In kernel or real mode, it would have executed whatever would be at that memory location with unlimited side effects...
So you must not execute a simple NOP but the system call that tells the OS that the program is ended. According to exit (system call) on Wikipedia, it should be on Linux 64 bits:
mov eax, 60 ; sys_exit syscall number: 60
xor edi, edi ; set exit status to 0 (`xor edi, edi` is equal to `mov edi, 0` )
syscall ; call it
If your program starts at mov eax, 60 then it will cleanly exit and even return a 0 status to its caller.
Related
I am trying to wrap my head around the workings of the ELF file format and the way objects get linked and executed.
In order to learn more I tried to analyze the output of the following assembler code:
#----------------------------------------------------------------------------------------
# Exits immediately using code 3. Runs on 64-bit Linux
# gcc -c basic.s && ld basic.o && ./a.out
#----------------------------------------------------------------------------------------
.global _start
.text
_start:
# _exit(3)
mov $60, %rax # system call 60 is exit
mov $3, %rdi # we want return code 3
syscall # invoke operating system to exit
I am using readelf and hd to analyze the output a.out.
Here are the relevant outputs:
readelf -l a.out:
Elf file type is EXEC (Executable file)
Entry point 0x401000
There are 2 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000000b0 0x00000000000000b0 R 0x1000
LOAD 0x0000000000001000 0x0000000000401000 0x0000000000401000
0x0000000000000010 0x0000000000000010 R E 0x1000
Section to Segment mapping:
Segment Sections...
00
01 .text
Excerpts of hd a.out:
00000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 |.ELF............|
00000010 02 00 3e 00 01 00 00 00 00 10 40 00 00 00 00 00 |..>.......#.....|
00000020 40 00 00 00 00 00 00 00 e0 10 00 00 00 00 00 00 |#...............|
00000030 00 00 00 00 40 00 38 00 02 00 40 00 05 00 04 00 |....#.8...#.....|
00000040 01 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 00 00 40 00 00 00 00 00 00 00 40 00 00 00 00 00 |..#.......#.....|
00000060 b0 00 00 00 00 00 00 00 b0 00 00 00 00 00 00 00 |................|
00000070 00 10 00 00 00 00 00 00 01 00 00 00 05 00 00 00 |................|
00000080 00 10 00 00 00 00 00 00 00 10 40 00 00 00 00 00 |..........#.....|
00000090 00 10 40 00 00 00 00 00 10 00 00 00 00 00 00 00 |..#.............|
000000a0 10 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00 |................|
000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
[...]
and
[...]
00001000 48 c7 c0 3c 00 00 00 48 c7 c7 03 00 00 00 0f 05 |H..<...H........|
[...]
As you can see from this analysis, there is a program section called '.text' which is 10 bytes long and contains the instructions "MOV, MOV, SYSCALL" at file offset 0x1000 - all as expected. However, there is also another program section, this one without name, defined between offsets 0x0000 and 0x00b0, which is exactly the space occupied by the ELF header + program header 0 + program header 1. I have tried a minimal C program, and gcc creates a similar program section there as well.
The question: Why? To what end? Why is it necessary to define this program section and what does this section have to do with the execution of this program?
Bonus question: Why is the actual machine code put at offset 0x1000; Wouldn't it have been more efficient to put it at 0x00b0, right after the other one?
I am using Ubuntu 20.04 with gcc 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) on an Intel x86_64 cpu.
I wrote a basic helloworld.exe with C with the simple line printf("helloworld!\n");
Then I used UltraEdit to view the bytes of the EXE file and used also PE Explorer to see the header values. When it comes to Address of Entry Point, PE Explorer displays 0x004012c0.
Magic 010Bh PE32
Linker Version 1902h 2.25
Size of Code 00008000h
Size of Initialized Data 0000B000h
Size of Uninitialized Data 00000C00h
Address of Entry Point 004012C0h
Base of Code 00001000h
Base of Data 00009000h
Image Base 00400000h
But in UltraEdit I see 0x000012c0 after counting 16 bytes after magic 0x010B.
3F 02 00 00 E0 00 07 03 0B 01 02 19 00 80 00 00
00 B0 00 00 00 0C 00 00 C0 12 00 00 00 10 00 00
00 90 00 00 00 00 40 00 00 10 00 00 00 02 00 00
04 00 00 00 01 00 00 00 04 00 00 00 00 00 00 00
00 10 01 00 00 04 00 00 91 F6 00 00 03 00 00 00
00 00 20 00 00 10 00 00 00 00 10 00 00 10 00 00
00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
00 E0 00 00 C0 06 00 00 00 00 00 00 00 00 00 00
Which one is correct?
simply read about IMAGE_OPTIONAL_HEADER structure
AddressOfEntryPoint
A pointer to the entry point function, relative to the image base
address. For executable files, this is the starting address. For
device drivers, this is the address of the initialization function.
The entry point function is optional for DLLs. When no entry point is
present, this member is zero.
so absolute address of EntryPoint is AddressOfEntryPoint ? ImageBase + AddressOfEntryPoint : 0
in your case AddressOfEntryPoint == 12c0 and ImageBase == 400000
as result absolute address of EntryPoint is 12c0+400000==4012c0
So I've got really big .bin files with bytes written in them. They have 96-bit numbers written in them as two's complement numbers (still no ASCII, only bytes). Now I have to write an assembly program to sort the numbers in this files and save it to another file (don't ask why, assembly class). I've done it for a file with 32k numbers, like this:
./main < inSort32Kx96b.bin > XD.bin
diff outSort32Kx96b.bin XD.bin
The file outSort32Kx96b.bin is given to me by my teacher. So now diff doesn't output anything, they are identical (I can check that with hexdump or mcview). But I got another file inSort1Kx96b.bin. And I also sort it. But then diff says:
Binary files outSort1Kx96b.bin and XD.bin differ
#Edit:
cmp gave the number of byte where those 2 files differ. Now I can see the difference:
So the difference appears from 0x000017C0. Then I don't know now. If they are written as two's complement numbers in little endian order, then which one is bigger, f.e
00 00 00 00 00 00 00 81 00 00 00 00
or
00 00 00 00 00 00 00 7F 00 00 00 00
?
A hexdump shows you the single bytes in big endian order. If a bunch of bytes have to be interpreted as a number, only the byte order is reversed.
little endian big endian (C notation)
AB CD EF = 0xEFCDAB
01 02 03 04 05 06 07 08 09 10 11 12 = 0x121110090807060504030201
Let's translate your examples to big endian order:
0x000000008100000000000000
0x000000007F00000000000000
You can see that the first number is bigger.
"Two's complement number" is not a very clear expression. Better is "signed integer". The sign which shows whether the number is positive or negativ, is the very first bit of the number. This bit can be found at the begginning of a big endian number and the end of a little endian number.
Positive:
00 00 00 00 00 00 00 81 00 00 00 00 = 0x000000008100000000000000
00 00 00 00 00 00 00 81 00 00 00 10 = 0x100000008100000000000000
00 00 00 00 00 00 00 81 00 00 00 7F = 0x7F0000008100000000000000
Negativ:
00 00 00 00 00 00 00 81 00 00 00 80 = 0x800000008100000000000000
00 00 00 00 00 00 00 81 00 00 00 CD = 0xCD0000008100000000000000
00 00 00 00 00 00 00 81 00 00 00 F0 = 0xF00000008100000000000000
I'm analyzing this tiny ELF file:
00000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 |.ELF............|
00000010 02 00 3e 00 01 00 00 00 78 00 40 00 00 00 00 00 |..>.....x.#.....|
00000020 40 00 00 00 00 00 00 00 98 00 00 00 00 00 00 00 |#...............|
00000030 00 00 00 00 40 00 38 00 01 00 40 00 03 00 02 00 |....#.8...#.....|
00000040 01 00 00 00 05 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 00 00 40 00 00 00 00 00 00 00 40 00 00 00 00 00 |..#.......#.....|
00000060 7e 00 00 00 00 00 00 00 7e 00 00 00 00 00 00 00 |~.......~.......|
00000070 00 00 20 00 00 00 00 00 31 c0 ff c0 cd 80 00 2e |.. .....1.......|
00000080 73 68 73 74 72 74 61 62 00 2e 74 65 78 74 00 00 |shstrtab..text..|
00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000000d0 00 00 00 00 00 00 00 00 0b 00 00 00 01 00 00 00 |................|
000000e0 06 00 00 00 00 00 00 00 78 00 40 00 00 00 00 00 |........x.#.....|
000000f0 78 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00 |x...............|
00000100 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |................|
00000110 00 00 00 00 00 00 00 00 01 00 00 00 03 00 00 00 |................|
00000120 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000130 7e 00 00 00 00 00 00 00 11 00 00 00 00 00 00 00 |~...............|
00000140 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |................|
00000150 00 00 00 00 00 00 00 00 |........|
00000158
I found documentation on the ELF header and the program header and decoded both of those, but I'm having problems decoding what's after this (starting with 31 c0 ff c0 cd 80 00 2e). Judging by the "shstrtab" text, I am looking at the section table, but what does 31 c0 ff c0 cd 80 00 2e mean? Where is this part documented?
OK, judging by the information in the first 16 bytes of the header:
00000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 |.ELF............|
E L F | | '--- Pudding :) ---'
| '--- Little-endian (ELFDATA2LSB)
'------ 64-bit (ELFCLASS64)
we're dealing with a 64-bit ELF with little-endian encoding of multi-byte numbers. So the ELF header is the first 4 rows in the hex editor. We're interested in these fields in the last two rows of it:
Prog Hdr Tab offset Sect Hdr Tab offset
.----------^----------. .----------^----------.
00000020 40 00 00 00 00 00 00 00 98 00 00 00 00 00 00 00 |#...............|
00000030 00 00 00 00 40 00 38 00 01 00 40 00 03 00 02 00 |....#.8...#.....|
'-.-' '-.-' '-.-' '-.-' '-.-'
PHT entry size ---' | | | '-- Sect names in #2
PHT num entries ----------' | '-- SHT num entries
'-------- SHT entry size
So we know that the Program Headers Table starts at offset 0x40 in the file (right after this header) and contains 1 entry of size 0x38 (56 bytes). So it ends at offset 0x40 + 1*0x38 = 0x78 (this is the first byte after this table, and this is also where your "mysterious data" begins, so keep this in mind).
The Section Headers Table starts at offset 0x98 in the file and contains 3 entries of size 0x40 (64 bytes), that is, each entry in SHT takes 4 consecutive rows in a hex editor, and the entire table is 3*4 = 12 such rows, so the offset 0x158 is the first byte after this table. But this is just the end of the file, so there's nothing more after the SHT.
The SHT entry at index 2 (the third=last one) should be a string table that contains the names for the sections.
So let's look at those sections now, shall we?
Section #2
Let's start with section #2, since it is supposed to contain the string table with the names for all the sections, so it will be very useful in further analysis. Here's its header (the last one in the table):
Name index Type=SHT_STRTAB (bingo!)
Flags .----^----. .----^----.
00000118 .----------^----------. 01 00 00 00 03 00 00 00 |........|
00000120 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000130 7e 00 00 00 00 00 00 00 11 00 00 00 00 00 00 00 |~...............|
'----------.----------' '----------.----------'
Starting offset Size
00000140 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |................|
00000150 00 00 00 00 00 00 00 00 |........|
00000158
So this is indeed a string table (0x03 = SHT_STRTAB). It starts from offset 0x7E in the file and takes 0x11 (17) consecutive bytes. The first byte after the string table is therefore 0x8F. This byte is not a part of any section (garbage).
The string table
So let's see what's in the section containing the string table, so that we could name our sections:
0000007E 00 2e |..|
00000080 73 68 73 74 72 74 61 62 00 2e 74 65 78 74 00 |shstrtab..text.|
0000008F
Here's the string table, with addresses relative to its beginning:
+0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +A +B +C +D +E +F
00: 00 2E 73 68 73 74 72 74 61 62 00 2e 74 65 78 74
10: 00
or the same in ASCII, with the NULL characters marked as ∎:
+0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +A +B +C +D +E +F
00: ∎ . s h s t r t a b ∎ . t e x t
10: ∎
So we have just 3 full string in it, with the following relative offsets:
00: "" (Just the empty string)
01: ".shstrtab" (Name for this section)
0B: ".text" (Name for the section that contains the executable code)
(Keep in mind, though, that sections can also address substrings inside those strings, if they share the common ending.)
We can now verify that this section (#2) is indeed named .shstrtab: its name index was 0x01 after all, wasn't it? ;)
Section #1
Now let's take apart section #1's header:
Name index Type=SHT_PROGBITS
Flags .----^----. .----^----.
000000d8 .----------^----------. 0b 00 00 00 01 00 00 00 |........|
000000e0 06 00 00 00 00 00 00 00 78 00 40 00 00 00 00 00 |........x.#.....|
000000f0 78 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00 |x...............|
'----------.----------' '----------.----------'
Starting offset Size
00000100 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |................|
00000110 00 00 00 00 00 00 00 00 |........|
00000118
So this section is named .text (note the name index 0x0B) and it is of type SHT_PROGBITS, so it contains some program-defined data; the executable code in this case. It starts from the offset 0x78 in the file and takes the next 6 bytes, so the first byte after this section is at offset 0x7E (where the string table begins). Here's its contents:
00000070 31 c0 ff c0 cd 80 |1.....|
0000007E
But wait! Remember where your "mysterious data" starts? Yes! It's the 0x78 offset! :) So this "mysterious data" is actually your executable payload :) After decoding it as Intel x86-64 opcodes we get this tiny little program:
31 C0 xor %eax,%eax ; Clear the EAX register to 0 (the short way).
FF C0 inc %eax ; Increase the EAX, so now it contains 1.
CD 80 int $0x80 ; Interrupt 0x80 is the system call on Linux.
which is basically equivalent to calling exit(0) in C ;) because the syscall interrupt expects the operation number in EAX, which in this case is sys_exit (operation number 1).
So yeah, mystery solved :) But let's continue anyway, to learn something more, and this way we'll find out where this piece of code will be loaded in memory.
Section #0
And finally section #0. It has some part missing, but I assume it was all 0s, since the first section is always a NULL section after all. Here's its (butchered) header:
00000098 00 00 00 00 00 00 00 00 | ........|
*
000000d0 00 00 00 00 00 00 00 00
But it's of no use to us. Nothing interesting here.
Program Headers Table
The last thing what's left to decode is the Program Headers Table, which – according to the information from the ELF header – starts from the offset 0x40 and takes 56 bytes, the first byte after it being at offset 0x78. Here's the dump:
Type=PHT_EXEC Flags=RX Starting offset in file
.----^----. .----^----. .----------^----------.
00000040 01 00 00 00 05 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 00 00 40 00 00 00 00 00 00 00 40 00 00 00 00 00 |..#.......#.....|
'----------.----------' '----------.----------'
Virtual address Physical address
Size in file Size in memory
.----------^----------. .----------^----------.
00000060 7e 00 00 00 00 00 00 00 7e 00 00 00 00 00 00 00 |~.......~.......|
00000070 00 00 20 00 00 00 00 00
00000078 '----------.----------'
Alignment
So it says that we load the first 126 (0x7E) bytes of the file into a memory segment of the same size, and the memory segment is supposed to start from the virtual address 0x400000. Our code starts from the offset 0x78 in the file and the first byte after it has the offset 0x7E, so it basically loads the entire beginning of the file, with the ELF header and the program header table into memory, as well as our executable payload at the end of it, and stops loading afterwards, ignoring the rest of the file.
So if the beginning of the file is loaded at address 0x400000, and our program starts 120 (0x78) bytes from its beginning, it will be located at the address 0x400078 in memory :>
Now let's see what entry point is specified in the ELF header for our program:
Executable x86-64 Version=1 Program's entry point
.-^-. .-^-. .----^----. .----------^----------.
00000010 02 00 3e 00 01 00 00 00 78 00 40 00 00 00 00 00 |..>.....x.#.....|
Bingo! :> It's 0x400078, so it points at the start of our little piece of code in the memory image.
And that's all, folks! ;)
I'm looking for an easy way to convert a simple binary file into a text-representation of its binary, where encoding doesn't matter. I know that the programmatic solution is straightforward, but I feel that there must be some arcane string of unix commands to accomplish this.
Am I off base? Is there a simpler solution than the programmatic?
base64 -e filename>xxx
on the other side
base64 -d xxx>filename
Use od. For example:
$ od -t x1 -An /bin/ls | head
7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
02 00 3e 00 01 00 00 00 e0 26 40 00 00 00 00 00
40 00 00 00 00 00 00 00 30 b6 01 00 00 00 00 00
00 00 00 00 40 00 38 00 09 00 40 00 1d 00 1c 00
06 00 00 00 05 00 00 00 40 00 00 00 00 00 00 00
40 00 40 00 00 00 00 00 40 00 40 00 00 00 00 00
f8 01 00 00 00 00 00 00 f8 01 00 00 00 00 00 00
08 00 00 00 00 00 00 00 03 00 00 00 04 00 00 00
38 02 00 00 00 00 00 00 38 02 40 00 00 00 00 00
38 02 40 00 00 00 00 00 1c 00 00 00 00 00 00 00
for example, to display a binary file as a sequence of hex codes:
od -t x1 file|cut -c8-
uuencode and uudecode were made for transferring binary content as ASCII characters. See the wikipedia entry.
max#upsight:~$ openssl base64 < /dev/urandom | head -10
qnISxigXTjgON+tkSDtRJ6fRNczsejY2bEC5D1W8fscy+6mopiGfVLvZ/bu99SrT
qdTRaeRXO8fgEejXsbTy4XP9MmCbAsBCSEvDpq5bfR/Sd7EjJLUxcRwzEMlhIrYT
m6J+20aR9M4g7pbT+hjjBE/gsHKxFfZQFgxT/tm1pEg6zMvQywjsrc7d+PSJQOHw
vzYXfWkyLO1nJm9g+Pw3rBI/UuV0+lmrIflhlj5CDWuaxDJUXJiWdsD6cGKLclfz
Mlh17mHwteqMLLSrTZ0QA0ygxISqiCf2sDtPgUToM7ZT2EbaNck5auxbhU+7OcxI
vBZRKozRZtfsZA0IUzMlIQmFanBdjOeGepQjgCDruq5hqEbNc1A+HhXqTtAr8Aic
4iNf36xZifDvASYy27hTVrlI/5kTeRZURqquaxHqum15VD5IC3J/sH+AwPpN1/qi
0YM8xt+LliVje7Oo7QiTona+VMjA//a715/0J8yeryLxTLSnT8JsXUpR0CiOgAcH
tQk9nzHCfMmFzb02nrhFJ0MjLCFgNJOiI1vT0AhNnMh449dcIkDDwyMpkRV4KZ1l
CSL+K4vXhMz3LhPKSihKbYLY6aJSnlPe/GiIOfl1g1VlbtoxJ7ZclpcOp4KWSKHV
...and so on
If the reason you're doing it is to see strings inside the binary data then there's a command called "strings" that will print all the strings in a file for you.
you can also use hexdump. Look at the man page for more options
$ hexdump binaryfile
Yes, you are off-base, this is nontrivial in the general case. Some commercial solutions exist, one we use is Autonomy Keyview.
I am assuming you mean including (e.g.) MSOffice and PDFs.