I'm trying to gather statistics about the percentage of library code that is used vs executed. To do this I'm invoking Qemu-user with the -d in_asm flag. I log this to a file and get a sizeable file listing the translated instructions that looks like this
----------------
IN:
0x4001a0f1e9: 48 83 c4 30 addq $0x30, %rsp
0x4001a0f1ed: 85 c0 testl %eax, %eax
0x4001a0f1ef: 74 b7 je 0x4001a0f1a8
----------------
IN:
0x4001a0f1f1: 49 8b 0c 24 movq (%r12), %rcx
0x4001a0f1f5: 48 83 7c 24 50 00 cmpq $0, 0x50(%rsp)
0x4001a0f1fb: 0f 84 37 01 00 00 je 0x4001a0f338
----------------
To map blocks to associated files, I extract the /proc/pid/maps for the qemu process and compare the address of instructions executed to the address ranges of files within the guest program. This appears to work reasonably well, however the majority of the instructions executed appear to be outside of any of the files contained within the map file. The bottom of the guest address space is listed as follows
.
.
.
40020a0000-4002111000 r--p 00000000 103:02 2622381 /lib/x86_64-
linux-gnu/libpcre.so.3.13.3
4002111000-4002112000 r--p 00070000 103:02 2622381 /lib/x86_64-linux-gnu/libpcre.so.3.13.3
4002112000-4002113000 rw-p 00071000 103:02 2622381 /lib/x86_64-linux-gnu/libpcre.so.3.13.3
4002113000-4002115000 rw-p 00000000 00:00 0
555555554000-5555555a1000 r--p 00000000 103:02 12462104 /home/name/Downloads/qemu-5.2.0/exe/bin/qemu-x86_64
the guest program appears to end at 0x4002115000, with a sizeable gap between the guest, and Qemu which begins at 0x555555554000. I can match instructions in the libraries to the actual binaries, so the approach isn't entirely faulty. However there are almost 60,000 blocks executed whose origin is between 0x400aa20000 and 0x407c8ae138. This region of memory is nominally unmapped, however Qemu seems to be translating, and succesfully executing code here. The program appears to run correctly, so I am unsure where these instructions originate. I had initially thought it might be the vDSO, but the range appears to be much too large, and there are too many separate addresses. I looked at the preceding code for a couple of these blocks and it was in ld.so but I can't say if all the calls are generated there. I think it's possible that this is kernel code, but I'm not sure how to validate whether or not this is true. I'm at a loss as to how to approach this problem.
Is there a way to trace the providence of these instructions? perhaps using the gdb stub or some other logging functionality?"
When you are searching in /proc/pid/maps the corresponding modules may be already unloaded. Running LD_DEBUG=files <your qemu command line> will print module loading info, including their load address and size. Search there for missing code addresses.
Related
I have this program:
static int aux() {
return 1;
}
int _start(){
int a = aux();
return a;
}
When I compile it using GCC with flags -nostdlib -m32 -fpie and generate an ELF binary, I get the following assembly code:
00001000 <aux>:
1000: f3 0f 1e fb endbr32
1004: 55 push %ebp
1005: 89 e5 mov %esp,%ebp
1007: e8 2d 00 00 00 call 1039 <__x86.get_pc_thunk.ax>
100c: 05 e8 2f 00 00 add $0x2fe8,%eax
1011: b8 01 00 00 00 mov $0x1,%eax
1016: 5d pop %ebp
1017: c3 ret
00001018 <_start>:
1018: f3 0f 1e fb endbr32
101c: 55 push %ebp
101d: 89 e5 mov %esp,%ebp
101f: 83 ec 10 sub $0x10,%esp
1022: e8 12 00 00 00 call 1039 <__x86.get_pc_thunk.ax>
1027: 05 cd 2f 00 00 add $0x2fcd,%eax
102c: e8 cf ff ff ff call 1000 <aux>
1031: 89 45 fc mov %eax,-0x4(%ebp)
1034: 8b 45 fc mov -0x4(%ebp),%eax
1037: c9 leave
1038: c3 ret
00001039 <__x86.get_pc_thunk.ax>:
1039: 8b 04 24 mov (%esp),%eax
103c: c3 ret
I know that the get_pc_thunk function is used to implement position-independent code in x86, but in this case I can't understand why it is being used. My questions are:
The function is returning the address of the next instruction in the eax register and, in both usages, an add instruction is being used to make eax point to the GOT. Normally, (at least when accessing global variables), this eax register would be immediately used to access a global variable in the table. In this case, however, the eax is being completely ignored. What is going on?
I also don't understand why the get_pc_thunk is even present in the code, since both call instructions are using relative addresses. Since the addresses are relative, shouldn't they already be position-independent out of the box?
Thanks!
You haven't enabled optimisation, so GCC emits function prologues without regard to if they are useful in the function in question.
To see the result of get_pc_thunk used access a global variable.
To remove the useless calls to get_pc_thunk enable optimisation for example by adding -O2 to the GCC command line.
If, however, I move the aux() function to another compilation unit, the get_pc_thunk function remains being called, even with -O2, and, again, its return value is being ignored.
IIRC, EBX=GOT point is assumed/required by the PLT itself, and the call has to go through the PLT because it's not known when compiling this compilation unit that an aux definition will be statically linked with it. (https://godbolt.org/z/Yere9o shows that effect for main with just a prototype for aux(), not a definition it can inline.)
With a "hidden" ELF visibility attribute, we can get that to go away because the compiler knows it doesn't need to indirect through the PLT because a call rel32 will be known at static link time without needing runtime relocation: https://godbolt.org/z/73dGKq
__attribute__((visibility("hidden"))) int aux(void);
int _start(){
int a = aux();
return a;
}
gcc10.1 -O2 -m32 -fpie
_start:
jmp aux
IMO it makes sense to have the call in object files generated for compilation units that are calling external functions, but I don't understand why the linker (or the 'flow') is not removing them in the final binary.
#felipeek: Good question. The linker doesn't know when it can relax a call foo#plt to call foo because that also disables symbol interposition. Even if there is a definition of foo in this ELF shared object, a definition in one loaded earlier could override it / take precedence. I think this "problem" is due to the fact that PIE executables evolved out of a kind of hack: put an entry point in a shared object and the dynamic linker will be willing to run it. i.e. at an ELF level, PIE executables are the same as .so, and -fpie and -fPIC look the same to the linker.
The linker can go the other way, though: if making a normal non-PIE executable (ELF type = EXEC), it can turn call foo into call foo#plt, but that PLT itself doesn't have to be PIE/PIC so it doesn't need EBX=GOT.
Are we saying that all calls to other compilation units will invoke a totally unnecessary call in the final binary when PIE is required?
No, only ones in 32-bit PIE code where you fail to tell the compiler that it's an "internal" symbol using ELF "hidden" visibility. You can even have 2 names for the same symbol, one with hidden visibility, so you can make a function that libraries can resolve by name, but that you can still call from within the executable using simple call rel32 instead of clunky indirect calls via the PLT.
This is one of the downsides of PIE. Even in 64-bit code, without the attribute you get jmp aux#PLT. (Or with -fno-plt, an indirect call using RIP-relative addressing for the GOT entry.)
32-bit PIE really sucks a lot for performance, like on average 15% (measured a while ago on CPUs at the time, could possibly be somewhat different.) Much smaller effect on x86-64 where RIP-relative addressing is available, like a couple %. 32-bit absolute addresses no longer allowed in x86-64 Linux? has some links to more details.
I have the following code in .s file:
pushq $afterjmp
nop
afterjmp:
movl %eax, %edx
Its object file has the following:
20: 68 00 00 00 00 pushq $0x0
25: 90 nop
0000000000000026 <afterjmp>:
26: 89 c2 mov %eax,%edx
After linking, it becomes:
400572: 68 78 05 40 00 pushq $0x400578
400577: 90 nop
400578: 89 c2 mov %eax,%edx
How does the argument 0x0 to pushq at byte 20 of the object file gets converted to 0x400578 in the final executable?
Which section of the object file contains this information?
You answered your own question: After linking....
Here is a good article:
Linkers and Loaders
In particular, note the section about "symbol relocation":
Relocation. Compilers and assemblers generate the object code for each
input module with a starting address of zero. Relocation is the
process of assigning load addresses to different parts of the program
by merging all sections of the same type into one section. The code
and data section also are adjusted so they point to the correct
runtime addresses.
There's no way to know the program address of "afterjmp" when a single object file is assembled. It's only when all the object files are assembled into an executable image can the addresses (relative to offset "0") be computed. That's one of the jobs of the linker: to keep track of "symbol references" (like "afterjmp"), and compute the machine address ("symbol relocation").
I want to do a statistic of memory bytes access on programs running on Linux (X86_64 architecture). I use perf tool to dump the file like this:
: ffffffff81484700 <load2+0x484700>:
2.86 : ffffffff8148473b: 41 8b 57 04 mov 0x4(%r15),%edx
5.71 : ffffffff81484800: 65 8b 3c 25 1c b0 00 mov %gs:0xb01c,%edi
22.86 : ffffffff814848a0: 42 8b b4 39 80 00 00 mov 0x80(%rcx,%r15,1),%esi
25.71 : ffffffff814848d8: 42 8b b4 39 80 00 00 mov 0x80(%rcx,%r15,1),%esi
2.86 : ffffffff81484947: 80 bb b0 00 00 00 00 cmpb $0x0,0xb0(%rbx)
2.86 : ffffffff81484954: 83 bb 88 03 00 00 01 cmpl $0x1,0x388(%rbx)
5.71 : ffffffff81484978: 80 79 40 00 cmpb $0x0,0x40(%rcx)
2.86 : ffffffff8148497e: 48 8b 7c 24 08 mov 0x8(%rsp),%rdi
5.71 : ffffffff8148499b: 8b 71 34 mov 0x34(%rcx),%esi
5.71 : ffffffff814849a4: 0f af 34 24 imul (%rsp),%esi
My current method is to analyze file and get all memory access instructions, such as move, cmp, etc. Then calculate every access bytes of every instruction, such as mov 0x4(%r15),%edx will add 4 bytes.
I want to know whether there is possible way to calculate through machine code , such as by analyzing "41 8b 57 04", I can also add 4 bytes. Because I am not familiar with X86_64 machine code, could anyone give any clues? Or is there any better way to do statistics? Thanks in advance!
See https://stackoverflow.com/a/20319753/120163 for information about decoding Intel instructions; in fact, you really need to refer to Intel reference manuals: http://download.intel.com/design/intarch/manuals/24319101.pdf If you only want to do this manually for a few instructions, you can just look up the data in these manuals.
If you want to automate the computation of instruction total-memory-access, you will need a function that maps instructions to the amount of data accessed. Since the instruction set is complex, the corresponding function will be complex and take you a long time to write from scratch.
My SO answer https://stackoverflow.com/a/23843450/120163 provides C code that maps x86-32 instructions to their length, given a buffer that contains a block of binary code. Such code is necessary if one is to start at some point in the object code buffer and simply enumerate the instructions that are being used. (This code has been used in production; it is pretty solid). This routine was built basically by reading the Intel reference manual very carefully. For OP, this would have to be extended to x86-64, which shouldn't be very hard, mostly you have account for the extended-register prefix opcode byte and some differences from x86-32.
To solve OP's problem, one would also modify this routine to also return the number of byte reads by each individual instruction. This latter data also has to be extracted by careful inspection from the Intel reference manuals.
OP also has to worry about where he gets the object code from; if he doesn't run this routine in the address space of the object code itself, he will need to somehow get this object code from the .exe file. For that, he needs to build or run the equivalent of the Windows loader, and I'll bet that
has a bunch of dark corners. Check out the format of object code files.
Now i have got a machine instruction'address from EIP register. This machine instruction could change the value of a certain area of memory, i do want but cannot get the address of this memory.
Of course,i could read the data from machine instruction'address, but the content is machine instruction like:0x8b0c4d8b......, it's unreadable(i can not use debugging tools like gdb).
How to get the address that one machine instruction will write to?
If you know the machine code EIP points to and you just want to disassemble it, do something like this (I took your example of 0x8b0c4d8b):
#create binary file
$ echo -en "\x8b\x0c\x4d\x8b" > foo.bin
#disassemble it
$ objdump -D -b binary -m i386 foo.bin
foo.bin: file format binary
Disassembly of section .data:
00000000 :
0: 8b .byte 0x8b
1: 0c 4d or $0x4d,%al
3: 8b .byte 0x8b
So, in this case, it doesn't change any memory location but if it did, you can easily see it from the assembly code.
Edit: It seems from the comments that you want to do this programmatically. Take a look at udis86. It allows examining operands of instructions. For ARM, see disarm.
Consider a sparse file with 1s written to a portion of the file.
I want to reclaim the actual space on disk for these 1s as I no longer need that portion of the sparse file. The portion of the file containing these 1s should become a "hole" as it was before the 1s were themselves written.
To do this, I cleared the region to 0s. This does not reclaim the blocks on disk.
How do I actually make the sparse file, well, sparse again?
This question is similar to this one but there is no accepted answer for that question.
Consider the following sequence of events run on a stock Linux server:
$ cat /tmp/test.c
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <string.h>
int main(int argc, char **argv) {
int fd;
char c[1024];
memset(c,argc==1,1024);
fd = open("test",O_CREAT|O_WRONLY,0777);
lseek(fd,10000,SEEK_SET);
write(fd,c,1024);
close(fd);
return 0;
}
$ gcc -o /tmp/test /tmp/test.c
$ /tmp/test
$ hexdump -C ./test
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00002710 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |................|
*
00002b10
$ du -B1 test; du -B1 --apparent-size test
4096 test
11024 test
$ /tmp/test clear
$ hexdump -C ./test
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00002b10
$ du -B1 test; du -B1 --apparent-size test
4096 test
11024 test
# NO CHANGE IN SIZE.... HMM....
EDIT -
Let me further qualify that I don't want to rewrite files, copy files, etc. If it is not possible to somehow free previously allocated blocks in situ, so be it, but I'd like to determine if such is actually possible or not. It seems like "no, it is not" at this point. I suppose I'm looking for sys_punchhole for Linux (discussions of which I just stumbled upon).
It appears as if linux have added a syscall called fallocate for "punching holes" in files. The implementations in individual filesystems seem to focus on the ability to use this for pre-allocating a larger continous number of blocks.
There is also the posix_fallocate call that only focus on the latter, and is not usable for hole punching.
Right now it appears that only NTFS supports hole-punching. This has been historically a problem across most filesystems. POSIX as far as I know, does not define an OS interface to punch holes, so none of the standard Linux filesystems have support for it. NetApp supports hole punching through Windows in its WAFL filesystem. There is a nice blog post about this here.
For your problem, as others have indicated, the only solution is to move the file leaving out blocks containing zeroes. Yeah its going to be slow. Or write an extension for your filesystem on Linux that does this and submit a patch to the good folks in the Linux kernel team. ;)
Edit: Looks like XFS supports hole-punching. Check this thread.
Another really twisted option can be to use a filesystem debugger to go and punch holes in all indirect blocks which point to zeroed out blocks in your file (maybe you can script that). Then run fsck which will correct all associated block counts, collect all orphaned blocks (the zeroed out ones) and put them in the lost+found directory (you can delete them to reclaim space) and correct other properties in the filesystem. Scary, huh?
Disclaimer: Do this at your own risk. I am not responsible for any data loss you incur. ;)
Ron Yorston offers several solutions; but they all involve either mounting the FS read-only (or unmounting it) while the sparsifying takes place; or making a new sparse file, then copying across those chunks of the original that aren't just 0s, and then replacing the original file with the newly-sparsified file.
It really depends on your filesystem though. We've already seen that NTFS handles this. I imagine that any of the other filesystems Wikipedia lists as handling transparent compression would do exactly the same - this is, after all, equivalent to transparently compressing the file.
After you have "zeroed" some region of the file you have to tell to the file system that this new region is intended to be a sparse region. So in case of NTFS you have to call DeviceIoControl() for that region again. At least I do this way in my utility: "sparse_checker"
For me the bigger problem is how to unset the sparse region back :).
Regards
This way is cheap, but it works. :-P
Read in all the data past the hole you want, into memory (or another file, or whatever).
Truncate the file to the start of the hole (ftruncate is your friend).
Seek to the end of the hole.
Write the data back in.
umount your filesystem and edit filesystem directly in way similar debugfs or fsck. usually you need driver for each used fs.
Seems like writing zeros (as in the referenced question) to the part you're done with is a logical thing to try. Here a link to an MSDN question for NTFS sparse files that does just that to "release" the "unused" part. YMMV.
http://msdn.microsoft.com/en-us/library/ms810500.aspx