Lately, I've been writing some x86 assembly injections for the purpose of a game mod, but since most of my workflow has involved writing and assembling custom routines by hand, I've been looking to move towards a more robust solution.
Microsoft's inline assembler seems like a nice choice, but I've run into something of a limitation it seems to have.
Whenever I write an instruction that involves an immediate memory address (the game uses a fixed address space layout), the assembler silently converts it to an immediate value instead.
For example, in MASM:
mov ecx, 0xCCCCCCCC => B9 CC CC CC CC
mov ecx, [0xCCCCCCCC] => B9 CC CC CC CC*
* Should be: 8B 0D CC CC CC CC
In this case, both assembled instructions are loading ecx with the immediate value 0xCCCCCCCC, although the second one should be fetching the value from the immediate address 0xCCCCCCCC
Note that it's possible to use a named variable in this manner:
mov ecx, [myInt]
Which will assemble to an 8B 0D memory fetch instruction, but it also adds the operand to the module's relocation table and doesn't allow the specification of arbitrary addresses.
Trying to trick the assembler with something like
mov ecx, [myInt-myInt+0xCCCCCCCC]
Also results in the address being treated as an immediate value.
It is possible possible to go with:
mov ecx, 0xCCCCCCCC
mov ecx, [ecx]
Which will assemble properly and exhibit the correct behavior, but now I've bloated my injection size by 2 unnecessary bytes. Because I'm working under some rather tight spatial constraints, this isn't acceptable and I'd rather not use a code cave where I don't have to.
The funny thing is that something in C like:
register int x;
x = *(int*)(0xCCCCCCCC)
Happily compiles to
mov ecx, [0xCCCCCCCC] => 8B 0D CC CC CC CC
It's a little odd to see a lower level language have more limitations placed on it than a higher level language. What I'm trying to do seems pretty reasonable to me, so does anyone know if MASM has some hidden way of using fixed immediate memory addresses when reading from memory?
I'm not sure if this works outside of the flat memory model, but I found that MASM distinguishes immediate addresses and immediate values as follows:
mov ecx, 0xCCCCCCCC => B9 CC CC CC CC
mov ecx, [0xCCCCCCCC] => B9 CC CC CC CC
mov ecx, ds:[0xCCCCCCCC] => 8B 0D CC CC CC CC
The first 2 instructions load ecx with the immediate value 0xCCCCCCCC. The last instruction loads ecx with the value at address 0xCCCCCCCC.
In NASM syntax,
mov ecx, 0xCCCCCCCC ; B9 CC CC CC CC mov ecx, imm32
mov ecx, [0xCCCCCCCC] ; 8b 0d cc cc cc cc mov r32, [disp32]
mov ecx, [_start-_start + 0xCCCCCCCC]; 8b 0d cc cc cc cc same
Tested with nasm -felf and yasm -felf, on my GNU/Linux desktop.
I wonder if it's a bug that MASM assembles [0xCCCCCCCC] to an immediate instead of an effective address. Does it do the same when it's an operand to other instructions? e.g. is it an error with LEA?
Related
I believe I understand how the linux x86-64 ABI uses registers and stack to pass parameters to a function (cf. previous ABI discussion). What I'm confused about is if/what registers are expected to be preserved across a function call. That is, what registers are guarenteed not to get clobbered?
Here's the complete table of registers and their use from the documentation [PDF Link]:
r12, r13, r14, r15, rbx, rsp, rbp are the callee-saved registers - they have a "Yes" in the "Preserved across function calls" column.
Experimental approach: disassemble GCC code
Mostly for fun, but also as a quick verification that you understood the ABI right.
Let's try to clobber all registers with inline assembly to force GCC to save and restore them:
main.c
#include <inttypes.h>
uint64_t inc(uint64_t i) {
__asm__ __volatile__(
""
: "+m" (i)
:
: "rax",
"rbx",
"rcx",
"rdx",
"rsi",
"rdi",
"rbp",
"rsp",
"r8",
"r9",
"r10",
"r11",
"r12",
"r13",
"r14",
"r15",
"ymm0",
"ymm1",
"ymm2",
"ymm3",
"ymm4",
"ymm5",
"ymm6",
"ymm7",
"ymm8",
"ymm9",
"ymm10",
"ymm11",
"ymm12",
"ymm13",
"ymm14",
"ymm15"
);
return i + 1;
}
int main(int argc, char **argv) {
(void)argv;
return inc(argc);
}
GitHub upstream.
Compile and disassemble:
gcc -std=gnu99 -O3 -ggdb3 -Wall -Wextra -pedantic -o main.out main.c
objdump -d main.out
Disassembly contains:
00000000000011a0 <inc>:
11a0: 55 push %rbp
11a1: 48 89 e5 mov %rsp,%rbp
11a4: 41 57 push %r15
11a6: 41 56 push %r14
11a8: 41 55 push %r13
11aa: 41 54 push %r12
11ac: 53 push %rbx
11ad: 48 83 ec 08 sub $0x8,%rsp
11b1: 48 89 7d d0 mov %rdi,-0x30(%rbp)
11b5: 48 8b 45 d0 mov -0x30(%rbp),%rax
11b9: 48 8d 65 d8 lea -0x28(%rbp),%rsp
11bd: 5b pop %rbx
11be: 41 5c pop %r12
11c0: 48 83 c0 01 add $0x1,%rax
11c4: 41 5d pop %r13
11c6: 41 5e pop %r14
11c8: 41 5f pop %r15
11ca: 5d pop %rbp
11cb: c3 retq
11cc: 0f 1f 40 00 nopl 0x0(%rax)
and so we clearly see that the following are pushed and popped:
rbx
r12
r13
r14
r15
rbp
The only missing one from the spec is rsp, but we expect the stack to be restored of course. Careful reading of the assembly confirms that it is maintained in this case:
sub $0x8, %rsp: allocates 8 bytes on stack to save %rdi at %rdi, -0x30(%rbp), which is done for the inline assembly +m constraint
lea -0x28(%rbp), %rsp restores %rsp back to before the sub, i.e. 5 pops after mov %rsp, %rbp
there are 6 pushes and 6 corresponding pops
no other instructions touch %rsp
Tested in Ubuntu 18.10, GCC 8.2.0.
The ABI specifies what a piece of standard-conforming software is allowed to expect. It is written primarily for authors of compilers, linkers and other language processing software. These authors want their compiler to produce code that will work properly with code that is compiled by the same (or a different) compiler. They all have to agree to a set of rules: how are formal arguments to functions passed from caller to callee, how are function return values passed back from callee to caller, which registers are preserved/scratch/undefined across the call boundary, and so on.
For example, one rule states that the generated assembly code for a function must save the value of a preserved register before changing the value, and that the code must restore the saved value before returning to its caller. For a scratch register, the generated code is not required to save and restore the register value; it can do so if it wants, but standard-conforming software is not allowed to depend upon this behavior (if it does it is not standard-conforming software).
If you are writing assembly code, you are responsible for playing by these same rules (you are playing the role of the compiler). That is, if your code changes a callee-preserved register, you are responsible for inserting instructions that save and restore the original register value. If your assembly code calls an external function, your code must pass arguments in the standard-conforming way, and it can depend upon the fact that, when the callee returns, preserved register values are in fact preserved.
The rules define how standards-conforming software can get along. However, it is perfectly legal to write (or generate) code that does not play by these rules! Compilers do this all the time, because they know that the rules don't need to be followed under certain circumstances.
For example, consider a C function named foo that is declared as follows, and never has its address taken:
static foo(int x);
At compile-time, the compiler is 100% certain that this function can only be called by other code in the file(s) it is currently compiling. Function foo cannot be called by anything else, ever, given the definition of what it means to be static. Because the compiler knows all of the callers of foo at compile time, the compiler is free to use whatever calling sequence it wants (up to and including not making a call at all, that is, inlining the code for foo into the callers of foo.
As an author of assembly code, you can do this too. That is, you can implement a "private agreement" between two or more routines, as long as that agreement doesn't interfere with or violate the expectations of standards-conforming software.
I'm writing a C++ program, and decided that it would be more efficient to write a specific function in x86 assembly, due to it's use of the carry-flag. In the disassembly, I noticed that some instructions had been changed, causing my program to throw the exception: "Access violation reading location". Why are instructions changing, and how can I prevent this?
Here is a snippet of my code:
XOR EBX, EBX ; 31 DB
MOV BL, DH ; 88 F3
MOV AH, BYTE PTR [ECX] ; 8A 21
The disassembler shows this:
xor bx, bx ; 66 31 DB
mov bl, dh ; 88 F3
mov ah, byte ptr [bx+di] ; 67 8A 21
You assembled in 16-bit mode and disassembled in 32-bit mode, making everything the opposite of what it should be. MASM isn't "changing" instructions, just assembling them for the mode you told it to assemble for.
e.g. in 16-bit mode, [ECX] needs a 67 address-size override prefix to encode a 32-bit addressing mode. When decoded in 32-bit mode, that same prefix overrides it to mean 16-bit. (And that bit-pattern means [bx+di]; 16-bit addressing modes can't use a SIB byte so all the different register combos are packed into the ModRM byte. [cx] isn't encodeable.)
Also, if you think xor + mov is the most efficient way to zero-extend DH into EBX, have a look at movzx. That's more efficient on modern CPUs. (See https://agner.org/optimize/ and https://uops.info/).
Generally you want to avoid writing high-8 registers like AH; Haswell/Skylake have merging penalties when reading the full register, and AMD CPUs have a false dependency. Read the info and links on Why doesn't GCC use partial registers? carefully before you use them.
Something has got me confused in x86 assembly for a while, it's how/when can NASM infer the size of the operation, here's an example:
mov ebx, [eax]
Here we are moving the 4 bytes stored at the address held in eax into ebx. The size of the operation is inferred as 4 bytes because the register is 32 bits.
However, this operation doesn't get inferred and throws a compile error:
mov [eax], 123456
Of course the solution is this:
mov dword [eax], 123456
Which will move the 32 representation of the number 123456 into the bytes stored at the address held at eax.
But this confuses me, surely it can see eax is 32 bit, so shouldn't it assume I want to store it as a 32 bit value without me having to specify dword after the mov?
Surely if I wanted to put the 16 bit representation of 12345 (smaller number to fit in 16 bits) into eax I would do this:
mov ax, 12345
The operand-size would be ambiguous (and so must be specified) for any instruction with a memory destination and an immediate source. (Neither operand actually being a register, even if using one or more in an addressing mode.)
Address-size and operand-size are separate attributes of an instruction.
Quoting what you said in a comment on another answer, since I think this gets at the core of your confusion:
I would expect mov [eax], 1 to set the 4 bytes held in memory address eax to the 32 bit representation of 1
The BYTE/WORD/DWORD [PTR] annotation is not about the size of the memory address; it's about the size of the variable in memory at that address. Assuming flat 32-bit addressing, addresses are always four bytes long, and therefore must go in Exx registers. So, when the source operand is an immediate value, the dword (or whatever) annotation on the destination operand is the only way the assembler can know whether it's supposed to modify 1, 2, or 4 bytes of RAM.
Perhaps it will help if I demonstrate the effect of these annotations on machine code:
$ objdump -d -Mintel test.o
...
0: c6 00 01 mov BYTE PTR [eax], 0x1
3: 66 c7 00 01 00 mov WORD PTR [eax], 0x1
8: c7 00 01 00 00 00 mov DWORD PTR [eax], 0x1
(I've adjusted the spacing a bit compared to how objdump actually prints it.)
Take note of two things: (1) the three different operand prefixes produce three different machine instructions, and (2) using a different prefix changes the length of the source operand as emitted into the machine code.
mov [eax], 123456
This instruction would use immediate addressing for the source operand and indirect addressing for the destination operand i.e. place the decimal 123456 into the memory address stored in register eax, as you pointed out but the memory address to which eax points does not itself have to be 32 bits in size. NASM can not infer the size of the destination operand. The size of the pointer in register eax is 32 bits.
Address-size and operand-size are totally separate attributes of an instruction.
Surely if I wanted to put the 16 bit representation of 12345 into eax I would do this:
mov ax, 12345
Yes but here you are using immediate addressing for the source operand and register addressing for the destination operand. The assembler can infer the amount of data you wish to move from the size of the destination register (16 bits in the case of the AX register, leaving the upper 2 bytes of the full EAX unmodified so you're not actually setting 32-bit EAX to that value).
compile error
I think you meant assembly error :)
In your first case it can determine it without problems, since EBX is a 32bit register. But in the second one you're using EAX as an address, not as a destination register so nasm developers took the safe route and make the developer choose the size.
If you did mov [eax], 1, what could nasm determine from that? Do you want to set the byte, 16bit or 32bit block of memory to 1? It is totally unknown. This is why it's better to force the developer to state the size.
It would be entirely different if you said mov eax, 123456 since then the destination is a register.
I believe I understand how the linux x86-64 ABI uses registers and stack to pass parameters to a function (cf. previous ABI discussion). What I'm confused about is if/what registers are expected to be preserved across a function call. That is, what registers are guarenteed not to get clobbered?
Here's the complete table of registers and their use from the documentation [PDF Link]:
r12, r13, r14, r15, rbx, rsp, rbp are the callee-saved registers - they have a "Yes" in the "Preserved across function calls" column.
Experimental approach: disassemble GCC code
Mostly for fun, but also as a quick verification that you understood the ABI right.
Let's try to clobber all registers with inline assembly to force GCC to save and restore them:
main.c
#include <inttypes.h>
uint64_t inc(uint64_t i) {
__asm__ __volatile__(
""
: "+m" (i)
:
: "rax",
"rbx",
"rcx",
"rdx",
"rsi",
"rdi",
"rbp",
"rsp",
"r8",
"r9",
"r10",
"r11",
"r12",
"r13",
"r14",
"r15",
"ymm0",
"ymm1",
"ymm2",
"ymm3",
"ymm4",
"ymm5",
"ymm6",
"ymm7",
"ymm8",
"ymm9",
"ymm10",
"ymm11",
"ymm12",
"ymm13",
"ymm14",
"ymm15"
);
return i + 1;
}
int main(int argc, char **argv) {
(void)argv;
return inc(argc);
}
GitHub upstream.
Compile and disassemble:
gcc -std=gnu99 -O3 -ggdb3 -Wall -Wextra -pedantic -o main.out main.c
objdump -d main.out
Disassembly contains:
00000000000011a0 <inc>:
11a0: 55 push %rbp
11a1: 48 89 e5 mov %rsp,%rbp
11a4: 41 57 push %r15
11a6: 41 56 push %r14
11a8: 41 55 push %r13
11aa: 41 54 push %r12
11ac: 53 push %rbx
11ad: 48 83 ec 08 sub $0x8,%rsp
11b1: 48 89 7d d0 mov %rdi,-0x30(%rbp)
11b5: 48 8b 45 d0 mov -0x30(%rbp),%rax
11b9: 48 8d 65 d8 lea -0x28(%rbp),%rsp
11bd: 5b pop %rbx
11be: 41 5c pop %r12
11c0: 48 83 c0 01 add $0x1,%rax
11c4: 41 5d pop %r13
11c6: 41 5e pop %r14
11c8: 41 5f pop %r15
11ca: 5d pop %rbp
11cb: c3 retq
11cc: 0f 1f 40 00 nopl 0x0(%rax)
and so we clearly see that the following are pushed and popped:
rbx
r12
r13
r14
r15
rbp
The only missing one from the spec is rsp, but we expect the stack to be restored of course. Careful reading of the assembly confirms that it is maintained in this case:
sub $0x8, %rsp: allocates 8 bytes on stack to save %rdi at %rdi, -0x30(%rbp), which is done for the inline assembly +m constraint
lea -0x28(%rbp), %rsp restores %rsp back to before the sub, i.e. 5 pops after mov %rsp, %rbp
there are 6 pushes and 6 corresponding pops
no other instructions touch %rsp
Tested in Ubuntu 18.10, GCC 8.2.0.
The ABI specifies what a piece of standard-conforming software is allowed to expect. It is written primarily for authors of compilers, linkers and other language processing software. These authors want their compiler to produce code that will work properly with code that is compiled by the same (or a different) compiler. They all have to agree to a set of rules: how are formal arguments to functions passed from caller to callee, how are function return values passed back from callee to caller, which registers are preserved/scratch/undefined across the call boundary, and so on.
For example, one rule states that the generated assembly code for a function must save the value of a preserved register before changing the value, and that the code must restore the saved value before returning to its caller. For a scratch register, the generated code is not required to save and restore the register value; it can do so if it wants, but standard-conforming software is not allowed to depend upon this behavior (if it does it is not standard-conforming software).
If you are writing assembly code, you are responsible for playing by these same rules (you are playing the role of the compiler). That is, if your code changes a callee-preserved register, you are responsible for inserting instructions that save and restore the original register value. If your assembly code calls an external function, your code must pass arguments in the standard-conforming way, and it can depend upon the fact that, when the callee returns, preserved register values are in fact preserved.
The rules define how standards-conforming software can get along. However, it is perfectly legal to write (or generate) code that does not play by these rules! Compilers do this all the time, because they know that the rules don't need to be followed under certain circumstances.
For example, consider a C function named foo that is declared as follows, and never has its address taken:
static foo(int x);
At compile-time, the compiler is 100% certain that this function can only be called by other code in the file(s) it is currently compiling. Function foo cannot be called by anything else, ever, given the definition of what it means to be static. Because the compiler knows all of the callers of foo at compile time, the compiler is free to use whatever calling sequence it wants (up to and including not making a call at all, that is, inlining the code for foo into the callers of foo.
As an author of assembly code, you can do this too. That is, you can implement a "private agreement" between two or more routines, as long as that agreement doesn't interfere with or violate the expectations of standards-conforming software.
In this disassembly from VC++ a function call is being made. The compiler MOVs the local pointers to a register before pushing them:
memcpy( nodeNewLocation, pNode, sizeCurrentNode );
0041A5DA 8B 45 F8 mov eax,dword ptr [ebp-8]
0041A5DD 50 push eax
0041A5DE 8B 4D 0C mov ecx,dword ptr [ebp+0Ch]
0041A5E1 51 push ecx
0041A5E2 8B 55 D4 mov edx,dword ptr [ebp-2Ch]
0041A5E5 52 push edx
0041A5E6 E8 67 92 FF FF call 00413852
0041A5EB 83 C4 0C add esp,0Ch
Why not just push them directly? ie
push dword ptr [ebp-8]
Also, if you are going to do a separate push, why not do it manually. In other words, instead of doing "push eax" above, do
mov [esp], eax
Etc. the advantage of this is that after doing the 3 movs you can do a single subtract to set the new stack pointer, instead of implicitly subtracting three times with the pushes.
UPDATE---Release version
This is the same code compiled for release:
; 741 : memcpy( nodeNewLocation, pNode, sizeCurrentNode );
00087 8b 45 f8 mov eax, DWORD PTR _sizeCurrentNode$[ebp]
0008a 8b 7b 04 mov edi, DWORD PTR [ebx+4]
0008d 50 push eax
0008e 56 push esi
0008f 57 push edi
00090 e8 00 00 00 00 call _memcpy
00095 83 c4 0c add esp, 12 ; 0000000cH
Definitely more efficient than the debug version, but it is still doing a MOV/PUSH combo.
This is an optimization. It is explicitly mentioned in the Intel processor manuals, volume 4, section 12.3.3.6:
In Intel Atom microarchitecture, using PUSH/POP instructions to manage stack space
and address adjustment between function calls/returns will be more optimal than
using ENTER/LEAVE alternatives. This is because PUSH/POP will not need MSROM
flows and stack pointer address update is done at AGU.
When a callee function need to return to the caller, the callee could issue POP instruction
to restore data and restore the stack pointer from the EBP.
Assembly/Compiler Coding Rule 19. (MH impact, M generality) For Intel
Atom processors, favor register form of PUSH/POP and avoid using LEAVE; Use LEA
to adjust ESP instead of ADD/SUB.
The rest of the manual isn't that clear about the reason, but it does mention a possible 3 cycle AGU stall on implicit ESP adjustments.
I suspect it only does it in debug builds, or possibly in some situations where it's warranted by pipelining or other considerations (e.g. it could put a parameter into esi and use it after the call). I've looked into some binaries, and MSVC definitely does use such pushes:
push ebx ; mthd
push dword ptr [ebp+place+4]
push dword ptr [ebp+place] ; pos
push [ebp+filedes] ; fh
call __lseeki64_nolock
(code from the CRT)
As for the second question, instructions addressing esp are longer than pushes: "push eax" is one byte while "mov [esp-8], eax" is four bytes. In fact, this approach (mov instead of push) is used by GCC by default since a couple versions ago (option -maccumulate-outgoing-args) and it has led to notable increases in code size. Supposedly it makes code faster but I'm unconvinced.
I actually figured out the reason for it. It has to do with the way instructions are pipelined on the Pentium MMX. There are two pipelines, U and V, which allows MMX processors to process 2 instructions at a time IF they are pairable. PUSHs are not pairable with one another, but they are pairable with MOVs. So, if you write:
mov eax, [indirect]
mov esi, [indirect]
push eax
push esi
then, what happens is that instructions #1 and #3 get paired and #2 and #4 get paired so, effectively, these four instructions run in the same number of cycles as a single mov/push, and a single mov/push is faster than two push [indirect]s. This exact case is described in detail in Section 4.3, p. 41, Examples 4.11a and 4.11b, of the Microarchitecture optimization guide by Agner Fog, available widely on internet.