How to use LEA with brackets? - visual-c++

I have this ASM function, that takes 4 arguments. The first two arguments are passed by value, the last two are passed by reference. So I'm using this:
PUSH EBP
MOV EBP, ESP
SUB ESP, 20
MOV EAX, [EBP+8]
MOV EBX, [EBP+12]
LEA ECX, [EBP+16]
LEA EDX, [EBP+20]
PUSH EDX
PUSH ECX
PUSH EBX
PUSH EAX
CALL Function
LEAVE
RETN 20
(Note that I'm using this code inside C++ using the VC's __asm statement).
But while searching about the use of LEA to pass arguments as pointers (aka by reference) I found:
[...] Note there are NO brackets in this line. Putting the square brackets around
something means "get the contents of", so you were effectively defeating the
LEA op. [...]
I want to pass both arguments at EBP+16 and EBP+20 by reference, but how can I do that if can't use brackets? If I don't put them, then the compiler throws an error (C2424).
Thanks in advance.

Try
MOV ECX, EBP+16
MOV EDX, EBP+20

lea has one type of operation, so use the syntax that makes your compiler glad (for example, fasm requires brackets always, while masm requires its absence around label arguments).
Note, that:
Windows calling conventions require you to preserve ebx during calls;
I doubt taking addresses of 3rd and 4th arguments was really your intention;
You can use push dword [ebp+xxx] instruction.

Related

Incrementing one to a variable in IA32 Linux Assembly

I'm trying to increment 1 to a variable in IA32 Assembly in Linux
section .data
num: dd 0x1
section .text
global _start
_start:
add dword [num], 1
mov edx, 1
mov ecx, [num]
mov ebx,1
mov eax,4
int 0x80
mov eax,1
int 0x80
Not sure if it's possible to do.
In another literature I saw the follow code:
mov eax, num
inc eax
mov num, eax
Is it possible to increment a value to a var without moving to a register?
If so, do I have any advantage moving the value to a register?
Is it possible to increment a value to a var without moving to a register?
Certainly: inc dword [num].
Like practically all x86 instructions, inc can take either a register or memory operand. See the instruction description at http://felixcloutier.com/x86/inc; the form inc r/m32 indicates that you can give an operand which is either a 32-bit register or 32-bit memory operand (effective address).
If you're interested in micro-optimizations, it turns out that add dword [num], 1 may still be somewhat faster, though one byte larger, on certain CPUs. The specifics are pretty complicated and you can find a very extensive discussion at INC instruction vs ADD 1: Does it matter?. This is partly related to the slight difference in effect between the two, which is that add will set or clear the carry flag according to whether a carry occurs, while inc always leaves the carry flag unchanged.
If so, do I have any advantage moving the value to a register?
No. That would make your code larger and probably slower.

Should %rsp be aligned to 16-byte boundary before calling a function in NASM?

I saw the following rules from NASM's document:
The stack pointer %rsp must be aligned to a 16-byte boundary before making a call. Fine, but the process of making a call pushes the return address (8 bytes) on the stack, so when a function gets control, %rsp is not aligned. You have to make that extra space yourself, by pushing something or subtracting 8 from %rsp.
And I have a snippet of NASM assembly code as below:
The %rsp should be at the boundary of 8-bytes before I call the function "inc" in "_start" which violates the rules described in NASM's document. But actually, everything is going on well. So, how can I understand this?
I built this under Ubuntu 20.04 LTS (x86_64).
global _start
section .data
init:
db 0x2
section .rodata
codes:
db '0123456789abcdef'
section .text
inc:
mov rax, [rsp+8] ; read param from the stack;
add rax, 0x1
ret
print:
lea rsi, [codes + rax]
mov rax, 1
mov rdi, 1
mov rdx, 1
syscall
ret
_start:
; enable AC check;
pushf
or dword [rsp], 1<<18
popf
mov rdi, [init] ; move the first 8 bytes of init to %rdi;
push rdi ; %rsp -> 8 bytes;
call inc
pop r11 ; clean stack by the caller;
call print
mov rax, 60
xor rdi, rdi
syscall
The ABI is a set of rules for how functions should behave to be interoperable with each other. Each of the rules on one side are paired with allowed assumptions on the other. In this case, the rule about stack alignment for the caller is an allowed assumption about stack alignment for the callee. Since your inc function doesn't depend on 16-byte stack alignment, it's fine to call that particular function with a stack that's only 8-byte aligned.
If you're wondering why it didn't break when you enabled AC, that's because you're only loading 8-byte values from the stack, and the stack is still 8-byte aligned. If you did sub rsp, 4 or something to break 8-byte alignment too, then you would get a bus error.
Where the ABI becomes important is when the situation isn't one function you wrote yourself in assembly calling another function you wrote yourself in assembly. A function in someone else's library (including the C standard library), or one that you compiled from C instead of writing in assembly, is within its rights to do movaps [rsp - 24], xmm0 or something, which would break if you didn't properly align the stack before calling it.
Side note: the ABI also says how you're supposed to pass parameters (the calling convention), but you're just kind of passing them wherever. Again, fine from your own assembly, but they'll definitely break if you try to call them from C.

Is there advantage of reading data without using pop operation?

According to this PDF document (Page-66), the following bunch of statements
mov eax, DWORD PTR SS:[esp]
mov eax, DWORD PTR SS:[esp + 4]
mov eax, DWORD PTR SS:[esp + 8]
are equivalent to the following bunch of statements:
pop eax
pop eax
pop eax
Is there any advantage of the the former over the latter?
mov leaves the data on the stack, pop removes it so you can only read it once, and only in order. Data below ESP has to be considered "lost", unless you're using a calling convention / ABI that includes a red-zone below the stack pointer.
Data is usually still there below ESP, but asynchronous stuff like signal handlers, or a debugger evaluating a call fflush(0) in the context of your process, can step on it.
Also, pop modifies ESP, so each pop requires stack-unwind metadata1 in another section of the executable/library, for it to be fully ABI compliant with SEH on Windows or the i386 / x86-64 System V ABI on other OSes (which specifies that all functions need unwind metadata, even if they're not C++ functions that actually support propagating exceptions).
But if you're reading data for the last time, and you actually need it all, then yes pop is an efficient way to read it on modern CPUs (like Pentium-M and later, with a stack engine to handle the ESP updates without a separate uop.)
On older CPUs, like Pentium III, pop was actually slower than 3x mov + add esp,12 and compilers did generate code the way Brendan's answer shows.
void foo() {
asm("" ::: "ebx", "esi", "edi");
}
This function forces the compiler to save/restore 3 call-preserved registers (by declaring clobbers on them.) It doesn't actually touch them; the inline asm string is empty. But this makes it easy to see what compilers will do for saving/restoring. (Which is the only time they'll use pop normally.)
GCC's default (tune=generic) code-gen, or with -march=skylake for example, is like this (from the Godbolt compiler explorer)
foo: # gcc8.3 -O3 -m32
push edi
push esi
push ebx
pop ebx
pop esi
pop edi
ret
But telling it to tune for an old CPU without a stack engine makes it do this:
foo: # gcc8.3 -march=pentium3 -O3 -m32
sub esp, 12
mov DWORD PTR [esp], ebx
mov DWORD PTR [esp+4], esi
mov DWORD PTR [esp+8], edi
mov ebx, DWORD PTR [esp]
mov esi, DWORD PTR [esp+4]
mov edi, DWORD PTR [esp+8]
add esp, 12
ret
gcc thinks -march=pentium-m doesn't have a stack engine, or at least chooses not to use push/pop there. I think that's a mistake, because Agner Fog's microarch pdf definitely describes the stack engine as being present in Pentium-M.
On P-M and later, push/pop are single-uop instructions, with the ESP update handled outside the out-of-order backend, and for push the store-address+store-data uops are micro-fused.
On Pentium 3, they're 2 or 3 uops each. (Again, see Agner Fog's instruction tables.)
On in-order P5 Pentium, push and pop are actually fine. (But memory-destination instructions like add [mem], reg were generally avoided, because P5 didn't split them into uops to pipeline better.)
Mixing pop with direct references to [esp] will actually be potentially slower than just one or the other, on modern Intel CPUs, because it costs extra stack-sync uops.
Obviously writing EAX 3 times back to back means the first 2 loads are useless in both sequences.
See Extreme Fibonacci for an example of pop (1 uop, or like 1.1 uop with the stack sync uops amortized) being more efficient than lodsd (2 uops on Skylake) for reading through an array. (In evil code that assumes a large red-zone because it doesn't install signal handlers. Don't actually do this unless you know exactly what you're doing and when it will break; this is more of a silly computer tricks / extreme optimization for code-golf than anything that's practically useful.)
Footnote 1: The Godbolt compiler explorer normally filters out extra assembler directives, but if you uncheck that box you can see gcc's function that uses push/pop has .cfi_def_cfa_offset 12 after every push/pop.
pop ebx
.cfi_restore 3
.cfi_def_cfa_offset 12
pop esi
.cfi_restore 6
.cfi_def_cfa_offset 8
pop edi
.cfi_restore 7
.cfi_def_cfa_offset 4
The .cfi_restore 7 metadata directives have to be there regardless of push/pop vs. mov, because that lets stack unwinding restore call-preserved registers as it unwinds. (7 is the register number).
But for other uses of push/pop inside a function (like pushing args to a function call, or a dummy pop to remove it from the stack), you wouldn't have .cfi_restore, only metadata for the stack pointer changing relative to the stack frame.
Normally you don't worry about this in hand-written asm, but compilers have to get this right so there's a small extra cost to using push/pop in terms of total executable size. But only in parts of the file that aren't mapped into memory normally, and not mixed with code.
This:
pop eax
pop ebx
pop ecx
.. is sort of equivalent to this:
mov eax,[esp]
add esp,4
mov ebx,[esp]
add esp,4
mov ecx,[esp]
add esp,4
..which can be like this:
mov eax,[esp] ;Do this instruction
add esp,4 ; ...and this instruction in parallel
;Stall until the previous instruction completes (and the value
mov ebx,[esp] ;in ESP becomes known); then do this instruction
add esp,4 ; ...and this instruction in parallel
;Stall until the previous instruction completes (and the value
mov ecx,[esp] ;in ESP becomes known); then do this instruction
add esp,4 ; ...and this instruction in parallel
For this code:
mov eax, [esp]
mov ebx, [esp + 4]
mov ecx, [esp + 8]
add esp,12
.. all of the instructions can happen in parallel (in theory).
Note: In practice all of the above depends on which CPU, etc.

Manually Add Newline To Stack Variable In x86 Linux Assembly

I wrote a simple assembly program that gets two integers from the user via a prompt, multiplies them together and prints that out. I wanted to do this directly with sys_read and not scanf so I could manually convert the input to an integer after removing the LF.
Here's the full source: http://pastebin.com/utnjTvNZ
In particular, what I want to do now is manually add a newline to the result of the multiplication that is now converted back to it's ASCII char equivalent. Initially, I thought I could just left shift 16 bits and add 0xA leaving me with, for example, 0x0034000A on the stack for 2*2 (0x0034 is "4" in ASCII chars), followed by a null terminator and a LF. However, the LF is printing before the result. I figured this was an endianess thing, so I tried the reverse (0x000A0034) and that just printed some other ASCII char instead.
So, simply, how do I properly push a newline to the stack so that this is printed with a newline following the number when using sys_write? What I'm missing is how strings are stored on the stack... which I can't test because normally you just create a variable and push the address onto the stack.
I'm aware some things in here could be done better, cleaner and up-to-standards and whatnot. I understand things intuitively so it's something I just need to do to better understand the stack and Linux system calls in general.
Okay, so to answer my own question thanks to the help of Jester, to add a newline to the 32-bit word I was displaying in memory, I had to understand endianness. Since I compiled for 32-bit, my program is functioning on 32-bit words. These words' bytes are written into memory "backwards". The words themselves are still stored in "normal" order. For example 0x0A290028 0x0A293928 prints (NULL)LF(9)LF. The bytes are backwards but the words are not. Sys_write, since it just uses a void *buf and isn't aware of strings, just reads bytes in endian-order from the buffer and spits them out.
What I was able to do was simply left-shift my single-digit product, for example, 0x00000034 by 8-bits. This left me with 0x00003400. To that, I could add 0x000A0000. This would result in 0x000A3400, and the number "4" being printed followed by a newline.
So, the new procedure looks like this:
multprint:
mov eax, sys_write ;4
mov ebx, stdout ;1
mov ecx, resultstr
mov edx, resultstrLen
dec edx
int 0x80
pop eax ;multiplican't
pop ebx ;multiplicand
mul ebx
add eax, '0'
shl eax, 8 ;make room for () and LF
add eax, 0x0A290028
push eax
mov ecx, esp
;mov [num], eax ;use these two lines if I don't want to use the stack
;mov ecx, num
mov eax, sys_write
mov ebx, stdout
mov edx, 4
int 0x80

In NASM, is MOV EBX, AX a valid instruction?

In NASM, is MOV EBX, AX a valid instruction?
Basically, can you move the contents of a small register into a register bigger than it?
That's not valid anywhere. To get the effect you want, do
MOVZX EBX, AX, or
MOVSX EBX, AX
depending on whether you want to zero or sign extend the source operand.

Resources