I'm writing a C++ program, and decided that it would be more efficient to write a specific function in x86 assembly, due to it's use of the carry-flag. In the disassembly, I noticed that some instructions had been changed, causing my program to throw the exception: "Access violation reading location". Why are instructions changing, and how can I prevent this?
Here is a snippet of my code:
XOR EBX, EBX ; 31 DB
MOV BL, DH ; 88 F3
MOV AH, BYTE PTR [ECX] ; 8A 21
The disassembler shows this:
xor bx, bx ; 66 31 DB
mov bl, dh ; 88 F3
mov ah, byte ptr [bx+di] ; 67 8A 21
You assembled in 16-bit mode and disassembled in 32-bit mode, making everything the opposite of what it should be. MASM isn't "changing" instructions, just assembling them for the mode you told it to assemble for.
e.g. in 16-bit mode, [ECX] needs a 67 address-size override prefix to encode a 32-bit addressing mode. When decoded in 32-bit mode, that same prefix overrides it to mean 16-bit. (And that bit-pattern means [bx+di]; 16-bit addressing modes can't use a SIB byte so all the different register combos are packed into the ModRM byte. [cx] isn't encodeable.)
Also, if you think xor + mov is the most efficient way to zero-extend DH into EBX, have a look at movzx. That's more efficient on modern CPUs. (See https://agner.org/optimize/ and https://uops.info/).
Generally you want to avoid writing high-8 registers like AH; Haswell/Skylake have merging penalties when reading the full register, and AMD CPUs have a false dependency. Read the info and links on Why doesn't GCC use partial registers? carefully before you use them.
Related
How can I load a single byte from address? I thought it would be something like this:
mov rax, byte[rdi]
mov al, [rdi]
Merge a byte into the low byte of RAX.
Or better, avoid a false dependency on the old value of RAX by zero-extending into a 32-bit register (and thus implicitly to 64 bits) with MOVZX:
movzx eax, byte [rdi] ; most efficient way to load one byte on modern x86
Or if you want sign-extension into a wider register, use MOVSX.
movsx eax, byte [rdi] ; sign extend to 32-bit, zero-extend to 64
movsx rax, byte [rdi] ; sign extend to 64-bit
(On some CPUs, MOVSX is just as efficient as MOVZX, handled right in a load port without even needing an ALU uop. https://uops.info. But there are some where MOVZX loads are cheaper than MOVSX, so prefer MOVZX if you don't care about the upper bytes and really just want to avoid partial-register shenanigans.)
The MASM equivalent replaces byte with byte ptr.
A mov load doesn't need a size specifier (al destination implies byte operand-size). movzx always does for a memory source because a 32-bit destination doesn't disambiguate between 8 vs. 16-bit memory sources.
The AT&T equivalent is movzbl (%rdi), %eax (with movzb specifying that we zero-extend a byte, the l specifying 32-bit destination size.)
I'm trying to increment 1 to a variable in IA32 Assembly in Linux
section .data
num: dd 0x1
section .text
global _start
_start:
add dword [num], 1
mov edx, 1
mov ecx, [num]
mov ebx,1
mov eax,4
int 0x80
mov eax,1
int 0x80
Not sure if it's possible to do.
In another literature I saw the follow code:
mov eax, num
inc eax
mov num, eax
Is it possible to increment a value to a var without moving to a register?
If so, do I have any advantage moving the value to a register?
Is it possible to increment a value to a var without moving to a register?
Certainly: inc dword [num].
Like practically all x86 instructions, inc can take either a register or memory operand. See the instruction description at http://felixcloutier.com/x86/inc; the form inc r/m32 indicates that you can give an operand which is either a 32-bit register or 32-bit memory operand (effective address).
If you're interested in micro-optimizations, it turns out that add dword [num], 1 may still be somewhat faster, though one byte larger, on certain CPUs. The specifics are pretty complicated and you can find a very extensive discussion at INC instruction vs ADD 1: Does it matter?. This is partly related to the slight difference in effect between the two, which is that add will set or clear the carry flag according to whether a carry occurs, while inc always leaves the carry flag unchanged.
If so, do I have any advantage moving the value to a register?
No. That would make your code larger and probably slower.
Yesterday I was looking at some 32 bit code generated by VC++ 2010 (most probably; don't know about the specific options, sorry) and I was intrigued by a curious recurring detail: in many functions, it zeroed out ebx in the prologue, and it always used it like a "zero register" (think $zero on MIPS). In particular, it often:
used it to zero out memory; this is not unusual, as the encoding for a mov mem,imm is 1 to 4 bytes bigger than mov mem,reg (the full immediate value size has to be encoded even for 0), but usually (gcc) the necessary register is zeroed out "on demand", and kept for more useful purposes otherwise;
used it for compares against zero - as in cmp reg,ebx. This is what stroke me as really unusual, as it should be exactly the same as test reg,reg, but adds a dependency to an extra register. Now, keep in mind that this happened in non-leaf functions, with ebx being often pushed (by the callee) on and off the stack, so I would not trust this dependency to be always completely free. Also, it also used test reg,reg in the exact same fashion (test/cmp => jg).
Most importantly, registers on "classic" x86 are a scarce resource, if you start having to spill registers you waste a lot of time for no good reason; why waste one through all the function just to keep a zero in it? (still, thinking about it, I don't remember seeing much register spillage in functions that used this "zero-register" pattern).
So: what am I missing? Is it a compiler blooper or some incredibly smart optimization that was particularly interesting in 2010?
Here's an excerpt:
; standard prologue: ebp/esp, SEH, overflow protection, ... then:
xor ebx, ebx
mov [ebp+4], ebx ; zero out some locals
mov [ebp], ebx
call function_1
xor ecx, ecx ; ebx _not_ used to zero registers
cmp eax, ebx ; ... but used for compares?! why not test eax,eax?
setnz cl ; what? it goes through cl to check if eax is not zero?
cmp ecx, ebx ; still, why not test ecx,ecx?
jnz function_body
push 123456
call throw_something
function_body:
mov edx, [eax]
mov ecx, eax ; it's not like it was interested in ecx anyway...
mov eax, [edx+0Ch]
call eax ; virtual method call; ebx is preserved but possibly pushed/popped
lea esi, [eax+10h]
mov [ebp+0Ch], esi
mov eax, [ebp+10h]
mov ecx, [eax-0Ch]
xor edi, edi ; ugain, registers are zeroed as usual
mov byte ptr [ebp+4], 1
mov [ebp+8], ecx
cmp ecx, ebx ; why not test ecx,ecx?
jg somewhere
label1:
lea eax, [esi-10h]
mov byte ptr [ebp+4], bl ; ok, uses bl to write a zero to memory
lea ecx, [eax+0Ch]
or edx, 0FFFFFFFFh
lock xadd [ecx], edx
dec edx
test edx, edx ; now it's using the regular test reg,reg!
jg somewhere_else
Notice: an earlier version of this question said that it used mov reg,ebx instead of xor ebx,ebx; this was just me not remembering stuff correctly. Sorry if anybody put too much thought trying to understand that.
Everything you commented on as odd looks sub-optimal to me. test eax,eax sets all flags (except AF) the same as cmp against zero, and is preferred for performance and code-size.
On P6 (PPro through Nehalem), reading long-dead registers is bad because it can lead to register-read stalls. P6 cores can only read 2 or 3 not-recently-modified architectural registers from the permanent register file per clock (to fetch operands for the issue stage: the ROB holds operands for uops, unlike on SnB-family where it only holds references to the physical register file).
Since this is from VS2010, Sandybridge wasn't released yet, so it should have put a lot of weight on tuning for Pentium II/III, Pentium-M, Core2, and Nehalem where reading "cold" registers is a possible bottleneck.
IDK if anything like this ever made sense for integer regs, but I don't know much about optimizing for CPUs older than P6.
The cmp / setz / cmp / jnz sequence looks particularly braindead. Maybe it came from a compiler-internal canned sequence for producing a boolean value from something, and it failed to optimize a test of the boolean back into just using the flags directly? That still doesn't explain the use of ebx as a zero-register, which is also completely useless there.
Is it possible that some of that was from inline-asm that returned a boolean integer (using a silly that wanted a zero in a register)?
Or maybe the source code was comparing two unknown values, and it was only after inlining and constant-propagation that it turned into a compare against zero? Which MSVC failed to optimize fully, so it still kept 0 as a constant in a register instead of using test?
(the rest of this was written before the question included code).
Sounds weird, or like a case of CSE / constant-hoisting run amok. i.e. treating 0 like any other constant that you might want to load once and then reg-reg copy throughout the function.
Your analysis of the data-dependency behaviour is correct: moving from a register that was zeroed a while ago essentially starts a new dependency chain.
When gcc wants two zeroed registers, it often xor-zeroes one and then uses a mov or movdqa to copy to the other.
This is sub-optimal on Sandybridge where xor-zeroing doesn't need an execution port, but a possible win on Bulldozer-family where mov can run on the AGU or ALU, but xor-zeroing still needs an ALU port.
For vector moves, it's a clear win on Bulldozer: handled in register rename with no execution unit. But xor-zeroing of an XMM or YMM register still needs an execution port on Bulldozer-family (or two for ymm, so always use xmm with implicit zero-extension).
Still, I don't think that justifies tying up a register for the duration of a whole function, especially not if it costs extra saves/restores. And not for P6-family CPUs where register-read stalls are a thing.
Something has got me confused in x86 assembly for a while, it's how/when can NASM infer the size of the operation, here's an example:
mov ebx, [eax]
Here we are moving the 4 bytes stored at the address held in eax into ebx. The size of the operation is inferred as 4 bytes because the register is 32 bits.
However, this operation doesn't get inferred and throws a compile error:
mov [eax], 123456
Of course the solution is this:
mov dword [eax], 123456
Which will move the 32 representation of the number 123456 into the bytes stored at the address held at eax.
But this confuses me, surely it can see eax is 32 bit, so shouldn't it assume I want to store it as a 32 bit value without me having to specify dword after the mov?
Surely if I wanted to put the 16 bit representation of 12345 (smaller number to fit in 16 bits) into eax I would do this:
mov ax, 12345
The operand-size would be ambiguous (and so must be specified) for any instruction with a memory destination and an immediate source. (Neither operand actually being a register, even if using one or more in an addressing mode.)
Address-size and operand-size are separate attributes of an instruction.
Quoting what you said in a comment on another answer, since I think this gets at the core of your confusion:
I would expect mov [eax], 1 to set the 4 bytes held in memory address eax to the 32 bit representation of 1
The BYTE/WORD/DWORD [PTR] annotation is not about the size of the memory address; it's about the size of the variable in memory at that address. Assuming flat 32-bit addressing, addresses are always four bytes long, and therefore must go in Exx registers. So, when the source operand is an immediate value, the dword (or whatever) annotation on the destination operand is the only way the assembler can know whether it's supposed to modify 1, 2, or 4 bytes of RAM.
Perhaps it will help if I demonstrate the effect of these annotations on machine code:
$ objdump -d -Mintel test.o
...
0: c6 00 01 mov BYTE PTR [eax], 0x1
3: 66 c7 00 01 00 mov WORD PTR [eax], 0x1
8: c7 00 01 00 00 00 mov DWORD PTR [eax], 0x1
(I've adjusted the spacing a bit compared to how objdump actually prints it.)
Take note of two things: (1) the three different operand prefixes produce three different machine instructions, and (2) using a different prefix changes the length of the source operand as emitted into the machine code.
mov [eax], 123456
This instruction would use immediate addressing for the source operand and indirect addressing for the destination operand i.e. place the decimal 123456 into the memory address stored in register eax, as you pointed out but the memory address to which eax points does not itself have to be 32 bits in size. NASM can not infer the size of the destination operand. The size of the pointer in register eax is 32 bits.
Address-size and operand-size are totally separate attributes of an instruction.
Surely if I wanted to put the 16 bit representation of 12345 into eax I would do this:
mov ax, 12345
Yes but here you are using immediate addressing for the source operand and register addressing for the destination operand. The assembler can infer the amount of data you wish to move from the size of the destination register (16 bits in the case of the AX register, leaving the upper 2 bytes of the full EAX unmodified so you're not actually setting 32-bit EAX to that value).
compile error
I think you meant assembly error :)
In your first case it can determine it without problems, since EBX is a 32bit register. But in the second one you're using EAX as an address, not as a destination register so nasm developers took the safe route and make the developer choose the size.
If you did mov [eax], 1, what could nasm determine from that? Do you want to set the byte, 16bit or 32bit block of memory to 1? It is totally unknown. This is why it's better to force the developer to state the size.
It would be entirely different if you said mov eax, 123456 since then the destination is a register.
In this disassembly from VC++ a function call is being made. The compiler MOVs the local pointers to a register before pushing them:
memcpy( nodeNewLocation, pNode, sizeCurrentNode );
0041A5DA 8B 45 F8 mov eax,dword ptr [ebp-8]
0041A5DD 50 push eax
0041A5DE 8B 4D 0C mov ecx,dword ptr [ebp+0Ch]
0041A5E1 51 push ecx
0041A5E2 8B 55 D4 mov edx,dword ptr [ebp-2Ch]
0041A5E5 52 push edx
0041A5E6 E8 67 92 FF FF call 00413852
0041A5EB 83 C4 0C add esp,0Ch
Why not just push them directly? ie
push dword ptr [ebp-8]
Also, if you are going to do a separate push, why not do it manually. In other words, instead of doing "push eax" above, do
mov [esp], eax
Etc. the advantage of this is that after doing the 3 movs you can do a single subtract to set the new stack pointer, instead of implicitly subtracting three times with the pushes.
UPDATE---Release version
This is the same code compiled for release:
; 741 : memcpy( nodeNewLocation, pNode, sizeCurrentNode );
00087 8b 45 f8 mov eax, DWORD PTR _sizeCurrentNode$[ebp]
0008a 8b 7b 04 mov edi, DWORD PTR [ebx+4]
0008d 50 push eax
0008e 56 push esi
0008f 57 push edi
00090 e8 00 00 00 00 call _memcpy
00095 83 c4 0c add esp, 12 ; 0000000cH
Definitely more efficient than the debug version, but it is still doing a MOV/PUSH combo.
This is an optimization. It is explicitly mentioned in the Intel processor manuals, volume 4, section 12.3.3.6:
In Intel Atom microarchitecture, using PUSH/POP instructions to manage stack space
and address adjustment between function calls/returns will be more optimal than
using ENTER/LEAVE alternatives. This is because PUSH/POP will not need MSROM
flows and stack pointer address update is done at AGU.
When a callee function need to return to the caller, the callee could issue POP instruction
to restore data and restore the stack pointer from the EBP.
Assembly/Compiler Coding Rule 19. (MH impact, M generality) For Intel
Atom processors, favor register form of PUSH/POP and avoid using LEAVE; Use LEA
to adjust ESP instead of ADD/SUB.
The rest of the manual isn't that clear about the reason, but it does mention a possible 3 cycle AGU stall on implicit ESP adjustments.
I suspect it only does it in debug builds, or possibly in some situations where it's warranted by pipelining or other considerations (e.g. it could put a parameter into esi and use it after the call). I've looked into some binaries, and MSVC definitely does use such pushes:
push ebx ; mthd
push dword ptr [ebp+place+4]
push dword ptr [ebp+place] ; pos
push [ebp+filedes] ; fh
call __lseeki64_nolock
(code from the CRT)
As for the second question, instructions addressing esp are longer than pushes: "push eax" is one byte while "mov [esp-8], eax" is four bytes. In fact, this approach (mov instead of push) is used by GCC by default since a couple versions ago (option -maccumulate-outgoing-args) and it has led to notable increases in code size. Supposedly it makes code faster but I'm unconvinced.
I actually figured out the reason for it. It has to do with the way instructions are pipelined on the Pentium MMX. There are two pipelines, U and V, which allows MMX processors to process 2 instructions at a time IF they are pairable. PUSHs are not pairable with one another, but they are pairable with MOVs. So, if you write:
mov eax, [indirect]
mov esi, [indirect]
push eax
push esi
then, what happens is that instructions #1 and #3 get paired and #2 and #4 get paired so, effectively, these four instructions run in the same number of cycles as a single mov/push, and a single mov/push is faster than two push [indirect]s. This exact case is described in detail in Section 4.3, p. 41, Examples 4.11a and 4.11b, of the Microarchitecture optimization guide by Agner Fog, available widely on internet.