Subtracting Decimal Values (Assembly) - linux

How do I subtract a decimal value in assembly. IA32 (linux)
1: mov edx, 1/2
sub ecx, ebx
sub ecx, edx
mov bal2, ecx
I tried this but it some how skips the subtraction with the decimal.
And if I enter .5 it gives me an error.
error:junk `.5' after expression

As lightbulbone correctly says in his answer, you cannot work on floating point values when using general purpose registers (eax etc.). You could use the FPU, as lightbulbone suggests, but that is comparatively tedious.
Alternatively - unless your code needs to run on ancient CPUs - you can use the SSE instructions for this task, which is much more straightforward. Code similar to what you are showing might look like the following:
.data
OneHalf dd 0.5
bal2 dd ?
.code
movss xmm0, OneHalf ; instead of mov edx, 1/2
subss xmm1, xmm2 ; instead of sub ecx, ebx
subss xmm1, xmm0 ; instead of sub ecx, edx
movss bal2, xmm1 ; instead of mov bal2, ecx

The code you provided is not capable of subtracting a "decimal value". I am assuming that by "decimal value" you really mean floating-point value, in which case you can not use the integer arithmetic facilities as you've done.
I highly recommend you download the Intel IA-64 Architecture Software Developer's Manual (Intel IA-64 Architecture Software Developer's Mnaul) and have a read through the sections that explain how to use the x87 floating-point facilities. Specifically, look at:
Volume 1, Section 5.2
Volume 1, Chapter 8
Volume 2A, Section 3.2
As a word of caution, my experience with the floating-point facilities is quite minimal. However, my understanding is that it is a stack based environment in which you can load (push) an item on to the stack and then operate on it. The commands I would look at are: FLD, FSUB, and FST (all are listed in Volume 2A).
As an example, here is a short program that loads +1 and the constant pi into the floating-point stack then performs the operation pi-1 and pops the result.
/* floating point subtraction */
.text
.globl _main
_main:
pushq %rbp
movq %rsp, %rbp
fld1 /* load +1.0 */
fldpi /* load pi (3.14159...) */
fsubp /* 3.14159 - 1.0, pop result */
movq $0x0, %rax
leave
ret

Related

Incrementing one to a variable in IA32 Linux Assembly

I'm trying to increment 1 to a variable in IA32 Assembly in Linux
section .data
num: dd 0x1
section .text
global _start
_start:
add dword [num], 1
mov edx, 1
mov ecx, [num]
mov ebx,1
mov eax,4
int 0x80
mov eax,1
int 0x80
Not sure if it's possible to do.
In another literature I saw the follow code:
mov eax, num
inc eax
mov num, eax
Is it possible to increment a value to a var without moving to a register?
If so, do I have any advantage moving the value to a register?
Is it possible to increment a value to a var without moving to a register?
Certainly: inc dword [num].
Like practically all x86 instructions, inc can take either a register or memory operand. See the instruction description at http://felixcloutier.com/x86/inc; the form inc r/m32 indicates that you can give an operand which is either a 32-bit register or 32-bit memory operand (effective address).
If you're interested in micro-optimizations, it turns out that add dword [num], 1 may still be somewhat faster, though one byte larger, on certain CPUs. The specifics are pretty complicated and you can find a very extensive discussion at INC instruction vs ADD 1: Does it matter?. This is partly related to the slight difference in effect between the two, which is that add will set or clear the carry flag according to whether a carry occurs, while inc always leaves the carry flag unchanged.
If so, do I have any advantage moving the value to a register?
No. That would make your code larger and probably slower.

Infinite loops seemingly not working in NASM?

I'm trying to make a DOS program in NASM that uses interrupt 10h to display a pixel cycling through the 16 available colors in the top left corner. I also use interrupt 21h to only make the program run every 1/100 seconds (100 fps).
segment .data
pixelcolor: db 0
pixelx: dw 100
pixely: dw 100
timeaux: db 0 ; used later on to force the program to run at 100fps
segment .text
global _start
_start:
mov ah,00h
mov al,0dh
int 10h
mov ah,0bh
mov bh,00h
mov bl,00h
int 10h
.infinite:
mov ah,2ch
int 21h ; get system time
cmp dl,timeaux ; if 1/100 seconds haven't passed yet...
je .infinite ; ...skip current frame
; else, continue normally
mov byte[timeaux],dl
mov ah,00h
mov al,0dh
int 10h
mov ah,0bh
mov bh,00h
mov bl,00h
int 10h
mov ah,0ch
mov al,pixelcolor
mov cx,pixelx
mov dx,pixely
int 10h
inc byte[pixelcolor]
jmp .infinite
However, when I actually run the program in DOSBox, the pixel just stays red. Does anyone know why my infinite loops aren't working? (Note: I'm very new to NASM, so honestly I'm not even suprised my programs only work 15% of the time.)
The problem isn't actually the loop itself. What the loop is doing each iteration is the problem. Some issues and observations I have are:
Since this is a DOS COM program you will need an org 100h at the top since a COM program is loaded by the DOS loader to offset 100h of the current program segment. Without this the offsets of your data will be incorrect leading to data being read/written to from the wrong memory locations.
You have a problem with mov al,pixelcolor. It needs to be mov al,[pixelcolor]. Without square brackets1 the offset of pixelcolor is moved to AL, not what is stored at offset of pixelcolor. The same goes for pixelx and pixely. Your code prints the same pixel color (red in your case) to the wrong place2 on the screen repeatedly. This code:
mov ah,0ch
mov al,pixelcolor
mov cx,pixelx
mov dx,pixely
int 10h
inc byte[pixelcolor]
should be:
mov ah,0ch
mov al,[pixelcolor]
mov cx,[pixelx]
mov dx,[pixely]
int 10h
inc byte[pixelcolor]
It should be noted that the resolution of the timer by default will only be 18.2 times a second (~55ms). This is less resolution than the 1/100 of a second you are aiming for.
Some versions of DOS may always return 0 for the 1/100 of a second value.
Use of the BIOS to write pixels to the screen may make coding simpler (it abstracts away differences in the video modes) but will be quite slow compared to writing pixels directly to memory.
I would recommend Borland's Turbo Debugger (TD) for debugging DOS software. Turbo Debugger is included in a number of Borland's DOS C/C++ compiler suites.
Footnotes
1The use of brackets [] in NASM differs from MASM/TASM/JWASM.
2Although your question says you want to write to the upper left of the screen, the code suggests you really intended to write the pixel at coordinate 100,100.

Should %rsp be aligned to 16-byte boundary before calling a function in NASM?

I saw the following rules from NASM's document:
The stack pointer %rsp must be aligned to a 16-byte boundary before making a call. Fine, but the process of making a call pushes the return address (8 bytes) on the stack, so when a function gets control, %rsp is not aligned. You have to make that extra space yourself, by pushing something or subtracting 8 from %rsp.
And I have a snippet of NASM assembly code as below:
The %rsp should be at the boundary of 8-bytes before I call the function "inc" in "_start" which violates the rules described in NASM's document. But actually, everything is going on well. So, how can I understand this?
I built this under Ubuntu 20.04 LTS (x86_64).
global _start
section .data
init:
db 0x2
section .rodata
codes:
db '0123456789abcdef'
section .text
inc:
mov rax, [rsp+8] ; read param from the stack;
add rax, 0x1
ret
print:
lea rsi, [codes + rax]
mov rax, 1
mov rdi, 1
mov rdx, 1
syscall
ret
_start:
; enable AC check;
pushf
or dword [rsp], 1<<18
popf
mov rdi, [init] ; move the first 8 bytes of init to %rdi;
push rdi ; %rsp -> 8 bytes;
call inc
pop r11 ; clean stack by the caller;
call print
mov rax, 60
xor rdi, rdi
syscall
The ABI is a set of rules for how functions should behave to be interoperable with each other. Each of the rules on one side are paired with allowed assumptions on the other. In this case, the rule about stack alignment for the caller is an allowed assumption about stack alignment for the callee. Since your inc function doesn't depend on 16-byte stack alignment, it's fine to call that particular function with a stack that's only 8-byte aligned.
If you're wondering why it didn't break when you enabled AC, that's because you're only loading 8-byte values from the stack, and the stack is still 8-byte aligned. If you did sub rsp, 4 or something to break 8-byte alignment too, then you would get a bus error.
Where the ABI becomes important is when the situation isn't one function you wrote yourself in assembly calling another function you wrote yourself in assembly. A function in someone else's library (including the C standard library), or one that you compiled from C instead of writing in assembly, is within its rights to do movaps [rsp - 24], xmm0 or something, which would break if you didn't properly align the stack before calling it.
Side note: the ABI also says how you're supposed to pass parameters (the calling convention), but you're just kind of passing them wherever. Again, fine from your own assembly, but they'll definitely break if you try to call them from C.

Is there advantage of reading data without using pop operation?

According to this PDF document (Page-66), the following bunch of statements
mov eax, DWORD PTR SS:[esp]
mov eax, DWORD PTR SS:[esp + 4]
mov eax, DWORD PTR SS:[esp + 8]
are equivalent to the following bunch of statements:
pop eax
pop eax
pop eax
Is there any advantage of the the former over the latter?
mov leaves the data on the stack, pop removes it so you can only read it once, and only in order. Data below ESP has to be considered "lost", unless you're using a calling convention / ABI that includes a red-zone below the stack pointer.
Data is usually still there below ESP, but asynchronous stuff like signal handlers, or a debugger evaluating a call fflush(0) in the context of your process, can step on it.
Also, pop modifies ESP, so each pop requires stack-unwind metadata1 in another section of the executable/library, for it to be fully ABI compliant with SEH on Windows or the i386 / x86-64 System V ABI on other OSes (which specifies that all functions need unwind metadata, even if they're not C++ functions that actually support propagating exceptions).
But if you're reading data for the last time, and you actually need it all, then yes pop is an efficient way to read it on modern CPUs (like Pentium-M and later, with a stack engine to handle the ESP updates without a separate uop.)
On older CPUs, like Pentium III, pop was actually slower than 3x mov + add esp,12 and compilers did generate code the way Brendan's answer shows.
void foo() {
asm("" ::: "ebx", "esi", "edi");
}
This function forces the compiler to save/restore 3 call-preserved registers (by declaring clobbers on them.) It doesn't actually touch them; the inline asm string is empty. But this makes it easy to see what compilers will do for saving/restoring. (Which is the only time they'll use pop normally.)
GCC's default (tune=generic) code-gen, or with -march=skylake for example, is like this (from the Godbolt compiler explorer)
foo: # gcc8.3 -O3 -m32
push edi
push esi
push ebx
pop ebx
pop esi
pop edi
ret
But telling it to tune for an old CPU without a stack engine makes it do this:
foo: # gcc8.3 -march=pentium3 -O3 -m32
sub esp, 12
mov DWORD PTR [esp], ebx
mov DWORD PTR [esp+4], esi
mov DWORD PTR [esp+8], edi
mov ebx, DWORD PTR [esp]
mov esi, DWORD PTR [esp+4]
mov edi, DWORD PTR [esp+8]
add esp, 12
ret
gcc thinks -march=pentium-m doesn't have a stack engine, or at least chooses not to use push/pop there. I think that's a mistake, because Agner Fog's microarch pdf definitely describes the stack engine as being present in Pentium-M.
On P-M and later, push/pop are single-uop instructions, with the ESP update handled outside the out-of-order backend, and for push the store-address+store-data uops are micro-fused.
On Pentium 3, they're 2 or 3 uops each. (Again, see Agner Fog's instruction tables.)
On in-order P5 Pentium, push and pop are actually fine. (But memory-destination instructions like add [mem], reg were generally avoided, because P5 didn't split them into uops to pipeline better.)
Mixing pop with direct references to [esp] will actually be potentially slower than just one or the other, on modern Intel CPUs, because it costs extra stack-sync uops.
Obviously writing EAX 3 times back to back means the first 2 loads are useless in both sequences.
See Extreme Fibonacci for an example of pop (1 uop, or like 1.1 uop with the stack sync uops amortized) being more efficient than lodsd (2 uops on Skylake) for reading through an array. (In evil code that assumes a large red-zone because it doesn't install signal handlers. Don't actually do this unless you know exactly what you're doing and when it will break; this is more of a silly computer tricks / extreme optimization for code-golf than anything that's practically useful.)
Footnote 1: The Godbolt compiler explorer normally filters out extra assembler directives, but if you uncheck that box you can see gcc's function that uses push/pop has .cfi_def_cfa_offset 12 after every push/pop.
pop ebx
.cfi_restore 3
.cfi_def_cfa_offset 12
pop esi
.cfi_restore 6
.cfi_def_cfa_offset 8
pop edi
.cfi_restore 7
.cfi_def_cfa_offset 4
The .cfi_restore 7 metadata directives have to be there regardless of push/pop vs. mov, because that lets stack unwinding restore call-preserved registers as it unwinds. (7 is the register number).
But for other uses of push/pop inside a function (like pushing args to a function call, or a dummy pop to remove it from the stack), you wouldn't have .cfi_restore, only metadata for the stack pointer changing relative to the stack frame.
Normally you don't worry about this in hand-written asm, but compilers have to get this right so there's a small extra cost to using push/pop in terms of total executable size. But only in parts of the file that aren't mapped into memory normally, and not mixed with code.
This:
pop eax
pop ebx
pop ecx
.. is sort of equivalent to this:
mov eax,[esp]
add esp,4
mov ebx,[esp]
add esp,4
mov ecx,[esp]
add esp,4
..which can be like this:
mov eax,[esp] ;Do this instruction
add esp,4 ; ...and this instruction in parallel
;Stall until the previous instruction completes (and the value
mov ebx,[esp] ;in ESP becomes known); then do this instruction
add esp,4 ; ...and this instruction in parallel
;Stall until the previous instruction completes (and the value
mov ecx,[esp] ;in ESP becomes known); then do this instruction
add esp,4 ; ...and this instruction in parallel
For this code:
mov eax, [esp]
mov ebx, [esp + 4]
mov ecx, [esp + 8]
add esp,12
.. all of the instructions can happen in parallel (in theory).
Note: In practice all of the above depends on which CPU, etc.

Why does VC++ 2010 often use ebx as a "zero register"?

Yesterday I was looking at some 32 bit code generated by VC++ 2010 (most probably; don't know about the specific options, sorry) and I was intrigued by a curious recurring detail: in many functions, it zeroed out ebx in the prologue, and it always used it like a "zero register" (think $zero on MIPS). In particular, it often:
used it to zero out memory; this is not unusual, as the encoding for a mov mem,imm is 1 to 4 bytes bigger than mov mem,reg (the full immediate value size has to be encoded even for 0), but usually (gcc) the necessary register is zeroed out "on demand", and kept for more useful purposes otherwise;
used it for compares against zero - as in cmp reg,ebx. This is what stroke me as really unusual, as it should be exactly the same as test reg,reg, but adds a dependency to an extra register. Now, keep in mind that this happened in non-leaf functions, with ebx being often pushed (by the callee) on and off the stack, so I would not trust this dependency to be always completely free. Also, it also used test reg,reg in the exact same fashion (test/cmp => jg).
Most importantly, registers on "classic" x86 are a scarce resource, if you start having to spill registers you waste a lot of time for no good reason; why waste one through all the function just to keep a zero in it? (still, thinking about it, I don't remember seeing much register spillage in functions that used this "zero-register" pattern).
So: what am I missing? Is it a compiler blooper or some incredibly smart optimization that was particularly interesting in 2010?
Here's an excerpt:
; standard prologue: ebp/esp, SEH, overflow protection, ... then:
xor ebx, ebx
mov [ebp+4], ebx ; zero out some locals
mov [ebp], ebx
call function_1
xor ecx, ecx ; ebx _not_ used to zero registers
cmp eax, ebx ; ... but used for compares?! why not test eax,eax?
setnz cl ; what? it goes through cl to check if eax is not zero?
cmp ecx, ebx ; still, why not test ecx,ecx?
jnz function_body
push 123456
call throw_something
function_body:
mov edx, [eax]
mov ecx, eax ; it's not like it was interested in ecx anyway...
mov eax, [edx+0Ch]
call eax ; virtual method call; ebx is preserved but possibly pushed/popped
lea esi, [eax+10h]
mov [ebp+0Ch], esi
mov eax, [ebp+10h]
mov ecx, [eax-0Ch]
xor edi, edi ; ugain, registers are zeroed as usual
mov byte ptr [ebp+4], 1
mov [ebp+8], ecx
cmp ecx, ebx ; why not test ecx,ecx?
jg somewhere
label1:
lea eax, [esi-10h]
mov byte ptr [ebp+4], bl ; ok, uses bl to write a zero to memory
lea ecx, [eax+0Ch]
or edx, 0FFFFFFFFh
lock xadd [ecx], edx
dec edx
test edx, edx ; now it's using the regular test reg,reg!
jg somewhere_else
Notice: an earlier version of this question said that it used mov reg,ebx instead of xor ebx,ebx; this was just me not remembering stuff correctly. Sorry if anybody put too much thought trying to understand that.
Everything you commented on as odd looks sub-optimal to me. test eax,eax sets all flags (except AF) the same as cmp against zero, and is preferred for performance and code-size.
On P6 (PPro through Nehalem), reading long-dead registers is bad because it can lead to register-read stalls. P6 cores can only read 2 or 3 not-recently-modified architectural registers from the permanent register file per clock (to fetch operands for the issue stage: the ROB holds operands for uops, unlike on SnB-family where it only holds references to the physical register file).
Since this is from VS2010, Sandybridge wasn't released yet, so it should have put a lot of weight on tuning for Pentium II/III, Pentium-M, Core2, and Nehalem where reading "cold" registers is a possible bottleneck.
IDK if anything like this ever made sense for integer regs, but I don't know much about optimizing for CPUs older than P6.
The cmp / setz / cmp / jnz sequence looks particularly braindead. Maybe it came from a compiler-internal canned sequence for producing a boolean value from something, and it failed to optimize a test of the boolean back into just using the flags directly? That still doesn't explain the use of ebx as a zero-register, which is also completely useless there.
Is it possible that some of that was from inline-asm that returned a boolean integer (using a silly that wanted a zero in a register)?
Or maybe the source code was comparing two unknown values, and it was only after inlining and constant-propagation that it turned into a compare against zero? Which MSVC failed to optimize fully, so it still kept 0 as a constant in a register instead of using test?
(the rest of this was written before the question included code).
Sounds weird, or like a case of CSE / constant-hoisting run amok. i.e. treating 0 like any other constant that you might want to load once and then reg-reg copy throughout the function.
Your analysis of the data-dependency behaviour is correct: moving from a register that was zeroed a while ago essentially starts a new dependency chain.
When gcc wants two zeroed registers, it often xor-zeroes one and then uses a mov or movdqa to copy to the other.
This is sub-optimal on Sandybridge where xor-zeroing doesn't need an execution port, but a possible win on Bulldozer-family where mov can run on the AGU or ALU, but xor-zeroing still needs an ALU port.
For vector moves, it's a clear win on Bulldozer: handled in register rename with no execution unit. But xor-zeroing of an XMM or YMM register still needs an execution port on Bulldozer-family (or two for ymm, so always use xmm with implicit zero-extension).
Still, I don't think that justifies tying up a register for the duration of a whole function, especially not if it costs extra saves/restores. And not for P6-family CPUs where register-read stalls are a thing.

Resources