NASM local label differentiation

NASM local label differentiation - nasm

From what I understand, NASM (like all good assemblers) allows you to define local labels by prefixing them with a period, and it allows later definitions to override previous ones.
The code I've seen demonstrating this looks like:
part1 mov ax, 10
.loop ; do something
dec ax
jnz .loop
part2 mov ax, 50
.loop ; do something
dec ax
jnz .loop
In this case, the later definition overrides the earlier one so that the correct label is selected.
However, I can't see how this works in the following scenario.
part1 mov ax, 10
.loop jz .fin
; do something else
dec ax
jmp .loop
.fin
part2 mov ax, 50
.loop jz .fin
; do something else
dec ax
jmp .loop
.fin
At the point where jz .fin in the second loop is assembled, surely the earlier instance of .fin would still be active and it would jump to the wrong location.
Or is NASM smarter than that and uses some other method to decide which label is active at any given time?

Actually, that understanding isn't quite correct. The local labels aren't stand-alone items, they actually associate themselves with the most recent non-local label.
NASM manual section 3.9 Local Labels
So the second code example effectively becomes:
part1 mov ax, 10
part1.loop jz part1.fin
; do something else
dec ax
jmp part1.loop
part1.fin
part2 mov ax, 50
part2.loop jz part2.fin
; do something else
dec ax
jmp part2.loop
part2.fin
and the problem then disappears.
In fact, you can actually refer to the local labels as non-local ones. In other words, if you wanted to leave part1 and totally skip part2 completely, you could just use something like:
jmp part2.fin
from somewhere within part1. Using .fin from within part1 wouldn't work since that would select part1.fin, but the fully qualified part2.fin would do the trick.

At the point where jz .fin in the second loop is assembled, surely the earlier instance of .fin would still be active and it would jump to the wrong location.
Or is NASM smarter than that and uses some other method to decide which label is active at any given time?
Let's find out!
C:\nasm>nasm -f bin -o locals.com locals.asm && ndisasm locals.com
00000000 B80A00 mov ax,0xa ; part1: mov ax, 10
00000003 7403 jz 0x8 ; .loop: jz .fin
00000005 48 dec ax ; dec ax
00000006 EBFB jmp short 0x3 ; jmp .loop
00000008 B83200 mov ax,0x32 ; .fin: part2: mov ax, 50
0000000B 7403 jz 0x10 ; .loop: jz .fin
0000000D 48 dec ax ; dec ax
0000000E EBFB jmp short 0xb ; jmp .loop
; .fin:
So the first jz .fin jumps to the first instance of .fin, and the second jz .fin jumps to the second instance of .fin.

According to YASM's docs on NASM local labels, a local label is associated with a global label. So I think your example would actually have a conflict if there wasn't a global label separating the two instances of .fin.

Related

Boot sector printing the wrong thing when I %include my print function

I am having an issue with some assembly code. I am trying to print out a string using a function from a different assembly file. But it doesn't output the string but instead an "S ". How do I fix this? I would like to add that I use the NASM assembler.
code:
string.asm
print_string:
pusha
mov ah, 0x0e
loop:
mov al, [bx]
cmp al, 0
je return
int 0x10
inc bx
jmp loop
return:
popa
ret
boot_sector.asm -
[org 0x7c00]
%include "string.asm"
mov bx, [my_string]
call print_string
my_string:
db 'hello world', 0
times 510 - ($ - $$) db 0
dw 0xaa55

Execution of a boot sector begins with the first byte. In this case, the first instruction is the top of your function, because you put it first.
The code assembles exactly the same as if you had included it manually before assembling. So your boot sector is really:
[org 0x7c00]
print_string:
pusha
...
ret
mov bx, [my_string] ; BX = load first 2 bytes of my_string.
; should have been
; mov bx, my_string ; BX = address of my_string. mov bx, imm16
call print_string
It should be pretty obvious why that doesn't work, and you would have noticed this if you single-stepped your code with the debugger built-in to BOCHS (or any other way of debugging a boot sector). Even just looking at disassembly might have clued you in.
Solution: put the %include after your other code, and avoid having execution fall into it. e.g. put this after the call:
cli ; disable interrupts
hlt ; halt until the next interrupt. (except for NMI)
(If NMI is possible, you can put the hlt inside an infinite loop with jmp.)
This is not your only bug. As #MichaelPetch points out, you were loading 2 bytes from the string instead of putting its address into BX.

How can i copy an array in nasm x86 assembly for Linux, porting 16-bit DOS code?

I have to write a program which copy an array in other array, using x86 assembler
The original code is written in MSDOS' TASM for 8086 processor, but I want port this to Linux NASM using i386 processor
The code in TASM is this:
.MODEL SMALL
.DATA
TABLE_A DB 10, 5, 1
TABLE_B DB 0, 0, 0
.CODE
MOV AX, SEG TABLE_B
MOV DS, AX
MOV SI, 0
LOOP:
MOV AL, TABLE_A[SI]
MOV TABLE_B[SI], AL
INC SI
CMP SI, 2
JBE LOOP
MOV AH, 4Ch
INT 21h
END
I'm trying to rewrite this in nasm, but I don't get to sit in the correct array position, similar to TABLE_A[SI] instruction
How can I do it?

The final code in nasm is this
section .text
global _start
cpu 386
_start:
MOV ESI, TABLE_A
MOV EDI, TABLE_B
MOV CX, 3
COPY_LOOP:
MOV AL, [ESI]
MOV [EDI], AL
INC SI
INC DI
LOOP COPY_LOOP
MOV AX,1
INT 80h
section .data
TABLE_A DB 10, 5, 1
TABLE_B DB 0, 0, 0

How could I do it?
(question from comments on self-answer)
Well, first you read Instruction reference guide to understand what the instruction does, and then you can use it, if it fits your purpose. This is the important step, keep re-reading instruction details every so often, to verify it does modify registers and flags in a way you expect it. Especially if in debugger you see the CPU state of change you didn't expect.
As you are in linux, the ds/es segment registers are very likely already set to reasonably values (covering .data section), so after setting eSi to Source address, eDi to Destination address, and eCx to Count, you write instead of COPY_LOOP: just rep movsb ... and then exit trough int 80h (eax=1). (notice the emphasized letters in register names, Intel picked those intentionally to make it easy to recall)
BTW, just now I noticed, you wrote in your code sort of bugs:
inc si/di should be inc esi/edi, because you use esi/edi to address. If you would be copying array over 64k memory boundary, inc si would wrap around on it.
set ecx to 3, in 32b mode the loop instruction does use whole 32b ecx, not 16b part cx only. If the code ahead of copy would use some large number in ecx setting some of upper 16 bits, your loop would copy many more bytes than only 3.
ahead of calling int 80h again you must set whole eax with the function number, otherwise you risk to have some garbage in upper 16 bits of eax from previous code, requesting invalid function.
So after applying these your code may look like this:
section .text
global _start
cpu 386
_start:
MOV ESI, TABLE_A
MOV EDI, TABLE_B
MOV ECX, 3
REP MOVSB ; copy ECX bytes from DS:ESI to ES:EDI
MOV EAX,1 ; call sys_exit, again FIXED to EAX!
INT 80h
section .data
TABLE_A DB 10, 5, 1
TABLE_B DB 0, 0, 0
If you did read the docs about registers, you should already understand what is difference between eax and ax. In Linux you are in 32b mode (when you link the binary as 32b elf, nowadays the 64b may be default on 64b system, which differs a bit from 32b mode), so by default use the 32b register variants. Unless you really want the 16b/8b variant for particular reason, and you make sure the code doesn't work later with 32b register while you set only less of it (like loop, rep movsb and int 80h do).
Also it makes the code usually faster, as using 16b ax in 32b mode requires additional opcode byte ahead of instruction, for example mov eax,ebx is 2 bytes opcode 89 D8, mov ax,bx is 3 bytes opcode 66 89 D8.

In response to marc
I tried this form, without successful result:
MOV SI, 0
MOV AX, 0
LOOP:
MOV AX, [TABLE_A + SI]
MOV [TABLE_B + SI], AX
INC SI
CMP SI, 2
JBE LOOP

Use pointers (SI, DI) to the arrays and CX as counter :
MOV SI, Table_A ;POINTER TO TABLE_A.
MOV DI, Table_B ;POINTER TO TABLE_B.
MOV CX, 3 ;ARRAY LENGTH.
REPEAT:
MOV AL, [SI]
MOV [DI], AL
INC SI
INC DI
LOOP REPEAT ;CX-1. IF CX>0 JUMP TO REPEAT.

Print register value to console

I want to print the value in %RCX directly to the console, let's say an ASCII value. I've searched through some wise books and tutorials, but all use buffers to pass anything. Is it possible to print anything without creating special buffer for that purpose?
lets say i am here (all this answers are fat too complicated to me and use different syntax):
movq $5, %rax
...???(print %rax)
Output on console:
\>5
in example, to print buffer i use code:
SYSWRITE = 4
STDOUT = 1
EXIT_SUCCESS = 0
.text
buff: .ascii "Anything to print\n"
buff_len = . - buff
movq $SYSWRITE, %eax
mov $STDOUT, %ebx
mov $buff, %ecx
mov $buff_len, %edx
NO C CODE OR DIFFERENT ASS SYNTAX ALLOWED!!!

In order to print a register (in hex representation or numeric) the routine (write to stdout, stderr, etc.) expects ASCII characters. Just writing a register will cause the routine to try an display the ascii equivalent of the value in the register. You may get lucky sometimes if each of the bytes in the register happen to fall into the printable character range.
You will need to convert it vis-a-vis routines that convert to decimal or hex. Here is an example of converting a 64 bit register to the hex representation (using intel syntax w/nasm):
section .rodata
hex_xlat: db "0123456789abcdef"
section .text
; Called with RDI is the register to convert and
; RSI for the buffer to fill
;
register_to_hex:
push rsi ; Save for return
xor eax,eax
mov ecx, 16 ; looper
lea rdx, [rel hex_xlat] ; position-independent code can't index a static array directly
ALIGN 16
.loop:
rol rdi, 4 ; dil now has high bit nibble
mov al, dil ; capture low nibble
and al, 0x0f
mov al, byte [rdx+rax] ; look up the ASCII encoding for the hex digit
; rax is an 'index' with range 0x0 - 0xf.
; The upper bytes of rax are still zero from xor
mov byte [rsi], al ; store in print buffer
inc rsi ; position next pointer
dec ecx
jnz .loop
.exit:
pop rax ; Get original buffer pointer
ret

This answer is an addendum to the answer given by Frank, and utilizes the mechanism used there to do the conversion.
You mention the register %RCX in your question. This suggests you are looking at 64-bit code and that your environment is likely GCC/GAS (GNU Assembler) based since % is usually the AT&T style prefix for registers.
With that in mind I've created a quick and dirty macro that can be used inline anywhere you need to print a 64-bit register, 64-bit memory operand, or a 32-bit immediate value in GNU Assembly. This version was a proof of concept and could be amended to support 64 bit immediate values. All the registers that are used are preserved, and the code will also account for the Linux 64-bit System V ABI red zone.
The code below is commented to point out what is occurring at each step.
printmac.inc:
.macro memreg_to_hex src # Macro takes one input
# src = memory operand, register,
# or 32 bit constant to print
# Define the translation table only once for the current object
.ifndef MEMREG_TO_HEX_NOT_FIRST
.set MEMREG_TO_HEX_NOT_FIRST, 1
.PushSection .rodata
hex_xlat: .ascii "0123456789abcdef"
.PopSection
.endif
add $-128,%rsp # Avoid 128 byte red zone
push %rsi # Save all registers that will be used
push %rdi
push %rdx
push %rcx
push %rbx
push %rax
push %r11 # R11 is destroyed by SYSCALL
mov \src, %rdi # Move src value to RDI for processing
# Output buffer on stack at ESP-16 to ESP-1
lea -16(%rsp),%rsi # RSI = output buffer on stack
lea hex_xlat(%rip), %rdx # RDX = translation buffer address
xor %eax,%eax # RAX = Index into translation array
mov $16,%ecx # 16 nibbles to print
.align 16
1:
rol $4,%rdi # rotate high nibble to low nibble
mov %dil,%al # dil now has previous high nibble
and $0xf,%al # mask off all but low nibble
mov (%rdx,%rax,1),%al # Lookup in translation table
mov %al,(%rsi) # Store in output buffer
inc %rsi # Update output buffer address
dec %ecx
jne 1b # Loop until counter is 0
mov $1,%eax # Syscall 1 = sys_write
mov %eax,%edi # EDI = 1 = STDIN
mov $16,%edx # EDX = Number of chars to print
sub %rdx,%rsi # RSI = beginning of output buffer
syscall
pop %r11 # Restore all registers used
pop %rax
pop %rbx
pop %rcx
pop %rdx
pop %rdi
pop %rsi
sub $-128,%rsp # Restore stack
.endm
printtest.s
.include "printmac.inc"
.global main
.text
main:
mov $0x123456789abcdef,%rcx
memreg_to_hex %rcx # Print the 64-bit value 0x123456789abcdef
memreg_to_hex %rsp # Print address containing ret pointer
memreg_to_hex (%rsp) # Print return pointer
memreg_to_hex $0x402 # Doesn't support 64-bit immediates
# but can print anything that fits a DWORD
retq
This can be compiled and linked with:
gcc -m64 printtest.s -o printtest
The macro doesn't print an end of line character so the output of the test program looks like:
0123456789abcdef00007fff5283d74000007f5c4a080a500000000000000402
The memory addresses will be be different.
Since the macros are inlined, each time you invoke the macro the entire code will be emitted. The code is space inefficient. The bulk of the code could be moved to an object file you can include at link time. Then a stub macro could wrap a CALL to the main printing function.
The code doesn't use printf because at some point I thought I saw a comment that you couldn't use the C library. If that's not the case this can be simplified greatly by calling printf to format the output to print a 64-bit hexadecimal value.

Just for fun, here are a couple other sequences for storing a hex string from a register. Printing the buffer is not the interesting part, IMO; copy that part from Michael's excellent answer if needed.
I tested some of these. I've included a main that calls one of these functions and then uses printf("%s\n%lx\n", result, test_value); to make it easy to spot problems.
Test main():
extern printf
global main
main:
push rbx
mov rdi, 0x1230ff56dcba9911
mov rbx, rdi
sub rsp, 32
mov rsi, rsp
mov byte [rsi+16], 0
call register_to_hex_ssse3
mov rdx, rbx
mov edi, fmt
mov rsi, rsp
xor eax,eax
call printf
add rsp, 32
pop rbx
ret
section .rodata
fmt: db `%s\n%lx\n`, 0 ; YASM doesn't support `string with escapes`, so this only assembles with NASM.
; NASM needs
; %use smartalign
; ALIGNMODE p6, 32
; or similar, to stop it using braindead repeated single-byte NOPs for ALIGN
SSSE3 pshufb for the LUT
This version doesn't need a loop, but the code size is much larger than the rotate-loop versions because SSE instructions are longer.
section .rodata
ALIGN 16
hex_digits:
hex_xlat: db "0123456789abcdef"
section .text
;; rdi = val rsi = buffer
ALIGN 16
global register_to_hex_ssse3
register_to_hex_ssse3: ;;;; 0x39 bytes of code
;; use PSHUFB to do 16 nibble->ASCII LUT lookups in parallel
movaps xmm5, [rel hex_digits]
;; x86 is little-endian, but we want the hex digit for the high nibble to be the first character in the string
;; so reverse the bytes, and later unpack nibbles like [ LO HI ... LO HI ]
bswap rdi
movq xmm1, rdi
;; generate a constant on the fly, rather than loading
;; this is a bit silly: we already load the LUT, might as well load another 16B from the same cache line, a memory operand for PAND since we manage to only use it once
pcmpeqw xmm4,xmm4
psrlw xmm4, 12
packuswb xmm4,xmm4 ; [ 0x0f 0x0f 0x0f ... ] mask for low-nibble of each byte
movdqa xmm0, xmm1 ; xmm0 = low nibbles at the bottom of each byte
psrlw xmm1, 4 ; xmm1 = high nibbles at the bottom of each byte (with garbage from next byte)
punpcklbw xmm1, xmm0 ; unpacked nibbles (with garbage in the high 4b of some bytes)
pand xmm1, xmm4 ; mask off the garbage bits because pshufb reacts to the MSB of each element. Delaying until after interleaving the hi and lo nibbles means we only need one
pshufb xmm5, xmm1 ; xmm5 = the hex digit for the corresponding nibble in xmm0
movups [rsi], xmm5
ret
AVX2: you can do two integers at once, with something like
int64x2_to_hex_avx2: ; (const char buf[32], uint64_t first, uint64_t second)
bswap rsi ; We could replace the two bswaps with one 256b vpshufb, but that would require a mask
vmovq xmm1, rsi
bswap rdx
vpinsrq xmm1, xmm1, rdx, 1
vpmovzxbw ymm1, xmm1 ; upper lane = rdx, lower lane = rsi, with each byte zero-extended to a word element
vpsllw ymm1, ymm1, 12 ; shift the high nibbles out, leaving the low nibbles at the top of each word
vpor ymm0, ymm0, ymm1 ; merge while hi and lo elements both need the same shift
vpsrlw ymm1, ymm1, 4 ; low nibbles in elems 1, 3, 5, ...
; high nibbles in elems 0, 2, 4, ...
pshufb / store ymm0 / ret
Using pmovzx and shifts to avoid pand is a win compared to generating the constant on the fly, I think, but probably not otherwise. It takes 2 extra shifts and a por. It's an option for the 16B non-AVX version, but it's SSE4.1.
Optimized for code-size (fits in 32 (0x20) bytes)
(Derived from Frank's loop)
Using cmov instead of the LUT to handle 0-9 vs. a-f might take fewer than 16B of extra code size. That might be fun: edits welcome.
The ways to get a nibble from the bottom of rsi into an otherwise-zeroed rax include:
mov al, sil (3B (REX required for sil)) / and al, 0x0f (2B special encoding for and al, imm8).
mov eax, esi (2B) / and eax, 0x0f (3B): same size and doesn't require an xor beforehand to zero the upper bytes of rax.
Would be smaller if the args were reversed, so the dest buffer was already in rdi. stosb is a tiny instruction (but slower than mov [rdi], al / inc rdi), so it actually saved overall bytes to use xchg rdi, rsi to set up for it. changing the function signature could save 5 bytes: void reg_to_hex(char buf[16], uint64_t val) would save two bytes from not having to return buf in rax, and 3 bytes from dropping the xchg. The caller will probably use 16B of stack, and having the caller do a mov rdx, rsp instead of mov rdx, rax before calling another function / syscall on the buffer doesn't save anything.
The next function is probably going to ALIGN 16, though, so shrinking the function to even smaller than 32B isn't as useful as getting it inside half a cache-line.
Absolute addressing for the LUT (hex_xlat) would save a few bytes
(use mov al, byte [hex_xlat + rax] instead of needing the lea).
global register_to_hex_size
register_to_hex_size:
push rsi ; pushing/popping return value (instead of mov rax, rsi) frees up rax for stosb
xchg rdi, rsi ; allows stosb. Better: remove this and change the function signature
mov cl, 16 ; 3B shorter than mov ecx, 16
lea rdx, [rel hex_xlat]
;ALIGN 16
.loop:
rol rsi, 4
mov eax, esi ; mov al, sil to allow 2B AND AL,0xf requires a 2B xor eax,eax
and eax, 0x0f
mov al, byte [rdx+rax]
stosb
;; loop .loop ; setting up ecx instead of cl takes more bytes than loop saves
dec cl
jne .loop
pop rax ; get the return value back off the stack
ret
Using xlat costs 2B (to save/restore rbx), but saves 3B, for a net savings of 1B. It's a 3-uop instruction, with 7c latency, one per 2c throughput (Intel Skylake). The latency and throughput aren't a problem here, since each iteration is a separate dependency chain, and there's too much overhead for this to run at one clock per iteration anyway. So the main problem is that it's 3 uops, making it less uop-cache-friendly. With xlat, the loop becomes 10 uops instead of 8 (using stosb), so that sucks.
112: 89 f0 mov eax,esi
114: 24 0f and al,0xf
116: d7 xlat BYTE PTR ds:[rbx]
117: aa stos BYTE PTR es:[rdi],al
vs.
f1: 89 f0 mov eax,esi
f3: 83 e0 0f and eax,0xf
f6: 8a 04 02 mov al,BYTE PTR [rdx+rax*1]
f9: aa stos BYTE PTR es:[rdi],al
Interestingly, this still has no partial-register stalls, because we never read a wide register after writing only part of it. mov eax, esi is write-only, so it cleans up the partial-reg-ness from the load into al. So there would be no advantage to using movzx eax, byte [rdx+rax]. Even when we return to the caller, the pop rax doesn't leave the caller succeptible to partial-reg problems.
(If we don't bother returning the input pointer in rax, then the caller could have a problem. Except in that case it shouldn't be reading rax at all. Usually it only matters if you call with call-preserved registers in a partial-reg state, because the called function might push them. Or more obviously, with arg-passing / return-value registers.
Efficient version (uop-cache friendly)
Looping backwards didn't turn out to save any instructions or bytes, but I've included this version because it's more different from the version in Frank's answer.
ALIGN 16
global register_to_hex_countdown
register_to_hex_countdown:
;;; work backwards in the buffer, starting with the least-significant nibble as the last char
mov rax, rsi ; return value, and loop bound
add rsi, 15 ; last char of the buffer
lea rcx, [rel hex_xlat] ; position-independent code
ALIGN 16
.loop:
mov edx, edi
and edx, 0x0f ; isolate low nibble
mov dl, byte [rcx+rdx] ; look up the ascii encoding for the hex digit
; rdx is an 'index' with range 0x0 - 0xf
; non-PIC version: mov dl, [hex_digits + rdx]
mov byte [rsi], dl
shr rdi, 4
dec rsi
cmp rsi, rax
jae .loop ; rsi counts backwards down to its initial value
ret
The whole thing is only 12 insns (11 uops with macro-fusion, or 12 including the NOP for alignment). Some CPUs can fuse cmp/jcc but not dec/jcc (e.g. AMD, and Nehalem)
Another option for looping backwards was mov ecx, 15, and store with mov [rsi+rcx], dl, but two-register addressing modes can't micro-fuse. Still, that would only bring the loop up to 8 uops, so it would be fine.
Instead of always storing 16 digits, this version could use rdi becoming zero as the loop condition to avoid printing leading zeros. i.e.
add rsi, 16
...
.loop:
...
dec rsi
mov byte [rsi], dl
shr rdi, 4
jnz .loop
; lea rax, [rsi+1] ; correction not needed because of adjustments to how rsi is managed
mov rax, rsi
ret
printing from rax to the end of the buffer gives just the significant digits of the integer.

Clearing out a string variable

I have writen this little experiement bootstrap that has a getline and print_string "functions". The boot stuff is taken from MikeOS tutorial but the rest I have writen myself. I compile this with NASM and run it in QEMU.
So the actual question: I've declared this variable curInpLn on line 6. What ever the user types is saved on that variable and then after enter is hit it is displayed to the user with some additional messages. What I'd like to do is to clear the contents of curInpLn each time the getline function is called but for some reason I can't manage to do that. I'm quite the beginner with Assmebly at the moment.
You can compile the code to bin format and then create a floppy image of it with: "dd status=noxfer conv=notrunc if=FILENAME.bin of=FILENAME.flp" and run it in qemu with: "qemu -fda FILENAME.flp"
BITS 16
jmp start
welcomeSTR: db 'Welcome!',0
promptSTR: db 'Please prompt something: ',0
responseSTR: db 'You prompted: ',0
curInpLn: times 80 db 0 ;this is a variable to hold the input 'command'
curCharCnt: dw 0
curLnNum: dw 1
start:
mov ax, 07C0h ; Set up 4K stack space after this bootloader
add ax, 288 ; (4096 + 512) / 16 bytes per paragraph
mov ss, ax
mov sp, 4096
mov ax, 07C0h ; Set data segment to where we're loaded
mov ds, ax
call clear_screen
lea bx, [welcomeSTR] ; Put string position into SI
call print_string
call new_line
.waitCMD:
lea bx, [promptSTR]
call print_string
call getLine ; Call our string-printing routine
jmp .waitCMD
getLine:
cld
mov cx, 80 ;number of loops for loopne
mov di, 0 ;offset to bx
lea bx, [curInpLn] ;the address of our string
.gtlLoop:
mov ah, 00h ;This is an bios interrupt to
int 16h ;wait for a keypress and save it to al
cmp al, 08h ;see if backspace was pressed
je .gtlRemChar ;if so, jump
mov [bx+di], al ;effective address of our curInpLn string
inc di ;is saved in bx, di is an offset where we will
;insert our char in al
cmp al, 0Dh ;see if character typed is car-return (enter)
je .gtlDone ;if so, jump
mov ah, 0Eh ;bios interrupt to show the char in al
int 10h
.gtlCont:
loopne .gtlLoop ;loopne loops until cx is zero
jmp .gtlDone
.gtlRemChar:
;mov [bx][di-1], 0 ;this needs to be solved. NASM gives error on this.
dec di
jmp .gtlCont
.gtlDone:
call new_line
lea bx, [responseSTR]
call print_string
mov [curCharCnt], di ;save the amount of chars entered to a var
lea bx, [curInpLn]
call print_string
call new_line
ret
print_string: ; Routine: output string in SI to screen
mov si, bx
mov ah, 0Eh ; int 10h 'print char' function
.repeat:
lodsb ; Get character from string
cmp al, 0
je .done ; If char is zero, end of string
int 10h ; Otherwise, print it
jmp .repeat
.done:
ret
new_line:
mov ax, [curLnNum]
inc ax
mov [curLnNum], ax
mov ah, 02h
mov dl, 0
mov dh, [curLnNum]
int 10h
ret
clear_screen:
push ax
mov ax, 3
int 10h
pop ax
ret
times 510-($-$$) db 0 ; Pad remainder of boot sector with 0s
dw 0xAA55 ; The standard PC boot signature

I haven't written code in Assembly for 20 years (!), but it looks like you need to use the 'stosw' instruction (or 'stosb'). STOSB loads the value held in AL to the byte pointed to by ES:DI, whereas STOSSW loads the value held in AX to the word pointed to by ES:DI. The instruction automatically advances the pointer. As your variable curInpLn is 80 bytes long, you can clear it with 40 iterations of STOSW. Something like
xor ax, ax ; ax = 0
mov es, ds ; point es to our data segment
mov di, offset curInpLn ; point di at the variable
mov cx, 40 ; how many repetitions
rep stosw ; zap the variable
This method is probably the quickest method of clearing the variable as it doesn't require the CPU to retrieve any instructions from the pre-fetch queue. In fact, it allows the pre-fetch queue to fill up, thus allowing any following instructions to execute as quickly as possible.

How to update array after BubbleSort - asm in vc++

I'm learning the basics of bubblesort in Assembly right now and what I've coded makes sense in my head, but apparently not to the compiler. When I print out the entire array after running this subroutine, why does the array look the same as before it was sorted?
bubblesort proc
bubbleSort Proc
mov ecx, NUMS_LENGTH
dec ecx
mov edi, 0
.WHILE ecx > 0
mov ax, NUMS[di]
.IF ax > NUMS[di]+2
xchg ax, NUMS[di]+2
call WriteInt
mov [NUMS[di]+2], ax
inc di
.ENDIF
dec ecx
.ENDW
bubbleSort endp
Advice and insight are greatly appreciated. Many thanks in advance!
EDIT
bubbleSort Proc
mov ecx, NUMS_LENGTH
dec ecx
mov edi, 0
.WHILE ecx > 0
mov di, NUMS_LENGTH-1
.WHILE di < NUMS_LENGTH-1
mov ax, NUMS[di]
.IF ax > NUMS[di]+2
xchg ax, NUMS[di]+2
mov [NUMS[di]+2], ax
inc di
.ELSE
inc di
.ENDIF
.ENDW
dec ecx
.ENDW
bubbleSort endp
*EDIT 2 *
bubbleSort Proc
mov ecx, NUMS_LENGTH
dec ecx
mov di, 0
.WHILE ecx > 0
mov ax, NUMS[di]
.WHILE di != NUMS_LENGTH-1
.IF ax > NUMS[di]+2
xchg ax, NUMS[di]+2
mov NUMS[di]+2, ax
inc di
.ELSE
inc di
mov NUMS[di]+2, ax
.ENDIF
.ENDW
dec ecx
.ENDW
bubbleSort endp

At a guess, it doesn't look quite the same, but it's not sorted either.
A bubble sort requires two nested loops, but you seem to only have one loop. That means it makes only one pass through the array. At the end of one pass, the last element in the array will be in the correct position, but the rest won't be sorted yet.
Edit: The edited code probably isn't an improvement. In particular:
mov di, NUMS_LENGTH-1
.WHILE di < NUMS_LENGTH-1
We're setting di equal to NUMS_LENGTH-1, then executing a while loop that can only execute if di is less than NUMS_LENGTH-1, so that inner loop clearly won't (ever!) execute.
Edit2: Though a bubble sort in assembly language strikes me as just about the worst wast of time possible, I suppose if you insist, the obvious way to start would be to consider how a bubble sort looks in something like C:
for (int i=array_len; i!=0; --i)
for (int j=0; j<i; j++)
if (array[j] > array[j+1]) {
temp = array[j];
array[j] = array[j+1];
array[j+1] = temp;
}
Writing something on the same general order in assembly language shouldn't be terribly difficult.
Edit4: Fixed code (I think):
mov ecx, (NUMS_LEN-1)*4
outer_loop:
xor edi, edi
inner_loop:
mov eax, nums[edi]
cmp eax, nums[edi+4]
jl noswap
xchg nums[edi+4], eax
mov nums[edi], eax
noswap:
add edi, 4
cmp edi, ecx
jl inner_loop
sub ecx, 4
jnz outer_loop
Sorry -- I've never really gotten used to Microsoft's "high level" control-flow macros for assembly language. A couple of points to consider: at least assuming we don't allow a zero-length array, for this task loops that test the condition at the bottom are a lot simpler. In general, loops that test at the bottom are cleaner in assembly language anyway. When the algorithm demands a test at the beginning, it's still often cleaner to put the test at the bottom, and build a structure like:
initalization
jmp loop_test
loop_top:
loop body
loop_test:
update loop variable(s)
if (more iterations)
jmp loop_top

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string