Nasm - Change variable in different threads - multithreading

I have a program that has a main thread and a second thread. The second thread modifies a global variable which then will be used in the main thread. But somehow the changes I make in the second thread are not shown in the main thread.
section .bss USE32
global var
var resd 1
section .text USE32
..start:
push 0
push 0
push 0
push .second
push 0
push 0
call [CreateThread]
mov eax, 1
cmp [var], eax ; --> the content of var and '1' are not the same. Which is confusing since I set the content of var to '1' in the second thread
;the other code here is not important
.second:
mov eax, 1
mov [var], eax
ret
(This is a simplification of my real program which creates threads in a loop; I haven't tested this exact code.)

You don't join the new thread (wait for it to exit); there's no reason to assume that it's finished (or even fully started) when CreateThread returns to the main thread.
You could spin-wait until you see a non-zero value in [var], and count how many iterations that takes, if you want to benchmark thread-startup overhead + inter-core latency.
...
call [CreateThread]
mov edi, 1
cmp [var], edi
je .zero_latency ; if var already changed
rdtsc ; could put an lfence before and/or after this to serialize execution
mov ecx, eax ; save low half of EDX:EAX cycle count; should be short enough that the interval fits in 32 bits
xor esi, esi
.spin:
inc esi ; ++spin_count
pause ; optional, but avoids memory-order mis-speculation when var changes
cmp [var], edi
jne .spin
rdtsc
sub eax, ecx ; reference cycles since CreateThread returned
...
.zero_latency: ; jump here if the value already changed before the first iteration
Note that rdtsc measures in reference cycles, not core clock cycles, so turbo matters. Only doing the low 32 bits of the 64-bit subtraction is fine if the interval is less than 2^32 (e.g. about 1 second on a CPU with a reference frequency of 4.2 GHz, vastly longer than we'd expect here).
esi is the spin count. With pause in the loop, you'll do about one check per 100 cycles on Skylake and later, or about one check per 5 cycles on earlier Intel. Otherwise about one check per core clock cycle.

Related

Scan an integer and print the interval (1, integer) in NASM

I am trying to learn the assembly language from a Linux Ubuntu 16.04 x64.
For now I have the following problem:
- scan an integer n and print the numbers from 1 to n.
For n = 5 I should have 1 2 3 4 5.
I tried to do it with scanf and printf but after I input the number, it exits.
The code is:
;nasm -felf64 code.asm && gcc code.o && ./a.out
SECTION .data
message1: db "Enter the number: ",0
message1Len: equ $-message1
message2: db "The numbers are:", 0
formatin: db "%d",0
formatout: db "%d",10,0 ; newline, nul
integer: times 4 db 0 ; 32-bits integer = 4 bytes
SECTION .text
global main
extern scanf
extern printf
main:
mov eax, 4
mov ebx, 1
mov ecx, message1
mov edx, message1Len
int 80h
mov rdi, formatin
mov rsi, integer
mov al, 0
call scanf
int 80h
mov rax, integer
loop:
push rax
push formatout
call printf
add esp, 8
dec rax
jnz loop
mov rax,0
ret
I am aware that in this loop I would have the inverse output (5 4 3 2 1 0), but I did not know how to set the condition.
The command I'm using is the following:
nasm -felf64 code.asm && gcc code.o && ./a.out
Can you please help me find where I'm going wrong?
There are several problems:
1. The parameters to printf, as discussed in the comments. In x86-64, the first few parameters are passed in registers.
2. printf does not preserve the value of eax.
3. The stack is misaligned.
4. rbx is used without saving the caller's value.
5. The address of integer is being loaded instead of its value.
6. Since printf is a varargs function, eax needs to be set to 0 before the call.
7. Spurious int 80h after the call to scanf.
I'll repeat the entire function in order to show the necessary changes in context.
main:
push rbx ; This fixes problems 3 and 4.
mov eax, 4
mov ebx, 1
mov ecx, message1
mov edx, message1Len
int 80h
mov rdi, formatin
mov rsi, integer
mov al, 0
call scanf
mov ebx, [integer] ; fix problems 2 and 5
loop:
mov rdi, formatout ; fix problem 1
mov esi, ebx
xor eax, eax ; fix problem 6
call printf
dec ebx
jnz loop
pop rbx ; restore caller's value
mov rax,0
ret
P.S. To make it count up instead of down, change the loop like this:
mov ebx, 1
loop:
<call printf>
inc ebx
cmp ebx, [integer]
jle loop
You are calling scanf correctly, using the x86-64 System V calling convention. It leaves its return value in eax. After successful conversion of one operand (%d), it returns with eax = 1.
... correct setup for scanf, including zeroing AL.
call scanf ; correct
int 80h ; insane: system call with eax = scanf return value
Then you run int 80h, which makes a 32-bit legacy-ABI system call using eax=1 as the code to determine which system call. (see What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?).
eax=1 / int 80h is sys_exit on Linux. (unistd_32.h has __NR_exit = 1). Use a debugger; that would have shown you which instruction was making your program exit.
Your title (before I corrected it) said you got a segmentation fault, but I tested on my x86-64 desktop and that's not the case. It exits cleanly using an int 80h exit system call. (But in code that does segfault, use a debugger to find out which instruction.) strace decodes int 0x80 system calls incorrectly in 64-bit processes, using the 64-bit syscall call numbers from unistd_64.h, not the 32-bit unistd_32.h call numbers.
Your code was close to working: you use the int 0x80 32-bit ABI correctly for sys_write, and only pass it 32-bit args. (The pointer args fit in 32 bits because static code/data is always placed in the low 2GiB of virtual address space in the default code model on x86-64. Exactly for this reason, so you can use compact instructions like mov edi, formatin to put addresses in registers, or use them as immediates or rel32 signed displacements.)
OTOH I think you were doing that for the wrong reason. And as #prl points out, you forgot to maintain 16-byte stack alignment.
Also, mixing system calls with C stdio functions is usually a bad idea. Stdio uses internal buffers instead of always making a system call on every function call, so things can appear out of order, or a read can be waiting for user input when there's already data in the stdio buffer for stdin.
Your loop is broken in several ways, too. You seem to be trying to call printf with the 32-bit calling convention (args on the stack).
Even in 32-bit code, this is broken, because printf's return vale is in eax. So your loop is infinite, because printf returns the number of characters printed. That's at least two from the %d\n format string, so dec rax / jnz will always jump.
In the x86-64 SysV ABI, you need to zero al before calling printf (with xor eax,eax), if you didn't pass any FP args in XMM registers. You also have to pass args in rdi, rsi, ..., like for scanf.
You also add rsp, 8 after pushing two 8-byte values, so the stack grows forever. (But you never return, so the eventual segfault will be on stack overflow, not on trying to return with rsp not pointing to the return address.)
Decide whether you're making 32-bit or 64-bit code, and only copy/paste from examples for the mode and OS you're targeting. (Note that 64-bit code can and often does use mostly 32-bit registers, though.)
See also Assembling 32-bit binaries on a 64-bit system (GNU toolchain) (which does include a NASM section with a handy asm-link script that assembles and links into a static binary). But since you're writing main instead of _start and are using libc functions, you should just link with gcc -m32 (if you decide to use 32-bit code instead of replacing the 32-bit parts of your program with 64-bit function-calling and system-call conventions).
See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64.

Making a for loop in Assembly x86

I am new to Assembly and is wondering how I could write a for loop that will iterate 15 times. In the for loop i need to have an if condition that tests wether or not an integer k is greater than 7 as well. How would this be written in code?
Usually this would be written as a comment, but it is a little bit too complex. The code in the other answer isn't as efficient as it could be:
MOV ECX, 15 ; number of iterations
.loop:
...
CMP EAX, 8 ; compare k to 7
DEC ECX ; partial-flag merge needed after this, slow on some CPUs
JA .loop ; check for loop exit condition
B: ; code after the loop
...
;; check CF here to see if k >= 8 or not, i.e. k > 7
EDIT: To have k in memory as 32-bit integer as before, the CMP instruction looks as the following:
CMP DWORD [k], 8
EDIT2: Save one conditional jump. The OP didn't mention to leave or to stay when k is greater than 7. The above code leaves the loop when it ran 15 times or k isn't greater than 7.
Note that this trick with combining the comparisons is only likely to be good on CPUs like Intel Sandybridge-family or Silvermont which don't have partial-flag stalls for reading CF after a dec. On Core2 / Nehalem, this will stall for ~7 cycles to merge flags, like you'd get in an adc BigInteger loop using dec. See Agner Fog's microarch PDF.
Unless you're sure it's good on all the CPUs your code will run on, use cmp / ja or jae separately from dec / jnz.
EDIT3: The OP asks for incrementing/decrementing an integer (here edx) when eax is greater/smaller than 7:
.loop:
...
CMP EAX, 7
DEC EDX
JB .end
INC EDX
JE .end
INC EDX
.end:
DEC ECX
JZ .loop
(probably there'll be someone optimizing this further)
I suggest learning how to use compiled code to teach you such things. Write the code you want in C, then compile it with flags to output the assembly. For example, in gcc this is done with the -S flag. You can then read the machine-generated assembly code to see how it was accomplished.
The general strategy for a for-loop could be something like this:
MOV ECX, 0 ; clear ECX
A:
CMP [k], 7 ; compare k to 7
JGT B ; jump to B if k is greater than 7
CMP ECX, 15 ; check for loop exit condition
JE C ; jump to C if we have iterated 15 times
INC ECX
JMP A
B: ; this happens if K is greater than 7
...
C: ; this happens if we reach the end of the loop

Optimizing a loop that prints a counter

I have a very small loop program that prints the numbers from 5000000 to 1. I want to make it run the fastest possible.
I'm learning linux x86-64 assembly with NASM.
global main
extern printf
main:
push rbx
mov rax,5000000d
print:
push rax
push rcx
mov rdi, format
mov rsi, rax
call printf
pop rcx
pop rax
dec rax
jnz print
pop rbx
ret
format:
db "%ld", 10, 0
The call to printf completely dominates the run-time of even that highly inefficient loop. (Did you notice that you push/pop rcx even though you never use it anywhere? Perhaps that's left over from using the slow LOOP instruction).
To learn more about writing efficient x86 asm, see Agner Fog's Optimizing Assembly guide. (And his microarchitecture guide, too, if you want to really get into the details of specific CPUs and how they're different: What's optimal on one uarch CPU might not be on another. e.g. IMUL r64 has much better throughput and latency on Intel CPUs than on AMD, but CMOV and ADC are 2 uops on Intel pre-Broadwell, with 2 cycle latency. vs. 1 on AMD, since 3-input ALU m-ops (FLAGS + both registers) aren't a problem for AMD.) Also see other links in the x86 tag wiki.
Purely optimizing the loop without changing the 5M calls to printf is useful only as an example of how to write a loop properly, not to actually speed up this code. But let's start with that:
; trivial fixes to loop efficiently while calling the same slow function
global main
extern printf
main:
push rbx
mov ebx, 5000000 ; don't waste a REX prefix for constants that fit in 32 bits
.print:
;; removed the push/pops from inside the loop.
; Use call-preserved regs instead of saving/restoring stuff inside a loop yourself.
mov edi, format ; static data / code always has a 32-bit address
mov esi, ebx
xor eax, eax ; The x86-64 SysV ABI requires al = number of FP args passed in FP registers for variadic functions
call printf
dec ebx
jnz .print
pop rbx ; restore rbx, the one call-preserved reg we actually used.
xor eax,eax ; successful exit status.
ret
section .rodata ; it's usually best to put constant data in a separate section of the text segment, not right next to code.
format:
db "%ld", 10, 0
To speed this up, we should take advantage of the redundancy in converting consecutive integers to strings. Since "5000000\n" is only 8 bytes long (including the newline), the string representation fits in a 64-bit register.
We can store that string into a buffer and increment a pointer by the string length. (Since it will get shorter for smaller numbers, just keep the current string length in a register, which you can update in the special-case branch where it changes.)
We can decrement the string representation in-place to avoid ever (re)doing the process of dividing by 10 to turn an integer into a decimal string.
Since carry/borrow doesn't naturally propagate inside a register, and the AAS instruction isn't available in 64-bit mode (and only worked on AX, not even EAX, and is slow), we have to do it ourselves. We're decrementing by 1 every time, so we know what's going to happen. We can handle the least-significant digit by unrolling 10 times, so there's no branching to handle it.
Also note that since we want to the digits in printing order, carry goes the wrong direction anyway, since x86 is little-endian. If there was a good way to take advantage of having our string in the other byte order, we could maybe use BSWAP or MOVBE. (But note that MOVBE r64 is 3 fused-domain uops on Skylake, 2 of them ALU uops. BSWAP r64 is also 2 uops.)
Perhaps we should be doing odd/even counters in parallel, in two halves of an XMM vector register. But that stops working well once the string is shorter than 8B. Storing one number-string at a time, we can easily overlap. Still, we could do the carry-propagation stuff in a vector reg and store the two halves separately with MOVQ and MOVHPS. Or since 4/5th of the numbers from 0 to 5M are 7 digits, it might be worth having code for the special case where we can store a whole 16B vector of two numbers.
A better way to handle shorter strings: SSSE3 PSHUFB to shuffle the two strings to left-packed in a vector register, then a single MOVUPS to store two at once. The shuffle mask only needs to be updated when the string length (number of digits) changes, so the infrequently-executed carry-handling special case code can do that, too.
Vectorization of the hot part of the loop should be very straightforward and cheap, and should just about double performance.
;;; Optimized version: keep the string data in a register and modify it
;;; instead of doing the whole int->string conversion every time.
section .bss
printbuf: resb 1024*128 + 4096 ; Buffer size ~= half L2 cache size on Intel SnB-family. Or use a giant buffer that we write() once. Or maybe vmsplice to give it away to the kernel, since we only run once.
global main
extern printf
main:
push rbx
; use some REX-only regs for values that we're always going to use a REX prefix with anyway for 64-bit operand size.
mov rdx, `5000000\n` ; (NASM string constants as integers work like little-endian, so AL = '5' = 0x35 and the high byte holds '\n' = 10). Note that YASM doesn't support back-ticks for C-style backslash processing.
mov r9, 1<<56 ; decrement by 1 in the 2nd-last byte: LSB of the decimal string
;xor r9d, r9d
;bts r9, 56 ; IDK if this code-size optimization outside the loop would help or not.
mov eax, 8 ; string length.
mov edi, printbuf
.storeloop:
;; rdx = "????x9\n". We compute the start value for the next iteration, i.e. counter -= 10 in rdx.
mov r8, rdx
;; r8 = rdx. We modify it to have each last digit from 9 down to 0 in sequence, and store those strings in the buffer.
;; The string could be any length, always with the first ASCII digit in the low byte; our other constants are adjusted correctly for it
;; narrower than 8B means that our stores overlap, but that's fine.
;; Starting from here to compute the next unrolled iteration's starting value takes the `sub r8, r9` instructions off the critical path, vs. if we started from r8 at the bottom of the loop. This gives out-of-order execution more to play with.
;; It means each loop iteration's sequence of subs and stores are a separate dependency chain (except for the store addresses, but OOO can get ahead on those because we only pointer-increment every 2 stores).
mov [rdi], r8
sub r8, r9 ; r8 = "xxx8\n"
mov [rdi + rax], r8 ; defer p += len by using a 2-reg addressing mode
sub r8, r9 ; r8 = "xxx7\n"
lea edi, [rdi + rax*2] ; if we had len*3 in another reg, we could defer this longer
;; our static buffer is guaranteed to be in the low 31 bits of address space so we can safely save a REX prefix on the LEA here. Normally you shouldn't truncate pointers to 32-bits, but you asked for the fastest possible. This won't hurt, and might help on some CPUs, especially with possible decode bottlenecks.
;; repeat that block 3 more times.
;; using a short inner loop for the 9..0 last digit might be a win on some CPUs (like maybe Core2), depending on their front-end loop-buffer capabilities if the frontend is a bottleneck at all here.
;; anyway, then for the last one:
mov [rdi], r8 ; r8 = "xxx1\n"
sub r8, r9
mov [rdi + rax], r8 ; r8 = "xxx0\n"
lea edi, [rdi + rax*2]
;; compute next iteration's RDX. It's probably a win to interleave some of this into the loop body, but out-of-order execution should do a reasonably good job here.
mov rcx, r9
shr rcx, 8 ; maybe hoist this constant out, too
; rcx = 1 in the second-lowest digit
sub rdx, rcx
; detect carry when '0' (0x30) - 1 = 0x2F by checking the low bit of the high nibble in that byte.
shl rcx, 5
test rdx, rcx
jz .carry_second_digit
; .carry_second_digit is some complicated code to propagate carry as far as it needs to go, up to the most-significant digit.
; when it's done, it re-enters the loop at the top, with eax and r9 set appropriately.
; it only runs once per 100 digits, so it doesn't have to be super-fast
; maybe only do buffer-length checks in the carry-handling branch,
; in which case the jz .carry can be jnz .storeloop
cmp edi, esi ; } while(p < endp)
jbe .storeloop
; write() system call on the buffer.
; Maybe need a loop around this instead of doing all 5M integer-strings in one giant buffer.
pop rbx
xor eax,eax ; successful exit status.
ret
This is not fully fleshed-out, but should give an idea of what might work well.
If vectorizing with SSE2, probably use a scalar integer register to keep track of when you need to break out and handle carry. i.e. a down-counter from 10.
Even this scalar version probably comes close to sustaining one store per clock, which saturates the store port. They're only 8B stores (and when the string gets shorter, the useful part is shorter than that), so we're definitely leaving performance on the table if we don't bottleneck on cache misses. But with a 3GHz CPU and dual channel DDR3-1600 (~25.6GB/s theoretical max bandwidth), 8B per clock is about enough to saturate main memory with a single core.
We could parallelize this, and break the 5M .. 1 range up into chunks. With some clever math, we can figure out what byte to write the first character of "2500000\n", or we could have each thread call write() itself in the correct order. (Or use the same clever math to have them call pwrite(2) independently with different file offsets, so the kernel takes care of all the synchronization for multiple writers to the same file.)
You're essentially printing a fixed string. I'd pre-generate that string into one long constant.
The program then becomes a single call to write (or a short loop to deal with incomplete writes).

Need some advice with NASM loop

I`m trying to make a while loop that prints from 0 through 10 but have some errors...
Compile with these:
nasm -f elf myprog.asm
gcc -m32 -o myprog myprog.o
Errors:
at output you can see 134513690.. lots of....
and at the last line a segmentation fault
This is the code:
SECTION .text
global main
extern printf
main:
xor eax,eax ; eax = 0
myloop:
cmp eax,10 ; eax = 10?
je finish ; If true finish
push eax ; Save eax value
push number ; push number value on stack
call printf
pop eax
inc eax ; eax + 1
add esp,80 ; Im not sure what is this
jmp myloop ; jump to myloop
number db "%d",10,0 ; This is how i print the numbers
finish:
mov eax,1
mov ebx,0
int 0x80
There's one real error in this code; the function call cleanup isn't quite right. I would change the myloop section to be like this:
myloop:
cmp eax,10 ; eax = 10?
je finish ; If true finish
push eax ; Save eax value
push number ; push number value on stack
call printf
add esp, 4 ; move past the `push number` line
pop eax
inc eax ; eax + 1
jmp myloop ; jump to myloop
The biggest difference is that instead of adding 80 to esp (and I'm not sure why you were doing that), you're only adding the size of the argument pushed. Also, previously the wrong value was getting popped as eax, but switching the order of the add and the pop fixes this.
A few problems, you need to push the "number" not as address, but as numeral.
push dword number
After you call printf, you need to restore the stack, ESP.
Basically when you "push" a register, it gets stored in the stack. Since you push twice (two arguments), you need to restore 8 bytes.
When you "pop eax", you're retrieving the top of the stack, which is "number", not the counter. Therefore, you just need to do
pop eax
pop eax
then there is no need to restore the ESP by adding since it is done by popping.
Basically, after the first iteration, eax points at an address, so it will never be equal to 10.
Further reading about Stack Pointer and Base Pointer:
Ebp, esp and stack frame in assembly with nasm

string comparison in 8086

I have problem with this question. I don't know what it wants from me.
Question : Write a procedure that compares a source string at DS:SI to a destination string at ES:DI and sets the flags accordingly. If the source is less than the destination, carry flag is set. if string are equal , the zero flag is set. if the source is greater than the destination , the zero and carry flags are both cleared.
My Answer :
MOV ESI , STRING1
MOV EDI, STRING2
MOV ECX, COUNT
CLD
REPE CMPSB
Still I am not sure about it. Is it true or should I try something else ?
p.s: I don't understand why people vote down this question. What is wrong with my question ? I think we are all here for learning. Or not ? Miss I something ?
If the problem statement says the pointers are already in SI and DI when you're called, you shouldn't clobber them.
16-bit code often doesn't stick to a single calling convention for all functions, and passing (the first few) args in registers is usually good (fewer instructions, and avoids store/reload). 32-bit x86 calling conventions usually use stack args, but that's obsolete. Both the Windows x64 and Linux/Mac x86-64 System V ABIs / calling conventions use register args.
The problem statement doesn't mention a count, though. So you're implementing strcmp for strings terminated by a zero-byte, rather than memcmp for known-length blocks of memory. You can't use a single rep instruction, since you need to check for non-equal and for end-of-string. If you just pass some large size, and the strings are equal, repe cmpsb would continue past the terminator.
repe cmpsb is usable if you know the length of either string. e.g. take a length arg in CX to avoid the problem of running past the terminator in both strings.
But for performance, repe cmpsb isn't fast anyway (like 2 to 3 cycles per compare, on Skylake vs. Ryzen. Or even 4 cycles per compare on Bulldozer-family). Only rep movs and rep stos are efficient on modern CPUs, with optimized microcode that copies or stores 16 (or 32 or 64) bytes at a time.
There are 2 major conventions for storing strings in memory: Explicit-length strings (pointer + length) like C++ std::string, and implicit length strings where you just have a pointer, and the end of string is marked by a sentinel / terminator. (Like C char* that uses a 0 byte, or DOS string-print functions that use '$' as a terminator.)
A useful observation is that you only need to check for the terminator in one of the strings. If the other string has a terminator and this one doesn't, it will be a mismatch.
So you want to load a byte into a register from one string, and check it for the teminator and against memory for the other string.
(If you need to actually use ES:DI instead of just DI with the default DS segment base, you can use cmp al, [es: bx + di] (NASM syntax, adjust as needed like maybe cmp al, es: [bx + di] I think). Probably the question intended for you to use lodsb and scasb, because scasb uses ES:DI.)
;; inputs: str1 pointer in DI, str2 pointer in SI
;; outputs: BX = mismatch index, or one-past-the-terminators.
;; FLAGS: ZF=1 for equal strings (je), ZF=0 for mismatch (jne)
;; clobbers: AL (holds str1's terminator or mismatching byte on return)
strcmp:
xor bx, bx
.innerloop: ; do {
mov al, [si + bx] ; load a source byte
cmp al, [di + bx] ; check it against the other string
jne .mismatch ; if (str1[i] != str2[i]) break;
inc bx ; index++
test al, al ; check for 0. Use cmp al, '$' for a $ terminator
jnz .innerloop ; }while(str1[i] != terminator);
; fall through (ZF=1) means we found the terminator
; in str1 *and* str2 at the same position, thus they match
.mismatch: ; we jump here with ZF=0 on mismatch
; sete al ; optionally create an integer in AL from FLAGS
ret
Usage: put pointers in SI/DI, call strcmp / je match, because the match / non-match state is in FLAGS. If you want to turn the condition into an integer, 386 and later CPUs allow sete al to create a 0 or 1 in AL according to the equals condition (ZF==1).
Using sub al, [mem] instead of cmp al, [mem], we'd get al = str1[i] - str2[i], giving us a 0 only if the strings matched. If your strings only hold ASCII values from 0..127, that can't cause signed overflow, so you can use it as a signed return value that actually tells you which string sorts before/after the other. (But if there could be high-ASCII 128..255 bytes in the string, we'd need to zero- or sign-extend to 16-bit first to avoid signed overflow for a case like (unsigned)5 - (unsigned)254 = (signed)+7 because of 8-bit wraparound.
Of course, with our FLAGS return value, the caller can already use ja or jb (unsigned compare results), or jg / jl if they want to treat the strings as holding signed char. This works regardless of the range of input bytes.
Or inline this loop so jne mismatch jumps somewhere useful directly.
16-bit addressing modes are limited, but BX can be a base, and SI and DI can both be indices. I used an index increment instead of inc si and inc di. Using lodsb would also be an option, and maybe even scasb to compare it to the other string. (And then check the terminator.)
Performance
Indexed addressing modes can be slower on some modern x86 CPUs, but this does save instructions in the loop (so it's good for true 8086 where code-size matters). Although to really tune for 8086, I think lodsb / scasb would be your best bet, replacing the mov load and cmp al, [mem], and also the inc bx. Just remember to use cld outside the loop if your calling convention doesn't guarantee that.
If you care about modern x86, use movzx eax, byte [si+bx] to break the false dependency on the old value of EAX, for CPUs that don't rename partial registers separately. (Breaking the false dep is especially important if you use sub al, [str2] because that would turn it into a 2-cycle loop-carried dependency chain through EAX, on CPUs other than PPro through Sandybridge. IvyBridge and later doesn't rename AL separately from EAX, so mov al, [mem] is a micro-fused load+merge uop.)
If the cmp al,[bx+di] micro-fuses the load, and macro-fuses with jne into one compare-and-branch uop, the whole loop could be only 4 uops total on Haswell, and could run at 1 iteration per clock for large inputs. The branch mispredict at the end will make small-input performance worse, unless branching goes the way way every time for a small enough input. See https://agner.org/optimize/. Recent Intel and AMD can do 2 loads per clock.
Unrolling could amortize the inc bx cost, but that's all. With a taken + not-taken branch inside the loop, no current CPUs can run this faster than 1 cycle per iteration. (See Why are loops always compiled into "do...while" style (tail jump)? for more about the do{}while loop structure). To go faster we'd need to check multiple bytes at once.
Even 1 byte / cycle is very slow compared to 16 bytes per 1 or 2 cycles with SSE2 (using some clever tricks to avoid reading memory that might fault).
See https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for more about using x86 SIMD for string compare, and also glibc's SSE2 and later optimized string functions.
GNU libc's fallback scalar strcmp implementation looks decent (translated from AT&T to Intel syntax, but with the C preprocessor macros and stuff left in. L() makes a local label).
It only uses this when SSE2 or better isn't available. There are bithacks for checking a whole 32-bit register for any zero bytes, which could let you go faster even without SIMD, but alignment is a problem. (If the terminator could be anywhere, you have to be careful when loading multiple bytes at once to not read from any memory pages that you aren't sure contain at least 1 byte of valid data, otherwise you could fault.)
strcmp:
mov ecx,DWORD PTR [esp+0x4]
mov edx,DWORD PTR [esp+0x8] # load pointer args
L(oop): mov al,BYTE PTR [ecx] # movzx eax, byte ptr [ecx] would be avoid a false dep
cmp al,BYTE PTR [edx]
jne L(neq)
inc ecx
inc edx
test al, al
jnz L(oop)
xorl eax, eax
/* when strings are equal, pointers rest one beyond
the end of the NUL terminators. */
ret
L(neq): mov eax, 1
mov ecx, -1
cmovb eax, ecx
ret

Resources