Related
Previously I asked how I could send only the first byte of a string to a record, I was told that reading only the first byte of the string, but I don't know how to do it, please help.
To move the first byte of the string into al (which is the lowest byte of eax), use
mov al, [string]
If you want to move the byte into the low byte of eax and zero out the rest of eax, you can do
movzx eax, byte [string]
ref
Ok, so I'm fairly new to assembly, infact, I'm very new to assembly. I wrote a piece of code which is simply meant to take numerical input from the user, multiply it by 10, and have the result expressed to the user via the programs exit status (by typing echo $? in terminal)
Problem is, it is not giving the correct number, 4x10 showed as 144. So then I figured the input would probably be as a character, rather than an integer. My question here is, how do I convert the character input to an integer so that it can be used in arithmetic calculations?
It would be great if someone could answer keeping in mind that I'm a beginner :)
Also, how can I convert said integer back to a character?
section .data
section .bss
input resb 4
section .text
global _start
_start:
mov eax, 3
mov ebx, 0
mov ecx, input
mov edx, 4
int 0x80
mov ebx, 10
imul ebx, ecx
mov eax, 1
int 0x80
Here's a couple of functions for converting strings to integers, and vice versa:
; Input:
; ESI = pointer to the string to convert
; ECX = number of digits in the string (must be > 0)
; Output:
; EAX = integer value
string_to_int:
xor ebx,ebx ; clear ebx
.next_digit:
movzx eax,byte[esi]
inc esi
sub al,'0' ; convert from ASCII to number
imul ebx,10
add ebx,eax ; ebx = ebx*10 + eax
loop .next_digit ; while (--ecx)
mov eax,ebx
ret
; Input:
; EAX = integer value to convert
; ESI = pointer to buffer to store the string in (must have room for at least 10 bytes)
; Output:
; EAX = pointer to the first character of the generated string
int_to_string:
add esi,9
mov byte [esi],STRING_TERMINATOR
mov ebx,10
.next_digit:
xor edx,edx ; Clear edx prior to dividing edx:eax by ebx
div ebx ; eax /= 10
add dl,'0' ; Convert the remainder to ASCII
dec esi ; store characters in reverse order
mov [esi],dl
test eax,eax
jnz .next_digit ; Repeat until eax==0
mov eax,esi
ret
And this is how you'd use them:
STRING_TERMINATOR equ 0
lea esi,[thestring]
mov ecx,4
call string_to_int
; EAX now contains 1234
; Convert it back to a string
lea esi,[buffer]
call int_to_string
; You now have a string pointer in EAX, which
; you can use with the sys_write system call
thestring: db "1234",0
buffer: resb 10
Note that I don't do much error checking in these routines (like checking if there are characters outside of the range '0' - '9'). Nor do the routines handle signed numbers. So if you need those things you'll have to add them yourself.
The basic algorith for string->digit is: total = total*10 + digit, starting from the MSD. (e.g. with digit = *p++ - '0' for an ASCII string of digits). So the left-most / Most-Significant / first digit (in memory, and in reading order) gets multiplied by 10 N times, where N is the total number of digits after it.
Doing it this way is generally more efficient than multiplying each digit by the right power of 10 before adding. That would need 2 multiplies; one to grow a power of 10, and another to apply it to the digit. (Or a table look-up with ascending powers of 10).
Of course, for efficiency you might use SSSE3 pmaddubsw and SSE2 pmaddwd to multiply digits by their place-value in parallel: see Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number? and arbitrary-length How to implement atoi using SIMD?. But the latter probably isn't a win when numbers are typically short. A scalar loop is efficient when most numbers are only a couple digits long.
Adding on to #Michael's answer, it may be useful to have the int->string function stop at the first non-digit, instead of at a fixed length. This will catch problems like your string including a newline from when the user pressed return, as well as not turning 12xy34 into a very large number. (Treat it as 12, like C's atoi function). The stop character can also be the terminating 0 in a C implicit-length string.
I've also made some improvements:
Don't use the slow loop instruction unless you're optimizing for code-size. Just forget it exists and use dec / jnz in cases where counting down to zero is still what you want to do, instead of comparing a pointer or something else.
2 LEA instructions are significantly better than imul + add: lower latency.
accumulate the result in EAX where we want to return it anyway. (If you inline this instead of calling it, use whatever register you want the result in.)
I changed the registers so it follows the x86-64 System V ABI (First arg in RDI, return in EAX).
Porting to 32-bit: This doesn't depend on 64-bitness at all; it can be ported to 32-bit by just using 32-bit registers. (i.e. replace rdi with edi, rax with ecx, and rax with eax). Beware of C calling-convention differences between 32 and 64-bit, e.g. EDI is call-preserved and args are usually passed on the stack. But if your caller is asm, you can pass an arg in EDI.
; args: pointer in RDI to ASCII decimal digits, terminated by a non-digit
; clobbers: ECX
; returns: EAX = atoi(RDI) (base 10 unsigned)
; RDI = pointer to first non-digit
global base10string_to_int
base10string_to_int:
movzx eax, byte [rdi] ; start with the first digit
sub eax, '0' ; convert from ASCII to number
cmp al, 9 ; check that it's a decimal digit [0..9]
jbe .loop_entry ; too low -> wraps to high value, fails unsigned compare check
; else: bad first digit: return 0
xor eax,eax
ret
; rotate the loop so we can put the JCC at the bottom where it belongs
; but still check the digit before messing up our total
.next_digit: ; do {
lea eax, [rax*4 + rax] ; total *= 5
lea eax, [rax*2 + rcx] ; total = (total*5)*2 + digit
; imul eax, 10 / add eax, ecx
.loop_entry:
inc rdi
movzx ecx, byte [rdi]
sub ecx, '0'
cmp ecx, 9
jbe .next_digit ; } while( digit <= 9 )
ret ; return with total in eax
This stops converting on the first non-digit character. Often this will be the 0 byte that terminates an implicit-length string. You could check after the loop that it was a string-end, not some other non-digit character, by checking ecx == -'0' (which still holds the str[i] - '0' integer "digit" value that was out of range), if you want to detect trailing garbage.
If your input is an explicit-length string, you'd need to use a loop counter instead of checking a terminator (like #Michael's answer), because the next byte in memory might be another digit. Or it might be in an unmapped page.
Making the first iteration special and handling it before jumping into the main part of the loop is called loop peeling. Peeling the first iteration allows us to optimize it specially, because we know total=0 so there's no need to multiply anything by 10. It's like starting with sum = array[0]; i=1 instead of sum=0, i=0;.
To get nice loop structure (with the conditional branch at the bottom), I used the trick of jumping into the middle of the loop for the first iteration. This didn't even take an extra jmp because I was already branching in the peeled first iteration. Reordering a loop so an if()break in the middle becomes a loop branch at the bottom is called loop rotation, and can involve peeling the first part of the first iteration and the 2nd part of the last iteration.
The simple way to solve the problem of exiting the loop on a non-digit would be to have a jcc in the loop body, like an if() break; statement in C before the total = total*10 + digit. But then I'd need a jmp and have 2 total branch instructions in the loop, meaning more overhead.
If I didn't need the sub ecx, '0' result for the loop condition, I could have used lea eax, [rax*2 + rcx - '0'] to do it as part of the LEA as well. But that would have made the LEA latency 3 cycles instead of 1, on Sandybridge-family CPUs. (3-component LEA vs. 2 or less.) The two LEAs form a loop-carried dependency chain on eax (total), so (especially for large numbers) it would not be worth it on Intel. On CPUs where base + scaled-index is no faster than base + scaled-index + disp8 (Bulldozer-family / Ryzen), then sure, if you have an explicit length as your loop condition and don't want to check the digits at all.
I used movzx to load with zero extension in the first place, instead of doing that after converting the digit from ASCII to integer. (It has to be done at some point to add into 32-bit EAX). Often code that manipulates ASCII digits uses byte operand-size, like mov cl, [rdi]. But that would create a false dependency on the old value of RCX on most CPUs.
sub al,'0' saves 1 byte over sub eax,'0', but causes a partial-register stall on Nehalem/Core2 and even worse on PIII. Fine on all other CPU families, even Sandybridge: it's a RMW of AL, so it doesn't rename the partial reg separately from EAX. But cmp al, 9 doesn't cause a problem, because reading a byte register is always fine. It saves a byte (special encoding with no ModRM byte), so I used that at the top of the function.
For more optimization stuff, see http://agner.org/optimize, and other links in the x86 tag wiki.
The tag wiki also has beginner links, including an FAQ section with links to integer->string functions, and other common beginner questions.
Related:
How do I print an integer in Assembly Level Programming without printf from the c library? is the reverse of this question, integer -> base10string.
Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number? highly optimized SSSE3 pmaddubsw / pmaddwd for 8-digit integers.
How to implement atoi using SIMD? using a shuffle to handle variable-length
Conversion of huge decimal numbers (128bit) formatted as ASCII to binary (hex) handles long strings, e.g. a 128-bit integer that takes 4x 32-bit registers. (It's not very efficient, and might be better to convert in multiple chunks and then do extended-precision multiplies by 1e9 or something.)
Convert from ascii to integer in AT&T Assembly Inefficient AT&T version of this.
Ok, so I'm fairly new to assembly, infact, I'm very new to assembly. I wrote a piece of code which is simply meant to take numerical input from the user, multiply it by 10, and have the result expressed to the user via the programs exit status (by typing echo $? in terminal)
Problem is, it is not giving the correct number, 4x10 showed as 144. So then I figured the input would probably be as a character, rather than an integer. My question here is, how do I convert the character input to an integer so that it can be used in arithmetic calculations?
It would be great if someone could answer keeping in mind that I'm a beginner :)
Also, how can I convert said integer back to a character?
section .data
section .bss
input resb 4
section .text
global _start
_start:
mov eax, 3
mov ebx, 0
mov ecx, input
mov edx, 4
int 0x80
mov ebx, 10
imul ebx, ecx
mov eax, 1
int 0x80
Here's a couple of functions for converting strings to integers, and vice versa:
; Input:
; ESI = pointer to the string to convert
; ECX = number of digits in the string (must be > 0)
; Output:
; EAX = integer value
string_to_int:
xor ebx,ebx ; clear ebx
.next_digit:
movzx eax,byte[esi]
inc esi
sub al,'0' ; convert from ASCII to number
imul ebx,10
add ebx,eax ; ebx = ebx*10 + eax
loop .next_digit ; while (--ecx)
mov eax,ebx
ret
; Input:
; EAX = integer value to convert
; ESI = pointer to buffer to store the string in (must have room for at least 10 bytes)
; Output:
; EAX = pointer to the first character of the generated string
int_to_string:
add esi,9
mov byte [esi],STRING_TERMINATOR
mov ebx,10
.next_digit:
xor edx,edx ; Clear edx prior to dividing edx:eax by ebx
div ebx ; eax /= 10
add dl,'0' ; Convert the remainder to ASCII
dec esi ; store characters in reverse order
mov [esi],dl
test eax,eax
jnz .next_digit ; Repeat until eax==0
mov eax,esi
ret
And this is how you'd use them:
STRING_TERMINATOR equ 0
lea esi,[thestring]
mov ecx,4
call string_to_int
; EAX now contains 1234
; Convert it back to a string
lea esi,[buffer]
call int_to_string
; You now have a string pointer in EAX, which
; you can use with the sys_write system call
thestring: db "1234",0
buffer: resb 10
Note that I don't do much error checking in these routines (like checking if there are characters outside of the range '0' - '9'). Nor do the routines handle signed numbers. So if you need those things you'll have to add them yourself.
The basic algorith for string->digit is: total = total*10 + digit, starting from the MSD. (e.g. with digit = *p++ - '0' for an ASCII string of digits). So the left-most / Most-Significant / first digit (in memory, and in reading order) gets multiplied by 10 N times, where N is the total number of digits after it.
Doing it this way is generally more efficient than multiplying each digit by the right power of 10 before adding. That would need 2 multiplies; one to grow a power of 10, and another to apply it to the digit. (Or a table look-up with ascending powers of 10).
Of course, for efficiency you might use SSSE3 pmaddubsw and SSE2 pmaddwd to multiply digits by their place-value in parallel: see Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number? and arbitrary-length How to implement atoi using SIMD?. But the latter probably isn't a win when numbers are typically short. A scalar loop is efficient when most numbers are only a couple digits long.
Adding on to #Michael's answer, it may be useful to have the int->string function stop at the first non-digit, instead of at a fixed length. This will catch problems like your string including a newline from when the user pressed return, as well as not turning 12xy34 into a very large number. (Treat it as 12, like C's atoi function). The stop character can also be the terminating 0 in a C implicit-length string.
I've also made some improvements:
Don't use the slow loop instruction unless you're optimizing for code-size. Just forget it exists and use dec / jnz in cases where counting down to zero is still what you want to do, instead of comparing a pointer or something else.
2 LEA instructions are significantly better than imul + add: lower latency.
accumulate the result in EAX where we want to return it anyway. (If you inline this instead of calling it, use whatever register you want the result in.)
I changed the registers so it follows the x86-64 System V ABI (First arg in RDI, return in EAX).
Porting to 32-bit: This doesn't depend on 64-bitness at all; it can be ported to 32-bit by just using 32-bit registers. (i.e. replace rdi with edi, rax with ecx, and rax with eax). Beware of C calling-convention differences between 32 and 64-bit, e.g. EDI is call-preserved and args are usually passed on the stack. But if your caller is asm, you can pass an arg in EDI.
; args: pointer in RDI to ASCII decimal digits, terminated by a non-digit
; clobbers: ECX
; returns: EAX = atoi(RDI) (base 10 unsigned)
; RDI = pointer to first non-digit
global base10string_to_int
base10string_to_int:
movzx eax, byte [rdi] ; start with the first digit
sub eax, '0' ; convert from ASCII to number
cmp al, 9 ; check that it's a decimal digit [0..9]
jbe .loop_entry ; too low -> wraps to high value, fails unsigned compare check
; else: bad first digit: return 0
xor eax,eax
ret
; rotate the loop so we can put the JCC at the bottom where it belongs
; but still check the digit before messing up our total
.next_digit: ; do {
lea eax, [rax*4 + rax] ; total *= 5
lea eax, [rax*2 + rcx] ; total = (total*5)*2 + digit
; imul eax, 10 / add eax, ecx
.loop_entry:
inc rdi
movzx ecx, byte [rdi]
sub ecx, '0'
cmp ecx, 9
jbe .next_digit ; } while( digit <= 9 )
ret ; return with total in eax
This stops converting on the first non-digit character. Often this will be the 0 byte that terminates an implicit-length string. You could check after the loop that it was a string-end, not some other non-digit character, by checking ecx == -'0' (which still holds the str[i] - '0' integer "digit" value that was out of range), if you want to detect trailing garbage.
If your input is an explicit-length string, you'd need to use a loop counter instead of checking a terminator (like #Michael's answer), because the next byte in memory might be another digit. Or it might be in an unmapped page.
Making the first iteration special and handling it before jumping into the main part of the loop is called loop peeling. Peeling the first iteration allows us to optimize it specially, because we know total=0 so there's no need to multiply anything by 10. It's like starting with sum = array[0]; i=1 instead of sum=0, i=0;.
To get nice loop structure (with the conditional branch at the bottom), I used the trick of jumping into the middle of the loop for the first iteration. This didn't even take an extra jmp because I was already branching in the peeled first iteration. Reordering a loop so an if()break in the middle becomes a loop branch at the bottom is called loop rotation, and can involve peeling the first part of the first iteration and the 2nd part of the last iteration.
The simple way to solve the problem of exiting the loop on a non-digit would be to have a jcc in the loop body, like an if() break; statement in C before the total = total*10 + digit. But then I'd need a jmp and have 2 total branch instructions in the loop, meaning more overhead.
If I didn't need the sub ecx, '0' result for the loop condition, I could have used lea eax, [rax*2 + rcx - '0'] to do it as part of the LEA as well. But that would have made the LEA latency 3 cycles instead of 1, on Sandybridge-family CPUs. (3-component LEA vs. 2 or less.) The two LEAs form a loop-carried dependency chain on eax (total), so (especially for large numbers) it would not be worth it on Intel. On CPUs where base + scaled-index is no faster than base + scaled-index + disp8 (Bulldozer-family / Ryzen), then sure, if you have an explicit length as your loop condition and don't want to check the digits at all.
I used movzx to load with zero extension in the first place, instead of doing that after converting the digit from ASCII to integer. (It has to be done at some point to add into 32-bit EAX). Often code that manipulates ASCII digits uses byte operand-size, like mov cl, [rdi]. But that would create a false dependency on the old value of RCX on most CPUs.
sub al,'0' saves 1 byte over sub eax,'0', but causes a partial-register stall on Nehalem/Core2 and even worse on PIII. Fine on all other CPU families, even Sandybridge: it's a RMW of AL, so it doesn't rename the partial reg separately from EAX. But cmp al, 9 doesn't cause a problem, because reading a byte register is always fine. It saves a byte (special encoding with no ModRM byte), so I used that at the top of the function.
For more optimization stuff, see http://agner.org/optimize, and other links in the x86 tag wiki.
The tag wiki also has beginner links, including an FAQ section with links to integer->string functions, and other common beginner questions.
Related:
How do I print an integer in Assembly Level Programming without printf from the c library? is the reverse of this question, integer -> base10string.
Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number? highly optimized SSSE3 pmaddubsw / pmaddwd for 8-digit integers.
How to implement atoi using SIMD? using a shuffle to handle variable-length
Conversion of huge decimal numbers (128bit) formatted as ASCII to binary (hex) handles long strings, e.g. a 128-bit integer that takes 4x 32-bit registers. (It's not very efficient, and might be better to convert in multiple chunks and then do extended-precision multiplies by 1e9 or something.)
Convert from ascii to integer in AT&T Assembly Inefficient AT&T version of this.
I am trying to concatenate two strings in assembly language.
mov esi, str1
mov eax, str1
mov edx, [str2]
call slen
mov [esi+eax-1], edx
Everything works exactly fine except that only 4 characters of the second string gets appended. I know the reason for its occurrence, but I can't seem to find any solution.
You cannot store any string in a register. It has to be equal or smaller than register size (assuming we're talking about ASCII-encoded strings) because a register has a fixed size.
I have problem with this question. I don't know what it wants from me.
Question : Write a procedure that compares a source string at DS:SI to a destination string at ES:DI and sets the flags accordingly. If the source is less than the destination, carry flag is set. if string are equal , the zero flag is set. if the source is greater than the destination , the zero and carry flags are both cleared.
My Answer :
MOV ESI , STRING1
MOV EDI, STRING2
MOV ECX, COUNT
CLD
REPE CMPSB
Still I am not sure about it. Is it true or should I try something else ?
p.s: I don't understand why people vote down this question. What is wrong with my question ? I think we are all here for learning. Or not ? Miss I something ?
If the problem statement says the pointers are already in SI and DI when you're called, you shouldn't clobber them.
16-bit code often doesn't stick to a single calling convention for all functions, and passing (the first few) args in registers is usually good (fewer instructions, and avoids store/reload). 32-bit x86 calling conventions usually use stack args, but that's obsolete. Both the Windows x64 and Linux/Mac x86-64 System V ABIs / calling conventions use register args.
The problem statement doesn't mention a count, though. So you're implementing strcmp for strings terminated by a zero-byte, rather than memcmp for known-length blocks of memory. You can't use a single rep instruction, since you need to check for non-equal and for end-of-string. If you just pass some large size, and the strings are equal, repe cmpsb would continue past the terminator.
repe cmpsb is usable if you know the length of either string. e.g. take a length arg in CX to avoid the problem of running past the terminator in both strings.
But for performance, repe cmpsb isn't fast anyway (like 2 to 3 cycles per compare, on Skylake vs. Ryzen. Or even 4 cycles per compare on Bulldozer-family). Only rep movs and rep stos are efficient on modern CPUs, with optimized microcode that copies or stores 16 (or 32 or 64) bytes at a time.
There are 2 major conventions for storing strings in memory: Explicit-length strings (pointer + length) like C++ std::string, and implicit length strings where you just have a pointer, and the end of string is marked by a sentinel / terminator. (Like C char* that uses a 0 byte, or DOS string-print functions that use '$' as a terminator.)
A useful observation is that you only need to check for the terminator in one of the strings. If the other string has a terminator and this one doesn't, it will be a mismatch.
So you want to load a byte into a register from one string, and check it for the teminator and against memory for the other string.
(If you need to actually use ES:DI instead of just DI with the default DS segment base, you can use cmp al, [es: bx + di] (NASM syntax, adjust as needed like maybe cmp al, es: [bx + di] I think). Probably the question intended for you to use lodsb and scasb, because scasb uses ES:DI.)
;; inputs: str1 pointer in DI, str2 pointer in SI
;; outputs: BX = mismatch index, or one-past-the-terminators.
;; FLAGS: ZF=1 for equal strings (je), ZF=0 for mismatch (jne)
;; clobbers: AL (holds str1's terminator or mismatching byte on return)
strcmp:
xor bx, bx
.innerloop: ; do {
mov al, [si + bx] ; load a source byte
cmp al, [di + bx] ; check it against the other string
jne .mismatch ; if (str1[i] != str2[i]) break;
inc bx ; index++
test al, al ; check for 0. Use cmp al, '$' for a $ terminator
jnz .innerloop ; }while(str1[i] != terminator);
; fall through (ZF=1) means we found the terminator
; in str1 *and* str2 at the same position, thus they match
.mismatch: ; we jump here with ZF=0 on mismatch
; sete al ; optionally create an integer in AL from FLAGS
ret
Usage: put pointers in SI/DI, call strcmp / je match, because the match / non-match state is in FLAGS. If you want to turn the condition into an integer, 386 and later CPUs allow sete al to create a 0 or 1 in AL according to the equals condition (ZF==1).
Using sub al, [mem] instead of cmp al, [mem], we'd get al = str1[i] - str2[i], giving us a 0 only if the strings matched. If your strings only hold ASCII values from 0..127, that can't cause signed overflow, so you can use it as a signed return value that actually tells you which string sorts before/after the other. (But if there could be high-ASCII 128..255 bytes in the string, we'd need to zero- or sign-extend to 16-bit first to avoid signed overflow for a case like (unsigned)5 - (unsigned)254 = (signed)+7 because of 8-bit wraparound.
Of course, with our FLAGS return value, the caller can already use ja or jb (unsigned compare results), or jg / jl if they want to treat the strings as holding signed char. This works regardless of the range of input bytes.
Or inline this loop so jne mismatch jumps somewhere useful directly.
16-bit addressing modes are limited, but BX can be a base, and SI and DI can both be indices. I used an index increment instead of inc si and inc di. Using lodsb would also be an option, and maybe even scasb to compare it to the other string. (And then check the terminator.)
Performance
Indexed addressing modes can be slower on some modern x86 CPUs, but this does save instructions in the loop (so it's good for true 8086 where code-size matters). Although to really tune for 8086, I think lodsb / scasb would be your best bet, replacing the mov load and cmp al, [mem], and also the inc bx. Just remember to use cld outside the loop if your calling convention doesn't guarantee that.
If you care about modern x86, use movzx eax, byte [si+bx] to break the false dependency on the old value of EAX, for CPUs that don't rename partial registers separately. (Breaking the false dep is especially important if you use sub al, [str2] because that would turn it into a 2-cycle loop-carried dependency chain through EAX, on CPUs other than PPro through Sandybridge. IvyBridge and later doesn't rename AL separately from EAX, so mov al, [mem] is a micro-fused load+merge uop.)
If the cmp al,[bx+di] micro-fuses the load, and macro-fuses with jne into one compare-and-branch uop, the whole loop could be only 4 uops total on Haswell, and could run at 1 iteration per clock for large inputs. The branch mispredict at the end will make small-input performance worse, unless branching goes the way way every time for a small enough input. See https://agner.org/optimize/. Recent Intel and AMD can do 2 loads per clock.
Unrolling could amortize the inc bx cost, but that's all. With a taken + not-taken branch inside the loop, no current CPUs can run this faster than 1 cycle per iteration. (See Why are loops always compiled into "do...while" style (tail jump)? for more about the do{}while loop structure). To go faster we'd need to check multiple bytes at once.
Even 1 byte / cycle is very slow compared to 16 bytes per 1 or 2 cycles with SSE2 (using some clever tricks to avoid reading memory that might fault).
See https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for more about using x86 SIMD for string compare, and also glibc's SSE2 and later optimized string functions.
GNU libc's fallback scalar strcmp implementation looks decent (translated from AT&T to Intel syntax, but with the C preprocessor macros and stuff left in. L() makes a local label).
It only uses this when SSE2 or better isn't available. There are bithacks for checking a whole 32-bit register for any zero bytes, which could let you go faster even without SIMD, but alignment is a problem. (If the terminator could be anywhere, you have to be careful when loading multiple bytes at once to not read from any memory pages that you aren't sure contain at least 1 byte of valid data, otherwise you could fault.)
strcmp:
mov ecx,DWORD PTR [esp+0x4]
mov edx,DWORD PTR [esp+0x8] # load pointer args
L(oop): mov al,BYTE PTR [ecx] # movzx eax, byte ptr [ecx] would be avoid a false dep
cmp al,BYTE PTR [edx]
jne L(neq)
inc ecx
inc edx
test al, al
jnz L(oop)
xorl eax, eax
/* when strings are equal, pointers rest one beyond
the end of the NUL terminators. */
ret
L(neq): mov eax, 1
mov ecx, -1
cmovb eax, ecx
ret