I am trying to write the char *my_strcpy(char *dest, const char *source); in assembly, at&t syntax, that should act exactly like strcpy from C. My c file looks like this:
.globl my_strcpy
my_strcpy:
push %rbp
mov %rsp, %rbp
mov %rdi, %rax
jmp copy_loop
The jump is pointless.
copy_loop:
cmp $0, (%rsi)
You didn't specify whether this should be an 8, 16, 32 or 64-bit compare. When I assemble it, I get a 32-bit compare; e.g. it sees whether the 32-bit word at address %rsi equals zero. You need to change this to cmpb $0, (%rsi).
je end
mov %rsi, %rdi
As user 500 noted, this copies the address in the %rsi register into the %rdi register, overwriting it. This is not what you want. You probably intended something like movb (%rsi), (%rdi), but no such instruction actually exists: x86 does not have such a single instruction to move memory to memory (special exception: see the movsb instruction). So you'll need to first copy the byte at address %rsi into a register, and then copy it onward with another instruction, e.g. mov (%rsi), %cl ; mov %cl, (%rdi). Note the use of the 8-bit %cl register makes it unambiguous that these should be one-byte moves.
movzbl (%rsi), %ecx is a more efficient way to load a byte on modern x86. You still store it by reading CL with mov %cl, (%rdi), but overwriting the whole RCX instead of merging into RCX is better.
addq $1, %rsi
addq $1, %rdi
You might like to learn about the inc instruction, but add is fine.
je copy_loop
I think you mean jmp copy_loop, since the jump here should happen unconditionally. (Or you should rearrange your loop so the conditional branch can be at the bottom. Since you want to copy the terminating 0 byte, you can just copy and then check for 0, like do{}while(c != 0))
end:
leave
ret
Related
I'm trying to write an x86 version of the 'cat' program as a training of syscall calls in assembly.
I'm struggling a lot with command line arguments.
I use the main symbol as an entry point, so I thought I would find the argc parameter in %rdi and the argv parameter in %rsi.
Actually argc is in %rdi as expected, but I keep segfaulting when trying to pass argv[1] to the open syscall.
Not sure of what I'm doing wrong, here is my assembly code:
main:
cmp $2, %rdi // If argc != 2 return 1
jne .err1
lea 8(%rsi), %rdi // Move argv[1] -> %rdi
xor %rsi, %rsi // 0 to %rsi -> O_RDONLY
xor %rdx, %rdx
mov $2, %rax // Open = syscall 2
syscall
cmp 0, %rax // If open returns <0 -> exit status 2
jl .err2
mov %rax, %rdi // Move fd to %rdi
call cat
ret
.err1:
mov $1, %rax
ret
.err2:
mov $2, %rax
ret
There are two issues with your code.
First, you use lea 8(%rsi), %rdi to retrieve the second argument. Note that rsi points to an array of pointers to command line arguments, so to retrieve the pointer to the second argument, you have to dereference 8(%rsi) using something like mov 8(%rsi), %rdi.
Second, you forgot the dollar sign in front of 0 in cmp $0, %rax. This causes an absolute address mode for address 0 to be selected, effectively dereferencing a null pointer. To fix this, add the missing dollar sign to select an immediate addressing mode.
When I fix both issues, your code as far as you posted it seems to work just fine.
As a learning exercise, I've been handwriting assembly. I can't seem to figure out how to load the value of an address into a register.
Semantically, I want to do the following:
_start:
# read(0, buffer, 1)
mov $3, %eax # System call 3 is read
mov $0, %ebx # File handle 0 is stdin
mov $buffer, %ecx # Buffer to write to
mov $1, %edx # Length of buffer
int $0x80 # Invoke system call
lea (%ecx, %ecx), %edi # Pull the value at address into %edi
cmp $97, %edi # Compare to 'a'
je done
I've written a higher-level implementation in C:
char buffer[1];
int main()
{
read(0, buffer, 1);
char a = buffer[0];
return (a == 'a') ? 1 : 0;
}
But compiling with gcc -S produces assembly that doesn't port well into my implementation above.
I think lea is the right instruction I should be using to load the value at the given address stored in %ecx into %edi, but upon inspection in gdb, %edi contains a garbage value after this instruction is executed. Is this approach correct?
Instead of the lea instruction, what you need is:
movzbl (%ecx), %edi
That is, zero extending into the edi register the byte at the memory address contained in ecx.
_start:
# read(0, buffer, 1)
mov $3, %eax # System call 3 is read
mov $0, %ebx # File handle 0 is stdin
mov $buffer, %ecx # Buffer to write to
mov $1, %edx # Length of buffer
int $0x80 # Invoke system call
movzbl (%ecx), %edi # Pull the value at address ecx into edi
cmp $97, %edi # Compare to 'a'
je done
Some advice
You don't really need the movz instruction: you don't need a separate load operation, since you can compare the byte in memory pointed by ecx directly with cmp:
cmpb $97, (%ecx)
You may want to specify the character to be compared against (i.e., 'a') as $'a' instead of $97 in order to improve readability:
cmpb $'a', (%ecx)
Avoiding conditional branches is usually a good idea. Immediately after performing the system call, you could use the following code that uses cmov for determining the return value, which is stored in eax, instead of performing a conditional jump (i.e., the je instruction):
xor %eax, %eax # set eax to zero
cmpb $'a', (%ecx) # compare to 'a'
cmovz %edx, %eax # conditionally move edx(=1) into eax
ret # eax is either 0 or 1 at this point
edx was set to 1 prior to the system call. Therefore, this approach above relies on the fact that edx is preserved across the system call (i.e., the int 0x80 instruction).
Even better, you could use sete on al after the comparison instead of the cmov:
xor %eax, %eax # set eax to zero
cmpb $'a', (%ecx) # compare to 'a'
sete %al # conditionally set al
ret # eax is either 0 or 1 at this point
The register al, which was set to zero by means of xor %eax, %eax, will be set to 1 if the ZF flag was set by the cmp (i.e., if the byte pointed by ecx is 'a'). With this approach you don't need to care about thinking whether the syscall preserves edx or not, since the outcome doesn't depend on edx.
I have written a Assembly program to display the factorial of a number following AT&T syntax. But it's not working. Here is my code
.text
.globl _start
_start:
movq $5,%rcx
movq $5,%rax
Repeat: #function to calculate factorial
decq %rcx
cmp $0,%rcx
je print
imul %rcx,%rax
cmp $1,%rcx
jne Repeat
# Now result of factorial stored in rax
print:
xorq %rsi, %rsi
# function to print integer result digit by digit by pushing in
#stack
loop:
movq $0, %rdx
movq $10, %rbx
divq %rbx
addq $48, %rdx
pushq %rdx
incq %rsi
cmpq $0, %rax
jz next
jmp loop
next:
cmpq $0, %rsi
jz bye
popq %rcx
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $4, %rsp
jmp next
bye:
movq $1,%rax
movq $0, %rbx
int $0x80
.data
num : .byte 5
This program is printing nothing, I also used gdb to visualize it work fine until loop function but when it comes in next some random value start entering in various register. Help me to debug so that it could print factorial.
As #ped7g points out, you're doing several things wrong: using the int 0x80 32-bit ABI in 64-bit code, and passing character values instead of pointers to the write() system call.
Here's how to print an integer in x8-64 Linux, the simple and somewhat-efficient1 way, using the same repeated division / modulo by 10.
System calls are expensive (probably thousands of cycles for write(1, buf, 1)), and doing a syscall inside the loop steps on registers so it's inconvenient and clunky as well as inefficient. We should write the characters into a small buffer, in printing order (most-significant digit at the lowest address), and make a single write() system call on that.
But then we need a buffer. The maximum length of a 64-bit integer is only 20 decimal digits, so we can just use some stack space. In x86-64 Linux, we can use stack space below RSP (up to 128B) without "reserving" it by modifying RSP. This is called the red-zone. If you wanted to pass the buffer to another function instead of a syscall, you would have to reserve space with sub $24, %rsp or something.
Instead of hard-coding system-call numbers, using GAS makes it easy to use the constants defined in .h files. Note the mov $__NR_write, %eax near the end of the function. The x86-64 SystemV ABI passes system-call arguments in similar registers to the function-calling convention. (So it's totally different from the 32-bit int 0x80 ABI, which you shouldn't use in 64-bit code.)
// building with gcc foo.S will use CPP before GAS so we can use headers
#include <asm/unistd.h> // This is a standard Linux / glibc header file
// includes unistd_64.h or unistd_32.h depending on current mode
// Contains only #define constants (no C prototypes) so we can include it from asm without syntax errors.
.p2align 4
.globl print_integer #void print_uint64(uint64_t value)
print_uint64:
lea -1(%rsp), %rsi # We use the 128B red-zone as a buffer to hold the string
# a 64-bit integer is at most 20 digits long in base 10, so it fits.
movb $'\n', (%rsi) # store the trailing newline byte. (Right below the return address).
# If you need a null-terminated string, leave an extra byte of room and store '\n\0'. Or push $'\n'
mov $10, %ecx # same as mov $10, %rcx but 2 bytes shorter
# note that newline (\n) has ASCII code 10, so we could actually have stored the newline with movb %cl, (%rsi) to save code size.
mov %rdi, %rax # function arg arrives in RDI; we need it in RAX for div
.Ltoascii_digit: # do{
xor %edx, %edx
div %rcx # rax = rdx:rax / 10. rdx = remainder
# store digits in MSD-first printing order, working backwards from the end of the string
add $'0', %edx # integer to ASCII. %dl would work, too, since we know this is 0-9
dec %rsi
mov %dl, (%rsi) # *--p = (value%10) + '0';
test %rax, %rax
jnz .Ltoascii_digit # } while(value != 0)
# If we used a loop-counter to print a fixed number of digits, we would get leading zeros
# The do{}while() loop structure means the loop runs at least once, so we get "0\n" for input=0
# Then print the whole string with one system call
mov $__NR_write, %eax # call number from asm/unistd_64.h
mov $1, %edi # fd=1
# %rsi = start of the buffer
mov %rsp, %rdx
sub %rsi, %rdx # length = one_past_end - start
syscall # write(fd=1 /*rdi*/, buf /*rsi*/, length /*rdx*/); 64-bit ABI
# rax = return value (or -errno)
# rcx and r11 = garbage (destroyed by syscall/sysret)
# all other registers = unmodified (saved/restored by the kernel)
# we don't need to restore any registers, and we didn't modify RSP.
ret
To test this function, I put this in the same file to call it and exit:
.p2align 4
.globl _start
_start:
mov $10120123425329922, %rdi
# mov $0, %edi # Yes, it does work with input = 0
call print_uint64
xor %edi, %edi
mov $__NR_exit, %eax
syscall # sys_exit(0)
I built this into a static binary (with no libc):
$ gcc -Wall -static -nostdlib print-integer.S && ./a.out
10120123425329922
$ strace ./a.out > /dev/null
execve("./a.out", ["./a.out"], 0x7fffcb097340 /* 51 vars */) = 0
write(1, "10120123425329922\n", 18) = 18
exit(0) = ?
+++ exited with 0 +++
$ file ./a.out
./a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=69b865d1e535d5b174004ce08736e78fade37d84, not stripped
Footnote 1: See Why does GCC use multiplication by a strange number in implementing integer division? for avoiding div r64 for division by 10, because that's very slow (21 to 83 cycles on Intel Skylake). A multiplicative inverse would make this function actually efficient, not just "somewhat". (But of course there'd still be room for optimizations...)
Related: Linux x86-32 extended-precision loop that prints 9 decimal digits from each 32-bit "limb": see .toascii_digit: in my Extreme Fibonacci code-golf answer. It's optimized for code-size (even at the expense of speed), but well-commented.
It uses div like you do, because that's smaller than using a fast multiplicative inverse). It uses loop for the outer loop (over multiple integer for extended precision), again for code-size at the cost of speed.
It uses the 32-bit int 0x80 ABI, and prints into a buffer that was holding the "old" Fibonacci value, not the current.
Another way to get efficient asm is from a C compiler. For just the loop over digits, look at what gcc or clang produce for this C source (which is basically what the asm is doing). The Godbolt Compiler explorer makes it easy to try with different options and different compiler versions.
See gcc7.2 -O3 asm output which is nearly a drop-in replacement for the loop in print_uint64 (because I chose the args to go in the same registers):
void itoa_end(unsigned long val, char *p_end) {
const unsigned base = 10;
do {
*--p_end = (val % base) + '0';
val /= base;
} while(val);
// write(1, p_end, orig-current);
}
I tested performance on a Skylake i7-6700k by commenting out the syscall instruction and putting a repeat loop around the function call. The version with mul %rcx / shr $3, %rdx is about 5 times faster than the version with div %rcx for storing a long number-string (10120123425329922) into a buffer. The div version ran at 0.25 instructions per clock, while the mul version ran at 2.65 instructions per clock (although requiring many more instructions).
It might be worth unrolling by 2, and doing a divide by 100 and splitting up the remainder of that into 2 digits. That would give a lot better instruction-level parallelism, in case the simpler version bottlenecks on mul + shr latency. The chain of multiply/shift operations that brings val to zero would be half as long, with more work in each short independent dependency chain to handle a 0-99 remainder.
Related:
NASM version of this answer, for x86-64 or i386 Linux How do I print an integer in Assembly Level Programming without printf from the c library?
How to convert a binary integer number to a hex string? - Base 16 is a power of 2, conversion is much simpler and doesn't require div.
Several things:
0) I guess this is 64b linux environment, but you should have stated so (if it is not, some of my points will be invalid)
1) int 0x80 is 32b call, but you are using 64b registers, so you should use syscall (and different arguments)
2) int 0x80, eax=4 requires the ecx to contain address of memory, where the content is stored, while you give it the ASCII character in ecx = illegal memory access (the first call should return error, i.e. eax is negative value). Or using strace <your binary> should reveal the wrong arguments + error returned.
3) why addq $4, %rsp? Makes no sense to me, you are damaging rsp, so the next pop rcx will pop wrong value, and in the end you will run way "up" into the stack.
... maybe some more, I didn't debug it, this list is just by reading the source (so I may be even wrong about something, although that would be rare).
BTW your code is working. It just doesn't do what you expected. But work fine, precisely as the CPU is designed and precisely what you wrote in the code. Whether that does achieve what you wanted, or makes sense, that's different topic, but don't blame the HW or assembler.
... I can do a quick guess how the routine may be fixed (just partial hack-fix, still needs rewrite for syscall under 64b linux):
next:
cmpq $0, %rsi
jz bye
movq %rsp,%rcx ; make ecx to point to stack memory (with stored char)
; this will work if you are lucky enough that rsp fits into 32b
; if it is beyond 4GiB logical address, then you have bad luck (syscall needed)
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $8, %rsp ; now rsp += 8; is needed, because there's no POP
jmp next
Again didn't try myself, just writing it from head, so let me know how it changed situation.
When I write a simple assembly language program, linked with the C library, using gcc 4.6.1 on Ubuntu, and I try to print an integer, it works fine:
.global main
.text
main:
mov $format, %rdi
mov $5, %rsi
mov $0, %rax
call printf
ret
format:
.asciz "%10d\n"
This prints 5, as expected.
But now if I make a small change, and try to print a floating point value:
.global main
.text
main:
mov $format, %rdi
movsd x, %xmm0
mov $1, %rax
call printf
ret
format:
.asciz "%10.4f\n"
x:
.double 15.5
This program seg faults without printing anything. Just a sad segfault.
But I can fix this by pushing and popping %rbp.
.global main
.text
main:
push %rbp
mov $format, %rdi
movsd x, %xmm0
mov $1, %rax
call printf
pop %rbp
ret
format:
.asciz "%10.4f\n"
x:
.double 15.5
Now it works, and prints 15.5000.
My question is: why did pushing and popping %rbp make the application work? According to the ABI, %rbp is one of the registers that the callee must preserve, and so printf cannot be messing it up. In fact, printf worked in the first program, when only an integer was passed to printf. So the problem must be elsewhere?
I suspect the problem doesn't have anything to do with %rbp, but rather has to do with stack alignment. To quote the ABI:
The ABI requires that stack frames be aligned on 16-byte boundaries. Specifically, the end of
the argument area (%rbp+16) must be a multiple of 16. This requirement means that the frame
size should be padded out to a multiple of 16 bytes.
The stack is aligned when you enter main(). Calling printf() pushes the return address onto the stack, moving the stack pointer by 8 bytes. You restore the alignment by pushing another eight bytes onto the stack (which happen to be %rbp but could just as easily be something else).
Here is the code that gcc generates (also on the Godbolt compiler explorer):
.LC1:
.ascii "%10.4f\12\0"
main:
leaq .LC1(%rip), %rdi # format string address
subq $8, %rsp ### align the stack by 16 before a CALL
movl $1, %eax ### 1 FP arg being passed in a register to a variadic function
movsd .LC0(%rip), %xmm0 # load the double itself
call printf
xorl %eax, %eax # return 0 from main
addq $8, %rsp
ret
As you can see, it deals with the alignment requirements by subtracting 8 from %rsp at the start, and adding it back at the end.
You could instead do a dummy push/pop of whatever register you like instead of manipulating %rsp directly; some compilers do use a dummy push to align the stack because this can actually be cheaper on modern CPUs, and saves code size.
When I write a simple assembly language program, linked with the C library, using gcc 4.6.1 on Ubuntu, and I try to print an integer, it works fine:
.global main
.text
main:
mov $format, %rdi
mov $5, %rsi
mov $0, %rax
call printf
ret
format:
.asciz "%10d\n"
This prints 5, as expected.
But now if I make a small change, and try to print a floating point value:
.global main
.text
main:
mov $format, %rdi
movsd x, %xmm0
mov $1, %rax
call printf
ret
format:
.asciz "%10.4f\n"
x:
.double 15.5
This program seg faults without printing anything. Just a sad segfault.
But I can fix this by pushing and popping %rbp.
.global main
.text
main:
push %rbp
mov $format, %rdi
movsd x, %xmm0
mov $1, %rax
call printf
pop %rbp
ret
format:
.asciz "%10.4f\n"
x:
.double 15.5
Now it works, and prints 15.5000.
My question is: why did pushing and popping %rbp make the application work? According to the ABI, %rbp is one of the registers that the callee must preserve, and so printf cannot be messing it up. In fact, printf worked in the first program, when only an integer was passed to printf. So the problem must be elsewhere?
I suspect the problem doesn't have anything to do with %rbp, but rather has to do with stack alignment. To quote the ABI:
The ABI requires that stack frames be aligned on 16-byte boundaries. Specifically, the end of
the argument area (%rbp+16) must be a multiple of 16. This requirement means that the frame
size should be padded out to a multiple of 16 bytes.
The stack is aligned when you enter main(). Calling printf() pushes the return address onto the stack, moving the stack pointer by 8 bytes. You restore the alignment by pushing another eight bytes onto the stack (which happen to be %rbp but could just as easily be something else).
Here is the code that gcc generates (also on the Godbolt compiler explorer):
.LC1:
.ascii "%10.4f\12\0"
main:
leaq .LC1(%rip), %rdi # format string address
subq $8, %rsp ### align the stack by 16 before a CALL
movl $1, %eax ### 1 FP arg being passed in a register to a variadic function
movsd .LC0(%rip), %xmm0 # load the double itself
call printf
xorl %eax, %eax # return 0 from main
addq $8, %rsp
ret
As you can see, it deals with the alignment requirements by subtracting 8 from %rsp at the start, and adding it back at the end.
You could instead do a dummy push/pop of whatever register you like instead of manipulating %rsp directly; some compilers do use a dummy push to align the stack because this can actually be cheaper on modern CPUs, and saves code size.