I am trying to write the char *my_strcpy(char *dest, const char *source); in assembly, at&t syntax, that should act exactly like strcpy from C. My c file looks like this:
.globl my_strcpy
my_strcpy:
push %rbp
mov %rsp, %rbp
mov %rdi, %rax
jmp copy_loop
The jump is pointless.
copy_loop:
cmp $0, (%rsi)
You didn't specify whether this should be an 8, 16, 32 or 64-bit compare. When I assemble it, I get a 32-bit compare; e.g. it sees whether the 32-bit word at address %rsi equals zero. You need to change this to cmpb $0, (%rsi).
je end
mov %rsi, %rdi
As user 500 noted, this copies the address in the %rsi register into the %rdi register, overwriting it. This is not what you want. You probably intended something like movb (%rsi), (%rdi), but no such instruction actually exists: x86 does not have such a single instruction to move memory to memory (special exception: see the movsb instruction). So you'll need to first copy the byte at address %rsi into a register, and then copy it onward with another instruction, e.g. mov (%rsi), %cl ; mov %cl, (%rdi). Note the use of the 8-bit %cl register makes it unambiguous that these should be one-byte moves.
movzbl (%rsi), %ecx is a more efficient way to load a byte on modern x86. You still store it by reading CL with mov %cl, (%rdi), but overwriting the whole RCX instead of merging into RCX is better.
addq $1, %rsi
addq $1, %rdi
You might like to learn about the inc instruction, but add is fine.
je copy_loop
I think you mean jmp copy_loop, since the jump here should happen unconditionally. (Or you should rearrange your loop so the conditional branch can be at the bottom. Since you want to copy the terminating 0 byte, you can just copy and then check for 0, like do{}while(c != 0))
end:
leave
ret
I have written a Assembly program to display the factorial of a number following AT&T syntax. But it's not working. Here is my code
.text
.globl _start
_start:
movq $5,%rcx
movq $5,%rax
Repeat: #function to calculate factorial
decq %rcx
cmp $0,%rcx
je print
imul %rcx,%rax
cmp $1,%rcx
jne Repeat
# Now result of factorial stored in rax
print:
xorq %rsi, %rsi
# function to print integer result digit by digit by pushing in
#stack
loop:
movq $0, %rdx
movq $10, %rbx
divq %rbx
addq $48, %rdx
pushq %rdx
incq %rsi
cmpq $0, %rax
jz next
jmp loop
next:
cmpq $0, %rsi
jz bye
popq %rcx
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $4, %rsp
jmp next
bye:
movq $1,%rax
movq $0, %rbx
int $0x80
.data
num : .byte 5
This program is printing nothing, I also used gdb to visualize it work fine until loop function but when it comes in next some random value start entering in various register. Help me to debug so that it could print factorial.
As #ped7g points out, you're doing several things wrong: using the int 0x80 32-bit ABI in 64-bit code, and passing character values instead of pointers to the write() system call.
Here's how to print an integer in x8-64 Linux, the simple and somewhat-efficient1 way, using the same repeated division / modulo by 10.
System calls are expensive (probably thousands of cycles for write(1, buf, 1)), and doing a syscall inside the loop steps on registers so it's inconvenient and clunky as well as inefficient. We should write the characters into a small buffer, in printing order (most-significant digit at the lowest address), and make a single write() system call on that.
But then we need a buffer. The maximum length of a 64-bit integer is only 20 decimal digits, so we can just use some stack space. In x86-64 Linux, we can use stack space below RSP (up to 128B) without "reserving" it by modifying RSP. This is called the red-zone. If you wanted to pass the buffer to another function instead of a syscall, you would have to reserve space with sub $24, %rsp or something.
Instead of hard-coding system-call numbers, using GAS makes it easy to use the constants defined in .h files. Note the mov $__NR_write, %eax near the end of the function. The x86-64 SystemV ABI passes system-call arguments in similar registers to the function-calling convention. (So it's totally different from the 32-bit int 0x80 ABI, which you shouldn't use in 64-bit code.)
// building with gcc foo.S will use CPP before GAS so we can use headers
#include <asm/unistd.h> // This is a standard Linux / glibc header file
// includes unistd_64.h or unistd_32.h depending on current mode
// Contains only #define constants (no C prototypes) so we can include it from asm without syntax errors.
.p2align 4
.globl print_integer #void print_uint64(uint64_t value)
print_uint64:
lea -1(%rsp), %rsi # We use the 128B red-zone as a buffer to hold the string
# a 64-bit integer is at most 20 digits long in base 10, so it fits.
movb $'\n', (%rsi) # store the trailing newline byte. (Right below the return address).
# If you need a null-terminated string, leave an extra byte of room and store '\n\0'. Or push $'\n'
mov $10, %ecx # same as mov $10, %rcx but 2 bytes shorter
# note that newline (\n) has ASCII code 10, so we could actually have stored the newline with movb %cl, (%rsi) to save code size.
mov %rdi, %rax # function arg arrives in RDI; we need it in RAX for div
.Ltoascii_digit: # do{
xor %edx, %edx
div %rcx # rax = rdx:rax / 10. rdx = remainder
# store digits in MSD-first printing order, working backwards from the end of the string
add $'0', %edx # integer to ASCII. %dl would work, too, since we know this is 0-9
dec %rsi
mov %dl, (%rsi) # *--p = (value%10) + '0';
test %rax, %rax
jnz .Ltoascii_digit # } while(value != 0)
# If we used a loop-counter to print a fixed number of digits, we would get leading zeros
# The do{}while() loop structure means the loop runs at least once, so we get "0\n" for input=0
# Then print the whole string with one system call
mov $__NR_write, %eax # call number from asm/unistd_64.h
mov $1, %edi # fd=1
# %rsi = start of the buffer
mov %rsp, %rdx
sub %rsi, %rdx # length = one_past_end - start
syscall # write(fd=1 /*rdi*/, buf /*rsi*/, length /*rdx*/); 64-bit ABI
# rax = return value (or -errno)
# rcx and r11 = garbage (destroyed by syscall/sysret)
# all other registers = unmodified (saved/restored by the kernel)
# we don't need to restore any registers, and we didn't modify RSP.
ret
To test this function, I put this in the same file to call it and exit:
.p2align 4
.globl _start
_start:
mov $10120123425329922, %rdi
# mov $0, %edi # Yes, it does work with input = 0
call print_uint64
xor %edi, %edi
mov $__NR_exit, %eax
syscall # sys_exit(0)
I built this into a static binary (with no libc):
$ gcc -Wall -static -nostdlib print-integer.S && ./a.out
10120123425329922
$ strace ./a.out > /dev/null
execve("./a.out", ["./a.out"], 0x7fffcb097340 /* 51 vars */) = 0
write(1, "10120123425329922\n", 18) = 18
exit(0) = ?
+++ exited with 0 +++
$ file ./a.out
./a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=69b865d1e535d5b174004ce08736e78fade37d84, not stripped
Footnote 1: See Why does GCC use multiplication by a strange number in implementing integer division? for avoiding div r64 for division by 10, because that's very slow (21 to 83 cycles on Intel Skylake). A multiplicative inverse would make this function actually efficient, not just "somewhat". (But of course there'd still be room for optimizations...)
Related: Linux x86-32 extended-precision loop that prints 9 decimal digits from each 32-bit "limb": see .toascii_digit: in my Extreme Fibonacci code-golf answer. It's optimized for code-size (even at the expense of speed), but well-commented.
It uses div like you do, because that's smaller than using a fast multiplicative inverse). It uses loop for the outer loop (over multiple integer for extended precision), again for code-size at the cost of speed.
It uses the 32-bit int 0x80 ABI, and prints into a buffer that was holding the "old" Fibonacci value, not the current.
Another way to get efficient asm is from a C compiler. For just the loop over digits, look at what gcc or clang produce for this C source (which is basically what the asm is doing). The Godbolt Compiler explorer makes it easy to try with different options and different compiler versions.
See gcc7.2 -O3 asm output which is nearly a drop-in replacement for the loop in print_uint64 (because I chose the args to go in the same registers):
void itoa_end(unsigned long val, char *p_end) {
const unsigned base = 10;
do {
*--p_end = (val % base) + '0';
val /= base;
} while(val);
// write(1, p_end, orig-current);
}
I tested performance on a Skylake i7-6700k by commenting out the syscall instruction and putting a repeat loop around the function call. The version with mul %rcx / shr $3, %rdx is about 5 times faster than the version with div %rcx for storing a long number-string (10120123425329922) into a buffer. The div version ran at 0.25 instructions per clock, while the mul version ran at 2.65 instructions per clock (although requiring many more instructions).
It might be worth unrolling by 2, and doing a divide by 100 and splitting up the remainder of that into 2 digits. That would give a lot better instruction-level parallelism, in case the simpler version bottlenecks on mul + shr latency. The chain of multiply/shift operations that brings val to zero would be half as long, with more work in each short independent dependency chain to handle a 0-99 remainder.
Related:
NASM version of this answer, for x86-64 or i386 Linux How do I print an integer in Assembly Level Programming without printf from the c library?
How to convert a binary integer number to a hex string? - Base 16 is a power of 2, conversion is much simpler and doesn't require div.
Several things:
0) I guess this is 64b linux environment, but you should have stated so (if it is not, some of my points will be invalid)
1) int 0x80 is 32b call, but you are using 64b registers, so you should use syscall (and different arguments)
2) int 0x80, eax=4 requires the ecx to contain address of memory, where the content is stored, while you give it the ASCII character in ecx = illegal memory access (the first call should return error, i.e. eax is negative value). Or using strace <your binary> should reveal the wrong arguments + error returned.
3) why addq $4, %rsp? Makes no sense to me, you are damaging rsp, so the next pop rcx will pop wrong value, and in the end you will run way "up" into the stack.
... maybe some more, I didn't debug it, this list is just by reading the source (so I may be even wrong about something, although that would be rare).
BTW your code is working. It just doesn't do what you expected. But work fine, precisely as the CPU is designed and precisely what you wrote in the code. Whether that does achieve what you wanted, or makes sense, that's different topic, but don't blame the HW or assembler.
... I can do a quick guess how the routine may be fixed (just partial hack-fix, still needs rewrite for syscall under 64b linux):
next:
cmpq $0, %rsi
jz bye
movq %rsp,%rcx ; make ecx to point to stack memory (with stored char)
; this will work if you are lucky enough that rsp fits into 32b
; if it is beyond 4GiB logical address, then you have bad luck (syscall needed)
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $8, %rsp ; now rsp += 8; is needed, because there's no POP
jmp next
Again didn't try myself, just writing it from head, so let me know how it changed situation.
I made a program in x64 assembly, using AT&T syntax, but I don't figure out why mov operator copies address variable to register. Here is my code:
.globl main
.text
main:
mov $a, %rax
mov $format, %rdi # set 1st parameter (format)
mov %rax, %rsi # set 2nd parameter (current_number)
mov $0, %rax # because printf is varargs
sub $8, %rsp # align stack pointer
call printf # printf(format, sum/count)
add $8, %rsp # restore stack pointer
ret
.data
a: .quad 123
format: .asciz "%d\n"
The program outputs 6295616 instead of 123. Please help me understand what I've done wrong.
Because you have indicated with the dollar sign that you want immediate mode. Remove it to get absolute mode.
When I write a simple assembly language program, linked with the C library, using gcc 4.6.1 on Ubuntu, and I try to print an integer, it works fine:
.global main
.text
main:
mov $format, %rdi
mov $5, %rsi
mov $0, %rax
call printf
ret
format:
.asciz "%10d\n"
This prints 5, as expected.
But now if I make a small change, and try to print a floating point value:
.global main
.text
main:
mov $format, %rdi
movsd x, %xmm0
mov $1, %rax
call printf
ret
format:
.asciz "%10.4f\n"
x:
.double 15.5
This program seg faults without printing anything. Just a sad segfault.
But I can fix this by pushing and popping %rbp.
.global main
.text
main:
push %rbp
mov $format, %rdi
movsd x, %xmm0
mov $1, %rax
call printf
pop %rbp
ret
format:
.asciz "%10.4f\n"
x:
.double 15.5
Now it works, and prints 15.5000.
My question is: why did pushing and popping %rbp make the application work? According to the ABI, %rbp is one of the registers that the callee must preserve, and so printf cannot be messing it up. In fact, printf worked in the first program, when only an integer was passed to printf. So the problem must be elsewhere?
I suspect the problem doesn't have anything to do with %rbp, but rather has to do with stack alignment. To quote the ABI:
The ABI requires that stack frames be aligned on 16-byte boundaries. Specifically, the end of
the argument area (%rbp+16) must be a multiple of 16. This requirement means that the frame
size should be padded out to a multiple of 16 bytes.
The stack is aligned when you enter main(). Calling printf() pushes the return address onto the stack, moving the stack pointer by 8 bytes. You restore the alignment by pushing another eight bytes onto the stack (which happen to be %rbp but could just as easily be something else).
Here is the code that gcc generates (also on the Godbolt compiler explorer):
.LC1:
.ascii "%10.4f\12\0"
main:
leaq .LC1(%rip), %rdi # format string address
subq $8, %rsp ### align the stack by 16 before a CALL
movl $1, %eax ### 1 FP arg being passed in a register to a variadic function
movsd .LC0(%rip), %xmm0 # load the double itself
call printf
xorl %eax, %eax # return 0 from main
addq $8, %rsp
ret
As you can see, it deals with the alignment requirements by subtracting 8 from %rsp at the start, and adding it back at the end.
You could instead do a dummy push/pop of whatever register you like instead of manipulating %rsp directly; some compilers do use a dummy push to align the stack because this can actually be cheaper on modern CPUs, and saves code size.
I'm trying to learn basic assembly. I wrote a simple program in C to translate to assembly:
void myFunc(int x, int y) {
int z;
}
int main() {
myFunc(20, 10);
return 0;
}
This is what I thought the correct translation of the function would be:
.text
.globl _start
.type myFunc, #function
myFunc:
pushl %ebp #Push old ebp register on to stack
movl %esp, %ebp #Move esp into ebp so we can reference vars
sub $4, %esp #Subtract 4 bytes from esp to make room for 'z' var
movl $2, -4(%ebp) #Move value 2 into 'z'
movl %ebp, %esp #Restore esp
popl %ebp #Set ebp to 0?
ret #Restore eip and jump to next instruction
_start:
pushl $10 #Push 10 onto stack for 'y' var
pushl $20 #Push 20 onto stack for 'x' var
call myFunc #Jump to myFunc (this pushes ret onto stack)
add $8, %esp #Restore esp to where it was before
movl $1, %eax #Exit syscall
movl $0, %ebx #Return 0
int $0x80 #Interrupt
Just to double check it I ran it in gdb and was confused by the results:
(gdb) disas myFunc
Dump of assembler code for function myFunc:
0x08048374 <myFunc+0>: push ebp
0x08048375 <myFunc+1>: mov ebp,esp
0x08048377 <myFunc+3>: sub esp,0x10
0x0804837a <myFunc+6>: leave
0x0804837b <myFunc+7>: ret
End of assembler dump.
Why at 0x08048377 did gcc subtract 0x10 (16 bytes) from the stack when an integer is 4 bytes in length?
Also, is the leave instruction equivalent to the following?
movl %ebp, %esp #Restore esp
popl %ebp #Set ebp to 0?
Using:
gcc version 4.3.2 (Debian 4.3.2-1.1)
GNU gdb 6.8-debian
Depending on the platform, GCC may choose different stack alignments; this can be overridden, but doing so can make the program slower or crash. The default -mpreferred-stack-boundary=4 keeps the stack aligned to 16-byte addresses. Assuming the stack pointer already suitably aligned at the start of the function, it will remain so aligned after sub %esp, $10.
leave is an x86 macro-instruction which is equivalent to mov %ebp, %esp; pop %ebp.
Your GDB is configured to print out Intel instead of AT&T assembly syntax - turn that off before it confuses you more than it already has.
The stack pointer (%esp) is required to always be aligned to a 16-byte boundary. That's probably where the sub esp,0x10 is coming from. (It's unnecessary, but GCC has historically been bad at noticing that stack adjustments are unnecessary.) Also, your function doesn't do anything interesting, so the body has been optimized out. You should have compiled this code:
int myFunc(int x, int y)
{
return x + y;
}
int main(void)
{
return myFunc(20, 30);
}
That'll produce assembly language that's easier to map back to the original C. GCC would still be allowed to produce
main:
movl $50,%eax
ret
and nothing else, but it probably won't unless you use -O3 -fwhole-program ;-)