How do I use "nanosleep" in x86 Assembly? - linux

I have some problems with Linux' nanosleep syscall. This code should wait 2 seconds before it exits, but it doesn't:
.globl _start
pushq %rbp
movq %rsp,%rbp
pushq $0 #0 nanoseconds
pushq $2 #2 seconds
leaq (%rbp),%rdi #the time structure on the stack
movq $35,%rax #nanosleep syscall
movq $0,%rsi #disable useless parameter

After pushing stuff on the stack, use mov %rsp, %rdi. RSP (the current stack pointer) is what's pointing to your newly-pushed struct, not RBP (the frame pointer). lea (%rsp), %rdi is a less-efficient way to write that, but would also work.
You're passing RBP as the pointer, but it still points to the saved RBP value from making a "stack frame". Note that is _start, not a function, so you're really just terminating the linked list of saved-RBP values. The System V ABI recommends doing this by explicitly setting RBP to zero, but Linux zeros registers (other than RSP) on process startup so this works.
Anyway, at _start, (rsp) is argc, and then you push a 0 (the saved RBP) and set RBP to point there. So the struct you're passing to sys_nanosleep is {0, argc}. Or argc nanoseconds. (Test with strace to see if I got this right; I didn't try it.)
This is what you should do:
pushq $0 #0 nanoseconds
pushq $2 #2 seconds
### RSP instead of RBP in the next instruction:
mov %rsp, %rdi #the time structure we just pushed
mov $35, %eax #SYS_nanosleep
xor %esi, %esi #rem=NULL, we don't care if we wake early
# RSP is 16 bytes lower than it was before this fragment, in case that matters for later code.
I also optimized by not using 64-bit operand-size when you don't need it (because writing a 32-bit register zeros the upper 32 bits). I like letting register sizes imply operand size instead of using movq, like in Intel syntax. I also used the idiomatic way to zero a register, and improving the comments.
Your proposed answer is broken: subq $16, %rbp before leave is bad idea.
If you want to address your newly-pushed struct relative to your RBP stack frame, you could lea -16(%rbp), %rdi.
But modifying %rbp will make leave set RSP to the updated RBP and then pop the low qword of the struct into RBP, instead of the caller's saved RBP. RSP is left pointing to the high qword of your struct, rather than the function return address.
This probably only works because you just use sys_exit after leave, because you're not in a function so you couldn't ret anyway. It makes no sense to use leave in _start, because it's not a function. You have to just sys_exit or sys_exit_group.
But if you used this fragment inside an actual function, it would break the stack.

I figured it out on myself. This works:
#call nanosleep
movq $35,%rax
subq $16, %rbp
movq %rbp,%rdi
movq $0,%rsi


Returning from a signal handler without going the kernel + userspace interrupts

This is a followup on my older question: Returning from a signal handler via setcontext
I tried to write the assembly for doing the context switch from a signal handler back to the interrupted context without the kernel assist (the sigreturn system call) for a signal handler where blocked signals don't need to be changed (SA_NODEFER).
It seems doable (mostly just reloading registers) up to a final point: setting %rsp and %rip back.
From what I understand, if there was no red zone, I could do something like:
//use %rax to old %rsp, and %rcx as a scratch register
mov RSP_OFFSET(%rdi), %rax
// push old rip, rax, rcx, rdi using the scratch %rcx register
mov RIP_OFFSET(%rdi), %rcx
mov %rcx, -8(%rax)
mov RAX_OFFSET(%rdi), %rcx
mov %rcx, -16(%rax)
mov RCX_OFFSET(%rdi), %rcx
mov %rcx, -24(%rax)
mov RDI_OFFSET(%rdi), %rcx
mov %rcx, -32(%rax)
//make rax point to old_rsp - 32
lea -32(%rax), %rax
mov %rax, %rsp //restore rsp to that
//restore the pushed registers through actual popping
pop %rdi
pop %rcx
pop %rax
//pop and set %rip
but with the redzone it seems impossible as there's (?) no way to atomically (with respect to signals) and without register use pop off a large amount from the stack and jump to an address stored deep in the popped off portion (or somewhere completely different as is the case with alternate signal stacks). I looked at hardware task switching but that is unsupported in 64-bit protected mode.
How will this work with the upcoming user-level interrupts? Will those handlers be able to somehow return without going through the kernel?

Should %rsp be aligned to 16-byte boundary before calling a function in NASM?

I saw the following rules from NASM's document:
The stack pointer %rsp must be aligned to a 16-byte boundary before making a call. Fine, but the process of making a call pushes the return address (8 bytes) on the stack, so when a function gets control, %rsp is not aligned. You have to make that extra space yourself, by pushing something or subtracting 8 from %rsp.
And I have a snippet of NASM assembly code as below:
The %rsp should be at the boundary of 8-bytes before I call the function "inc" in "_start" which violates the rules described in NASM's document. But actually, everything is going on well. So, how can I understand this?
I built this under Ubuntu 20.04 LTS (x86_64).
global _start
section .data
db 0x2
section .rodata
db '0123456789abcdef'
section .text
mov rax, [rsp+8] ; read param from the stack;
add rax, 0x1
lea rsi, [codes + rax]
mov rax, 1
mov rdi, 1
mov rdx, 1
; enable AC check;
or dword [rsp], 1<<18
mov rdi, [init] ; move the first 8 bytes of init to %rdi;
push rdi ; %rsp -> 8 bytes;
call inc
pop r11 ; clean stack by the caller;
call print
mov rax, 60
xor rdi, rdi
The ABI is a set of rules for how functions should behave to be interoperable with each other. Each of the rules on one side are paired with allowed assumptions on the other. In this case, the rule about stack alignment for the caller is an allowed assumption about stack alignment for the callee. Since your inc function doesn't depend on 16-byte stack alignment, it's fine to call that particular function with a stack that's only 8-byte aligned.
If you're wondering why it didn't break when you enabled AC, that's because you're only loading 8-byte values from the stack, and the stack is still 8-byte aligned. If you did sub rsp, 4 or something to break 8-byte alignment too, then you would get a bus error.
Where the ABI becomes important is when the situation isn't one function you wrote yourself in assembly calling another function you wrote yourself in assembly. A function in someone else's library (including the C standard library), or one that you compiled from C instead of writing in assembly, is within its rights to do movaps [rsp - 24], xmm0 or something, which would break if you didn't properly align the stack before calling it.
Side note: the ABI also says how you're supposed to pass parameters (the calling convention), but you're just kind of passing them wherever. Again, fine from your own assembly, but they'll definitely break if you try to call them from C.

Is there advantage of reading data without using pop operation?

According to this PDF document (Page-66), the following bunch of statements
mov eax, DWORD PTR SS:[esp]
mov eax, DWORD PTR SS:[esp + 4]
mov eax, DWORD PTR SS:[esp + 8]
are equivalent to the following bunch of statements:
pop eax
pop eax
pop eax
Is there any advantage of the the former over the latter?
mov leaves the data on the stack, pop removes it so you can only read it once, and only in order. Data below ESP has to be considered "lost", unless you're using a calling convention / ABI that includes a red-zone below the stack pointer.
Data is usually still there below ESP, but asynchronous stuff like signal handlers, or a debugger evaluating a call fflush(0) in the context of your process, can step on it.
Also, pop modifies ESP, so each pop requires stack-unwind metadata1 in another section of the executable/library, for it to be fully ABI compliant with SEH on Windows or the i386 / x86-64 System V ABI on other OSes (which specifies that all functions need unwind metadata, even if they're not C++ functions that actually support propagating exceptions).
But if you're reading data for the last time, and you actually need it all, then yes pop is an efficient way to read it on modern CPUs (like Pentium-M and later, with a stack engine to handle the ESP updates without a separate uop.)
On older CPUs, like Pentium III, pop was actually slower than 3x mov + add esp,12 and compilers did generate code the way Brendan's answer shows.
void foo() {
asm("" ::: "ebx", "esi", "edi");
This function forces the compiler to save/restore 3 call-preserved registers (by declaring clobbers on them.) It doesn't actually touch them; the inline asm string is empty. But this makes it easy to see what compilers will do for saving/restoring. (Which is the only time they'll use pop normally.)
GCC's default (tune=generic) code-gen, or with -march=skylake for example, is like this (from the Godbolt compiler explorer)
foo: # gcc8.3 -O3 -m32
push edi
push esi
push ebx
pop ebx
pop esi
pop edi
But telling it to tune for an old CPU without a stack engine makes it do this:
foo: # gcc8.3 -march=pentium3 -O3 -m32
sub esp, 12
mov DWORD PTR [esp], ebx
mov DWORD PTR [esp+4], esi
mov DWORD PTR [esp+8], edi
mov ebx, DWORD PTR [esp]
mov esi, DWORD PTR [esp+4]
mov edi, DWORD PTR [esp+8]
add esp, 12
gcc thinks -march=pentium-m doesn't have a stack engine, or at least chooses not to use push/pop there. I think that's a mistake, because Agner Fog's microarch pdf definitely describes the stack engine as being present in Pentium-M.
On P-M and later, push/pop are single-uop instructions, with the ESP update handled outside the out-of-order backend, and for push the store-address+store-data uops are micro-fused.
On Pentium 3, they're 2 or 3 uops each. (Again, see Agner Fog's instruction tables.)
On in-order P5 Pentium, push and pop are actually fine. (But memory-destination instructions like add [mem], reg were generally avoided, because P5 didn't split them into uops to pipeline better.)
Mixing pop with direct references to [esp] will actually be potentially slower than just one or the other, on modern Intel CPUs, because it costs extra stack-sync uops.
Obviously writing EAX 3 times back to back means the first 2 loads are useless in both sequences.
See Extreme Fibonacci for an example of pop (1 uop, or like 1.1 uop with the stack sync uops amortized) being more efficient than lodsd (2 uops on Skylake) for reading through an array. (In evil code that assumes a large red-zone because it doesn't install signal handlers. Don't actually do this unless you know exactly what you're doing and when it will break; this is more of a silly computer tricks / extreme optimization for code-golf than anything that's practically useful.)
Footnote 1: The Godbolt compiler explorer normally filters out extra assembler directives, but if you uncheck that box you can see gcc's function that uses push/pop has .cfi_def_cfa_offset 12 after every push/pop.
pop ebx
.cfi_restore 3
.cfi_def_cfa_offset 12
pop esi
.cfi_restore 6
.cfi_def_cfa_offset 8
pop edi
.cfi_restore 7
.cfi_def_cfa_offset 4
The .cfi_restore 7 metadata directives have to be there regardless of push/pop vs. mov, because that lets stack unwinding restore call-preserved registers as it unwinds. (7 is the register number).
But for other uses of push/pop inside a function (like pushing args to a function call, or a dummy pop to remove it from the stack), you wouldn't have .cfi_restore, only metadata for the stack pointer changing relative to the stack frame.
Normally you don't worry about this in hand-written asm, but compilers have to get this right so there's a small extra cost to using push/pop in terms of total executable size. But only in parts of the file that aren't mapped into memory normally, and not mixed with code.
pop eax
pop ebx
pop ecx
.. is sort of equivalent to this:
mov eax,[esp]
add esp,4
mov ebx,[esp]
add esp,4
mov ecx,[esp]
add esp,4
..which can be like this:
mov eax,[esp] ;Do this instruction
add esp,4 ; ...and this instruction in parallel
;Stall until the previous instruction completes (and the value
mov ebx,[esp] ;in ESP becomes known); then do this instruction
add esp,4 ; ...and this instruction in parallel
;Stall until the previous instruction completes (and the value
mov ecx,[esp] ;in ESP becomes known); then do this instruction
add esp,4 ; ...and this instruction in parallel
For this code:
mov eax, [esp]
mov ebx, [esp + 4]
mov ecx, [esp + 8]
add esp,12
.. all of the instructions can happen in parallel (in theory).
Note: In practice all of the above depends on which CPU, etc.

Linux sbrk() as a syscall in assembly

So, as a challenge, and for performance, I'm writing a simple server in assembly. The only way I know of is via system calls. (through int 0x80) Obviously, I'm going to need more memory than allocated at assemble, or at load, so I read up and decided I wanted to use sbrk(), mainly because I don't understand mmap() :p
At any rate, Linux provides no interrupt for sbrk(), only brk().
So... how do I find the current program break to use brk()? I thought about using getrlimit(), but I don't know how to get a resource (the process id I'd guess) to pass to getrlimit(). Or should I find some other way to implement sbrk()?
The sbrk function can be implemented by getting the current value and subtracting the desired amount manually. Some systems allow you to get the current value with brk(0), others keep track of it in a variable [which is initialized with the address of _end, which is set up by the linker to point to the initial break value].
This is a very platform-specific thing, so YMMV.
EDIT: On linux:
However, the actual Linux system call returns the new program break on success. On failure, the system call returns the current break. The glibc wrapper function does some work (i.e., checks whether the new break is less than addr) to provide the 0 and -1 return values described above.
So from assembly, you can call it with an absurd value like 0 or -1 to get the current value.
Be aware that you cannot "free" memory allocated via brk - you may want to just link in a malloc function written in C. Calling C functions from assembly isn't hard.
#include <unistd.h>
#define SOME_NUMBER 8
int main() {
void *ptr = sbrk(8);
return 0;
Compile using with Assembly Output option
gcc -S -o test.S test.c
Then look at the ASM code
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $8, %eax
movl %eax, %edi
callq _sbrk
movq %rax, -16(%rbp)
movl $0, -8(%rbp)
movl -8(%rbp), %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %eax
addq $16, %rsp
popq %rbp
There is no system call for it but you should be able to still make the call

How to manipulate strings with x86 assembly?

I'm in the process of writing an assembly program that takes two strings as input and concatenates them. Here's what I have: (using NASM syntax)
hello: db "Hello ",0
world: db "world!",0
; do the concatenation
Since I've never done any work with strings in x86 assembly before, I need to know how storing and manipulating strings work in the first place.
I'm guessing that once the length of each string is known, that concatenating would simply involve moving chunks of memory around. This part can be simplified by using libc. (I can use strlen() and strcat().)
My real problem is that I'm not familiar with the way strings are stored in x86 assembly. Do they just get added to the stack...? Do they go on a heap somewhere? Should I use malloc() (somehow)?
The strings in your example are stored the same way a global character array would be stored by a C program. They're just a series of bytes in the data section of your executable. If you want to concatenate them, you're going to need some space to do it - either do it on the stack, or call malloc() to get yourself some memory. As you say, you can just use strcat() if you are willing to call out to libc. Here's a quick example I made (AT&T syntax), using a global buffer to concatenate the strings, then print them out:
.asciz "Hello "
.asciz "world!"
.space 100
.globl _main
.globl _puts
.globl _strcat
push %rbp
mov %rsp, %rbp
leaq buffer(%rip), %rdi
leaq hello(%rip), %rsi
callq _strcat
leaq buffer(%rip), %rdi
leaq world(%rip), %rsi
callq _strcat
leaq buffer(%rip), %rdi
callq _puts
mov $0, %rax
pop %rbp
