linux debugger detection in multi-thread application using ptrace - multithreading

I have to implement debugger detection technique under linux. So the main idea is, my piece of code creates second thread via syscall clone. After that, created thread is supposed to check if debugger is present in while loop, sleeping for a few seconds. My question is how to implement debugger detection via ptrace in multi-thread environment inside infinite loop. My problem is that after calling ptrace(PTRACE_TRACEME, 0, 1, 0) for a second time debugger is detected (that's reasonable and correct of course). So do I have to detach tracer somehow at the end of loop or use ptrace in another way? Here is a piece of code:
new_thread:
; PTRACE
xor rdi, rdi
xor rsi, rsi
xor rdx, rdx
inc rdx
xor r10, r10
mov rax, 101 ; ptrace syscall
syscall
cmp rax, 0
jge __nondbg
call _dbg
db 'debugged!', 0xa, 0
_dbg:
mov rdi, 1
pop rsi
mov rdx, 10
mov rax, 1 ; syscall write
syscall
; exit_group call
mov rdi, 127
mov rax, 231 ; exit_group syscall
syscall
__nondbg:
call _nondbg
db 'non-debugged!', 0xa, 0
_nondbg:
mov rdi, 1
pop rsi
mov rdx, 14
mov rax, 1 ; syscall write
syscall
; ==========
; SLEEP.....
; ==========
push 0 ; value should be a parameter
push 5 ; value should be a parameter
mov rdi, rsp
xor rsi, rsi
mov rax, 35 ; syscall nanosleep
syscall ; syscall
pop rax
pop rax
jmp new_thread

I don't know if your design forces you to try a loop detection. PTRACE_TRACEME is used by a tracee process to be traced by its parent (after fork). I admit i don't know for sure how this would work when the tracer is another thread in the same process, but i think it wouldn't work very well, as the mechanism of ptrace is based on signals.
If you want to be sure that your (child) process is being attached to the traces, the common approach is to raise a stop signal to allow the tracer to attach. When the execution is resumed, you know the tracer is there.
raise(SIGSTOP);

Related

lock cmpxchg fails to execute threads in core order

The following 64-bit NASM code uses lock cmpxchg to take each core in core order, execute some code, then reset the core number variable using xchg so the next core can execute the code. The core number for each core is stored in rbx -- the four cores are numbered 0, 8, 16 and 24. The variable [spin_lock_core] starts at zero and when each core is finished it updates the core number by 8 at the final line xchg [spin_lock_core],rax.
Spin_lock:
xor rax,rax
lock cmpxchg [spin_lock_core],rbx
jnz Spin_lock
; Test
mov rbp,extra_test_array
mov [rbp+rbx],rbx
; Execute some code before looping out
mov rax,1234
mov rdx,23435
add rax,rbx
mov rcx,rax
;jmp label_899
mov rax,rbx
add rax,8
xchg [spin_lock_core],rax
But before the code reaches xchg [spin_lock_core],rax the first core loops out of the program (jmp label_899), which should cause the other threads to freeze because they would be waiting for the [spin_lock_core] var to be updated, which never happens. But instead all four cores are written to the output array extra_test_array, which is displayed on the terminal when the program exits. In other words, this fails to stop the cores until the core number is updated.
The full, minimal code is below (as minimal as NASM can be in this case). The code is written for a shared object, and it's reproducible if it gets an input array (as written it doesn't matter if the input array is int or float):
; Header Section
[BITS 64]
[default rel]
global Main_Entry_fn
extern pthread_create, pthread_join, pthread_exit, pthread_self, sched_getcpu
global FreeMem_fn
extern malloc, realloc, free
extern sprintf
section .data align=16
X_ctr: dq 0
data_master_ptr: dq 0
initial_dynamic_length: dq 0
XMM_Stack: dq 0, 0, 0, 0, 0, 0, 0
ThreadID: dq 0
X_ptr: dq 0
X_length: dq 0
X: dq 0
collect_ptr: dq 0
collect_length: dq 0
collect_ctr: dq 0
even_squares_list_ptrs: dq 0, 0, 0, 0
even_squares_list_ctr: dq 0
even_squares_list_length: dq 0
Number_Of_Cores: dq 32
pthread_attr_t: dq 0
pthread_arg: dq 0
Join_Ret_Val: dq 0
tcounter: dq 0
sched_getcpu_array: times 4 dq 0
ThreadIDLocked: dq 0
spin_lock_core: dq 0
extra_test_array: dq 0
; __________
section .text
Init_Cores_fn:
; _____
; Create Threads
label_0:
mov rdi,ThreadID ; ThreadCount
mov rsi,pthread_attr_t ; Thread Attributes
mov rdx,Test_fn ; Function Pointer
mov rcx,pthread_arg
call pthread_create wrt ..plt
mov rdi,[ThreadID] ; id to wait on
mov rsi,Join_Ret_Val ; return value
call pthread_join wrt ..plt
mov rax,[tcounter]
add rax,8
mov [tcounter],rax
mov rbx,[Number_Of_Cores]
cmp rax,rbx
jl label_0
; _____
jmp label_900 ; All threads return here, and exit
; ______________________________________
Test_fn:
; Get the core number
call sched_getcpu wrt ..plt
mov rbx,8 ; multiply by 8
mul rbx
push rax
pop rax
mov rbx,rax
push rax
Spin_lock:
lock cmpxchg [spin_lock_core],rbx
jnz Spin_lock
; Test
mov rbp,extra_test_array
mov [rbp+rbx],rbx
; Execute some code before looping out
mov rax,1234
mov rdx,23435
add rax,rbx
mov rcx,rax
jmp label_899
mov rax,rbx
add rax,8
xchg [spin_lock_core],rax
;__________
label_899:
pop rax
ret
; __________
label_900:
mov rdi,extra_test_array ;audit_array
mov rax,rdi
ret
;__________
;Free the memory
FreeMem_fn:
;The pointer is passed back in rcx (of course)
sub rsp,40
call free wrt ..plt
add rsp,40
ret
; __________
; Main Entry
Main_Entry_fn:
push rdi
push rbp
push rbx
push r15
xor r15,r15
push r14
xor r14,r14
push r13
xor r13,r13
push r12
xor r12,r12
push r11
xor r11,r11
push r10
xor r10,r10
push r9
xor r9,r9
push r8
xor r8,r8
movsd [XMM_Stack+0],xmm13
movsd [XMM_Stack+8],xmm12
movsd [XMM_Stack+16],xmm11
movsd [XMM_Stack+24],xmm15
movsd [XMM_Stack+32],xmm14
movsd [XMM_Stack+40],xmm10
mov [X_ptr],rdi
mov [data_master_ptr],rsi
; Now assign lengths
lea rdi,[data_master_ptr]
mov rbp,[rdi]
xor rcx,rcx
movsd xmm0,qword[rbp+rcx]
cvttsd2si rax,xmm0
mov [X_length],rax
add rcx,8
; __________
; Write variables to assigned registers
mov r15,0
lea rdi,[rel collect_ptr]
mov r14,qword[rdi]
mov r13,[collect_ctr]
mov r12,[collect_length]
lea rdi,[rel X_ptr]
mov r11,qword[rdi]
mov r10,[X_length]
; __________
call Init_Cores_fn
movsd xmm10,[XMM_Stack+0]
movsd xmm14,[XMM_Stack+8]
movsd xmm15,[XMM_Stack+16]
movsd xmm11,[XMM_Stack+24]
movsd xmm12,[XMM_Stack+32]
movsd xmm13,[XMM_Stack+40]
pop r8
pop r9
pop r10
pop r11
pop r12
pop r13
pop r14
pop r15
pop rbx
pop rbp
pop rdi
ret
The instruction "lock cmpxchg" should fail until the [spin_lock_core] variable is updated, but it doesn't do that.
Thanks for any help in understanding why lock cmpxchg doesn't prevent the cores after core zero from firing in this area of code.
UPDATE: other research shows that xor rax,rax is needed at the top of the Spin_lock: section. When I insert that line, it reads like this:
Spin_lock:
xor rax,rax
lock cmpxchg [spin_lock_core],rbx
jnz Spin_lock
With that change it freezes, as expected. But when I remove the line jmp label_899 it still freezes, but it shouldn't do that.
EDIT 122219:
Based on the comments on this question yesterday, I revised the spinlock code to (1) eliminate atomic operations in favor of faster mov and cmp instructions, (2) assign a unique memory location to each core, and (3) separate the memory locations by > 256 bytes to avoid memory on the same cache line.
Each core's memory location will be changed to 1 when the previous core is finished. When each core finishes, it sets its own memory location back to 0.
The code successfully executes core 0 IF I have all other cores loop out before the spinlock. When I let all four cores run through the spinlock, the program again hangs.
I've verified that each separate memory location is set to 1 when the previous core is finished.
Here's the updated spinlock section:
section .data
spin_lock_core: times 140 dq 0
spin_lock_core_offsets: dq 0,264,528,792
section .text
; Calculate the offset to spin_lock_core
mov rbp,spin_lock_core
mov rdi,spin_lock_core_offsets
mov rax,[rdi+rbx]
add rbp,rax
; ________
Spin_lock:
pause
cmp byte[rbp],1
jnz Spin_lock
xor rax,rax
mov [rbp],rax ; Set current memory location to zero
; Execute some code before looping out
mov rax,1234
mov rdx,23435
add rax,rdx
mov rcx,rax
; Loop out if this is the last core
mov rax,rbx
add rax,8
cmp rax,[Number_Of_Cores]
jge label_899
; Set next core to 1 by adding 264 to the base address
add rbp,264
mov rax,1
mov [rbp],rax
Why does this code still hang?
I don't think you should use cmpxchg for this at all. Try this:
Spin_lock:
pause
cmp [spin_lock_core],rbx
jnz Spin_lock
; Test
mov rbp,extra_test_array
mov [rbp+rbx],rbx
; Execute some code before looping out
mov rax,1234
mov rdx,23435
add rax,rbx
mov rcx,rax
;jmp label_899
lea rax,[rbx+8]
mov [spin_lock_core],rax
I solved this spinlock problem, but after Peter Cordes' comment below I see that it is not correct. I won't delete this answer because I hope it can lead to the solution.
I use lock cmpxchg [rbp+rbx],rbx, which assembles without error, but the NASM assembler should return a "invalid combination of operands" error because the source operand can only be rax, so it shouldn't assemble with any other register. I also note that the online resources (for example, https://www.felixcloutier.com/x86/cmpxchg) show the format as CMPXCHG r/m64,r64, but the source operand can't be any r64 -- it must be rax, as that entry goes on to say.
Without the "mov rax,rbx" line it works because on the first iteration the rax register is set to 0 which matches the memory location. On the second iteration it succeeds by default.
When I add "mov rax,rbx" -- which resets rax -- the program once again hangs. I would really appreciate any ideas on why this program should hang as written.
At the start of this block rbx is the core number:
section .data
spin_lock_core: times 4 dq 0
section .text
[ Code leading up to this spinlock section shown above ]
mov rbp,spin_lock_core
Spin_lock:
pause
mov rax,rbx
lock cmpxchg [rbp+rbx],rax
jnz Spin_lock
mov rax,rbx
add rax,8
cmp rax,[Number_Of_Cores]
jge spin_lock_out
xchg [rbp+rax],rax
spin_lock_out:
The differences from my original post are:
Each core spins on (and reads from) its own unique memory location.
I use the "pause" instruction on the spinlock.
Each unique memory location is updated in core order.
But it does not work when I include mov rax,rbx. Intuitively that should work, so I will really appreciate any ideas on why it doesn't in this case.

How to obtain the VSync Display refresh pulse in Linux?

I'm programming some routines on a Linux NASM x86-64 asm code.
How can I obtain the pulse of the display refresh, VSync?
I guess maybe via syscall I can reach the pulse, but I accept other suggestions, don't ask me why, i really need the pulse to avoid Flicker on display.
I know how to do it for windows as shown in code below, but Linux doesn't support D3D.
;;;;; WINDOWS VERSION EXAMPLE!
;ENABLE VSYNC
Therraszeta3:
CMP BYTE [RY_X+0x1003],255
jnz .L1232321
mov rcx,0
mov rax, [GetDC__]
mov [D3DKMT_OPENADAPTERFROMHDC_hDc], rax
lea rcx, [D3DKMT_OPENADAPTERFROMHDC]
call [GetProcAddress_LoadLibrary_Gdi32_dll_D3DKMTOpenAdapterFromHdc_]
mov [D3DKMTOpenAdapterFromHdc__], rax
;;
mov eax, dword [D3DKMT_OPENADAPTERFROMHDC_hAdapter]
mov dword [D3DKMT_WAITFORVERTICALBLANKEVENT_hAdapter], eax
mov dword [D3DKMT_WAITFORVERTICALBLANKEVENT_hDevice],0
mov eax, dword [D3DKMT_OPENADAPTERFROMHDC_VidPnSourceId]
mov dword [D3DKMT_WAITFORVERTICALBLANKEVENT_VidPnSourceId], eax
lea rcx, [D3DKMT_WAITFORVERTICALBLANKEVENT]
call [GetProcAddress_LoadLibrary_Gdi32_dll_D3DKMTWaitForVerticalBlankEvent_]
mov [D3DKMTWaitForVerticalBlankEvent__], rax
.L1232321
;;;;;
I expect to obtain the pulse in an infinite loop, indicating the beginning of every frame.

push/pop segmentation fault in simple multiplication function

my teacher is doing a crash course in assembly with us, and I have no experience in it whatsoever. I am supposed to write a simple function that takes four variables and calculates (x+y)-(z+a) and then prints out the answer. I know it's a simple problem, but after hours of research I am getting no where, any push in the right direction would be very helpful! I do need to use the stack, as I have more things to add to the program once I get past this point, and will have a lot of variables to store. I am compiling using nasm and gcc, in linux. (x86 64)
(side question, my '3' isn't showing up in register r10, but I am in linux so this should be the correct register... any ideas?)
Here is my code so far:
global main
extern printf
segment .data
mulsub_str db "(%ld * %ld) - (%ld * %ld) = %ld",10,0
data dq 1, 2, 3, 4
segment .text
main:
call multiplyandsubtract
pop r9
mov rdi, mulsub_str
mov rsi, [data]
mov rdx, [data+8]
mov r10, [data+16]
mov r8, [data+24]
mov rax, 0
call printf
ret
multiplyandsubtract:
;;multiplies first function
mov rax, [data]
mov rdi, [data+8]
mul rdi
mov rbx, rdi
push rbx
;;multiplies second function
mov rax, [data+16]
mov rsi, [data+24]
mul rsi
mov rbx, rsi
push rbx
;;subtracts function 2 from function 1
pop rsi
pop rdi
sub rdi, rsi
push rdi
ret
push in the right direction
Nice pun!
Your problem is that you apparently don't seem to know that ret is using the stack for the return address. As such push rdi; ret will just go to the address in rdi and not return to your caller. Since that is unlikely to be a valid code address, you get a nice segfault.
To return values from functions just leave the result in a register, standard calling conventions normally use rax. Here is a possible version:
global main
extern printf
segment .data
mulsub_str db "(%ld * %ld) - (%ld * %ld) = %ld",10,0
data dq 1, 2, 3, 4
segment .text
main:
sub rsp, 8
call multiplyandsubtract
mov r9, rax
mov rdi, mulsub_str
mov rsi, [data]
mov rdx, [data+8]
mov r10, [data+16]
mov r8, [data+24]
mov rax, 0
call printf
add rsp, 8
ret
multiplyandsubtract:
;;multiplies first function
mov rax, [data]
mov rdi, [data+8]
mul rdi
mov rbx, rdi
push rbx
;;multiplies second function
mov rax, [data+16]
mov rsi, [data+24]
mul rsi
mov rbx, rsi
push rbx
;;subtracts function 2 from function 1
pop rsi
pop rdi
sub rdi, rsi
mov rax, rdi
ret
PS: notice I have also fixed the stack alignment as per the ABI. printf is known to be picky about that too.
To return more than 64b from subroutine (rax is not enough), you can optionally drop the whole standard ABI convention (or actually follow it, there's surely a well defined way how to return more than 64b from subroutines), and use other registers until you ran out of them.
And once you ran out of spare return registers (or when you desperately want to use stack memory), you can follow the way C++ compilers do:
SUB rsp,<return_data_size + alignment>
CALL subroutine
...
MOV al,[rsp + <offset>] ; to access some value from returned data
; <offset> = 0 to return_data_size-1, as defined by you when defining
; the memory layout for returned data structure
...
ADD rsp,<return_data_size + alignment> ; restore stack pointer
subroutine:
MOV al,<result_value_1>
MOV [rsp + 8 + <offset>],al ; store it into allocated stack space
; the +8 is there to jump beyond return address, which was pushed
; at stack by "CALL" instruction. If you will push more registers/data
; at the stack inside the subroutine, you will have either to recalculate
; all offsets in following code, or use 32b C-like function prologue:
PUSH rbp
MOV rbp,rsp
MOV [rbp + 16 + <offset>],al ; now all offsets are constant relative to rbp
... other code ...
; epilogue code restoring stack
MOV rsp,rbp ; optional, when you did use RSP and didn't restore it yet
POP rbp
RET
So during executing the instructions of subroutine, the stack memory layout is like this:
rsp -> current_top_of_stack (some temporary push/pop as needed)
+x ...
rbp -> original rbp value (if prologue/epilogue code was used)
+8 return address to caller
+16 allocated space for returning values
+16+return_data_size
... padding to have rsp correctly aligned by ABI requirements ...
+16+return_data_size+alignment
... other caller stack data or it's own stack frame/return address ...
I'm not going to check how ABI defines it, because I'm too lazy, plus I hope this answer is understandable for you to explain the principle, so you will recognize which way the ABI works and adjust...
Then again, I would highly recommend to use rather many shorter simpler subroutines returning only single value (in rax/eax/ax/al), whenever possible, try to follow the SRP (Single Responsibility Principle). The above way will force you to define some return-data-structure, which may be too much hassle, if it's just some temporary thing and can be split into single-value subroutines instead (if performance is endangered, then probably inlining the whole subroutine will outperform even the logic of grouped returned values and single CALL).

I'm getting a segmentation fault in my assembly program [duplicate]

The tutorial I am following is for x86 and was written using 32-bit assembly, I'm trying to follow along while learning x64 assembly in the process. This has been going very well up until this lesson where I have the following simple program which simply tries to modify a single character in a string; it compiles fine but segfaults when ran.
section .text
global _start ; Declare global entry oint for ld
_start:
jmp short message ; Jump to where or message is at so we can do a call to push the address onto the stack
code:
xor rax, rax ; Clean up the registers
xor rbx, rbx
xor rcx, rcx
xor rdx, rdx
; Try to change the N to a space
pop rsi ; Get address from stack
mov al, 0x20 ; Load 0x20 into RAX
mov [rsi], al; Why segfault?
xor rax, rax; Clear again
; write(rdi, rsi, rdx) = write(file_descriptor, buffer, length)
mov al, 0x01 ; write the command for 64bit Syscall Write (0x01) into the lower 8 bits of RAX
mov rdi, rax ; First Paramter, RDI = 0x01 which is STDOUT, we move rax to ensure the upper 56 bits of RDI are zero
;pop rsi ; Second Parameter, RSI = Popped address of message from stack
mov dl, 25 ; Third Parameter, RDX = Length of message
syscall ; Call Write
; exit(rdi) = exit(return value)
xor rax, rax ; write returns # of bytes written in rax, need to clean it up again
add rax, 0x3C ; 64bit syscall exit is 0x3C
xor rdi, rdi ; Return value is in rdi (First parameter), zero it to return 0
syscall ; Call Exit
message:
call code ; Pushes the address of the string onto the stack
db 'AAAABBBNAAAAAAAABBBBBBBB',0x0A
This culprit is this line:
mov [rsi], al; Why segfault?
If I comment it out, then the program runs fine, outputting the message 'AAAABBBNAAAAAAAABBBBBBBB', why can't I modify the string?
The authors code is the following:
global _start
_start:
jmp short ender
starter:
pop ebx ;get the address of the string
xor eax, eax
mov al, 0x20
mov [ebx+7], al ;put a NULL where the N is in the string
mov al, 4 ;syscall write
mov bl, 1 ;stdout is 1
pop ecx ;get the address of the string from the stack
mov dl, 25 ;length of the string
int 0x80
xor eax, eax
mov al, 1 ;exit the shellcode
xor ebx,ebx
int 0x80
ender:
call starter
db 'AAAABBBNAAAAAAAABBBBBBBB'0x0A
And I've compiled that using:
nasm -f elf <infile> -o <outfile>
ld -m elf_i386 <infile> -o <outfile>
But even that causes a segfault, images on the page show it working properly and changing the N into a space, however I seem to be stuck in segfault land :( Google isn't really being helpful in this case, and so I turn to you stackoverflow, any pointers (no pun intended!) would be appreciated
I would assume it's because you're trying to access data that is in the .text section. Usually you're not allowed to write to code segment for security. Modifiable data should be in the .data section. (Or .bss if zero-initialized.)
For actual shellcode, where you don't want to use a separate section, see Segfault when writing to string allocated by db [assembly] for alternate workarounds.
Also I would never suggest using the side effects of call pushing the address after it to the stack to get a pointer to data following it, except for shellcode.
This is a common trick in shellcode (which must be position-independent); 32-bit mode needs a call to get EIP somehow. The call must have a backwards displacement to avoid 00 bytes in the machine code, so putting the call somewhere that creates a "return" address you specifically want saves an add or lea.
Even in 64-bit code where RIP-relative addressing is possible, jmp / call / pop is about as compact as jumping over the string for a RIP-relative LEA with a negative displacement.
Outside of the shellcode / constrained-machine-code use case, it's a terrible idea and you should just lea reg, [rel buf] like a normal person with the data in .data and the code in .text. (Or read-only data in .rodata.) This way you're not trying execute code next to data, or put data next to code.
(Code-injection vulnerabilities that allow shellcode already imply the existence of a page with write and exec permission, but normal processes from modern toolchains don't have any W+X pages unless you do something to make that happen. W^X is a good security feature for this reason, so normal toolchain security features / defaults must be defeated to test shellcode.)

NASM x86_64 having trouble writing command line arguments, returning -14 in rax

I am using elf64 compilation and trying to take a parameter and write it out to the console.
I am calling the function as ./test wooop
After stepping through with gdb there seems to be no problem, everything is set up ok:
rax: 0x4
rbx: 0x1
rcx: pointing to string, x/6cb $rcx gives 'w' 'o' 'o' 'o' 'p' 0x0
rdx: 0x5 <---correctly determining length
after the int 80h rax contains -14 and nothing is printed to the console.
If I define a string in .data, it just works. gdb shows the value of $rcx in the same way.
Any ideas? here is my full source
%define LF 0Ah
%define stdout 1
%define sys_exit 1
%define sys_write 4
global _start
section .data
usagemsg: db "test {string}",LF,0
testmsg: db "wooop",0
section .text
_start:
pop rcx ;this is argc
cmp rcx, 2 ;one argument
jne usage
pop rcx
pop rcx ; argument now in rcx
test rcx,rcx
jz usage
;mov rcx, testmsg ;<-----uncomment this to print ok!
call print
jmp exit
usage:
mov rcx, usagemsg
call print
jmp exit
calclen:
push rdi
mov rdi, rcx
push rcx
xor rcx,rcx
not rcx
xor al,al
cld
repne scasb
not rcx
lea rdx, [rcx-1]
pop rcx
pop rdi
ret
print:
push rax
push rbx
push rdx
call calclen
mov rax, sys_write
mov rbx, stdout
int 80h
pop rdx
pop rbx
pop rax
ret
exit:
mov rax, sys_exit
mov rbx, 0
int 80h
Thanks
EDIT: After changing how I make my syscalls as below it works fine. Thanks all for your help!
sys_write is now 1
sys_exit is now 60
stdout now goes in rdi, not rbx
the string to write is now set in rsi, not rcx
int 80h is replaced by syscall
I'm still running 32-bit hardware, so this is a wild asmed guess! As you probably know, 64-bit system call numbers are completely different, and "syscall" is used instead of int 80h. However int 80h and 32-bit system call numbers can still be used, with 64-bit registers truncated to 32-bit. Your tests indicate that this works with addresses in .data, but with a "stack address", it returns -14 (-EFAULT - bad address). The only thing I can think of is that truncating rcx to ecx results in a "bad address" if it's on the stack. I don't know where the stack is in 64-bit code. Does this make sense?
I'd try it with "proper" 64-bit system call numbers and registers and "syscall", and see if that helps.
Best,
Frank
As you said, you're using ELF64 as the target of the compilation. This is, unfortunately, your first mistake. Using the "old" system call interface on Linux, e.g. int 80h is possible only when running 32-bit tasks. Obviously, you could simply assemble your source as ELF32, but then you're going to lose all the advantages if running tasks in 64-bit mode, namely the extra registers and 64-bit operations.
In order to make system calls in 64-bit tasks, the "new" system call interface must be used. The system call itself is done with the syscall instruction. The kernel destroys registers rcx and r11. The number of the system is specified in the register rax, while the arguments of the call are passed in rdi, rsi, rdx, r10, r8 and r9. Keep in mind that the numbers of the syscalls are different than the ones in 32-bit mode. You can find them in unistd_64.h, which is usually in /usr/include/asm or wherever your distribution stores it.

Resources