CPU_ZERO "undefined symbol" using pthread_setaffinity_np in NASM

CPU_ZERO "undefined symbol" using pthread_setaffinity_np in NASM - linux

I am using the pthreads library in NASM under Ubuntu 18.04. Thread creation works correctly, but I want to assign each thread to a separate core with pthread_setaffinity_np.
Following is the section of code I use to initialize threads. It compiles as written but at run time I get "undefined symbol: CPU_ZERO."
Using examples from C, I inserted %define _GNU_SOURCE at the top of the program, but I still get the undefined symbol CPU_ZERO error.
section .data align=16
; For thread scheduling:
cpuset: times 4 dq 0
section .text
label_0:
mov rdi,ThreadID ; ThreadCount
mov rsi,pthread_attr_t ; Thread Attributes
mov rdx,Test_fn ; Function Pointer
mov rcx,pthread_arg
call pthread_create wrt ..plt
; Set affinity mask
mov rdi,cpuset
call CPU_ZERO wrt ..plt
call pthread_self wrt ..plt
push rax
mov rdi,rax
mov rsi,cpuset
call CPU_SET wrt ..plt
pop rax
mov rdi,rax
mov rsi,32
mov rdx,cpuset
call pthread_setaffinity_np wrt ..plt
; check the result with pthread_getaffinity_np
mov rax,[tcounter]
add rax,8
mov [tcounter],rax
mov rbx,[Number_Of_Cores]
cmp rax,rbx
jl label_0
My question is: how do I use CPU_ZERO and CPU_SET in NASM (or any other assembly language; I can translate to NASM).
Thanks for any help.

CPU_ZERO and CPU_SET are C macros, not functions which you can call.
You'll have to roll your own function to perform equivalent zeroing / setting.

Those are CPP macros, not function. You can tell from the all-caps names. And from the fact the man page calls them macros.
As usual, the notes section of the man page has details that are useful for asm:
Since CPU sets are bit masks allocated in units of long words, the
actual number of CPUs in a dynamically allocated CPU set will be
rounded up to the next multiple of sizeof(unsigned long). An
application should consider the contents of these extra bits to be
undefined.
Notwithstanding the similarity in the names, note that the constant
CPU_SETSIZE indicates the number of CPUs in the cpu_set_t data type
(thus, it is effectively a count of the bits in the bit mask), while
the setsize argument of the CPU_*_S() macros is a size in bytes.
On my system (Arch Linux, glibc 2.29-4)
/usr/include/bits/cpu-set.h says
...
#define __CPU_SETSIZE 1024
#define __NCPUBITS (8 * sizeof (__cpu_mask))
...
typedef __CPU_MASK_TYPE __cpu_mask; // ultimately unsigned long via some other headers
...
typedef struct
{
__cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS];
} cpu_set_t;
So a cpu_set_t is 1024 bits = 128 bytes = times 16 dq 0 or resq 16, at least on my system with that kernel config.
CPU_ZERO is free in your case; your statically-allocated cpu_set_t is statically zero-initialized. For some reason you put it in .data instead of .bss, so the executable will have to actually contain those zeros, but same difference.
If you did want to zero one on the stack, for example, rep stosd is one easy way, or xorps xmm0, xmm0 and 8x movups stores would also work.
Since high performance is not essential (CPU affinity-setting code probably only runs once), bts is a very convenient way to set bits in a bitmap (CPU_SET). With a memory destination, it takes a bit-index that can go outside the dword selected by the addressing mode. bts mem, reg is slow and microcoded (like 10 uops on Skylake), but nice for code size. bts mem, imm is only 3 uops, but or byte [mem + i/8], 1<<(i%8) is only 2 uops.
or also lets you set more than 1 bit at once, or more simply just mov store some bytes that contain the desired pattern of zeros and ones.
But TL:DR: it's just a bitmap, manipulate it however you like using asm, or even statically initialize it with non-zero values.

My approach to solving this problem is summarized in my answer at Reproduce these C types in assembly?.

Related

Is it possible for a program to read itself?

Theoretical question. But let's say I have written an assembly program. I have "labelx:" I want the program to read at this memory address and only this size and print to stdout.
Would it be something like
jmp labelx
And then would i then use the Write syscall , making sure to read from the next instruction from labelx:
mov rsi,rip
mov rdi,0x01
mov rdx,?
mov rax,0x01
syscall
to then output to stdout.
However how would I obtain the size to read itself? Especially if there is a
label after the code i want to read or code after. Would I have to manually
count the lines?
mov rdx,rip+(bytes*lines)
And then syscall with populated registers for the syscall to write to from rsi to rdi. Being stdout.
Is this Even possible? Would i have to use the read syscall first, as the write system call requires rsi to be allocated memory buffer. However I assumed .text is already allocated memory and is read only. Would I have to allocate onto the stack or heap or a static buffer first before write, if it's even possible in the first place?
I'm using NASM syntax btw. And pretty new to assembly. And just a question.

Yes, the .text section is just bytes in memory, no different from section .rodata where you might normally put msg: db "hello", 10. x86 is a Von Neumann architecture (not Harvard), so there's no distinction between code pointers and data pointers, other than what you choose to do with them. Use objdump -drwC -Mintel on a linked executable to see the machine-code bytes, or GDB's x command in a running process, to see bytes anywhere.
You can get the assembler to calculate the size by putting labels at the start/end of the part you want, and using mov edx, prog_end - prog_start in the code at the point where you want that size in RDX.
See How does $ work in NASM, exactly? for more about subtracting two labels (in the same section) to get a size. (Where $ is an implicit label at the start of the current line, although $ isn't likely what you want here.)
To get the current address into a register, you need a RIP-relative LEA, not mov, because RIP isn't a general-purpose register and there's no special form of mov that reads it.
here:
lea rsi, [rel here] ; with DEFAULT REL you could just use [here]
mov edi, 1 ; stdout fileno
mov edx, .end - here ; assemble-time constant size calculation
mov eax, 1 ; __NR_write
syscall
.end:
This is fully position-independent, unlike if you used mov esi, here. (How to load address of function or label into register)
The LEA could use lea rsi, [rel $] to assemble to the same machine-code bytes, but you want a label there so you can subtract them.
I optimized your MOV instructions to use 32-bit operand-size, implicitly zero-extending into the full 64-bit RDX and RAX. (And RDI, but write(int fd, void *buf, size_t len) only looks at EDI anyway for the file descriptor).
Note that you can write any bytes of any section; there's nothing special about having a block of code write itself. In the above example, put the start/end labels anywhere. (e.g. foo: and .end:, and mov edx, foo.end - foo taking advantage of how NASM local labels work, by appending to the previous non-local label, so you can reference them from somewhere else. Or just give them both non-dot names.)

Read file from a specific position in x86

Is it possible to start reading a file from a specific line or byte. Currently I use this code to read 4 bytes of a file:
section .data
filename db "file.txt", 0
section .bss
read_data resb 4
section .text
global _start
_start:
mov rax, SYS_OPEN
mov rdi, filename
mov rsi, O_RDONLY
mov rdx, 0
syscall
push rax
mov rdi, rax
mov rax, SYS_READ
mov rsi, read_data
mov rdx, 4
syscall
mov rax, SYS_CLOSE
pop rdi
syscall
This code always reads the first 4 bytes, but I want to start reading from other parts of the file, like the middle for example. What do I need to add or change?

A freshly-opened file descriptor starts at position = 0. If you keep reading from the same fd in a loop, you'll get successive chunks. (Use a larger buffer like 8kiB and loop over dwords in user-space, though, using the value that read returned as an upper limit! A system call is very expensive in CPU time.)
Is it possible to start reading a file from a specific line or byte.
Byte: yes
Line: no. In Unix/Linux, the kernel doesn't have an index of line-start byte offsets or any other line-oriented API. The line handling in stdio fgets for example is purely done in user-space. There have been some historical OSes with record-based files, but Unix files are flat arrays of bytes. (They can have holes, unwritten extents, and extended attributes... But the kernel APIs for the main file contents only operate with by byte offsets).
If you want to do lines, read a big block and loop forward until you've seen some number of newlines. If you're not there yet, read another block; repeat until you find the start and end of the line number you want, or you hit EOF. x86-64 can efficiently search 16 bytes at a time with pcmpeqb / pmovmskb / popcnt (popcnt requires SSE4.2 or the specific popcnt feature bit).
Or with just SSE2, or when optimizing for large blocks, with pcmpeqb / psadbw (against all-zero) to hsum bytes to qwords / paddd. Then check how many lines you went every so often with some scalar code. Or keep it simple and branch on finding the first newline in a SIMD vector.
Obviously the slow and simple option is a byte-at-a-time loop that counts '\n' characters - if you know how to do strchr with SSE2 it should be straightforward to vectorize that search using the above suggestions.
But if you only want some specific byte positions, you have two main options:
seek with lseek(2) before read(2) (see #Nicolae Natea's answer)
Use POSIX/Linux pread(2) to read from a specified offset, without moving the fd's file offset for future read calls. The Linux system call name is pread64 (__NR_pread64 equ 17 from asm/unistd_64.h)
ssize_t pread(int fd, void *buf, size_t count, off_t offset); The only difference from read is the offset arg, the 4th arg thus passed in R10 (not RCX like the user-space function calling convention). off_t is a 64-bit type simply passed in a single register in 64-bit code.
Other than the pread64 name in the .h, there's nothing special about the asm interface compared to the C interface, it follows the standard system-calling convention. (It exists since Linux 2.1.60 ; before that glibc's wrapper emulated it with lseek.)
There are other things you can do like mmap, or a preadv system call, but pread is most exactly what you're looking for if you have a known position you want to read from.

Before performing the read you should perform a lseek, so that the file position is updated.
so something along the lines:
mov rdi, rax ; fd
mov rax, SYS_LSEEK
mov rsi, <whatever offset you want>
mov rdx, 0 ; keep 0 if the offset should be from the begining of the file
syscall
note: RDI will still hold the same fd value after a syscall so you don't need extra save/restore for the fd across lseek / read / close.
Tip:
It might be easier to write the code in c and compile it with gcc -g -S -fverbose-asm -Og -c main.c and then look at main.s. (How to remove "noise" from GCC/clang assembly output?). But that will only show the compiler making calls to libc wrapper functions, unless you use inline system call macros like MUSL libc provides.

CMPXCHG is not atomic in this NASM program

I have implemented a spinlock in NASM-64 on Windows. I'm using a spinlock to block a shared memory buffer so that each core will write to the shared buffer one core at a time in order (core 0 first, core 1 second, core 2 third, etc.) As far as I know, a mutex or semaphore will not allow me to do the buffer writes in core order, and a spinlock is preferable because it doesn't use an OS call.
Here is the code section at issue. This is not a complete example because the problem is in this small section of a much larger assembly program. I used cmpxchg for atomicity.
On entry to this section, rax contains the core number and rbx contains a memory variable called spinlock_core, which is set to zero (first core) on entry to this section. After each core is finished, spinlock_core is incremented to the next core number.
mov rdi,Test_Array
movq rbx,xmm11
mov [rdi+rax],rbx ; rax contains the core number offset (0, 8, 16, 24)
push rax
; Spin Lock
spin_lock_01:
lock cmpxchg [spinlock_core],rax ; spinlock_core is set to zero on first entry
jnz spin_lock_01
; To test the result:
mov rdi,Test_Array
mov [rdi+rax+32],rax
jmp out_of_here
The results of this test are in Test_Array, which is populated with the number of bytes to write for each core. On return it contains:
40, 40, 40, 16, 0, 8, 16, 24
showing that cores 0-2 each have 40 bytes to write and core 3 has 16 bytes to write. However, the last four elements of Test_Array contain the core offset for each of the four cores. If the spinlock was working correctly (cmpxchg rax,rbx), the last four elements should all contain zero, showing that only the first core was allowed through. But it shows that all four cores were allowed through.
I assume that my cmpxchg is not atomic, and that's why other cores leak through -- they should only be allowed in when spinlock_core is incremented to the next core, but that doesn't happen before we exit with jmp out_of_here. According to the docs, cmpxchg should be preceded by a "lock" prefix, as in lock cmpxchg rax,rbx, but when I do that the NASM assembler says "warning: instruction is not lockable [-w+lock]." Felix Cloutier's site says the lock prefix is only needed if a memory operand is involved, but when I write "cmpxchg rax,[spinlock_core]" I get "invalid combination of opcodes and operands."
To summarize, my questions are: why is cmpxchg not atomic as written above, and why does the NASM assembler not allow the use of the lock prefix?
There are a number of detailed posts on Stack Overflow and elsewhere on the issue of atomicity but I haven't found any that address this specific issue.
Thanks for any help.

Here is how I solved this, with the help of comments below from #Peter Cordes and #prl. The changes are (1) use bx, not ax as a destination register and (2) use only the low 8-bit registers ax and bx, not rax and rbx.
xor rbx,rbx
spin_lock_01:
pop rax ; the core number is stored in the stack
mov bl,al ; put core number into bx
mov rcx,rax ; preserve core number for later use
push rax ; push core number back to stack
lock cmpxchg [spinlock_core],bl ; spinlock_core is set to zero on first entry
jnz spin_lock_01
; now test it
mov rdi,Test_Array
mov rbx,[spinlock_core]
mov rdx,[order_of_execution]
add rdx,1
mov [order_of_execution],rdx
mov [rdi+rcx+0],rcx
mov [rdi+rcx+32],rdx
add rbx,8
mov [spinlock_core],rbx
jmp out_of_here
The output in Test_Array shows as: 0,8,16,24,1,2,3,4. That shows that each of the cores is passed through in core order (1,2,3,4). The memory location spinlock_core is incremented by each core when it completes, signalling the next core.

x86-64: int 0x80 will not print from memory [duplicate]

int 0x80 on Linux always invokes the 32-bit ABI, regardless of what mode it's called from: args in ebx, ecx, ... and syscall numbers from /usr/include/asm/unistd_32.h. (Or crashes on 64-bit kernels compiled without CONFIG_IA32_EMULATION).
64-bit code should use syscall, with call numbers from /usr/include/asm/unistd_64.h, and args in rdi, rsi, etc. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64. If your question was marked a duplicate of this, see that link for details on how you should make system calls in 32 or 64-bit code. If you want to understand what exactly happened, keep reading.
(For an example of 32-bit vs. 64-bit sys_write, see Using interrupt 0x80 on 64-bit Linux)
syscall system calls are faster than int 0x80 system calls, so use native 64-bit syscall unless you're writing polyglot machine code that runs the same when executed as 32 or 64 bit. (sysenter always returns in 32-bit mode, so it's not useful from 64-bit userspace, although it is a valid x86-64 instruction.)
Related: The Definitive Guide to Linux System Calls (on x86) for how to make int 0x80 or sysenter 32-bit system calls, or syscall 64-bit system calls, or calling the vDSO for "virtual" system calls like gettimeofday. Plus background on what system calls are all about.
Using int 0x80 makes it possible to write something that will assemble in 32 or 64-bit mode, so it's handy for an exit_group() at the end of a microbenchmark or something.
Current PDFs of the official i386 and x86-64 System V psABI documents that standardize function and syscall calling conventions are linked from https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI.
See the x86 tag wiki for beginner guides, x86 manuals, official documentation, and performance optimization guides / resources.
But since people keep posting questions with code that uses int 0x80 in 64-bit code, or accidentally building 64-bit binaries from source written for 32-bit, I wonder what exactly does happen on current Linux?
Does int 0x80 save/restore all the 64-bit registers? Does it truncate any registers to 32-bit? What happens if you pass pointer args that have non-zero upper halves?
Does it work if you pass it 32-bit pointers?

TL:DR: int 0x80 works when used correctly, as long as any pointers fit in 32 bits (stack pointers don't fit). But beware that strace decodes it wrong unless you have a very recent strace + kernel.
int 0x80 zeros r8-r11 for reasons, and preserves everything else. Use it exactly like you would in 32-bit code, with the 32-bit call numbers. (Or better, don't use it!)
Not all systems even support int 0x80: The Windows Subsystem for Linux version 1 (WSL1) is strictly 64-bit only: int 0x80 doesn't work at all. It's also possible to build Linux kernels without IA-32 emulation either. (No support for 32-bit executables, no support for 32-bit system calls). See this re: making sure your WSL is actually WSL2 (which uses an actual Linux kernel in a VM.)
The details: what's saved/restored, which parts of which regs the kernel uses
int 0x80 uses eax (not the full rax) as the system-call number, dispatching to the same table of function-pointers that 32-bit user-space int 0x80 uses. (These pointers are to sys_whatever implementations or wrappers for the native 64-bit implementation inside the kernel. System calls are really function calls across the user/kernel boundary.)
Only the low 32 bits of arg registers are passed. The upper halves of rbx-rbp are preserved, but ignored by int 0x80 system calls. Note that passing a bad pointer to a system call doesn't result in SIGSEGV; instead the system call returns -EFAULT. If you don't check error return values (with a debugger or tracing tool), it will appear to silently fail.
All registers (except eax of course) are saved/restored (including RFLAGS, and the upper 32 of integer regs), except that r8-r11 are zeroed. r12-r15 are call-preserved in the x86-64 SysV ABI's function calling convention, so the registers that get zeroed by int 0x80 in 64-bit are the call-clobbered subset of the "new" registers that AMD64 added.
This behaviour has been preserved over some internal changes to how register-saving was implemented inside the kernel, and comments in the kernel mention that it's usable from 64-bit, so this ABI is probably stable. (I.e. you can count on r8-r11 being zeroed, and everything else being preserved.)
The return value is sign-extended to fill 64-bit rax. (Linux declares 32-bit sys_ functions as returning signed long.) This means that pointer return values (like from void *mmap()) need to be zero-extended before use in 64-bit addressing modes
Unlike sysenter, it preserves the original value of cs, so it returns to user-space in the same mode that it was called in. (Using sysenter results in the kernel setting cs to $__USER32_CS, which selects a descriptor for a 32-bit code segment.)
Older strace decodes int 0x80 incorrectly for 64-bit processes. It decodes as if the process had used syscall instead of int 0x80. This can be very confusing. e.g. strace prints write(0, NULL, 12 <unfinished ... exit status 1> for eax=1 / int $0x80, which is actually _exit(ebx), not write(rdi, rsi, rdx).
I don't know the exact version where the PTRACE_GET_SYSCALL_INFO feature was added, but Linux kernel 5.5 / strace 5.5 handle it. It misleadingly says the process "runs in 32-bit mode" but does decode correctly. (Example).
int 0x80 works as long as all arguments (including pointers) fit in the low 32 of a register. This is the case for static code and data in the default code model ("small") in the x86-64 SysV ABI. (Section 3.5.1
: all symbols are known to be located in the virtual addresses in the range 0x00000000 to 0x7effffff, so you can do stuff like mov edi, hello (AT&T mov $hello, %edi) to get a pointer into a register with a 5 byte instruction).
But this is not the case for position-independent executables, which many Linux distros now configure gcc to make by default (and they enable ASLR for executables). For example, I compiled a hello.c on Arch Linux, and set a breakpoint at the start of main. The string constant passed to puts was at 0x555555554724, so a 32-bit ABI write system call would not work. (GDB disables ASLR by default, so you always see the same address from run to run, if you run from within GDB.)
Linux puts the stack near the "gap" between the upper and lower ranges of canonical addresses, i.e. with the top of the stack at 2^48-1. (Or somewhere random, with ASLR enabled). So rsp on entry to _start in a typical statically-linked executable is something like 0x7fffffffe550, depending on size of env vars and args. Truncating this pointer to esp does not point to any valid memory, so system calls with pointer inputs will typically return -EFAULT if you try to pass a truncated stack pointer. (And your program will crash if you truncate rsp to esp and then do anything with the stack, e.g. if you built 32-bit asm source as a 64-bit executable.)
How it works in the kernel:
In the Linux source code, arch/x86/entry/entry_64_compat.S defines
ENTRY(entry_INT80_compat). Both 32 and 64-bit processes use the same entry point when they execute int 0x80.
entry_64.S is defines native entry points for a 64-bit kernel, which includes interrupt / fault handlers and syscall native system calls from long mode (aka 64-bit mode) processes.
entry_64_compat.S defines system-call entry-points from compat mode into a 64-bit kernel, plus the special case of int 0x80 in a 64-bit process. (sysenter in a 64-bit process may go to that entry point as well, but it pushes $__USER32_CS, so it will always return in 32-bit mode.) There's a 32-bit version of the syscall instruction, supported on AMD CPUs, and Linux supports it too for fast 32-bit system calls from 32-bit processes.
I guess a possible use-case for int 0x80 in 64-bit mode is if you wanted to use a custom code-segment descriptor that you installed with modify_ldt. int 0x80 pushes segment registers itself for use with iret, and Linux always returns from int 0x80 system calls via iret. The 64-bit syscall entry point sets pt_regs->cs and ->ss to constants, __USER_CS and __USER_DS. (It's normal that SS and DS use the same segment descriptors. Permission differences are done with paging, not segmentation.)
entry_32.S defines entry points into a 32-bit kernel, and is not involved at all.
The int 0x80 entry point in Linux 4.12's entry_64_compat.S:
/*
* 32-bit legacy system call entry.
*
* 32-bit x86 Linux system calls traditionally used the INT $0x80
* instruction. INT $0x80 lands here.
*
* This entry point can be used by 32-bit and 64-bit programs to perform
* 32-bit system calls. Instances of INT $0x80 can be found inline in
* various programs and libraries. It is also used by the vDSO's
* __kernel_vsyscall fallback for hardware that doesn't support a faster
* entry method. Restarted 32-bit system calls also fall back to INT
* $0x80 regardless of what instruction was originally used to do the
* system call.
*
* This is considered a slow path. It is not used by most libc
* implementations on modern hardware except during process startup.
...
*/
ENTRY(entry_INT80_compat)
... (see the github URL for the full source)
The code zero-extends eax into rax, then pushes all the registers onto the kernel stack to form a struct pt_regs. This is where it will restore from when the system call returns. It's in a standard layout for saved user-space registers (for any entry point), so ptrace from other process (like gdb or strace) will read and/or write that memory if they use ptrace while this process is inside a system call. (ptrace modification of registers is one thing that makes return paths complicated for the other entry points. See comments.)
But it pushes $0 instead of r8/r9/r10/r11. (sysenter and AMD syscall32 entry points store zeros for r8-r15.)
I think this zeroing of r8-r11 is to match historical behaviour. Before the Set up full pt_regs for all compat syscalls commit, the entry point only saved the C call-clobbered registers. It dispatched directly from asm with call *ia32_sys_call_table(, %rax, 8), and those functions follow the calling convention, so they preserve rbx, rbp, rsp, and r12-r15. Zeroing r8-r11 instead of leaving them undefined was to avoid info leaks from a 64-bit kernel to 32-bit user-space (which could far jmp to a 64-bit code segment to read anything the kernel left there).
The current implementation (Linux 4.12) dispatches 32-bit-ABI system calls from C, reloading the saved ebx, ecx, etc. from pt_regs. (64-bit native system calls dispatch directly from asm, with only a mov %r10, %rcx needed to account for the small difference in calling convention between functions and syscall. Unfortunately it can't always use sysret, because CPU bugs make it unsafe with non-canonical addresses. It does try to, so the fast-path is pretty damn fast, although syscall itself still takes tens of cycles.)
Anyway, in current Linux, 32-bit syscalls (including int 0x80 from 64-bit) eventually end up indo_syscall_32_irqs_on(struct pt_regs *regs). It dispatches to a function pointer ia32_sys_call_table, with 6 zero-extended args. This maybe avoids needing a wrapper around the 64-bit native syscall function in more cases to preserve that behaviour, so more of the ia32 table entries can be the native system call implementation directly.
Linux 4.12 arch/x86/entry/common.c
if (likely(nr < IA32_NR_syscalls)) {
/*
* It's possible that a 32-bit syscall implementation
* takes a 64-bit parameter but nonetheless assumes that
* the high bits are zero. Make sure we zero-extend all
* of the args.
*/
regs->ax = ia32_sys_call_table[nr](
(unsigned int)regs->bx, (unsigned int)regs->cx,
(unsigned int)regs->dx, (unsigned int)regs->si,
(unsigned int)regs->di, (unsigned int)regs->bp);
}
syscall_return_slowpath(regs);
In older versions of Linux that dispatch 32-bit system calls from asm (like 64-bit still did until 4.151), the int80 entry point itself puts args in the right registers with mov and xchg instructions, using 32-bit registers. It even uses mov %edx,%edx to zero-extend EDX into RDX (because arg3 happen to use the same register in both conventions). code here. This code is duplicated in the sysenter and syscall32 entry points.
Footnote 1: Linux 4.15 (I think) introduced Spectre / Meltdown mitigations, and a major revamp of the entry points that made them them a trampoline for the meltdown case. It also sanitized the incoming registers to avoid user-space values other than actual args being in registers during the call (when some Spectre gadget might run), by storing them, zeroing everything, then calling to a C wrapper that reloads just the right widths of args from the struct saved on entry.
I'm planning to leave this answer describing the much simpler mechanism because the conceptually useful part here is that the kernel side of a syscall involves using EAX or RAX as an index into a table of function pointers, with other incoming register values copied going to the places where the calling convention wants args to go. i.e. syscall is just a way to make a call into the kernel, to its dispatch code.
Simple example / test program:
I wrote a simple Hello World (in NASM syntax) which sets all registers to have non-zero upper halves, then makes two write() system calls with int 0x80, one with a pointer to a string in .rodata (succeeds), the second with a pointer to the stack (fails with -EFAULT).
Then it uses the native 64-bit syscall ABI to write() the chars from the stack (64-bit pointer), and again to exit.
So all of these examples are using the ABIs correctly, except for the 2nd int 0x80 which tries to pass a 64-bit pointer and has it truncated.
If you built it as a position-independent executable, the first one would fail too. (You'd have to use a RIP-relative lea instead of mov to get the address of hello: into a register.)
I used gdb, but use whatever debugger you prefer. Use one that highlights changed registers since the last single-step. gdbgui works well for debugging asm source, but is not great for disassembly. Still, it does have a register pane that works well for integer regs at least, and it worked great on this example.
See the inline ;;; comments describing how register are changed by system calls
global _start
_start:
mov rax, 0x123456789abcdef
mov rbx, rax
mov rcx, rax
mov rdx, rax
mov rsi, rax
mov rdi, rax
mov rbp, rax
mov r8, rax
mov r9, rax
mov r10, rax
mov r11, rax
mov r12, rax
mov r13, rax
mov r14, rax
mov r15, rax
;; 32-bit ABI
mov rax, 0xffffffff00000004 ; high garbage + __NR_write (unistd_32.h)
mov rbx, 0xffffffff00000001 ; high garbage + fd=1
mov rcx, 0xffffffff00000000 + .hello
mov rdx, 0xffffffff00000000 + .hellolen
;std
after_setup: ; set a breakpoint here
int 0x80 ; write(1, hello, hellolen); 32-bit ABI
;; succeeds, writing to stdout
;;; changes to registers: r8-r11 = 0. rax=14 = return value
; ebx still = 1 = STDOUT_FILENO
push 'bye' + (0xa<<(3*8))
mov rcx, rsp ; rcx = 64-bit pointer that won't work if truncated
mov edx, 4
mov eax, 4 ; __NR_write (unistd_32.h)
int 0x80 ; write(ebx=1, ecx=truncated pointer, edx=4); 32-bit
;; fails, nothing printed
;;; changes to registers: rax=-14 = -EFAULT (from /usr/include/asm-generic/errno-base.h)
mov r10, rax ; save return value as exit status
mov r8, r15
mov r9, r15
mov r11, r15 ; make these regs non-zero again
;; 64-bit ABI
mov eax, 1 ; __NR_write (unistd_64.h)
mov edi, 1
mov rsi, rsp
mov edx, 4
syscall ; write(edi=1, rsi='bye\n' on the stack, rdx=4); 64-bit
;; succeeds: writes to stdout and returns 4 in rax
;;; changes to registers: rax=4 = length return value
;;; rcx = 0x400112 = RIP. r11 = 0x302 = eflags with an extra bit set.
;;; (This is not a coincidence, it's how sysret works. But don't depend on it, since iret could leave something else)
mov edi, r10d
;xor edi,edi
mov eax, 60 ; __NR_exit (unistd_64.h)
syscall ; _exit(edi = first int 0x80 result); 64-bit
;; succeeds, exit status = low byte of first int 0x80 result = 14
section .rodata
_start.hello: db "Hello World!", 0xa, 0
_start.hellolen equ $ - _start.hello
Build it into a 64-bit static binary with
yasm -felf64 -Worphan-labels -gdwarf2 abi32-from-64.asm
ld -o abi32-from-64 abi32-from-64.o
Run gdb ./abi32-from-64. In gdb, run set disassembly-flavor intel and layout reg if you don't have that in your ~/.gdbinit already. (GAS .intel_syntax is like MASM, not NASM, but they're close enough that it's easy to read if you like NASM syntax.)
(gdb) set disassembly-flavor intel
(gdb) layout reg
(gdb) b after_setup
(gdb) r
(gdb) si # step instruction
press return to repeat the last command, keep stepping
Press control-L when gdb's TUI mode gets messed up. This happens easily, even when programs don't print to stdout themselves.

Can't understand assembly mov instruction between register and a variable

I am using NASM assembler on linux 64 bit.
There is something with variables and registers I can't understand.
I create a variable named "msg":
msg db "hello, world"
Now when I want to write to the stdout I move the msg to rsi register, however I don't understand the mov instruction bitwise ... the rsi register consists of 64 bit , while the msg variable has 12 symbols which is 8 bits each , which means the msg variable has a size of 12 * 8 bits , which is greater than 64 bits obviously.
So how is this even possible to make an instruction like:
mov rsi, msg , without overflowing the memory allocated for rsi.
Or does the rsi register contain the memory location of the first symbol of the string and after writing 1 symbol it changes to the memory location of the next symbol?
Sorry if I wrote complete nonsense, I'm new to assembly and i just can't get the grasp of it for a while.

In NASM syntax (unlike MASM syntax) mov rsi, symbol puts the address of the symbol into RSI. (Inefficiently with a 64-bit absolute immediate; use a RIP-relative LEA or mov esi, symbol instead. How to load address of function or label into register in GNU Assembler)
mov rsi, [symbol] would load 8 bytes starting at symbol. It's up to you to choose a useful place to load 8 bytes from when you write an instruction like that.
mov rsi, msg ; rsi = address of msg. Use lea rsi, [rel msg] instead
movzx eax, byte [rsi+1] ; rax = 'e' (upper 7 bytes zeroed)
mov edx, [msg+6] ; rdx = ' wor' (upper 4 bytes zeroed)
Note that you can use mov esi, msg because symbol addresses always fit in 32 bits (in the default "small" code model, where all static code/data goes in the low 2GB of virtual address space). NASM makes this optimization for you with assemble-time constants (like mov rax, 1), but probably it can't with link-time constants. Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?
and after writing 1 symbol it changes to the memory location of the next symbol?
No, if you want that you have to inc rsi. There is no magic. Pointers are just integers that you manipulate like any other integers, and strings are just bytes in memory.
Accessing registers doesn't magically modify them.
There are instructions like lodsb and pop that load from memory and increment a pointer (rsi or rsp respectively), but x86 doesn't have any pre/post-increment/decrement addressing modes, so you can't get that behaviour with mov even if you want it. Use add/sub or inc/dec.

Disclaimer: I'm not familiar with the flavor of assembly that you're dealing with, so the following is more general. The particular flavor may have more features than what I'm used to. In general, assembly deals with single byte/word entities where the size depends on the processor. I've done quite a bit of work on 8 and 16-bit processors, so that is where my answer is coming from.
General statements about Assembly:
Assembly is just like a high level language, except you have to handle a lot more of the details. So if you're used to some operation in say C, you can start there and then break the operation down even further.
For instance, if you have declared two variables that you want to add, that's pretty easy in C:
x = a + b;
In assembly, you have to break that down further:
mov R1, a * get value from a into register R1
mov R2, b * get value from b into register R2
add R1,R2 * perform the addition (typically goes into a particular location I'll call it the accumulator
mov x, acc * store the result of the addition from the accumulator into x
Depending on the flavor of assembly and the processor, you may be able to directly refer to variables in the addition instruction, but like I said I would have to look at the specific flavor you're working with.
Comments on your specific question:
If you have a string of characters, then you would have to move each one individually using a loop of some sort. I would set up a register to contain the starting address of your string, and then increment that register after each character is moved. It acts like a pointer in C. You will need to have some sort of indication for the termination of the string or another value that tells the size of the string, so you know when to stop.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string