Which registers to use as temporaries when writing AMD64 SysV assembly? - linux

I'm implementing a function using cpuid in assembly according to AMD64 SysV ABI. I need 2 temporary registers to be used in the function itself: the first one to accumulate return value and the second one as a counter.
My function currently looks as:
;zero argument function
some_cpuid_fun:
push rbx
xor r10d, r10d ;counter initial value
xor r11d, r11d ;return value accumulator initial value
some_cpuid_fun_loop:
;...
cpuid
;update r10d and r11d according to the result
mov eax, r11d
pop rbx
ret
Since cpuid clobbers eax, ebx, ecx, edx I cannot use them across different cpuid executions. As documented in the AMD64 SysV ABI:
r10 temporary register, used for passing a function’s
static chain pointer
r11 temporary register
There is only one strictly temporary register r11, r10 seems to have a different purpose (and its usage as a loop counter is not one, I obviously did not pass any static chain pointers).
QUESTION: Is the some_cpuid_fun function implementation AMD64 SysV ABI compatible? If no how to rewrite it to stay compatible with the ABI?

After you have identified all the registers used for the arguments, none in this case, all you have to care about is if a register is volatile (not preserved across calls) or not.
In short, looks at the last column of the Figure 3.4 (where the usage for r10 come from): It is not preserved, so you can use it without restoring it.
The Usage column only tells you where to look for the arguments if expected and where to put the return value.
If you don't take a static chain pointer in input, you can overwrite r10 as soon as needed.
So yes, using r10 as a temporary register is ABI compatible.
For reference:
This subsection discusses usage of each register. Registers %rbp, %rbx and
%r12 through %r15 “belong” to the calling function and the called function is
required to preserve their values. In other words, a called function must preserve
these registers’ values for its caller. Remaining registers “belong” to the called
function. 5 If a calling function wants to preserve such a register value across a
function call, it must save the value in its local stack frame.

Related

Understanding Linux x86_64 Syscall Implementation in NASM

I'm using the linux kernel as the basis for my 'OS' (which is just a game). Obviously, I use system calls to interact with the kernel, e.g. input, sound, graphics, etc. As I don't have a linker on this system to use pre-defined wrapper in <sys/syscall.h>, I use this syscall wrapper for C in NASM asm:
global _syscall
_syscall:
mov rax, rdi
mov rdi, rsi
mov rsi, rdx
mov rdx, rcx
mov r10, r8
mov r8, r9
mov r9, [rsp + 8]
syscall
ret
I understand the registers used based on System V calling convention. However, I don't understand the last line mov r9, [rsp + 8]. I feel like it has something to do with return address, but I'm not sure.
The 7th arg is passed on the stack in x86-64 System V.
This is taking the first C arg and putting it in RAX, then copying the next 6 C args to the 6 arg-passing registers of the kernel's system-call calling convention. (Like functions, but with R10 instead of RCX).
The only reason the craptastic glibc syscall() function exists / is written that way is because there's no way to tell C compilers about a custom calling convention where an arg is also passed in RAX. That wrapper makes it look just like any other C function with a prototype.
It's fine for messing around with new system calls, but as you noted it's inefficient. If you wanted something better in C, use inline asm macros for your ISA, e.g. https://github.com/linux-on-ibm-z/linux-syscall-support/blob/master/linux_syscall_support.h. Inline asm is hard, and historically some syscall1 / syscall2 (per number of args) macros have been missing things like a "memory" clobber to tell the compiler that pointed-to memory could also be an input or output. That github project is safe and has code for various ISAs. (Some missed optimizations, like could use a dummy input operand instead of a full "memory" clobber... But that's irrelevant to asm)
Of course, you can do much better if you're writing in asm:
Just use the syscall instruction directly with args in the right registers (RDI, RSI, RDX, R10, R8, R9) instead of call _syscall with the function-calling convention. That's strictly worse than just inlining the syscall instruction: With syscall you know that registers are unmodified except for RAX (return value) and RCX/R11 (syscall itself uses them to save RIP and RFLAGS before kernel code runs.) And it would take just as much code to get args into registers for a function call as it would for syscall.
If you do want a wrapper function at all (e.g. to cmp rax, -4095 / jae handle_syscall_error afterwards and maybe set errno), use the same calling convention for it as the kernel expects, so the first instruction can be syscall, not all that stupid shuffling of args over by 1.
Functions in asm (that you only need to call from asm) can use whatever calling convention is convenient. It's a good idea to use a good standard one most of the time, but any "obviously special" function can certainly use a special convention.

How to comprehend the flow of this assembly code

I can' t understand how this works.
Here's a part of main() program disassembled by objdump and written in intel notation
0000000000000530 <main>:
530: lea rdx,[rip+0x37d] # 8b4 <_IO_stdin_used+0x4>
537: mov DWORD PTR [rsp-0xc],0x0
53f: movabs r10,0xedd5a792ef95fa9e
549: mov r9d,0xffffffcc
54f: nop
550: mov eax,DWORD PTR [rsp-0xc]
554: cmp eax,0xd
557: ja 57c <main+0x4c>
559: movsxd rax,DWORD PTR [rdx+rax*4]
55d: add rax,rdx
560: jmp rax
The rodata section dump:
.rodata
08b0 01000200 ecfdffff d4fdffff bcfdffff ................
08c0 9cfdffff 7cfdffff 6cfdffff 4cfdffff ....|...l...L...
08d0 3cfdffff 2cfdffff 0cfdffff ecfcffff <...,...........
08e0 d4fcffff b4fcffff 0cfeffff ............
In 530, rip is [537] so [rdx] = [537 + 37d] = 8b4.
First question is the value of rdx is how large? Is the valueis ec, or ecfdffff or something else? If it has DWORD, I can understand that has 'ecfdffff' (even this is wrong too?:() but this program don't declare it. How can I judge the value?
Then the program continues.
In 559, rax is first appeared.
The second question is this rax can interpret as a part of eax and in this time is the rax = 0? If rax is 0, in 559 means rax = DWORD[rdx] and the value of rax become ecfdffff and next [55d] do rax += rdx, and I think this value can't jamp. There must be something wrong, so tell me where, or how i make any wrongs.
I think I'll diverge from what Peter discussed (he provides good information) and get to the heart of some issues I think are causing you problems. When I first glanced at this question I assumed that the code was likely compiler generated and the jmp rax was likely the result of some control flow statement. The most likely way to generate such a code sequence is via a C switch. It isn't uncommon for a switch statement to be made of a jump table to say what code should execute depending on the control variable. As an example: the control variable for switch(a) is a.
This all made sense to me, and I wrote up a number of comments (now deleted) that ultimately resulted in bizarre memory addresses that jmp rax would go to. I had errands to run but when I returned I had the aha moment that you may have had the same confusion I did. This output from objdump using the -s option appeared as:
.rodata
08b0 01000200 ecfdffff d4fdffff bcfdffff ................
08c0 9cfdffff 7cfdffff 6cfdffff 4cfdffff ....|...l...L...
08d0 3cfdffff 2cfdffff 0cfdffff ecfcffff <...,...........
08e0 d4fcffff b4fcffff 0cfeffff ............
One of your questions seems to be about what values get loaded here. I never used the -s option to look at data in the sections and was unaware that although the dump splits the data out in groups of 4 bytes (32-bit values) they are shown in byte order as it appears in memory. I had at first assumed the output was displaying these values from Most Significant Byte to Least significant byte and objdump -s had done the conversion. That is not the case.
You have to manually reverse the bytes of each group of 4 bytes to get the real value that would be read from memory into a register.
ecfdffff in the output actually means ec fd ff ff. As a DWORD value (32-bit) you need to reverse the bytes to get the HEX value as you would expect when loaded from memory. ec fd ff ff reversed would be ff ff fd ec or the 32-bit value 0xfffffdec. Once you realize that then this makes a lot more sense. If you make this same adjustment for all the data in that table you'd get:
.rodata
08b0: 0x00020001 0xfffffdec 0xfffffdd4 0xfffffdbc
08c0: 0xfffffd9c 0xfffffd7c 0xfffffd6c 0xfffffd4c
08d0: 0xfffffd3c 0xfffffd2c 0xfffffd0c 0xfffffcec
08e0: 0xfffffcd4 0xfffffcb4 0xfffffe0c
Now if we look at the code you have it starts with:
530: lea rdx,[rip+0x37d] # 8b4 <_IO_stdin_used+0x4>
This doesn't load data from memory, it is computing the effective address of some data and places the address in RDX. The disassembly from OBJDUMP is displaying the code and data with the view that it is loaded in memory starting at 0x000000000000. When it is loaded into memory it may be placed at some other address. GCC in this case is producing position independent code (PIC). It is generated in such a way that the first byte of the program can start at an arbitrary address in memory.
The # 8b4 comment is the part we are concerned about (you can ignore the information after that). The disassembly is saying if the program was loaded at 0x0000000000000000 then the value loaded into RDX would be 0x8b4. How was that arrived at? This instruction starts at 0x530 but with RIP relative addressing the RIP (instruction pointer) is relative to the address just after the current instruction. The address the disassembler used was 0x537 (the byte after the current instruction is the address of the first byte of the next instruction). The instruction adds 0x37d to RIP and gets 0x537+0x37d=0x8b4. The address 0x8b4 happens to be in the .rodata section which you are given a dump of (as discussed above).
We now know that RDX contains the base of some data. The jmp rax suggests this is likely going to be a table of 32-bit values that are used to determine what memory location to jump to depending on the value in the control variable of a switch statement.
This statement appears to be storing the value 0 as a 32-bit value on the stack.
537: mov DWORD PTR [rsp-0xc],0x0
These appear to be variables that the compiler chose to store in registers (rather than memory).
53f: movabs r10,0xedd5a792ef95fa9e
549: mov r9d,0xffffffcc
R10 is being loaded with the 64-bit value 0xedd5a792ef95fa9e. R9D is the lower 32-bits of the 64-bit R9 register.The value 0xffffffcc is being loaded into the lower 32-bits of R9 but there is something else occurring. In 64-bit mode if the destination of an instruction is a 32-bit register the CPU automatically zero extends the value into the upper 32-bits of the register. The CPU is guaranteeing us that the upper 32-bits are zeroed.
This is a NOP and doesn't do anything except align the next instruction to memory address 0x550. 0x550 is a value that is 16-byte aligned. This has some value and may hint that the instruction at 0x550 may be the first instruction at the top of a loop. An optimizer may place NOPs into the code to align the first instruction at the top of a loop to a 16-byte aligned address in memory for performance reasons:
54f: nop
Earlier the 32-bit stack based variable at rsp-0xc was set to zero. This reads the value 0 from memory as a 32-bit value and stores it in EAX. Since EAX is a 32-bit register being used as the destination for the instruction the CPU automatically filled the upper 32-bits of RAX to 0. So all of RAX is zero.
550: mov eax,DWORD PTR [rsp-0xc]
EAX is now being compared to 0xd. If it is above (ja) it goes to the instruction at 0x57c.
554: cmp eax,0xd
557: ja 57c <main+0x4c>
We then have this instruction:
559: movsxd rax,DWORD PTR [rdx+rax*4]
The movsxd is an instruction that will take a 32-bit source operand (in this case the 32-bit value at memory address RDX+RAX*4) load it into the bottom 32-bits of RAX and then sign extend the value into the upper 32-bits of RAX. Effectively if the 32-bit value is negative (the most significant bit is 1) the upper 32-bits of RAX will be set to 1. If the 32-bit value is not negative the upper 32-bits of RAX will be set to 0.
When this code is first encountered RDX contains the base of some table at 0x8b4 from the beginning of the program loaded in memory. RAX is set to 0. Effectively the first 32-bits in the table are copied to RAX and sign extended. As seen earlier the value at offset 0xb84 is 0xfffffdec. That 32-bit value is negative so RAX contains 0xfffffffffffffdec.
Now to the meat of the situation:
55d: add rax,rdx
560: jmp rax
RDX still holds the address to the beginning of a table in memory. RAX is being added to that value and stored back in RAX (RAX = RAX+RDX). We then JMP to the address stored in RAX. So this code all seems to suggest we have a JUMP table with 32-bit values that we are using to determine where we should go. So then the obvious question. What are the 32-bit values in the table? The 32-bit values are the difference between the beginning of the table and the address of the instruction we want to jump to.
We know the table is 0x8b4 from the location our program is loaded in memory. The C compiler told the linker to compute the difference between 0x8b4 and the address where the instruction we want to execute resides. If the program had been loaded into memory at 0x0000000000000000 (hypothetically), RAX = RAX+RDX would have resulted in RAX being 0xfffffffffffffdec + 0x8b4 = 0x00000000000006a0. We then use jmp rax to jump to 0x6a0. You didn't show the entire dump of memory but there is going to be code at 0x6a0 that will execute when the value passed to the switch statement is 0. Each 32-bit value in the JUMP table will be a similar offset to the code that will execute depending on the control variable in the switch statement. If we add 0x8b4 to all the entries in the table we get:
08b0: 0x000006a0 0x00000688 0x00000670
08c0: 0x00000650 0x00000630 0x00000620 0x00000600
08d0: 0x000005F0 0x000005e0 0x000005c0 0x000005a0
08e0: 0x00000588 0x00000568 0x000006c0
You should find that in the code you haven't provided us that these addresses coincide with code that appears after the jmp rax.
Given that the memory address 0x550 was aligned, I have a hunch that this switch statement is inside a loop that keeps executing as some kind of state machine until the proper conditions are met for it to exit. Likely the value of the control variable used for the switch statement is changed by the code in the switch statement itself. Each time the switch statement is run the control variable has a different value and will do something different.
The control variable for the switch statement was originally checked for the value being above 0x0d (13). The table starting at 0x8b4 in the .rodata section has 14 entries. One can assume the switch statement probably has 14 different states (cases).
but this program don't declare it
You're looking at disassembly of machine code + data. It's all just bytes in memory. Any labels the disassembler does manage to show are ones that got left in the executable's symbol table. They're irrelevant to how the CPU runs the machine code.
(The ELF program headers tell the OS's program loader how to map it into memory, and where to jump to as an entry point. This has nothing to do with symbols, unless a shared library references some globals or functions defined in the executable.)
You can single-step the code in GDB and watch register values change.
In 559, rax is first appeared.
EAX is the low 32 bits of RAX. Writing to EAX zero-extends into RAX implicitly. From mov DWORD PTR [rsp-0xc],0x0 and the later reload, we know that RAX=0.
This must have been un-optimized compiler output (or volatile int idx = 0; to defeat constant propagation), otherwise it would know at compile time that RAX=0 and could optimize away everything else.
lea rdx,[rip+0x37d] # 8b4
A RIP-relative LEA puts the address of static into a register. It's not a load from memory. (That happens later when movsxd with an indexed addressing mode uses RDX as the base address.)
The disassembler worked out the address for you; it's RDX = 0x8b4. (Relative to the start of the file; when actually running the program would be mapped at a virtual address like 0x55555...000)
554: cmp eax,0xd
557: ja 57c <main+0x4c>
559: movsxd rax,DWORD PTR [rdx+rax*4]
55d: add rax,rdx
560: jmp rax
This is a jump table. First it checks for an out-of-bounds index with cmp eax,0xd, then it indexes a table of 32-bit signed offsets using EAX (movsxd with an addressing mode that scales RAX by 4), and adds that to the base address of the table to get a jump target.
GCC could just make a jump table of 64-bit absolute pointers, but chooses not to so that .rodata is position-independent as well and doesn't need load-time fixups in a PIE executable. (Even though Linux does support doing that.) See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 where this is discussed (although the main focus of that bug is that gcc -fPIE can't turn a switch into a table lookup of string addresses, and actually still uses a jump table)
The jump-offset table address is in RDX, this is what was set up with the earlier LEA.

Linux 64-abi, calling convention

I'm reading intel manual about calling convention and which register has which purpose. Here is what was specified in the Figure 3.4: Register Usage:
%rax temporary register; with variable arguments
passes information about the number of vector
registers used; 1st return register
But in linux api we use rax to pass the function number. Is it consistent with what was specified in the intel manual? Actually I expected that (according to the manual) we will pass the function number into rdi (it is used for the 1st argument). And so forth...
Can I use rax to pass the first function argument in my hand-written functions? E.g.
mov rax, [array_lenght_ptr]
mov rdi, array_start_ptr
callq _array_sum
That quote is talking about the function-calling convention, which is standardized by x86-64 System V ABI doc.
You're thinking of Linux's system-call calling convention, which is described in an appendix to the ABI doc, but that part isn't normative. Anyway, the system call ABI puts the call number in rax because it's not an arg to the system call. Alternatively, you can think of it like a 0th arg, the same way that variadic function calls pass the number of FP register-args in al. (Fun fact: that makes it possible for the caller to pass even the first FP arg on the stack if they want to.)
But more importantly, because call number in RAX makes a better ABI, and because of tradition: it's what the i386 system-call ABI does, too. And the i386 System V function-call ABI is totally different, using stack args exclusively.
This means system-call wrapper functions can just set eax and run syscall instead of needing to do something like
libc_write_wrapper_for_your_imagined_syscall_convention:
; copy all args to the next slot over
mov r10, rdx ; size_t count
mov rdx, rsi ; void *buf
mov esi, edi ; int fd
mov edi, 1 ; SYS_write
syscall
cmp rax, -4095
jae set_errno
ret
instead of
actual_libc_write_wrapper: ; glibc's actual code I think also checks for pthread cancellation points or something...
mov eax, 1 ; SYS_write
syscall
cmp rax, -4095
jae set_errno
ret
Note the use of r10 instead of rcx because syscall clobbers rcx and r11 with the saved RIP and RFLAGS, so it doesn't have to write any memory with return information, and doesn't force user-space to put it somewhere the kernel can read it (like 32-bit sysenter does).
So the system-calling convention couldn't be the same as the function-calling convention. (Or the function-calling convention would have had to choose different registers.)
For system calls with 4 or more args (or a generic wrapper that works for any system call) you do need a mov r10, rcx, but that's all. (Unlike the 32-bit convention where a wrapper has to load args from the stack, and save/restore ebx because the kernel's poorly-chosen ABI uses it for the first arg.)
Can I use rax to pass the first function argument in my hand-written functions?
Yes, do whatever you want for private helper functions that you don't need to call from C.
Choose arg registers to make things easier for the callers (or for the most important caller), or you'll be using any registers with fixed register choices (like div).
Note which registers are clobbered and which are preserved with a comment. Only bother to save/restore registers that your caller actually needs saving / restoring, and choose which tmp regs you use to minimize push/pop. Avoid a push/pop save/reload of registers that are part of a critical latency path in your caller, if your function is short.

Why does VC++ 2010 often use ebx as a "zero register"?

Yesterday I was looking at some 32 bit code generated by VC++ 2010 (most probably; don't know about the specific options, sorry) and I was intrigued by a curious recurring detail: in many functions, it zeroed out ebx in the prologue, and it always used it like a "zero register" (think $zero on MIPS). In particular, it often:
used it to zero out memory; this is not unusual, as the encoding for a mov mem,imm is 1 to 4 bytes bigger than mov mem,reg (the full immediate value size has to be encoded even for 0), but usually (gcc) the necessary register is zeroed out "on demand", and kept for more useful purposes otherwise;
used it for compares against zero - as in cmp reg,ebx. This is what stroke me as really unusual, as it should be exactly the same as test reg,reg, but adds a dependency to an extra register. Now, keep in mind that this happened in non-leaf functions, with ebx being often pushed (by the callee) on and off the stack, so I would not trust this dependency to be always completely free. Also, it also used test reg,reg in the exact same fashion (test/cmp => jg).
Most importantly, registers on "classic" x86 are a scarce resource, if you start having to spill registers you waste a lot of time for no good reason; why waste one through all the function just to keep a zero in it? (still, thinking about it, I don't remember seeing much register spillage in functions that used this "zero-register" pattern).
So: what am I missing? Is it a compiler blooper or some incredibly smart optimization that was particularly interesting in 2010?
Here's an excerpt:
; standard prologue: ebp/esp, SEH, overflow protection, ... then:
xor ebx, ebx
mov [ebp+4], ebx ; zero out some locals
mov [ebp], ebx
call function_1
xor ecx, ecx ; ebx _not_ used to zero registers
cmp eax, ebx ; ... but used for compares?! why not test eax,eax?
setnz cl ; what? it goes through cl to check if eax is not zero?
cmp ecx, ebx ; still, why not test ecx,ecx?
jnz function_body
push 123456
call throw_something
function_body:
mov edx, [eax]
mov ecx, eax ; it's not like it was interested in ecx anyway...
mov eax, [edx+0Ch]
call eax ; virtual method call; ebx is preserved but possibly pushed/popped
lea esi, [eax+10h]
mov [ebp+0Ch], esi
mov eax, [ebp+10h]
mov ecx, [eax-0Ch]
xor edi, edi ; ugain, registers are zeroed as usual
mov byte ptr [ebp+4], 1
mov [ebp+8], ecx
cmp ecx, ebx ; why not test ecx,ecx?
jg somewhere
label1:
lea eax, [esi-10h]
mov byte ptr [ebp+4], bl ; ok, uses bl to write a zero to memory
lea ecx, [eax+0Ch]
or edx, 0FFFFFFFFh
lock xadd [ecx], edx
dec edx
test edx, edx ; now it's using the regular test reg,reg!
jg somewhere_else
Notice: an earlier version of this question said that it used mov reg,ebx instead of xor ebx,ebx; this was just me not remembering stuff correctly. Sorry if anybody put too much thought trying to understand that.
Everything you commented on as odd looks sub-optimal to me. test eax,eax sets all flags (except AF) the same as cmp against zero, and is preferred for performance and code-size.
On P6 (PPro through Nehalem), reading long-dead registers is bad because it can lead to register-read stalls. P6 cores can only read 2 or 3 not-recently-modified architectural registers from the permanent register file per clock (to fetch operands for the issue stage: the ROB holds operands for uops, unlike on SnB-family where it only holds references to the physical register file).
Since this is from VS2010, Sandybridge wasn't released yet, so it should have put a lot of weight on tuning for Pentium II/III, Pentium-M, Core2, and Nehalem where reading "cold" registers is a possible bottleneck.
IDK if anything like this ever made sense for integer regs, but I don't know much about optimizing for CPUs older than P6.
The cmp / setz / cmp / jnz sequence looks particularly braindead. Maybe it came from a compiler-internal canned sequence for producing a boolean value from something, and it failed to optimize a test of the boolean back into just using the flags directly? That still doesn't explain the use of ebx as a zero-register, which is also completely useless there.
Is it possible that some of that was from inline-asm that returned a boolean integer (using a silly that wanted a zero in a register)?
Or maybe the source code was comparing two unknown values, and it was only after inlining and constant-propagation that it turned into a compare against zero? Which MSVC failed to optimize fully, so it still kept 0 as a constant in a register instead of using test?
(the rest of this was written before the question included code).
Sounds weird, or like a case of CSE / constant-hoisting run amok. i.e. treating 0 like any other constant that you might want to load once and then reg-reg copy throughout the function.
Your analysis of the data-dependency behaviour is correct: moving from a register that was zeroed a while ago essentially starts a new dependency chain.
When gcc wants two zeroed registers, it often xor-zeroes one and then uses a mov or movdqa to copy to the other.
This is sub-optimal on Sandybridge where xor-zeroing doesn't need an execution port, but a possible win on Bulldozer-family where mov can run on the AGU or ALU, but xor-zeroing still needs an ALU port.
For vector moves, it's a clear win on Bulldozer: handled in register rename with no execution unit. But xor-zeroing of an XMM or YMM register still needs an execution port on Bulldozer-family (or two for ymm, so always use xmm with implicit zero-extension).
Still, I don't think that justifies tying up a register for the duration of a whole function, especially not if it costs extra saves/restores. And not for P6-family CPUs where register-read stalls are a thing.

Linux x64: why does r10 come before r8 and r9 in syscalls?

I decided to take a crack at assembly the other day, and I've been playing around with really basic things like printing stuff from argv to stdout. I found this great list of linux syscall numbers with arguments and everything, and I'm curious why r10 is used for arguments before r8 and r9. I've found all kinds of weird conventions about what can be used what for what and when, like how loop counters go in rcx. Is there a particular reason why r10 was moved up? Was it more convenient?
I should probably also mention I'm interested in this out of curiosity, not because it's causing me problems.
Edit: I found this question which gets close, referencing the x64 ABI documentation on page 124, where it notes that user level applications use rdi, rsi, rdx, rcx, r8, r9. The kernel on the other hand uses r10 instead of rcx, and destroys rcx and r11. That might explain how r10 ended up there, but then why was it swapped in?
RCX, along with R11, is used by the syscall instruction, being immediately destroyed by it. Thus these registers are not only not saved after syscall, but they can't even be used for parameter passing. Thus R10 was chosen to replace unusable RCX to pass fourth parameter.
See also this answer for a bit more information on how syscall uses these registers.
Reference: Intel's Instruction Set Reference, look for SYSCALL.
see x86-64.orgs abi documentation page 124
User-level applications use as integer registers for passing the sequence
%rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi,
%rsi, %rdx, %r10, %r8 and %r9.
A system-call is done via the syscall instruction. The kernel destroys
registers %rcx and %r11.
This is saying that when you use the syscall instruction the kernel destroys %rcx so you need to use %r10 instead.
Also the comment from #technosaurus explains that the kernel is using %rcx to store the entry point in case of an interrupt during a syscall.

Resources