What is wrong with this emulation of CMPXCHG16B instruction? - multithreading

I'm trying to run a binary program that uses CMPXCHG16B instruction at one place, unfortunately my Athlon 64 X2 3800+ doesn't support it. Which is great, because I see it as a programming challenge. The instruction doesn't seem to be that hard to implement with a cave jump, so that's what I did, but something didn't work, program just froze in a loop. Maybe someone can tell me if I implemented my CMPXCHG16B wrong?
Firstly the actual piece of machine code that I'm trying to emulate is this:
f0 49 0f c7 08 lock cmpxchg16b OWORD PTR [r8]
Excerpt from Intel manual describing CMPXCHG16B:
Compare RDX:RAX with m128. If equal, set ZF and load RCX:RBX into m128.
Else, clear ZF and load m128 into RDX:RAX.
First I replace all 5 bytes of the instruction with a jump to code cave with my emulation procedure, luckily the jump takes up exactly 5 bytes! The jump is actually a call instruction e8, but could be a jmp e9, both work.
e8 96 fb ff ff call 0xfffffb96(-649)
This is a relative jump with a 32-bit signed offset encoded in two's complement, the offset points to a code cave relative to address of next instruction.
Next the emulation code I'm jumping to:
PUSH R10
PUSH R11
MOV r10, QWORD PTR [r8]
MOV r11, QWORD PTR [r8+8]
TEST R10, RAX
JNE ELSE
TEST R11, RDX
JNE ELSE
MOV QWORD PTR [r8], RBX
MOV QWORD PTR [r8+8], RCX
JMP END
ELSE:
MOV RAX, r10
MOV RDX, r11
END:
POP R11
POP R10
RET
Personally, I'm happy with it, and I think it matches the functional specification given in manual. It restores stack and two registers r10 and r11 to their original order and then resumes execution. Alas it does not work! That is the code works, but the program acts as if it's waiting for a tip and burning electricity. Which indicates my emulation was not perfect and I inadvertently broke it's loop. Do you see anything wrong with it?
I notice that this is an atomic variant of it—owning to the lock prefix. I'm hoping it's something else besides contention that I did wrong. Or is there a way to emulate atomicity too?

It's not possible to emulate lock cmpxchg16b. It's sort of possible if all accesses to the target address are synchronised with a separate lock, but that includes all other instructions, including non-atomic stores to either half of the object, and atomic read-modify-writes (like xchg, lock cmpxchg, lock add, lock xadd) with one half (or other part) of the 16 byte object.
You can emulate cmpxchg16b (without lock) like you've done here, with the bugfixes from #Fifoernik's answer. That's an interesting learning exercise, but not very useful in practice, because real code that uses cmpxchg16b always uses it with a lock prefix.
A non-atomic replacement will work most of the time, because it's rare for a cache-line invalidate from another core to arrive in the small time window between two nearby instructions. This doesn't mean it's safe, it just means it's really hard to debug when it does occasionally fail. If you just want to get a game working for your own use, and can accept occasional lockups / errors, this might be useful. For anything where correctness is important, you're out of luck.
What about MFENCE? Seems to be what I need.
MFENCE before, after, or between the loads and stores won't prevent another thread from seeing a half-written value ("tearing"), or from modifying the data after your code has made the decision that the compare succeeded, but before it does the store. It might narrow the window of vulnerability, but it can't close it, because MFENCE only prevents reordering of the global visibility of our own stores and loads. It can't stop a store from another core from becoming visible to us after our loads but before our stores. That requires an atomic read-modify-write bus cycle, which is what locked instructions are for.
Doing two 8-byte atomic compare-exchanges would solve the window-of-vulnerability problem, but only for each half separately, leaving the "tearing" problem.
Atomic 16B loads/stores solves the tearing problem but not the atomicity problem between loads and stores. It's possible with SSE on some hardware, but not guaranteed to be atomic by the x86 ISA the way 8B naturally-aligned loads and stores are.
Xen's lock cmpxchg16b emulation:
The Xen virtual machine has an x86 emulator, I guess for the case where a VM starts on one machine and migrates to less-capable hardware. It emulates lock cmpxchg16b by taking a global lock, because there's no other way. If there was a way to emulate it "properly", I'm sure Xen would do that.
As discussed in this mailing list thread, Xen's solution still doesn't work when the emulated version on one core is accessing the same memory as the non-emulated instruction on another core. (The native version doesn't respect the global lock).
See also this patch on the Xen mailing list that changes the lock cmpxchg8b emulation to support both lock cmpxchg8b and lock cmpxchg16b.
I also found that KVM's x86 emulator doesn't support cmpxchg16b either, according to the search results for emulate cmpxchg16b.
I think all this is good evidence that my analysis is correct, and that it's not possible to emulate it safely.

I see these things wrong with your code to emulate the cmpxchg16b instruction:
You need to use cmp in stead of test to get a correct comparison.
You need to save/restore all flags except the ZF. The manual mentions :
The CF, PF, AF, SF, and OF flags are unaffected.
The manual contains the following:
IF (64-Bit Mode and OperandSize = 64)
THEN
TEMP128 ← DEST
IF (RDX:RAX = TEMP128)
THEN
ZF ← 1;
DEST ← RCX:RBX;
ELSE
ZF ← 0;
RDX:RAX ← TEMP128;
DEST ← TEMP128;
FI;
FI
So to really write code that "matches the functional specification given in manual" a write to the m128 is required. Although this particular write is part of the locked version lock cmpxchg16b, it won't of course do any good to the atomicity of the emulation! A straightforward emulation of lock cmpxchg16b is thus not possible. See #PeterCordes' answer.
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison
ELSE:
MOV RAX, r10
MOV RDX, r11
MOV QWORD PTR [r8], r10
MOV QWORD PTR [r8+8], r11
END:

Related

How can I sample instructions executed or retired with Linux perf?

How can I get a reasonable sample of instructions executed or retired1 using perf?
Stuff like perf record -e instructions gives obviously wrong results. Obviously wrong here means they appear to sample slow instructions based on their execution time.
E.g., for the following loop:
.loop:
mov rdx, rsi
xor rax, rax
div rdi
dec rcx
jne .loop
You will get 99.5% to 100% of the samples on the div or dec instructions (depending on if you have 1-instruction skid, or not), yet obviously all these instructions must be retired at the same rate (it is a basic block).
Typical results for a "skid 1" event like inst_retired.any_p (3704 samples):
Source, build and execution instructions for this benchmark.
1 I mention both because the distinction is important in principle (the former would count speculatively executed, but never retired instructions), although I'm happy for an answer for either.

How to comprehend the flow of this assembly code

I can' t understand how this works.
Here's a part of main() program disassembled by objdump and written in intel notation
0000000000000530 <main>:
530: lea rdx,[rip+0x37d] # 8b4 <_IO_stdin_used+0x4>
537: mov DWORD PTR [rsp-0xc],0x0
53f: movabs r10,0xedd5a792ef95fa9e
549: mov r9d,0xffffffcc
54f: nop
550: mov eax,DWORD PTR [rsp-0xc]
554: cmp eax,0xd
557: ja 57c <main+0x4c>
559: movsxd rax,DWORD PTR [rdx+rax*4]
55d: add rax,rdx
560: jmp rax
The rodata section dump:
.rodata
08b0 01000200 ecfdffff d4fdffff bcfdffff ................
08c0 9cfdffff 7cfdffff 6cfdffff 4cfdffff ....|...l...L...
08d0 3cfdffff 2cfdffff 0cfdffff ecfcffff <...,...........
08e0 d4fcffff b4fcffff 0cfeffff ............
In 530, rip is [537] so [rdx] = [537 + 37d] = 8b4.
First question is the value of rdx is how large? Is the valueis ec, or ecfdffff or something else? If it has DWORD, I can understand that has 'ecfdffff' (even this is wrong too?:() but this program don't declare it. How can I judge the value?
Then the program continues.
In 559, rax is first appeared.
The second question is this rax can interpret as a part of eax and in this time is the rax = 0? If rax is 0, in 559 means rax = DWORD[rdx] and the value of rax become ecfdffff and next [55d] do rax += rdx, and I think this value can't jamp. There must be something wrong, so tell me where, or how i make any wrongs.
I think I'll diverge from what Peter discussed (he provides good information) and get to the heart of some issues I think are causing you problems. When I first glanced at this question I assumed that the code was likely compiler generated and the jmp rax was likely the result of some control flow statement. The most likely way to generate such a code sequence is via a C switch. It isn't uncommon for a switch statement to be made of a jump table to say what code should execute depending on the control variable. As an example: the control variable for switch(a) is a.
This all made sense to me, and I wrote up a number of comments (now deleted) that ultimately resulted in bizarre memory addresses that jmp rax would go to. I had errands to run but when I returned I had the aha moment that you may have had the same confusion I did. This output from objdump using the -s option appeared as:
.rodata
08b0 01000200 ecfdffff d4fdffff bcfdffff ................
08c0 9cfdffff 7cfdffff 6cfdffff 4cfdffff ....|...l...L...
08d0 3cfdffff 2cfdffff 0cfdffff ecfcffff <...,...........
08e0 d4fcffff b4fcffff 0cfeffff ............
One of your questions seems to be about what values get loaded here. I never used the -s option to look at data in the sections and was unaware that although the dump splits the data out in groups of 4 bytes (32-bit values) they are shown in byte order as it appears in memory. I had at first assumed the output was displaying these values from Most Significant Byte to Least significant byte and objdump -s had done the conversion. That is not the case.
You have to manually reverse the bytes of each group of 4 bytes to get the real value that would be read from memory into a register.
ecfdffff in the output actually means ec fd ff ff. As a DWORD value (32-bit) you need to reverse the bytes to get the HEX value as you would expect when loaded from memory. ec fd ff ff reversed would be ff ff fd ec or the 32-bit value 0xfffffdec. Once you realize that then this makes a lot more sense. If you make this same adjustment for all the data in that table you'd get:
.rodata
08b0: 0x00020001 0xfffffdec 0xfffffdd4 0xfffffdbc
08c0: 0xfffffd9c 0xfffffd7c 0xfffffd6c 0xfffffd4c
08d0: 0xfffffd3c 0xfffffd2c 0xfffffd0c 0xfffffcec
08e0: 0xfffffcd4 0xfffffcb4 0xfffffe0c
Now if we look at the code you have it starts with:
530: lea rdx,[rip+0x37d] # 8b4 <_IO_stdin_used+0x4>
This doesn't load data from memory, it is computing the effective address of some data and places the address in RDX. The disassembly from OBJDUMP is displaying the code and data with the view that it is loaded in memory starting at 0x000000000000. When it is loaded into memory it may be placed at some other address. GCC in this case is producing position independent code (PIC). It is generated in such a way that the first byte of the program can start at an arbitrary address in memory.
The # 8b4 comment is the part we are concerned about (you can ignore the information after that). The disassembly is saying if the program was loaded at 0x0000000000000000 then the value loaded into RDX would be 0x8b4. How was that arrived at? This instruction starts at 0x530 but with RIP relative addressing the RIP (instruction pointer) is relative to the address just after the current instruction. The address the disassembler used was 0x537 (the byte after the current instruction is the address of the first byte of the next instruction). The instruction adds 0x37d to RIP and gets 0x537+0x37d=0x8b4. The address 0x8b4 happens to be in the .rodata section which you are given a dump of (as discussed above).
We now know that RDX contains the base of some data. The jmp rax suggests this is likely going to be a table of 32-bit values that are used to determine what memory location to jump to depending on the value in the control variable of a switch statement.
This statement appears to be storing the value 0 as a 32-bit value on the stack.
537: mov DWORD PTR [rsp-0xc],0x0
These appear to be variables that the compiler chose to store in registers (rather than memory).
53f: movabs r10,0xedd5a792ef95fa9e
549: mov r9d,0xffffffcc
R10 is being loaded with the 64-bit value 0xedd5a792ef95fa9e. R9D is the lower 32-bits of the 64-bit R9 register.The value 0xffffffcc is being loaded into the lower 32-bits of R9 but there is something else occurring. In 64-bit mode if the destination of an instruction is a 32-bit register the CPU automatically zero extends the value into the upper 32-bits of the register. The CPU is guaranteeing us that the upper 32-bits are zeroed.
This is a NOP and doesn't do anything except align the next instruction to memory address 0x550. 0x550 is a value that is 16-byte aligned. This has some value and may hint that the instruction at 0x550 may be the first instruction at the top of a loop. An optimizer may place NOPs into the code to align the first instruction at the top of a loop to a 16-byte aligned address in memory for performance reasons:
54f: nop
Earlier the 32-bit stack based variable at rsp-0xc was set to zero. This reads the value 0 from memory as a 32-bit value and stores it in EAX. Since EAX is a 32-bit register being used as the destination for the instruction the CPU automatically filled the upper 32-bits of RAX to 0. So all of RAX is zero.
550: mov eax,DWORD PTR [rsp-0xc]
EAX is now being compared to 0xd. If it is above (ja) it goes to the instruction at 0x57c.
554: cmp eax,0xd
557: ja 57c <main+0x4c>
We then have this instruction:
559: movsxd rax,DWORD PTR [rdx+rax*4]
The movsxd is an instruction that will take a 32-bit source operand (in this case the 32-bit value at memory address RDX+RAX*4) load it into the bottom 32-bits of RAX and then sign extend the value into the upper 32-bits of RAX. Effectively if the 32-bit value is negative (the most significant bit is 1) the upper 32-bits of RAX will be set to 1. If the 32-bit value is not negative the upper 32-bits of RAX will be set to 0.
When this code is first encountered RDX contains the base of some table at 0x8b4 from the beginning of the program loaded in memory. RAX is set to 0. Effectively the first 32-bits in the table are copied to RAX and sign extended. As seen earlier the value at offset 0xb84 is 0xfffffdec. That 32-bit value is negative so RAX contains 0xfffffffffffffdec.
Now to the meat of the situation:
55d: add rax,rdx
560: jmp rax
RDX still holds the address to the beginning of a table in memory. RAX is being added to that value and stored back in RAX (RAX = RAX+RDX). We then JMP to the address stored in RAX. So this code all seems to suggest we have a JUMP table with 32-bit values that we are using to determine where we should go. So then the obvious question. What are the 32-bit values in the table? The 32-bit values are the difference between the beginning of the table and the address of the instruction we want to jump to.
We know the table is 0x8b4 from the location our program is loaded in memory. The C compiler told the linker to compute the difference between 0x8b4 and the address where the instruction we want to execute resides. If the program had been loaded into memory at 0x0000000000000000 (hypothetically), RAX = RAX+RDX would have resulted in RAX being 0xfffffffffffffdec + 0x8b4 = 0x00000000000006a0. We then use jmp rax to jump to 0x6a0. You didn't show the entire dump of memory but there is going to be code at 0x6a0 that will execute when the value passed to the switch statement is 0. Each 32-bit value in the JUMP table will be a similar offset to the code that will execute depending on the control variable in the switch statement. If we add 0x8b4 to all the entries in the table we get:
08b0: 0x000006a0 0x00000688 0x00000670
08c0: 0x00000650 0x00000630 0x00000620 0x00000600
08d0: 0x000005F0 0x000005e0 0x000005c0 0x000005a0
08e0: 0x00000588 0x00000568 0x000006c0
You should find that in the code you haven't provided us that these addresses coincide with code that appears after the jmp rax.
Given that the memory address 0x550 was aligned, I have a hunch that this switch statement is inside a loop that keeps executing as some kind of state machine until the proper conditions are met for it to exit. Likely the value of the control variable used for the switch statement is changed by the code in the switch statement itself. Each time the switch statement is run the control variable has a different value and will do something different.
The control variable for the switch statement was originally checked for the value being above 0x0d (13). The table starting at 0x8b4 in the .rodata section has 14 entries. One can assume the switch statement probably has 14 different states (cases).
but this program don't declare it
You're looking at disassembly of machine code + data. It's all just bytes in memory. Any labels the disassembler does manage to show are ones that got left in the executable's symbol table. They're irrelevant to how the CPU runs the machine code.
(The ELF program headers tell the OS's program loader how to map it into memory, and where to jump to as an entry point. This has nothing to do with symbols, unless a shared library references some globals or functions defined in the executable.)
You can single-step the code in GDB and watch register values change.
In 559, rax is first appeared.
EAX is the low 32 bits of RAX. Writing to EAX zero-extends into RAX implicitly. From mov DWORD PTR [rsp-0xc],0x0 and the later reload, we know that RAX=0.
This must have been un-optimized compiler output (or volatile int idx = 0; to defeat constant propagation), otherwise it would know at compile time that RAX=0 and could optimize away everything else.
lea rdx,[rip+0x37d] # 8b4
A RIP-relative LEA puts the address of static into a register. It's not a load from memory. (That happens later when movsxd with an indexed addressing mode uses RDX as the base address.)
The disassembler worked out the address for you; it's RDX = 0x8b4. (Relative to the start of the file; when actually running the program would be mapped at a virtual address like 0x55555...000)
554: cmp eax,0xd
557: ja 57c <main+0x4c>
559: movsxd rax,DWORD PTR [rdx+rax*4]
55d: add rax,rdx
560: jmp rax
This is a jump table. First it checks for an out-of-bounds index with cmp eax,0xd, then it indexes a table of 32-bit signed offsets using EAX (movsxd with an addressing mode that scales RAX by 4), and adds that to the base address of the table to get a jump target.
GCC could just make a jump table of 64-bit absolute pointers, but chooses not to so that .rodata is position-independent as well and doesn't need load-time fixups in a PIE executable. (Even though Linux does support doing that.) See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 where this is discussed (although the main focus of that bug is that gcc -fPIE can't turn a switch into a table lookup of string addresses, and actually still uses a jump table)
The jump-offset table address is in RDX, this is what was set up with the earlier LEA.

NASM - Use labels for code loaded from disk

As a learning experience, I'm writing a boot-loader for BIOS in NASM on 16-bit real mode in my x86 emulator, Qemu.
BIOS loads your boot-sector at address 0x7C00. NASM assumes you start at 0x0, so your labels are useless unless you do something like specify the origin with [org 0x7C00] (or presumably other techniques). But, when you load the 2nd stage boot-loader, its RAM origin is different, which complicates the hell out of using labels in that newly loaded code.
What's the recommended way to deal with this? It this linker territory? Should I be using segment registers instead of org?
Thanks in advance!
p.s. Here's the code that works right now:
[bits 16]
[org 0x7c00]
LOAD_ADDR: equ 0x9000 ; This is where I'm loading the 2nd stage in RAM.
start:
mov bp, 0x8000 ; set up the stack
mov sp, bp ; relatively out of the way
call disk_load ; load the new instructions
; at 0x9000
jmp LOAD_ADDR
%include "disk_load.asm"
times 510 - ($ - $$) db 0
dw 0xaa55 ;; end of bootsector
seg_two:
;; this is ridiculous. Better way?
mov cx, LOAD_ADDR + print_j - seg_two
jmp cx
jmp $
print_j:
mov ah, 0x0E
mov al, 'k'
int 0x10
jmp $
times 2048 db 0xf
You may be making this harder than it is (not that this is trivial by any means!)
Your labels work fine and will continue to work fine. Remember that, if you look under the hood at the machine code generated, your short jumps (everything after seg_two in what you've posted) are relative jumps. This means that the assembler doesn't actually need to compute the real address, it simply needs to calculate the offset from the current opcode. However, when you load your code into RAM at 0x9000, that is an entirely different story.
Personally, when writing precisely the kind of code that you are, I would separate the code. The boot sector stops at the dw 0xaa55 and the 2nd stage gets its own file with an ORG 0x9000 at the top.
When you compile these to object code you simply need to concatenate them together. Essentially, that's what you're doing now except that you are getting the assembler to do it for you.
Hope this makes sense. :)

Assembly: what are semantic NOPs?

I was wondering what are "semantic NOPs" in assembly?
Code that isn't an actual nop but doesn't affect the behavior of the program.
In C, the following sequence could be thought of as a semantic NOP:
{
// Since none of these have side affects, they are effectively no-ops
int x = 5;
int y = x * x;
int z = y / x;
}
They are instructions that have no effect, like a NOP, but take more bytes. Useful to get code aligned to a cache line boundary. An instruction like lea edi,[edi+0] is an example, it would take 7 NOPs to fill the same number of bytes but takes only 1 cycle instead of 7.
A semantic NOP is a collection of machine language instructions that have no effect at all or almost no effect (most instructions change condition codes) whose only purpose is obfuscation of what the program is actually doing.
Code that executes but doesn't do anything meaningful. These are also called "opaque predicates," and are used most often by obfuscators.
A true "semantic nop" is an instruction which has no effect other than taking some time and advancing the program counter. Many machines where register-to-register moves do not affect flags, for example, have numerous instructions that will move a register to itself. On the 8088, for example, any of the following would be semantic NOPs:
mov al,al
mov bl,bl
mov cl,cl
...
mov ax,ax
mob bx,bx
mov cx,cx
...
xchg ax,ax
xchg bx,bx
xchg cx,cx
...
Note that all of the above except for "xchg ax,ax" are two-byte instructions. Intel has therefore declared that "xchg ax,ax" should be used when a one-byte NOP is required. Indeed, if one assembles "mov ax,ax" and disassembles it, it will disassemble as "NOP".
Note that in some cases an instruction or instruction sequence may have potential side-effects, but nonetheless be more desirable than the usual "nop". On the 6502, for example, if one needs a 7-cycle delay and the stack pointer is valid but the top-of-stack value is irrelevant, a PHP followed by a PLP will kill seven cycles using only two bytes of code. If the top-of-stack value isn't a spare byte of RAM, though, the sequence would fail.

How does a syscall actually happen on linux?

Inspired by this question
How can I force GDB to disassemble?
and related to this one
What is INT 21h?
How does an actually system call happen under linux? what happens when the call is performed, until the actual kernel routine is invoked ?
Assuming we're talking about x86:
The ID of the system call is deposited into the EAX register
Any arguments required by the system call are deposited into the locations dictated by the system call. For example, some system calls expect their argument to reside in the EBX register. Others may expect their argument to be sitting on the top of the stack.
An INT 0x80 interrupt is invoked.
The Linux kernel services the system call identified by the ID in the EAX register, depositing any results in pre-determined locations.
The calling code makes use of any results.
I may be a bit rusty at this, it's been a few years...
The given answers are correct but I would like to add that there are more mechanisms to enter kernel mode. Every recent kernel maps the "vsyscall" page in every process' address space. It contains little more than the most efficient syscall trap method.
For example on a regular 32 bit system it could contain:
0xffffe000: int $0x80
0xffffe002: ret
But on my 64-bitsystem I have access to the way more efficient method using the syscall/sysenter instructions
0xffffe000: push %ecx
0xffffe001: push %edx
0xffffe002: push %ebp
0xffffe003: mov %esp,%ebp
0xffffe005: sysenter
0xffffe007: nop
0xffffe008: nop
0xffffe009: nop
0xffffe00a: nop
0xffffe00b: nop
0xffffe00c: nop
0xffffe00d: nop
0xffffe00e: jmp 0xffffe003
0xffffe010: pop %ebp
0xffffe011: pop %edx
0xffffe012: pop %ecx
0xffffe013: ret
This vsyscall page also maps some systemcalls that can be done without a context switch. I know certain gettimeofday, time and getcpu are mapped there, but I imagine getpid could fit in there just as well.
This is already answered at
How is the system call in Linux implemented?
Probably did not match with this question because of the differing "syscall" term usage.
Basically, its very simple: Somewhere in memory lies a table where each syscall number and the address of the corresponding handler is stored (see http://lxr.linux.no/linux+v2.6.30/arch/x86/kernel/syscall_table_32.S for the x86 version)
The INT 0x80 interrupt handler then just takes the arguments out of the registers, puts them on the (kernel) stack, and calls the appropriate syscall handler.

Resources