x86 reserved EFLAGS bit 1 == 0: how can this happen? - multithreading

I'm using the Win32 API to stop/start/inspect/change thread state. Generally works pretty well. Sometimes it fails, and I'm trying to track down the cause.
I have one thread that is forcing context switches on other threads by:
thread stop
fetch processor state into windows context block
read thread registers from windows context block to my own context block
write thread registers from another context block into windows context block
restart thread
This works remarkably well... but ... very rarely, context switches seem to fail.
(Symptom: my multithread system blows sky high executing a strange places with strange register content).
The context control is accomplished by:
if ((suspend_count=SuspendThread(WindowsThreadHandle))<0)
{ printf("TimeSlicer Suspend Thread failure");
...
}
...
Context.ContextFlags = (CONTEXT_INTEGER | CONTEXT_CONTROL | CONTEXT_FLOATING_POINT);
if (!GetThreadContext(WindowsThreadHandle,&Context))
{ printf("Context fetch failure");
...
}
call ContextSwap(&Context); // does the context swap
if (ResumeThread(WindowsThreadHandle)<0)
{ printf("Thread resume failure");
...
}
None of the print statements ever get executed. I conclude that Windows thinks the context operations all happened reliably.
Oh, yes, I do know when a thread being stopped is not computing [e.g., in a system function] and won't attempt to stop/context switch it. I know this because each thread that does anything other-than-computing sets a thread specific "don't touch me" flag, while it is doing other-than-computing. (Device driver programmers will recognize this as the equivalent of "interrupt disable" instructions).
So, I wondered about the reliability of the content of the context block.
I added a variety of sanity tests on various register values pulled out of the context block; you can actually decide that ESP is OK (within bounds of the stack area defined in the TIB), PC is in the program that I expect or in a system call, etc. No surprises here.
I decided to check that the condition code bits (EFLAGS) were being properly read out; if this were wrong, it would cause a switched task to take a "wrong branch" when its state was restored. So I added the following code to verify that the purported EFLAGS register contains stuff that only look like EFLAGS according to the Intel reference manual (http://en.wikipedia.org/wiki/FLAGS_register).
mov eax, Context.EFlags[ebx] ; ebx points to Windows Context block
mov ecx, eax ; check that we seem to have flag bits
and ecx, 0FFFEF32Ah ; where we expect constant flag bits to be
cmp ecx, 000000202h ; expected state of constant flag bits
je #f
breakpoint ; trap if unexpected flag bit status
##:
On my Win 7 AMD Phenom II X6 1090T (hex core),
it traps occasionally with a breakpoint, with ECX = 0200h. Fails same way on my Win 7 Intel i7 system. I would ignore this,
except it hints the EFLAGS aren't being stored correctly, as I suspected.
According to my reading of the Intel (and also the AMD) reference manuals, bit 1 is reserved and always has the value "1". Not what I see here.
Obviously, MS fills the context block by doing complicated things on a thread stop. I expect them to store the state accurately. This bit isn't stored correctly.
If they don't store this bit correctly, what else don't they store?
Any explanations for why the value of this bit could/should be zero sometimes?
EDIT: My code dumps the registers and the stack on catching a breakpoint.
The stack area contains the context block as a local variable.
Both EAX, and the value in the stack at the proper offset for EFLAGS in the context block contain the value 0244h. So the value in the context block really is wrong.
EDIT2: I changed the mask and comparsion values to
and ecx, 0FFFEF328h ; was FFEF32Ah where we expect flag bits to be
cmp ecx, 000000200h
This seems to run reliably with no complaints. Apparently Win7 doesn't do bit 1 of eflags right, and it appears not to matter.
Still interested in an explanation, but apparently this is not the source of my occasional context switch crash.

Microsoft has a long history of squirreling away a few bits in places that aren't really used. Raymond Chen has given plenty of examples, e.g. using the lower bit(s) of a pointer that's not byte-aligned.
In this case, Windows might have needed to store some of its thread context in an existing CONTEXT structure, and decided to use an otherwise unused bit in EFLAGS. You couldn't do anything with that bit anyway, and Windows will get that bit back when you call SetThreadContext.

Related

What is wrong with this emulation of CMPXCHG16B instruction?

I'm trying to run a binary program that uses CMPXCHG16B instruction at one place, unfortunately my Athlon 64 X2 3800+ doesn't support it. Which is great, because I see it as a programming challenge. The instruction doesn't seem to be that hard to implement with a cave jump, so that's what I did, but something didn't work, program just froze in a loop. Maybe someone can tell me if I implemented my CMPXCHG16B wrong?
Firstly the actual piece of machine code that I'm trying to emulate is this:
f0 49 0f c7 08 lock cmpxchg16b OWORD PTR [r8]
Excerpt from Intel manual describing CMPXCHG16B:
Compare RDX:RAX with m128. If equal, set ZF and load RCX:RBX into m128.
Else, clear ZF and load m128 into RDX:RAX.
First I replace all 5 bytes of the instruction with a jump to code cave with my emulation procedure, luckily the jump takes up exactly 5 bytes! The jump is actually a call instruction e8, but could be a jmp e9, both work.
e8 96 fb ff ff call 0xfffffb96(-649)
This is a relative jump with a 32-bit signed offset encoded in two's complement, the offset points to a code cave relative to address of next instruction.
Next the emulation code I'm jumping to:
PUSH R10
PUSH R11
MOV r10, QWORD PTR [r8]
MOV r11, QWORD PTR [r8+8]
TEST R10, RAX
JNE ELSE
TEST R11, RDX
JNE ELSE
MOV QWORD PTR [r8], RBX
MOV QWORD PTR [r8+8], RCX
JMP END
ELSE:
MOV RAX, r10
MOV RDX, r11
END:
POP R11
POP R10
RET
Personally, I'm happy with it, and I think it matches the functional specification given in manual. It restores stack and two registers r10 and r11 to their original order and then resumes execution. Alas it does not work! That is the code works, but the program acts as if it's waiting for a tip and burning electricity. Which indicates my emulation was not perfect and I inadvertently broke it's loop. Do you see anything wrong with it?
I notice that this is an atomic variant of it—owning to the lock prefix. I'm hoping it's something else besides contention that I did wrong. Or is there a way to emulate atomicity too?
It's not possible to emulate lock cmpxchg16b. It's sort of possible if all accesses to the target address are synchronised with a separate lock, but that includes all other instructions, including non-atomic stores to either half of the object, and atomic read-modify-writes (like xchg, lock cmpxchg, lock add, lock xadd) with one half (or other part) of the 16 byte object.
You can emulate cmpxchg16b (without lock) like you've done here, with the bugfixes from #Fifoernik's answer. That's an interesting learning exercise, but not very useful in practice, because real code that uses cmpxchg16b always uses it with a lock prefix.
A non-atomic replacement will work most of the time, because it's rare for a cache-line invalidate from another core to arrive in the small time window between two nearby instructions. This doesn't mean it's safe, it just means it's really hard to debug when it does occasionally fail. If you just want to get a game working for your own use, and can accept occasional lockups / errors, this might be useful. For anything where correctness is important, you're out of luck.
What about MFENCE? Seems to be what I need.
MFENCE before, after, or between the loads and stores won't prevent another thread from seeing a half-written value ("tearing"), or from modifying the data after your code has made the decision that the compare succeeded, but before it does the store. It might narrow the window of vulnerability, but it can't close it, because MFENCE only prevents reordering of the global visibility of our own stores and loads. It can't stop a store from another core from becoming visible to us after our loads but before our stores. That requires an atomic read-modify-write bus cycle, which is what locked instructions are for.
Doing two 8-byte atomic compare-exchanges would solve the window-of-vulnerability problem, but only for each half separately, leaving the "tearing" problem.
Atomic 16B loads/stores solves the tearing problem but not the atomicity problem between loads and stores. It's possible with SSE on some hardware, but not guaranteed to be atomic by the x86 ISA the way 8B naturally-aligned loads and stores are.
Xen's lock cmpxchg16b emulation:
The Xen virtual machine has an x86 emulator, I guess for the case where a VM starts on one machine and migrates to less-capable hardware. It emulates lock cmpxchg16b by taking a global lock, because there's no other way. If there was a way to emulate it "properly", I'm sure Xen would do that.
As discussed in this mailing list thread, Xen's solution still doesn't work when the emulated version on one core is accessing the same memory as the non-emulated instruction on another core. (The native version doesn't respect the global lock).
See also this patch on the Xen mailing list that changes the lock cmpxchg8b emulation to support both lock cmpxchg8b and lock cmpxchg16b.
I also found that KVM's x86 emulator doesn't support cmpxchg16b either, according to the search results for emulate cmpxchg16b.
I think all this is good evidence that my analysis is correct, and that it's not possible to emulate it safely.
I see these things wrong with your code to emulate the cmpxchg16b instruction:
You need to use cmp in stead of test to get a correct comparison.
You need to save/restore all flags except the ZF. The manual mentions :
The CF, PF, AF, SF, and OF flags are unaffected.
The manual contains the following:
IF (64-Bit Mode and OperandSize = 64)
THEN
TEMP128 ← DEST
IF (RDX:RAX = TEMP128)
THEN
ZF ← 1;
DEST ← RCX:RBX;
ELSE
ZF ← 0;
RDX:RAX ← TEMP128;
DEST ← TEMP128;
FI;
FI
So to really write code that "matches the functional specification given in manual" a write to the m128 is required. Although this particular write is part of the locked version lock cmpxchg16b, it won't of course do any good to the atomicity of the emulation! A straightforward emulation of lock cmpxchg16b is thus not possible. See #PeterCordes' answer.
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison
ELSE:
MOV RAX, r10
MOV RDX, r11
MOV QWORD PTR [r8], r10
MOV QWORD PTR [r8+8], r11
END:

Is there a list of errors will show up as `segfaults` when they are not really related to memory access?

In this question, I learned that attempting to run privileged instructions when not in ring 0 can cause what looks like a segfault in a user process, and I have two follow-up questions.
Is this true of all privileged instructions?
What other sorts of errors can cause a fake segfault but are not related to trying to read memory?
Read through the instruction set reference and see where #GP is listed for a non-memory issue. Incomplete list: CLI, CLTS, HLT, IN, INT (with an invalid vector), INVD, INVLPG, IRET (under circumstances), LDMXCSR(setting reserved bits), LGDT, LIDT, LLDT, LMSW, LTR, MONITOR (with ECX != 0), MOV (to CRx or DRx), MWAIT (with invalid ECX), OUT, RDMSR, RDPMC, SWAPGS, SYSEXIT, SYSRET, WBINVD, WRMSR, XGETBV (invalid ECX), XRSTOR, XSETBV

64-bit windows VMware detection

I am trying to develop an application which detects if program is running inside a virtual machine.
For 32-bit Windows, there are already methods explained in the following link:
http://www.codeproject.com/Articles/9823/Detect-if-your-program-is-running-inside-a-Virtual
I am trying to adapt the code regarding Virtual PC and VMware detection in an 64-bit Windows operating system. For VMware, the code can detect successfully in an Windows XP 64-bit OS. But the program crashes when I run it in a native system (Windows 7 64-bit OS).
I put the code in an .asm file and define custom build step with ml64.exe file. The asm code for 64-bit Windows is:
IsInsideVM proc
push rdx
push rcx
push rbx
mov rax, 'VMXh'
mov rbx, 0 ; any value but not the MAGIC VALUE
mov rcx, 10 ; get VMWare version
mov rdx, 'VX' ; port number
in rax, dx ; read port
; on return EAX returns the VERSION
cmp rbx, 'VMXh'; is it a reply from VMWare?
setz al ; set return value
movzx rax,al
pop rbx
pop rcx
pop rdx
ret
IsInsideVM endp
I call this part in a cpp file like:
__try
{
returnValue = IsInsideVM();
}
__except(1)
{
returnValue = false;
}
Thanks in advance.
The old red pill from Joanna may work: random backup page of invisiblethings.org blog:
Swallowing the Red Pill is more or less equivalent to the following code (returns non zero when in Matrix):
int swallow_redpill () {
unsigned char m[2+4], rpill[] = "\x0f\x01\x0d\x00\x00\x00\x00\xc3";
*((unsigned*)&rpill[3]) = (unsigned)m;
((void(*)())&rpill)();
return (m[5]>0xd0) ? 1 : 0;
}
The heart of this code is actually the SIDT instruction (encoded as 0F010D[addr]), which stores the contents of the interrupt descriptor table register (IDTR) in the destination operand, which is actually a memory location. What is special and interesting about SIDT instruction is that, it can be executed in non privileged mode (ring3) but it returns the contents of the sensitive register, used internally by operating system.
Because there is only one IDTR register, but there are at least two OS running concurrently (i.e. the host and the guest OS), VMM needs to relocate the guest's IDTR in a safe place, so that it will not conflict with a host's one. Unfortunately, VMM cannot know if (and when) the process running in guest OS executes SIDT instruction, since it is not privileged (and it doesn't generate exception). Thus the process gets the relocated address of IDT table. It was observed that on VMWare, the relocated address of IDT is at address 0xffXXXXXX, whereas on Virtual PC it is 0xe8XXXXXX. This was tested on VMWare Workstation 4 and Virtual PC 2004, both running on Windows XP host OS.
Note: I haven't tested it myself but look that it uses an unprivileged approach. If it does not work at first for x64, some tweaking may help.
Also, just found out a question with content that may help you: Detecting VMM on linux
My guess is that your function corrups registers.
Running on real hardware (non-VM) should probably trigger exception at "in rax, dx". If this happens then control is passed to your exception handler, which sets result, but does not restore registers. This behaviour will be fully unexpected by caller. For example, it can save something into EBX/RBX register, then call your asm code, your asm code does "mov RBX, 0", it executes, catches exception, sets result, returns - and then caller suddently realizes that his saved data isn't in EBX/RBX anymore! If there was some pointer stored in EBX/RBX - you're going to crash hard. Anything can happen.
Surely, your asm code saves/restores registers, but this happens only when no exception is raised. I.e. if your code is running on VM. Then your code does its normal execution path, no exceptions are raised, registers will be restored normally. But if there is the exception - your POPs will be skipped, because execution will be passed to exception handler.
The correct code should probably do PUSH/POPs outside of try/except block, not inside.

What is %gs in Assembly

void return_input (void)
{
char array[30];
gets (array);
printf("%s\n", array);
}
After compiling it in gcc, this function is converted to the following Assembly code:
push %ebp
mov %esp,%ebp
sub $0x28,%esp
mov %gs:0x14,%eax
mov %eax,-0x4(%ebp)
xor %eax,%eax
lea -0x22(%ebp),%eax
mov %eax,(%esp)
call 0x8048374
lea -0x22(%ebp),%eax
mov %eax,(%esp)
call 0x80483a4
mov -0x4(%ebp),%eax
xor %gs:0x14,%eax
je 0x80484ac
call 0x8048394
leave
ret
I don't understand two lines:
mov %gs:0x14,%eax
xor %gs:0x14,%eax
What is %gs, and what exactly these two lines do?
This is compilation command:
cc -c -mpreferred-stack-boundary=2 -ggdb file.c
GS is a segment register, its use in linux can be read up on here (its basically used for per thread data).
mov %gs:0x14,%eax
xor %gs:0x14,%eax
this code is used to validate that the stack hasn't exploded or been corrupted, using a canary value stored at GS+0x14, see this.
gcc -fstack-protector=strong is on by default in many modern distros; you can use gcc -fno-stack-protector to not add those checks. (On x86, thread-local storage is cheap so GCC keeps the randomized canary value there, making it somewhat harder to leak.)
In the AT&T style assembly languages, the percent sigil generally indicates a register. In x86 family processors from 386 onwards, GS is one of the so-called segment registers. However, in protected mode environments segment registers work as selector registers.
A virtual memory selector represents its own mapping of virtual address space together with its own access regime. In practical terms, %gs:0x14 can be thought of as a reference into an array whose origin is held in %gs (albeit the CPU does a bit of extra dereferencing). On modern GNU/Linux systems, %gs is usually used to point at the thread-local storage region. In the code you're asking about, however, only one item of the TLS matters — the stack canary.
The idea is to attempt to detect a buffer overflow error by placing a random but constant value — it's called a stack canary in memory of the canaries coal miners used to employ to signal increase in levels of poisonous gases by dying — into the stack before gets() gets called, above its stack frame, and check whether it is still there after gets() will have returned. gets() has no business overwriting this part of the stack — it is outside its own stack frame, and it is not given a pointer to it —, so if the stack canary has died, something has gone wrong in a dangerous way. (C as a programming environment happens to be particularly prone to this kind of wrong-goings, and security researchers have learnt to exploit many of them over the last twenty years or so. Also, gets() happens to be a function that is inherently at risk to overflow its target buffer.) You have not offered addresses with your code, but 0x80484ac is likely the address of leave, and the call 0x8048394 which is executed in case of mismatch (that is, jumped over by je 0x80484ac in case of match), is probably a call to __stack_chk_fail(), provided by libc to handle the stack corruption by fleeing the metaphorical poisonous mine.
The reason the canonical value of the stack canary is kept in the thread-local storage is that this way, every thread can have its own stack canary. Stacks themselves are normally not shared between threads, so it is natural to also not share the canary value.
ES, FS, GS: Extra Segment Registers
Can be used as extra segment registers; also used in special instructions that span segments (like string copies).
taken from here
http://www.hep.wisc.edu/~pinghc/x86AssmTutorial.htm
hope it helps

In Linux, on entry of a sys call, what is the value in %eax? (not orig_eax)

When a syscall returns, I get the syscall return value in %eax, however on entry I am getting -38, which is 0xFFFFFFDA in hex. This is for both write/read. What is this number? Can it be used to safely differentiate an entry from an exit?
The -38 in eax on syscall entry is apparently ENOSYS (Function not implemented), and is put there by syscall_trace_entry in arch/x86/kernel/entry_32.S. I suppose it's safe to assume that it will always be there on syscall entry, however it can also be there on syscall exit, if the syscall returns ENOSYS.
Personally, I have always just kept track of whether I'm in syscall entry or exit when using ptrace, although I have seen some code relying on the ENOSYS too. (I'm assuming you're using ptrace) I guess that won't work if the process happens to be inside a syscall when you attach to it, but I have been lucky enough to not bump into that problem.
I took a quick look at strace sources, and I guess it keeps track of the state too, since there was a comment saying "We are attaching to an already running process. Try to figure out the state of the process in syscalls, to handle the first event well." and slightly after that it said "The process is asleep in the middle of a syscall. Fake the syscall entry event.".
In short, the value can't be safely used to differentiate an entry from an exit. That said, I'm not sure that tracking it manually is the best method, since I haven't really got any source which would definitely tell you to use that technique, sorry. :)
I still not get when you get the -38 in eax, but when doing a syscall eax contains a number that defines the syscall (in a 2.6 Kernel you can have a look at arch/x86/include/asm/unistd_64.h to see the numbers for each call).
So the sequence is the following:
your programm
set eax to syscall (dep on call, also some other regs)
init syscall (via int 0x80)
result of syscall in eax
your programm again
Maybe your question is not so formulated, but if you are not writing kernel code/driver the easiest way to tell, wether you are before syscall entry or after syscall exit is: TRUE when you are in your code ;-). The entry/exit itself happen (more or less) instant in one instruction, so either you are in the syscall (then you would know because it must be some kernel code or the blocking call) or you are not (almost everytime when you debug your code).

Resources