What Does Assembly Instruction Shift Do? - shellcode

I came across a pretty interesting
article that demonstrated how to remove nullbyte characters from shellcode. Among the techniques used, the assembly instructions shl and shr seemed to occupy a rather important role in the code.
I realize that the assembly instructions mov $0x3b, %rax and mov $59, %rax each actually generate the machine code instructions 48 c7 c0 3b 00 00 00. So to cope for this, the author instead uses mov $0x1111113b, %rax to fill the register with the system call number, which generates instead the machine code 48 c7 c0 3b 11 11 11, which successfully removes nullbytes.
Unfortunately, the code still doesn't execute because syscall treats 3b 11 11 11 as an illegal instruction, or this causes the code to seg fault. So what the author then did was shift %rax back and forth 56 bytes with the commands
shl $0x38, %rax
shr $0x38, %rax
After this shift, the code executes perfectly. What I want to know is how the shift instructions fixes the 48 c7 c0 3b 11 11 11 issue, and somehow makes %rax proper and syscall'able. I know that the shl/shr shifts bits left and right, meaning that shifting left moves the bits up into higher bits, and shifting right makes them lower again, because binary is read right to left. But how does this at all change the code and make it executable? Doesn't shifting back and forth essentially change nothing, putting the shifted bits exactly back where they were in the beginning?
My only theory is that shifting bits away leaves behind zeros. But I still don't see how shifting %rax forward and then back fixes the solution, because wouldn't it bring back the 11 11 11 section anyway?
Anyways, I thought this was interesting as I had never seen the shift operands before today. Thanks in advance.

Shifting is a lossy operation - if bits are shifted outside of the register, they just disappear. (Sometimes one of them is stored in a carry flag, but that's not important here.) See http://en.wikibooks.org/wiki/X86_Assembly/Shift_and_Rotate#Logical_Shift_Instructions .
The shift left (shl) operation does this:
0x000000001111113b << 0x38 = 0x3b00000000000000
The 0x111111 part would have occupied bit 64, 65, 66 etc., but %rax is a 64-bit register, so those bits vanish. Then, the logical shift right (shr) operation does this:
0x3b00000000000000 >> 0x38 = 0x000000000000003b
Giving you the number that you want. And that's all there is to it.

Related

Is there a way to prevent MASM changing instructions?

I'm writing a C++ program, and decided that it would be more efficient to write a specific function in x86 assembly, due to it's use of the carry-flag. In the disassembly, I noticed that some instructions had been changed, causing my program to throw the exception: "Access violation reading location". Why are instructions changing, and how can I prevent this?
Here is a snippet of my code:
XOR EBX, EBX ; 31 DB
MOV BL, DH ; 88 F3
MOV AH, BYTE PTR [ECX] ; 8A 21
The disassembler shows this:
xor bx, bx ; 66 31 DB
mov bl, dh ; 88 F3
mov ah, byte ptr [bx+di] ; 67 8A 21
You assembled in 16-bit mode and disassembled in 32-bit mode, making everything the opposite of what it should be. MASM isn't "changing" instructions, just assembling them for the mode you told it to assemble for.
e.g. in 16-bit mode, [ECX] needs a 67 address-size override prefix to encode a 32-bit addressing mode. When decoded in 32-bit mode, that same prefix overrides it to mean 16-bit. (And that bit-pattern means [bx+di]; 16-bit addressing modes can't use a SIB byte so all the different register combos are packed into the ModRM byte. [cx] isn't encodeable.)
Also, if you think xor + mov is the most efficient way to zero-extend DH into EBX, have a look at movzx. That's more efficient on modern CPUs. (See https://agner.org/optimize/ and https://uops.info/).
Generally you want to avoid writing high-8 registers like AH; Haswell/Skylake have merging penalties when reading the full register, and AMD CPUs have a false dependency. Read the info and links on Why doesn't GCC use partial registers? carefully before you use them.

How can I determine at runtime whether ASM code is running in x86 or x64 CPU? [duplicate]

This question already has an answer here:
x86-32 / x86-64 polyglot machine-code fragment that detects 64bit mode at run-time?
(1 answer)
Closed 6 years ago.
I want to write some assembly code that can find out whether it runs in an x86 or x64 binary (The reason I want to do such a weird thing is that I will inject this code in any given binary, and when the code runs, it will determine which kind of system call it should do and run that part of the code. Nothing malicious, just a "hello world" before passing to the actual entry point as an exercise).
Anyway, one 'solution' I thought of was as follows:
read the stack pointer to general-purpose register X
push 0
read the stack pointer to GP register Y
subtract Y from X (store result in X)
pop to Y (to fix the stack)
X has size of register, behave accordingly
This is the closest I could get:
0: 54 push rsp
1: 54 push rsp
2: 5b pop rbx
3: 58 pop rax
4: 48 29 d8 sub rax,rbx <---
7: 83 f8 08 cmp eax,0x8
a: 74 ?? je 64_bit_code_addr
This produces the same bytes for x86, except for that 0x48 at address 0x4. How can I write that instruction in an architecture-independent way? Or what other solution can I have to achieve this effect?
(Please do not present out-of-the-box solutions, such as "you can determine the class of an executable by checking EI_CLASS offset of an ELF file" etc.)
It can be much simpler, using that REX.W in 32bit code is a DEC:
48 90
Which in 64bit code is:
rex.w nop ; still a nop
and in 32bit code:
dec eax
nop
Put something like xor eax, eax before it, of course.

What is wrong with this emulation of CMPXCHG16B instruction?

I'm trying to run a binary program that uses CMPXCHG16B instruction at one place, unfortunately my Athlon 64 X2 3800+ doesn't support it. Which is great, because I see it as a programming challenge. The instruction doesn't seem to be that hard to implement with a cave jump, so that's what I did, but something didn't work, program just froze in a loop. Maybe someone can tell me if I implemented my CMPXCHG16B wrong?
Firstly the actual piece of machine code that I'm trying to emulate is this:
f0 49 0f c7 08 lock cmpxchg16b OWORD PTR [r8]
Excerpt from Intel manual describing CMPXCHG16B:
Compare RDX:RAX with m128. If equal, set ZF and load RCX:RBX into m128.
Else, clear ZF and load m128 into RDX:RAX.
First I replace all 5 bytes of the instruction with a jump to code cave with my emulation procedure, luckily the jump takes up exactly 5 bytes! The jump is actually a call instruction e8, but could be a jmp e9, both work.
e8 96 fb ff ff call 0xfffffb96(-649)
This is a relative jump with a 32-bit signed offset encoded in two's complement, the offset points to a code cave relative to address of next instruction.
Next the emulation code I'm jumping to:
PUSH R10
PUSH R11
MOV r10, QWORD PTR [r8]
MOV r11, QWORD PTR [r8+8]
TEST R10, RAX
JNE ELSE
TEST R11, RDX
JNE ELSE
MOV QWORD PTR [r8], RBX
MOV QWORD PTR [r8+8], RCX
JMP END
ELSE:
MOV RAX, r10
MOV RDX, r11
END:
POP R11
POP R10
RET
Personally, I'm happy with it, and I think it matches the functional specification given in manual. It restores stack and two registers r10 and r11 to their original order and then resumes execution. Alas it does not work! That is the code works, but the program acts as if it's waiting for a tip and burning electricity. Which indicates my emulation was not perfect and I inadvertently broke it's loop. Do you see anything wrong with it?
I notice that this is an atomic variant of it—owning to the lock prefix. I'm hoping it's something else besides contention that I did wrong. Or is there a way to emulate atomicity too?
It's not possible to emulate lock cmpxchg16b. It's sort of possible if all accesses to the target address are synchronised with a separate lock, but that includes all other instructions, including non-atomic stores to either half of the object, and atomic read-modify-writes (like xchg, lock cmpxchg, lock add, lock xadd) with one half (or other part) of the 16 byte object.
You can emulate cmpxchg16b (without lock) like you've done here, with the bugfixes from #Fifoernik's answer. That's an interesting learning exercise, but not very useful in practice, because real code that uses cmpxchg16b always uses it with a lock prefix.
A non-atomic replacement will work most of the time, because it's rare for a cache-line invalidate from another core to arrive in the small time window between two nearby instructions. This doesn't mean it's safe, it just means it's really hard to debug when it does occasionally fail. If you just want to get a game working for your own use, and can accept occasional lockups / errors, this might be useful. For anything where correctness is important, you're out of luck.
What about MFENCE? Seems to be what I need.
MFENCE before, after, or between the loads and stores won't prevent another thread from seeing a half-written value ("tearing"), or from modifying the data after your code has made the decision that the compare succeeded, but before it does the store. It might narrow the window of vulnerability, but it can't close it, because MFENCE only prevents reordering of the global visibility of our own stores and loads. It can't stop a store from another core from becoming visible to us after our loads but before our stores. That requires an atomic read-modify-write bus cycle, which is what locked instructions are for.
Doing two 8-byte atomic compare-exchanges would solve the window-of-vulnerability problem, but only for each half separately, leaving the "tearing" problem.
Atomic 16B loads/stores solves the tearing problem but not the atomicity problem between loads and stores. It's possible with SSE on some hardware, but not guaranteed to be atomic by the x86 ISA the way 8B naturally-aligned loads and stores are.
Xen's lock cmpxchg16b emulation:
The Xen virtual machine has an x86 emulator, I guess for the case where a VM starts on one machine and migrates to less-capable hardware. It emulates lock cmpxchg16b by taking a global lock, because there's no other way. If there was a way to emulate it "properly", I'm sure Xen would do that.
As discussed in this mailing list thread, Xen's solution still doesn't work when the emulated version on one core is accessing the same memory as the non-emulated instruction on another core. (The native version doesn't respect the global lock).
See also this patch on the Xen mailing list that changes the lock cmpxchg8b emulation to support both lock cmpxchg8b and lock cmpxchg16b.
I also found that KVM's x86 emulator doesn't support cmpxchg16b either, according to the search results for emulate cmpxchg16b.
I think all this is good evidence that my analysis is correct, and that it's not possible to emulate it safely.
I see these things wrong with your code to emulate the cmpxchg16b instruction:
You need to use cmp in stead of test to get a correct comparison.
You need to save/restore all flags except the ZF. The manual mentions :
The CF, PF, AF, SF, and OF flags are unaffected.
The manual contains the following:
IF (64-Bit Mode and OperandSize = 64)
THEN
TEMP128 ← DEST
IF (RDX:RAX = TEMP128)
THEN
ZF ← 1;
DEST ← RCX:RBX;
ELSE
ZF ← 0;
RDX:RAX ← TEMP128;
DEST ← TEMP128;
FI;
FI
So to really write code that "matches the functional specification given in manual" a write to the m128 is required. Although this particular write is part of the locked version lock cmpxchg16b, it won't of course do any good to the atomicity of the emulation! A straightforward emulation of lock cmpxchg16b is thus not possible. See #PeterCordes' answer.
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison
ELSE:
MOV RAX, r10
MOV RDX, r11
MOV QWORD PTR [r8], r10
MOV QWORD PTR [r8+8], r11
END:

Determining when NASM can infer the size of the mov operation

Something has got me confused in x86 assembly for a while, it's how/when can NASM infer the size of the operation, here's an example:
mov ebx, [eax]
Here we are moving the 4 bytes stored at the address held in eax into ebx. The size of the operation is inferred as 4 bytes because the register is 32 bits.
However, this operation doesn't get inferred and throws a compile error:
mov [eax], 123456
Of course the solution is this:
mov dword [eax], 123456
Which will move the 32 representation of the number 123456 into the bytes stored at the address held at eax.
But this confuses me, surely it can see eax is 32 bit, so shouldn't it assume I want to store it as a 32 bit value without me having to specify dword after the mov?
Surely if I wanted to put the 16 bit representation of 12345 (smaller number to fit in 16 bits) into eax I would do this:
mov ax, 12345
The operand-size would be ambiguous (and so must be specified) for any instruction with a memory destination and an immediate source. (Neither operand actually being a register, even if using one or more in an addressing mode.)
Address-size and operand-size are separate attributes of an instruction.
Quoting what you said in a comment on another answer, since I think this gets at the core of your confusion:
I would expect mov [eax], 1 to set the 4 bytes held in memory address eax to the 32 bit representation of 1
The BYTE/WORD/DWORD [PTR] annotation is not about the size of the memory address; it's about the size of the variable in memory at that address. Assuming flat 32-bit addressing, addresses are always four bytes long, and therefore must go in Exx registers. So, when the source operand is an immediate value, the dword (or whatever) annotation on the destination operand is the only way the assembler can know whether it's supposed to modify 1, 2, or 4 bytes of RAM.
Perhaps it will help if I demonstrate the effect of these annotations on machine code:
$ objdump -d -Mintel test.o
...
0: c6 00 01 mov BYTE PTR [eax], 0x1
3: 66 c7 00 01 00 mov WORD PTR [eax], 0x1
8: c7 00 01 00 00 00 mov DWORD PTR [eax], 0x1
(I've adjusted the spacing a bit compared to how objdump actually prints it.)
Take note of two things: (1) the three different operand prefixes produce three different machine instructions, and (2) using a different prefix changes the length of the source operand as emitted into the machine code.
mov [eax], 123456
This instruction would use immediate addressing for the source operand and indirect addressing for the destination operand i.e. place the decimal 123456 into the memory address stored in register eax, as you pointed out but the memory address to which eax points does not itself have to be 32 bits in size. NASM can not infer the size of the destination operand. The size of the pointer in register eax is 32 bits.
Address-size and operand-size are totally separate attributes of an instruction.
Surely if I wanted to put the 16 bit representation of 12345 into eax I would do this:
mov ax, 12345
Yes but here you are using immediate addressing for the source operand and register addressing for the destination operand. The assembler can infer the amount of data you wish to move from the size of the destination register (16 bits in the case of the AX register, leaving the upper 2 bytes of the full EAX unmodified so you're not actually setting 32-bit EAX to that value).
compile error
I think you meant assembly error :)
In your first case it can determine it without problems, since EBX is a 32bit register. But in the second one you're using EAX as an address, not as a destination register so nasm developers took the safe route and make the developer choose the size.
If you did mov [eax], 1, what could nasm determine from that? Do you want to set the byte, 16bit or 32bit block of memory to 1? It is totally unknown. This is why it's better to force the developer to state the size.
It would be entirely different if you said mov eax, 123456 since then the destination is a register.

Why does the VC++ compiler MOV+PUSH args instead of just PUSH them? x86

In this disassembly from VC++ a function call is being made. The compiler MOVs the local pointers to a register before pushing them:
memcpy( nodeNewLocation, pNode, sizeCurrentNode );
0041A5DA 8B 45 F8 mov eax,dword ptr [ebp-8]
0041A5DD 50 push eax
0041A5DE 8B 4D 0C mov ecx,dword ptr [ebp+0Ch]
0041A5E1 51 push ecx
0041A5E2 8B 55 D4 mov edx,dword ptr [ebp-2Ch]
0041A5E5 52 push edx
0041A5E6 E8 67 92 FF FF call 00413852
0041A5EB 83 C4 0C add esp,0Ch
Why not just push them directly? ie
push dword ptr [ebp-8]
Also, if you are going to do a separate push, why not do it manually. In other words, instead of doing "push eax" above, do
mov [esp], eax
Etc. the advantage of this is that after doing the 3 movs you can do a single subtract to set the new stack pointer, instead of implicitly subtracting three times with the pushes.
UPDATE---Release version
This is the same code compiled for release:
; 741 : memcpy( nodeNewLocation, pNode, sizeCurrentNode );
00087 8b 45 f8 mov eax, DWORD PTR _sizeCurrentNode$[ebp]
0008a 8b 7b 04 mov edi, DWORD PTR [ebx+4]
0008d 50 push eax
0008e 56 push esi
0008f 57 push edi
00090 e8 00 00 00 00 call _memcpy
00095 83 c4 0c add esp, 12 ; 0000000cH
Definitely more efficient than the debug version, but it is still doing a MOV/PUSH combo.
This is an optimization. It is explicitly mentioned in the Intel processor manuals, volume 4, section 12.3.3.6:
In Intel Atom microarchitecture, using PUSH/POP instructions to manage stack space
and address adjustment between function calls/returns will be more optimal than
using ENTER/LEAVE alternatives. This is because PUSH/POP will not need MSROM
flows and stack pointer address update is done at AGU.
When a callee function need to return to the caller, the callee could issue POP instruction
to restore data and restore the stack pointer from the EBP.
Assembly/Compiler Coding Rule 19. (MH impact, M generality) For Intel
Atom processors, favor register form of PUSH/POP and avoid using LEAVE; Use LEA
to adjust ESP instead of ADD/SUB.
The rest of the manual isn't that clear about the reason, but it does mention a possible 3 cycle AGU stall on implicit ESP adjustments.
I suspect it only does it in debug builds, or possibly in some situations where it's warranted by pipelining or other considerations (e.g. it could put a parameter into esi and use it after the call). I've looked into some binaries, and MSVC definitely does use such pushes:
push ebx ; mthd
push dword ptr [ebp+place+4]
push dword ptr [ebp+place] ; pos
push [ebp+filedes] ; fh
call __lseeki64_nolock
(code from the CRT)
As for the second question, instructions addressing esp are longer than pushes: "push eax" is one byte while "mov [esp-8], eax" is four bytes. In fact, this approach (mov instead of push) is used by GCC by default since a couple versions ago (option -maccumulate-outgoing-args) and it has led to notable increases in code size. Supposedly it makes code faster but I'm unconvinced.
I actually figured out the reason for it. It has to do with the way instructions are pipelined on the Pentium MMX. There are two pipelines, U and V, which allows MMX processors to process 2 instructions at a time IF they are pairable. PUSHs are not pairable with one another, but they are pairable with MOVs. So, if you write:
mov eax, [indirect]
mov esi, [indirect]
push eax
push esi
then, what happens is that instructions #1 and #3 get paired and #2 and #4 get paired so, effectively, these four instructions run in the same number of cycles as a single mov/push, and a single mov/push is faster than two push [indirect]s. This exact case is described in detail in Section 4.3, p. 41, Examples 4.11a and 4.11b, of the Microarchitecture optimization guide by Agner Fog, available widely on internet.

Resources