Why calling detoured function via original pointer works? - detours

I'm using Microsoft Detours. I'm detouring CreateFileW() function.
Here is shortcut of my code that attaches a detour. Error handling etc omitted.
...
ptrTargetFunction = DetourFindFunction("kernel32.dll", "CreateFileW");
DetourTransactionBegin();
DetourUpdateThread(GetCurrentThread());
printf("[D] Before DetourAttach: ptrTargetFunction = %p\n", ptrTargetFunction); // 4
PDETOUR_TRAMPOLINE pRealTrampoline;
PVOID pRealTarget;
PVOID pRealDetour;
DetourAttachEx((PVOID*)&ptrTargetFunction, hook, &pRealTrampoline, &pRealTarget, &pRealDetour);
printf("[D] After DetourAttach: ptrTargetFunction = %p\n", ptrTargetFunction); // 9
printf("[D] \t pRealTrampoline = %p\n", pRealTrampoline);
printf("[D] \t pRealTarget = %p\n", pRealTarget);
printf("[D] \t pRealDetour = %p\n", pRealDetour);
...
Here is full code of hook() function. hook() is a detour function.
_TEXT SEGMENT
EXTERN InjectionFunction: PROC
EXTERN ptrTargetFunction: qword
hook PROC
push rsp
push rbx
push rcx
push rdx
push rsi
push rdi
push rbp
push r8
push r9
push r10
push r11
push r12
push r13
push r14
push r15
call InjectionFunction
pop r15
pop r14
pop r13
pop r12
pop r11
pop r10
pop r9
pop r8
pop rbp
pop rdi
pop rsi
pop rdx
pop rcx
pop rbx
pop rsp
mov rax, ptrTargetFunction // 36
push rax // 37
ret // 38
hook ENDP
_TEXT ENDS
END
As I understood, Microsoft Detours will change implementation of the target function (that will be detoured (CreateFileW)) by replacing some of first assembler instructions by jmp to detour (hook()) function. It also creates trampoline function that should be called if we want to call not detoured implementation of CreateFileW.
At lines 4 and 9 (singled out via comment) printed address of target function is the same. Variable ptrTargetFunction should contain address of changed implementation of CreateFileW.
When detoured function is called hook() is invoked. In the end of the hook() function we call target function by placing it's address on top of the stack and executing ret instruction (lines 36-37).
I suppose that execution flow should be something like: some_code->(detoured)CreateFileW()->hook()->(detoured)CreateFileW()->hook()->... infinite circle.
During debugging I realized that there is no infinite circle. Why?
Feel free to ask any questions or ask for rewriting question because of difficulties in understanding the problem:)

LONG DetourAttach(
_Inout_ PVOID * ppPointer,
_In_ PVOID pDetour
);
Detours replaces the first few instructions of the target function
with an unconditional jump to the user-provided detour function.
Instructions from the target function are placed in a trampoline. The
address of the trampoline is placed in a target pointer.
Source: https://github.com/microsoft/detours/wiki
So, as I understood, ptrTargetFunction will store pointer to trampoline.
I still don't realize why at lines 4 and 9 ptrTargetFunction has same values.

Related

Assembly LINUX, x86-64 - Calling convention

I am currently learning the basics of assembly language for LINUX OS, x84-64 processors, GCC compilers and I came across an example which tries to translate into assembly the following C function:
unsigned long fact(unsigned n)
{
if (n<=1)
return 1;
else
return n*fact(n-1);
}
This is the solution proposed:
.intel_syntax noprefix
.text
.global fact
.type fact, #function
fact: PUSH RBP
MOV RBP, RSP
CMP EDI, 1
JBE retour_un
PUSH RDI
DEC RDI
SUB RSP, 8
CALL fact
ADD RSP, 8
POP RDI
MUL RDI
JMP retour
retour_un: MOV RAX, 1
retour: POP RBP
RET
However, I don't exactly how this works. From what I have read the CALL instruction places on the stack the value of RIP and then jumps to wherever its argument shows. In this case, I understand up to the point where CALL fact is used. If this instruction is implemented everytime then everytime the program will start form the beginning until RDI==1 in which case it will jump to retour_un and everything that there is in between will never be executed. Could someone explain to me where I am mistaken and how this segment of assembly code actually works?

push/pop segmentation fault in simple multiplication function

my teacher is doing a crash course in assembly with us, and I have no experience in it whatsoever. I am supposed to write a simple function that takes four variables and calculates (x+y)-(z+a) and then prints out the answer. I know it's a simple problem, but after hours of research I am getting no where, any push in the right direction would be very helpful! I do need to use the stack, as I have more things to add to the program once I get past this point, and will have a lot of variables to store. I am compiling using nasm and gcc, in linux. (x86 64)
(side question, my '3' isn't showing up in register r10, but I am in linux so this should be the correct register... any ideas?)
Here is my code so far:
global main
extern printf
segment .data
mulsub_str db "(%ld * %ld) - (%ld * %ld) = %ld",10,0
data dq 1, 2, 3, 4
segment .text
main:
call multiplyandsubtract
pop r9
mov rdi, mulsub_str
mov rsi, [data]
mov rdx, [data+8]
mov r10, [data+16]
mov r8, [data+24]
mov rax, 0
call printf
ret
multiplyandsubtract:
;;multiplies first function
mov rax, [data]
mov rdi, [data+8]
mul rdi
mov rbx, rdi
push rbx
;;multiplies second function
mov rax, [data+16]
mov rsi, [data+24]
mul rsi
mov rbx, rsi
push rbx
;;subtracts function 2 from function 1
pop rsi
pop rdi
sub rdi, rsi
push rdi
ret
push in the right direction
Nice pun!
Your problem is that you apparently don't seem to know that ret is using the stack for the return address. As such push rdi; ret will just go to the address in rdi and not return to your caller. Since that is unlikely to be a valid code address, you get a nice segfault.
To return values from functions just leave the result in a register, standard calling conventions normally use rax. Here is a possible version:
global main
extern printf
segment .data
mulsub_str db "(%ld * %ld) - (%ld * %ld) = %ld",10,0
data dq 1, 2, 3, 4
segment .text
main:
sub rsp, 8
call multiplyandsubtract
mov r9, rax
mov rdi, mulsub_str
mov rsi, [data]
mov rdx, [data+8]
mov r10, [data+16]
mov r8, [data+24]
mov rax, 0
call printf
add rsp, 8
ret
multiplyandsubtract:
;;multiplies first function
mov rax, [data]
mov rdi, [data+8]
mul rdi
mov rbx, rdi
push rbx
;;multiplies second function
mov rax, [data+16]
mov rsi, [data+24]
mul rsi
mov rbx, rsi
push rbx
;;subtracts function 2 from function 1
pop rsi
pop rdi
sub rdi, rsi
mov rax, rdi
ret
PS: notice I have also fixed the stack alignment as per the ABI. printf is known to be picky about that too.
To return more than 64b from subroutine (rax is not enough), you can optionally drop the whole standard ABI convention (or actually follow it, there's surely a well defined way how to return more than 64b from subroutines), and use other registers until you ran out of them.
And once you ran out of spare return registers (or when you desperately want to use stack memory), you can follow the way C++ compilers do:
SUB rsp,<return_data_size + alignment>
CALL subroutine
...
MOV al,[rsp + <offset>] ; to access some value from returned data
; <offset> = 0 to return_data_size-1, as defined by you when defining
; the memory layout for returned data structure
...
ADD rsp,<return_data_size + alignment> ; restore stack pointer
subroutine:
MOV al,<result_value_1>
MOV [rsp + 8 + <offset>],al ; store it into allocated stack space
; the +8 is there to jump beyond return address, which was pushed
; at stack by "CALL" instruction. If you will push more registers/data
; at the stack inside the subroutine, you will have either to recalculate
; all offsets in following code, or use 32b C-like function prologue:
PUSH rbp
MOV rbp,rsp
MOV [rbp + 16 + <offset>],al ; now all offsets are constant relative to rbp
... other code ...
; epilogue code restoring stack
MOV rsp,rbp ; optional, when you did use RSP and didn't restore it yet
POP rbp
RET
So during executing the instructions of subroutine, the stack memory layout is like this:
rsp -> current_top_of_stack (some temporary push/pop as needed)
+x ...
rbp -> original rbp value (if prologue/epilogue code was used)
+8 return address to caller
+16 allocated space for returning values
+16+return_data_size
... padding to have rsp correctly aligned by ABI requirements ...
+16+return_data_size+alignment
... other caller stack data or it's own stack frame/return address ...
I'm not going to check how ABI defines it, because I'm too lazy, plus I hope this answer is understandable for you to explain the principle, so you will recognize which way the ABI works and adjust...
Then again, I would highly recommend to use rather many shorter simpler subroutines returning only single value (in rax/eax/ax/al), whenever possible, try to follow the SRP (Single Responsibility Principle). The above way will force you to define some return-data-structure, which may be too much hassle, if it's just some temporary thing and can be split into single-value subroutines instead (if performance is endangered, then probably inlining the whole subroutine will outperform even the logic of grouped returned values and single CALL).

nasm segmentation fault while using arg

extern puts
global main
section .text
main:
mov rax, rdi
label:
test rax, rax
je exit
push rsi
mov rdi, [rsi]
call puts
pop rsi
dec rax
add rsi, 8
jmp label
exit:
pop rsi
ret
I wrote nasm code like that. However segmentation fault occur in last. I can't understand why segmentation fault is occur.
rax is not guaranteed to be preserved across function calls, as it is used to return integer results from functions (in the case of puts "a nonnegative number on success, or EOF on error") You need to save the value of rax before calling puts, like you're doing with rsi, and restore it afterwards.
Obviously you want to get the command line parameters in a GCC environment on a 64-bit Linux, where they are passed according to the GCC calling convention which follows the Linux calling convention "System V AMD64 ABI".
Let's translate the program logic to C:
#include <stdio.h>
int main ( int argc, char** argv )
{
if (argc != 0)
{
do
{
puts (*argv);
argc--;
argv++;
} while (argc);
}
return;
}
The asm program doesn't return an exit code. That exit code should be in RAX when the function returns. BTW: argc is always >0 since the first string of argv holds the program name.
The main function is both "caller" (calls puts) and "callee" (returns to the GCC environment). As caller it has to preserve RAX and RSI before the call to puts and restore them when it needs them. A callee-saved register is not used. Don't forget to align the stack by 16.
This works:
extern puts
global main
section .text
main: ; RDI: argc, RSI: argv, stack is unaligned by 8
mov rax, rdi
label:
test rax, rax
je exit
push rbx ; Push 8 bytes to align the stack before the call
push rax ; Save it (caller-saved)
push rsi ; Save it (caller-saved)
mov rdi, [rsi] ; Argument for puts
call puts
pop rsi ; Restore it
pop rax ; Restore it
pop rbx ; "Unalign" the stack
dec rax
add rsi, 8
jmp label
exit:
; pop rsi ; Once too much
xor eax, eax ; RAX = 0 (return 0)
ret ; RAX: return value

Assembler: "Function Call" assembler code produced by VC++

I am by all means no assembler expert, and my knowledge on this topic is rather shallow, but I was curious on what the Microsoft VC++ Compiler does in a simple function call that does nothing else but returning a value.
Let us have the following function:
unsigned long __stdcall someFunction ( void * args) {
return 0;
}
Now, I know that with __stdcall calling convention the CALLEE is responsible for stack unwinding, and with __cdecl the CALLER of the function takes care of this. But for this example I would like to stick to the former.
With an non-optimized debug build I saw that the following output is being produced:
unsigned long __stdcall someFunction (void * args) {
00A31730 push ebp
00A31731 mov ebp,esp
00A31733 sub esp,0C0h
00A31739 push ebx
00A3173A push esi
00A3173B push edi
00A3173C lea edi,[ebp-0C0h]
00A31742 mov ecx,30h
00A31747 mov eax,0CCCCCCCCh
00A3174C rep stos dword ptr es:[edi]
return 0;
00A3174E xor eax,eax
}
00A31750 pop edi
00A31751 pop esi
00A31752 pop ebx
00A31753 mov esp,ebp
00A31755 pop ebp
00A31756 ret 4
I would thank anyone to explain this snippet of code for me if possible. I know that the xor statement actually resets the eax register to produce the zero return value. Also the ret 4 is self-explanatory to me. I think the edi, esi and ebx registers are pushed before and popped after to save the original state, so that the function can use them freely maybe. But for the rest - I have no clue.
Any answer is very much appreciated! :)
Thanks!
So you're asking what these lines do:
00A3173C lea edi,[ebp-0C0h]
00A31742 mov ecx,30h
00A31747 mov eax,0CCCCCCCCh
00A3174C rep stos dword ptr es:[edi]
In Visual C++ debugging runtime library, uninitialized stack memory is initialized to contain 0xCC bytes. This is what these instructions do.
At the beginning of the ASM code, there is the instruction sub esp,0C0h that allocates 0xC0 bytes for the stack. However, there is no local variables used in this function, so where does this come from? It's for Edit+Continue support: you're able to add local variables and continue debugging.
The 0xCC opcode means the INT 3 x86 assembly instruction, so if you try to execute that code (accidentally due to a bug), the program will throw an INT 3 exception which will be handled by the debugger or OS. So it's not just some random value.

Why so much stack space used for each recursion?

I have a simple recursive function RCompare() that calls a more complex function Compare() which returns before the recursive call. Each recursion level uses 248 bytes of stack space which seems like way more than it should. Here is the recursive function:
void CMList::RCompare(MP n1) // RECURSIVE and Looping compare function
{
auto MP ne=n1->mf;
while(StkAvl() && Compare(n1=ne->mb))
RCompare(n1); // Recursive call !
}
StkAvl() is a simple stack space check function that compares the address of an auto variable to the value of an address near the end of the stack stored in a static variable.
It seems to me that the only things added to the stack in each recursion are two pointer variables (MP is a pointer to a structure) and the stuff that one function call stores, a few saved registers, base pointer, return address, etc., all 32-bit (4 byte) values. There's no way that is 248 bytes is it?
I don't no how to actually look at the stack in a meaningful way in Visual Studio 2008.
Thanks
Added disassembly:
CMList::RCompare:
0043E000 push ebp
0043E001 mov ebp,esp
0043E003 sub esp,0E4h
0043E009 push ebx
0043E00A push esi
0043E00B push edi
0043E00C push ecx
0043E00D lea edi,[ebp-0E4h]
0043E013 mov ecx,39h
0043E018 mov eax,0CCCCCCCCh
0043E01D rep stos dword ptr es:[edi]
0043E01F pop ecx
0043E020 mov dword ptr [ebp-8],edx
0043E023 mov dword ptr [ebp-14h],ecx
0043E026 mov eax,dword ptr [n1]
0043E029 mov ecx,dword ptr [eax+20h]
0043E02C mov dword ptr [ne],ecx
0043E02F mov ecx,dword ptr [this]
0043E032 call CMList::StkAvl (41D46Fh)
0043E037 test eax,eax
0043E039 je CMList::RCompare+63h (43E063h)
0043E03B mov eax,dword ptr [ne]
0043E03E mov ecx,dword ptr [eax+1Ch]
0043E041 mov dword ptr [n1],ecx
0043E044 mov edx,dword ptr [n1]
0043E047 mov ecx,dword ptr [this]
0043E04A call CMList::Compare (41DA05h)
0043E04F movzx edx,al
0043E052 test edx,edx
0043E054 je CMList::RCompare+63h (43E063h)
0043E056 mov edx,dword ptr [n1]
0043E059 mov ecx,dword ptr [this]
0043E05C call CMList::RCompare (41EC9Dh)
0043E061 jmp CMList::RCompare+2Fh (43E02Fh)
0043E063 pop edi
0043E064 pop esi
0043E065 pop ebx
0043E066 add esp,0E4h
0043E06C cmp ebp,esp
0043E06E call #ILT+5295(__RTC_CheckEsp) (41E4B4h)
0043E073 mov esp,ebp
0043E075 pop ebp
0043E076 ret
Why 0E4h?
More Info:
class mch // match node structure
{
public:
T_FSZ c1,c2; // file indexes
T_MSZ sz; // match size
enum ntyp typ; // type of node
mch *mb,*mf; // pointers to next and previous match nodes
};
typedef mch * MP; // for use in casting (MP) x
Should be a plain old pointer right? The same pointers are in the structure itself and they are just normal 4 byte pointers.
Edit: Added:
#pragma check_stack(off)
void CMList::RCompare(MP n1) // RECURSIVE and Looping compare function
{
auto MP ne=n1->mf;
while(StkAvl() && Compare(n1=ne->mb))
RCompare(n1); // Recursive call !
} // end RCompare()
#pragma check_stack()
But it didn't change anything. :(
Now what?
Note that on debug mode the compiler binds many bytes from the stack, on every function,
to catch buffer overflow bugs.
0043E003 sub esp, 0E4h ; < -- bound 228 bytes
...
0043E00D lea edi,[ebp-0E4h]
0043E013 mov ecx, 39h
0043E018 mov eax, 0CCCCCCCCh ; <-- sentinel
0043E01D rep stos dword ptr es:[edi] ; <-- write sentinels
Edit: the OP Harvey found the pragma that turns on/off stack probes.
check_stack
Instructs the compiler to turn off
stack probes if off (or –) is
specified,
or to turn on stack probes
if on (or +) is specified.
#pragma check_stack([ {on | off}] )
#pragma check_stack{+ | –}
Update: well, probes is another story, as it appears.
Try this: /GZ (Enable Stack Frame Run-Time Error Checking)
That also depends on the compiler and the architecture you're running - e.g. it could be aligning to 256 bytes for faster execution, so that each level uses the 8 bytes of the variable + 248 padding.
I guess some space has to be allocated for exception handling. Did you look at the disassembly?
In Visual Studio you can look at the register "esp", the stack pointer, in a watch (or register) windows. Set a breakpoint in your function between one call and the next to see who much stack you consume.
On a pain function on debug mode in Visual Studio 2008 it is 16 byes per function call.

Resources