Understanding ASM. Why does this work in Windows? - linux

Me and a couple of friends are fiddling with a very strange issue. We encountered a Crash in our application inside of a small assembler portion (used to speed up the process). The error was caused by fiddling with the stackpointer and not resetting it at the end, it looked like this:
push ebp
mov ebp, esp
; do stuff here including sub and add on esp
pop ebp
When correctly it should be written as:
push ebp
mov ebp, esp
; do stuff here including sub and add on esp
mov esp,ebp
pop ebp
Now what our mindbreak is: Why does this work in Windows? We found the error as we ported the application to Linux, where we encountered the crash. Neither in Windows or Android (using the NDK) we encountered any issues and would never have found this error. Is there any Stackpointer recovery? Is there a protection against misusing the stackpointer?

the ebp esp usage, is called a stack frame, and its purpose is to allocate variables on the stack, and afterward have a quick way to restore the stack back before the ret instruction. All new versions of x86 CPU can compress these instructions together using enter / leave instructions instead.
esp is the actual stack pointer used by the CPU when doing push/pop/call/ret.
ebp is a user-manipulated base pointer, more or less all compilers use this as a stack-pointer for local storage.
If the mov esp, ebp instruction is missing, the stack will misbehave if esp != ebp when the CPU reaches pop ebp, but only then.

it seems the compiler takes care of your stack in windows:
The only way I can imagine is:
Microsoft Visual C takes special care of functions that are B{__stdcall}. Since the number of parameters is known at compile time, the compiler encodes the parameter byte count in the symbol name itself.
The __stdcall convention is mainly used by the Windows API, and it's a bit more compact than __cdecl. The main difference is that any given function has a hard-coded set of parameters, and this cannot vary from call to call like it can in C (no "variadic functions").
see:
http://unixwiz.net/techtips/win32-callconv-asm.html
and:
https://en.wikipedia.org/wiki/X86_calling_conventions

Related

x86 segfault in "call" to function

I am working on a toy compiler. I used to allocate all memory with malloc, but since I never call free, I think it will be sufficient (and faster) to allocate a GB or so on the stack and then slowly use that buffer.
But... now I am segfaulting before anything interesting happens. It happens on about 30% of my test cases (all test cases are the same in this section tho). Pasted from GDB:
(gdb) disas
Dump of assembler code for function main:
0x0000000000400bf1 <+0>: push rbp
0x0000000000400bf2 <+1>: mov rbp,rsp
0x0000000000400bf5 <+4>: mov QWORD PTR [rip+0x2014a4],rsp # 0x6020a0
0x0000000000400bfc <+11>: sub rsp,0x7735940
0x0000000000400c03 <+18>: sub rsp,0x7735940
0x0000000000400c0a <+25>: sub rsp,0x7735940
0x0000000000400c11 <+32>: sub rsp,0x7735940
=> 0x0000000000400c18 <+39>: call 0x400fec <new_Main>
0x0000000000400c1d <+44>: mov r15,rax
0x0000000000400c20 <+47>: mov rax,r15
0x0000000000400c23 <+50>: add rax,0x20
0x0000000000400c27 <+54>: mov rax,QWORD PTR [rax]
0x0000000000400c2a <+57>: add rax,0x48
0x0000000000400c2e <+61>: mov rax,QWORD PTR [rax]
0x0000000000400c31 <+64>: call rax
0x0000000000400c33 <+66>: mov rax,0x0
0x0000000000400c3a <+73>: mov rsp,rbp
0x0000000000400c3d <+76>: pop rbp
0x0000000000400c3e <+77>: ret
I originally did one big "sub rsp, 0x..." and I thought breaking it up a bit would help (it didn't -- the program crashes at call either way). The total should be 500MB in this case.
What really confuses me is why it fails on "call <>" instead of one of the subs. And why it only fails some of the time rather than always or never.
Disclosure: this is a school project, but asking for help with general issues regarding x86 is not against any rules.
Update: based on #swift's comment, I set ulimit -s unlimited... and it now segfaults randomly? It seems random. It's not coming close to using the whole 500 MB buffer tho. It only allocates about 400 bytes total.
Subtracting something from RSP won’t cause any issues since nothing uses it. It’s just a register with a value, it doesn’t allocate anything. But when you use CALL then memory pointed by RSP is accessed and issues may happen. The stack usually isn’t very big so to your question “is there any reason you can’t take a GB of memory from the stack” the answer is “because the stack doesn’t have that much space to be used.”
As for being faster to allocate a big buffer in the stack isn’t really a thing. Allocating and releasing a single big block of memory isn’t slower in the heap. Having lots of allocations and releases in heap is worse than in the stack. So there’s not much point in this case to do it in the stack.

Read and display float or double number in assembly in Microsoft Visual Studio

I am using Microsoft Visual Studio 2015 to learn inline assembly programming. I have read a lot of posts on stackoverflow including the most relevant one this, but after I tried the methods the result is still 0.0000. I used the float first and store the value to fpu but the reuslt is the same and tried passing value to eax and still no help.
Here is my code:
#include "stdafx.h"
int _tmain(int argc, _TCHAR* argv[])
{
char input1[] = "Enter number: \n";
char input_format[] = "%lf";
double afloat;
char output[] = "The number is: %lf";
_asm {
lea eax, input1
push eax
call printf_s
add esp, 4
lea eax, input_format
push eax
lea eax, afloat
push eax
call scanf_s
add esp, 8
sub esp, 8
fstp [esp]
lea eax, output
push eax
call printf
add esp, 12
}
return 0;
}
The result:
You are attempting to print the wrong value. In fact, the code should just be causing nonsense to be printed to the terminal. You got quite lucky that you see 0.0. Let's look specifically at the part of the code that retrieves the floating point value, which is your call to scanf_s:
lea eax, input_format
push eax
lea eax, afloat
push eax
call scanf_s
add esp, 8
sub esp, 8
fstp [esp]
First of all, I don't see any logic in adding 8 to your stack pointer (esp) and then immediately subtracting 8 from the stack pointer. Performing these two operations back-to-back just cancel each other out. As such, these two instructions can be deleted.
Second, you are pushing the arguments on the stack in the wrong order. The cdecl calling convention (used by the CRT functions, including printf_s and scanf_s) passes argument in reverse order, from right to left. Therefore, to call scanf_s, you would first push the address of the floating-point value into which the input should be stored, and then push the address of the format control string buffer. Your code is wrong, because it pushes the arguments from left to right. You get lucky with printf_s, because you're only passing one argument, but because you're passing two arguments to scanf_s, bad things happen.
The third problem is that you appear to be assuming that scanf_s returns the output directly.
If it did, and you had requested a floating point value, you would be correct that the cdecl calling convention would have it returning that value at the top of the floating-point stack, in FP(0), and thus you would be correctly popping that value and storing it in a memory location (fstp [esp]).
While scanf_s does return a value (an integer value indicating the number of fields that were successfully converted and assigned), it does not return the value from the standard input stream. There is, in fact, no way that it could do this, since it supports arbitrary types of inputs. This is why it uses a pointer to a memory location to store the value. You probably knew this already, since you arranged to pass that pointer as a parameter to the function.
Now, why did you get an output of 0.0 in the final call to printf? Because of the fstp [esp] instruction. This pops the top value off of the floating-point stack, and stores it in the memory address contained in esp. I have already pointed out that scanf_s does not place any value(s) on the floating-point stack, so technically, it contains meaningless/garbage data. But in your case, you were lucky enough that FP(0) actually contained 0.0, so that is what got printed. Why does FP(0) contain 0.0? I'm guessing because this is a debug build that you're running, and the CRT is zeroing the stack. Or maybe because that's what FSTP pops off the stack when the stack is empty. I don't know, and I don't see that documented anywhere. But it doesn't really matter what happens when you write incorrect code, because you should strive to only write correct code!
Here's what correct code might look like:
; Get address of 'input1' buffer, and push it onto the stack
; in preparation of a call to 'printf_s'. Then, make the call.
lea eax, DWORD PTR [input1]
push eax
call printf_s
add esp, 4 ; clean up the stack after calling 'printf_s'
; Call 'scanf_s' to retrieve the floating-point input.
; Note that in the __cdecl calling convention, arguments are always
; passed on the stack in *reverse* order!
lea eax, DWORD PTR [afloat]
push eax
lea eax, DWORD PTR [input_format]
push eax
call scanf_s
; (no need to clean up the stack here; we're about to reuse that space!)
; The 'scanf_s' function stored the floating-point value that the user entered
; in the 'afloat' variable. So, we need to load this value onto the top
; of the floating point stack in order to pass it to 'printf_s'
fld QWORD PTR [afloat]
fstp QWORD PTR [esp]
; Get the address of the 'output' buffer and push it onto the stack,
; and then call 'printf_s'. Again, this is pushed last because
; __cdecl demands that arguments are passed right-to-left.
lea eax, DWORD PTR [output]
push eax
call printf_s
add esp, 12 ; clean up the stack after 'scanf_s' and 'printf_s'
Note that you could optimize this code further by deferring the stack cleanup after the initial call to printf_s. You can just wait and do the cleanup later at the end of the function. Functionally, this is equivalent, but an optimizing compiler will often choose to defer stack cleanup to produce more efficient code because it can interleave it within other time-consuming instructions.
Also note that you technically do not need the DWORD PTR directives that I've used in the code because the inline assembler (and MASM syntax in general) tries to read your mind and assemble the code that you meant to write. However, I like to write it to be explicit. It just means that the value you're loading is DWORD-sized (32 bits). All pointers on 32-bit x86 are 32 bits in size, as are int values and single-precision float values. QWORD means 64 bits, like an __int64 value or a double value.
Warning: When I first tested this code in MSVC using inline assembly, I couldn't get it to run. It worked fine when assembled separately using MASM, but when written using inline assembly, I couldn't execute it without getting an "access violation" error. In fact, when I tried your original code, I got the same error. Initially, I couldn't figure out how you were able to get your code to run, either!
Finally, I was able to diagnose the problem: by default, my MSVC project had been configured to dynamically link to the C runtime (CRT) library. The implementation of printf/printf_s is apparently doing some type of debug check that is causing this code to fail. I still am not entirely sure what the purpose of this validation code is, or exactly how it works, but it appears to be checking a certain offset within the stack for a sentinel value. Anyway, when I switched to statically linking to the CRT, everything runs as expected.
At least, when I compiled the code as a "release" build. In a "debug" build, the compiler can't tell that you need floating-point support (since all the floating-point stuff is in the inline assembly, which it can't parse), so it fails to tell the linker to link in the floating-point support from the CRT. As a result, the application bombs as soon as you run it and try to use scanf_s with a floating-point value. There are various ways of fixing this, but the simplest way would be to simply explicitly initialize the afloat value (it doesn't matter what you initialize it to; 0.0 or any other value would work just fine).
I suppose you are statically linking the CRT and running a release build, which is why your original code was executing. In that case, the code I've shown will both execute and return the correct results. However, if you're trying to learn assembly language, I strongly recommend avoiding inline assembly and writing functions directly in MASM. This is supported from within the Visual Studio environment, too; for setup instructions, see here (same for all versions).
Have you used a debugger to look at registers / memory?
fstp [esp] is extremely suspicious, because ST0 should be empty at that point. scanf returns an integer, and the calling convention requires the x87 stack to be empty on call/return, except for FP return values.
I forget what happens when you FST while the x87 stack is empty. If you get zero, that would explain it, since this is what you're passing to printf.
add esp, 8 / sub esp, 8 is completely redundant. There's no point in doing that. You can just take it out. (Or comment it out but leave it there as a reminder that you've optimized by reusing the arg space on the stack instead of popping it and pushing new stuff.)
Since scanf writes the double at the address you passed, you could avoid copying it by getting it to write it near the bottom of the stack, and then adjust ESP to right below it. Push a format string and you're ready to call printf, with the double already where it needs to be on the stack as the second arg.
sub esp, 8 ; reserve 8 bytes for storing a `double` on the stack
push esp ; push a pointer to the bottom of that 16B we just reserved
push OFFSET input_format ; (only works if it's in static storage, but it's a string constant so you should have done that instead of initializing a non-const array with a literal inside `main`.)
call scanf ; scanf("%lf", ESP_before_we_pushed_args)
add esp, 8
; TODO: check return value and handle error
; esp now points at the double stored by scanf
push OFFSET output_format
call printf
add esp, 12 ; "pop" the args: format string and the double we made space for before the scanf call.
If the calling convention / ABI you're using requires that you make function calls with ESP 16B-aligned, you can't shorten it quite this much: you'll need an lea eax, [esp+4] / push eax or something instead of push esp, because the double can't be right above scanf's second arg but also be printf's second arg if both calls have the stack 16B-aligned. So have scanf store the double at a higher address, and add esp, 12 to reach it.
IDK if MSVC-style inline-asm guarantees any kind of stack alignment in the first place. Inline-asm seems to make this problem more complicated in some ways than just writing the whole function in asm.
Thank you for all the help and caveats offered. There are still problems I need to configure and fix, but after reading most float point operators on this website , I finally come up with a solution to read and display float point number with Microsoft Visual Studio. I am sorry for all the bad conventions I've used in this post and lack of comments, I will learn to program in a better style. Thanks again!
Here is my working code(different from original post since I want to use float point number there):
#include "stdafx.h"
int _tmain(int argc, _TCHAR* argv[])
{
char input1[] = "Enter number: \n";
char input_format[] = "%f";
char output_format[] = "%.2f\n"; //for better ouput
float afloat;
char output[] = "The number is:";
_asm {
lea eax, input1
push eax
call printf_s
add esp, 4
call flushall; // flush all streams and clear all buffers
mov afloat, 0; // give the variable afloat a default
lea eax, afloat;
push eax;
lea eax, input_format; // read the floating number user inputs
push eax;
call scanf_s;
add esp, 8;
//leave space for more operations.
lea eax, output;
push eax;
call printf; // print the output message
add esp, 4;
sub esp, 8; // reserve stack for a double in stack
fld dword ptr afloat; // load float
fstp qword ptr[esp]; // IMPORTANT: convert to double and store, because printf expects a double not a float
lea eax, output_format; // import the floating number format '%f'
push eax;
call printf; // print the variable afloat
add esp, 12;
}
return 0;
}
The result looks like this:

Making a stack frame with push/lea offset/sub instead of push/mov/sub?

I'm analysing a c++ function compiled with vc++ (probably vs10) and I never saw this prologue pattern before.
It seems to be a stdcall but the prologue is a little bit different:
stdcall usually starts the function with the following prologue pattern:
push ebp
mov ebp, esp
sub esp, const
However the prologue of this function I'm analysing is the following:
push ebp
lea ebp, [esp - 0x4C]
sub esp, 0x80
Analysing other functions in the same PE that uses the same prologue/epilogue it seems the RETN always come after a LEAVE instruction, just another thing I never saw in a regular cdecl function.
I'm wondering about why the compiler did that. It appears to open space on ESP (by sub esp, const), so why it opens another block of stack by lea ebp, [esp - const]?
Does anyone know why the compiler does that? Is that a different call convention from stdcall?
I did some research on the net as well as studied this specific assembly code to find out but didn't discover the need of that.
Thanks in advance!
EDIT with screens of the prologue/epilogue:
Prologue
Epilogue
A call to the function
As no one in the comments wrote an answer here we go...
The reason of that difference in the prologue/epilogue between the "usual" stdcall and the one I talk in the topic is compiler optmization for code density.
Offsetting EBP in the prologue the compiler is able to shorten the instructions in the function that accesses some stack variables. It can now access a larger chunk of stack memory (depending on how long the prologue offset EBP) with a single byte displacement - using EBP + N and EBP - M to access local variables (where N and M are a const between -128 and + 127). Of course instructions that access variables beyond that EBP's offset will use 4 bytes displacement, but the overall code of that function will be shorter using this optimization trick.

Some inline assembler questions

I already asked similar question here, but I still get some errors, so I hope you could tell me what am I doing wrong. Just know that I know assembler, and I have done several projects in 8051 assembler, and even it is not the same, is close to x86 asm.
There is block of code I tried in VC++ 2010 Express (I am trying to get information from CPUID instruction): `
int main()
{
char a[17]; //containing array for the CPUID string
a[16] = '\0'; //null termination for the std::cout
void *b=&a[0];
int c=0; //predefined value which need to be loaded into eax before cpuid
_asm
{
mov eax,c;
cpuid;
mov [b],eax;
mov [b+4],ebx;
mov [b+8],ecx;
mov [b+12],edx;
}
std::cout<<a;
}`
So, to quick summarize, I tried to create void pointer to the first element of array, and than using indirect addressing just move values from registers. But this approach gives me "stack around b variable is corrupted run-time error" but I don´t know why.
Please help. Thanks. And this is just for study purposes, i know there are functions for CPUID....
EDIT: Also, how can you use direct addressing in x86 VC++ 2010 inline assembler? I mean common syntax for immediate number load in 8051 is mov src,#number but in VC++ asm its mov dest,number without # sign. So how to tell the compiler you want to access memory cell adress x directly?
The reason your stack is corrupted is because you're storing the value of eax in b. Then storing the value of ebx at the memory location b+4, etc. The inline assembler syntax [b+4] is equivalent to the C++ expression &(b+4), if b were a byte pointer.
You can see this if you watch b and single-step. As soon as you execute mov [b],eax, the value of b changes.
One way to fix the problem is to load the value of b into an index register and use indexed addressing:
mov edi,[b]
mov [edi],eax
mov [edi+4],ebx
mov [edi+8],ecx
mov [edi+12],edx
You don't really need b at all, to hold a pointer to a. You can load the index register directly with the lea (load effective address) instruction:
lea edi,a
mov [edi],eax
... etc ...
If you're fiddling with inline assembler, it's a good idea to open the Registers window in the debugger and watch how things change when you're single stepping.
You can also directly address the memory:
mov dword ptr a,eax
mov dword ptr a+4,ebx
... etc ...
However, direct addressing like that required more code bytes than the indexed addressing in the previous example.
I think the above, with the lea (Load Effective Address) instruction and the direct addressing I showed answers your final question.
The advice to open the Registers window in the debugger watch how things change is not going to work in VC++ 2010 Express.
You might be just as surprised as myself to find out that the VC++ 2010 Express is MISSING the registers window. This is especially surprising since stepping in disassembly works.
The only workaround I know is to open a watch window and type the register names in the Name field. Type in EAX EBX ECX EDX ESI EDI EIP ESP EBP EFL and CS DS ES SS FS GS if you want
ST1 ST2 ST3 ST4 ST5 ST6 ST7 also work in the watch window.
You will probably also want to set the Value to Hexadecimal by right clicking in the Watch window and checking Hexadecimal Display.

Trying to understand gcc's complicated stack-alignment at the top of main that copies the return address

hi I have disassembled some programs (linux) I wrote to understand better how it works, and I noticed that the main function always begins with:
lea ecx,[esp+0x4] ; I assume this is for getting the adress of the first argument of the main...why ?
and esp,0xfffffff0 ; ??? is the compiler trying to align the stack pointer on 16 bytes ???
push DWORD PTR [ecx-0x4] ; I understand the assembler is pushing the return adress....why ?
push ebp
mov ebp,esp
push ecx ;why is ecx pushed too ??
so my question is: why all this work is done ??
I only understand the use of:
push ebp
mov ebp,esp
the rest seems useless to me...
I've had a go at it:
;# As you have already noticed, the compiler wants to align the stack
;# pointer on a 16 byte boundary before it pushes anything. That's
;# because certain instructions' memory access needs to be aligned
;# that way.
;# So in order to first save the original offset of esp (+4), it
;# executes the first instruction:
lea ecx,[esp+0x4]
;# Now alignment can happen. Without the previous insn the next one
;# would have made the original esp unrecoverable:
and esp,0xfffffff0
;# Next it pushes the return addresss and creates a stack frame. I
;# assume it now wants to make the stack look like a normal
;# subroutine call:
push DWORD PTR [ecx-0x4]
push ebp
mov ebp,esp
;# Remember that ecx is still the only value that can restore the
;# original esp. Since ecx may be garbled by any subroutine calls,
;# it has to save it somewhere:
push ecx
This is done to keep the stack aligned to a 16-byte boundary. Some instructions require certain data types to be aligned on as much as a 16-byte boundary. In order to meet this requirement, GCC makes sure that the stack is initially 16-byte aligned, and allocates stack space in multiples of 16 bytes. This can be controlled using the option -mpreferred-stack-boundary=num. If you use -mpreferred-stack-boundary=2 (for a 22=4-byte alignment), this alignment code will not be generated because the stack is always at least 4-byte aligned. However you could then have trouble if your program uses any data types that require stronger alignment.
According to the gcc manual:
On Pentium and PentiumPro, double and long double values should be aligned to an 8 byte boundary (see -malign-double) or suffer significant run time performance penalties. On Pentium III, the Streaming SIMD Extension (SSE) data type __m128 may not work properly if it is not 16 byte aligned.
To ensure proper alignment of this values on the stack, the stack boundary must be as aligned as that required by any value stored on the stack. Further, every function must be generated such that it keeps the stack aligned. Thus calling a function compiled with a higher preferred stack boundary from a function compiled with a lower preferred stack boundary will most likely misalign the stack. It is recommended that libraries that use callbacks always use the default setting.
This extra alignment does consume extra stack space, and generally increases code size. Code that is sensitive to stack space usage, such as embedded systems and operating system kernels, may want to reduce the preferred alignment to -mpreferred-stack-boundary=2.
The lea loads the original stack pointer (from before the call to main) into ecx, since the stack pointer is about to modified. This is used for two purposes:
to access the arguments to the main function, since they are relative to the original stack pointer
to restore the stack pointer to its original value when returning from main
lea ecx,[esp+0x4] ; I assume this is for getting the adress of the first argument of the main...why ?
and esp,0xfffffff0 ; ??? is the compiler trying to align the stack pointer on 16 bytes ???
push DWORD PTR [ecx-0x4] ; I understand the assembler is pushing the return adress....why ?
push ebp
mov ebp,esp
push ecx ;why is ecx pushed too ??
Even if every instruction worked perfectly with no speed penalty despite arbitrarily aligned operands, alignment would still increase performance. Imagine a loop referencing a 16-byte quantity that just overlaps two cache lines. Now, to load that little wchar into the cache, two entire cache lines have to be evicted, and what if you need them in the same loop? The cache is so tremendously faster than RAM that cache performance is always critical.
Also, there usually is a speed penalty to shift misaligned operands into the registers.
Given that the stack is being realigned, we naturally have to save the old alignment in order to traverse stack frames for parameters and returning.
ecx is a temporary register so it has to be saved. Also, depending on optimization level, some of the frame linkage ops that don't seem strictly necessary to run the program might well be important in order to set up a trace-ready chain of frames.

Resources