Faking ASM Return Address?

Faking ASM Return Address? - visual-c++

Would it be possible to fake the return address at, ebp + 4.
I'm currently writing a DLL that you would inject into a game, in which it would call game functions to do things, but the functions I call check the return address against the program itself, and if its outside their base it detects it.
So basically is there any way to fake the return address in any way?
It works like this:
if ( (_BYTE *)retaddr - (_BYTE *)unusedPadding >= (unsigned int)&byte_A6132A )
{
dword_11E59F8 |= 0x200000u;
dword_162D06C = 0;
result = (void (*)())sub_51FEE0(dword_11E59FC, v5, (_BYTE *)retaddr - (_BYTE *)unusedPadding, ebx0, edi0, a1);
}

Better way:
push returnshere
push your_second_argument
push your_first_argument
push address_of_fragment_in_exe
jmp function_you_want_to_call
returnshere:
; more code
Where address_of_ret_in_exe is the address of this fragment:
add esp, 8 ;4 * number of arguments pushed
ret
This has the advantage of not editing the game binary. I've seen more than one game that checksummed itself so if you edited it, even in slack space, you're in trouble. If they went through so much trouble as to verify calls come from the game binary, than they likely have defenses against the game binary from being edited. Just be glad they don't trace the call graph.

You mean so on entry to the called function, the return address at [esp] is not the actual call site?
You can emulate call by pushing whatever you want and then jumping. Of course, the function you call this way will return to the return address you give it. There would also be a significant perf penalty for mismatched call/ret, because you'll break the CPU's return-address predictor stack.
Can you put some trampoline functions at an acceptable address range, and call through them? Although that's really inconvenient in a 32bit ABI which passes args on the stack. I guess you'd have to pass an extra arg to the trampoline function that it could use to stash the return address, instead of copying all the args.
So the trampoline could be something like:
mov [esp+20], esi ; save esi in a dummy arg slot
mov esi, [esp] ; save the orig return address in a call-preserved reg
add esp, 4 ; call with the original args
call lib_function
push esi ; restore the orig return address
mov esi, [esp+20] ; and restore esi
ret ; return to orig return address
That's not wonderful, and takes a lot of code for something that needs to be duplicated for every library functions. (And it doesn't make a stack frame, so might hurt debugging / exception handling?) For functions without many args it might be shorter to do something like
push [esp+8] ; 2nd arg
push [esp+8] ; 1st arg
call lib_function
add esp, 8
ret
Using indirect calls would let you use the same trampoline for multiple functions, but at the cost of branch mispredicts if there isn't a simple short pattern.
And of course none of these trampolines can work unless you can stick them in memory at an address the library will accept calls from.

Related

Read and display float or double number in assembly in Microsoft Visual Studio

I am using Microsoft Visual Studio 2015 to learn inline assembly programming. I have read a lot of posts on stackoverflow including the most relevant one this, but after I tried the methods the result is still 0.0000. I used the float first and store the value to fpu but the reuslt is the same and tried passing value to eax and still no help.
Here is my code:
#include "stdafx.h"
int _tmain(int argc, _TCHAR* argv[])
{
char input1[] = "Enter number: \n";
char input_format[] = "%lf";
double afloat;
char output[] = "The number is: %lf";
_asm {
lea eax, input1
push eax
call printf_s
add esp, 4
lea eax, input_format
push eax
lea eax, afloat
push eax
call scanf_s
add esp, 8
sub esp, 8
fstp [esp]
lea eax, output
push eax
call printf
add esp, 12
}
return 0;
}
The result:

You are attempting to print the wrong value. In fact, the code should just be causing nonsense to be printed to the terminal. You got quite lucky that you see 0.0. Let's look specifically at the part of the code that retrieves the floating point value, which is your call to scanf_s:
lea eax, input_format
push eax
lea eax, afloat
push eax
call scanf_s
add esp, 8
sub esp, 8
fstp [esp]
First of all, I don't see any logic in adding 8 to your stack pointer (esp) and then immediately subtracting 8 from the stack pointer. Performing these two operations back-to-back just cancel each other out. As such, these two instructions can be deleted.
Second, you are pushing the arguments on the stack in the wrong order. The cdecl calling convention (used by the CRT functions, including printf_s and scanf_s) passes argument in reverse order, from right to left. Therefore, to call scanf_s, you would first push the address of the floating-point value into which the input should be stored, and then push the address of the format control string buffer. Your code is wrong, because it pushes the arguments from left to right. You get lucky with printf_s, because you're only passing one argument, but because you're passing two arguments to scanf_s, bad things happen.
The third problem is that you appear to be assuming that scanf_s returns the output directly.
If it did, and you had requested a floating point value, you would be correct that the cdecl calling convention would have it returning that value at the top of the floating-point stack, in FP(0), and thus you would be correctly popping that value and storing it in a memory location (fstp [esp]).
While scanf_s does return a value (an integer value indicating the number of fields that were successfully converted and assigned), it does not return the value from the standard input stream. There is, in fact, no way that it could do this, since it supports arbitrary types of inputs. This is why it uses a pointer to a memory location to store the value. You probably knew this already, since you arranged to pass that pointer as a parameter to the function.
Now, why did you get an output of 0.0 in the final call to printf? Because of the fstp [esp] instruction. This pops the top value off of the floating-point stack, and stores it in the memory address contained in esp. I have already pointed out that scanf_s does not place any value(s) on the floating-point stack, so technically, it contains meaningless/garbage data. But in your case, you were lucky enough that FP(0) actually contained 0.0, so that is what got printed. Why does FP(0) contain 0.0? I'm guessing because this is a debug build that you're running, and the CRT is zeroing the stack. Or maybe because that's what FSTP pops off the stack when the stack is empty. I don't know, and I don't see that documented anywhere. But it doesn't really matter what happens when you write incorrect code, because you should strive to only write correct code!
Here's what correct code might look like:
; Get address of 'input1' buffer, and push it onto the stack
; in preparation of a call to 'printf_s'. Then, make the call.
lea eax, DWORD PTR [input1]
push eax
call printf_s
add esp, 4 ; clean up the stack after calling 'printf_s'
; Call 'scanf_s' to retrieve the floating-point input.
; Note that in the __cdecl calling convention, arguments are always
; passed on the stack in *reverse* order!
lea eax, DWORD PTR [afloat]
push eax
lea eax, DWORD PTR [input_format]
push eax
call scanf_s
; (no need to clean up the stack here; we're about to reuse that space!)
; The 'scanf_s' function stored the floating-point value that the user entered
; in the 'afloat' variable. So, we need to load this value onto the top
; of the floating point stack in order to pass it to 'printf_s'
fld QWORD PTR [afloat]
fstp QWORD PTR [esp]
; Get the address of the 'output' buffer and push it onto the stack,
; and then call 'printf_s'. Again, this is pushed last because
; __cdecl demands that arguments are passed right-to-left.
lea eax, DWORD PTR [output]
push eax
call printf_s
add esp, 12 ; clean up the stack after 'scanf_s' and 'printf_s'
Note that you could optimize this code further by deferring the stack cleanup after the initial call to printf_s. You can just wait and do the cleanup later at the end of the function. Functionally, this is equivalent, but an optimizing compiler will often choose to defer stack cleanup to produce more efficient code because it can interleave it within other time-consuming instructions.
Also note that you technically do not need the DWORD PTR directives that I've used in the code because the inline assembler (and MASM syntax in general) tries to read your mind and assemble the code that you meant to write. However, I like to write it to be explicit. It just means that the value you're loading is DWORD-sized (32 bits). All pointers on 32-bit x86 are 32 bits in size, as are int values and single-precision float values. QWORD means 64 bits, like an __int64 value or a double value.
Warning: When I first tested this code in MSVC using inline assembly, I couldn't get it to run. It worked fine when assembled separately using MASM, but when written using inline assembly, I couldn't execute it without getting an "access violation" error. In fact, when I tried your original code, I got the same error. Initially, I couldn't figure out how you were able to get your code to run, either!
Finally, I was able to diagnose the problem: by default, my MSVC project had been configured to dynamically link to the C runtime (CRT) library. The implementation of printf/printf_s is apparently doing some type of debug check that is causing this code to fail. I still am not entirely sure what the purpose of this validation code is, or exactly how it works, but it appears to be checking a certain offset within the stack for a sentinel value. Anyway, when I switched to statically linking to the CRT, everything runs as expected.
At least, when I compiled the code as a "release" build. In a "debug" build, the compiler can't tell that you need floating-point support (since all the floating-point stuff is in the inline assembly, which it can't parse), so it fails to tell the linker to link in the floating-point support from the CRT. As a result, the application bombs as soon as you run it and try to use scanf_s with a floating-point value. There are various ways of fixing this, but the simplest way would be to simply explicitly initialize the afloat value (it doesn't matter what you initialize it to; 0.0 or any other value would work just fine).
I suppose you are statically linking the CRT and running a release build, which is why your original code was executing. In that case, the code I've shown will both execute and return the correct results. However, if you're trying to learn assembly language, I strongly recommend avoiding inline assembly and writing functions directly in MASM. This is supported from within the Visual Studio environment, too; for setup instructions, see here (same for all versions).

Have you used a debugger to look at registers / memory?
fstp [esp] is extremely suspicious, because ST0 should be empty at that point. scanf returns an integer, and the calling convention requires the x87 stack to be empty on call/return, except for FP return values.
I forget what happens when you FST while the x87 stack is empty. If you get zero, that would explain it, since this is what you're passing to printf.
add esp, 8 / sub esp, 8 is completely redundant. There's no point in doing that. You can just take it out. (Or comment it out but leave it there as a reminder that you've optimized by reusing the arg space on the stack instead of popping it and pushing new stuff.)
Since scanf writes the double at the address you passed, you could avoid copying it by getting it to write it near the bottom of the stack, and then adjust ESP to right below it. Push a format string and you're ready to call printf, with the double already where it needs to be on the stack as the second arg.
sub esp, 8 ; reserve 8 bytes for storing a `double` on the stack
push esp ; push a pointer to the bottom of that 16B we just reserved
push OFFSET input_format ; (only works if it's in static storage, but it's a string constant so you should have done that instead of initializing a non-const array with a literal inside `main`.)
call scanf ; scanf("%lf", ESP_before_we_pushed_args)
add esp, 8
; TODO: check return value and handle error
; esp now points at the double stored by scanf
push OFFSET output_format
call printf
add esp, 12 ; "pop" the args: format string and the double we made space for before the scanf call.
If the calling convention / ABI you're using requires that you make function calls with ESP 16B-aligned, you can't shorten it quite this much: you'll need an lea eax, [esp+4] / push eax or something instead of push esp, because the double can't be right above scanf's second arg but also be printf's second arg if both calls have the stack 16B-aligned. So have scanf store the double at a higher address, and add esp, 12 to reach it.
IDK if MSVC-style inline-asm guarantees any kind of stack alignment in the first place. Inline-asm seems to make this problem more complicated in some ways than just writing the whole function in asm.

Thank you for all the help and caveats offered. There are still problems I need to configure and fix, but after reading most float point operators on this website , I finally come up with a solution to read and display float point number with Microsoft Visual Studio. I am sorry for all the bad conventions I've used in this post and lack of comments, I will learn to program in a better style. Thanks again!
Here is my working code(different from original post since I want to use float point number there):
#include "stdafx.h"
int _tmain(int argc, _TCHAR* argv[])
{
char input1[] = "Enter number: \n";
char input_format[] = "%f";
char output_format[] = "%.2f\n"; //for better ouput
float afloat;
char output[] = "The number is:";
_asm {
lea eax, input1
push eax
call printf_s
add esp, 4
call flushall; // flush all streams and clear all buffers
mov afloat, 0; // give the variable afloat a default
lea eax, afloat;
push eax;
lea eax, input_format; // read the floating number user inputs
push eax;
call scanf_s;
add esp, 8;
//leave space for more operations.
lea eax, output;
push eax;
call printf; // print the output message
add esp, 4;
sub esp, 8; // reserve stack for a double in stack
fld dword ptr afloat; // load float
fstp qword ptr[esp]; // IMPORTANT: convert to double and store, because printf expects a double not a float
lea eax, output_format; // import the floating number format '%f'
push eax;
call printf; // print the variable afloat
add esp, 12;
}
return 0;
}
The result looks like this:

Understanding assembly language _start label in a C program

I had written a simple c program and was trying to do use GDB to debug the program. I understand the use of following in main function:
On entry
push %ebp
mov %esp,%ebp
On exit
leave
ret
Then I tried gdb on _start and I got the following
xor %ebp,%ebp
pop %esi
mov %esp,%ecx
and $0xfffffff0,%esp
push %eax
push %esp
push %edx
push $0x80484d0
push $0x8048470
push %ecx
push %esi
push $0x8048414
call 0x8048328 <__libc_start_main#plt>
hlt
nop
nop
nop
nop
I am unable to understand these lines, and the logic behind this.
Can someone provide any guidance to help explain the code of _start?

Here is the well commented assembly source of the code you posted.
Summarized, it does the following things:
establish a sentinel stack frame with ebp = 0 so code that walks the stack can find its end easily
Pop the number of command line arguments into esi so we can pass them to __libc_start_main
Align the stack pointer to a multiple of 16 bits in order to comply with the ABI. This is not guaranteed to be the case in some versions of Linux so it has to be done manually just in case.
The addresses of __libc_csu_fini, __libc_csu_init, the argument vector, the number of arguments and the address of main are pushed as arguments to __libc_start_main
__libc_start_main is called. This function (source code here) sets up some glibc-internal variables and eventually calls main. It never returns.
If for any reason __libc_start_main should return, a hlt instruction is placed afterwards. This instruction is not allowed in user code and should cause the program to crash (hopefully).
The final series of nop instructions is padding inserted by the assembler so the next function starts at a multiple of 16 bytes for better performance. It is never reached in normal execution.

for gnu tools the _start label is the entry point of the program. for the C language to work you need to have a stack you need to have some memory/variables zeroed and some set to the values you chose:
int x = 5;
int y;
int fun ( void )
{
static int z;
}
all three of these variables x,y,z are essentially global, one is a local global. since we wrote it that way we assume that when our program starts x contains the value 5 and it is assumed that y is zero. in order for those things to happen, some bootstrap code is required and that is what happens (and more) between _start and main().
Other toolchains may choose to use a different label to define the entry/start point, but gnu tools use _start. there may be other things your tools require before main() is called C++ for example requires more than C.

Why does Linux save %ebp when doing a context switch?

When doing a context switch, x86 Linux (very cleverly) avoids saving and restoring EAX, EBX, ECX, EDX, ESI, and EDI. Of course, the userland values are saved on the kernel stack when switching into kernel mode. But the values in the kernel code are not saved -- instead, GCC directives are used which tell the compiler not to keep any values which are needed in those registers at the point where the switch happens.
Naturally, ESP has to be saved and restored. But this is what I don't understand: before ESP is switched, EBP is pushed on the kernel stack. I would think that EBP was being used as a frame pointer, but in my kernel debugger, the values sure don't look like it:
(gdb) print $esp
$22 = (void *) 0xc0025ec0
(gdb) print $ebp
$23 = (void *) 0xcf827f3c
The difference is way too big for EBP to be a frame pointer here. A comment in the code says that "EBP is saved/restored explicitly for wchan access", but I'm searching the code and can't figure out how that is so. Google isn't helping either. Can some kernel wizard step in and help here?

The difference is way too big for EBP to be a frame pointer here.
Presumably you have compiled your kernel without frame pointers enabled. See the relevant config option:
config SCHED_OMIT_FRAME_POINTER
def_bool y
prompt "Single-depth WCHAN output"
depends on X86
---help---
Calculate simpler /proc/<PID>/wchan values. If this option
is disabled then wchan values will recurse back to the
caller function. This provides more accurate wchan values,
at the expense of slightly more scheduling overhead.
If in doubt, say "Y".
The function get_wchan will do a sanity check on the ebp value, and only use it if it seems to be a frame pointer.
I think it would be better to use the above config flag in both places, so that ebp would not be saved unnecessarily if it isn't a frame pointer, and also the get_wchan would not bother if we knew there wouldn't be a frame pointer. That said, saving/restoring ebp only adds a very little overhead, so it's not tragic.

I have figured it out. EBP is a frame pointer, but at the point that I checked its value, ESP had already been switched to the new process' kernel stack, but EBP had not yet been restored (so it still had the value from the previous process). Sorry!!
The reason for storing the frame pointer is so that others can determine where in the kernel code a process went to sleep. Among other things, this is used by /proc/PID/wchan, which prints the name of the kernel function which made a process sleep.
The code which checks this is as follows (details removed for brevity):
unsigned long get_wchan(struct task_struct *p)
{
unsigned long sp, bp, ip;
sp = p->thread.sp;
bp = *(unsigned long *) sp;
do {
ip = *(unsigned long *) (bp+4);
if (!in_sched_functions(ip))
return ip;
bp = *(unsigned long *) bp;
} while (count++ < 16);
return 0;
}
Since EBP is pushed right before switching kernel stacks, the stack pointer of a sleeping process will point to the saved EBP (frame pointer) value. That frame pointer points to the caller's saved frame pointer, which points to the previous caller's, which points to the previous caller's... in other words, the saved frame pointers form a linked list going back up the call stack.
The frame pointer is saved immediately on function entry, so the value just above it (4 bytes up) is the return address to the calling function.
The loop in get_wchan walks that "linked list" (bp = *bp), checking the return address above each saved frame pointer, until it finds an address within a function like ep_poll or futex_wait_queue_me.
get_wchan just returns an address inside a function; for display in /proc, lookup_symbol_name is used to convert that address into a function name.

Can anybody explain some simple assembly code?

I have just started to learn assembly. This is the dump from gdb for a simple program which prints hello ranjit.
Dump of assembler code for function main:
0x080483b4 <+0>: push %ebp
0x080483b5 <+1>: mov %esp,%ebp
0x080483b7 <+3>: sub $0x4,%esp
=> 0x080483ba <+6>: movl $0x8048490,(%esp)
0x080483c1 <+13>: call 0x80482f0 <puts#plt>
0x080483c6 <+18>: leave
0x080483c7 <+19>: ret
My questions are :
Why every time ebp is pushed on to stack at start of the program? What is in the ebp which is necessary to run this program?
In second line why is ebp copied to esp?
I can't get the third line at all. what I know about SUB syntax is "sub dest,source", but here how can esp be subtracted from 4 and stored in 4?
What is this value "$0x8048490"? Why it is moved to esp, and why this time is esp closed in brackets? Does it denote something different than esp without brackets?
Next line is the call to function but what is this "0x80482f0"?
What is leave and ret (maybe ret means returning to lib c.)?
operating system : ubuntu 10, compiler : gcc

ebp is used as a frame pointer in Intel processors (assuming you're using a calling convention that uses frames).
It provides a known point of reference for locating passed-in parameters (on one side) and local variables (on the other) no matter what you do with the stack pointer while your function is active.
The sequence:
push %ebp ; save callers frame pointer
mov %esp,%ebp ; create a new frame pointer
sub $N,%esp ; make space for locals
saves the frame pointer for the previous stack frame (the caller), loads up a new frame pointer, then adjusts the stack to hold things for the current "stack level".
Since parameters would have been pushed before setting up the frame, they can be accessed with [bp+N] where N is a suitable offset.
Similarly, because locals are created "under" the frame pointer, they can be accessed with [bp-N].
The leave instruction is a single one which undoes that stack frame. You used to have to do it manually but Intel introduced a faster way of getting it done. It's functionally equivalent to:
mov %ebp, %esp ; restore the old stack pointer
pop %ebp ; and frame pointer
(the old, manual way).
Answering the questions one by one in case I've missed something:
To start a new frame. See above.
It isn't. esp is copied to ebp. This is AT&T notation (the %reg is a dead giveaway) where (among other thing) source and destination operands are swapped relative to Intel notation.
See answer to (2) above. You're subtracting 4 from esp, not the other way around.
It's a parameter being passed to the function at 0x80482f0. It's not being loaded into esp but into the memory pointed at by esp. In other words, it's being pushed on the stack. Since the function being called is puts (see (5) below), it will be the address of the string you want putsed.
The function name in the <> after the address. It's calling the puts function (probably the one in the standard library though that's not guaranteed). For a description of what the PLT is, see here.
I've already explained leave above as unwinding the current stack frame before exiting. The ret simply returns from the current function. If the current functtion is main, it's going back to the C startup code.

In my career I learned several assembly languages, you didn't mention which but it appears Intel x86 (segmented memory model as PaxDiablo pointed out). However, I have not used assembly since last century (lucky me!). Here are some of your answers:
The EBP register is pushed onto the stack at the beginning because we need it further along in other operations of the routine. You don't want to just discard its original value thus corrupting the integrity of the rest of the application.
If I remember correctly (I may be wrong, long time) it is the other way around, we are moving %esp INTO %ebp, remember we saved it in the previous line? now we are storing some new value without destroying the original one.
Actually they are SUBstracting the value of four (4) FROM the contents of the %esp register. The resulting value is not stored on "four" but on %esp. If %esp had 0xFFF8 after the SUB it will contain 0xFFF4. I think this is called "Immediate" if my memory serves me. What is happening here (I reckon) is the computation of a memory address (4 bytes less).
The value $0x8048490 I don't know. However, it is NOT being moved INTO %esp but rather INTO THE ADDRESS POINTED TO BY THE CONTENTS OF %esp. That is why the notation is (%esp) rather than %esp. This is kind of a common notation in all assembly languages I came about in my career. If on the other hand the right operand was simply %esp, then the value would have been moved INTO the %esp register. Basically the %esp register's contents are being used for addressing.
It is a fixed value and the string on the right makes me think that this value is actually the address of the puts() (Put String) compiler library routine.
"leave" is an instrution that is the equivalent of "pop %ebp". Remember we saved the contents of %ebp at the beginning, now that we are done with the routine we are restoring it back into the register so that the caller gets back to its context. The "ret" instruction is the final instruction of the routine, it "returns" to the caller.

Trying to understand gcc's complicated stack-alignment at the top of main that copies the return address

hi I have disassembled some programs (linux) I wrote to understand better how it works, and I noticed that the main function always begins with:
lea ecx,[esp+0x4] ; I assume this is for getting the adress of the first argument of the main...why ?
and esp,0xfffffff0 ; ??? is the compiler trying to align the stack pointer on 16 bytes ???
push DWORD PTR [ecx-0x4] ; I understand the assembler is pushing the return adress....why ?
push ebp
mov ebp,esp
push ecx ;why is ecx pushed too ??
so my question is: why all this work is done ??
I only understand the use of:
push ebp
mov ebp,esp
the rest seems useless to me...

I've had a go at it:
;# As you have already noticed, the compiler wants to align the stack
;# pointer on a 16 byte boundary before it pushes anything. That's
;# because certain instructions' memory access needs to be aligned
;# that way.
;# So in order to first save the original offset of esp (+4), it
;# executes the first instruction:
lea ecx,[esp+0x4]
;# Now alignment can happen. Without the previous insn the next one
;# would have made the original esp unrecoverable:
and esp,0xfffffff0
;# Next it pushes the return addresss and creates a stack frame. I
;# assume it now wants to make the stack look like a normal
;# subroutine call:
push DWORD PTR [ecx-0x4]
push ebp
mov ebp,esp
;# Remember that ecx is still the only value that can restore the
;# original esp. Since ecx may be garbled by any subroutine calls,
;# it has to save it somewhere:
push ecx

This is done to keep the stack aligned to a 16-byte boundary. Some instructions require certain data types to be aligned on as much as a 16-byte boundary. In order to meet this requirement, GCC makes sure that the stack is initially 16-byte aligned, and allocates stack space in multiples of 16 bytes. This can be controlled using the option -mpreferred-stack-boundary=num. If you use -mpreferred-stack-boundary=2 (for a 22=4-byte alignment), this alignment code will not be generated because the stack is always at least 4-byte aligned. However you could then have trouble if your program uses any data types that require stronger alignment.
According to the gcc manual:
On Pentium and PentiumPro, double and long double values should be aligned to an 8 byte boundary (see -malign-double) or suffer significant run time performance penalties. On Pentium III, the Streaming SIMD Extension (SSE) data type __m128 may not work properly if it is not 16 byte aligned.
To ensure proper alignment of this values on the stack, the stack boundary must be as aligned as that required by any value stored on the stack. Further, every function must be generated such that it keeps the stack aligned. Thus calling a function compiled with a higher preferred stack boundary from a function compiled with a lower preferred stack boundary will most likely misalign the stack. It is recommended that libraries that use callbacks always use the default setting.
This extra alignment does consume extra stack space, and generally increases code size. Code that is sensitive to stack space usage, such as embedded systems and operating system kernels, may want to reduce the preferred alignment to -mpreferred-stack-boundary=2.
The lea loads the original stack pointer (from before the call to main) into ecx, since the stack pointer is about to modified. This is used for two purposes:
to access the arguments to the main function, since they are relative to the original stack pointer
to restore the stack pointer to its original value when returning from main

lea ecx,[esp+0x4] ; I assume this is for getting the adress of the first argument of the main...why ?
and esp,0xfffffff0 ; ??? is the compiler trying to align the stack pointer on 16 bytes ???
push DWORD PTR [ecx-0x4] ; I understand the assembler is pushing the return adress....why ?
push ebp
mov ebp,esp
push ecx ;why is ecx pushed too ??
Even if every instruction worked perfectly with no speed penalty despite arbitrarily aligned operands, alignment would still increase performance. Imagine a loop referencing a 16-byte quantity that just overlaps two cache lines. Now, to load that little wchar into the cache, two entire cache lines have to be evicted, and what if you need them in the same loop? The cache is so tremendously faster than RAM that cache performance is always critical.
Also, there usually is a speed penalty to shift misaligned operands into the registers.
Given that the stack is being realigned, we naturally have to save the old alignment in order to traverse stack frames for parameters and returning.
ecx is a temporary register so it has to be saved. Also, depending on optimization level, some of the frame linkage ops that don't seem strictly necessary to run the program might well be important in order to set up a trace-ready chain of frames.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string