Why so much stack space used for each recursion?

Why so much stack space used for each recursion? - visual-c++

I have a simple recursive function RCompare() that calls a more complex function Compare() which returns before the recursive call. Each recursion level uses 248 bytes of stack space which seems like way more than it should. Here is the recursive function:
void CMList::RCompare(MP n1) // RECURSIVE and Looping compare function
{
auto MP ne=n1->mf;
while(StkAvl() && Compare(n1=ne->mb))
RCompare(n1); // Recursive call !
}
StkAvl() is a simple stack space check function that compares the address of an auto variable to the value of an address near the end of the stack stored in a static variable.
It seems to me that the only things added to the stack in each recursion are two pointer variables (MP is a pointer to a structure) and the stuff that one function call stores, a few saved registers, base pointer, return address, etc., all 32-bit (4 byte) values. There's no way that is 248 bytes is it?
I don't no how to actually look at the stack in a meaningful way in Visual Studio 2008.
Thanks
Added disassembly:
CMList::RCompare:
0043E000 push ebp
0043E001 mov ebp,esp
0043E003 sub esp,0E4h
0043E009 push ebx
0043E00A push esi
0043E00B push edi
0043E00C push ecx
0043E00D lea edi,[ebp-0E4h]
0043E013 mov ecx,39h
0043E018 mov eax,0CCCCCCCCh
0043E01D rep stos dword ptr es:[edi]
0043E01F pop ecx
0043E020 mov dword ptr [ebp-8],edx
0043E023 mov dword ptr [ebp-14h],ecx
0043E026 mov eax,dword ptr [n1]
0043E029 mov ecx,dword ptr [eax+20h]
0043E02C mov dword ptr [ne],ecx
0043E02F mov ecx,dword ptr [this]
0043E032 call CMList::StkAvl (41D46Fh)
0043E037 test eax,eax
0043E039 je CMList::RCompare+63h (43E063h)
0043E03B mov eax,dword ptr [ne]
0043E03E mov ecx,dword ptr [eax+1Ch]
0043E041 mov dword ptr [n1],ecx
0043E044 mov edx,dword ptr [n1]
0043E047 mov ecx,dword ptr [this]
0043E04A call CMList::Compare (41DA05h)
0043E04F movzx edx,al
0043E052 test edx,edx
0043E054 je CMList::RCompare+63h (43E063h)
0043E056 mov edx,dword ptr [n1]
0043E059 mov ecx,dword ptr [this]
0043E05C call CMList::RCompare (41EC9Dh)
0043E061 jmp CMList::RCompare+2Fh (43E02Fh)
0043E063 pop edi
0043E064 pop esi
0043E065 pop ebx
0043E066 add esp,0E4h
0043E06C cmp ebp,esp
0043E06E call #ILT+5295(__RTC_CheckEsp) (41E4B4h)
0043E073 mov esp,ebp
0043E075 pop ebp
0043E076 ret
Why 0E4h?
More Info:
class mch // match node structure
{
public:
T_FSZ c1,c2; // file indexes
T_MSZ sz; // match size
enum ntyp typ; // type of node
mch *mb,*mf; // pointers to next and previous match nodes
};
typedef mch * MP; // for use in casting (MP) x
Should be a plain old pointer right? The same pointers are in the structure itself and they are just normal 4 byte pointers.
Edit: Added:
#pragma check_stack(off)
void CMList::RCompare(MP n1) // RECURSIVE and Looping compare function
{
auto MP ne=n1->mf;
while(StkAvl() && Compare(n1=ne->mb))
RCompare(n1); // Recursive call !
} // end RCompare()
#pragma check_stack()
But it didn't change anything. :(
Now what?

Note that on debug mode the compiler binds many bytes from the stack, on every function,
to catch buffer overflow bugs.
0043E003 sub esp, 0E4h ; < -- bound 228 bytes
...
0043E00D lea edi,[ebp-0E4h]
0043E013 mov ecx, 39h
0043E018 mov eax, 0CCCCCCCCh ; <-- sentinel
0043E01D rep stos dword ptr es:[edi] ; <-- write sentinels
Edit: the OP Harvey found the pragma that turns on/off stack probes.
check_stack
Instructs the compiler to turn off
stack probes if off (or –) is
specified,
or to turn on stack probes
if on (or +) is specified.
#pragma check_stack([ {on | off}] )
#pragma check_stack{+ | –}
Update: well, probes is another story, as it appears.
Try this: /GZ (Enable Stack Frame Run-Time Error Checking)

That also depends on the compiler and the architecture you're running - e.g. it could be aligning to 256 bytes for faster execution, so that each level uses the 8 bytes of the variable + 248 padding.

I guess some space has to be allocated for exception handling. Did you look at the disassembly?

In Visual Studio you can look at the register "esp", the stack pointer, in a watch (or register) windows. Set a breakpoint in your function between one call and the next to see who much stack you consume.
On a pain function on debug mode in Visual Studio 2008 it is 16 byes per function call.

Related

Top of a stack frame is not on the lowest address when examining $rsp [duplicate]

I have tried to understand this basing on a square function in c++ at godbolt.org . Clearly, return, parameters and local variables use “rbp - alignment” for this function.
Could someone please explain how this is possible?
What then would rbp + alignment do in this case?
int square(int num){
int n = 5;// just to test how locals are treated with frame pointer
return num * num;
}
Compiler (x86-64 gcc 11.1)
Generated Assembly:
square(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi. ;\\Both param and local var use rbp-*
mov DWORD PTR[rbp-4], 5. ;//
mov eax, DWORD PTR [rbp-20]
imul eax, eax
pop rbp
ret

This is one of those cases where it’s handy to distinguish between parameters and arguments. In short: arguments are the values given by the caller, while parameters are the variables holding them.
When square is called, the caller places the argument in the rdi register, in accordance with the standard x86-64 calling convention. square then allocates a local variable, the parameter, and places the argument in the parameter. This allows the parameter to be used like any other variable: be read, written into, having its address taken, and so on. Since in this case it’s the callee that allocated the memory for the parameter, it necessarily has to reside below the frame pointer.
With an ABI where arguments are passed on the stack, the callee would be able to reuse the stack slot containing the argument as the parameter. This is exactly what happens on x86-32 (pass -m32 to see yourself):
square(int): # #square(int)
push ebp
mov ebp, esp
push eax
mov eax, dword ptr [ebp + 8]
mov dword ptr [ebp - 4], 5
mov eax, dword ptr [ebp + 8]
imul eax, dword ptr [ebp + 8]
add esp, 4
pop ebp
ret
Of course, if you enabled optimisations, the compiler would not bother with allocating a parameter on the stack in the callee; it would just use the value in the register directly:
square(int): # #square(int)
mov eax, edi
imul eax, edi
ret

GCC allows "leaf" functions, those that don't call other functions, to not bother creating a stack frame. The free stack is fair game to do so as these fns wish.

push/pop segmentation fault in simple multiplication function

my teacher is doing a crash course in assembly with us, and I have no experience in it whatsoever. I am supposed to write a simple function that takes four variables and calculates (x+y)-(z+a) and then prints out the answer. I know it's a simple problem, but after hours of research I am getting no where, any push in the right direction would be very helpful! I do need to use the stack, as I have more things to add to the program once I get past this point, and will have a lot of variables to store. I am compiling using nasm and gcc, in linux. (x86 64)
(side question, my '3' isn't showing up in register r10, but I am in linux so this should be the correct register... any ideas?)
Here is my code so far:
global main
extern printf
segment .data
mulsub_str db "(%ld * %ld) - (%ld * %ld) = %ld",10,0
data dq 1, 2, 3, 4
segment .text
main:
call multiplyandsubtract
pop r9
mov rdi, mulsub_str
mov rsi, [data]
mov rdx, [data+8]
mov r10, [data+16]
mov r8, [data+24]
mov rax, 0
call printf
ret
multiplyandsubtract:
;;multiplies first function
mov rax, [data]
mov rdi, [data+8]
mul rdi
mov rbx, rdi
push rbx
;;multiplies second function
mov rax, [data+16]
mov rsi, [data+24]
mul rsi
mov rbx, rsi
push rbx
;;subtracts function 2 from function 1
pop rsi
pop rdi
sub rdi, rsi
push rdi
ret

push in the right direction
Nice pun!
Your problem is that you apparently don't seem to know that ret is using the stack for the return address. As such push rdi; ret will just go to the address in rdi and not return to your caller. Since that is unlikely to be a valid code address, you get a nice segfault.
To return values from functions just leave the result in a register, standard calling conventions normally use rax. Here is a possible version:
global main
extern printf
segment .data
mulsub_str db "(%ld * %ld) - (%ld * %ld) = %ld",10,0
data dq 1, 2, 3, 4
segment .text
main:
sub rsp, 8
call multiplyandsubtract
mov r9, rax
mov rdi, mulsub_str
mov rsi, [data]
mov rdx, [data+8]
mov r10, [data+16]
mov r8, [data+24]
mov rax, 0
call printf
add rsp, 8
ret
multiplyandsubtract:
;;multiplies first function
mov rax, [data]
mov rdi, [data+8]
mul rdi
mov rbx, rdi
push rbx
;;multiplies second function
mov rax, [data+16]
mov rsi, [data+24]
mul rsi
mov rbx, rsi
push rbx
;;subtracts function 2 from function 1
pop rsi
pop rdi
sub rdi, rsi
mov rax, rdi
ret
PS: notice I have also fixed the stack alignment as per the ABI. printf is known to be picky about that too.

To return more than 64b from subroutine (rax is not enough), you can optionally drop the whole standard ABI convention (or actually follow it, there's surely a well defined way how to return more than 64b from subroutines), and use other registers until you ran out of them.
And once you ran out of spare return registers (or when you desperately want to use stack memory), you can follow the way C++ compilers do:
SUB rsp,<return_data_size + alignment>
CALL subroutine
...
MOV al,[rsp + <offset>] ; to access some value from returned data
; <offset> = 0 to return_data_size-1, as defined by you when defining
; the memory layout for returned data structure
...
ADD rsp,<return_data_size + alignment> ; restore stack pointer
subroutine:
MOV al,<result_value_1>
MOV [rsp + 8 + <offset>],al ; store it into allocated stack space
; the +8 is there to jump beyond return address, which was pushed
; at stack by "CALL" instruction. If you will push more registers/data
; at the stack inside the subroutine, you will have either to recalculate
; all offsets in following code, or use 32b C-like function prologue:
PUSH rbp
MOV rbp,rsp
MOV [rbp + 16 + <offset>],al ; now all offsets are constant relative to rbp
... other code ...
; epilogue code restoring stack
MOV rsp,rbp ; optional, when you did use RSP and didn't restore it yet
POP rbp
RET
So during executing the instructions of subroutine, the stack memory layout is like this:
rsp -> current_top_of_stack (some temporary push/pop as needed)
+x ...
rbp -> original rbp value (if prologue/epilogue code was used)
+8 return address to caller
+16 allocated space for returning values
+16+return_data_size
... padding to have rsp correctly aligned by ABI requirements ...
+16+return_data_size+alignment
... other caller stack data or it's own stack frame/return address ...
I'm not going to check how ABI defines it, because I'm too lazy, plus I hope this answer is understandable for you to explain the principle, so you will recognize which way the ABI works and adjust...
Then again, I would highly recommend to use rather many shorter simpler subroutines returning only single value (in rax/eax/ax/al), whenever possible, try to follow the SRP (Single Responsibility Principle). The above way will force you to define some return-data-structure, which may be too much hassle, if it's just some temporary thing and can be split into single-value subroutines instead (if performance is endangered, then probably inlining the whole subroutine will outperform even the logic of grouped returned values and single CALL).

Assembler: "Function Call" assembler code produced by VC++

I am by all means no assembler expert, and my knowledge on this topic is rather shallow, but I was curious on what the Microsoft VC++ Compiler does in a simple function call that does nothing else but returning a value.
Let us have the following function:
unsigned long __stdcall someFunction ( void * args) {
return 0;
}
Now, I know that with __stdcall calling convention the CALLEE is responsible for stack unwinding, and with __cdecl the CALLER of the function takes care of this. But for this example I would like to stick to the former.
With an non-optimized debug build I saw that the following output is being produced:
unsigned long __stdcall someFunction (void * args) {
00A31730 push ebp
00A31731 mov ebp,esp
00A31733 sub esp,0C0h
00A31739 push ebx
00A3173A push esi
00A3173B push edi
00A3173C lea edi,[ebp-0C0h]
00A31742 mov ecx,30h
00A31747 mov eax,0CCCCCCCCh
00A3174C rep stos dword ptr es:[edi]
return 0;
00A3174E xor eax,eax
}
00A31750 pop edi
00A31751 pop esi
00A31752 pop ebx
00A31753 mov esp,ebp
00A31755 pop ebp
00A31756 ret 4
I would thank anyone to explain this snippet of code for me if possible. I know that the xor statement actually resets the eax register to produce the zero return value. Also the ret 4 is self-explanatory to me. I think the edi, esi and ebx registers are pushed before and popped after to save the original state, so that the function can use them freely maybe. But for the rest - I have no clue.
Any answer is very much appreciated! :)
Thanks!

So you're asking what these lines do:
00A3173C lea edi,[ebp-0C0h]
00A31742 mov ecx,30h
00A31747 mov eax,0CCCCCCCCh
00A3174C rep stos dword ptr es:[edi]
In Visual C++ debugging runtime library, uninitialized stack memory is initialized to contain 0xCC bytes. This is what these instructions do.
At the beginning of the ASM code, there is the instruction sub esp,0C0h that allocates 0xC0 bytes for the stack. However, there is no local variables used in this function, so where does this come from? It's for Edit+Continue support: you're able to add local variables and continue debugging.
The 0xCC opcode means the INT 3 x86 assembly instruction, so if you try to execute that code (accidentally due to a bug), the program will throw an INT 3 exception which will be handled by the debugger or OS. So it's not just some random value.

gcc x64 stack manipulation

I try to understand the way gcc x64 organize the stack, a small program generate this asm
(gdb) disassemble *main
Dump of assembler code for function main:
0x0000000000400534 <main+0>: push rbp
0x0000000000400535 <main+1>: mov rbp,rsp
0x0000000000400538 <main+4>: sub rsp,0x30
0x000000000040053c <main+8>: mov DWORD PTR [rbp-0x14],edi
0x000000000040053f <main+11>: mov QWORD PTR [rbp-0x20],rsi
0x0000000000400543 <main+15>: mov DWORD PTR [rsp],0x7
0x000000000040054a <main+22>: mov r9d,0x6
0x0000000000400550 <main+28>: mov r8d,0x5
0x0000000000400556 <main+34>: mov ecx,0x4
0x000000000040055b <main+39>: mov edx,0x3
0x0000000000400560 <main+44>: mov esi,0x2
0x0000000000400565 <main+49>: mov edi,0x1
0x000000000040056a <main+54>: call 0x4004c7 <addAll>
0x000000000040056f <main+59>: mov DWORD PTR [rbp-0x4],eax
0x0000000000400572 <main+62>: mov esi,DWORD PTR [rbp-0x4]
0x0000000000400575 <main+65>: mov edi,0x400688
0x000000000040057a <main+70>: mov eax,0x0
0x000000000040057f <main+75>: call 0x400398 <printf#plt>
0x0000000000400584 <main+80>: mov eax,0x0
0x0000000000400589 <main+85>: leave
0x000000000040058a <main+86>: ret
Why it reserve up to 0x30 bytes just to save edi and rsi
I don't see any where restore values of edi and rsi as required by ABI
edi and rsi save at position that has delta 0x20 - 0x14 = 0xC, not a continuous region, does it make sense?
follow is source code
int mix(int a,int b,int c,int d,int e,int f, int g){
return a | b | c | d | e | f |g;
}
int addAll(int a,int b,int c,int d,int e,int f, int g){
return a+b+c+d+e+f+g+mix(a,b,c,d,e,f,g);
}
int main(int argc,char **argv){
int total;
total = addAll(1,2,3,4,5,6,7);
printf("result is %d\n",total);
return 0;
}
Edit
It's seem that stack has stored esi,rdi, 7th parameter call to addAll and total , it should take 4x8 = 32 (0x20) bytes, it round up to 0x30 for some reasons.

I dont know your original code, but locals are also stored on the stack, and when you have some local variables that space is also "allocated". Also for alignment reason it can be, that he "rounded" to the next multiple of 16. I would guess you have a local for passing the result from your addAll to the printf, and that is stored at rbp-04.
I just had look in your linked ABI - where does it say that the callee has to restore rdi and rsi? It says already on page 15, footnote:
Note that in contrast to the Intel386 ABI, %rdi, and %rsi belong to the called function, not
the caller.
Afaik they are used for passing the first arguments to the callee.
0xC are 12. This comes also from alignment, as you can see, he just needs to store edi not rdi, for alignment purpose I assume that he aligns it on a 4 byte border, while si is rsi, which is 64 bit and aligned on 8 byte border.

2: The ABI explicitly says that rdi/rsi need NOT be saved by the called function; see page 15 and footnote 5.
1 and 3: unsure; probably stack alignment issues.

Microsoft inline assembly and references or Why does BYTE PTR [ByteRef] not work in this situation?

Okay so I have a C++ function in which I am trying to use inline assembly
void ToggleBit(unsigned char &Byte, unsigned int Bit)
{
/* In C:
* Byte ^= (1<<Bit);
*/
__asm
{
push edx
push ecx
mov ecx, Bit
xor edx, edx
mov edx, 1
sal dl, cl
xor BYTE PTR [Byte], dl
pop ecx
pop edx
}
}
This should work, right? Since Byte is a reference (which is essentially a constant pointer), it must be dereferenced to access the data... but it didn't work!
Upon debugging the following code:
mov edx, Byte
;edx = 0x0040f9d3
mov bl, BYTE PTR [Byte]
;bl = 0xd3
I don't understand why this would happen at all.

As you say, a reference is the same as a pointer in assembly. To access the reference/pointer, you must first read the pointer value, and then dereference it:
mov ecx, Byte ; Or mov ecx, [Byte] which is the same thing
xor [ecx], dl
When you access the value at BYTE PTR [Byte], it accesses the first byte of the pointer value (the pointed-to address) instead of the pointed-to value.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why so much stack space used for each recursion? - visual-c++

That also depends on the compiler and the architecture you're running - e.g. it could be aligning to 256 bytes for faster execution, so that each level uses the 8 bytes of the variable + 248 padding.

I guess some space has to be allocated for exception handling. Did you look at the disassembly?

In Visual Studio you can look at the register "esp", the stack pointer, in a watch (or register) windows. Set a breakpoint in your function between one call and the next to see who much stack you consume. On a pain function on debug mode in Visual Studio 2008 it is 16 byes per function call.

Related

Top of a stack frame is not on the lowest address when examining $rsp [duplicate]

push/pop segmentation fault in simple multiplication function

Assembler: "Function Call" assembler code produced by VC++

gcc x64 stack manipulation

Microsoft inline assembly and references or Why does BYTE PTR [ByteRef] not work in this situation?

Categories

Resources