x86 segfault in "call" to function - linux

I am working on a toy compiler. I used to allocate all memory with malloc, but since I never call free, I think it will be sufficient (and faster) to allocate a GB or so on the stack and then slowly use that buffer.
But... now I am segfaulting before anything interesting happens. It happens on about 30% of my test cases (all test cases are the same in this section tho). Pasted from GDB:
(gdb) disas
Dump of assembler code for function main:
0x0000000000400bf1 <+0>: push rbp
0x0000000000400bf2 <+1>: mov rbp,rsp
0x0000000000400bf5 <+4>: mov QWORD PTR [rip+0x2014a4],rsp # 0x6020a0
0x0000000000400bfc <+11>: sub rsp,0x7735940
0x0000000000400c03 <+18>: sub rsp,0x7735940
0x0000000000400c0a <+25>: sub rsp,0x7735940
0x0000000000400c11 <+32>: sub rsp,0x7735940
=> 0x0000000000400c18 <+39>: call 0x400fec <new_Main>
0x0000000000400c1d <+44>: mov r15,rax
0x0000000000400c20 <+47>: mov rax,r15
0x0000000000400c23 <+50>: add rax,0x20
0x0000000000400c27 <+54>: mov rax,QWORD PTR [rax]
0x0000000000400c2a <+57>: add rax,0x48
0x0000000000400c2e <+61>: mov rax,QWORD PTR [rax]
0x0000000000400c31 <+64>: call rax
0x0000000000400c33 <+66>: mov rax,0x0
0x0000000000400c3a <+73>: mov rsp,rbp
0x0000000000400c3d <+76>: pop rbp
0x0000000000400c3e <+77>: ret
I originally did one big "sub rsp, 0x..." and I thought breaking it up a bit would help (it didn't -- the program crashes at call either way). The total should be 500MB in this case.
What really confuses me is why it fails on "call <>" instead of one of the subs. And why it only fails some of the time rather than always or never.
Disclosure: this is a school project, but asking for help with general issues regarding x86 is not against any rules.
Update: based on #swift's comment, I set ulimit -s unlimited... and it now segfaults randomly? It seems random. It's not coming close to using the whole 500 MB buffer tho. It only allocates about 400 bytes total.

Subtracting something from RSP won’t cause any issues since nothing uses it. It’s just a register with a value, it doesn’t allocate anything. But when you use CALL then memory pointed by RSP is accessed and issues may happen. The stack usually isn’t very big so to your question “is there any reason you can’t take a GB of memory from the stack” the answer is “because the stack doesn’t have that much space to be used.”
As for being faster to allocate a big buffer in the stack isn’t really a thing. Allocating and releasing a single big block of memory isn’t slower in the heap. Having lots of allocations and releases in heap is worse than in the stack. So there’s not much point in this case to do it in the stack.

Related

Writing large arrays to memory x86 assembly - segfaults using stack space

I'm trying to write an array to the stack in x86 assembly using the AT&T (GAS) syntax. I have ran into an issue whereby I can write ~8.37 million entries to the stack, and then I get a Segmentation fault if I try to write any more. I don't know why this is, here's the code I'm using:
mov %rsp, %rbp
mov $8378658, %rdx
writeDataLoop:
sub %rdx, %rbp
movb $0b1, (%rbp)
add %rdx, %rbp
sub $1, %rdx
cmp $0, %rdx
jg writeDataLoop
Another odd thing that I've found is that the limit at which I can write data up to changes very slightly with each run (it's roughly at 8378658, which is also nothing significant in hex (0x7fd922). Can anyone tell me how to write more data to the stack, and also potentially explain what this arbitrary stack write limit is? Thanks.
To start off with, the default stack size is 8MB, this is the stack limit I was reaching. This can be found with the ulimit -a command. However, one should not use the stack for large amounts of data (usually arrays). Instead, the .space directive should be used, which, using AT&T syntax, takes the amount of data to store in bytes: .space <buffersize>. This can be labelled, for example:
buffer: .space 1024
This allocates 1024 bytes of storage to the label buffer. This label should be in the .bss section of your program, allowing for read and write access.
Finally, to access this data (read or write), one can use buffer(%rax), where rax is the offset.
Edit:
Using the .bss section is more efficient file-size wise than using the .data section, you just have to manually initialize the array.

Why does VC++ 2010 often use ebx as a "zero register"?

Yesterday I was looking at some 32 bit code generated by VC++ 2010 (most probably; don't know about the specific options, sorry) and I was intrigued by a curious recurring detail: in many functions, it zeroed out ebx in the prologue, and it always used it like a "zero register" (think $zero on MIPS). In particular, it often:
used it to zero out memory; this is not unusual, as the encoding for a mov mem,imm is 1 to 4 bytes bigger than mov mem,reg (the full immediate value size has to be encoded even for 0), but usually (gcc) the necessary register is zeroed out "on demand", and kept for more useful purposes otherwise;
used it for compares against zero - as in cmp reg,ebx. This is what stroke me as really unusual, as it should be exactly the same as test reg,reg, but adds a dependency to an extra register. Now, keep in mind that this happened in non-leaf functions, with ebx being often pushed (by the callee) on and off the stack, so I would not trust this dependency to be always completely free. Also, it also used test reg,reg in the exact same fashion (test/cmp => jg).
Most importantly, registers on "classic" x86 are a scarce resource, if you start having to spill registers you waste a lot of time for no good reason; why waste one through all the function just to keep a zero in it? (still, thinking about it, I don't remember seeing much register spillage in functions that used this "zero-register" pattern).
So: what am I missing? Is it a compiler blooper or some incredibly smart optimization that was particularly interesting in 2010?
Here's an excerpt:
; standard prologue: ebp/esp, SEH, overflow protection, ... then:
xor ebx, ebx
mov [ebp+4], ebx ; zero out some locals
mov [ebp], ebx
call function_1
xor ecx, ecx ; ebx _not_ used to zero registers
cmp eax, ebx ; ... but used for compares?! why not test eax,eax?
setnz cl ; what? it goes through cl to check if eax is not zero?
cmp ecx, ebx ; still, why not test ecx,ecx?
jnz function_body
push 123456
call throw_something
function_body:
mov edx, [eax]
mov ecx, eax ; it's not like it was interested in ecx anyway...
mov eax, [edx+0Ch]
call eax ; virtual method call; ebx is preserved but possibly pushed/popped
lea esi, [eax+10h]
mov [ebp+0Ch], esi
mov eax, [ebp+10h]
mov ecx, [eax-0Ch]
xor edi, edi ; ugain, registers are zeroed as usual
mov byte ptr [ebp+4], 1
mov [ebp+8], ecx
cmp ecx, ebx ; why not test ecx,ecx?
jg somewhere
label1:
lea eax, [esi-10h]
mov byte ptr [ebp+4], bl ; ok, uses bl to write a zero to memory
lea ecx, [eax+0Ch]
or edx, 0FFFFFFFFh
lock xadd [ecx], edx
dec edx
test edx, edx ; now it's using the regular test reg,reg!
jg somewhere_else
Notice: an earlier version of this question said that it used mov reg,ebx instead of xor ebx,ebx; this was just me not remembering stuff correctly. Sorry if anybody put too much thought trying to understand that.
Everything you commented on as odd looks sub-optimal to me. test eax,eax sets all flags (except AF) the same as cmp against zero, and is preferred for performance and code-size.
On P6 (PPro through Nehalem), reading long-dead registers is bad because it can lead to register-read stalls. P6 cores can only read 2 or 3 not-recently-modified architectural registers from the permanent register file per clock (to fetch operands for the issue stage: the ROB holds operands for uops, unlike on SnB-family where it only holds references to the physical register file).
Since this is from VS2010, Sandybridge wasn't released yet, so it should have put a lot of weight on tuning for Pentium II/III, Pentium-M, Core2, and Nehalem where reading "cold" registers is a possible bottleneck.
IDK if anything like this ever made sense for integer regs, but I don't know much about optimizing for CPUs older than P6.
The cmp / setz / cmp / jnz sequence looks particularly braindead. Maybe it came from a compiler-internal canned sequence for producing a boolean value from something, and it failed to optimize a test of the boolean back into just using the flags directly? That still doesn't explain the use of ebx as a zero-register, which is also completely useless there.
Is it possible that some of that was from inline-asm that returned a boolean integer (using a silly that wanted a zero in a register)?
Or maybe the source code was comparing two unknown values, and it was only after inlining and constant-propagation that it turned into a compare against zero? Which MSVC failed to optimize fully, so it still kept 0 as a constant in a register instead of using test?
(the rest of this was written before the question included code).
Sounds weird, or like a case of CSE / constant-hoisting run amok. i.e. treating 0 like any other constant that you might want to load once and then reg-reg copy throughout the function.
Your analysis of the data-dependency behaviour is correct: moving from a register that was zeroed a while ago essentially starts a new dependency chain.
When gcc wants two zeroed registers, it often xor-zeroes one and then uses a mov or movdqa to copy to the other.
This is sub-optimal on Sandybridge where xor-zeroing doesn't need an execution port, but a possible win on Bulldozer-family where mov can run on the AGU or ALU, but xor-zeroing still needs an ALU port.
For vector moves, it's a clear win on Bulldozer: handled in register rename with no execution unit. But xor-zeroing of an XMM or YMM register still needs an execution port on Bulldozer-family (or two for ymm, so always use xmm with implicit zero-extension).
Still, I don't think that justifies tying up a register for the duration of a whole function, especially not if it costs extra saves/restores. And not for P6-family CPUs where register-read stalls are a thing.

Making a stack frame with push/lea offset/sub instead of push/mov/sub?

I'm analysing a c++ function compiled with vc++ (probably vs10) and I never saw this prologue pattern before.
It seems to be a stdcall but the prologue is a little bit different:
stdcall usually starts the function with the following prologue pattern:
push ebp
mov ebp, esp
sub esp, const
However the prologue of this function I'm analysing is the following:
push ebp
lea ebp, [esp - 0x4C]
sub esp, 0x80
Analysing other functions in the same PE that uses the same prologue/epilogue it seems the RETN always come after a LEAVE instruction, just another thing I never saw in a regular cdecl function.
I'm wondering about why the compiler did that. It appears to open space on ESP (by sub esp, const), so why it opens another block of stack by lea ebp, [esp - const]?
Does anyone know why the compiler does that? Is that a different call convention from stdcall?
I did some research on the net as well as studied this specific assembly code to find out but didn't discover the need of that.
Thanks in advance!
EDIT with screens of the prologue/epilogue:
Prologue
Epilogue
A call to the function
As no one in the comments wrote an answer here we go...
The reason of that difference in the prologue/epilogue between the "usual" stdcall and the one I talk in the topic is compiler optmization for code density.
Offsetting EBP in the prologue the compiler is able to shorten the instructions in the function that accesses some stack variables. It can now access a larger chunk of stack memory (depending on how long the prologue offset EBP) with a single byte displacement - using EBP + N and EBP - M to access local variables (where N and M are a const between -128 and + 127). Of course instructions that access variables beyond that EBP's offset will use 4 bytes displacement, but the overall code of that function will be shorter using this optimization trick.

What is wrong with this emulation of CMPXCHG16B instruction?

I'm trying to run a binary program that uses CMPXCHG16B instruction at one place, unfortunately my Athlon 64 X2 3800+ doesn't support it. Which is great, because I see it as a programming challenge. The instruction doesn't seem to be that hard to implement with a cave jump, so that's what I did, but something didn't work, program just froze in a loop. Maybe someone can tell me if I implemented my CMPXCHG16B wrong?
Firstly the actual piece of machine code that I'm trying to emulate is this:
f0 49 0f c7 08 lock cmpxchg16b OWORD PTR [r8]
Excerpt from Intel manual describing CMPXCHG16B:
Compare RDX:RAX with m128. If equal, set ZF and load RCX:RBX into m128.
Else, clear ZF and load m128 into RDX:RAX.
First I replace all 5 bytes of the instruction with a jump to code cave with my emulation procedure, luckily the jump takes up exactly 5 bytes! The jump is actually a call instruction e8, but could be a jmp e9, both work.
e8 96 fb ff ff call 0xfffffb96(-649)
This is a relative jump with a 32-bit signed offset encoded in two's complement, the offset points to a code cave relative to address of next instruction.
Next the emulation code I'm jumping to:
PUSH R10
PUSH R11
MOV r10, QWORD PTR [r8]
MOV r11, QWORD PTR [r8+8]
TEST R10, RAX
JNE ELSE
TEST R11, RDX
JNE ELSE
MOV QWORD PTR [r8], RBX
MOV QWORD PTR [r8+8], RCX
JMP END
ELSE:
MOV RAX, r10
MOV RDX, r11
END:
POP R11
POP R10
RET
Personally, I'm happy with it, and I think it matches the functional specification given in manual. It restores stack and two registers r10 and r11 to their original order and then resumes execution. Alas it does not work! That is the code works, but the program acts as if it's waiting for a tip and burning electricity. Which indicates my emulation was not perfect and I inadvertently broke it's loop. Do you see anything wrong with it?
I notice that this is an atomic variant of it—owning to the lock prefix. I'm hoping it's something else besides contention that I did wrong. Or is there a way to emulate atomicity too?
It's not possible to emulate lock cmpxchg16b. It's sort of possible if all accesses to the target address are synchronised with a separate lock, but that includes all other instructions, including non-atomic stores to either half of the object, and atomic read-modify-writes (like xchg, lock cmpxchg, lock add, lock xadd) with one half (or other part) of the 16 byte object.
You can emulate cmpxchg16b (without lock) like you've done here, with the bugfixes from #Fifoernik's answer. That's an interesting learning exercise, but not very useful in practice, because real code that uses cmpxchg16b always uses it with a lock prefix.
A non-atomic replacement will work most of the time, because it's rare for a cache-line invalidate from another core to arrive in the small time window between two nearby instructions. This doesn't mean it's safe, it just means it's really hard to debug when it does occasionally fail. If you just want to get a game working for your own use, and can accept occasional lockups / errors, this might be useful. For anything where correctness is important, you're out of luck.
What about MFENCE? Seems to be what I need.
MFENCE before, after, or between the loads and stores won't prevent another thread from seeing a half-written value ("tearing"), or from modifying the data after your code has made the decision that the compare succeeded, but before it does the store. It might narrow the window of vulnerability, but it can't close it, because MFENCE only prevents reordering of the global visibility of our own stores and loads. It can't stop a store from another core from becoming visible to us after our loads but before our stores. That requires an atomic read-modify-write bus cycle, which is what locked instructions are for.
Doing two 8-byte atomic compare-exchanges would solve the window-of-vulnerability problem, but only for each half separately, leaving the "tearing" problem.
Atomic 16B loads/stores solves the tearing problem but not the atomicity problem between loads and stores. It's possible with SSE on some hardware, but not guaranteed to be atomic by the x86 ISA the way 8B naturally-aligned loads and stores are.
Xen's lock cmpxchg16b emulation:
The Xen virtual machine has an x86 emulator, I guess for the case where a VM starts on one machine and migrates to less-capable hardware. It emulates lock cmpxchg16b by taking a global lock, because there's no other way. If there was a way to emulate it "properly", I'm sure Xen would do that.
As discussed in this mailing list thread, Xen's solution still doesn't work when the emulated version on one core is accessing the same memory as the non-emulated instruction on another core. (The native version doesn't respect the global lock).
See also this patch on the Xen mailing list that changes the lock cmpxchg8b emulation to support both lock cmpxchg8b and lock cmpxchg16b.
I also found that KVM's x86 emulator doesn't support cmpxchg16b either, according to the search results for emulate cmpxchg16b.
I think all this is good evidence that my analysis is correct, and that it's not possible to emulate it safely.
I see these things wrong with your code to emulate the cmpxchg16b instruction:
You need to use cmp in stead of test to get a correct comparison.
You need to save/restore all flags except the ZF. The manual mentions :
The CF, PF, AF, SF, and OF flags are unaffected.
The manual contains the following:
IF (64-Bit Mode and OperandSize = 64)
THEN
TEMP128 ← DEST
IF (RDX:RAX = TEMP128)
THEN
ZF ← 1;
DEST ← RCX:RBX;
ELSE
ZF ← 0;
RDX:RAX ← TEMP128;
DEST ← TEMP128;
FI;
FI
So to really write code that "matches the functional specification given in manual" a write to the m128 is required. Although this particular write is part of the locked version lock cmpxchg16b, it won't of course do any good to the atomicity of the emulation! A straightforward emulation of lock cmpxchg16b is thus not possible. See #PeterCordes' answer.
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison
ELSE:
MOV RAX, r10
MOV RDX, r11
MOV QWORD PTR [r8], r10
MOV QWORD PTR [r8+8], r11
END:

Understanding ASM. Why does this work in Windows?

Me and a couple of friends are fiddling with a very strange issue. We encountered a Crash in our application inside of a small assembler portion (used to speed up the process). The error was caused by fiddling with the stackpointer and not resetting it at the end, it looked like this:
push ebp
mov ebp, esp
; do stuff here including sub and add on esp
pop ebp
When correctly it should be written as:
push ebp
mov ebp, esp
; do stuff here including sub and add on esp
mov esp,ebp
pop ebp
Now what our mindbreak is: Why does this work in Windows? We found the error as we ported the application to Linux, where we encountered the crash. Neither in Windows or Android (using the NDK) we encountered any issues and would never have found this error. Is there any Stackpointer recovery? Is there a protection against misusing the stackpointer?
the ebp esp usage, is called a stack frame, and its purpose is to allocate variables on the stack, and afterward have a quick way to restore the stack back before the ret instruction. All new versions of x86 CPU can compress these instructions together using enter / leave instructions instead.
esp is the actual stack pointer used by the CPU when doing push/pop/call/ret.
ebp is a user-manipulated base pointer, more or less all compilers use this as a stack-pointer for local storage.
If the mov esp, ebp instruction is missing, the stack will misbehave if esp != ebp when the CPU reaches pop ebp, but only then.
it seems the compiler takes care of your stack in windows:
The only way I can imagine is:
Microsoft Visual C takes special care of functions that are B{__stdcall}. Since the number of parameters is known at compile time, the compiler encodes the parameter byte count in the symbol name itself.
The __stdcall convention is mainly used by the Windows API, and it's a bit more compact than __cdecl. The main difference is that any given function has a hard-coded set of parameters, and this cannot vary from call to call like it can in C (no "variadic functions").
see:
http://unixwiz.net/techtips/win32-callconv-asm.html
and:
https://en.wikipedia.org/wiki/X86_calling_conventions

Resources