Writing large arrays to memory x86 assembly - segfaults using stack space

Writing large arrays to memory x86 assembly - segfaults using stack space - linux

I'm trying to write an array to the stack in x86 assembly using the AT&T (GAS) syntax. I have ran into an issue whereby I can write ~8.37 million entries to the stack, and then I get a Segmentation fault if I try to write any more. I don't know why this is, here's the code I'm using:
mov %rsp, %rbp
mov $8378658, %rdx
writeDataLoop:
sub %rdx, %rbp
movb $0b1, (%rbp)
add %rdx, %rbp
sub $1, %rdx
cmp $0, %rdx
jg writeDataLoop
Another odd thing that I've found is that the limit at which I can write data up to changes very slightly with each run (it's roughly at 8378658, which is also nothing significant in hex (0x7fd922). Can anyone tell me how to write more data to the stack, and also potentially explain what this arbitrary stack write limit is? Thanks.

To start off with, the default stack size is 8MB, this is the stack limit I was reaching. This can be found with the ulimit -a command. However, one should not use the stack for large amounts of data (usually arrays). Instead, the .space directive should be used, which, using AT&T syntax, takes the amount of data to store in bytes: .space <buffersize>. This can be labelled, for example:
buffer: .space 1024
This allocates 1024 bytes of storage to the label buffer. This label should be in the .bss section of your program, allowing for read and write access.
Finally, to access this data (read or write), one can use buffer(%rax), where rax is the offset.
Edit:
Using the .bss section is more efficient file-size wise than using the .data section, you just have to manually initialize the array.

Related

Is it possible for a program to read itself?

Theoretical question. But let's say I have written an assembly program. I have "labelx:" I want the program to read at this memory address and only this size and print to stdout.
Would it be something like
jmp labelx
And then would i then use the Write syscall , making sure to read from the next instruction from labelx:
mov rsi,rip
mov rdi,0x01
mov rdx,?
mov rax,0x01
syscall
to then output to stdout.
However how would I obtain the size to read itself? Especially if there is a
label after the code i want to read or code after. Would I have to manually
count the lines?
mov rdx,rip+(bytes*lines)
And then syscall with populated registers for the syscall to write to from rsi to rdi. Being stdout.
Is this Even possible? Would i have to use the read syscall first, as the write system call requires rsi to be allocated memory buffer. However I assumed .text is already allocated memory and is read only. Would I have to allocate onto the stack or heap or a static buffer first before write, if it's even possible in the first place?
I'm using NASM syntax btw. And pretty new to assembly. And just a question.

Yes, the .text section is just bytes in memory, no different from section .rodata where you might normally put msg: db "hello", 10. x86 is a Von Neumann architecture (not Harvard), so there's no distinction between code pointers and data pointers, other than what you choose to do with them. Use objdump -drwC -Mintel on a linked executable to see the machine-code bytes, or GDB's x command in a running process, to see bytes anywhere.
You can get the assembler to calculate the size by putting labels at the start/end of the part you want, and using mov edx, prog_end - prog_start in the code at the point where you want that size in RDX.
See How does $ work in NASM, exactly? for more about subtracting two labels (in the same section) to get a size. (Where $ is an implicit label at the start of the current line, although $ isn't likely what you want here.)
To get the current address into a register, you need a RIP-relative LEA, not mov, because RIP isn't a general-purpose register and there's no special form of mov that reads it.
here:
lea rsi, [rel here] ; with DEFAULT REL you could just use [here]
mov edi, 1 ; stdout fileno
mov edx, .end - here ; assemble-time constant size calculation
mov eax, 1 ; __NR_write
syscall
.end:
This is fully position-independent, unlike if you used mov esi, here. (How to load address of function or label into register)
The LEA could use lea rsi, [rel $] to assemble to the same machine-code bytes, but you want a label there so you can subtract them.
I optimized your MOV instructions to use 32-bit operand-size, implicitly zero-extending into the full 64-bit RDX and RAX. (And RDI, but write(int fd, void *buf, size_t len) only looks at EDI anyway for the file descriptor).
Note that you can write any bytes of any section; there's nothing special about having a block of code write itself. In the above example, put the start/end labels anywhere. (e.g. foo: and .end:, and mov edx, foo.end - foo taking advantage of how NASM local labels work, by appending to the previous non-local label, so you can reference them from somewhere else. Or just give them both non-dot names.)

x86 segfault in "call" to function

I am working on a toy compiler. I used to allocate all memory with malloc, but since I never call free, I think it will be sufficient (and faster) to allocate a GB or so on the stack and then slowly use that buffer.
But... now I am segfaulting before anything interesting happens. It happens on about 30% of my test cases (all test cases are the same in this section tho). Pasted from GDB:
(gdb) disas
Dump of assembler code for function main:
0x0000000000400bf1 <+0>: push rbp
0x0000000000400bf2 <+1>: mov rbp,rsp
0x0000000000400bf5 <+4>: mov QWORD PTR [rip+0x2014a4],rsp # 0x6020a0
0x0000000000400bfc <+11>: sub rsp,0x7735940
0x0000000000400c03 <+18>: sub rsp,0x7735940
0x0000000000400c0a <+25>: sub rsp,0x7735940
0x0000000000400c11 <+32>: sub rsp,0x7735940
=> 0x0000000000400c18 <+39>: call 0x400fec <new_Main>
0x0000000000400c1d <+44>: mov r15,rax
0x0000000000400c20 <+47>: mov rax,r15
0x0000000000400c23 <+50>: add rax,0x20
0x0000000000400c27 <+54>: mov rax,QWORD PTR [rax]
0x0000000000400c2a <+57>: add rax,0x48
0x0000000000400c2e <+61>: mov rax,QWORD PTR [rax]
0x0000000000400c31 <+64>: call rax
0x0000000000400c33 <+66>: mov rax,0x0
0x0000000000400c3a <+73>: mov rsp,rbp
0x0000000000400c3d <+76>: pop rbp
0x0000000000400c3e <+77>: ret
I originally did one big "sub rsp, 0x..." and I thought breaking it up a bit would help (it didn't -- the program crashes at call either way). The total should be 500MB in this case.
What really confuses me is why it fails on "call <>" instead of one of the subs. And why it only fails some of the time rather than always or never.
Disclosure: this is a school project, but asking for help with general issues regarding x86 is not against any rules.
Update: based on #swift's comment, I set ulimit -s unlimited... and it now segfaults randomly? It seems random. It's not coming close to using the whole 500 MB buffer tho. It only allocates about 400 bytes total.

Subtracting something from RSP won’t cause any issues since nothing uses it. It’s just a register with a value, it doesn’t allocate anything. But when you use CALL then memory pointed by RSP is accessed and issues may happen. The stack usually isn’t very big so to your question “is there any reason you can’t take a GB of memory from the stack” the answer is “because the stack doesn’t have that much space to be used.”
As for being faster to allocate a big buffer in the stack isn’t really a thing. Allocating and releasing a single big block of memory isn’t slower in the heap. Having lots of allocations and releases in heap is worse than in the stack. So there’s not much point in this case to do it in the stack.

How to understand "cmpl $0x0, -0x30(%rbp)" / "je ..."

This is the assembly code from my bomb lap question, I am stuck in phase2;
The bomb lab require us to find out the correct input based on assembly code or it will exploded.
From <+20> I know that %rbp -0x30(48) == 0 or it will call <+32> and explode the bomb; so %rbp = 48(DEC)
After that(+26) %rbp - 0x2c(44) must equal 1 or it will explode the bomb...
But since %rbp = 48, the bomb will explode anywhere so I am confuse now...
I think I misunderstand the compl , je/jne or how to calculate these things...

-0x30(%ebp) doesn't mean to use the value %ebp - 0x30. It's a memory address to read from. The instruction (cmpl) has an l suffix, so it's dealing with a 4 byte quantity. So what's actually happening is that it reads a 4 byte number from the address %ebp - 0x30 and checks whether it's zero.
(The $ prefix means it's an immediate value, not an address. This is why 0x0 is taken literally and not dereferenced.)

You don't need to know value of rbp to defuse the bomb, it's used always in relative way to address certain portion of memory. It's set at the start by the sequence:
pushq %rbp
movq %rsp, %rbp
pushq %r12
pushq %rbx
subq $0x20, %rsp
Which does push old rbp value into stack (to preserve it), then sets rbp to current stack pointer value (rsp). So (%rbp) contains the old rbp now. Then another two registers r12 and rbx are preserved by pushing them into stack (which makes now rsp == rbp - 16). And then rsp is adjusted one more time by subtracting 32, i.e. rsp == rbp - 48.
This is common pattern how to allocate memory space for local variables, i.e. any further push instruction will use memory below rbp - 48. Memory from rbp - 48 up to rbp - 17 (inc.) is undefined, free to use as "local variables" memory. Then at rbp - 16 is stored in 8 bytes original rbx value, at rbp - 8 is stored original r12 value, and at rbp is stored old rbp. (I mean all the "rbp - x" to be used as memory addresses to address values in the memory)
Then your code calls input 6 numbers, which I guess means 32 bit integers (judging by the following code), so it provides it with address rbp - 48. 6*4 = 24 => the inputted values will be stored in memory from address rbp - 48 up to (incl.) rbp - 25.
Note the gap of "unused memory" from rbp - 24 up to rbp - 17, that's another 8 bytes of spare local storage, not used by the code you posted, that's very likely padding added by compiler to make the rsp correctly aligned before callq read_six_numbers.
So basically you don't need to know where exactly rsp/rbp points to, it's pointing to some valid stack memory (either that, or the code will crash, and the bomb will NOT explode ... a bit weird design :)))). You can pick any arbitrary value, like 0x8000 for rsp at start, and simulate the run with that. (i.e. the <+11> leaq -0x30,(%rbp), %rsi is then rsi = 0x7FC8; (0x8000 - 8 - 0x30).
Rest of what -x(%rbp) means in various instructions (pay attention to the lea vs <any other instruction> semantics difference of "memory operand" usage, the lea does only the memory address calculation, while other instructions just start by that, using the calculated address to access actual value stored in memory) is described in the other answer + comment. Plus use the x86 instruction reference guide and read through some tutorials few more times, until it will make sense.

Confused about AT&T Assembly Syntax for addressing modes vs. jmp and far jmp

In AT&T Assembly Syntax, literal values must be prefixed with a $ sign
But, in Memory Addressing, literal values do not have $ sign
for example:
mov %eax, -100(%eax)
and
jmp 100
jmp $100, $100
are different.
My question is why the $ prefix so confusing?

jmp 100 is a jump to absolute address 100, just like jmp my_label is a jump to the code at my_label. EIP = 100 or EIP = the address of my_label.
(jmp 100 assembles to a jmp rel32 with a R_386_PC32 relocation, asking the linker to fill in the right relative offset from the jmp instruction's own address to the absolute target.)
So in AT&T syntax, you can think of jmp x as sort of like an LEA into EIP.
Or another way to think of it is that code-fetch starts from the specified memory location. Requiring a $ for an immediate wouldn't really make sense, because the machine encoding for direct near jumps uses a relative displacement, not absolute. (http://felixcloutier.com/x86/JMP.html).
Also, indirect jumps use a different syntax (jmp *%eax register indirect or jmp *(%edi, %ecx, 4) memory indirect), so a distinction between immediate vs. memory isn't needed.
But far jump is a different story.
jmp ptr16:32 and jmp m16:32 are both available in 32-bit mode, so you do need to distinguish between ljmp *(%edi) vs. ljmp $100, $100.
Direct far jump (jmp far ptr16:32) does take an absolute segment:offset encoded into the instruction, just like add $123, %eax takes an immediate encoded into the instruction.

Question: My question is why the prefixed $ so confused ?
$ prefix is used to load/use the value as is.
example:
movl $5, %eax #copy value 5 to eax
addl $10,%eax # add 10 + 5 and store result in eax
$5, $10 are values (constants) and are not take from any external source like register or memory
In Memory addressing, Specifically "Direct addressing mode" we want to use the value stored in particular memory location.
example:
movl 20, %eax
The above would get the value stored in Memory location at 20.
Practially since memory locations are numbered in hexadecimal (0x00000000 to 0xfffffffff), it's difficult to specify the memory locations in hexadecimals in instructions. So we assign a symbol to the location
Example:
.section .data
mydata:
long 4
.section .text
.globl _start
_start
movl mydata, %eax
In the above code, mydata is symbolic representation given a particular memory location where value "4" is stored.
I hope the above clears your confusion.

Trying to understand gcc's complicated stack-alignment at the top of main that copies the return address

hi I have disassembled some programs (linux) I wrote to understand better how it works, and I noticed that the main function always begins with:
lea ecx,[esp+0x4] ; I assume this is for getting the adress of the first argument of the main...why ?
and esp,0xfffffff0 ; ??? is the compiler trying to align the stack pointer on 16 bytes ???
push DWORD PTR [ecx-0x4] ; I understand the assembler is pushing the return adress....why ?
push ebp
mov ebp,esp
push ecx ;why is ecx pushed too ??
so my question is: why all this work is done ??
I only understand the use of:
push ebp
mov ebp,esp
the rest seems useless to me...

I've had a go at it:
;# As you have already noticed, the compiler wants to align the stack
;# pointer on a 16 byte boundary before it pushes anything. That's
;# because certain instructions' memory access needs to be aligned
;# that way.
;# So in order to first save the original offset of esp (+4), it
;# executes the first instruction:
lea ecx,[esp+0x4]
;# Now alignment can happen. Without the previous insn the next one
;# would have made the original esp unrecoverable:
and esp,0xfffffff0
;# Next it pushes the return addresss and creates a stack frame. I
;# assume it now wants to make the stack look like a normal
;# subroutine call:
push DWORD PTR [ecx-0x4]
push ebp
mov ebp,esp
;# Remember that ecx is still the only value that can restore the
;# original esp. Since ecx may be garbled by any subroutine calls,
;# it has to save it somewhere:
push ecx

This is done to keep the stack aligned to a 16-byte boundary. Some instructions require certain data types to be aligned on as much as a 16-byte boundary. In order to meet this requirement, GCC makes sure that the stack is initially 16-byte aligned, and allocates stack space in multiples of 16 bytes. This can be controlled using the option -mpreferred-stack-boundary=num. If you use -mpreferred-stack-boundary=2 (for a 22=4-byte alignment), this alignment code will not be generated because the stack is always at least 4-byte aligned. However you could then have trouble if your program uses any data types that require stronger alignment.
According to the gcc manual:
On Pentium and PentiumPro, double and long double values should be aligned to an 8 byte boundary (see -malign-double) or suffer significant run time performance penalties. On Pentium III, the Streaming SIMD Extension (SSE) data type __m128 may not work properly if it is not 16 byte aligned.
To ensure proper alignment of this values on the stack, the stack boundary must be as aligned as that required by any value stored on the stack. Further, every function must be generated such that it keeps the stack aligned. Thus calling a function compiled with a higher preferred stack boundary from a function compiled with a lower preferred stack boundary will most likely misalign the stack. It is recommended that libraries that use callbacks always use the default setting.
This extra alignment does consume extra stack space, and generally increases code size. Code that is sensitive to stack space usage, such as embedded systems and operating system kernels, may want to reduce the preferred alignment to -mpreferred-stack-boundary=2.
The lea loads the original stack pointer (from before the call to main) into ecx, since the stack pointer is about to modified. This is used for two purposes:
to access the arguments to the main function, since they are relative to the original stack pointer
to restore the stack pointer to its original value when returning from main

lea ecx,[esp+0x4] ; I assume this is for getting the adress of the first argument of the main...why ?
and esp,0xfffffff0 ; ??? is the compiler trying to align the stack pointer on 16 bytes ???
push DWORD PTR [ecx-0x4] ; I understand the assembler is pushing the return adress....why ?
push ebp
mov ebp,esp
push ecx ;why is ecx pushed too ??
Even if every instruction worked perfectly with no speed penalty despite arbitrarily aligned operands, alignment would still increase performance. Imagine a loop referencing a 16-byte quantity that just overlaps two cache lines. Now, to load that little wchar into the cache, two entire cache lines have to be evicted, and what if you need them in the same loop? The cache is so tremendously faster than RAM that cache performance is always critical.
Also, there usually is a speed penalty to shift misaligned operands into the registers.
Given that the stack is being realigned, we naturally have to save the old alignment in order to traverse stack frames for parameters and returning.
ecx is a temporary register so it has to be saved. Also, depending on optimization level, some of the frame linkage ops that don't seem strictly necessary to run the program might well be important in order to set up a trace-ready chain of frames.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Writing large arrays to memory x86 assembly - segfaults using stack space - linux

Related

Is it possible for a program to read itself?

x86 segfault in "call" to function

How to understand "cmpl $0x0, -0x30(%rbp)" / "je ..."

Confused about AT&T Assembly Syntax for addressing modes vs. jmp and far jmp

Trying to understand gcc's complicated stack-alignment at the top of main that copies the return address

Categories

Resources