x86-64 GNU Assembly - gnu

While investigating a crash, I came across the following code snippet and immediately recognized that the mov instruction should actually be movq to get the correct 64-bit register operation.
#elif defined(__x86_64__)
unsigned long rbp;
__asm__ volatile ("mov %%rbp, %0" : "=r" (rbp));
sp = (void **) rbp;
#else
Further to this, I also found documentation that claims that the rbp register for x86-64 is general purpose and does not contain the address of the current frame. I have also found documentation that claims that rbp does contain the address of the current frame. Can someone clarify?

Regarding the first part of your question (movq instead of mov), the assembler (as, in this case), will recognize that your operand is 64 bits, and will correctly use movq. mov is not a valid instruction, it's a way to tell the assembler "use the right mov variant depending on the operands".
Regarding the second part, it's actually both: it's a general purpose register, in the sense that it can hold any value. It is also used as a stack-frame base pointer. The '2.4 Stack operation' section of the AMD64 Application programming manual says:
A stack is a portion of a stack segment in memory that is used to link
procedures. Software conventions typically define stacks using a
stack frame, which consists of two registers—a stack-frame base
pointer (rBP) and a stack pointer (rSP)—

Related

brk segment overflow error in x86 assembly [duplicate]

When to use size directives in x86 seems a bit ambiguous. This x86 assembly guide says the following:
In general, the intended size of the of the data item at a given memory
address can be inferred from the assembly code instruction in which it is
referenced. For example, in all of the above instructions, the size of
the memory regions could be inferred from the size of the register
operand. When we were loading a 32-bit register, the assembler could
infer that the region of memory we were referring to was 4 bytes wide.
When we were storing the value of a one byte register to memory, the
assembler could infer that we wanted the address to refer to a single
byte in memory.
The examples they give are pretty trivial, such as mov'ing an immediate value into a register.
But what about more complex situations, such as the following:
mov QWORD PTR [rip+0x21b520], 0x1
In this case, isn't the QWORD PTR size directive redundant since, according to the above guide, it can be assumed that we want to move 8 bytes into the destination register due to the fact that RIP is 8 bytes? What are the definitive rules for size directives on the x86 architecture? I couldn't find an answer for this anywhere, thanks.
Update: As Ross pointed out, the destination in the above example isn't a register. Here's a more relevant example:
mov esi, DWORD PTR [rax*4+0x419260]
In this case, can't it be assumed that we want to move 4 bytes because ESI is 4 bytes, making the DWORD PTR directive redundant?
You're right; it is rather ambiguous. Assuming we're talking about Intel syntax, it is true that you can often get away with not using size directives. Any time the assembler can figure it out automatically, they are optional. For example, in the instruction
mov esi, DWORD PTR [rax*4+0x419260]
the DWORD PTR specifier is optional for exactly the reason you suppose: the assembler can figure out that it is to move a DWORD-sized value, since the value is being moved into a DWORD-sized register.
Similarly, in
mov rsi, QWORD PTR [rax*4+0x419260]
the QWORD PTR specifier is optional for the exact same reason.
But it is not always optional. Consider your first example:
mov QWORD PTR [rip+0x21b520], 0x1
Here, the QWORD PTR specifier is not optional. Without it, the assembler has no idea what size value you want to store starting at the address rip+0x21b520. Should 0x1 be stored as a BYTE? Extended to a WORD? A DWORD? A QWORD? Some assemblers might guess, but you can't be assured of the correct result without explicitly specifying what you want.
In other words, when the value is in a register operand, the size specifier is optional because the assembler can figure out the size based on the size of the register. However, if you're dealing with an immediate value or a memory operand, the size specifier is probably required to ensure you get the results you want.
Personally, I prefer to always include the size when I write code. It's a couple of characters more typing, but it forces me to think about it and state explicitly what I want. If I screw up and code a mismatch, then the assembler will scream loudly at me, which has caught bugs more than once. I also think having it there enhances readability. So here I agree with old_timer, even though his perspective appears to be somewhat unpopular.
Disassemblers also tend to be verbose in their outputs, including the size specifiers even when they are optional. Hans Passant theorized in the comments this was to preserve backwards-compatibility with old-school assemblers that always needed these, but I'm not sure that's true. It might be part of it, but in my experience, disassemblers tend to be wordy in lots of different ways, and I think this is just to make it easier to analyze code with which you are unfamiliar.
Note that AT&T syntax uses a slightly different tact. Rather than writing the size as a prefix to the operand, it adds a suffix to the instruction mnemonic: b for byte, w for word, l for dword, and q for qword. So, the three previous examples become:
movl 0x419260(,%rax,4), %esi
movq 0x419260(,%rax,4), %rsi
movq $0x1, 0x21b520(%rip)
Again, on the first two instructions, the l and q prefixes are optional, because the assembler can deduce the appropriate size. On the last instruction, just like in Intel syntax, the prefix is non-optional. So, the same thing in AT&T syntax as Intel syntax, just a different format for the size specifiers.
RIP, or any other register in the address is only relevant to the addressing mode, not the width of data transfered. The memory reference [rip+0x21b520] could be used with a 1, 2, 4, or 8-byte access, and the constant value 0x01 could also be 1 to 8 bytes (0x01 is the same as 0x00000001 etc.) So in this case, the operand size has to be explicitly mentioned.
With a register as the source or destination, the operand size would be implicit: if, say, EAX is used, the data is 32 bits or 4 bytes:
mov [rip+0x21b520],eax
And of course, in the awfully beautiful AT&T syntax, the operand size is marked as a suffix to the instruction mnemonic (the l here).
movl $1, 0x21b520(%rip)
it gets worse than that, an assembly language is defined by the assembler, the program that reads/interprets/parses it. And x86 in particular but as a general rule there is no technical reason for any two assemblers for the same target to have the same assembly language, they tend to be similar, but dont have to be.
You have fallen into a couple of traps, first off the specific syntax used for the assembler you are using with respect to the size directive, then second, is there a default. My recommendation is ALWAYS use the size directive (or if there is a unique instruction mnemonic), then you never have to worry about it right?

x64 memset core, is passed buffer address truncated?

1. Problem Background
Recently a core dump occurred on one of our on-line search server. The core happens in memset() due to the attempt to write to an invalid address, and hence received the SIGSEGV signal. The following information is from dmsg:
is_searcher_ser[17405]: segfault at 000000002c32a668 rip 0000003da0a7b006 rsp 0000000053abc790 error 6
The environment of our on-line servers goes as follows:
OS: RHEL 5.3
Kernel: 2.6.18-131.el5.custom, x86_64 (64-bit)
GCC: 4.1.2 20080704 (Red Hat 4.1.2-44)
Glibc: glibc-2.5-49.6
The following is the relevant code snippet:
CHashMap<…>::CHashMap(…)
{
…
typedef HashEntry *HashEntryPtr;
m_ppEntry = new HashEntryPtr[m_nHashSize]; // m_nHashSize is 389 when core
assert(m_ppEntry != NULL);
memset(m_ppEntry, 0x0, m_nHashSize*sizeof(HashEntryPtr)); // Core in this memset() invocation
…
}
The assembly code of the above code is:
…
0x000000000091fe9e <+110>: callq 0x502638 <_Znam#plt> // new HashEntryPtr[m_nHashSize]
0x000000000091fea3 <+115>: mov 0xc(%rbx),%edx // Get the value of m_nHashSize
0x000000000091fea6 <+118>: mov %rax,%rdi // Put m_ppEntry pointer to %rdi for later memset invocation
0x000000000091fea9 <+121>: mov %rax,0x20(%rbx) // Store the pointer to m_ppEntry member variable(%rbx holds the this pointer)
0x000000000091fead <+125>: xor %esi,%esi // Generate 0
0x000000000091feaf <+127>: shl $0x3,%rdx // m_nHashSize*sizeof(HashEntryPtr)
0x000000000091feb3 <+131>: callq 0x502b38 <memset#plt> // Call the memset() function
…
In the core dump, the assembly of memset#plt is:
(gdb) disassemble 0x502b38
Dump of assembler code for function memset#plt:
0x0000000000502b38 <+0>: jmpq *0x771b92(%rip) # 0xc746d0 <memset#got.plt>
0x0000000000502b3e <+6>: pushq $0x53
0x0000000000502b43 <+11>: jmpq 0x5025f8
End of assembler dump.
(gdb) x/ag 0x0000000000502b3e+0x771b92
0xc746d0 <memset#got.plt>: 0x3da0a7acb0 <memset>
(gdb) disassemble 0x3da0a7acb0
Dump of assembler code for function memset:
0x0000003da0a7acb0 <+0>: cmp $0x1,%rdx
0x0000003da0a7acb4 <+4>: mov %rdi,%rax
…
For the above GDB analysis, we know that the address of memset() has been resolved in the relocation PLT table. That is to say, the first jmpq *0x771b92(%rip) will directly jump to the first instruction of function memset(). Besides, the program had run nearly one day on-line, the relocation address of memset() should have been already resolved earlier.
2. Weird phenomenon
This core fired at the instruction => 0x0000003da0a7b006 <+854>: mov %rdx,-0x8(%rdi) in the memset(). Actually this is the instruction in the memset() to set the 0 at the right begin position of the buffer which is the first parameter of memset().
When cored , in frame 0, the value of $rdi is 0x2c32a670 ,and $rax is 0x2c32a668. From the assembly analysis and off-line test, $rax should hold the source buffer of the memset, i.e., the first parameter of memset().
So, in our example, $rax should be same as the address of m_ppEntry, the value of which is stored in the this object (this pointer is stored in %rbx) first before it is zeroed by memset later. However, the value of m_ppEntry is 0x2ab02c32a668.
Then use info files GDB command to check, the address 0x2c32a668 is indeed invalid (not mapped), and address 0x2ab02c32a668 is a valid address.
3. Why it is weird?
The weird place of this core is that: If the real address of memset has been resolved already(very very probably), then there are only very few instructions between the operation to put the pointer value into m_ppEntry and the attempt to memset it. And actually the value of register $rax (holding the passed buffer address) are not changed at all during these instructions. So, how can m_ppEntry isn’t equal to $rax?
What is weird More is that: when core, the value of $rax (0x2c32a668) is actually the value of lower 4 bytes of m_ppEntry (0x2ab02c32a668). If there is indeed some relationship between the two values, is the m_ppEntry parameter passed to memset being truncated? However, the involved several instructions all use %rax, rather than %eax. By the way, I cannot reproduce this issue offline.
So,
1) Which address is valid? If 0x2c32a668 is valid? Is the heap corrupted just between the several instructions? And how to paraphrase that the value of m_ppEntry is 0x2ab02c32a668, and why the low 4 bytes of this two value is the same?
2) If 0x2ab02c32a668 is valid, why the address is truncated when passed into the 64-bit memset()? Under which condition this error will occur? I cannot reproduce this offline. Is this issue an known bug? I didn't find it through Google.
3) Or, is it due to some hardware or power issue to make the 4 higher bytes of %rdi passed to memset zeroed? (I’m very very reluctant to believe this).
At last, any comment on this core is appreciated.
Thanks,
Gary Hu
I'm assuming most of the time this code works fine, given your mention of one day's running.
I agree signals are worth inspecting, it does look suspiciously like pointer truncation is happening somewhere else.
Only other thing I'm thinking it could be an issue with the new. Is there any possibly that on occasion you could end up calling an overloaded new operator?
Also for completeness what is the declaration of m_ppEntry ?
I'm assuming you're using a no throw new otherwise the assert(m_ppEntry != NULL); would be meaningless.

What is %gs in Assembly

void return_input (void)
{
char array[30];
gets (array);
printf("%s\n", array);
}
After compiling it in gcc, this function is converted to the following Assembly code:
push %ebp
mov %esp,%ebp
sub $0x28,%esp
mov %gs:0x14,%eax
mov %eax,-0x4(%ebp)
xor %eax,%eax
lea -0x22(%ebp),%eax
mov %eax,(%esp)
call 0x8048374
lea -0x22(%ebp),%eax
mov %eax,(%esp)
call 0x80483a4
mov -0x4(%ebp),%eax
xor %gs:0x14,%eax
je 0x80484ac
call 0x8048394
leave
ret
I don't understand two lines:
mov %gs:0x14,%eax
xor %gs:0x14,%eax
What is %gs, and what exactly these two lines do?
This is compilation command:
cc -c -mpreferred-stack-boundary=2 -ggdb file.c
GS is a segment register, its use in linux can be read up on here (its basically used for per thread data).
mov %gs:0x14,%eax
xor %gs:0x14,%eax
this code is used to validate that the stack hasn't exploded or been corrupted, using a canary value stored at GS+0x14, see this.
gcc -fstack-protector=strong is on by default in many modern distros; you can use gcc -fno-stack-protector to not add those checks. (On x86, thread-local storage is cheap so GCC keeps the randomized canary value there, making it somewhat harder to leak.)
In the AT&T style assembly languages, the percent sigil generally indicates a register. In x86 family processors from 386 onwards, GS is one of the so-called segment registers. However, in protected mode environments segment registers work as selector registers.
A virtual memory selector represents its own mapping of virtual address space together with its own access regime. In practical terms, %gs:0x14 can be thought of as a reference into an array whose origin is held in %gs (albeit the CPU does a bit of extra dereferencing). On modern GNU/Linux systems, %gs is usually used to point at the thread-local storage region. In the code you're asking about, however, only one item of the TLS matters — the stack canary.
The idea is to attempt to detect a buffer overflow error by placing a random but constant value — it's called a stack canary in memory of the canaries coal miners used to employ to signal increase in levels of poisonous gases by dying — into the stack before gets() gets called, above its stack frame, and check whether it is still there after gets() will have returned. gets() has no business overwriting this part of the stack — it is outside its own stack frame, and it is not given a pointer to it —, so if the stack canary has died, something has gone wrong in a dangerous way. (C as a programming environment happens to be particularly prone to this kind of wrong-goings, and security researchers have learnt to exploit many of them over the last twenty years or so. Also, gets() happens to be a function that is inherently at risk to overflow its target buffer.) You have not offered addresses with your code, but 0x80484ac is likely the address of leave, and the call 0x8048394 which is executed in case of mismatch (that is, jumped over by je 0x80484ac in case of match), is probably a call to __stack_chk_fail(), provided by libc to handle the stack corruption by fleeing the metaphorical poisonous mine.
The reason the canonical value of the stack canary is kept in the thread-local storage is that this way, every thread can have its own stack canary. Stacks themselves are normally not shared between threads, so it is natural to also not share the canary value.
ES, FS, GS: Extra Segment Registers
Can be used as extra segment registers; also used in special instructions that span segments (like string copies).
taken from here
http://www.hep.wisc.edu/~pinghc/x86AssmTutorial.htm
hope it helps

Initial state of program registers and stack on Linux ARM

I'm currently playing with ARM assembly on Linux as a learning exercise. I'm using 'bare' assembly, i.e. no libcrt or libgcc. Can anybody point me to information about what state the stack-pointer and other registers will at the start of the program before the first instruction is called? Obviously pc/r15 points at _start, and the rest appear to be initialised to 0, with two exceptions; sp/r13 points to an address far outside my program, and r1 points to a slightly higher address.
So to some solid questions:
What is the value in r1?
Is the value in sp a legitimate stack allocated by the kernel?
If not, what is the preferred method of allocating a stack; using brk or allocate a static .bss section?
Any pointers would be appreciated.
Since this is Linux, you can look at how it is implemented by the kernel.
The registers seem to be set by the call to start_thread at the end of load_elf_binary (if you are using a modern Linux system, it will almost always be using the ELF format). For ARM, the registers seem to be set as follows:
r0 = first word in the stack
r1 = second word in the stack
r2 = third word in the stack
sp = address of the stack
pc = binary entry point
cpsr = endianess, thumb mode, and address limit set as needed
Clearly you have a valid stack. I think the values of r0-r2 are junk, and you should instead read everything from the stack (you will see why I think this later). Now, let's look at what is on the stack. What you will read from the stack is filled by create_elf_tables.
One interesting thing to notice here is that this function is architecture-independent, so the same things (mostly) will be put on the stack on every ELF-based Linux architecture. The following is on the stack, in the order you would read it:
The number of parameters (this is argc in main()).
One pointer to a C string for each parameter, followed by a zero (this is the contents of argv in main(); argv would point to the first of these pointers).
One pointer to a C string for each environment variable, followed by a zero (this is the contents of the rarely-seen envp third parameter of main(); envp would point to the first of these pointers).
The "auxiliary vector", which is a sequence of pairs (a type followed by a value), terminated by a pair with a zero (AT_NULL) in the first element. This auxiliary vector has some interesting and useful information, which you can see (if you are using glibc) by running any dynamically-linked program with the LD_SHOW_AUXV environment variable set to 1 (for instance LD_SHOW_AUXV=1 /bin/true). This is also where things can vary a bit depending on the architecture.
Since this structure is the same for every architecture, you can look for instance at the drawing on page 54 of the SYSV 386 ABI to get a better idea of how things fit together (note, however, that the auxiliary vector type constants on that document are different from what Linux uses, so you should look at the Linux headers for them).
Now you can see why the contents of r0-r2 are garbage. The first word in the stack is argc, the second is a pointer to the program name (argv[0]), and the third probably was zero for you because you called the program with no arguments (it would be argv[1]). I guess they are set up this way for the older a.out binary format, which as you can see at create_aout_tables puts argc, argv, and envp in the stack (so they would end up in r0-r2 in the order expected for a call to main()).
Finally, why was r0 zero for you instead of one (argc should be one if you called the program with no arguments)? I am guessing something deep in the syscall machinery overwrote it with the return value of the system call (which would be zero since the exec succeeded). You can see in kernel_execve (which does not use the syscall machinery, since it is what the kernel calls when it wants to exec from kernel mode) that it deliberately overwrites r0 with the return value of do_execve.
Here's what I use to get a Linux/ARM program started with my compiler:
/** The initial entry point.
*/
asm(
" .text\n"
" .globl _start\n"
" .align 2\n"
"_start:\n"
" sub lr, lr, lr\n" // Clear the link register.
" ldr r0, [sp]\n" // Get argc...
" add r1, sp, #4\n" // ... and argv ...
" add r2, r1, r0, LSL #2\n" // ... and compute environ.
" bl _estart\n" // Let's go!
" b .\n" // Never gets here.
" .size _start, .-_start\n"
);
As you can see, I just get the argc, argv, and environ stuff from the stack at [sp].
A little clarification: The stack pointer points to a valid area in the process' memory. r0, r1, r2, and r3 are the first three parameters to the function being called. I populate them with argc, argv, and environ, respectively.
Here's the uClibc crt. It seems to suggest that all registers are undefined except r0 (which contains a function pointer to be registered with atexit()) and sp which contains a valid stack address.
So, the value you see in r1 is probably not something you can rely on.
Some data are placed on the stack for you.
I've never used ARM Linux but I suggest you either look at the source for the libcrt and see what they do, or use gdb to step into an existing executable. You shouldn't need the source code just step through the assembly code.
Everything you need to find out should happen within the very first code executed by any binary executable.
Hope this helps.
Tony

Trying to understand gcc's complicated stack-alignment at the top of main that copies the return address

hi I have disassembled some programs (linux) I wrote to understand better how it works, and I noticed that the main function always begins with:
lea ecx,[esp+0x4] ; I assume this is for getting the adress of the first argument of the main...why ?
and esp,0xfffffff0 ; ??? is the compiler trying to align the stack pointer on 16 bytes ???
push DWORD PTR [ecx-0x4] ; I understand the assembler is pushing the return adress....why ?
push ebp
mov ebp,esp
push ecx ;why is ecx pushed too ??
so my question is: why all this work is done ??
I only understand the use of:
push ebp
mov ebp,esp
the rest seems useless to me...
I've had a go at it:
;# As you have already noticed, the compiler wants to align the stack
;# pointer on a 16 byte boundary before it pushes anything. That's
;# because certain instructions' memory access needs to be aligned
;# that way.
;# So in order to first save the original offset of esp (+4), it
;# executes the first instruction:
lea ecx,[esp+0x4]
;# Now alignment can happen. Without the previous insn the next one
;# would have made the original esp unrecoverable:
and esp,0xfffffff0
;# Next it pushes the return addresss and creates a stack frame. I
;# assume it now wants to make the stack look like a normal
;# subroutine call:
push DWORD PTR [ecx-0x4]
push ebp
mov ebp,esp
;# Remember that ecx is still the only value that can restore the
;# original esp. Since ecx may be garbled by any subroutine calls,
;# it has to save it somewhere:
push ecx
This is done to keep the stack aligned to a 16-byte boundary. Some instructions require certain data types to be aligned on as much as a 16-byte boundary. In order to meet this requirement, GCC makes sure that the stack is initially 16-byte aligned, and allocates stack space in multiples of 16 bytes. This can be controlled using the option -mpreferred-stack-boundary=num. If you use -mpreferred-stack-boundary=2 (for a 22=4-byte alignment), this alignment code will not be generated because the stack is always at least 4-byte aligned. However you could then have trouble if your program uses any data types that require stronger alignment.
According to the gcc manual:
On Pentium and PentiumPro, double and long double values should be aligned to an 8 byte boundary (see -malign-double) or suffer significant run time performance penalties. On Pentium III, the Streaming SIMD Extension (SSE) data type __m128 may not work properly if it is not 16 byte aligned.
To ensure proper alignment of this values on the stack, the stack boundary must be as aligned as that required by any value stored on the stack. Further, every function must be generated such that it keeps the stack aligned. Thus calling a function compiled with a higher preferred stack boundary from a function compiled with a lower preferred stack boundary will most likely misalign the stack. It is recommended that libraries that use callbacks always use the default setting.
This extra alignment does consume extra stack space, and generally increases code size. Code that is sensitive to stack space usage, such as embedded systems and operating system kernels, may want to reduce the preferred alignment to -mpreferred-stack-boundary=2.
The lea loads the original stack pointer (from before the call to main) into ecx, since the stack pointer is about to modified. This is used for two purposes:
to access the arguments to the main function, since they are relative to the original stack pointer
to restore the stack pointer to its original value when returning from main
lea ecx,[esp+0x4] ; I assume this is for getting the adress of the first argument of the main...why ?
and esp,0xfffffff0 ; ??? is the compiler trying to align the stack pointer on 16 bytes ???
push DWORD PTR [ecx-0x4] ; I understand the assembler is pushing the return adress....why ?
push ebp
mov ebp,esp
push ecx ;why is ecx pushed too ??
Even if every instruction worked perfectly with no speed penalty despite arbitrarily aligned operands, alignment would still increase performance. Imagine a loop referencing a 16-byte quantity that just overlaps two cache lines. Now, to load that little wchar into the cache, two entire cache lines have to be evicted, and what if you need them in the same loop? The cache is so tremendously faster than RAM that cache performance is always critical.
Also, there usually is a speed penalty to shift misaligned operands into the registers.
Given that the stack is being realigned, we naturally have to save the old alignment in order to traverse stack frames for parameters and returning.
ecx is a temporary register so it has to be saved. Also, depending on optimization level, some of the frame linkage ops that don't seem strictly necessary to run the program might well be important in order to set up a trace-ready chain of frames.

Resources