How to understand "cmpl $0x0, -0x30(%rbp)" / "je ..." - linux

This is the assembly code from my bomb lap question, I am stuck in phase2;
The bomb lab require us to find out the correct input based on assembly code or it will exploded.
From <+20> I know that %rbp -0x30(48) == 0 or it will call <+32> and explode the bomb; so %rbp = 48(DEC)
After that(+26) %rbp - 0x2c(44) must equal 1 or it will explode the bomb...
But since %rbp = 48, the bomb will explode anywhere so I am confuse now...
I think I misunderstand the compl , je/jne or how to calculate these things...

-0x30(%ebp) doesn't mean to use the value %ebp - 0x30. It's a memory address to read from. The instruction (cmpl) has an l suffix, so it's dealing with a 4 byte quantity. So what's actually happening is that it reads a 4 byte number from the address %ebp - 0x30 and checks whether it's zero.
(The $ prefix means it's an immediate value, not an address. This is why 0x0 is taken literally and not dereferenced.)

You don't need to know value of rbp to defuse the bomb, it's used always in relative way to address certain portion of memory. It's set at the start by the sequence:
pushq %rbp
movq %rsp, %rbp
pushq %r12
pushq %rbx
subq $0x20, %rsp
Which does push old rbp value into stack (to preserve it), then sets rbp to current stack pointer value (rsp). So (%rbp) contains the old rbp now. Then another two registers r12 and rbx are preserved by pushing them into stack (which makes now rsp == rbp - 16). And then rsp is adjusted one more time by subtracting 32, i.e. rsp == rbp - 48.
This is common pattern how to allocate memory space for local variables, i.e. any further push instruction will use memory below rbp - 48. Memory from rbp - 48 up to rbp - 17 (inc.) is undefined, free to use as "local variables" memory. Then at rbp - 16 is stored in 8 bytes original rbx value, at rbp - 8 is stored original r12 value, and at rbp is stored old rbp. (I mean all the "rbp - x" to be used as memory addresses to address values in the memory)
Then your code calls input 6 numbers, which I guess means 32 bit integers (judging by the following code), so it provides it with address rbp - 48. 6*4 = 24 => the inputted values will be stored in memory from address rbp - 48 up to (incl.) rbp - 25.
Note the gap of "unused memory" from rbp - 24 up to rbp - 17, that's another 8 bytes of spare local storage, not used by the code you posted, that's very likely padding added by compiler to make the rsp correctly aligned before callq read_six_numbers.
So basically you don't need to know where exactly rsp/rbp points to, it's pointing to some valid stack memory (either that, or the code will crash, and the bomb will NOT explode ... a bit weird design :)))). You can pick any arbitrary value, like 0x8000 for rsp at start, and simulate the run with that. (i.e. the <+11> leaq -0x30,(%rbp), %rsi is then rsi = 0x7FC8; (0x8000 - 8 - 0x30).
Rest of what -x(%rbp) means in various instructions (pay attention to the lea vs <any other instruction> semantics difference of "memory operand" usage, the lea does only the memory address calculation, while other instructions just start by that, using the calculated address to access actual value stored in memory) is described in the other answer + comment. Plus use the x86 instruction reference guide and read through some tutorials few more times, until it will make sense.

Related

how the stack frame works?

i am reading the csapp and some codes (x86-64) confuse me.
the book says "pushq %rbp" equals :
subq $8,%rsp
movq %rbp,(%rsp)
The c code is :
long P(long x,long y)
{
long u = Q(y);
long v = Q(x);
return u + v;
}
And the a part of the assembly code the book gives is :
pushq %rbp
pushq %rbx
subq $8,%rsp
The 'subq' confuse me.
Why it is like this?
Stack is a block of memory which grows down. There is a point in memory indicated by rsp/esp register which is a stack top. All memory above it is occupied by things placed on stack and all memory below it is free.
If you want to put something on stack you need to decrease rsp register (that is what sub instruction does) by number of bytes you need and rsp will point now to the newly reserver area you needed.
Lets look at this simple example:
rsp points to address 100. As said - whole memory above address 100 is used, and memory below 100 is free. So if you need 4 bytes you decrease rsp by 4 so it points to 96. As you have just decreased rsp you are aware that memory cells 96, 97, 98 and 99 are yours and you can use them. When you need more bytes on stack, then you again can decrease rsp to get it more.
There are two ways puting things on stack.
1. you can decrease rsp as shown above.
2. you can use push instruction which does exactly the same but in one step: push rax will decrease rsp by 8 bytes (size of rax register) and then will save its value in reserved area.
Sometimes also rbp register is being used to operate on stack.
If you need a bigger area on stack, for example for local variables, you reserve required amount on stack and then you save current rsp value into rbp. So rbp is a kind of bookmark remembering where your area is. Then you can push more things on stack, without loosing information where the allocated area was.
Before leaving function all things placed on stack need to be taken from it. It is done by pop instruction which is opposite to push - takes value from stack and moves it to register and then increases rsp. Or you can just increase rsp if you do not need to restore register values.

How to comprehend the flow of this assembly code

I can' t understand how this works.
Here's a part of main() program disassembled by objdump and written in intel notation
0000000000000530 <main>:
530: lea rdx,[rip+0x37d] # 8b4 <_IO_stdin_used+0x4>
537: mov DWORD PTR [rsp-0xc],0x0
53f: movabs r10,0xedd5a792ef95fa9e
549: mov r9d,0xffffffcc
54f: nop
550: mov eax,DWORD PTR [rsp-0xc]
554: cmp eax,0xd
557: ja 57c <main+0x4c>
559: movsxd rax,DWORD PTR [rdx+rax*4]
55d: add rax,rdx
560: jmp rax
The rodata section dump:
.rodata
08b0 01000200 ecfdffff d4fdffff bcfdffff ................
08c0 9cfdffff 7cfdffff 6cfdffff 4cfdffff ....|...l...L...
08d0 3cfdffff 2cfdffff 0cfdffff ecfcffff <...,...........
08e0 d4fcffff b4fcffff 0cfeffff ............
In 530, rip is [537] so [rdx] = [537 + 37d] = 8b4.
First question is the value of rdx is how large? Is the valueis ec, or ecfdffff or something else? If it has DWORD, I can understand that has 'ecfdffff' (even this is wrong too?:() but this program don't declare it. How can I judge the value?
Then the program continues.
In 559, rax is first appeared.
The second question is this rax can interpret as a part of eax and in this time is the rax = 0? If rax is 0, in 559 means rax = DWORD[rdx] and the value of rax become ecfdffff and next [55d] do rax += rdx, and I think this value can't jamp. There must be something wrong, so tell me where, or how i make any wrongs.
I think I'll diverge from what Peter discussed (he provides good information) and get to the heart of some issues I think are causing you problems. When I first glanced at this question I assumed that the code was likely compiler generated and the jmp rax was likely the result of some control flow statement. The most likely way to generate such a code sequence is via a C switch. It isn't uncommon for a switch statement to be made of a jump table to say what code should execute depending on the control variable. As an example: the control variable for switch(a) is a.
This all made sense to me, and I wrote up a number of comments (now deleted) that ultimately resulted in bizarre memory addresses that jmp rax would go to. I had errands to run but when I returned I had the aha moment that you may have had the same confusion I did. This output from objdump using the -s option appeared as:
.rodata
08b0 01000200 ecfdffff d4fdffff bcfdffff ................
08c0 9cfdffff 7cfdffff 6cfdffff 4cfdffff ....|...l...L...
08d0 3cfdffff 2cfdffff 0cfdffff ecfcffff <...,...........
08e0 d4fcffff b4fcffff 0cfeffff ............
One of your questions seems to be about what values get loaded here. I never used the -s option to look at data in the sections and was unaware that although the dump splits the data out in groups of 4 bytes (32-bit values) they are shown in byte order as it appears in memory. I had at first assumed the output was displaying these values from Most Significant Byte to Least significant byte and objdump -s had done the conversion. That is not the case.
You have to manually reverse the bytes of each group of 4 bytes to get the real value that would be read from memory into a register.
ecfdffff in the output actually means ec fd ff ff. As a DWORD value (32-bit) you need to reverse the bytes to get the HEX value as you would expect when loaded from memory. ec fd ff ff reversed would be ff ff fd ec or the 32-bit value 0xfffffdec. Once you realize that then this makes a lot more sense. If you make this same adjustment for all the data in that table you'd get:
.rodata
08b0: 0x00020001 0xfffffdec 0xfffffdd4 0xfffffdbc
08c0: 0xfffffd9c 0xfffffd7c 0xfffffd6c 0xfffffd4c
08d0: 0xfffffd3c 0xfffffd2c 0xfffffd0c 0xfffffcec
08e0: 0xfffffcd4 0xfffffcb4 0xfffffe0c
Now if we look at the code you have it starts with:
530: lea rdx,[rip+0x37d] # 8b4 <_IO_stdin_used+0x4>
This doesn't load data from memory, it is computing the effective address of some data and places the address in RDX. The disassembly from OBJDUMP is displaying the code and data with the view that it is loaded in memory starting at 0x000000000000. When it is loaded into memory it may be placed at some other address. GCC in this case is producing position independent code (PIC). It is generated in such a way that the first byte of the program can start at an arbitrary address in memory.
The # 8b4 comment is the part we are concerned about (you can ignore the information after that). The disassembly is saying if the program was loaded at 0x0000000000000000 then the value loaded into RDX would be 0x8b4. How was that arrived at? This instruction starts at 0x530 but with RIP relative addressing the RIP (instruction pointer) is relative to the address just after the current instruction. The address the disassembler used was 0x537 (the byte after the current instruction is the address of the first byte of the next instruction). The instruction adds 0x37d to RIP and gets 0x537+0x37d=0x8b4. The address 0x8b4 happens to be in the .rodata section which you are given a dump of (as discussed above).
We now know that RDX contains the base of some data. The jmp rax suggests this is likely going to be a table of 32-bit values that are used to determine what memory location to jump to depending on the value in the control variable of a switch statement.
This statement appears to be storing the value 0 as a 32-bit value on the stack.
537: mov DWORD PTR [rsp-0xc],0x0
These appear to be variables that the compiler chose to store in registers (rather than memory).
53f: movabs r10,0xedd5a792ef95fa9e
549: mov r9d,0xffffffcc
R10 is being loaded with the 64-bit value 0xedd5a792ef95fa9e. R9D is the lower 32-bits of the 64-bit R9 register.The value 0xffffffcc is being loaded into the lower 32-bits of R9 but there is something else occurring. In 64-bit mode if the destination of an instruction is a 32-bit register the CPU automatically zero extends the value into the upper 32-bits of the register. The CPU is guaranteeing us that the upper 32-bits are zeroed.
This is a NOP and doesn't do anything except align the next instruction to memory address 0x550. 0x550 is a value that is 16-byte aligned. This has some value and may hint that the instruction at 0x550 may be the first instruction at the top of a loop. An optimizer may place NOPs into the code to align the first instruction at the top of a loop to a 16-byte aligned address in memory for performance reasons:
54f: nop
Earlier the 32-bit stack based variable at rsp-0xc was set to zero. This reads the value 0 from memory as a 32-bit value and stores it in EAX. Since EAX is a 32-bit register being used as the destination for the instruction the CPU automatically filled the upper 32-bits of RAX to 0. So all of RAX is zero.
550: mov eax,DWORD PTR [rsp-0xc]
EAX is now being compared to 0xd. If it is above (ja) it goes to the instruction at 0x57c.
554: cmp eax,0xd
557: ja 57c <main+0x4c>
We then have this instruction:
559: movsxd rax,DWORD PTR [rdx+rax*4]
The movsxd is an instruction that will take a 32-bit source operand (in this case the 32-bit value at memory address RDX+RAX*4) load it into the bottom 32-bits of RAX and then sign extend the value into the upper 32-bits of RAX. Effectively if the 32-bit value is negative (the most significant bit is 1) the upper 32-bits of RAX will be set to 1. If the 32-bit value is not negative the upper 32-bits of RAX will be set to 0.
When this code is first encountered RDX contains the base of some table at 0x8b4 from the beginning of the program loaded in memory. RAX is set to 0. Effectively the first 32-bits in the table are copied to RAX and sign extended. As seen earlier the value at offset 0xb84 is 0xfffffdec. That 32-bit value is negative so RAX contains 0xfffffffffffffdec.
Now to the meat of the situation:
55d: add rax,rdx
560: jmp rax
RDX still holds the address to the beginning of a table in memory. RAX is being added to that value and stored back in RAX (RAX = RAX+RDX). We then JMP to the address stored in RAX. So this code all seems to suggest we have a JUMP table with 32-bit values that we are using to determine where we should go. So then the obvious question. What are the 32-bit values in the table? The 32-bit values are the difference between the beginning of the table and the address of the instruction we want to jump to.
We know the table is 0x8b4 from the location our program is loaded in memory. The C compiler told the linker to compute the difference between 0x8b4 and the address where the instruction we want to execute resides. If the program had been loaded into memory at 0x0000000000000000 (hypothetically), RAX = RAX+RDX would have resulted in RAX being 0xfffffffffffffdec + 0x8b4 = 0x00000000000006a0. We then use jmp rax to jump to 0x6a0. You didn't show the entire dump of memory but there is going to be code at 0x6a0 that will execute when the value passed to the switch statement is 0. Each 32-bit value in the JUMP table will be a similar offset to the code that will execute depending on the control variable in the switch statement. If we add 0x8b4 to all the entries in the table we get:
08b0: 0x000006a0 0x00000688 0x00000670
08c0: 0x00000650 0x00000630 0x00000620 0x00000600
08d0: 0x000005F0 0x000005e0 0x000005c0 0x000005a0
08e0: 0x00000588 0x00000568 0x000006c0
You should find that in the code you haven't provided us that these addresses coincide with code that appears after the jmp rax.
Given that the memory address 0x550 was aligned, I have a hunch that this switch statement is inside a loop that keeps executing as some kind of state machine until the proper conditions are met for it to exit. Likely the value of the control variable used for the switch statement is changed by the code in the switch statement itself. Each time the switch statement is run the control variable has a different value and will do something different.
The control variable for the switch statement was originally checked for the value being above 0x0d (13). The table starting at 0x8b4 in the .rodata section has 14 entries. One can assume the switch statement probably has 14 different states (cases).
but this program don't declare it
You're looking at disassembly of machine code + data. It's all just bytes in memory. Any labels the disassembler does manage to show are ones that got left in the executable's symbol table. They're irrelevant to how the CPU runs the machine code.
(The ELF program headers tell the OS's program loader how to map it into memory, and where to jump to as an entry point. This has nothing to do with symbols, unless a shared library references some globals or functions defined in the executable.)
You can single-step the code in GDB and watch register values change.
In 559, rax is first appeared.
EAX is the low 32 bits of RAX. Writing to EAX zero-extends into RAX implicitly. From mov DWORD PTR [rsp-0xc],0x0 and the later reload, we know that RAX=0.
This must have been un-optimized compiler output (or volatile int idx = 0; to defeat constant propagation), otherwise it would know at compile time that RAX=0 and could optimize away everything else.
lea rdx,[rip+0x37d] # 8b4
A RIP-relative LEA puts the address of static into a register. It's not a load from memory. (That happens later when movsxd with an indexed addressing mode uses RDX as the base address.)
The disassembler worked out the address for you; it's RDX = 0x8b4. (Relative to the start of the file; when actually running the program would be mapped at a virtual address like 0x55555...000)
554: cmp eax,0xd
557: ja 57c <main+0x4c>
559: movsxd rax,DWORD PTR [rdx+rax*4]
55d: add rax,rdx
560: jmp rax
This is a jump table. First it checks for an out-of-bounds index with cmp eax,0xd, then it indexes a table of 32-bit signed offsets using EAX (movsxd with an addressing mode that scales RAX by 4), and adds that to the base address of the table to get a jump target.
GCC could just make a jump table of 64-bit absolute pointers, but chooses not to so that .rodata is position-independent as well and doesn't need load-time fixups in a PIE executable. (Even though Linux does support doing that.) See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 where this is discussed (although the main focus of that bug is that gcc -fPIE can't turn a switch into a table lookup of string addresses, and actually still uses a jump table)
The jump-offset table address is in RDX, this is what was set up with the earlier LEA.

Why does %rbp point to nothing?

It is known that %rsp points to the top of the stack frame and %rbp points to the base of the stack frame. Then I can't understand why %rbp is 0x0 in this piece of code:
(gdb) x/4xg $rsp
0x7fffffffe170: 0x00000000004000dc 0x0000000000000010
0x7fffffffe180: 0x0000000000000001 0x00007fffffffe487
(gdb) disas HelloWorldProc
Dump of assembler code for function HelloWorldProc:
=> 0x00000000004000b0 <+0>: push %rbp
0x00000000004000b1 <+1>: mov %rsp,%rbp
0x00000000004000b4 <+4>: mov $0x1,%eax
0x00000000004000b9 <+9>: mov $0x1,%edi
0x00000000004000be <+14>: movabs $0x6000ec,%rsi
0x00000000004000c8 <+24>: mov $0xd,%edx
0x00000000004000cd <+29>: syscall
0x00000000004000cf <+31>: leaveq
0x00000000004000d0 <+32>: retq
End of assembler dump.
(gdb) x/xg $rbp
0x0: Cannot access memory at address 0x0
And why is it "saving" (pushing) %rbp to the stack if it points to nothing?
RBP is a general-purpose register, so it can contain any value that you (or your compiler) wants it to contain. It is only by convention that RBP is used to point to the procedure frame. According to this convention, the stack looks like this:
Low |====================|
addresses | Unused space |
| |
|====================| ← RSP points here
↑ | Function's |
↑ | local variables |
↑ | | ↑ RBP - x
direction |--------------------| ← RBP points here
of stack | Original/saved RBP | ↓ RBP + x
growth |--------------------|
↑ | Return pointer |
↑ |--------------------|
↑ | Function's |
| parameters |
| |
|====================|
| Parent |
| function's data |
|====================|
| Grandparent |
High | function's data |
addresses |====================|
As such, the boilerplate prologue code for a function is:
push %rbp
mov %rsp, %rbp
This first instruction saves the original value of RBP by pushing it onto the stack, and then the second instruction sets RBP to the original value of RSP. After this, the stack looks exactly like the one depicted above, in the beautiful ASCII art.
The function then does its thing, executing whatever code it wants to execute. As suggested in the drawing, it can access any parameters it was passed on the stack by using positive offsets from RBP (i.e., RBP+x), and it can access any local variables it has allocated space for on the stack by using negative offsets from RBP (i.e., RBP-x). If you understand that the stack grows downward in memory (addresses get smaller), then this offsetting scheme makes sense.
Finally, the boilerplate epilogue code to end a function is:
leaveq
or, equivalently:
mov %rbp, %rsp
pop %rbp
This first instruction sets RSP to the value of RBP (the working value used throughout the function's code), and the second instruction pops the "original/saved RBP" off the stack, into RBP. It is no coincidence that this is precisely the opposite of what was done in the prologue code we looked at above.
Note, though, that this is merely a convention. Unless required by the ABI, the compiler is free to use RBP as a general-purpose register, with no relation to the stack pointer. This works because the compiler can just calculate the required offsets from RSP at compile time, and it is a common optimization, known as "frame pointer elision" (or "frame pointer omission"). It is especially common in 32-bit mode, where the number of available general-purpose registers is extremely small, but you'll sometimes see it in 64-bit code, too. When the compiler has elided the frame pointer, it doesn't need the prologue and epilogue code to manipulate it, so this can be omitted, too.
The reason you see all of this frame-pointer book-keeping is because you're analyzing unoptimized code, where the frame pointer is never elided because having it around often makes debugging easier (and since execution speed is not a significant concern).
The reason why it RBP is 0 upon entry to your function appears to be a peculiarity of GDB, and not something that you really need to concern yourself with. As Shift_Left notes in the comments, GDB under Linux pre-initializes all registers (except RSP) to 0 before handing off control to an application. If you had run this program outside of the debugger, and simply printed the initial value of RBP to stdout, you'd see that it would be non-zero.
But, again, the exact value shouldn't matter to you. Understanding the schematic drawing of the call stack above is the key. Assuming that frame pointers have not been elided, the compiler has no idea when it generates the prologue and epilogue code what value RBP will have upon entry, because it doesn't know where on the call stack the function will end up being called.

Why does VC++ 2010 often use ebx as a "zero register"?

Yesterday I was looking at some 32 bit code generated by VC++ 2010 (most probably; don't know about the specific options, sorry) and I was intrigued by a curious recurring detail: in many functions, it zeroed out ebx in the prologue, and it always used it like a "zero register" (think $zero on MIPS). In particular, it often:
used it to zero out memory; this is not unusual, as the encoding for a mov mem,imm is 1 to 4 bytes bigger than mov mem,reg (the full immediate value size has to be encoded even for 0), but usually (gcc) the necessary register is zeroed out "on demand", and kept for more useful purposes otherwise;
used it for compares against zero - as in cmp reg,ebx. This is what stroke me as really unusual, as it should be exactly the same as test reg,reg, but adds a dependency to an extra register. Now, keep in mind that this happened in non-leaf functions, with ebx being often pushed (by the callee) on and off the stack, so I would not trust this dependency to be always completely free. Also, it also used test reg,reg in the exact same fashion (test/cmp => jg).
Most importantly, registers on "classic" x86 are a scarce resource, if you start having to spill registers you waste a lot of time for no good reason; why waste one through all the function just to keep a zero in it? (still, thinking about it, I don't remember seeing much register spillage in functions that used this "zero-register" pattern).
So: what am I missing? Is it a compiler blooper or some incredibly smart optimization that was particularly interesting in 2010?
Here's an excerpt:
; standard prologue: ebp/esp, SEH, overflow protection, ... then:
xor ebx, ebx
mov [ebp+4], ebx ; zero out some locals
mov [ebp], ebx
call function_1
xor ecx, ecx ; ebx _not_ used to zero registers
cmp eax, ebx ; ... but used for compares?! why not test eax,eax?
setnz cl ; what? it goes through cl to check if eax is not zero?
cmp ecx, ebx ; still, why not test ecx,ecx?
jnz function_body
push 123456
call throw_something
function_body:
mov edx, [eax]
mov ecx, eax ; it's not like it was interested in ecx anyway...
mov eax, [edx+0Ch]
call eax ; virtual method call; ebx is preserved but possibly pushed/popped
lea esi, [eax+10h]
mov [ebp+0Ch], esi
mov eax, [ebp+10h]
mov ecx, [eax-0Ch]
xor edi, edi ; ugain, registers are zeroed as usual
mov byte ptr [ebp+4], 1
mov [ebp+8], ecx
cmp ecx, ebx ; why not test ecx,ecx?
jg somewhere
label1:
lea eax, [esi-10h]
mov byte ptr [ebp+4], bl ; ok, uses bl to write a zero to memory
lea ecx, [eax+0Ch]
or edx, 0FFFFFFFFh
lock xadd [ecx], edx
dec edx
test edx, edx ; now it's using the regular test reg,reg!
jg somewhere_else
Notice: an earlier version of this question said that it used mov reg,ebx instead of xor ebx,ebx; this was just me not remembering stuff correctly. Sorry if anybody put too much thought trying to understand that.
Everything you commented on as odd looks sub-optimal to me. test eax,eax sets all flags (except AF) the same as cmp against zero, and is preferred for performance and code-size.
On P6 (PPro through Nehalem), reading long-dead registers is bad because it can lead to register-read stalls. P6 cores can only read 2 or 3 not-recently-modified architectural registers from the permanent register file per clock (to fetch operands for the issue stage: the ROB holds operands for uops, unlike on SnB-family where it only holds references to the physical register file).
Since this is from VS2010, Sandybridge wasn't released yet, so it should have put a lot of weight on tuning for Pentium II/III, Pentium-M, Core2, and Nehalem where reading "cold" registers is a possible bottleneck.
IDK if anything like this ever made sense for integer regs, but I don't know much about optimizing for CPUs older than P6.
The cmp / setz / cmp / jnz sequence looks particularly braindead. Maybe it came from a compiler-internal canned sequence for producing a boolean value from something, and it failed to optimize a test of the boolean back into just using the flags directly? That still doesn't explain the use of ebx as a zero-register, which is also completely useless there.
Is it possible that some of that was from inline-asm that returned a boolean integer (using a silly that wanted a zero in a register)?
Or maybe the source code was comparing two unknown values, and it was only after inlining and constant-propagation that it turned into a compare against zero? Which MSVC failed to optimize fully, so it still kept 0 as a constant in a register instead of using test?
(the rest of this was written before the question included code).
Sounds weird, or like a case of CSE / constant-hoisting run amok. i.e. treating 0 like any other constant that you might want to load once and then reg-reg copy throughout the function.
Your analysis of the data-dependency behaviour is correct: moving from a register that was zeroed a while ago essentially starts a new dependency chain.
When gcc wants two zeroed registers, it often xor-zeroes one and then uses a mov or movdqa to copy to the other.
This is sub-optimal on Sandybridge where xor-zeroing doesn't need an execution port, but a possible win on Bulldozer-family where mov can run on the AGU or ALU, but xor-zeroing still needs an ALU port.
For vector moves, it's a clear win on Bulldozer: handled in register rename with no execution unit. But xor-zeroing of an XMM or YMM register still needs an execution port on Bulldozer-family (or two for ymm, so always use xmm with implicit zero-extension).
Still, I don't think that justifies tying up a register for the duration of a whole function, especially not if it costs extra saves/restores. And not for P6-family CPUs where register-read stalls are a thing.

Confused about AT&T Assembly Syntax for addressing modes vs. jmp and far jmp

In AT&T Assembly Syntax, literal values must be prefixed with a $ sign
But, in Memory Addressing, literal values do not have $ sign
for example:
mov %eax, -100(%eax)
and
jmp 100
jmp $100, $100
are different.
My question is why the $ prefix so confusing?
jmp 100 is a jump to absolute address 100, just like jmp my_label is a jump to the code at my_label. EIP = 100 or EIP = the address of my_label.
(jmp 100 assembles to a jmp rel32 with a R_386_PC32 relocation, asking the linker to fill in the right relative offset from the jmp instruction's own address to the absolute target.)
So in AT&T syntax, you can think of jmp x as sort of like an LEA into EIP.
Or another way to think of it is that code-fetch starts from the specified memory location. Requiring a $ for an immediate wouldn't really make sense, because the machine encoding for direct near jumps uses a relative displacement, not absolute. (http://felixcloutier.com/x86/JMP.html).
Also, indirect jumps use a different syntax (jmp *%eax register indirect or jmp *(%edi, %ecx, 4) memory indirect), so a distinction between immediate vs. memory isn't needed.
But far jump is a different story.
jmp ptr16:32 and jmp m16:32 are both available in 32-bit mode, so you do need to distinguish between ljmp *(%edi) vs. ljmp $100, $100.
Direct far jump (jmp far ptr16:32) does take an absolute segment:offset encoded into the instruction, just like add $123, %eax takes an immediate encoded into the instruction.
Question: My question is why the prefixed $ so confused ?
$ prefix is used to load/use the value as is.
example:
movl $5, %eax #copy value 5 to eax
addl $10,%eax # add 10 + 5 and store result in eax
$5, $10 are values (constants) and are not take from any external source like register or memory
In Memory addressing, Specifically "Direct addressing mode" we want to use the value stored in particular memory location.
example:
movl 20, %eax
The above would get the value stored in Memory location at 20.
Practially since memory locations are numbered in hexadecimal (0x00000000 to 0xfffffffff), it's difficult to specify the memory locations in hexadecimals in instructions. So we assign a symbol to the location
Example:
.section .data
mydata:
long 4
.section .text
.globl _start
_start
movl mydata, %eax
In the above code, mydata is symbolic representation given a particular memory location where value "4" is stored.
I hope the above clears your confusion.

Resources