How does a syscall actually happen on linux?

How does a syscall actually happen on linux? - linux

Inspired by this question
How can I force GDB to disassemble?
and related to this one
What is INT 21h?
How does an actually system call happen under linux? what happens when the call is performed, until the actual kernel routine is invoked ?

Assuming we're talking about x86:
The ID of the system call is deposited into the EAX register
Any arguments required by the system call are deposited into the locations dictated by the system call. For example, some system calls expect their argument to reside in the EBX register. Others may expect their argument to be sitting on the top of the stack.
An INT 0x80 interrupt is invoked.
The Linux kernel services the system call identified by the ID in the EAX register, depositing any results in pre-determined locations.
The calling code makes use of any results.
I may be a bit rusty at this, it's been a few years...

The given answers are correct but I would like to add that there are more mechanisms to enter kernel mode. Every recent kernel maps the "vsyscall" page in every process' address space. It contains little more than the most efficient syscall trap method.
For example on a regular 32 bit system it could contain:
0xffffe000: int $0x80
0xffffe002: ret
But on my 64-bitsystem I have access to the way more efficient method using the syscall/sysenter instructions
0xffffe000: push %ecx
0xffffe001: push %edx
0xffffe002: push %ebp
0xffffe003: mov %esp,%ebp
0xffffe005: sysenter
0xffffe007: nop
0xffffe008: nop
0xffffe009: nop
0xffffe00a: nop
0xffffe00b: nop
0xffffe00c: nop
0xffffe00d: nop
0xffffe00e: jmp 0xffffe003
0xffffe010: pop %ebp
0xffffe011: pop %edx
0xffffe012: pop %ecx
0xffffe013: ret
This vsyscall page also maps some systemcalls that can be done without a context switch. I know certain gettimeofday, time and getcpu are mapped there, but I imagine getpid could fit in there just as well.

This is already answered at
How is the system call in Linux implemented?
Probably did not match with this question because of the differing "syscall" term usage.

Basically, its very simple: Somewhere in memory lies a table where each syscall number and the address of the corresponding handler is stored (see http://lxr.linux.no/linux+v2.6.30/arch/x86/kernel/syscall_table_32.S for the x86 version)
The INT 0x80 interrupt handler then just takes the arguments out of the registers, puts them on the (kernel) stack, and calls the appropriate syscall handler.

Related

On x64 Linux, what is the difference between syscall, int 0x80 and ret to exit a program?

I decided yesterday to learn assembly (NASM syntax) after years of C++ and Python and I'm already confused about the way to exit a program. It's mostly about ret because it's the suggested instruction on SASM IDE.
I'm speaking for main obviously. I don't care about x86 backward compatibility. Only the x64 Linux best way. I'm curious.

If you use printf or other libc functions, it's best to ret from main or call exit. (Which are equivalent; main's caller will call the libc exit function.)
If not, if you were only making other raw system calls like write with syscall, it's also appropriate and consistent to exit that way, but either way, or call exit are 100% fine in main.
If you want to work without libc at all, e.g. put your code under _start: instead of main: and link with ld or gcc -static -nostdlib, then you can't use ret. Use mov eax, 231 (__NR_exit_group) / syscall.
main is a real & normal function like any other (called with a valid return address), but _start (the process entry point) isn't. On entry to _start, the stack holds argc and argv, so trying to ret would set RIP=argc, and then code-fetch would segfault on that unmapped address. Nasm segmentation fault on RET in _start
System call vs. ret-from-main
Exiting via a system call is like calling _exit() in C - skip atexit() and libc cleanup, notably not flushing any buffered stdout output (line buffered on a terminal, full-buffered otherwise).
This leads to symptoms such as Using printf in assembly leads to empty output when piping, but works on the terminal (or if your output doesn't end with \n, even on a terminal.)
main is a function, called (indirectly) from CRT startup code. (Assuming you link your program normally, like you would a C program.) Your hand-written main works exactly like a compiler-generate C main function would. Its caller (__libc_start_main) really does do something like int result = main(argc, argv); exit(result);,
e.g. call rax (pointer passed by _start) / mov edi, eax / call exit.
So returning from main is exactly1 like calling exit.
Syscall implementation of exit() for a comparison of the relevant C functions, exit vs. _exit vs. exit_group and the underlying asm system calls.
C question: What is the difference between exit and return? is primarily about exit() vs. return, although there is mention of calling _exit() directly, i.e. just making a system call. It's applicable because C main compiles to an asm main just like you'd write by hand.
Footnote 1: You can invent a hypothetical intentionally weird case where it's different. e.g. you used stack space in main as your stdio buffer with sub rsp, 1024 / mov rsi, rsp / ... / call setvbuf. Then returning from main would involve putting RSP above that buffer, and __libc_start_main's call to exit could overwrite some of that buffer with return addresses and locals before execution reached the fflush cleanup. This mistake is more obvious in asm than C because you need leave or mov rsp, rbp or add rsp, 1024 or something to point RSP at your return address.
In C++, return from main runs destructors for its locals (before global/static exit stuff), exit doesn't. But that just means the compiler makes asm that does more stuff before actually running the ret, so it's all manual in asm, like in C.
The other difference is of course the asm / calling-convention details: exit status in EAX (return value) or EDI (first arg), and of course to ret you have to have RSP pointing at your return address, like it was on function entry. With call exit you don't, and you can even do a conditional tailcall of exit like jne exit. Since it's a noreturn function, you don't really need RSP pointing at a valid return address. (RSP should be aligned by 16 before a call, though, or RSP%16 = 8 before a tailcall, matching the alignment after call pushes a return address. It's unlikely that exit / fflush cleanup will do any alignment-required stores/loads to the stack, but it's a good habit to get this right.)
(This whole footnote is about ret vs. call exit, not syscall, so it's a bit of a tangent from the rest of the answer. You can also run syscall without caring where the stack-pointer points.)
SYS_exit vs. SYS_exit_group raw system calls
The raw SYS_exit system call is for exiting the current thread, like pthread_exit().
(eax=60 / syscall, or eax=1 / int 0x80).
SYS_exit_group is for exiting the whole program, like _exit.
(eax=231 / syscall, or eax=252 / int 0x80).
In a single-threaded program you can use either, but conceptually exit_group makes more sense to me if you're going to use raw system calls. glibc's _exit() wrapper function actually uses the exit_group system call (since glibc 2.3). See Syscall implementation of exit() for more details.
However, nearly all the hand-written asm you'll ever see uses SYS_exit1. It's not "wrong", and SYS_exit is perfectly acceptable for a program that didn't start more threads. Especially if you're trying to save code size with xor eax,eax / inc eax (3 bytes in 32-bit mode) or push 60 / pop rax (3 bytes in 64-bit mode), while push 231/pop rax would be even larger than mov eax,231 because it doesn't fit in a signed imm8.
Note 1: (Usually actually hard-coding the number, not using __NR_... constants from asm/unistd.h or their SYS_... names from sys/syscall.h)
And historically, it's all there was. Note that in unistd_32.h, __NR_exit has call number 1, but __NR_exit_group = 252 wasn't added until years later when the kernel gained support for tasks that share virtual address space with their parent, aka threads started by clone(2). This is when SYS_exit conceptually became "exit current thread". (But one could easily and convincingly argue that in a single-threaded program, SYS_exit does still mean exit the whole program, because it only differs from exit_group if there are multiple threads.)
To be honest, I've never used eax=252 / int 0x80 in anything, only ever eax=1. It's only in 64-bit code where I often use mov eax,231 instead of mov eax,60 because neither number is "simple" or memorable the way 1 is, so might as well be a cool guy and use the "modern" exit_group way in my single-threaded toy program / experiment / microbenchmark / SO answer. :P (If I didn't enjoy tilting at windmills, I wouldn't spend so much time on assembly, especially on SO.)
And BTW, I usually use NASM for one-off experiments so it's inconvenient to use pre-defined symbolic constants for call numbers; with GCC to preprocess a .S before running GAS you can make your code self-documenting with #include <sys/syscall.h> so you can use mov $SYS_exit_group, %eax (or $__NR_exit_group), or mov eax, __NR_exit_group with .intel_syntax noprefix.
Don't use the 32-bit int 0x80 ABI in 64-bit code:
What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? explains what happens if you use the COMPAT_IA32_EMULATION int 0x80 ABI in 64-bit code.
It's totally fine for just exiting, as long as your kernel has that support compiled in, otherwise it will segfault just like any other random int number like int 0x7f. (e.g. on WSL1, or people that built custom kernels and disabled that support.)
But the only reason you'd do it that way in asm would be so you could build the same source file with nasm -felf32 or nasm -felf64. (You can't use syscall in 32-bit code, except on some AMD CPUs which have a 32-bit version of syscall. And the 32-bit ABI uses different call numbers anyway so this wouldn't let the same source be useful for both modes.)
Related:
Why am I allowed to exit main using ret? (CRT startup code calls main, you're not returning directly to the kernel.)
Nasm segmentation fault on RET in _start - you can't ret from _start
Using printf in assembly leads to empty output when piping, but works on the terminal stdout buffer (not) flushing with raw system call exit
Syscall implementation of exit() call exit vs. mov eax,60/syscall (_exit) vs. mov eax,231/syscall (exit_group).
Can't call C standard library function on 64-bit Linux from assembly (yasm) code - modern Linux distros config GCC in a way that call exit or call puts won't link with nasm -felf64 foo.asm && gcc foo.o.
Is main() really start of a C++ program? - Ciro's answer is a deep dive into how glibc + its CRT startup code actually call main (including x86-64 asm disassembly in GDB), and shows the glibc source code for __libc_start_main.
Linux x86 Program Start Up
or - How the heck do we get to main()? 32-bit asm, and more detail than you'll probably want until you're a lot more comfortable with asm, but if you've ever wondered why CRT runs so much code before getting to main, that covers what's happening at a level that's a couple steps up from using GDB with starti (stop at the process entry point, e.g. in the dynamic linker's _start) and stepi until you get to your own _start or main.
https://stackoverflow.com/tags/x86/info lots of good links about this and everything else.

What is wrong with this emulation of CMPXCHG16B instruction?

I'm trying to run a binary program that uses CMPXCHG16B instruction at one place, unfortunately my Athlon 64 X2 3800+ doesn't support it. Which is great, because I see it as a programming challenge. The instruction doesn't seem to be that hard to implement with a cave jump, so that's what I did, but something didn't work, program just froze in a loop. Maybe someone can tell me if I implemented my CMPXCHG16B wrong?
Firstly the actual piece of machine code that I'm trying to emulate is this:
f0 49 0f c7 08 lock cmpxchg16b OWORD PTR [r8]
Excerpt from Intel manual describing CMPXCHG16B:
Compare RDX:RAX with m128. If equal, set ZF and load RCX:RBX into m128.
Else, clear ZF and load m128 into RDX:RAX.
First I replace all 5 bytes of the instruction with a jump to code cave with my emulation procedure, luckily the jump takes up exactly 5 bytes! The jump is actually a call instruction e8, but could be a jmp e9, both work.
e8 96 fb ff ff call 0xfffffb96(-649)
This is a relative jump with a 32-bit signed offset encoded in two's complement, the offset points to a code cave relative to address of next instruction.
Next the emulation code I'm jumping to:
PUSH R10
PUSH R11
MOV r10, QWORD PTR [r8]
MOV r11, QWORD PTR [r8+8]
TEST R10, RAX
JNE ELSE
TEST R11, RDX
JNE ELSE
MOV QWORD PTR [r8], RBX
MOV QWORD PTR [r8+8], RCX
JMP END
ELSE:
MOV RAX, r10
MOV RDX, r11
END:
POP R11
POP R10
RET
Personally, I'm happy with it, and I think it matches the functional specification given in manual. It restores stack and two registers r10 and r11 to their original order and then resumes execution. Alas it does not work! That is the code works, but the program acts as if it's waiting for a tip and burning electricity. Which indicates my emulation was not perfect and I inadvertently broke it's loop. Do you see anything wrong with it?
I notice that this is an atomic variant of it—owning to the lock prefix. I'm hoping it's something else besides contention that I did wrong. Or is there a way to emulate atomicity too?

It's not possible to emulate lock cmpxchg16b. It's sort of possible if all accesses to the target address are synchronised with a separate lock, but that includes all other instructions, including non-atomic stores to either half of the object, and atomic read-modify-writes (like xchg, lock cmpxchg, lock add, lock xadd) with one half (or other part) of the 16 byte object.
You can emulate cmpxchg16b (without lock) like you've done here, with the bugfixes from #Fifoernik's answer. That's an interesting learning exercise, but not very useful in practice, because real code that uses cmpxchg16b always uses it with a lock prefix.
A non-atomic replacement will work most of the time, because it's rare for a cache-line invalidate from another core to arrive in the small time window between two nearby instructions. This doesn't mean it's safe, it just means it's really hard to debug when it does occasionally fail. If you just want to get a game working for your own use, and can accept occasional lockups / errors, this might be useful. For anything where correctness is important, you're out of luck.
What about MFENCE? Seems to be what I need.
MFENCE before, after, or between the loads and stores won't prevent another thread from seeing a half-written value ("tearing"), or from modifying the data after your code has made the decision that the compare succeeded, but before it does the store. It might narrow the window of vulnerability, but it can't close it, because MFENCE only prevents reordering of the global visibility of our own stores and loads. It can't stop a store from another core from becoming visible to us after our loads but before our stores. That requires an atomic read-modify-write bus cycle, which is what locked instructions are for.
Doing two 8-byte atomic compare-exchanges would solve the window-of-vulnerability problem, but only for each half separately, leaving the "tearing" problem.
Atomic 16B loads/stores solves the tearing problem but not the atomicity problem between loads and stores. It's possible with SSE on some hardware, but not guaranteed to be atomic by the x86 ISA the way 8B naturally-aligned loads and stores are.
Xen's lock cmpxchg16b emulation:
The Xen virtual machine has an x86 emulator, I guess for the case where a VM starts on one machine and migrates to less-capable hardware. It emulates lock cmpxchg16b by taking a global lock, because there's no other way. If there was a way to emulate it "properly", I'm sure Xen would do that.
As discussed in this mailing list thread, Xen's solution still doesn't work when the emulated version on one core is accessing the same memory as the non-emulated instruction on another core. (The native version doesn't respect the global lock).
See also this patch on the Xen mailing list that changes the lock cmpxchg8b emulation to support both lock cmpxchg8b and lock cmpxchg16b.
I also found that KVM's x86 emulator doesn't support cmpxchg16b either, according to the search results for emulate cmpxchg16b.
I think all this is good evidence that my analysis is correct, and that it's not possible to emulate it safely.

I see these things wrong with your code to emulate the cmpxchg16b instruction:
You need to use cmp in stead of test to get a correct comparison.
You need to save/restore all flags except the ZF. The manual mentions :
The CF, PF, AF, SF, and OF flags are unaffected.
The manual contains the following:
IF (64-Bit Mode and OperandSize = 64)
THEN
TEMP128 ← DEST
IF (RDX:RAX = TEMP128)
THEN
ZF ← 1;
DEST ← RCX:RBX;
ELSE
ZF ← 0;
RDX:RAX ← TEMP128;
DEST ← TEMP128;
FI;
FI
So to really write code that "matches the functional specification given in manual" a write to the m128 is required. Although this particular write is part of the locked version lock cmpxchg16b, it won't of course do any good to the atomicity of the emulation! A straightforward emulation of lock cmpxchg16b is thus not possible. See #PeterCordes' answer.
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison
ELSE:
MOV RAX, r10
MOV RDX, r11
MOV QWORD PTR [r8], r10
MOV QWORD PTR [r8+8], r11
END:

How does the gettimeofday syscall wor‍k?

gettimeofday is a syscall of x86-86 according to this page(just search gettimeofday in the box):
int gettimeofday(struct timeval *tv, struct timezone *tz);
I thought the disassembly should be easy enough, just prepare the two pointers and call the related syscall, but its disassembly is doing much more:
(gdb) disas gettimeofday
Dump of assembler code for function gettimeofday:
0x00000034f408c2d0 <gettimeofday+0>: sub $0x8,%rsp
0x00000034f408c2d4 <gettimeofday+4>: mov $0xffffffffff600000,%rax
0x00000034f408c2db <gettimeofday+11>: callq *%rax
0x00000034f408c2dd <gettimeofday+13>: cmp $0xfffff001,%eax
0x00000034f408c2e2 <gettimeofday+18>: jae 0x34f408c2e9 <gettimeofday+25>
0x00000034f408c2e4 <gettimeofday+20>: add $0x8,%rsp
0x00000034f408c2e8 <gettimeofday+24>: retq
0x00000034f408c2e9 <gettimeofday+25>: mov 0x2c4cb8(%rip),%rcx # 0x34f4350fa8 <free+3356736>
0x00000034f408c2f0 <gettimeofday+32>: xor %edx,%edx
0x00000034f408c2f2 <gettimeofday+34>: sub %rax,%rdx
0x00000034f408c2f5 <gettimeofday+37>: mov %edx,%fs:(%rcx)
0x00000034f408c2f8 <gettimeofday+40>: or $0xffffffffffffffff,%rax
0x00000034f408c2fc <gettimeofday+44>: jmp 0x34f408c2e4 <gettimeofday+20>
End of assembler dump.
And I don't see syscall at all.
Can anyone explain how it works?

gettimeofday() on Linux is what's called a vsyscall and/or vdso. Hence you see the two lines:
0x00000034f408c2d4 : mov $0xffffffffff600000,%rax
0x00000034f408c2db : callq *%rax
in your disassembly. The address 0xffffffffff600000 is the vsyscall page (on x86_64).
The mechanism maps a specific kernel-created code page into user memory, so that a few "syscalls" can be made without the overhead of a user/kernel context switch, but rather as "ordinary" function call. The actual implementation is right here.

Syscalls generally create a lot of overhead, and given the abundance of gettimeofday() calls, one would prefer not to use a syscall. To that end, Linux kernel may map one or two special areas into each program, called vdso and vsyscall. Your implementation of gettimeofday() seems to be using vsyscall:
mov $0xffffffffff600000,%rax
This is the standard address of vsyscall map. Try cat /proc/self/maps to see that mapping. The idea behind vsyscall is that kernel provides fast user-space implementations of some functions, and libc just calls them.
Read this nice article for more details.

Can anybody explain some simple assembly code?

I have just started to learn assembly. This is the dump from gdb for a simple program which prints hello ranjit.
Dump of assembler code for function main:
0x080483b4 <+0>: push %ebp
0x080483b5 <+1>: mov %esp,%ebp
0x080483b7 <+3>: sub $0x4,%esp
=> 0x080483ba <+6>: movl $0x8048490,(%esp)
0x080483c1 <+13>: call 0x80482f0 <puts#plt>
0x080483c6 <+18>: leave
0x080483c7 <+19>: ret
My questions are :
Why every time ebp is pushed on to stack at start of the program? What is in the ebp which is necessary to run this program?
In second line why is ebp copied to esp?
I can't get the third line at all. what I know about SUB syntax is "sub dest,source", but here how can esp be subtracted from 4 and stored in 4?
What is this value "$0x8048490"? Why it is moved to esp, and why this time is esp closed in brackets? Does it denote something different than esp without brackets?
Next line is the call to function but what is this "0x80482f0"?
What is leave and ret (maybe ret means returning to lib c.)?
operating system : ubuntu 10, compiler : gcc

ebp is used as a frame pointer in Intel processors (assuming you're using a calling convention that uses frames).
It provides a known point of reference for locating passed-in parameters (on one side) and local variables (on the other) no matter what you do with the stack pointer while your function is active.
The sequence:
push %ebp ; save callers frame pointer
mov %esp,%ebp ; create a new frame pointer
sub $N,%esp ; make space for locals
saves the frame pointer for the previous stack frame (the caller), loads up a new frame pointer, then adjusts the stack to hold things for the current "stack level".
Since parameters would have been pushed before setting up the frame, they can be accessed with [bp+N] where N is a suitable offset.
Similarly, because locals are created "under" the frame pointer, they can be accessed with [bp-N].
The leave instruction is a single one which undoes that stack frame. You used to have to do it manually but Intel introduced a faster way of getting it done. It's functionally equivalent to:
mov %ebp, %esp ; restore the old stack pointer
pop %ebp ; and frame pointer
(the old, manual way).
Answering the questions one by one in case I've missed something:
To start a new frame. See above.
It isn't. esp is copied to ebp. This is AT&T notation (the %reg is a dead giveaway) where (among other thing) source and destination operands are swapped relative to Intel notation.
See answer to (2) above. You're subtracting 4 from esp, not the other way around.
It's a parameter being passed to the function at 0x80482f0. It's not being loaded into esp but into the memory pointed at by esp. In other words, it's being pushed on the stack. Since the function being called is puts (see (5) below), it will be the address of the string you want putsed.
The function name in the <> after the address. It's calling the puts function (probably the one in the standard library though that's not guaranteed). For a description of what the PLT is, see here.
I've already explained leave above as unwinding the current stack frame before exiting. The ret simply returns from the current function. If the current functtion is main, it's going back to the C startup code.

In my career I learned several assembly languages, you didn't mention which but it appears Intel x86 (segmented memory model as PaxDiablo pointed out). However, I have not used assembly since last century (lucky me!). Here are some of your answers:
The EBP register is pushed onto the stack at the beginning because we need it further along in other operations of the routine. You don't want to just discard its original value thus corrupting the integrity of the rest of the application.
If I remember correctly (I may be wrong, long time) it is the other way around, we are moving %esp INTO %ebp, remember we saved it in the previous line? now we are storing some new value without destroying the original one.
Actually they are SUBstracting the value of four (4) FROM the contents of the %esp register. The resulting value is not stored on "four" but on %esp. If %esp had 0xFFF8 after the SUB it will contain 0xFFF4. I think this is called "Immediate" if my memory serves me. What is happening here (I reckon) is the computation of a memory address (4 bytes less).
The value $0x8048490 I don't know. However, it is NOT being moved INTO %esp but rather INTO THE ADDRESS POINTED TO BY THE CONTENTS OF %esp. That is why the notation is (%esp) rather than %esp. This is kind of a common notation in all assembly languages I came about in my career. If on the other hand the right operand was simply %esp, then the value would have been moved INTO the %esp register. Basically the %esp register's contents are being used for addressing.
It is a fixed value and the string on the right makes me think that this value is actually the address of the puts() (Put String) compiler library routine.
"leave" is an instrution that is the equivalent of "pop %ebp". Remember we saved the contents of %ebp at the beginning, now that we are done with the routine we are restoring it back into the register so that the caller gets back to its context. The "ret" instruction is the final instruction of the routine, it "returns" to the caller.

Help with understanding a very basic main() disassembly in GDB

Heyo,
I have written this very basic main function to experiment with disassembly and also to see and hopefully understand what is going on at the lower level:
int main() {
return 6;
}
Using gdb to disas main produces this:
0x08048374 <main+0>: lea 0x4(%esp),%ecx
0x08048378 <main+4>: and $0xfffffff0,%esp
0x0804837b <main+7>: pushl -0x4(%ecx)
0x0804837e <main+10>: push %ebp
0x0804837f <main+11>: mov %esp,%ebp
0x08048381 <main+13>: push %ecx
0x08048382 <main+14>: mov $0x6,%eax
0x08048387 <main+19>: pop %ecx
0x08048388 <main+20>: pop %ebp
0x08048389 <main+21>: lea -0x4(%ecx),%esp
0x0804838c <main+24>: ret
Here is my best guess as to what I think is going on and what I need help with line-by-line:
lea 0x4(%esp),%ecx
Load the address of esp + 4 into ecx. Why do we add 4 to esp?
I read somewhere that this is the address of the command line arguments. But when I did x/d $ecx I get the value of argc. Where are the actual command line argument values stored?
and $0xfffffff0,%esp
Align stack
pushl -0x4(%ecx)
Push the address of where esp was originally onto the stack. What is the purpose of this?
push %ebp
Push the base pointer onto the stack
mov %esp,%ebp
Move the current stack pointer into the base pointer
push %ecx
Push the address of original esp + 4 on to stack. Why?
mov $0x6,%eax
I wanted to return 6 here so i'm guessing the return value is stored in eax?
pop %ecx
Restore ecx to value that is on the stack. Why would we want ecx to be esp + 4 when we return?
pop %ebp
Restore ebp to value that is on the stack
lea -0x4(%ecx),%esp
Restore esp to it's original value
ret
I am a n00b when it comes to assembly so any help would be great! Also if you see any false statements about what I think is going on please correct me.
Thanks a bunch! :]

Stack frames
The code at the beginning of the function body:
push %ebp
mov %esp, %ebp
is to create the so-called stack frame, which is a "solid ground" for referencing parameters and objects local to the procedure. The %ebp register is used (as its name indicates) as a base pointer, which points to the base (or bottom) of the local stack inside the procedure.
After entering the procedure, the stack pointer register (%esp) points to the return address stored on the stack by the call instruction (it is the address of the instruction just after the call). If you'd just invoke ret now, this address would be popped from the stack into the %eip (instruction pointer) and the code would execute further from that address (of the next instruction after the call). But we don't return yet, do we? ;-)
You then push %ebp register to save its previous value somewhere and not lose it, because you'll use it for something shortly. (BTW, it usually contains the base pointer of the caller function, and when you peek that value, you'll find a previously stored %ebp, which would be again a base pointer of the function one level higher, so you can trace the call stack that way.) When you save the %ebp, you can then store the current %esp (stack pointer) there, so that %ebp will point to the same address: the base of the current local stack. The %esp will move back and forth inside the procedure when you'll be pushing and popping values on the stack or reserving & freeing local variables. But %ebp will stay fixed, still pointing to the base of the local stack frame.
Accessing parameters
Parameters passed to the procedure by the caller are "burried just uner the ground" (that is, they have positive offsets relative to the base, because stack grows down). You have in %ebp the address of the base of the local stack, where lies the previous value of the %ebp. Below it (that is, at 4(%ebp) lies the return address. So the first parameter will be at 8(%ebp), the second at 12(%ebp) and so on.
Local variables
And local variables could be allocated on the stack above the base (that is, they'd have negative offsets relative to the base). Just subtract N to the %esp and you've just allocated N bytes on the stack for local variables, by moving the top of the stack above (or, precisely, below) this region :-) You can refer to this area by negative offsets relative to %ebp, i.e. -4(%ebp) is the first word, -8(%ebp) is second etc. Remember that (%ebp) points to the base of the local stack, where the previous %ebp value has been saved. So remember to restore the stack to the previous position before you try to restore the %ebp through pop %ebp at the end of the procedure. You can do it two ways:
1. You can free only the local variables by adding back the N to the %esp (stack pointer), that is, moving the top of the stack as if these local variables had never been there. (Well, their values will stay on the stack, but they'll be considered "freed" and could be overwritten by subsequent pushes, so it's no longer safe to refer them. They're dead bodies ;-J )
2. You can flush the stack down to the ground and free all local space by simply restoring the %esp from the %ebp which has been fixed earlier to the base of the stack. It'll restore the stack pointer to the state it has just after entering the procedure and saving the %esp into %ebp. It's like loading the previously saved game when you've messed something ;-)
Turning off frame pointers
It's possible to have a less messy assembly from gcc -S by adding a switch -fomit-frame-pointer. It tells GCC to not assemble any code for setting/resetting the stack frame until it's really needed for something. Just remember that it can confuse debuggers, because they usually depend on the stack frame being there to be able to track up the call stack. But it won't break anything if you don't need to debug this binary. It's perfectly fine for release targets and it saves some spacetime.
Call Frame Information
Sometimes you can meet some strange assembler directives starting from .cfi interleaved with the function header. This is a so-called Call Frame Information. It's used by debuggers to track the function calls. But it's also used for exception handling in high-level languages, which needs stack unwinding and other call-stack-based manipulations. You can turn it off too in your assembly, by adding a switch -fno-dwarf2-cfi-asm. This tells the GCC to use plain old labels instead of those strange .cfi directives, and it adds a special data structures at the end of your assembly, refering to those labels. This doesn't turn off the CFI, just changes the format to more "transparent" one: the CFI tables are then visible to the programmer.

You did pretty good with your interpretation. When a function is called, the return address is automatically pushed to the stack, which is why argc, the first argument, has been pushed back to 4(%esp). argv would start at 8(%esp), with a pointer for each argument, followed by a null pointer. This function pushes the old value of %esp to the stack so that it can contain the original, unaligned value upon returned. The value of %ecx at return doesn't matter, which is why it is used as temporary storage for the %esp reference. Other than that, you are correct with everything.

Regarding your first question (where are stored the command line arguments), arguments to functions are right before ebp. I must say, your "real" main begins at < main + 10 >, where it pushes ebp and moves esp to ebp. I think that gcc messes everything up with all that leas just to replace the usual operations (addictions and subtractions) on esp before and after functions call. Usually a routine looks like this (simple function I did as an example):
0x080483b4 <+0>: push %ebp
0x080483b5 <+1>: mov %esp,%ebp
0x080483b7 <+3>: sub $0x10,%esp # room for local variables
0x080483ba <+6>: mov 0xc(%ebp),%eax # get arg2
0x080483bd <+9>: mov 0x8(%ebp),%edx # and arg1
0x080483c0 <+12>: lea (%edx,%eax,1),%eax # just add them
0x080483c3 <+15>: mov %eax,-0x4(%ebp) # store in local var
0x080483c6 <+18>: mov -0x4(%ebp),%eax # and return the sum
0x080483c9 <+21>: leave
0x080483ca <+22>: ret
Perhaps you've enabled some optimizations, which could make the code trickier.
Finally yes, the return value is stored in eax. Your interpretation is quite correct anyway.

The only thing I think that's outstanding from your original questions is why the following statements exist in your code:
0x08048381 <main+13>: push %ecx
0x08048382 <main+14>: mov $0x6,%eax
0x08048387 <main+19>: pop %ecx
The push and pop of %ecx at <main+13> and <main+19> don't seem to make much sense - and they don't really do anything in this example, but consider the case where your code invokes function calls.
There's no way for the system to guarantee that the calls to other functions - which will set up their own stack activation frames - won't reset register values. In fact they probably will. The code therefore sets up a saved register section on the stack where any registers used by the code (other than %esp and %ebp which are already saved though the regular stack setup) are stored in the stack before possibly handing control over to function calls in the "meat" of the current code block.
When these potential calls return, the system then pops the values off the stack to restore the pre-call register values. If you were writing assembler directly rather than compiling, you'd be responsible for storing and retrieving these register values, yourself.
In the case of your example code, however, there are no function calls - only a single instruction at <main+14> where you're setting the return value, but the compiler can't know that, and preserves its registers as usual.
It would be interesting to see what would happen here if you added C statements which pushed other values onto the stack after <main+14>. If I'm right about this being a saved register section of the stack, you'd expect the compiler to insert automatic pop statements prior to <main+19> in order to clear these values.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string