I'm creating a driver for 32 and 64 bit Linux OS. One of the requirements is that all of the code needs to be self contained with no call outs. On 64-bit I've no issues, but on 32-bit GCC seems to add a call instruction to the next byte. After searching a bit I found this link:
http://forum.soft32.com/linux/Strange-problem-disassembling-shared-lib-ftopict439936.html
Is there a way to disable this on 32-bit Linux?
Example:
32 bit disassembly:
<testfunc>:
0: push %ebp
1: mov %esp, %ebp
3: call 4 <test_func+0x4>
<...some operation on ebx as mentioned in the link above>
64 bit disassebmly:
<testfunc>:
0: push %rbp
1: mov %rsp, %rbp
3: <...no call here>
There is no call in the "testfunc" at all. Even then why is 32-bit compiler adding these "call" instructions? Any help is appreciated.
What you're seeing in 32-bit disassembly may be a way to make the code position-independent. Remember that call pushes onto the stack the return address, which is equal eip+constant? In 64-bit mode there is rip-relative addressing. In 32-bit there isn't. So this call may be simulate that instruction-pointer-relative addressing.
This call instruction to the next byte is coming from function profiling for "gprof" tool. I was able to get rid of these "call" instruction by removing the "-pg" option from compilation.
Since it was a driver, this was being picked up from Linux kernel config - CONFIG_FUNCTION_TRACER.
Related
I am trying to load the address of 'main' into a register (R10) in the GNU Assembler. I am unable to. Here I what I have and the error message I receive.
main:
lea main, %r10
I also tried the following syntax (this time using mov)
main:
movq $main, %r10
With both of the above I get the following error:
/usr/bin/ld: /tmp/ccxZ8pWr.o: relocation R_X86_64_32S against symbol `main' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
Compiling with -fPIC does not resolve the issue and just gives me the same exact error.
In x86-64, most immediates and displacements are still 32-bits because 64-bit would waste too much code size (I-cache footprint and fetch/decode bandwidth).
lea main, %reg is an absolute disp32 addressing mode which would stop load-time address randomization (ASLR) from choosing a random 64-bit (or 47-bit) address. So it's not supported on Linux except in position-dependent executables, or at all on MacOS where static code/data are always loaded outside the low 32 bits. (See the x86 tag wiki for links to docs and guides.) On Windows, you can build executables as "large address aware" or not. If you choose not, addresses will fit in 32 bits.
The standard efficient way to put a static address into a register is a RIP-relative LEA:
# RIP-relative LEA always works. Syntax for various assemblers:
lea main(%rip), %r10 # AT&T syntax
lea r10, [rip+main] # GAS .intel_syntax noprefix equivalent
lea r10, [rel main] ; NASM equivalent, or use default rel
lea r10, [main] ; FASM defaults to RIP-relative. MASM may also
See How do RIP-relative variable references like "[RIP + _a]" in x86-64 GAS Intel-syntax work? for an explanation of the 3 syntaxes, and Why are global variables in x86-64 accessed relative to the instruction pointer? (and this) for reasons why RIP-relative is the standard way to address static data.
This uses a 32-bit relative displacement from the end of the current instruction, like jmp/call. This can reach any static data in .data, .bss, .rodata, or function in .text, assuming the usual 2GiB total size limit for static code+data.
In position dependent code (built with gcc -fno-pie -no-pie for example) on Linux, you can take advantage of 32-bit absolute addressing to save code size. Also, mov r32, imm32 has slightly better throughput than RIP-relative LEA on Intel/AMD CPUs, so out-of-order execution may be able to overlap it better with the surrounding code. (Optimizing for code-size is usually less important than most other things, but when all else is equal pick the shorter instruction. In this case all else is at least equal, or also better with mov imm32.)
See 32-bit absolute addresses no longer allowed in x86-64 Linux? for more about how PIE executables are the default. (Which is why you got a link error about -fPIC with your use of a 32-bit absolute.)
# in a non-PIE executable, mov imm32 into a 32-bit register is even better
# same as you'd use in 32-bit code
## GAS AT&T syntax
mov $main, %r10d # 6 bytes
mov $main, %edi # 5 bytes: no REX prefix needed for a "legacy" register
## GAS .intel_syntax
mov edi, OFFSET main
;; mov edi, main ; NASM and FASM syntax
Note that writing any 32-bit register always zero-extends into the full 64-bit register (R10 and RDI).
lea main, %edi or lea main, %rdi would also work in a Linux non-PIE executable, but never use LEA with a [disp32] absolute addressing mode (even in 32-bit code where that doesn't require a SIB byte); mov is always at least as good.
The operand-size suffix is redundant when you have a register operand that uniquely determines it; I prefer to just write mov instead of movl or movq.
The stupid/bad way is a 10-byte 64-bit absolute address as an immediate:
# Inefficient, DON'T USE
movabs $main, %r10 # 10 bytes including the 64-bit absolute address
This is what you get in NASM if you use mov rdi, main instead of mov edi, main so many people end up doing this. Linux dynamic linking does actually support runtime fixups for 64-bit absolute addresses. But the use-case for that is for jump tables, not for absolute addresses as immediates.
movq $sign_extended_imm32, %reg (7 bytes) still uses a 32-bit absolute address, but wastes code bytes on a sign-extended mov to a 64-bit register, instead of implicit zero-extension to 64-bit from writing a 32-bit register.
By using movq, you're telling GAS you want a R_X86_64_32S relocation instead of a R_X86_64_64 64-bit absolute relocation.
The only reason you'd ever want this encoding is for kernel code where static addresses are in the upper 2GiB of 64-bit virtual address space, instead of the lower 2GiB. mov has slight performance advantages over lea on some CPUs (e.g. running on more ports), but normally if you can use a 32-bit absolute it's in the low 2GiB of virtual address space where a mov r32, imm32 works.
(Related: Difference between movq and movabsq in x86-64)
PS: I intentionally left out any discussion of "large" or "huge" memory / code models, where RIP-relative +-2GiB addressing can't reach static data, or maybe can't even reach other code addresses. The above is for x86-64 System V ABI's "small" and/or "small-PIC" code models. You may need movabs $imm64 for medium and large models, but that's very rare.
I don't know if mov $imm32, %r32 works in Windows x64 executables or DLLs with runtime fixups, but RIP-relative LEA certainly does.
Semi-related: Call an absolute pointer in x86 machine code - if you're JITing, try to put the JIT buffer near existing code so you can call rel32, otherwise movabs a pointer into a register.
I have been messing around with linux assembly on an x86 machine,
Basically my question is: I have pushed couple values into the stack moved the stack pointer into the base pointer and moved a value of 8 into a register to get a pushed value and in the end i wanted to get the value and put it into %ebx for the system call so i would get the value, but it seems to get an error. no clue why.
Error is: junk (%ebp) after register
Example:
.section .data
.section .text
.globl _start
_start:
pushl $50
pushl $20
movl %esp,%ebp
movl $8,%edx
movl %edx(%ebp),%ebx ## Supposed to be return value at system termination // PROBLEM HERE
movl $1,%eax ## System call
int $0x80 # Terminate program
I think part of the problem might be that in x86 the stack actually grows downwards, not up. You're adding to the base pointer, which is giving junk, where you have to subtract from it. I don't have an x86 machine so I can't test this, but have you tried something like movl -%edx(%ebp),%ebx?
Oops, I reversed the direction of the operands in my head. In this case, your stack looks like this:
1952 - ???
1948 - 20
1944 - 50 <- ebp <- esp
So when you take ebp+8, you aren't getting 20, you're getting address 1952, and you don't know what that contains.
Check out the links in https://stackoverflow.com/tags/x86/info. I updated them recently, and added the info about using gdb to single-step asm.
What do you mean "get an error"? Segmentation fault? Syntax error? (The normal syntax is (%ebp, %edx). Only numeric-constant displacements go outside the parens, e.g. -4(%ebp, %edx))
Also, if you're going to use stack frame pointers at all, do the mov %esp, %ebp after pushing any registers you want to preserve, but before pushing args to any functions you're going to call. However, there's no need to use %ebp that way at all, though. gcc defaults to -fomit-frame-pointer since 4.4 I think. It can make it easier to keep track of where your local variables are, if you're pushing/popping stuff.
You might want to just start with 64bit asm, instead of messing around with the obsolete x86 args-on-the-stack ABI.
This just made me think of what's probably wrong with your code. You're probably getting a segfault. (But you didn't say if it was that, syntax error, or something else.) Because you probably built your code in 64bit mode. Build a 32bit binary, or change your code to use %rsp.
You might want to just start with 64bit asm, instead of messing around with the obsolete x86 args-on-the-stack ABI.
This just made me think of what's probably wrong with your code. You're probably getting a segfault. (But you didn't say if it was that, syntax error, or something else.) Because you probably built your code in 64bit mode. Build a 32bit binary, or change your code to use %rsp.
Here are the pop instructions that use the shortcut opcodes on page 1159 of the intel x64 manual:
58+ rw POP r16 Pop top of stack into r16; increment stack
pointer.
58+ rd POP r64 Pop top of stack into r64; increment stack
pointer.
Do these instructions use Rex.R or Rex.B to encode registers 9-16 or are they just added to the opcode? Also does the 64-bit version use Rex.W? I've just never run into these register shortcut instructions before.
Instructions that encode a register operand as part of the opcode use the REX.B field to access registers r8 and so on.
64bit pushes and pops do not need a REX.W, they are 64bit by default and there is no way to make them 32bit. They can be made 16bit by using the 66h prefix.
My program is in 32bit mode running on x86_64 CPU (64bit OS, ubuntu 8.04). Is it possible to switch to 64bit mode (long mode) in user mode temporarily? If so, how?
Background story: I'm writing a library linked with 32bit mode program, so it must be 32bit mode at start. However, I'd like to use faster x86_64 intructions for better performance. So I want to switch to 64bit mode do some pure computation (no OS interaction; no need 64bit addressing) and come back to 32bit before returning to caller.
I found there are some related but different questions. For example,
run 32 bit code in 64 bit program
run 64 bit code in 32 bit OS
My question is "run 64 bit code in 32 bit program, 64 bit OS"
Contrary to the other answers, I assert that in principle the short answer is YES. This is likely not supported officially in any way, but it appears to work. At the end of this answer I present a demo.
On Linux-x86_64, a 32 bit (and X32 too, according to GDB sources) process gets CS register equal to 0x23 — a selector of 32-bit ring 3 code segment defined in GDT (its base is 0). And 64 bit processes get another selector: 0x33 — a selector of long mode (i.e. 64 bit) ring 3 code segment (bases for ES, CS, SS, DS are treated unconditionally as zeros in 64 bit mode). Thus if we do far jump, far call or something similar with target segment selector of 0x33, we'll load the corresponding descriptor to the shadow part of CS and will end up in a 64 bit segment.
The demo at the bottom of this answer uses jmp far instruction to jump to 64 bit code. Note that I've chosen a special constant to load into rax, so that for 32 bit code that instruction looks like
dec eax
mov eax, 0xfafafafa
ud2
cli ; these two are unnecessary, but leaving them here for fun :)
hlt
This must fail if we execute it having 32 bit descriptor in CS shadow part (will raise SIGILL on ud2 instruction).
Now here's the demo (compile it with fasm).
format ELF executable
segment readable executable
SYS_EXIT_32BIT=1
SYS_EXIT_64BIT=60
SYS_WRITE=4
STDERR=2
entry $
mov ax,cs
cmp ax,0x23 ; 32 bit process on 64 bit kernel has this selector in CS
jne kernelIs32Bit
jmp 0x33:start64 ; switch to 64-bit segment
start64:
use64
mov rax, qword 0xf4fa0b0ffafafafa ; would crash inside this if executed as 32 bit code
xor rdi,rdi
mov eax, SYS_EXIT_64BIT
syscall
ud2
use32
kernelIs32Bit:
mov edx, msgLen
mov ecx, msg
mov ebx, STDERR
mov eax, SYS_WRITE
int 0x80
dec ebx
mov eax, SYS_EXIT_32BIT
int 0x80
msg:
db "Kernel appears to be 32 bit, can't jump to long mode segment",10
msgLen = $-msg
The answer is NO. Just because you are running 64bit code (presumably 64bit length datatypes, eg. variables, etc.) you are not running in 64bit mode on a 32 bit box. Compilers have workarounds to provide 64bit data types on 32 bit machines. For example gcc has unsigned long long and uin64_t that are 8 bit datatypes on both x86 and x86_64 machines. Datatypes are portable between x86 & x86_64 for that reason. That does not mean you get 64bit address space on a 32bit box. It means the compiler can handle 64bit datatypes. You will run into instances where you cannot run some 64bit code on 32 bit boxes. In that case, you will need preprocessor instructions to compile the correct 64bit code on x86_64 and the correct 32bit code on x86. A simple example is where different datatypes are explicitly required. In that case you can provide a preprocessor check to determine if the host computer is 64bit or 32bit with:
#if defined(__LP64__) || defined(_LP64)
# define BUILD_64 1
#endif
You can then provide conditionals to compile the correct code with the following:
#ifdef BUILD_64
printf (" x : %ld, hex: %lx,\nfmtbinstr_64 (d, 4, \"-\"): %s\n",
d, d, fmtbinstr_64 (d, 4, "-"));
#else
printf (" x : %lld, hex: %llx,\nfmtbinstr_64 (d, 4, \"-\"): %s\n",
d, d, fmtbinstr_64 (d, 4, "-"));
#endif
Hopefully this provides a starting point for you to work with. If you have more specific questions, please post more details.
Inspired by this question
How can I force GDB to disassemble?
and related to this one
What is INT 21h?
How does an actually system call happen under linux? what happens when the call is performed, until the actual kernel routine is invoked ?
Assuming we're talking about x86:
The ID of the system call is deposited into the EAX register
Any arguments required by the system call are deposited into the locations dictated by the system call. For example, some system calls expect their argument to reside in the EBX register. Others may expect their argument to be sitting on the top of the stack.
An INT 0x80 interrupt is invoked.
The Linux kernel services the system call identified by the ID in the EAX register, depositing any results in pre-determined locations.
The calling code makes use of any results.
I may be a bit rusty at this, it's been a few years...
The given answers are correct but I would like to add that there are more mechanisms to enter kernel mode. Every recent kernel maps the "vsyscall" page in every process' address space. It contains little more than the most efficient syscall trap method.
For example on a regular 32 bit system it could contain:
0xffffe000: int $0x80
0xffffe002: ret
But on my 64-bitsystem I have access to the way more efficient method using the syscall/sysenter instructions
0xffffe000: push %ecx
0xffffe001: push %edx
0xffffe002: push %ebp
0xffffe003: mov %esp,%ebp
0xffffe005: sysenter
0xffffe007: nop
0xffffe008: nop
0xffffe009: nop
0xffffe00a: nop
0xffffe00b: nop
0xffffe00c: nop
0xffffe00d: nop
0xffffe00e: jmp 0xffffe003
0xffffe010: pop %ebp
0xffffe011: pop %edx
0xffffe012: pop %ecx
0xffffe013: ret
This vsyscall page also maps some systemcalls that can be done without a context switch. I know certain gettimeofday, time and getcpu are mapped there, but I imagine getpid could fit in there just as well.
This is already answered at
How is the system call in Linux implemented?
Probably did not match with this question because of the differing "syscall" term usage.
Basically, its very simple: Somewhere in memory lies a table where each syscall number and the address of the corresponding handler is stored (see http://lxr.linux.no/linux+v2.6.30/arch/x86/kernel/syscall_table_32.S for the x86 version)
The INT 0x80 interrupt handler then just takes the arguments out of the registers, puts them on the (kernel) stack, and calls the appropriate syscall handler.