How to inform GCC to not use a particular register - linux

Assume I have a very big source code and intend to make the rdx register totally unused during the execution, i.e., while generating the assembly code, all I want is to inform my compiler (GCC) that it should not use rdx at all.
NOTE: register rdx is just an example. I am OK with any available Intel x86 register.
I am even happy to update the source code of the compiler and use my custom GCC. But which changes to the source code are needed?

You tell GCC not to allocate a register via the -ffixed-reg option (gcc docs).
-ffixed-reg
Treat the register named reg as a fixed register; generated code should never refer to it (except perhaps as a stack pointer, frame pointer or in some other fixed role).
reg must be the name of a register. The register names accepted are machine-specific and are defined in the REGISTER_NAMES macro in the machine description macro file.
For example, gcc -ffixed-r13 will make gcc leave it alone entirely. Using registers that are part of the calling convention, or required for certain instructions, may be problematic.

You can put some global variable to this register.
For ARM CPU you can do it this way:
register volatile type *global_ptr asm ("r8")
This instruction uses general purpose register "r8" to hold
the value of global_ptr pointer.
See the source in U-Boot for real-life example:
http://git.denx.de/?p=u-boot.git;a=blob;f=arch/arm/include/asm/global_data.h;h=4e3ea55e290a19c766017b59241615f7723531d5;hb=HEAD#l83
File arch/arm/include/asm/global_data.h (line ~83).
#define DECLARE_GLOBAL_DATA_PTR register volatile gd_t *gd asm ("r8")

I don't know whether there is a simple mechanism to tell that to gcc at run time. I would assume that you must recompile. From what I read I understand that there are description files for the different CPUs, e.g. this file, but what exactly needs to be changed in order to prevent gcc from using the register, and what potential side effects such a change could have, is beyond me.
I would ask on the gcc mailing list for assistence. Chances are that the modification is not so difficult per se, except that building gcc isn't trivial in my experience. In your case, if I analyze the situation correctly, a caveat applies. You are essentially cross-compiling, i.e building for a different architecture. In particular I understand that you have to build your system and other libraries which your program uses because their code would normally use that register. If you intend to link dynamically you probably would also have to build your own ld.so (the dynamic loader) because starting a dynamically linked executable actually starts that loader which would use that register. (Therefore maybe linking statically is better.)

Consider the divq instruction - the dividend is represented by [rdx][rax], and, assuming the divisor (D) satisfies rdx < D, the quotient is stored in %rax and remainder in %rdx. There are no alternative registers that can be used here.
The same applies with the mul/mulq instructions, where the product is stored in [rdx][rax] - even the recent mulx instruction, while more flexible, still uses %rdx as a source register. (If memory serves)
More importantly, %rdx is used to pass parameters in the x86-64 ELF ABI. You could never call C library functions (or any other ELF library for that matter) - even kernel syscalls use %rdx to pass parameters - though the register use is not the same.
I'm not clear on your motivation - but the fact is, you won't be able to do anything practical on any x86[-64] platform (let alone an ELF/Linux platform) - at least in user-space.

Related

why non-pic code can't be totally ASLR using run-time fixups?

I understand that PIC code makes ASLR randomization more efficient and easier since the code can be placed anywhere in memory with no change in code. But if i understand right according to Wikipedia relocation dynamic linker can make "fixups" at runtime so a symbol can be located although code being not position-independent. But according to many answers i saw here non-pic code can't ASLR sections except the stack(so cant randomize program entry point). If that is correct then what are runtime fixups are used for and why can't we just fixup all locations in code at runtime before program start to make program entry point randomized.
TL:DR: Not all uses of absolute address will have relocation info in a non-PIE executable (ELF type EXEC, not DYN). Therefore the kernel's program-loader can't find them all to apply fixups.
Thus there's no way to retroactively enable ASLR for executables built as non-PIE. There's no way for a traditional executable to flag itself as having relocation metadata for every use of an absolute address, and no point in adding such a feature either since if you want text ASLR you'd just build a PIE.
Because ELF-type EXEC Linux executables are guaranteed to be loaded / mapped at the fixed base address chosen by the linker at link time, it would be a waste of space in the executable to make symbol-table entries for internal symbols. So toolchains didn't do that, and there's no reason to start. That's simply how traditional ELF executables were designed; Linux switched from a.out to ELF back in the mid 90s before stack ASLR was a thing, so it wasn't on people's radar.
e.g. the absolute address of static char buf[100] is probably embedded somewhere in the machine code that uses it (if we're talking about 32-bit code, or 64-bit code that puts the address in a register), but there's no way to know where or how many times.
Also, for x86-64 specifically, the default code model for non-PIE executables guarantees that static addresses (text / data / bss) will all be in the low 2GiB of virtual address space, so 32-bit absolute signed or unsigned addresses can work, and rel32 displacements can reach anything from anything. That's why non-PIE compiler output uses mov $symbol, %edi (5 bytes) to put an address in a register, instead of lea symbol(%rip), %rdi (7 bytes). https://godbolt.org/z/89PeK1
So even if you did know where every absolute address was, you could only ASLR it in the low 2GiB, limiting the number of bits of entropy you could introduce. (I think Windows has a mode for this: LargeAddressAware = no. But Linux doesn't. 32-bit absolute addresses no longer allowed in x86-64 Linux? Again, PIE is a better way to allow text ASLR, so people (distros) should just compile for that if they want its benefits.)
Unlike Windows, Linux doesn't spend huge effort on things that can be handled better and more efficiently by recompiling binaries from source.
That being said, GNU/Linux does support fixup relocations for 64-bit absolute addresses even in PIC / PIE ELF shared objects. That's why beginner code like NASM mov rdi, BUFFER can work even in a shared library: use objdump -drwC -Mintel to see the relocation info on that use of the symbol in a mov reg, imm64 instruction. An lea rdi, [rel BUFFER] wouldn't need any relocation entry if BUFFER wasn't a global symbol. (Equivalent of C static.)
You might be wondering why metadata is essential:
There's no reliable way to search text/data for possible absolute addresses; false positives would be possible. e.g. /usr/bin/ld probably contains 0x401000 as the default start address for an x86-64 executable. You don't want ASLR of ld's code+data to also change its defaults. Or that integer value could have come up in any number of ways in many programs, e.g. as a bitmap. And of course x86-64 machine code is variable length so there's no reliable way to even distinguish opcodes from immediate operands in the most general case.
And also potentially false negatives. Not super likely that an x86 program would construct an absolute address in a register with multiple instructions, but it's certainly possible. However in non-x86 code, that would be common.
RISC machines with fixed-length instructions can't put a 32-bit address into a 32-bit instruction; there'd be no room left for anything else. So to load from static storage, the absolute addresses would have to be split across multiple instructions, like MIPS lui $t0, %hi(0x612300) / lw $t1, %lo(0x612300)($t0) to load from a static variable at absolute address 0x612300. (There would normally be a symbol name in the asm source, but it wouldn't appear in the final linked binary unless it was .globl, so I used numbers as a reminder.) Instructions like that don't have to come in pairs; the same high-half of the address could be reused by other accesses into the same array or struct in later instructions.
Let's first have a look at Windows before having a look at Linux:
Windows' .EXE files (programs) typically have a so-called "base relocation table" and they have an "image base".
The "image base" is the "desired" start address of the program; if Windows loads the program to that address, no relocation needs to be done.
The "base relocation table" contains a list of all values in a program which represent addresses. If the program is loaded to a different address than the "image base", Windows must add the difference to all values listed in that table.
If the .EXE file does not contain a "base relocation table" (as far as I know some 32-bit GCC versions generate such files), it is not possible to load the file to another address.
This is because the following C code statements will result in exactly the same machine code (binary code) if the variable someVariable is located at the address 12340000, and it is not possible to distinguish between them:
long myVariable = 12340000;
And:
int * myVariable = &someVariable;
In the first case, the value 12340000 must not be changed in any situation; in the second case, the address (which is 12340000) must be changed to the real address if the program is loaded to another address.
If the "base relocation table" is missing, there is no information if the value 12340000 is an integer value (which must not be changed) or an address (which must be changed).
So the program must be loaded to some fixed address.
I'm not sure about the latest 32-bit Linux releases, but at least in older 32-bit Linux versions there was nothing like a "base relocation table" and programs did not use PIC. This means that these programs had to be loaded to their "favorite" address.
I don't know about 64-bit Linux programs, but if a program is compiled the same way as the (older) 32-bit programs, they also must be loaded to a certain address and ASLR is not possible.

How to add/remove x86 instruction in linux executables without spoiling the alignment

I'm new to binary and assembly, and I'm curious about how to directly edit binary executables. I tried to remove an instruction from a binary file (according to disassembled instructions provided by objdump), but after doing that the "executable" seems no longer in an executable format (segmentation fault when running; gdb cannot recognize). I heard that this is due to instruction alignment issue. (Is it?)
So, is it possible to add/remove single x86 instructions directly in linux executables? If so, how? Thanks in advance.
If you remove a chunk of binary file without adjusting file headers accordingly, it will become invalid.
Fortunately, you can replace instructions with NOP without actually removing them. File size remains the same, and if there is no checksum or signature (or if it's not actually checked), there is nothing more to do.
There is no universal way to insert the instructions, but generally you overwrite the original code with a JMP to another location, where you reproduce what the original code did, do your own things as you wanted, then JMP back. Finding room for your new code might be impossible without changing the size of the binary, so I would instead patch the code after executable is loaded (perhaps using a special LD_PRELOADed library).
Yes. Just replace it with a NOP instruction (0x90) - or multiple ones if the instruction spans across multiple bytes. This is an old trick.

How are __addgs* used, and what is GS?

On Microsoft's site can be found some details of the
__addgsbyte ( offset, data )
__addgsword ( offset, data )
__addgsdword ( offset, data )
__addgsqword ( offset, data )
intrinsic functions. It is stated that offset is
the offset from the beginning of GS. I presume that GS refers to the processor register.
How does GS relate to the stack, if at all? Alternatively, how can I calculate an offset with respect to GS?
(And, are there any 'gotchas' relating to this and particular calling conventions, such as __fastcall?)
The GS register does not relate to the stack at all and therefore no relation to callign convensions. On x64 versions of Windows it is used to point to operating system data:
From wikipedia:
Instead of FS segment descriptor on x86 versions of the Windows NT
family, GS segment descriptor is used to point to two operating system
defined structures: Thread Information Block (NT_TIB) in user mode and
Processor Control Region (KPCR) in kernel mode. Thus, for example, in
user mode GS:0 is the address of the first member of the Thread
Information Block. Maintaining this convention made the x86-64 port
easier, but required AMD to retain the function of the FS and GS
segments in long mode — even though segmented addressing per se is not
really used by any modern operating system.
Note that those intrinsics are only available in kernel mode (e.g. device drivers). To calculate an offset, you would need to know what segment of memory GS is pointing to. So in kernel mode you would need to know the layout of the Processor Control Region.
Personally I don't know what the use of these would be.
these intrinsics, along with there fs counterparts have no real use except for accessing OS specific data, as such its most likely that these where added purely to make the windows developers lives easier (I've personally used this for inline TLS access)
The doc you linked says:
Add a value to a memory location specified by an offset relative to the beginning of the GS segment.
That means it's an intrinsic for this asm instruction: add gs:[offset], data (a normal memory-destination add with a GS segment override) with your choice of operand-size.
The compiler can presumably pick any addressing mode for the offset part, and data can be a register or immediate, thus either or both of offset and data can be runtime variables or constants.
The actual linear (virtual) address accessed will be gs_base + offset, where gs_base is set via an MSR to any address (by the OS on its own, or by you making a system call).
In user-space at least, Windows normally uses GS for TLS (thread local storage). The answer claiming this intrinsic only works in kernel code is mistaken. It doesn't add to the GS base, it adds to memory at an address relative to the existing GS base.
MS only seems to document this intrinsic for x64, but it's a valid instruction in 32-bit mode as well. IDK why they'd bother to restrict it. (Except of course the qword form: 64-bit operand size isn't available in 32-bit mode.)
Perhaps the compiler doesn't know how to generally optimize __readgsdword / (operation on that data) / __writegsdword into a memory-destination instruction with the same gs:offset address. If so, this intrinsic would just be a code-size optimization.
But perhaps it's sometimes relevant to force the compiler to do it as a single instruction to make it atomic wrt. interrupts on this core (but not wrt. accesses by other CPU cores). IDK if that's an intended use case is; this answer is just explaining what that one line of documentation means in terms of x86 asm and memory-segmentation.

How to interpret segment register accesses on x86-64?

With this function:
mov 1069833(%rip),%rax # 0x2b5c1bf9ef90 <_fini+3250648>
add %fs:0x0,%rax
retq
How do I interpret the second instruction and find out what was added to RAX?
This code:
mov 1069833(%rip),%rax # 0x2b5c1bf9ef90 <_fini+3250648>
add %fs:0x0,%rax
retq
is returning the address of a thread-local variable. %fs:0x0 is the address of the TCB (Thread Control Block), and 1069833(%rip) is the offset from there to the variable, which is known since the variable resides either in the program or on some dynamic library loaded at program's load time (libraries loaded at runtime via dlopen() need some different code).
This is explained in great detail in Ulrich Drepper's TLS document, specially §4.3 and §4.3.6.
I'm not sure they've been called segment register since the bad old days of segmented architecture. I believe the proper term is a selector (but I could be wrong).
However, I think you just need at the first quadword (64 bits) in the fs area.
The %fs:0x0 bit means the contents of the memory at fs:0. Since you've used the generic add (rather than addl for example), I think it will take the data width from the target %rax.
In terms of getting the actual value, it depends on whether you're in legacy or long mode.
In legacy mode, you'll have to get the fs value and look it up in the GDT (or possibly LDT) in order to get the base address.
In long mode, you'll need to look at the relevant model specific registers. If you're at this point, you've moved beyond my level of expertise unfortunately.

Can we modify the int 0x80 routine?

How does linux 2.6 differ from 2.4?
Can we modify the source kernel?
Can we modify the int 0x80 service routine?
UPDATE:
1. the 0x80 handler is essentially the same between 2.4 and 2.6, although the function called from the handler is called by the 'syscall' instruction handler for x86-64 in 2.6.
2. the 0x80 handler can be modified like the rest of the kernel.
3. You won't break anything by modifying it, unless you remove backwards compatibility. E.g., you can add your own trace or backdoor if you feel so inclined. The other post that says you will break your libs and toolchain if you modify the handler is incorrect. If you break the dispatch algorithm, or modify the dispatch table incorrectly, then you will break things.
3a. As I originally posted, the best way to extend the 0x80 service is to extend the system call handler.
As the kernel source says:
What: The kernel syscall interface
Description:
This interface matches much of the POSIX interface and is based
on it and other Unix based interfaces. It will only be added to
over time, and not have things removed from it.
Note that this interface is different for every architecture
that Linux supports. Please see the architecture-specific
documentation for details on the syscall numbers that are to be
mapped to each syscall.
The system call table entries for i386 are in:
arch/i386/kernel/syscall_table.S
Note that the table is a sequence of pointers, so if you want to maintain a degree of forward compatibility with the kernel maintainers, you'd need to pad the table before placement of your pointer.
The syscall vector number is defined in irq_vectors.h
Then traps.c sets the address of the system_call function via set_system_gate, which places the entry into the interrupt descriptor table. The system_call function itself is in entry.S, and calls the requested pointer from the system call table.
There are a few housekeeping details, which you can see reading the code, but direct modification of the 0x80 interrupt handler is accomplished in entry.S inside the system_call function. In a more sane fashion, you can modify the system call table, inserting your own function without modifying the dispatch mechanism.
In fact, having read the 2.6 source, it says directly that int 0x80 and x86-64 syscall use the same code, so far. So you can make portable changes for x86-32 and x86-64.
END Update
The INT 0x80 method invokes the system call table handler. This matches register arguments to a call table, invoking kernel functions based on the contents of the EAX register. You can easily extend the system call table to add custom kernel API functions.
This may even work with the new syscall code on x86-64, as it uses the system call table, too.
If you alter the current system call table in any manner other than to extend it, you will break all dependent libraries and code, including libc, init, etc.
Here's the current Linux system call table: http://asm.sourceforge.net/syscall.html
It's an architectural overhaul. Everything has changed internally. SMP support is complete, the process scheduler is vastly improved, memory management got an overhaul, and many, many other things.
Yes. It's open-source software. If you do not have a copy of the source, you can get it from your vendor or from kernel.org.
Yes, but it's not advisable because it will break libc, it will break your baselayout, and it will break your toolchain if you change the sequence of existing syscalls, and nearly everything you might think you want to do should be done in userspace when at all possible.

Resources