How to interpret segment register accesses on x86-64?

How to interpret segment register accesses on x86-64? - linux

With this function:
mov 1069833(%rip),%rax # 0x2b5c1bf9ef90 <_fini+3250648>
add %fs:0x0,%rax
retq
How do I interpret the second instruction and find out what was added to RAX?

This code:
mov 1069833(%rip),%rax # 0x2b5c1bf9ef90 <_fini+3250648>
add %fs:0x0,%rax
retq
is returning the address of a thread-local variable. %fs:0x0 is the address of the TCB (Thread Control Block), and 1069833(%rip) is the offset from there to the variable, which is known since the variable resides either in the program or on some dynamic library loaded at program's load time (libraries loaded at runtime via dlopen() need some different code).
This is explained in great detail in Ulrich Drepper's TLS document, specially §4.3 and §4.3.6.

I'm not sure they've been called segment register since the bad old days of segmented architecture. I believe the proper term is a selector (but I could be wrong).
However, I think you just need at the first quadword (64 bits) in the fs area.
The %fs:0x0 bit means the contents of the memory at fs:0. Since you've used the generic add (rather than addl for example), I think it will take the data width from the target %rax.
In terms of getting the actual value, it depends on whether you're in legacy or long mode.
In legacy mode, you'll have to get the fs value and look it up in the GDT (or possibly LDT) in order to get the base address.
In long mode, you'll need to look at the relevant model specific registers. If you're at this point, you've moved beyond my level of expertise unfortunately.

Related

why non-pic code can't be totally ASLR using run-time fixups?

I understand that PIC code makes ASLR randomization more efficient and easier since the code can be placed anywhere in memory with no change in code. But if i understand right according to Wikipedia relocation dynamic linker can make "fixups" at runtime so a symbol can be located although code being not position-independent. But according to many answers i saw here non-pic code can't ASLR sections except the stack(so cant randomize program entry point). If that is correct then what are runtime fixups are used for and why can't we just fixup all locations in code at runtime before program start to make program entry point randomized.

TL:DR: Not all uses of absolute address will have relocation info in a non-PIE executable (ELF type EXEC, not DYN). Therefore the kernel's program-loader can't find them all to apply fixups.
Thus there's no way to retroactively enable ASLR for executables built as non-PIE. There's no way for a traditional executable to flag itself as having relocation metadata for every use of an absolute address, and no point in adding such a feature either since if you want text ASLR you'd just build a PIE.
Because ELF-type EXEC Linux executables are guaranteed to be loaded / mapped at the fixed base address chosen by the linker at link time, it would be a waste of space in the executable to make symbol-table entries for internal symbols. So toolchains didn't do that, and there's no reason to start. That's simply how traditional ELF executables were designed; Linux switched from a.out to ELF back in the mid 90s before stack ASLR was a thing, so it wasn't on people's radar.
e.g. the absolute address of static char buf[100] is probably embedded somewhere in the machine code that uses it (if we're talking about 32-bit code, or 64-bit code that puts the address in a register), but there's no way to know where or how many times.
Also, for x86-64 specifically, the default code model for non-PIE executables guarantees that static addresses (text / data / bss) will all be in the low 2GiB of virtual address space, so 32-bit absolute signed or unsigned addresses can work, and rel32 displacements can reach anything from anything. That's why non-PIE compiler output uses mov $symbol, %edi (5 bytes) to put an address in a register, instead of lea symbol(%rip), %rdi (7 bytes). https://godbolt.org/z/89PeK1
So even if you did know where every absolute address was, you could only ASLR it in the low 2GiB, limiting the number of bits of entropy you could introduce. (I think Windows has a mode for this: LargeAddressAware = no. But Linux doesn't. 32-bit absolute addresses no longer allowed in x86-64 Linux? Again, PIE is a better way to allow text ASLR, so people (distros) should just compile for that if they want its benefits.)
Unlike Windows, Linux doesn't spend huge effort on things that can be handled better and more efficiently by recompiling binaries from source.
That being said, GNU/Linux does support fixup relocations for 64-bit absolute addresses even in PIC / PIE ELF shared objects. That's why beginner code like NASM mov rdi, BUFFER can work even in a shared library: use objdump -drwC -Mintel to see the relocation info on that use of the symbol in a mov reg, imm64 instruction. An lea rdi, [rel BUFFER] wouldn't need any relocation entry if BUFFER wasn't a global symbol. (Equivalent of C static.)
You might be wondering why metadata is essential:
There's no reliable way to search text/data for possible absolute addresses; false positives would be possible. e.g. /usr/bin/ld probably contains 0x401000 as the default start address for an x86-64 executable. You don't want ASLR of ld's code+data to also change its defaults. Or that integer value could have come up in any number of ways in many programs, e.g. as a bitmap. And of course x86-64 machine code is variable length so there's no reliable way to even distinguish opcodes from immediate operands in the most general case.
And also potentially false negatives. Not super likely that an x86 program would construct an absolute address in a register with multiple instructions, but it's certainly possible. However in non-x86 code, that would be common.
RISC machines with fixed-length instructions can't put a 32-bit address into a 32-bit instruction; there'd be no room left for anything else. So to load from static storage, the absolute addresses would have to be split across multiple instructions, like MIPS lui $t0, %hi(0x612300) / lw $t1, %lo(0x612300)($t0) to load from a static variable at absolute address 0x612300. (There would normally be a symbol name in the asm source, but it wouldn't appear in the final linked binary unless it was .globl, so I used numbers as a reminder.) Instructions like that don't have to come in pairs; the same high-half of the address could be reused by other accesses into the same array or struct in later instructions.

Let's first have a look at Windows before having a look at Linux:
Windows' .EXE files (programs) typically have a so-called "base relocation table" and they have an "image base".
The "image base" is the "desired" start address of the program; if Windows loads the program to that address, no relocation needs to be done.
The "base relocation table" contains a list of all values in a program which represent addresses. If the program is loaded to a different address than the "image base", Windows must add the difference to all values listed in that table.
If the .EXE file does not contain a "base relocation table" (as far as I know some 32-bit GCC versions generate such files), it is not possible to load the file to another address.
This is because the following C code statements will result in exactly the same machine code (binary code) if the variable someVariable is located at the address 12340000, and it is not possible to distinguish between them:
long myVariable = 12340000;
And:
int * myVariable = &someVariable;
In the first case, the value 12340000 must not be changed in any situation; in the second case, the address (which is 12340000) must be changed to the real address if the program is loaded to another address.
If the "base relocation table" is missing, there is no information if the value 12340000 is an integer value (which must not be changed) or an address (which must be changed).
So the program must be loaded to some fixed address.
I'm not sure about the latest 32-bit Linux releases, but at least in older 32-bit Linux versions there was nothing like a "base relocation table" and programs did not use PIC. This means that these programs had to be loaded to their "favorite" address.
I don't know about 64-bit Linux programs, but if a program is compiled the same way as the (older) 32-bit programs, they also must be loaded to a certain address and ASLR is not possible.

How to inform GCC to not use a particular register

Assume I have a very big source code and intend to make the rdx register totally unused during the execution, i.e., while generating the assembly code, all I want is to inform my compiler (GCC) that it should not use rdx at all.
NOTE: register rdx is just an example. I am OK with any available Intel x86 register.
I am even happy to update the source code of the compiler and use my custom GCC. But which changes to the source code are needed?

You tell GCC not to allocate a register via the -ffixed-reg option (gcc docs).
-ffixed-reg
Treat the register named reg as a fixed register; generated code should never refer to it (except perhaps as a stack pointer, frame pointer or in some other fixed role).
reg must be the name of a register. The register names accepted are machine-specific and are defined in the REGISTER_NAMES macro in the machine description macro file.
For example, gcc -ffixed-r13 will make gcc leave it alone entirely. Using registers that are part of the calling convention, or required for certain instructions, may be problematic.

You can put some global variable to this register.
For ARM CPU you can do it this way:
register volatile type *global_ptr asm ("r8")
This instruction uses general purpose register "r8" to hold
the value of global_ptr pointer.
See the source in U-Boot for real-life example:
http://git.denx.de/?p=u-boot.git;a=blob;f=arch/arm/include/asm/global_data.h;h=4e3ea55e290a19c766017b59241615f7723531d5;hb=HEAD#l83
File arch/arm/include/asm/global_data.h (line ~83).
#define DECLARE_GLOBAL_DATA_PTR register volatile gd_t *gd asm ("r8")

I don't know whether there is a simple mechanism to tell that to gcc at run time. I would assume that you must recompile. From what I read I understand that there are description files for the different CPUs, e.g. this file, but what exactly needs to be changed in order to prevent gcc from using the register, and what potential side effects such a change could have, is beyond me.
I would ask on the gcc mailing list for assistence. Chances are that the modification is not so difficult per se, except that building gcc isn't trivial in my experience. In your case, if I analyze the situation correctly, a caveat applies. You are essentially cross-compiling, i.e building for a different architecture. In particular I understand that you have to build your system and other libraries which your program uses because their code would normally use that register. If you intend to link dynamically you probably would also have to build your own ld.so (the dynamic loader) because starting a dynamically linked executable actually starts that loader which would use that register. (Therefore maybe linking statically is better.)

Consider the divq instruction - the dividend is represented by [rdx][rax], and, assuming the divisor (D) satisfies rdx < D, the quotient is stored in %rax and remainder in %rdx. There are no alternative registers that can be used here.
The same applies with the mul/mulq instructions, where the product is stored in [rdx][rax] - even the recent mulx instruction, while more flexible, still uses %rdx as a source register. (If memory serves)
More importantly, %rdx is used to pass parameters in the x86-64 ELF ABI. You could never call C library functions (or any other ELF library for that matter) - even kernel syscalls use %rdx to pass parameters - though the register use is not the same.
I'm not clear on your motivation - but the fact is, you won't be able to do anything practical on any x86[-64] platform (let alone an ELF/Linux platform) - at least in user-space.

MOVDQU instruction + page boundary

I have a simple test program that loads an xmm register with the
movdqu instruction accessing data across a page boundary (OS = Linux).
If the following page is mapped, this works just fine. If it's not
mapped then I get a SIGSEGV, which is probably expected.
However this diminishes the usefulness of the unaligned loads quite
a bit. Additionally SSE4.2 instructions (like pcmpistri) which
allow for unaligned memory references appear to exhibit this behavior
as well.
That's all fine -- except there's many an implementation of strcmp
using pcmpistri that I've found that don't seem to address this issue
at all -- and I've been able to contrive trivial testcases that will
cause these implementations to fail, while the byte-at-a-time trivial
strcmp implementation will work just fine with the same data layout.
One more note -- it appears the the GNU C library implementation for
64-bit Linux has a __strcmp_sse42 variant that appears to use the
pcmpistri instruction in a more safe manner. The implementation of
this strcmp is fairly complex, but it appears to be carefully trying
to avoid the page boundary issue. I'm not sure if that's due to the
issue I describe above, or whether it's just a side-effect of trying to
get better performance by aligning the data.
Anyway the question I have is primarily -- where can I find out more
about this issue? I've typed in "movdqu crossing page boundary" and
every variant of that I can think of to Google, but haven't come across
anything particularly useful. If anyone can point me to further info
on this it would be greatly appreciated.

First, any algorithm which tries to access an unmapped address will cause a SegFault. If a non-AVX code flow used a 4 byte load to access the last byte of a page and the first 3 bytes of "the next page" which happened to not be mapped then it would also cause a SegFault. No? I believe that the "issue" is that the AVX(1/2/3) registers are so much bigger than "typical" that algorithms which were unsafe (but got away with it) get caught if they are trivially extended to the larger registers.
Aligned loads (MOVDQA) can never have this problem since they don't cross any boundaries of their own size or greater. Unaligned loads CAN have this problem (as you've noted) and "often" do. The reason for this is that the instruction is defined to load the full size of the target register. You need to look at the operand types in the instruction definitions quite carefully. It doesn't matter how much of the data you are interested in. It matters what the instruction is defined to do.
However...
AVX1 (Sandybridge) added a "masked move" capability which is slower than a movdqa or movdqu but will not (architecturally) access the unmapped page so long as the mask is not enabled for the portion of the access which would have fallen in that page. This is meant to address the issue. In general, moving forward, it appears that masked portions (See AVX512) of loads/stores will not cause access violations on IA either.
(It is a bummer about PCMPxSTRx behavior. Perhaps you could add 15 bytes of padding to your "string" objects?)

Facing a similar problem with a library I was writing, I got some information from a very helpful contributor.
The core of the idea is to align the 16-byte reads to the end of the string, then handle the leftover bytes at the beginning. This works because the end of the string must live in an accessible page, and you are guaranteed that the 16-byte truncated starting address must also live in an accessible page.
Since we never read past the string we cannot potentially stray into a protected page.
To handle the initial set of bytes, I chose to use the PCMPxSTRM functions, which return the bitmask of matching bytes. Then it's simply a matter of shifting the result to ignore any mask bits that occur before the true beginning of the string.

Does eax always, and only, store the system call?

[I'm confused about the CPU registers and I haven't found any truly clear and coherent explanation of them across the whole internet. If anyone has a link to something useful I'd really appreciate it if you'd post it in a comment or answer.]
The primary reason I'm here now is because I have been looking at sample NASM programs in a [thus far vain] attempt to learn the language. The program always ends by placing a system call code in eax and then calling int 0x80 (which I would love if someone could explain as well). However, from what I understand, eax is a 32 bit register - why do you need 32 bits to store system calls (I'm sure there aren't 232 worth). Also, sometimes I see other values and strings moved into eax during the program itself. Does that mean eax only has a special use when you finally want to perform a system call but for the rest of the time you can do with it as you please?

All bits of eax are used because that's how the system call interface is implemented. It's true there aren't 232 system calls, not even 216. But that's how it is. It allows for easy extension of the set of the system calls. You don't need to think hard about it, just accept it as a fact and live on.
eax is a general purpose register and you can do with it anything you please. The fact that it's used to contain the system call ID is just an established convention and nothing else. eax is not anyhow forbidden for other uses.

The program always ends by placing a system call code in eax and then calling int 0x80 (which I would love if someone could explain as well).
This is because you're only looking at old 32-bit examples for Linux, and that is what the Linux developers felt like doing. There's no reason why they couldn't have used a different register, and not much reason they couldn't have used half a register (e.g. a ax instead of eax, or bx or ..). In a similar way, there's no reason they couldn't have used a call gate or a different interrupt number. Of course once Linux developers made their decision ("kernel will expect function number in EAX and use int 0x80") everything that calls their kernel has to comply with their decision; and they can't easily change their decision without breaking all existing software (but can, and did, support alternatives - e.g. adding support for sysenter and syscall when those instructions got invented, while ensuring that int 0x80 still works the same).
However, from what I understand, eax is a 32 bit register - why do you need 32 bits to store system calls (I'm sure there aren't 232 worth)
They didn't "need" 32-bits; but you can expect that the function number will (after a "is the value too big" sanity check) end up being used inside a call [table+eax*4] instruction to call the selected function, and because that uses 32-bit addressing it needs to use a 32-bit register. Using half (or a quarter) of a register would've involved zero extension (e.g. an extra and eax,0x0000FFFF or movzx eax,ax instruction) to convert the 16-bit value into a 32-bit value. It's also typically faster to use all 32 bits for other reasons (e.g. a mov ax,123 that sets the lowest 16 bits of EAX and leaves the highest 16 bits unchanged will depend on the previous value of the highest 16 bits, and that can cause a "dependency stall" in the CPU if it needs to wait until the previous value of EAX is known).
Does that mean eax only has a special use when you finally want to perform a system call but for the rest of the time you can do with it as you please?
It means that when you call someone else's code, you have to comply with someone else's calling conventions, regardless of what they are. This can mean using other registers (ebx, ecx, etc) for whatever purpose they decided, and can mean using a specific stack layout (e.g. pushing things onto stack in a specific order).
Note that there are various instructions that do expect specific registers to be used in a specific way - mul, div, stosd, movsd, loop, in, out, enter, leave, etc; and there are "rare special cases" for every general purpose register. Despite this; they are still "general purpose registers" because they are not "specific purpose registers" (like eip or flags, which can only be used for one specific purpose and can never be used for anything else).

eax is a general purpose register, you can put whatever you want in it. int 0x80 is the interrupt for a system call... that interrupt looks at the value in eax and calls that system routine.

How are __addgs* used, and what is GS?

On Microsoft's site can be found some details of the
__addgsbyte ( offset, data )
__addgsword ( offset, data )
__addgsdword ( offset, data )
__addgsqword ( offset, data )
intrinsic functions. It is stated that offset is
the offset from the beginning of GS. I presume that GS refers to the processor register.
How does GS relate to the stack, if at all? Alternatively, how can I calculate an offset with respect to GS?
(And, are there any 'gotchas' relating to this and particular calling conventions, such as __fastcall?)

The GS register does not relate to the stack at all and therefore no relation to callign convensions. On x64 versions of Windows it is used to point to operating system data:
From wikipedia:
Instead of FS segment descriptor on x86 versions of the Windows NT
family, GS segment descriptor is used to point to two operating system
defined structures: Thread Information Block (NT_TIB) in user mode and
Processor Control Region (KPCR) in kernel mode. Thus, for example, in
user mode GS:0 is the address of the first member of the Thread
Information Block. Maintaining this convention made the x86-64 port
easier, but required AMD to retain the function of the FS and GS
segments in long mode — even though segmented addressing per se is not
really used by any modern operating system.
Note that those intrinsics are only available in kernel mode (e.g. device drivers). To calculate an offset, you would need to know what segment of memory GS is pointing to. So in kernel mode you would need to know the layout of the Processor Control Region.
Personally I don't know what the use of these would be.

these intrinsics, along with there fs counterparts have no real use except for accessing OS specific data, as such its most likely that these where added purely to make the windows developers lives easier (I've personally used this for inline TLS access)

The doc you linked says:
Add a value to a memory location specified by an offset relative to the beginning of the GS segment.
That means it's an intrinsic for this asm instruction: add gs:[offset], data (a normal memory-destination add with a GS segment override) with your choice of operand-size.
The compiler can presumably pick any addressing mode for the offset part, and data can be a register or immediate, thus either or both of offset and data can be runtime variables or constants.
The actual linear (virtual) address accessed will be gs_base + offset, where gs_base is set via an MSR to any address (by the OS on its own, or by you making a system call).
In user-space at least, Windows normally uses GS for TLS (thread local storage). The answer claiming this intrinsic only works in kernel code is mistaken. It doesn't add to the GS base, it adds to memory at an address relative to the existing GS base.
MS only seems to document this intrinsic for x64, but it's a valid instruction in 32-bit mode as well. IDK why they'd bother to restrict it. (Except of course the qword form: 64-bit operand size isn't available in 32-bit mode.)
Perhaps the compiler doesn't know how to generally optimize __readgsdword / (operation on that data) / __writegsdword into a memory-destination instruction with the same gs:offset address. If so, this intrinsic would just be a code-size optimization.
But perhaps it's sometimes relevant to force the compiler to do it as a single instruction to make it atomic wrt. interrupts on this core (but not wrt. accesses by other CPU cores). IDK if that's an intended use case is; this answer is just explaining what that one line of documentation means in terms of x86 asm and memory-segmentation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string