How are __addgs* used, and what is GS? - visual-c++

On Microsoft's site can be found some details of the
__addgsbyte ( offset, data )
__addgsword ( offset, data )
__addgsdword ( offset, data )
__addgsqword ( offset, data )
intrinsic functions. It is stated that offset is
the offset from the beginning of GS. I presume that GS refers to the processor register.
How does GS relate to the stack, if at all? Alternatively, how can I calculate an offset with respect to GS?
(And, are there any 'gotchas' relating to this and particular calling conventions, such as __fastcall?)

The GS register does not relate to the stack at all and therefore no relation to callign convensions. On x64 versions of Windows it is used to point to operating system data:
From wikipedia:
Instead of FS segment descriptor on x86 versions of the Windows NT
family, GS segment descriptor is used to point to two operating system
defined structures: Thread Information Block (NT_TIB) in user mode and
Processor Control Region (KPCR) in kernel mode. Thus, for example, in
user mode GS:0 is the address of the first member of the Thread
Information Block. Maintaining this convention made the x86-64 port
easier, but required AMD to retain the function of the FS and GS
segments in long mode — even though segmented addressing per se is not
really used by any modern operating system.
Note that those intrinsics are only available in kernel mode (e.g. device drivers). To calculate an offset, you would need to know what segment of memory GS is pointing to. So in kernel mode you would need to know the layout of the Processor Control Region.
Personally I don't know what the use of these would be.

these intrinsics, along with there fs counterparts have no real use except for accessing OS specific data, as such its most likely that these where added purely to make the windows developers lives easier (I've personally used this for inline TLS access)

The doc you linked says:
Add a value to a memory location specified by an offset relative to the beginning of the GS segment.
That means it's an intrinsic for this asm instruction: add gs:[offset], data (a normal memory-destination add with a GS segment override) with your choice of operand-size.
The compiler can presumably pick any addressing mode for the offset part, and data can be a register or immediate, thus either or both of offset and data can be runtime variables or constants.
The actual linear (virtual) address accessed will be gs_base + offset, where gs_base is set via an MSR to any address (by the OS on its own, or by you making a system call).
In user-space at least, Windows normally uses GS for TLS (thread local storage). The answer claiming this intrinsic only works in kernel code is mistaken. It doesn't add to the GS base, it adds to memory at an address relative to the existing GS base.
MS only seems to document this intrinsic for x64, but it's a valid instruction in 32-bit mode as well. IDK why they'd bother to restrict it. (Except of course the qword form: 64-bit operand size isn't available in 32-bit mode.)
Perhaps the compiler doesn't know how to generally optimize __readgsdword / (operation on that data) / __writegsdword into a memory-destination instruction with the same gs:offset address. If so, this intrinsic would just be a code-size optimization.
But perhaps it's sometimes relevant to force the compiler to do it as a single instruction to make it atomic wrt. interrupts on this core (but not wrt. accesses by other CPU cores). IDK if that's an intended use case is; this answer is just explaining what that one line of documentation means in terms of x86 asm and memory-segmentation.

Related

In modern Linux x86-64 is it safe for userspace to overwrite the GS register?

In a 64-bit C program, using glibc, pthreads and so on (nothing exotic), is it safe to overwrite the GS register, without restoring it, on current kernel and glibc versions? I know that the FS register is used by pthreads/glibc for the thread-local storage block pointer, so messing with that will wreck anything that uses TLS, but I'm not sure about GS
If not, is safe to save the value, overwrite it, and then restore the value, as long as the userland code while it is overwritten doesn't do X (what is X)?
I don't know for sure; Jester says "Yes, but instead of doing that you will likely want arch_prctl(ARCH_SET_GS, foo);"
Or on a CPU with FSGSBASE, perhaps wrgsbase from user-space, if the kernel (such as Linux 5.9 or newer) enables that for user-space use. (CR4.FSGSBASE[bit 16] must be set or it faults with #UD).
I know that x86-64 switched to using FS for TLS (32-bit uses GS) because of how the syscall entry point uses swapgs to find the kernel stack.
I think that was that just for consistency between user and kernel for kernel TLS / per-core stuff, because 32-bit processes under a 64-bit kernel do still use GS for TLS. Except 32-bit processes can't use syscall (except on AMD CPUs). That alone doesn't rule out some code that only executes for a 64-bit process that could do something with GS, but probably there's no problem.
swapgs only swaps the GS base, not the selector. I don't know if there are any kernel entry points that would rewrite GS with some default selector value (and then reload the saved GS base). I'd guess not.

Is it possible to force a range of virtual addresses?

I have an Ada program that was written for a specific (embedded, multi-processor, 32-bit) architecture. I'm attempting to use this same code in a simulation on 64-bit RHEL as a shared object (since there are multiple versions and I have a requirement to choose a version at runtime).
The problem I'm having is that there are several places in the code where the people who wrote it (not me...) have used Unchecked_Conversions to convert System.Addresses to 32-bit integers. Not only that, but there are multiple routines with hard-coded memory addresses. I can make minor changes to this code, but completely porting it to x86_64 isn't really an option. There are routines that handle interrupts, CPU task scheduling, etc.
This code has run fine in the past when it was statically-linked into a previous version of the simulation (consisting of Fortran/C/C++). Now, however, the main executable starts, then loads a shared object based on some inputs. This shared object then checks some other inputs and loads the appropriate Ada shared object.
Looking through the code, it's apparent that it should work fine if I can keep the logical memory addresses between 0 and 2,147,483,647 (32-bit signed int). Is there a way to either force the shared object loader to leave space in the lower ranges for the Ada code or perhaps make the Ada code "think" that it's addresses are between 0 and 2,147,483,647?
Is there a way to either force the shared object loader to leave space in the lower ranges for the Ada code
The good news is that the loader will leave the lower ranges untouched.
The bad news is that it will not load any shared object there. There is no interface you could use to influence placement of shared objects.
That said, dlopen from memory (which we implemented in our private fork of glibc) would allow you to do that. But that's not available publicly.
Your other possible options are:
if you can fit the entire process into 32-bit address space, then your solution is trivial: just build everything with -m32.
use prelink to relocate the library to desired address. Since that address should almost always be available, the loader is very likely to load the library exactly there.
link the loader with a custom mmap implementation, which detects the library of interest through some kind of side channel, and does mmap syscall with MAP_32BIT set, or
run the program in a ptrace sandbox. Such sandbox can again intercept mmap syscall, and or-in MAP_32BIT when desirable.
or perhaps make the Ada code "think" that it's addresses are between 0 and 2,147,483,647?
I don't see how that's possible. If the library stores an address of a function or a global in a 32-bit memory location, then loads that address and dereferences it ... it's going to get a 32-bit truncated address and a SIGSEGV on dereference.

Code sequences for TLS on ARM

The ELF Handling For Thread-Local Storage document gives assembly sequences for the various models (local exec/initial exec/general dynamic) for various architectures. But not ARM -- is there anywhere I can see such code sequences for ARM? I'm working on a compiler and want to generate code that will operate properly with the platform linkers (both program and dynamic).
For clarity, let's assume an ARMv7 CPU and a pretty new kernel and glibc (say 3.13+ / 2.19+), but I'd also be interested in what has to change for older hw/sw if that's easy to explain.
I don't exactly understand what you want. However, the assembler sequences (for ARMv6+ and a capable kernel) are,
mrc p15, 0, rX, c13, c0, 2 # get the user r/w register
This is called TPIDRURW in some ARM manuals. Your TLS tables/structure must be parented from this value (probably a pointer). Using the mcr is faster, but you can also call the helper (see below) if you don't set HWCAP_TLS in your ELF (which can be used on all ARM CPUs supported by Linux).
The intent of address 0xffff0fe8 seems to be that you can use those 4-bytes instead of using the above assembler directly with (rX == r0) as maybe it is different for some machine somewhere.
It is dependent on the CPU type. There is a helper in the vector page #0xffff0fe0 in entry-armv.S; it is in the process/thread structure if the hardware doesn't support it. Documentation is in kernel_user_helpers.txt
Usage example:
typedef void * (__kuser_get_tls_t)(void);
#define __kuser_get_tls (*(__kuser_get_tls_t *)0xffff0fe0)
void foo()
{
void *tls = __kuser_get_tls();
printf("TLS = %p\n", tls);
}
You do a syscall to set the TLS stuff. clone is a way to setup a thread context. The thread_info holds all register for a thread; it may share an mm (memory management or process memory view) with other task_struct. Ie, the thread_info has a tp_value for each created thread.
Here is a dicussion of the ARM implementation. ELF/nptl/glibc and Linux kernel are all involved (and/or search terms to investigate more). The syscall for get_tls() was probably too expensive and the current mainline has a vector page helper (mapped by all threads/processes).
Some glibc source, tls-macros.h, tlsdesc.c, etc. Most likely a full/concise answer will depend on the version of,
Your ARM CPU.
Your Linux kernel.
Your glibc.
Your compiler (and flags!).

How to interpret segment register accesses on x86-64?

With this function:
mov 1069833(%rip),%rax # 0x2b5c1bf9ef90 <_fini+3250648>
add %fs:0x0,%rax
retq
How do I interpret the second instruction and find out what was added to RAX?
This code:
mov 1069833(%rip),%rax # 0x2b5c1bf9ef90 <_fini+3250648>
add %fs:0x0,%rax
retq
is returning the address of a thread-local variable. %fs:0x0 is the address of the TCB (Thread Control Block), and 1069833(%rip) is the offset from there to the variable, which is known since the variable resides either in the program or on some dynamic library loaded at program's load time (libraries loaded at runtime via dlopen() need some different code).
This is explained in great detail in Ulrich Drepper's TLS document, specially §4.3 and §4.3.6.
I'm not sure they've been called segment register since the bad old days of segmented architecture. I believe the proper term is a selector (but I could be wrong).
However, I think you just need at the first quadword (64 bits) in the fs area.
The %fs:0x0 bit means the contents of the memory at fs:0. Since you've used the generic add (rather than addl for example), I think it will take the data width from the target %rax.
In terms of getting the actual value, it depends on whether you're in legacy or long mode.
In legacy mode, you'll have to get the fs value and look it up in the GDT (or possibly LDT) in order to get the base address.
In long mode, you'll need to look at the relevant model specific registers. If you're at this point, you've moved beyond my level of expertise unfortunately.

ARM Linux: Why does the linux expect the register r0 to be set to zero

ARM Linux booting manual says that the register r0 should be zero. Why should the register r0 be zer0?
http://www.arm.linux.org.uk/developer/booting.php
CPU register settings
r0 = 0.
r1 = machine type number discovered in (3) above.
r2 = physical address of tagged list in system RAM.
I browsed through the arch/arm/kernel/head.S but could not find the reason for that.
Though I can't find any references in the Linux kernel mailing lists or the Linux sources confirming this, I would speculate that the value is being used as an ABI version for future-proofing the ABI.
Future versions of the kernel may wish to modify what arguments are passed in from the boot-loader: perhaps a new argument is needed for some new CPU feature, or one of the existing arguments needs to be slightly tweaked.
This presents a serious problem when a new kernel is booted from an older boot-loader: how does the kernel know what arguments are being passed in? We could try and enforce that new kernels are only ever booted with new boot-loaders, but this would cause quite a few head-aches during the transition period. (Boot-loaders are written by people outside the Linux kernel team; and are also frequently flashed deep into hardware, preventing them from being easily upgraded in some circumstances.)
A better solution is to reserve the register r0 to be the ABI version. For now, we insist that r0 is always 0. If the ABI ever changes, r0 can be bumped up by one. Future kernels could then inspect r0 to determine what version ABI it is being booted with, and hence how to interpret the values in the other registers.
Consistency and efficiency. Since setting a register to zero is a common operation, and ARM is typically used in constrained environments, there may be an improvement in code density. The instruction encoding for setting a register to an immediate value is longer than setting a register to the value of another register. Whether this makes much of a difference in practice is another question.

Resources