Code sequences for TLS on ARM - linux

The ELF Handling For Thread-Local Storage document gives assembly sequences for the various models (local exec/initial exec/general dynamic) for various architectures. But not ARM -- is there anywhere I can see such code sequences for ARM? I'm working on a compiler and want to generate code that will operate properly with the platform linkers (both program and dynamic).
For clarity, let's assume an ARMv7 CPU and a pretty new kernel and glibc (say 3.13+ / 2.19+), but I'd also be interested in what has to change for older hw/sw if that's easy to explain.

I don't exactly understand what you want. However, the assembler sequences (for ARMv6+ and a capable kernel) are,
mrc p15, 0, rX, c13, c0, 2 # get the user r/w register
This is called TPIDRURW in some ARM manuals. Your TLS tables/structure must be parented from this value (probably a pointer). Using the mcr is faster, but you can also call the helper (see below) if you don't set HWCAP_TLS in your ELF (which can be used on all ARM CPUs supported by Linux).
The intent of address 0xffff0fe8 seems to be that you can use those 4-bytes instead of using the above assembler directly with (rX == r0) as maybe it is different for some machine somewhere.
It is dependent on the CPU type. There is a helper in the vector page #0xffff0fe0 in entry-armv.S; it is in the process/thread structure if the hardware doesn't support it. Documentation is in kernel_user_helpers.txt
Usage example:
typedef void * (__kuser_get_tls_t)(void);
#define __kuser_get_tls (*(__kuser_get_tls_t *)0xffff0fe0)
void foo()
{
void *tls = __kuser_get_tls();
printf("TLS = %p\n", tls);
}
You do a syscall to set the TLS stuff. clone is a way to setup a thread context. The thread_info holds all register for a thread; it may share an mm (memory management or process memory view) with other task_struct. Ie, the thread_info has a tp_value for each created thread.
Here is a dicussion of the ARM implementation. ELF/nptl/glibc and Linux kernel are all involved (and/or search terms to investigate more). The syscall for get_tls() was probably too expensive and the current mainline has a vector page helper (mapped by all threads/processes).
Some glibc source, tls-macros.h, tlsdesc.c, etc. Most likely a full/concise answer will depend on the version of,
Your ARM CPU.
Your Linux kernel.
Your glibc.
Your compiler (and flags!).

Related

Is it possible to force a range of virtual addresses?

I have an Ada program that was written for a specific (embedded, multi-processor, 32-bit) architecture. I'm attempting to use this same code in a simulation on 64-bit RHEL as a shared object (since there are multiple versions and I have a requirement to choose a version at runtime).
The problem I'm having is that there are several places in the code where the people who wrote it (not me...) have used Unchecked_Conversions to convert System.Addresses to 32-bit integers. Not only that, but there are multiple routines with hard-coded memory addresses. I can make minor changes to this code, but completely porting it to x86_64 isn't really an option. There are routines that handle interrupts, CPU task scheduling, etc.
This code has run fine in the past when it was statically-linked into a previous version of the simulation (consisting of Fortran/C/C++). Now, however, the main executable starts, then loads a shared object based on some inputs. This shared object then checks some other inputs and loads the appropriate Ada shared object.
Looking through the code, it's apparent that it should work fine if I can keep the logical memory addresses between 0 and 2,147,483,647 (32-bit signed int). Is there a way to either force the shared object loader to leave space in the lower ranges for the Ada code or perhaps make the Ada code "think" that it's addresses are between 0 and 2,147,483,647?
Is there a way to either force the shared object loader to leave space in the lower ranges for the Ada code
The good news is that the loader will leave the lower ranges untouched.
The bad news is that it will not load any shared object there. There is no interface you could use to influence placement of shared objects.
That said, dlopen from memory (which we implemented in our private fork of glibc) would allow you to do that. But that's not available publicly.
Your other possible options are:
if you can fit the entire process into 32-bit address space, then your solution is trivial: just build everything with -m32.
use prelink to relocate the library to desired address. Since that address should almost always be available, the loader is very likely to load the library exactly there.
link the loader with a custom mmap implementation, which detects the library of interest through some kind of side channel, and does mmap syscall with MAP_32BIT set, or
run the program in a ptrace sandbox. Such sandbox can again intercept mmap syscall, and or-in MAP_32BIT when desirable.
or perhaps make the Ada code "think" that it's addresses are between 0 and 2,147,483,647?
I don't see how that's possible. If the library stores an address of a function or a global in a 32-bit memory location, then loads that address and dereferences it ... it's going to get a 32-bit truncated address and a SIGSEGV on dereference.

Linux syscall strategy through vsyscall page

I am reading about VM handling on Linux. Apparently to perform a syscall there's a page at 0xFFFFF000 on x86. called vsyscall page. In the past, the strategy to call a syscall was to use int 0x80. Is this vsyscall page strategy still using int 0x80 under the hood, or is it using a different call strategy (e.g. syscall opcode?). Collateral question: is the int 0x80 method outdated?
If you run ldd on a modern Linux binary, you'll see that it's linked to a dynamic library called linux-vdso.1 (on amd64) or linux-gate.so.1 (on x86), which is located in that vsyscall page. This is a shared library provided by the kernel, mapped into every process's address space, which contains C functions that encapsulate the specifics of how to perform a system call.
The reason for this encapsulation is that the "preferred" way to perform a system call can differ from one machine to another. The interrupt 0x80 method should always work on x86, but recent processors support the sysenter (Intel) or syscall (AMD) instructions, which are much more efficient. You want your programs to use those when available, but you also want the same compiled binary to run on both Intel and AMD (and other) processors, so it shouldn't contain vendor-specific opcodes. The linux-vdso/linux-gate library hides these processor-specific decisions behind a consistent interface.
For more information, see this article.

Linux kernel assembly and logic

My question is somewhat weird but I will do my best to explain.
Looking at the languages the linux kernel has, I got C and assembly even though I read a text that said [quote] Second iteration of Unix is written completely in C [/quote]
I thought that was misleading but when I said that kernel has assembly code I got 2 questions of the start
What assembly files are in the kernel and what's their use?
Assembly is architecture dependant so how can linux be installed on more than one CPU architecture
And if linux kernel is truly written completely in C than how can it get GCC needed for compiling?
I did a complete find / -name *.s
and just got one assembly file (asm-offset.s) somewhere in the /usr/src/linux-headers-`uname -r/
Somehow I don't think that is helping with the GCC working, so how can linux work without assembly or if it uses assembly where is it and how can it be stable when it depends on the arch.
Thanks in advance
1. Why assembly is used?
Because there are certain things then can be done only in assembly and because assembly results in a faster code. For eg, "you can get access to unusual programming modes of your processor (e.g. 16 bit mode to interface startup, firmware, or legacy code on Intel PCs)".
Read here for more reasons.
2. What assembly file are used?
From: https://www.kernel.org/doc/Documentation/arm/README
"The initial entry into the kernel is via head.S, which uses machine
independent code. The machine is selected by the value of 'r1' on
entry, which must be kept unique."
From https://www.ibm.com/developerworks/library/l-linuxboot/
"When the bzImage (for an i386 image) is invoked, you begin at ./arch/i386/boot/head.S in the start assembly routine (see Figure 3 for the major flow). This routine does some basic hardware setup and invokes the startup_32 routine in ./arch/i386/boot/compressed/head.S. This routine sets up a basic environment (stack, etc.) and clears the Block Started by Symbol (BSS). The kernel is then decompressed through a call to a C function called decompress_kernel (located in ./arch/i386/boot/compressed/misc.c). When the kernel is decompressed into memory, it is called. This is yet another startup_32 function, but this function is in ./arch/i386/kernel/head.S."
Apart from these assembly files, lot of linux kernel code has usage of inline assembly.
3. Architecture dependence?
And you are right about it being architecture dependent, that's why the linux kernel code is ported to different architecture.
Linux porting guide
List of supported arch
Things written mainly in assembly in Linux:
Boot code: boots up the machine and sets it up in a state in which it can start executing C code (e.g: on some processors you may need to manually initialize caches and TLBs, on x86 you have to switch to protected mode, ...)
Interrupts/Exceptions/Traps entry points/returns: there you need to do very processor-specific things, e.g: saving registers and reenabling interrupts, and eventually restoring registers and properly returning to user mode. Some exceptions may be handled entirely in assembly.
Instruction emulation: some CPU models may not support certain instructions, may not support unaligned data access, or may not have an FPU. An option is using emulation when getting the corresponding exception.
VDSO: the VDSO is a virtual library that the kernel maps into userspace. It allows e.g: selecting the optimal syscall sequence for the current CPU (on x86 use sysenter/syscall instead of int 0x80 if available), and implementing certain system calls without requiring a context switch (e.g: gettimeofday()).
Atomic operations and locks: Maybe in a future some of these could be written using C11 support for atomic operations.
Copying memory from/to user mode: Besides using an optimized copy, these check for out-of-bounds access.
Optimized routines: the kernel has optimized version of some routines, e.g: crypto routines, memset, clear_page, csum_copy (checksum and copy to another place IP data in one pass), ...
Support for suspend/resume and other ACPI/EFI/firmware thingies
BPF JIT: newer kernels include a JIT compiler for BPF expressions (used for example by tcpdump, secmode mode 2, ...)
...
To support different architectures, Linux has assembly code (re-)written for each architecture it supports (and sometimes, there are several implementations of some code for different platforms using the same CPU architecture). Just look at all the subdirectories under arch/
Assembly is needed for a couple of reasons.
There are many instructions that are needed for the operation of an operating system that have no C equivalent, at least on most processors. A good example on Intel x86/64 processors is the iret instruciton, which returns from hardware/software interrupts. These interrupts are key to handling hardware events (like a keyboard press) and system calls from programs on older processors.
A computer does not start up in a state that is immediately ready for execution of C code. For an Intel example, when execution gets to the startup routine the processor may not be in 32-bit mode (or 64-bit mode), and the stack required by C also may not be ready. There are some other features present in some processors (like paging) which need to be turned on from assembly as well.
However, most of the Linux kernel is written in C, which interfaces with some platform specific C/assembly code through standardized interfaces. By separating the parts in this way, most of the logic of the Linux kernel can be shared between platforms. The build system simply compiles the platform independent and dependent parts together for specific platforms, which results in different executable kernel files for different platforms (and kernel configurations for that matter).
Assembly code in the kernel is generally used for low-level hardware interaction that can't be done directly from C. They're like a platform- specific foundation that's used by higher-level parts of the kernel that are written in C.
The kernel source tree contains assembly code for a variety of systems. When you compile a kernel for a particular type of system (such as an x86 PC), only the appropriate assembly code for that platform is included in the build process.
Linux is not the second version of Unix (or Unix in general). It is Unix compatible, but Unix and Linux have separate histories and, in terms of code base (of their kernels), are completely separate. Linus Torvald's idea was to write an open source Unix.
Some of the lower level things like some of the architecture dependent parts of memory management are done in assembly. The old (but still available) Linux kernel API for x86, int 0x80, is implemented in assembly. There are probably other places in the kernel that are implemented in assembly, but I don't know any others.
When you compile the kernel, you select an architecture to target. Depending on the target, the right assembly files for that architecture are included in the build.
The reason you don't find anything is because you're searching the headers, not the sources. Download a tar ball from kernel.org and search that.

System calls : difference between sys_exit(), SYS_exit and exit()

What is the difference between SYS_exit, sys_exit() and exit()?
What I understand :
The linux kernel provides system calls, which are listed in man 2 syscalls.
There are wrapper functions of those syscalls provided by glibc which have mostly similar names as the syscalls.
My question : In man 2 syscalls, there is no mention of SYS_exit and sys_exit(), for example. What are they?
Note : The syscall exit here is only an example. My question really is : What are SYS_xxx and sys_xxx()?
I'll use exit() as in your example although this applies to all system calls.
The functions of the form sys_exit() are the actual entry points to the kernel routine that implements the function you think of as exit(). These symbols are not even available to user-mode programmers. That is, unless you are hacking the kernel, you cannot link to these functions because their symbols are not available outside the kernel. If I wrote libmsw.a which had a file scope function like
static int msw_func() {}
defined in it, you would have no success trying to link to it because it is not exported in the libmsw symbol table; that is:
cc your_program.c libmsw.a
would yield an error like:
ld: cannot resolve symbol msw_func
because it isn't exported; the same applies for sys_exit() as contained in the kernel.
In order for a user program to get to kernel routines, the syscall(2) interface needs to be used to effect a switch from user-mode to kernel mode. When that mode-switch (somtimes called a trap) occurs a small integer is used to look up the proper kernel routine in a kernel table that maps integers to kernel functions. An entry in the table has the form
{SYS_exit, sys_exit},
Where SYS_exit is an preprocessor macro which is
#define SYS_exit (1)
and has been 1 since before you were born because there hasn't been reason to change it. It also happens to be the first entry in the table of system calls which makes look up a simple array index.
As you note in your question, the proper way for a regular user-mode program to access sys_exit is through the thin wrapper in glibc (or similar core library). The only reason you'd ever need to mess with SYS_exit or sys_exit is if you were writing kernel code.
This is now addressed in man syscall itself,
Roughly speaking, the code belonging to the system call with number __NR_xxx defined in /usr/include/asm/unistd.h can be found in the Linux kernel source in the routine sys_xxx(). (The dispatch table for i386 can be found in /usr/src/linux/arch/i386/kernel/entry.S.) There are many exceptions, however, mostly because older system calls were superseded by newer ones, and this has been treated somewhat unsystematically. On platforms with proprietary operating-system emulation, such as parisc, sparc, sparc64, and alpha, there are many additional system calls; mips64 also contains a full set of 32-bit system calls.
At least now /usr/include/asm/unistd.h is a preprocessor hack that links to either,
/usr/include/asm/unistd_32.h
/usr/include/asm/unistd_x32.h
/usr/include/asm/unistd_64.h
The C function exit() is defined in stdlib.h. Think of this as a high level event driven interface that allows you to register a callback with atexit()
/* Call all functions registered with `atexit' and `on_exit',
in the reverse of the order in which they were registered,
perform stdio cleanup, and terminate program execution with STATUS. */
extern void exit (int __status) __THROW __attribute__ ((__noreturn__));
So essentially the kernel provides an interface (C symbols) called __NR_xxx. Traditionally people want sys_exit() which is defined with a preprocessor macro SYS_exit. This macro creates the sys_exit() function. The exit() function is part of the standard C library stdlib.h and ported to other operating systems that lack the Linux Kernel ABI entirely (there may not be __NR_xxx functions) and potentially don't even have sys_* functions available either (you could write exit() to send the interrupt or use VDSO in Assembly).

How are __addgs* used, and what is GS?

On Microsoft's site can be found some details of the
__addgsbyte ( offset, data )
__addgsword ( offset, data )
__addgsdword ( offset, data )
__addgsqword ( offset, data )
intrinsic functions. It is stated that offset is
the offset from the beginning of GS. I presume that GS refers to the processor register.
How does GS relate to the stack, if at all? Alternatively, how can I calculate an offset with respect to GS?
(And, are there any 'gotchas' relating to this and particular calling conventions, such as __fastcall?)
The GS register does not relate to the stack at all and therefore no relation to callign convensions. On x64 versions of Windows it is used to point to operating system data:
From wikipedia:
Instead of FS segment descriptor on x86 versions of the Windows NT
family, GS segment descriptor is used to point to two operating system
defined structures: Thread Information Block (NT_TIB) in user mode and
Processor Control Region (KPCR) in kernel mode. Thus, for example, in
user mode GS:0 is the address of the first member of the Thread
Information Block. Maintaining this convention made the x86-64 port
easier, but required AMD to retain the function of the FS and GS
segments in long mode — even though segmented addressing per se is not
really used by any modern operating system.
Note that those intrinsics are only available in kernel mode (e.g. device drivers). To calculate an offset, you would need to know what segment of memory GS is pointing to. So in kernel mode you would need to know the layout of the Processor Control Region.
Personally I don't know what the use of these would be.
these intrinsics, along with there fs counterparts have no real use except for accessing OS specific data, as such its most likely that these where added purely to make the windows developers lives easier (I've personally used this for inline TLS access)
The doc you linked says:
Add a value to a memory location specified by an offset relative to the beginning of the GS segment.
That means it's an intrinsic for this asm instruction: add gs:[offset], data (a normal memory-destination add with a GS segment override) with your choice of operand-size.
The compiler can presumably pick any addressing mode for the offset part, and data can be a register or immediate, thus either or both of offset and data can be runtime variables or constants.
The actual linear (virtual) address accessed will be gs_base + offset, where gs_base is set via an MSR to any address (by the OS on its own, or by you making a system call).
In user-space at least, Windows normally uses GS for TLS (thread local storage). The answer claiming this intrinsic only works in kernel code is mistaken. It doesn't add to the GS base, it adds to memory at an address relative to the existing GS base.
MS only seems to document this intrinsic for x64, but it's a valid instruction in 32-bit mode as well. IDK why they'd bother to restrict it. (Except of course the qword form: 64-bit operand size isn't available in 32-bit mode.)
Perhaps the compiler doesn't know how to generally optimize __readgsdword / (operation on that data) / __writegsdword into a memory-destination instruction with the same gs:offset address. If so, this intrinsic would just be a code-size optimization.
But perhaps it's sometimes relevant to force the compiler to do it as a single instruction to make it atomic wrt. interrupts on this core (but not wrt. accesses by other CPU cores). IDK if that's an intended use case is; this answer is just explaining what that one line of documentation means in terms of x86 asm and memory-segmentation.

Resources