Parameters of files syscall_32.tbl, syscall_64.tbl in build linux kernel - linux

I'm praticing to build a new linux kernel on virtual machine. I have some question about 2 files syscall_32.tbl and syscall_64.tbl in step import parameters of a module in them.
I'm know that file syscall_32.tbl have 5 parameters [number] [abi] [name], [entry point], [compat entry point], and file syscall_64.tbl have 4 without [compat entry point].
I have some questions that I can't find answer for them.
[number]: what is range value of this column. I find out that the numbers are union and increasing sequence. If now I import new with large number (such as 10^6), is it OK?
[abi]: I know that in file syscall_64.tbl, the value of column maybe common, 64, x32. What is meaning of each value? Why is different between them? And why machine 64-bit have value x32 in this column?
[name]: I know that [entry point] and [compat entry point] is used for function to run the syscall. And when user call system call, we don't need call the name, we only use the [number] and kernel space use [entry point] to run. What is a reason for this column ([name])?
Thanks for your view and answer. Sorry for my bad english.

For different binaries to interact, they need to agree on a set of interfaces, e.g. the size of types and layout (padding) of structs. On amd64, GNU/Linux supports three ABIs natively:
i386: For compatibility with x86 32-bit binaries. System calls are defined in syscall_32.tbl
x86_64: Native 64-bit binaries. System calls are defined syscall_64.tbl with abi=64
x32: ILP32 (32-bit int, long and pointers), but with amd64 goodies: e.g. registers are 64-bit and there is more of them than in i386. System calls are defined syscall_64.tbl with abi=x32
A binary's ABI is configured at compilation time (-m32, -m64 and -mx32 respectively for GCC), but the kernel runs in long mode in all three cases and sometimes conversions are necessary to account for ABI differences.
Regarding your questions:
[number]: Size depends on the system call convention. e.g. with int 80h, the system call number is passed through the 32-bit wide eax.
[abi]: "common" system calls can be used for both amd64 ABIs, but some, like those with pointers to structs, need special handling to account for ABI differences.
[name]: Linux provides headers with system call number definitions, e.g.
#define __NR_exit 1. The macro name is generated from the [name] column. See this answer for more information.

Related

why non-pic code can't be totally ASLR using run-time fixups?

I understand that PIC code makes ASLR randomization more efficient and easier since the code can be placed anywhere in memory with no change in code. But if i understand right according to Wikipedia relocation dynamic linker can make "fixups" at runtime so a symbol can be located although code being not position-independent. But according to many answers i saw here non-pic code can't ASLR sections except the stack(so cant randomize program entry point). If that is correct then what are runtime fixups are used for and why can't we just fixup all locations in code at runtime before program start to make program entry point randomized.
TL:DR: Not all uses of absolute address will have relocation info in a non-PIE executable (ELF type EXEC, not DYN). Therefore the kernel's program-loader can't find them all to apply fixups.
Thus there's no way to retroactively enable ASLR for executables built as non-PIE. There's no way for a traditional executable to flag itself as having relocation metadata for every use of an absolute address, and no point in adding such a feature either since if you want text ASLR you'd just build a PIE.
Because ELF-type EXEC Linux executables are guaranteed to be loaded / mapped at the fixed base address chosen by the linker at link time, it would be a waste of space in the executable to make symbol-table entries for internal symbols. So toolchains didn't do that, and there's no reason to start. That's simply how traditional ELF executables were designed; Linux switched from a.out to ELF back in the mid 90s before stack ASLR was a thing, so it wasn't on people's radar.
e.g. the absolute address of static char buf[100] is probably embedded somewhere in the machine code that uses it (if we're talking about 32-bit code, or 64-bit code that puts the address in a register), but there's no way to know where or how many times.
Also, for x86-64 specifically, the default code model for non-PIE executables guarantees that static addresses (text / data / bss) will all be in the low 2GiB of virtual address space, so 32-bit absolute signed or unsigned addresses can work, and rel32 displacements can reach anything from anything. That's why non-PIE compiler output uses mov $symbol, %edi (5 bytes) to put an address in a register, instead of lea symbol(%rip), %rdi (7 bytes). https://godbolt.org/z/89PeK1
So even if you did know where every absolute address was, you could only ASLR it in the low 2GiB, limiting the number of bits of entropy you could introduce. (I think Windows has a mode for this: LargeAddressAware = no. But Linux doesn't. 32-bit absolute addresses no longer allowed in x86-64 Linux? Again, PIE is a better way to allow text ASLR, so people (distros) should just compile for that if they want its benefits.)
Unlike Windows, Linux doesn't spend huge effort on things that can be handled better and more efficiently by recompiling binaries from source.
That being said, GNU/Linux does support fixup relocations for 64-bit absolute addresses even in PIC / PIE ELF shared objects. That's why beginner code like NASM mov rdi, BUFFER can work even in a shared library: use objdump -drwC -Mintel to see the relocation info on that use of the symbol in a mov reg, imm64 instruction. An lea rdi, [rel BUFFER] wouldn't need any relocation entry if BUFFER wasn't a global symbol. (Equivalent of C static.)
You might be wondering why metadata is essential:
There's no reliable way to search text/data for possible absolute addresses; false positives would be possible. e.g. /usr/bin/ld probably contains 0x401000 as the default start address for an x86-64 executable. You don't want ASLR of ld's code+data to also change its defaults. Or that integer value could have come up in any number of ways in many programs, e.g. as a bitmap. And of course x86-64 machine code is variable length so there's no reliable way to even distinguish opcodes from immediate operands in the most general case.
And also potentially false negatives. Not super likely that an x86 program would construct an absolute address in a register with multiple instructions, but it's certainly possible. However in non-x86 code, that would be common.
RISC machines with fixed-length instructions can't put a 32-bit address into a 32-bit instruction; there'd be no room left for anything else. So to load from static storage, the absolute addresses would have to be split across multiple instructions, like MIPS lui $t0, %hi(0x612300) / lw $t1, %lo(0x612300)($t0) to load from a static variable at absolute address 0x612300. (There would normally be a symbol name in the asm source, but it wouldn't appear in the final linked binary unless it was .globl, so I used numbers as a reminder.) Instructions like that don't have to come in pairs; the same high-half of the address could be reused by other accesses into the same array or struct in later instructions.
Let's first have a look at Windows before having a look at Linux:
Windows' .EXE files (programs) typically have a so-called "base relocation table" and they have an "image base".
The "image base" is the "desired" start address of the program; if Windows loads the program to that address, no relocation needs to be done.
The "base relocation table" contains a list of all values in a program which represent addresses. If the program is loaded to a different address than the "image base", Windows must add the difference to all values listed in that table.
If the .EXE file does not contain a "base relocation table" (as far as I know some 32-bit GCC versions generate such files), it is not possible to load the file to another address.
This is because the following C code statements will result in exactly the same machine code (binary code) if the variable someVariable is located at the address 12340000, and it is not possible to distinguish between them:
long myVariable = 12340000;
And:
int * myVariable = &someVariable;
In the first case, the value 12340000 must not be changed in any situation; in the second case, the address (which is 12340000) must be changed to the real address if the program is loaded to another address.
If the "base relocation table" is missing, there is no information if the value 12340000 is an integer value (which must not be changed) or an address (which must be changed).
So the program must be loaded to some fixed address.
I'm not sure about the latest 32-bit Linux releases, but at least in older 32-bit Linux versions there was nothing like a "base relocation table" and programs did not use PIC. This means that these programs had to be loaded to their "favorite" address.
I don't know about 64-bit Linux programs, but if a program is compiled the same way as the (older) 32-bit programs, they also must be loaded to a certain address and ASLR is not possible.

Why does toolchain name have separate OS and EABI fields.?

For eg. arm-unknown-linux-gnueabi
Now, once the OS i.e Linux is fixed, the C Library will be fixed (GLibc) and hence the calling convention and ABI being followed will be fixed. What is the requirement of 4th field i.e. ABI separately? Can a toolchain use a different ABI from the one used by underlying OS and LIBC. In that case how will libraries compiled by the said toolchain run on the OS?
It's more or less a matter of historical reasons, a.k.a the holy wars about the sacred operating system's name. What you call the "toolchain name" is actually called the Target Triplet, and as it name imply, it has three fields, not either more or less. In your example case, the fields would be:
Machine/CPU: arm
Vendor: unknown
Operating System: linux-gnueabi
Take another reference example I've already faced: i686-elf-gcc, which is used for hobbyist operating system development:
Machine/CPU: i686-elf
Vendor: unknown (implicit)
Operating System: none (implicit; the compiler is actually a freestanding cross compiler, used for the development of operating system kernels, thus the code it outputs expects no underlying OS, as the output code is the OS itself!).
This is just a matter of confusion originating from the fact that the fields may (and do) use the - character, which is used to separate the fields, too. In your case, the OS is considered to be linux-gnueabi, otherwise known as the GNU operating system with the Linux kernel using the Embedded ARM ABI. The Linux Kernel has historically been one of the most portable pieces of software in the world, so it's expected to be portable to other ARM ABIs, although I'm only aware of the EABI...

Linux kernel assembly and logic

My question is somewhat weird but I will do my best to explain.
Looking at the languages the linux kernel has, I got C and assembly even though I read a text that said [quote] Second iteration of Unix is written completely in C [/quote]
I thought that was misleading but when I said that kernel has assembly code I got 2 questions of the start
What assembly files are in the kernel and what's their use?
Assembly is architecture dependant so how can linux be installed on more than one CPU architecture
And if linux kernel is truly written completely in C than how can it get GCC needed for compiling?
I did a complete find / -name *.s
and just got one assembly file (asm-offset.s) somewhere in the /usr/src/linux-headers-`uname -r/
Somehow I don't think that is helping with the GCC working, so how can linux work without assembly or if it uses assembly where is it and how can it be stable when it depends on the arch.
Thanks in advance
1. Why assembly is used?
Because there are certain things then can be done only in assembly and because assembly results in a faster code. For eg, "you can get access to unusual programming modes of your processor (e.g. 16 bit mode to interface startup, firmware, or legacy code on Intel PCs)".
Read here for more reasons.
2. What assembly file are used?
From: https://www.kernel.org/doc/Documentation/arm/README
"The initial entry into the kernel is via head.S, which uses machine
independent code. The machine is selected by the value of 'r1' on
entry, which must be kept unique."
From https://www.ibm.com/developerworks/library/l-linuxboot/
"When the bzImage (for an i386 image) is invoked, you begin at ./arch/i386/boot/head.S in the start assembly routine (see Figure 3 for the major flow). This routine does some basic hardware setup and invokes the startup_32 routine in ./arch/i386/boot/compressed/head.S. This routine sets up a basic environment (stack, etc.) and clears the Block Started by Symbol (BSS). The kernel is then decompressed through a call to a C function called decompress_kernel (located in ./arch/i386/boot/compressed/misc.c). When the kernel is decompressed into memory, it is called. This is yet another startup_32 function, but this function is in ./arch/i386/kernel/head.S."
Apart from these assembly files, lot of linux kernel code has usage of inline assembly.
3. Architecture dependence?
And you are right about it being architecture dependent, that's why the linux kernel code is ported to different architecture.
Linux porting guide
List of supported arch
Things written mainly in assembly in Linux:
Boot code: boots up the machine and sets it up in a state in which it can start executing C code (e.g: on some processors you may need to manually initialize caches and TLBs, on x86 you have to switch to protected mode, ...)
Interrupts/Exceptions/Traps entry points/returns: there you need to do very processor-specific things, e.g: saving registers and reenabling interrupts, and eventually restoring registers and properly returning to user mode. Some exceptions may be handled entirely in assembly.
Instruction emulation: some CPU models may not support certain instructions, may not support unaligned data access, or may not have an FPU. An option is using emulation when getting the corresponding exception.
VDSO: the VDSO is a virtual library that the kernel maps into userspace. It allows e.g: selecting the optimal syscall sequence for the current CPU (on x86 use sysenter/syscall instead of int 0x80 if available), and implementing certain system calls without requiring a context switch (e.g: gettimeofday()).
Atomic operations and locks: Maybe in a future some of these could be written using C11 support for atomic operations.
Copying memory from/to user mode: Besides using an optimized copy, these check for out-of-bounds access.
Optimized routines: the kernel has optimized version of some routines, e.g: crypto routines, memset, clear_page, csum_copy (checksum and copy to another place IP data in one pass), ...
Support for suspend/resume and other ACPI/EFI/firmware thingies
BPF JIT: newer kernels include a JIT compiler for BPF expressions (used for example by tcpdump, secmode mode 2, ...)
...
To support different architectures, Linux has assembly code (re-)written for each architecture it supports (and sometimes, there are several implementations of some code for different platforms using the same CPU architecture). Just look at all the subdirectories under arch/
Assembly is needed for a couple of reasons.
There are many instructions that are needed for the operation of an operating system that have no C equivalent, at least on most processors. A good example on Intel x86/64 processors is the iret instruciton, which returns from hardware/software interrupts. These interrupts are key to handling hardware events (like a keyboard press) and system calls from programs on older processors.
A computer does not start up in a state that is immediately ready for execution of C code. For an Intel example, when execution gets to the startup routine the processor may not be in 32-bit mode (or 64-bit mode), and the stack required by C also may not be ready. There are some other features present in some processors (like paging) which need to be turned on from assembly as well.
However, most of the Linux kernel is written in C, which interfaces with some platform specific C/assembly code through standardized interfaces. By separating the parts in this way, most of the logic of the Linux kernel can be shared between platforms. The build system simply compiles the platform independent and dependent parts together for specific platforms, which results in different executable kernel files for different platforms (and kernel configurations for that matter).
Assembly code in the kernel is generally used for low-level hardware interaction that can't be done directly from C. They're like a platform- specific foundation that's used by higher-level parts of the kernel that are written in C.
The kernel source tree contains assembly code for a variety of systems. When you compile a kernel for a particular type of system (such as an x86 PC), only the appropriate assembly code for that platform is included in the build process.
Linux is not the second version of Unix (or Unix in general). It is Unix compatible, but Unix and Linux have separate histories and, in terms of code base (of their kernels), are completely separate. Linus Torvald's idea was to write an open source Unix.
Some of the lower level things like some of the architecture dependent parts of memory management are done in assembly. The old (but still available) Linux kernel API for x86, int 0x80, is implemented in assembly. There are probably other places in the kernel that are implemented in assembly, but I don't know any others.
When you compile the kernel, you select an architecture to target. Depending on the target, the right assembly files for that architecture are included in the build.
The reason you don't find anything is because you're searching the headers, not the sources. Download a tar ball from kernel.org and search that.

how come an x64 OS can run a code compiled for x86 machine

Basically, what I wonder is how come an x86-64 OS can run a code compiled for x86 machine. I know when first x64 Systems has been introduced, this wasn't a feature of any of them. After that, they somehow managed to do this.
Note that I know that x86 assembly language is a subset of x86-64 assembly language and ISA's is designed in such a way that they can support backward compatibility. But what confuses me here is stack calling conventions. These conventions differ a lot depending on the architecture. For example, in x86, in order to backup frame pointer, proceses pushes where it points to stack(RAM) and pops after it is done. On the other hand, in x86-64, processes doesn't need to update frame pointer at all since all the references is given via stack pointer. And secondly, While in x86 architecture arguments to functions is passed by stack in x86-64, registers are used for that purpose.
Maybe this differences between stack calling conventions of x86-64 and x64 architecture may not affect the way program stack grows as long as different conventions are not used at the same time and this is mostly the case because x32 functions are called by other x32's and same for x64. But, at one point, a function (probably a system function) will call a function whose code is compiled for a x86-64 machine with some arguments, at this point, I am curious about how OS(or some other control unit) handle to get this function work.
Thanks in advance.
Part of the way that the i386/x86-64 architecture is designed is that the CS and other segment registers refer to entries in the GDT. The GDT entries have a few special bits besides the base and limit that describe the operating mode and privilege level of the current running task.
If the CS register refers to a 32-bit code segment, the processor will run in what is essentially i386 compatibility mode. Likewise 64-bit code requires a 64-bit code segment.
So, putting this all together.
When the OS wants to run a 32-bit task, during the task switch into it, it loads a value into CS which refers to a 32-bit code segment. Interrupt handlers also have segment registers associated with them, so when a system call occurs or an interrupt occurs, the handler will switch back to the OS's 64-bit code segment, (allowing the 64-bit OS code to run correctly) and the OS then can do its work and continue scheduling new tasks.
As a follow up with regards to calling convention. Neither i386 or x86-64 require the use of frame pointers. The code is free to do as it pleases. In fact, many compilers (gcc, clang, VS) offer the ability to compile 32-bit code without frame pointers. What is important is that the calling convention is implemented consistently. If all the code expects arguments to be passed on the stack, that's fine, but the called code better agree with that. Likewise, passing via registers is fine too, just everyone has to agree (at least at the library interface level, internal functions can generally do as they please).
Beyond that, just keep in mind that the difference between the two isn't really an issue because every process gets its own private view of memory. A side consequence though is that 32-bit apps can't load 64-bit dlls, and 64-bit apps can't load 32-bit dlls, because a process either has a 32-bit code segment or a 64-bit code segment. It can't be both.
The processor in put into legacy mode, but that requires everything executing at that time to be 32bit code. This switching is handled by the OS.
Windows : It uses WoW64. WoW64 is responsible for changing the processor mode, it also provides the compatible dll and registry functions.
Linux : Until recently Linux used to (like windows) shift to running the processor in legacy mode when ever it started executing 32bit code, you needed all the 32bit glibc libraries installed, and it would break if it tried to work together with 64bit code. Now there are implementing the X32 ABI which should make everything run like smoother and allow 32bit applications to access x64 feature like increased no. of registers. See this article on the x32 abi
PS : I am not very certain on the details of things, but it should give you a start.
Also, this answer combined with Evan Teran's answer probably give a rough picture of everything that is happening.

What's different between compiling in 32bit mode and 64bit mode on 64bit os about execution of ioctl function?

I have a 64 bit Enterprice SuSE 11
I have an application which open a HIDRAW device and operate an ioctl function on it to get raw info from this device like below:
struct hidraw_devinfo devinfo;
int fd = open("/dev/hidraw0", 0);
int ret = ioctl(fd, HIDIOCGRAWINFO, &devinfo);
...
If I compile this program in 64 bit mode there is no error and no problem and when I execute the application the ioctl function works properly.
g++ main.cpp
If I complie this program in 32 bit mode there is also no error and no problem. but when I execute the application the ioctl function return EINVAL error(errno = 22 , Invalid Argument)
g++ -m32 main.cpp
what's the problem?
Note:
struct hidraw_devinfo
{
__u32 bustype;
__s16 vendor;
__s16 product;
}
Linux ioctl definitions and compatibility layers are a fascinating topic I've just bashed my head against.
Typically ioctl definitions use a family of macros _IOW/_IOR et al that take your argument type-name as a reference, along with a magic number and ordinal value that are munged to give you your ioctl argument value (eg HIDIOCGRAWINFO). The type-name is used to encode sizeof(arg_type) into the definition. This means that the type used in user space determines the value generated by the ioctl macro - ie HIDIOCGRAWINFO may vary based on include conditions.
Here is the first point where 32- and 64-bit differ, the sizeof may differ, depending on packing, use of vague data-sizes (eg long), but especially (and unavoidably) if you use a pointer argument. So in this case a 64-bit kernel module what wants to support 32-bit clients needs do define a compatibility argument-type to match the layout of the 32-bit equivalent of the argument type and thus a 32-bit compatible ioctl. These 32-bit equivalent definitions make use of a kernel facility/layer called compat.
In your case the sizeof() is the same so that is not the path you are taking - but its important to understand the whole of what could be happening.
In addition a kernel configuration may define CONFIG_COMPAT which changes the sys-call wrappers (especially the code surrounding the user/kernel interface wrt ioctl) to ease the burden of supporting 32- and 64-bit. Part of this includes a compatibility ioctl callback called ioctl_compat.
What I've seen is with CONFIG_COMPAT defined that 32-bit programs will generate code that delivers ioctls to the ioctl_compat callback, even if it could generate the same ioctl value as 64-bit does (eg in your case). So the driver writer needs to make sure that ioctl_compat handles both the special (different) 32-bit compatible ioctl TYPEs and the normal "64-bit - or unchanged 32-bit" types.
So a kernel-module designed and tested on 32-bit only and 64-bit only systems (without CONFIG_COMPAT) may work for 32- and 64-bit programs, but not for one which supports both.
So looking in HID I see this was added in 2.6.38:
http://lxr.linux.no/#linux+v2.6.38/drivers/hid/hidraw.c#L347
The problem is probably a mismatch between the devinfo structure your program passes to the ioctl function.
I guess your work on a 64 bit system. Thus, your kernel runs in 64 bits, and the kernel module you are talking to (with ioctl) is also 64 bits.
When you compile your user program in 64 bits, the devinfo definition in the kernel module and in the user program is the same.
When you compile your user program in 32 bits, the devinfo definition in the kernel module differs from its definition in your user program. Indeed, in 32 bits, the size of some types changes: mainly long and pointers. Thus, your program create a structure of a certain size, and the kernel module interprets the data it receives differently. The kernel module probably don't understand the value you give to it because it does not look for it at the position you placed it.
The solution is to pay attention to the definition of the devinfo structure so that it has the same binary representation when compiling for 32 bits and for 64 bits.

Resources