What Linux entity is responsible for generating Illegal Instruction Traps? - linux

I am working on a custom version of Rocket Chip that features some extra instructions that I would like to be properly handled by Linux. Although bare-metal programs using these instructions run fine, Linux makes the same benchmarks crash with "Illegal Instruction" messages.
Does anyone know which software element of Linux - loader, disassembler, something else - is responsible for detecting illegal instructions?
My goal is to modify that piece of software so that Linux stops complaining about my instructions. If anyone knows about an easier way to suppress this kind of error, that would be very useful too.

The RISC-V implementation (the processor) raises an illegal instruction trap whenever it encounters an instruction it has not implemented. These illegal instruction traps will be piped through to Linux (either via trap delegation or after being handled by the machine-mode software), which then flow through the standard trap handling flow:
strapvec points to Handle_exception, which does a bunch of bookkeeping to avoid trashing userspace and then direct traps to the correct location.
For illegal instruction traps, you'll fall through to the excp_vect_table jump table, which handles all the boring traps.
This is indexed by scause, which in this case points to do_trap_insn_illegal.
do_trap_insn_illegal is just a generic Linux trap handler, it passes SIGILL to whatever caused the trap. This may raise a signal to a userspace task, a kernel task, or just panic the kernel directly.
There are a bunch of levels of indirection here that we're currently not doing anything with, but may serve to emulate unimplemented instructions in the future.

Related

Why is eBPF said to be safer than LKM?

when talking about ebpf advantage, it always mentions safe than lkm.
I read some documentation, ebpf ensures safe by verifying code before it loaded.
these are checklists that verify to do:
loops
out of range jumps
unreachable instructions
invalid instructions
uninitialized register access
uninitialized stack access
misaligned stack access
out of range stack access
invalid calling convention
most of these checks I can understand, but it's all reason that lkm cause kernel panic? if do these can ensure safe?
I have 120000 servers in production, this question is the only reason to prevent me to migrate from traditional hids to ebpf hids. but if it can cause a kernel panic on a large scale, only one time, our business will be over.
Yes, as far as I know, the BPF verifier is meant to prevent any sort of kernel crash. That however doesn't mean you can't break things unintentionally in production. You could for example freeze your system by attaching BPF programs to all kernel functions or lose all connectivity by dropping all received packets. In those cases, the verifier has no way to know that you didn't mean to perform those actions; it won't stop you.
That being said, any sort of verification is better than no verification as in traditional kernel modules. With kernel modules, not only can you shoot yourself in the foot as I've described above, but you could also crash the whole system because of a subtle bug somewhere in the code.
Regardless of what you're using, you should obviously test it extensively before deploying to production.

Need an overview of debugging process from the hardware layer

I want a comprehensive overview of how the debugging process occurs on a typical x86 machine running Linux operating system; let's say the program used for debugging is gdb. Question #1 : is the process of debugging facilitated by the hardware (or it is implemented completely in software instead?). If so, what architecture features from the instruction set are involved?
The x86 ISA includes a single-byte int3 encoding that's intended for software breakpoints. GDB uses this (via ptrace) by default for breakpoints.
(Why Single Stepping Instruction on X86?)
x86 also has a Trap Flag (TF) in EFLAGS for single-step mode. (https://en.wikipedia.org/wiki/Trap_flag). See also Difference between trap flag (TF) and monitor trap flag?
There are even "debug registers" for setting hardware breakpoints, without modifying the machine code to be run. And also hardware support for watch points, to break on write to a certain address. This makes GDB watch points efficient, not requiring it to single-step and manually decode the instruction to see where it writes.
https://wiki.osdev.org/CPU_Registers_x86#Debug_Registers
Implementing hardware breakpoints using x86 debug register osdev forum thread might be relevant.
Some other ISAs exist without nearly as much HW support for debugging. e.g. without a single-step flag, a debugger might have to always decode the current instruction (pointed to by program counter) to find the next one to be executed, and set a software breakpoint there.
ARM Linux used to do that to implement ptrace single-step, but that disassembler code was removed from the kernel and now just returns -EIO. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=425fc47adb5bb69f76285be77a09a3341a30799e is the commit that removed it.

"Switching from user mode to kernel mode" is an incorrect concept

Im studying for the first time "Operating System". In my book i found this sentence about "User Mode" and "Kernel Mode":
"Switch from user to kernel mode" instruction is executed only in kernel
mode
I think that is a incorrect sentence as in practice there is no "switch of kernel". In fact, when a user process need to do a privileged instruction it simply ask the kernel to do something for itself. Is it correct ?
In fact, when a user process need to do a privileged instruction it simply ask the kernel to do something for itself.
But how does that happen? Details are processor (i.e. instruction set architecture) and OS specific (explained in ABI specifications relevant to your system, e.g. here), but that usually involves some machine code instruction like SYSENTER or SYSCALL (or SVC on mainframes) capable of atomically changing the CPU mode (that is switching it in a controlled manner to kernel mode). The actual parameters of the system call (including even the syscall number) are often passed in registers (but details are ABI specific).
So I feel the concept of switching from user-mode to kernel-mode is relevant, and meaningful (so "correct").
BTW, user-mode code is forbidden (by the hardware) to execute privileged machine instructions, such as those interacting with IO hardware devices (read about protection rings). If you try, you get some hardware exception (a bit similar to interrupts). Hence your code (even if it is malicious) has to make system calls, which the kernel controls (it has lots of code related to permission checking), for e.g. all IO.
Read also Operating Systems: Three Easy Pieces - freely downloadable. See also http://osdev.org/. Read system call wikipage & syscalls(2), and the Assembler HowTo.
In real life, things are much more complex. Read about System Management Mode and about the (scary) Intel Management Engine.

Linux System Calls & Kernel Mode

I understand that system calls exist to provide access to capabilities that are disallowed in user space, such as accessing a HDD using the read() system call. I also understand that these are abstracted by a user-mode layer in the form of library calls such as fread(), to provide compatibility across hardware.
So from the application developers point of view, we have something like;
//library //syscall //k_driver //device_driver
fread() -> read() -> k_read() -> d_read()
My question is; what is stopping me inlining all the instructions in the fread() and read() functions directly into my program? The instructions are the same, so the CPU should behave in the same way? I have not tried it, but I assume that this does not work for some reason I am missing. Otherwise any application could get arbitrary kernel mode operation.
TL;DR: What allows system calls to 'enter' kernel mode that is not copy-able by an application?
System calls do not enter the kernel themselves. More precisely, for example the read function you call is still, as far as your application is concerned, a library call. What read(2) does internally is calling the actual system call using some interruption or the syscall(2) assembly instruction, depending on the CPU architecture and OS.
This is the only way for userland code to have privileged code to be executed, but it is an indirect way. The userland and kernel code execute in different contexts.
That means you cannot add the kernel source code to your userland code and expect it to do anything useful but crash. In particular, the kernel code has access to physical memory addresses required to interact with the hardware. Userland code is limited to access a virtual memory space that has not this capability. Also, the instructions userland code is allowed to execute is a subset of the ones the CPU support. Several I/O, interruption and virtualization related instructions are examples of prohibited code. They are known as privileged instructions and require to be in an lower ring or supervisor mode depending on the CPU architecture.
You could inline them. You can issue system calls directly through syscall(2), but that soon gets messy. Note that the system call overhead (context switches back and forth, in-kernel checks, ...), not to mention the time the system call itself takes, makes your gain by inlining dissapear in the noise (if there is any gain, more code means cache isn't so useful, and performance suffers). Trust the libc/kernel folks to have studied the matter and done the inlining for you behind your back (in the relevant *.h file) if it really is a measurable gain.

Get instruction pointer on segmentation fault or crash (for x86 JIT compiler project)?

I'm implementing the backend for a JavaScript JIT compiler that produces x86 code. Sometimes, as the result of bugs, I get segmentation faults. It can be quite difficult to trace back what caused them. Hence, I've been wondering if there would be some "easy" way to trap segmentation faults and other such crashes, and get the address of the instruction that caused the fault. This way, I could map the address back to compiled x86 assembly, or even back to source code.
This needs to work on Linux, but ideally on any POSIX compliant system. In the worst case, if I can't catch the seg fault and get the IP in my running JIT, I'd like to be able to trap it outside (kernel log?), and perhaps just have the compiler dump a big file with mappings of addresses to instructions, which I could match with a Python script or something.
Any ideas/suggestions are appreciated. Feel free to share your own debugging tips if you've ever worked on a compiler project of your own.
If you use sigaction, you can define a signal handler that takes 3 arguments:
void (*sa_sigaction)(int signum, siginfo_t *info, void *ucontext)
The third argument passed to the signal handler is a pointer to an OS and architecture specific data structure. On linux, its a ucontext_t which is defined in the <sys/ucontext.h> header file. Within that, uc_mcontext is an mcontext_t (machine context) which for x86 contains all the registers at the time of the signal in gregs. So you can access
ucontext->uc_mcontext.gregs[REG_EIP] (32 bit mode)
ucontext->uc_mcontext.gregs[REG_RIP] (64 bit mode)
to get the instruction pointer of the faulting instruction.

Resources