Rust Embedded panic destroys stack

Rust Embedded panic destroys stack - rust

I'm using Rust on a Cortex-M4 and using gdb with openocd to debug it.
From C(++) I'm used to looking at the call stack when an exception (like a hardfault) happens. It's really helpful to see which line caused the exception.
However, in Rust, when a panic happens, the call stack is almost empty. Why does this happen?
Is there a way to make Rust preserve the stack (only for the debugger, I don't need to print it)? Or can I insert a breakpoint somewhere where the call stack hasn't been destroyed yet?
Right now I have an unwrap somewhere that panics, but I can't find where unless I step through a whole lot of code.
EDIT: This is the stack trace I do get in the panic handler:
i stack
#0 rust_begin_unwind (info=0x2001f810) at src\main.rs:122
#1 0x080219dc in cortex_m::itm::write_fmt (port=0x2001f820, args=...) at C:\Users\d.dokter\.cargo\registry\src\github.com-1ecc6299db9ec823\cortex-m-0.6.1\src/itm.rs:128
#2 0x2001f894 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
It's also weird that the write_fmt function is on the stack as that is being called inside the handler to log the panic. I find that 0x2001f894 address very suspicious as well, because that's a RAM address.

The easiest solution is to set the panic handler to call abort() instead of unwinding the stack. This can be done by adding this to your Cargo.toml
[profile.dev]
panic = "abort"
[profile.release]
panic = "abort"
With this setting, the panic handler will immediately call abort(), so gdb can still see the whole backtrace.
If you just want to print the stack trace, you can also set the environment variable RUST_BACKTRACE=1.

Related

What happens when a panic hook panics?

What happens if the function I passed to std::panic::set_hook panics?
I can imagine many ways of reacting to this: consider this UB, abort the program like C++ does, invoke the panic handler again for the new panic, simply abort the execution of the hook... What exactly does Rust promise here?
Context. I'm writing a web app with Rust/WASM backend and I would like to make a panic hook that sends any errors to the server for debugging. This involves a network operation, which can itself fail. So I'm trying to figure out how I can ensure some reasonable behavior in this double-failure scenario.

It's not documented outside of the source code.
The source code for the panic entry point in std has this comment:
// If this is the third nested call (e.g., panics == 2, this is 0-indexed),
// the panic hook probably triggered the last panic, otherwise the
// double-panic check would have aborted the process. In this case abort the
// process real quickly as we don't want to try calling it again as it'll
// probably just panic again.
So the answer to your question is either "invoke the panic handler again for the new panic" or "abort the program" depending on how many times the hook already panicked.
This all assumes you aren't using #![no_std]. If you are then you're either disabling panicking altogether or you are implementing your own panic handler with #[panic_handler], in which case you get to decide what happens yourself.

Is it possible to hook a function call with kprobes?

According to https://docs.kernel.org/trace/kprobes.html it is possible to set the instruction pointer within a kprobe's pre_handler function.
Since kprobes can probe into a running kernel code, it can change the register set, including instruction pointer. This operation requires maximum care, such as keeping the stack frame, recovering the execution path etc. Since it operates on a running kernel and needs deep knowledge of computer architecture and concurrent computing, you can easily shoot your foot.
If you change the instruction pointer (and set up other related registers) in pre_handler, you must return !0 so that kprobes stops single stepping and just returns to the given address. This also means post_handler should not be called anymore.
The same type of question was asked here, https://linux-kernel.vger.kernel.narkive.com/et7AyFPm/kprobe-pre-handler-change-return-ip it appears that if the current kprobe is "cleaned up" and the pre_handler sets the new instruction pointer and then returns 1, then you can enter a function separate from the intended instruction.
I may doing things wrong but here is my kprobes pre_handler function:
int handler_pre(struct kprobe *kp, struct pt_regs *regs) {
regs->ip = (unsigned long)mock_function;
reset_current_kprobe();
preempt_enable_no_resched();
return 1;
}
First off, when I compile my module I get the error:
WARNING: "per_cpu__current_kprobe" undefined!
If I try to add the line:
EXPORT_PER_CPU_SYMBOL(current_kprobe);
After I define the kprobe, I still get the undefined warning above. Removing the reset_current_kprobe call removes the compiler warning and allows me to insert the module but, as you may have guessed, it completely crashes the kernel. Since the kernel crashes, I am unable to figure out what may be going wrong.
My understanding is that kprobes replace the first instruction at a probed address with a breakpoint instruction which triggers the pre_handler. So by the time the pre_handler is reached, a stack frame for the intended function shouldn't have been created. In my mind this removes the possibility that I could be somehow messing up the stack but I could be completely wrong.
Does anyone have any insight as to how I could go about fixing this issue or what I am doing wrong?

Why does a panic while panicking result in an illegal instruction?

Consider the following code that purposely causes a double panic:
use scopeguard::defer; // 1.1.0
fn main() {
defer!{ panic!() };
defer!{ panic!() };
}
I know this typically happens when a Drop implementation panics while unwinding from a previous panic, but why does it cause the program to issue an illegal instruction? That sounds like the code is corrupted or jumped somewhere unintended. I figure this might be system or code generation dependent but I tested on various platforms and they all issue similar errors with the same reason:
Linux:
thread panicked while panicking. aborting.
Illegal instruction (core dumped)
Windows (with cargo run):
thread panicked while panicking. aborting.
error: process didn't exit successfully: `target\debug\tests.exe` (exit code: 0xc000001d, STATUS_ILLEGAL_INSTRUCTION)
The Rust Playground:
thread panicked while panicking. aborting.
timeout: the monitored command dumped core
/playground/tools/entrypoint.sh: line 11: 8 Illegal instruction timeout --signal=KILL ${timeout} "$#"
What's going on? What causes this?

This behavior is intended.
From a comment by Jonas Schievink in Why does panicking in a Drop impl cause SIGILL?:
It calls intrinsics::abort(), which LLVM turns into a ub2 instruction, which is illegal, thus SIGILL
I couldn't find any documentation for how double panics are handled, but a paragraph for std::intrinsics::abort() lines up with this behavior:
The current implementation of intrinsics::abort is to invoke an invalid instruction, on most platforms. On Unix, the process will probably terminate with a signal like SIGABRT, SIGILL, SIGTRAP, SIGSEGV or SIGBUS. The precise behaviour is not guaranteed and not stable.
Curiously, this behavior is different from calling std::process::abort(), which always terminates with SIGABRT.
The illegal instruction of choice on x86 is UD2 (I think a typo in the comment above) a.k.a. an undefined instruction which is paradoxically reserved and documented to not be an instruction. So there is no corruption or invalid jump, just a quick and loud way to tell the OS that something has gone very wrong.

What is the recommended way to propagate panics in tokio tasks?

Right now my panics are being swallowed. In my use case, I would like it to crash entire program and also print the stack trace. How should I configure it?

Panics are generally not swallowed, instead they are returned as an error when awaiting the tokio::task::JoinHandle returned from tokio::task::spawn() or tokio::task::spawn_blocking() and can be handled accordingly.
If a panic occurs within the Tokio runtime an error message is printed to stderr like this: "thread 'tokio-runtime-worker' panicked at 'Panicking...', src\main.rs:26:17". If you run the binary with the environment variable RUST_BACKTRACE set to 1 a stacktrace is printed as well.
As with all Rust programs you can set your own panic handler with std::panic::set_hook() to make it exit if any thread panics after printing the panic info like this:
let default_panic = std::panic::take_hook();
std::panic::set_hook(Box::new(move |info| {
default_panic(info);
std::process::exit(1);
}));

setting a gdb exit breakpoint not working?

I've set breakpoints on exit and _exit and my program (multithreaded app, running on linux 2.6.16.46-0.12 sles10), is somehow still exiting in a way I can't locate
(gdb) c
...
[New Thread 47513671297344 (LWP 15279)]
[New Thread 47513667103040 (LWP 15280)]
[New Thread 47513662908736 (LWP 15281)]
Program exited with code 0177.
(gdb)
the exit functions reside in libc so there's no deferred load shared library issues. Anybody know of some other mysterious trigger for exit that can't be caught?
EDIT: the problem is now academic only. I tried binary search debugging, backing out a subset of my changes (the problem went away). After I applied them again in sequence, I can no longer repro the problem, even with things restored to the original state.
EDIT2: I found one reason for this sort of error recently, which may have been the original source for this problem. For historical reasons our product uses the evil linker flag -Bsymbolic. Among the side effects of this is that when a symbol is undefined but called, the GLIBC runtime linker will bomb in exactly this way, and you see it in the debugger as a process exited with 0177. When the runtime linker aborts this way, I'd guess it makes the syscall to _exit directly (rather than using the C runtime library exit() or _exit()). That would be consistent with the fact that I was unable to catch this with an the exit breakpoints in the debugger.

There are two common reasons for _exit breakpoint to "miss" -- either GDB didn't set the breakpoint in the right place, or the program performs (a moral equivalent of) syscall(SYS_exit, ...)
What do info break and disassemble _exit say?
You might be able to convince GDB to set the breakpoint correctly with break *&_exit. Alternatively, GDB-7.0 supports catch syscall. Something like this should work (assuming Linux/x86_64; note that on ix86 the numbers will be different) regardless of how the program exits:
(gdb) catch syscall 60
Catchpoint 3 (syscall 'exit' [60])
(gdb) catch syscall 231
Catchpoint 4 (syscall 'exit_group' [231])
(gdb) c
Catchpoint 4 (call to syscall 'exit_group'), 0x00007ffff7912f3d in _exit () from /lib/libc.so.6
Update:
Your comment indicates that _exit breakpoint is set correctly, so it's likely that your process just doesn't execute _exit.
That leaves syscall(SYS_exit, ...) and one other possibility (which I missed before): all threads executing pthread_exit. You might want to set a breakpoint on pthread_exit as well (and execute info thread each time you hit it -- the last thread to do pthread_exit will cause the process to terminate).
Edit:
Also worth noting that you can use mnemonic names, rather than syscall numbers. You can also simultaneously add multiple syscalls to the catch list like so:
(gdb) catch syscall exit exit_group
Catchpoint 2 (syscalls 'exit' [1] 'exit_group' [252])

Setting the breakpoint on _exit was a good idea.
You might also try linking statically, just to take a stack of potential gdb complications off the table.
0177 is suspiciously like the wait status wait(2) returns for child stopped, but gdb is printing the exit status, which is a different thing, so that's probably a real exit argument.

It might be that you have some lazy references unresolved in some shared library loaded into process. I have exactly the same situation that "someone somewhere" exited process and that appeared to be unresolved reference.
Check your process with "ldd -r" option.
Looks like ld.so or whatever does lazy resolving of some symbols to uniform exit function (which should be abort IMHO).
My situation:
$ ldd ./program
undefined symbol: XXXX (/usr/lib/libYYY.so)
$./program
program: started!
...
<program is running regardless of undefined references>
Now exit appeared when I've invoked some scenario that used function that was undefined. It always exited with exitcode=127 and gdb reported 0177.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Rust Embedded panic destroys stack - rust

Related

What happens when a panic hook panics?

Is it possible to hook a function call with kprobes?

Why does a panic while panicking result in an illegal instruction?

What is the recommended way to propagate panics in tokio tasks?

setting a gdb exit breakpoint not working?

Categories

Resources