Why does a panic while panicking result in an illegal instruction? - rust

Consider the following code that purposely causes a double panic:
use scopeguard::defer; // 1.1.0
fn main() {
defer!{ panic!() };
defer!{ panic!() };
}
I know this typically happens when a Drop implementation panics while unwinding from a previous panic, but why does it cause the program to issue an illegal instruction? That sounds like the code is corrupted or jumped somewhere unintended. I figure this might be system or code generation dependent but I tested on various platforms and they all issue similar errors with the same reason:
Linux:
thread panicked while panicking. aborting.
Illegal instruction (core dumped)
Windows (with cargo run):
thread panicked while panicking. aborting.
error: process didn't exit successfully: `target\debug\tests.exe` (exit code: 0xc000001d, STATUS_ILLEGAL_INSTRUCTION)
The Rust Playground:
thread panicked while panicking. aborting.
timeout: the monitored command dumped core
/playground/tools/entrypoint.sh: line 11: 8 Illegal instruction timeout --signal=KILL ${timeout} "$#"
What's going on? What causes this?

This behavior is intended.
From a comment by Jonas Schievink in Why does panicking in a Drop impl cause SIGILL?:
It calls intrinsics::abort(), which LLVM turns into a ub2 instruction, which is illegal, thus SIGILL
I couldn't find any documentation for how double panics are handled, but a paragraph for std::intrinsics::abort() lines up with this behavior:
The current implementation of intrinsics::abort is to invoke an invalid instruction, on most platforms. On Unix, the process will probably terminate with a signal like SIGABRT, SIGILL, SIGTRAP, SIGSEGV or SIGBUS. The precise behaviour is not guaranteed and not stable.
Curiously, this behavior is different from calling std::process::abort(), which always terminates with SIGABRT.
The illegal instruction of choice on x86 is UD2 (I think a typo in the comment above) a.k.a. an undefined instruction which is paradoxically reserved and documented to not be an instruction. So there is no corruption or invalid jump, just a quick and loud way to tell the OS that something has gone very wrong.

Related

What is the recommended way to propagate panics in tokio tasks?

Right now my panics are being swallowed. In my use case, I would like it to crash entire program and also print the stack trace. How should I configure it?
Panics are generally not swallowed, instead they are returned as an error when awaiting the tokio::task::JoinHandle returned from tokio::task::spawn() or tokio::task::spawn_blocking() and can be handled accordingly.
If a panic occurs within the Tokio runtime an error message is printed to stderr like this: "thread 'tokio-runtime-worker' panicked at 'Panicking...', src\main.rs:26:17". If you run the binary with the environment variable RUST_BACKTRACE set to 1 a stacktrace is printed as well.
As with all Rust programs you can set your own panic handler with std::panic::set_hook() to make it exit if any thread panics after printing the panic info like this:
let default_panic = std::panic::take_hook();
std::panic::set_hook(Box::new(move |info| {
default_panic(info);
std::process::exit(1);
}));

Rust Embedded panic destroys stack

I'm using Rust on a Cortex-M4 and using gdb with openocd to debug it.
From C(++) I'm used to looking at the call stack when an exception (like a hardfault) happens. It's really helpful to see which line caused the exception.
However, in Rust, when a panic happens, the call stack is almost empty. Why does this happen?
Is there a way to make Rust preserve the stack (only for the debugger, I don't need to print it)? Or can I insert a breakpoint somewhere where the call stack hasn't been destroyed yet?
Right now I have an unwrap somewhere that panics, but I can't find where unless I step through a whole lot of code.
EDIT: This is the stack trace I do get in the panic handler:
i stack
#0 rust_begin_unwind (info=0x2001f810) at src\main.rs:122
#1 0x080219dc in cortex_m::itm::write_fmt (port=0x2001f820, args=...) at C:\Users\d.dokter\.cargo\registry\src\github.com-1ecc6299db9ec823\cortex-m-0.6.1\src/itm.rs:128
#2 0x2001f894 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
It's also weird that the write_fmt function is on the stack as that is being called inside the handler to log the panic. I find that 0x2001f894 address very suspicious as well, because that's a RAM address.
The easiest solution is to set the panic handler to call abort() instead of unwinding the stack. This can be done by adding this to your Cargo.toml
[profile.dev]
panic = "abort"
[profile.release]
panic = "abort"
With this setting, the panic handler will immediately call abort(), so gdb can still see the whole backtrace.
If you just want to print the stack trace, you can also set the environment variable RUST_BACKTRACE=1.

What could cause trace/breakpoint trap (core dumped)? [duplicate]

I know this question has been asked before, but I have read all the threads and I didn't find an answer.
From the moment I execure run to start debugging my project, I get this : Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to Thread 6]. When I do ctrl+c, gdb tells me : Program received signal SIGINT, Interrupt.
0x00000000 in ?? ()
Usually it'll tell me which file and which function it got interrupted at not 0x00000000 in ?? ()
GDB no longer hits breakpoints, and what makes matter crazier is the fact that a colleague and I, are sharing the same session (the debug is done using cygwin with a remote machine) and it works fine for them but not for me.
when I try to get info about the threads using info threads here's what I get :
[New Thread 20]
[New Thread 21]
[New Thread 22]
Id Target Id Frame
4 Thread 22 (ssp=0xa9004d5c) 0x00000000 in ?? ()
3 Thread 21 (ssp=0xa9002e64) 0x00000010 in ?? ()
2 Thread 20 (ssp=0xa9000ef4) 0x00000000 in ?? ()
The current thread <Thread ID 1> has terminated. See `help thread'
there's no thread 6, there's no * to indicate which thread gdb is using. And I don't even know if that's linked to the problem.
Can anyone please help me?
You are not providing nearly enough info to help you. Details matter, and you are withholding them. Versions of GDB and gdbserver matter, how you invoke GDB and gdbserver matter, what warnings you receive from GDB (if any) matter.
Now, this error message:
Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to Thread 6]
usually means that gdbserver has not attached one of the threads of your process, and that thread has tried to execute breakpoint instruction (you do have breakpoints set before this happens, don't you?).
One of the reasons this may happen is when your GDB loads "wrong" libthread_db.so (one that doesn't match the target libc.so.6).
what makes matter crazier is the fact that a colleague and I, are sharing the same session (the debug is done using cygwin with a remote machine) and it works fine for them but not for me.
I am not sure what you mean by "same session", but it's probably not "when he types commands, they work; but when I type the same commands into the same GDB, they don't".
One difference between you and your colleague could be LD_LIBRATY_PATH environment variable setting. Another could be in ~/.gdbinit or in ./.gdbinit.
I suggest running gdb -nx to get rid of the latter, and unsetting LD_LIBRARY_PATH to get rid of the former.
The problem with the whole thing and for some reason no one seemed to notice it is this :
this is how I call gdb /usr/local/build/gdbx.y/gdb/gdb what I should be doing is this : /usr/local/build/gdbx.y/build/gdb/gdb
It was a path problem.

How can I tell whether SIGILL originated from an illegal instruction or from kill -ILL?

In a signal handler installed through void (*sa_sigaction)(int, siginfo_t *, void *);, how can I tell whether a SIGILL originated from an illegal instruction or from some process having sent SIGILL? I looked at si_pid of siginfo_t, but this seems to be uninitialized in case an illegal instruction was encountered, so that I can't base my decision on it. - Of course, I'm searching for a preferably simple and portable solution, rather than reading the instruction code at si_addr and trying to determine if it is legal.
A bona fide SIGILL will have an si_code of one of the ILL_ values (e.g., ILL_ILLADR). A user-requested SIGILL will have an si_code of one of the SI_ values (often SI_USER).
The relevant POSIX values are:
[Kernel-generated]
ILL_ILLOPC Illegal opcode.
ILL_ILLOPN Illegal operand.
ILL_ILLADR Illegal addressing mode.
ILL_ILLTRP Illegal trap.
ILL_PRVOPC Privileged opcode.
ILL_PRVREG Privileged register.
ILL_COPROC Coprocessor error.
ILL_BADSTK Internal stack error.
[User-requested]
SI_USER Signal sent by kill().
SI_QUEUE Signal sent by the sigqueue().
SI_TIMER Signal generated by expiration of a timer set by timer_settime().
SI_ASYNCIO Signal generated by completion of an asynchronous I/O request.
SI_MESGQ Signal generated by arrival of a message on an empty message queue.
For example, the recipe in this question gives me ILL_ILLOPN, whereas kill(1) and kill(2) gives me zero (SI_USER).
Of course, your implementation might add values to the POSIX list. Historically, user- or process-generated si_code values were <= 0, and this is still quite common. Your implementation might have a convenience macro to assist here, too. For example, Linux provides:
#define SI_FROMUSER(siptr) ((siptr)->si_code <= 0)
#define SI_FROMKERNEL(siptr) ((siptr)->si_code > 0)

setting a gdb exit breakpoint not working?

I've set breakpoints on exit and _exit and my program (multithreaded app, running on linux 2.6.16.46-0.12 sles10), is somehow still exiting in a way I can't locate
(gdb) c
...
[New Thread 47513671297344 (LWP 15279)]
[New Thread 47513667103040 (LWP 15280)]
[New Thread 47513662908736 (LWP 15281)]
Program exited with code 0177.
(gdb)
the exit functions reside in libc so there's no deferred load shared library issues. Anybody know of some other mysterious trigger for exit that can't be caught?
EDIT: the problem is now academic only. I tried binary search debugging, backing out a subset of my changes (the problem went away). After I applied them again in sequence, I can no longer repro the problem, even with things restored to the original state.
EDIT2: I found one reason for this sort of error recently, which may have been the original source for this problem. For historical reasons our product uses the evil linker flag -Bsymbolic. Among the side effects of this is that when a symbol is undefined but called, the GLIBC runtime linker will bomb in exactly this way, and you see it in the debugger as a process exited with 0177. When the runtime linker aborts this way, I'd guess it makes the syscall to _exit directly (rather than using the C runtime library exit() or _exit()). That would be consistent with the fact that I was unable to catch this with an the exit breakpoints in the debugger.
There are two common reasons for _exit breakpoint to "miss" -- either GDB didn't set the breakpoint in the right place, or the program performs (a moral equivalent of) syscall(SYS_exit, ...)
What do info break and disassemble _exit say?
You might be able to convince GDB to set the breakpoint correctly with break *&_exit. Alternatively, GDB-7.0 supports catch syscall. Something like this should work (assuming Linux/x86_64; note that on ix86 the numbers will be different) regardless of how the program exits:
(gdb) catch syscall 60
Catchpoint 3 (syscall 'exit' [60])
(gdb) catch syscall 231
Catchpoint 4 (syscall 'exit_group' [231])
(gdb) c
Catchpoint 4 (call to syscall 'exit_group'), 0x00007ffff7912f3d in _exit () from /lib/libc.so.6
Update:
Your comment indicates that _exit breakpoint is set correctly, so it's likely that your process just doesn't execute _exit.
That leaves syscall(SYS_exit, ...) and one other possibility (which I missed before): all threads executing pthread_exit. You might want to set a breakpoint on pthread_exit as well (and execute info thread each time you hit it -- the last thread to do pthread_exit will cause the process to terminate).
Edit:
Also worth noting that you can use mnemonic names, rather than syscall numbers. You can also simultaneously add multiple syscalls to the catch list like so:
(gdb) catch syscall exit exit_group
Catchpoint 2 (syscalls 'exit' [1] 'exit_group' [252])
Setting the breakpoint on _exit was a good idea.
You might also try linking statically, just to take a stack of potential gdb complications off the table.
0177 is suspiciously like the wait status wait(2) returns for child stopped, but gdb is printing the exit status, which is a different thing, so that's probably a real exit argument.
It might be that you have some lazy references unresolved in some shared library loaded into process. I have exactly the same situation that "someone somewhere" exited process and that appeared to be unresolved reference.
Check your process with "ldd -r" option.
Looks like ld.so or whatever does lazy resolving of some symbols to uniform exit function (which should be abort IMHO).
My situation:
$ ldd ./program
undefined symbol: XXXX (/usr/lib/libYYY.so)
$./program
program: started!
...
<program is running regardless of undefined references>
Now exit appeared when I've invoked some scenario that used function that was undefined. It always exited with exitcode=127 and gdb reported 0177.

Resources