ThreadSanitizer: signal handler spoils errno - how to avoid set of errno

ThreadSanitizer: signal handler spoils errno - how to avoid set of errno - multithreading

I have a bit of code which handles POSIX signals, and as part of it (to be signal safe) - does a sem_post() system call (according to http://man7.org/linux/man-pages/man3/sem_post.3.html 'async signal safe').
But when I run this code - very occasionally, I get the thread sanitizer complaint:
SUMMARY: ThreadSanitizer: signal handler spoils errno /home/lewis/Sandbox/Stroika-Build-Dir-Ubuntu1804_x86_64/Library/Sources/Stroika/Foundation/Execution/SignalHandlers.cpp:497 in Stroika::Foundation::Execution::SignalHandlerRegistry::FirstPassSignalHandler_(int)
I believe this is due to a call to sem_post, which may INDEED overwrite errno.
And yes - this could indeed mess up another thread if it happened at just the right (wrong?) time.
I've always found the 'thread local' errno mechanism a convenient way to handle errors, but I'm just now realizing how dangerous it is for signal handling code.
Is there some way to call system calls WITHOUT overwriting errno? Something at least vaguely portable?
Even http://man7.org/linux/man-pages/man2/syscall.2.html - says it stores its result in errno.

On Linux you can use _syscall.
Another way would be to save errno at the beginning of the signal handler and restore before returning.
If you are sure that your function is safe in this respect you can also use some attributes (both in GCC and CLANG) to disable instrumentation of your functions.

Related

What happens when a panic hook panics?

What happens if the function I passed to std::panic::set_hook panics?
I can imagine many ways of reacting to this: consider this UB, abort the program like C++ does, invoke the panic handler again for the new panic, simply abort the execution of the hook... What exactly does Rust promise here?
Context. I'm writing a web app with Rust/WASM backend and I would like to make a panic hook that sends any errors to the server for debugging. This involves a network operation, which can itself fail. So I'm trying to figure out how I can ensure some reasonable behavior in this double-failure scenario.

It's not documented outside of the source code.
The source code for the panic entry point in std has this comment:
// If this is the third nested call (e.g., panics == 2, this is 0-indexed),
// the panic hook probably triggered the last panic, otherwise the
// double-panic check would have aborted the process. In this case abort the
// process real quickly as we don't want to try calling it again as it'll
// probably just panic again.
So the answer to your question is either "invoke the panic handler again for the new panic" or "abort the program" depending on how many times the hook already panicked.
This all assumes you aren't using #![no_std]. If you are then you're either disabling panicking altogether or you are implementing your own panic handler with #[panic_handler], in which case you get to decide what happens yourself.

errno is signal handler

I call write() in my SIGCHLD signal handler.
But write() may sometimes set errno. Will this break my program?
Should I save and then restore errno like the following?
void sigchld_hanlder(int)
{
int old_errno = errno;
write(...);
errno = old_errno;
}

Writing one character to a pipe is a common way to communicate from a signal handler to a poll loop. Yes, you should[1] save errno and restore it in the signal handler. It is also a good idea to put the pipe (or other file descriptor) you are writing to in NON BLOCKING mode so that the write call cannot block (which could happen if the signal handler is called many times before some thread has a chance to read it).
signalfd is another way to safely communicate a signal to a poll loop.
[1]https://www.gnu.org/software/libc/manual/html_node/POSIX-Safety-Concepts.html

Interrupting open() with SIGALRM

We have a legacy embedded system which uses SDL to read images and fonts from an NFS share.
If there's a network problem, TTF_OpenFont() and IMG_Load() hang essentially forever. A test application reveals that open() behaves in the same way.
It occurred to us that a quick fix would be to call alarm() before the calls which open files on the NFS share. The man pages weren't entirely clear whether open() would fail with EINTR when interrupted by SIGALRM, so we put together a test app to verify this approach. We set up a signal handler with sigaction::sa_flags set to zero to ensure that SA_RESTART was not set.
The signal handler was called, but open() was not interrupted. (We observed the same behaviour with SIGINT and SIGTERM.)
I suppose the system treats open() as a "fast" operation even on "slow" infrastructure such as NFS.
Is there any way to change this behaviour and allow open() to be interrupted by a signal?

The man pages weren't entirely clear whether open() would fail with
EINTR when interrupted by SIGALRM, so we put together a test app to
verify this approach.
open(2) is a slow syscall (slow syscalls are those that can sleep forever, and can be awaken when, and if, a signal is caught in the meantime) only for some file types. In general, opens that block the caller until some condition occurs are usually interruptible. Known examples include opening a FIFO (named pipe), or (back in the old days) opening a physical terminal device (it sleeps until the modem is dialed).
NFS-mounted filesystems probably don't cause open(2) to sleep in an interruptible state. After all, you are most likely opening a regular file, and in that case open(2) will not be interruptable.
Is there any way to change this behaviour and allow open() to be
interrupted by a signal?
I don't think so, not without doing some (non-trivial) changes to the kernel.
I would explore the possibility of using setjmp(3) / longjmp(3) (see the manpage if you're not familiar; it's basically non-local gotos). You can initialize the environment buffer before calling open(2), and issue a longjmp(3) in the signal handler. Here's an example:
#include <stdio.h>
#include <stdlib.h>
#include <setjmp.h>
#include <unistd.h>
#include <signal.h>
static jmp_buf jmp_env;
void sighandler(int signo) {
longjmp(jmp_env, 1);
}
int main(void) {
struct sigaction sigact;
sigact.sa_handler = sighandler;
sigact.sa_flags = 0;
sigemptyset(&sigact.sa_mask);
if (sigaction(SIGALRM, &sigact, NULL) < 0) {
perror("sigaction(2) error");
exit(EXIT_FAILURE);
}
if (setjmp(jmp_env) == 0) {
/* First time through
* This is where we would open the file
*/
alarm(5);
/* Simulate a blocked open() */
while (1)
; /* Intentionally left blank */
/* If open(2) is successful here, don't forget to unset
* the alarm
*/
alarm(0);
} else {
/* SIGALRM caught, open(2) canceled */
printf("open(2) timed out\n");
}
return 0;
}
It works by saving the context environment with the help of setjmp(3) before calling open(2). setjmp(3) returns 0 the first time through, and returns whatever value was passed to longjmp(3) otherwise.
Please be aware that this solution is not perfect. Here are some points to keep in mind:
There is a window of time between the call to alarm(2) and the call to open(2) (simulated here with while (1) { ... }) where the process may be preempted for a long time, so there is a chance the alarm expires before we actually attempt to open the file. Sure, with a large timeout such as 2 or 3 seconds this will most likely not happen, but it's still a race condition.
Similarly, there is a window of time between successfully opening the file and canceling the alarm where, again, the process may be preempted for a long time and the alarm may expire before we get the chance to cancel it. This is slightly worse because we have already opened the file so we will "leak" the file descriptor. Again, in practice, with a large timeout this will likely never happen, but it's a race condition nevertheless.
If the code catches other signals, there may be another signal handler in the midst of execution when SIGALRM is caught. Using longjmp(3) inside the signal handler will destroy the execution context of these other signal handlers, and depending on what they were doing, very nasty things may happen (inconsistent state if the signal handlers were manipulating other data structures in the program, etc.). It's as if it started executing, and suddenly crashed somewhere in the middle. You can fix it by: a) carefully setting up all signal handlers such that SIGALRM is blocked before they are invoked (this ensures that the SIGALRM handler does not begin execution until other handlers are done) and b) blocking these other signals before catching SIGALRM. Both actions can be accomplished by setting the sa_mask field of struct sigaction with the necessary mask (the operating system atomically sets the process's signal mask to that value before beginning execution of the handler and unsets it before returning from the handler). OTOH, if the rest of the code doesn't catch signals, then this is not a problem.
sleep(3) may be implemented with alarm(2), and alarm(2) and setitimer(2) share the same timer; if other portions in the code make use of any of these functions, they will interfere and the result will be a huge mess.
Just make sure you weigh in these disadvantages before blindly using this approach. The use of setjmp(3) / longjmp(3) is usually discouraged and makes programs considerably harder to read, understand and maintain. It's not elegant, but right now I don't think you have a choice, unless you're willing to do some core refactoring in the project.
If you do end up using setjmp(3), then at the very least document these limitations.

Maybe there is a strategy of using a separate thread to do the open so the main thread is not held up longer than desired.

Is there an async-signal-safe way of reading a directory listing on Linux?

SUSv4 does not list opendir, readdir, closedir, etc. in its list of async-signal-safe functions.
Is there a safe way to read a directory listing from a signal handler?
e.g. is it possible to 'open' the directory and somehow slurp out the raw directory listing? If so what kind of data structure is returned by 'read'?
Or maybe on Linux there are certain system calls that are async-signal-safe even though SUSv4 / POSIX does not require it that could be used?

If you know in advance which directory you need to read, you could call opendir() outside the signal handler (opendir() calls malloc(), so you can't run it from within the handler) and keep the DIR* in a static variable somewhere. When your signal handler runs, you should be able to get away with calling readdir_r() on that handle as long as you can guarantee that only that one signal handler would use the DIR* handle at any moment. There is a lock field in the DIR that is taken by readdir() and readdir_r(), so if, say, you used the DIR* from two signal handlers, or you registered the same handler to handle multiple signals, you may end up with a deadlock due to the lock never being released by the interrupted handler.
A similar approach appears to also work to read a directory from a child process after calling fork() but before calling execve().

Using sigprocmask to implement locks

I'm implementing user threads in Linux kernel 2.4, and I'm using ualarm to invoke context switches between the threads.
We have a requirement that our thread library's functions should be uninterruptable by the context switching mechanism for threads, so I looked into blocking signals and learned that using sigprocmask is the standard way to do this.
However, it looks like I need to do quite a lot to implement this:
sigset_t new_set, old_set;
sigemptyset(&new_set);
sigaddset(&new_set, SIGALRM);
sigprocmask(SIG_BLOCK, &new_set, &old_set);
This blocks SIGALARM but it does this with 3 function invocations! A lot can happen in the time it takes for these functions to run, including the signal being sent.
The best idea I had to mitigate this was temporarily disabling ualarm, like this:
sigset_t new_set, old_set;
time=ualarm(0,0);
sigemptyset(&new_set);
sigaddset(&new_set, SIGALRM);
sigprocmask(SIG_BLOCK, &new_set, &old_set);
ualarm(time, 0);
Which is fine except that this feels verbose. Isn't there a better way to do this?

As WhirlWind points out, the signal set functions are quite lightweight and may even be implemented as macros; and you can also just keep around a signal set that contains only SIGALRM and re-use that.
Regardless, it doesn't actually matter if the signal happens during the sigaddset() or sigemptyset() calls - the new_set and old_set variable are (presumably) thread-local, and the critical section isn't entered until after sigprocmask() returns.

You'll find that sigemptyset() and sigaddset() in signals.h are just macros or inline functions, so they execute inline in your code. Just use a stack variable when you call them.
However, why don't you do this in a single-threaded startup section of your code? I also doubt the function call to sigprocmask will be atomic. Blocking signals does not mean your code will be uninterruptible.
By the way, I'm not sure how you're using ualarm, but if you're not catching or ignoring SIGALARM when you call it the first time, you'll probably kill your process.

sigprocmask() is the only function that goes to kernel level and actually changes the signal masking status. The other functions are just manipulation functions for setting up the mask before calling sigprocmask or passing the set to another signal related function.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ThreadSanitizer: signal handler spoils errno - how to avoid set of errno - multithreading

On Linux you can use _syscall. Another way would be to save errno at the beginning of the signal handler and restore before returning. If you are sure that your function is safe in this respect you can also use some attributes (both in GCC and CLANG) to disable instrumentation of your functions.

Related

What happens when a panic hook panics?

errno is signal handler

Interrupting open() with SIGALRM

Is there an async-signal-safe way of reading a directory listing on Linux?

Using sigprocmask to implement locks

Categories

Resources