What are the possible threats that a waiting pthread_mutex might encounter? - multithreading

If a pthread is locking a shared resource.
Is there any threat that a waiting pthread_mutex might encounter?
Something like limitation of parallel pthreads, time limit, event, ...

As you can see in the specification, for example here, pthread_mutex_lock() has a int return value. Apart from the trivial/obvious error causes, such as "invalid argument" etc, there is one which can actually be considered a "threat". Especially a threat to people who do not check return values.
This threat is the return value EAGAIN which if not caught properly can cause your program to become faulty, accessing the resource the mutex is supposed to protect while it did not acquire the mutex. EAGAIN can happen for example, if the process received a System V "signal" and if this thread with the code is affected by it.
In general, using Unix System V constructs (such as signals) in parallel with Posix threads is at least dangerous. In Unix System V, threads did not exist and it was clear that the single main thread of a process was "interrupted" and used to handle the signal (using a stack-switch to the signal stack). Any kernel side blocking of the main thread got interrupted, the blocking function returns with EAGAIN and has to re-issue its call after handling the signal.
Hence, unfortunately the only fool-proof way of coding on Posix/Unix systems involves an abundance of while loops around anything which might block.
while( EAGAIN == pthread_mutex_lock(...) );
Not doing that would mean that your code can only be used in applications which clearly exert full control over signal behavior. Such as disabling all signals or using other means to ensure that the thread executing this code will not be affected.
Apart from this, Mutexes are system resources (kernel objects) and the amount available is not infinite, yet not usually something to worry about. I hope for other answers to elaborate on such limits.
EDIT It seems the documentation has changed in the past few years. Now they state, that EAGAIN would be related to the limit of recursive locks and that EINTR shall not happen. In the past, at least there were systems/documentations which conformed with my explanation above.
Also new (at least to me):
If a signal is delivered to a thread waiting for a mutex, upon return from the signal handler the thread shall resume waiting for the mutex as if it was not interrupted.
Well, maybe they learned something since I last was forced to work with such systems.

Related

How to detect if a linux thread is crashed

I've this problem, I need to understand if a Linux thread is running or not due to crash and not for normal exit. The reason to do that is try to restart the thread without reset\restart all system.
The pthread_join() seems not a good option because I've several thread to monitoring and the function return on specific thread, It doesn't work in "parallel". At moment I've a keeep live signal from thread to main but I'm looking for some system call or thread attribute to understand the state
Any suggestion?
P
Thread "crashes"
How to detect if a linux thread is crashed
if (0) //...
That is, the only way that a pthreads thread can terminate abnormally while other threads in the process continue to run is via thread cancellation,* which is not well described as a "crash". In particular, if a signal is received whose effect is abnormal termination then the whole process terminates, not just the thread that handled the signal. Other kinds of errors do not cause threads to terminate.
On the other hand, if by "crash" you mean normal termination in response to the thread detecting an error condition, then you have no limitation on what the thread can do prior to terminating to communicate about its state. For example,
it could update a shared object that tracks information about your threads
it could write to a pipe designated for the purpose
it could raise a signal
If you like, you can use pthread_cleanup_push() to register thread cleanup handlers to help with that.
On the third hand, if you're asking about detecting live threads that are failing to make progress -- because they are deadlocked, for example -- then your best bet is probably to implement some form of heartbeat monitor. That would involve each thread you want to monitor periodically updating a shared object that tracks the time of each thread's last update. If a thread goes too long between beats then you can guess that it may be stalled. This requires you to instrument all the threads you want to monitor.
Thread cancellation
You should not use thread cancellation. But if you did, and if you include termination because of cancellation in your definition of "crash", then you still have all the options above available to you, but you must engage them by registering one or more cleanup handlers.
GNU-specific options
The main issues with using pthread_join() to check thread state are
it doesn't work for daemon threads, and
pthread_join() blocks until the specified thread terminates.
For daemon threads, you need one of the approaches already discussed, but for ordinary threads on GNU/Linux, Glibc provides non-standard pthread_tryjoin_np(), which performs a non-blocking attempt to join a thread, and also pthread_timedjoin_np(), which performs a join attempt with a timeout. If you are willing to rely on Glibc-specific functions then one of these might serve your purpose.
Linux-specific options
The Linux kernel makes per-process thread status information available via the /proc filesystem. See How to check the state of Linux threads?, for example. Do be aware, however, that the details vary a bit from one kernel version to another. And if you're planning to do this a lot, then also be aware that even though /proc is a virtual filesystem (so no physical disk is involved), you still access it via slow-ish I/O interfaces.
Any of the other alternatives is probably better than reading files in /proc. I mention it only for completeness.
Overall
I'm looking for some system call or thread attribute to understand the state
The pthreads API does not provide a "have you terminated?" function or any other such state-inquiry function, unless you count pthread_join(). If you want that then you need to roll your own, which you can do by means of some of the facilities already discussed.
*Do not use thread cancellation.

How to ensure a signal handler never yields to a thread within the same process group?

This is a bit of a meta question since I think I have a solution that works for me, but it has its own downsides and upsides. I need to do a fairly common thing, catch SIGSEGV on a thread (no dedicated crash handling thread), dump some debug information and exit.
The catch here is the fact that upon crash, my application runs llvm-symbolizer which takes a while (relatively speaking) and causes a yield (either because of clone + execve or exceeding the time quanta for the thread, I've seen latter happen when doing symbolication myself in-process using libLLVM). The reason for doing all this is to get a stack trace with demangled symbols and with line/file information (stored in a separate DWP file). For obvious reasons I do not want a yield happening across my SIGSEGV handler since I intend to terminate the application (thread group) after it has executed and never return from the signal handler.
I'm not that familiar with Linux signal handling and with glibc's wrappers doing magic around them, though, I know the basic gotchas but there isn't much information on the specifics of handling signals like whether synchronous signal handlers get any kind of special priority in terms of scheduling.
Brainstorming, I had a few ideas and downsides to them:
pthread_kill(<every other thread>, SIGSTOP) - Cumbersome with more threads, interacts with signal handlers which seems like it could have unintended side effects. Also requires intercepting thread creation from other libraries to keep track of the thread list and an increasing chance of pre-emption with every system call. Possibly even change their contexts once they're stopped to point to a syscall exit stub or flat out use SIGKILL.
Global flag to serve as cancellation points for all thread (kinda like pthread_cancel/pthread_testcancel). Safer but requires a lot of maintenance and across a large codebase it can be hellish, in addition to a a mild performance overhead. Global flag could also cause the error to cascade since the program is already in an unpredictable state so letting any other thread run there is already not great.
"Abusing" the scheduler which is my current pick, with my implementation as one of the answers. Switching to FIFO scheduling policy and raising priority therefore becoming the only runnable thread in that group.
Core dumps not an option since the goal here was to avoid them in the first place. I would prefer not requiring a helper program aside from from the symbolizer as well.
Environment is a typical glibc based Linux (4.4) distribution with NPTL.
I know that crash handlers are fairly common now so I believe none of the ways I picked are that great, especially considering I've never seen the scheduler "hack" ever get used that way. So with that, does anyone have a better alternative that is cleaner and less riskier than the scheduler "hack" and am I missing any important points in my general ideas about signals?
Edit: It seems that I haven't really considered MP in this equation (as per comments) and the fact that other threads are still runnable in an MP situation and can happily continue running alongside the FIFO thread on a different processor. I can however change the affinity of the process to only execute on the same core as the crashing thread, which basically will effectively freeze all other threads at schedule boundaries. However, that still leaves the "FIFO thread yielding due to blocking IO" scenario open.
It seems like the FIFO + SIGSTOP option is the best one, though I do wonder if there are any other tricks that can make a thread unschedulable short of using SIGSTOP. From the docuemntation it seems like it's not possible to set a thread's CPU affinity to zero (leaving it in a limbo state where it's technically runnable except no processors are available for it to run on).
upon crash, my application runs llvm-symbolizer
That is likely to cause deadlocks. I can't find any statement about llvm-symbolizer being async-signal safe. It's likely to call malloc, and if so will surely deadlock if the crash also happens inside malloc (e.g. due to heap corruption elsewhere).
Switching to FIFO scheduling policy and raising priority therefore becoming the only runnable thread in that group.
I believe you are mistaken: a SCHED_FIFO thread will run so long as it is runnable (i.e. does not issue any blocking system calls). If the thread does issue such a call (which it has to: to e.g. open the separate .dwp file), it will block and other threads will become runnable.
TL;DR: there is no easy way to achieve what you want, and it seems unnecessary anyway: what do you care that other threads continue running while the crashing thread finishes its business?
This is the best solution I could come up (parts omitted for brevity but it shows the principle) with, my basic assumption being that in this situation the process runs as root. This approach can lead to resource starvation in case things go really bad and requires privileges (if I understand the man(7) sched page correctly) I run the part of the signal handler that causes preemptions under the OSSplHigh guard and exit the scope as soon as I can. This is not strictly C++ related since the same could be done in C or any other native language.
void spl_get(spl_t& O)
{
os_assert(syscall(__NR_sched_getattr,
0, &O, sizeof(spl_t), 0) == 0);
}
void spl_set(spl_t& N)
{
os_assert(syscall(__NR_sched_setattr,
0, &N, 0) == 0);
}
void splx(uint32_t PRI, spl_t& O) {
spl_t PL = {0};
PL.size = sizeof(PL);
PL.sched_policy = SCHED_FIFO;
PL.sched_priority = PRI;
spl_set(PL, O);
}
class OSSplHigh {
os::spl_t OldPrioLevel;
public:
OSSplHigh() {
os::splx(2, OldPrioLevel);
}
~OSSplHigh() {
os::spl_set(OldPrioLevel);
}
};
The handler itself is quite trivial using sigaltstack and sigaction though I do not block SIGSEGV on any thread. Also oddly enough syscalls sched_setattr and sched_getattr or the struct definition weren't exposed through glibc contrary to the documentation.
Late Edit: The best solution involved sending SIGSTOP to all threads (by intercepting pthread_create via linker's --wrap option) to keep a ledger of all running threads, thank you to suggestion in the comments.

suspendThread in windows

Keeping my question short... i am writing simulation for a RTOS. As usual the main problem comes with context switch simulation. In case of interrupts it is really becoming hard not to deviate from 'Good' coding guidelines.
Say Task A is running and user application is calculating its harmless private stuff which will run for a long time. during this task A, an interrupt X is supposed to occur. (hint: task A has nothing to do with triggering this interrupt X)... now how do i perform context switch from Task A to interrupt X handler?
My current implementation is based on a context thread that waits till some context switch is requested; an interrupt controller thread that can generate interrupts if someone request interrupt triggering; and a main thread that is running Task A. Now i use interrupt controller thread to generate a new thread for interrupt X and then request context thread to do the context switch. Context thread Suspends Task A main thread and resumes interrupt X handler thread. At the end of interrupt X handler thread, Task A main thread is resumed..
[Edit] just to clarify, i already know suspending and terminating threads from outside is really bad. That is why i asked this question. Plus please don't recommend using event etc. for controlling Task A. it is user application code and i can't control it. He can even use while(1){} if he wants...
I suspect that you can't do what you want to do in that way.
You mentioned that suspending a thread from outside is really bad. The reason is that you have no idea what the thread is doing when you suspend it. It's impossible to know whether the thread currently owns a mutex; if it does then any other thread that tries to access the same mutex is going to deadlock.
You have the problem that the runtime being used by the threads that might be suspended is the same as the one being used by the supervisor. That means there are many potential such deadlocks between the supervisor and the other threads.
In a real environment (i.e. not a simulator), the operating system kernel can suspend threads because there are checks in place to ensure that these deadlocks can't happen. I don't know the details, but it probably involves masking interrupts at certain critical points, and probably not sharing the same mutexes between user-mode code and critical parts of the kernel scheduler. (In your case that would mean your scheduler could not use any of the same OS API functions, either directly or indirectly, as are allowed to be used by the user threads, in case they involve mutexes. This of course would be virtually impossible to achieve.)
The reason I asked in a comment whether you have any control over the user code compiler is that if you controlled the compiler then you could arrange for the user code to effectively mask interrupts for the duration of each instruction and only yield to another thread at well-defined points between instructions. This is how it is done in a control system that I work on.
The other aspect is platform dependence. In Linux and other unix-like operating systems, you have signals, which are like user-mode interrupts. You could potentially use signals to emulate context switching, although you would still have the same problem with mutexes. There is absolutely no equivalent on Windows (as far as I know) precisely because of the problem already stated. The nearest thing is an asynchronous procedure call, but this will run only when the thread has put itself into an alertable wait state (which means the thread is in a deterministic state and is now safe to interrupt).
I think you are going to have to re-think the whole concept so that your supervisory thread has the sort of privileged control above the user threads that the OS has in a non-emulated environment. That will probably involve replacing the compiler or the run-time libraries, or both, with something of your own making.

Mutex lock: what does "blocking" mean?

I've been reading up on multithreading and shared resources access and one of the many (for me) new concepts is the mutex lock. What I can't seem to find out is what is actually happening to the thread that finds a "critical section" is locked. It says in many places that the thread gets "blocked", but what does that mean? Is it suspended, and will it resume when the lock is lifted? Or will it try again in the next iteration of the "run loop"?
The reason I ask, is because I want to have system supplied events (mouse, keyboard, etc.), which (apparantly) are delivered on the main thread, to be handled in a very specific part in the run loop of my secondary thread. So whatever event is delivered, I queue in my own datastructure. Obviously, the datastructure needs a mutex lock because it's being modified by both threads. The missing puzzle-piece is: what happens when an event gets delivered in a function on the main thread, I want to queue it, but the queue is locked? Will the main thread be suspended, or will it just jump over the locked section and go out of scope (losing the event)?
Blocked means execution gets stuck there; generally, the thread is put to sleep by the system and yields the processor to another thread. When a thread is blocked trying to acquire a mutex, execution resumes when the mutex is released, though the thread might block again if another thread grabs the mutex before it can.
There is generally a try-lock operation that grab the mutex if possible, and if not, will return an error. But you are eventually going to have to move the current event into that queue. Also, if you delay moving the events to the thread where they are handled, the application will become unresponsive regardless.
A queue is actually one case where you can get away with not using a mutex. For example, Mac OS X (and possibly also iOS) provides the OSAtomicEnqueue() and OSAtomicDequeue() functions (see man atomic or <libkern/OSAtomic.h>) that exploit processor-specific atomic operations to avoid using a lock.
But, why not just process the events on the main thread as part of the main run loop?
The simplest way to think of it is that the blocked thread is put in a wait ("sleeping") state until the mutex is released by the thread holding it. At that point the operating system will "wake up" one of the threads waiting on the mutex and let it acquire it and continue. It's as if the OS simply puts the blocked thread on a shelf until it has the thing it needs to continue. Until the OS takes the thread off the shelf, it's not doing anything. The exact implementation -- which thread gets to go next, whether they all get woken up or they're queued -- will depend on your OS and what language/framework you are using.
Too late to answer but I may facilitate the understanding. I am talking more from implementation perspective rather than theoretical texts.
The word "blocking" is kind of technical homonym. People may use it for sleeping or mere waiting. The term has to be understood in context of usage.
Blocking means Waiting - Assume on an SMP system a thread B wants to acquire a spinlock held by some other thread A. One of the mechanisms is to disable preemption and keep spinning on the processor unless B gets it. Another mechanism probably, an efficient one, is to allow other threads to use processor, in case B does not gets it in easy attempts. Therefore we schedule out thread B (as preemption is enabled) and give processor to some other thread C. In this case thread B just waits in the scheduler's queue and comes back with its turn. Understand that B is not sleeping just waiting rather passively instead of busy-wait and burning processor cycles. On BSD and Solaris systems there are data-structures like turnstiles to implement this situation.
Blocking means Sleeping - If the thread B had instead made system call like read() waiting data from network socket, it cannot proceed until it gets it. Therefore, some texts casually use term blocking as "... blocked for I/O" or "... in blocking system call". Actually, thread B is rather sleeping. There are specific data-structures known as sleep queues - much like luxury waiting rooms on air-ports :-). The thread will be woken up when OS detects availability of data, much like an attendant of the waiting room.
Blocking means just that. It is blocked. It will not proceed until able. You don't say which language you're using, but most languages/libraries have lock objects where you can "attempt" to take the lock and then carry on and do something different depending on whether you succeeded or not.
But in, for example, Java synchronized blocks, your thread will stall until it is able to acquire the monitor (mutex, lock). The java.util.concurrent.locks.Lock interface describes lock objects which have more flexibility in terms of lock acquisition.

How do I determine if a detached pthread is alive?

How do I determine if a detached pthread is still alive ?
I have a communication channel with the thread (a uni-directional queue pointing outwards from the thread) but what happens if the thread dies without a gasp?
Should I resign myself to using process signals or can I probe for thread liveliness somehow?
For a joinable (i.e NOT detached) pthread you could use pthread_kill like this:
int ret = pthread_kill(YOUR_PTHREAD_ID, 0);
If you get a ESRCH value, it might be the case that your thread is dead.
However this doesn't apply to a detached pthreads because after it has ended its thread ID can be reused for another thread.
From the comments:
The answer is wrong because if the thread is detached and is not
alive, the pthread_t is invalid. You can't pass it to pthread_kill. It
could, for example, be a pointer to a structure that was freed,
causing your program to crash. POSIX says, "A conforming
implementation is free to reuse a thread ID after its lifetime has
ended. If an application attempts to use a thread ID whose lifetime
has ended, the behavior is undefined." – Thanks #DavidSchwartz
This question assumes a design with an unavoidable race condition.
Presumably, you plan to do something like this:
Check to see if thread is alive
Wait for message from thread
The problem is that this sequence is not atomic and cannot be fixed. Specifically, what if the thread you are checking dies between step (1) and step (2)?
Race conditions are evil; rare race conditions doubly so. Papering over something 90% reliable with something 99.999% reliable is one of the worst decisions you can make.
The right answer to your question is "don't do that". Instead, fix your application so that threads do not die randomly.
If that is impossible, and some thread is prone to crashing, and you need to recover from that... Then your design is fundamentally flawed and you should not be using a thread. Put that unreliable thing in a different process and use a pipe to communicate with it instead. Process death closes file descriptors, and reading a pipe whose other end has been closed has well-defined, easily detected, race-free behavior.
It is probably undefined behaviour when you send a signal to an already dead thread. Your application might crash.
see http://sourceware.org/bugzilla/show_bug.cgi?id=4509 and http://udrepper.livejournal.com/16844.html

Resources