How to detect if a linux thread is crashed - linux

I've this problem, I need to understand if a Linux thread is running or not due to crash and not for normal exit. The reason to do that is try to restart the thread without reset\restart all system.
The pthread_join() seems not a good option because I've several thread to monitoring and the function return on specific thread, It doesn't work in "parallel". At moment I've a keeep live signal from thread to main but I'm looking for some system call or thread attribute to understand the state
Any suggestion?
P

Thread "crashes"
How to detect if a linux thread is crashed
if (0) //...
That is, the only way that a pthreads thread can terminate abnormally while other threads in the process continue to run is via thread cancellation,* which is not well described as a "crash". In particular, if a signal is received whose effect is abnormal termination then the whole process terminates, not just the thread that handled the signal. Other kinds of errors do not cause threads to terminate.
On the other hand, if by "crash" you mean normal termination in response to the thread detecting an error condition, then you have no limitation on what the thread can do prior to terminating to communicate about its state. For example,
it could update a shared object that tracks information about your threads
it could write to a pipe designated for the purpose
it could raise a signal
If you like, you can use pthread_cleanup_push() to register thread cleanup handlers to help with that.
On the third hand, if you're asking about detecting live threads that are failing to make progress -- because they are deadlocked, for example -- then your best bet is probably to implement some form of heartbeat monitor. That would involve each thread you want to monitor periodically updating a shared object that tracks the time of each thread's last update. If a thread goes too long between beats then you can guess that it may be stalled. This requires you to instrument all the threads you want to monitor.
Thread cancellation
You should not use thread cancellation. But if you did, and if you include termination because of cancellation in your definition of "crash", then you still have all the options above available to you, but you must engage them by registering one or more cleanup handlers.
GNU-specific options
The main issues with using pthread_join() to check thread state are
it doesn't work for daemon threads, and
pthread_join() blocks until the specified thread terminates.
For daemon threads, you need one of the approaches already discussed, but for ordinary threads on GNU/Linux, Glibc provides non-standard pthread_tryjoin_np(), which performs a non-blocking attempt to join a thread, and also pthread_timedjoin_np(), which performs a join attempt with a timeout. If you are willing to rely on Glibc-specific functions then one of these might serve your purpose.
Linux-specific options
The Linux kernel makes per-process thread status information available via the /proc filesystem. See How to check the state of Linux threads?, for example. Do be aware, however, that the details vary a bit from one kernel version to another. And if you're planning to do this a lot, then also be aware that even though /proc is a virtual filesystem (so no physical disk is involved), you still access it via slow-ish I/O interfaces.
Any of the other alternatives is probably better than reading files in /proc. I mention it only for completeness.
Overall
I'm looking for some system call or thread attribute to understand the state
The pthreads API does not provide a "have you terminated?" function or any other such state-inquiry function, unless you count pthread_join(). If you want that then you need to roll your own, which you can do by means of some of the facilities already discussed.
*Do not use thread cancellation.

Related

Multithreading Models - One to Many model

I've been reading the dinosaur book and have been confused by this particular model.
The books says that for the one to many model "Thread management is done by the thread library in user space, so it is efficient; but the entire process will block if a thread makes a blocking system call. Also, because only one thread can access the kernel at a time, multiple threads are unable to run in parallel on multiprocessors"
What I'm confused about is what is meant by an entire process will block if a blocking system call is made? Does this mean if I have a multi-threaded program and one of it's threads blocks then all of its threads will have to wait, effectively stalling the program?
If a program undergoing execution causes a block with this model does it mean that another separate program can't be swapped in to be executed because the kernel thread is blocking? If that answer is YES another program(process) could be swapped in than why couldn't a multi-threaded program simply execute another one of its threads while the blocking thread is forced to wait?
If you manage your threads in user level, it means that the swapping is done by your application, not by OS scheduler. Each thread must reach some point where he surrenders (or loses) the control to the management mechanism, but that mechanism is also user-level, so if one of the threads is in the middle of doing a system call - your thread management system (and through that all the other threads) must wait until the kernel code is done.
The OS is still active all the time, and may still preempt the entire program, so other processes will not starve, only the internal "threads" you manage yourself. These threads can't get started during that block because the mechanism responsible of starting them is also blocked by the kernel.

C# When thread switching will most probably occur?

I was wondering when .Net would most probably switch from a thread to another?
I understand we can't predict when this will happen exactly, but is there any intelligence in this? For example, when a thread is executed will it try to wait for a method to returns or a loop to finish before switching?
I'm not an expert on .NET, but in general scheduling is handled by the kernel.
Either your thread's timeslice has expired (threads/processes only get a certain amount of CPU time)
Your thread has blocked for IO.
Some other obscure reason, like waiting for an IPC message, a network packet or something.
Threads can be preempted at any point along their execution path, be it in a loop or returning from a function. This in general isn't handled by the underlying VM (.NET or JVM) but is controlled by the OS.
Of course there is 'intelligence', of a sort:). The set of running threads can only change upon an interrupt, either:
An actual hardware interrupt from a peripheral device, eg. disk, NIC, KB, mouse, timer.
A software interrupt, (ie. a system call), that can change the state of thread/s. This encompasses sleep calls and calls to wait/signal on inter-thread synchro objects, as well as I/O calls that request data that is not immediately available.
If there is no interrupt, the OS cannot change the set of running threads because it is not entered. The OS does not know or care about loops, function/methods calls, (except those that make system calls as above), gotos or any other user-level flow-control mechanisms.
I read your question now, it may not be rellevant anymore, but after reading the above answers, i want to just to make sure:
Threads are managed (or as i know) by the process they belong to. There is nothing to do with the Operation System(and that's is the main reason why working with multithreads is more faster than working with multiprocess, because there are data sharing between threads and the switching between them is occuring faster than the context switch wich occure between process by the Short-Term-Scheduler).
(NOTE: There are two types of threads: USER_MODE' threads and KERNEL_MODE' threadss, and each os can have both of them or just on of them. Anyway a thread that working in a user application environment is considered as a USER_MODE' thread and managed by the process it's belong to.)
Am I Write?
Thanks!!!

How to exit a program when using blocking calls

I need to do a project where the application monitors incoming connections and apply some rules as defined in a xml document. The rules are either filtering (blocking or permitting) connections or redirect traffic on a certain port. In order to do this, I use functions such as accept and recv (from Winsock). All of those functions are used on different threads. I'm wondering, though, how am I supposed to clean up the program before exiting since all those blocking calls are made. Normally I'd either wait until the person exits the console through the X button or waiting for the user to input a certain character in the main thread. The thing is I'm not sure what happens if the application exits while there are still active threads/if memory is still allocated/ if sockets are in use. Are all destructors called? Are h andles and sockets correctly closed? Or do I need to somehow do it myself?
Thanks
In general, I would say no. Do not try to explicitly clean up resources like sockets, fd's, handles, threads unless you are absolutely forced to.
Exact behaviour depends on OS and how you terminate your app.
All the common desktop OS will release resources allocated to a process by the OS when a process terminates. This includes sockets, file descriptors, memory.
On Windows/Linux, if you return from your C/C++ main() without any explicit cleanup, static dtors will get called by the crt code. Dtors for dynamically allocated objects in non-main threads are not run.
Executables written in other languages may behave differently.
If, instead of returning from main(), you call a 'ProcessExit()' API directly, static destructors will not get called because the OS has no concept of dtors - it has no idea, or interest, in what language was used to generate the executable.
In either case, the OS will be called to terminate your process. The OS does this, (simple 'Dummies' version:), by first changing the state of all process threads that are not running so that they never run again. Threads that are running on other cores are then stopped. Then OS resources like fd, sockets are closed, then released, then all process memory is freed, then OS kernel process/thread objects freed, then your process no longer exists.
If you absolutely need some, or all, C++/whatever dtors called when some thread needs to stop the app, you will have to explcitly signal other threads to stop so that dtors can be run. I tend to use a globally-accessible 'CloseRequested' bool that relevant blocking calls check immediately after returning. There remains the issue of persuading the blocking calls to return.
Some blocking calls can be coded up to wait on more than one signal, so allowing the call to return by a simple event/sema/condvar/whatever signal.
Some calls, like recv(), accept(), can be pesuaded to return early by closing the fd/socket they are waiting on.
Some calls can be made to return by 'artificially' satisfying their wait condition - eg. creating a temp file just to make a folder-monitor call return so that the 'CloseRequested' bool can be checked.
If a blocking call is so annoyingly stubborn that it cannot be persuaded to return, you could redesign your app so that whatever the critical resource is that is released in the dtors can be released by another thread - maybe create the thing in another thread and pass it to the thread that blocks in a ctor parameter, something like that.
NOTE WELL: Thread shutdown code bodges, as listed above, are extra code that does not add to the normal functionality of your app. You should restrict explicit thread shutdown to those threads that hold resources that absolutely must be released by explicit user code - DB connections, say. If the OS can release the resource, it should be allowed to do so. The OS is very good at stopping all process threads before releasing resources they are using, user code is not.
Where possible, use blocking calls that take a timeout value, and have your threads loop. That gives you a place to check for a shutdown condition and exit the thread gracefully. Handles will generally be cleaned up by the system when the process exits. It is polite to shut down sockets gracefully, but not absolutely mandatory. The downside of not doing so is it can take a while for the kernel to clean up exclusive resources. For example, if you just kill a thread waiting to accept(), and then your app re-launches, it won't be able to successfully accept() on the same port until the kernel cleans up the old socket.

what is kernel thread dispatching?

Can someone give me an easy to understand definition of kernel thread dispatching or just thread dispatching if there's no difference between the two?
From what I understand it's just doing a context switch while the currently active thread waits on a lock from another thread, so the CPU goes and does something else while this thread is in blocking mode.
I might however have misunderstood.
It's basically the process by which the operating system determines which of the many active threads is sent (dispatched) to the CPU for processing at any given point.
Each operating system has its own implementation, but the basic concept is to keep a sorted list of threads by priority, and dispatch them as needed to the CPU. Time slicing is added to allow multiple programs to run concurrently, etc.

can you use multiple threads to ptrace an application?

I am writing a GUI oriented debugger which targets Linux primarily, but I plan ports to other OSes in the future. Because the GUI must stay interactive at all times, I have a few threads handling different things.
Primarily I have a "debug event" thread which simply loops waiting for waitpid to return and delivers the received events to the other threads. I do this because waitpid does not have a timeout, which makes it very hard to integrate it with other event loops and keep things responsive (waitpid can hang indefinitely!).
This strategy has worked wonderfully for the Linux builds so far. Lately I've been trying to make my debugger thread aware (as in the threads in the debugged application, not the debugger itself).
So I set the ptrace options to follow clone events and look for a status which has the upper 16-bit set to PTRACE_EVENT_CLONE. Then I use PTRACE_GETEVENTMSG to get the TID of the new thread. This all works nicely in my small test harness applications. But for some reason, it is failing when i put that code in my actual debugger. (I get a "No such process" error code)
The one thing that occurred to me is that Windows has a rule that only the thread which attached to an application can listen for debug events. Does Linux's ptrace have a similar limitation? If so, why does my code work for other debug events?
EDIT:
It seems that at the very least waitpid supports waiting from a different thread, the man page says:
Before Linux 2.4, a thread was just a
special case of a process, and as a
consequence one thread could not wait on the
children of another thread, even when
the latter belongs to the same thread
group. However, POSIX prescribes
such functionality, and since Linux 2.4 a
thread can, and by default
will, wait on children of other
threads in the same thread group.
So at most this is a ptrace limitation.
I had the same issue (plus many others!) while implementing the Linux-specific part of the Maxine VM debugger. You are correct in your guess that only one thread in the debugger can use ptrace to control the debuggee. We accomplish this by making all calls to ptrace on a dedicated thread. You may find it useful to look at the LinuxTask.java, linuxTask.h and linuxTask.c files in the Maxine sources available at kenai.com/projects/maxine/sources/maxine/show
As far as I can tell, this is not allowed. A task cannot use ptrace on a task which it has not attached. Also, a task can be traced by at most one other task, so you can't simply attach it once in each thread. I think this is because when one task attaches to another task, the tracing task becomes the parent of the traced task, and each task can only have one parent.
It seems like multi-thread tracing ought to be allowed because the threads are part of the same process, but implementation-wise, there isn't actually much distinction between threads and processes in the Linux kernel. A thread is just a task that happens to share most of its resources with another task.
If you're interested, you can browse the source code for ptrace in the kernel. Specifically look at ptrace_check_attach, which is called by sys_ptrace for most requests. It returns -ESRCH (sounds like the error code you're getting) if the target task's parent is not the current task.

Resources