Deadlock while multi-threaded process exit in signal handler - multithreading

There are two thread in a process. when main thread receive SEGV, from signal handler i used to send some internal signal to other auxiliary thread using pthread_kill and using this internal signal i used to trap auxiliary thread in sleep state, so that i can now do mandatory cleanup and stack-trace dump into file from main thread with thinking of now single threaded process, (as other auxiliary thread is in sleep state).
But, once i encounter that while main thread is exiting, process left (doesn't exit)and seems
present in deadlock state between two thread.
Please help me why and which part of code is causing deadlock.
Thanks in Advance!!
Auxiliary Thread stack:
Thread 2 (Thread 0x7fc565b5b700 (LWP 13831)):
#0 0x00007fc5668e81fd in nanosleep () from /lib64/libc.so.6
#1 0x00007fc566915214 in usleep () from /lib64/libc.so.6
#2 0x00000000009699a2 in SignalHandFun() at ...........
#3 <signal handler called>
#4 0x00007fc56691820a in mmap64 () from /lib64/libc.so.6
#5 0x00007fc5668a5bfc in _IO_file_doallocate_internal () from /lib64/libc.so.6
#6 0x00007fc5668b386c in _IO_doallocbuf_internal () from /lib64/libc.so.6
#7 0x00007fc5668b215b in _IO_new_file_underflow () from /lib64/libc.so.6
#8 0x00007fc5668b38ae in _IO_default_uflow_internal () from /lib64/libc.so.6
#9 0x00007fc566894bad in _IO_vfscanf_internal () from /lib64/libc.so.6
#10 0x00007fc5668a2cd8 in fscanf () from /lib64/libc.so.6
.....
......
.....
#15 0x00007fc567259806 in start_thread () from /lib64/libpthread.so.0
#16 0x00007fc56691b64d in clone () from /lib64/libc.so.6
#17 0x0000000000000000 in ?? ()
Main Thread stack:
Thread 1 (Thread 0x7fc5679c0720 (LWP 13795)):
#0 0x00007fc56692878e in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007fc5668b504b in _L_lock_1309 () from /lib64/libc.so.6
#2 0x00007fc5668b3d9a in _IO_flush_all_lockp () from /lib64/libc.so.6
#3 0x00007fc5668b4181 in _IO_cleanup () from /lib64/libc.so.6
#4 0x00007fc566872630 in __run_exit_handlers () from /lib64/libc.so.6
#5 0x00007fc5668726b5 in exit () from /lib64/libc.so.6
#6 0x00000000009698e3 in SignalHandFun() at ....
#7 <signal handler called>
#8 0x000000b1000000b0 in ?? ()
#9 0x0000000000000000 in ?? ()

I assume that you send a signal to another thread because you want to do some work that cannot be done with async-signal-safe functions.
The problem is that if your signal handler is called on a thread that has any locks acquired (such as in your case, the internal libio list lock), then any thread that attempts to acquire the same lock will block indefinitely: You cannot return from your SIGSEGV handler, so the lock will never become available for locking again, and no thread waiting on the lock will make progress. In your case, the exit function needs to acquire the libio list lock because it has to go through the list of all open file streams and flush them, while a thread opening a new file acquires the lock while it puts the new file on the list.
While this is an implementation detail and could conceivable be addressed inside glibc at some (far) point in the future (the small improvements we have made relatively recently will not help in your case), the only way is that you call _exit before the final process exit procedure in glibc, after the cleanup you need to do. In your case, it may be possible to do so from an atexit handler you registered as early possible, but this depends on your application.
Regarding crash handlers, we published some advice here:
Using the fork function in signal handlers
The article focuses on fork, but the deadlock issues are pretty much the same in your case.

Related

Why pthread_mutex_lock is not marked as async-signal safe?

You see, sem_post is marked as async-signal safe. But why pthread_mutex_lock is not marked as async-signal safe, while the following program give you the illusion that it is actually async-signal safe?
void handle(int arg){
printf("I wake up!\n");
}
int main()
{
signal(SIGHUP, handle);
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock(&lock);
printf("gonna be blocked\n");
pthread_mutex_lock(&lock);
pthread_mutex_unlock(&lock);
return 0;
}
kill -hup $pid will have it print out something. But the lock is still not acquired and program is blocked(I mean, it does not finish), which give me the impression that it is asynchronous signal safe.
You can consult Advanced unix programming environment book, or man sigaction to get async-signal safe function list.
But why pthread_mutex_lock is not marked as async-signal safe
Because it isn't.
while the following program give you the illusion that it is actually async-signal safe?
Your program has nothing to do with async signal safety. Any conclusion about async signal safety your derived from this test program is plain wrong.
Async signal safety is about being able to call the function from an async signal handler.
To see that pthread_mutex_lock isn't async signal safe, write a program with 3 threads: one doing pthread_mutex_lock and pthread_mutex_unlock in a tight loop, one doing the same on the same mutex from a signal handler, and a third one that sends an unending stream of SIGHUPs to the process.
If pthread_mutex_lock were async signal safe, this program would run forever.
But I expect that what you would observe is that this program will either crash or deadlock after a while.
Even if it doesn't, that still wouldn't mean that pthread_mutex_lock is safe, only that you haven't yet proved that it is unsafe.

Node using all processors without clustering. How come?

I have a nodejs application that gets data from one server and pushes into another. For testing I sent 1000 requests to my node server and saw what happens on the system monitor. There I could see that all 4 processors were 100% occupied.
Now, from what I have read on nodejs, it seems that it by default uses only 1 thread(which means 1 processor?). But how come all my computer's processors were occupied? Is this load balancing happening at OS level(I am on ubuntu 14)
And in case the balancing was done by OS then what is the difference between this automatic OS level load balancing and explicitly using clusters to divide the load? What are the advantages/disadvantages of each?
Any help would be deeply appreciated :)
Though the application is driven by a single thread, there are helper threads inside node to facilitate the execution within the runtime environment. Examples are JIT compiler thread and GC helper threads. Though they wont consume CPU in proportion to the application load, they will be driven by the characteristics internal to the virtual machine.
Hooking onto a live debugger shows how many threads are there any what they are doing:
gdb) info threads
6 Thread 0x7ffff61d8700 (LWP 23181) 0x00000034d080d930 in sem_wait () from /lib64/libpthread.so.0
5 Thread 0x7ffff6bd9700 (LWP 23180) 0x00000034d080d930 in sem_wait () from /lib64/libpthread.so.0
4 Thread 0x7ffff75da700 (LWP 23179) 0x00000034d080d930 in sem_wait () from /lib64/libpthread.so.0
3 Thread 0x7ffff7fdb700 (LWP 23178) 0x00000034d080d930 in sem_wait () from /lib64/libpthread.so.0
2 Thread 0x7ffff7ffc700 (LWP 23177) 0x00000034d080d930 in sem_wait () from /lib64/libpthread.so.0
* 1 Thread 0x7ffff7fdd720 (LWP 23168) 0x00000034d04e5239 in syscall () from /lib64/libc.so.6
(gdb)

Race condition between wait_event and wake_up in Linux kernel

I'm a kernel newbie. I just got this question when reading the source code.
In the implementation of wait_event(), the kernel does something like this:
...
prepare_to_wait(); /* enqueue current thread to the wait queue */
...
schedule(); /* invoke deactivate_task() inside, which will dequeue current thread from the runqueue */
...
in the implementation of "wake_up()", the kernel does the follows:
...
try_to_wake_up(); /* invoke activate_task() inside, which will enqueue the target thread into the runqueue */
...
in a concurrent execution, what if the above functions are invoked in the following order:
...
prepare_to_wait(); /* thread A adds itself to the wait queue */
...
try_to_wake_up(); /* thread B wakes up A and enqueues it into the runqueue */
...
schedule(); /* thread A dequeues itself from the runqueue and yields the CPU */
...
Thread A is not in either the runqueue or the wait queue. Does that mean we lost thread A? The kernel must have some mechanism to prevent this from happening. Could someone tell me what I missed here? Thanks!
I found the answer in the article, Kernel Korner - Sleeping in the Kernel in Issue 137 of the Linux Journal dated Jul 28, 2005 by Kedar Sovani.
In a nutshell, this is the lost wakeup issue. The Linux kernel solves it by setting the task state to TASK_INTERRUPTIBLE. This causes calls to schedule() to wake immediately, even if someone has called a wake up function prior to the schedule() call [as well as the normal during].
One of the mechanisms from kernel is wait_event*() macros. It works in the way as explained in Kernel Korner - Sleeping in the Kernel

What are the general causes of abort signal?

I have an application, in C++ over running linux, which on exit gets abort signal.
Before I go after the code to hunt down the problem, I need to know what could be the cases in which I shall get an abort signal from kernel. This could give me proper direction to debug.
Please mention each and every potential scenario in which an application could get an abort signal.
# specifics of execution scenario are,
process is in exit mode, i.e exit() routine is called for graceful shutdown of process.
consequently all the global object destructors are called.
TIA
Compile it with -g
Run it from a debugger
When the application crashes, the debugger will give you the line, let you inspect thread, variables...
Other solution:
change your core dump generation with ulimit
load the core dump in gdb post mortem
Root cause can be multiple : reading outside of your memory space, division by 0, dereferencing invalid pointer...
I would try running under valgrind. There could be a memory error even before the abort and valgrind could notice that and tell you. If this is the case, you will find the error much easier than with a conventional debugger like gdb.
The cause for aborted is in general an assertion failure
for example
(gdb) bt
#0 0x00000035fbc30265 in raise () from /lib64/libc.so.6
#1 0x00000035fbc31d10 in abort () from /lib64/libc.so.6
#2 0x00000035fbc296e6 in __assert_fail () from /lib64/libc.so.6

getting info about threads in gdb/ddd

I am debugging a multi threaded application using ddd.
At the same time each second I can see on DDD console out that a new thread is created
[NewThread 0x455fc940 (LWP 27373)]
and destroyed immediately after it.
[Thread 0x455fc940 (LWP 27373) exited]
After few minutes I have this text out
[NewThread 0x455fc940 (LWP 27363)]
[Thread 0x455fc940 (LWP 27363) exited]
[NewThread 0x455fc940 (LWP 27367)]
[Thread 0x455fc940 (LWP 27367) exited]
[NewThread 0x455fc940 (LWP 27373)]
[Thread 0x455fc940 (LWP 27373) exited]
...and so on..
with this LWP increasing.
The threas comes and go too fast to be displayed using the window I got clicking on Status->Thread. Can you address me a bit about how to get information about that thread?
Do you know why this LWP is increasing all the time?
More important how to get the function that is lunched into that thread?
Thank you all
AFG
LWP is an acronym and stands for Light Weight Process. It is in effect the thread ID of each newly spawned thread.
On what to do about those spawning and dying threads: you could try set a break point at clone, which is he system call (? am I correct?) which starts a new thread at a given function.
Note: When breaking at clone you know from where the thread will be started, but don't actually have a thread, you can then however set break points at the functions given as argument to clone...
That is, start your program from gdb or ddd with the start command, which sets a temporary break point at the program entry point (i.e. main), than set a break point at clone, continue and see what happens ;).
Update: setting a break point at clone works for me... at least in my test. I should add that this is linux specific - and is actually what pthread_create uses.
Set a breakpoint at pthread_create.
(gdb) break pthread_create
Breakpoint 1 at 0x20c49ba5cabf44
Now when you run it, it will stop execution when the next call to create a thread happens, and you can type where to see who the caller was.

Resources