pthread_join hangs indefinitely __lll_lock_wait_private() - linux

I have multithreaded application where I spawn a few threads and do a pthread_join upon completion.
The main thread spawns threads and waits on pthread_join() for the worker threads to join. I am facing a issue where the main thread is waiting indefinitely in pthread_join() and all the worker threads have exited, leading the program to hang.
I identified that all worker threads have exited by checking info thread on gdb since it lists only the main thread.
Its is known that calling pthread_join() on a exited thread will return immediately. But this seems different. This is the gdb stack trace.
#0 0x00007f45fefebeec in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007f45fef68a6f in _L_lock_5333 () from /lib64/libc.so.6
#2 0x00007f45fef62408 in _int_free () from /lib64/libc.so.6
#3 0x00007f45ffbe5088 in _dl_deallocate_tls () from /lib64/ld-linux-x86-64.so.2
#4 0x00007f45ff9bde67 in __free_stacks () from /lib64/libpthread.so.0
#5 0x00007f45ff9bdf7f in __deallocate_stack () from /lib64/libpthread.so.0
#6 0x00007f45ff9bff93 in pthread_join () from /lib64/libpthread.so.0
#7 0x00007f45f87a6fe1 in waitForWorkerThreadsToExit () at src/server.c:133
#8 ServerLoop (arg=<optimized out>) at src/server.c:662
#9 0x00007f45ff9bee25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f45fefde34d in clone () from /lib64/libc.so.6
I am on CentOS7 and Linux kernel 3.10
Can someone help? TIA

One of the other threads is leaving without relinquishing the lock. As suggested here you can check the thread id for owner of this mutex to know which thread is the culprit.

Related

How to daemonize a process which is linked to shared library?

We are trying to daemonize a binary that is linked to 1 shared library. After creating a fork, the parent process gets exit due to which the shared library gets detached.
Below is the stack.
#3 0x00007ffff7deb07a in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#4 0x00007ffff67aace9 in __run_exit_handlers () from /lib64/libc.so.6
#5 0x00007ffff67aad37 in exit () from /lib64/libc.so.6
#6 0x00007ffff7b35ce2 in daemonize () at
/home/CSDeveloper/CLLM420/1/src/mds.llmd/common/libllmd.c:677
#7 0x00007ffff7b35cfb in llm_waitForShutdown () at
/home/CSDeveloper/CLLM420/1/src/mds.llmd/common/libllmd.c:665
#8 0x0000000000400c1d in main (argc=<optimized out>, argv=0x7fffffffdea8)
at /home/CSDeveloper/CLLM420/1/src/mds.llmd/common/llmd.c:134
Can somebody let us know how to daemonize such process which is linked to a shared library?

Why a SIGABRT permission denied error appeared during a zmq_poll()?

I am using zmq PUSH and PULL sockets. And recently started observing SIGABRT crash, in zmq_poll() operation.
The error/exit log is "Permission denied (src/tcp_connecter.cpp:361)"
(gdb) bt
#0 0x00007ffff76d053f in raise () from /lib64/libc.so.6
#1 0x00007ffff76ba895 in abort () from /lib64/libc.so.6
#2 0x00007ffff7f59ace in zmq::zmq_abort(char const*) () from /lib64/libzmq.so.5
#3 0x00007ffff7f9ef36 in zmq::tcp_connecter_t::connect() () from /lib64/libzmq.so.5
#4 0x00007ffff7f9f060 in zmq::tcp_connecter_t::out_event() () from /lib64/libzmq.so.5
#5 0x00007ffff7f6bc2c in zmq::epoll_t::loop() () from /lib64/libzmq.so.5
#6 0x00007ffff7f9ffba in thread_routine () from /lib64/libzmq.so.5
#7 0x00007ffff75d058e in start_thread () from /lib64/libpthread.so.0
#8 0x00007ffff77956a3 in clone () from /lib64/libc.so.6
Could anyone help me here ??
Process is a part of a container running in Kubernates. And this issue started occuring suddenly. And couldn't be able to recover.
Thanks,
Meanwhile, I resolved the issue.
The zmq interface in host A was trying to connect to Host B. And above error is observed in Host A.
And this issue started occuring once after restarting HostB. And I could notice, there is an ip6table rule got added in HostB as part of its restart. This rule was doing "reject with admin prohibited" in INPUT and forward chain in HostB. (I have to search my notes for exact rule.)
As part of this, zmq client in HostA was ending up in above mentioned crash. I believe crash (SIGABRT) should not be solution for hitting such rule in peer end. Since SIGABRT exception is unable to handle in code.

Process terminated by signal 6, core shows kind of loop in libc

On analysis of the core of a process (terminated by signal 6), on LINUX, stack bt shows :
Core was generated by `/opt/namsam/pac_rrc_qx_e1/bin/rrcprb'.
Program terminated with signal 6, Aborted.
#0 0x0000005555ffb004 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0 0x0000005555ffb004 in epoll_wait () from /lib64/libc.so.6
#1 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
#2 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
#3 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
#4 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
#5 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
#6 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
#7 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
libc seems to have gone in some loop.. Did something go wrong with the application "rrcprb" here..? please help me debug this issue..?
Since __epoll_wait_nocancel does not call itself, it's pretty clear that the stack trace you've got is bogus. Most likely cause is incorrect unwind descriptors in your libc.so.6.
It's also somewhat unlikely that you actually crashed in epoll_wait. Try thread apply all where, and see if there is a "more interesting" stack trace / thread for you to look at.

Not to Able analayze the Core dump issue for Multithreaded application.........(Help Required)

I am working on multhithreading application when ever the process dump it always generates core as shown below i am not able to understand where it is actually dumping.
GNU gdb Red Hat Linux (6.5-25.el5rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: exec file is newer than core file.
Core was generated by `multithreadprocess '.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000448f7a in std::ostream::operator<< ()
(gdb) where
0x000000000044bd32 in std::ostream::operator<< ()
#1 0x0000000000450b21 in std::ostream::operator<< ()
#2 0x000000000042eda9 in std::string::operator= ()
#3 0x00000030582062e7 in start_thread () from /lib64/libpthread.so.0
#4 0x00000030576ce3bd in clone () from /lib64/libc.so.6
(gdb)thread apply all bt
Thread 6 (process 11674):
#0 0x000000305820a687 in pthread_cond_timedwait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000431140 in std::string::operator= ()
#2 0x00000030582062e7 in start_thread () from /lib64/libpthread.so.0
#3 0x00000030576ce3bd in clone () from /lib64/libc.so.6
Thread 5 (process 11683):
#0 0x000000305820cbfb in write () from /lib64/libpthread.so.0
#1 0x0000000000449151 in std::ostream::operator<< ()
#2 0x000000000043b74a in std::string::operator= ()
#3 0x000000000046c3f4 in std::string::substr ()
#4 0x000000000046e3c1 in std::string::substr ()
#5 0x00000000004305a4 in std::string::operator= ()
#6 0x00000030582062e7 in start_thread () from /lib64/libpthread.so.0
#7 0x00000030576ce3bd in clone () from /lib64/libc.so.6
Thread 4 (process 11744):
#0 0x00000030576c5896 in poll () from /lib64/libc.so.6
#1 0x0000000000474f1c in std::string::substr ()
#2 0x000000000043b889 in std::string::operator= ()
#3 0x0000000000474dbc in std::string::substr ()
#4 0x00000000004306a5 in std::string::operator= ()
#5 0x00000030582062e7 in start_thread () from /lib64/libpthread.so.0
#6 0x00000030576ce3bd in clone () from /lib64/libc.so.6
Thread 3 (process 11864):
#0 0x000000305820a687 in pthread_cond_timedwait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000431140 in std::string::operator= ()
#2 0x00000030582062e7 in start_thread () from /lib64/libpthread.so.0
#3 0x00000030576ce3bd in clone () from /lib64/libc.so.6
Thread 2 (process 11866):
#0 0x000000305820a687 in pthread_cond_timedwait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000431140 in std::string::operator= ()
#2 0x00000030582062e7 in start_thread () from /lib64/libpthread.so.0
#3 0x00000030576ce3bd in clone () from /lib64/libc.so.6
Thread 1 (process 11865):
#0 0x000000000044bd32 in std::ostream::operator<< ()
#1 0x0000000000450b21 in std::ostream::operator<< ()
#2 0x000000000042eda9 in std::string::operator= ()
#3 0x00000030582062e7 in start_thread () from /lib64/libpthread.so.0
#4 0x00000030576ce3bd in clone () from /lib64/libc.so.6
If i give bt full it is showing like this
(gdb) bt full
#0 0x000000000044bd32 in std::ostream::operator<< ()
No symbol table info available.
#1 0x0000000000450b21 in std::ostream::operator<< ()
No symbol table info available.
#2 0x000000000042eda9 in std::string::operator= ()
No symbol table info available.
#3 0x00000030582062e7 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4 0x00000030576ce3bd in clone () from /lib64/libc.so.6
No symbol table info available.
GDB 6.5 is quite old. You will likely get significantly better stack traces from (current) GDB 7.0.1.
You also appear to be trying to debug optimized code, built without -g flag, and you may not be debugging the right executable (GDB warns that your executable is newer than your core).
Make sure that your executable and all the libraries listed in info shared GDB output exactly match between the system where your core was produced and the system on which you are analyzing the core (if they are not the same) -- this is paramount -- if there is a mismatch, you'll likely get bogus stack traces (and the stack traces you've posted do look completely bogus to me).
Looks to me like you're using iostream inside a multithreaded application without the appropriate flags. See this. In particular, note that it says
When you build an application that
uses the iostream classes of the libC
library to run in a multithreaded
environment, compile and link the
source code of the application using
the -mt option. This option passes
-D_REENTRANT to the preprocessor and -lthread to the linker.
This is for a particular platform; your requirements may vary.

gdb backtrace and pthread_cond_wait()

This is on a Redhat EL5 machine w/ a 2.6.18-164.2.1.el5 x86_64 kernel using gcc 4.1.2 and gdb 7.0.
When I run my application with gdb and break in while it's running, several of my threads show the following call stack when I do a backtrace:
#0 0x000000000051d7da in pthread_cond_wait ()
#1 0x0000000100000000 in ?? ()
#2 0x0000000000c1c3b0 in ?? ()
#3 0x0000000000c1c448 in ?? ()
#4 0x00000000000007dd in ?? ()
#5 0x000000000051d630 in ?? ()
#6 0x00007fffffffdc90 in ?? ()
#7 0x000000003b1ae84b in ?? ()
#8 0x00007fffffffdd50 in ?? ()
#9 0x0000000000000000 in ?? ()
Is this a symptom of a common problem?
Is there a known issue with viewing the call stack while waiting on a condition?
The problem is that pthread_cond_wait is written in hand-coded assembly, and apparently doesn't have proper unwind descriptor (required on x86_64 to unwind the stack) in your build of glibc. This problem may have recently been fixed here.
You can try to build and install the latest glibc (note: if you screw up installation, your machine will likely become unbootable; approach with extreme caution!), or just live with "bogus" stack traces from pthread_cond_wait.
Generally, synchronization is required when multiple threads share a single resource.
In such a case, when you interrupt the program, you'll see only 1 thread is running (i.e., accessing the resource) and other threads are waiting within pthread_cond_wait().
So I don't think pthread_cond_wait() itself is problematic.
If your program hangs with deadlock or performance doesn't scale, it might be caused by pthread_cond_wait().
That looks like a corrupt stack trace to me
for example:
#9 0x0000000000000000 in ?? ()
There shouldn't be code at NULL

Resources