I am using zmq PUSH and PULL sockets. And recently started observing SIGABRT crash, in zmq_poll() operation.
The error/exit log is "Permission denied (src/tcp_connecter.cpp:361)"
(gdb) bt
#0 0x00007ffff76d053f in raise () from /lib64/libc.so.6
#1 0x00007ffff76ba895 in abort () from /lib64/libc.so.6
#2 0x00007ffff7f59ace in zmq::zmq_abort(char const*) () from /lib64/libzmq.so.5
#3 0x00007ffff7f9ef36 in zmq::tcp_connecter_t::connect() () from /lib64/libzmq.so.5
#4 0x00007ffff7f9f060 in zmq::tcp_connecter_t::out_event() () from /lib64/libzmq.so.5
#5 0x00007ffff7f6bc2c in zmq::epoll_t::loop() () from /lib64/libzmq.so.5
#6 0x00007ffff7f9ffba in thread_routine () from /lib64/libzmq.so.5
#7 0x00007ffff75d058e in start_thread () from /lib64/libpthread.so.0
#8 0x00007ffff77956a3 in clone () from /lib64/libc.so.6
Could anyone help me here ??
Process is a part of a container running in Kubernates. And this issue started occuring suddenly. And couldn't be able to recover.
Thanks,
Meanwhile, I resolved the issue.
The zmq interface in host A was trying to connect to Host B. And above error is observed in Host A.
And this issue started occuring once after restarting HostB. And I could notice, there is an ip6table rule got added in HostB as part of its restart. This rule was doing "reject with admin prohibited" in INPUT and forward chain in HostB. (I have to search my notes for exact rule.)
As part of this, zmq client in HostA was ending up in above mentioned crash. I believe crash (SIGABRT) should not be solution for hitting such rule in peer end. Since SIGABRT exception is unable to handle in code.
Related
I have multithreaded application where I spawn a few threads and do a pthread_join upon completion.
The main thread spawns threads and waits on pthread_join() for the worker threads to join. I am facing a issue where the main thread is waiting indefinitely in pthread_join() and all the worker threads have exited, leading the program to hang.
I identified that all worker threads have exited by checking info thread on gdb since it lists only the main thread.
Its is known that calling pthread_join() on a exited thread will return immediately. But this seems different. This is the gdb stack trace.
#0 0x00007f45fefebeec in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007f45fef68a6f in _L_lock_5333 () from /lib64/libc.so.6
#2 0x00007f45fef62408 in _int_free () from /lib64/libc.so.6
#3 0x00007f45ffbe5088 in _dl_deallocate_tls () from /lib64/ld-linux-x86-64.so.2
#4 0x00007f45ff9bde67 in __free_stacks () from /lib64/libpthread.so.0
#5 0x00007f45ff9bdf7f in __deallocate_stack () from /lib64/libpthread.so.0
#6 0x00007f45ff9bff93 in pthread_join () from /lib64/libpthread.so.0
#7 0x00007f45f87a6fe1 in waitForWorkerThreadsToExit () at src/server.c:133
#8 ServerLoop (arg=<optimized out>) at src/server.c:662
#9 0x00007f45ff9bee25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f45fefde34d in clone () from /lib64/libc.so.6
I am on CentOS7 and Linux kernel 3.10
Can someone help? TIA
One of the other threads is leaving without relinquishing the lock. As suggested here you can check the thread id for owner of this mutex to know which thread is the culprit.
I am using node 10.38 on Linux (Ubuntu 4.10, FC20, etc...).
I have some code in startup which looks like:
process.on('SIGTERM', function() {
process.exit(1);
});
process.on('SIGINT', function() {
process.exit(1);
});
Somewhere else in the process, I have code like this:
dns.lookup("somehostname", function(err, addresses, family) {
// do something
});
Many times, if you send SIGTERM to the process, node will not quit. It will hang for as long as it takes to resolve DNS. Sometimes, if DNS server does not respond, it can take up to 5 minutes to quit. If you take a GDB stack trace at this time, you see a stack trace like this. If you attach a gdb debugger, you will see that it is stuck in trying to resolve the hostname we are trying to resolve.
I would have thought that gethostbyname can be interrupted by signals. Can someone shed some insight into it?
Thread 3 (process 18074):
#0 0x00007fabac3bed26 in poll () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fababcdce90 in __libc_res_nsend () from /lib64/libresolv.so.2
No symbol table info available.
#2 0x00007fababcdbcb6 in __libc_res_nquery () from /lib64/libresolv.so.2
No symbol table info available.
#3 0x00007fababcdbf27 in __libc_res_nquerydomain () from /lib64/libresolv.so.2
No symbol table info available.
#4 0x00007fababcdc14b in __libc_res_nsearch () from /lib64/libresolv.so.2
No symbol table info available.
#5 0x00007fababeeb8ef in _nss_dns_gethostbyname3_r () from
/lib64/libnss_dns.so.2
No symbol table info available.
#6 0x00007fababeebb64 in _nss_dns_gethostbyname2_r () from
/lib64/libnss_dns.so.2
No symbol table info available.
#7 0x00007fabac3b02bf in gaih_inet () from /lib64/libc.so.6
No symbol table info available.
#8 0x00007fabac3b178e in getaddrinfo () from /lib64/libc.so.6
No symbol table info available.
#9 0x0000000000a0cbb2 in uv_getaddrinfo ()
No symbol table info available.
#10 0x0000000000a127c4 in uv_queue_work ()
No symbol table info available.
#11 0x0000000000a08462 in uv_thread_create ()
No symbol table info available.
gethostbyname can indeed be interrupted by signals, but you'll see right at the bottom of the stack trace that the call is being made within a thread.
The SIGTERM that you send is only being delivered to the main program and for reasons I didn't yet establish the process isn't exiting until all threads have completed their work.
Im building a shared library on linux. the library ".so" was sucessfully created, but when I tried to link it to a test application (with an empty main) and run the executable I got a segmentation error : "Segmentation error (cure dumped)"
when I tried to debug it with gdb and check the backtrace I got this output:
Program received signal SIGSEGV, Segmentation fault.
0x0073d5df in std::_Rb_tree_decrement(std::_Rb_tree_node_base*) () from /usr/lib/libstdc++.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12.1-4.i686 libgcc-4.4.5-2.fc13.i686 libstdc++-4.4.5-2.fc13.i686 zlib-1.2.3-23.fc12.i686
(gdb) backtrace
#0 0x0073d5df in std::_Rb_tree_decrement(std::_Rb_tree_node_base*) () from /usr/lib/libstdc++.so.6
#1 0x0012d70c in ?? () from /opt/cuda/lib/libcudart.so.3
#2 0x0012df0c in ?? () from /opt/cuda/lib/libcudart.so.3
#3 0x0012c88a in ?? () from /opt/cuda/lib/libcudart.so.3
#4 0x00121435 in __cudaRegisterFatBinary () from /opt/cuda/lib/libcudart.so.3
#5 0x005d7bfd in __sti____cudaRegisterAll_55_tmpxft_00000fe6_00000000_26_MonteCarloPaeo_SM10_cpp1_ii_3a8af011()
() from libsharedCUFP.so
#6 0x005db40d in __do_global_ctors_aux () from libsharedCUFP.so
#7 0x005a8748 in _init () from libsharedCUFP.so
#8 0x008abd00 in _dl_init_internal () from /lib/ld-linux.so.2
#9 0x0089d88f in _dl_start_user () from /lib/ld-linux.so.2
Im not familiar with gdb debugging, and it's the first time Im trying to build a shared library on Linux, but it seems to me that it has something to do with the library dynamic linking.
If anyone had any idea about this error and could help me, I would be grateful.
It doesn't have anything to do with dynamic linking or shared libraries - one of the constructors in libsharedCUFP.so (I assume this is your shared library) is most probably passing an illegal address to a function in libcudart.so which crashes.
You simply need to debug your code.
On analysis of the core of a process (terminated by signal 6), on LINUX, stack bt shows :
Core was generated by `/opt/namsam/pac_rrc_qx_e1/bin/rrcprb'.
Program terminated with signal 6, Aborted.
#0 0x0000005555ffb004 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0 0x0000005555ffb004 in epoll_wait () from /lib64/libc.so.6
#1 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
#2 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
#3 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
#4 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
#5 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
#6 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
#7 0x0000005555ffafe8 in __epoll_wait_nocancel () from /lib64/libc.so.6
libc seems to have gone in some loop.. Did something go wrong with the application "rrcprb" here..? please help me debug this issue..?
Since __epoll_wait_nocancel does not call itself, it's pretty clear that the stack trace you've got is bogus. Most likely cause is incorrect unwind descriptors in your libc.so.6.
It's also somewhat unlikely that you actually crashed in epoll_wait. Try thread apply all where, and see if there is a "more interesting" stack trace / thread for you to look at.
This is on a Redhat EL5 machine w/ a 2.6.18-164.2.1.el5 x86_64 kernel using gcc 4.1.2 and gdb 7.0.
When I run my application with gdb and break in while it's running, several of my threads show the following call stack when I do a backtrace:
#0 0x000000000051d7da in pthread_cond_wait ()
#1 0x0000000100000000 in ?? ()
#2 0x0000000000c1c3b0 in ?? ()
#3 0x0000000000c1c448 in ?? ()
#4 0x00000000000007dd in ?? ()
#5 0x000000000051d630 in ?? ()
#6 0x00007fffffffdc90 in ?? ()
#7 0x000000003b1ae84b in ?? ()
#8 0x00007fffffffdd50 in ?? ()
#9 0x0000000000000000 in ?? ()
Is this a symptom of a common problem?
Is there a known issue with viewing the call stack while waiting on a condition?
The problem is that pthread_cond_wait is written in hand-coded assembly, and apparently doesn't have proper unwind descriptor (required on x86_64 to unwind the stack) in your build of glibc. This problem may have recently been fixed here.
You can try to build and install the latest glibc (note: if you screw up installation, your machine will likely become unbootable; approach with extreme caution!), or just live with "bogus" stack traces from pthread_cond_wait.
Generally, synchronization is required when multiple threads share a single resource.
In such a case, when you interrupt the program, you'll see only 1 thread is running (i.e., accessing the resource) and other threads are waiting within pthread_cond_wait().
So I don't think pthread_cond_wait() itself is problematic.
If your program hangs with deadlock or performance doesn't scale, it might be caused by pthread_cond_wait().
That looks like a corrupt stack trace to me
for example:
#9 0x0000000000000000 in ?? ()
There shouldn't be code at NULL