I have a c# application (servicebus) which runs on a private web server. Its basic job is to accept some web requests and create other processes to handle processing the data packages described in the requests. The processing is often ongoing and can take weeks.
The servicebus will, occasionally, start consuming great amounts of CPU. That is, it is normally idle, getting 1 or 2 seconds of CPU time per day. When it gets into this strange mode, its consuming 100+% CPU all the time. At this point, a new instance of the servicebus gets spawned by apache if a new request comes in. At this point I will have two copies of the servicebus running (and possibly both handling processing requests -- i don't know).
This is the normal process (via ps -aef ) :
UID PID PPID C STIME TTY TIME CMD
apache 8978 1 0 11:51 ? 00:00:01 /opt/mono/bin/mono /opt/mono/lib/mono/4.0/mod-mono-server4.exe --filename /tmp/mod_mono_server_default --applications /:/opt/ov/vespa/servicebus --nonstop
As you can see, the application is a C# program (compiled with VS 2010 for .NET 4) running via mod-mono-server4 under mono. This is a redhat linux enterprise 6.5 system.
After running for a while that process 'went crazy' and started consuming lots of CPU and mod-mono-server created a new instance. As you can see, I didn't find it until Monday morning after it had used over 2 days of CPU time. Here is the new ps -aef output :
UID PID PPID C STIME TTY TIME CMD
apache 8978 1 83 Sep19 ? 2-08:26:25 /opt/mono/bin/mono /opt/mono/lib/mono/4.0/mod-mono-server4.exe --filename /tmp/mod_mono_server_default --applications /:/opt/ov/vespa/servicebus --nonstop
apache 32538 1 0 Sep21 ? 00:00:00 /opt/mono/bin/mono /opt/mono/lib/mono/4.0/mod-mono-server4.exe --filename /tmp/mod_mono_server_default --applications /:/opt/ov/vespa/servicebus --nonstop
In case you need to see how the application is configured, I have the snippet from the conf.d file for the application :
# The user and group need to be set before mod_mono.conf is loaded.
User apache
Group apache
# Service Bus setup
Include /etc/httpd/conf/mod_mono.conf
Listen 8081
<VirtualHost *:8081>
DocumentRoot /opt/ov/vespa/servicebus
MonoServerPath default /opt/mono/bin/mod-mono-server4
MonoApplications "/:/opt/ov/vespa/servicebus"
<Location "/">
SetHandler mono
Allow from all
</Location>
</VirtualHost>
The basic question is... how do I go about debugging this and finding what is wrong with my application? That, however is a bit vague. Normally, I would want to put mono into debug mod and then when it gets into this strange mode I would use kill -ABRT to get a core dump out of it. I assume I could then find a for loop/while loop/etc which is stuck and fix my bug. So, the real question is how do do that? Is that process PID=8978 actually my application being interpreted by mono or is it mono running mod-mono-server4.exe? Or is it mono interpreting mod-mono-server4.exe which in turn is interpreting servicebus? Where in the apache configuration files do I put in the arguments to mono so I can get the --debug I desire.
Normally to debug I would need a process like :
/opt/mono/bin/mono --debug /opt/test/testapp.exe
So, I need to get a --debug into the command line and sort out which PID to actually kill. Then I can use techniques from http://www.mono-project.com/docs/debug+profile/debug/ to debug the core file.
NOTE: I have tried putting in MonoMaxCPUTime and MonoAutoRestartTime directives into the apache conf files to cure this. The problem is, when everything is nominal, they work fine. Once it gets into this bad state(consuming a ton of CPU), the restart fails. Or rather it succeeds in creating a new process but fails to delete the old one (basically the state I am already in).
Debugging so far : I see my log files for PID=8979 stops on 9/21 at 03:27. Given that it often generates a 200% or 300% CPU or more that could easily be the time of the 'crash'. Looking in the apache logs I found an unusual event at that time. A dump of the log is below :
...
[Sun Sep 21 03:28:01 2014] [notice] SIGHUP received. Attempting to restart
mod-mono-server received a shutdown message
httpd: Could not reliably determine the server's fully qualified domain name, using localhost.localdomain for ServerName
Stacktrace:
Native stacktrace:
/opt/mono/bin/mono() [0x48cc26]
/lib64/libpthread.so.0() [0x32fca0f710]
/lib64/libpthread.so.0(pthread_cond_wait+0xcc) [0x32fca0b5bc]
/opt/mono/bin/mono() [0x5a6a9c]
/opt/mono/bin/mono() [0x5ad4e9]
/opt/mono/bin/mono() [0x5116d8]
/opt/mono/bin/mono(mono_thread_manage+0x1ad) [0x5161cd]
/opt/mono/bin/mono(mono_main+0x1401) [0x46a671]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x32fc21ed1d]
/opt/mono/bin/mono() [0x4123a9]
Debug info from gdb:
warning: File "/opt/mono/bin/mono-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "/usr/share/gdb/auto-load:/usr/lib/debug:/usr/bin/mono-gdb.py".
To enable execution of this file add
add-auto-load-safe-path /opt/mono/bin/mono-gdb.py
line to your configuration file "$HOME/.gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "$HOME/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
[New LWP 9148]
[New LWP 9135]
[New LWP 9000]
[New LWP 8991]
[New LWP 8990]
[New LWP 8988]
[New LWP 8987]
[New LWP 8986]
[New LWP 8985]
[New LWP 8984]
[Thread debugging using libthread_db enabled]
0x00000032fca0e75d in read () from /lib64/libpthread.so.0
11 Thread 0x7f0d8bcaf700 (LWP 8984) 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
10 Thread 0x7f0d8b2ae700 (LWP 8985) 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
9 Thread 0x7f0d8a8ad700 (LWP 8986) 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
8 Thread 0x7f0d89eac700 (LWP 8987) 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
7 Thread 0x7f0d894ab700 (LWP 8988) 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
6 Thread 0x7f0d88aaa700 (LWP 8990) 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
5 Thread 0x7f0d880a9700 (LWP 8991) 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
4 Thread 0x7f0d8713c700 (LWP 9000) 0x00000032fca0d930 in sem_wait () from /lib64/libpthread.so.0
3 Thread 0x7f0d86157700 (LWP 9135) 0x00000032fc27a983 in malloc () from /lib64/libc.so.6
2 Thread 0x7f0d8568b700 (LWP 9148) 0x00000032fc2792f0 in _int_malloc () from /lib64/libc.so.6
* 1 Thread 0x7f0d8bcb0740 (LWP 8978) 0x00000032fca0e75d in read () from /lib64/libpthread.so.0
Thread 11 (Thread 0x7f0d8bcaf700 (LWP 8984)):
#0 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000005d59f7 in GC_wait_marker ()
#2 0x00000000005dbabd in GC_help_marker ()
#3 0x00000000005d4778 in GC_mark_thread ()
#4 0x00000032fca079d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000032fc2e8b5d in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7f0d8b2ae700 (LWP 8985)):
#0 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000005d59f7 in GC_wait_marker ()
#2 0x00000000005dbabd in GC_help_marker ()
#3 0x00000000005d4778 in GC_mark_thread ()
#4 0x00000032fca079d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000032fc2e8b5d in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7f0d8a8ad700 (LWP 8986)):
#0 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000005d59f7 in GC_wait_marker ()
#2 0x00000000005dbabd in GC_help_marker ()
#3 0x00000000005d4778 in GC_mark_thread ()
#4 0x00000032fca079d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000032fc2e8b5d in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f0d89eac700 (LWP 8987)):
#0 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000005d59f7 in GC_wait_marker ()
#2 0x00000000005dbabd in GC_help_marker ()
#3 0x00000000005d4778 in GC_mark_thread ()
#4 0x00000032fca079d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000032fc2e8b5d in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f0d894ab700 (LWP 8988)):
#0 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000005d59f7 in GC_wait_marker ()
#2 0x00000000005dbabd in GC_help_marker ()
#3 0x00000000005d4778 in GC_mark_thread ()
#4 0x00000032fca079d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000032fc2e8b5d in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f0d88aaa700 (LWP 8990)):
#0 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000005d59f7 in GC_wait_marker ()
#2 0x00000000005dbabd in GC_help_marker ()
#3 0x00000000005d4778 in GC_mark_thread ()
#4 0x00000032fca079d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000032fc2e8b5d in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f0d880a9700 (LWP 8991)):
#0 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000005d59f7 in GC_wait_marker ()
#2 0x00000000005dbabd in GC_help_marker ()
#3 0x00000000005d4778 in GC_mark_thread ()
#4 0x00000032fca079d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000032fc2e8b5d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f0d8713c700 (LWP 9000)):
#0 0x00000032fca0d930 in sem_wait () from /lib64/libpthread.so.0
#1 0x00000000005bea28 in mono_sem_wait ()
#2 0x000000000053b2bb in finalizer_thread ()
#3 0x000000000051375b in start_wrapper ()
#4 0x00000000005a8214 in thread_start_routine ()
#5 0x00000000005d565a in GC_start_routine ()
#6 0x00000032fca079d1 in start_thread () from /lib64/libpthread.so.0
#7 0x00000032fc2e8b5d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f0d86157700 (LWP 9135)):
#0 0x00000032fc27a983 in malloc () from /lib64/libc.so.6
#1 0x00000000005cd0e6 in monoeg_malloc ()
#2 0x00000000005cbef1 in monoeg_g_hash_table_insert_replace ()
#3 0x00000000005acff5 in WaitForMultipleObjectsEx ()
#4 0x0000000000512694 in ves_icall_System_Threading_WaitHandle_WaitAny_internal ()
#5 0x00000000417b0270 in ?? ()
#6 0x00007f0d68000c21 in ?? ()
#7 0x00007f0d847c4b40 in ?? ()
#8 0x00007f0d68003e00 in ?? ()
#9 0x000000004023e890 in ?? ()
#10 0x00007f0d68003e00 in ?? ()
#11 0x00007f0d86156940 in ?? ()
#12 0x00007f0d861568a0 in ?? ()
#13 0x00007f0d8767d000 in ?? ()
#14 0xffffffffffffffff in ?? ()
#15 0x00007f0d86156cc0 in ?? ()
#16 0x00007f0d847c4b40 in ?? ()
#17 0x000000004023e268 in ?? ()
#18 0x0000000000000000 in ?? ()
Thread 2 (Thread 0x7f0d8568b700 (LWP 9148)):
#0 0x00000032fc2792f0 in _int_malloc () from /lib64/libc.so.6
#1 0x00000032fc27a636 in calloc () from /lib64/libc.so.6
#2 0x00000000005cd148 in monoeg_malloc0 ()
#3 0x00000000005cbb94 in monoeg_g_hash_table_new ()
#4 0x00000000005acf94 in WaitForMultipleObjectsEx ()
#5 0x0000000000512694 in ves_icall_System_Threading_WaitHandle_WaitAny_internal ()
#6 0x00000000417b0270 in ?? ()
#7 0x00007f0d60000c21 in ?? ()
#8 0x00007f0d8767d000 in ?? ()
#9 0xffffffffffffffff in ?? ()
#10 0x000000004023e890 in ?? ()
#11 0x00007f0d68003e00 in ?? ()
#12 0x00007f0d8568a940 in ?? ()
#13 0x00007f0d8568a8a0 in ?? ()
#14 0x00007f0d8767d000 in ?? ()
#15 0xffffffffffffffff in ?? ()
#16 0x00007f0d8568acc0 in ?? ()
#17 0x00007f0d864e2990 in ?? ()
#18 0x000000004023e268 in ?? ()
#19 0x0000000000000000 in ?? ()
Thread 1 (Thread 0x7f0d8bcb0740 (LWP 8978)):
#0 0x00000032fca0e75d in read () from /lib64/libpthread.so.0
#1 0x000000000048cdb6 in mono_handle_native_sigsegv ()
#2 <signal handler called>
#3 0x00000032fca0b5bc in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#4 0x00000000005a6a9c in _wapi_handle_timedwait_signal_handle ()
#5 0x00000000005ad4e9 in WaitForMultipleObjectsEx ()
#6 0x00000000005116d8 in wait_for_tids ()
#7 0x00000000005161cd in mono_thread_manage ()
#8 0x000000000046a671 in mono_main ()
#9 0x00000032fc21ed1d in __libc_start_main () from /lib64/libc.so.6
#10 0x00000000004123a9 in _start ()
=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================
Which I think means the process had a seg fault and was trying to dump core or something and stuck trying to do that? Or did it get a sig ABRT while processing a sig SEGV? In either case, that's a dump of mono, right? I did a find of the full file system and no core was generated so I'm not sure how apache/gdb managed this.
In case it matters I have RedHat 6.5, mono 2.10.8, gcc 4.4.7, mod-mono-server4.exe 2.10.0.0
Basically this boils down to these questions.
How do I get --debug into the mono commands that apache issues?
How do I get apache to save the core files it encounters instead of automatically running gdb on them (as I need to issue more complex commands to get at the underlying c# code)?
What does the command line for my servicebus mean? That is why/how come the mod-mono-server4 isn't a completely separate process from my servicebus? How does the MMS fit into the mono interpreting servicebus processing chain
Or am I totally wrong and will the answers to those questions not help me?
First of all: Mono 2.10 is very old, you may be running into a bug that is already fixed in the latest 3.8.
As for getting --debug into your app, you can set the environment variable MONO_OPTIONS=--debug, that has the same effect as specifying it on the command line.
Related
I have compiled my kernel with the following kernel option enabled. That should be enough.
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y
CONFIG_DEBUG_INFO=y
I want to implement a tcp socket server under kernel space. However when I debug my kernel, gdb seems can't recognize function symbol. Question marks are shown.
#0 0xffffffffb92ef58a in ?? ()
#1 0xffffffffb92ef6dd in ?? ()
#2 0xffffb4a640c73c38 in ?? ()
#3 0xffff9b0c275587c0 in ?? ()
#4 0xffff9b0c5c9fbc00 in ?? ()
#5 0xffff9b0c7c3ec480 in ?? ()
#6 0xffffffffc063d000 in ?? ()
#7 0xffffffffc063b22e in myserver ()
at /home/river/Desktop/kernel-sock/server.c:75
#8 0xffffffffc063b285 in server_init ()
at /home/river/Desktop/kernel-sock/server.c:88
#9 0xffffffffb8e0218e in ?? ()
#10 0xffff9b0c7ffeb5c0 in ?? ()
#11 0x000000000000001f in ?? ()
#12 0x85ce74a569aec8a5 in ?? ()
The current kernel version is 4.9.82.
I disabled CONFIG_DEBUG_RODATA and CONFIG_RANDOMIZE_BASE.
CONFIG_RANDOMIZE_MEMORY randomizes the virtual addresses of memory sections, including physical memory mappings, vmalloc, and vemmap.
I think memory address randomizing is the key.
I'm looking at a program that crashes, leading to a useless (or so it seems) core dump. I didn't write the program but I'm trying to find what may be the cause.
First strange thing is that the core dump is named after QThread instead of my executable itself.
Then inside the backtrace, there's no hint at line numbers of the program itself:
$ gdb acqui ../../appli/core.QThread.31667.1448795278
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./acqui'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fcf4a1ce107 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007fcf4a1ce107 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fcf4a1cf4e8 in __GI_abort () at abort.c:89
#2 0x00007fcf4aab9b3d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007fcf4aab7bb6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007fcf4aab7c01 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007fcf4aab7e69 in __cxa_rethrow () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007fcf4b8707db in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#7 0x00007fcf4b764e99 in QThread::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#8 0x00007fcf4b76770f in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#9 0x00007fcf4ad6c0a4 in start_thread (arg=0x7fcf0b7fe700) at pthread_create.c:309
#10 0x00007fcf4a27f04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) info threads
Id Target Id Frame
16 Thread 0x7fcf297fa700 (LWP 31676) 0x00007fcf4a27650d in poll () at ../sysdeps/unix/syscall-template.S:81
15 Thread 0x7fcf28ff9700 (LWP 60474) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
14 Thread 0x7fcf08ff9700 (LWP 60516) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
13 Thread 0x7fcf0bfff700 (LWP 60513) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
12 Thread 0x7fcf3932c700 (LWP 60494) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
11 Thread 0x7fcf29ffb700 (LWP 60444) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
10 Thread 0x7fcf39b2d700 (LWP 31668) 0x00007fcf4a27650d in poll () at ../sysdeps/unix/syscall-template.S:81
9 Thread 0x7fcf2affd700 (LWP 31673) 0x00007fcf4a27650d in poll () at ../sysdeps/unix/syscall-template.S:81
8 Thread 0x7fcf2bfff700 (LWP 31671) 0x00007fcf4a27650d in poll () at ../sysdeps/unix/syscall-template.S:81
7 Thread 0x7fcf38b2b700 (LWP 60432) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
6 Thread 0x7fcf2a7fc700 (LWP 31674) 0x00007fcf4a27650d in poll () at ../sysdeps/unix/syscall-template.S:81
5 Thread 0x7fcf4d4f9780 (LWP 31667) 0x00007fcf4a27650d in poll () at ../sysdeps/unix/syscall-template.S:81
4 Thread 0x7fcf097fa700 (LWP 60430) pthread_cond_timedwait##GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
3 Thread 0x7fcf09ffb700 (LWP 31682) 0x00007fcf4a27650d in poll () at ../sysdeps/unix/syscall-template.S:81
2 Thread 0x7fcf0affd700 (LWP 31680) 0x00007fcf4a27650d in poll () at ../sysdeps/unix/syscall-template.S:81
* 1 Thread 0x7fcf0b7fe700 (LWP 31679) 0x00007fcf4a1ce107 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
I'm at a loss as to where to start. Is it a problem in using QThread ? Something else ? How can I enable more (or better) debugging info ? The program itself is compiled with -g -ggdb.
This part:
#4 0x00007fcf4aab7c01 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007fcf4aab7e69 in __cxa_rethrow () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
... means that the code in question is re-throwing an exception, but there is no exception handler for it. So, the runtime calls std::terminate.
This is a programming error, though exactly what to do depends on your libraries and program -- maybe not re-throw, maybe install an outermost exception handler and log a message, etc.
I'm working with libexpect, but if the read times out (expected return code EXP_TIMEOUT) I instead get a crash as follows.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f1366275bb9 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007f1366275bb9 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f1366278fc8 in __GI_abort () at abort.c:89
#2 0x00007f13662b2e14 in __libc_message (do_abort=do_abort#entry=2, fmt=fmt#entry=0x7f13663bf06b "*** %s ***: %s terminated\n") at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007f136634a7dc in __GI___fortify_fail (msg=<optimized out>) at fortify_fail.c:37
#4 0x00007f136634a6ed in ____longjmp_chk () at ../sysdeps/unix/sysv/linux/x86_64/____longjmp_chk.S:100
#5 0x00007f136634a649 in __longjmp_chk (env=0x1, val=1) at ../setjmp/longjmp.c:38
#6 0x00007f1366ed2a95 in ?? () from /usr/lib/libexpect.so.5.45
#7 <signal handler called>
#8 0x00007f1367334b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#9 0x000000000044cc13 in main (argc=3, argv=0x7fffca4013b8) at main_thread.c:6750
(gdb)
As you can see, I'm using nanosleep, which is supposed to not interact with signals like usleep and sleep (http://linux.die.net/man/2/nanosleep). As I understand it, libexpect uses SIGALRM to time out, but it's unclear to me how the two threads are interacting. If I had to guess, the expect call is raising a sigalrm, and it's interrupting the nanosleep call, but beyond that I don't know what's going on.
Thread 1:
while (stuff)
{
//dothings
struct timespec time;
time.tv_sec = 0.25;
time.tv_nsec = 250000000;
nanosleep(&time, NULL);
}
Thread 2:
switch(exp_expectl(fd, exp_glob, (char*)user_prompt, OK, exp_end))
{
case OK:
DG_LOG_DEBUG("Recieved user prompt");
break;
case EXP_TIMEOUT:
DG_LOG_DEBUG("Expect timed out");
goto error;
default:
DG_LOG_DEBUG("Expect failed for unknown reasons");
goto error;
}
I have done some reading about signals and sleep, but I've used sleep in multiple threads on many occasions and had no difficulties until now. What am I missing?
edit: misc version info
ubuntu 14.04 3.13.0-44-generic
/usr/lib/libexpect.so.5.45
code is in C
compiler is gcc (-lexpect -ltcl)
include <tcl8.6/expect.h>
I have a function that converts uids to usernames,
char *uid2name (uid_t uid)
{
struct passwd *pwd = getpwuid (uid);
if (pwd)
return strdup (pwd->pw_name);
return NULL;
}
And I call it with:
name = uid2name (atoi (blabal));
Sometimes it crashes with either ABRT or BUS error:
(I know there's a conversion between signed integer and unsigned ones, but I tested getpwuid function with negative values, it doesn't crash.)
Program received signal SIGBUS, Bus error.
[Switching to Thread 1088039264 (LWP 7572)]
0x00007f504afaff10 in malloc_consolidate () from /lib64/tls/libc.so.6
#0 0x00007f504afaff10 in malloc_consolidate () from /lib64/tls/libc.so.6
#1 0x00007f504afb0f17 in _int_malloc () from /lib64/tls/libc.so.6
#2 0x00007f504afb2c52 in malloc () from /lib64/tls/libc.so.6
#3 0x00007f504afa26ba in __fopen_internal () from /lib64/tls/libc.so.6
#4 0x00007f50499ce04a in internal_setent () from /lib64/libnss_files.so.2
#5 0x00007f50499ce5b0 in _nss_files_getpwuid_r ()
from /lib64/libnss_files.so.2
#6 0x00007f504afd63fd in getpwuid_r##GLIBC_2.2.5 () from /lib64/tls/libc.so.6
#7 0x00007f504afd5d5d in getpwuid () from /lib64/tls/libc.so.6
#8 0x00007f504b783579 in uid2name (uid=Variable "uid" is not available.
) at XX.c
Any ideas what could go wrong here?
I am developeing an application on linux where i wanted to have backtrace of all running threads at a particular frequency. so my user defined signal handler SIGUSR1 (for all threads) calls backtrace().
i am getting crash(SIGSEGV) in my signal handler which is originated from backtrace() call. i have passed correct arguments to the function as specified on most of the sites.
http://linux.die.net/man/3/backtrace.
what could make backtrace() crash in this case?
To add more details:
What makes me to conclude that crash is inside backtrace is frame 14 below. onMySignal is the signal handler SIGUSR1 and it calls backtrace.
Sample code of onMySignal is (copied from linux documentation of backtrace)
pthread_mutex_lock( &sig_mutex );
int j, nptrs;
#define SIZE 100
void *buffer[100] = {NULL};//or void *buffer[100];
char **strings;
nptrs = backtrace(buffer, SIZE);
pthread_mutex_unlock( &sig_mutex );
(gdb) where
#0 0x00000037bac0e9dd in raise () from
#1 0x00002aaabda936b2 in skgesigOSCrash () from
#2 0x00002aaabdd31705 in kpeDbgSignalHandler ()
#3 0x00002aaabda938c2 in skgesig_sigactionHandler ()
#4 <signal handler called>
#5 0x00000037ba030265 in raise () from
#6 0x00000037ba031d10 in abort () from
#7 0x00002b6cef82efd7 in os::abort(bool) () from
#8 0x00002b6cef98205d in VMError::report_and_die() ()
#9 0x00002b6cef835655 in JVM_handle_linux_signal ()
#10 0x00002b6cef831bae in signalHandler(int, siginfo*, void*) ()
#11 <signal handler called>
#12 0x00000037be407638 in ?? ()
#13 0x00000037be4088bb in _Unwind_Backtrace ()
#14 0x00000037ba0e5fa8 in backtrace ()
#15 0x00002aaaaae3875f in onMySignal (signum=10,info=0x4088ec80, context=0x4088eb50)
#16 <signal handler called>
#17 0x00002aaab4aa8acb in mxSession::setPartition(int)
#18 0x0000000000000001 in ?? ()
#19 0x0000000000000000 in ?? ()
(gdb)
hope this will make more clear of issue..
#janneb
I have Written the Signal handler Implementation in Mutex lock for better synchronozation.
#janneb
i did not find in the Document specifying API backtrace_symbols/backtrace is async_signal_safe or not. and whether they should be used in Signal handler or not.
Still i removed backtrace_symbols from my Signal handler and dont use it anywhere.. but my actual problem of crash in backtrace() persit. and no clue why it is crashing..
Edit 23/06/11: more details:
(gdb) where
#0 0x00000037bac0e9dd in raise () from
#1 0x00002aaab98a36b2 in skgesigOSCrash () from
#2 0x00002aaab9b41705 in kpeDbgSignalHandler () from
#3 0x00002aaab98a38c2 in skgesig_sigactionHandler () from
#4 <signal handler called>
#5 0x00000037ba030265 in raise () from
#6 0x00000037ba031d10 in abort () from
#7 0x00002ac003803fd7 in os::abort(bool) () from
#8 0x00002ac00395705d in VMError::report_and_die() () from
#9 0x00002ac00380a655 in JVM_handle_linux_signal () from
#10 0x00002ac003806bae in signalHandler(int, siginfo*, void*) () from
#11 <signal handler called>
#12 0x00000037be407638 in ?? () from libgcc_s.so.1
#13 0x00000037be4088bb in _Unwind_Backtrace () from libgcc_s.so.1
#14 0x00000037ba0e5fa8 in backtrace () from libc.so.6
#15 0x00002aaaaae3875f in onMyBacktrace (signum=10, info=0x415d0eb0, context=0x415d0d80)
#16 <signal handler called>
#17 0x00000037ba071fa8 in _int_free () from libc.so.6
#18 0x00000000000007e0 in ?? ()
#19 0x000000005aab01a0 in ?? ()
#20 0x000000000000006f in ?? ()
#21 0x00000037ba075292 in realloc () from libc.so.6
#22 0x00002aaab6248c4e in Memory::reallocMemory(void*, unsigned long, char const*, int) ()
crashed occured when realloc was executing and one of the address was like 0x00000000000007e0 (looks invalid)..
The documentation for signal handling
defines the list of safe functions to call from a signal handler, you must not use any other functions, including backtrace. (search for async-signal-safe in that document)
What you can do is write to a pipe you have previously setup, and have a thread waiting for that pipe, which then does the backtrace.
EDIT:
Ok, so that backtrace function returns the current thread's stack, so can't be used from another thread, so my idea of using a separate thread to do the backtrace won't work.
Therefore: you could try backtrace_symbols_fd from your signal handler.
As an alternative you could use gdb to get the backtrace, without having to have code in your program - and gdb can handle multiple threads easily.
Shell script to run gdb and get back traces:
#!/bin/bash
PID="$1"
[ -d "/proc/$PID" ] || PID=$(pgrep $1)
[ -d "/proc/$PID" ] || { echo "Can't find process: $PID" >&2 ; exit 1 ; }
[ -d "$TMPDIR" ] || TMPDIR=/tmp
BATCH=$(mktemp $TMPDIR/pstack.gdb.XXXXXXXXXXXXX)
echo "thread apply all bt" >"$BATCH"
echo "quit" >>"$BATCH"
gdb "/proc/$PID/exe" "$PID" -batch -x "$BATCH" </dev/null
rm "$BATCH"
As stated by Douglas Leeder, backtrace isn't on the list of signal safe calls, though in this case I suspect the problem is the malloc done by backtrace_symbols, try using backtrace_symbols_fd, which does not call malloc, only write. (and drop the mutex calls, signal handlers should not sleep)
EDIT
From what I can tell from the source for backtrace, it should be signal safe itself, though it is possible that you are overrunning your stack.
You may want to look at glibc's implementation for libsegfault to see how it handles this case