Need help resolving segfault in libc-2.23.so - linux

Need help debugging shared library with gdb.
I am trying to debug a shared library and in my case it is:
libc-2.23.so
The reason is that I get theese lines in dmesg:
[10081.433266] compiz[11346]: segfault at 7f30a4100010 ip 00007f309c36f44b sp 00007ffdde303aa0 error 4 in libc-2.23.so[7f309c2f1000+1bf000]
[22005.764635] compiz[16149]: segfault at 7f30e3456db0 ip 00007f30db85044b sp 00007fffaab9c0a0 error 4 in libc-2.23.so[7f30db7d2000+1bf000]
[48777.031064] compiz[25203]: segfault at 7f0b8e23b050 ip 00007f0b87edf44b sp 00007ffd51d15740 error 4 in libc-2.23.so[7f0b87e61000+1bf000]
[78850.413793] compiz[4889]: segfault at 7f60ddbf2440 ip 00007f60d598944b sp 00007ffedc5e31b0 error 4 in libc-2.23.so[7f60d590b000+1bf000]
[84583.754783] compiz[8441]: segfault at 7f5f8c3930c0 ip 00007f5f871d544b sp 00007ffc436bb5a0 error 4 in libc-2.23.so[7f5f87157000+1bf000]
[100625.457854] compiz[15619]: segfault at 7ffffa967680 ip 00007ffff722844b sp 00007fffffffdad0 error 4 in libc-2.23.so[7ffff71aa000+1bf000]
[104234.596331] compiz[19076]: segfault at 7ffffa2dc540 ip 00007ffff722844b sp 00007fffffffd810 error 4 in libc-2.23.so[7ffff71aa000+1bf000]
[112314.238115] compiz[22152]: segfault at 7ffffe232760 ip 00007ffff722844b sp 00007fffffffd810 error 4 in libc-2.23.so[7ffff71aa000+1bf000]
[130828.195732] compiz[26013]: segfault at 7ffffa966180 ip 00007ffff722844b sp 00007fffffffdad0 error 4 in libc-2.23.so[7ffff71aa000+1bf000]
[225379.026592] compiz[19275]: segfault at 7ffff821b6d0 ip 00007ffff722844b sp 00007fffffffd7c0 error 4 in libc-2.23.so[7ffff71aa000+1bf000]
The address where libc-2.23.so is loaded does not change after time stamp 100625.457854 since I ran the command:
$ echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
In order to be able to load it under gdb.
What I have done so far is that I have established that the segfault always occur on the same offset from the shared librarys loaded address.
I calculated the offset by taking instruction pointer minus load address in python:
ld = ["7f309c2f1000", "7f30db7d2000", "7f0b87e61000", "7f60d590b000", "7f5f87157000", "7ffff71aa000"]
ip = ["7f309c36f44b", "7f30db85044b", "7f0b87edf44b", "7f60d598944b", "7f5f871d544b", "7ffff722844b"]
ld_val = [int(x,16) for x in ld]
ip_val=[int(x,16) for x in ip]
ip_off=[i-s for (i,s) in zip(ip_val,ld_val)]
ip_off
[517195, 517195, 517195, 517195, 517195, 517195]
So using this information I got the offending line from executing:
$ addr2line -e /lib/x86_64-linux-gnu/libc-2.23.so -fCi 0x7e44b
malloc_consolidate
/build/glibc-9tT8Do/glibc-2.23/malloc/malloc.c:4167
Since I run Ubuntu 16.04 I installed the sources by issuing:
$ apt-get source glibc-source
Inspecting the offending line showed that it was just a comment.
malloc.c:4167
/* Slightly streamlined version of consolidation code in free() */
inside function:
static void malloc_consolidate(mstate av)
So I am assuming I am doing something wrong here.
Any pointer on how to capture this "segfault"?

So I am assuming I am doing something wrong here.
You aren't.
The symptoms you are looking at are 99.999% result of heap corruption, and since this is happening in compiz, there is little you can do except file a bug report.
To make a useful bug report, it would help if you could run compiz under Valgrind. Running it under GDB will not help.
I had gdb loaded with the library and breakpoint on line 4167 but no break even if I got a new entry in dmesg.
That means you are debugging the wrong process. Perhaps compiz forks helper processes, and one of them dies?

Related

IoTivity Zigbee communication

I am using IoTivity 1.2.0 downloaded from https://www.iotivity.org/downloads
After connecting Telegesis Dongle to the USB port, when I use dmesg command in terminal I am getting following output:
[ 1468.177799] iotivityandzigb[3807]: segfault at 0 ip 0000000000406a1a sp 00007fff58916940 error 4 in iotivityandzigbeeserver[400000+a000]
[ 1477.694759] iotivityandzigb[3817]: segfault at 0 ip 0000000000406a1a sp 00007ffe520b7e90 error 4 in iotivityandzigbeeserver[400000+a000]
[ 1574.990272] iotivityandzigb[3879]: segfault at 0 ip 0000000000406a1a sp 00007fff36eef7f0 error 4 in iotivityandzigbeeserver[400000+a000]
[ 1600.509959] iotivityandzigb[3892]: segfault at 0 ip 0000000000406a1a sp 00007ffc0a01d770 error 4 in iotivityandzigbeeserver[400000+a000]
[ 1916.457932] iotivityandzigb[3936]: segfault at 0 ip 0000000000406a1a sp 00007ffe3fc5e030 error 4 in iotivityandzigbeeserver[400000+a000]
[ 1985.551459] iotivityandzigb[4001]: segfault at 0 ip 00000000004069ff sp 00007ffc431ca040 error 4 in iotivityandzigbeeserver[400000+a000]
[ 2202.975833] iotivityandzigb[4105]: segfault at 0 ip 00000000004069ff sp 00007fff9d810760 error 4 in iotivityandzigbeeserver[400000+a000]
I see there is a lot of segmentation faults. How can I solve these problems?
If this is still happening with more recent code (aka 1.2.1, or 1.2-rel git branch, or master git branch), filing an issue in the iotivity tracker at jira.iotivity.org might get more attention from people familiar with this area of the code. Sorry this isn't really an "answer" but hopefully it will at least provide a path to some resolution.

How can I debug this User Space application crash?

I'm running a Qt5.4.0 application on my embedded Linux system (TI AM335x based) and it's stopping to run and I'm having a hard time debugging this. This is a QtWebKit QML example (youtubeview) but other QtWebKit examples are preforming the same for me so it's something WebKit based on my system.
When I run the application, it runs for a second or so, then ends with no messages. There is nothing reported to the syslog or dmesg either. When I kick it off with strace I can see this futex message:
futex(0x2d990, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x2d9ac, FUTEX_WAIT_PRIVATE, 7, NULL <unfinished ...>
+++ exited with 1 +++
Then it stops. Not very helpful... My next though was to debug this with GDB, however GDB crashes when I try to run this:
-sh-4.2# gdb youtubeview
GNU gdb (GDB) 7.5
Copyright (C) 2012 Free Software Foundation, Inc.
...
(gdb) run
Starting program: /usr/share/qt5/examples/webkitqml/youtubeview/youtubeview
/home/mike/ulf_qt_450/ulf/build-ulf/out/work/armv7ahf-vfp-neon-linux-gnueabihf/gdb/7.5-r0.0/gdb-7.5/gdb/utils.c:1081: internal-error: virtual memory exhausted: can't allocate 64652911 bytes.
A problem internal to GDB has been detected,
This issue occurs even if I set a break point at main first, just as soon as it starts running it will get stuck and run out of memory.
Are there other tools or techniques that can be used here to help isolate the issue?
Perhaps arguments to GDB to limit memory use or give some more information about why this Qt program made it crash?
Perhaps some FDs or system variables I could use to figure out why the FUTEX is being held and failing?
I'm not sure where to take this problem right now.
The Qt code itself is pretty simple, and I don't anticipate any issues in here:
#include <QGuiApplication>
#include <QQuickView>
int main(int argc, char* argv[])
{
QGuiApplication app(argc,argv);
QQuickView view;
view.setSource(QUrl("qrc:///" QWEBKIT_EXAMPLE_NAME ".qml"));
view.setResizeMode(QQuickView::SizeRootObjectToView);
view.show();
return app.exec();
return 0;
}
Running gdb on the device, especially with a huge library such as WebKit, is bound to get you out of memory errors.
Instead, run gdbserver on the device, and connect to it via gdb running on the host machine, using the toolchain's cross-gdb for that. In that scenario, the gdb on the host loads all the debug information, while the gdbserver on the device needs almost no memory.
It is even possible to have the separate debug information available on the host and stripped libraries on the device.
Please note that parts of WebKit are always built in release mode, even if the rest of Qt was built in debug mode, if you are going to debug into WebKit you might want to change that in the build system.
Here is a minimal example:
Device:
# gdbserver 192.168.1.2:12345 myapp
Process myapp created; pid = 989
Listening on port 12345
Host:
# arm-none-linux-gnueabi-gdb myapp
GNU gdb (Sourcery G++ Lite 2009q1-203) 6.8.50.20081022-cvs
(gdb) set solib-absolute-prefix /opt/targetroot
(gdb) target remote 192.168.1.42:12345
Remote debugging using 192.168.1.42:12345
(gdb) start
The "remote" target does not support "run". Try "help target" or "continue".
(gdb) break main
Breakpoint 1 at 0x1ab9c: file myapp/main.cpp, line 12.
(gdb) cont
Continuing.
Breakpoint 1, main (argc=1, argv=0xbecfedb4) at myapp/main.cpp:12
12 QApplication app(argc, argv, QApplication::GuiServer);
And you are right that it looks like a problem in QtWebKit itself, not in your application. Good luck!

How can I get GDB to tell me what address caused a segfault in a core dump file

I just know when gdb attach to a process, I can use p $_siginfo._sifields._sigfault.si_addr to show what address caused a segfault.
But, how to do in a core dump file?
I try it in a core dump file:
(gdb) p $_siginfo._sifields._sigfault.si_addr
Unable to read siginfo

error for gdbserver

Anybody knows the error message?
gdbserver[949] segfault at 81c ip 0000081c sp bfeef918 error 4 in gdbserver [8048000+1c0000]
segmentation fault
Thanks,
The error message (presumably from /var/log/messages) means that gdbserver crashed (received SIGSEGV).
I think it happened at instruction 0x81c, which likely means that it called a function though bad function pointer.

how do you diagnose a kernel oops?

Given a linux kernel oops, how do you go about diagnosing the problem? In the output I can see a stack trace which seems to give some clues. Are there any tools that would help find the problem? What basic procedures do you follow to track it down?
Unable to handle kernel paging request for data at address 0x33343a31
Faulting instruction address: 0xc50659ec
Oops: Kernel access of bad area, sig: 11 [#1]
tpsslr3
Modules linked in: datalog(P) manet(P) vnet wlan_wep wlan_scan_sta ath_rate_sample ath_pci wlan ath_hal(P)
NIP: c50659ec LR: c5065f04 CTR: c00192e8
REGS: c2aff920 TRAP: 0300 Tainted: P (2.6.25.16-dirty)
MSR: 00009032 CR: 22082444 XER: 20000000
DAR: 33343a31, DSISR: 20000000
TASK = c2e6e3f0[1486] 'datalogd' THREAD: c2afe000
GPR00: c5065f04 c2aff9d0 c2e6e3f0 00000000 00000001 00000001 00000000 0000b3f9
GPR08: 3a33340a c5069624 c5068d14 33343a31 82082482 1001f2b4 c1228000 c1230000
GPR16: c60f0000 000004a8 c59abbe6 0000002f c1228360 c340d6b0 c5070000 00000001
GPR24: c2aff9e0 c5070000 00000000 00000000 00000003 c2cc2780 c2affae8 0000000f
NIP [c50659ec] mesh_packet_in+0x3d8/0xdac [manet]
LR [c5065f04] mesh_packet_in+0x8f0/0xdac [manet]
Call Trace:
[c2aff9d0] [c5065f04] mesh_packet_in+0x8f0/0xdac [manet] (unreliable)
[c2affad0] [c5061ff8] IF_netif_rx+0xa0/0xb0 [manet]
[c2affae0] [c01925e4] netif_receive_skb+0x34/0x3c4
[c2affb10] [c60b5f74] netif_receive_skb_debug+0x2c/0x3c [wlan]
[c2affb20] [c60bc7a4] ieee80211_deliver_data+0x1b4/0x380 [wlan]
[c2affb60] [c60bd420] ieee80211_input+0xab0/0x1bec [wlan]
[c2affbf0] [c6105b04] ath_rx_poll+0x884/0xab8 [ath_pci]
[c2affc90] [c018ec20] net_rx_action+0xd8/0x1ac
[c2affcb0] [c00260b4] __do_softirq+0x7c/0xf4
[c2affce0] [c0005754] do_softirq+0x58/0x5c
[c2affcf0] [c0025eb4] irq_exit+0x48/0x58
[c2affd00] [c000627c] do_IRQ+0xa4/0xc4
[c2affd10] [c00106f8] ret_from_except+0x0/0x14
--- Exception: 501 at __delay+0x78/0x98
LR = cfi_amdstd_write_buffers+0x618/0x7ac
[c2affdd0] [c0163670] cfi_amdstd_write_buffers+0x504/0x7ac (unreliable)
[c2affe50] [c015a2d0] concat_write+0xe4/0x140
[c2affe80] [c0158ff4] part_write+0xd0/0xf0
[c2affe90] [c015bdf0] mtd_write+0x170/0x2a8
[c2affef0] [c0073898] vfs_write+0xcc/0x16c
[c2afff10] [c0073f2c] sys_write+0x4c/0x90
[c2afff40] [c0010060] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0xfd98a50
LR = 0x10003840
Instruction dump:
419d02a0 98010009 800100a4 2f800003 419e0508 2f170000 419a0098 3d20c507
a0e1002e 81699624 39299624 7f8b4800 419e007c a0610016 7d264b78
Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 1 seconds..
An Oops gives a bunch of information useful in diagnosing a crash. It starts with the address of the crash, the reason ("access of bad area") and the contents of the registers. The call trace answers the question "how did we get here". The first item in the list happened most recently. Working backwards, an interrupt happened (do_IRQ) because the Atheros WiFi adapter received a packet (ath_rx_poll). The routine passed it to the generic WiFi code (ieee80211_input) which in turn passed it up to the network stack (netif_receive_skb).
To figure out the exact code causing the problem, you can run
gdb /usr/src/linux/vmlinux
and then disassemble the function in question, which might be mesh_packet_in(). Might, because the faulting instruction (0xc50659ec) looks to be outside of mesh_packet_in() (0xc5065f04). You might also try the gdb command
(gdb) info line 0xc50659ec
to figure out which function contains this address.
You should first try to find the source of the code that has crashed. In the specific case, the analysis claims that the crash happened in mesh_packet_in of the manet driver, at offset 0x8f0. It also reports that the instructions at this point are 419d02a0 98010009 ... So inspect the module with "objdump -d", to confirm whether the function/offset reported is correct. Then check the source for what it is doing; you can use the registers list to confirm again that you are looking at the right instruction.
When you know what C statement is faulting, you need to read the source to find out where the bogus data were coming from.
http://oss.sgi.com/projects/kdb/
Install this into your kernel, then when it Oops's, you'll be thrown into a gdb-like interface that you can poke around with. However, it looks like the manet module is deref'ing a bad pointer.

Resources