how do you diagnose a kernel oops? - linux

Given a linux kernel oops, how do you go about diagnosing the problem? In the output I can see a stack trace which seems to give some clues. Are there any tools that would help find the problem? What basic procedures do you follow to track it down?
Unable to handle kernel paging request for data at address 0x33343a31
Faulting instruction address: 0xc50659ec
Oops: Kernel access of bad area, sig: 11 [#1]
tpsslr3
Modules linked in: datalog(P) manet(P) vnet wlan_wep wlan_scan_sta ath_rate_sample ath_pci wlan ath_hal(P)
NIP: c50659ec LR: c5065f04 CTR: c00192e8
REGS: c2aff920 TRAP: 0300 Tainted: P (2.6.25.16-dirty)
MSR: 00009032 CR: 22082444 XER: 20000000
DAR: 33343a31, DSISR: 20000000
TASK = c2e6e3f0[1486] 'datalogd' THREAD: c2afe000
GPR00: c5065f04 c2aff9d0 c2e6e3f0 00000000 00000001 00000001 00000000 0000b3f9
GPR08: 3a33340a c5069624 c5068d14 33343a31 82082482 1001f2b4 c1228000 c1230000
GPR16: c60f0000 000004a8 c59abbe6 0000002f c1228360 c340d6b0 c5070000 00000001
GPR24: c2aff9e0 c5070000 00000000 00000000 00000003 c2cc2780 c2affae8 0000000f
NIP [c50659ec] mesh_packet_in+0x3d8/0xdac [manet]
LR [c5065f04] mesh_packet_in+0x8f0/0xdac [manet]
Call Trace:
[c2aff9d0] [c5065f04] mesh_packet_in+0x8f0/0xdac [manet] (unreliable)
[c2affad0] [c5061ff8] IF_netif_rx+0xa0/0xb0 [manet]
[c2affae0] [c01925e4] netif_receive_skb+0x34/0x3c4
[c2affb10] [c60b5f74] netif_receive_skb_debug+0x2c/0x3c [wlan]
[c2affb20] [c60bc7a4] ieee80211_deliver_data+0x1b4/0x380 [wlan]
[c2affb60] [c60bd420] ieee80211_input+0xab0/0x1bec [wlan]
[c2affbf0] [c6105b04] ath_rx_poll+0x884/0xab8 [ath_pci]
[c2affc90] [c018ec20] net_rx_action+0xd8/0x1ac
[c2affcb0] [c00260b4] __do_softirq+0x7c/0xf4
[c2affce0] [c0005754] do_softirq+0x58/0x5c
[c2affcf0] [c0025eb4] irq_exit+0x48/0x58
[c2affd00] [c000627c] do_IRQ+0xa4/0xc4
[c2affd10] [c00106f8] ret_from_except+0x0/0x14
--- Exception: 501 at __delay+0x78/0x98
LR = cfi_amdstd_write_buffers+0x618/0x7ac
[c2affdd0] [c0163670] cfi_amdstd_write_buffers+0x504/0x7ac (unreliable)
[c2affe50] [c015a2d0] concat_write+0xe4/0x140
[c2affe80] [c0158ff4] part_write+0xd0/0xf0
[c2affe90] [c015bdf0] mtd_write+0x170/0x2a8
[c2affef0] [c0073898] vfs_write+0xcc/0x16c
[c2afff10] [c0073f2c] sys_write+0x4c/0x90
[c2afff40] [c0010060] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0xfd98a50
LR = 0x10003840
Instruction dump:
419d02a0 98010009 800100a4 2f800003 419e0508 2f170000 419a0098 3d20c507
a0e1002e 81699624 39299624 7f8b4800 419e007c a0610016 7d264b78
Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 1 seconds..

An Oops gives a bunch of information useful in diagnosing a crash. It starts with the address of the crash, the reason ("access of bad area") and the contents of the registers. The call trace answers the question "how did we get here". The first item in the list happened most recently. Working backwards, an interrupt happened (do_IRQ) because the Atheros WiFi adapter received a packet (ath_rx_poll). The routine passed it to the generic WiFi code (ieee80211_input) which in turn passed it up to the network stack (netif_receive_skb).
To figure out the exact code causing the problem, you can run
gdb /usr/src/linux/vmlinux
and then disassemble the function in question, which might be mesh_packet_in(). Might, because the faulting instruction (0xc50659ec) looks to be outside of mesh_packet_in() (0xc5065f04). You might also try the gdb command
(gdb) info line 0xc50659ec
to figure out which function contains this address.

You should first try to find the source of the code that has crashed. In the specific case, the analysis claims that the crash happened in mesh_packet_in of the manet driver, at offset 0x8f0. It also reports that the instructions at this point are 419d02a0 98010009 ... So inspect the module with "objdump -d", to confirm whether the function/offset reported is correct. Then check the source for what it is doing; you can use the registers list to confirm again that you are looking at the right instruction.
When you know what C statement is faulting, you need to read the source to find out where the bogus data were coming from.

http://oss.sgi.com/projects/kdb/
Install this into your kernel, then when it Oops's, you'll be thrown into a gdb-like interface that you can poke around with. However, it looks like the manet module is deref'ing a bad pointer.

Related

Need help resolving segfault in libc-2.23.so

Need help debugging shared library with gdb.
I am trying to debug a shared library and in my case it is:
libc-2.23.so
The reason is that I get theese lines in dmesg:
[10081.433266] compiz[11346]: segfault at 7f30a4100010 ip 00007f309c36f44b sp 00007ffdde303aa0 error 4 in libc-2.23.so[7f309c2f1000+1bf000]
[22005.764635] compiz[16149]: segfault at 7f30e3456db0 ip 00007f30db85044b sp 00007fffaab9c0a0 error 4 in libc-2.23.so[7f30db7d2000+1bf000]
[48777.031064] compiz[25203]: segfault at 7f0b8e23b050 ip 00007f0b87edf44b sp 00007ffd51d15740 error 4 in libc-2.23.so[7f0b87e61000+1bf000]
[78850.413793] compiz[4889]: segfault at 7f60ddbf2440 ip 00007f60d598944b sp 00007ffedc5e31b0 error 4 in libc-2.23.so[7f60d590b000+1bf000]
[84583.754783] compiz[8441]: segfault at 7f5f8c3930c0 ip 00007f5f871d544b sp 00007ffc436bb5a0 error 4 in libc-2.23.so[7f5f87157000+1bf000]
[100625.457854] compiz[15619]: segfault at 7ffffa967680 ip 00007ffff722844b sp 00007fffffffdad0 error 4 in libc-2.23.so[7ffff71aa000+1bf000]
[104234.596331] compiz[19076]: segfault at 7ffffa2dc540 ip 00007ffff722844b sp 00007fffffffd810 error 4 in libc-2.23.so[7ffff71aa000+1bf000]
[112314.238115] compiz[22152]: segfault at 7ffffe232760 ip 00007ffff722844b sp 00007fffffffd810 error 4 in libc-2.23.so[7ffff71aa000+1bf000]
[130828.195732] compiz[26013]: segfault at 7ffffa966180 ip 00007ffff722844b sp 00007fffffffdad0 error 4 in libc-2.23.so[7ffff71aa000+1bf000]
[225379.026592] compiz[19275]: segfault at 7ffff821b6d0 ip 00007ffff722844b sp 00007fffffffd7c0 error 4 in libc-2.23.so[7ffff71aa000+1bf000]
The address where libc-2.23.so is loaded does not change after time stamp 100625.457854 since I ran the command:
$ echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
In order to be able to load it under gdb.
What I have done so far is that I have established that the segfault always occur on the same offset from the shared librarys loaded address.
I calculated the offset by taking instruction pointer minus load address in python:
ld = ["7f309c2f1000", "7f30db7d2000", "7f0b87e61000", "7f60d590b000", "7f5f87157000", "7ffff71aa000"]
ip = ["7f309c36f44b", "7f30db85044b", "7f0b87edf44b", "7f60d598944b", "7f5f871d544b", "7ffff722844b"]
ld_val = [int(x,16) for x in ld]
ip_val=[int(x,16) for x in ip]
ip_off=[i-s for (i,s) in zip(ip_val,ld_val)]
ip_off
[517195, 517195, 517195, 517195, 517195, 517195]
So using this information I got the offending line from executing:
$ addr2line -e /lib/x86_64-linux-gnu/libc-2.23.so -fCi 0x7e44b
malloc_consolidate
/build/glibc-9tT8Do/glibc-2.23/malloc/malloc.c:4167
Since I run Ubuntu 16.04 I installed the sources by issuing:
$ apt-get source glibc-source
Inspecting the offending line showed that it was just a comment.
malloc.c:4167
/* Slightly streamlined version of consolidation code in free() */
inside function:
static void malloc_consolidate(mstate av)
So I am assuming I am doing something wrong here.
Any pointer on how to capture this "segfault"?
So I am assuming I am doing something wrong here.
You aren't.
The symptoms you are looking at are 99.999% result of heap corruption, and since this is happening in compiz, there is little you can do except file a bug report.
To make a useful bug report, it would help if you could run compiz under Valgrind. Running it under GDB will not help.
I had gdb loaded with the library and breakpoint on line 4167 but no break even if I got a new entry in dmesg.
That means you are debugging the wrong process. Perhaps compiz forks helper processes, and one of them dies?

How to locate a thread's stack base from a core file?

I have a core dump that shows a thread dying from a SIGBUS signal while executing mov %r15d,0xa0(%rsp). That seems to tell me that it died because it ran out of thread stack.
But how can I prove it? I cannot seem to find a GDB command to display thread information besides thread backtraces. In this case there is no backtrace. It shows the current function and then 0x0000000000000000. Yet another indication of stack corruption, I think.
I don't have a copy of /proc/[pid]/maps from when the program died. Is there anything in GDB or in the core file I can look at to find the base of each thread stack?
That seems to tell me that it died because it ran out of thread stack.
Very likely
But how can I prove it?
(gdb) p/x $rsp
$1 = 0x7fffc5791000
(gdb) info target
Symbols from "a.out".
Local core dump file:
`core', file type elf64-x86-64.
0x0000000000400000 - 0x0000000000401000 is load1
...
0x00007faaf2240000 - 0x00007faaf2241000 is load14
0x00007fffc5791000 - 0x00007fffc5f91000 is load15
0x00007fffc5faf000 - 0x00007fffc5fb0000 is load16
0xffffffffff600000 - 0xffffffffff600000 is load17
Local exec file:
...
Note how $rsp is at the (low) end of the load15 segment, and there is no mapping that "covers" $rsp-8

How to locale a bug from panic

all.
I'm a kernel newbie. I want to know how to get useful infomations from painc such as which line or which function is wrong.
For example, following is a panic-output about usb hiddev, how to read it? Thanks.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [<ffffffff813b4aa1>] free_async+0xa1/0x100
PGD 2326c9067 PUD 230f4c067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1d.2/usb8/8-2/speed
CPU 3
Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp l]
Pid: 2400, comm: lsusb Tainted: G I--------------- 2.6.32-296.el664fixes.3.x86_64 #1 Dell Inc. OptiPlN
RIP: 0010:[<ffffffff813b4aa1>] do_IRQ: 0.97 No irq handler for vector (irq -1)
[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 186 is 53003c)
�Mounting proc filesystem
Mounting sysfs filesystem
Creating /dev
Creating initial device nodes
setfont: KDFONTOP: Invalid argument
Free memory/Total memory (free %): 78672 / 114884 ( 68.4795 )
Loading dm-mod.ko module
Loading dm-log.ko module
Loading dm-region-hash.ko module
Loading dm-mirror.ko module
Loading dm-zero.ko module
Loading dm-snapshot.ko module
Loading freq_table.ko module
Loading mperf.ko module
Loading ipt_REJECT.ko module
Loading nf_defrag_ipv4.ko module
Loading ip_tables.ko module
Loading nf_conntrack.ko module
Loading ip6_tables.ko module
Loading ipv6.ko module
Loading fat.ko module
Loading macvlan.ko module
Loading tun.ko module
Loading kvm.ko module
Loading uinput.ko module
Loading parport.ko module
Loading dcdbas.ko module
Loading microcode.ko module
The panic itself is actually quite accurate as is;
BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [] free_async+0xa1/0x100
Tells already that the function where the problem happened is free_async and that function is exactly 0x100 bytes long and the crash happened at offset 0xa1. You need to map that offset into the exact line of code, but that now depends a bit on your environment how to do it.
Sometimes manual code review already will show what line has pointer manipulations, so you can do it just by reviewing that function.
Then the next question is that, why do you have a NULL-pointer there?

How to print the last received signal in GDB?

when loading a core dump into GDB the reason why it crashed automatically is displayed. For example
Program terminated with signal 11, Segmentation fault.
Is there any way to get the information again?
The thing is, that I'm writing a script which needs this information. But if the signal is only available after loading the core dump amd I can't access the information later on.
Is there really no command for such an important feature?
To print the information about the last signal execute
p $_siginfo
If you know what the core file name is, you can issue the target core command which respecifies the target core file:
(gdb) target core core.8577
[New LWP 8577]
Core was generated by `./fault'.
Program terminated with signal 11, Segmentation fault.
#0 0x080483d5 in main () at fault.c:10
10 *ptr = '\123';
(gdb)
As for the implied question, what is the info last signal command?, I don't know. There does not seem to be one.
The core file's name can be obtained from the command info target:
(gdb) info target
Symbols from "/home/wally/.bin/fault".
Local core dump file:
`/home/wally/.bin/core.8577', file type elf32-i386.
0x00da1000 - 0x00da2000 is load1
0x08048000 - 0x08049000 is load2
...
0xbfe8d000 - 0xbfeaf000 is load14
Local exec file:
`/home/wally/.bin/fault', file type elf32-i386.
Entry point: 0x8048300
0x08048134 - 0x08048147 is .interp
0x08048148 - 0x08048168 is .note.ABI-tag
0x08048168 - 0x0804818c is .note.gnu.build-id
0x0804818c - 0x080481ac is .gnu.hash
0x080481ac - 0x080481fc is .dynsym
0x080481fc - 0x08048246 is .dynstr
...

Geting SIGBUS (Bus error) # 0 (0)killed by SIGBUS (core dumped) in Redhat

I have process that works perfectly in the same machine in 2 accounts but when i copy the process to other account and run the process im getting core dump.
when i run the process with strace in the end im getting :
--- SIGBUS (Bus error) # 0 (0) ---
+++ killed by SIGBUS (core dumped) +++
when i open the core dump im getting :
#0 0x000000360046fed3 in malloc_consolidate () from /lib64/libc.so.6
#1 0x00000036004723fd in _int_malloc () from /lib64/libc.so.6
#2 0x000000360047402a in malloc () from /lib64/libc.so.6
#3 0x00000036004616ba in __fopen_internal () from /lib64/libc.so.6
#4 0x0000000000fe9652 in LogMngr::OpenFile (this=0x2aaaaad17010, iLogIndex=0) at LogMngr.c:801
i can see it something with opening the file for logging , but why it only in one account and in the other is fine ?
You can get a SIGBUS from an unaligned memory access . Are you using something like mmap, shared memory regions, or something similar ?
Any core dump inside malloc always indicates heap corruption, and heap corruption in general is sneaky like that: it may never show up on machine A, sometimes show up on machine B, and always show up on machine C.
Valgrind will likely point you straight at the problem.

Resources