How to locale a bug from panic - linux

all.
I'm a kernel newbie. I want to know how to get useful infomations from painc such as which line or which function is wrong.
For example, following is a panic-output about usb hiddev, how to read it? Thanks.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [<ffffffff813b4aa1>] free_async+0xa1/0x100
PGD 2326c9067 PUD 230f4c067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1d.2/usb8/8-2/speed
CPU 3
Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp l]
Pid: 2400, comm: lsusb Tainted: G I--------------- 2.6.32-296.el664fixes.3.x86_64 #1 Dell Inc. OptiPlN
RIP: 0010:[<ffffffff813b4aa1>] do_IRQ: 0.97 No irq handler for vector (irq -1)
[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 186 is 53003c)
�Mounting proc filesystem
Mounting sysfs filesystem
Creating /dev
Creating initial device nodes
setfont: KDFONTOP: Invalid argument
Free memory/Total memory (free %): 78672 / 114884 ( 68.4795 )
Loading dm-mod.ko module
Loading dm-log.ko module
Loading dm-region-hash.ko module
Loading dm-mirror.ko module
Loading dm-zero.ko module
Loading dm-snapshot.ko module
Loading freq_table.ko module
Loading mperf.ko module
Loading ipt_REJECT.ko module
Loading nf_defrag_ipv4.ko module
Loading ip_tables.ko module
Loading nf_conntrack.ko module
Loading ip6_tables.ko module
Loading ipv6.ko module
Loading fat.ko module
Loading macvlan.ko module
Loading tun.ko module
Loading kvm.ko module
Loading uinput.ko module
Loading parport.ko module
Loading dcdbas.ko module
Loading microcode.ko module

The panic itself is actually quite accurate as is;
BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [] free_async+0xa1/0x100
Tells already that the function where the problem happened is free_async and that function is exactly 0x100 bytes long and the crash happened at offset 0xa1. You need to map that offset into the exact line of code, but that now depends a bit on your environment how to do it.
Sometimes manual code review already will show what line has pointer manipulations, so you can do it just by reviewing that function.
Then the next question is that, why do you have a NULL-pointer there?

Related

Error 'Unknown symbol __copy_to_user' during module load

I have NAS Terramaster F4-210 based on arm64 Realtek RTD1296 CPU. It has custom OpenWrt 15.05.1 based firmware with 4.4.18 linux kernel. I want to create kernel module to use my zigbee stick (cdc-acm - USB Modem (CDC ACM) support) and run homeassistant on it.
# uname -a
Linux TNAS-BA68 4.4.18-g8bcbd8a-dirty #1327 SMP Mon Aug 31 11:55:52 CST 2020 aarch64 GNU/Linux
I downloaded appropriate kernel, created some config and after installing my newly compiled module I get the following errors in kernel log:
# modprobe cdc-acm
1 module could not be probed
- cdc-acm
# dmesg
...
cdc_acm: Unknown symbol __copy_to_user (err 0)
cdc_acm: Unknown symbol __copy_from_user (err 0)
cdc_acm: Unknown symbol _mcount (err 0)
As far as I understand that means module expects copy_to_user, copy_from_user, mcount to be part of the kernel (or other loaded module). But kernel doesn't export these symbols:
# cat /proc/kallsyms | grep copy_to_user
ffffff80082923f0 T copy_to_user_page
ffffff80087d2600 T __arch_copy_to_user
ffffff80087e67b0 t kfifo_copy_to_user
ffffff8008871854 T my_copy_to_user
File arch/arm64/include/asm/uaccess.h has definition of copy_to_user:
extern unsigned long __must_check __copy_to_user(void __user *to, const void *from, unsigned long n);
...
static inline unsigned long __must_check copy_to_user(void __user *to, const void *from, unsigned long n)
File arch/arm64/lib/copy_to_user.S contains source code of __copy_to_user:
ENTRY(__copy_to_user)
ALTERNATIVE("nop", __stringify(SET_PSTATE_PAN(0)), ARM64_HAS_PAN, CONFIG_ARM64_PAN)
...
ENDPROC(__copy_to_user)
My initial idea was I have bad kernel configuration or bad toolchain. So I used toolchain and OpenWrt+kernel configuration from banana pi W2 board which has the same CPU. Without any luck but compile warning had disappeared.
So can somebody please confirm that the problem can't be solved by applying some different kernel configuration or proper toolchain. Instead kernel source code must be modified. E.g. instead of __copy_to_user one of __arch_copy_to_user or my_copy_to_user must be used.
So my assumption is: Terramaster took kernel sources, modified (probably used __arch_copy_to_user instead of __copy_to_user) and then compiled sources.
BTW: I also checked kernel sources and didn't find __arch_copy_to_user. Does that mean it was introduced by kernel sources modifications or it still can be present thereby usage of some nasty defs.

Running Linux 4.9 on Cortex-M4 STM32F4 (29I-DISC1)

I spend a few days trying to understand but I'm stuck.
I get no more than a "Starting kernel..." message after entering 'bootm 8100000' on my STM32F429I-DISC1 board.
Before I update uboot from 2011 to 2016 It was a "Starting Kernel..." + UNHANDED EXCEPTION HARDFAULT, but now I have just the "Starting Kernel..." message.
MCU is a stm32F429, 2MB Flash + ext. 8MB RAM.
Flash start addr is 0x08000000 (uboot addr) and I put my kernel on the start of the 2nd flash bank at 0x08100000.
Start of External 8MB RAM is 0xD0000000
u-boot-2016.11 seems to run pretty well on that board, bdi give me:
U-Boot > bdi
arch_number = 0x00000000
boot_params = 0xD0000100
DRAM bank = 0x00000000
-> start = 0xD0000000
-> size = 0x00800000
current eth = unknown
ip_addr = <NULL>
baudrate = 115200 bps
relocaddr = 0xD07D6000
reloc off = 0xC87D6000
irq_sp = 0xD05D3EE0
sp start = 0xD05D3ED0
Early malloc usage: e0 / 400
This is how I build the kernel:
make CROSS_COMPILE=arm-none-eabi- ARCH=arm uImage LOADADDR=08100000 -B
And this is the complete output of bootm command:
U-Boot > bootm 8100000
## Booting kernel from Legacy Image at 08100000 ...
Image Name: Linux-4.9.0
Image Type: ARM Linux Kernel Image (uncompressed)
Data Size: 839872 Bytes = 820.2 KiB
Load Address: 08100000
Entry Point: 08100000
Verifying Checksum ... OK
Loading Kernel Image ... OK
Starting kernel ...
With the 'robutest'/'emcraft' kernel/config files I got the same log, unless the Entry Point seems more correct because it's 08100001.
On the robutest/emcraft kernel I tried to activate the LCD screen of the board, but nothing happen.
In all my test I activate kernel config "early printk" and "DEBUG_LL_xxx" stuff.
This is a link to my .config file:
http://pastebin.com/gBNYx3Gs
PS: I made some try with uCLinux emcraft/robutest to try to find what's going on, but my main goal is to run Linux 4.9.
Thanks for reading me!!!
EDIT: I tried to pass the dtb "the old way", but with the same result:
U-Boot > bootm 08100000 - 08040000
## Booting kernel from Legacy Image at 08100000 ...
Image Name: Linux-4.9.0
Image Type: ARM Linux Kernel Image (uncompressed)
Data Size: 805744 Bytes = 786.9 KiB
Load Address: 08100000
Entry Point: 08100000
Verifying Checksum ... OK
## Flattened Device Tree blob at 08040000
Booting using the fdt blob at 0x8040000
Loading Kernel Image ... OK
Loading Device Tree to d05ce000, end d05d2a9f ... OK
Starting kernel ...
I'm desperate, any ideas is welcome :'(
EDIT2: I tried to uncompress the kernel with u-boot, it's the same:
U-Boot > bootm 8100000 - 8040000
## Booting kernel from Legacy Image at 08100000 ...
Image Name: uImage
Image Type: ARM Linux Kernel Image (gzip compressed)
Data Size: 940696 Bytes = 918.6 KiB
Load Address: d0008000
Entry Point: d0008001
Verifying Checksum ... OK
## Flattened Device Tree blob at 08040000
Booting using the fdt blob at 0x8040000
Uncompressing Kernel Image ... OK
Loading Device Tree to d05ce000, end d05d2a9f ... OK
Starting kernel ...
EDIT3: I checked the memory/USART1 address in the dtb, and it's ok. Why I have no message of the kernel?
EDIT4: With uxipImage:
U-Boot > bootm 08060000 - 08040000
## Booting kernel from Legacy Image at 08060000 ...
Image Name: uxipImage
Image Type: ARM Linux Kernel Image (uncompressed)
Data Size: 1497396 Bytes = 1.4 MiB
Load Address: 08060000
Entry Point: 08060041
Verifying Checksum ... OK
## Flattened Device Tree blob at 08040000
Booting using the fdt blob at 0x8040000
Loading Kernel Image ... OK
Loading Device Tree to d05ce000, end d05d2a9f ... OK
Starting kernel ...
I tried with different entry points, 08060000, 08060040 and 08060041.
I found!
Thanks alexander, the trick for the UART WORKS like a charm!!!!
Thanks to you, First time I try to hack the kernel in an embedded system and I've learnt so many things thanks to you.
For those who will have this problem, for me it was the XIP_PHYS_ADDR!
Don't forget the 64 bytes header!!
I was flashing the XIP kernel # 0x08060000 so the XIP_PHYS_ADDR (and the boot entry btw) it's 0x08060040!!!!
Thank you again alexander !!
I have had faced the same issue but the reason was a different one. The issue was in one of the u-boot structure field that stores the size of the uncompressed linux kernel. The u-boot is not populating this field with the uncompressed size, that the linux kernel uses later to resize its stack, thus putting the system into an undefined state.
Once u-boot prints the Starting kernel... message, the next message should be Uncompressing Linux... done, booting the kernel after u-boot transfers a SMM Handler for the kernel to take-over the execution, and maybe the kernel is looking for this field. If you are working on a x86 based system,the uncompression would be in the following file directories:
arch/x86/boot/uncompressed/head_32.S
arch/x86/boot/uncompressed/piggy.S
The solution is to use the latest u-boot foun in here: https://github.com/andy-shev/u-boot
However, if you are using a custom u-boot, you need to look for this field.

Tie System.map values to kernel addresses

I'm trying to boot a custom kernel on a BeagleBoneBlack. u-boot works, and loads stuff as follows:
U-Boot 2016.03 (Apr 26 2016 - 11:32:30 +0000)
Watchdog enabled
I2C: ready
DRAM: 512 MiB
MMC: OMAP SD/MMC: 0, OMAP SD/MMC: 1
*** Warning - bad CRC, using default environment
Net: <ethaddr> not set. Validating first E-fuse MAC
cpsw, usb_ether
Press SPACE to abort autoboot in 2 seconds
switch to partitions #0, OK
mmc0 is current device
Scanning mmc 0:1...
Found /boot/extlinux/extlinux.conf
Retrieving file: /boot/extlinux/extlinux.conf
278 bytes read in 39 ms (6.8 KiB/s)
1: Linux grsec
Retrieving file: /boot/initramfs-grsec
5875398 bytes read in 349 ms (16.1 MiB/s)
Retrieving file: /boot/vmlinuz-4.4.8-grsec
3140944 bytes read in 211 ms (14.2 MiB/s)
append: BOOT_IMAGE=/boot/vmlinuz-4.4.8-grsec modules=loop,squashfs,sd-mod,usb-storage modloop=/boot/modloop-grsec console=ttyO0,115200n8
Retrieving file: /boot/dtbs/am335x-boneblack.dtb
31516 bytes read in 426 ms (71.3 KiB/s)
Kernel image # 0x82000000 [ 0x000000 - 0x2fed50 ]
## Flattened Device Tree blob at 88000000
Booting using the fdt blob at 0x88000000
Loading Ramdisk to 8fa65000, end 8ffff6c6 ... OK
Loading Device Tree to 8fa5a000, end 8fa64b1b ... OK
Starting kernel ...
Everything looks good so far, I think. But the kernel fails to load. I can't get access to anything from the kernel with low level debugging enabled in the kernel options either.
I've attached a J-Link JTAG debugger and was hoping to trace through to the problem, but I'm having trouble tying the System.map through to the disassembly.
Here for example is the start of the System.Map:
00000000 t __vectors_start
00000024 A cpu_ca8_suspend_size
00000024 A cpu_v7_suspend_size
0000002c A cpu_ca9mp_suspend_size
00001000 t __stubs_start
00001004 t vector_rst
00001020 t vector_irq
000010a0 t vector_dabt
00001120 t vector_pabt
000011a0 t vector_und
00001220 t vector_addrexcptn
00001240 t vector_fiq
00001240 T vector_fiq_offset
80204000 A swapper_pg_dir
80208000 T _text
80208000 T stext
8020808c t __create_page_tables
8020813c t __turn_mmu_on_loc
80208148 t __fixup_smp
802081b0 t __fixup_smp_on_up
802081d4 t __fixup_pv_table
80208228 t __vet_atags
80208280 T __idmap_text_start
80208280 T __turn_mmu_on
80208280 T _stext
So taking __create_page_tables, I grep in the source code under ./arch/arm/kernel with:
.../arm/arm/kernel$ grep __create_page_tables -rn
Binary file head.o matches
head.S:128: bl __create_page_tables
head.S:180:__create_page_tables:
head.S:355:ENDPROC(__create_page_tables)
So we're looking for the following at the symbol address:
__create_page_tables:
pgtbl r4, r8 # page table address
But the disassembler shows something different at the address I'm translating too give the Kernel is loaded at 0x82000000:
How can I translate the kernel symbols to the debugger addresses?

why does the i2cdetect always gives UU on my RTC in embedded Linux

I'd like to communicate read from my RTC in C code rather than the "hwclock" shell command.
However, when I use i2cdetect, it shows 0x68(which is my RTC slave address) is having the status "UU", which means "Probing was skipped, because this address is currently in use by a driver". And after I tried the i2cget, its givng "could bot set address to 0x68: Device or resource busy".
So I'm thinking if there are some problem in my Linux kernel that will force to read from my RTC all the time, or some other reason.
Thanks
I am assuming that you are using DS-1307 RTC, or one of its variants (because of 0x68 slave address). Check if its driver is loaded by:
$ lsmod | grep rtc
If you seen an entry of rtc_ds1307, (like this -> rtc_ds1307 17394 0 ) in the output of above command then this driver might be in hold of that address.
If the driver is loaded in system then unload it using
$ rmmod rtc-ds1307
EDIT:
(In light of OP's feedback,) Please do the following
1) cat /sys/bus/i2c/devices/3-0068/modalias. This will give you the name of the kernel driver that is keeping this device busy. Copy the driver-name after the colon(:)
OP's output of the command tells us that its ds1337
2) Check if ds1337 is an alias for a driver, using
grep ds1337 /lib/modules/`uname -r`/modules.alias
Hopefully you will get the following output
alias i2c:ds1337 rtc_ds1307
This confirms our presumption that rtc_ds1307 is infact the driver in hold of the I2C address 0x68.
3) use rmmod rtc_ds1307 to unload the driver.
Note: This will only work if the driver is a Loadable Kernel Module, otherwise you will see the following error:
ERROR: Module rtc_ds1307 does not exist in /proc/modules
In that case you will have to recompile the kernel again with that driver disabled/modularized.
0x68 is being used by some driver,
Disable that driver in kernel source code and recompile source code.

how do you diagnose a kernel oops?

Given a linux kernel oops, how do you go about diagnosing the problem? In the output I can see a stack trace which seems to give some clues. Are there any tools that would help find the problem? What basic procedures do you follow to track it down?
Unable to handle kernel paging request for data at address 0x33343a31
Faulting instruction address: 0xc50659ec
Oops: Kernel access of bad area, sig: 11 [#1]
tpsslr3
Modules linked in: datalog(P) manet(P) vnet wlan_wep wlan_scan_sta ath_rate_sample ath_pci wlan ath_hal(P)
NIP: c50659ec LR: c5065f04 CTR: c00192e8
REGS: c2aff920 TRAP: 0300 Tainted: P (2.6.25.16-dirty)
MSR: 00009032 CR: 22082444 XER: 20000000
DAR: 33343a31, DSISR: 20000000
TASK = c2e6e3f0[1486] 'datalogd' THREAD: c2afe000
GPR00: c5065f04 c2aff9d0 c2e6e3f0 00000000 00000001 00000001 00000000 0000b3f9
GPR08: 3a33340a c5069624 c5068d14 33343a31 82082482 1001f2b4 c1228000 c1230000
GPR16: c60f0000 000004a8 c59abbe6 0000002f c1228360 c340d6b0 c5070000 00000001
GPR24: c2aff9e0 c5070000 00000000 00000000 00000003 c2cc2780 c2affae8 0000000f
NIP [c50659ec] mesh_packet_in+0x3d8/0xdac [manet]
LR [c5065f04] mesh_packet_in+0x8f0/0xdac [manet]
Call Trace:
[c2aff9d0] [c5065f04] mesh_packet_in+0x8f0/0xdac [manet] (unreliable)
[c2affad0] [c5061ff8] IF_netif_rx+0xa0/0xb0 [manet]
[c2affae0] [c01925e4] netif_receive_skb+0x34/0x3c4
[c2affb10] [c60b5f74] netif_receive_skb_debug+0x2c/0x3c [wlan]
[c2affb20] [c60bc7a4] ieee80211_deliver_data+0x1b4/0x380 [wlan]
[c2affb60] [c60bd420] ieee80211_input+0xab0/0x1bec [wlan]
[c2affbf0] [c6105b04] ath_rx_poll+0x884/0xab8 [ath_pci]
[c2affc90] [c018ec20] net_rx_action+0xd8/0x1ac
[c2affcb0] [c00260b4] __do_softirq+0x7c/0xf4
[c2affce0] [c0005754] do_softirq+0x58/0x5c
[c2affcf0] [c0025eb4] irq_exit+0x48/0x58
[c2affd00] [c000627c] do_IRQ+0xa4/0xc4
[c2affd10] [c00106f8] ret_from_except+0x0/0x14
--- Exception: 501 at __delay+0x78/0x98
LR = cfi_amdstd_write_buffers+0x618/0x7ac
[c2affdd0] [c0163670] cfi_amdstd_write_buffers+0x504/0x7ac (unreliable)
[c2affe50] [c015a2d0] concat_write+0xe4/0x140
[c2affe80] [c0158ff4] part_write+0xd0/0xf0
[c2affe90] [c015bdf0] mtd_write+0x170/0x2a8
[c2affef0] [c0073898] vfs_write+0xcc/0x16c
[c2afff10] [c0073f2c] sys_write+0x4c/0x90
[c2afff40] [c0010060] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0xfd98a50
LR = 0x10003840
Instruction dump:
419d02a0 98010009 800100a4 2f800003 419e0508 2f170000 419a0098 3d20c507
a0e1002e 81699624 39299624 7f8b4800 419e007c a0610016 7d264b78
Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 1 seconds..
An Oops gives a bunch of information useful in diagnosing a crash. It starts with the address of the crash, the reason ("access of bad area") and the contents of the registers. The call trace answers the question "how did we get here". The first item in the list happened most recently. Working backwards, an interrupt happened (do_IRQ) because the Atheros WiFi adapter received a packet (ath_rx_poll). The routine passed it to the generic WiFi code (ieee80211_input) which in turn passed it up to the network stack (netif_receive_skb).
To figure out the exact code causing the problem, you can run
gdb /usr/src/linux/vmlinux
and then disassemble the function in question, which might be mesh_packet_in(). Might, because the faulting instruction (0xc50659ec) looks to be outside of mesh_packet_in() (0xc5065f04). You might also try the gdb command
(gdb) info line 0xc50659ec
to figure out which function contains this address.
You should first try to find the source of the code that has crashed. In the specific case, the analysis claims that the crash happened in mesh_packet_in of the manet driver, at offset 0x8f0. It also reports that the instructions at this point are 419d02a0 98010009 ... So inspect the module with "objdump -d", to confirm whether the function/offset reported is correct. Then check the source for what it is doing; you can use the registers list to confirm again that you are looking at the right instruction.
When you know what C statement is faulting, you need to read the source to find out where the bogus data were coming from.
http://oss.sgi.com/projects/kdb/
Install this into your kernel, then when it Oops's, you'll be thrown into a gdb-like interface that you can poke around with. However, it looks like the manet module is deref'ing a bad pointer.

Resources