How to debug a Linux kernel that freezes during boot?

How to debug a Linux kernel that freezes during boot? - linux

I have a legacy device with a binary Linux 2.6.18 kernel that boots normally to its rootfs. However, if I try to compile this kernel from the source, the resulting kernel binary will freeze during the boot. I don't have the .config file used to build the previous kernel binary that is currently booting normally.
The boot is freezing and no error output is provided. Here is the boot log:
Linux version 2.6.18-6.2 (myuser#host) (gcc version 4.2.0 20070124 (prerelease) - BRCM 10ts-20080721) #10 SMP Sun Apr 28 18:25:24 BRT 2013
Fetching vars from bootloader... OK (E,d,B,C)
Detected 512 MB on MEMC0 (strap 0x23430310)
Board strapped at 512 MB, default is 256 MB
Options: sata=1 enet=1 emac_1=1 no_mdio=0 docsis=0 ebi_war=0 pci=1 smp=1
CPU revision is: 0002a044
FPU revision is: 00130001
Primary instruction cache 32kB, physically tagged, 2-way, linesize 64 bytes.
Primary data cache 64kB, 4-way, linesize 64 bytes.
<6>Synthesized TLB refill handler (23 instructions).
<6>Synthesized TLB load handler fastpath (37 instructions).
<6>Synthesized TLB store handler fastpath (37 instructions).
<6>Synthesized TLB modify handler fastpath (36 instructions).
Determined physical RAM map:
memory: 10000000 # 00000000 (usable)
memory: 10000000 # 20000000 (usable)
Using 32MB for memory, overwrite by passing mem=xx
User-defined physical RAM map:
node [00000000, 02000000: RAM]
node [02000000, 0e000000: RSVD]
node [20000000, 10000000: RAM]
<5>Reserving 224 MB upper memory starting at 02000000
<7>On node 0 totalpages: 65536
<7> DMA zone: 65536 pages, LIFO batch:15
<7>On node 1 totalpages: 65536
<7> Normal zone: 65536 pages, LIFO batch:15
Built 2 zonelists. Total pages: 131072
<5>Kernel command line: root=/dev/mtdblock3 rw rootfstype=jffs2 console=ttyS0,115200
PID hash table entries: 4096 (order: 12, 16384 bytes)
mips_counter_frequency = 202000000 from Calibration, = 202500000 from header(CPU_MHz/2)
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
Memory: 286336k/524288k available (2924k kernel code, 237760k reserved, 544k data, 164k init, 0k highmem)
Mount-cache hash table entries: 512
Checking for 'wait' instruction... available.
plat_prepare_cpus: ENABLING 2nd Thread...
TP0: prom_boot_secondary: Kick off 2nd CPU...
CPU revision is: 0002a044
FPU revision is: 00130001
Primary instruction cache 32kB, physically tagged, 2-way, linesize 64 bytes.
Primary data cache 64kB, 4-way, linesize 64 bytes.
Synthesized TLB refill handler (23 instructions).
Brought up 2 CPUs
migration_cost=1000
NET: Registered protocol family 16
registering PCI controller with io_map_base unset
registering PCI controller with io_map_base unset
SCSI subsystem initialized
usbcore: registered new driver usbfs
usbcore: registered new driver hub
NET: Registered protocol family 2
IP route cache hash table entries: 16384 (order: 4, 65536 bytes)
TCP established hash table entries: 65536 (order: 7, 524288 bytes)
TCP bind hash table entries: 32768 (order: 6, 262144 bytes)
TCP: Hash tables configured (established 65536 bind 32768)
TCP reno registered
brcm-pm: disabling power to USB block
brcm-pm: disabling power to ENET block
brcm-pm: disabling power to SATA block
squashfs: version 3.2-r2 (2007/01/15) Phillip Lougher
JFFS2 version 2.2. (NAND) (SUMMARY) (C) 2001-2006 Red Hat, Inc.
io scheduler noop registered
io scheduler anticipatory registered (default)
io scheduler deadline registered
io scheduler cfq registered
Serial: 8250/16550 driver $Revision: 1.1.1.1 $ 3 ports, IRQ sharing disabled
serial8250: ttyS0 at MMIO 0x0 (irq = 22) is a 16550A
serial8250: ttyS1 at MMIO 0x0 (irq = 66) is a 16550A
serial8250: ttyS2 at MMIO 0x0 (irq = 67) is a 16550A
loop: loaded (max 8 devices)
brcm-pm: enabling power to ENET block
How do I go about debugging this? Any insights on possible solutions to the freeze are welcome as well.

One way to deal with this is to enable CONFIG_EARLY_PRINTK and add some printk() statements in kernel code that you suspect is freezing (most likely some drivers configuration parameters are wrong).
Also, you might be able to get old kernel config by looking at /boot/config-*, or at /proc/config.gz (it will exist only if old kernel had option CONFIG_IKCONFIG_PROC enabled).

Add initcall_debug to CONFIG_CMDLINE (kernel command line).
CONFIG_CMDLINE="root=/dev/ram0 rw mem=512M#0x0 initrd=0x800000,16M console=ttyS0,38400n8 rootfstype=ext2 init=/bin/busybox init -s initcall_debug"

There are some debugger options like kdb and kgdb, but I've always found them flaky and temperamental. Probably more-so if you can't even get your machine to boot. I concur with the CONFIG_EARLY_PRINTK advise, and would advise you to make sure you get kernel output on boot (not "quiet"), but it seems you have this already.
The "GPIO" suggestion above could work - but is very system-dependent and cumbersome. That said, I think you want an answer better than "Start adding a lot of printk's". You can start with the offending ethernet driver (BRC-PM?) or try removing that to see if that's related.
It'll take some investigation - sorry, but no "magic bullet"! :-O

From the last line of log,
brcm-pm: enabling power to ENET block
looks this is the issue of connected power supply to the system. it is not able to souce the power properly and that's why system is freezing.

Related

u-boot hangs on soft reboot

I'm having this subtle issue where if I put my ARM device (U-boot + Linux) under soft reboot cycle (stress test), it fails after 100+ cycles. The serial output I capture in failed scenario is:
...
g_txrx_mode=1
g_profileid=1
id=0x1F11 board_type=0x0004 HAS_POE_SUPPORT=1
Not POE
read_rbf_header_from_ext4 - filename = e30.core.rbf filesize = 7317252
cff_from_mmc_ext4:writing e30.core.rbf length 13 num_files 0
Full Configuration Succeeded.
crestron_load_rbf: use core e30.core.rbf length 13 rval 1
Booting from primary
Writing to MMC(0)... done
dram_init: id 1f11 (id & 0x0001) 1 has_dsp/has_dante0
DDRCAL: Success
INFO : Skip relocation as SDRAM is non secure memory
Reserving 2048 Bytes for IRQ stack at: ffe2f708
DRAM : 512 MiB
On a successful reboot, next printed lines are:
WARNING: Caches not enabled
MMC: In: serial
Out: serial
Err: serial
it seems it failed between 'skip_relocation()' and 'enable_caches()'. but why after 100+ attempts? Could it be memory issue? Memory timing issue? And how can I debug it?

Linux memory allocation - order changed by 1

I will try to describe the issue as much as I can. Though, I won't be able to post all the relevant code.
The case is as follow,
I made few changes in the code, all of them were in user-space. I didn't change anything in the kernel code.
After compiling and working for a while with this release I suddenly noticed that the Ephemeral ports range has been changed.
After investigating I encountered that this was caused because of a change in the order of magnitude (in memory allocation). But as I said before, no-one touched this code.
Here is some of the log messages of the linux bootup from before and after this change. You can notice that change in the order I mentioned before.
After the change:
[000000.000] Determined physical RAM map:
[000000.000] memory: 0000000007000000 # 0000000000c10000 (usable)
[000000.015] reserve bootmem for memoops 0000000000020000 # 0000000007bf0000
[000000.019] Primary instruction cache 32kB, virtually tagged, 4 way, 64 sets, linesize 128 bytes.
[000000.019] Primary data cache 16kB, 64-way, 2 sets, linesize 128 bytes.
[000000.020] PID hash table entries: 512 (order: 9, 16384 bytes)
[000000.020] Using 500.000 MHz high precision timer.
[000000.227] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
[000000.240] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
[000000.266] Memory: 112408k/114688k available (2062k kernel code, 2136k reserved, 533k data, 200k init, 0k highmem)
Before the change:
[000000.000] Determined physical RAM map:
[000000.000] memory: 0000000007400000 # 0000000000c00000 (usable)
[000000.016] reserve bootmem for memoops 0000000000020000 # 0000000007fe0000
[000000.020] Primary instruction cache 32kB, virtually tagged, 4 way, 64 sets, linesize 128 bytes.
[000000.020] Primary data cache 16kB, 64-way, 2 sets, linesize 128 bytes.
[000000.020] PID hash table entries: 1024 (order: 10, 32768 bytes)
[000000.228] Dentry cache hash table entries: 32768 (order: 6, 262144 bytes)
[000000.242] Inode-cache hash table entries: 16384 (order: 5, 131072 bytes)
[000000.269] Memory: 116280k/118784k available (2062k kernel code, 2344k reserved, 533k data, 200k init, 0k highmem)
Note: I already tried to take off my changes and recompile, but the issue is still there for some reason.
Maybe someone know what might affect this? How can this happen?

The application codes will not bring any influence to kernel's booting, because it begin to run after kernel booting.
This issue should be caused by your physical memory change, you can verified via kernel source codes:
mm/page_alloc.c
/* round applicable memory size up to nearest megabyte */
numentries = nr_kernel_pages;
numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
numentries >>= 20 - PAGE_SHIFT;
numentries <<= 20 - PAGE_SHIFT;
....
log2qty = ilog2(numentries);
....
printk(KERN_INFO "%s hash table entries: %ld (order: %d, %lu bytes)\n",
tablename,
(1UL << log2qty),
ilog2(size) - PAGE_SHIFT,
size);
If the boot parameter keep the same, you can check:
if hardware keep the same
did you boot via different method, such as change from legacy to UEFI mode
if you have multi target with the same config(mem size, cpu, chipset etc.), you can verified on another board to avoid the hardware problem on this board.

what's the difference between free's result and dmidecode's result in linux?

I use two tools to collect my memory info, the dmidecode and free, and the two show different results,the dmidecode show my memory is 4096MB, the free -m show's 3829, what it's different and why?
Handle 0x0083, DMI type 17, 27 bytes
Memory Device
Array Handle: 0x0082
Error Information Handle: No Error
Total Width: 32 bits
Data Width: 32 bits
Size: 4096 MB
Form Factor: DIMM
Set: None
Locator: RAM slot #0
Bank Locator: RAM slot #0
Type: DRAM
Type Detail: EDO
Speed: Unknown
Manufacturer: Not Specified
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
free -m output:
total used free shared buffers cached
Mem: 3829 3566 262 0 495 1779
-/+ buffers/cache: 1291 2537
Swap: 8191 0 8191

dmidecode uses BIOS facilities (smbios in particular) to get amount of memory physically present in a system. On system's boot BIOS determines its size from SPD chips on DIMM modules.
But during boot some memory is reserved by BIOS itself (i.e. for Video RAM of embedded videocard), so amount of memory presented to OS is a bit smaller, that is what you see in free output.
Usually you can check it from dmesg output:
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009e000 (usable)
[ 0.000000] BIOS-e820: 000000000009e000 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000bd92a000 (usable)
[ 0.000000] BIOS-e820: 00000000bd92a000 - 00000000bd94c000 (ACPI NVS)
...

Understanding Linux load address for U-Boot process

I'm trying to understand embedded Linux principles and can't figure out addresses at u-boot output.
For example, I have UDOO board based on i.MX6 quad processor and I got following output from U-Boot:
U-Boot 2013.10-rc3 (Jan 20 2014 - 13:33:34)
CPU: Freescale i.MX6Q rev1.2 at 792 MHz
Reset cause: POR
Board: UDOO
DRAM: 1 GiB
MMC: FSL_SDHC: 0
No panel detected: default to LDB-WVGA
Display: LDB-WVGA (800x480)
In: serial
Out: serial
Err: serial
Net: using phy at 6
FEC [PRIME]
Warning: FEC MAC addresses don't match:
Address in SROM is 00:c0:08:88:a5:e6
Address in environment is 00:c0:08:88:9c:ce
Hit any key to stop autoboot: 0
Booting from mmc ...
4788388 bytes read in 303 ms (15.1 MiB/s)
## Booting kernel from Legacy Image at 12000000 ...
Image Name: Linux-3.0.35
Image Type: ARM Linux Kernel Image (uncompressed)
Data Size: 4788324 Bytes = 4.6 MiB
Load Address: 10008000
Entry Point: 10008000
Verifying Checksum ... OK
Loading Kernel Image ... OK
Starting kernel ...
I don't understand the value of Load address 0x10008000. According to documentation for this particular processor, at address zone 0x10000000 - 0xffffffff is mapped main memory. But what is 0x8000 offset? I can't figure out reason for this value.
I also don't understand address 0x12000000, where the kernel image is loaded from. Is there mapped memory region for SD card?
Please, can you give me some explanation for these addresses or even better, some references to resources about this topic. My goal is to learn how to port u-boot and Linux kernel to another boards.
Thank you!

If you check the environment variables of the u-boot, you will find that kernel image is copied from boot device to the RAM location(Here, 12000000) through command like fatload.
Now, This is not the LOADADDRESS. You give LOADADDRESS to command line while compiling the kernel, This address is mostly at 32K offset from start of the RAM in Physical address space of the processor.
Your RAM is mapped at 10000000 and kernel LOADADDRESS is 10008000(32K offset). bootm command uncompress the kernel image from 12000000 to 10008000 address and then calls the kernel entry point.

check out include/configs folder. It contains all the board definitions
i.MX uboot include/configs
To port uboot to another port, base on a very similar board and modify from there.

RAM memory mapping - Need clarification

Total RAM size is 512 MB.
On my WEC7 device control panel, I'm seeing total memory as:
Storage memory: 53792 KB
Program memory: 376140 KB
So, total size is : 419MB.
My config.bib has following:
SECTION_BASE 80000000 00001000 RESERVED
ARGS 80001000 00001000 RESERVED
RSVD 80002000 001BA000 RESERVED
EMAC 801BC000 00009000 RESERVED
RSVD1 801C5000 0003B000 RESERVED
FBUFFER 95B00000 00200000 RESERVED
#define NK_START 80200000
#define NK_SIZE 05E00000
#define RAM_START 86000000
#define RAM_SIZE 0FB00000
According to this, RAM_SIZE is 251MB.
AFAIK, this is Program memory + Storage memory. Is my understanding is correct? If yes, why this difference? If no, what is the correct explanation for this?
My image_cfg.h has following line:
#define STATIC_MAPPING_RAM_SIZE (384)
And oemaddrtab_cfg.inc file has:
g_oalAddressTable
DCD 0x80000000, 0x00100000, STATIC_MAPPING_RAM_SIZE ; RAM image mapping; 0x80000000+384MB=0x98000000
DCD 0x9B000000, 0xFC000000, 64 ; 64 MB Peripheral device space (As per datasheet)
DCD 0x9F100000, 0x00000000, 1 ;Mapping Boot region
DCD 0x00000000, 0x00000000, 0 ; Terminate table
NK size:
nk.bin: 51MB
nk.nb0: 94MB
Anybody please explain why I am getting 419MB of memory, and also please explain more about these memory mapping...

In addition to the amount you specify in the config file you can have an additional RAM region passed to the system through the OEMGetExtendedDRAM function in your OAL. For BSPs supporting devices with different amounts of RAM is common to have the minimum amount configured in the bib file, detect if additional RAM is available, and return it using the above function.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string