How to troubleshoot missing printk output? - bluetooth

I have two identical raspberry pi 3 b+ devices, one running with raspberrypi-kernel_1.20180313-1 and the other running raspberrypi-kernel_1.20180417-1. I'm watching bluetooth events using "hcidump -R" on both devices. One device shows bluetooth events, the other does not. I've swapped the SD cards on the devices to confirm it is not related to hardware, regardless of which device the SD card is in, the one running 20180313 shows the bluetooth events and the 20180417 does not.
To debug this, I've been adding some printk statements to various points in the raspbian source pulled from git:
https://github.com/raspberrypi/linux
I felt the most relevant place to start with the debugging was in the bluetooth RX code, e.g. print something for each bluetooth message send to the line discipline. Specifically, in drivers/bluetooth/hci_ldisc.c, in the hci_uart_tty_receive function, I added two lines to the beginning:
printk(KERN_ERR "AARON: In hci_uart_tty_receive with tty %p\n", tty);
dump_stack();
After rebuilding the kernels and starting the devices, on the pi running 20180313 I saw the log messages/stack traces and on the other pi I saw nothing, indicating the bluetooth RX code isn't being reached. So to further debug, I looked at the stack trace, which was:
Jul 9 21:03:18 tiltpi kernel: [ 9.391137] Workqueue: events_unbound flush_to_ldisc
Jul 9 21:03:18 tiltpi kernel: [ 9.391166] [<8010f664>] (unwind_backtrace) from [<8010bd1c>] (show_stack+0x20/0x24)
Jul 9 21:03:18 tiltpi kernel: [ 9.391183] [<8010bd1c>] (show_stack) from [<80449c20>] (dump_stack+0xc8/0x114)
Jul 9 21:03:18 tiltpi kernel: [ 9.391221] [<80449c20>] (dump_stack) from [<7f4400cc>] (hci_uart_tty_receive+0x5c/0xac [hci_uart])
Jul 9 21:03:18 tiltpi kernel: [ 9.391254] [<7f4400cc>] (hci_uart_tty_receive [hci_uart]) from [<804b9bdc>] (tty_ldisc_receive_buf+0x64/0x6c)
Jul 9 21:03:18 tiltpi kernel: [ 9.391273] [<804b9bdc>] (tty_ldisc_receive_buf) from [<804ba15c>] (flush_to_ldisc+0xcc/0xe4)
Jul 9 21:03:18 tiltpi kernel: [ 9.391293] [<804ba15c>] (flush_to_ldisc) from [<80135934>] (process_one_work+0x144/0x438)
Jul 9 21:03:18 tiltpi kernel: [ 9.391311] [<80135934>] (process_one_work) from [<80135c68>] (worker_thread+0x40/0x574)
Jul 9 21:03:18 tiltpi kernel: [ 9.391328] [<80135c68>] (worker_thread) from [<8013b930>] (kthread+0x108/0x124)
Jul 9 21:03:18 tiltpi kernel: [ 9.391352] [<8013b930>] (kthread) from [<80107ed4>] (ret_from_fork+0x14/0x20)
I proceeded to add printk statements for flush_to_ldisc and tty_ldisc_receive_buf, recompile, and retest. However, while I continued to see the printk message I added in hci_uart_tty_receive, I did not see the messages I added to flush_to_ldisc or tty_ldisc_receive_buf.
Upon further inspection of the kernel source, I found the stack trace didn't even make sense as the functions listed did not directly call to eachother. More specifically, in tty_buffer.c, flush_to_ldisc (towards the bottom of the stack) calls to receive_buf, which then calls to tty_ldisc_receive_buf, which will then call to hci_uart_tty_receive in hci_ldisc.c. The kernel stack doesn't have any entry for receive_buf and shows flush_to_ldisc calling directly to _tty_ldisc_receive_buf.
So I'm quite confused. I've searched through the kernel source and found no other declarations of "flush_to_ldisc" or "tty_ldisc_receive_buf" functions.
Why/how can dump_stack() be missing a stack entry? Why aren't the prink statements I've placed in the functions toward the bottom of that stack showing up while the printk statements I've placed toward the top of the stack do show up?
EDIT:
A bit more searching shows that the Linux kernel relies on gcc to do certain optimizations including automatic inlining of some functions, and hence that is likely what is happening to my stack trace. That would explain why I don't see the functions explicitly listed in the stack, but doesn't explain why the printk output doesn't show up. Any thoughts from anyone on why printk statements would show up from functions in the top of the stack but not the bottom? The rsyslog.conf file is setup with:
*.err -/var/log/errors.log
And all printk statements I added are like "printk(KERN_ERR "string\n");"
EDIT2: Updated question title to reflect that it is not just about absent printk output.
EDIT3: I deleted my local copy of the kernel source, pulled it again, added my printk statements, and recompiled from scratch, and I now get the printk statements showing up. It seems the code I added wasn't recompiled or linked into the kernel build. I ran "make clean" before making the kernel, but still it seems something wasn't getting compiled/linked properly. But starting clean resolved the problem.
Summary: Linux kernel makes use of gcc optimizations that will result in functions being compiled inline even when not explicitly specified in the source as inline. And when you're "sure" you've recompiled the kernel with your changes, you should start over with a clean source/build dir and try a second time before taking your issue to stack.

Related

The Linux(CentOS 7.9) kernel has prompted a bug. Is it harmful?

My runtime environment is CentOS 7.9(kernel is version 5.16.11) in the VMBox virtual machine, it is allocated 1G memory and 8 CPU cores.
[root#dev236 ~]# uname -a
Linux dev236 5.16.11-1.el7.elrepo.x86_64 #1 SMP PREEMPT Tue Feb 22 10:22:37 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
I ran a computation-intensive program that used 8 threads to continuously use the CPU.
After some time, the operating system issues a bug alert, like this:
[root#dev236 src]# ./server --p:command-threads-count=8
[31274.179023] rcu: INFO: rcu_preempt self-detected stall on CPU
[31274.179039] watchdog: BUG: soft lockup - CPU#3 stuck for 210S! [server:1356]
[31274.179042] watchdog: BUG: soft lockup - CPU#1 stuck for 210S! [server:1350]
[31274.179070] watchdog: BUG: soft lockup - CPU#7 stuck for 210S! [server:1355]
[31274.179214] rcu: 0-...!: (1 GPs behind) idle=52f/1/0x4000000000000000 softirq=10073864/10073865
fqs=0
Message from syslogd#dev236 at Jan 25 18:59:49 ...
kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 210S! [server:1356]
Message from syslogd#dev236 at Jan 25 18:59:49 ...
kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 210S! [server:1350]
Message from syslogd#dev236 at Jan 25 18:59:49 ...
kernel:watchdog: BUG: soft lockup - CPU#7 stuck for 210S! [server:1355]
^C
[root#dev236 src]#
Then, I looked at the program log, and the log file was constantly being appended, which indicated that my test program was still running.
I wonder if I can ignore this bug tip?
Or, do I have to do something?
for example:
    Reduce the computational intensity of the program?
    Give the CPU a break once in a while?
    Reduce the number of threads started in the program?
Thank you all

Why does chromiuim trigger DRM (Direct Rendering Manager) on startup on Linux?

I was wondering if anyone knows why chromium based browsers trigger direct rendering manager on startup on Linux whereas Firefox for example doesn't seem to do that? This is what I see when starting chromium:
Dec 19 11:02:30 hp kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Dec 19 11:02:31 hp kernel: [drm] UVD and UVD ENC initialized successfully.
Dec 19 11:02:31 hp kernel: [drm] VCE initialized successfully.

memcheck-amd64- killed by OOM

I'm using Valgrind to correct a segmentation fault in my code, but when the run arrives to the segmentation fault point my process is killed.
Searching in /var/log/syslog file, I can see that the process memcheck-amd64- (valgrind?) has been killed:
Sep 7 12:48:34 fabiano-HP kernel: [10154.654505] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-c2.scope,task=memcheck-amd64-,pid=3688,uid=1000
Sep 7 12:48:34 fabiano-HP kernel: [10154.654560] Out of memory: Killed process 3688 (memcheck-amd64-) total-vm:11539708kB, anon-rss:6503332kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:12952kB oom_score_adj:0
Sep 7 12:46:26 fabiano-HP org.freedesktop.thumbnails.Cache1[3661]: message repeated 3 times: [ libpng error: Read Error]
Sep 7 12:48:34 fabiano-HP systemd[1]: session-c2.scope: A process of this unit has been killed by the OOM killer.
Now, the problem is that Valgrind doesn't write the output file, so I can't understand what's going on... how can I avoid this? I mean, what's happening?
EDIT:
I'm running Valgrind with this command valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --log-file=valgrind-out.txt ./mainCPU -ellpack

Debugging cdc-acm kernel module

I am trying to fix a problem I am having on Ubuntu (tried different versions including the latest 13.10) with a USB device talking CDC/ACM on one of its interfaces. The kernel module handling this kind of devices only reports
cdc_acm 6-2:1.1: This device cannot do calls on its own. It is not a modem.
cdc_acm: probe of 6-2:1.1 failed with error -22
in dmesg and that is it. Nothing about "Zero length descriptor references" or similar stuff that other people report on the web. So I wanted to find out what the problem might be. I followed the description in http://www.silly-science.co.uk/2012/06/23/lenovo-usb-modem-in-linux-ubuntu-10-04 to compile and load a custom cdc-acm module. First, I changed the two #undefs for debug to #defines in cdc-acm.c, but I am still not getting any additional output in dmesg.
Changing the version string in cdc-acm.c's DRIVER_VERSION define to something else, I can verify that my modified module is indeed loaded. Am I looking for the debug output in the wrong place?
I managed to get debug info from cdc_acm in dmesg, and even though I don't have something special to share, these were my steps, using latest kernel as of today 4.2-rc5:
Change DEBUG and VERBOSE_DEBUG #undefs to #defines in cdc-acm.c.
make -C /lib/modules/$(uname -r)/build M=$(pwd)/drivers/usb/class modules
modprobe -r cdc_acm; insmod $(pwd)/drivers/usb/class/cdc-acm.ko
dmesg after plugging a compatible device
[...]
[14035.355036] cdc_acm 2-2:1.1: acm_tty_write - write 1
[14035.368040] cdc_acm 2-2:1.1: acm_softint
[14038.156445] cdc_acm 2-2:1.0: acm_tty_close
[14038.173054] cdc_acm 2-2:1.0: acm_ctrl_msg - rq 0x22, val 0x0, len 0x0, result 0
[14038.173059] cdc_acm 2-2:1.0: acm_port_shutdown
[14038.173640] cdc_acm 2-2:1.0: acm_ctrl_irq - urb shutting down with status: -2
[14038.174636] cdc_acm 2-2:1.1: acm_read_bulk_callback - urb 0, len 0
[...]

Ubuntu: segfault at 125 ip 00cd6df4 sp bfeef720 error 6 in libQtCore.so.4.7.4[b51000+2ca000]?

As per the Title I am recieving this error msg, while running my Qt Application.
Actually, I have an application designed under Qt4.7.4. The application crashes randomly while under operation. It happens randomly during different phase of operation. I went through reading "/var/log/syslog ", and found result as below :
Aug 29 16:17:01 localhost CRON[1484]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 29 16:20:18 localhost kernel: [ 1472.204669] IAccessRemoteSc[1420]: segfault at ac4ecc4 ip 00ed71ef sp bfcdde2c error 4 in libc-2.10.1.so[e6c000+13e000]
Aug 29 17:17:01 localhost CRON[8814]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 29 17:28:33 localhost kernel: [ 5567.372835] IAccessRemoteSc[1894]: segfault at a55d77c ip 008481ef sp bfa9271c error 4 in libc-2.10.1.so[7dd000+13e000]
Aug 29 17:29:01 localhost kernel: [ 5595.452673] IAccessRemoteSc[10231]: segfault at 11064954 ip 086591ef sp bf8b0dec error 4 in libc-2.10.1.so[85ee000+13e000]
Aug 29 17:31:12 localhost kernel: [ 5726.055671] IAccessRemoteSc[10291]: segfault at a0beb84 ip 00cbf1ef sp bffdfb0c error 4 in libc-2.10.1.so[c54000+13e000]
Aug 29 18:15:44 localhost kernel: [ 8399.369686] IAccessRemoteSc[10602]: segfault at 125 ip 00cd6df4 sp bfeef720 error 6 in libQtCore.so.4.7.4[b51000+2ca000]
Aug 29 18:17:01 localhost CRON[12697]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
From the above message, the error is from libc-2.10.1.so and libQtCore.so.4.7.4.
I am using Ubuntu 9.10 version (as per our company standards)
I tried google search, but no clear reason/solutions have been mentioned.
So, Do any one have some idea about this error ??
Any idea/suggestion would be great help for me.
I found good explaination about this error on stackoverflow forum itself. Click Here
I am also pasting the same below (bit modified as per the error i m facing):
Error 6 is ENXIO (No such device or address). It may be that libQtWebKit is habitually mishandling that error, or it may be that there's something else that's going on.
If this were a program, not a shared library
Run
addr2line -e yourSegfaultingProgram 00cd6df4
(and repeat for the other instruction pointer values given) to see where the error is happening. Better, get a debug-instrumented build, and reproduce the problem under a debugger such as gdb.
Since it's a shared library
You're hosed, unfortunately; it's not possible to know where the libraries were placed in memory by the dynamic linker after-the-fact. Reproduce the problem under gdb.
What the error means
Here's the breakdown of the fields:
address - the location in memory the code is trying to access (it's likely that 10 and 11 are offsets from a pointer we expect to be set to a valid value but which is instead pointing to 0)
ip - instruction pointer, ie. where the code which is trying to do this lives
sp - stack pointer
error - value of errno, ie. last error code which was reported by a syscall
Also When system requests fail, error code are returned. To understand the nature of the error these codes need to be interpreted. They are recorded in:-
/usr/include/asm/errno.h
Click Here to get the list of Errors and Meaning of each errors

Resources