Issues trying to debug a kernel vmcore - linux

One of our clients called us saying they had had a kernel crash and asked us to investigate. They are running SLES12 SP2.
I copied the vmcore file under /var/crash (11 Mb) out of production, onto another machine also running SLES12 SP2. I copied the image of the kernel /boot/vmlinux-4.4.120-92.70-default.gz too. I installed the kernel debuginfo package in this machine. However I'm unable to run the crash utility on it:
$ strings vmcore |grep "4\.4\."
4.4.120-92.70-default
OSRELEASE=4.4.120-92.70-default
BOOT_IMAGE=/boot/vmlinuz-4.4.120-92.70-default root=[…]
$ strings ~/vmlinux-4.4.120-92.70-default |grep "4\.4\."
Linux version 4.4.120-92.70-default (geeko#buildhost) (gcc version 4.8.5 (SUSE Linux) ) #1 SMP Wed Mar 14 15:59:43 UTC 2018 (52a83de)
$ crash /usr/lib/debug/boot/vmlinux-4.4.120-92.70-default.debug ~/vmlinux-4.4.120-92.70-default vmcore
crash 7.1.5
[…]
GNU gdb (GDB) 7.6
[…]
This GDB was configured as "x86_64-unknown-linux-gnu"...
WARNING: could not find MAGIC_START!
WARNING: cannot read linux_banner string
crash: /usr/lib/debug/boot/vmlinux-4.4.120-92.70-default.debug and vmcore do not match!
Usage: […]
I think the strings invocations above prove that the kernel and the core do match, however I'm still getting that error. What can I do next?

Related

Runtime error "undefined symbol g_malloc0_n" - but only on old Debian5

We have a legacy Linux application, written in C, using GTK.
We compile it on Ubuntu14/32-bit, we can also launch it
In production, we run it on Debian10/32-bit
We have some very old hosts, which runs Debian 5. The old version of the application runs on Debian 5, but when we compile a new one, we got a runtime message:
.../bin/gui: symbol lookup error: .../lib/libmkt.so: undefined symbol: g_malloc0_n
It's strange, because nm finds it:
$ nm .../lib/libmkt.so | grep g_malloc
U g_malloc0
U g_malloc0_n
Also the application has a good reference entry to the library:
$ ldd .../bin/gui | grep libmkt
libmkt.so => .../lib/libmkt.so (0xb7257000)
This is happening only on the old machine with Debian 5:
$ uname -a
Linux kiosk 2.6.26-2-686 #1 SMP Mon Aug 30 07:01:57 UTC 2010 i686 GNU/Linux
On the new machine, with Debian 10, the application starts:
$ uname -a
Linux kiosk 4.19.0-17-686-pae #1 SMP Debian 4.19.194-1 (2021-06-10) i686 GNU/Linux
The developer machine with Ubuntu14, the application starts:
$ uname -a
Linux ubuntu-build-server 4.4.0-142-generic #168~14.04.1-Ubuntu SMP
Sat Jan 19 11:28:33 UTC 2019 i686 i686 i686 GNU/Linux
We start the application with LD_LIBRARY_PATH=.../lib .../bin/gui, we have a bunch of .so files in the .../lib directory, including the .../lib/libmkt.so, which indicates the error.
My hint is that the compiler uses some feature for libmkt.so, which the 2.6 kernel does not like, but I haven't found such issue on the internet.
UPDATE: .../lib/libmkt.so does not contain the missing g_malloc0_n symbol, but it refers to the GTK library, which does. What should I do in order to find such second-hop symbols?

nvcc fatal : Unsupported gpu architecture 'compute_20' while compiling matlab

(CentOS Linux release 7.3;cuda 9.1;GPU:Tesla P100-PCIE)
I've installed Matlab2018a on a server, but when I tried to do this:
vl_compilenn('enableGpu', true);
I encountered this:
vl_compilenn: CUDA: MEX config file:
'/data1/zhangdinghuai/gitrepo/explanatoryGraph/matconvnet-1.0-
beta24/matlab/src/config/mex_CUDA_glnxa64.xml'
Building with 'nvcc'.
nvcc fatal : Unsupported gpu architecture 'compute_20'
and
Building with 'nvcc'.
Error using mex
nvcc fatal : Unsupported gpu architecture 'compute_20'
Error in vl_compilenn>mex_compile (line 529)
mex(mopts{:}) ;
Error in vl_compilenn (line 487)
mex_compile(opts, srcs{i}, objfile, flags.mexcu) ;
I have searched similar questions but none of them works, can anyone give me a hand?
PS:more information about the server is listed below:
[zhangdinghuai#gpu01 2018a]$ lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-
noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages- 4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.3.1611 (Core)
Release: 7.3.1611
Codename: Core
[zhangdinghuai#gpu01 2018a]$ cat /etc/issue
\S
Kernel \r on an \m
[zhangdinghuai#gpu01 2018a]$ cat /proc/version
Linux version 3.10.0-514.26.1.el7.x86_64 (builder#kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Thu Jun 29 16:05:25 UTC 2017
In a similar thread here "nvcc fatal : Unsupported gpu architecture 'compute_20' while cuda 9.1+caffe+openCV 3.4.0 is installed" or at Askububtu , it was recommended to edit the makefile.config and to comment out the -gencode arch=compute_20.
Can you also share the exact kernel version you are using, the exact PCI device with PCI ID and driver versions if there are any. This might give better insight into your environment as well could help to answer further questions.
My solution was modifying the file matconvnet/matlab/src/config/mex_CUDA_glnxa64.xml.
Change the line
`NVCCFLAGS="-D_FORCE_INLINES -gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_30,code=\"sm_30,compute_30\" $NVCC_FLAGS"`
into
`NVCCFLAGS="-D_FORCE_INLINES -gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_30,code=\"sm_30,compute_30\" $NVCC_FLAGS"`

systemtap profiling gc node.js

I installed node.js(0.9.4) via nvm, which according to changelog has systemtap support.
I installed systemtap on my Fedora linux distro.
$ sudo yum install systemtap
I used this gist from Ben Noordhuis.
$ stap -l 'process("node")'
produces nothing.
$ sudo stap gc.stp -c 'node test.js'
semantic error: while resolving probe point: identifier 'process' at gc.stp:7:7
source: probe process("node").mark("gc__start")
^
semantic error: no match
semantic error: while resolving probe point: identifier 'process' at :12:7
source: probe process("node").mark("gc__done")
I have no experience at all with systemtap, but like to toy with it? What is possible with it? Can I see how much memory is consumed by code(http://stackoverflow.com/questions/13126808/whats-the-node-js-memory-breakdown)?
Update to answer comment.
$ readelf -n node
readelf: Error: 'node': No such file
$ which node
~/nvm/v0.9.4/bin/node
$ readelf -n ~/nvm/v0.9.4/bin/node
Notes at offset 0x0000021c with length 0x00000020:
Owner Data size Description
GNU 0x00000010 NT_GNU_ABI_TAG (ABI version tag)
OS: Linux, ABI: 2.6.32
Notes at offset 0x0000023c with length 0x00000024:
Owner Data size Description
GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring)
Build ID: 294da933883eaeaf7e848073dc3db6bff6762fb4
$ uname -a
[alfred#alfred81-AMILO-Pi-2515 gc-stap]$ uname -a
Linux alfred81-AMILO-Pi-2515.lan 3.6.3-1.fc17.x86_64 #1 SMP Mon Oct 22 15:32:35 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
$ stap -V
Systemtap translator/driver (version 2.0/0.154, rpm 2.0-1.fc17)
Copyright (C) 2005-2012 Red Hat, Inc. and others
This is free software; see the source for copying conditions.
enabled features: AVAHI LIBRPM LIBSQLITE3 NSS TR1_UNORDERED_MAP NLS
Your copy of node appears to be compiled without sys/sdt.h markers. If they were compiled in, readelf -n would show something like ...
stapsdt 0x00000040 NT_STAPSDT (SystemTap probe descriptors)
Provider: stap
Name: stap_system__spawn
Location: 0x000000000012e1b0, Base: 0x00000000001cb886, Semaphore: 0x0000000000000000
Arguments: -4#%ebx -4#%eax
Perhaps it was configured with --without-dtrace.

make: can't find /usr/include/linux/ext3_fs.h

When I try to compile one of my old program which uses ext3 structure with new Fedora 16
I get the message
# make
Compile main.c In file included from main.c:8:0:
giis.h:18:28: fatal error: linux/ext3_fs.h: No such file or directory
compilation terminated.
I did yum install kernel-devel and kernel-headers - but still it gives above message.
# uname -a
Linux space 3.2.9-2.fc16.x86_64 #1 SMP Mon Mar 5 20:55:39 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
The linux kernel does not export a header called ext3_fs.h, or does not do so anymore. Edit your giis.h to do without it. See commit v2.6.25-rc8~52: “Neither of the headers actually compiles when included from userpsace nor should it be made available as userspace tools should be using the libraries or at least headers from e2fsprogs.”

libstdc++.so.6: cannot handle TLS data

I have an application compiled at:
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
Linux debian 2.6.18-5-686 #1 SMP Fri Jun 1 00:47:00 UTC 2007 i686 GNU/Linux
and it runs well.
Now I want to run it at:
Linux 2.4.20_mvlcge31-tomas #7 Thu May 7 11:33:21 CEST 2009 i686 unknown
I got following errors:
libstdc++.so.6: cannot handle TLS data
From the web I saw someone suggested to do this: export LD_ASSUME_KERNEL=2.2.5
I tried but get even more errors:
ls: error while loading shared libraries: librt.so.1: cannot open shared object file: No such file or directory
Who can help me with it? thanks
You had compiled the application against much newer libc and kernel version, You can't compile program on 2.6 with newest libc and expect it to run on old kernel.
Also where do you actually still use Linux 2.4?

Resources