Analyzing a Xen core dump - linux

After a Xen guest domain hang, I took a dump using xm core-dump . Following the sparse documentation I found, I tried using the crash utility to analyze the dump.
Unfortunately, the kernel image (Debian lenny) is stripped, so I am forced to make use of the map file.
However,
crash
/boot/System.map-2.6.26-2-xen-amd64
vmlinux-2.6.26-2-xen-amd64
/mnt/my-core-file
(with vmlinux-2.6.26-2-xen-amd64 being the gunzip'ed vmlinuz image) fails:
crash: vmlinux-2.6.26-2-xen-amd64: no
debugging data available
Then I read that current Xen versions produce ELF-compatible dumps for guest domains. Indeed, this seems to be the case:
~$ sudo file my-core-dump
my-core-dump: ELF 64-bit LSB core file x86-64, version 1
However, gdb vmlinux-2.6.26-2-xen-amd64 my-core-dump fails, too:
...is not a core dump: File format not
recognized
Any hints?

Have you tried attaching to the domU console ?
xm create domU.conf -c
On the subject of the core-dump file, I found this:
http://lists.xensource.com/archives/html/xen-devel/2006-12/msg00456.html
I just want to check that you aren't under the impression that 'xm
dump-core' emits an Elf core file. It doesn't -- the format is custom and as
far as I know is only interpreted by a set of gdbserver patches that we ship
in our repository. Does the crash utility really support this special
format?
Edit: This might help to debug the core-dump: http://os-drive.com/files/docbook/xen-faq.html#setup_gdb

Related

gdb can't resolve symbols for linux kernel

I have setup Linux Kernel debug environment with VMware Workstation. But When I tried to connect with gdb that connects correctly but I can't set any breakpoint or examine any kernel symbol.
Target Machine (debugee) Ubuntu 18:
I have compiled linux kernel 5.0-0 with the following directives:
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_INFO_REDUCED is not set
# CONFIG_DEBUG_INFO_SPLIT is not set
CONFIG_DEBUG_INFO_DWARF4=y
CONFIG_DEBUG_FS=y
# CONFIG_DEBUG_SECTION_MISMATCH is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
Also my VMX file configuration:
debugStub.listen.guest64 = "TRUE"
debugStub.listen.guest64.remote="TRUE"
After that I transfered vmlinux to debugger machine and use gdb:
bash$ gdb vmlinux
gdb-peda$ target remote 10.251.31.28:8864
Remote debugging using 10.251.31.28:8864
Warning: not running or target is remote
0xffffffff9c623f36 in ?? ()
gdb-peda$ disas sys_open
No symbol "do_sys_open" in current context.
First you need to install kernel-debug-devel, kernel-debuginfo, kernel-debuginfo-common for corresponding kernel version.
Then you can use crash utility to debug kernel, which internally uses gdb
The symbol name you're looking for is sometimes not exactly what you expect it to be. You can use readelf or other similar tools to find the full name of the symbol in the kernel image. These names sometimes differ from the names in the code because of various architecture level differences and their related header and C definitions in kernel code. For example you might be able to disassemble the open() system call by using:
disas __x64_do_sys_open
if you've compiled it for x86-64 architecture.
Also keep in mind that these naming conventions are subject to change in different versions of kernel.

linux application get Killed

I have a "Seagate Central" NAS with an embedded linux on it
$ cat /etc/*release
MontaVista Linux 6, (.dev-snapshot-20130726)
When I try to run my own application on this NAS, it will be "Killed"
without any notifications on dmesg or /var/log/messages
$ cat /proc/cpuinfo
Processor : ARMv6-compatible processor rev 4 (v6l)
BogoMIPS : 279.34
Features : swp half thumb fastmult vfp edsp java
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xb02
CPU revision : 4
Hardware : Cavium Networks CNS3420 Validation Board
Revision : 0000
Serial : 0000000000000000
My toolchain is
Sourcery_CodeBench_Lite_for_ARM_GNU_Linux/arm-none-linux-gnueabi
and my compile switches are
-march=armv6k -mcpu=mpcore -mfloat-abi=softfp -mfpu=vfp
How can I find out which process is killing my application, or what setting I have to change?
PS: I have created a simple HelloWorld application which is also not working !
$ ldd Hello
$ not a dynamic executable
readelf -a Hello
=> http://pastebin.com/kT9FvkjE
readelf -a zip
=> http://pastebin.com/3V6kqA9b
UPDATE 1
I have comiled a new binary with hard float
Readelf output
http://pastebin.com/a87bKksY
But no success ;(
I guess it is really a "lock" topic, which is blocking my application to execute. How can I find out what application kills mine ?
Or how can I disable such kind of function ?
Use these compiler switches:
-march=armv6k -Wl,-z,max-page-size=0x10000,-z,common-page-size=0x10000,-Ttext-segment=0x10000
See also this link regarding the toolchain.
You can run readelf -a against one of the built-in binaries (e.g. /usr/bin/nano) to see the proper text-segment offset in the section headers and page size / alignment in the program headers. The above compiler flags make self-compiled programs match the structure of built in binaries, and have been tested to work. It seems the Seagate Central NAS uses a page size / offset of 0x10000 while the default for ARM gcc is 0x8000.
Edit: I see you ran readelf already. Your pastebin shows
HelloWorld:[ 1] .interp PROGBITS 00008134 000134 000013 00 A 0 0 1
zip:[ 1] .interp PROGBITS 00010134 000134 000013 00 A 0 0 1
The value 10134-134=10000 (hex) yields the correct text-segment linker option. Further down (LOAD...) are the alignment specifiers, which are 0x8000 for your HelloWorld, but 0x10000 for the zip built-in. In my experience, soft-float has not caused problems.
Do you see any output at all?
Is your application dynamically linked?
If so, run the dynamic linker with the verbose option (you'll have to figure out the name of the dynamic linker on your platform, for Arch linux, it is ldd):
ldd --verbose 'your_program_name'
That will tell you if you're missing any dependencies (shared libs etc)
Run readelf -a 'your_program_name'
Make sure the file mentioned in Requesting program interpreter: /lib/ld-linux.so.2 exists. In this case, that filename is /lib/ld-linux.so.2
If this fails to help you figure out the problem, post the complete output of ldd --verbose 'your_program_name' and readelf -a 'your_program_name' in your question.
Another issue may be that the NAS software just kills foreign programs. I'm not sure why it would, but we're talking about a big corporation here (Seagate) and they have odd ideas of how the world works at times...
Edit, after looking at the pastebin of readelf:
From what I see, your Hello executable differs in 2 ways from the zip executable:
It is not dynamically linked, so that throws out a whole load of problems to look for.
There's a difference in how the 2 programs are built. zip does not use softfloats and Hello does. I suspect the soft-float dependency is due to one or both of these compiler switches: -mfloat-abi=softfp -mfpu=vfp
Hello Flags: 0x5000202, has entry point, Version5 EABI, soft-float ABI
zip Flags: 0x5000002, has entry point, Version5 EABI
I'd start with either:
Removing the soft-float option from the Hello build or:
make sure the soft-float emulation libraries are on the machine. I don't know what libs this would depend on, but I do remember MontaVista supplying them the last time I touched their software. It's been 8+ years since I touched MontaVista so it's clouded in a bit of old-memory fog.
This is an old thread, but I just wanted to add that I succeeded in compiling a "hello world" for this old NAS today.
Running ld-linux.so.3 <app> told me that
ELF load command alignment not page-aligned
Googling this, I found this: https://github.com/JuliaLang/julia/issues/33293, which pointed me to linker-options:
-Wl,-z,max-page-size=0x10000
Compiling with these options yielded en ELF that actually did work!
Are you sure your compilation options are correct ?
Try the following :
strace your application (if strace is present on the NAS)
downloas one of the NAS binary and run arm-none-linux-gnueabi-readelf -a on it, do the same on your helloworld program and see if the abi tag differ.
It looks like an illegal instruction issue, a floating point issue or an incompatible libc issue.
Edit : according to readelf output, nas program are compiled without soft float, you should try that.

core file not created on a segmentation fault on ARM

I have a arm executable[ (Debug build) ELF 32 bit LSB executable, ARM version (SYSV)] process executing on ARM Cortex A9 target having Linux OS(KErnel 2.6.38.8 )
The process has user id root groupid root
Even when the process crashes after getting SIGSEGV, there is no core file generated.
Now I have read this question to ensure it has file system is writable, ulimit -c is unlimited, user is root and has permissions, but still something is missing.
Here are few outputs of certain varaibles of my process and system, related to a core file creation :
cat /proc//coredump_filter is 00000033
cat /proc/sys/kernel/core_pattern is core
cat /proc/sys/kernel/core_uses_pid is 0
I have tried everything but stumped.
Could there be any kernel config/build option disabling the core creation?
Any other pointers?
EDIT:
I did simple test as below and it created a core file, but my process crash still does not dump core file.
sleep 20 &
killall -SIGSEGV sleep
Could there be any kernel config/build option disabling the core creation?
It is hidden under General Setup|Embedded System or General Setup|Configure standard... depending on your Linux version. The symbol value is ELF_CORE and it is in init/Kconfig. If it is not enabled, you will never get core dumps.
As suggested in a hidden comment in why coredump file is not generated.

Find which program caused a core dump file

I've been going through intense program/package installation recently, so I can't tell for sure which of the newly installed programs (or old programs) caused the appearance of a core file in my home folder. It's a server, so I better find out any possible sources of instability on the machine.
You can simply use the file program to identify them:
E.g
# file /var/core/core
/var/core/core: ELF 64-bit MSB core file SPARCV9 Version 1, from 'crs_stat.bin'
Often using the file program on the core file will show the errant executable, as explained by #Benj in the accepted answer (code from Benj's answer):
# file /var/core/core
/var/core/core: ELF 64-bit MSB core file SPARCV9 Version 1, from 'crs_stat.bin'
However, sometimes you may get a complaint about "too many program header sections":
core.some-lib.nnnn.nnnn: ELF 64-bit LSB core file x86-64, version 1 (SYSV), too many program header sections (1850)
In this case, you can try some alternatives:
Tail the last several strings of the corefile (the app was about 25 back for me): strings core.some-lib.nnnn.nnnn | tail -50
Use gdb itself: gdb -c core.some-lib.nnnn.nnnn This will often tell you something like this: Core was generated by '/usr/local/bin/some-executable'
you can navigate to the directory where the core.pid is and run gdb core core.pid

Core-dump file format

I have written a custom core-dump handling application for a project. I have changed '/proc/sys/kernel/core_pattern' to call my dump-handler and its invoked successfully.
Now the issue is saving the core-dump into a file that can be recognized by gdb. Currently my dump-handler read the dump from STDIN and save it into a file 'core.dump'. When I try to load this core dump into gdb it gives me error:
(gdb) ... is not a core dump: File format not recognized
When I run 'file' command on a standard core dump it give me following:
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from './dump_gen'
And for custom generated dump, 'file' gives following:
core.dump: data
Please can anyone help me how to write core-dump correctly so it can be used in gdb.
PS: I don't want to use standard core dump file.
I think you somehow don't write all the data to the core file.
Create a simple script, make it executable and set the core pattern to the script.
#!/bin/sh
cat > /tmp/core.$$
Now generate a core file (for example run sleep 1243 and press ctrl+\) and it should work.
I just tested it myself on my system and it works without a problem.
The first thing to check that comes to mind is the Elf header flag that indicates what kind of file it is. It has four values - shared object, unlinked object, executable and core dump. That's most likely what's causing gdb errors.
Also, try examining it with objdump - it can pull apart the entire ELF file for analysis what part of it is apparently not good.
You can find the ELF spec at https://refspecs.linuxbase.org/elf/elf.pdf

Resources