__stack_chk_fail when executing copy_process() in _do_fork()? - linux

I'm trying to do an experimental project about linux kernel(4.4.52) on x86_64, and one requirement of which is that whenever the control flow leaves specific function, the Write Protection bit in CR0 register would always be enabled. Generally speaking, it is like(the idea comes from nested kernel, but that is not very relevant to my question):
DISABLE_CR0.WP_BIT
original_func()
ENABLE_CR0.WP_BIT
By doing that, the whole kernel would be executing with CR0.WP enabled. I have replaced the original native_set_pte function and native_write_cr3 function with the format above, and now the kernel crashes when booting.
Here is the log(that's its original log, although the sequence seems weird):
[ 1.403888] IP: [<ffff8800351ebbb0>] 0xffff8800351ebbb0
[ 1.403891] PGD 2876067 PUD 2877067 PMD 3500e063 PTE 80000000351eb163
[ 1.403892] Oops: 0011 [#2] SMP
[ 1.403898] Modules linked in: crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse pata_acpi floppy
[ 1.403901] CPU: 0 PID: 143 Comm: systemd-udevd Tainted: G D 4.4.52v1+ #2
[ 1.403902] Hardware name: Fedora Project OpenStack Nova, BIOS 0.5.1 01/01/2011
[ 1.403903] task: ffff8800351c0e00 ti: ffff8800351e8000 task.ti: ffff8800351e8000
[ 1.403905] RIP: 0010:[<ffff8800351ebbb0>] [<ffff8800351ebbb0>] 0xffff8800351ebbb0
[ 1.403906] RSP: 0018:ffff8800351ebba8 EFLAGS: 00211086
[ 1.403906] RAX: 000000000000000e RBX: ffff8800351ebcf8 RCX: 000000000000000e
[ 1.403907] RDX: 0000000000000000 RSI: 0000000000201092 RDI: 0000000000201092
[ 1.403908] RBP: 0000000000000003 R08: ffffffff82778d60 R09: ffff8800351ebb40
[ 1.403909] R10: 0000000000000030 R11: ffffc00000000fff R12: ffff8800351c0e00
[ 1.403909] R13: 0000000000000010 R14: 0000000000201046 R15: ffffffffffffffff
[ 1.403911] FS: 00007f1f021e38c0(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[ 1.403912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.403912] CR2: ffff8800351ebbb0 CR3: 00000000351cd000 CR4: 00000000001406f0
[ 1.403916] Stack:
[ 1.403918] ffffffff810b62ae ffff8800351ebbc0 000000000000006c 0000000000000000
[ 1.403920] ffff8800351ebbd8 ffffffff810b62ae ffffffff8111ce51 00000000000364a4
[ 1.403921] ffffffff82783168 000000000000005c 000000000000000c ffffffff820583b0
[ 1.403922] Call Trace:
[ 1.403928] [<ffffffff810b62ae>] ? kvm_sched_clock_read+0x1e/0x30
[ 1.403930] [<ffffffff810b62ae>] ? kvm_sched_clock_read+0x1e/0x30
[ 1.403933] [<ffffffff8111ce51>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[ 1.403935] [<ffffffff8111c49e>] ? down_trylock+0x2e/0x40
[ 1.403937] [<ffffffff81129959>] ? console_trylock+0x19/0x60
[ 1.403938] [<ffffffff8112af2e>] ? vprintk_emit+0x29e/0x530
[ 1.403945] [<ffffffff8115fe8e>] ? crash_kexec+0x7e/0x140
[ 1.403953] [<ffffffff81440ae5>] ? find_next_bit+0x15/0x20
[ 1.403955] [<ffffffff814390bb>] ? __const_udelay+0x2b/0x30
[ 1.403958] [<ffffffff810a2a0c>] ? native_stop_other_cpus+0x8c/0x170
[ 1.403965] [<ffffffff811dde8f>] ? panic+0xeb/0x215
[ 1.403968] [<ffffffff810d12a7>] ? copy_process+0x727/0x1b20
[ 1.403970] [<ffffffff810d32f9>] ? __stack_chk_fail+0x19/0x20
[ 1.403972] [<ffffffff810d12a7>] ? copy_process+0x727/0x1b20
[ 1.403974] [<ffffffff810d2808>] ? _do_fork+0x78/0x360
[ 1.403975] [<ffffffff810d2b99>] ? SyS_clone+0x19/0x20
[ 1.403986] [<ffffffff818694f2>] ? entry_SYSCALL_64_fastpath+0x16/0x71
[ 1.404004] Code: 00 00 00 86 10 21 00 00 00 00 00 a8 bb 1e 35 00 88 ff ff 18 00 00 00 00 00 00 00 b0 bb 1e 35 00 88 ff ff ae 62 0b 81 ff ff ff ff <c0> bb 1e 35 00 88 ff ff 6c 00 00 00 00 00 00 00 00 00 00 00 00
[ 1.404005] RIP [<ffff8800351ebbb0>] 0xffff8800351ebbb0
[ 1.404006] RSP <ffff8800351ebba8>
[ 1.404006] CR2: ffff8800351ebbb0
[ 1.404008] ---[ end trace b62acacf75e0c54f ]---
[ 1.406415] Kernel Offset: disabled
[ 1.456105] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff810d12a7
[ 1.456105]
I guess the problem is that something at copy_process causes an overflow, maybe it writes to some read-only memory? But CR0.WP bit should only affects the supervisor mode according to intel's document, so does that mean kernel is running in supervisor mode when executing copy_process?
I tried to disassemble the kernel, and got really upset about all those countless assembly instructions... So I decide to find it out with qemu. However, the kernel did NOT crash in qemu!! The command I use is that
qemu-system-x86_64 -m 1G -kernel arch/x86/boot/bzImage -initrd arch/x86/boot/linux4.4.52-rootfs.img -hda vdisk.img --append "root=/dev/sda rw console=ttyS0" -nographic
I used to think that _do_fork is independent of specific devices and filesystems(correct me if I'm wrong), so what causes the kernel to crash at my VPS should make it crash at qemu as well, which it didn't.
Has anyone come across the same issue? I really need some help now.
P.S. I do this at my VPS, ubuntu 16.04.2, but I think this is not the reason.

Please note that QEMU is not fully architecture accurate model, and QEMU do not provide full support of all architecture referenced features, because QEMU aim is emulation speed, not accuracy, and then it only provide some workable architecture profile to run OS.
Some features QEMU do precise and correct like paging and segmentation, but many are not : there are some troubles with CPUID, syscall, many instructions do not generate #GP, #SS, #PF, floating point errors, some instructions are not implemented (AVX, AVX2, FMA) and so on.
You should use one of x86 golden models to catch such tricky cases: try to use Bochs, or one of provided by Intel itself.

Related

Additional serial ports are half working: why is that?

I have an embedded board with 2 serial ports and an additional PCI dual serial port + LPT, to reach a total of 4 serial ttys, though the added ttyS2 and ttyS3 almost don't work.
The system runs Debian buster, with a very few packages added to minimal setup.
All the ports are recognized by kernel
[ +0,021002] 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[ +0,021110] 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
[ +0,036520] 0000:04:00.0: ttyS2 at I/O 0xc030 (irq = 19, base_baud = 115200) is a ST16650V2
[ +0,035081] 0000:04:00.1: ttyS3 at I/O 0xc020 (irq = 16, base_baud = 115200) is a ST16650V2
and a successive test with setserial gives the same result.
Note however that a dpkg-reconfigure setserial does not write the file in /etc/setserial.conf and I have no idea on why - I tried resolving copying configuration by hand.
From some applications like minicom I see no result in opening the port and connecting from a remote terminal, nothing sent, nothing received.
From a test application using librxtx-java it looks to be sending data, but when data is received what happens is
[mar23 08:42] irq 16: nobody cared (try booting with the "irqpoll" option)
[ +0,000014] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.0-14-amd64 #1 Debian 4.19.171-2
[ +0,000003] Hardware name: /, BIOS 5.6.5 12/18/2018
[ +0,000002] Call Trace:
[ +0,000006] <IRQ>
[ +0,000012] dump_stack+0x66/0x81
[ +0,000008] __report_bad_irq+0x3a/0xb4
[ +0,000006] note_interrupt.cold.9+0xa/0x63
[ +0,000008] handle_irq_event_percpu+0x6d/0x80
[ +0,000006] handle_irq_event+0x3c/0x60
[ +0,000004] handle_fasteoi_irq+0xa3/0x160
[ +0,000007] handle_irq+0x1f/0x30
[ +0,000006] do_IRQ+0x49/0xe0
[ +0,000005] common_interrupt+0xf/0xf
[ +0,000003] </IRQ>
[ +0,000007] RIP: 0010:cpuidle_enter_state+0xb9/0x320
[ +0,000006] Code: e8 7c 85 b2 ff 80 7c 24 0b 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 3b 02 00 00
[ +0,000003] RSP: 0018:ffffb9e30020be90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd7
[ +0,000005] RAX: ffff9329f8122140 RBX: 000000ae53921a15 RCX: 000000000000001f
[ +0,000002] RDX: 000000ae53921a15 RSI: 0000000060062542 RDI: 0000000000000000
[ +0,000003] RBP: ffff9329f812a248 R08: 0000000000000002 R09: 0000000000021a00
[ +0,000002] R10: 000000ecc794a002 R11: ffff9329f8121128 R12: 0000000000000001
[ +0,000002] R13: ffffffffa98b70b8 R14: 0000000000000001 R15: 0000000000000000
[ +0,000011] do_idle+0x228/0x270
[ +0,000006] cpu_startup_entry+0x6f/0x80
[ +0,000005] start_secondary+0x1a4/0x200
[ +0,000006] secondary_startup_64+0xa4/0xb0
[ +0,000005] handlers:
[ +0,000009] [<00000000920e25ee>] serial8250_interrupt
[ +0,000005] Disabling IRQ #16
I read a few articles and did a quick read of Serial-HOWTO, but it seems that once setserial has found the correct configuration everything should go as intended, so I have no clue on what's going on
---- EDIT
Well, I (almost) resolved: the board has 2 serial and 1 parallel, all connected to an external device. I mismatched a non-standard external connector and routed some RS232 level signals into the parallel port: that resulted to be fatal.
The confusing result is that the controller looked to still be working, while it isn't 100% doing so.
I'm waiting to get a new board...
Well, I (almost) resolved: it's all about an hardware fault.
The board has 2 serial and 1 parallel, all connected to an external device. I mismatched a non-standard external connector and routed some RS232 level signals into the parallel port: that resulted to be fatal.
The confusing result is that the controller looked to still be working, while it isn't 100% doing so. I'm waiting to get a new board...

How to run "invd" instruction with disabled SMP support?

I'm trying to execute "invd" instruction from a kernel module. I have asked a similar question How to execute “invd” instruction? previously and from #Peter Cordes's answer, I understand I can't safely run this instruction on SMP system after system boot. So, shouldn't I be able to run this instruction after boot without SMP support? Because there is no other core running, therefore there is no change for memory inconsistency? I have the following kernel module compiled with -o0 flag,
static int __init deviceDriver_init(void){
unsigned long flags;
int LEN=10;
int STEP=1;
int VALUE=1;
int arr[LEN];
int i;
unsigned long dummy;
printk(KERN_INFO "invd Driver loaded\n");
//wbinvd();
//asm volatile("cpuid\n":::);
local_irq_disable();
__asm__ __volatile__(
"wbinvd\n"
"loop:"
"movq %%rdx, (%%rbx);"
"leaq (%%rbx, %%rcx, 8), %%rbx;"
"cmpq %%rbx, %%rax;"
"jg loop;"
"invd\n"
: "=b"(dummy) // output
: "b" (arr),
"a" (arr+LEN),
"c" (STEP),
"d" (VALUE)
: "cc", "memory"
);
local_irq_enable();
//asm volatile("invd\n":::);
printk(KERN_INFO "invd execute\n");
return 0;
}
I'm still getting the following error upon inserting the module I'm getting Segmentation fault (core dumped) in the terminal and the dmesg shows,
[ 2590.518614] invd Driver loaded
[ 2590.518840] general protection fault: 0000 [#5] SMP PTI
I have boot my kernel with nosmp but I do not understand why dmesg still shows SMP PTI
$cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.15.0-136-generic root=UUID=dbe747ff-a6a5-45cb-8553-c6db6d445d3d ro quiet splash nosmp vt.handoff=7
Update post:
As I mentioned in the comment section, After disabling, SGX from BIOS, I was able to run this invd without any error. However, when I try to run the same code on a different machine with the same kernel version, I still get the same error message. It is strange and I can't explain why this is happening. As in the comment section, #prl mentions that the error may be coming from the instruction following invd. I begin to think that maybe that is true. Because second from the last line in the dmesg is higlighted in RED [ 153.527386] RIP: loop+0xc/0xf22 [noSmp8] RSP: ffffb8d9450a7be0. So, seems like the error is coming from inside the loop. I have updated the __init function code according to the suggestion. I'm not good at assembly code, can anyone please tell me if the inline assembly code is correct or not? If this inline assembly code is not correct how to fix the code? My whole dmesg trace is,
[ 153.514293] invd Driver loaded
[ 153.514547] general protection fault: 0000 [#1] SMP PTI
[ 153.514656] Modules linked in: noSmp8(OE+) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables ccm arc4 intel_rapl rt2800usb rt2x00usb x86_pkg_temp_thermal intel_powerclamp rt2800lib coretemp rt2x00lib mac80211 cfg80211 kvm_intel kvm irqbypass snd_hda_codec_realtek crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate intel_rapl_perf dell_smm_hwmon dell_wmi dell_smbios dcdbas intel_wmi_thunderbolt snd_hda_codec_generic dell_wmi_descriptor wmi_bmof snd_seq_midi snd_seq_midi_event
[ 153.515454] serio_raw snd_hda_intel snd_hda_codec snd_hda_core sparse_keymap snd_hwdep snd_rawmidi joydev input_leds snd_seq snd_pcm snd_seq_device snd_timer snd soundcore mei_me mei shpchp intel_pch_thermal mac_hid acpi_pad parport_pc ppdev lp parport autofs4 hid_generic usbhid hid nouveau mxm_wmi ttm drm_kms_helper psmouse syscopyarea sysfillrect sysimgblt igb e1000e dca i2c_algo_bit ptp pps_core ahci libahci fb_sys_fops drm wmi video
[ 153.516038] CPU: 0 PID: 4024 Comm: insmod Tainted: G OE 4.15.0-136-generic #140~16.04.1-Ubuntu
[ 153.516331] Hardware name: Dell Inc. BIOS 1.3.2 01/25/2016
[ 153.516626] RIP: 0010:loop+0xc/0xf22 [noSmp8]
[ 153.516917] RSP: 0018:ffffb8d9450a7be0 EFLAGS: 00010046
[ 153.517213] RAX: ffffb8d9450a7c08 RBX: ffffb8d9450a7c08 RCX: 0000000000000001
[ 153.517513] RDX: 0000000000000001 RSI: ffffb8d9450a7be0 RDI: ffff8edaadc16490
[ 153.517814] RBP: ffffb8d9450a7c60 R08: 0000000000012c40 R09: ffffffffb39624c4
[ 153.518119] R10: ffffb8d9450a7c78 R11: 000000000000038c R12: ffffb8d9450a7c10
[ 153.518427] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8eda4c6bd660
[ 153.518730] FS: 00007fd7f09cf700(0000) GS:ffff8edaadc00000(0000) knlGS:0000000000000000
[ 153.519036] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 153.519346] CR2: 00005634f95fde50 CR3: 000000040dd2c001 CR4: 00000000003606f0
[ 153.519656] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 153.519980] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 153.520289] Call Trace:
[ 153.520597] ? 0xffffffffc050d000
[ 153.520899] do_one_initcall+0x55/0x1ac
[ 153.521201] ? do_one_initcall+0x55/0x1ac
[ 153.521504] ? do_init_module+0x27/0x223
[ 153.521808] ? _cond_resched+0x32/0x50
[ 153.522107] ? kmem_cache_alloc_trace+0x165/0x1c0
[ 153.522408] do_init_module+0x5f/0x223
[ 153.522710] load_module+0x188c/0x1ea0
[ 153.523016] ? ima_post_read_file+0x83/0xa0
[ 153.523320] SYSC_finit_module+0xe5/0x120
[ 153.523623] ? SYSC_finit_module+0xe5/0x120
[ 153.523927] SyS_finit_module+0xe/0x10
[ 153.524231] do_syscall_64+0x73/0x130
[ 153.524534] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[ 153.524838] RIP: 0033:0x7fd7f04fd599
[ 153.525144] RSP: 002b:00007ffda61c2968 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
[ 153.525455] RAX: ffffffffffffffda RBX: 00005643631d7210 RCX: 00007fd7f04fd599
[ 153.525768] RDX: 0000000000000000 RSI: 0000564361c3226b RDI: 0000000000000003
[ 153.526084] RBP: 0000564361c3226b R08: 0000000000000000 R09: 00007fd7f07c2ea0
[ 153.526403] R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000000000
[ 153.526722] R13: 00005643631d7ca0 R14: 0000000000000000 R15: 0000000000000000
[ 153.527040] Code: 00 48 8b 75 c8 48 8b 45 c8 8b 55 b8 48 63 d2 48 c1 e2 02 48 01 d0 8b 4d b4 8b 55 bc 48 89 f3 48 89 13 48 8d 1c cb 48 39 d8 7f f4 <0f> 08 48 89 d8 48 89 45 d0 e8 40 ef 73 00 48 c7 c7 c7 d0 c4 c0
[ 153.527386] RIP: loop+0xc/0xf22 [noSmp8] RSP: ffffb8d9450a7be0
[ 153.530228] ---[ end trace cc9ea64985c9fe34 ]---
So, it not possible to run invd even without SMP?
There's 2 questions here:
a) How to execute INVD (unsafely)
For this, you need to be running at CPL=0, and you have to make sure the CPU isn't using any "processor reserved memory protections" which are part of Intel's Software Guard Extensions (an extension to allow programs to have a shielded/private/encrypted space that the OS can't tamper with, often used for digital rights management schemes but possibly usable for enhancing security/confidentiality of other things).
Note that SGX is supported in recent versions of Linux, but I'm not sure when support was introduced or how old your kernel is, or if it's enabled/disabled.
If either of these isn't true (e.g. you're at CPL=3 or there are "processor reserved memory protections) you will get a general protection fault exception.
b) How to execute INVD Safely
For this, you have to make sure that the caches (which includes "external caches" - e.g. possibly including things like eDRAM and caches built into non-volatile RAM) don't contain any modified data that will cause problems if lost. This includes data from:
IRQs. These can be disabled.
NMI and machine check exceptions. For a running OS it's mostly impossible to stop/disable these and if you can disable them then it's like crossing your fingers while ignoring critical hardware failures (an extremely bad idea).
the firmware's System Management Mode. This is a special CPU mode the firmware uses for various things (e.g. ECC scrubbing, some power management, emulation of legacy devices) that't beyond the control of the OS/kernel. It can't be disabled.
writes done by the CPU itself. This includes updating the accessed/dirty flags in page tables (which can not be disabled), plus any performance monitoring or debugging features that store data in memory (which can be "not enabled").
With these restrictions (and not forgetting the performance problems) there are only 2 cases where INVD might be sane - early firmware code that needs to determine RAM chip sizes and configure memory controllers (where it's very likely to be useful/sane), and the instant before the computer is turned off (where it's likely to be pointless).
Guesswork
I'm guessing (based on my inability to think of any other plausible reason) that you want to construct temporary shielded/private area of memory (to enhance security - e.g. so that the data you put in that area won't/can't leak into RAM). In this case (ironically) it's possible that the tool designed specifically for this job (SGX) is preventing you from doing it badly.

kernel panic when loading snapshot of vm that uses image geerated with i586_qemu config by ptxdist

I build the i586_qemu(with some changes of package selection) using ptxdist 2012.12.0. Everything works fine on my laptop(Ubuntu 12.04.2, Linux 3.5.0-23-generic in virtualbox run on MPB). However, when I copied images to a server(run Ubuntu 12.04.4, Linux 3.11.0-19-generic), and try to use savevm and loadvm command, I got a kernel panic.
here's the output:
(qemu) savevm vm0
(qemu) Clocksource tsc unstable (delta = 5441725078 ns)
Switching to clocksource jiffies
(qemu) info snapshots
ID TAG VM SIZE DATE VM CLOCK
1 vm0 16M 2014-04-19 00:36:32 00:04:12.923
It seems savevm run a little longer than it runs on my laptop. But when I restart the vm, the problem comes:
sudo kvm -nographic -m 256 -M pc -no-reboot -kernel ./images/linuximage -hda ./images/hd.img.qcow2 -device e1000,netdev=net0,mac='DE:AD:BE:EF:12:03' -netdev tap,id=net0,script=qemu-ifup.sh -append "root=/dev/sda1 rw console=ttyS0,115200 debug" -loadvm vm0
+ switch=br0
+ ovs-vsctl del-port br0 tap0
+ [ -n tap0 ]
+ whoami
+ /usr/bin/sudo /usr/sbin/tunctl -u root -t tap0
sudo: /usr/sbin/tunctl: command not found
+ /usr/bin/sudo /sbin/ip link set tap0 up
+ sleep 0.1s
+ /usr/bin/sudo ovs-vsctl add-port br0 tap0
+ exit 0
divide error: 0000 [#1] PREEMPT
Modules linked in:
Pid: 0, comm: swapper Not tainted 3.0.0-pengutronix #1 Bochs Bochs
EIP: 0060:[<c01067e8>] EFLAGS: 00000246 CPU: 0
EAX: 00000000 EBX: c02e6a74 ECX: 00000096 EDX: 00000003
ESI: 00020800 EDI: c02b4000 EBP: c02b3ff8 ESP: c02b3fe8
DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
Process swapper (pid: 0, ti=c02b2000 task=c02ba480 task.ti=c02b2000)
Stack:
c0101448 c02cc5a3 c02e6a74 00000800 0052b003 00000000
Call Trace:
[<c0101448>] ? 0xc0101448
[<c02cc5a3>] ? 0xc02cc5a3
Code: 0f 01 c8 e8 41 ff ff ff 85 c0 75 07 89 c1 fb 0f 01 c9 c3 fb c3 83 3d 98 c6 2f c0 00 75 1c 80 3d c5 9c 2c c0 00 74 13 eb 15 fb f4 <eb> 01 fb 89 e0 25 00 e0 ff ff 83 48 0c 04 c3 fb f3 90 c3 89 e0
EIP: [<c01067e8>] SS:ESP 0068:c02b3fe8
---[ end trace 6fe899157eb8f58b ]---
Kernel panic - not syncing: Attempted to kill the idle task!
Clocksource tsc unstable (delta = 5233522621 ns)
The most obvious thing to me is the clocksource unstable warning. According to What does “clocksource tsc unstable” mean?, the problem could be the difference of tsc between cores(the server I am using have 48). So, what should be done to stop the kernel panic? or are there any other causes?
The problem goes away when I use the tcg accelerator(which is the default accelerator in my laptop) instead of KVM kernel module. The clocksource problem still occurs, but seems have no influence on the VM.

Analyzing CPU registers during kernel crash dump

I was debugging a issue and hit the below kernel crash along with crash dump being generated. To some extent i do know, how to get to the exact line in the code where the issue occurred using gdb (l *(debug_fucntion+0x19)) command.
<1>BUG: unable to handle kernel paging request at ffffc90028213000
<1>IP: [<ffffffffa0180279>] debug_fucntion+0x19/0x160 [dise]
<4>PGD 103febe067 PUD 103febf067 PMD fd54e1067 PTE 0
<4>Oops: 0000 [#1] SMP
<4>last sysfs file: /sys/kernel/mm/ksm/run
<4>CPU 7
<4>Modules linked in: dise(P)(U) ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge autofs4 8021q garp stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun kvm uinput ipmi_devintf power_meter microcode iTCO_wdt iTCO_vendor_support dcdbas sg ses enclosure serio_raw lpc_ich mfd_core i7core_edac edac_core bnx2 ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dise]
<4>
<4>Pid: 1126, comm: diseproc Tainted: P W --------------- 2.6.32-431.el6.x86_64 #1 Dell Inc. PowerEdge R710/0MD99X
<4>RIP: 0010:[<ffffffffa0180279>] [<ffffffffa0180279>] debug_fucntion+0x19/0x160 [dise]
<4>RSP: 0018:ffff880435fc5b88 EFLAGS: 00010282
<4>RAX: 0000000000000000 RBX: 0000000000010000 RCX: ffffc90028213000
<4>RDX: 0000000000010040 RSI: 0000000000010000 RDI: ffff880fe36a0000
<4>RBP: ffff880435fc5b88 R08: ffffffffa025d8a3 R09: 0000000000000000
<4>R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000010040
<4>R13: 000000000000b101 R14: ffffc90028213010 R15: ffff880fe36a0000
<4>FS: 00007fbe6040b700(0000) GS:ffff8800618e0000(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>CR2: ffffc90028213000 CR3: 0000000fc965b000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process diseproc (pid: 1126, threadinfo ffff880435fc4000, task ffff8807f8be8ae0)
<4>Stack:
<4> ffff880435fc5be8 ffffffffa0180498 0000000081158f46 00000c200000fd26
<4><d> ffffc90028162000 0000fec635fc5bc8 0000000000000018 ffff881011d80000
<4><d> ffffc90028162000 ffff8802f18fe440 ffff880fc80b4000 ffff880435fc5cec
<4>Call Trace:
<4> [<ffffffffa0180498>] cmd_dump+0x1c8/0x360 [dise]
<4> [<ffffffffa01978e1>] debug_log_show+0x91/0x160 [dise]
<4> [<ffffffffa013afb9>] process_debug+0x5a9/0x990 [dise]
<4> [<ffffffff810792c7>] ? current_fs_time+0x27/0x30
<4> [<ffffffffa013bc38>] dise_ioctl+0xd8/0x300 [dise]
<4> [<ffffffff8105a501>] ? hotplug_hrtick+0x21/0x60
<4> [<ffffffff8119db42>] vfs_ioctl+0x22/0xa0
<4> [<ffffffff8119dce4>] do_vfs_ioctl+0x84/0x580
<4> [<ffffffff8119e261>] sys_ioctl+0x81/0xa0
<4> [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>Code: be c4 10 e1 48 8b 5d d8 44 01 f0 4c 8b 65 e0 4c 8b 6d e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 0f 1f 44 00 00 55 48 89 e5 0f 1f 44 00 00 <48> 8b 01 48 c1 e8 3c 83 f8 08 76 0b e8 f6 fb ff ff c9 c3 0f 1f
<1>RIP [<ffffffffa0180279>] debug_fucntion+0x19/0x160 [dise]
<4> RSP <ffff880435fc5b88>
<4>CR2: ffffc90028213000
Question i have is
Can the CPU register contents which are printed give more information? How do i decode them?
Can i get to know variables values or data structure values from the crash dump which leads to the crash?
What does the "Code : be c4 10 e1 48 8b 5d ... " tell me here?
You must understand that you are inspecting (not debugging) at assembly level (not source code). This is important thing that you must hold in your head when inspecting crash dumps.
You have to read your crash dump report carefully line by line because it contains lots of info and also that's all you got.
When you got place when your code was crashed - you have to figure out why that happened by reading crash dump report and disassembly.
First line in your crash dump report tells you
BUG: unable to handle kernel paging request at ffffc90028213000
That means you are using invalid memory.
Line
Process diseproc (pid: 1126, threadinfo ffff880435fc4000, task ffff8807f8be8ae0)
tells you what happened in userspace on crash time. Seems like userspace process diseproc issued some command to your driver that caused crash.
Very important line is
IP: [<ffffffffa0180279>] debug_fucntion+0x19/0x160 [dise]
Try to issue dis debug_function command to disassemble debug_function, find debug_function+25(0x19 hex = 25 dec) and look around. Read it side by side with C source code for debug_function. Usually you can find crash place in C code by comparing callq instructions - disassembly will show printable name of called functions.
Next and most important is Call trace:
Call Trace:
[<ffffffffa0180498>] cmd_dump+0x1c8/0x360 [dise]
[<ffffffffa01978e1>] debug_log_show+0x91/0x160 [dise]
[<ffffffffa013afb9>] process_debug+0x5a9/0x990 [dise]
[<ffffffff810792c7>] ? current_fs_time+0x27/0x30
[<ffffffffa013bc38>] dise_ioctl+0xd8/0x300 [dise]
[<ffffffff8105a501>] ? hotplug_hrtick+0x21/0x60
[<ffffffff8119db42>] vfs_ioctl+0x22/0xa0
[<ffffffff8119dce4>] do_vfs_ioctl+0x84/0x580
[<ffffffff8119e261>] sys_ioctl+0x81/0xa0
[<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Reading bottom to top: kernel got ioctl (from diseproc, obvious), kernel invoked ioctl handler dise_ioctl in dise module, then current_fs_time, process_debug, debug_log_show and finally cmd_dump.
Now you know:
Code path: dise_ioctl -> current_fs_time -> process_debug -> debug_log_show -> cmd_dump -> somehow to debug_function.
Approximate place in C code that caused crash
Reason to crash: access to invalid memory
With this info you have to use your last and most powerful method - thinking. Try to understand what variables/structures caused crash. Maybe some of them was freed by the time you arrived in debug_function? Maybe you mistype in pointer arithmetic?
Answers to questions:
Most of the times CPU register values are pointless because it has nothing to do with your C code. Just some values, pointing to some memory - whatever. Yes, there are some extremely useful registers like RIP/EIP and RSP/ESP, but most of them is way too out of context.
Very unlikely. You are actually not debugging - you are inspecting your dump - you don't have any debugging context.
I agree with #user2699113 that it just memory content under pointer from RIP.
And remember - best debugging tool is your brain.
See here... This has good documentation on how to debug kernel crashes.. See the section Objdump
What it tells it that you can disassemble your kernel image using objdump on vmlinux image. This command will output a large a text file of your kernel source code ... You can then grep for the problem causing EIP in the previously created output file.
PS: I would recommend doing objdump on vmlinux and saving it locally.
and 2.: It is rather hard to find out how cpu registers relates to parameters and variable values.
3: That code is assembler code. You may find it in your disassembled program and find out where that problem occured. Notice that there is <48> 8b 01 48 ... - and AFAIK the trap occurs at this assembler command. It means that you need to debug it by disassembling your code. If you compile your program (module) with debuggig symbols you can find out the number line where the problem occured.

Create a ethernet packet in a kernel module and send it

I need to create an ethernet packet an send it in my kernel module. Someone can help me to do this?
I think i need to create a skb using dev_alloc_skb, then i need to write the mac_ethernet, insert the data and send it using dev_queu_xmit.
But i'm not sure if this work, or if it is the right and easiest way to do it.
Best Regards
EDIT1:
int sendpacket ()
{
unsigned char dest[ETH_ALEN]={0x00,0x25,0x22,0x05,0xF3,0xF0};
unsigned char src[ETH_ALEN] = {0x90,0xE6,0xBA,0x48,0x7C,0x87};
struct sk_buff * skbt =alloc_skb(ETH_FRAME_LEN,GFP_KERNEL);
//skb_reserve(skb,ETH_FRAME_LEN);
dev_hard_header(skbt,dev_eth1,ETH_P_802_3,dest,src,dev_eth1->addr_len);
if(dev_queue_xmit(skbt)!=NET_XMIT_SUCCESS)
{
printk("Not send!!\n");
}
kfree_skb(skbt);
return 0;
}
> Dmesg command:
>
> 677.826933] Hello:I'm the hook module!!!! [ 677.826937] 2!!!! [ 677.826941] skb_under_panic: text:c0723608 len:14 put:14 head:f1843800 data:f18437f2 tail:0xf1843800 end:0xf1843e00 dev:<NULL> [ 677.826959]
> ------------[ cut here ]------------ [ 677.826961] kernel BUG at net/core/skbuff.c:146! [ 677.826964] invalid opcode: 0000 [#1] SMP [
> 677.826967] last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map [
> 677.826969] Modules linked in: sendpacket(+) bluetooth rfkill vfat fat fuse sunrpc cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT
> nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 uinput
> snd_hda_codec_atihdmi snd_hda_codec_realtek snd_hda_intel
> snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd
> soundcore atl1e snd_page_alloc iTCO_wdt iTCO_vendor_support r8169 mii
> i2c_i801 microcode asus_atk0110 pcspkr ata_generic pata_acpi
> usb_storage pata_marvell radeon ttm drm_kms_helper drm i2c_algo_bit
> i2c_core [last unloaded: sendpacket] [ 677.827003] [ 677.827003]
> Pid: 4780, comm: insmod Tainted: G W 2.6.35101 #7 P5QL
> PRO/P5QL PRO [ 677.827003] EIP: 0060:[<c070a192>] EFLAGS: 00210246
> CPU: 0 [ 677.827003] EIP is at skb_push+0x57/0x62 [ 677.827003] EAX:
> 00000088 EBX: c08f9fdc ECX: f156bf10 EDX: c093b4ca [ 677.827003] ESI:
> 00000000 EDI: f51ca000 EBP: f156bf38 ESP: f156bf0c [ 677.827003] DS:
> 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 677.827003] Process insmod
> (pid: 4780, ti=f156a000 task=f2b071a0 task.ti=f156a000) [ 677.827003]
> Stack: [ 677.827003] c093b4ca c0723608 0000000e 0000000e f1843800
> f18437f2 f1843800 f1843e00 [ 677.827003] <0> c08f9fdc f156bf64
> f156bf6a f156bf50 c0723608 00000001 c07235e5 f3b6c000 [ 677.827003]
> <0> 00835ff4 f156bf78 f7d640a8 f156bf6a f156bf64 00000006 48bae690
> 2500877c [ 677.827003] Call Trace: [ 677.827003] [<c0723608>] ?
> eth_header+0x23/0x93 [ 677.827003] [<c0723608>] ?
> eth_header+0x23/0x93 [ 677.827003] [<c07235e5>] ?
> eth_header+0x0/0x93 [ 677.827003] [<f7d640a8>] ?
> sendpacket+0x8f/0xb6 [sendpacket] [ 677.827003] [<f7d67000>] ?
> hook_init+0x0/0x46 [sendpacket] [ 677.827003] [<f7d67044>] ?
> hook_init+0x44/0x46 [sendpacket] [ 677.827003] [<c0401246>] ?
> do_one_initcall+0x4f/0x139 [ 677.827003] [<c0451e29>] ?
> blocking_notifier_call_chain+0x11/0x13 [ 677.827003] [<c046210c>] ?
> sys_init_module+0x7f/0x19b [ 677.827003] [<c040321f>] ?
> sysenter_do_call+0x12/0x28 [ 677.827003] Code: c0 85 f6 0f 45 de 53
> ff b0 a8 00 00 00 ff b0 a4 00 00 00 51 ff b0 ac 00 00 00 52 ff 70 50
> ff 75 04 68 ca b4 93 c0 e8 ad 4a 09 00 <0f> 0b 8d 65 f8 89 c8 5b 5e 5d
> c3 55 89 e5 56 53 0f 1f 44 00 00 [ 677.827116] EIP: [<c070a192>]
> skb_push+0x57/0x62 SS:ESP 0068:f156bf0c [ 677.827154] ---[ end trace
> dee1e3278503a581 ]---
In your case you just want to use raw packets from user space instead of dealing with the complexities of kernel code.
This blog post details how to do everything you need.
At the risk of sounding like a broken record you're learning why this should be done from user space.
Because you seem determined to make this mistake anyway, let's try to figure out what the problem is.
It's also a good illustration of how helpful it is to have source code. The exception log tells you the problem occurred on line 146 of net/core/skbuff.c.
That's within the function skb_under_panic(), which is only used in that file (it's static after all), from within skb_push().
The skb_push() function expands the skb forwards. Basically it creates room in the buffer for a new header. It does this by shifting the internal data pointer forward.
In your case, the internal data pointer is still in its original localtion: at the very from of the skb. You need to reserve some room at the front of the skb first. Use skb_reserve(), pretty much just like you had. Why did you comment that out?
Also, you need to check that the allocation of the skb succeeded. Kernel allocators can (and do) return NULL sometimes.

Resources