Soft lockup occured in VMX testing in Linux - linux

I am testing VMX in Linux (Ubuntu-16.04) in a VMware platform.
When the VM is running a long loop, the host Linux hit CPU softlock, as follows,
Jun 11 00:01:13 ubuntu kernel: [ 5624.196130] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [guest:8297]
Jun 11 00:01:13 ubuntu kernel: [ 5624.196134] Modules linked in: vmm(OE) rfcomm ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) bnep vmw_vsock_vmci_transport vsock crct10dif_pclmul vmw_balloon crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev btusb btrtl btbcm btintel snd_ens1371 input_leds serio_raw bluetooth snd_ac97_codec gameport ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore i2c_piix4 shpchp vmw_vmci 8250_fintek mac_hid kvm_intel kvm irqbypass parport_pc ppdev lp parport autofs4 hid_generic usbhid hid vmwgfx psmouse ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mptspi mptscsih drm ahci mptbase libahci e1000 scsi_transport_spi pata_acpi fjes [last unloaded: vmm]
Jun 11 00:01:13 ubuntu kernel: [ 5624.196210] CPU: 0 PID: 8297 Comm: guest Tainted: G OEL 4.4.0-127-generic #153-Ubuntu
Jun 11 00:01:13 ubuntu kernel: [ 5624.196212] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
Jun 11 00:01:13 ubuntu kernel: [ 5624.196213] task: ffff880092788000 ti: ffff88007ffb8000 task.ti: ffff88007ffb8000
Jun 11 00:01:13 ubuntu kernel: [ 5624.196215] RIP: 0010:[<ffffffffc0648c7f>] [<ffffffffc0648c7f>] failInvalid+0x18/0xa9 [vmm]
Jun 11 00:01:13 ubuntu kernel: [ 5624.196221] RSP: 0018:ffff88007ffbbe60 EFLAGS: 00000246
Jun 11 00:01:13 ubuntu kernel: [ 5624.196223] RAX: 0000000000000000 RBX: 00000000006041c0 RCX: 0000000000000000
Jun 11 00:01:13 ubuntu kernel: [ 5624.196224] RDX: 0000000000001dfa RSI: 0000000000000000 RDI: 000000000000640a
Jun 11 00:01:13 ubuntu kernel: [ 5624.196225] RBP: ffff88007ffbbe70 R08: 000007ff00000017 R09: 0000000000000010
Jun 11 00:01:13 ubuntu kernel: [ 5624.196226] R10: c093ffffffff0000 R11: 0000000000100000 R12: 00000000006041c0
Jun 11 00:01:13 ubuntu kernel: [ 5624.196227] R13: ffff8800b37d2600 R14: 0000000040087401 R15: 00000000006041c0
Jun 11 00:01:13 ubuntu kernel: [ 5624.196229] FS: 00007f8793426700(0000) GS:ffff8800ba600000(0000) knlGS:0000000000000000
Jun 11 00:01:13 ubuntu kernel: [ 5624.196230] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 11 00:01:13 ubuntu kernel: [ 5624.196231] CR2: 0000000000910000 CR3: 00000000671f2000 CR4: 0000000000362670
Jun 11 00:01:13 ubuntu kernel: [ 5624.196237] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 11 00:01:13 ubuntu kernel: [ 5624.196238] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun 11 00:01:13 ubuntu kernel: [ 5624.196239] Stack:
Jun 11 00:01:13 ubuntu kernel: [ 5624.196240] 00000000006041c0 00000000006041c0 ffff88007ffbbe88 ffffffffc064a538
Jun 11 00:01:13 ubuntu kernel: [ 5624.196242] ffff8800907ab398 ffff88007ffbbe98 ffffffffc064a83c ffff88007ffbbf08
Jun 11 00:01:13 ubuntu kernel: [ 5624.196244] ffffffff81227f8f 0000000000000002 ffff8800b399e610 ffff88008cf50b10
Jun 11 00:01:13 ubuntu kernel: [ 5624.196246] Call Trace:
Jun 11 00:01:13 ubuntu kernel: [ 5624.196251] [<ffffffffc064a538>] vmm_vcpu_run+0x58/0x310 [vmm]
Jun 11 00:01:13 ubuntu kernel: [ 5624.196254] [<ffffffffc064a83c>] my_ioctl+0x4c/0x50 [vmm]
Jun 11 00:01:13 ubuntu kernel: [ 5624.196258] [<ffffffff81227f8f>] do_vfs_ioctl+0x2af/0x4b0
Jun 11 00:01:13 ubuntu kernel: [ 5624.196260] [<ffffffff81214329>] ? vfs_write+0x149/0x1a0
Jun 11 00:01:13 ubuntu kernel: [ 5624.196262] [<ffffffff81228209>] SyS_ioctl+0x79/0x90
Jun 11 00:01:13 ubuntu kernel: [ 5624.196265] [<ffffffff81850c88>] entry_SYSCALL_64_fastpath+0x1c/0xbb
Jun 11 00:01:13 ubuntu kernel: [ 5624.196267] Code: 00 48 8d 1c 25 74 19 65 c0 0f 78 03 8b 04 25 74 19 65 c0 41 5f 41 5e 41 5d 41 5c 41 5b 41 5a 41 59 41 58 5f 5e 5d 5a 59 5b 58 9d <48> c7 c3 68 01 65 c0 eb 19 4c 8b 23 e8 c0 c4 ff ff 41 89 04 24
VMM enables VMEXIT on Interrupt, and there is timer in VMM to check VM state.
If I called touch_softlockup_watchdog() in the timer function, there is NO such lockup happened.
But I think there is still NO chance for other threads/processes to scheduled on the pCPU where the VM is running.
So the question is, how can the VMM make host Linux scheduler schedule other entities on the pCPU, instead of letting the VM to take over the whole pCPU, calling schedule() or something else?
The VM code is loaded and started by a host application, which loads the VM image into a memory region allocated/setup by VMM. Then sets the vCPU context (like, GPRs, RIP, segment registers, etc), and asks the VMM to start the VM by calling 'VMLaunch' and 'VMResume'.

The easiest way to avoid this is to call schedule() in the handling loop of vmexit in kernel space.

I'm currently facing the exact same problem you described. I'm running a ubuntu 18 in VMWare and try to setup a kernel module than enable VMX. I'm able to run the initialization routine which setup control, guest and host before executing vmlaunch instruction.
vmlaunch execute sucessfully, as I quickly enter hypervisor mode because of wrmsr and rdmsr from the current OS.
However, it seems that I get stuck very quickly, my call trace mostly looks like yours. I don't undestand well what's happening but it seems that i'm trap in exit_to_usermode_loop(). Is it the problem you faced ?

Related

How to run "invd" instruction with disabled SMP support?

I'm trying to execute "invd" instruction from a kernel module. I have asked a similar question How to execute “invd” instruction? previously and from #Peter Cordes's answer, I understand I can't safely run this instruction on SMP system after system boot. So, shouldn't I be able to run this instruction after boot without SMP support? Because there is no other core running, therefore there is no change for memory inconsistency? I have the following kernel module compiled with -o0 flag,
static int __init deviceDriver_init(void){
unsigned long flags;
int LEN=10;
int STEP=1;
int VALUE=1;
int arr[LEN];
int i;
unsigned long dummy;
printk(KERN_INFO "invd Driver loaded\n");
//wbinvd();
//asm volatile("cpuid\n":::);
local_irq_disable();
__asm__ __volatile__(
"wbinvd\n"
"loop:"
"movq %%rdx, (%%rbx);"
"leaq (%%rbx, %%rcx, 8), %%rbx;"
"cmpq %%rbx, %%rax;"
"jg loop;"
"invd\n"
: "=b"(dummy) // output
: "b" (arr),
"a" (arr+LEN),
"c" (STEP),
"d" (VALUE)
: "cc", "memory"
);
local_irq_enable();
//asm volatile("invd\n":::);
printk(KERN_INFO "invd execute\n");
return 0;
}
I'm still getting the following error upon inserting the module I'm getting Segmentation fault (core dumped) in the terminal and the dmesg shows,
[ 2590.518614] invd Driver loaded
[ 2590.518840] general protection fault: 0000 [#5] SMP PTI
I have boot my kernel with nosmp but I do not understand why dmesg still shows SMP PTI
$cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.15.0-136-generic root=UUID=dbe747ff-a6a5-45cb-8553-c6db6d445d3d ro quiet splash nosmp vt.handoff=7
Update post:
As I mentioned in the comment section, After disabling, SGX from BIOS, I was able to run this invd without any error. However, when I try to run the same code on a different machine with the same kernel version, I still get the same error message. It is strange and I can't explain why this is happening. As in the comment section, #prl mentions that the error may be coming from the instruction following invd. I begin to think that maybe that is true. Because second from the last line in the dmesg is higlighted in RED [ 153.527386] RIP: loop+0xc/0xf22 [noSmp8] RSP: ffffb8d9450a7be0. So, seems like the error is coming from inside the loop. I have updated the __init function code according to the suggestion. I'm not good at assembly code, can anyone please tell me if the inline assembly code is correct or not? If this inline assembly code is not correct how to fix the code? My whole dmesg trace is,
[ 153.514293] invd Driver loaded
[ 153.514547] general protection fault: 0000 [#1] SMP PTI
[ 153.514656] Modules linked in: noSmp8(OE+) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables ccm arc4 intel_rapl rt2800usb rt2x00usb x86_pkg_temp_thermal intel_powerclamp rt2800lib coretemp rt2x00lib mac80211 cfg80211 kvm_intel kvm irqbypass snd_hda_codec_realtek crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate intel_rapl_perf dell_smm_hwmon dell_wmi dell_smbios dcdbas intel_wmi_thunderbolt snd_hda_codec_generic dell_wmi_descriptor wmi_bmof snd_seq_midi snd_seq_midi_event
[ 153.515454] serio_raw snd_hda_intel snd_hda_codec snd_hda_core sparse_keymap snd_hwdep snd_rawmidi joydev input_leds snd_seq snd_pcm snd_seq_device snd_timer snd soundcore mei_me mei shpchp intel_pch_thermal mac_hid acpi_pad parport_pc ppdev lp parport autofs4 hid_generic usbhid hid nouveau mxm_wmi ttm drm_kms_helper psmouse syscopyarea sysfillrect sysimgblt igb e1000e dca i2c_algo_bit ptp pps_core ahci libahci fb_sys_fops drm wmi video
[ 153.516038] CPU: 0 PID: 4024 Comm: insmod Tainted: G OE 4.15.0-136-generic #140~16.04.1-Ubuntu
[ 153.516331] Hardware name: Dell Inc. BIOS 1.3.2 01/25/2016
[ 153.516626] RIP: 0010:loop+0xc/0xf22 [noSmp8]
[ 153.516917] RSP: 0018:ffffb8d9450a7be0 EFLAGS: 00010046
[ 153.517213] RAX: ffffb8d9450a7c08 RBX: ffffb8d9450a7c08 RCX: 0000000000000001
[ 153.517513] RDX: 0000000000000001 RSI: ffffb8d9450a7be0 RDI: ffff8edaadc16490
[ 153.517814] RBP: ffffb8d9450a7c60 R08: 0000000000012c40 R09: ffffffffb39624c4
[ 153.518119] R10: ffffb8d9450a7c78 R11: 000000000000038c R12: ffffb8d9450a7c10
[ 153.518427] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8eda4c6bd660
[ 153.518730] FS: 00007fd7f09cf700(0000) GS:ffff8edaadc00000(0000) knlGS:0000000000000000
[ 153.519036] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 153.519346] CR2: 00005634f95fde50 CR3: 000000040dd2c001 CR4: 00000000003606f0
[ 153.519656] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 153.519980] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 153.520289] Call Trace:
[ 153.520597] ? 0xffffffffc050d000
[ 153.520899] do_one_initcall+0x55/0x1ac
[ 153.521201] ? do_one_initcall+0x55/0x1ac
[ 153.521504] ? do_init_module+0x27/0x223
[ 153.521808] ? _cond_resched+0x32/0x50
[ 153.522107] ? kmem_cache_alloc_trace+0x165/0x1c0
[ 153.522408] do_init_module+0x5f/0x223
[ 153.522710] load_module+0x188c/0x1ea0
[ 153.523016] ? ima_post_read_file+0x83/0xa0
[ 153.523320] SYSC_finit_module+0xe5/0x120
[ 153.523623] ? SYSC_finit_module+0xe5/0x120
[ 153.523927] SyS_finit_module+0xe/0x10
[ 153.524231] do_syscall_64+0x73/0x130
[ 153.524534] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[ 153.524838] RIP: 0033:0x7fd7f04fd599
[ 153.525144] RSP: 002b:00007ffda61c2968 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
[ 153.525455] RAX: ffffffffffffffda RBX: 00005643631d7210 RCX: 00007fd7f04fd599
[ 153.525768] RDX: 0000000000000000 RSI: 0000564361c3226b RDI: 0000000000000003
[ 153.526084] RBP: 0000564361c3226b R08: 0000000000000000 R09: 00007fd7f07c2ea0
[ 153.526403] R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000000000
[ 153.526722] R13: 00005643631d7ca0 R14: 0000000000000000 R15: 0000000000000000
[ 153.527040] Code: 00 48 8b 75 c8 48 8b 45 c8 8b 55 b8 48 63 d2 48 c1 e2 02 48 01 d0 8b 4d b4 8b 55 bc 48 89 f3 48 89 13 48 8d 1c cb 48 39 d8 7f f4 <0f> 08 48 89 d8 48 89 45 d0 e8 40 ef 73 00 48 c7 c7 c7 d0 c4 c0
[ 153.527386] RIP: loop+0xc/0xf22 [noSmp8] RSP: ffffb8d9450a7be0
[ 153.530228] ---[ end trace cc9ea64985c9fe34 ]---
So, it not possible to run invd even without SMP?
There's 2 questions here:
a) How to execute INVD (unsafely)
For this, you need to be running at CPL=0, and you have to make sure the CPU isn't using any "processor reserved memory protections" which are part of Intel's Software Guard Extensions (an extension to allow programs to have a shielded/private/encrypted space that the OS can't tamper with, often used for digital rights management schemes but possibly usable for enhancing security/confidentiality of other things).
Note that SGX is supported in recent versions of Linux, but I'm not sure when support was introduced or how old your kernel is, or if it's enabled/disabled.
If either of these isn't true (e.g. you're at CPL=3 or there are "processor reserved memory protections) you will get a general protection fault exception.
b) How to execute INVD Safely
For this, you have to make sure that the caches (which includes "external caches" - e.g. possibly including things like eDRAM and caches built into non-volatile RAM) don't contain any modified data that will cause problems if lost. This includes data from:
IRQs. These can be disabled.
NMI and machine check exceptions. For a running OS it's mostly impossible to stop/disable these and if you can disable them then it's like crossing your fingers while ignoring critical hardware failures (an extremely bad idea).
the firmware's System Management Mode. This is a special CPU mode the firmware uses for various things (e.g. ECC scrubbing, some power management, emulation of legacy devices) that't beyond the control of the OS/kernel. It can't be disabled.
writes done by the CPU itself. This includes updating the accessed/dirty flags in page tables (which can not be disabled), plus any performance monitoring or debugging features that store data in memory (which can be "not enabled").
With these restrictions (and not forgetting the performance problems) there are only 2 cases where INVD might be sane - early firmware code that needs to determine RAM chip sizes and configure memory controllers (where it's very likely to be useful/sane), and the instant before the computer is turned off (where it's likely to be pointless).
Guesswork
I'm guessing (based on my inability to think of any other plausible reason) that you want to construct temporary shielded/private area of memory (to enhance security - e.g. so that the data you put in that area won't/can't leak into RAM). In this case (ironically) it's possible that the tool designed specifically for this job (SGX) is preventing you from doing it badly.

__stack_chk_fail when executing copy_process() in _do_fork()?

I'm trying to do an experimental project about linux kernel(4.4.52) on x86_64, and one requirement of which is that whenever the control flow leaves specific function, the Write Protection bit in CR0 register would always be enabled. Generally speaking, it is like(the idea comes from nested kernel, but that is not very relevant to my question):
DISABLE_CR0.WP_BIT
original_func()
ENABLE_CR0.WP_BIT
By doing that, the whole kernel would be executing with CR0.WP enabled. I have replaced the original native_set_pte function and native_write_cr3 function with the format above, and now the kernel crashes when booting.
Here is the log(that's its original log, although the sequence seems weird):
[ 1.403888] IP: [<ffff8800351ebbb0>] 0xffff8800351ebbb0
[ 1.403891] PGD 2876067 PUD 2877067 PMD 3500e063 PTE 80000000351eb163
[ 1.403892] Oops: 0011 [#2] SMP
[ 1.403898] Modules linked in: crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse pata_acpi floppy
[ 1.403901] CPU: 0 PID: 143 Comm: systemd-udevd Tainted: G D 4.4.52v1+ #2
[ 1.403902] Hardware name: Fedora Project OpenStack Nova, BIOS 0.5.1 01/01/2011
[ 1.403903] task: ffff8800351c0e00 ti: ffff8800351e8000 task.ti: ffff8800351e8000
[ 1.403905] RIP: 0010:[<ffff8800351ebbb0>] [<ffff8800351ebbb0>] 0xffff8800351ebbb0
[ 1.403906] RSP: 0018:ffff8800351ebba8 EFLAGS: 00211086
[ 1.403906] RAX: 000000000000000e RBX: ffff8800351ebcf8 RCX: 000000000000000e
[ 1.403907] RDX: 0000000000000000 RSI: 0000000000201092 RDI: 0000000000201092
[ 1.403908] RBP: 0000000000000003 R08: ffffffff82778d60 R09: ffff8800351ebb40
[ 1.403909] R10: 0000000000000030 R11: ffffc00000000fff R12: ffff8800351c0e00
[ 1.403909] R13: 0000000000000010 R14: 0000000000201046 R15: ffffffffffffffff
[ 1.403911] FS: 00007f1f021e38c0(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[ 1.403912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.403912] CR2: ffff8800351ebbb0 CR3: 00000000351cd000 CR4: 00000000001406f0
[ 1.403916] Stack:
[ 1.403918] ffffffff810b62ae ffff8800351ebbc0 000000000000006c 0000000000000000
[ 1.403920] ffff8800351ebbd8 ffffffff810b62ae ffffffff8111ce51 00000000000364a4
[ 1.403921] ffffffff82783168 000000000000005c 000000000000000c ffffffff820583b0
[ 1.403922] Call Trace:
[ 1.403928] [<ffffffff810b62ae>] ? kvm_sched_clock_read+0x1e/0x30
[ 1.403930] [<ffffffff810b62ae>] ? kvm_sched_clock_read+0x1e/0x30
[ 1.403933] [<ffffffff8111ce51>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[ 1.403935] [<ffffffff8111c49e>] ? down_trylock+0x2e/0x40
[ 1.403937] [<ffffffff81129959>] ? console_trylock+0x19/0x60
[ 1.403938] [<ffffffff8112af2e>] ? vprintk_emit+0x29e/0x530
[ 1.403945] [<ffffffff8115fe8e>] ? crash_kexec+0x7e/0x140
[ 1.403953] [<ffffffff81440ae5>] ? find_next_bit+0x15/0x20
[ 1.403955] [<ffffffff814390bb>] ? __const_udelay+0x2b/0x30
[ 1.403958] [<ffffffff810a2a0c>] ? native_stop_other_cpus+0x8c/0x170
[ 1.403965] [<ffffffff811dde8f>] ? panic+0xeb/0x215
[ 1.403968] [<ffffffff810d12a7>] ? copy_process+0x727/0x1b20
[ 1.403970] [<ffffffff810d32f9>] ? __stack_chk_fail+0x19/0x20
[ 1.403972] [<ffffffff810d12a7>] ? copy_process+0x727/0x1b20
[ 1.403974] [<ffffffff810d2808>] ? _do_fork+0x78/0x360
[ 1.403975] [<ffffffff810d2b99>] ? SyS_clone+0x19/0x20
[ 1.403986] [<ffffffff818694f2>] ? entry_SYSCALL_64_fastpath+0x16/0x71
[ 1.404004] Code: 00 00 00 86 10 21 00 00 00 00 00 a8 bb 1e 35 00 88 ff ff 18 00 00 00 00 00 00 00 b0 bb 1e 35 00 88 ff ff ae 62 0b 81 ff ff ff ff <c0> bb 1e 35 00 88 ff ff 6c 00 00 00 00 00 00 00 00 00 00 00 00
[ 1.404005] RIP [<ffff8800351ebbb0>] 0xffff8800351ebbb0
[ 1.404006] RSP <ffff8800351ebba8>
[ 1.404006] CR2: ffff8800351ebbb0
[ 1.404008] ---[ end trace b62acacf75e0c54f ]---
[ 1.406415] Kernel Offset: disabled
[ 1.456105] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff810d12a7
[ 1.456105]
I guess the problem is that something at copy_process causes an overflow, maybe it writes to some read-only memory? But CR0.WP bit should only affects the supervisor mode according to intel's document, so does that mean kernel is running in supervisor mode when executing copy_process?
I tried to disassemble the kernel, and got really upset about all those countless assembly instructions... So I decide to find it out with qemu. However, the kernel did NOT crash in qemu!! The command I use is that
qemu-system-x86_64 -m 1G -kernel arch/x86/boot/bzImage -initrd arch/x86/boot/linux4.4.52-rootfs.img -hda vdisk.img --append "root=/dev/sda rw console=ttyS0" -nographic
I used to think that _do_fork is independent of specific devices and filesystems(correct me if I'm wrong), so what causes the kernel to crash at my VPS should make it crash at qemu as well, which it didn't.
Has anyone come across the same issue? I really need some help now.
P.S. I do this at my VPS, ubuntu 16.04.2, but I think this is not the reason.
Please note that QEMU is not fully architecture accurate model, and QEMU do not provide full support of all architecture referenced features, because QEMU aim is emulation speed, not accuracy, and then it only provide some workable architecture profile to run OS.
Some features QEMU do precise and correct like paging and segmentation, but many are not : there are some troubles with CPUID, syscall, many instructions do not generate #GP, #SS, #PF, floating point errors, some instructions are not implemented (AVX, AVX2, FMA) and so on.
You should use one of x86 golden models to catch such tricky cases: try to use Bochs, or one of provided by Intel itself.

Is __init attribute used in loadable kernel modules?

The description at this - http://www.tldp.org/LDP/lkmpg/2.4/html/x281.htm - page (as well as some related answers on SO, for example the answer here - __init and __exit macros usage for built-in and loadable modules ) says
The __init macro causes the init function to be discarded and its memory freed once the init function finishes for built-in drivers, but not loadable modules.
However, I tried to insert the following module, in which I try to call the init functions with __init attribute from a non-init function (f2()), and I get error from the kernel, thus indicating that __init has an effect on loadable modules as well.
How and where can I find reliable information about this?
My (above mentioned) program:
#include<linux/module.h>
#include<linux/init.h>
static int __init f1(void){
printk(KERN_ALERT "hello \n");
return 0;
}
static void __exit f2(void){
f1();
printk(KERN_ALERT "bye N\n");
}
module_init(f1);
module_exit(f2);
Error from the kernel:
Jul 8 08:15:51 localhost kernel: hello NOTICE
Jul 8 08:15:54 localhost kernel: [303032.948188] BUG: unable to handle kernel paging request at f9b13000
Jul 8 08:15:54 localhost kernel: [303032.949003] IP: [<f9b13000>] 0xf9b12fff
Jul 8 08:15:54 localhost kernel: [303032.949003] *pdpt = 0000000000d3c001 *pde = 000000003100b067 *pte = 0000000000000000
Jul 8 08:15:54 localhost kernel: [303032.949003] Modules linked in: hello(POF-) tcp_lp lp wacom fuse bnep bluetooth ip6t_rpfilter ip6t_REJECT cfg80211 rfkill xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw joydev snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm coretemp kvm iTCO_wdt iTCO_vendor_support ppdev r8169 mii snd_page_alloc snd_timer snd soundcore microcode serio_raw i2c_i801 lpc_ich mfd_core parport_pc parport acpi_cpufreq mperf binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc i915 ata_generic i2c_algo_bit pata_acpi drm_kms_helper drm i2c_core video [last unloaded: hello]
Jul 8 08:15:54 localhost kernel: [303032.949003] CPU: 1 PID: 11924 Comm: rmmod Tainted: PF O 3.11.10-301.fc20.i686+PAE #1
Jul 8 08:15:54 localhost kernel: [303032.949003] Hardware name: /DG41RQ, BIOS RQG4110H.86A.0013.2009.1223.1136 12/23/2009
Jul 8 08:15:54 localhost kernel: [303032.949003] task: d1bad780 ti: c33a4000 task.ti: c33a4000
Jul 8 08:15:54 localhost kernel: [303032.949003] EIP: 0060:[<f9b13000>] EFLAGS: 00010282 CPU: 1
Jul 8 08:15:54 localhost kernel: [303032.949003] EIP is at 0xf9b13000
Jul 8 08:15:54 localhost kernel: [303032.949003] EAX: f9af6000 EBX: f9af8000 ECX: c0c77270 EDX: 00000000
Jul 8 08:15:54 localhost kernel: [303032.949003] ESI: 00000000 EDI: 00000000 EBP: c33a5f3c ESP: c33a5f30
Jul 8 08:15:54 localhost kernel: [303032.949003] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Jul 8 08:15:54 localhost kernel: [303032.949003] CR0: 8005003b CR2: f9b13000 CR3: 10e7b000 CR4: 000407f0
Jul 8 08:15:54 localhost kernel: [303032.949003] Stack:
Jul 8 08:15:54 localhost kernel: [303032.949003] f9af600b 00000000 00000000 c33a5fac c04b06a9 f4852ac0 c33a5f50 c057989d
Jul 8 08:15:54 localhost kernel: [303032.949003] 00000000 f9af8000 00000800 c33a5f50 6c6c6568 0000006f f4852ac0 f5312490
Jul 8 08:15:54 localhost kernel: [303032.949003] db5e7100 00000000 d1bad780 d1bada9c c33a5f88 c056160d c33a5f9c c046e9de
Jul 8 08:15:54 localhost kernel: [303032.949003] Call Trace:
Jul 8 08:15:54 localhost kernel: [303032.949003] [<f9af600b>] ? f2+0xb/0x1000 [hello]
Jul 8 08:15:54 localhost kernel: [303032.949003] [<c04b06a9>] SyS_delete_module+0x149/0x2a0
Jul 8 08:15:54 localhost kernel: [303032.949003] [<c057989d>] ? mntput+0x1d/0x30
Jul 8 08:15:54 localhost kernel: [303032.949003] [<c056160d>] ? ____fput+0xd/0x10
Jul 8 08:15:54 localhost kernel: [303032.949003] [<c046e9de>] ? task_work_run+0x7e/0xb0
Jul 8 08:15:54 localhost kernel: [303032.949003] [<c099ff0d>] sysenter_do_call+0x12/0x28
Jul 8 08:15:54 localhost kernel: [303032.949003] Code: Bad EIP value.
Jul 8 08:15:54 localhost kernel: [303032.949003] EIP: [<f9b13000>] 0xf9b13000 SS:ESP 0068:c33a5f30
Jul 8 08:15:54 localhost kernel: [303032.949003] CR2: 00000000f9b13000
Jul 8 08:15:54 localhost kernel: [303032.949003] ---[ end trace ca338922043618f4 ]---
Jul 8 08:15:54 localhost kernel: BUG: unable to handle kernel paging request at f9b13000
Jul 8 08:15:54 localhost kernel: IP: [<f9b13000>] 0xf9b12fff
Jul 8 08:15:54 localhost kernel: *pdpt = 0000000000d3c001 *pde = 000000003100b067 *pte = 0000000000000000
Actually, __init attribute affects on loadable modules code.
Probably, this is misprint in the book you refers.
BTW, you should get warning about sections mismatching when build given module.

BUG assertion triggered when replacing a physical page in a process

I modified the Linux kernel in a way to have it modify some of the memory pages of a specific process. In summary, the functions I wrote receive a process id and address in that process, they then replace the page at that specific address with another dummy page. Finally, one of the functions call __free_page() on the original page that was replaced.
The problem is I get this error from the Linux kernel when it tries to reuse the original page. So, what is that flag it is complaining about? and how to get rid of this error? here is the relevant lines from syslog.
Thanks.
Nov 14 19:15:23 localhost kernel: [ 1466.949451] BUG: Bad page state in process mytestapp pfn:7d309
Nov 14 19:15:23 localhost kernel: [ 1466.949452] page:ffffea0001f4c240 count:-1 mapcount:0 mapping: (null) index:0x7fd632179
Nov 14 19:15:23 localhost kernel: [ 1466.949453] page flags: 0x100000000000000()
Nov 14 19:15:23 localhost kernel: [ 1466.949453] Modules linked in: test_module(O) acpiphp bnep rfcomm bluetooth binfmt_misc joydev hid_generic usbhid hid snd_ens1371 gameport snd_ac97_codec ac97_bus snd_pcm ghash_clmulni_intel snd_seq_midi snd_rawmidi snd_seq_midi_event ppdev snd_seq aesni_intel ablk_helper cryptd aes_x86_64 snd_timer snd_seq_device psmouse microcode snd vmw_balloon acpi_memhotplug parport_pc soundcore snd_page_alloc vmwgfx ttm mac_hid drm i2c_piix4 serio_raw shpchp lp parport e1000 mptspi mptscsih mptbase floppy vmw_pvscsi vmxnet3
Nov 14 19:15:23 localhost kernel: [ 1466.949484] Pid: 15064, comm: mytestapp Tainted: G B O 3.6.11-elasticos-0.01 #31
Nov 14 19:15:23 localhost kernel: [ 1466.949485] Call Trace:
Nov 14 19:15:23 localhost kernel: [ 1466.949487] [<ffffffff8111941f>] bad_page+0xbf/0x110
Nov 14 19:15:23 localhost kernel: [ 1466.949505] [<ffffffff8111aac9>] get_page_from_freelist+0x6f9/0x810
Nov 14 19:15:23 localhost kernel: [ 1466.949508] [<ffffffff8111a702>] ? get_page_from_freelist+0x332/0x810
Nov 14 19:15:23 localhost kernel: [ 1466.949509] [<ffffffff8111b06e>] __alloc_pages_nodemask+0x48e/0x9b0
Nov 14 19:15:23 localhost kernel: [ 1466.949512] [<ffffffff8111f03a>] ? pagevec_lru_move_fn+0xea/0x110
Nov 14 19:15:23 localhost kernel: [ 1466.949514] [<ffffffff81154ec3>] alloc_pages_vma+0xb3/0x190
Nov 14 19:15:23 localhost kernel: [ 1466.949515] [<ffffffff811397cc>] handle_pte_fault+0x56c/0xb00
Nov 14 19:15:23 localhost kernel: [ 1466.949517] [<ffffffff810473f7>] ? pte_alloc_one+0x37/0x50
Nov 14 19:15:23 localhost kernel: [ 1466.949527] [<ffffffff8113afd9>] handle_mm_fault+0x259/0x340
Nov 14 19:15:23 localhost kernel: [ 1466.949538] [<ffffffff8107c218>] ? up_read+0x18/0x30
Nov 14 19:15:23 localhost kernel: [ 1466.949540] [<ffffffff816213d2>] do_page_fault+0x152/0x520
Nov 14 19:15:23 localhost kernel: [ 1466.949541] [<ffffffff8108c36d>] ? set_next_entity+0x9d/0xb0
Nov 14 19:15:23 localhost kernel: [ 1466.949543] [<ffffffff810135ca>] ? __switch_to+0x17a/0x410
Nov 14 19:15:23 localhost kernel: [ 1466.949545] [<ffffffff8161de65>] page_fault+0x25/0x30
This macro checks that appropriate page flags are unset. As I see in your case you have problems with PG_LOCKED flag set. It means that you freed locked page. See unlock_page to handle this or (probably) use free_page instead

Why does cryptsetup segfault?

While most of my use a chrooted Gentoo as supplement to another Linux distribution project works fine, unfortunately cryptsetup segfaults on both luksFormat and luksOpen. How to troubleshoot this best?
edit[ For testing purposes I downgraded the Firmware to an older version which runs a 2.6.31 Kernel instead of a 3.3 one, and there the segfault does not occur. Nonetheless I would like to get this working with the newer version, so any help on troubleshooting this is appreciated...
edit2 Following this hint, I compared the output of grep dm_crypt /proc/kallsyms and noticed that the newer Kernel lacks __initcall_dm_crypt_init6 - could this be reason of failure? How can that be fixed?
]
An strace | tail reveals
open("/dev/loop0", O_RDONLY|O_LARGEFILE) = 6
ioctl(6, BLKRAGET, 256) = 0
close(6) = 0
ioctl(3, DM_DEV_CREATE, 0x3cfb0) = 0
ioctl(3, DM_TABLE_LOAD <unfinished ...>
+++ killed by SIGSEGV +++
that ioctrl(3, DM_TABLE_LOAD seems to be responsible, where 3, according to a grep '= 3$' should stem from
open("/dev/mapper/control", O_RDWR|O_LARGEFILE) = 3
so there is some trouble with /dev/mapper/control, where /dev is bind-mounted from the host system into the chroot environment. But what exactly is the problem? How can one figure that out?
Additional output, the usefulness of which I couldn't determine yet:
cryptsetup --debug -v's output is not very informative:
# cryptsetup 1.4.3 processing "cryptsetup --debug -v luksFormat /dev/loop0"
# Running command luksFormat.
# Locking memory.
WARNING!
======== This will overwrite data on /dev/loop0 irrevocably.
Are you sure? (Type uppercase yes): YES
# Allocating crypt device /dev/loop0 context.
# Trying to open and read device /dev/loop0.
# Initialising device-mapper backend, UDEV is disabled.
# Detected dm-crypt version 1.5.1, dm-ioctl version 4.22.0.
# Timeout set to 0 miliseconds.
# Iteration time set to 1000 miliseconds.
# Interactive passphrase entry requested. Enter LUKS passphrase: Verify passphrase:
# Formatting device /dev/loop0 as type LUKS1.
# Crypto backend (gcrypt 1.5.3) initialized.
# Topology: IO (512/0), offset = 0; Required alignment is 1048576 bytes.
# Generating LUKS header version 1 using hash sha1, aes, cbc-essiv:sha256, MK 32 bytes
# PBKDF2: 43251 iterations per second using hash sha1.
# Data offset 4096, UUID 03dcff20-158e-4d16-b89c-90af7b176a80, digest iterations 5250
# Updating LUKS header of size 1024 on device /dev/loop0
# Reading LUKS header of size 1024 from device /dev/loop0
# Adding new keyslot -1 using volume key.
# Calculating data for key slot 0
# Key slot 0 use 21118 password iterations.
# Using hash sha1 for AF in key slot 0, 4000 stripes
# Updating key slot 0 [0x1000] area on device /dev/loop0.
# DM-UUID is CRYPT-TEMP-temporary-cryptsetup-12060
# dm create temporary-cryptsetup-12060 CRYPT-TEMP-temporary-cryptsetup-12060 OF [16384] (*1)
# dm reload temporary-cryptsetup-12060 OFW [16384] (*1) Segmentation fault
Finally, the systrace from the host's /var/log/messages reads
Nov 14 14:02:15 host kernel: Unable to handle kernel paging request at virtual address e58c20f0
Nov 14 14:02:15 host kernel: pgd = cab58000
Nov 14 14:02:15 host kernel: [e58c20f0] *pgd=00000000
Nov 14 14:02:15 host kernel: Internal error: Oops: 5 [#1]
Nov 14 14:02:15 host kernel: Modules linked in: usblp usb_storage uhci_hcd ohci_hcd ehci_hcd usbcore usb_common
Nov 14 14:02:15 host kernel: CPU: 0 Not tainted (3.3.4-88f6281 #1)
Nov 14 14:02:15 host kernel: PC is at async_encrypt+0x44/0x50
Nov 14 14:02:15 host kernel: LR is at async_encrypt+0x48/0x50
Nov 14 14:02:15 host kernel: pc : [<c01fa374>] lr : [<c01fa378>] psr: 20000013
Nov 14 14:02:15 host kernel: sp : c10bfd60 ip : cc372580 fp : c10bfd84
Nov 14 14:02:15 host kernel: r10: d0a75160 r9 : d0a75175 r8 : 00000000
Nov 14 14:02:15 host kernel: r7 : c2f78428 r6 : 00000020 r5 : c2f783c0 r4 : e58c2040
Nov 14 14:02:15 host kernel: r3 : c94a72a0 r2 : 00000040 r1 : 00000000 r0 : c10bfd64
Nov 14 14:02:15 host kernel: Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Nov 14 14:02:15 host kernel: Control: 0005397f Table: 0ab58000 DAC: 00000015
Nov 14 14:02:15 host kernel: Process cryptsetup (pid: 5994, stack limit = 0xc10be270)
Nov 14 14:02:15 host kernel: Stack: (0xc10bfd60 to 0xc10c0000)
Nov 14 14:02:15 host kernel: fd60: c023eddc c01fab50 00000010 c01f88f8 c030e4c8 00000000 c10bfdac c10bfd88
Nov 14 14:02:15 host kernel: fd80: c030e564 c01fa340 c2f783c0 00000040 c2f783c0 d0a75175 d0a7a020 d0a75164
Nov 14 14:02:15 host kernel: fda0: c10bfdcc c10bfdb0 c030e61c c030e50c d0a75168 00000000 c2f783c0 d0a75168
Nov 14 14:02:15 host kernel: fdc0: c10bfe1c c10bfdd0 c030fbcc c030e598 d0a75160 c00d262c c10bfe0c c7752a40
Nov 14 14:02:15 host kernel: fde0: c00dd198 00000000 d0a7516e 00000000 c030feb4 00000020 cebc9200 c2f783c0
Nov 14 14:02:15 host kernel: fe00: d0a7a020 00000005 00000000 cebc9500 c10bfe64 c10bfe20 c030fedc c030f9ec
Nov 14 14:02:15 host kernel: fe20: c10bfe64 c10bfe30 c0308148 c023d270 c10bfe64 00000040 d0a7a020 000000fa
Nov 14 14:02:15 host kernel: fe40: 000000fa d0a7a020 cebc9500 d0a75150 00000000 cebc9500 c10bfe9c c10bfe68
Nov 14 14:02:15 host kernel: fe60: c0308954 c030fe7c 00000000 00000000 cebc9200 00000005 000000fa 00000000
Nov 14 14:02:15 host kernel: fe80: 00000001 d0a75000 d0a79000 c10bfeb0 c10bfee4 c10bfea0 c030b57c c03087b0
Nov 14 14:02:15 host kernel: fea0: 000000fa 00000000 d0a75160 00008004 d0a75160 d0a75138 c0308b5c 00004000
Nov 14 14:02:15 host kernel: fec0: cb637c00 d0a75000 00000000 c030b604 c10be000 00008004 c10bff0c c10bfee8
Nov 14 14:02:15 host kernel: fee0: c030b678 c030b52c 00000016 cebc9500 c10be000 00000000 0003cf80 00004000
Nov 14 14:02:15 host kernel: ff00: c10bff3c c10bff10 c030c5c4 c030b614 c00fb8b4 d0a75000 cb4580a0 0003cf80
Nov 14 14:02:15 host kernel: ff20: c138fd09 cb4580a0 c000bca8 00000000 c10bff4c c10bff40 c030c65c c030c4c8
Nov 14 14:02:15 host kernel: ff40: c10bff5c c10bff50 c00f06b4 c030c654 c10bff7c c10bff60 c00f0e10 c00f0694
Nov 14 14:02:15 host kernel: ff60: c10bff8c c10bff70 00000003 0003cf80 c10bffa4 c10bff80 c00f0fbc c00f0db0
Nov 14 14:02:15 host kernel: ff80: c10bffa4 00000000 00000001 b6f93cc8 b6e22e80 00000036 00000000 c10bffa8
Nov 14 14:02:15 host kernel: ffa0: c000bb00 c00f0f8c 00000001 b6f93cc8 00000003 c138fd09 0003cf80 b6e31450
Nov 14 14:02:15 host kernel: ffc0: 00000001 b6f93cc8 b6e22e80 00000036 0003cfb0 0003cec0 b6e22e80 0003cf80
Nov 14 14:02:15 host kernel: ffe0: b6e2f0f4 bedcbbfc b6e1e880 b6f083bc 60000010 00000003 e2433001 e5863000
Nov 14 14:02:15 host kernel: Backtrace:
Nov 14 14:02:15 host kernel: [<c01fa330>] (async_encrypt+0x0/0x50) from [<c030e564>] (crypt_setkey_allcpus+0x68/0x8c)
Nov 14 14:02:15 host kernel: r4:00000000
Nov 14 14:02:15 host kernel: [<c030e4fc>] (crypt_setkey_allcpus+0x0/0x8c) from [<c030e61c>] (crypt_set_key+0x94/0xb8)
Nov 14 14:02:15 host kernel: r8:d0a75164 r7:d0a7a020 r6:d0a75175 r5:c2f783c0 r4:00000040
Nov 14 14:02:15 host kernel: [<c030e588>] (crypt_set_key+0x0/0xb8) from [<c030fbcc>] (crypt_ctr_cipher+0x1f0/0x490)
Nov 14 14:02:15 host kernel: r6:d0a75168 r5:c2f783c0 r4:00000000
Nov 14 14:02:15 host kernel: [<c030f9dc>] (crypt_ctr_cipher+0x0/0x490) from [<c030fedc>] (crypt_ctr+0x70/0x2fc)
Nov 14 14:02:15 host kernel: [<c030fe6c>] (crypt_ctr+0x0/0x2fc) from [<c0308954>] (dm_table_add_target+0x1b4/0x350)
Nov 14 14:02:15 host kernel: [<c03087a0>] (dm_table_add_target+0x0/0x350) from [<c030b57c>] (populate_table+0x60/0xe8)
Nov 14 14:02:15 host kernel: r9:c10bfeb0 r8:d0a79000 r7:d0a75000 r6:00000001 r5:00000000
Nov 14 14:02:15 host kernel: r4:000000fa
Nov 14 14:02:15 host kernel: [<c030b51c>] (populate_table+0x0/0xe8) from [<c030b678>] (table_load+0x74/0x1f4)
Nov 14 14:02:15 host kernel: [<c030b604>] (table_load+0x0/0x1f4) from [<c030c5c4>] (ctl_ioctl+0x10c/0x18c)
Nov 14 14:02:15 host kernel: r7:00004000 r6:0003cf80 r5:00000000 r4:c10be000
Nov 14 14:02:15 host kernel: [<c030c4b8>] (ctl_ioctl+0x0/0x18c) from [<c030c65c>] (dm_ctl_ioctl+0x18/0x1c)
Nov 14 14:02:15 host kernel: [<c030c644>] (dm_ctl_ioctl+0x0/0x1c) from [<c00f06b4>] (vfs_ioctl+0x30/0x44)
Nov 14 14:02:15 host kernel: [<c00f0684>] (vfs_ioctl+0x0/0x44) from [<c00f0e10>] (do_vfs_ioctl+0x70/0x1dc)
Nov 14 14:02:15 host kernel: [<c00f0da0>] (do_vfs_ioctl+0x0/0x1dc) from [<c00f0fbc>] (sys_ioctl+0x40/0x68)
Nov 14 14:02:15 host kernel: r5:0003cf80 r4:00000003
Nov 14 14:02:15 host kernel: [<c00f0f7c>] (sys_ioctl+0x0/0x68) from [<c000bb00>] (ret_fast_syscall+0x0/0x2c)
Nov 14 14:02:15 host kernel: r7:00000036 r6:b6e22e80 r5:b6f93cc8 r4:00000001
Nov 14 14:02:15 host kernel: Code: e24b0020 e59c1024 e59c2020 e1a0e00f (e594f0b0)
Nov 14 14:02:15 host kernel: ---[ end trace 3c58c565608fcd8c ]---
Ok, so it's an "Oops: 5", the meaning of which I don't know...

Resources