How to run "invd" instruction with disabled SMP support? - linux

I'm trying to execute "invd" instruction from a kernel module. I have asked a similar question How to execute “invd” instruction? previously and from #Peter Cordes's answer, I understand I can't safely run this instruction on SMP system after system boot. So, shouldn't I be able to run this instruction after boot without SMP support? Because there is no other core running, therefore there is no change for memory inconsistency? I have the following kernel module compiled with -o0 flag,
static int __init deviceDriver_init(void){
unsigned long flags;
int LEN=10;
int STEP=1;
int VALUE=1;
int arr[LEN];
int i;
unsigned long dummy;
printk(KERN_INFO "invd Driver loaded\n");
//wbinvd();
//asm volatile("cpuid\n":::);
local_irq_disable();
__asm__ __volatile__(
"wbinvd\n"
"loop:"
"movq %%rdx, (%%rbx);"
"leaq (%%rbx, %%rcx, 8), %%rbx;"
"cmpq %%rbx, %%rax;"
"jg loop;"
"invd\n"
: "=b"(dummy) // output
: "b" (arr),
"a" (arr+LEN),
"c" (STEP),
"d" (VALUE)
: "cc", "memory"
);
local_irq_enable();
//asm volatile("invd\n":::);
printk(KERN_INFO "invd execute\n");
return 0;
}
I'm still getting the following error upon inserting the module I'm getting Segmentation fault (core dumped) in the terminal and the dmesg shows,
[ 2590.518614] invd Driver loaded
[ 2590.518840] general protection fault: 0000 [#5] SMP PTI
I have boot my kernel with nosmp but I do not understand why dmesg still shows SMP PTI
$cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.15.0-136-generic root=UUID=dbe747ff-a6a5-45cb-8553-c6db6d445d3d ro quiet splash nosmp vt.handoff=7
Update post:
As I mentioned in the comment section, After disabling, SGX from BIOS, I was able to run this invd without any error. However, when I try to run the same code on a different machine with the same kernel version, I still get the same error message. It is strange and I can't explain why this is happening. As in the comment section, #prl mentions that the error may be coming from the instruction following invd. I begin to think that maybe that is true. Because second from the last line in the dmesg is higlighted in RED [ 153.527386] RIP: loop+0xc/0xf22 [noSmp8] RSP: ffffb8d9450a7be0. So, seems like the error is coming from inside the loop. I have updated the __init function code according to the suggestion. I'm not good at assembly code, can anyone please tell me if the inline assembly code is correct or not? If this inline assembly code is not correct how to fix the code? My whole dmesg trace is,
[ 153.514293] invd Driver loaded
[ 153.514547] general protection fault: 0000 [#1] SMP PTI
[ 153.514656] Modules linked in: noSmp8(OE+) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables ccm arc4 intel_rapl rt2800usb rt2x00usb x86_pkg_temp_thermal intel_powerclamp rt2800lib coretemp rt2x00lib mac80211 cfg80211 kvm_intel kvm irqbypass snd_hda_codec_realtek crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate intel_rapl_perf dell_smm_hwmon dell_wmi dell_smbios dcdbas intel_wmi_thunderbolt snd_hda_codec_generic dell_wmi_descriptor wmi_bmof snd_seq_midi snd_seq_midi_event
[ 153.515454] serio_raw snd_hda_intel snd_hda_codec snd_hda_core sparse_keymap snd_hwdep snd_rawmidi joydev input_leds snd_seq snd_pcm snd_seq_device snd_timer snd soundcore mei_me mei shpchp intel_pch_thermal mac_hid acpi_pad parport_pc ppdev lp parport autofs4 hid_generic usbhid hid nouveau mxm_wmi ttm drm_kms_helper psmouse syscopyarea sysfillrect sysimgblt igb e1000e dca i2c_algo_bit ptp pps_core ahci libahci fb_sys_fops drm wmi video
[ 153.516038] CPU: 0 PID: 4024 Comm: insmod Tainted: G OE 4.15.0-136-generic #140~16.04.1-Ubuntu
[ 153.516331] Hardware name: Dell Inc. BIOS 1.3.2 01/25/2016
[ 153.516626] RIP: 0010:loop+0xc/0xf22 [noSmp8]
[ 153.516917] RSP: 0018:ffffb8d9450a7be0 EFLAGS: 00010046
[ 153.517213] RAX: ffffb8d9450a7c08 RBX: ffffb8d9450a7c08 RCX: 0000000000000001
[ 153.517513] RDX: 0000000000000001 RSI: ffffb8d9450a7be0 RDI: ffff8edaadc16490
[ 153.517814] RBP: ffffb8d9450a7c60 R08: 0000000000012c40 R09: ffffffffb39624c4
[ 153.518119] R10: ffffb8d9450a7c78 R11: 000000000000038c R12: ffffb8d9450a7c10
[ 153.518427] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8eda4c6bd660
[ 153.518730] FS: 00007fd7f09cf700(0000) GS:ffff8edaadc00000(0000) knlGS:0000000000000000
[ 153.519036] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 153.519346] CR2: 00005634f95fde50 CR3: 000000040dd2c001 CR4: 00000000003606f0
[ 153.519656] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 153.519980] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 153.520289] Call Trace:
[ 153.520597] ? 0xffffffffc050d000
[ 153.520899] do_one_initcall+0x55/0x1ac
[ 153.521201] ? do_one_initcall+0x55/0x1ac
[ 153.521504] ? do_init_module+0x27/0x223
[ 153.521808] ? _cond_resched+0x32/0x50
[ 153.522107] ? kmem_cache_alloc_trace+0x165/0x1c0
[ 153.522408] do_init_module+0x5f/0x223
[ 153.522710] load_module+0x188c/0x1ea0
[ 153.523016] ? ima_post_read_file+0x83/0xa0
[ 153.523320] SYSC_finit_module+0xe5/0x120
[ 153.523623] ? SYSC_finit_module+0xe5/0x120
[ 153.523927] SyS_finit_module+0xe/0x10
[ 153.524231] do_syscall_64+0x73/0x130
[ 153.524534] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[ 153.524838] RIP: 0033:0x7fd7f04fd599
[ 153.525144] RSP: 002b:00007ffda61c2968 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
[ 153.525455] RAX: ffffffffffffffda RBX: 00005643631d7210 RCX: 00007fd7f04fd599
[ 153.525768] RDX: 0000000000000000 RSI: 0000564361c3226b RDI: 0000000000000003
[ 153.526084] RBP: 0000564361c3226b R08: 0000000000000000 R09: 00007fd7f07c2ea0
[ 153.526403] R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000000000
[ 153.526722] R13: 00005643631d7ca0 R14: 0000000000000000 R15: 0000000000000000
[ 153.527040] Code: 00 48 8b 75 c8 48 8b 45 c8 8b 55 b8 48 63 d2 48 c1 e2 02 48 01 d0 8b 4d b4 8b 55 bc 48 89 f3 48 89 13 48 8d 1c cb 48 39 d8 7f f4 <0f> 08 48 89 d8 48 89 45 d0 e8 40 ef 73 00 48 c7 c7 c7 d0 c4 c0
[ 153.527386] RIP: loop+0xc/0xf22 [noSmp8] RSP: ffffb8d9450a7be0
[ 153.530228] ---[ end trace cc9ea64985c9fe34 ]---
So, it not possible to run invd even without SMP?

There's 2 questions here:
a) How to execute INVD (unsafely)
For this, you need to be running at CPL=0, and you have to make sure the CPU isn't using any "processor reserved memory protections" which are part of Intel's Software Guard Extensions (an extension to allow programs to have a shielded/private/encrypted space that the OS can't tamper with, often used for digital rights management schemes but possibly usable for enhancing security/confidentiality of other things).
Note that SGX is supported in recent versions of Linux, but I'm not sure when support was introduced or how old your kernel is, or if it's enabled/disabled.
If either of these isn't true (e.g. you're at CPL=3 or there are "processor reserved memory protections) you will get a general protection fault exception.
b) How to execute INVD Safely
For this, you have to make sure that the caches (which includes "external caches" - e.g. possibly including things like eDRAM and caches built into non-volatile RAM) don't contain any modified data that will cause problems if lost. This includes data from:
IRQs. These can be disabled.
NMI and machine check exceptions. For a running OS it's mostly impossible to stop/disable these and if you can disable them then it's like crossing your fingers while ignoring critical hardware failures (an extremely bad idea).
the firmware's System Management Mode. This is a special CPU mode the firmware uses for various things (e.g. ECC scrubbing, some power management, emulation of legacy devices) that't beyond the control of the OS/kernel. It can't be disabled.
writes done by the CPU itself. This includes updating the accessed/dirty flags in page tables (which can not be disabled), plus any performance monitoring or debugging features that store data in memory (which can be "not enabled").
With these restrictions (and not forgetting the performance problems) there are only 2 cases where INVD might be sane - early firmware code that needs to determine RAM chip sizes and configure memory controllers (where it's very likely to be useful/sane), and the instant before the computer is turned off (where it's likely to be pointless).
Guesswork
I'm guessing (based on my inability to think of any other plausible reason) that you want to construct temporary shielded/private area of memory (to enhance security - e.g. so that the data you put in that area won't/can't leak into RAM). In this case (ironically) it's possible that the tool designed specifically for this job (SGX) is preventing you from doing it badly.

Related

Additional serial ports are half working: why is that?

I have an embedded board with 2 serial ports and an additional PCI dual serial port + LPT, to reach a total of 4 serial ttys, though the added ttyS2 and ttyS3 almost don't work.
The system runs Debian buster, with a very few packages added to minimal setup.
All the ports are recognized by kernel
[ +0,021002] 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[ +0,021110] 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
[ +0,036520] 0000:04:00.0: ttyS2 at I/O 0xc030 (irq = 19, base_baud = 115200) is a ST16650V2
[ +0,035081] 0000:04:00.1: ttyS3 at I/O 0xc020 (irq = 16, base_baud = 115200) is a ST16650V2
and a successive test with setserial gives the same result.
Note however that a dpkg-reconfigure setserial does not write the file in /etc/setserial.conf and I have no idea on why - I tried resolving copying configuration by hand.
From some applications like minicom I see no result in opening the port and connecting from a remote terminal, nothing sent, nothing received.
From a test application using librxtx-java it looks to be sending data, but when data is received what happens is
[mar23 08:42] irq 16: nobody cared (try booting with the "irqpoll" option)
[ +0,000014] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.0-14-amd64 #1 Debian 4.19.171-2
[ +0,000003] Hardware name: /, BIOS 5.6.5 12/18/2018
[ +0,000002] Call Trace:
[ +0,000006] <IRQ>
[ +0,000012] dump_stack+0x66/0x81
[ +0,000008] __report_bad_irq+0x3a/0xb4
[ +0,000006] note_interrupt.cold.9+0xa/0x63
[ +0,000008] handle_irq_event_percpu+0x6d/0x80
[ +0,000006] handle_irq_event+0x3c/0x60
[ +0,000004] handle_fasteoi_irq+0xa3/0x160
[ +0,000007] handle_irq+0x1f/0x30
[ +0,000006] do_IRQ+0x49/0xe0
[ +0,000005] common_interrupt+0xf/0xf
[ +0,000003] </IRQ>
[ +0,000007] RIP: 0010:cpuidle_enter_state+0xb9/0x320
[ +0,000006] Code: e8 7c 85 b2 ff 80 7c 24 0b 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 3b 02 00 00
[ +0,000003] RSP: 0018:ffffb9e30020be90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd7
[ +0,000005] RAX: ffff9329f8122140 RBX: 000000ae53921a15 RCX: 000000000000001f
[ +0,000002] RDX: 000000ae53921a15 RSI: 0000000060062542 RDI: 0000000000000000
[ +0,000003] RBP: ffff9329f812a248 R08: 0000000000000002 R09: 0000000000021a00
[ +0,000002] R10: 000000ecc794a002 R11: ffff9329f8121128 R12: 0000000000000001
[ +0,000002] R13: ffffffffa98b70b8 R14: 0000000000000001 R15: 0000000000000000
[ +0,000011] do_idle+0x228/0x270
[ +0,000006] cpu_startup_entry+0x6f/0x80
[ +0,000005] start_secondary+0x1a4/0x200
[ +0,000006] secondary_startup_64+0xa4/0xb0
[ +0,000005] handlers:
[ +0,000009] [<00000000920e25ee>] serial8250_interrupt
[ +0,000005] Disabling IRQ #16
I read a few articles and did a quick read of Serial-HOWTO, but it seems that once setserial has found the correct configuration everything should go as intended, so I have no clue on what's going on
---- EDIT
Well, I (almost) resolved: the board has 2 serial and 1 parallel, all connected to an external device. I mismatched a non-standard external connector and routed some RS232 level signals into the parallel port: that resulted to be fatal.
The confusing result is that the controller looked to still be working, while it isn't 100% doing so.
I'm waiting to get a new board...
Well, I (almost) resolved: it's all about an hardware fault.
The board has 2 serial and 1 parallel, all connected to an external device. I mismatched a non-standard external connector and routed some RS232 level signals into the parallel port: that resulted to be fatal.
The confusing result is that the controller looked to still be working, while it isn't 100% doing so. I'm waiting to get a new board...

linux kernel panic unable to handle kernel NULL pointer dereference at

I'm facing issues with some kernel panic but I don't have any idea how to find which soft is exacly causing this issue. I'm trying to compile some soft on remote host using distcc software but my machines which are compiling are going down because of this issue.
Could you point me where shoud I start looking? What could cause this issue? Which tools should I use?
Here is kernel panic output:
[591792.656853] IP: [< (null)>] (null)
[591792.658710] PGD 800000032ca05067 PUD 327bc6067 PMD 0
[591792.660439] Oops: 0010 [#1] SMP
[591792.661562] Modules linked in: fuse nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache nls_utf8 isofs sunrpc dm_mirror dm_region_hash dm_log dm_mod sb_edac iosf_mbi kvm_intel ppdev kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd cirrus ttm joydev drm_kms_helper sg virtio_balloon syscopyarea sysfillrect sysimgblt fb_sys_fops drm parport_pc parport drm_panel_orientation_quirks pcspkr i2c_piix4 ip_tables xfs libcrc32c sr_mod cdrom virtio_blk virtio_net ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel serio_raw floppy ata_piix libata virtio_pci virtio_ring virtio
[591792.682098] CPU: 2 PID: 25548 Comm: cc1plus Not tainted 3.10.0-957.el7.x86_64 #1
[591792.684495] Hardware name: Red Hat OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014
[591792.686923] task: ffff8ebb65ea1040 ti: ffff8ebb6b250000 task.ti: ffff8ebb6b250000
[591792.689315] RIP: 0010:[<0000000000000000>] [< (null)>] (null)
[591792.691729] RSP: 0018:ffff8ebb6b253da0 EFLAGS: 00010246
[591792.693438] RAX: 0000000000000000 RBX: ffff8ebb6b253e40 RCX: ffff8ebb6b253fd8
[591792.695716] RDX: ffff8ebb38098840 RSI: ffff8ebb6b253e40 RDI: ffff8ebb38098840
[591792.697992] RBP: ffff8ebb6b253e30 R08: 0000000000000100 R09: 0000000000000001
[591792.700271] R10: ffff8ebb7fd1f080 R11: ffffd7da0beb9380 R12: ffff8eb8417af000
[591792.702547] R13: ffff8eb875d1b000 R14: ffff8ebb6b253f24 R15: 0000000000000000
[591792.704821] FS: 0000000000000000(0000) GS:ffff8ebb7fd00000(0063) knlGS:00000000f7524740
[591792.707397] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[591792.709242] CR2: 0000000000000000 CR3: 000000032eb0a000 CR4: 00000000003607e0
[591792.711519] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[591792.713814] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[591792.716100] Call Trace:
[591792.716927] [<ffffffff9165270b>] ? path_openat+0x3eb/0x640
[591792.718727] [<ffffffff91653dfd>] do_filp_open+0x4d/0xb0
[591792.720451] [<ffffffff91661504>] ? __alloc_fd+0xc4/0x170
[591792.722267] [<ffffffff9163ff27>] do_sys_open+0x137/0x240
[591792.724017] [<ffffffff916a1fab>] compat_SyS_open+0x1b/0x20
[591792.725820] [<ffffffff91b78bb0>] sysenter_dispatch+0xd/0x2b
[591792.727648] Code: Bad RIP value.
[591792.728795] RIP [< (null)>] (null)
[591792.730486] RSP <ffff8ebb6b253da0>
[591792.731625] CR2: 0000000000000000
[591792.734935] ---[ end trace ccfdca9d4733e7a5 ]---
[591792.736450] Kernel panic - not syncing: Fatal exception
[591792.738708] Kernel Offset: 0x10400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
It is quite difficult to tell what went wrong with this just piece of log you have pasted.
It seems like oops lead into kernel panic!.
Well,with helping you to find the real cause,I can help you with material to look into for further dissection of panic/crash.
link 1: analysing kernel panics
link 2 : oops
Hope it helps you! :)

Syscall hijacking x64- unable to handle kernel paging request at ffffffff91000018

I write a kernel module which replaces syscall and have a problem. Module can't be loaded because is some problem in memory. I tried fix it for 3 hours, but it still not work. This code is working, when I choose memory closer sys_call_table (eg. linux_banner address from /proc/kallsyms), but it isn't always works.
Problem is usually, when function which search syscall table points to address which end is 18 (eg ffffffff91000018, ffffffff81000018).
Why it does not work?
Code:
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/syscalls.h>
#include <linux/list.h>
#include <linux/unistd.h>
#include <linux/kobject.h>
#include <linux/init.h>
/* start of 64-bit kernel space is 0xffffffff80000000 */
#define END_MEM 0xffffffffffffffff /* end of 64-bit kernel */
#define START_MEM 0xffffffff81000000
unsigned long long **syscall_tab;
asmlinkage long (*orig_mkdir)(const char __user *pathname, umode_t mode);
asmlinkage long my_mkdir(const char __user *pathname, umode_t mode)
{
long ret;
ret = orig_mkdir(pathname, mode);
printk("Creating dir: %s", pathname);
return ret;
}
static void hide(void)
{
list_del(&THIS_MODULE->list);
kobject_del(&THIS_MODULE->mkobj.kobj);
}
static unsigned long long **find(void) {
unsigned long long **sctable;
unsigned long long i = START_MEM;
while (i < END_MEM) {
sctable = (unsigned long long **) i;
if ( sctable[__NR_close] == (unsigned long long *) sys_close) {
printk("syscall_tab %lx", syscall_tab);
return &sctable[0];
}
i += sizeof(void *);
}
return NULL;
}
static int __init init(void)
{
write_cr0(read_cr0() & (~0x10000));
if(!(syscall_tab = find())) {
return 0;
}
orig_mkdir = (void *) syscall_tab[__NR_mkdir];
printk("write_cr0");
syscall_tab[__NR_mkdir] = (unsigned long long*) my_mkdir;
printk("po podmiance");
write_cr0(read_cr0() | (~0x10000));
return 0;
}
static void __exit exitt(void)
{
write_cr0(read_cr0() & (~0x10000));
syscall_tab[__NR_mkdir] = (unsigned long long*) orig_mkdir;
write_cr0(read_cr0() | (~0x10000));
}
module_init(init);
module_exit(exitt);
MODULE_LICENSE("GPL");
Error:
[ 299.273838] BUG: unable to handle kernel paging request at ffffffff91000018
[ 299.273856] IP: init+0x23/0x1000 [hijack1]
[ 299.273860] PGD b6a0c067
[ 299.273861] P4D b6a0c067
[ 299.273863] PUD b6a0d063
[ 299.273866] PMD 0
[ 299.273872] Oops: 0000 [#1] PREEMPT SMP
[ 299.273877] Modules linked in: hijack1(O+) fuse rfcomm bnep nls_iso8859_1 nls_cp437 vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel joydev ppdev hp_wmi mousedev iTCO_wdt aes_x86_64 sparse_keymap iTCO_vendor_support mei_wdt crypto_simd psmouse glue_helper pcspkr evdev input_leds cryptd mac_hid intel_cstate intel_rapl_perf uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core btusb btrtl btbcm btintel bluetooth cdc_ether ecdh_generic usbnet videodev uas media mii hid_generic nouveau mxm_wmi ttm arc4 drm_kms_helper iwldvm drm syscopyarea sysfillrect mac80211 sysimgblt iwlwifi fb_sys_fops parport_pc parport snd_hda_codec_hdmi i2c_algo_bit snd_hda_codec_idt cfg80211
[ 299.273953] rfkill snd_hda_codec_generic hp_accel thermal lis3lv02d wmi input_polldev tpm_infineon video ac battery button snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm shpchp snd_timer e1000e snd ptp soundcore tpm_tis mei_me mei pps_core lpc_ich tpm_tis_core tpm sch_fq_codel vboxnetflt(O) vboxnetadp(O) pci_stub vboxpci(O) vboxdrv(O) sg ip_tables x_tables ext4 crc16 jbd2 fscrypto mbcache sr_mod sd_mod cdrom usb_storage usbhid hid serio_raw atkbd libps2 ahci libahci libata scsi_mod xhci_pci xhci_hcd ehci_pci sdhci_pci ehci_hcd sdhci firewire_ohci led_class firewire_core mmc_core crc_itu_t usbcore usb_common i8042 serio
[ 299.274005] CPU: 2 PID: 3384 Comm: insmod Tainted: G O 4.12.4-1-ARCH #1
[ 299.274009] Hardware name: Hewlett-Packard HP EliteBook 8560w/1631, BIOS 68SVD Ver. F.60 03/12/2015
[ 299.274014] task: ffff90127cc0c740 task.stack: ffffb72907298000
[ 299.274019] RIP: 0010:init+0x23/0x1000 [hijack1]
[ 299.274023] RSP: 0018:ffffb7290729bc88 EFLAGS: 00010206
[ 299.274027] RAX: 0000000080040033 RBX: ffffffff91000000 RCX: 0000000000000000
[ 299.274031] RDX: 00000000004bec82 RSI: 00000000004bec82 RDI: 0000000080040033
[ 299.274036] RBP: ffffb7290729bc90 R08: ffff901339003980 R09: ffffffffa018970a
[ 299.274040] R10: ffffe481c211ebc0 R11: 0000000000000000 R12: ffffffffc0030000
[ 299.274044] R13: ffff9012377965e0 R14: ffffffffc0a81050 R15: ffff90132e0eca80
[ 299.274049] FS: 00007f9a842a4b80(0000) GS:ffff90133dc80000(0000) knlGS:0000000000000000
[ 299.274053] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080040033
[ 299.274057] CR2: ffffffff91000018 CR3: 000000007cdb9000 CR4: 00000000000406e0
[ 299.274061] Call Trace:
[ 299.274068] do_one_initcall+0x50/0x190
[ 299.274073] ? do_init_module+0x27/0x1e6
[ 299.274077] do_init_module+0x5f/0x1e6
[ 299.274082] load_module+0x2610/0x2ab0
[ 299.274087] ? vfs_read+0x115/0x130
[ 299.274091] SYSC_finit_module+0xf6/0x110
[ 299.274095] ? SYSC_finit_module+0xf6/0x110
[ 299.274100] SyS_finit_module+0xe/0x10
[ 299.274105] entry_SYSCALL_64_fastpath+0x1a/0xa5
[ 299.274109] RIP: 0033:0x7f9a839b3bb9
[ 299.274111] RSP: 002b:00007ffd2386ee28 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[ 299.274120] RAX: ffffffffffffffda RBX: 00007f9a83c74aa0 RCX: 00007f9a839b3bb9
[ 299.274124] RDX: 0000000000000000 RSI: 000000000041aada RDI: 0000000000000003
[ 299.274128] RBP: 00007f9a83c74af8 R08: 0000000000000000 R09: 00007f9a83c76e40
[ 299.274132] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000001020
[ 299.274136] R13: 0000000000001018 R14: 00007f9a83c74af8 R15: 0000000000000001
[ 299.274141] Code: <48> 81 7b 18 40 a8 21 a0 75 2d 48 8b 35 14 13 a5 00 48 c7 c7 35 00
[ 299.276347] RIP: init+0x23/0x1000 [hijack1] RSP: ffffb7290729bc88
[ 299.277333] CR2: ffffffff91000018
[ 299.283408] ---[ end trace 63ac9e1e3a0e12c3 ]---
Syscall hijacking x64- unable to handle kernel paging request at ffffffff91000018...
I write kernel module which replace syscall and have a problem. Module can't be loaded because is some problem in memory. I tried fix it for 3 hours, but it still not work...
The problem is, hijacking syscalls is not technically feasible. You can't do it with Linux. Linux does not have a layered design that supports this sort of thing (as opposed to Windows or other operating systems).
About the best you will be able to do is interpositioning, which redirects calls made through the PLT into your shared object. I believe this is the way Valgrind works when it replaces malloc and free.
Also note that some system calls are not routed through the PLT. See the discussion of Double-underscore names for public API functions on the glibc wiki.
Also see Query regarding kernel modules intercepting system call on the Kernel Newbies mailing list and Multiple kernel modules intercepting same system call and crash during unload on Stack Overflow. The first one is the question where the kernel developers tell OP is not possible. I'm just reiterating what the kernel dev's have already stated.

__stack_chk_fail when executing copy_process() in _do_fork()?

I'm trying to do an experimental project about linux kernel(4.4.52) on x86_64, and one requirement of which is that whenever the control flow leaves specific function, the Write Protection bit in CR0 register would always be enabled. Generally speaking, it is like(the idea comes from nested kernel, but that is not very relevant to my question):
DISABLE_CR0.WP_BIT
original_func()
ENABLE_CR0.WP_BIT
By doing that, the whole kernel would be executing with CR0.WP enabled. I have replaced the original native_set_pte function and native_write_cr3 function with the format above, and now the kernel crashes when booting.
Here is the log(that's its original log, although the sequence seems weird):
[ 1.403888] IP: [<ffff8800351ebbb0>] 0xffff8800351ebbb0
[ 1.403891] PGD 2876067 PUD 2877067 PMD 3500e063 PTE 80000000351eb163
[ 1.403892] Oops: 0011 [#2] SMP
[ 1.403898] Modules linked in: crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse pata_acpi floppy
[ 1.403901] CPU: 0 PID: 143 Comm: systemd-udevd Tainted: G D 4.4.52v1+ #2
[ 1.403902] Hardware name: Fedora Project OpenStack Nova, BIOS 0.5.1 01/01/2011
[ 1.403903] task: ffff8800351c0e00 ti: ffff8800351e8000 task.ti: ffff8800351e8000
[ 1.403905] RIP: 0010:[<ffff8800351ebbb0>] [<ffff8800351ebbb0>] 0xffff8800351ebbb0
[ 1.403906] RSP: 0018:ffff8800351ebba8 EFLAGS: 00211086
[ 1.403906] RAX: 000000000000000e RBX: ffff8800351ebcf8 RCX: 000000000000000e
[ 1.403907] RDX: 0000000000000000 RSI: 0000000000201092 RDI: 0000000000201092
[ 1.403908] RBP: 0000000000000003 R08: ffffffff82778d60 R09: ffff8800351ebb40
[ 1.403909] R10: 0000000000000030 R11: ffffc00000000fff R12: ffff8800351c0e00
[ 1.403909] R13: 0000000000000010 R14: 0000000000201046 R15: ffffffffffffffff
[ 1.403911] FS: 00007f1f021e38c0(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[ 1.403912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.403912] CR2: ffff8800351ebbb0 CR3: 00000000351cd000 CR4: 00000000001406f0
[ 1.403916] Stack:
[ 1.403918] ffffffff810b62ae ffff8800351ebbc0 000000000000006c 0000000000000000
[ 1.403920] ffff8800351ebbd8 ffffffff810b62ae ffffffff8111ce51 00000000000364a4
[ 1.403921] ffffffff82783168 000000000000005c 000000000000000c ffffffff820583b0
[ 1.403922] Call Trace:
[ 1.403928] [<ffffffff810b62ae>] ? kvm_sched_clock_read+0x1e/0x30
[ 1.403930] [<ffffffff810b62ae>] ? kvm_sched_clock_read+0x1e/0x30
[ 1.403933] [<ffffffff8111ce51>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[ 1.403935] [<ffffffff8111c49e>] ? down_trylock+0x2e/0x40
[ 1.403937] [<ffffffff81129959>] ? console_trylock+0x19/0x60
[ 1.403938] [<ffffffff8112af2e>] ? vprintk_emit+0x29e/0x530
[ 1.403945] [<ffffffff8115fe8e>] ? crash_kexec+0x7e/0x140
[ 1.403953] [<ffffffff81440ae5>] ? find_next_bit+0x15/0x20
[ 1.403955] [<ffffffff814390bb>] ? __const_udelay+0x2b/0x30
[ 1.403958] [<ffffffff810a2a0c>] ? native_stop_other_cpus+0x8c/0x170
[ 1.403965] [<ffffffff811dde8f>] ? panic+0xeb/0x215
[ 1.403968] [<ffffffff810d12a7>] ? copy_process+0x727/0x1b20
[ 1.403970] [<ffffffff810d32f9>] ? __stack_chk_fail+0x19/0x20
[ 1.403972] [<ffffffff810d12a7>] ? copy_process+0x727/0x1b20
[ 1.403974] [<ffffffff810d2808>] ? _do_fork+0x78/0x360
[ 1.403975] [<ffffffff810d2b99>] ? SyS_clone+0x19/0x20
[ 1.403986] [<ffffffff818694f2>] ? entry_SYSCALL_64_fastpath+0x16/0x71
[ 1.404004] Code: 00 00 00 86 10 21 00 00 00 00 00 a8 bb 1e 35 00 88 ff ff 18 00 00 00 00 00 00 00 b0 bb 1e 35 00 88 ff ff ae 62 0b 81 ff ff ff ff <c0> bb 1e 35 00 88 ff ff 6c 00 00 00 00 00 00 00 00 00 00 00 00
[ 1.404005] RIP [<ffff8800351ebbb0>] 0xffff8800351ebbb0
[ 1.404006] RSP <ffff8800351ebba8>
[ 1.404006] CR2: ffff8800351ebbb0
[ 1.404008] ---[ end trace b62acacf75e0c54f ]---
[ 1.406415] Kernel Offset: disabled
[ 1.456105] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff810d12a7
[ 1.456105]
I guess the problem is that something at copy_process causes an overflow, maybe it writes to some read-only memory? But CR0.WP bit should only affects the supervisor mode according to intel's document, so does that mean kernel is running in supervisor mode when executing copy_process?
I tried to disassemble the kernel, and got really upset about all those countless assembly instructions... So I decide to find it out with qemu. However, the kernel did NOT crash in qemu!! The command I use is that
qemu-system-x86_64 -m 1G -kernel arch/x86/boot/bzImage -initrd arch/x86/boot/linux4.4.52-rootfs.img -hda vdisk.img --append "root=/dev/sda rw console=ttyS0" -nographic
I used to think that _do_fork is independent of specific devices and filesystems(correct me if I'm wrong), so what causes the kernel to crash at my VPS should make it crash at qemu as well, which it didn't.
Has anyone come across the same issue? I really need some help now.
P.S. I do this at my VPS, ubuntu 16.04.2, but I think this is not the reason.
Please note that QEMU is not fully architecture accurate model, and QEMU do not provide full support of all architecture referenced features, because QEMU aim is emulation speed, not accuracy, and then it only provide some workable architecture profile to run OS.
Some features QEMU do precise and correct like paging and segmentation, but many are not : there are some troubles with CPUID, syscall, many instructions do not generate #GP, #SS, #PF, floating point errors, some instructions are not implemented (AVX, AVX2, FMA) and so on.
You should use one of x86 golden models to catch such tricky cases: try to use Bochs, or one of provided by Intel itself.

Analyzing CPU registers during kernel crash dump

I was debugging a issue and hit the below kernel crash along with crash dump being generated. To some extent i do know, how to get to the exact line in the code where the issue occurred using gdb (l *(debug_fucntion+0x19)) command.
<1>BUG: unable to handle kernel paging request at ffffc90028213000
<1>IP: [<ffffffffa0180279>] debug_fucntion+0x19/0x160 [dise]
<4>PGD 103febe067 PUD 103febf067 PMD fd54e1067 PTE 0
<4>Oops: 0000 [#1] SMP
<4>last sysfs file: /sys/kernel/mm/ksm/run
<4>CPU 7
<4>Modules linked in: dise(P)(U) ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge autofs4 8021q garp stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun kvm uinput ipmi_devintf power_meter microcode iTCO_wdt iTCO_vendor_support dcdbas sg ses enclosure serio_raw lpc_ich mfd_core i7core_edac edac_core bnx2 ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dise]
<4>
<4>Pid: 1126, comm: diseproc Tainted: P W --------------- 2.6.32-431.el6.x86_64 #1 Dell Inc. PowerEdge R710/0MD99X
<4>RIP: 0010:[<ffffffffa0180279>] [<ffffffffa0180279>] debug_fucntion+0x19/0x160 [dise]
<4>RSP: 0018:ffff880435fc5b88 EFLAGS: 00010282
<4>RAX: 0000000000000000 RBX: 0000000000010000 RCX: ffffc90028213000
<4>RDX: 0000000000010040 RSI: 0000000000010000 RDI: ffff880fe36a0000
<4>RBP: ffff880435fc5b88 R08: ffffffffa025d8a3 R09: 0000000000000000
<4>R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000010040
<4>R13: 000000000000b101 R14: ffffc90028213010 R15: ffff880fe36a0000
<4>FS: 00007fbe6040b700(0000) GS:ffff8800618e0000(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>CR2: ffffc90028213000 CR3: 0000000fc965b000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process diseproc (pid: 1126, threadinfo ffff880435fc4000, task ffff8807f8be8ae0)
<4>Stack:
<4> ffff880435fc5be8 ffffffffa0180498 0000000081158f46 00000c200000fd26
<4><d> ffffc90028162000 0000fec635fc5bc8 0000000000000018 ffff881011d80000
<4><d> ffffc90028162000 ffff8802f18fe440 ffff880fc80b4000 ffff880435fc5cec
<4>Call Trace:
<4> [<ffffffffa0180498>] cmd_dump+0x1c8/0x360 [dise]
<4> [<ffffffffa01978e1>] debug_log_show+0x91/0x160 [dise]
<4> [<ffffffffa013afb9>] process_debug+0x5a9/0x990 [dise]
<4> [<ffffffff810792c7>] ? current_fs_time+0x27/0x30
<4> [<ffffffffa013bc38>] dise_ioctl+0xd8/0x300 [dise]
<4> [<ffffffff8105a501>] ? hotplug_hrtick+0x21/0x60
<4> [<ffffffff8119db42>] vfs_ioctl+0x22/0xa0
<4> [<ffffffff8119dce4>] do_vfs_ioctl+0x84/0x580
<4> [<ffffffff8119e261>] sys_ioctl+0x81/0xa0
<4> [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>Code: be c4 10 e1 48 8b 5d d8 44 01 f0 4c 8b 65 e0 4c 8b 6d e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 0f 1f 44 00 00 55 48 89 e5 0f 1f 44 00 00 <48> 8b 01 48 c1 e8 3c 83 f8 08 76 0b e8 f6 fb ff ff c9 c3 0f 1f
<1>RIP [<ffffffffa0180279>] debug_fucntion+0x19/0x160 [dise]
<4> RSP <ffff880435fc5b88>
<4>CR2: ffffc90028213000
Question i have is
Can the CPU register contents which are printed give more information? How do i decode them?
Can i get to know variables values or data structure values from the crash dump which leads to the crash?
What does the "Code : be c4 10 e1 48 8b 5d ... " tell me here?
You must understand that you are inspecting (not debugging) at assembly level (not source code). This is important thing that you must hold in your head when inspecting crash dumps.
You have to read your crash dump report carefully line by line because it contains lots of info and also that's all you got.
When you got place when your code was crashed - you have to figure out why that happened by reading crash dump report and disassembly.
First line in your crash dump report tells you
BUG: unable to handle kernel paging request at ffffc90028213000
That means you are using invalid memory.
Line
Process diseproc (pid: 1126, threadinfo ffff880435fc4000, task ffff8807f8be8ae0)
tells you what happened in userspace on crash time. Seems like userspace process diseproc issued some command to your driver that caused crash.
Very important line is
IP: [<ffffffffa0180279>] debug_fucntion+0x19/0x160 [dise]
Try to issue dis debug_function command to disassemble debug_function, find debug_function+25(0x19 hex = 25 dec) and look around. Read it side by side with C source code for debug_function. Usually you can find crash place in C code by comparing callq instructions - disassembly will show printable name of called functions.
Next and most important is Call trace:
Call Trace:
[<ffffffffa0180498>] cmd_dump+0x1c8/0x360 [dise]
[<ffffffffa01978e1>] debug_log_show+0x91/0x160 [dise]
[<ffffffffa013afb9>] process_debug+0x5a9/0x990 [dise]
[<ffffffff810792c7>] ? current_fs_time+0x27/0x30
[<ffffffffa013bc38>] dise_ioctl+0xd8/0x300 [dise]
[<ffffffff8105a501>] ? hotplug_hrtick+0x21/0x60
[<ffffffff8119db42>] vfs_ioctl+0x22/0xa0
[<ffffffff8119dce4>] do_vfs_ioctl+0x84/0x580
[<ffffffff8119e261>] sys_ioctl+0x81/0xa0
[<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Reading bottom to top: kernel got ioctl (from diseproc, obvious), kernel invoked ioctl handler dise_ioctl in dise module, then current_fs_time, process_debug, debug_log_show and finally cmd_dump.
Now you know:
Code path: dise_ioctl -> current_fs_time -> process_debug -> debug_log_show -> cmd_dump -> somehow to debug_function.
Approximate place in C code that caused crash
Reason to crash: access to invalid memory
With this info you have to use your last and most powerful method - thinking. Try to understand what variables/structures caused crash. Maybe some of them was freed by the time you arrived in debug_function? Maybe you mistype in pointer arithmetic?
Answers to questions:
Most of the times CPU register values are pointless because it has nothing to do with your C code. Just some values, pointing to some memory - whatever. Yes, there are some extremely useful registers like RIP/EIP and RSP/ESP, but most of them is way too out of context.
Very unlikely. You are actually not debugging - you are inspecting your dump - you don't have any debugging context.
I agree with #user2699113 that it just memory content under pointer from RIP.
And remember - best debugging tool is your brain.
See here... This has good documentation on how to debug kernel crashes.. See the section Objdump
What it tells it that you can disassemble your kernel image using objdump on vmlinux image. This command will output a large a text file of your kernel source code ... You can then grep for the problem causing EIP in the previously created output file.
PS: I would recommend doing objdump on vmlinux and saving it locally.
and 2.: It is rather hard to find out how cpu registers relates to parameters and variable values.
3: That code is assembler code. You may find it in your disassembled program and find out where that problem occured. Notice that there is <48> 8b 01 48 ... - and AFAIK the trap occurs at this assembler command. It means that you need to debug it by disassembling your code. If you compile your program (module) with debuggig symbols you can find out the number line where the problem occured.

Resources