I used the following perf command to sample userspace read accesses to DRAM by evince:
perf record -d --call-graph dwarf -c 100 -e mem_load_uops_retired.l3_miss:uppp /opt/evince-3.28.4/bin/evince
As can be seen, I used the PEBS feature to increase the accuracy of sampling. But there are some non-memory accesses reported as memory ones. For example, this is a sampled event reported by perf script:
evince 20589 16079.401401: 100 mem_load_uops_retired.l3_miss:uppp: 555555860750 5080022 N/A|SNP N/A|TLB N/A|LCK N/A
555555579939 ev_history_can_go_back+0x19 (/opt/evince-3.28.4/bin/evince)
5555555862ef ev_window_update_actions_sensitivity+0xa1f (/opt/evince-3.28.4/bin/evince)
55555558ce4f ev_window_page_changed_cb+0xf (/opt/evince-3.28.4/bin/evince)
7ffff574510c g_closure_invoke+0x19c (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff575805d signal_emit_unlocked_R+0xf4d (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff5760714 g_signal_emit_valist+0xa74 (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff576112e g_signal_emit+0x8e (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff7140d76 emit_value_changed+0xf6 (inlined)
7ffff7140d76 adjustment_set_value+0xf6 (inlined)
7ffff7140d76 gtk_adjustment_set_value_internal+0xf6 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff574510c g_closure_invoke+0x19c (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff5757de7 signal_emit_unlocked_R+0xcd7 (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff575fc7f g_signal_emitv+0x27f (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff7153519 gtk_binding_entry_activate+0x289 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff71539ef binding_activate+0x5f (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff7153b7f gtk_bindings_activate_list+0x17f (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff7154cd8 gtk_bindings_activate_event+0x98 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff7973959 ev_view_key_press_event+0x59 (/opt/evince-3.28.4/lib/libevview3.so.3.0.0)
7ffff72698f6 _gtk_marshal_BOOLEAN__BOXEDv+0xa6 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff574524f _g_closure_invoke_va+0xbf (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff57603cc g_signal_emit_valist+0x72c (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff576112e g_signal_emit+0x8e (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff73b1533 gtk_widget_event_internal+0x163 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff73d1f0a gtk_window_propagate_key_event+0xfa (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
5555555894b1 ev_window_key_press_event+0x31 (/opt/evince-3.28.4/bin/evince)
7ffff72698f6 _gtk_marshal_BOOLEAN__BOXEDv+0xa6 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff5745345 _g_closure_invoke_va+0x1b5 (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff57603cc g_signal_emit_valist+0x72c (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff576112e g_signal_emit+0x8e (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff73b1533 gtk_widget_event_internal+0x163 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff726693e propagate_event+0x21e (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff7268947 gtk_main_do_event+0x7f7 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff6d79764 _gdk_event_emit+0x24 (/usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30)
7ffff6da9f91 gdk_event_source_dispatch+0x21 (/usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30)
7ffff546a416 g_main_dispatch+0x2e6 (inlined)
7ffff546a416 g_main_context_dispatch+0x2e6 (/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.5600.4)
7ffff546a64f g_main_context_iterate+0x1ff (inlined)
7ffff546a6db g_main_context_iteration+0x2b (/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.5600.4)
7ffff5a2be3c g_application_run+0x1fc (/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.5600.4)
555555573707 main+0x447 (/opt/evince-3.28.4/bin/evince)
7ffff4a91b96 __libc_start_main+0xe6 (/lib/x86_64-linux-gnu/libc-2.27.so)
555555573899 _start+0x29 (/opt/evince-3.28.4/bin/evince)
ffffffffffffffff [unknown] ([unknown])
This implies that there exists an access to address 0x555555860750 (which is located in [heap]) by an instruction at 0x555555579939 (which is located in evince text section at offset 0x19 of function ev_history_can_go_back()). This memory instrucion is the last line in the following code snippet:
0000000000025920 <ev_history_can_go_back>:
25920: 53 push %rbx
25921: 48 89 fb mov %rdi,%rbx
25924: e8 67 fa ff ff callq 25390 <ev_history_get_type>
25929: 48 85 db test %rbx,%rbx
2592c: 74 42 je 25970 <ev_history_can_go_back+0x50>
2592e: 48 8b 13 mov (%rbx),%rdx
25931: 48 85 d2 test %rdx,%rdx
25934: 74 05 je 2593b <ev_history_can_go_back+0x1b>
25936: 48 39 02 cmp %rax,(%rdx)
25939: 74 0f je 2594a <ev_history_can_go_back+0x2a>
This is a jump to ev_history_can_go_back+0x2a and, apparently, this is not an access to [heap] at address 0x555555860750. Is this perf report wrong?
UPDATE
How about the following backtrace?
11159097179866 0xfb80 [0x1778]: PERF_RECORD_SAMPLE(IP, 0x4002): 7309/7309: 0x7ffff6d6c310 period: 10000 addr: 0x7ffff7034e50
... FP chain: nr:0
... user regs: mask 0xff0fff ABI 64-bit
.... AX 0x555555b8b4c0
.... BX 0x555555c48e10
.... CX 0x1
.... DX 0x7fffffffd988
.... SI 0x7fffffffd980
.... DI 0x555555b8b4c0
.... BP 0x258
.... SP 0x7fffffffd978
.... IP 0x7ffff6d6c310
.... FLAGS 0x20e
.... CS 0x33
.... SS 0x2b
.... R8 0x27c
.... R9 0x24
.... R10 0x2a2
.... R11 0x0
.... R12 0x258
.... R13 0x555555b8b4c0
.... R14 0x3000
.... R15 0x7ffff5747000
... ustack: size 5768, offset 0xd8
. data_src: 0x5080022
... thread: evince:7309
...... dso: /usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30
evince 7309 11159.097179: 10000 mem_load_uops_retired.l3_miss:uppp: 7ffff7034e50 5080022 N/A|SNP N/A|TLB N/A|LCK N/A
7ffff6d6c310 cairo_surface_get_device_scale#plt+0x0 (/usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30)
7ffff6d91029 gdk_window_create_similar_surface+0xc9 (/usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30)
7ffff6d95410 gdk_window_begin_paint_internal+0x350 (/usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30)
7ffff6d956f1 gdk_window_begin_draw_frame+0xc1 (/usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30)
7ffff73c4942 gtk_widget_render+0xd2 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff7268858 gtk_main_do_event+0x708 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
7ffff6d79764 _gdk_event_emit+0x24 (/usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30)
7ffff6d897f4 _gdk_window_process_updates_recurse_helper+0x104 (/usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30)
7ffff6d8a9f5 gdk_window_process_updates_internal+0x165 (/usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30)
7ffff6d8abef gdk_window_process_updates_with_mode+0x11f (/usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30)
7ffff574510c g_closure_invoke+0x19c (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff575805d signal_emit_unlocked_R+0xf4d (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff5760714 g_signal_emit_valist+0xa74 (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff576112e g_signal_emit+0x8e (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
7ffff6d82ac8 gdk_frame_clock_paint_idle+0x3c8 (/usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30)
7ffff6d6e07f gdk_threads_dispatch+0x1f (/usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2200.30)
7ffff546ad02 g_timeout_dispatch+0x12 (/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.5600.4)
7ffff546a284 g_main_dispatch+0x154 (inlined)
7ffff546a284 g_main_context_dispatch+0x154 (/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.5600.4)
7ffff546a64f g_main_context_iterate+0x1ff (inlined)
7ffff546a6db g_main_context_iteration+0x2b (/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.5600.4)
7ffff5a2be3c g_application_run+0x1fc (/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.5600.4)
555555573707 main+0x447 (/opt/evince-3.28.4/bin/evince)
7ffff4a91b96 __libc_start_main+0xe6 (/lib/x86_64-linux-gnu/libc-2.27.so)
555555573899 _start+0x29 (/opt/evince-3.28.4/bin/evince)
The access point is at offset 0 of the following disassembly:
Dump of assembler code for function cairo_surface_get_device_scale#plt:
0x000000000002a310 <+0>: jmpq *0x2c8b3a(%rip) # 0x2f2e50
0x000000000002a316 <+6>: pushq $0x1c7
0x000000000002a31b <+11>: jmpq 0x28690
This is an unconditional jump which will not lead to macrofusion.
On Intel CPUs at least, cmp %rax,(%rdx) can macro-fuse with the following je, while also micro-fusing the load. https://agner.org/optimize/. Also related: Micro fusion and addressing modes (this is a non-indexed addressing mode so this can stay micro-fused even on Sandybridge/IvyBridge).
So in the fused domain (where retirement happens) you really do have single-uop compare-and-branch with a memory source. Note that mem_load_uops_retired.l3_miss:uppp counts uops, not instructions.
Even in the unfused domain, macro-fused compare/branch really executes as a single uop on a single execution unit, but the load does have to execute on a separate port. (Micro-fusion saves decode/issue front-end bandwidth, and uop cache space, but not back-end ports.)
Related
I am profiling a Rust binary using perf record and I observe unexpected stack frames in the output of perf script. As a result, the flamegraph is difficult to interpret. How can I avoid these unexpected stack frames?
Profiling commands:
$ perf record -F 997 --call-graph dwarf,65528 -g -e cpu-clock /path/to/bin
$ perf script
In the application, fn main calls fn tip_mine. Here is a record from perf script which shows the expected stack frame:
...
56275bcc4bee stacks_inspect::tip_mine+0xd23e (inlined)
56275bcc4bee stacks_inspect::main+0xd23e (/home/igor/stacks-blockchain/target/release/stacks-inspect)
ffffffffffffffff [unknown] ([unknown])
Here is another record from perf script from the same run which shows extra stack frames between fn main and fn tip_mine:
...
56275bcbe20d stacks_inspect::tip_mine+0x685d (inlined)
56275bcbe20d stacks_inspect::main+0x685d (/home/igor/stacks-blockchain/target/release/stacks-inspect)
56275bc9bc42 core::ops::function::FnOnce::call_once+0x2 (inlined)
56275bc9bc42 std::sys_common::backtrace::__rust_begin_short_backtrace+0x2 (/home/igor/stacks-blockchain/target/release/stacks-inspect)
56275bca5e68 std::rt::lang_start::{{closure}}+0x8 (inlined)
56275c35dc8d core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once+0x42d (inlined)
56275c35dc8d std::panicking::try::do_call+0x42d (inlined)
56275c35dc8d std::panicking::try+0x42d (inlined)
56275c35dc8d std::panic::catch_unwind+0x42d (inlined)
56275c35dc8d std::rt::lang_start_internal::{{closure}}+0x42d (inlined)
56275c35dc8d std::panicking::try::do_call+0x42d (inlined)
56275c35dc8d std::panicking::try+0x42d (inlined)
56275c35dc8d std::panic::catch_unwind+0x42d (inlined)
56275c35dc8d std::rt::lang_start_internal+0x42d (/home/igor/stacks-blockchain/target/release/stacks-inspect)
56275bcc64e7 main+0x27 (/home/igor/stacks-blockchain/target/release/stacks-inspect)
7fa0aa8b4d8f [unknown] (/usr/lib/x86_64-linux-gnu/libc.so.6)
7fa0aa8b4e3f __libc_start_main+0x7f (/usr/lib/x86_64-linux-gnu/libc.so.6)
56275bc5a1d4 _start+0x24 (/home/igor/stacks-blockchain/target/release/stacks-inspect)
I'm trying to execute "invd" instruction from a kernel module. I have asked a similar question How to execute “invd” instruction? previously and from #Peter Cordes's answer, I understand I can't safely run this instruction on SMP system after system boot. So, shouldn't I be able to run this instruction after boot without SMP support? Because there is no other core running, therefore there is no change for memory inconsistency? I have the following kernel module compiled with -o0 flag,
static int __init deviceDriver_init(void){
unsigned long flags;
int LEN=10;
int STEP=1;
int VALUE=1;
int arr[LEN];
int i;
unsigned long dummy;
printk(KERN_INFO "invd Driver loaded\n");
//wbinvd();
//asm volatile("cpuid\n":::);
local_irq_disable();
__asm__ __volatile__(
"wbinvd\n"
"loop:"
"movq %%rdx, (%%rbx);"
"leaq (%%rbx, %%rcx, 8), %%rbx;"
"cmpq %%rbx, %%rax;"
"jg loop;"
"invd\n"
: "=b"(dummy) // output
: "b" (arr),
"a" (arr+LEN),
"c" (STEP),
"d" (VALUE)
: "cc", "memory"
);
local_irq_enable();
//asm volatile("invd\n":::);
printk(KERN_INFO "invd execute\n");
return 0;
}
I'm still getting the following error upon inserting the module I'm getting Segmentation fault (core dumped) in the terminal and the dmesg shows,
[ 2590.518614] invd Driver loaded
[ 2590.518840] general protection fault: 0000 [#5] SMP PTI
I have boot my kernel with nosmp but I do not understand why dmesg still shows SMP PTI
$cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.15.0-136-generic root=UUID=dbe747ff-a6a5-45cb-8553-c6db6d445d3d ro quiet splash nosmp vt.handoff=7
Update post:
As I mentioned in the comment section, After disabling, SGX from BIOS, I was able to run this invd without any error. However, when I try to run the same code on a different machine with the same kernel version, I still get the same error message. It is strange and I can't explain why this is happening. As in the comment section, #prl mentions that the error may be coming from the instruction following invd. I begin to think that maybe that is true. Because second from the last line in the dmesg is higlighted in RED [ 153.527386] RIP: loop+0xc/0xf22 [noSmp8] RSP: ffffb8d9450a7be0. So, seems like the error is coming from inside the loop. I have updated the __init function code according to the suggestion. I'm not good at assembly code, can anyone please tell me if the inline assembly code is correct or not? If this inline assembly code is not correct how to fix the code? My whole dmesg trace is,
[ 153.514293] invd Driver loaded
[ 153.514547] general protection fault: 0000 [#1] SMP PTI
[ 153.514656] Modules linked in: noSmp8(OE+) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables ccm arc4 intel_rapl rt2800usb rt2x00usb x86_pkg_temp_thermal intel_powerclamp rt2800lib coretemp rt2x00lib mac80211 cfg80211 kvm_intel kvm irqbypass snd_hda_codec_realtek crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate intel_rapl_perf dell_smm_hwmon dell_wmi dell_smbios dcdbas intel_wmi_thunderbolt snd_hda_codec_generic dell_wmi_descriptor wmi_bmof snd_seq_midi snd_seq_midi_event
[ 153.515454] serio_raw snd_hda_intel snd_hda_codec snd_hda_core sparse_keymap snd_hwdep snd_rawmidi joydev input_leds snd_seq snd_pcm snd_seq_device snd_timer snd soundcore mei_me mei shpchp intel_pch_thermal mac_hid acpi_pad parport_pc ppdev lp parport autofs4 hid_generic usbhid hid nouveau mxm_wmi ttm drm_kms_helper psmouse syscopyarea sysfillrect sysimgblt igb e1000e dca i2c_algo_bit ptp pps_core ahci libahci fb_sys_fops drm wmi video
[ 153.516038] CPU: 0 PID: 4024 Comm: insmod Tainted: G OE 4.15.0-136-generic #140~16.04.1-Ubuntu
[ 153.516331] Hardware name: Dell Inc. BIOS 1.3.2 01/25/2016
[ 153.516626] RIP: 0010:loop+0xc/0xf22 [noSmp8]
[ 153.516917] RSP: 0018:ffffb8d9450a7be0 EFLAGS: 00010046
[ 153.517213] RAX: ffffb8d9450a7c08 RBX: ffffb8d9450a7c08 RCX: 0000000000000001
[ 153.517513] RDX: 0000000000000001 RSI: ffffb8d9450a7be0 RDI: ffff8edaadc16490
[ 153.517814] RBP: ffffb8d9450a7c60 R08: 0000000000012c40 R09: ffffffffb39624c4
[ 153.518119] R10: ffffb8d9450a7c78 R11: 000000000000038c R12: ffffb8d9450a7c10
[ 153.518427] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8eda4c6bd660
[ 153.518730] FS: 00007fd7f09cf700(0000) GS:ffff8edaadc00000(0000) knlGS:0000000000000000
[ 153.519036] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 153.519346] CR2: 00005634f95fde50 CR3: 000000040dd2c001 CR4: 00000000003606f0
[ 153.519656] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 153.519980] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 153.520289] Call Trace:
[ 153.520597] ? 0xffffffffc050d000
[ 153.520899] do_one_initcall+0x55/0x1ac
[ 153.521201] ? do_one_initcall+0x55/0x1ac
[ 153.521504] ? do_init_module+0x27/0x223
[ 153.521808] ? _cond_resched+0x32/0x50
[ 153.522107] ? kmem_cache_alloc_trace+0x165/0x1c0
[ 153.522408] do_init_module+0x5f/0x223
[ 153.522710] load_module+0x188c/0x1ea0
[ 153.523016] ? ima_post_read_file+0x83/0xa0
[ 153.523320] SYSC_finit_module+0xe5/0x120
[ 153.523623] ? SYSC_finit_module+0xe5/0x120
[ 153.523927] SyS_finit_module+0xe/0x10
[ 153.524231] do_syscall_64+0x73/0x130
[ 153.524534] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[ 153.524838] RIP: 0033:0x7fd7f04fd599
[ 153.525144] RSP: 002b:00007ffda61c2968 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
[ 153.525455] RAX: ffffffffffffffda RBX: 00005643631d7210 RCX: 00007fd7f04fd599
[ 153.525768] RDX: 0000000000000000 RSI: 0000564361c3226b RDI: 0000000000000003
[ 153.526084] RBP: 0000564361c3226b R08: 0000000000000000 R09: 00007fd7f07c2ea0
[ 153.526403] R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000000000
[ 153.526722] R13: 00005643631d7ca0 R14: 0000000000000000 R15: 0000000000000000
[ 153.527040] Code: 00 48 8b 75 c8 48 8b 45 c8 8b 55 b8 48 63 d2 48 c1 e2 02 48 01 d0 8b 4d b4 8b 55 bc 48 89 f3 48 89 13 48 8d 1c cb 48 39 d8 7f f4 <0f> 08 48 89 d8 48 89 45 d0 e8 40 ef 73 00 48 c7 c7 c7 d0 c4 c0
[ 153.527386] RIP: loop+0xc/0xf22 [noSmp8] RSP: ffffb8d9450a7be0
[ 153.530228] ---[ end trace cc9ea64985c9fe34 ]---
So, it not possible to run invd even without SMP?
There's 2 questions here:
a) How to execute INVD (unsafely)
For this, you need to be running at CPL=0, and you have to make sure the CPU isn't using any "processor reserved memory protections" which are part of Intel's Software Guard Extensions (an extension to allow programs to have a shielded/private/encrypted space that the OS can't tamper with, often used for digital rights management schemes but possibly usable for enhancing security/confidentiality of other things).
Note that SGX is supported in recent versions of Linux, but I'm not sure when support was introduced or how old your kernel is, or if it's enabled/disabled.
If either of these isn't true (e.g. you're at CPL=3 or there are "processor reserved memory protections) you will get a general protection fault exception.
b) How to execute INVD Safely
For this, you have to make sure that the caches (which includes "external caches" - e.g. possibly including things like eDRAM and caches built into non-volatile RAM) don't contain any modified data that will cause problems if lost. This includes data from:
IRQs. These can be disabled.
NMI and machine check exceptions. For a running OS it's mostly impossible to stop/disable these and if you can disable them then it's like crossing your fingers while ignoring critical hardware failures (an extremely bad idea).
the firmware's System Management Mode. This is a special CPU mode the firmware uses for various things (e.g. ECC scrubbing, some power management, emulation of legacy devices) that't beyond the control of the OS/kernel. It can't be disabled.
writes done by the CPU itself. This includes updating the accessed/dirty flags in page tables (which can not be disabled), plus any performance monitoring or debugging features that store data in memory (which can be "not enabled").
With these restrictions (and not forgetting the performance problems) there are only 2 cases where INVD might be sane - early firmware code that needs to determine RAM chip sizes and configure memory controllers (where it's very likely to be useful/sane), and the instant before the computer is turned off (where it's likely to be pointless).
Guesswork
I'm guessing (based on my inability to think of any other plausible reason) that you want to construct temporary shielded/private area of memory (to enhance security - e.g. so that the data you put in that area won't/can't leak into RAM). In this case (ironically) it's possible that the tool designed specifically for this job (SGX) is preventing you from doing it badly.
I'm facing issues with some kernel panic but I don't have any idea how to find which soft is exacly causing this issue. I'm trying to compile some soft on remote host using distcc software but my machines which are compiling are going down because of this issue.
Could you point me where shoud I start looking? What could cause this issue? Which tools should I use?
Here is kernel panic output:
[591792.656853] IP: [< (null)>] (null)
[591792.658710] PGD 800000032ca05067 PUD 327bc6067 PMD 0
[591792.660439] Oops: 0010 [#1] SMP
[591792.661562] Modules linked in: fuse nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache nls_utf8 isofs sunrpc dm_mirror dm_region_hash dm_log dm_mod sb_edac iosf_mbi kvm_intel ppdev kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd cirrus ttm joydev drm_kms_helper sg virtio_balloon syscopyarea sysfillrect sysimgblt fb_sys_fops drm parport_pc parport drm_panel_orientation_quirks pcspkr i2c_piix4 ip_tables xfs libcrc32c sr_mod cdrom virtio_blk virtio_net ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel serio_raw floppy ata_piix libata virtio_pci virtio_ring virtio
[591792.682098] CPU: 2 PID: 25548 Comm: cc1plus Not tainted 3.10.0-957.el7.x86_64 #1
[591792.684495] Hardware name: Red Hat OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014
[591792.686923] task: ffff8ebb65ea1040 ti: ffff8ebb6b250000 task.ti: ffff8ebb6b250000
[591792.689315] RIP: 0010:[<0000000000000000>] [< (null)>] (null)
[591792.691729] RSP: 0018:ffff8ebb6b253da0 EFLAGS: 00010246
[591792.693438] RAX: 0000000000000000 RBX: ffff8ebb6b253e40 RCX: ffff8ebb6b253fd8
[591792.695716] RDX: ffff8ebb38098840 RSI: ffff8ebb6b253e40 RDI: ffff8ebb38098840
[591792.697992] RBP: ffff8ebb6b253e30 R08: 0000000000000100 R09: 0000000000000001
[591792.700271] R10: ffff8ebb7fd1f080 R11: ffffd7da0beb9380 R12: ffff8eb8417af000
[591792.702547] R13: ffff8eb875d1b000 R14: ffff8ebb6b253f24 R15: 0000000000000000
[591792.704821] FS: 0000000000000000(0000) GS:ffff8ebb7fd00000(0063) knlGS:00000000f7524740
[591792.707397] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[591792.709242] CR2: 0000000000000000 CR3: 000000032eb0a000 CR4: 00000000003607e0
[591792.711519] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[591792.713814] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[591792.716100] Call Trace:
[591792.716927] [<ffffffff9165270b>] ? path_openat+0x3eb/0x640
[591792.718727] [<ffffffff91653dfd>] do_filp_open+0x4d/0xb0
[591792.720451] [<ffffffff91661504>] ? __alloc_fd+0xc4/0x170
[591792.722267] [<ffffffff9163ff27>] do_sys_open+0x137/0x240
[591792.724017] [<ffffffff916a1fab>] compat_SyS_open+0x1b/0x20
[591792.725820] [<ffffffff91b78bb0>] sysenter_dispatch+0xd/0x2b
[591792.727648] Code: Bad RIP value.
[591792.728795] RIP [< (null)>] (null)
[591792.730486] RSP <ffff8ebb6b253da0>
[591792.731625] CR2: 0000000000000000
[591792.734935] ---[ end trace ccfdca9d4733e7a5 ]---
[591792.736450] Kernel panic - not syncing: Fatal exception
[591792.738708] Kernel Offset: 0x10400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
It is quite difficult to tell what went wrong with this just piece of log you have pasted.
It seems like oops lead into kernel panic!.
Well,with helping you to find the real cause,I can help you with material to look into for further dissection of panic/crash.
link 1: analysing kernel panics
link 2 : oops
Hope it helps you! :)
I write kernel module to debug user processes' stack's. I found a way to get pointer to stack using mm field from task_struct structure, but when I try read value from stack adress my module is crashed.
Code (for existing process with pid 860):
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/pid.h>
#include <linux/proc_fs.h>
struct task_struct *ed_task;
long int *val;
void *stack_p;
int i = 0;
static int __init init(void)
{
ed_task = pid_task(find_vpid(860), PIDTYPE_PID);
stack_p = (void *) ed_task->mm->start_stack;
printk("stack %d: %p", i, stack_p);
val = stack_p - sizeof(void *);
printk("stack %d value: %ld", i, *val);
printk(">");
return 0;
}
static void __exit modex(void)
{
}
module_init(init);
module_exit(modex);
MODULE_LICENSE("GPL");
Error:
[ 2714.296489] stack 0: 00007fffbd922e30
[ 2714.296495] BUG: unable to handle kernel paging request at 00007fffbd922e28
[ 2714.297844] IP: init+0x65/0x1000 [main3]
[ 2714.299017] PGD 10140d067
[ 2714.299019] P4D 10140d067
[ 2714.300188] PUD 0
[ 2714.303364] Oops: 0000 [#4] PREEMPT SMP
[ 2714.306523] Modules linked in: main3(O+) main1(O+) main(O+) main2(O+) rndis_host cdc_ether usbnet mii fuse rfcomm bnep nls_iso8859_1 nls_cp437 vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm hp_wmi irqbypass iTCO_wdt mousedev sparse_keymap iTCO_vendor_support ppdev joydev mei_wdt btusb crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel btrtl pcbc btbcm aesni_intel btintel bluetooth evdev input_leds aes_x86_64 uvcvideo crypto_simd glue_helper cryptd videobuf2_vmalloc intel_cstate intel_rapl_perf videobuf2_memops ecdh_generic pcspkr videobuf2_v4l2 videobuf2_core videodev media psmouse mac_hid hid_generic arc4 nouveau iwldvm mac80211 mxm_wmi iwlwifi ttm cfg80211 drm_kms_helper drm rfkill syscopyarea sysfillrect snd_hda_codec_hdmi sysimgblt snd_hda_codec_idt
[ 2714.312788] snd_hda_codec_generic fb_sys_fops snd_hda_intel i2c_algo_bit snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore parport_pc thermal tpm_infineon hp_accel ac e1000e lis3lv02d parport wmi mei_me mei input_polldev battery video ptp pps_core button lpc_ich shpchp tpm_tis tpm_tis_core tpm sch_fq_codel vboxnetflt(O) vboxnetadp(O) pci_stub vboxpci(O) vboxdrv(O) sg ip_tables x_tables ext4 crc16 jbd2 fscrypto mbcache sr_mod cdrom sd_mod usbhid hid serio_raw atkbd libps2 ahci libahci libata scsi_mod firewire_ohci xhci_pci xhci_hcd sdhci_pci sdhci ehci_pci led_class ehci_hcd firewire_core mmc_core crc_itu_t usbcore usb_common i8042 serio [last unloaded: main]
[ 2714.318036] CPU: 3 PID: 6244 Comm: insmod Tainted: G D O 4.12.4-1-ARCH #1
[ 2714.319130] Hardware name: Hewlett-Packard HP EliteBook 8560w/1631, BIOS 68SVD Ver. F.60 03/12/2015
[ 2714.320246] task: ffff917ea5819c80 task.stack: ffff9f90455c8000
[ 2714.321406] RIP: 0010:init+0x65/0x1000 [main3]
[ 2714.322524] RSP: 0018:ffff9f90455cbc90 EFLAGS: 00010286
[ 2714.323684] RAX: 00007fffbd922e30 RBX: 0000000000000000 RCX: ffffffff8ba55a68
[ 2714.324813] RDX: 00007fffbd922e28 RSI: 0000000000000000 RDI: ffffffffc0f08031
[ 2714.325937] RBP: ffff9f90455cbc90 R08: 000000000000044b R09: ffffffff8bca0940
[ 2714.327040] R10: fffff2d0c2ecac40 R11: 0000000000000000 R12: ffffffffc0498000
[ 2714.328142] R13: ffff917e6ee4c480 R14: ffffffffc0f09050 R15: ffff917f06e24660
[ 2714.329279] FS: 00007f8999054b80(0000) GS:ffff917f3dcc0000(0000) knlGS:0000000000000000
[ 2714.330389] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2714.331516] CR2: 00007fffbd922e28 CR3: 000000010150f000 CR4: 00000000000406e0
[ 2714.332652] Call Trace:
[ 2714.333810] do_one_initcall+0x50/0x190
[ 2714.334945] ? do_init_module+0x27/0x1e6
[ 2714.336065] do_init_module+0x5f/0x1e6
[ 2714.337197] load_module+0x2610/0x2ab0
[ 2714.338349] ? vfs_read+0x115/0x130
[ 2714.339469] SYSC_finit_module+0xf6/0x110
[ 2714.340574] ? SYSC_finit_module+0xf6/0x110
[ 2714.341675] SyS_finit_module+0xe/0x10
[ 2714.342782] entry_SYSCALL_64_fastpath+0x1a/0xa5
[ 2714.343899] RIP: 0033:0x7f8998765bb9
[ 2714.344988] RSP: 002b:00007ffc068a0528 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[ 2714.346111] RAX: ffffffffffffffda RBX: 00007f8998a26aa0 RCX: 00007f8998765bb9
[ 2714.347203] RDX: 0000000000000000 RSI: 000000000041aada RDI: 0000000000000003
[ 2714.348279] RBP: 00007f8998a26af8 R08: 0000000000000000 R09: 00007f8998a28e40
[ 2714.349368] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000001020
[ 2714.350441] R13: 0000000000001018 R14: 00007f8998a26af8 R15: 0000000000000001
[ 2714.351524] Code: 48 89 15 07 13 a7 00 e8 8d 11 cf ca 48 8b 05 fb 12 a7 00 8b 35 ed 12 a7 00 48 c7 c7 31 80 f0 c0 48 8d 50 f8 48 89 15 eb 12 a7 00 <48> 8b 50 f8 e8 65 11 cf ca 48 c7 c7 45 80 f0 c0 e8 59 11 cf ca
[ 2714.353786] RIP: init+0x65/0x1000 [main3] RSP: ffff9f90455cbc90
[ 2714.354914] CR2: 00007fffbd922e28
[ 2714.356130] ---[ end trace a150fd8aba7bd1e3 ]---
Why it doesn't work? Are stacks have any special memory protection or something wrong am I doing?
You cannot directly access userspace memory from the kernel space. In order to do so, you need to use the kernel provided API function copy_from_user().
Modifying your code to make use of this API should work:
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/pid.h>
#include <linux/proc_fs.h>
#include <linux/uaccess.h> // Header file corresponding to copy_from_user()
struct task_struct *ed_task;
long int *valp; // Pointer to destination kernel memory
long int val; // Destination kernel memory
void *stack_p;
int i = 0;
static int __init init(void)
{
ed_task = pid_task(find_vpid(2298), PIDTYPE_PID);
stack_p = (void *) ed_task->mm->start_stack;
printk("stack %d: %p\n", i, stack_p);
valp = stack_p - sizeof(void *);
copy_from_user(&val, valp, sizeof(val));
printk("stack %d value: %ld", i, val);
printk(">");
return 0;
}
static void __exit modex(void)
{
}
module_init(init);
module_exit(modex);
MODULE_LICENSE("GPL");
I write a kernel module which replaces syscall and have a problem. Module can't be loaded because is some problem in memory. I tried fix it for 3 hours, but it still not work. This code is working, when I choose memory closer sys_call_table (eg. linux_banner address from /proc/kallsyms), but it isn't always works.
Problem is usually, when function which search syscall table points to address which end is 18 (eg ffffffff91000018, ffffffff81000018).
Why it does not work?
Code:
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/syscalls.h>
#include <linux/list.h>
#include <linux/unistd.h>
#include <linux/kobject.h>
#include <linux/init.h>
/* start of 64-bit kernel space is 0xffffffff80000000 */
#define END_MEM 0xffffffffffffffff /* end of 64-bit kernel */
#define START_MEM 0xffffffff81000000
unsigned long long **syscall_tab;
asmlinkage long (*orig_mkdir)(const char __user *pathname, umode_t mode);
asmlinkage long my_mkdir(const char __user *pathname, umode_t mode)
{
long ret;
ret = orig_mkdir(pathname, mode);
printk("Creating dir: %s", pathname);
return ret;
}
static void hide(void)
{
list_del(&THIS_MODULE->list);
kobject_del(&THIS_MODULE->mkobj.kobj);
}
static unsigned long long **find(void) {
unsigned long long **sctable;
unsigned long long i = START_MEM;
while (i < END_MEM) {
sctable = (unsigned long long **) i;
if ( sctable[__NR_close] == (unsigned long long *) sys_close) {
printk("syscall_tab %lx", syscall_tab);
return &sctable[0];
}
i += sizeof(void *);
}
return NULL;
}
static int __init init(void)
{
write_cr0(read_cr0() & (~0x10000));
if(!(syscall_tab = find())) {
return 0;
}
orig_mkdir = (void *) syscall_tab[__NR_mkdir];
printk("write_cr0");
syscall_tab[__NR_mkdir] = (unsigned long long*) my_mkdir;
printk("po podmiance");
write_cr0(read_cr0() | (~0x10000));
return 0;
}
static void __exit exitt(void)
{
write_cr0(read_cr0() & (~0x10000));
syscall_tab[__NR_mkdir] = (unsigned long long*) orig_mkdir;
write_cr0(read_cr0() | (~0x10000));
}
module_init(init);
module_exit(exitt);
MODULE_LICENSE("GPL");
Error:
[ 299.273838] BUG: unable to handle kernel paging request at ffffffff91000018
[ 299.273856] IP: init+0x23/0x1000 [hijack1]
[ 299.273860] PGD b6a0c067
[ 299.273861] P4D b6a0c067
[ 299.273863] PUD b6a0d063
[ 299.273866] PMD 0
[ 299.273872] Oops: 0000 [#1] PREEMPT SMP
[ 299.273877] Modules linked in: hijack1(O+) fuse rfcomm bnep nls_iso8859_1 nls_cp437 vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel joydev ppdev hp_wmi mousedev iTCO_wdt aes_x86_64 sparse_keymap iTCO_vendor_support mei_wdt crypto_simd psmouse glue_helper pcspkr evdev input_leds cryptd mac_hid intel_cstate intel_rapl_perf uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core btusb btrtl btbcm btintel bluetooth cdc_ether ecdh_generic usbnet videodev uas media mii hid_generic nouveau mxm_wmi ttm arc4 drm_kms_helper iwldvm drm syscopyarea sysfillrect mac80211 sysimgblt iwlwifi fb_sys_fops parport_pc parport snd_hda_codec_hdmi i2c_algo_bit snd_hda_codec_idt cfg80211
[ 299.273953] rfkill snd_hda_codec_generic hp_accel thermal lis3lv02d wmi input_polldev tpm_infineon video ac battery button snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm shpchp snd_timer e1000e snd ptp soundcore tpm_tis mei_me mei pps_core lpc_ich tpm_tis_core tpm sch_fq_codel vboxnetflt(O) vboxnetadp(O) pci_stub vboxpci(O) vboxdrv(O) sg ip_tables x_tables ext4 crc16 jbd2 fscrypto mbcache sr_mod sd_mod cdrom usb_storage usbhid hid serio_raw atkbd libps2 ahci libahci libata scsi_mod xhci_pci xhci_hcd ehci_pci sdhci_pci ehci_hcd sdhci firewire_ohci led_class firewire_core mmc_core crc_itu_t usbcore usb_common i8042 serio
[ 299.274005] CPU: 2 PID: 3384 Comm: insmod Tainted: G O 4.12.4-1-ARCH #1
[ 299.274009] Hardware name: Hewlett-Packard HP EliteBook 8560w/1631, BIOS 68SVD Ver. F.60 03/12/2015
[ 299.274014] task: ffff90127cc0c740 task.stack: ffffb72907298000
[ 299.274019] RIP: 0010:init+0x23/0x1000 [hijack1]
[ 299.274023] RSP: 0018:ffffb7290729bc88 EFLAGS: 00010206
[ 299.274027] RAX: 0000000080040033 RBX: ffffffff91000000 RCX: 0000000000000000
[ 299.274031] RDX: 00000000004bec82 RSI: 00000000004bec82 RDI: 0000000080040033
[ 299.274036] RBP: ffffb7290729bc90 R08: ffff901339003980 R09: ffffffffa018970a
[ 299.274040] R10: ffffe481c211ebc0 R11: 0000000000000000 R12: ffffffffc0030000
[ 299.274044] R13: ffff9012377965e0 R14: ffffffffc0a81050 R15: ffff90132e0eca80
[ 299.274049] FS: 00007f9a842a4b80(0000) GS:ffff90133dc80000(0000) knlGS:0000000000000000
[ 299.274053] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080040033
[ 299.274057] CR2: ffffffff91000018 CR3: 000000007cdb9000 CR4: 00000000000406e0
[ 299.274061] Call Trace:
[ 299.274068] do_one_initcall+0x50/0x190
[ 299.274073] ? do_init_module+0x27/0x1e6
[ 299.274077] do_init_module+0x5f/0x1e6
[ 299.274082] load_module+0x2610/0x2ab0
[ 299.274087] ? vfs_read+0x115/0x130
[ 299.274091] SYSC_finit_module+0xf6/0x110
[ 299.274095] ? SYSC_finit_module+0xf6/0x110
[ 299.274100] SyS_finit_module+0xe/0x10
[ 299.274105] entry_SYSCALL_64_fastpath+0x1a/0xa5
[ 299.274109] RIP: 0033:0x7f9a839b3bb9
[ 299.274111] RSP: 002b:00007ffd2386ee28 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[ 299.274120] RAX: ffffffffffffffda RBX: 00007f9a83c74aa0 RCX: 00007f9a839b3bb9
[ 299.274124] RDX: 0000000000000000 RSI: 000000000041aada RDI: 0000000000000003
[ 299.274128] RBP: 00007f9a83c74af8 R08: 0000000000000000 R09: 00007f9a83c76e40
[ 299.274132] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000001020
[ 299.274136] R13: 0000000000001018 R14: 00007f9a83c74af8 R15: 0000000000000001
[ 299.274141] Code: <48> 81 7b 18 40 a8 21 a0 75 2d 48 8b 35 14 13 a5 00 48 c7 c7 35 00
[ 299.276347] RIP: init+0x23/0x1000 [hijack1] RSP: ffffb7290729bc88
[ 299.277333] CR2: ffffffff91000018
[ 299.283408] ---[ end trace 63ac9e1e3a0e12c3 ]---
Syscall hijacking x64- unable to handle kernel paging request at ffffffff91000018...
I write kernel module which replace syscall and have a problem. Module can't be loaded because is some problem in memory. I tried fix it for 3 hours, but it still not work...
The problem is, hijacking syscalls is not technically feasible. You can't do it with Linux. Linux does not have a layered design that supports this sort of thing (as opposed to Windows or other operating systems).
About the best you will be able to do is interpositioning, which redirects calls made through the PLT into your shared object. I believe this is the way Valgrind works when it replaces malloc and free.
Also note that some system calls are not routed through the PLT. See the discussion of Double-underscore names for public API functions on the glibc wiki.
Also see Query regarding kernel modules intercepting system call on the Kernel Newbies mailing list and Multiple kernel modules intercepting same system call and crash during unload on Stack Overflow. The first one is the question where the kernel developers tell OP is not possible. I'm just reiterating what the kernel dev's have already stated.