I made a plugin for Nagios/Icinga that parses networking device logs for strings but it's causing kernel panics in large environments. The full code can be found here. I've tried reinstalling the kernel and some packages but it still persists. It's also running on another server just fine, but that server monitors less hosts. How can I troubleshoot the Oops to fix the code or repair the server?
The script uses Net::OpenSSH to connect to different networking devices and run "sh log", an example excerpt is:
my $cisco_cmd = 'sh log ';
# SSH
if ($socket) {
SSH();
# Cisco SSH command
my $ssh_session = $ssh->system({stdout_fh=> $stdout_fh}, $cisco_cmd);
}
sub SSH{
$ssh = Net::OpenSSH->new($host, user=>$username,
password=>$password,
timeout => 30,
master_stdout_fh => $stdout_fh,
master_stderr_fh => $stdout_fh,
master_opts => [-o => "KexAlgorithms=+diffie-hellman-group1-sha1",
-o => "HostKeyAlgorithms=+ssh-dss",
-o => "StrictHostKeyChecking no"]);
if ($ssh->error) {
print "Unknown - Unable to connect to remote host: ". $ssh->error . "\n";
exit 3;
};
return $ssh;
}
Over time syslog will start logging Oops's before it goes into kernel panic.
The Oops is:
Oct 6 14:11:43 icinga1 kernel: [359392.196625] BUG: unable to handle kernel NULL pointer dereference at (null)
Oct 6 14:11:43 icinga1 kernel: [359392.196632] IP: [<ffffffff814fa7c5>] tty_ioctl+0x375/0xc40
Oct 6 14:11:43 icinga1 kernel: [359392.196640] PGD 0
Oct 6 14:11:43 icinga1 kernel: [359392.196642] Oops: 0000 [#1] SMP
Oct 6 14:11:43 icinga1 kernel: [359392.196645] Modules linked in: vmw_vsock_vmci_transport vsock coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd vmw_balloon input_leds joydev serio_raw shpchp i2c_piix4 vmw_vmci 8250_fintek mac_hid sunrpc parport_pc ppdev lp parport autofs4 vmw_pvscsi vmwgfx ttm drm_kms_helper syscopyarea psmouse sysfillrect sysimgblt mptspi fb_sys_fops mptscsih drm mptbase ahci vmxnet3 libahci scsi_transport_spi pata_acpi floppy fjes
Oct 6 14:11:43 icinga1 kernel: [359392.196672] CPU: 1 PID: 21451 Comm: ssh Not tainted 4.4.0-96-generic #119-Ubuntu
Oct 6 14:11:43 icinga1 kernel: [359392.196684] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
Oct 6 14:11:43 icinga1 kernel: [359392.196686] task: ffff88042d008000 ti: ffff88033194c000 task.ti: ffff88033194c000
Oct 6 14:11:43 icinga1 kernel: [359392.196688] RIP: 0010:[<ffffffff814fa7c5>] [<ffffffff814fa7c5>] tty_ioctl+0x375/0xc40
Oct 6 14:11:43 icinga1 kernel: [359392.196691] RSP: 0018:ffff88033194fdf0 EFLAGS: 00010246
Oct 6 14:11:43 icinga1 kernel: [359392.196692] RAX: 0000000000000000 RBX: ffff8803c8aec800 RCX: fffffffeffffffff
Oct 6 14:11:43 icinga1 kernel: [359392.196693] RDX: fffffffe00000001 RSI: 0000000000000000 RDI: ffff8803c8aec828
Oct 6 14:11:43 icinga1 kernel: [359392.196694] RBP: ffff88033194fe98 R08: 0000563d7cc698e0 R09: 0000563d7c0cc880
Oct 6 14:11:43 icinga1 kernel: [359392.196695] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000005401
Oct 6 14:11:43 icinga1 kernel: [359392.196697] R13: 00007ffef42a03b0 R14: ffff8803d1d9c500 R15: 0000000000000000
Oct 6 14:11:43 icinga1 kernel: [359392.196699] FS: 00002b64e6d8f000(0000) GS:ffff88043fc40000(0000) knlGS:0000000000000000
Oct 6 14:11:43 icinga1 kernel: [359392.196700] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 6 14:11:43 icinga1 kernel: [359392.196701] CR2: 0000000000000000 CR3: 00000000bac36000 CR4: 00000000000406e0
Oct 6 14:11:43 icinga1 kernel: [359392.196784] Stack:
Oct 6 14:11:43 icinga1 kernel: [359392.196786] ffff8802bf401600 ffff88033194fe20 ffffffff812276f2 ffff8802bf401600
Oct 6 14:11:43 icinga1 kernel: [359392.196788] ffff8802bf401658 ffff8802bf401600 ffff88033194fe58 ffffffff8122795e
Oct 6 14:11:43 icinga1 kernel: [359392.196790] ffff880428845000 0000000000000008 ffff8802cc420cb0 ffff88042d5d2920
Oct 6 14:11:43 icinga1 kernel: [359392.196792] Call Trace:
Oct 6 14:11:43 icinga1 kernel: [359392.196800] [<ffffffff812276f2>] ? __dentry_kill+0x162/0x1e0
Oct 6 14:11:43 icinga1 kernel: [359392.196802] [<ffffffff8122795e>] ? dput+0x1ee/0x220
Oct 6 14:11:43 icinga1 kernel: [359392.196818] [<ffffffff81231224>] ? mntput+0x24/0x40
Oct 6 14:11:43 icinga1 kernel: [359392.196822] [<ffffffff81211f00>] ? __fput+0x190/0x220
Oct 6 14:11:43 icinga1 kernel: [359392.196824] [<ffffffff81223faf>] do_vfs_ioctl+0x29f/0x490
Oct 6 14:11:43 icinga1 kernel: [359392.196826] [<ffffffff81211fce>] ? ____fput+0xe/0x10
Oct 6 14:11:43 icinga1 kernel: [359392.196830] [<ffffffff8109f116>] ? task_work_run+0x86/0xa0
Oct 6 14:11:43 icinga1 kernel: [359392.196832] [<ffffffff81224219>] SyS_ioctl+0x79/0x90
Oct 6 14:11:43 icinga1 kernel: [359392.196836] [<ffffffff81843272>] entry_SYSCALL_64_fastpath+0x16/0x71
Oct 6 14:11:43 icinga1 kernel: [359392.196838] Code: 18 48 8b 41 60 48 85 c0 74 16 4c 89 ea 44 89 e6 48 89 df ff d0 3d fd fd ff ff 0f 85 da fd ff ff 48 89 df e8 3e 74 00 00 49 89 c7 <48> 8b 00 4c 8b 40 48 48 c7 c0 ea ff ff ff 4d 85 c0 74 22 44 89
Oct 6 14:11:43 icinga1 kernel: [359392.196859] RIP [<ffffffff814fa7c5>] tty_ioctl+0x375/0xc40
Oct 6 14:11:43 icinga1 kernel: [359392.196861] RSP <ffff88033194fdf0>
Oct 6 14:11:43 icinga1 kernel: [359392.196862] CR2: 0000000000000000
Oct 6 14:11:43 icinga1 kernel: [359392.196865] ---[ end trace 72b7f0a8e26ab854 ]---
The bug report in the log says the issue is on tty_ioctl, so it is triggered when ioctl is called on a tty file descriptor. Net::OpenSSH uses a pseudo-tty when doing password authentication, so, a possible workaround for your problem would be to switch to a different authentication mechanism not requiring a tty as for instance public key authentication.
Also, the backtrace shows that the crash happens when manipulating the file system, so maybe your real problem is a corrupted file system triggering some kernel bug. You could try to force a fsck on the machine file systems or just to recreate them.
You didn't say which Linux distribution you are using neither the kernel version. You could try switching to a different one.
In any case, it is the first time anybody reports this problem, and Net::OpenSSH is used frequently with password authentication, so, unless you are using a pretty rare kernel version, there should be something special in your system that causes this bug to show up.
Related
After some time my python code that I run in ipython shell freezes. I have found entries in the syslog file. I will need to investigate that.
By freezing I mean:
the process in the shell stops
the shell is not responding to keystrokes
Does anyone have any advice/suggestions about what might be going on and where should I look? I understand it is a vague question but maybe we can all get to the bottom of the problem. Thank you!
Configuration of my system.
$ ipython3 --version
7.14.0
$ python3 --version
Python 3.6.9
$ ipython3
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.14.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: exit
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
I have looked at syslog and found this.
Aug 12 20:52:57 linux-box kernel: [3135203.066680] python3: page allocation failure: order:4, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
Aug 12 20:52:57 linux-box kernel: [3135203.066681] python3 cpuset=/ mems_allowed=0
Aug 12 20:52:57 linux-box kernel: [3135203.066684] CPU: 11 PID: 1241 Comm: python3 Not tainted 4.15.0-109-generic #110-Ubuntu
Aug 12 20:52:57 linux-box kernel: [3135203.066685] Hardware name: LENOVO 30C50045US/3138, BIOS M1VKT1BA 08/17/2018
Aug 12 20:52:57 linux-box kernel: [3135203.066686] Call Trace:
Aug 12 20:52:57 linux-box kernel: [3135203.066690] dump_stack+0x6d/0x8e
Aug 12 20:52:57 linux-box kernel: [3135203.066693] warn_alloc+0xff/0x1a0
Aug 12 20:52:57 linux-box kernel: [3135203.066694] ? __alloc_pages_direct_compact+0x51/0x100
Aug 12 20:52:57 linux-box kernel: [3135203.066695] __alloc_pages_slowpath+0xdc5/0xe00
Aug 12 20:52:57 linux-box kernel: [3135203.066697] __alloc_pages_nodemask+0x29a/0x2c0
Aug 12 20:52:57 linux-box kernel: [3135203.066699] alloc_pages_current+0x6a/0xe0
Aug 12 20:52:57 linux-box kernel: [3135203.066701] kmalloc_order+0x18/0x40
Aug 12 20:52:57 linux-box kernel: [3135203.066702] kmalloc_order_trace+0x24/0xb0
Aug 12 20:52:57 linux-box kernel: [3135203.066703] __kmalloc+0x1fe/0x210
Aug 12 20:52:57 linux-box kernel: [3135203.066706] proc_do_submiturb+0x4a3/0xd90
Aug 12 20:52:57 linux-box kernel: [3135203.066707] usbdev_do_ioctl+0xa38/0x1170
Aug 12 20:52:57 linux-box kernel: [3135203.066709] usbdev_ioctl+0xe/0x20
Aug 12 20:52:57 linux-box kernel: [3135203.066711] do_vfs_ioctl+0xa8/0x630
Aug 12 20:52:57 linux-box kernel: [3135203.066712] ? SyS_futex+0x13b/0x180
Aug 12 20:52:57 linux-box kernel: [3135203.066713] SyS_ioctl+0x79/0x90
Aug 12 20:52:57 linux-box kernel: [3135203.066714] do_syscall_64+0x73/0x130
Aug 12 20:52:57 linux-box kernel: [3135203.066716] entry_SYSCALL_64_after_hwframe+0x41/0xa6
Aug 12 20:52:57 linux-box kernel: [3135203.066717] RIP: 0033:0x7f8abc0b56d7
Aug 12 20:52:57 linux-box kernel: [3135203.066718] RSP: 002b:00007f89ca386a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 12 20:52:57 linux-box kernel: [3135203.066719] RAX: ffffffffffffffda RBX: 0000000002057f90 RCX: 00007f8abc0b56d7
Aug 12 20:52:57 linux-box kernel: [3135203.066719] RDX: 00007f8a2c0077c0 RSI: 000000008038550a RDI: 0000000000000022
Aug 12 20:52:57 linux-box kernel: [3135203.066720] RBP: 00007f8a2c006e40 R08: 00007f8a2c0077c0 R09: 00007f8a2c008430
Aug 12 20:52:57 linux-box kernel: [3135203.066720] R10: 00000000ffffff00 R11: 0000000000000246 R12: 0000000000000000
Aug 12 20:52:57 linux-box kernel: [3135203.066721] R13: 00007f8a2c0077c0 R14: 0000000000000000 R15: 0000000000000000
Aug 12 20:52:57 linux-box kernel: [3135203.066722] warn_alloc_show_mem: 1 callbacks suppressed
Aug 12 20:52:57 linux-box kernel: [3135203.066722] Mem-Info:
Aug 12 20:52:57 linux-box kernel: [3135203.066724] active_anon:7072912 inactive_anon:510142 isolated_anon:0
Aug 12 20:52:57 linux-box kernel: [3135203.066724] active_file:7178154 inactive_file:1236198 isolated_file:0
Aug 12 20:52:57 linux-box kernel: [3135203.066724] unevictable:12 dirty:582646 writeback:328149 unstable:6144
Aug 12 20:52:57 linux-box kernel: [3135203.066724] slab_reclaimable:204738 slab_unreclaimable:31575
Aug 12 20:52:57 linux-box kernel: [3135203.066724] mapped:16191 shmem:10145 pagetables:27672 bounce:0
Aug 12 20:52:57 linux-box kernel: [3135203.066724] free:84911 free_pcp:15 free_cma:0
Aug 12 20:52:57 linux-box kernel: [3135203.066726] Node 0 active_anon:28291648kB inactive_anon:2040568kB active_file:28712940kB inactive_file:4944792kB u$
Aug 12 20:52:57 linux-box kernel: [3135203.066726] Node 0 DMA free:15888kB min:16kB low:28kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB $
Aug 12 20:52:57 linux-box kernel: [3135203.066728] lowmem_reserve[]: 0 2326 64192 64192 64192
Aug 12 20:52:57 linux-box kernel: [3135203.066729] Node 0 DMA32 free:249952kB min:2448kB low:4828kB high:7208kB active_anon:1548208kB inactive_anon:52kB $
Aug 12 20:52:57 linux-box kernel: [3135203.066731] lowmem_reserve[]: 0 0 61865 61865 61865
Aug 12 20:52:57 linux-box kernel: [3135203.066732] Node 0 Normal free:73804kB min:65116kB low:128464kB high:191812kB active_anon:26743080kB inactive_anon$
Aug 12 20:52:57 linux-box kernel: [3135203.066733] lowmem_reserve[]: 0 0 0 0 0
Aug 12 20:52:57 linux-box kernel: [3135203.066734] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*20$
Aug 12 20:52:57 linux-box kernel: [3135203.066737] Node 0 DMA32: 54*4kB (UME) 74*8kB (UME) 280*16kB (UME) 225*32kB (UME) 319*64kB (UME) 276*128kB (UME) 2$
Aug 12 20:52:57 linux-box kernel: [3135203.066740] Node 0 Normal: 6*4kB (UH) 233*8kB (UMH) 2952*16kB (UMEH) 723*32kB (UEH) 1*64kB (H) 3*128kB (H) 3*256kB$
Aug 12 20:52:57 linux-box kernel: [3135203.066744] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 12 20:52:57 linux-box kernel: [3135203.066744] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 12 20:52:57 linux-box kernel: [3135203.066744] 8446758 total pagecache pages
Aug 12 20:52:57 linux-box kernel: [3135203.066753] 21866 pages in swap cache
Aug 12 20:52:57 linux-box kernel: [3135203.066754] Swap cache stats: add 268953271, delete 268898216, find 41931219/53685984
Aug 12 20:52:57 linux-box kernel: [3135203.066754] Free swap = 65004764kB
Aug 12 20:52:57 linux-box kernel: [3135203.066754] Total swap = 67108860kB
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 16733480 pages RAM
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 0 pages HighMem/MovableOnly
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 284837 pages reserved
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 0 pages cma reserved
Aug 12 20:52:57 linux-box kernel: [3135203.066706] proc_do_submiturb+0x4a3/0xd90
Aug 12 20:52:57 linux-box kernel: [3135203.066707] usbdev_do_ioctl+0xa38/0x1170
Aug 12 20:52:57 linux-box kernel: [3135203.066709] usbdev_ioctl+0xe/0x20
Aug 12 20:52:57 linux-box kernel: [3135203.066711] do_vfs_ioctl+0xa8/0x630
Aug 12 20:52:57 linux-box kernel: [3135203.066712] ? SyS_futex+0x13b/0x180
Aug 12 20:52:57 linux-box kernel: [3135203.066713] SyS_ioctl+0x79/0x90
Aug 12 20:52:57 linux-box kernel: [3135203.066714] do_syscall_64+0x73/0x130
Aug 12 20:52:57 linux-box kernel: [3135203.066716] entry_SYSCALL_64_after_hwframe+0x41/0xa6
Aug 12 20:52:57 linux-box kernel: [3135203.066717] RIP: 0033:0x7f8abc0b56d7
Aug 12 20:52:57 linux-box kernel: [3135203.066718] RSP: 002b:00007f89ca386a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 12 20:52:57 linux-box kernel: [3135203.066719] RAX: ffffffffffffffda RBX: 0000000002057f90 RCX: 00007f8abc0b56d7
Aug 12 20:52:57 linux-box kernel: [3135203.066719] RDX: 00007f8a2c0077c0 RSI: 000000008038550a RDI: 0000000000000022
Aug 12 20:52:57 linux-box kernel: [3135203.066720] RBP: 00007f8a2c006e40 R08: 00007f8a2c0077c0 R09: 00007f8a2c008430
Aug 12 20:52:57 linux-box kernel: [3135203.066720] R10: 00000000ffffff00 R11: 0000000000000246 R12: 0000000000000000
Aug 12 20:52:57 linux-box kernel: [3135203.066721] R13: 00007f8a2c0077c0 R14: 0000000000000000 R15: 0000000000000000
Aug 12 20:52:57 linux-box kernel: [3135203.066722] warn_alloc_show_mem: 1 callbacks suppressed
Aug 12 20:52:57 linux-box kernel: [3135203.066722] Mem-Info:
Aug 12 20:52:57 linux-box kernel: [3135203.066724] active_anon:7072912 inactive_anon:510142 isolated_anon:0
Aug 12 20:52:57 linux-box kernel: [3135203.066724] active_file:7178154 inactive_file:1236198 isolated_file:0
Aug 12 20:52:57 linux-box kernel: [3135203.066724] unevictable:12 dirty:582646 writeback:328149 unstable:6144
Aug 12 20:52:57 linux-box kernel: [3135203.066724] slab_reclaimable:204738 slab_unreclaimable:31575
Aug 12 20:52:57 linux-box kernel: [3135203.066724] mapped:16191 shmem:10145 pagetables:27672 bounce:0
Aug 12 20:52:57 linux-box kernel: [3135203.066724] free:84911 free_pcp:15 free_cma:0
Aug 12 20:52:57 linux-box kernel: [3135203.066726] Node 0 active_anon:28291648kB inactive_anon:2040568kB active_file:28712940kB inactive_file:4944792kB u$
Aug 12 20:52:57 linux-box kernel: [3135203.066726] Node 0 DMA free:15888kB min:16kB low:28kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB $
Aug 12 20:52:57 linux-box kernel: [3135203.066728] lowmem_reserve[]: 0 2326 64192 64192 64192
Aug 12 20:52:57 linux-box kernel: [3135203.066729] Node 0 DMA32 free:249952kB min:2448kB low:4828kB high:7208kB active_anon:1548208kB inactive_anon:52kB $
Aug 12 20:52:57 linux-box kernel: [3135203.066731] lowmem_reserve[]: 0 0 61865 61865 61865
Aug 12 20:52:57 linux-box kernel: [3135203.066732] Node 0 Normal free:73804kB min:65116kB low:128464kB high:191812kB active_anon:26743080kB inactive_anon$
Aug 12 20:52:57 linux-box kernel: [3135203.066733] lowmem_reserve[]: 0 0 0 0 0
Aug 12 20:52:57 linux-box kernel: [3135203.066734] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*20$
Aug 12 20:52:57 linux-box kernel: [3135203.066737] Node 0 DMA32: 54*4kB (UME) 74*8kB (UME) 280*16kB (UME) 225*32kB (UME) 319*64kB (UME) 276*128kB (UME) 2$
Aug 12 20:52:57 linux-box kernel: [3135203.066740] Node 0 Normal: 6*4kB (UH) 233*8kB (UMH) 2952*16kB (UMEH) 723*32kB (UEH) 1*64kB (H) 3*128kB (H) 3*256kB$
Aug 12 20:52:57 linux-box kernel: [3135203.066744] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 12 20:52:57 linux-box kernel: [3135203.066744] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 12 20:52:57 linux-box kernel: [3135203.066744] 8446758 total pagecache pages
Aug 12 20:52:57 linux-box kernel: [3135203.066753] 21866 pages in swap cache
Aug 12 20:52:57 linux-box kernel: [3135203.066754] Swap cache stats: add 268953271, delete 268898216, find 41931219/53685984
Aug 12 20:52:57 linux-box kernel: [3135203.066754] Free swap = 65004764kB
Aug 12 20:52:57 linux-box kernel: [3135203.066754] Total swap = 67108860kB
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 16733480 pages RAM
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 0 pages HighMem/MovableOnly
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 284837 pages reserved
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 0 pages cma reserved
Aug 12 20:52:57 linux-box kernel: [3135203.066756] 0 pages hwpoisoned
Few words about what I do. might be useful, might be not.
I have a code that communicates with a camera, reads images and saves them. I have two threads per python process. One for reading images from the camera and one for saving them into a file.
I've got a headless ARM-based Linux (v3.10.53-1.1.1) system with no swap space enabled, and I occasionally see processes get killed by the OOM-killer even though there is plenty of RAM available.
Running echo 1 > /proc/sys/vm/compact_memory periodically seems to keep the OOM-killer at bay, which makes me think that memory fragmentation is the culprit, but I don't understand why a user process would ever need physically-contiguous blocks anyway; as I understand it, even in the worst-case scenario (complete fragmentation, with only individual 4K blocks available), the kernel could simply allocate the necessary number of individual 4K blocks and then use the Magic of Virtual Memory (tm) to make them look contiguous to the user-process.
Can someone explain why the OOM-killer would be invoked in response to memory fragmentation? Is it just a buggy kernel or is there a genuine reason? (And even if the kernel did need to de-frag memory in order to satisfy a request, shouldn't it do that automatically rather than giving up and OOM'ing?)
I've pasted an example OOM-killer invocation below, in case it sheds any light on things. I can reproduce the fault at will; this invocation occurred while the computer still had ~120MB of RAM available (according to free), in response to my test-program allocating memory, 10000 400-byte allocations at a time.
May 28 18:51:34 g2 user.warn kernel: [ 4228.307769] cored invoked oom-killer: gfp_mask=0x2084d0, order=0, oom_score_adj=0
May 28 18:51:35 g2 user.warn kernel: [ 4228.315295] CPU: 2 PID: 19687 Comm: cored Tainted: G O 3.10.53-1.1.1_ga+gf57416a #1
May 28 18:51:35 g2 user.warn kernel: [ 4228.323843] Backtrace:
May 28 18:51:35 g2 user.warn kernel: [ 4228.326340] [<c0011c54>] (dump_backtrace+0x0/0x10c) from [<c0011e68>] (show_stack+0x18/0x1c)
May 28 18:51:35 g2 user.warn kernel: [ 4228.334804] r6:00000000 r5:00000000 r4:c9784000 r3:00000000
May 28 18:51:35 g2 user.warn kernel: [ 4228.340566] [<c0011e50>] (show_stack+0x0/0x1c) from [<c04d0dd8>] (dump_stack+0x24/0x28)
May 28 18:51:35 g2 user.warn kernel: [ 4228.348684] [<c04d0db4>] (dump_stack+0x0/0x28) from [<c009b474>] (dump_header.isra.10+0x84/0x19c)
May 28 18:51:35 g2 user.warn kernel: [ 4228.357616] [<c009b3f0>] (dump_header.isra.10+0x0/0x19c) from [<c009ba3c>] (oom_kill_process+0x288/0x3f4)
May 28 18:51:35 g2 user.warn kernel: [ 4228.367230] [<c009b7b4>] (oom_kill_process+0x0/0x3f4) from [<c009bf8c>] (out_of_memory+0x208/0x2cc)
May 28 18:51:35 g2 user.warn kernel: [ 4228.376323] [<c009bd84>] (out_of_memory+0x0/0x2cc) from [<c00a0278>] (__alloc_pages_nodemask+0x8f8/0x910)
May 28 18:51:35 g2 user.warn kernel: [ 4228.385921] [<c009f980>] (__alloc_pages_nodemask+0x0/0x910) from [<c00b6c34>] (__pte_alloc+0x2c/0x158)
May 28 18:51:35 g2 user.warn kernel: [ 4228.395263] [<c00b6c08>] (__pte_alloc+0x0/0x158) from [<c00b9fe0>] (handle_mm_fault+0xd4/0xfc)
May 28 18:51:35 g2 user.warn kernel: [ 4228.403914] r6:c981a5d8 r5:cc421a40 r4:10400000 r3:10400000
May 28 18:51:35 g2 user.warn kernel: [ 4228.409689] [<c00b9f0c>] (handle_mm_fault+0x0/0xfc) from [<c0019a00>] (do_page_fault+0x174/0x3dc)
May 28 18:51:35 g2 user.warn kernel: [ 4228.418575] [<c001988c>] (do_page_fault+0x0/0x3dc) from [<c0019dc0>] (do_translation_fault+0xb4/0xb8)
May 28 18:51:35 g2 user.warn kernel: [ 4228.427824] [<c0019d0c>] (do_translation_fault+0x0/0xb8) from [<c00083ac>] (do_DataAbort+0x40/0xa0)
May 28 18:51:35 g2 user.warn kernel: [ 4228.436896] r6:c0019d0c r5:00000805 r4:c06a33d0 r3:103ffea8
May 28 18:51:35 g2 user.warn kernel: [ 4228.442643] [<c000836c>] (do_DataAbort+0x0/0xa0) from [<c000e138>] (__dabt_usr+0x38/0x40)
May 28 18:51:35 g2 user.warn kernel: [ 4228.450850] Exception stack(0xc9785fb0 to 0xc9785ff8)
May 28 18:51:35 g2 user.warn kernel: [ 4228.455918] 5fa0: 103ffea8 00000000 b6d56708 00000199
May 28 18:51:35 g2 user.warn kernel: [ 4228.464116] 5fc0: 00000001 b6d557c0 0001ffc8 b6d557f0 103ffea0 b6d55228 10400038 00000064
May 28 18:51:35 g2 user.warn kernel: [ 4228.472327] 5fe0: 0001ffc9 beb04990 00000199 b6c95d84 600f0010 ffffffff
May 28 18:51:35 g2 user.warn kernel: [ 4228.478952] r8:103ffea0 r7:b6d557f0 r6:ffffffff r5:600f0010 r4:b6c95d84
May 28 18:51:35 g2 user.warn kernel: [ 4228.485759] Mem-info:
May 28 18:51:35 g2 user.warn kernel: [ 4228.488038] DMA per-cpu:
May 28 18:51:35 g2 user.warn kernel: [ 4228.490589] CPU 0: hi: 90, btch: 15 usd: 5
May 28 18:51:35 g2 user.warn kernel: [ 4228.495389] CPU 1: hi: 90, btch: 15 usd: 13
May 28 18:51:35 g2 user.warn kernel: [ 4228.500205] CPU 2: hi: 90, btch: 15 usd: 17
May 28 18:51:35 g2 user.warn kernel: [ 4228.505003] CPU 3: hi: 90, btch: 15 usd: 65
May 28 18:51:35 g2 user.warn kernel: [ 4228.509823] active_anon:92679 inactive_anon:47 isolated_anon:0
May 28 18:51:35 g2 user.warn kernel: [ 4228.509823] active_file:162 inactive_file:1436 isolated_file:0
May 28 18:51:35 g2 user.warn kernel: [ 4228.509823] unevictable:0 dirty:0 writeback:0 unstable:0
May 28 18:51:35 g2 user.warn kernel: [ 4228.509823] free:28999 slab_reclaimable:841 slab_unreclaimable:2103
May 28 18:51:35 g2 user.warn kernel: [ 4228.509823] mapped:343 shmem:89 pagetables:573 bounce:0
May 28 18:51:35 g2 user.warn kernel: [ 4228.509823] free_cma:29019
May 28 18:51:35 g2 user.warn kernel: [ 4228.541416] DMA free:115636kB min:1996kB low:2492kB high:2992kB active_anon:370716kB inactive_anon:188kB active_file:752kB inactive_file:6040kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:524288kB managed:2
May 28 18:51:35 g2 user.warn kernel: [ 4228.583833] lowmem_reserve[]: 0 0 0 0
May 28 18:51:35 g2 user.warn kernel: [ 4228.587577] DMA: 2335*4kB (UMC) 1266*8kB (UMC) 1034*16kB (UMC) 835*32kB (UC) 444*64kB (C) 28*128kB (C) 103*256kB (C) 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB = 121100kB
May 28 18:51:35 g2 user.warn kernel: [ 4228.604979] 502 total pagecache pages
May 28 18:51:35 g2 user.warn kernel: [ 4228.608649] 0 pages in swap cache
May 28 18:51:35 g2 user.warn kernel: [ 4228.611979] Swap cache stats: add 0, delete 0, find 0/0
May 28 18:51:35 g2 user.warn kernel: [ 4228.617210] Free swap = 0kB
May 28 18:51:35 g2 user.warn kernel: [ 4228.620110] Total swap = 0kB
May 28 18:51:35 g2 user.warn kernel: [ 4228.635245] 131072 pages of RAM
May 28 18:51:35 g2 user.warn kernel: [ 4228.638394] 30575 free pages
May 28 18:51:35 g2 user.warn kernel: [ 4228.641293] 3081 reserved pages
May 28 18:51:35 g2 user.warn kernel: [ 4228.644437] 1708 slab pages
May 28 18:51:35 g2 user.warn kernel: [ 4228.647239] 265328 pages shared
May 28 18:51:35 g2 user.warn kernel: [ 4228.650399] 0 pages swap cached
May 28 18:51:35 g2 user.info kernel: [ 4228.653546] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
May 28 18:51:35 g2 user.info kernel: [ 4228.661408] [ 115] 0 115 761 128 5 0 -1000 udevd
May 28 18:51:35 g2 user.info kernel: [ 4228.669347] [ 237] 0 237 731 98 5 0 -1000 udevd
May 28 18:51:35 g2 user.info kernel: [ 4228.677278] [ 238] 0 238 731 100 5 0 -1000 udevd
May 28 18:51:35 g2 user.info kernel: [ 4228.685224] [ 581] 0 581 1134 78 5 0 -1000 sshd
May 28 18:51:35 g2 user.info kernel: [ 4228.693074] [ 592] 0 592 662 15 4 0 0 syslogd
May 28 18:51:35 g2 user.info kernel: [ 4228.701184] [ 595] 0 595 662 19 4 0 0 klogd
May 28 18:51:35 g2 user.info kernel: [ 4228.709113] [ 633] 0 633 6413 212 12 0 0 g2d
May 28 18:51:35 g2 user.info kernel: [ 4228.716877] [ 641] 0 641 663 16 3 0 0 getty
May 28 18:51:35 g2 user.info kernel: [ 4228.724827] [ 642] 0 642 663 16 5 0 0 getty
May 28 18:51:35 g2 user.info kernel: [ 4228.732770] [ 646] 0 646 6413 215 12 0 0 g2d
May 28 18:51:35 g2 user.info kernel: [ 4228.740540] [ 650] 0 650 10791 572 10 0 0 avbd
May 28 18:51:35 g2 user.info kernel: [ 4228.748385] [ 651] 0 651 9432 2365 21 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.756322] [ 652] 0 652 52971 4547 42 0 0 g2d
May 28 18:51:35 g2 user.info kernel: [ 4228.764104] [ 712] 0 712 14135 2458 24 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.772053] [ 746] 0 746 1380 248 6 0 0 dhclient
May 28 18:51:35 g2 user.info kernel: [ 4228.780251] [ 779] 0 779 9419 2383 21 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.788187] [ 780] 0 780 9350 2348 21 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.796127] [ 781] 0 781 9349 2347 21 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.804074] [ 782] 0 782 9353 2354 21 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.812012] [ 783] 0 783 18807 2573 27 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.819955] [ 784] 0 784 17103 3233 28 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.827882] [ 785] 0 785 13990 2436 24 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.835819] [ 786] 0 786 9349 2350 21 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.843764] [ 807] 0 807 13255 4125 25 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.851702] [ 1492] 999 1492 512 27 5 0 0 avahi-autoipd
May 28 18:51:35 g2 user.info kernel: [ 4228.860334] [ 1493] 0 1493 433 14 5 0 0 avahi-autoipd
May 28 18:51:35 g2 user.info kernel: [ 4228.868955] [ 1494] 0 1494 1380 246 7 0 0 dhclient
May 28 18:51:35 g2 user.info kernel: [ 4228.877163] [19170] 0 19170 1175 131 6 0 0 sshd
May 28 18:51:35 g2 user.info kernel: [ 4228.885022] [19183] 0 19183 750 70 4 0 0 sh
May 28 18:51:35 g2 user.info kernel: [ 4228.892701] [19228] 0 19228 663 16 5 0 0 watch
May 28 18:51:35 g2 user.info kernel: [ 4228.900636] [19301] 0 19301 1175 131 5 0 0 sshd
May 28 18:51:35 g2 user.info kernel: [ 4228.908475] [19315] 0 19315 751 69 5 0 0 sh
May 28 18:51:35 g2 user.info kernel: [ 4228.916154] [19365] 0 19365 663 16 5 0 0 watch
May 28 18:51:35 g2 user.info kernel: [ 4228.924099] [19443] 0 19443 1175 153 5 0 0 sshd
May 28 18:51:35 g2 user.info kernel: [ 4228.931948] [19449] 0 19449 750 70 5 0 0 sh
May 28 18:51:35 g2 user.info kernel: [ 4228.939626] [19487] 0 19487 1175 132 5 0 0 sshd
May 28 18:51:35 g2 user.info kernel: [ 4228.947467] [19500] 0 19500 750 70 3 0 0 sh
May 28 18:51:35 g2 user.info kernel: [ 4228.955148] [19540] 0 19540 662 17 5 0 0 tail
May 28 18:51:35 g2 user.info kernel: [ 4228.963002] [19687] 0 19687 63719 56396 127 0 0 cored
May 28 18:51:35 g2 user.err kernel: [ 4228.970936] Out of memory: Kill process 19687 (cored) score 428 or sacrifice child
May 28 18:51:35 g2 user.err kernel: [ 4228.978513] Killed process 19687 (cored) total-vm:254876kB, anon-rss:225560kB, file-rss:24kB
Also here is my test-program I use to stress the system and invoke the OOM-killer (with the echo 1 > /proc/sys/vm/compact_memory command being run every so often, the OOM-killer appears when free reports system RAM close to zero, as expected; without it, the OOM-killer appears well before that, when free reports 130+MB of RAM available but after cat /proc/buddyinfo shows the RAM becoming fragmented):
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char ** argv)
{
while(1)
{
printf("PRESS RETURN TO ALLOCATE BUFFERS\n");
const int numBytes = 400;
char buf[64]; fgets(buf, sizeof(buf), stdin);
for (int i=0; i<10000; i++)
{
void * ptr = malloc(numBytes); // yes, a deliberate memory leak
if (ptr)
{
memset(ptr, 'J', numBytes); // force the virtual memory system to actually allocate the RAM, and not only the address space
}
else printf("malloc() failed!\n");
}
fprintf(stderr, "Deliberately leaked 10000*%i bytes!\n", numBytes);
}
return 0;
}
You are on the right track, Jeremy. The identical thing happened to me on my CentOS desktop system. I am a computer consultant, and I have worked with Linux since 1995. And I pound my Linux systems mercilessly with many file downloads and all sorts of other activities that stretch them to their limits. After my main desktop had been up for about 4 days, it got real slow (like slower than 1/10 of normal speed), the OOM killed kicked in, and I was sitting there wondering why my system was acting that way. It had plenty of RAM, but the OOM killer was kicking when it had no business doing so. So I rebooted it, and it acted fine...for about 4 days, then the problem returned. Bugged the snot out of me not knowing why.
So I put on my test engineer hat and ran all sorts of stress tests on the machine to see if I could reproduce the symptoms on purpose. After several months of this, I was able to recreate the problem at will and prove that my solution for it would work every time.
"Cache turnover" in this context is when a system has to tear down existing cache to create more cache space to support new file writes. Since the system is in a hurry to redeploy the RAM, it does not take the time to defragment the memory it is freeing. So over time, as more and more file writes occur, the cache turns over repeatedly. And the memory in which it resides keeps getting more and more fragmented. In my tests, I found that after the disk cache has turned over about 15 times, the memory becomes so fragmented that the system cannot tear down and then allocate the memory fast enough to keep the OOM killer from being triggered due to lack of free RAM in the system when a spike in memory demand occurs. Such a spike could be caused by executing something as simple as
find /dev /etc /home /opt /tmp /usr -xdev > /dev/null
On my system, that command creates a demand for about 50MB of new cache. That was what
free -mt
shows, anyway.
The solution for this problem involves expanding on what you already discovered.
/bin/echo 3 > /proc/sys/vm/drop_caches
export CONFIG_COMPACTION=1
echo 1 > /proc/sys/vm/compact_memory
And yes, I totally agree that dropping cache will force your system to re-read some data from disk. But at a rate of once per day or even once per hour, the negative effect of dropping cache is absolutely negligible compared to everything else your system is doing, no matter what that might be. The negative effect is so small that I cannot even measure it, and I made my living as a test engineer for 5+ years figuring out how to measure things like that.
If you set up a cron job to execute those once a day, that should eliminate your OOM killer problem. If you still see problems with the OOM killer after that, consider executing them more frequently. It will vary depending on how much file writing you do compared to the amount of system RAM your unit has.
Cassandra node went down due to OOM, and checking the /var/log/message I see below.
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java cpuset=/ mems_allowed=0
....
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA32: 1294*4kB (UM) 932*8kB (UEM) 897*16kB (UEM) 483*32kB (UEM) 224*64kB (UEM) 114*128kB (UEM) 41*256kB (UEM) 12*512kB (UEM) 7*1024kB (UE
M) 2*2048kB (EM) 35*4096kB (UM) = 242632kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 Normal: 5319*4kB (UE) 3233*8kB (UEM) 960*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 62500kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 38109 total pagecache pages
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages in swap cache
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Swap cache stats: add 0, delete 0, find 0/0
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Free swap = 0kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Total swap = 0kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 16394647 pages RAM
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages HighMem/MovableOnly
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 310559 pages reserved
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2634] 0 2634 41614 326 82 0 0 systemd-journal
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2690] 0 2690 29793 541 27 0 0 lvmetad
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2710] 0 2710 11892 762 25 0 -1000 systemd-udevd
.....
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [13774] 0 13774 459778 97729 429 0 0 Scan Factory
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14506] 0 14506 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14586] 0 14586 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14588] 0 14588 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14589] 0 14589 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14598] 0 14598 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14599] 0 14599 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14600] 0 14600 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14601] 0 14601 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19679] 0 19679 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19680] 0 19680 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 9084] 1007 9084 2822449 260291 810 0 0 java
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 8509] 1007 8509 17223585 14908485 32510 0 0 java
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21877] 0 21877 461828 97716 318 0 0 ScanAction Mgr
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21884] 0 21884 496653 98605 340 0 0 OAS Manager
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [31718] 89 31718 25474 486 48 0 0 pickup
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4891] 1007 4891 26999 191 9 0 0 iostat
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4957] 1007 4957 26999 192 10 0 0 iostat
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Out of memory: Kill process 8509 (java) score 928 or sacrifice child
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Killed process 8509 (java) total-vm:68894340kB, anon-rss:59496344kB, file-rss:137596kB, shmem-rss:0kB
Nothing else runs on this host except dse cassandra with search and monitoring agents. Max heap size is set to 31g, the cassandra java process seems to be using ~57gb (ram is 62gb) at the time of error.
So I am guess the jvm started using lots of memory and triggered oom error.
Is my understanding correct?
That this is linux triggered jvm kill as the jvm was consuming more than available memory?
So in this case jvm was using max of 31g and remaining 26gb its using is non-heap memory. Normally this process takes around 42g and the fact that at the time of oom moment it was consuming 57g I am suspecting the java process to be the culprit rather than victim.
At the time of issue there was no heap dump taken, I have configured it now. But even if heap dump was taken would it have help figure out who is consuming more memory. Heapdump would only dump heap memory area, what should be used to dump non-heapdump? Native memory tracking is one thing I came across.
Any way to have native memory dumped when oom occurs?
Whats the best way to monitor the jvm memory to diagnose oom errors?
This may not helpful..
You may not get heapdump because oom-killer is kernel feature. Jvm has no chance to write heapdump.
And SIGKILL can not be caught and does not generate core dump. (unix default action)
http://programmergamer.blogspot.com/2013/05/clarification-on-sigint-sigterm-sigkill.html
I have a multi-threaded process for which I wanted to generate core dump.
gcore ran, gdb started, the process went to "t" state according to ps.
However, it got stuck there.
As it was already being traced, I could not attach another gdb session to see what was going on. I dumped kernel threads, some process threads were in 'ptrace_stop' while others were working.
Any ideas on why gcore got stuck would be helpful. The system had enough free memory too (free reported 300MB, but also reported swap had kicked in - though not sure when).
The process is running inside a docker container on a virtualbox vm.
I ran:
gcore process_id
That seems to have spawned a gdb child process
root 26540 0.0 0.0 4500 660 ? Ss May20 0:00 /bin/sh /usr/bin/gcore -o /var/cores/core.proc_name 2539
root 26545 0.0 2.9 160896 105144 ? S May20 0:01 /usr/bin/gdb --nx --batch -ex set pagination off -ex set height 0 -ex set width 0 -ex attach 2539 -ex gcore /var/cores/core.proc_name.2539 -ex detach -ex quit
May 22 12:08:24 BaseVM kernel: Call Trace:
May 22 12:08:24 BaseVM kernel: [<ffffffff8163bb39>] schedule+0x29/0x70
May 22 12:08:24 BaseVM kernel: [<ffffffff81639829>] schedule_timeout+0x209/0x2d0
May 22 12:08:24 BaseVM kernel: [<ffffffff811c4268>] ? __kmalloc_node_track_caller+0x58/0x270
May 22 12:08:24 BaseVM kernel: [<ffffffff8151a7f6>] ? skb_release_data+0xd6/0x110
May 22 12:08:24 BaseVM kernel: [<ffffffff810a68a6>] ? prepare_to_wait+0x56/0x90
May 22 12:08:24 BaseVM kernel: [<ffffffff815d372f>] unix_stream_read_generic+0x30f/0x8e0
May 22 12:08:24 BaseVM kernel: [<ffffffff810a6b80>] ? wake_up_atomic_t+0x30/0x30
May 22 12:08:24 BaseVM kernel: [<ffffffff8152158d>] ? __scm_destroy+0x4d/0x60
May 22 12:08:24 BaseVM kernel: [<ffffffff815d29f3>] ? unix_stream_sendmsg+0x413/0x430
May 22 12:08:24 BaseVM kernel: [<ffffffff815d3df4>] unix_stream_recvmsg+0x54/0x70
May 22 12:08:24 BaseVM kernel: [<ffffffff815d0d90>] ? unix_state_double_unlock+0x50/0x50
May 22 12:08:24 BaseVM kernel: [<ffffffff8151199f>] sock_recvmsg+0xbf/0x100
May 22 12:08:24 BaseVM kernel: [<ffffffff81511cae>] ___sys_recvmsg+0x11e/0x2b0
May 22 12:08:24 BaseVM kernel: [<ffffffff811f05db>] ? do_filp_open+0x4b/0xb0
May 22 12:08:24 BaseVM kernel: [<ffffffff81512861>] __sys_recvmsg+0x51/0x90
May 22 12:08:24 BaseVM kernel: [<ffffffff815128b2>] SyS_recvmsg+0x12/0x20
May 22 12:08:24 BaseVM kernel: [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
May 22 12:08:24 BaseVM kernel: proc_name t ffff880212ce5080 0 8081 8003 0x00000082
May 22 12:08:24 BaseVM kernel: ffff8800c97e3d50 0000000000000082 ffff880212ce5080 ffff8800c97e3fd8
May 22 12:08:24 BaseVM kernel: ffff8800c97e3fd8 ffff8800c97e3fd8 ffff880212ce5080 ffff880212ce5080
May 22 12:08:24 BaseVM kernel: ffff880212ce5080 0000000000000000 ffff880212ce5080 ffff880212ce5080
May 22 12:08:24 BaseVM kernel: Call Trace:
May 22 12:08:24 BaseVM kernel: [<ffffffff8163bb39>] schedule+0x29/0x70
May 22 12:08:24 BaseVM kernel: [<ffffffff810912fd>] ptrace_stop+0x16d/0x2b0
May 22 12:08:24 BaseVM kernel: [<ffffffff81092ebd>] get_signal_to_deliver+0x3dd/0x6d0
May 22 12:08:24 BaseVM kernel: [<ffffffff81014417>] do_signal+0x57/0x6c0
May 22 12:08:24 BaseVM kernel: [<ffffffff81014adf>] do_notify_resume+0x5f/0xb0
May 22 12:08:24 BaseVM kernel: [<ffffffff81646dfd>] int_signal+0x12/0x17
May 22 12:08:24 BaseVM kernel: proc-name-ust t ffff880213f87300 0 8103 8003 0x00000082
May 22 12:08:24 BaseVM kernel: ffff8800c1c2fd50 0000000000000082 ffff880213f87300 ffff8800c1c2ffd8
May 22 12:08:24 BaseVM kernel: ffff8800c1c2ffd8 ffff8800c1c2ffd8 ffff880213f87300 ffff880213f87300
May 22 12:08:24 BaseVM kernel: ffff880213f87300 0000000000000000 ffff880213f87300 ffff880213f87300
May 22 12:08:24 BaseVM kernel: Call Trace:
May 22 12:08:24 BaseVM kernel: [<ffffffff8163bb39>] schedule+0x29/0x70
May 22 12:08:24 BaseVM kernel: [<ffffffff810912fd>] ptrace_stop+0x16d/0x2b0
May 22 12:08:24 BaseVM kernel: [<ffffffff81092ebd>] get_signal_to_deliver+0x3dd/0x6d0
May 22 12:08:24 BaseVM kernel: [<ffffffff810e2860>] ? futex_wake+0x80/0x160
May 22 12:08:24 BaseVM kernel: [<ffffffff81014417>] do_signal+0x57/0x6c0
May 22 12:08:24 BaseVM kernel: [<ffffffff8108fe7b>] ? recalc_sigpending+0x1b/0x50
May 22 12:08:24 BaseVM kernel: [<ffffffff81014adf>] do_notify_resume+0x5f/0xb0
May 22 12:08:24 BaseVM kernel: [<ffffffff81646dfd>] int_signal+0x12/0x17
May 22 12:08:24 BaseVM kernel: proc-name-ust S ffffc90000dd5c00 0 8104 8003 0x00000080
May 22 12:08:24 BaseVM kernel: ffff8800c1c03cd0 0000000000000082 ffff880213f85080 ffff8800c1c03fd8
May 22 12:08:24 BaseVM kernel: ffff8800c1c03fd8 ffff8800c1c03fd8 ffff880213f85080 ffff880213f85080
May 22 12:08:24 BaseVM kernel: ffff880213f85080 0000000000000000 ffff8800c1c03de0 ffffc90000dd5c00
May 22 12:08:24 BaseVM kernel: Call Trace:
May 22 12:08:24 BaseVM kernel: [<ffffffff8163bb39>] schedule+0x29/0x70
May 22 12:08:24 BaseVM kernel: [<ffffffff810e2704>] futex_wait_queue_me+0xc4/0x120
May 22 12:08:24 BaseVM kernel: [<ffffffff810e3279>] futex_wait+0x179/0x280
May 22 12:08:24 BaseVM kernel: [<ffffffff8108ffa6>] ? dequeue_signal+0x86/0x180
May 22 12:08:24 BaseVM kernel: [<ffffffff81092bdf>] ? get_signal_to_deliver+0xff/0x6d0
May 22 12:08:24 BaseVM kernel: [<ffffffff810e530e>] do_futex+0xfe/0x5b0
May 22 12:08:24 BaseVM kernel: [<ffffffff8164205d>] ? __do_page_fault+0x16d/0x450
May 22 12:08:24 BaseVM kernel: [<ffffffff810e5840>] SyS_futex+0x80/0x180
May 22 12:08:24 BaseVM kernel: [<ffffffff81014afa>] ? do_notify_resume+0x7a/0xb0
May 22 12:08:24 BaseVM kernel: [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
May 22 12:08:24 BaseVM kernel: proc-name t ffff880213f82280 0 8107 8003 0x00000082
May 22 12:08:24 BaseVM kernel: ffff8800c7dcfd50 0000000000000082 ffff880213f82280 ffff8800c7dcffd8
May 22 12:08:24 BaseVM kernel: ffff8800c7dcffd8 ffff8800c7dcffd8 ffff880213f82280 ffff880213f82280
May 22 12:08:24 BaseVM kernel: ffff880213f82280 0000000000000000 ffff880213f82280 ffff880213f82280
I'm having trouble with unexpected characters being sent on a USB port with the cdc_acm driver. What makes this all the more perplexing is that the code runs fine on Ubuntu 12.04 (3.2 kernel) but fails (the subject of this question) on Centos 6 (3.6 kernel)
The USB device is a Bluegiga BLED112 Bluetooth Smart dongle. Its embedded microcontroler will reset any time there in unexpected input on it's USB interface.
The test code opens the port, writes 4 bytes (a hello message) and expects to read a response. The read never completes because the unexpected characters cause the device to reset which causes the hub to drop the device and re-enumerate.
To troubleshoot, here's what I've done:
Downloaded the source code for the cdc_acm driver. Added a bunch of printk debug messages and stack_dumps to follow what's going on.
I rmmod'd the "stock" cdc_acm and insmod'd my instrumented module. All the device enumeration works, right driver attached, etc.
Since the code works on Ubuntu 12.04/Linux 3.2, I grabbed the 3.2 cdc_acm code and compiled that module on the Centos 6 / Linux 3.6 platform. Using that 3.2 module instead of the 3.6 module did not make a difference. I reverted to the 3.6 module.
Turned on the debug file system with usbmon and watched the USB traffic. I can see that there are extra characters being sent on the USB interface.
To watch what's going on, on top of the printk's in the cdc_acm module, I've merged the output of usb mon (cat /sys/kernel/debug/usb/usbmon/3u | logger) and the output of the test application (scan_example /dev/ttyACM0 | logger -s) so I have a single stream of time correlated debug trail.
The spurious characters sent on the USB endpoint are x5E x40 x5E x40 x5E x40 x5E x40 x41 (in ASCII its ^#^#^#^#A) which looks like some sort of probing or trying to get the attention of a modem These characters are sent immediately after the application's write() causes the 4 hello bytes to be sent to the end point.
Since the cdc_acm device is supposed to be a modem, I tried to turn off the modem control by adding this to usb_device_id acm_ids[] in cdc_acm.c
/* bluegiga BLED112*/
{ USB_DEVICE(0x2458, 0x0001),
.driver_info = NOT_A_MODEM,
},
Recompiled and insmod'd and the syslog show that this was recognized (quirks is 8), but no change in function.
Neither NetowrkManager nor modem-manager are running, but I still suspect that there is some sort of modem control function going on somewhere, I just don't know where.
Here's a annotated debug log (MDV prefixes those printk's that I added to cdc_acm)
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_bulk
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_done
Here are the 4 bytes sent by the application 00 00 00 01
Feb 13 18:14:32 localhost cpcenter: df046a80 3672670191 C Bi:3:006:4 0 4 = 00000001
Feb 13 18:14:32 localhost cpcenter: 1360797272.669690 write: data2: len=0 contains:
... and these additonal characters show up unexpectedly 5e 40 5e 40 5e 40....
Feb 13 18:14:32 localhost cpcenter: df046a80 3672670232 S Bi:3:006:4 -115 128 <
Feb 13 18:14:32 localhost cpcenter: f3cc5740 3672670297 S Bo:3:006:4 -115 1 = 5e
Feb 13 18:14:32 localhost cpcenter: df2e1300 3672670332 S Bo:3:006:4 -115 1 = 40
Feb 13 18:14:32 localhost cpcenter: f3cc5740 3672670347 C Bo:3:006:4 0 1 >
Feb 13 18:14:32 localhost cpcenter: f3cc5740 3672670392 S Bo:3:006:4 -115 1 = 5e
Feb 13 18:14:32 localhost cpcenter: df2e1180 3672670426 S Bo:3:006:4 -115 1 = 40
Feb 13 18:14:32 localhost cpcenter: df2e1c00 3672670461 S Bo:3:006:4 -115 1 = 5e
Feb 13 18:14:32 localhost cpcenter: df2e1840 3672670496 S Bo:3:006:4 -115 1 = 40
Feb 13 18:14:32 localhost cpcenter: df2e1300 3672670591 C Bo:3:006:4 0 1 >
At this point we get a spontaneous disconnect.
Feb 13 18:14:32 localhost kernel: usb 3-1: USB disconnect, device number 6
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_bulk
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_done
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm read_bulk_callback
Feb 13 18:14:32 localhost kernel: MDV 1 acm_read_bulk_callback - urb 1, len 0
Feb 13 18:14:32 localhost kernel: MDV 3 acm_read_bulk_callback - non-zero urb status: -71
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_bulk
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_done
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm read_bulk_callback
Feb 13 18:14:32 localhost kernel: MDV 1 acm_read_bulk_callback - urb 1, len 0
Feb 13 18:14:32 localhost kernel: MDV 3 acm_read_bulk_callback - non-zero urb status: -71
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_bulk
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_done
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm read_bulk_callback
Feb 13 18:14:32 localhost kernel: MDV 1 acm_read_bulk_callback - urb 2, len 0
Feb 13 18:14:32 localhost cpcenter: df2e1d80 3672670629 S Bo:3:006:4 -115 1 = 5e
Feb 13 18:14:32 localhost kernel: MDV 3 acm_read_bulk_callback - non-zero urb status: -71
Feb 13 18:14:32 localhost cpcenter: df2e1300 3672670677 S Bo:3:006:4 -115 1 = 41
Feb 13 18:14:32 localhost cpcenter: f3cc5740 3672670802 C Bo:3:006:4 0 1 >
Feb 13 18:14:32 localhost cpcenter: df2e1180 3672671019 C Bo:3:006:4 0 1 >
Feb 13 18:14:32 localhost cpcenter: df2e1c00 3672671237 C Bo:3:006:4 0 1 >
Feb 13 18:14:32 localhost cpcenter: dfbf8c00 3672673193 C Ii:3:001:1 0:2048 1 = 02
Feb 13 18:14:32 localhost cpcenter: dfbf8c00 3672673207 S Ii:3:001:1 -115:2048 4 <
Feb 13 18:14:32 localhost cpcenter: f3c26c00 3672673221 S Ci:3:001:0 s a3 00 0000 0001 0004 4 <
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_disconnect
Feb 13 18:14:32 localhost kernel: Pid: 29, comm: khubd Tainted: G O 3.5.3-1.el6.elrepo.i686 #1
Stack trace at the time of disconnect
Feb 13 18:14:32 localhost kernel: Call Trace:
Feb 13 18:14:32 localhost kernel: [<f82dabc5>] acm_disconnect+0x35/0x1f0 [cdc_acm]
Feb 13 18:14:32 localhost kernel: [<c13835db>] usb_unbind_interface+0x4b/0x180
Feb 13 18:14:32 localhost cpcenter: f3c26c00 3672673239 C Ci:3:001:0 0 4 = 00010100
Feb 13 18:14:32 localhost kernel: [<c1318bfb>] __device_release_driver+0x5b/0xb0
Feb 13 18:14:32 localhost kernel: [<c1318d05>] device_release_driver+0x25/0x40
Feb 13 18:14:32 localhost kernel: [<c1317f0c>] bus_remove_device+0xcc/0x130
Feb 13 18:14:32 localhost kernel: [<c131612f>] ? device_remove_attrs+0x2f/0x90
Feb 13 18:14:32 localhost kernel: [<c1316275>] device_del+0xe5/0x180
Feb 13 18:14:32 localhost kernel: [<c1380326>] usb_disable_device+0x96/0x240 Feb 13 18:14:32 localhost kernel: [<c1379f91>] usb_disconnect+0x91/0x130
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_bulk
Feb 13 18:14:32 localhost kernel: [<c137a2c0>] hub_port_connect_change+0xb0/0xa60
Feb 13 18:14:32 localhost kernel: [<c1380f4e>] ? usb_control_msg+0xce/0xe0
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_done
Feb 13 18:14:32 localhost kernel: [<c137b296>] hub_events+0x536/0x810
Feb 13 18:14:32 localhost cpcenter: f3c26c00 3672673243 S Co:3:001:0 s 23 01 0010 0001 0000 0
Feb 13 18:14:32 localhost cpcenter: f3c26c00 3672673250 C Co:3:001:0 0 0
Feb 13 18:14:32 localhost kernel: [<c1065bdf>] ? finish_wait+0x4f/0x70
Feb 13 18:14:32 localhost kernel: [<c137b5aa>] hub_thread+0x3a/0x1d0
Feb 13 18:14:32 localhost cpcenter: df2e1840 3672673260 C Bo:3:006:4 -71 0
Feb 13 18:14:32 localhost kernel: [<c1065a70>] ? wake_up_bit+0x30/0x30
Feb 13 18:14:32 localhost kernel: [<c137b570>] ? hub_events+0x810/0x810
Feb 13 18:14:32 localhost kernel: [<c106564c>] kthread+0x7c/0x90
Feb 13 18:14:32 localhost cpcenter: f3c16c80 3672673292 C Bi:3:006:4 -71 0
Feb 13 18:14:32 localhost cpcenter: df2e1d80 3672673453 C Bo:3:006:4 -71 0
Feb 13 18:14:32 localhost cpcenter: f3c16d40 3672673553 C Bi:3:006:4 -71 0
Feb 13 18:14:32 localhost kernel: [<c10655d0>] ? kthread_freezable_should_stop+0x60/0x60
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm read_bulk_callback
Feb 13 18:14:32 localhost kernel: [<c14dedbe>] kernel_thread_helper+0x6/0x10
Feb 13 18:14:32 localhost kernel: MDV 1 acm_read_bulk_callback - urb 3, len 0
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm stop_data_traffic
Feb 13 18:14:32 localhost cpcenter: f3d19500 3672674474 C Ii:3:006:2 -108:64 0
Feb 13 18:14:32 localhost kernel: MDV 2 acm_read_bulk_callback - disconnected
Feb 13 18:14:32 localhost cpcenter: df2e1300 3672674636 C Bo:3:006:4 -71 0
Feb 13 18:14:32 localhost cpcenter: f3c16140 3672674753 C Bi:3:006:4 -71 0
The ^#^#^#^A string which is sent to your device is the result of echo performed by the terminal subsystem in the kernel in response to incoming bytes from your device.
This line in your log:
Feb 13 18:14:32 localhost cpcenter: df046a80 3672670191 C Bi:3:006:4 0 4 = 00000001
actually means that your device sent 4 bytes to the computer (Bi means “Bulk endpoint, input”). By default all terminal devices have echo enabled, therefore the kernel echoes those bytes back to the device, but because those bytes were in the control character range, they are echoed in the escaped form: ^#^#^#^A. Those echoed bytes are also sent in separate 1-byte write calls, which corresponds to 1-byte bulk out URBs in the subsequent log.
You need to fix your test program so that it turns off echo and other tty processing before trying to communicate with your device. The cfmakeraw() function can be used for this if your test program is in C/C++.
The program might be working in Ubuntu just because some other program happens to touch the port before your test program and change the port settings to turn off echo.