Cassandra node went down due to OOM, and checking the /var/log/message I see below.
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java cpuset=/ mems_allowed=0
....
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA32: 1294*4kB (UM) 932*8kB (UEM) 897*16kB (UEM) 483*32kB (UEM) 224*64kB (UEM) 114*128kB (UEM) 41*256kB (UEM) 12*512kB (UEM) 7*1024kB (UE
M) 2*2048kB (EM) 35*4096kB (UM) = 242632kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 Normal: 5319*4kB (UE) 3233*8kB (UEM) 960*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 62500kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 38109 total pagecache pages
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages in swap cache
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Swap cache stats: add 0, delete 0, find 0/0
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Free swap = 0kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Total swap = 0kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 16394647 pages RAM
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages HighMem/MovableOnly
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 310559 pages reserved
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2634] 0 2634 41614 326 82 0 0 systemd-journal
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2690] 0 2690 29793 541 27 0 0 lvmetad
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2710] 0 2710 11892 762 25 0 -1000 systemd-udevd
.....
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [13774] 0 13774 459778 97729 429 0 0 Scan Factory
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14506] 0 14506 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14586] 0 14586 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14588] 0 14588 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14589] 0 14589 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14598] 0 14598 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14599] 0 14599 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14600] 0 14600 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14601] 0 14601 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19679] 0 19679 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19680] 0 19680 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 9084] 1007 9084 2822449 260291 810 0 0 java
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 8509] 1007 8509 17223585 14908485 32510 0 0 java
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21877] 0 21877 461828 97716 318 0 0 ScanAction Mgr
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21884] 0 21884 496653 98605 340 0 0 OAS Manager
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [31718] 89 31718 25474 486 48 0 0 pickup
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4891] 1007 4891 26999 191 9 0 0 iostat
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4957] 1007 4957 26999 192 10 0 0 iostat
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Out of memory: Kill process 8509 (java) score 928 or sacrifice child
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Killed process 8509 (java) total-vm:68894340kB, anon-rss:59496344kB, file-rss:137596kB, shmem-rss:0kB
Nothing else runs on this host except dse cassandra with search and monitoring agents. Max heap size is set to 31g, the cassandra java process seems to be using ~57gb (ram is 62gb) at the time of error.
So I am guess the jvm started using lots of memory and triggered oom error.
Is my understanding correct?
That this is linux triggered jvm kill as the jvm was consuming more than available memory?
So in this case jvm was using max of 31g and remaining 26gb its using is non-heap memory. Normally this process takes around 42g and the fact that at the time of oom moment it was consuming 57g I am suspecting the java process to be the culprit rather than victim.
At the time of issue there was no heap dump taken, I have configured it now. But even if heap dump was taken would it have help figure out who is consuming more memory. Heapdump would only dump heap memory area, what should be used to dump non-heapdump? Native memory tracking is one thing I came across.
Any way to have native memory dumped when oom occurs?
Whats the best way to monitor the jvm memory to diagnose oom errors?
This may not helpful..
You may not get heapdump because oom-killer is kernel feature. Jvm has no chance to write heapdump.
And SIGKILL can not be caught and does not generate core dump. (unix default action)
http://programmergamer.blogspot.com/2013/05/clarification-on-sigint-sigterm-sigkill.html
Related
After some time my python code that I run in ipython shell freezes. I have found entries in the syslog file. I will need to investigate that.
By freezing I mean:
the process in the shell stops
the shell is not responding to keystrokes
Does anyone have any advice/suggestions about what might be going on and where should I look? I understand it is a vague question but maybe we can all get to the bottom of the problem. Thank you!
Configuration of my system.
$ ipython3 --version
7.14.0
$ python3 --version
Python 3.6.9
$ ipython3
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.14.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: exit
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
I have looked at syslog and found this.
Aug 12 20:52:57 linux-box kernel: [3135203.066680] python3: page allocation failure: order:4, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
Aug 12 20:52:57 linux-box kernel: [3135203.066681] python3 cpuset=/ mems_allowed=0
Aug 12 20:52:57 linux-box kernel: [3135203.066684] CPU: 11 PID: 1241 Comm: python3 Not tainted 4.15.0-109-generic #110-Ubuntu
Aug 12 20:52:57 linux-box kernel: [3135203.066685] Hardware name: LENOVO 30C50045US/3138, BIOS M1VKT1BA 08/17/2018
Aug 12 20:52:57 linux-box kernel: [3135203.066686] Call Trace:
Aug 12 20:52:57 linux-box kernel: [3135203.066690] dump_stack+0x6d/0x8e
Aug 12 20:52:57 linux-box kernel: [3135203.066693] warn_alloc+0xff/0x1a0
Aug 12 20:52:57 linux-box kernel: [3135203.066694] ? __alloc_pages_direct_compact+0x51/0x100
Aug 12 20:52:57 linux-box kernel: [3135203.066695] __alloc_pages_slowpath+0xdc5/0xe00
Aug 12 20:52:57 linux-box kernel: [3135203.066697] __alloc_pages_nodemask+0x29a/0x2c0
Aug 12 20:52:57 linux-box kernel: [3135203.066699] alloc_pages_current+0x6a/0xe0
Aug 12 20:52:57 linux-box kernel: [3135203.066701] kmalloc_order+0x18/0x40
Aug 12 20:52:57 linux-box kernel: [3135203.066702] kmalloc_order_trace+0x24/0xb0
Aug 12 20:52:57 linux-box kernel: [3135203.066703] __kmalloc+0x1fe/0x210
Aug 12 20:52:57 linux-box kernel: [3135203.066706] proc_do_submiturb+0x4a3/0xd90
Aug 12 20:52:57 linux-box kernel: [3135203.066707] usbdev_do_ioctl+0xa38/0x1170
Aug 12 20:52:57 linux-box kernel: [3135203.066709] usbdev_ioctl+0xe/0x20
Aug 12 20:52:57 linux-box kernel: [3135203.066711] do_vfs_ioctl+0xa8/0x630
Aug 12 20:52:57 linux-box kernel: [3135203.066712] ? SyS_futex+0x13b/0x180
Aug 12 20:52:57 linux-box kernel: [3135203.066713] SyS_ioctl+0x79/0x90
Aug 12 20:52:57 linux-box kernel: [3135203.066714] do_syscall_64+0x73/0x130
Aug 12 20:52:57 linux-box kernel: [3135203.066716] entry_SYSCALL_64_after_hwframe+0x41/0xa6
Aug 12 20:52:57 linux-box kernel: [3135203.066717] RIP: 0033:0x7f8abc0b56d7
Aug 12 20:52:57 linux-box kernel: [3135203.066718] RSP: 002b:00007f89ca386a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 12 20:52:57 linux-box kernel: [3135203.066719] RAX: ffffffffffffffda RBX: 0000000002057f90 RCX: 00007f8abc0b56d7
Aug 12 20:52:57 linux-box kernel: [3135203.066719] RDX: 00007f8a2c0077c0 RSI: 000000008038550a RDI: 0000000000000022
Aug 12 20:52:57 linux-box kernel: [3135203.066720] RBP: 00007f8a2c006e40 R08: 00007f8a2c0077c0 R09: 00007f8a2c008430
Aug 12 20:52:57 linux-box kernel: [3135203.066720] R10: 00000000ffffff00 R11: 0000000000000246 R12: 0000000000000000
Aug 12 20:52:57 linux-box kernel: [3135203.066721] R13: 00007f8a2c0077c0 R14: 0000000000000000 R15: 0000000000000000
Aug 12 20:52:57 linux-box kernel: [3135203.066722] warn_alloc_show_mem: 1 callbacks suppressed
Aug 12 20:52:57 linux-box kernel: [3135203.066722] Mem-Info:
Aug 12 20:52:57 linux-box kernel: [3135203.066724] active_anon:7072912 inactive_anon:510142 isolated_anon:0
Aug 12 20:52:57 linux-box kernel: [3135203.066724] active_file:7178154 inactive_file:1236198 isolated_file:0
Aug 12 20:52:57 linux-box kernel: [3135203.066724] unevictable:12 dirty:582646 writeback:328149 unstable:6144
Aug 12 20:52:57 linux-box kernel: [3135203.066724] slab_reclaimable:204738 slab_unreclaimable:31575
Aug 12 20:52:57 linux-box kernel: [3135203.066724] mapped:16191 shmem:10145 pagetables:27672 bounce:0
Aug 12 20:52:57 linux-box kernel: [3135203.066724] free:84911 free_pcp:15 free_cma:0
Aug 12 20:52:57 linux-box kernel: [3135203.066726] Node 0 active_anon:28291648kB inactive_anon:2040568kB active_file:28712940kB inactive_file:4944792kB u$
Aug 12 20:52:57 linux-box kernel: [3135203.066726] Node 0 DMA free:15888kB min:16kB low:28kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB $
Aug 12 20:52:57 linux-box kernel: [3135203.066728] lowmem_reserve[]: 0 2326 64192 64192 64192
Aug 12 20:52:57 linux-box kernel: [3135203.066729] Node 0 DMA32 free:249952kB min:2448kB low:4828kB high:7208kB active_anon:1548208kB inactive_anon:52kB $
Aug 12 20:52:57 linux-box kernel: [3135203.066731] lowmem_reserve[]: 0 0 61865 61865 61865
Aug 12 20:52:57 linux-box kernel: [3135203.066732] Node 0 Normal free:73804kB min:65116kB low:128464kB high:191812kB active_anon:26743080kB inactive_anon$
Aug 12 20:52:57 linux-box kernel: [3135203.066733] lowmem_reserve[]: 0 0 0 0 0
Aug 12 20:52:57 linux-box kernel: [3135203.066734] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*20$
Aug 12 20:52:57 linux-box kernel: [3135203.066737] Node 0 DMA32: 54*4kB (UME) 74*8kB (UME) 280*16kB (UME) 225*32kB (UME) 319*64kB (UME) 276*128kB (UME) 2$
Aug 12 20:52:57 linux-box kernel: [3135203.066740] Node 0 Normal: 6*4kB (UH) 233*8kB (UMH) 2952*16kB (UMEH) 723*32kB (UEH) 1*64kB (H) 3*128kB (H) 3*256kB$
Aug 12 20:52:57 linux-box kernel: [3135203.066744] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 12 20:52:57 linux-box kernel: [3135203.066744] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 12 20:52:57 linux-box kernel: [3135203.066744] 8446758 total pagecache pages
Aug 12 20:52:57 linux-box kernel: [3135203.066753] 21866 pages in swap cache
Aug 12 20:52:57 linux-box kernel: [3135203.066754] Swap cache stats: add 268953271, delete 268898216, find 41931219/53685984
Aug 12 20:52:57 linux-box kernel: [3135203.066754] Free swap = 65004764kB
Aug 12 20:52:57 linux-box kernel: [3135203.066754] Total swap = 67108860kB
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 16733480 pages RAM
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 0 pages HighMem/MovableOnly
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 284837 pages reserved
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 0 pages cma reserved
Aug 12 20:52:57 linux-box kernel: [3135203.066706] proc_do_submiturb+0x4a3/0xd90
Aug 12 20:52:57 linux-box kernel: [3135203.066707] usbdev_do_ioctl+0xa38/0x1170
Aug 12 20:52:57 linux-box kernel: [3135203.066709] usbdev_ioctl+0xe/0x20
Aug 12 20:52:57 linux-box kernel: [3135203.066711] do_vfs_ioctl+0xa8/0x630
Aug 12 20:52:57 linux-box kernel: [3135203.066712] ? SyS_futex+0x13b/0x180
Aug 12 20:52:57 linux-box kernel: [3135203.066713] SyS_ioctl+0x79/0x90
Aug 12 20:52:57 linux-box kernel: [3135203.066714] do_syscall_64+0x73/0x130
Aug 12 20:52:57 linux-box kernel: [3135203.066716] entry_SYSCALL_64_after_hwframe+0x41/0xa6
Aug 12 20:52:57 linux-box kernel: [3135203.066717] RIP: 0033:0x7f8abc0b56d7
Aug 12 20:52:57 linux-box kernel: [3135203.066718] RSP: 002b:00007f89ca386a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 12 20:52:57 linux-box kernel: [3135203.066719] RAX: ffffffffffffffda RBX: 0000000002057f90 RCX: 00007f8abc0b56d7
Aug 12 20:52:57 linux-box kernel: [3135203.066719] RDX: 00007f8a2c0077c0 RSI: 000000008038550a RDI: 0000000000000022
Aug 12 20:52:57 linux-box kernel: [3135203.066720] RBP: 00007f8a2c006e40 R08: 00007f8a2c0077c0 R09: 00007f8a2c008430
Aug 12 20:52:57 linux-box kernel: [3135203.066720] R10: 00000000ffffff00 R11: 0000000000000246 R12: 0000000000000000
Aug 12 20:52:57 linux-box kernel: [3135203.066721] R13: 00007f8a2c0077c0 R14: 0000000000000000 R15: 0000000000000000
Aug 12 20:52:57 linux-box kernel: [3135203.066722] warn_alloc_show_mem: 1 callbacks suppressed
Aug 12 20:52:57 linux-box kernel: [3135203.066722] Mem-Info:
Aug 12 20:52:57 linux-box kernel: [3135203.066724] active_anon:7072912 inactive_anon:510142 isolated_anon:0
Aug 12 20:52:57 linux-box kernel: [3135203.066724] active_file:7178154 inactive_file:1236198 isolated_file:0
Aug 12 20:52:57 linux-box kernel: [3135203.066724] unevictable:12 dirty:582646 writeback:328149 unstable:6144
Aug 12 20:52:57 linux-box kernel: [3135203.066724] slab_reclaimable:204738 slab_unreclaimable:31575
Aug 12 20:52:57 linux-box kernel: [3135203.066724] mapped:16191 shmem:10145 pagetables:27672 bounce:0
Aug 12 20:52:57 linux-box kernel: [3135203.066724] free:84911 free_pcp:15 free_cma:0
Aug 12 20:52:57 linux-box kernel: [3135203.066726] Node 0 active_anon:28291648kB inactive_anon:2040568kB active_file:28712940kB inactive_file:4944792kB u$
Aug 12 20:52:57 linux-box kernel: [3135203.066726] Node 0 DMA free:15888kB min:16kB low:28kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB $
Aug 12 20:52:57 linux-box kernel: [3135203.066728] lowmem_reserve[]: 0 2326 64192 64192 64192
Aug 12 20:52:57 linux-box kernel: [3135203.066729] Node 0 DMA32 free:249952kB min:2448kB low:4828kB high:7208kB active_anon:1548208kB inactive_anon:52kB $
Aug 12 20:52:57 linux-box kernel: [3135203.066731] lowmem_reserve[]: 0 0 61865 61865 61865
Aug 12 20:52:57 linux-box kernel: [3135203.066732] Node 0 Normal free:73804kB min:65116kB low:128464kB high:191812kB active_anon:26743080kB inactive_anon$
Aug 12 20:52:57 linux-box kernel: [3135203.066733] lowmem_reserve[]: 0 0 0 0 0
Aug 12 20:52:57 linux-box kernel: [3135203.066734] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*20$
Aug 12 20:52:57 linux-box kernel: [3135203.066737] Node 0 DMA32: 54*4kB (UME) 74*8kB (UME) 280*16kB (UME) 225*32kB (UME) 319*64kB (UME) 276*128kB (UME) 2$
Aug 12 20:52:57 linux-box kernel: [3135203.066740] Node 0 Normal: 6*4kB (UH) 233*8kB (UMH) 2952*16kB (UMEH) 723*32kB (UEH) 1*64kB (H) 3*128kB (H) 3*256kB$
Aug 12 20:52:57 linux-box kernel: [3135203.066744] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 12 20:52:57 linux-box kernel: [3135203.066744] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 12 20:52:57 linux-box kernel: [3135203.066744] 8446758 total pagecache pages
Aug 12 20:52:57 linux-box kernel: [3135203.066753] 21866 pages in swap cache
Aug 12 20:52:57 linux-box kernel: [3135203.066754] Swap cache stats: add 268953271, delete 268898216, find 41931219/53685984
Aug 12 20:52:57 linux-box kernel: [3135203.066754] Free swap = 65004764kB
Aug 12 20:52:57 linux-box kernel: [3135203.066754] Total swap = 67108860kB
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 16733480 pages RAM
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 0 pages HighMem/MovableOnly
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 284837 pages reserved
Aug 12 20:52:57 linux-box kernel: [3135203.066755] 0 pages cma reserved
Aug 12 20:52:57 linux-box kernel: [3135203.066756] 0 pages hwpoisoned
Few words about what I do. might be useful, might be not.
I have a code that communicates with a camera, reads images and saves them. I have two threads per python process. One for reading images from the camera and one for saving them into a file.
I've got a headless ARM-based Linux (v3.10.53-1.1.1) system with no swap space enabled, and I occasionally see processes get killed by the OOM-killer even though there is plenty of RAM available.
Running echo 1 > /proc/sys/vm/compact_memory periodically seems to keep the OOM-killer at bay, which makes me think that memory fragmentation is the culprit, but I don't understand why a user process would ever need physically-contiguous blocks anyway; as I understand it, even in the worst-case scenario (complete fragmentation, with only individual 4K blocks available), the kernel could simply allocate the necessary number of individual 4K blocks and then use the Magic of Virtual Memory (tm) to make them look contiguous to the user-process.
Can someone explain why the OOM-killer would be invoked in response to memory fragmentation? Is it just a buggy kernel or is there a genuine reason? (And even if the kernel did need to de-frag memory in order to satisfy a request, shouldn't it do that automatically rather than giving up and OOM'ing?)
I've pasted an example OOM-killer invocation below, in case it sheds any light on things. I can reproduce the fault at will; this invocation occurred while the computer still had ~120MB of RAM available (according to free), in response to my test-program allocating memory, 10000 400-byte allocations at a time.
May 28 18:51:34 g2 user.warn kernel: [ 4228.307769] cored invoked oom-killer: gfp_mask=0x2084d0, order=0, oom_score_adj=0
May 28 18:51:35 g2 user.warn kernel: [ 4228.315295] CPU: 2 PID: 19687 Comm: cored Tainted: G O 3.10.53-1.1.1_ga+gf57416a #1
May 28 18:51:35 g2 user.warn kernel: [ 4228.323843] Backtrace:
May 28 18:51:35 g2 user.warn kernel: [ 4228.326340] [<c0011c54>] (dump_backtrace+0x0/0x10c) from [<c0011e68>] (show_stack+0x18/0x1c)
May 28 18:51:35 g2 user.warn kernel: [ 4228.334804] r6:00000000 r5:00000000 r4:c9784000 r3:00000000
May 28 18:51:35 g2 user.warn kernel: [ 4228.340566] [<c0011e50>] (show_stack+0x0/0x1c) from [<c04d0dd8>] (dump_stack+0x24/0x28)
May 28 18:51:35 g2 user.warn kernel: [ 4228.348684] [<c04d0db4>] (dump_stack+0x0/0x28) from [<c009b474>] (dump_header.isra.10+0x84/0x19c)
May 28 18:51:35 g2 user.warn kernel: [ 4228.357616] [<c009b3f0>] (dump_header.isra.10+0x0/0x19c) from [<c009ba3c>] (oom_kill_process+0x288/0x3f4)
May 28 18:51:35 g2 user.warn kernel: [ 4228.367230] [<c009b7b4>] (oom_kill_process+0x0/0x3f4) from [<c009bf8c>] (out_of_memory+0x208/0x2cc)
May 28 18:51:35 g2 user.warn kernel: [ 4228.376323] [<c009bd84>] (out_of_memory+0x0/0x2cc) from [<c00a0278>] (__alloc_pages_nodemask+0x8f8/0x910)
May 28 18:51:35 g2 user.warn kernel: [ 4228.385921] [<c009f980>] (__alloc_pages_nodemask+0x0/0x910) from [<c00b6c34>] (__pte_alloc+0x2c/0x158)
May 28 18:51:35 g2 user.warn kernel: [ 4228.395263] [<c00b6c08>] (__pte_alloc+0x0/0x158) from [<c00b9fe0>] (handle_mm_fault+0xd4/0xfc)
May 28 18:51:35 g2 user.warn kernel: [ 4228.403914] r6:c981a5d8 r5:cc421a40 r4:10400000 r3:10400000
May 28 18:51:35 g2 user.warn kernel: [ 4228.409689] [<c00b9f0c>] (handle_mm_fault+0x0/0xfc) from [<c0019a00>] (do_page_fault+0x174/0x3dc)
May 28 18:51:35 g2 user.warn kernel: [ 4228.418575] [<c001988c>] (do_page_fault+0x0/0x3dc) from [<c0019dc0>] (do_translation_fault+0xb4/0xb8)
May 28 18:51:35 g2 user.warn kernel: [ 4228.427824] [<c0019d0c>] (do_translation_fault+0x0/0xb8) from [<c00083ac>] (do_DataAbort+0x40/0xa0)
May 28 18:51:35 g2 user.warn kernel: [ 4228.436896] r6:c0019d0c r5:00000805 r4:c06a33d0 r3:103ffea8
May 28 18:51:35 g2 user.warn kernel: [ 4228.442643] [<c000836c>] (do_DataAbort+0x0/0xa0) from [<c000e138>] (__dabt_usr+0x38/0x40)
May 28 18:51:35 g2 user.warn kernel: [ 4228.450850] Exception stack(0xc9785fb0 to 0xc9785ff8)
May 28 18:51:35 g2 user.warn kernel: [ 4228.455918] 5fa0: 103ffea8 00000000 b6d56708 00000199
May 28 18:51:35 g2 user.warn kernel: [ 4228.464116] 5fc0: 00000001 b6d557c0 0001ffc8 b6d557f0 103ffea0 b6d55228 10400038 00000064
May 28 18:51:35 g2 user.warn kernel: [ 4228.472327] 5fe0: 0001ffc9 beb04990 00000199 b6c95d84 600f0010 ffffffff
May 28 18:51:35 g2 user.warn kernel: [ 4228.478952] r8:103ffea0 r7:b6d557f0 r6:ffffffff r5:600f0010 r4:b6c95d84
May 28 18:51:35 g2 user.warn kernel: [ 4228.485759] Mem-info:
May 28 18:51:35 g2 user.warn kernel: [ 4228.488038] DMA per-cpu:
May 28 18:51:35 g2 user.warn kernel: [ 4228.490589] CPU 0: hi: 90, btch: 15 usd: 5
May 28 18:51:35 g2 user.warn kernel: [ 4228.495389] CPU 1: hi: 90, btch: 15 usd: 13
May 28 18:51:35 g2 user.warn kernel: [ 4228.500205] CPU 2: hi: 90, btch: 15 usd: 17
May 28 18:51:35 g2 user.warn kernel: [ 4228.505003] CPU 3: hi: 90, btch: 15 usd: 65
May 28 18:51:35 g2 user.warn kernel: [ 4228.509823] active_anon:92679 inactive_anon:47 isolated_anon:0
May 28 18:51:35 g2 user.warn kernel: [ 4228.509823] active_file:162 inactive_file:1436 isolated_file:0
May 28 18:51:35 g2 user.warn kernel: [ 4228.509823] unevictable:0 dirty:0 writeback:0 unstable:0
May 28 18:51:35 g2 user.warn kernel: [ 4228.509823] free:28999 slab_reclaimable:841 slab_unreclaimable:2103
May 28 18:51:35 g2 user.warn kernel: [ 4228.509823] mapped:343 shmem:89 pagetables:573 bounce:0
May 28 18:51:35 g2 user.warn kernel: [ 4228.509823] free_cma:29019
May 28 18:51:35 g2 user.warn kernel: [ 4228.541416] DMA free:115636kB min:1996kB low:2492kB high:2992kB active_anon:370716kB inactive_anon:188kB active_file:752kB inactive_file:6040kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:524288kB managed:2
May 28 18:51:35 g2 user.warn kernel: [ 4228.583833] lowmem_reserve[]: 0 0 0 0
May 28 18:51:35 g2 user.warn kernel: [ 4228.587577] DMA: 2335*4kB (UMC) 1266*8kB (UMC) 1034*16kB (UMC) 835*32kB (UC) 444*64kB (C) 28*128kB (C) 103*256kB (C) 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB = 121100kB
May 28 18:51:35 g2 user.warn kernel: [ 4228.604979] 502 total pagecache pages
May 28 18:51:35 g2 user.warn kernel: [ 4228.608649] 0 pages in swap cache
May 28 18:51:35 g2 user.warn kernel: [ 4228.611979] Swap cache stats: add 0, delete 0, find 0/0
May 28 18:51:35 g2 user.warn kernel: [ 4228.617210] Free swap = 0kB
May 28 18:51:35 g2 user.warn kernel: [ 4228.620110] Total swap = 0kB
May 28 18:51:35 g2 user.warn kernel: [ 4228.635245] 131072 pages of RAM
May 28 18:51:35 g2 user.warn kernel: [ 4228.638394] 30575 free pages
May 28 18:51:35 g2 user.warn kernel: [ 4228.641293] 3081 reserved pages
May 28 18:51:35 g2 user.warn kernel: [ 4228.644437] 1708 slab pages
May 28 18:51:35 g2 user.warn kernel: [ 4228.647239] 265328 pages shared
May 28 18:51:35 g2 user.warn kernel: [ 4228.650399] 0 pages swap cached
May 28 18:51:35 g2 user.info kernel: [ 4228.653546] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
May 28 18:51:35 g2 user.info kernel: [ 4228.661408] [ 115] 0 115 761 128 5 0 -1000 udevd
May 28 18:51:35 g2 user.info kernel: [ 4228.669347] [ 237] 0 237 731 98 5 0 -1000 udevd
May 28 18:51:35 g2 user.info kernel: [ 4228.677278] [ 238] 0 238 731 100 5 0 -1000 udevd
May 28 18:51:35 g2 user.info kernel: [ 4228.685224] [ 581] 0 581 1134 78 5 0 -1000 sshd
May 28 18:51:35 g2 user.info kernel: [ 4228.693074] [ 592] 0 592 662 15 4 0 0 syslogd
May 28 18:51:35 g2 user.info kernel: [ 4228.701184] [ 595] 0 595 662 19 4 0 0 klogd
May 28 18:51:35 g2 user.info kernel: [ 4228.709113] [ 633] 0 633 6413 212 12 0 0 g2d
May 28 18:51:35 g2 user.info kernel: [ 4228.716877] [ 641] 0 641 663 16 3 0 0 getty
May 28 18:51:35 g2 user.info kernel: [ 4228.724827] [ 642] 0 642 663 16 5 0 0 getty
May 28 18:51:35 g2 user.info kernel: [ 4228.732770] [ 646] 0 646 6413 215 12 0 0 g2d
May 28 18:51:35 g2 user.info kernel: [ 4228.740540] [ 650] 0 650 10791 572 10 0 0 avbd
May 28 18:51:35 g2 user.info kernel: [ 4228.748385] [ 651] 0 651 9432 2365 21 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.756322] [ 652] 0 652 52971 4547 42 0 0 g2d
May 28 18:51:35 g2 user.info kernel: [ 4228.764104] [ 712] 0 712 14135 2458 24 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.772053] [ 746] 0 746 1380 248 6 0 0 dhclient
May 28 18:51:35 g2 user.info kernel: [ 4228.780251] [ 779] 0 779 9419 2383 21 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.788187] [ 780] 0 780 9350 2348 21 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.796127] [ 781] 0 781 9349 2347 21 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.804074] [ 782] 0 782 9353 2354 21 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.812012] [ 783] 0 783 18807 2573 27 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.819955] [ 784] 0 784 17103 3233 28 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.827882] [ 785] 0 785 13990 2436 24 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.835819] [ 786] 0 786 9349 2350 21 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.843764] [ 807] 0 807 13255 4125 25 0 0 cored
May 28 18:51:35 g2 user.info kernel: [ 4228.851702] [ 1492] 999 1492 512 27 5 0 0 avahi-autoipd
May 28 18:51:35 g2 user.info kernel: [ 4228.860334] [ 1493] 0 1493 433 14 5 0 0 avahi-autoipd
May 28 18:51:35 g2 user.info kernel: [ 4228.868955] [ 1494] 0 1494 1380 246 7 0 0 dhclient
May 28 18:51:35 g2 user.info kernel: [ 4228.877163] [19170] 0 19170 1175 131 6 0 0 sshd
May 28 18:51:35 g2 user.info kernel: [ 4228.885022] [19183] 0 19183 750 70 4 0 0 sh
May 28 18:51:35 g2 user.info kernel: [ 4228.892701] [19228] 0 19228 663 16 5 0 0 watch
May 28 18:51:35 g2 user.info kernel: [ 4228.900636] [19301] 0 19301 1175 131 5 0 0 sshd
May 28 18:51:35 g2 user.info kernel: [ 4228.908475] [19315] 0 19315 751 69 5 0 0 sh
May 28 18:51:35 g2 user.info kernel: [ 4228.916154] [19365] 0 19365 663 16 5 0 0 watch
May 28 18:51:35 g2 user.info kernel: [ 4228.924099] [19443] 0 19443 1175 153 5 0 0 sshd
May 28 18:51:35 g2 user.info kernel: [ 4228.931948] [19449] 0 19449 750 70 5 0 0 sh
May 28 18:51:35 g2 user.info kernel: [ 4228.939626] [19487] 0 19487 1175 132 5 0 0 sshd
May 28 18:51:35 g2 user.info kernel: [ 4228.947467] [19500] 0 19500 750 70 3 0 0 sh
May 28 18:51:35 g2 user.info kernel: [ 4228.955148] [19540] 0 19540 662 17 5 0 0 tail
May 28 18:51:35 g2 user.info kernel: [ 4228.963002] [19687] 0 19687 63719 56396 127 0 0 cored
May 28 18:51:35 g2 user.err kernel: [ 4228.970936] Out of memory: Kill process 19687 (cored) score 428 or sacrifice child
May 28 18:51:35 g2 user.err kernel: [ 4228.978513] Killed process 19687 (cored) total-vm:254876kB, anon-rss:225560kB, file-rss:24kB
Also here is my test-program I use to stress the system and invoke the OOM-killer (with the echo 1 > /proc/sys/vm/compact_memory command being run every so often, the OOM-killer appears when free reports system RAM close to zero, as expected; without it, the OOM-killer appears well before that, when free reports 130+MB of RAM available but after cat /proc/buddyinfo shows the RAM becoming fragmented):
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char ** argv)
{
while(1)
{
printf("PRESS RETURN TO ALLOCATE BUFFERS\n");
const int numBytes = 400;
char buf[64]; fgets(buf, sizeof(buf), stdin);
for (int i=0; i<10000; i++)
{
void * ptr = malloc(numBytes); // yes, a deliberate memory leak
if (ptr)
{
memset(ptr, 'J', numBytes); // force the virtual memory system to actually allocate the RAM, and not only the address space
}
else printf("malloc() failed!\n");
}
fprintf(stderr, "Deliberately leaked 10000*%i bytes!\n", numBytes);
}
return 0;
}
You are on the right track, Jeremy. The identical thing happened to me on my CentOS desktop system. I am a computer consultant, and I have worked with Linux since 1995. And I pound my Linux systems mercilessly with many file downloads and all sorts of other activities that stretch them to their limits. After my main desktop had been up for about 4 days, it got real slow (like slower than 1/10 of normal speed), the OOM killed kicked in, and I was sitting there wondering why my system was acting that way. It had plenty of RAM, but the OOM killer was kicking when it had no business doing so. So I rebooted it, and it acted fine...for about 4 days, then the problem returned. Bugged the snot out of me not knowing why.
So I put on my test engineer hat and ran all sorts of stress tests on the machine to see if I could reproduce the symptoms on purpose. After several months of this, I was able to recreate the problem at will and prove that my solution for it would work every time.
"Cache turnover" in this context is when a system has to tear down existing cache to create more cache space to support new file writes. Since the system is in a hurry to redeploy the RAM, it does not take the time to defragment the memory it is freeing. So over time, as more and more file writes occur, the cache turns over repeatedly. And the memory in which it resides keeps getting more and more fragmented. In my tests, I found that after the disk cache has turned over about 15 times, the memory becomes so fragmented that the system cannot tear down and then allocate the memory fast enough to keep the OOM killer from being triggered due to lack of free RAM in the system when a spike in memory demand occurs. Such a spike could be caused by executing something as simple as
find /dev /etc /home /opt /tmp /usr -xdev > /dev/null
On my system, that command creates a demand for about 50MB of new cache. That was what
free -mt
shows, anyway.
The solution for this problem involves expanding on what you already discovered.
/bin/echo 3 > /proc/sys/vm/drop_caches
export CONFIG_COMPACTION=1
echo 1 > /proc/sys/vm/compact_memory
And yes, I totally agree that dropping cache will force your system to re-read some data from disk. But at a rate of once per day or even once per hour, the negative effect of dropping cache is absolutely negligible compared to everything else your system is doing, no matter what that might be. The negative effect is so small that I cannot even measure it, and I made my living as a test engineer for 5+ years figuring out how to measure things like that.
If you set up a cron job to execute those once a day, that should eliminate your OOM killer problem. If you still see problems with the OOM killer after that, consider executing them more frequently. It will vary depending on how much file writing you do compared to the amount of system RAM your unit has.
Is there any way to determine the Virtual Memory Size of the process at the time it is killed by linux oom-killer .
I can't find any parameter in file /var/log/messages, which may tell me the total VM size of the process being killed. There is lots of other information available in /var/log/messages, but not the total VM size of the process.
This is a centos 5.7 x64 machine.
Following are the contents of /var/log/messages :
Mar 1 18:51:45 c42 kernel: NameService invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Mar 1 18:51:45 c42 kernel:
Mar 1 18:51:46 c42 kernel: Call Trace:
Mar 1 18:51:46 c42 kernel: [<ffffffff800c9d3a>] out_of_memory+0x8e/0x2f3
Mar 1 18:51:46 c42 kernel: [<ffffffff8002dfd7>] __wake_up+0x38/0x4f
Mar 1 18:51:46 c42 kernel: [<ffffffff8000f677>] __alloc_pages+0x27f/0x308
Mar 1 18:51:46 c42 kernel: [<ffffffff80013034>] __do_page_cache_readahead+0x96/0x17b
Mar 1 18:51:46 c42 kernel: [<ffffffff80013971>] filemap_nopage+0x14c/0x360
Mar 1 18:51:46 c42 kernel: [<ffffffff8000896c>] __handle_mm_fault+0x1fd/0x103b
Mar 1 18:51:46 c42 kernel: [<ffffffff800671f2>] do_page_fault+0x499/0x842
Mar 1 18:51:46 c42 kernel: [<ffffffff80031143>] do_fork+0x148/0x1c1
Mar 1 18:51:46 c42 kernel: [<ffffffff8005dde9>] error_exit+0x0/0x84
Mar 1 18:51:46 c42 kernel:
Mar 1 18:51:46 c42 kernel: Mem-info:
Mar 1 18:51:47 c42 kernel: Node 0 DMA per-cpu:
Mar 1 18:51:48 c42 kernel: cpu 0 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 0 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 1 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 1 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 2 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 2 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 3 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 3 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 4 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 4 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 5 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 5 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 6 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 6 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 7 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 7 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 8 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 8 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 9 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 9 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 10 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 10 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 11 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 11 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 12 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 12 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 13 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 13 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 14 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 14 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 15 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 15 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: Node 0 DMA32 per-cpu:
Mar 1 18:51:49 c42 kernel: cpu 0 hot: high 186, batch 31 used:31
Mar 1 18:51:49 c42 kernel: cpu 0 cold: high 62, batch 15 used:35
............
Mar 1 18:51:58 c42 kernel: cpu 14 cold: high 62, batch 15 used:18
Mar 1 18:51:58 c42 kernel: cpu 15 hot: high 186, batch 31 used:6
Mar 1 18:51:59 c42 kernel: cpu 15 cold: high 62, batch 15 used:14
Mar 1 18:51:59 c42 kernel: Node 1 HighMem per-cpu: empty
Mar 1 18:51:59 c42 kernel: Free pages: 50396kB (0kB HighMem)
Mar 1 18:51:59 c42 kernel: Active:1559270 inactive:2490421 dirty:0 writeback:0 unstable:0 free:12599 slab:8740 mapped-file:1186 mapped-anon:4051463 pagetables:16277
Mar 1 18:51:59 c42 kernel: Node 0 DMA free:10068kB min:8kB low:8kB high:12kB active:0kB inactive:0kB present:9660kB pages_scanned:0 all_unreclaimable? yes
Mar 1 18:51:59 c42 kernel: lowmem_reserve[]: 0 1965 8025 8025
Mar 1 18:51:59 c42 kernel: Node 0 DMA32 free:26176kB min:1980kB low:2472kB high:2968kB active:1020328kB inactive:922224kB present:2012496kB pages_scanned:4075359 all_unreclaimable? yes
Mar 1 18:51:59 c42 kernel: lowmem_reserve[]: 0 0 6060 6060
Mar 1 18:51:59 c42 kernel: Node 0 Normal free:6060kB min:6108kB low:7632kB high:9160kB active:490800kB inactive:5569172kB present:6205440kB pages_scanned:21679912 all_unreclaimable? yes
Mar 1 18:51:59 c42 kernel: lowmem_reserve[]: 0 0 0 0
Mar 1 18:51:59 c42 kernel: Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Mar 1 18:52:00 c42 kernel: lowmem_reserve[]: 0 0 0 0
Mar 1 18:52:00 c42 kernel: Node 1 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Mar 1 18:52:00 c42 kernel: lowmem_reserve[]: 0 0 8080 8080
Mar 1 18:52:00 c42 kernel: Node 1 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Mar 1 18:52:00 c42 kernel: lowmem_reserve[]: 0 0 8080 8080
Mar 1 18:52:00 c42 kernel: Node 1 Normal free:8092kB min:8144kB low:10180kB high:12216kB active:4725952kB inactive:3470288kB present:8273920kB pages_scanned:15611005 all_unreclaimable? yes
Mar 1 18:52:00 c42 kernel: lowmem_reserve[]: 0 0 0 0
Mar 1 18:52:00 c42 kernel: Node 1 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Mar 1 18:52:01 c42 kernel: lowmem_reserve[]: 0 0 0 0
Mar 1 18:52:02 c42 kernel: Node 0 DMA: 5*4kB 2*8kB 5*16kB 5*32kB 5*64kB 2*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 2*4096kB = 10068kB
Mar 1 18:52:02 c42 kernel: Node 0 DMA32: 30*4kB 1*8kB 0*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 6*4096kB = 26176kB
Mar 1 18:52:02 c42 kernel: Node 0 Normal: 9*4kB 7*8kB 3*16kB 1*32kB 0*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 6060kB
Mar 1 18:52:02 c42 kernel: Node 0 HighMem: empty
Mar 1 18:52:03 c42 kernel: Node 1 DMA: empty
Mar 1 18:52:03 c42 kernel: Node 1 DMA32: empty
Mar 1 18:52:03 c42 kernel: Node 1 Normal: 49*4kB 3*8kB 0*16kB 0*32kB 1*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8092kB
Mar 1 18:52:03 c42 kernel: Node 1 HighMem: empty
Mar 1 18:52:03 c42 kernel: 1624 pagecache pages
Mar 1 18:52:04 c42 kernel: Swap cache: add 2581210, delete 2580953, find 6957/9192, race 0+16
Mar 1 18:52:04 c42 kernel: Free swap = 0kB
Mar 1 18:52:04 c42 kernel: Total swap = 10241428kB
Mar 1 18:52:04 c42 kernel: Free swap: 0kB
Mar 1 18:52:06 c42 kernel: 4718592 pages of RAM
Mar 1 18:52:06 c42 kernel: 616057 reserved pages
Mar 1 18:52:07 c42 kernel: 17381 pages shared
Mar 1 18:52:08 c42 kernel: 260 pages swap cached
Mar 1 18:52:09 c42 kernel: Out of memory: Killed process 16727, UID 501, (ApplicationMoni).
As per linux, total memory is the sum of physical memory and virtual memory i.e RAM+SWAP.
when ever your process got killed,you will get the score of the process got killed in kern log.
By observing the top command and oom_score of process. I figured that,
oom_score <= to percent it used in total memory
For example : My system having 16GB of RAM and 1GB of SWAP, so total Memory is 17GB.
Tomcat process got killed with oom score '602', Then the usage of tomcat is greater than or equivalent 60.2% of total memory, i.e 10.23 + GB of RAM is occupied by tomcat.
Here is another example:
score is 249 i.e memory usage is 24.9+ %
This is reported in dmesg, after the stack trace that caused the crash (usually a memory allocation request)
My server trigged OOM killer and I am trying to understand why. System has lot of RAM 128 GB and it looks like around 70GB of it was actually used. Reading through previous questions about OOM, it looks like this might be a case of memory fragmentation. See the syslog output
Jun 23 17:20:10 server1 kernel: [517262.504589] gmond invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Jun 23 17:20:10 server1 kernel: [517262.504593] gmond cpuset=/ mems_allowed=0-1
Jun 23 17:20:10 server1 kernel: [517262.504598] CPU: 4 PID: 1522 Comm: gmond Tainted: P OE 3.15.1-031501-lowlatency #201406161841
Jun 23 17:20:10 server1 kernel: [517262.504599] Hardware name: Dell Inc. PowerEdge R420/0K29HN, BIOS 2.3.3 07/10/2014
Jun 23 17:20:10 server1 kernel: [517262.504601] 0000000000000000 ffff880fce2ab848 ffffffff817746ec 0000000000000007
Jun 23 17:20:10 server1 kernel: [517262.504603] ffff880f74691950 ffff880fce2ab898 ffffffff8176a980 ffff880f00000000
Jun 23 17:20:10 server1 kernel: [517262.504605] 000201da81383df8 ffff881470376540 ffff881dcf7ab2a0 0000000000000000
Jun 23 17:20:10 server1 kernel: [517262.504607] Call Trace:
Jun 23 17:20:10 server1 kernel: [517262.504615] [<ffffffff817746ec>] dump_stack+0x4e/0x71
Jun 23 17:20:10 server1 kernel: [517262.504618] [<ffffffff8176a980>] dump_header+0x7e/0xbd
Jun 23 17:20:10 server1 kernel: [517262.504620] [<ffffffff8176aa16>] oom_kill_process.part.6+0x57/0x30a
Jun 23 17:20:10 server1 kernel: [517262.504623] [<ffffffff811654e7>] oom_kill_process+0x47/0x50
Jun 23 17:20:10 server1 kernel: [517262.504625] [<ffffffff81165825>] out_of_memory+0x145/0x1d0
Jun 23 17:20:10 server1 kernel: [517262.504628] [<ffffffff8116c1ba>] __alloc_pages_nodemask+0xb1a/0xc40
Jun 23 17:20:10 server1 kernel: [517262.504634] [<ffffffff811adba3>] alloc_pages_current+0xb3/0x180
Jun 23 17:20:10 server1 kernel: [517262.504636] [<ffffffff81161737>] __page_cache_alloc+0xb7/0xd0
Jun 23 17:20:10 server1 kernel: [517262.504638] [<ffffffff81163f80>] filemap_fault+0x280/0x430
Jun 23 17:20:10 server1 kernel: [517262.504642] [<ffffffff8118a0d9>] __do_fault+0x39/0x90
Jun 23 17:20:10 server1 kernel: [517262.504644] [<ffffffff8118e31e>] do_read_fault.isra.59+0x10e/0x1d0
Jun 23 17:20:10 server1 kernel: [517262.504646] [<ffffffff8118e870>] do_linear_fault.isra.61+0x70/0x80
Jun 23 17:20:10 server1 kernel: [517262.504647] [<ffffffff8118e986>] handle_pte_fault+0x76/0x1b0
Jun 23 17:20:10 server1 kernel: [517262.504652] [<ffffffff81095fe0>] ? lock_hrtimer_base.isra.25+0x30/0x60
Jun 23 17:20:10 server1 kernel: [517262.504654] [<ffffffff8118eea4>] __handle_mm_fault+0x1b4/0x360
Jun 23 17:20:10 server1 kernel: [517262.504655] [<ffffffff8118f101>] handle_mm_fault+0xb1/0x160
Jun 23 17:20:10 server1 kernel: [517262.504658] [<ffffffff81784667>] ? __do_page_fault+0x2b7/0x5a0
Jun 23 17:20:10 server1 kernel: [517262.504660] [<ffffffff81784522>] __do_page_fault+0x172/0x5a0
Jun 23 17:20:10 server1 kernel: [517262.504664] [<ffffffff8111fdec>] ? acct_account_cputime+0x1c/0x20
Jun 23 17:20:10 server1 kernel: [517262.504667] [<ffffffff810a73a9>] ? account_user_time+0x99/0xb0
Jun 23 17:20:10 server1 kernel: [517262.504669] [<ffffffff810a79dd>] ? vtime_account_user+0x5d/0x70
Jun 23 17:20:10 server1 kernel: [517262.504671] [<ffffffff8178498e>] do_page_fault+0x3e/0x80
Jun 23 17:20:10 server1 kernel: [517262.504673] [<ffffffff817811f8>] page_fault+0x28/0x30
Jun 23 17:20:10 server1 kernel: [517262.504674] Mem-Info:
Jun 23 17:20:10 server1 kernel: [517262.504675] Node 0 DMA per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504677] CPU 0: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504678] CPU 1: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504679] CPU 2: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504680] CPU 3: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504681] CPU 4: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504682] CPU 5: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504683] CPU 6: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504684] CPU 7: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504685] CPU 8: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504686] CPU 9: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504687] CPU 10: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504687] CPU 11: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504688] CPU 12: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504689] CPU 13: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504690] CPU 14: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504691] CPU 15: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504692] CPU 16: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504693] CPU 17: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504694] CPU 18: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504695] CPU 19: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504696] CPU 20: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504697] CPU 21: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504698] CPU 22: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504698] CPU 23: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504699] Node 0 DMA32 per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504701] CPU 0: hi: 186, btch: 31 usd: 30
Jun 23 17:20:10 server1 kernel: [517262.504702] CPU 1: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504703] CPU 2: hi: 186, btch: 31 usd: 34
Jun 23 17:20:10 server1 kernel: [517262.504704] CPU 3: hi: 186, btch: 31 usd: 27
Jun 23 17:20:10 server1 kernel: [517262.504705] CPU 4: hi: 186, btch: 31 usd: 30
Jun 23 17:20:10 server1 kernel: [517262.504705] CPU 5: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504706] CPU 6: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504707] CPU 7: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504708] CPU 8: hi: 186, btch: 31 usd: 173
Jun 23 17:20:10 server1 kernel: [517262.504709] CPU 9: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504710] CPU 10: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504711] CPU 11: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504712] CPU 12: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504713] CPU 13: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504714] CPU 14: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504715] CPU 15: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504716] CPU 16: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504717] CPU 17: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504718] CPU 18: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504719] CPU 19: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504720] CPU 20: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504721] CPU 21: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504722] CPU 22: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504722] CPU 23: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504723] Node 0 Normal per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504724] CPU 0: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504725] CPU 1: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504726] CPU 2: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504727] CPU 3: hi: 186, btch: 31 usd: 14
Jun 23 17:20:10 server1 kernel: [517262.504728] CPU 4: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504729] CPU 5: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504730] CPU 6: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504731] CPU 7: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504732] CPU 8: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504733] CPU 9: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504734] CPU 10: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504735] CPU 11: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504736] CPU 12: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504737] CPU 13: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504738] CPU 14: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504739] CPU 15: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504740] CPU 16: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504740] CPU 17: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504741] CPU 18: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504742] CPU 19: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504743] CPU 20: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504744] CPU 21: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504745] CPU 22: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504746] CPU 23: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504747] Node 1 Normal per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504748] CPU 0: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504749] CPU 1: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504750] CPU 2: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504751] CPU 3: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504752] CPU 4: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504753] CPU 5: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504754] CPU 6: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504755] CPU 7: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504756] CPU 8: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504757] CPU 9: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504758] CPU 10: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504758] CPU 11: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504759] CPU 12: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504760] CPU 13: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504761] CPU 14: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504762] CPU 15: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504763] CPU 16: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504764] CPU 17: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504765] CPU 18: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504766] CPU 19: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504767] CPU 20: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504768] CPU 21: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504769] CPU 22: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504770] CPU 23: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504773] active_anon:17833290 inactive_anon:2465707 isolated_anon:0
Jun 23 17:20:10 server1 kernel: [517262.504773] active_file:573 inactive_file:595 isolated_file:36
Jun 23 17:20:10 server1 kernel: [517262.504773] unevictable:0 dirty:4 writeback:0 unstable:0
Jun 23 17:20:10 server1 kernel: [517262.504773] free:82698 slab_reclaimable:43224 slab_unreclaimable:11476749
Jun 23 17:20:10 server1 kernel: [517262.504773] mapped:2465518 shmem:2465767 pagetables:66385 bounce:0
Jun 23 17:20:10 server1 kernel: [517262.504773] free_cma:0
Jun 23 17:20:10 server1 kernel: [517262.504776] Node 0 DMA free:14804kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15968kB managed:15828kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504779] lowmem_reserve[]: 0 2933 64370 64370
Jun 23 17:20:10 server1 kernel: [517262.504782] Node 0 DMA32 free:247776kB min:2048kB low:2560kB high:3072kB active_anon:1774744kB inactive_anon:607052kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3083200kB managed:3003592kB mlocked:0kB dirty:16kB writeback:0kB mapped:607068kB shmem:607068kB slab_reclaimable:25524kB slab_unreclaimable:302060kB kernel_stack:4928kB pagetables:3100kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2660 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504785] lowmem_reserve[]: 0 0 61436 61436
Jun 23 17:20:10 server1 kernel: [517262.504787] Node 0 Normal free:34728kB min:42952kB low:53688kB high:64428kB active_anon:30286072kB inactive_anon:9255576kB active_file:236kB inactive_file:640kB unevictable:0kB isolated(anon):0kB isolated(file):16kB present:63963136kB managed:62911420kB mlocked:0kB dirty:0kB writeback:0kB mapped:9255000kB shmem:9255724kB slab_reclaimable:86416kB slab_unreclaimable:22165372kB kernel_stack:21072kB pagetables:121112kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:13936 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504791] lowmem_reserve[]: 0 0 0 0
Jun 23 17:20:10 server1 kernel: [517262.504793] Node 1 Normal free:33484kB min:45096kB low:56368kB high:67644kB active_anon:39272344kB inactive_anon:200kB active_file:2112kB inactive_file:1752kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:67108864kB managed:66056916kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:276kB slab_reclaimable:60956kB slab_unreclaimable:23439564kB kernel_stack:13536kB pagetables:141328kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:18448 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504797] lowmem_reserve[]: 0 0 0 0
Jun 23 17:20:10 server1 kernel: [517262.504799] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 1*2048kB (R) 3*4096kB (M) = 14804kB
Jun 23 17:20:10 server1 kernel: [517262.504807] Node 0 DMA32: 4660*4kB (UEM) 2172*8kB (EM) 1739*16kB (EM) 1046*32kB (UEM) 629*64kB (EM) 344*128kB (UEM) 155*256kB (E) 46*512kB (UE) 3*1024kB (E) 0*2048kB 0*4096kB = 247904kB
Jun 23 17:20:10 server1 kernel: [517262.504816] Node 0 Normal: 9038*4kB (M) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36152kB
Jun 23 17:20:10 server1 kernel: [517262.504822] Node 1 Normal: 9055*4kB (UM) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36220kB
Jun 23 17:20:10 server1 kernel: [517262.504829] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 23 17:20:10 server1 kernel: [517262.504830] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 23 17:20:10 server1 kernel: [517262.504831] 2467056 total pagecache pages
Jun 23 17:20:10 server1 kernel: [517262.504832] 0 pages in swap cache
Jun 23 17:20:10 server1 kernel: [517262.504833] Swap cache stats: add 0, delete 0, find 0/0
Jun 23 17:20:10 server1 kernel: [517262.504834] Free swap = 0kB
Jun 23 17:20:10 server1 kernel: [517262.504834] Total swap = 0kB
Jun 23 17:20:10 server1 kernel: [517262.504835] 33542792 pages RAM
Jun 23 17:20:10 server1 kernel: [517262.504836] 0 pages HighMem/MovableOnly
Jun 23 17:20:10 server1 kernel: [517262.504837] 262987 pages reserved
Jun 23 17:20:10 server1 kernel: [517262.504838] 0 pages hwpoisoned
Jun 23 17:20:10 server1 kernel: [517262.504839] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jun 23 17:20:10 server1 kernel: [517262.504866] [ 569] 0 569 4997 144 13 0 0 upstart-udev-br
Jun 23 17:20:10 server1 kernel: [517262.504868] [ 578] 0 578 12891 187 29 0 -1000 systemd-udevd
Jun 23 17:20:10 server1 kernel: [517262.504873] [ 692] 101 692 80659 2295 59 0 0 rsyslogd
Jun 23 17:20:10 server1 kernel: [517262.504875] [ 750] 0 750 4084 331 13 0 0 upstart-file-br
Jun 23 17:20:10 server1 kernel: [517262.504877] [ 792] 0 792 3815 53 13 0 0 upstart-socket-
Jun 23 17:20:10 server1 kernel: [517262.504877] [ 792] 0 792 3815 53 13 0 0 upstart-socket-
Jun 23 17:20:10 server1 kernel: [517262.504879] [ 842] 111 842 27001 275 53 0 0 dbus-daemon
Jun 23 17:20:10 server1 kernel: [517262.504880] [ 851] 0 851 8834 101 22 0 0 systemd-logind
Jun 23 17:20:10 server1 kernel: [517262.504886] [ 1232] 0 1232 2558 572 8 0 0 dhclient
Jun 23 17:20:10 server1 kernel: [517262.504888] [ 1342] 104 1342 24484 281 49 0 0 ntpd
Jun 23 17:20:10 server1 kernel: [517262.504890] [ 1440] 0 1440 3955 41 12 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504891] [ 1443] 0 1443 3955 41 12 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504893] [ 1448] 0 1448 3955 39 13 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504895] [ 1450] 0 1450 3955 41 13 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504896] [ 1452] 0 1452 3955 42 13 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504898] [ 1469] 0 1469 4785 40 13 0 0 atd
Jun 23 17:20:10 server1 kernel: [517262.504900] [ 1470] 0 1470 15341 168 32 0 -1000 sshd
Jun 23 17:20:10 server1 kernel: [517262.504902] [ 1472] 0 1472 5914 65 17 0 0 cron
Jun 23 17:20:10 server1 kernel: [517262.504904] [ 1478] 999 1478 16020 3710 31 0 0 gmond
Jun 23 17:20:10 server1 kernel: [517262.504905] [ 1486] 0 1486 4821 65 14 0 0 irqbalance
Jun 23 17:20:10 server1 kernel: [517262.504907] [ 1500] 0 1500 343627 1730 85 0 0 nscd 743,1 1%Jun 23 17:20:10 server1 kernel: [517262.504909] [ 1559] 0 1559 1092 37 8 0 0 acpid
Jun 23 17:20:10 server1 kernel: [517262.504911] [ 1641] 0 1641 4978 71 13 0 0 master
Jun 23 17:20:10 server1 kernel: [517262.504913] [ 1650] 103 1650 5427 72 14 0 0 qmgr
Jun 23 17:20:10 server1 kernel: [517262.504917] [ 1895] 0 1895 1900 30 9 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504919] [ 1906] 1000 1906 2854329 2610 2594 0 0 thttpd
Jun 23 17:20:10 server1 kernel: [517262.504927] [ 3163] 1000 3163 2432 39 10 0 0 searchd
Jun 23 17:20:10 server1 kernel: [517262.504928] [ 3167] 1000 3167 2727221 2467025 4863 0 0 sphinx-daemon
Jun 23 17:20:10 server1 kernel: [517262.504931] [47622] 1000 47622 17834794 17329575 33989 0 0 MyExec
<.................Trimmed bunch of processes with low mem usage.......................................>
Jun 23 17:20:10 server1 kernel: [517262.508350] Out of memory: Kill process 47622 (MyExec) score 526 or sacrifice child
Jun 23 17:20:10 server1 kernel: [517262.508375] Killed process 47622 (MyExec) total-vm:71339176kB, anon-rss:69318300kB, file-rss:0kB
Looking at following lines, it seems like issue is fragmentation.
Jun 23 17:20:10 server1 kernel: [517262.504816] Node 0 Normal: 9038*4kB (M) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36152kB
Jun 23 17:20:10 server1 kernel: [517262.504822] Node 1 Normal: 9055*4kB (UM) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36220kB
I have no idea as why the system would be so badly fragmented. It was only running for 5 days when this happened. Also looking at the process that invoked the oom killer (gmond invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0), seems like it was only requesting 4K blocks and there are bunch of those available.
Is my understanding of fragmentation correct in this case?
How can I figure why the memory got so fragmented?
What can I do to avoid getting into this situation.
One thing that you can notice is, I have completely turned off swap and have swappiness set to 0. The reason is my system has more than enough RAM and should never hit swap. I am planning to enable it and set swappiness to 10. I am not sure if that helps in this case.
Thanks for your input.
Understanding of fragmentation is incorrect. The oom was issued because of memory watermarks were broken. Take a look at this:
Node 0 Normal free:34728kB min:42952kB low:53688kB
Node 1 Normal free:33484kB min:45096kB low:56368kB
From the last few lines of the logs you can see the kernel reports a total-vm usage 71339176kB (~71GiB) while total vm should include both your physical memory and swap space. Also your log shows resident memory about ~69GiB.
Is my understanding of fragmentation correct in this case?
If your capturing system diagnostics during the time the issue occured or sosreport, check the /proc/buddyinfo file for any memory fragmentation. Its best to write a script and backup this info if you are planning to reproducing this.
How can I figure why the memory got so fragmented?
What can I do to avoid getting into this situation.
Sometimes applications overcommit memory which the system is unable to honour potentially leading to OOM. You may want to modify and check the other kernel tunable or try to disable memory overcommitting using sysctl -a for reading the set values.
vm.overcommit_memory=2
vm.overcommit_ratio=80
Note: After adding the above lines in /etc/sysctl.conf its best to restart the system.
vm.overcommit: some apps require to alloc more virtual memory for the program, more then what is available on the system.
vm.overcommit take different value, 0 - a heuristic overcommit algorithm is used
1 - always overcommit regardless of whether memory is available or not (most likely set on your server its set to 0 or 1).
2 - this tell the kernel to allow apps to commit all swap + %of ram, for this the below value should also be set (ex: set to 80%)
2- using this would disallow overcommiting the memory usage (beyond the available RAM + 80% of swap space)
Updating with slabinfo This is after the node was rebooted.
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kvm_async_pf 0 0 136 30 1 : tunables 0 0 0 : slabdata 0 0 0
kvm_vcpu 0 0 16256 2 8 : tunables 0 0 0 : slabdata 0 0 0
kvm_mmu_page_header 0 0 168 48 2 : tunables 0 0 0 : slabdata 0 0 0
fusion_ioctx 5005 5005 296 55 4 : tunables 0 0 0 : slabdata 91 91 0
fusion_user_ll_request 0 0 3960 8 8 : tunables 0 0 0 : slabdata 0 0 0
ext4_groupinfo_4k 131670 131670 136 30 1 : tunables 0 0 0 : slabdata 4389 4389 0
ip6_dst_cache 1260 1260 384 42 4 : tunables 0 0 0 : slabdata 30 30 0
UDPLITEv6 0 0 1088 30 8 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 330 330 1088 30 8 : tunables 0 0 0 : slabdata 11 11 0
tw_sock_TCPv6 128 128 256 32 2 : tunables 0 0 0 : slabdata 4 4 0
TCPv6 288 288 1984 16 8 : tunables 0 0 0 : slabdata 18 18 0
kcopyd_job 0 0 3312 9 8 : tunables 0 0 0 : slabdata 0 0 0
dm_uevent 0 0 2632 12 8 : tunables 0 0 0 : slabdata 0 0 0
cfq_queue 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
bsg_cmd 0 0 312 52 4 : tunables 0 0 0 : slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
fuse_request 0 0 416 39 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_inode 0 0 768 42 8 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_key_record_cache 0 0 576 28 4 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_inode_cache 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
fat_inode_cache 0 0 712 46 8 : tunables 0 0 0 : slabdata 0 0 0
fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 54 54 600 54 8 : tunables 0 0 0 : slabdata 1 1 0
jbd2_journal_handle 2040 2040 48 85 1 : tunables 0 0 0 : slabdata 24 24 0
jbd2_journal_head 5071 5364 112 36 1 : tunables 0 0 0 : slabdata 149 149 0
jbd2_revoke_table_s 1792 1792 16 256 1 : tunables 0 0 0 : slabdata 7 7 0
jbd2_revoke_record_s 1536 1536 32 128 1 : tunables 0 0 0 : slabdata 12 12 0
ext4_inode_cache 75129 78771 984 33 8 : tunables 0 0 0 : slabdata 2387 2387 0
ext4_free_data 5952 6656 64 64 1 : tunables 0 0 0 : slabdata 104 104 0
ext4_allocation_context 768 768 128 32 1 : tunables 0 0 0 : slabdata 24 24 0
ext4_io_end 1344 1344 72 56 1 : tunables 0 0 0 : slabdata 24 24 0
ext4_extent_status 37921 38352 40 102 1 : tunables 0 0 0 : slabdata 376 376 0
dquot 768 768 256 32 2 : tunables 0 0 0 : slabdata 24 24 0
dnotify_mark 782 782 120 34 1 : tunables 0 0 0 : slabdata 23 23 0
pid_namespace 0 0 2192 14 8 : tunables 0 0 0 : slabdata 0 0 0
posix_timers_cache 0 0 248 33 2 : tunables 0 0 0 : slabdata 0 0 0
UDP-Lite 0 0 896 36 8 : tunables 0 0 0 : slabdata 0 0 0
xfrm_dst_cache 0 0 448 36 4 : tunables 0 0 0 : slabdata 0 0 0
ip_fib_trie 146 146 56 73 1 : tunables 0 0 0 : slabdata 2 2 0
UDP 828 828 896 36 8 : tunables 0 0 0 : slabdata 23 23 0
tw_sock_TCP 992 1152 256 32 2 : tunables 0 0 0 : slabdata 36 36 0
TCP 450 450 1792 18 8 : tunables 0 0 0 : slabdata 25 25 0
blkdev_queue 120 136 1896 17 8 : tunables 0 0 0 : slabdata 8 8 0
blkdev_requests 3358 3569 376 43 4 : tunables 0 0 0 : slabdata 83 83 0
blkdev_ioc 964 1287 104 39 1 : tunables 0 0 0 : slabdata 33 33 0
user_namespace 0 0 264 31 2 : tunables 0 0 0 : slabdata 0 0 0
sock_inode_cache 1377 1377 640 51 8 : tunables 0 0 0 : slabdata 27 27 0
net_namespace 0 0 4736 6 8 : tunables 0 0 0 : slabdata 0 0 0
shmem_inode_cache 2112 2112 672 48 8 : tunables 0 0 0 : slabdata 44 44 0
ftrace_event_file 1196 1196 88 46 1 : tunables 0 0 0 : slabdata 26 26 0
taskstats 196 196 328 49 4 : tunables 0 0 0 : slabdata 4 4 0
proc_inode_cache 63037 63250 648 50 8 : tunables 0 0 0 : slabdata 1265 1265 0
sigqueue 1224 1224 160 51 2 : tunables 0 0 0 : slabdata 24 24 0
bdev_cache 819 819 832 39 8 : tunables 0 0 0 : slabdata 21 21 0
kernfs_node_cache 54360 54360 112 36 1 : tunables 0 0 0 : slabdata 1510 1510 0
mnt_cache 510 510 320 51 4 : tunables 0 0 0 : slabdata 10 10 0
inode_cache 16813 19712 584 28 4 : tunables 0 0 0 : slabdata 704 704 0
dentry 144206 144606 192 42 2 : tunables 0 0 0 : slabdata 3443 3443 0
iint_cache 0 0 72 56 1 : tunables 0 0 0 : slabdata 0 0 0
buffer_head 6905641 6922305 104 39 1 : tunables 0 0 0 : slabdata 177495 177495 0
vm_area_struct 16764 16764 184 44 2 : tunables 0 0 0 : slabdata 381 381 0
mm_struct 1008 1008 896 36 8 : tunables 0 0 0 : slabdata 28 28 0
files_cache 1377 1377 640 51 8 : tunables 0 0 0 : slabdata 27 27 0
signal_cache 1380 1380 1088 30 8 : tunables 0 0 0 : slabdata 46 46 0
sighand_cache 1020 1020 2112 15 8 : tunables 0 0 0 : slabdata 68 68 0
task_xstate 1638 1638 832 39 8 : tunables 0 0 0 : slabdata 42 42 0
task_struct 837 855 6480 5 8 : tunables 0 0 0 : slabdata 171 171 0
Acpi-ParseExt 2968 2968 72 56 1 : tunables 0 0 0 : slabdata 53 53 0
Acpi-State 561 561 80 51 1 : tunables 0 0 0 : slabdata 11 11 0
Acpi-Namespace 3162 3162 40 102 1 : tunables 0 0 0 : slabdata 31 31 0
anon_vma 19313 19584 64 64 1 : tunables 0 0 0 : slabdata 306 306 0
shared_policy_node 7735 7735 48 85 1 : tunables 0 0 0 : slabdata 91 91 0
numa_policy 170 170 24 170 1 : tunables 0 0 0 : slabdata 1 1 0
radix_tree_node 2870899 2871624 584 28 4 : tunables 0 0 0 : slabdata 102558 102558 0
idr_layer_cache 555 555 2112 15 8 : tunables 0 0 0 : slabdata 37 37 0
dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-8192 180 180 8192 4 8 : tunables 0 0 0 : slabdata 45 45 0
kmalloc-4096 636 720 4096 8 8 : tunables 0 0 0 : slabdata 90 90 0
kmalloc-2048 6498 6688 2048 16 8 : tunables 0 0 0 : slabdata 418 418 0
kmalloc-1024 4677 4800 1024 32 8 : tunables 0 0 0 : slabdata 150 150 0
kmalloc-512 9029 9056 512 32 4 : tunables 0 0 0 : slabdata 283 283 0
kmalloc-256 31542 31840 256 32 2 : tunables 0 0 0 : slabdata 995 995 0
kmalloc-192 16548 16548 192 42 2 : tunables 0 0 0 : slabdata 394 394 0
kmalloc-128 8449 8544 128 32 1 : tunables 0 0 0 : slabdata 267 267 0
kmalloc-96 20607 21462 96 42 1 : tunables 0 0 0 : slabdata 511 511 0
kmalloc-64 71408 75968 64 64 1 : tunables 0 0 0 : slabdata 1187 1187 0
kmalloc-32 5760 5760 32 128 1 : tunables 0 0 0 : slabdata 45 45 0
kmalloc-16 13824 13824 16 256 1 : tunables 0 0 0 : slabdata 54 54 0
kmalloc-8 45056 45056 8 512 1 : tunables 0 0 0 : slabdata 88 88 0
kmem_cache_node 551 576 64 64 1 : tunables 0 0 0 : slabdata 9 9 0
kmem_cache 256 256 256 32 2 : tunables 0 0 0 : slabdata 8 8 0
I'm having trouble with unexpected characters being sent on a USB port with the cdc_acm driver. What makes this all the more perplexing is that the code runs fine on Ubuntu 12.04 (3.2 kernel) but fails (the subject of this question) on Centos 6 (3.6 kernel)
The USB device is a Bluegiga BLED112 Bluetooth Smart dongle. Its embedded microcontroler will reset any time there in unexpected input on it's USB interface.
The test code opens the port, writes 4 bytes (a hello message) and expects to read a response. The read never completes because the unexpected characters cause the device to reset which causes the hub to drop the device and re-enumerate.
To troubleshoot, here's what I've done:
Downloaded the source code for the cdc_acm driver. Added a bunch of printk debug messages and stack_dumps to follow what's going on.
I rmmod'd the "stock" cdc_acm and insmod'd my instrumented module. All the device enumeration works, right driver attached, etc.
Since the code works on Ubuntu 12.04/Linux 3.2, I grabbed the 3.2 cdc_acm code and compiled that module on the Centos 6 / Linux 3.6 platform. Using that 3.2 module instead of the 3.6 module did not make a difference. I reverted to the 3.6 module.
Turned on the debug file system with usbmon and watched the USB traffic. I can see that there are extra characters being sent on the USB interface.
To watch what's going on, on top of the printk's in the cdc_acm module, I've merged the output of usb mon (cat /sys/kernel/debug/usb/usbmon/3u | logger) and the output of the test application (scan_example /dev/ttyACM0 | logger -s) so I have a single stream of time correlated debug trail.
The spurious characters sent on the USB endpoint are x5E x40 x5E x40 x5E x40 x5E x40 x41 (in ASCII its ^#^#^#^#A) which looks like some sort of probing or trying to get the attention of a modem These characters are sent immediately after the application's write() causes the 4 hello bytes to be sent to the end point.
Since the cdc_acm device is supposed to be a modem, I tried to turn off the modem control by adding this to usb_device_id acm_ids[] in cdc_acm.c
/* bluegiga BLED112*/
{ USB_DEVICE(0x2458, 0x0001),
.driver_info = NOT_A_MODEM,
},
Recompiled and insmod'd and the syslog show that this was recognized (quirks is 8), but no change in function.
Neither NetowrkManager nor modem-manager are running, but I still suspect that there is some sort of modem control function going on somewhere, I just don't know where.
Here's a annotated debug log (MDV prefixes those printk's that I added to cdc_acm)
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_bulk
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_done
Here are the 4 bytes sent by the application 00 00 00 01
Feb 13 18:14:32 localhost cpcenter: df046a80 3672670191 C Bi:3:006:4 0 4 = 00000001
Feb 13 18:14:32 localhost cpcenter: 1360797272.669690 write: data2: len=0 contains:
... and these additonal characters show up unexpectedly 5e 40 5e 40 5e 40....
Feb 13 18:14:32 localhost cpcenter: df046a80 3672670232 S Bi:3:006:4 -115 128 <
Feb 13 18:14:32 localhost cpcenter: f3cc5740 3672670297 S Bo:3:006:4 -115 1 = 5e
Feb 13 18:14:32 localhost cpcenter: df2e1300 3672670332 S Bo:3:006:4 -115 1 = 40
Feb 13 18:14:32 localhost cpcenter: f3cc5740 3672670347 C Bo:3:006:4 0 1 >
Feb 13 18:14:32 localhost cpcenter: f3cc5740 3672670392 S Bo:3:006:4 -115 1 = 5e
Feb 13 18:14:32 localhost cpcenter: df2e1180 3672670426 S Bo:3:006:4 -115 1 = 40
Feb 13 18:14:32 localhost cpcenter: df2e1c00 3672670461 S Bo:3:006:4 -115 1 = 5e
Feb 13 18:14:32 localhost cpcenter: df2e1840 3672670496 S Bo:3:006:4 -115 1 = 40
Feb 13 18:14:32 localhost cpcenter: df2e1300 3672670591 C Bo:3:006:4 0 1 >
At this point we get a spontaneous disconnect.
Feb 13 18:14:32 localhost kernel: usb 3-1: USB disconnect, device number 6
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_bulk
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_done
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm read_bulk_callback
Feb 13 18:14:32 localhost kernel: MDV 1 acm_read_bulk_callback - urb 1, len 0
Feb 13 18:14:32 localhost kernel: MDV 3 acm_read_bulk_callback - non-zero urb status: -71
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_bulk
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_done
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm read_bulk_callback
Feb 13 18:14:32 localhost kernel: MDV 1 acm_read_bulk_callback - urb 1, len 0
Feb 13 18:14:32 localhost kernel: MDV 3 acm_read_bulk_callback - non-zero urb status: -71
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_bulk
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_done
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm read_bulk_callback
Feb 13 18:14:32 localhost kernel: MDV 1 acm_read_bulk_callback - urb 2, len 0
Feb 13 18:14:32 localhost cpcenter: df2e1d80 3672670629 S Bo:3:006:4 -115 1 = 5e
Feb 13 18:14:32 localhost kernel: MDV 3 acm_read_bulk_callback - non-zero urb status: -71
Feb 13 18:14:32 localhost cpcenter: df2e1300 3672670677 S Bo:3:006:4 -115 1 = 41
Feb 13 18:14:32 localhost cpcenter: f3cc5740 3672670802 C Bo:3:006:4 0 1 >
Feb 13 18:14:32 localhost cpcenter: df2e1180 3672671019 C Bo:3:006:4 0 1 >
Feb 13 18:14:32 localhost cpcenter: df2e1c00 3672671237 C Bo:3:006:4 0 1 >
Feb 13 18:14:32 localhost cpcenter: dfbf8c00 3672673193 C Ii:3:001:1 0:2048 1 = 02
Feb 13 18:14:32 localhost cpcenter: dfbf8c00 3672673207 S Ii:3:001:1 -115:2048 4 <
Feb 13 18:14:32 localhost cpcenter: f3c26c00 3672673221 S Ci:3:001:0 s a3 00 0000 0001 0004 4 <
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_disconnect
Feb 13 18:14:32 localhost kernel: Pid: 29, comm: khubd Tainted: G O 3.5.3-1.el6.elrepo.i686 #1
Stack trace at the time of disconnect
Feb 13 18:14:32 localhost kernel: Call Trace:
Feb 13 18:14:32 localhost kernel: [<f82dabc5>] acm_disconnect+0x35/0x1f0 [cdc_acm]
Feb 13 18:14:32 localhost kernel: [<c13835db>] usb_unbind_interface+0x4b/0x180
Feb 13 18:14:32 localhost cpcenter: f3c26c00 3672673239 C Ci:3:001:0 0 4 = 00010100
Feb 13 18:14:32 localhost kernel: [<c1318bfb>] __device_release_driver+0x5b/0xb0
Feb 13 18:14:32 localhost kernel: [<c1318d05>] device_release_driver+0x25/0x40
Feb 13 18:14:32 localhost kernel: [<c1317f0c>] bus_remove_device+0xcc/0x130
Feb 13 18:14:32 localhost kernel: [<c131612f>] ? device_remove_attrs+0x2f/0x90
Feb 13 18:14:32 localhost kernel: [<c1316275>] device_del+0xe5/0x180
Feb 13 18:14:32 localhost kernel: [<c1380326>] usb_disable_device+0x96/0x240 Feb 13 18:14:32 localhost kernel: [<c1379f91>] usb_disconnect+0x91/0x130
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_bulk
Feb 13 18:14:32 localhost kernel: [<c137a2c0>] hub_port_connect_change+0xb0/0xa60
Feb 13 18:14:32 localhost kernel: [<c1380f4e>] ? usb_control_msg+0xce/0xe0
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm acm_write_done
Feb 13 18:14:32 localhost kernel: [<c137b296>] hub_events+0x536/0x810
Feb 13 18:14:32 localhost cpcenter: f3c26c00 3672673243 S Co:3:001:0 s 23 01 0010 0001 0000 0
Feb 13 18:14:32 localhost cpcenter: f3c26c00 3672673250 C Co:3:001:0 0 0
Feb 13 18:14:32 localhost kernel: [<c1065bdf>] ? finish_wait+0x4f/0x70
Feb 13 18:14:32 localhost kernel: [<c137b5aa>] hub_thread+0x3a/0x1d0
Feb 13 18:14:32 localhost cpcenter: df2e1840 3672673260 C Bo:3:006:4 -71 0
Feb 13 18:14:32 localhost kernel: [<c1065a70>] ? wake_up_bit+0x30/0x30
Feb 13 18:14:32 localhost kernel: [<c137b570>] ? hub_events+0x810/0x810
Feb 13 18:14:32 localhost kernel: [<c106564c>] kthread+0x7c/0x90
Feb 13 18:14:32 localhost cpcenter: f3c16c80 3672673292 C Bi:3:006:4 -71 0
Feb 13 18:14:32 localhost cpcenter: df2e1d80 3672673453 C Bo:3:006:4 -71 0
Feb 13 18:14:32 localhost cpcenter: f3c16d40 3672673553 C Bi:3:006:4 -71 0
Feb 13 18:14:32 localhost kernel: [<c10655d0>] ? kthread_freezable_should_stop+0x60/0x60
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm read_bulk_callback
Feb 13 18:14:32 localhost kernel: [<c14dedbe>] kernel_thread_helper+0x6/0x10
Feb 13 18:14:32 localhost kernel: MDV 1 acm_read_bulk_callback - urb 3, len 0
Feb 13 18:14:32 localhost kernel: MDV:cdc-acm stop_data_traffic
Feb 13 18:14:32 localhost cpcenter: f3d19500 3672674474 C Ii:3:006:2 -108:64 0
Feb 13 18:14:32 localhost kernel: MDV 2 acm_read_bulk_callback - disconnected
Feb 13 18:14:32 localhost cpcenter: df2e1300 3672674636 C Bo:3:006:4 -71 0
Feb 13 18:14:32 localhost cpcenter: f3c16140 3672674753 C Bi:3:006:4 -71 0
The ^#^#^#^A string which is sent to your device is the result of echo performed by the terminal subsystem in the kernel in response to incoming bytes from your device.
This line in your log:
Feb 13 18:14:32 localhost cpcenter: df046a80 3672670191 C Bi:3:006:4 0 4 = 00000001
actually means that your device sent 4 bytes to the computer (Bi means “Bulk endpoint, input”). By default all terminal devices have echo enabled, therefore the kernel echoes those bytes back to the device, but because those bytes were in the control character range, they are echoed in the escaped form: ^#^#^#^A. Those echoed bytes are also sent in separate 1-byte write calls, which corresponds to 1-byte bulk out URBs in the subsequent log.
You need to fix your test program so that it turns off echo and other tty processing before trying to communicate with your device. The cfmakeraw() function can be used for this if your test program is in C/C++.
The program might be working in Ubuntu just because some other program happens to touch the port before your test program and change the port settings to turn off echo.