Understanding OOM odd behaviour? - linux

My server trigged OOM killer and I am trying to understand why. System has lot of RAM 128 GB and it looks like around 70GB of it was actually used. Reading through previous questions about OOM, it looks like this might be a case of memory fragmentation. See the syslog output
Jun 23 17:20:10 server1 kernel: [517262.504589] gmond invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Jun 23 17:20:10 server1 kernel: [517262.504593] gmond cpuset=/ mems_allowed=0-1
Jun 23 17:20:10 server1 kernel: [517262.504598] CPU: 4 PID: 1522 Comm: gmond Tainted: P OE 3.15.1-031501-lowlatency #201406161841
Jun 23 17:20:10 server1 kernel: [517262.504599] Hardware name: Dell Inc. PowerEdge R420/0K29HN, BIOS 2.3.3 07/10/2014
Jun 23 17:20:10 server1 kernel: [517262.504601] 0000000000000000 ffff880fce2ab848 ffffffff817746ec 0000000000000007
Jun 23 17:20:10 server1 kernel: [517262.504603] ffff880f74691950 ffff880fce2ab898 ffffffff8176a980 ffff880f00000000
Jun 23 17:20:10 server1 kernel: [517262.504605] 000201da81383df8 ffff881470376540 ffff881dcf7ab2a0 0000000000000000
Jun 23 17:20:10 server1 kernel: [517262.504607] Call Trace:
Jun 23 17:20:10 server1 kernel: [517262.504615] [<ffffffff817746ec>] dump_stack+0x4e/0x71
Jun 23 17:20:10 server1 kernel: [517262.504618] [<ffffffff8176a980>] dump_header+0x7e/0xbd
Jun 23 17:20:10 server1 kernel: [517262.504620] [<ffffffff8176aa16>] oom_kill_process.part.6+0x57/0x30a
Jun 23 17:20:10 server1 kernel: [517262.504623] [<ffffffff811654e7>] oom_kill_process+0x47/0x50
Jun 23 17:20:10 server1 kernel: [517262.504625] [<ffffffff81165825>] out_of_memory+0x145/0x1d0
Jun 23 17:20:10 server1 kernel: [517262.504628] [<ffffffff8116c1ba>] __alloc_pages_nodemask+0xb1a/0xc40
Jun 23 17:20:10 server1 kernel: [517262.504634] [<ffffffff811adba3>] alloc_pages_current+0xb3/0x180
Jun 23 17:20:10 server1 kernel: [517262.504636] [<ffffffff81161737>] __page_cache_alloc+0xb7/0xd0
Jun 23 17:20:10 server1 kernel: [517262.504638] [<ffffffff81163f80>] filemap_fault+0x280/0x430
Jun 23 17:20:10 server1 kernel: [517262.504642] [<ffffffff8118a0d9>] __do_fault+0x39/0x90
Jun 23 17:20:10 server1 kernel: [517262.504644] [<ffffffff8118e31e>] do_read_fault.isra.59+0x10e/0x1d0
Jun 23 17:20:10 server1 kernel: [517262.504646] [<ffffffff8118e870>] do_linear_fault.isra.61+0x70/0x80
Jun 23 17:20:10 server1 kernel: [517262.504647] [<ffffffff8118e986>] handle_pte_fault+0x76/0x1b0
Jun 23 17:20:10 server1 kernel: [517262.504652] [<ffffffff81095fe0>] ? lock_hrtimer_base.isra.25+0x30/0x60
Jun 23 17:20:10 server1 kernel: [517262.504654] [<ffffffff8118eea4>] __handle_mm_fault+0x1b4/0x360
Jun 23 17:20:10 server1 kernel: [517262.504655] [<ffffffff8118f101>] handle_mm_fault+0xb1/0x160
Jun 23 17:20:10 server1 kernel: [517262.504658] [<ffffffff81784667>] ? __do_page_fault+0x2b7/0x5a0
Jun 23 17:20:10 server1 kernel: [517262.504660] [<ffffffff81784522>] __do_page_fault+0x172/0x5a0
Jun 23 17:20:10 server1 kernel: [517262.504664] [<ffffffff8111fdec>] ? acct_account_cputime+0x1c/0x20
Jun 23 17:20:10 server1 kernel: [517262.504667] [<ffffffff810a73a9>] ? account_user_time+0x99/0xb0
Jun 23 17:20:10 server1 kernel: [517262.504669] [<ffffffff810a79dd>] ? vtime_account_user+0x5d/0x70
Jun 23 17:20:10 server1 kernel: [517262.504671] [<ffffffff8178498e>] do_page_fault+0x3e/0x80
Jun 23 17:20:10 server1 kernel: [517262.504673] [<ffffffff817811f8>] page_fault+0x28/0x30
Jun 23 17:20:10 server1 kernel: [517262.504674] Mem-Info:
Jun 23 17:20:10 server1 kernel: [517262.504675] Node 0 DMA per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504677] CPU 0: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504678] CPU 1: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504679] CPU 2: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504680] CPU 3: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504681] CPU 4: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504682] CPU 5: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504683] CPU 6: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504684] CPU 7: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504685] CPU 8: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504686] CPU 9: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504687] CPU 10: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504687] CPU 11: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504688] CPU 12: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504689] CPU 13: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504690] CPU 14: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504691] CPU 15: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504692] CPU 16: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504693] CPU 17: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504694] CPU 18: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504695] CPU 19: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504696] CPU 20: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504697] CPU 21: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504698] CPU 22: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504698] CPU 23: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504699] Node 0 DMA32 per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504701] CPU 0: hi: 186, btch: 31 usd: 30
Jun 23 17:20:10 server1 kernel: [517262.504702] CPU 1: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504703] CPU 2: hi: 186, btch: 31 usd: 34
Jun 23 17:20:10 server1 kernel: [517262.504704] CPU 3: hi: 186, btch: 31 usd: 27
Jun 23 17:20:10 server1 kernel: [517262.504705] CPU 4: hi: 186, btch: 31 usd: 30
Jun 23 17:20:10 server1 kernel: [517262.504705] CPU 5: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504706] CPU 6: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504707] CPU 7: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504708] CPU 8: hi: 186, btch: 31 usd: 173
Jun 23 17:20:10 server1 kernel: [517262.504709] CPU 9: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504710] CPU 10: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504711] CPU 11: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504712] CPU 12: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504713] CPU 13: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504714] CPU 14: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504715] CPU 15: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504716] CPU 16: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504717] CPU 17: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504718] CPU 18: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504719] CPU 19: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504720] CPU 20: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504721] CPU 21: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504722] CPU 22: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504722] CPU 23: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504723] Node 0 Normal per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504724] CPU 0: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504725] CPU 1: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504726] CPU 2: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504727] CPU 3: hi: 186, btch: 31 usd: 14
Jun 23 17:20:10 server1 kernel: [517262.504728] CPU 4: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504729] CPU 5: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504730] CPU 6: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504731] CPU 7: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504732] CPU 8: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504733] CPU 9: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504734] CPU 10: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504735] CPU 11: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504736] CPU 12: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504737] CPU 13: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504738] CPU 14: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504739] CPU 15: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504740] CPU 16: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504740] CPU 17: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504741] CPU 18: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504742] CPU 19: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504743] CPU 20: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504744] CPU 21: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504745] CPU 22: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504746] CPU 23: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504747] Node 1 Normal per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504748] CPU 0: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504749] CPU 1: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504750] CPU 2: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504751] CPU 3: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504752] CPU 4: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504753] CPU 5: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504754] CPU 6: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504755] CPU 7: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504756] CPU 8: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504757] CPU 9: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504758] CPU 10: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504758] CPU 11: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504759] CPU 12: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504760] CPU 13: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504761] CPU 14: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504762] CPU 15: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504763] CPU 16: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504764] CPU 17: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504765] CPU 18: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504766] CPU 19: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504767] CPU 20: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504768] CPU 21: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504769] CPU 22: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504770] CPU 23: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504773] active_anon:17833290 inactive_anon:2465707 isolated_anon:0
Jun 23 17:20:10 server1 kernel: [517262.504773] active_file:573 inactive_file:595 isolated_file:36
Jun 23 17:20:10 server1 kernel: [517262.504773] unevictable:0 dirty:4 writeback:0 unstable:0
Jun 23 17:20:10 server1 kernel: [517262.504773] free:82698 slab_reclaimable:43224 slab_unreclaimable:11476749
Jun 23 17:20:10 server1 kernel: [517262.504773] mapped:2465518 shmem:2465767 pagetables:66385 bounce:0
Jun 23 17:20:10 server1 kernel: [517262.504773] free_cma:0
Jun 23 17:20:10 server1 kernel: [517262.504776] Node 0 DMA free:14804kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15968kB managed:15828kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504779] lowmem_reserve[]: 0 2933 64370 64370
Jun 23 17:20:10 server1 kernel: [517262.504782] Node 0 DMA32 free:247776kB min:2048kB low:2560kB high:3072kB active_anon:1774744kB inactive_anon:607052kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3083200kB managed:3003592kB mlocked:0kB dirty:16kB writeback:0kB mapped:607068kB shmem:607068kB slab_reclaimable:25524kB slab_unreclaimable:302060kB kernel_stack:4928kB pagetables:3100kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2660 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504785] lowmem_reserve[]: 0 0 61436 61436
Jun 23 17:20:10 server1 kernel: [517262.504787] Node 0 Normal free:34728kB min:42952kB low:53688kB high:64428kB active_anon:30286072kB inactive_anon:9255576kB active_file:236kB inactive_file:640kB unevictable:0kB isolated(anon):0kB isolated(file):16kB present:63963136kB managed:62911420kB mlocked:0kB dirty:0kB writeback:0kB mapped:9255000kB shmem:9255724kB slab_reclaimable:86416kB slab_unreclaimable:22165372kB kernel_stack:21072kB pagetables:121112kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:13936 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504791] lowmem_reserve[]: 0 0 0 0
Jun 23 17:20:10 server1 kernel: [517262.504793] Node 1 Normal free:33484kB min:45096kB low:56368kB high:67644kB active_anon:39272344kB inactive_anon:200kB active_file:2112kB inactive_file:1752kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:67108864kB managed:66056916kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:276kB slab_reclaimable:60956kB slab_unreclaimable:23439564kB kernel_stack:13536kB pagetables:141328kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:18448 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504797] lowmem_reserve[]: 0 0 0 0
Jun 23 17:20:10 server1 kernel: [517262.504799] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 1*2048kB (R) 3*4096kB (M) = 14804kB
Jun 23 17:20:10 server1 kernel: [517262.504807] Node 0 DMA32: 4660*4kB (UEM) 2172*8kB (EM) 1739*16kB (EM) 1046*32kB (UEM) 629*64kB (EM) 344*128kB (UEM) 155*256kB (E) 46*512kB (UE) 3*1024kB (E) 0*2048kB 0*4096kB = 247904kB
Jun 23 17:20:10 server1 kernel: [517262.504816] Node 0 Normal: 9038*4kB (M) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36152kB
Jun 23 17:20:10 server1 kernel: [517262.504822] Node 1 Normal: 9055*4kB (UM) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36220kB
Jun 23 17:20:10 server1 kernel: [517262.504829] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 23 17:20:10 server1 kernel: [517262.504830] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 23 17:20:10 server1 kernel: [517262.504831] 2467056 total pagecache pages
Jun 23 17:20:10 server1 kernel: [517262.504832] 0 pages in swap cache
Jun 23 17:20:10 server1 kernel: [517262.504833] Swap cache stats: add 0, delete 0, find 0/0
Jun 23 17:20:10 server1 kernel: [517262.504834] Free swap = 0kB
Jun 23 17:20:10 server1 kernel: [517262.504834] Total swap = 0kB
Jun 23 17:20:10 server1 kernel: [517262.504835] 33542792 pages RAM
Jun 23 17:20:10 server1 kernel: [517262.504836] 0 pages HighMem/MovableOnly
Jun 23 17:20:10 server1 kernel: [517262.504837] 262987 pages reserved
Jun 23 17:20:10 server1 kernel: [517262.504838] 0 pages hwpoisoned
Jun 23 17:20:10 server1 kernel: [517262.504839] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jun 23 17:20:10 server1 kernel: [517262.504866] [ 569] 0 569 4997 144 13 0 0 upstart-udev-br
Jun 23 17:20:10 server1 kernel: [517262.504868] [ 578] 0 578 12891 187 29 0 -1000 systemd-udevd
Jun 23 17:20:10 server1 kernel: [517262.504873] [ 692] 101 692 80659 2295 59 0 0 rsyslogd
Jun 23 17:20:10 server1 kernel: [517262.504875] [ 750] 0 750 4084 331 13 0 0 upstart-file-br
Jun 23 17:20:10 server1 kernel: [517262.504877] [ 792] 0 792 3815 53 13 0 0 upstart-socket-
Jun 23 17:20:10 server1 kernel: [517262.504877] [ 792] 0 792 3815 53 13 0 0 upstart-socket-
Jun 23 17:20:10 server1 kernel: [517262.504879] [ 842] 111 842 27001 275 53 0 0 dbus-daemon
Jun 23 17:20:10 server1 kernel: [517262.504880] [ 851] 0 851 8834 101 22 0 0 systemd-logind
Jun 23 17:20:10 server1 kernel: [517262.504886] [ 1232] 0 1232 2558 572 8 0 0 dhclient
Jun 23 17:20:10 server1 kernel: [517262.504888] [ 1342] 104 1342 24484 281 49 0 0 ntpd
Jun 23 17:20:10 server1 kernel: [517262.504890] [ 1440] 0 1440 3955 41 12 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504891] [ 1443] 0 1443 3955 41 12 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504893] [ 1448] 0 1448 3955 39 13 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504895] [ 1450] 0 1450 3955 41 13 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504896] [ 1452] 0 1452 3955 42 13 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504898] [ 1469] 0 1469 4785 40 13 0 0 atd
Jun 23 17:20:10 server1 kernel: [517262.504900] [ 1470] 0 1470 15341 168 32 0 -1000 sshd
Jun 23 17:20:10 server1 kernel: [517262.504902] [ 1472] 0 1472 5914 65 17 0 0 cron
Jun 23 17:20:10 server1 kernel: [517262.504904] [ 1478] 999 1478 16020 3710 31 0 0 gmond
Jun 23 17:20:10 server1 kernel: [517262.504905] [ 1486] 0 1486 4821 65 14 0 0 irqbalance
Jun 23 17:20:10 server1 kernel: [517262.504907] [ 1500] 0 1500 343627 1730 85 0 0 nscd 743,1 1%Jun 23 17:20:10 server1 kernel: [517262.504909] [ 1559] 0 1559 1092 37 8 0 0 acpid
Jun 23 17:20:10 server1 kernel: [517262.504911] [ 1641] 0 1641 4978 71 13 0 0 master
Jun 23 17:20:10 server1 kernel: [517262.504913] [ 1650] 103 1650 5427 72 14 0 0 qmgr
Jun 23 17:20:10 server1 kernel: [517262.504917] [ 1895] 0 1895 1900 30 9 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504919] [ 1906] 1000 1906 2854329 2610 2594 0 0 thttpd
Jun 23 17:20:10 server1 kernel: [517262.504927] [ 3163] 1000 3163 2432 39 10 0 0 searchd
Jun 23 17:20:10 server1 kernel: [517262.504928] [ 3167] 1000 3167 2727221 2467025 4863 0 0 sphinx-daemon
Jun 23 17:20:10 server1 kernel: [517262.504931] [47622] 1000 47622 17834794 17329575 33989 0 0 MyExec
<.................Trimmed bunch of processes with low mem usage.......................................>
Jun 23 17:20:10 server1 kernel: [517262.508350] Out of memory: Kill process 47622 (MyExec) score 526 or sacrifice child
Jun 23 17:20:10 server1 kernel: [517262.508375] Killed process 47622 (MyExec) total-vm:71339176kB, anon-rss:69318300kB, file-rss:0kB
Looking at following lines, it seems like issue is fragmentation.
Jun 23 17:20:10 server1 kernel: [517262.504816] Node 0 Normal: 9038*4kB (M) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36152kB
Jun 23 17:20:10 server1 kernel: [517262.504822] Node 1 Normal: 9055*4kB (UM) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36220kB
I have no idea as why the system would be so badly fragmented. It was only running for 5 days when this happened. Also looking at the process that invoked the oom killer (gmond invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0), seems like it was only requesting 4K blocks and there are bunch of those available.
Is my understanding of fragmentation correct in this case?
How can I figure why the memory got so fragmented?
What can I do to avoid getting into this situation.
One thing that you can notice is, I have completely turned off swap and have swappiness set to 0. The reason is my system has more than enough RAM and should never hit swap. I am planning to enable it and set swappiness to 10. I am not sure if that helps in this case.
Thanks for your input.

Understanding of fragmentation is incorrect. The oom was issued because of memory watermarks were broken. Take a look at this:
Node 0 Normal free:34728kB min:42952kB low:53688kB
Node 1 Normal free:33484kB min:45096kB low:56368kB

From the last few lines of the logs you can see the kernel reports a total-vm usage 71339176kB (~71GiB) while total vm should include both your physical memory and swap space. Also your log shows resident memory about ~69GiB.
Is my understanding of fragmentation correct in this case?
If your capturing system diagnostics during the time the issue occured or sosreport, check the /proc/buddyinfo file for any memory fragmentation. Its best to write a script and backup this info if you are planning to reproducing this.
How can I figure why the memory got so fragmented?
What can I do to avoid getting into this situation.
Sometimes applications overcommit memory which the system is unable to honour potentially leading to OOM. You may want to modify and check the other kernel tunable or try to disable memory overcommitting using sysctl -a for reading the set values.
vm.overcommit_memory=2
vm.overcommit_ratio=80
Note: After adding the above lines in /etc/sysctl.conf its best to restart the system.
vm.overcommit: some apps require to alloc more virtual memory for the program, more then what is available on the system.
vm.overcommit take different value, 0 - a heuristic overcommit algorithm is used
1 - always overcommit regardless of whether memory is available or not (most likely set on your server its set to 0 or 1).
2 - this tell the kernel to allow apps to commit all swap + %of ram, for this the below value should also be set (ex: set to 80%)
2- using this would disallow overcommiting the memory usage (beyond the available RAM + 80% of swap space)

Updating with slabinfo This is after the node was rebooted.
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kvm_async_pf 0 0 136 30 1 : tunables 0 0 0 : slabdata 0 0 0
kvm_vcpu 0 0 16256 2 8 : tunables 0 0 0 : slabdata 0 0 0
kvm_mmu_page_header 0 0 168 48 2 : tunables 0 0 0 : slabdata 0 0 0
fusion_ioctx 5005 5005 296 55 4 : tunables 0 0 0 : slabdata 91 91 0
fusion_user_ll_request 0 0 3960 8 8 : tunables 0 0 0 : slabdata 0 0 0
ext4_groupinfo_4k 131670 131670 136 30 1 : tunables 0 0 0 : slabdata 4389 4389 0
ip6_dst_cache 1260 1260 384 42 4 : tunables 0 0 0 : slabdata 30 30 0
UDPLITEv6 0 0 1088 30 8 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 330 330 1088 30 8 : tunables 0 0 0 : slabdata 11 11 0
tw_sock_TCPv6 128 128 256 32 2 : tunables 0 0 0 : slabdata 4 4 0
TCPv6 288 288 1984 16 8 : tunables 0 0 0 : slabdata 18 18 0
kcopyd_job 0 0 3312 9 8 : tunables 0 0 0 : slabdata 0 0 0
dm_uevent 0 0 2632 12 8 : tunables 0 0 0 : slabdata 0 0 0
cfq_queue 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
bsg_cmd 0 0 312 52 4 : tunables 0 0 0 : slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
fuse_request 0 0 416 39 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_inode 0 0 768 42 8 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_key_record_cache 0 0 576 28 4 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_inode_cache 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
fat_inode_cache 0 0 712 46 8 : tunables 0 0 0 : slabdata 0 0 0
fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 54 54 600 54 8 : tunables 0 0 0 : slabdata 1 1 0
jbd2_journal_handle 2040 2040 48 85 1 : tunables 0 0 0 : slabdata 24 24 0
jbd2_journal_head 5071 5364 112 36 1 : tunables 0 0 0 : slabdata 149 149 0
jbd2_revoke_table_s 1792 1792 16 256 1 : tunables 0 0 0 : slabdata 7 7 0
jbd2_revoke_record_s 1536 1536 32 128 1 : tunables 0 0 0 : slabdata 12 12 0
ext4_inode_cache 75129 78771 984 33 8 : tunables 0 0 0 : slabdata 2387 2387 0
ext4_free_data 5952 6656 64 64 1 : tunables 0 0 0 : slabdata 104 104 0
ext4_allocation_context 768 768 128 32 1 : tunables 0 0 0 : slabdata 24 24 0
ext4_io_end 1344 1344 72 56 1 : tunables 0 0 0 : slabdata 24 24 0
ext4_extent_status 37921 38352 40 102 1 : tunables 0 0 0 : slabdata 376 376 0
dquot 768 768 256 32 2 : tunables 0 0 0 : slabdata 24 24 0
dnotify_mark 782 782 120 34 1 : tunables 0 0 0 : slabdata 23 23 0
pid_namespace 0 0 2192 14 8 : tunables 0 0 0 : slabdata 0 0 0
posix_timers_cache 0 0 248 33 2 : tunables 0 0 0 : slabdata 0 0 0
UDP-Lite 0 0 896 36 8 : tunables 0 0 0 : slabdata 0 0 0
xfrm_dst_cache 0 0 448 36 4 : tunables 0 0 0 : slabdata 0 0 0
ip_fib_trie 146 146 56 73 1 : tunables 0 0 0 : slabdata 2 2 0
UDP 828 828 896 36 8 : tunables 0 0 0 : slabdata 23 23 0
tw_sock_TCP 992 1152 256 32 2 : tunables 0 0 0 : slabdata 36 36 0
TCP 450 450 1792 18 8 : tunables 0 0 0 : slabdata 25 25 0
blkdev_queue 120 136 1896 17 8 : tunables 0 0 0 : slabdata 8 8 0
blkdev_requests 3358 3569 376 43 4 : tunables 0 0 0 : slabdata 83 83 0
blkdev_ioc 964 1287 104 39 1 : tunables 0 0 0 : slabdata 33 33 0
user_namespace 0 0 264 31 2 : tunables 0 0 0 : slabdata 0 0 0
sock_inode_cache 1377 1377 640 51 8 : tunables 0 0 0 : slabdata 27 27 0
net_namespace 0 0 4736 6 8 : tunables 0 0 0 : slabdata 0 0 0
shmem_inode_cache 2112 2112 672 48 8 : tunables 0 0 0 : slabdata 44 44 0
ftrace_event_file 1196 1196 88 46 1 : tunables 0 0 0 : slabdata 26 26 0
taskstats 196 196 328 49 4 : tunables 0 0 0 : slabdata 4 4 0
proc_inode_cache 63037 63250 648 50 8 : tunables 0 0 0 : slabdata 1265 1265 0
sigqueue 1224 1224 160 51 2 : tunables 0 0 0 : slabdata 24 24 0
bdev_cache 819 819 832 39 8 : tunables 0 0 0 : slabdata 21 21 0
kernfs_node_cache 54360 54360 112 36 1 : tunables 0 0 0 : slabdata 1510 1510 0
mnt_cache 510 510 320 51 4 : tunables 0 0 0 : slabdata 10 10 0
inode_cache 16813 19712 584 28 4 : tunables 0 0 0 : slabdata 704 704 0
dentry 144206 144606 192 42 2 : tunables 0 0 0 : slabdata 3443 3443 0
iint_cache 0 0 72 56 1 : tunables 0 0 0 : slabdata 0 0 0
buffer_head 6905641 6922305 104 39 1 : tunables 0 0 0 : slabdata 177495 177495 0
vm_area_struct 16764 16764 184 44 2 : tunables 0 0 0 : slabdata 381 381 0
mm_struct 1008 1008 896 36 8 : tunables 0 0 0 : slabdata 28 28 0
files_cache 1377 1377 640 51 8 : tunables 0 0 0 : slabdata 27 27 0
signal_cache 1380 1380 1088 30 8 : tunables 0 0 0 : slabdata 46 46 0
sighand_cache 1020 1020 2112 15 8 : tunables 0 0 0 : slabdata 68 68 0
task_xstate 1638 1638 832 39 8 : tunables 0 0 0 : slabdata 42 42 0
task_struct 837 855 6480 5 8 : tunables 0 0 0 : slabdata 171 171 0
Acpi-ParseExt 2968 2968 72 56 1 : tunables 0 0 0 : slabdata 53 53 0
Acpi-State 561 561 80 51 1 : tunables 0 0 0 : slabdata 11 11 0
Acpi-Namespace 3162 3162 40 102 1 : tunables 0 0 0 : slabdata 31 31 0
anon_vma 19313 19584 64 64 1 : tunables 0 0 0 : slabdata 306 306 0
shared_policy_node 7735 7735 48 85 1 : tunables 0 0 0 : slabdata 91 91 0
numa_policy 170 170 24 170 1 : tunables 0 0 0 : slabdata 1 1 0
radix_tree_node 2870899 2871624 584 28 4 : tunables 0 0 0 : slabdata 102558 102558 0
idr_layer_cache 555 555 2112 15 8 : tunables 0 0 0 : slabdata 37 37 0
dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-8192 180 180 8192 4 8 : tunables 0 0 0 : slabdata 45 45 0
kmalloc-4096 636 720 4096 8 8 : tunables 0 0 0 : slabdata 90 90 0
kmalloc-2048 6498 6688 2048 16 8 : tunables 0 0 0 : slabdata 418 418 0
kmalloc-1024 4677 4800 1024 32 8 : tunables 0 0 0 : slabdata 150 150 0
kmalloc-512 9029 9056 512 32 4 : tunables 0 0 0 : slabdata 283 283 0
kmalloc-256 31542 31840 256 32 2 : tunables 0 0 0 : slabdata 995 995 0
kmalloc-192 16548 16548 192 42 2 : tunables 0 0 0 : slabdata 394 394 0
kmalloc-128 8449 8544 128 32 1 : tunables 0 0 0 : slabdata 267 267 0
kmalloc-96 20607 21462 96 42 1 : tunables 0 0 0 : slabdata 511 511 0
kmalloc-64 71408 75968 64 64 1 : tunables 0 0 0 : slabdata 1187 1187 0
kmalloc-32 5760 5760 32 128 1 : tunables 0 0 0 : slabdata 45 45 0
kmalloc-16 13824 13824 16 256 1 : tunables 0 0 0 : slabdata 54 54 0
kmalloc-8 45056 45056 8 512 1 : tunables 0 0 0 : slabdata 88 88 0
kmem_cache_node 551 576 64 64 1 : tunables 0 0 0 : slabdata 9 9 0
kmem_cache 256 256 256 32 2 : tunables 0 0 0 : slabdata 8 8 0

Related

Identify Container Memory (MEM) consumption (what uses all the memory?)

I got a container (registry.access.redhat.com/ubi8/ubi-minimal) which runs a bash script (move files) in indefinitely loop
This test is with 900 files every minute and each file is ~ 1 KB (just a small XML)
here part of the yml file including the cmd executed by the pod
command: ["/bin/sh", "-c", "shopt -s nullglob && while true ; do for f in $vfsourcefolder/*.xml ; do randomNum=$(shuf -i $FolderStartNumber-$FolderEndNumber -n 1) ; mkdir -p $vfsourcefolder/$vfsubfolderprefix$randomNum ; mv $f $_ ; done ; done"]
livenessProbe:
exec:
command: ["/bin/sh", "-c", "test $checkFiles -gt $(ls -f $vfsourcefolder | wc -l)"]
initialDelaySeconds: 15
periodSeconds: 15
timeoutSeconds: 3
resources:
requests:
memory: 256Mi
cpu: 25m
limits:
memory: 4Gi
cpu: 2
after running for ~3days it consumes 3 GB of memory (according to kubectl top)
tilo#myserver:/$ kubectl top pod fs-probe-spreader1-0
NAME CPU(cores) MEMORY(bytes)
fs-probe-spreader1-0 217m 3207Mi
but I can't find out what takes all the memory.
The slabinfo shows lots of object in cifs_inode_cache and cifs_inode_cache
here stats from the pod
ps aux
top -b
df -TPh
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/memory.stat
cat /sys/fs/cgroup/memory/memory.kmem.slabinfo
[root#fs-probe-spreader1-0 /]# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 6.5 0.0 13148 3136 ? Ds Jan18 279:58 /bin/sh -c shopt -s nullglob && while true ; do for f in $vfsourcefolder/*.xml ; do randomNum=$(shuf -i $FolderStar
root 1266813 0.0 0.0 19352 3764 pts/0 Ss 23:12 0:00 bash
root 1372717 0.0 0.0 1092036 9720 ? Rsl 23:45 0:00 /usr/bin/runc init
root 1372719 0.0 0.0 51860 3676 pts/0 R+ 23:45 0:00 ps aux
[root#fs-probe-spreader1-0 /]# top -b
top - 23:53:56 up 4 days, 2:52, 0 users, load average: 2.46, 2.31, 2.26
Tasks: 3 total, 1 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.0 us, 0.0 sy, 0.0 ni, 95.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 16009.0 total, 507.8 free, 6127.2 used, 9374.0 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 9551.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 13148 3136 2544 D 6.7 0.0 280:56.60 sh
1398222 root 20 0 19352 3716 3148 S 0.0 0.0 0:00.01 bash
1401883 root 20 0 56192 4208 3660 R 0.0 0.0 0:00.00 top
[root#fs-probe-spreader1-0 /]# df -TPh
Filesystem Type Size Used Avail Use% Mounted on
overlay overlay 97G 23G 75G 24% /
tmpfs tmpfs 64M 0 64M 0% /dev
tmpfs tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
//myshare1.file.core.windows.net/mainfs cifs 100G 28G 73G 28% /trex/root
/dev/sda1 ext4 97G 23G 75G 24% /etc/hosts
shm tmpfs 64M 0 64M 0% /dev/shm
tmpfs tmpfs 4.0G 12K 4.0G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs tmpfs 7.9G 0 7.9G 0% /proc/acpi
tmpfs tmpfs 7.9G 0 7.9G 0% /proc/scsi
tmpfs tmpfs 7.9G 0 7.9G 0% /sys/firmware
[root#fs-probe-spreader1-0 /]# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
3374436352
[root#fs-probe-spreader1-0 /]# cat /sys/fs/cgroup/memory/memory.stat
cache 19505152
rss 1482752
rss_huge 0
shmem 0
mapped_file 0
dirty 135168
writeback 0
pgpgin 989469294
pgpgout 989464142
pgfault 2149218225
pgmajfault 0
inactive_anon 0
active_anon 1368064
inactive_file 6352896
active_file 13246464
unevictable 0
hierarchical_memory_limit 4294967296
total_cache 19505152
total_rss 1482752
total_rss_huge 0
total_shmem 0
total_mapped_file 0
total_dirty 135168
total_writeback 0
total_pgpgin 989469294
total_pgpgout 989464142
total_pgfault 2149218225
total_pgmajfault 0
total_inactive_anon 0
total_active_anon 1368064
total_inactive_file 6352896
total_active_file 13246464
total_unevictable 0
[root#fs-probe-spreader1-0 /]# cat /sys/fs/cgroup/memory/memory.kmem.slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kmalloc-rcl-128 64 64 128 32 1 : tunables 0 0 0 : slabdata 2 2 0
TCP 42 42 2240 14 8 : tunables 0 0 0 : slabdata 3 3 0
kmalloc-rcl-64 320 320 64 64 1 : tunables 0 0 0 : slabdata 5 5 0
kmalloc-rcl-96 126 126 96 42 1 : tunables 0 0 0 : slabdata 3 3 0
radix_tree_node 252 252 584 28 4 : tunables 0 0 0 : slabdata 9 9 0
UDPv6 96 96 1344 24 8 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-96 168 168 96 42 1 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-2k 64 64 2048 16 8 : tunables 0 0 0 : slabdata 4 4 0
cifs_inode_cache 3454395 3454395 776 21 4 : tunables 0 0 0 : slabdata 164495 164495 0
kmalloc-8 2048 2048 8 512 1 : tunables 0 0 0 : slabdata 4 4 0
buffer_head 5460 5460 104 39 1 : tunables 0 0 0 : slabdata 140 140 0
ext4_inode_cache 290 290 1096 29 8 : tunables 0 0 0 : slabdata 10 10 0
shmem_inode_cache 66 66 720 22 4 : tunables 0 0 0 : slabdata 3 3 0
ovl_inode 736 736 688 23 4 : tunables 0 0 0 : slabdata 32 32 0
pde_opener 408 408 40 102 1 : tunables 0 0 0 : slabdata 4 4 0
eventpoll_pwq 224 224 72 56 1 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-1k 64 64 1024 16 4 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-32 512 512 32 128 1 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-4k 32 32 4096 8 8 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-512 64 64 512 16 2 : tunables 0 0 0 : slabdata 4 4 0
skbuff_head_cache 64 64 256 16 1 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-192 84 84 192 21 1 : tunables 0 0 0 : slabdata 4 4 0
inode_cache 104 104 608 26 4 : tunables 0 0 0 : slabdata 4 4 0
pid 128 128 128 32 1 : tunables 0 0 0 : slabdata 4 4 0
anon_vma 2028 2028 104 39 1 : tunables 0 0 0 : slabdata 52 52 0
vm_area_struct 837 912 208 19 1 : tunables 0 0 0 : slabdata 48 48 0
mm_struct 120 120 1088 30 8 : tunables 0 0 0 : slabdata 4 4 0
signal_cache 112 112 1152 28 8 : tunables 0 0 0 : slabdata 4 4 0
sighand_cache 60 60 2112 15 8 : tunables 0 0 0 : slabdata 4 4 0
anon_vma_chain 1957 2368 64 64 1 : tunables 0 0 0 : slabdata 37 37 0
files_cache 92 92 704 23 4 : tunables 0 0 0 : slabdata 4 4 0
task_delay_info 204 204 80 51 1 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-64 3264 3264 64 64 1 : tunables 0 0 0 : slabdata 51 51 0
cred_jar 1323 1323 192 21 1 : tunables 0 0 0 : slabdata 63 63 0
task_struct 33 52 7680 4 8 : tunables 0 0 0 : slabdata 13 13 0
PING 64 64 1024 16 4 : tunables 0 0 0 : slabdata 4 4 0
sock_inode_cache 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
proc_inode_cache 432 432 680 24 4 : tunables 0 0 0 : slabdata 18 18 0
dentry 3346497 3346497 192 21 1 : tunables 0 0 0 : slabdata 159357 159357 0
filp 576 576 256 16 1 : tunables 0 0 0 : slabdata 36 36 0

cassandra node was put down with oom error

Cassandra node went down due to OOM, and checking the /var/log/message I see below.
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java cpuset=/ mems_allowed=0
....
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA32: 1294*4kB (UM) 932*8kB (UEM) 897*16kB (UEM) 483*32kB (UEM) 224*64kB (UEM) 114*128kB (UEM) 41*256kB (UEM) 12*512kB (UEM) 7*1024kB (UE
M) 2*2048kB (EM) 35*4096kB (UM) = 242632kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 Normal: 5319*4kB (UE) 3233*8kB (UEM) 960*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 62500kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 38109 total pagecache pages
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages in swap cache
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Swap cache stats: add 0, delete 0, find 0/0
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Free swap = 0kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Total swap = 0kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 16394647 pages RAM
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages HighMem/MovableOnly
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 310559 pages reserved
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2634] 0 2634 41614 326 82 0 0 systemd-journal
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2690] 0 2690 29793 541 27 0 0 lvmetad
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2710] 0 2710 11892 762 25 0 -1000 systemd-udevd
.....
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [13774] 0 13774 459778 97729 429 0 0 Scan Factory
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14506] 0 14506 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14586] 0 14586 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14588] 0 14588 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14589] 0 14589 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14598] 0 14598 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14599] 0 14599 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14600] 0 14600 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14601] 0 14601 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19679] 0 19679 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19680] 0 19680 21628 5340 24 0 0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 9084] 1007 9084 2822449 260291 810 0 0 java
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 8509] 1007 8509 17223585 14908485 32510 0 0 java
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21877] 0 21877 461828 97716 318 0 0 ScanAction Mgr
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21884] 0 21884 496653 98605 340 0 0 OAS Manager
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [31718] 89 31718 25474 486 48 0 0 pickup
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4891] 1007 4891 26999 191 9 0 0 iostat
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4957] 1007 4957 26999 192 10 0 0 iostat
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Out of memory: Kill process 8509 (java) score 928 or sacrifice child
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Killed process 8509 (java) total-vm:68894340kB, anon-rss:59496344kB, file-rss:137596kB, shmem-rss:0kB
Nothing else runs on this host except dse cassandra with search and monitoring agents. Max heap size is set to 31g, the cassandra java process seems to be using ~57gb (ram is 62gb) at the time of error.
So I am guess the jvm started using lots of memory and triggered oom error.
Is my understanding correct?
That this is linux triggered jvm kill as the jvm was consuming more than available memory?
So in this case jvm was using max of 31g and remaining 26gb its using is non-heap memory. Normally this process takes around 42g and the fact that at the time of oom moment it was consuming 57g I am suspecting the java process to be the culprit rather than victim.
At the time of issue there was no heap dump taken, I have configured it now. But even if heap dump was taken would it have help figure out who is consuming more memory. Heapdump would only dump heap memory area, what should be used to dump non-heapdump? Native memory tracking is one thing I came across.
Any way to have native memory dumped when oom occurs?
Whats the best way to monitor the jvm memory to diagnose oom errors?
This may not helpful..
You may not get heapdump because oom-killer is kernel feature. Jvm has no chance to write heapdump.
And SIGKILL can not be caught and does not generate core dump. (unix default action)
http://programmergamer.blogspot.com/2013/05/clarification-on-sigint-sigterm-sigkill.html

How to check where did memory goes when oom-killer invoked in linux

In our system, oom-killer is invoked when we have large throughput, we believe that most of the memory should be consumed by kernel driver, but we can't find the dedicated consumer, really appreciate anyone can give some suggestion for it.
followed is the detailed log of dmesg
[14839.077171] passkey-agent invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[14839.077187] CPU: 0 PID: 3443 Comm: passkey-agent Tainted: G O 4.1.35-rt41 #1
[14839.077190] Hardware name: LS1043A RDB Board (DT)
[14839.077193] Call trace:
[14839.079644] [<ffff8000000898f4>] dump_backtrace+0x0/0x154
[14839.079650] [<ffff800000089a5c>] show_stack+0x14/0x1c
[14839.079656] [<ffff8000008f3174>] dump_stack+0x90/0xb0
[14839.079663] [<ffff80000013eea4>] dump_header.isra.10+0x88/0x1b8
[14839.079668] [<ffff80000013f5f8>] oom_kill_process+0x210/0x3d4
[14839.079672] [<ffff80000013faec>] __out_of_memory.isra.15+0x330/0x374
[14839.079676] [<ffff80000013fd5c>] out_of_memory+0x5c/0x80
[14839.079682] [<ffff8000001442d8>] __alloc_pages_nodemask+0x55c/0x7c4
[14839.079687] [<ffff80000013e0cc>] filemap_fault+0x188/0x400
[14839.079693] [<ffff80000015f424>] __do_fault+0x3c/0x98
[14839.079698] [<ffff8000001641c8>] handle_mm_fault+0xc28/0x14f8
[14839.079704] [<ffff800000094c04>] do_page_fault+0x224/0x2b4
[14839.079709] [<ffff8000000822a0>] do_mem_abort+0x40/0xa0
[14839.079713] Exception stack(0xffff80001e47be20 to 0xffff80001e47bf50)
[14839.079719] be20: 00000000 00000000 000001f4 00000000 ffffffff ffffffff a6f90990 0000ffff
[14839.079725] be40: ffffffff ffffffff 3b969772 00000000 dbbcc280 0000ffff 00085db0 ffff8000
[14839.079730] be60: 00000000 00000000 000001f4 00000000 ffffffff ffffffff a6f90990 0000ffff
[14839.079736] be80: 1e47bea0 ffff8000 000895f8 ffff8000 00000008 00000000 00085b90 ffff8000
[14839.079742] bea0: dbbcc280 0000ffff 00085c9c ffff8000 00000000 00000000 0ee34088 00000000
[14839.079747] bec0: 00000000 00000000 00000001 00000000 dbbcc2b0 0000ffff 00000000 00000000
[14839.079752] bee0: 00000000 00000000 00000000 00000000 000f4240 00000000 00000000 00000000
[14839.079758] bf00: 00000049 00000000 0000001c 00000000 0000011b 00000000 00000013 00000000
[14839.079763] bf20: 00000028 00000000 00000000 00000000 a71c1c20 0000ffff 00000000 003b9aca
[14839.079767] bf40: a720a990 0000ffff a6f90918 0000ffff
[14839.082683] Mem-Info:
[14839.082700] active_anon:16910 inactive_anon:6202 isolated_anon:0
active_file:15 inactive_file:0 isolated_file:26
unevictable:62887 dirty:0 writeback:0 unstable:0
slab_reclaimable:944 slab_unreclaimable:8027
mapped:5421 shmem:2349 pagetables:527 bounce:0
free:5120 free_pcp:627 free_cma:0
[14839.082719] DMA free:20480kB min:22528kB low:28160kB high:33792kB active_anon:67640kB inactive_anon:24808kB active_file:60kB inactive_file:0kB unevictable:251548kB isolated(anon):0kB isolated(file):104kB present:1046528kB managed:890652kB mlocked:251548kB dirty:0kB writeback:0kB mapped:21684kB shmem:9396kB slab_reclaimable:3776kB slab_unreclaimable:32108kB kernel_stack:6064kB pagetables:2108kB unstable:0kB bounce:0kB free_pcp:2508kB local_pcp:424kB free_cma:0kB writeback_tmp:0kB pages_scanned:208 all_unreclaimable? no
[14839.082723] lowmem_reserve[]: 0 0 0
[14839.082729] DMA: 755*4kB (EM) 486*8kB (UEM) 617*16kB (UEM) 2*32kB (M) 1*64kB (R) 2*128kB (R) 1*256kB (R) 0*512kB 1*1024kB (R) 1*2048kB (R) 0*4096kB = 20492kB
[14839.082752] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[14839.082755] 7756 total pagecache pages
[14839.082760] 0 pages in swap cache
[14839.082763] Swap cache stats: add 0, delete 0, find 0/0
[14839.082765] Free swap = 0kB
[14839.082768] Total swap = 0kB
[14839.082856] 261632 pages RAM
[14839.082858] 0 pages HighMem/MovableOnly
[14839.082861] 34873 pages reserved
[14839.082863] 4096 pages cma reserved
[14839.082867] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[14839.082890] [ 1353] 0 1353 876 161 7 3 0 -1000 udevd
[14839.082899] [ 1863] 999 1863 695 48 5 3 0 0 dbus-daemon
[14839.082906] [ 1944] 0 1944 833 23 5 3 0 0 syslogd
[14839.082913] [ 1947] 0 1947 833 18 5 3 0 0 klogd
[14839.082919] [ 1990] 0 1990 2307 686 8 2 0 0 php-fpm
[14839.082925] [ 1991] 65534 1991 2307 857 8 2 0 0 php-fpm
[14839.082932] [ 1992] 65534 1992 2307 857 8 2 0 0 php-fpm
[14839.082938] [ 1999] 0 1999 720 31 5 3 0 0 bash
[14839.083042] [ 2001] 0 2001 1083 393 6 3 0 0 start_appli
[14839.083049] [ 2010] 0 2010 849 26 5 3 0 0 getty
[14839.083055] [ 2115] 0 2115 1262 96 6 4 0 -1000 sshd
[14839.083062] [ 3051] 0 3051 2709 210 6 2 0 0 optf_write
[14839.083068] [ 3052] 0 3052 1719 686 7 2 0 0 launcher
[14839.083074] [ 3055] 0 3055 5056 4196 13 2 0 0 globMW0
[14839.083081] [ 3066] 0 3066 10430 6805 27 2 0 0 confd
[14839.083088] [ 3085] 0 3085 9735 7449 23 2 0 0 hal0
[14839.083095] [ 3086] 0 3086 7781 6642 19 2 0 0 SystemMgr
[14839.083102] [ 3087] 0 3087 7455 6372 20 2 0 0 HWMgr
[14839.083108] [ 3088] 0 3088 8319 7118 20 2 0 0 SWMgr
[14839.083115] [ 3089] 0 3089 7824 6696 19 2 0 0 FaultMgr
[14839.083121] [ 3090] 0 3090 7488 6359 20 2 0 0 TSMgr
[14839.083127] [ 3091] 0 3091 7009 6144 20 2 0 0 SecurityMgr
[14839.083133] [ 3092] 0 3092 7736 6337 20 2 0 0 DHCPRelayMgr
[14839.083225] [ 3093] 0 3093 8747 6555 21 2 0 0 ItfMgr
[14839.083232] [ 3094] 0 3094 8192 6686 21 2 0 0 WlanItfMgr
[14839.083239] [ 3095] 0 3095 7602 6518 20 2 0 0 L2Mgr
[14839.083246] [ 3096] 0 3096 7399 6017 20 2 0 0 QoSMgr
[14839.083252] [ 3097] 0 3097 8647 6486 21 2 0 0 L3Mgr
[14839.083258] [ 3098] 0 3098 7482 6356 17 2 0 0 MulticastMgr
[14839.083264] [ 3099] 0 3099 7783 6609 21 2 0 0 DHCPMgr
[14839.083271] [ 3100] 0 3100 6864 6409 16 2 0 0 CallHomeMgr
[14839.083279] [ 3422] 0 3422 472 23 4 3 0 0 hciattach
[14839.083286] [ 3426] 0 3426 1035 50 6 3 0 0 bluetoothd
[14839.083292] [ 3443] 0 3443 2039 112 8 3 0 0 passkey-agent
[14839.083298] [ 3462] 0 3462 3852 2368 11 3 0 0 dhcpd
[14839.083304] [ 3517] 0 3517 860 161 7 3 0 -1000 udevd
[14839.083393] [ 3518] 0 3518 860 161 7 3 0 -1000 udevd
[14839.083400] [ 3650] 0 3650 1629 132 6 3 0 0 wpa_supplicant
[14839.083406] [ 3720] 0 3720 3134 1711 10 3 0 0 dhclient
[14839.083412] [ 3747] 0 3747 891 149 6 3 0 0 zebra
[14839.083419] [ 3751] 0 3751 834 132 7 3 0 0 ripd
[14839.083425] [ 3949] 0 3949 1037 67 6 4 0 0 ntpd
[14839.083431] [ 8000] 0 8000 721 33 5 3 0 0 sh
[14839.083436] Out of memory: Kill process 3085 (hal0) score 32 or sacrifice child
[14839.083447] Killed process 3085 (hal0) total-vm:38940kB, anon-rss:15236kB, file-rss:14560kB
We have 1G memory in total, and I can found that slab occupied about 35M (3776kB + 32108kB), the kernel stuck is 6064kB, and active_anon + inactive_anon is about 92M (67640kB + 24808kB), and the user space memory consumption is normal as usual.
So where did the memory left goes ? How can I check with it ?
For example, how can I check how much memory it consumed by a dedicated driver of pcie network card?

determine vm size of process killed by oom-killer

Is there any way to determine the Virtual Memory Size of the process at the time it is killed by linux oom-killer .
I can't find any parameter in file /var/log/messages, which may tell me the total VM size of the process being killed. There is lots of other information available in /var/log/messages, but not the total VM size of the process.
This is a centos 5.7 x64 machine.
Following are the contents of /var/log/messages :
Mar 1 18:51:45 c42 kernel: NameService invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Mar 1 18:51:45 c42 kernel:
Mar 1 18:51:46 c42 kernel: Call Trace:
Mar 1 18:51:46 c42 kernel: [<ffffffff800c9d3a>] out_of_memory+0x8e/0x2f3
Mar 1 18:51:46 c42 kernel: [<ffffffff8002dfd7>] __wake_up+0x38/0x4f
Mar 1 18:51:46 c42 kernel: [<ffffffff8000f677>] __alloc_pages+0x27f/0x308
Mar 1 18:51:46 c42 kernel: [<ffffffff80013034>] __do_page_cache_readahead+0x96/0x17b
Mar 1 18:51:46 c42 kernel: [<ffffffff80013971>] filemap_nopage+0x14c/0x360
Mar 1 18:51:46 c42 kernel: [<ffffffff8000896c>] __handle_mm_fault+0x1fd/0x103b
Mar 1 18:51:46 c42 kernel: [<ffffffff800671f2>] do_page_fault+0x499/0x842
Mar 1 18:51:46 c42 kernel: [<ffffffff80031143>] do_fork+0x148/0x1c1
Mar 1 18:51:46 c42 kernel: [<ffffffff8005dde9>] error_exit+0x0/0x84
Mar 1 18:51:46 c42 kernel:
Mar 1 18:51:46 c42 kernel: Mem-info:
Mar 1 18:51:47 c42 kernel: Node 0 DMA per-cpu:
Mar 1 18:51:48 c42 kernel: cpu 0 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 0 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 1 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 1 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 2 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 2 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 3 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 3 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 4 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 4 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 5 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 5 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 6 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 6 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 7 hot: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 7 cold: high 0, batch 1 used:0
Mar 1 18:51:48 c42 kernel: cpu 8 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 8 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 9 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 9 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 10 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 10 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 11 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 11 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 12 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 12 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 13 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 13 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 14 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 14 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 15 hot: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: cpu 15 cold: high 0, batch 1 used:0
Mar 1 18:51:49 c42 kernel: Node 0 DMA32 per-cpu:
Mar 1 18:51:49 c42 kernel: cpu 0 hot: high 186, batch 31 used:31
Mar 1 18:51:49 c42 kernel: cpu 0 cold: high 62, batch 15 used:35
............
Mar 1 18:51:58 c42 kernel: cpu 14 cold: high 62, batch 15 used:18
Mar 1 18:51:58 c42 kernel: cpu 15 hot: high 186, batch 31 used:6
Mar 1 18:51:59 c42 kernel: cpu 15 cold: high 62, batch 15 used:14
Mar 1 18:51:59 c42 kernel: Node 1 HighMem per-cpu: empty
Mar 1 18:51:59 c42 kernel: Free pages: 50396kB (0kB HighMem)
Mar 1 18:51:59 c42 kernel: Active:1559270 inactive:2490421 dirty:0 writeback:0 unstable:0 free:12599 slab:8740 mapped-file:1186 mapped-anon:4051463 pagetables:16277
Mar 1 18:51:59 c42 kernel: Node 0 DMA free:10068kB min:8kB low:8kB high:12kB active:0kB inactive:0kB present:9660kB pages_scanned:0 all_unreclaimable? yes
Mar 1 18:51:59 c42 kernel: lowmem_reserve[]: 0 1965 8025 8025
Mar 1 18:51:59 c42 kernel: Node 0 DMA32 free:26176kB min:1980kB low:2472kB high:2968kB active:1020328kB inactive:922224kB present:2012496kB pages_scanned:4075359 all_unreclaimable? yes
Mar 1 18:51:59 c42 kernel: lowmem_reserve[]: 0 0 6060 6060
Mar 1 18:51:59 c42 kernel: Node 0 Normal free:6060kB min:6108kB low:7632kB high:9160kB active:490800kB inactive:5569172kB present:6205440kB pages_scanned:21679912 all_unreclaimable? yes
Mar 1 18:51:59 c42 kernel: lowmem_reserve[]: 0 0 0 0
Mar 1 18:51:59 c42 kernel: Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Mar 1 18:52:00 c42 kernel: lowmem_reserve[]: 0 0 0 0
Mar 1 18:52:00 c42 kernel: Node 1 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Mar 1 18:52:00 c42 kernel: lowmem_reserve[]: 0 0 8080 8080
Mar 1 18:52:00 c42 kernel: Node 1 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Mar 1 18:52:00 c42 kernel: lowmem_reserve[]: 0 0 8080 8080
Mar 1 18:52:00 c42 kernel: Node 1 Normal free:8092kB min:8144kB low:10180kB high:12216kB active:4725952kB inactive:3470288kB present:8273920kB pages_scanned:15611005 all_unreclaimable? yes
Mar 1 18:52:00 c42 kernel: lowmem_reserve[]: 0 0 0 0
Mar 1 18:52:00 c42 kernel: Node 1 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Mar 1 18:52:01 c42 kernel: lowmem_reserve[]: 0 0 0 0
Mar 1 18:52:02 c42 kernel: Node 0 DMA: 5*4kB 2*8kB 5*16kB 5*32kB 5*64kB 2*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 2*4096kB = 10068kB
Mar 1 18:52:02 c42 kernel: Node 0 DMA32: 30*4kB 1*8kB 0*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 6*4096kB = 26176kB
Mar 1 18:52:02 c42 kernel: Node 0 Normal: 9*4kB 7*8kB 3*16kB 1*32kB 0*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 6060kB
Mar 1 18:52:02 c42 kernel: Node 0 HighMem: empty
Mar 1 18:52:03 c42 kernel: Node 1 DMA: empty
Mar 1 18:52:03 c42 kernel: Node 1 DMA32: empty
Mar 1 18:52:03 c42 kernel: Node 1 Normal: 49*4kB 3*8kB 0*16kB 0*32kB 1*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8092kB
Mar 1 18:52:03 c42 kernel: Node 1 HighMem: empty
Mar 1 18:52:03 c42 kernel: 1624 pagecache pages
Mar 1 18:52:04 c42 kernel: Swap cache: add 2581210, delete 2580953, find 6957/9192, race 0+16
Mar 1 18:52:04 c42 kernel: Free swap = 0kB
Mar 1 18:52:04 c42 kernel: Total swap = 10241428kB
Mar 1 18:52:04 c42 kernel: Free swap: 0kB
Mar 1 18:52:06 c42 kernel: 4718592 pages of RAM
Mar 1 18:52:06 c42 kernel: 616057 reserved pages
Mar 1 18:52:07 c42 kernel: 17381 pages shared
Mar 1 18:52:08 c42 kernel: 260 pages swap cached
Mar 1 18:52:09 c42 kernel: Out of memory: Killed process 16727, UID 501, (ApplicationMoni).
As per linux, total memory is the sum of physical memory and virtual memory i.e RAM+SWAP.
when ever your process got killed,you will get the score of the process got killed in kern log.
By observing the top command and oom_score of process. I figured that,
oom_score <= to percent it used in total memory
For example : My system having 16GB of RAM and 1GB of SWAP, so total Memory is 17GB.
Tomcat process got killed with oom score '602', Then the usage of tomcat is greater than or equivalent 60.2% of total memory, i.e 10.23 + GB of RAM is occupied by tomcat.
Here is another example:
score is 249 i.e memory usage is 24.9+ %
This is reported in dmesg, after the stack trace that caused the crash (usually a memory allocation request)

Understanding the Linux oom-killer's logs

My app was killed by the oom-killer. It is Ubuntu 11.10 running on a live USB with no swap and the PC has 1 Gig of RAM. The only app running (other than all the built in Ubuntu stuff) is my program flasherav. Note that /tmp is memory mapped and at the time of the crash had about 200MB of files in it (so was taking up ~200MB of RAM).
I'm trying to understand how to analyze the om-killer log such that I can understand where exactly all the memory is being used- i.e. what are the different chunks that will add up to ~1 gig which resulted in the oom-killer kicking in? Once I understand that, I can work on reducing the offender's usage so the app will run on a machine with 1 GB of ram. My specific questions are.
To try to analyze the situation, I summed up the "total_vm" column and I only get 609342KB (which when added to the 200MB in /tmp is still only 809MB). Maybe I'm wrong on what the "total_vm" column is- does it include allocated but not used memory plus shared memory. If yes, then shouldn't it far overstate actually used memory (and therefore I shouldn't be out of memory), right? Are there other chunks of memory in use that aren't accounted for in the list below?
[11686.040460] flasherav invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
[11686.040467] flasherav cpuset=/ mems_allowed=0
[11686.040472] Pid: 2859, comm: flasherav Not tainted 3.0.0-12-generic #20-Ubuntu
[11686.040476] Call Trace:
[11686.040488] [<c10e1c15>] dump_header.isra.7+0x85/0xc0
[11686.040493] [<c10e1e6c>] oom_kill_process+0x5c/0x80
[11686.040498] [<c10e225f>] out_of_memory+0xbf/0x1d0
[11686.040503] [<c10e6123>] __alloc_pages_nodemask+0x6c3/0x6e0
[11686.040509] [<c10e78d3>] ? __do_page_cache_readahead+0xe3/0x170
[11686.040514] [<c10e0fc8>] filemap_fault+0x218/0x390
[11686.040519] [<c1001c24>] ? __switch_to+0x94/0x1a0
[11686.040525] [<c10fb5ee>] __do_fault+0x3e/0x4b0
[11686.040530] [<c1069971>] ? enqueue_hrtimer+0x21/0x80
[11686.040535] [<c10fec2c>] handle_pte_fault+0xec/0x220
[11686.040540] [<c10fee68>] handle_mm_fault+0x108/0x210
[11686.040546] [<c152fa00>] ? vmalloc_fault+0xee/0xee
[11686.040551] [<c152fb5b>] do_page_fault+0x15b/0x4a0
[11686.040555] [<c1069a90>] ? update_rmtp+0x80/0x80
[11686.040560] [<c106a7b6>] ? hrtimer_start_range_ns+0x26/0x30
[11686.040565] [<c106aeaf>] ? sys_nanosleep+0x4f/0x60
[11686.040569] [<c152fa00>] ? vmalloc_fault+0xee/0xee
[11686.040574] [<c152cfcf>] error_code+0x67/0x6c
[11686.040580] [<c1520000>] ? reserve_backup_gdb.isra.11+0x26d/0x2c0
[11686.040583] Mem-Info:
[11686.040585] DMA per-cpu:
[11686.040588] CPU 0: hi: 0, btch: 1 usd: 0
[11686.040592] CPU 1: hi: 0, btch: 1 usd: 0
[11686.040594] Normal per-cpu:
[11686.040597] CPU 0: hi: 186, btch: 31 usd: 5
[11686.040600] CPU 1: hi: 186, btch: 31 usd: 30
[11686.040603] HighMem per-cpu:
[11686.040605] CPU 0: hi: 42, btch: 7 usd: 7
[11686.040608] CPU 1: hi: 42, btch: 7 usd: 22
[11686.040613] active_anon:113150 inactive_anon:113378 isolated_anon:0
[11686.040615] active_file:86 inactive_file:1964 isolated_file:0
[11686.040616] unevictable:0 dirty:0 writeback:0 unstable:0
[11686.040618] free:13274 slab_reclaimable:2239 slab_unreclaimable:2594
[11686.040619] mapped:1387 shmem:4380 pagetables:1375 bounce:0
[11686.040627] DMA free:4776kB min:784kB low:980kB high:1176kB active_anon:5116kB inactive_anon:5472kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15804kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:80kB slab_unreclaimable:168kB kernel_stack:96kB pagetables:64kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:6 all_unreclaimable? yes
[11686.040634] lowmem_reserve[]: 0 865 1000 1000
[11686.040644] Normal free:48212kB min:44012kB low:55012kB high:66016kB active_anon:383196kB inactive_anon:383704kB active_file:344kB inactive_file:7884kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:885944kB mlocked:0kB dirty:0kB writeback:0kB mapped:5548kB shmem:17520kB slab_reclaimable:8876kB slab_unreclaimable:10208kB kernel_stack:1960kB pagetables:3976kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:930 all_unreclaimable? yes
[11686.040652] lowmem_reserve[]: 0 0 1078 1078
[11686.040662] HighMem free:108kB min:132kB low:1844kB high:3560kB active_anon:64288kB inactive_anon:64336kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:138072kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:1460kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:61 all_unreclaimable? yes
[11686.040669] lowmem_reserve[]: 0 0 0 0
[11686.040675] DMA: 20*4kB 24*8kB 34*16kB 26*32kB 19*64kB 13*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4784kB
[11686.040690] Normal: 819*4kB 607*8kB 357*16kB 176*32kB 99*64kB 49*128kB 23*256kB 4*512kB 0*1024kB 0*2048kB 2*4096kB = 48212kB
[11686.040704] HighMem: 16*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 80kB
[11686.040718] 14680 total pagecache pages
[11686.040721] 8202 pages in swap cache
[11686.040724] Swap cache stats: add 2191074, delete 2182872, find 1247325/1327415
[11686.040727] Free swap = 0kB
[11686.040729] Total swap = 524284kB
[11686.043240] 262100 pages RAM
[11686.043244] 34790 pages HighMem
[11686.043246] 5610 pages reserved
[11686.043248] 2335 pages shared
[11686.043250] 240875 pages non-shared
[11686.043253] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[11686.043266] [ 1084] 0 1084 662 1 0 0 0 upstart-udev-br
[11686.043271] [ 1094] 0 1094 743 79 0 -17 -1000 udevd
[11686.043276] [ 1104] 101 1104 7232 42 0 0 0 rsyslogd
[11686.043281] [ 1149] 103 1149 1066 188 1 0 0 dbus-daemon
[11686.043286] [ 1165] 0 1165 1716 66 0 0 0 modem-manager
[11686.043291] [ 1220] 106 1220 861 42 0 0 0 avahi-daemon
[11686.043296] [ 1221] 106 1221 829 0 1 0 0 avahi-daemon
[11686.043301] [ 1255] 0 1255 6880 117 0 0 0 NetworkManager
[11686.043306] [ 1308] 0 1308 5988 144 0 0 0 polkitd
[11686.043311] [ 1334] 0 1334 723 85 0 -17 -1000 udevd
[11686.043316] [ 1335] 0 1335 730 108 0 -17 -1000 udevd
[11686.043320] [ 1375] 0 1375 663 37 0 0 0 upstart-socket-
[11686.043325] [ 1464] 0 1464 1333 120 1 0 0 login
[11686.043330] [ 1467] 0 1467 1333 135 1 0 0 login
[11686.043335] [ 1486] 0 1486 1333 135 1 0 0 login
[11686.043339] [ 1487] 0 1487 1333 136 1 0 0 login
[11686.043344] [ 1493] 0 1493 1333 134 1 0 0 login
[11686.043349] [ 1528] 0 1528 496 45 0 0 0 acpid
[11686.043354] [ 1529] 0 1529 607 46 1 0 0 cron
[11686.043359] [ 1549] 0 1549 10660 100 0 0 0 lightdm
[11686.043363] [ 1550] 0 1550 570 28 0 0 0 atd
[11686.043368] [ 1584] 0 1584 855 35 0 0 0 irqbalance
[11686.043373] [ 1703] 0 1703 17939 9653 0 0 0 Xorg
[11686.043378] [ 1874] 0 1874 7013 174 0 0 0 console-kit-dae
[11686.043382] [ 1958] 0 1958 1124 52 1 0 0 bluetoothd
[11686.043388] [ 2048] 999 2048 2435 641 1 0 0 bash
[11686.043392] [ 2049] 999 2049 2435 595 0 0 0 bash
[11686.043397] [ 2050] 999 2050 2435 587 1 0 0 bash
[11686.043402] [ 2051] 999 2051 2435 634 1 0 0 bash
[11686.043406] [ 2054] 999 2054 2435 569 0 0 0 bash
[11686.043411] [ 2155] 0 2155 1333 128 0 0 0 login
[11686.043416] [ 2222] 0 2222 684 67 1 0 0 dhclient
[11686.043420] [ 2240] 999 2240 2435 415 0 0 0 bash
[11686.043425] [ 2244] 0 2244 3631 58 0 0 0 accounts-daemon
[11686.043430] [ 2258] 999 2258 11683 277 0 0 0 gnome-session
[11686.043435] [ 2407] 999 2407 964 24 0 0 0 ssh-agent
[11686.043440] [ 2410] 999 2410 937 53 0 0 0 dbus-launch
[11686.043444] [ 2411] 999 2411 1319 300 1 0 0 dbus-daemon
[11686.043449] [ 2413] 999 2413 2287 88 0 0 0 gvfsd
[11686.043454] [ 2418] 999 2418 7867 123 1 0 0 gvfs-fuse-daemo
[11686.043459] [ 2427] 999 2427 32720 804 0 0 0 gnome-settings-
[11686.043463] [ 2437] 999 2437 10750 124 0 0 0 gnome-keyring-d
[11686.043468] [ 2442] 999 2442 2321 244 1 0 0 gconfd-2
[11686.043473] [ 2447] 0 2447 6490 156 0 0 0 upowerd
[11686.043478] [ 2467] 999 2467 7590 87 0 0 0 dconf-service
[11686.043482] [ 2529] 999 2529 11807 211 0 0 0 gsd-printer
[11686.043487] [ 2531] 999 2531 12162 587 0 0 0 metacity
[11686.043492] [ 2535] 999 2535 19175 960 0 0 0 unity-2d-panel
[11686.043496] [ 2536] 999 2536 19408 1012 0 0 0 unity-2d-launch
[11686.043502] [ 2539] 999 2539 16154 1120 1 0 0 nautilus
[11686.043506] [ 2540] 999 2540 17888 534 0 0 0 nm-applet
[11686.043511] [ 2541] 999 2541 7005 253 0 0 0 polkit-gnome-au
[11686.043516] [ 2544] 999 2544 8930 430 0 0 0 bamfdaemon
[11686.043521] [ 2545] 999 2545 11217 442 1 0 0 bluetooth-apple
[11686.043525] [ 2547] 999 2547 510 16 0 0 0 sh
[11686.043530] [ 2548] 999 2548 11205 301 1 0 0 gnome-fallback-
[11686.043535] [ 2565] 999 2565 6614 179 1 0 0 gvfs-gdu-volume
[11686.043539] [ 2567] 0 2567 5812 164 1 0 0 udisks-daemon
[11686.043544] [ 2571] 0 2571 1580 69 0 0 0 udisks-daemon
[11686.043549] [ 2579] 999 2579 16354 1035 0 0 0 unity-panel-ser
[11686.043554] [ 2602] 0 2602 1188 47 0 0 0 sudo
[11686.043559] [ 2603] 0 2603 374634 181503 0 0 0 flasherav
[11686.043564] [ 2607] 999 2607 12673 189 0 0 0 indicator-appli
[11686.043569] [ 2609] 999 2609 19313 311 1 0 0 indicator-datet
[11686.043573] [ 2611] 999 2611 15738 225 0 0 0 indicator-messa
[11686.043578] [ 2615] 999 2615 17433 237 1 0 0 indicator-sessi
[11686.043583] [ 2627] 999 2627 2393 132 0 0 0 gvfsd-trash
[11686.043588] [ 2640] 999 2640 1933 85 0 0 0 geoclue-master
[11686.043592] [ 2650] 0 2650 2498 1136 1 0 0 mount.ntfs
[11686.043598] [ 2657] 999 2657 6624 128 1 0 0 telepathy-indic
[11686.043602] [ 2659] 999 2659 2246 112 0 0 0 mission-control
[11686.043607] [ 2662] 999 2662 5431 346 1 0 0 gdu-notificatio
[11686.043612] [ 2664] 0 2664 3716 2392 0 0 0 mount.ntfs
[11686.043617] [ 2679] 999 2679 12453 197 1 0 0 zeitgeist-datah
[11686.043621] [ 2685] 999 2685 5196 1581 1 0 0 zeitgeist-daemo
[11686.043626] [ 2934] 999 2934 16305 710 0 0 0 gnome-terminal
[11686.043631] [ 2938] 999 2938 553 0 0 0 0 gnome-pty-helpe
[11686.043636] [ 2939] 999 2939 1814 406 0 0 0 bash
[11686.043641] Out of memory: Kill process 2603 (flasherav) score 761 or sacrifice child
[11686.043647] Killed process 2603 (flasherav) total-vm:1498536kB, anon-rss:721784kB, file-rss:4228kB
Memory management in Linux is a bit tricky to understand, and I can't say I fully understand it yet, but I'll try to share a little bit of my experience and knowledge.
Short answer to your question: Yes there are other stuff included than whats in the list.
What's being shown in your list is applications run in userspace. The kernel uses memory for itself and modules, on top of that it also has a lower limit of free memory that you can't go under. When you've reached that level it will try to free up resources, and when it can't do that anymore, you end up with an OOM problem.
From the last line of your list you can read that the kernel reports a total-vm usage of: 1498536kB (1,5GB), where the total-vm includes both your physical RAM and swap space. You stated you don't have any swap but the kernel seems to think otherwise since your swap space is reported to be full (Total swap = 524284kB, Free swap = 0kB) and it reports a total vmem size of 1,5GB.
Another thing that can complicate things further is memory fragmentation. You can hit the OOM killer when the kernel tries to allocate lets say 4096kB of continous memory, but there are no free ones availible.
Now that alone probably won't help you solve the actual problem. I don't know if it's normal for your program to require that amount of memory, but I would recommend to try a static code analyzer like cppcheck to check for memory leaks or file descriptor leaks. You could also try to run it through Valgrind to get a bit more information out about memory usage.
Sum of total_vm is 847170 and sum of rss is 214726, these two values are counted in 4kB pages, which means when oom-killer was running, you had used 214726*4kB=858904kB physical memory and swap space.
Since your physical memory is 1GB and ~200MB was used for memory mapping, it's reasonable for invoking oom-killer when 858904kB was used.
rss for process 2603 is 181503, which means 181503*4KB=726012 rss, was equal to sum of anon-rss and file-rss.
[11686.043647] Killed process 2603 (flasherav) total-vm:1498536kB,
anon-rss:721784kB, file-rss:4228kB
This webpage have an explanation and a solution.
The solution is:
To fix this problem the behavior of the kernel has to be changed, so it will no longer overcommit the memory for application requests. Finally I have included those mentioned values into the /etc/sysctl.conf file, so they get automatically applied on start-up:
vm.overcommit_memory = 2
vm.overcommit_ratio = 80
You can parse the different columns, here is an online example:
root#device:~# cat /var/log/syslog | grep kernel | rev | cut -d"]" -f1 | rev | awk '{ print $3, $4, $5, $8 }' | grep '^[0-9].*[a-Z][a-Z]' | perl -MData::Dumper -p -e 'BEGIN { $db = {}; } ($total_vm, $rss, $pgtables_bytes, $name) = split; $db->{$name}->{total_vm} += $total_vm; $db->{$name}->{rss} += $rss; $db->{$name}->{pgtables_bytes} += $pgtables_bytes; $_=undef; END { map { printf("%.1fG %s\n", ($db->{$_}->{rss} * 4096)/(1024*1024*1024), $_) } sort { $db->{$a}->{rss} <=> $db->{$b}->{rss} } keys %{$db}; }' | tail -n 10 | tac
8.1G mysql
5.2G php5.6
0.7G nothing-server
0.2G apache2
0.1G systemd-journal
0.1G python3.7
0.1G nginx
0.1G stats
0.0G php-login
0.0G python3

Resources