Large number of dead threads in .Net memory dump - 64-bit

during the analysis of a memory dump for a .Net4.5 WCF w3wp process, I encountered many threads identified as dead. !threads shows 68 out of 107 threads are dead which appear to be quite high. I was wondering if these threads could hold large amount of memory since the process eventually goes as high as 20GB+ and seem to never go down.
How can I inspect such threads and see the objects/memory held by these? Is it normal to have so many?
0:000> !threads
ThreadCount: 107
UnstartedThread: 0
BackgroundThread: 35
PendingThread: 0
DeadThread: 68
Hosted Runtime: no
ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
7 1 16fc 0000009d253a36e0 28220 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn
14 2 a64 000000a1702d7560 2b220 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Finalizer)
XXXX 3 0 000000a1702f9390 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Completion Port)
XXXX 4 0 000000a1702fa270 8038820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
16 6 21c8 000000a17031f310 102a220 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
17 7 2af4 000000a170327ef0 21220 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn
19 9 1b50 000000a1703cccd0 1020220 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
21 10 85c 000000a170416570 202b020 Preemptive 000000A0945502B8:000000A094550FD0 000000a1703360c0 0 MTA
25 11 13cc 000000a1711823f0 202b020 Preemptive 000000A094554D60:000000A094554FD0 000000a1703360c0 0 MTA
26 12 2044 000000a1711921d0 3029220 Preemptive 0000000000000000:0000000000000000 000000a1703360c0 0 MTA (Threadpool Worker)
XXXX 16 0 000000a17128a690 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Completion Port)
XXXX 17 0 000000a1712bd610 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Completion Port)
XXXX 18 0 000000a1712c5e30 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Completion Port)
XXXX 19 0 000000a1712c4e90 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Completion Port)
2 20 8a4 000000a1712c6600 20220 Preemptive 0000009E8B81C238:0000009E8B81DFD0 0000009d25385d70 0 Ukn
18 21 28f8 000000a1712c3720 20220 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn
22 22 bfc 000000a1712c3ef0 20220 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn
20 23 257c 000000a1712c5660 20220 Preemptive 000000A09457AC30:000000A09457AFD0 0000009d25385d70 0 Ukn
23 24 13e0 000000a1712c6dd0 20220 Preemptive 0000009F87F0B5C8:0000009F87F0CFD0 0000009d25385d70 0 Ukn
XXXX 26 0 000000a1713d8fb0 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
28 27 2aac 000000a1713dbe90 a029220 Preemptive 0000000000000000:0000000000000000 000000a1703360c0 0 MTA (Threadpool Completion Port)
XXXX 29 0 000000a1713dc660 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Completion Port)
29 30 284c 000000a1713d9f50 202b220 Preemptive 0000000000000000:0000000000000000 000000a1703360c0 0 MTA
XXXX 31 0 000000a1713da720 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
XXXX 32 0 000000a1713db6c0 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Completion Port)
XXXX 33 0 000000a174347600 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
XXXX 34 0 000000a174344720 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
XXXX 35 0 000000a174345e90 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
XXXX 36 0 000000a174346660 39820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn
XXXX 37 0 000000a174346e30 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
XXXX 38 0 000000a1743456c0 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
XXXX 39 0 000000a1741b9d10 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
XXXX 40 0 000000a1741bc420 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
XXXX 41 0 000000a1741bcbf0 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
XXXX 42 0 000000a1741ba4e0 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
XXXX 43 0 000000a1741be360 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
3 44 1e94 000000a1741bd3c0 20220 Preemptive 0000009F87E511F8:0000009F87E52FD0 0000009d25385d70 0 Ukn
XXXX 45 0 000000a1741bdb90 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
35 46 12dc 000000a1741bacb0 20220 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA
XXXX 47 0 000000a1741beb30 30820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn
XXXX 48 0 000000a1741bf300 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
XXXX 49 0 000000a171171f40 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
36 50 2bb4 000000a171173e80 202b020 Preemptive 0000000000000000:0000000000000000 000000a1703360c0 0 MTA
37 51 9e4 000000a171177530 202b020 Preemptive 000000A0945528D0:000000A094552FD0 000000a1703360c0 0 MTA
39 53 6d0 000000a171174e20 21220 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn
40 54 f34 000000a171172ee0 21220 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn
41 55 f74 000000a1711755f0 21220 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn
42 56 2198 000000a171174650 21220 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn
XXXX 57 0 000000a171175dc0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 60 0 000000a171176590 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 62 0 000000a171177d00 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 64 0 000000a171178ca0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 65 0 000000a1741bfad0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 70 0 000000a174344ef0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 71 0 000000a1713d9780 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 69 0 000000a171171770 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 68 0 000000a1711736b0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 67 0 000000a171172710 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 66 0 000000a171176d60 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 59 0 000000a1711784d0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 58 0 000000a1741bbc50 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 63 0 000000a1741c1240 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 61 0 000000a1741c02a0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 28 0 000000a1741c0a70 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 25 0 000000a1712c46c0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 15 0 000000a1713daef0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 14 0 000000a174347dd0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 13 0 000000a16744b400 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 52 0 000000a167448520 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 8 0 000000a16744bbd0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 72 0 000000a16744ac30 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Completion Port)
XXXX 73 0 000000a16744a460 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 74 0 000000a171268f50 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 75 0 000000a1712658a0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 76 0 000000a171269720 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 77 0 000000a171266070 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 78 0 000000a1712677e0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 79 0 000000a171269ef0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 80 0 000000a171266840 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 81 0 000000a17126a6c0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 82 0 000000a171267010 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 83 0 000000a17126ae90 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
XXXX 5 0 000000a171268780 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Completion Port)
43 84 dcc 000000a17126b660 8029220 Preemptive 0000009D9D1B3B88:0000009D9D1B3FD0 0000009d25385d70 0 MTA (Threadpool Completion Port)
XXXX 85 0 000000a171267fb0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 86 0 000000a17126be30 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
46 87 1e54 000000a17126c600 1029220 Preemptive 000000A094575068:000000A094576FD0 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 88 0 000000a17126cdd0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
45 89 1db8 000000a16744c3a0 1029220 Preemptive 000000A094577250:000000A094578FD0 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 90 0 000000a167448cf0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
XXXX 91 0 000000a16744cb70 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 92 0 000000a1674494c0 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 93 0 000000a16744d340 1039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Worker)
50 94 15a4 000000a16744db10 1029220 Preemptive 000000A09456AF80:000000A09456AFD0 0000009d25385d70 0 MTA (Threadpool Worker)
47 95 29c8 000000a167449c90 1029220 Preemptive 000000A094573D08:000000A094574FD0 0000009d25385d70 0 MTA (Threadpool Worker)
48 96 28c4 000000a16744e2e0 1029220 Preemptive 000000A094548ED8:000000A094548FD0 0000009d25385d70 0 MTA (Threadpool Worker)
49 97 69c 000000a16744eab0 1029220 Preemptive 0000009D9D1863F0:0000009D9D187FD0 0000009d25385d70 0 MTA (Threadpool Worker)
XXXX 98 0 000000a16744fa50 8039820 Preemptive 0000000000000000:0000000000000000 0000009d25385d70 0 Ukn (Threadpool Completion Port)
51 99 2bac 000000a16744f280 8029220 Preemptive 0000009F87F32660:0000009F87F32FD0 0000009d25385d70 0 MTA (Threadpool Completion Port)
52 101 c40 000000a174599040 1029220 Preemptive 0000009D9D178538:0000009D9D179FD0 0000009d25385d70 0 MTA (Threadpool Worker)
54 102 1e5c 000000a174598870 1029220 Preemptive 0000009F87F51578:0000009F87F52FD0 0000009d25385d70 0 MTA (Threadpool Worker)
56 103 2b68 000000a174596930 1029220 Preemptive 0000009D9D188E70:0000009D9D189FD0 0000009d25385d70 0 MTA (Threadpool Worker)
55 104 2924 000000a174595990 1029220 Preemptive 0000009D9D18C290:0000009D9D18DFD0 0000009d25385d70 0 MTA (Threadpool Worker)
53 105 2f0 000000a174599810 1029220 Preemptive 0000009E8B89EFD0:0000009E8B89FFD0 0000009d25385d70 0 MTA (Threadpool Worker)
57 106 f5c 000000a174596160 1029220 Preemptive 0000009E8B894828:0000009E8B895FD0 0000009d25385d70 0 MTA (Threadpool Worker)
58 107 20c 000000a174599fe0 1029220 Preemptive 0000009F87F53258:0000009F87F54FD0 0000009d25385d70 0 MTA (Threadpool Worker)
60 100 1f60 000000a17459a7b0 8029220 Preemptive 0000009F87F7B1A8:0000009F87F7CFD0 0000009d25385d70 0 MTA (Threadpool Completion Port)

I was wondering if these threads could hold large amount of memory
Remember the following rule: a process provides memory, a thread consumes CPU time. The inverse is also true: a process does not run and a thread does not hold memory. If someone says "my process still runs", that's a simplification of the sentence "my process has at least one thread that still runs".
A dead thread (marked with XXXX) means that there is a .NET Thread object in memory and the "real" thread (the kernel object maintained by the operating system) is gone.
The following is an MCVE for that situation:
using System;
using System.Collections.Generic;
using System.Threading;
namespace DeadThreadExample
{
class Program
{
static List<Thread> AllThreadsIEverStarted = new List<Thread>();
static void Main()
{
for(int i=0; i<1000; i++)
{
Thread t = new Thread(DoNothing);
t.Start();
AllThreadsIEverStarted.Add(t);
t.Join();
}
Console.WriteLine("There should be 1000 dead threads now. Debug it with WinDbg and SOS !threads");
Console.ReadLine();
}
private static void DoNothing()
{
// Just nothing
}
}
}
The debugging session is:
0:006> !threads
PDB symbol for clr.dll not loaded
ThreadCount: 1002
UnstartedThread: 0
BackgroundThread: 1
PendingThread: 0
DeadThread: 1000
Hosted Runtime: no
[...]
could hold large amount of memory
0:006> !dumpheap -stat
Statistics:
MT Count TotalSize Class Name
[...]
53dde9b0 1000 20000 System.Threading.ThreadHelper
53d66bf0 1000 44000 System.Threading.ExecutionContext
53d62e10 1001 52052 System.Threading.Thread
53dad5cc 2000 64000 System.Threading.ThreadStart
So, yes, there is a "memory leak", if you call the static collection a leak. Maybe it's not a leak, because you need that information at some point in time. Once the collection is cleared, it's no longer a leak.
1000 dead threads is equivalent to ~180 kB "memory leak". I wouldn't call that "large amount". Even if you pass an object as an argument (use ParameterizedThreadStart), it seems that the property m_ThreadStartArg of the Thread object is not set, so I can hardly see how a larger amount of memory would be leaked.
If you don't like that situation, use a memory profiler and find out which GC roots still has a reference to those threads.
Is it normal to have so many?
Maybe you were just unlucky. They might all be gone with the next garbage collection.
How can I inspect such threads and see the objects/memory held by these?
Use !dumpheap -stat -type, then dumpheap -mt and then !do:
0:006> !dumpheap -stat -type Thread
Statistics:
MT Count TotalSize Class Name
[...]
53d62e10 1001 52052 System.Threading.Thread
0:006> !dumpheap -mt 53d62e10
Address MT Size
02ec247c 53d62e10 52
02ec2504 53d62e10 52
[...]
Statistics:
MT Count TotalSize Class Name
53d62e10 1001 52052 System.Threading.Thread
Total 1001 objects
0:006> !do 02ec247c
Name: System.Threading.Thread
MethodTable: 53d62e10
EEClass: 53e679a4
Size: 52(0x34) bytes
File: C:\WINDOWS\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
MT Field Offset Type VT Attr Value Name
53d6cd68 400192d 4 ....Contexts.Context 0 instance 00000000 m_Context
53d66bf0 400192e 8 ....ExecutionContext 0 instance 00000000 m_ExecutionContext
53d624e4 400192f c System.String 0 instance 00000000 m_Name
53d63c70 4001930 10 System.Delegate 0 instance 00000000 m_Delegate
53d65074 4001931 14 ...ation.CultureInfo 0 instance 00000000 m_CurrentCulture
53d65074 4001932 18 ...ation.CultureInfo 0 instance 00000000 m_CurrentUICulture
53d62734 4001933 1c System.Object 0 instance 00000000 m_ThreadStartArg
53d67b18 4001934 20 System.IntPtr 1 instance 11519f8 DONT_USE_InternalThread
53d642a8 4001935 24 System.Int32 1 instance 2 m_Priority
53d642a8 4001936 28 System.Int32 1 instance 3 m_ManagedThreadId
53d6878c 4001937 2c System.Boolean 1 instance 0 m_ExecutionContextBelongsToOuterScope
[ ... static ... ]

Related

How can I select only the rows In file 1 that match column values in file 2?

I have multiple measurements per 'Subject' in file 1. I only want to use the highest quality, singular measurement per Subject. In my second file I have the exact list of which measurement is the best for each Subject. This information is contained in the column 'seriesnumber'. The number in the 'seriesnumber' column in file 2 corresponds to the best measurement for a Subject. I Need to extract only these rows from my file 1.
I have tried to use awk, join, and merge to try and accomplish this but came up with errors and strange incomplete files.
join code:
join -j2 file1 file2
awk code:
awk ' FILENAME=="file1" {arr[$2]=$0; next}
FILENAME=="file2" {print arr[$2]} ' file1 file2 > newfile
File 1 Example
Subject Seriesnumber
19-1-1001 2 8655 661 15250 60747 8005 3919 7393 2264 1479 1663 22968 4180 1712 689 781 4255 90 1260 7233 154 15643 63421 7361 4384 6932 2062 4526 1742 686 4575 100 1684 0 1194 0 0 5 0 0 147 699 315 305 317 565 1361200 1338210 1338690 304258 308180 612438 250614 255920 506534 66645 802424 1206450 1187010 1185180 1816840 1 1 21 17 38 1765590
19-1-1001 10 8992 507 15722 64032 8728 3929 7208 2075 1529 1529 22503 3993 1819 710 764 3870 87 1247 7361 65 16128 66226 8165 4384 6669 1805 4405 1752 779 4039 103 1705 0 1280 0 0 10 0 0 186 685 300 318 320 598 1370490 1347160 1347520 306588 307188 613775 251704 256521 508225 65808 808802 1208880 1189150 1187450 1827880 1 1 22 26 48 1778960
19-1-1103 2 3303 317 12146 57569 7008 3617 6910 2018 811 1593 18708 4708 1429 408 668 3279 14 1289 2351 85 13730 60206 6731 4137 7034 2038 4407 1483 749 3576 85 1668 0 948 0 0 7 0 0 129 602 288 291 285 748 1250030 1238540 1238820 301810 301062 602872 215029 218080 433108 61555 781150 1107360 1098510 1097220 1635560 1 1 32 47 79 1555850
19-1-1103 9 3236 286 12490 59477 7000 3558 6782 2113 894 1752 19338 4818 1724 387 649 3345 56 1314 2077 133 13885 60414 6628 4078 7063 2031 4269 1709 610 3707 112 1947 0 990 0 0 8 0 0 245 604 279 280 284 693 1269820 1258050 1258320 306856 309614 616469 215658 220876 436534 61859 796760 1124870 1115990 1114510 1630740 1 1 32 42 74 1556790
19-10-1010 2 3344 608 14744 59165 8389 4427 6962 2008 716 1496 21980 4008 1474 769 652 3715 61 1400 3049 1072 15767 61919 8325 4824 7117 1936 4001 1546 684 3935 103 1434 0 1624 0 0 3 0 0 316 834 413 520 517 833 1350760 1337040 1336840 311985 312592 624577 246800 251133 497933 65699 809736 1200320 1189410 1188280 1731270 1 1 17 13 30 1606700
19-10-1010 6 3242 616 15205 61330 8019 4520 6791 2093 735 1558 22824 3981 1546 653 614 3672 96 1227 2992 1070 16450 64189 8489 4407 6953 2099 4096 1668 680 4116 99 1449 0 2161 0 0 19 0 0 263 848 387 525 528 824 1339090 1325830 1325780 309464 311916 621380 239958 244616 484574 65493 810887 1183120 1172600 1171430 1720000 1 1 16 26 42 1587100
File 2 Example
Subject seriesnumber
19-10-1010 2
19-10-1166 2
19-102-10005 2
19-102-10006 2
19-103-10009 2
19-103-10010 2
19-104-10013 11
19-104-10014 2
19-105-10017 6
19-105-10018 6
The desired output would like something like this:
Where I no longer have duplicate entries per subject. The second column will look different because the preferred series number will differ per subject.
19-10-1010 2 3344 608 14744 59165 8389 4427 6962 2008 716 1496 21980 4008 1474 769 652 3715 61 1400 3049 1072 15767 61919 8325 4824 7117 1936 4001 1546 684 3935 103 1434 0 1624 0 0 3 0 0 316 834 413 520 517 833 1350760 1337040 1336840 311985 312592 624577 246800 251133 497933 65699 809736 1200320 1189410 1188280 1731270 1 1 17 13 30 1606700
19-10-1166 2 3699 312 15373 61787 8026 4248 6385 1955 608 2194 21394 4260 1563 886 609 3420 25 1101 3415 417 16909 63040 7236 4264 5933 1852 4156 1213 654 4007 53 1336 5 1597 0 0 18 0 0 110 821 300 514 466 854 1193020 1179470 1179420 282241 273236 555477 204883 203228 408111 61343 740736 1036210 1026080 1024910 1563950 1 1 39 40 79 1415890
19-102-10005 2 8733 514 13024 50735 7729 3775 4955 1575 1045 1141 20415 3924 1537 990 651 3515 134 1259 8571 232 13487 51374 7150 4169 5192 1664 3760 1620 596 3919 189 1958 0 1479 0 0 36 0 0 203 837 459 409 439 1072 1224350 1200010 1200120 287659 290445 578104 216976 220545 437521 57457 737161 1095770 1074440 1073050 1637570 1 1 31 22 53 1618600
19-102-10006 2 8347 604 13735 42231 7266 3836 6473 2057 1099 1007 18478 3769 1351 978 639 3332 125 1197 8207 454 13774 43750 6758 4274 6148 1921 3732 1584 614 3521 180 1611 0 1241 0 0 25 0 0 254 813 410 352 372 833 1092800 1069450 1069190 244104 245787 489891 202201 205897 408098 59170 634640 978807 958350 957462 1485600 1 1 19 19 38 1472020
19-103-10009 2 4222 596 14702 52038 7428 4065 6598 2166 835 1854 22613 3397 1387 879 568 3729 93 1315 3414 222 14580 52639 7316 3997 6447 1986 4067 1529 596 3778 113 1689 0 2097 0 0 23 0 0 260 761 326 400 359 772 1204670 1190100 1189780 256560 260381 516941 237316 243326 480642 60653 681040 1070620 1059370 1058440 1605990 1 1 25 23 48 1593730
19-103-10010 2 5254 435 14688 47120 7772 3130 5414 1711 741 1912 20643 3594 1449 882 717 3663 41 999 6465 605 14820 49390 6361 3826 5527 1523 3513 1537 639 3596 80 1261 0 1475 0 0 18 0 0 283 827 383 414 297 627 1135490 1117320 1116990 243367 245896 489263 221809 227084 448893 55338 639719 1009370 994519 993639 1568140 1 1 14 11 25 1542210
19-104-10013 2 7276 341 11836 53018 7912 3942 6105 2334 795 2532 21239 4551 1258 1176 430 3636 83 1184 8811 396 12760 53092 7224 4361 6306 1853 4184 1278 543 3921 175 1814 0 2187 0 0 8 0 0 266 783 381 382 357 793 1011640 987712 987042 206633 228397 435031 170375 191222 361597 61814 601948 879229 859619 859103 1586150 1 1 224 162 386 1557120
19-104-10014 2 5964 355 13297 55439 8599 4081 5628 1730 970 1308 20196 4519 1363 992 697 3474 62 1232 6830 472 14729 59478 7006 4443 6156 1825 4492 1726 827 4017 122 1804 0 1412 0 0 17 0 0 259 672 299 305 319 779 1308470 1288970 1288910 284018 285985 570003 258525 257355 515880 62485 746108 1166160 1149700 1148340 1826660 1 1 33 24 57 1630580
19-105-10017 2 7018 307 13848 53855 8345 3734 6001 2095 899 1932 20712 4196 1349 645 823 4212 72 1475 3346 1119 13970 55202 7411 3975 5672 1737 3778 1490 657 4089 132 1689 0 1318 0 0 23 0 0 234 745 474 367 378 760 1122360 1104380 1104520 235806 233881 469687 217939 220736 438675 61471 639143 985718 970903 969619 1583800 1 1 51 51 102 1558470
19-105-10018 2 16454 1098 12569 52521 8215 3788 5858 1805 788 1147 21028 3496 1492 665 634 3796 39 1614 10700 617 12813 52098 8091 3901 5367 1646 3544 1388 723 3938 47 1819 0 1464 0 0 42 0 0 330 832 301 319 400 788 1148940 1114080 1113560 225179 227218 452397 237056 237295 474351 59172 614884 1019300 986820 986144 1607900 1 1 19 28 47 1591480
19-105-10020 2 4096 451 13042 48597 7601 3228 5665 1582 778 1670 19769 3612 1187 717 617 3672 103 962 2627 467 13208 48466 6619 3461 5217 1360 3575 1388 718 3783 90 1370 0 862 0 0 6 0 0 216 673 386 439 401 682 1081580 1068850 1068890 233290 235396 468686 209666 214472 424139 54781 619447 958522 948737 947554 1493740 1 1 16 11 27 1452900
For file1 containing (I removed long useless lines):
Subject Seriesnumber
19-1-1001 2 8655 661 15250 60747 800
19-1-1001 10 8992 507 15722 64032 872
19-1-1103 2 3303 317 12146 57569 700
19-1-1103 9 3236 286 12490 59477 700
19-10-1010 2 3344 608 14744 59165 838
19-10-1010 6 3242 616 15205 61330 801
and file2 containig:
Subject seriesnumber
19-10-1010 2
19-10-1166 2
19-102-10005 2
19-102-10006 2
19-103-10009 2
19-103-10010 2
19-104-10013 11
19-104-10014 2
19-105-10017 6
19-105-10018 6
The following awk will output:
$ awk 'NR==FNR{a[$1, $2];next} ($1, $2) in a' file2 file1
19-10-1010 2 3344 608 14744 59165 838
Note that the first file argument to awk is file2 not file1 (small optimization)! How it works:
NR == FNR - if line number is file line number. Ie. choose only first file passed to awk.
a[$1, $2] - remember index $1,$2 in associative array a
next - do not parse rest of script and restart with next line
($1, $2) in a - check if $1, $2 is in associative array a
because of next this is run only for the second file as passed to awk
if this expression returns with true, then the line will be printed (this is how awk works).
Alternatively you could do the follow, but it will store the whole file1 in memory, which is... memory consuming..., the code above only stores $1, $2 indexes in memory.
awk 'NR==FNR{arr[$1, $2]=$0} NR!=FNR{print arr[$1, $2]}' file1 file2

Is there a way to find out the total number of bytes actually written on each node per second in a Cassandra Cluster

I see bytes being written on commit log is in MB's but the data that is being sent was actually couple of MBs(< 4 MB). Not sure why am I seeing such stats?
Here are dstats o/p of my disk(commitlog)
date/time |usr sys idl wai hiq siq| 1m 5m 15m | read writ| read writ|util| recv send
23-03 12:08:06| 27 4 66 2 0 0|13.8 6.14 3.50| 0 110M| 0 893 |66.8| 73M 79M
23-03 12:08:07| 29 5 64 2 0 0|13.8 6.14 3.50| 0 119M| 0 970 |58.8| 84M 81M
23-03 12:08:08| 29 4 64 3 0 0|13.8 6.14 3.50| 0 114M| 0 925 |70.4| 76M 75M
23-03 12:08:09| 30 6 63 2 0 0|13.2 6.13 3.52| 0 104M| 0 852 |58.0| 84M 73M
23-03 12:08:10| 30 5 63 2 0 0|13.2 6.13 3.52| 0 147M| 0 1190 |62.4| 92M 93M
23-03 12:08:11| 30 4 64 2 0 0|13.2 6.13 3.52| 0 113M| 0 923 |61.6| 77M 74M
23-03 12:08:12| 26 4 67 2 0 0|13.2 6.13 3.52| 0 134M| 0 1094 |56.0| 94M 90M
23-03 12:08:13| 39 5 54 1 0 0|13.2 6.13 3.52| 0 121M| 0 986 |54.4| 98M 88M
23-03 12:08:14| 25 4 68 3 0 0|12.7 6.15 3.53| 0 121M| 0 979 |71.2| 99M 87M
23-03 12:08:15| 36 6 55 3 0 0|12.7 6.15 3.53| 0 123M| 0 993 |62.0| 90M 93M
23-03 12:08:16| 31 6 60 2 0 0|12.7 6.15 3.53| 0 106M| 0 854 |54.8| 98M 104M
23-03 12:08:17| 37 6 54 2 0 1|12.7 6.15 3.53| 0 133M| 0 1067 |59.2| 92M 93M
23-03 12:08:18| 27 4 66 3 0 0|12.7 6.15 3.53| 0 116M| 0 936 |64.8| 97M 96M
23-03 12:08:19| 33 6 59 2 0 0|

Understanding OOM odd behaviour?

My server trigged OOM killer and I am trying to understand why. System has lot of RAM 128 GB and it looks like around 70GB of it was actually used. Reading through previous questions about OOM, it looks like this might be a case of memory fragmentation. See the syslog output
Jun 23 17:20:10 server1 kernel: [517262.504589] gmond invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Jun 23 17:20:10 server1 kernel: [517262.504593] gmond cpuset=/ mems_allowed=0-1
Jun 23 17:20:10 server1 kernel: [517262.504598] CPU: 4 PID: 1522 Comm: gmond Tainted: P OE 3.15.1-031501-lowlatency #201406161841
Jun 23 17:20:10 server1 kernel: [517262.504599] Hardware name: Dell Inc. PowerEdge R420/0K29HN, BIOS 2.3.3 07/10/2014
Jun 23 17:20:10 server1 kernel: [517262.504601] 0000000000000000 ffff880fce2ab848 ffffffff817746ec 0000000000000007
Jun 23 17:20:10 server1 kernel: [517262.504603] ffff880f74691950 ffff880fce2ab898 ffffffff8176a980 ffff880f00000000
Jun 23 17:20:10 server1 kernel: [517262.504605] 000201da81383df8 ffff881470376540 ffff881dcf7ab2a0 0000000000000000
Jun 23 17:20:10 server1 kernel: [517262.504607] Call Trace:
Jun 23 17:20:10 server1 kernel: [517262.504615] [<ffffffff817746ec>] dump_stack+0x4e/0x71
Jun 23 17:20:10 server1 kernel: [517262.504618] [<ffffffff8176a980>] dump_header+0x7e/0xbd
Jun 23 17:20:10 server1 kernel: [517262.504620] [<ffffffff8176aa16>] oom_kill_process.part.6+0x57/0x30a
Jun 23 17:20:10 server1 kernel: [517262.504623] [<ffffffff811654e7>] oom_kill_process+0x47/0x50
Jun 23 17:20:10 server1 kernel: [517262.504625] [<ffffffff81165825>] out_of_memory+0x145/0x1d0
Jun 23 17:20:10 server1 kernel: [517262.504628] [<ffffffff8116c1ba>] __alloc_pages_nodemask+0xb1a/0xc40
Jun 23 17:20:10 server1 kernel: [517262.504634] [<ffffffff811adba3>] alloc_pages_current+0xb3/0x180
Jun 23 17:20:10 server1 kernel: [517262.504636] [<ffffffff81161737>] __page_cache_alloc+0xb7/0xd0
Jun 23 17:20:10 server1 kernel: [517262.504638] [<ffffffff81163f80>] filemap_fault+0x280/0x430
Jun 23 17:20:10 server1 kernel: [517262.504642] [<ffffffff8118a0d9>] __do_fault+0x39/0x90
Jun 23 17:20:10 server1 kernel: [517262.504644] [<ffffffff8118e31e>] do_read_fault.isra.59+0x10e/0x1d0
Jun 23 17:20:10 server1 kernel: [517262.504646] [<ffffffff8118e870>] do_linear_fault.isra.61+0x70/0x80
Jun 23 17:20:10 server1 kernel: [517262.504647] [<ffffffff8118e986>] handle_pte_fault+0x76/0x1b0
Jun 23 17:20:10 server1 kernel: [517262.504652] [<ffffffff81095fe0>] ? lock_hrtimer_base.isra.25+0x30/0x60
Jun 23 17:20:10 server1 kernel: [517262.504654] [<ffffffff8118eea4>] __handle_mm_fault+0x1b4/0x360
Jun 23 17:20:10 server1 kernel: [517262.504655] [<ffffffff8118f101>] handle_mm_fault+0xb1/0x160
Jun 23 17:20:10 server1 kernel: [517262.504658] [<ffffffff81784667>] ? __do_page_fault+0x2b7/0x5a0
Jun 23 17:20:10 server1 kernel: [517262.504660] [<ffffffff81784522>] __do_page_fault+0x172/0x5a0
Jun 23 17:20:10 server1 kernel: [517262.504664] [<ffffffff8111fdec>] ? acct_account_cputime+0x1c/0x20
Jun 23 17:20:10 server1 kernel: [517262.504667] [<ffffffff810a73a9>] ? account_user_time+0x99/0xb0
Jun 23 17:20:10 server1 kernel: [517262.504669] [<ffffffff810a79dd>] ? vtime_account_user+0x5d/0x70
Jun 23 17:20:10 server1 kernel: [517262.504671] [<ffffffff8178498e>] do_page_fault+0x3e/0x80
Jun 23 17:20:10 server1 kernel: [517262.504673] [<ffffffff817811f8>] page_fault+0x28/0x30
Jun 23 17:20:10 server1 kernel: [517262.504674] Mem-Info:
Jun 23 17:20:10 server1 kernel: [517262.504675] Node 0 DMA per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504677] CPU 0: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504678] CPU 1: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504679] CPU 2: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504680] CPU 3: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504681] CPU 4: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504682] CPU 5: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504683] CPU 6: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504684] CPU 7: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504685] CPU 8: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504686] CPU 9: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504687] CPU 10: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504687] CPU 11: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504688] CPU 12: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504689] CPU 13: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504690] CPU 14: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504691] CPU 15: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504692] CPU 16: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504693] CPU 17: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504694] CPU 18: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504695] CPU 19: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504696] CPU 20: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504697] CPU 21: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504698] CPU 22: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504698] CPU 23: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504699] Node 0 DMA32 per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504701] CPU 0: hi: 186, btch: 31 usd: 30
Jun 23 17:20:10 server1 kernel: [517262.504702] CPU 1: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504703] CPU 2: hi: 186, btch: 31 usd: 34
Jun 23 17:20:10 server1 kernel: [517262.504704] CPU 3: hi: 186, btch: 31 usd: 27
Jun 23 17:20:10 server1 kernel: [517262.504705] CPU 4: hi: 186, btch: 31 usd: 30
Jun 23 17:20:10 server1 kernel: [517262.504705] CPU 5: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504706] CPU 6: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504707] CPU 7: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504708] CPU 8: hi: 186, btch: 31 usd: 173
Jun 23 17:20:10 server1 kernel: [517262.504709] CPU 9: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504710] CPU 10: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504711] CPU 11: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504712] CPU 12: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504713] CPU 13: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504714] CPU 14: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504715] CPU 15: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504716] CPU 16: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504717] CPU 17: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504718] CPU 18: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504719] CPU 19: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504720] CPU 20: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504721] CPU 21: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504722] CPU 22: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504722] CPU 23: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504723] Node 0 Normal per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504724] CPU 0: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504725] CPU 1: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504726] CPU 2: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504727] CPU 3: hi: 186, btch: 31 usd: 14
Jun 23 17:20:10 server1 kernel: [517262.504728] CPU 4: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504729] CPU 5: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504730] CPU 6: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504731] CPU 7: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504732] CPU 8: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504733] CPU 9: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504734] CPU 10: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504735] CPU 11: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504736] CPU 12: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504737] CPU 13: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504738] CPU 14: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504739] CPU 15: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504740] CPU 16: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504740] CPU 17: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504741] CPU 18: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504742] CPU 19: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504743] CPU 20: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504744] CPU 21: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504745] CPU 22: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504746] CPU 23: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504747] Node 1 Normal per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504748] CPU 0: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504749] CPU 1: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504750] CPU 2: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504751] CPU 3: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504752] CPU 4: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504753] CPU 5: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504754] CPU 6: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504755] CPU 7: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504756] CPU 8: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504757] CPU 9: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504758] CPU 10: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504758] CPU 11: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504759] CPU 12: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504760] CPU 13: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504761] CPU 14: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504762] CPU 15: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504763] CPU 16: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504764] CPU 17: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504765] CPU 18: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504766] CPU 19: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504767] CPU 20: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504768] CPU 21: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504769] CPU 22: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504770] CPU 23: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504773] active_anon:17833290 inactive_anon:2465707 isolated_anon:0
Jun 23 17:20:10 server1 kernel: [517262.504773] active_file:573 inactive_file:595 isolated_file:36
Jun 23 17:20:10 server1 kernel: [517262.504773] unevictable:0 dirty:4 writeback:0 unstable:0
Jun 23 17:20:10 server1 kernel: [517262.504773] free:82698 slab_reclaimable:43224 slab_unreclaimable:11476749
Jun 23 17:20:10 server1 kernel: [517262.504773] mapped:2465518 shmem:2465767 pagetables:66385 bounce:0
Jun 23 17:20:10 server1 kernel: [517262.504773] free_cma:0
Jun 23 17:20:10 server1 kernel: [517262.504776] Node 0 DMA free:14804kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15968kB managed:15828kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504779] lowmem_reserve[]: 0 2933 64370 64370
Jun 23 17:20:10 server1 kernel: [517262.504782] Node 0 DMA32 free:247776kB min:2048kB low:2560kB high:3072kB active_anon:1774744kB inactive_anon:607052kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3083200kB managed:3003592kB mlocked:0kB dirty:16kB writeback:0kB mapped:607068kB shmem:607068kB slab_reclaimable:25524kB slab_unreclaimable:302060kB kernel_stack:4928kB pagetables:3100kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2660 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504785] lowmem_reserve[]: 0 0 61436 61436
Jun 23 17:20:10 server1 kernel: [517262.504787] Node 0 Normal free:34728kB min:42952kB low:53688kB high:64428kB active_anon:30286072kB inactive_anon:9255576kB active_file:236kB inactive_file:640kB unevictable:0kB isolated(anon):0kB isolated(file):16kB present:63963136kB managed:62911420kB mlocked:0kB dirty:0kB writeback:0kB mapped:9255000kB shmem:9255724kB slab_reclaimable:86416kB slab_unreclaimable:22165372kB kernel_stack:21072kB pagetables:121112kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:13936 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504791] lowmem_reserve[]: 0 0 0 0
Jun 23 17:20:10 server1 kernel: [517262.504793] Node 1 Normal free:33484kB min:45096kB low:56368kB high:67644kB active_anon:39272344kB inactive_anon:200kB active_file:2112kB inactive_file:1752kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:67108864kB managed:66056916kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:276kB slab_reclaimable:60956kB slab_unreclaimable:23439564kB kernel_stack:13536kB pagetables:141328kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:18448 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504797] lowmem_reserve[]: 0 0 0 0
Jun 23 17:20:10 server1 kernel: [517262.504799] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 1*2048kB (R) 3*4096kB (M) = 14804kB
Jun 23 17:20:10 server1 kernel: [517262.504807] Node 0 DMA32: 4660*4kB (UEM) 2172*8kB (EM) 1739*16kB (EM) 1046*32kB (UEM) 629*64kB (EM) 344*128kB (UEM) 155*256kB (E) 46*512kB (UE) 3*1024kB (E) 0*2048kB 0*4096kB = 247904kB
Jun 23 17:20:10 server1 kernel: [517262.504816] Node 0 Normal: 9038*4kB (M) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36152kB
Jun 23 17:20:10 server1 kernel: [517262.504822] Node 1 Normal: 9055*4kB (UM) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36220kB
Jun 23 17:20:10 server1 kernel: [517262.504829] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 23 17:20:10 server1 kernel: [517262.504830] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 23 17:20:10 server1 kernel: [517262.504831] 2467056 total pagecache pages
Jun 23 17:20:10 server1 kernel: [517262.504832] 0 pages in swap cache
Jun 23 17:20:10 server1 kernel: [517262.504833] Swap cache stats: add 0, delete 0, find 0/0
Jun 23 17:20:10 server1 kernel: [517262.504834] Free swap = 0kB
Jun 23 17:20:10 server1 kernel: [517262.504834] Total swap = 0kB
Jun 23 17:20:10 server1 kernel: [517262.504835] 33542792 pages RAM
Jun 23 17:20:10 server1 kernel: [517262.504836] 0 pages HighMem/MovableOnly
Jun 23 17:20:10 server1 kernel: [517262.504837] 262987 pages reserved
Jun 23 17:20:10 server1 kernel: [517262.504838] 0 pages hwpoisoned
Jun 23 17:20:10 server1 kernel: [517262.504839] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jun 23 17:20:10 server1 kernel: [517262.504866] [ 569] 0 569 4997 144 13 0 0 upstart-udev-br
Jun 23 17:20:10 server1 kernel: [517262.504868] [ 578] 0 578 12891 187 29 0 -1000 systemd-udevd
Jun 23 17:20:10 server1 kernel: [517262.504873] [ 692] 101 692 80659 2295 59 0 0 rsyslogd
Jun 23 17:20:10 server1 kernel: [517262.504875] [ 750] 0 750 4084 331 13 0 0 upstart-file-br
Jun 23 17:20:10 server1 kernel: [517262.504877] [ 792] 0 792 3815 53 13 0 0 upstart-socket-
Jun 23 17:20:10 server1 kernel: [517262.504877] [ 792] 0 792 3815 53 13 0 0 upstart-socket-
Jun 23 17:20:10 server1 kernel: [517262.504879] [ 842] 111 842 27001 275 53 0 0 dbus-daemon
Jun 23 17:20:10 server1 kernel: [517262.504880] [ 851] 0 851 8834 101 22 0 0 systemd-logind
Jun 23 17:20:10 server1 kernel: [517262.504886] [ 1232] 0 1232 2558 572 8 0 0 dhclient
Jun 23 17:20:10 server1 kernel: [517262.504888] [ 1342] 104 1342 24484 281 49 0 0 ntpd
Jun 23 17:20:10 server1 kernel: [517262.504890] [ 1440] 0 1440 3955 41 12 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504891] [ 1443] 0 1443 3955 41 12 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504893] [ 1448] 0 1448 3955 39 13 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504895] [ 1450] 0 1450 3955 41 13 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504896] [ 1452] 0 1452 3955 42 13 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504898] [ 1469] 0 1469 4785 40 13 0 0 atd
Jun 23 17:20:10 server1 kernel: [517262.504900] [ 1470] 0 1470 15341 168 32 0 -1000 sshd
Jun 23 17:20:10 server1 kernel: [517262.504902] [ 1472] 0 1472 5914 65 17 0 0 cron
Jun 23 17:20:10 server1 kernel: [517262.504904] [ 1478] 999 1478 16020 3710 31 0 0 gmond
Jun 23 17:20:10 server1 kernel: [517262.504905] [ 1486] 0 1486 4821 65 14 0 0 irqbalance
Jun 23 17:20:10 server1 kernel: [517262.504907] [ 1500] 0 1500 343627 1730 85 0 0 nscd 743,1 1%Jun 23 17:20:10 server1 kernel: [517262.504909] [ 1559] 0 1559 1092 37 8 0 0 acpid
Jun 23 17:20:10 server1 kernel: [517262.504911] [ 1641] 0 1641 4978 71 13 0 0 master
Jun 23 17:20:10 server1 kernel: [517262.504913] [ 1650] 103 1650 5427 72 14 0 0 qmgr
Jun 23 17:20:10 server1 kernel: [517262.504917] [ 1895] 0 1895 1900 30 9 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504919] [ 1906] 1000 1906 2854329 2610 2594 0 0 thttpd
Jun 23 17:20:10 server1 kernel: [517262.504927] [ 3163] 1000 3163 2432 39 10 0 0 searchd
Jun 23 17:20:10 server1 kernel: [517262.504928] [ 3167] 1000 3167 2727221 2467025 4863 0 0 sphinx-daemon
Jun 23 17:20:10 server1 kernel: [517262.504931] [47622] 1000 47622 17834794 17329575 33989 0 0 MyExec
<.................Trimmed bunch of processes with low mem usage.......................................>
Jun 23 17:20:10 server1 kernel: [517262.508350] Out of memory: Kill process 47622 (MyExec) score 526 or sacrifice child
Jun 23 17:20:10 server1 kernel: [517262.508375] Killed process 47622 (MyExec) total-vm:71339176kB, anon-rss:69318300kB, file-rss:0kB
Looking at following lines, it seems like issue is fragmentation.
Jun 23 17:20:10 server1 kernel: [517262.504816] Node 0 Normal: 9038*4kB (M) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36152kB
Jun 23 17:20:10 server1 kernel: [517262.504822] Node 1 Normal: 9055*4kB (UM) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36220kB
I have no idea as why the system would be so badly fragmented. It was only running for 5 days when this happened. Also looking at the process that invoked the oom killer (gmond invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0), seems like it was only requesting 4K blocks and there are bunch of those available.
Is my understanding of fragmentation correct in this case?
How can I figure why the memory got so fragmented?
What can I do to avoid getting into this situation.
One thing that you can notice is, I have completely turned off swap and have swappiness set to 0. The reason is my system has more than enough RAM and should never hit swap. I am planning to enable it and set swappiness to 10. I am not sure if that helps in this case.
Thanks for your input.
Understanding of fragmentation is incorrect. The oom was issued because of memory watermarks were broken. Take a look at this:
Node 0 Normal free:34728kB min:42952kB low:53688kB
Node 1 Normal free:33484kB min:45096kB low:56368kB
From the last few lines of the logs you can see the kernel reports a total-vm usage 71339176kB (~71GiB) while total vm should include both your physical memory and swap space. Also your log shows resident memory about ~69GiB.
Is my understanding of fragmentation correct in this case?
If your capturing system diagnostics during the time the issue occured or sosreport, check the /proc/buddyinfo file for any memory fragmentation. Its best to write a script and backup this info if you are planning to reproducing this.
How can I figure why the memory got so fragmented?
What can I do to avoid getting into this situation.
Sometimes applications overcommit memory which the system is unable to honour potentially leading to OOM. You may want to modify and check the other kernel tunable or try to disable memory overcommitting using sysctl -a for reading the set values.
vm.overcommit_memory=2
vm.overcommit_ratio=80
Note: After adding the above lines in /etc/sysctl.conf its best to restart the system.
vm.overcommit: some apps require to alloc more virtual memory for the program, more then what is available on the system.
vm.overcommit take different value, 0 - a heuristic overcommit algorithm is used
1 - always overcommit regardless of whether memory is available or not (most likely set on your server its set to 0 or 1).
2 - this tell the kernel to allow apps to commit all swap + %of ram, for this the below value should also be set (ex: set to 80%)
2- using this would disallow overcommiting the memory usage (beyond the available RAM + 80% of swap space)
Updating with slabinfo This is after the node was rebooted.
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kvm_async_pf 0 0 136 30 1 : tunables 0 0 0 : slabdata 0 0 0
kvm_vcpu 0 0 16256 2 8 : tunables 0 0 0 : slabdata 0 0 0
kvm_mmu_page_header 0 0 168 48 2 : tunables 0 0 0 : slabdata 0 0 0
fusion_ioctx 5005 5005 296 55 4 : tunables 0 0 0 : slabdata 91 91 0
fusion_user_ll_request 0 0 3960 8 8 : tunables 0 0 0 : slabdata 0 0 0
ext4_groupinfo_4k 131670 131670 136 30 1 : tunables 0 0 0 : slabdata 4389 4389 0
ip6_dst_cache 1260 1260 384 42 4 : tunables 0 0 0 : slabdata 30 30 0
UDPLITEv6 0 0 1088 30 8 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 330 330 1088 30 8 : tunables 0 0 0 : slabdata 11 11 0
tw_sock_TCPv6 128 128 256 32 2 : tunables 0 0 0 : slabdata 4 4 0
TCPv6 288 288 1984 16 8 : tunables 0 0 0 : slabdata 18 18 0
kcopyd_job 0 0 3312 9 8 : tunables 0 0 0 : slabdata 0 0 0
dm_uevent 0 0 2632 12 8 : tunables 0 0 0 : slabdata 0 0 0
cfq_queue 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
bsg_cmd 0 0 312 52 4 : tunables 0 0 0 : slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
fuse_request 0 0 416 39 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_inode 0 0 768 42 8 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_key_record_cache 0 0 576 28 4 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_inode_cache 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
fat_inode_cache 0 0 712 46 8 : tunables 0 0 0 : slabdata 0 0 0
fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 54 54 600 54 8 : tunables 0 0 0 : slabdata 1 1 0
jbd2_journal_handle 2040 2040 48 85 1 : tunables 0 0 0 : slabdata 24 24 0
jbd2_journal_head 5071 5364 112 36 1 : tunables 0 0 0 : slabdata 149 149 0
jbd2_revoke_table_s 1792 1792 16 256 1 : tunables 0 0 0 : slabdata 7 7 0
jbd2_revoke_record_s 1536 1536 32 128 1 : tunables 0 0 0 : slabdata 12 12 0
ext4_inode_cache 75129 78771 984 33 8 : tunables 0 0 0 : slabdata 2387 2387 0
ext4_free_data 5952 6656 64 64 1 : tunables 0 0 0 : slabdata 104 104 0
ext4_allocation_context 768 768 128 32 1 : tunables 0 0 0 : slabdata 24 24 0
ext4_io_end 1344 1344 72 56 1 : tunables 0 0 0 : slabdata 24 24 0
ext4_extent_status 37921 38352 40 102 1 : tunables 0 0 0 : slabdata 376 376 0
dquot 768 768 256 32 2 : tunables 0 0 0 : slabdata 24 24 0
dnotify_mark 782 782 120 34 1 : tunables 0 0 0 : slabdata 23 23 0
pid_namespace 0 0 2192 14 8 : tunables 0 0 0 : slabdata 0 0 0
posix_timers_cache 0 0 248 33 2 : tunables 0 0 0 : slabdata 0 0 0
UDP-Lite 0 0 896 36 8 : tunables 0 0 0 : slabdata 0 0 0
xfrm_dst_cache 0 0 448 36 4 : tunables 0 0 0 : slabdata 0 0 0
ip_fib_trie 146 146 56 73 1 : tunables 0 0 0 : slabdata 2 2 0
UDP 828 828 896 36 8 : tunables 0 0 0 : slabdata 23 23 0
tw_sock_TCP 992 1152 256 32 2 : tunables 0 0 0 : slabdata 36 36 0
TCP 450 450 1792 18 8 : tunables 0 0 0 : slabdata 25 25 0
blkdev_queue 120 136 1896 17 8 : tunables 0 0 0 : slabdata 8 8 0
blkdev_requests 3358 3569 376 43 4 : tunables 0 0 0 : slabdata 83 83 0
blkdev_ioc 964 1287 104 39 1 : tunables 0 0 0 : slabdata 33 33 0
user_namespace 0 0 264 31 2 : tunables 0 0 0 : slabdata 0 0 0
sock_inode_cache 1377 1377 640 51 8 : tunables 0 0 0 : slabdata 27 27 0
net_namespace 0 0 4736 6 8 : tunables 0 0 0 : slabdata 0 0 0
shmem_inode_cache 2112 2112 672 48 8 : tunables 0 0 0 : slabdata 44 44 0
ftrace_event_file 1196 1196 88 46 1 : tunables 0 0 0 : slabdata 26 26 0
taskstats 196 196 328 49 4 : tunables 0 0 0 : slabdata 4 4 0
proc_inode_cache 63037 63250 648 50 8 : tunables 0 0 0 : slabdata 1265 1265 0
sigqueue 1224 1224 160 51 2 : tunables 0 0 0 : slabdata 24 24 0
bdev_cache 819 819 832 39 8 : tunables 0 0 0 : slabdata 21 21 0
kernfs_node_cache 54360 54360 112 36 1 : tunables 0 0 0 : slabdata 1510 1510 0
mnt_cache 510 510 320 51 4 : tunables 0 0 0 : slabdata 10 10 0
inode_cache 16813 19712 584 28 4 : tunables 0 0 0 : slabdata 704 704 0
dentry 144206 144606 192 42 2 : tunables 0 0 0 : slabdata 3443 3443 0
iint_cache 0 0 72 56 1 : tunables 0 0 0 : slabdata 0 0 0
buffer_head 6905641 6922305 104 39 1 : tunables 0 0 0 : slabdata 177495 177495 0
vm_area_struct 16764 16764 184 44 2 : tunables 0 0 0 : slabdata 381 381 0
mm_struct 1008 1008 896 36 8 : tunables 0 0 0 : slabdata 28 28 0
files_cache 1377 1377 640 51 8 : tunables 0 0 0 : slabdata 27 27 0
signal_cache 1380 1380 1088 30 8 : tunables 0 0 0 : slabdata 46 46 0
sighand_cache 1020 1020 2112 15 8 : tunables 0 0 0 : slabdata 68 68 0
task_xstate 1638 1638 832 39 8 : tunables 0 0 0 : slabdata 42 42 0
task_struct 837 855 6480 5 8 : tunables 0 0 0 : slabdata 171 171 0
Acpi-ParseExt 2968 2968 72 56 1 : tunables 0 0 0 : slabdata 53 53 0
Acpi-State 561 561 80 51 1 : tunables 0 0 0 : slabdata 11 11 0
Acpi-Namespace 3162 3162 40 102 1 : tunables 0 0 0 : slabdata 31 31 0
anon_vma 19313 19584 64 64 1 : tunables 0 0 0 : slabdata 306 306 0
shared_policy_node 7735 7735 48 85 1 : tunables 0 0 0 : slabdata 91 91 0
numa_policy 170 170 24 170 1 : tunables 0 0 0 : slabdata 1 1 0
radix_tree_node 2870899 2871624 584 28 4 : tunables 0 0 0 : slabdata 102558 102558 0
idr_layer_cache 555 555 2112 15 8 : tunables 0 0 0 : slabdata 37 37 0
dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-8192 180 180 8192 4 8 : tunables 0 0 0 : slabdata 45 45 0
kmalloc-4096 636 720 4096 8 8 : tunables 0 0 0 : slabdata 90 90 0
kmalloc-2048 6498 6688 2048 16 8 : tunables 0 0 0 : slabdata 418 418 0
kmalloc-1024 4677 4800 1024 32 8 : tunables 0 0 0 : slabdata 150 150 0
kmalloc-512 9029 9056 512 32 4 : tunables 0 0 0 : slabdata 283 283 0
kmalloc-256 31542 31840 256 32 2 : tunables 0 0 0 : slabdata 995 995 0
kmalloc-192 16548 16548 192 42 2 : tunables 0 0 0 : slabdata 394 394 0
kmalloc-128 8449 8544 128 32 1 : tunables 0 0 0 : slabdata 267 267 0
kmalloc-96 20607 21462 96 42 1 : tunables 0 0 0 : slabdata 511 511 0
kmalloc-64 71408 75968 64 64 1 : tunables 0 0 0 : slabdata 1187 1187 0
kmalloc-32 5760 5760 32 128 1 : tunables 0 0 0 : slabdata 45 45 0
kmalloc-16 13824 13824 16 256 1 : tunables 0 0 0 : slabdata 54 54 0
kmalloc-8 45056 45056 8 512 1 : tunables 0 0 0 : slabdata 88 88 0
kmem_cache_node 551 576 64 64 1 : tunables 0 0 0 : slabdata 9 9 0
kmem_cache 256 256 256 32 2 : tunables 0 0 0 : slabdata 8 8 0

Benchmarking CPU and File IO for an application running on Linux

I wrote two programs to run on Linux, each using a different algorithm, and I want to find a way (preferably using a benchmarking software) to compare the CPU usage and IO operations between these two programs.
Is there such a thing? and if yes, where can I find them. Thanks.
You can try hardinfo
Or there are like n different tools measuring system performance if measuring it while running your app solves your purpose
And you can also check this thread
You might try vmstat command:
vmstat 2 20 > vmstat.txt
20 samples of 2 seconds
bi = KB in, bo = KB out with wa = waiting for I/O
I/O can also increase cache demands
%CPU utilisation = us (user) = sy (system)
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 277504 17060 82732 0 0 91 87 1432 236 11 3 84 1
0 0 0 277372 17068 82732 0 0 0 24 1361 399 23 8 59 10
test start
0 1 0 275240 17068 82732 0 0 0 512 1342 305 24 4 69 4
2 1 0 275232 17068 82780 0 0 24 10752 4176 216 7 8 0 85
1 1 0 275240 17076 82732 0 0 12288 2590 5295 243 15 8 0 77
0 1 0 275240 17076 82748 0 0 8 11264 4329 214 6 12 0 82
0 1 0 275240 17076 82780 0 0 16 11264 4278 233 15 10 0 75
0 1 0 275240 17084 82780 0 0 19456 542 6563 255 10 7 0 83
0 1 0 275108 17084 82748 0 0 5128 3072 3501 265 16 37 0 47
3 1 0 275108 17084 82748 0 0 924 5120 8369 3845 12 33 0 55
0 1 0 275116 17092 82748 0 0 1576 85 11483 6645 5 50 0 45
1 1 0 275116 17092 82748 0 0 0 136 2304 689 3 9 0 88
2 1 0 275084 17100 82732 0 0 0 352 2374 800 14 26 0 61
0 0 0 275076 17100 82732 0 0 546 118 2408 1014 35 17 47 1
0 1 0 275076 17104 82732 0 0 0 62 1324 76 3 2 89 7
1 1 0 275076 17108 82732 0 0 0 452 1879 442 8 13 66 12
0 0 0 275116 17108 82732 0 0 800 352 2456 1195 19 17 56 8
0 1 0 275116 17112 82732 0 0 0 54 1325 76 4 1 88 8
test end
1 1 0 275116 17116 82732 0 0 0 510 1717 286 6 10 72 11
1 0 0 275076 17116 82732 0 0 1600 1152 3087 1344 23 29 41 7

OutOfMemory and Memory Fragmentation in SharePoint 2007 32 bit

for some weeks I struggling with the OutOfMemory issue on our SharePoint 2007 (published intranet with many customizations) WFEs (SP 2 and Win 2003 32 Bit Servers). After I received a crashed memory dump I found out that we have a memory fragmentation issue. For the dump analysis I use the following two tools: DiagDebug and Windbg with sos.dll.
Result: 96,74% Free Memory Fragmentation
DebugDiag (Memory Pressure Analyzers)
Virtual Memory Summary
Size of largest free VM block 23,63 MBytes
Free memory fragmentation 96,74%
Free Memory 725,52 MBytes (35,43% of Total Memory)
Reserved Memory 406,88 MBytes (19,87% of Total Memory)
Committed Memory 915,54 MBytes (44,71% of Total Memory)
Total Memory 2,00 GBytes
Largest free block at 0x00000000`4b0b0000
DebugDiag (SharePoint Analyzers)
Undisposed SPRequest objects: 9
Disposed SPRequest objects: 187
Undisposed SPWeb objects: 185
Disposed SPWeb objects: 34
Undisposed SPSite objects: 8
Disposed SPSite objects: 22
undisposed special purpose (AllowCleanupWhenThreadEnds = false) SPRequest object found at: 0x02320dd0.
undisposed special purpose (AllowCleanupWhenThreadEnds = false) SPRequest object found at: 0x4e8296f8.
undisposed special purpose (AllowCleanupWhenThreadEnds = false) SPRequest object found at: 0x4e869a20.
undisposed SPWeb object 0x02701168 references a disposed or invalid SPRequest object: 0x0270137c
undisposed SPWeb object 0x027013cc references a disposed or invalid SPRequest object: 0x027015e0
undisposed SPWeb object 0x02720824 references a disposed or invalid SPRequest object: 0x02720a20
undisposed SPWeb object 0x02d2aa74 references a disposed or invalid SPRequest object: 0x02d2ac70
...
Undisposed SPRequest Objects per managed Thread:
Thread ID: 6714, Undisposed SPRequest: 4
Thread ID: 4c68, Undisposed SPRequest: 3
Thread ID: 5e8c, Undisposed SPRequest: 1
Thread ID: 6180, Undisposed SPRequest: 1
So now I would like to understand what causes the memory fragmentation. Hope you can help me. These are the steps I did to get the right information.
Windbg with sos.dll
!address summary
-------------------- Usage SUMMARY --------------------------
TotSize ( KB) Pct(Tots) Pct(Busy) Usage
2c748000 ( 728352) : 34.73% 53.79% : RegionUsageIsVAD
2d584000 ( 742928) : 35.43% 00.00% : RegionUsageFree
f987000 ( 255516) : 12.18% 18.87% : RegionUsageImage
10fc000 ( 17392) : 00.83% 01.28% : RegionUsageStack
44000 ( 272) : 00.01% 00.02% : RegionUsageTeb
1585a000 ( 352616) : 16.81% 26.04% : RegionUsageHeap
0 ( 0) : 00.00% 00.00% : RegionUsagePageHeap
1000 ( 4) : 00.00% 00.00% : RegionUsagePeb
1000 ( 4) : 00.00% 00.00% : RegionUsageProcessParametrs
1000 ( 4) : 00.00% 00.00% : RegionUsageEnvironmentBlock
Tot: 7fff0000 (2097088 KB) Busy: 52a6c000 (1354160 KB)
-------------------- Type SUMMARY --------------------------
TotSize ( KB) Pct(Tots) Usage
2d584000 ( 742928) : 35.43% : <free>
154e3000 ( 349068) : 16.65% : MEM_IMAGE
127c000 ( 18928) : 00.90% : MEM_MAPPED
3c30d000 ( 986164) : 47.03% : MEM_PRIVATE
-------------------- State SUMMARY --------------------------
TotSize ( KB) Pct(Tots) Usage
3938b000 ( 937516) : 44.71% : MEM_COMMIT
2d584000 ( 742928) : 35.43% : MEM_FREE
196e1000 ( 416644) : 19.87% : MEM_RESERVE
Largest free region: Base 4b0b0000 - Size 017a0000 (24192 KB)
Seems that there is enough free memory (742928 KB) overall but the biggest free chunk has only 24192 KB. Again: Free Memory Fragmentation!
!threads
ThreadCount: 38
UnstartedThread: 0
BackgroundThread: 37
PendingThread: 0
DeadThread: 0
Hosted Runtime: no
PreEmptive GC Alloc Lock
ID OSID ThreadOBJ State GC Context Domain Count APT Exception
14 1 5358 0010f718 1808220 Enabled 00000000:00000000 000dd1b8 0 MTA (Threadpool Worker)
18 2 61a4 001118d0 b220 Enabled 00000000:00000000 000dd1b8 0 MTA (Finalizer)
19 3 6060 0012a3f8 80a220 Enabled 00000000:00000000 000dd1b8 0 MTA (Threadpool Completion Port)
20 4 64c8 0012df90 1220 Enabled 00000000:00000000 000dd1b8 0 Ukn
12 5 57f4 00147e80 880a220 Enabled 00000000:00000000 000dd1b8 0 MTA (Threadpool Completion Port)
23 7 6714 0eb89f08 180b220 Enabled 00000000:00000000 0012e6d0 1 MTA (Threadpool Worker)
24 8 66b8 0eb91970 180b220 Enabled 00000000:00000000 000dd1b8 0 MTA (Threadpool Worker)
25 b 6320 0eb942f0 180b220 Disabled 00000000:00000000 0012e6d0 0 MTA (Threadpool Worker)
26 d 2004 0eb97120 180b220 Enabled 00000000:00000000 000dd1b8 0 MTA (Threadpool Worker)
27 e 5bb0 0eb9a438 180b220 Enabled 00000000:00000000 000dd1b8 0 MTA (Threadpool Worker)
28 f 61a8 0eb9dee8 380b220 Enabled 00000000:00000000 0012e6d0 1 MTA (Threadpool Worker)
29 14 3b88 0ebba688 180b220 Enabled 00000000:00000000 000dd1b8 0 MTA (Threadpool Worker)
30 15 5d74 0ebc4840 380b220 Enabled 00000000:00000000 0012e6d0 1 MTA (Threadpool Worker)
31 16 422c 0ebc91b0 180b220 Enabled 00000000:00000000 000dd1b8 0 MTA (Threadpool Worker)
32 18 6544 125242c8 180b220 Enabled 00000000:00000000 000dd1b8 0 MTA (Threadpool Worker)
33 1a 4c68 12534bc8 180b220 Disabled 4e875ac4:4e875d30 0012e6d0 1 MTA (Threadpool Worker)
34 1b 66d4 12539c80 180b220 Enabled 00000000:00000000 000dd1b8 0 MTA (Threadpool Worker)
35 1c 5e8c 12542e58 180b220 Enabled 00000000:00000000 0012e6d0 1 MTA (Threadpool Worker)
36 1d 62f0 1254be90 180b220 Enabled 4e875d84:4e877d30 0012e6d0 2 MTA (Threadpool Worker) System.OutOfMemoryException (4e875d3c)
39 1e 6558 0ec16d28 80a220 Enabled 00000000:00000000 000dd1b8 0 MTA (Threadpool Completion Port)
40 1f 6180 0ec14b70 200b020 Enabled 00000000:00000000 0012e6d0 0 MTA
43 20 592c 0ebd7a00 220 Enabled 00000000:00000000 000dd1b8 0 MTA
45 21 624c 1261a060 220 Enabled 00000000:00000000 000dd1b8 0 MTA
8 22 5c78 125499f8 220 Enabled 00000000:00000000 000dd1b8 0 Ukn
6 23 3c68 126b6e90 220 Enabled 00000000:00000000 000dd1b8 0 Ukn
7 24 6458 36414400 220 Enabled 00000000:00000000 000dd1b8 0 Ukn
44 25 5e60 36675440 220 Enabled 00000000:00000000 000dd1b8 0 MTA
5 26 55d8 364214a0 220 Enabled 00000000:00000000 000dd1b8 0 Ukn
57 27 6534 36622948 220 Enabled 00000000:00000000 000dd1b8 0 Ukn
58 28 59bc 0016f810 220 Enabled 00000000:00000000 000dd1b8 0 Ukn
56 29 3ee0 250fa6d8 220 Enabled 00000000:00000000 000dd1b8 0 Ukn
60 2a 63fc 252da068 200b220 Enabled 00000000:00000000 0012e6d0 0 MTA
59 2b 5fdc 24fc0be8 220 Enabled 00000000:00000000 000dd1b8 0 Ukn
61 2c 4154 25052008 220 Enabled 00000000:00000000 000dd1b8 0 Ukn
62 2d 60fc 250093a8 220 Enabled 00000000:00000000 000dd1b8 0 Ukn
42 2e 1a38 1c99b5d0 220 Enabled 00000000:00000000 000dd1b8 0 MTA
63 13 59e0 0ebb5d48 220 Enabled 00000000:00000000 000dd1b8 0 Ukn
64 19 6420 0ebac6a0 880b220 Enabled 00000000:00000000 000dd1b8 0 MTA (Threadpool Completion Port)
!eeheap -gc
Number of GC Heaps: 2
------------------------------
Heap 0 (000e2418)
generation 0 starts at 0x432d0c54
generation 1 starts at 0x432c0038
generation 2 starts at 0x02060038
ephemeral segment allocation context: none
segment begin allocated size
02060000 02060038 03d0234c 0x01ca2314(30024468)
432c0000 432c0038 441e82f4 0x00f282bc(15893180)
Large object heap starts at 0x0a060038
segment begin allocated size
0a060000 0a060038 0bd57530 0x01cf74f8(30373112)
5b920000 5b920038 5c9bf338 0x0109f300(17429248)
Heap Size 0x5960dc8(93720008)
------------------------------
Heap 1 (00110750)
generation 0 starts at 0x4e869d30
generation 1 starts at 0x4e820038
generation 2 starts at 0x06060038
ephemeral segment allocation context: none
segment begin allocated size
06060000 06060038 07a0af98 0x019aaf60(26914656)
4e820000 4e820038 4e8eaf38 0x000caf00(831232)
Large object heap starts at 0x0c060038
segment begin allocated size
0c060000 0c060038 0c9ec998 0x0098c960(10013024)
6e020000 6e020038 6f90f0a8 0x018ef070(26144880)
Heap Size 0x3cf1830(63903792)
------------------------------
GC Heap Size 0x96525f8(157623800)
!dumpheap (heap1 extract)
0a060038 000e1a98 16 Free
0a060048 793042f4 4096
0a061048 000e1a98 16 Free
0a061058 793042f4 528
0a061268 000e1a98 16 Free
0a061278 793042f4 4096
0a062278 000e1a98 16 Free
0a062288 793042f4 5112
0a063680 000e1a98 16 Free
0a063690 793042f4 4096
0a064690 000e1a98 16 Free
0a0646a0 793042f4 4096
0a0656a0 000e1a98 16 Free
0a0656b0 793042f4 5112
0a066aa8 000e1a98 16 Free
0a066ab8 793042f4 4096
0a067ab8 000e1a98 16 Free
0a067ac8 793042f4 4096
0a068ac8 000e1a98 16 Free
0a068ad8 793042f4 4096
0a069ad8 793042f4 528
0a069ce8 000e1a98 16 Free
0a069cf8 793042f4 528
0a069f08 793042f4 528
0a06a118 000e1a98 16 Free
0a06a128 793042f4 528
0a06a338 000e1a98 260096 Free
0a0a9b38 793042f4 4096
0a0aab38 000e1a98 16 Free
0a0aab48 793042f4 5784
0a0ac1e0 000e1a98 16 Free
0a0ac1f0 793042f4 4096
0a0ad1f0 000e1a98 16 Free
0a0ad200 793042f4 528
0a0ad410 000e1a98 16 Free
0a0ad420 793042f4 4096
0a0ae420 000e1a98 16 Free
0a0ae430 793042f4 528
0a0ae640 000e1a98 16 Free
0a0ae650 793042f4 4096
0a0af650 000e1a98 16 Free
0a0af660 793042f4 528
0a0af870 000e1a98 16 Free
0a0af880 793042f4 528
0a0afa90 000e1a98 131120 Free
0a0cfac0 793042f4 528
0a0cfcd0 000e1a98 16 Free
0a0cfce0 793042f4 4096
0a0d0ce0 000e1a98 16 Free
0a0d0cf0 793042f4 528
0a0d0f00 000e1a98 16 Free
0a0d0f10 793042f4 528
0a0d1120 000e1a98 16 Free
0a0d1130 793042f4 528
0a0d1340 000e1a98 16 Free
0a0d1350 793042f4 4096
0a0d2350 000e1a98 16 Free
0a0d2360 793042f4 5784
0a0d39f8 000e1a98 16 Free
0a0d3a08 793042f4 4096
0a0d4a08 000e1a98 348200 Free
0a129a30 793042f4 528
0a129c40 000e1a98 16 Free
0a129c50 793042f4 528
0a129e60 000e1a98 361224 Free
0a182168 793042f4 528
0a182378 000e1a98 16 Free
0a182388 793042f4 7016
0a183ef0 000e1a98 16 Free
0a183f00 793042f4 7016
...
63859d80 14762 413336 System.Xml.XmlElement
6385a090 12103 435708 System.Xml.XmlName
79332b54 21020 504480 System.Collections.ArrayList
6385798c 32932 658640 System.Xml.NameTable+Entry
6385c76c 35215 704300 System.Xml.XmlAttribute
79331754 505 706416 System.Char[]
7932dd5c 12751 714056 System.Reflection.RuntimePropertyInfo
6385a284 36665 733300 System.Xml.XmlText
79332cc0 5530 791644 System.Int32[]
7932fde0 22824 1278144 System.Reflection.RuntimeMethodInfo
79333274 6758 1733808 System.Collections.Hashtable+bucket[]
793042f4 54360 5051132 System.Object[]
79333594 4772 29304312 System.Byte[]
79330b24 225539 33121896 System.String
000e1a98 239 72089072 Free
Total 711343 objects
Fragmented blocks larger than 0.5 MB:
Addr Size Followed by 4331b1d8 14.8MB 441e8270 System.Threading.Overlapped
I looked inside some of the addresses between the "Free" segments but unfortunately I can't find any information about the source that coused the issue.
!do 0a182388
Name: System.Object[]
MethodTable: 793042f4
EEClass: 790eda64
Size: 7012(0x1b64) bytes
Array: Rank 1, Number of elements 1749, Type CLASS
Element Type: System.Object
Fields:
None
!gcroot 0a182388
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 14 OSTHread 5358
Scan Thread 18 OSTHread 61a4
Scan Thread 19 OSTHread 6060
Scan Thread 20 OSTHread 64c8
Scan Thread 12 OSTHread 57f4
Scan Thread 23 OSTHread 6714
Scan Thread 24 OSTHread 66b8
Scan Thread 25 OSTHread 6320
Scan Thread 26 OSTHread 2004
Scan Thread 27 OSTHread 5bb0
Scan Thread 28 OSTHread 61a8
Scan Thread 29 OSTHread 3b88
Scan Thread 30 OSTHread 5d74
Scan Thread 31 OSTHread 422c
Scan Thread 32 OSTHread 6544
Scan Thread 33 OSTHread 4c68
Scan Thread 34 OSTHread 66d4
Scan Thread 35 OSTHread 5e8c
Scan Thread 36 OSTHread 62f0
Scan Thread 39 OSTHread 6558
Scan Thread 40 OSTHread 6180
Scan Thread 43 OSTHread 592c
Scan Thread 45 OSTHread 624c
Scan Thread 8 OSTHread 5c78
Scan Thread 6 OSTHread 3c68
Scan Thread 7 OSTHread 6458
Scan Thread 44 OSTHread 5e60
Scan Thread 5 OSTHread 55d8
Scan Thread 57 OSTHread 6534
Scan Thread 58 OSTHread 59bc
Scan Thread 56 OSTHread 3ee0
Scan Thread 60 OSTHread 63fc
Scan Thread 59 OSTHread 5fdc
Scan Thread 61 OSTHread 4154
Scan Thread 62 OSTHread 60fc
Scan Thread 42 OSTHread 1a38
Scan Thread 63 OSTHread 59e0
Scan Thread 64 OSTHread 6420
DOMAIN(0012E6D0):HANDLE(Pinned):e4613d4:Root:0a182388(System.Object[])
!gcroot 0a129a30
...
Scan Thread 64 OSTHread 6420
DOMAIN(000DD1B8):HANDLE(Pinned):1fc11b8:Root:0a129a30(System.Object[])
!gcroot 0a061278
...
DOMAIN(000DD1B8):HANDLE(Pinned):1fb13f0:Root:0a061278(System.Object[])
!gchandles
GC Handle Statistics:
Strong Handles: 1007
Pinned Handles: 474
Async Pinned Handles: 6
Ref Count Handles: 5
Weak Long Handles: 681
Weak Short Handles: 56
Other Handles: 0
...
661485ec 68 2176 System.Web.NativeFileChangeNotification
66153774 93 2976 System.Web.Hosting.ISAPIAsyncCompletionCallback
793310f8 68 3808 System.Threading.Thread
793141f0 162 6480 System.Reflection.Emit.DynamicResolver
79332070 279 6696 System.Reflection.Assembly
7932f19c 228 10944 System.Reflection.Module
793327e8 328 11808 System.Security.PermissionSet
7932f25c 386 29336 System.RuntimeType+RuntimeTypeCache
793042f4 185 294456 System.Object[]
In the DebugDiag SharePoint Analysis some undisposed SPWeb objects were reported.
So trying to find the cause here...
Report: "undisposed SPWeb object 0x02701168 references a disposed or invalid SPRequest object: 0x0270137c"
!do 0x02701168
Name: Microsoft.SharePoint.SPWeb
MethodTable: 1325ed80
EEClass: 1669cd80
Size: 508(0x1fc) bytes
(C:\WINDOWS\assembly\GAC_MSIL\Microsoft.SharePoint\12.0.0.0__71e9bce111e9429c\Microsoft.SharePoint.dll)
!gcroot 0x02701168
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 14 OSTHread 5358
Scan Thread 18 OSTHread 61a4
Scan Thread 19 OSTHread 6060
Scan Thread 20 OSTHread 64c8
Scan Thread 12 OSTHread 57f4
Scan Thread 23 OSTHread 6714
ESP:efbe92c:Root:07a2cbc8(System.Collections.Hashtable+bucket[])->
023f1044(Microsoft.SharePoint.Publishing.CacheManager)->
023f3df4(Microsoft.SharePoint.Publishing.CachedObjectFactory)->
023f3e7c(Microsoft.SharePoint.Publishing.WssObjectCache)->
023f3f30(System.Collections.Hashtable)->
03b0f448(System.Collections.Hashtable+bucket[])->
0733b918(Microsoft.SharePoint.Publishing.ThreadSafeCache`2+CacheEntry`2[[System.String, mscorlib],[Microsoft.SharePoint.Publishing.CachedObjectWrapper, Microsoft.SharePoint.Publishing],[System.String, mscorlib],[Microsoft.SharePoint.Publishing.CachedObjectWrapper, Microsoft.SharePoint.Publishing]])->
0733b868(Microsoft.SharePoint.Publishing.CachedObjectWrapper)->
035f25b4(Microsoft.SharePoint.Publishing.CachedPage)->
035f2718(System.Collections.Generic.Dictionary`2[[Microsoft.SharePoint.Publishing.Navigation.PortalSiteMapProvider, Microsoft.SharePoint.Publishing],[Microsoft.SharePoint.Publishing.Navigation.PortalSiteMapNode, Microsoft.SharePoint.Publishing]])->
0733ba90(System.Collections.Generic.Dictionary`2+Entry[[Microsoft.SharePoint.Publishing.Navigation.PortalSiteMapProvider, Microsoft.SharePoint.Publishing],[Microsoft.SharePoint.Publishing.Navigation.PortalSiteMapNode, Microsoft.SharePoint.Publishing]][])->
0733b92c(Microsoft.SharePoint.Publishing.Navigation.PortalListItemSiteMapNode)->
069b2f88(Microsoft.SharePoint.Publishing.Navigation.PortalWebSiteMapNode)->
069bfee4(System.Collections.Generic.Dictionary`2[[System.Guid, mscorlib],[Microsoft.SharePoint.Publishing.Navigation.ProxySiteMapNode, Microsoft.SharePoint.Publishing]])->
069d8fe8(System.Collections.Generic.Dictionary`2+Entry[[System.Guid, mscorlib],[Microsoft.SharePoint.Publishing.Navigation.ProxySiteMapNode, Microsoft.SharePoint.Publishing]][])->
069e0934(Microsoft.SharePoint.Publishing.Navigation.ProxySiteMapNode)->
069e04a8(Microsoft.SharePoint.Navigation.SPNavigationNode)->
069bfe28(Microsoft.SharePoint.Navigation.SPNavigation)->
069bf7dc(Microsoft.SharePoint.SPWeb)->
02700dcc(Microsoft.SharePoint.SPSite)->
02701364(System.Collections.Generic.List`1[[Microsoft.SharePoint.SPWeb, Microsoft.SharePoint]])->
The MS disposed checker didn't find any issues too.
So now I don't know how to proceed further to find the (custom) component that causes the memory fragmentation. I hope that someone you could give me some hints, tool suggestion or check list of components that may cause the fragmentation (Antivirus, caching etc). The problem ossurs only in the prod environment and the only thing that we do now is iisreset - sometime 5 times a day…
Thank you in advance and best regards,
Anton
Your crash logs might contain the faulting object but more likely the assembly that's executing is changing with every crash and seem random. They might just be the innocent bystander that got left holding the bag when all the memory was gone.
First - can't you configure the app pool to automatically recycle when a certain memory threshold is reached? This might help alleviate your need to constantly monitor and be ready for an IISRESET. Otherwise you might want to schedule regular recycles to keep the memory tidy for the time being.
Next, try to identify when the crashes began and check your deployment logs to see what was installed. (You DO keep logs of software package installs, right?)
Are the custom components developed in-house? You can punt some of the work initially by having the developers check all their projects which have been deployed with the SharePoint Dispose Checker Tool (is this what you were referring to at the end of your question?) Un-disposed SPWeb and SPSite objects seem to be the biggest cause of this fragmentation.
Another avenue to explore is this MSDN question I ran into while looking for something else. It appears the Navigation bar on a publishing page was to blame. There is a hotfix for that issue but you have to request it directly from Microsoft.
I've been developing for SharePoint for a long time but it's always been someone else's job to find these problems! These tidbits are what I've gleaned over time and hopefully something will be useful.

Resources