Excessive Gen 2 Free Blocks in Crash Dump - memory-leaks

On inspecting a crash dump file for an out of memory exception reported by a client the results of !DumpHeap -stat showed that 575MB of memory is being taken up by 45,000 objects of type "Free" most of which I assume would have to reside in Gen 2 due to the size.
The first places I looked for problems were the large object heap (LOH) and pinned objects. The large object heap with free space included was only 70MB so that wasn't the issue and running !gchandles showed:
GC Handle Statistics:
Strong Handles: 155
Pinned Handles: 265
Async Pinned Handles: 8
Ref Count Handles: 163
Weak Long Handles: 0
Weak Short Handles: 0
Other Handles: 0
which is a very small number of handles (around 600) compared to the number of free objects (45,000). To me this rules out the free blocks being caused by pinning.
I also looked into the free blocks themselves to see if maybe they had a consistent size, but on inspection the sizes varied widely and went from just short of 5MB to only around 12 bytes or so.
Any help would be appreciated! I am at a loss since there is fragmentation but no signs of it being cause by the two places that I know to look which are the large object heap (LOH) and pinned handles.

In which generation are the free objects?
I assume would have to reside in Gen 2 due to the size
Size is not related to generations. To find out in which generation the free blocks reside, you can follow these steps:
From !dumpheap -stat -type Free get the method table:
0:003> !dumpheap -stat -type Free
total 7 objects
Statistics:
MT Count TotalSize Class Name
00723538 7 100 Free
Total 7 objects
From !eeheap -gc, get the start addresses of the generations.
0:003> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x026a1018
generation 1 starts at 0x026a100c
generation 2 starts at 0x026a1000
ephemeral segment allocation context: none
segment begin allocated size
026a0000 026a1000 02731ff4 0x00090ff4(593908)
Large object heap starts at 0x036a1000
segment begin allocated size
036a0000 036a1000 036a3250 0x00002250(8784)
Total Size 0x93244(602692)
------------------------------
GC Heap Size 0x93244(602692)
Then dump the free objects only of a specific generation by passing the start and end address (e.g. of generation 2 here):
0:003> !dumpheap -mt 00723538 0x026a1000 0x026a100c
Address MT Size
026a1000 00723538 12 Free
026a100c 00723538 12 Free
total 2 objects
Statistics:
MT Count TotalSize Class Name
00723538 2 24 Free
Total 2 objects
So in my simple case, there are 2 free objects in generation 2.
Pinned handles
600 pinned objects should not cause 45.000 free memory blocks. Still from my experience, 600 pinned handles are a lot. But first, check in which generation the free memory blocks reside.

Related

Write high bandwidth real-time data to SSD in Linux

I have a real-time process that receives 16 kB of data every 200 us for about 1 hr. I need to store this data.
I have a 240 GB SSD on a SATA III channel and I thought I could use it as a plain storage device without any filesystem on it. I am running 5.4.0-109-generic kernel with 8 GB or ram.
Here is what I have done so far:
I set up a shared memory shm, dimension of 1 GB, where I write the data and I use a semaphore to tell a logger process when data is available.
In the logger process:
I open the SSD:
fd = open("/dev/sdb",O_WRONLY|O_LARGEFILE);
I wait for the data to be available in the shm and then I write a chunk, writing_size, of it to the SSD:
written_size = write(fd,local_buffer,writing_size);
I checked and written_size is always equal to writing_size;
performs an fsync() after N cycles:
if(written_cycles > N)
ret = fsync(fd);
I checked and fsync never returns -1.
I did set the I/O scheduler of /dev/sdb as noop and I did experiment with different writing_size and N. The final values I came up with are writing_size = 64 kB and N = 16.
The behavior that I am seeing is this:
the whole process works very well up until 17 GB have been written. At that point the logger process is being put to "uninterruptible sleep (D)" quite often and for quite some time, 1 or 2 seconds. This is still fine as the shared memory buffer will fill up in ~13 sec. When the data written reaches 20 GB, the logger process is being put to sleep for way longer, until it reaches 13 sec and I start to lose data. The threshold when the process is starting to being put to sleep is quite repeatable, 16 - 17 GB, but the maximum amount of data I can save before I lose it is random.
This is the best I can achieve so far with my method and the writing_size and N tuning mentioned previously.
I tried to set the logger process nice to -20 with no improvements.
It also looks like the noop I/O scheduler does not support the ionice so I tried the CFQ scheduler with maximum ionice but still got worse performances.
I bet the logger process is being put to sleep for I/O access but I do not understand why it happens after a certain number of bytes have been written. iotop shows that the I/O bandwidth of the logger process is stable around 85 MB/s.
I welcome any suggestions.
PS: I did try to mmap the the SSD and do memcpy instead of write()+fsync() but mmap is slower and results are worse.

How is memory used value derived in check_snmp_mem.pl?

I was configuring icinga2 to get memory used information from one linux client using script at check_snmp_mem.pl . Any idea how the memory used is derived in this script ?
Here is free command output
# free
total used free shared buff/cache available
Mem: 500016 59160 89564 3036 351292 408972
Swap: 1048572 4092 1044480
where as the performance data shown in icinga dashboard is
Label Value Max Warning Critical
ram_used 137,700.00 500,016.00 470,015.00 490,016.00
swap_used 4,092.00 1,048,572.00 524,286.00 838,858.00
Looking through the source code, it mentions ram_used for example in this line:
$n_output .= " | ram_used=" . ($$resultat{$nets_ram_total}-$$resultat{$nets_ram_free}-$$resultat{$nets_ram_cache}).";";
This strongly suggests that ram_used is calculated as the difference of the total RAM and the free RAM and the RAM used for cache. These values are retrieved via the following SNMP ids:
my $nets_ram_free = "1.3.6.1.4.1.2021.4.6.0"; # Real memory free
my $nets_ram_total = "1.3.6.1.4.1.2021.4.5.0"; # Real memory total
my $nets_ram_cache = "1.3.6.1.4.1.2021.4.15.0"; # Real memory cached
I don't know how they correlate to the output of free. The difference in free memory reported by free and to Icinga is 48136, so maybe you can find that number somewhere.

Stress-ng: RAM testing commands

Stress-ng: Can we test RAM using stress-ng? What are the commands used to test RAM on a MIPS 32 device?
There are many memory based stressors in stress-ng:
stress-ng --class memory?
class 'memory' stressors: atomic bsearch context full heapsort hsearch
lockbus lsearch malloc matrix membarrier memcpy memfd memrate memthrash
mergesort mincore null numa oom-pipe pipe qsort radixsort remap
resources rmap stack stackmmap str stream tlb-shootdown tmpfs tsearch
vm vm-rw wcs zero zlib
Alternatively, one can also use VM based stressors too:
stress-ng --class vm?
class 'vm' stressors: bigheap brk madvise malloc mlock mmap mmapfork mmapmany
mremap msync shm shm-sysv stack stackmmap tmpfs userfaultfd vm vm-rw
vm-splice
I suggest looking at the vm stressor first as this contains a large range of stressor methods that exercise memory patterns and can possibly find broken memory:
-m N, --vm N
start N workers continuously calling mmap(2)/munmap(2) and writ‐
ing to the allocated memory. Note that this can cause systems to
trip the kernel OOM killer on Linux systems if not enough physi‐
cal memory and swap is not available.
--vm-bytes N
mmap N bytes per vm worker, the default is 256MB. One can spec‐
ify the size as % of total available memory or in units of
Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.
--vm-ops N
stop vm workers after N bogo operations.
--vm-hang N
sleep N seconds before unmapping memory, the default is zero
seconds. Specifying 0 will do an infinite wait.
--vm-keep
do not continually unmap and map memory, just keep on re-writing
to it.
--vm-locked
Lock the pages of the mapped region into memory using mmap
MAP_LOCKED (since Linux 2.5.37). This is similar to locking
memory as described in mlock(2).
--vm-madvise advice
Specify the madvise 'advice' option used on the memory mapped
regions used in the vm stressor. Non-linux systems will only
have the 'normal' madvise advice, linux systems support 'dont‐
need', 'hugepage', 'mergeable' , 'nohugepage', 'normal', 'ran‐
dom', 'sequential', 'unmergeable' and 'willneed' advice. If this
option is not used then the default is to pick random madvise
advice for each mmap call. See madvise(2) for more details.
--vm-method m
specify a vm stress method. By default, all the stress methods
are exercised sequentially, however one can specify just one
method to be used if required. Each of the vm workers have 3
phases:
1. Initialised. The anonymously memory mapped region is set to a
known pattern.
2. Exercised. Memory is modified in a known predictable way.
Some vm workers alter memory sequentially, some use small or
large strides to step along memory.
3. Checked. The modified memory is checked to see if it matches
the expected result.
The vm methods containing 'prime' in their name have a stride of
the largest prime less than 2^64, allowing to them to thoroughly
step through memory and touch all locations just once while also
doing without touching memory cells next to each other. This
strategy exercises the cache and page non-locality.
Since the memory being exercised is virtually mapped then there
is no guarantee of touching page addresses in any particular
physical order. These workers should not be used to test that
all the system's memory is working correctly either, use tools
such as memtest86 instead.
The vm stress methods are intended to exercise memory in ways to
possibly find memory issues and to try to force thermal errors.
Available vm stress methods are described as follows:
Method Description
all iterate over all the vm stress methods
as listed below.
flip sequentially work through memory 8
times, each time just one bit in memory
flipped (inverted). This will effec‐
tively invert each byte in 8 passes.
galpat-0 galloping pattern zeros. This sets all
bits to 0 and flips just 1 in 4096 bits
to 1. It then checks to see if the 1s
are pulled down to 0 by their neighbours
or of the neighbours have been pulled up
to 1.
galpat-1 galloping pattern ones. This sets all
bits to 1 and flips just 1 in 4096 bits
to 0. It then checks to see if the 0s
are pulled up to 1 by their neighbours
or of the neighbours have been pulled
down to 0.
gray fill the memory with sequential gray
codes (these only change 1 bit at a time
between adjacent bytes) and then check
if they are set correctly.
incdec work sequentially through memory twice,
the first pass increments each byte by a
specific value and the second pass
decrements each byte back to the origi‐
nal start value. The increment/decrement
value changes on each invocation of the
stressor.
inc-nybble initialise memory to a set value (that
changes on each invocation of the stres‐
sor) and then sequentially work through
each byte incrementing the bottom 4 bits
by 1 and the top 4 bits by 15.
rand-set sequentially work through memory in 64
bit chunks setting bytes in the chunk to
the same 8 bit random value. The random
value changes on each chunk. Check that
the values have not changed.
rand-sum sequentially set all memory to random
values and then summate the number of
bits that have changed from the original
set values.
read64 sequentially read memory using 32 x 64
bit reads per bogo loop. Each loop
equates to one bogo operation. This
exercises raw memory reads.
ror fill memory with a random pattern and
then sequentially rotate 64 bits of mem‐
ory right by one bit, then check the
final load/rotate/stored values.
swap fill memory in 64 byte chunks with ran‐
dom patterns. Then swap each 64 chunk
with a randomly chosen chunk. Finally,
reverse the swap to put the chunks back
to their original place and check if the
data is correct. This exercises adjacent
and random memory load/stores.
move-inv sequentially fill memory 64 bits of mem‐
ory at a time with random values, and
then check if the memory is set cor‐
rectly. Next, sequentially invert each
64 bit pattern and again check if the
memory is set as expected.
modulo-x fill memory over 23 iterations. Each
iteration starts one byte further along
from the start of the memory and steps
along in 23 byte strides. In each
stride, the first byte is set to a ran‐
dom pattern and all other bytes are set
to the inverse. Then it checks see if
the first byte contains the expected
random pattern. This exercises cache
store/reads as well as seeing if neigh‐
bouring cells influence each other.
prime-0 iterate 8 times by stepping through mem‐
ory in very large prime strides clearing
just on bit at a time in every byte.
Then check to see if all bits are set to
zero.
prime-1 iterate 8 times by stepping through mem‐
ory in very large prime strides setting
just on bit at a time in every byte.
Then check to see if all bits are set to
one.
prime-gray-0 first step through memory in very large
prime strides clearing just on bit
(based on a gray code) in every byte.
Next, repeat this but clear the other 7
bits. Then check to see if all bits are
set to zero.
prime-gray-1 first step through memory in very large
prime strides setting just on bit (based
on a gray code) in every byte. Next,
repeat this but set the other 7 bits.
Then check to see if all bits are set to
one.
rowhammer try to force memory corruption using the
rowhammer memory stressor. This fetches
two 32 bit integers from memory and
forces a cache flush on the two
addresses multiple times. This has been
known to force bit flipping on some
hardware, especially with lower fre‐
quency memory refresh cycles.
walk-0d for each byte in memory, walk through
each data line setting them to low (and
the others are set high) and check that
the written value is as expected. This
checks if any data lines are stuck.
walk-1d for each byte in memory, walk through
each data line setting them to high (and
the others are set low) and check that
the written value is as expected. This
checks if any data lines are stuck.
walk-0a in the given memory mapping, work
through a range of specially chosen
addresses working through address lines
to see if any address lines are stuck
low. This works best with physical mem‐
ory addressing, however, exercising
these virtual addresses has some value
too.
walk-1a in the given memory mapping, work
through a range of specially chosen
addresses working through address lines
to see if any address lines are stuck
high. This works best with physical mem‐
ory addressing, however, exercising
these virtual addresses has some value
too.
write64 sequentially write memory using 32 x 64
bit writes per bogo loop. Each loop
equates to one bogo operation. This
exercises raw memory writes. Note that
memory writes are not checked at the end
of each test iteration.
zero-one set all memory bits to zero and then
check if any bits are not zero. Next,
set all the memory bits to one and check
if any bits are not one.
--vm-populate
populate (prefault) page tables for the memory mappings; this
can stress swapping. Only available on systems that support
MAP_POPULATE (since Linux 2.5.46).
So to run 1 vm stressor that uses 75% of memory using all the vm stressors with verification for 10 minutes with verbose mode enabled, use:
stress-ng --vm 1 --vm-bytes 75% --vm-method all --verify -t 10m -v

Why is my process taking higher resident memory as compared to virtual memory?

'top' logs of my linux process show that its resident memory is around 6 times of the virtual memory. I have researched a lot but couldn't find any reason for such a behavior. Ideally VIRT is always higher than RES due to linux kernel's memory management. Top output is below -
13743 root 20 0 15.234g 0.010t 4372 R 13.4 4.0 7:43.41 q
Not quite.
The g suffix indicates Gibibyte(s), and t indicates Tebibyte(s).
Let's do the conversion of 0.010t to g (GiB):
zsh% print $((0.010 * 1024))g
10.24g
And 10.24g < 15.234g, so yor assumption is not correct i.e. top is correctly showing the correct values for virtual set size (VSZ) and resident set size (RSS) -- just in different units (need to take a peek at the source for why).

Making sense from GHC profiler

I'm trying to make sense from GHC profiler. There is a rather simple app, which uses werq and lens-aeson libraries, and while learning about GHC profiling, I decided to play with it a bit.
Using different options (time tool, +RTS -p -RTS and +RTS -p -h) I acquired entirely different numbers of my memory usage. Having all those numbers, I'm now completely lost trying to understand what is going on, and how much memory the app actually uses.
This situation reminds me the phrase by Arthur Bloch: "A man with a watch knows what time it is. A man with two watches is never sure."
Can you, please, suggest me, how I can read all those numbers, and what is the meaning of each of them.
Here are the numbers:
time -l reports around 19M
#/usr/bin/time -l ./simple-wreq
...
3.02 real 0.39 user 0.17 sys
19070976 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
21040 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
71 messages sent
71 messages received
2991 signals received
43 voluntary context switches
6490 involuntary context switches
Using +RTS -p -RTS flag reports around 92M. Although it says "total alloc" it seems strange to me, that a simple app like this one can allocate and release 91M
# ./simple-wreq +RTS -p -RTS
# cat simple-wreq.prof
Fri Oct 14 15:08 2016 Time and Allocation Profiling Report (Final)
simple-wreq +RTS -N -p -RTS
total time = 0.07 secs (69 ticks # 1000 us, 1 processor)
total alloc = 91,905,888 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
main.g Main 60.9 88.8
MAIN MAIN 24.6 2.5
decodeLenient/look Data.ByteString.Base64.Internal 5.8 2.6
decodeLenientWithTable/fill Data.ByteString.Base64.Internal 2.9 0.1
decodeLenientWithTable.\.\.fill Data.ByteString.Base64.Internal 1.4 0.0
decodeLenientWithTable.\.\.fill.\ Data.ByteString.Base64.Internal 1.4 0.1
decodeLenientWithTable.\.\.fill.\.\.\.\ Data.ByteString.Base64.Internal 1.4 3.3
decodeLenient Data.ByteString.Base64.Lazy 1.4 1.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 443 0 24.6 2.5 100.0 100.0
main Main 887 0 0.0 0.0 75.4 97.4
main.g Main 889 0 60.9 88.8 75.4 97.4
object_ Data.Aeson.Parser.Internal 925 0 0.0 0.0 0.0 0.2
jstring_ Data.Aeson.Parser.Internal 927 50 0.0 0.2 0.0 0.2
unstream/resize Data.Text.Internal.Fusion 923 600 0.0 0.3 0.0 0.3
decodeLenient Data.ByteString.Base64.Lazy 891 0 1.4 1.4 14.5 8.1
decodeLenient Data.ByteString.Base64 897 500 0.0 0.0 13.0 6.7
....
+RTS -p -h and hp2ps show me the following picture and two numbers: 114K in the header and something around 1.8Mb on the graph.
And, just in case, here is the app:
module Main where
import Network.Wreq
import Control.Lens
import Data.Aeson.Lens
import Control.Monad
main :: IO ()
main = replicateM_ 10 g
where
g = do
r <- get "http://httpbin.org/get"
print $ r ^. responseBody
. key "headers"
. key "User-Agent"
. _String
UPDATE 1: Thank everyone for incredible good responses. As was suggested, I add +RTS -s output, so the entire picture builds up for everyone who read it.
#./simple-wreq +RTS -s
...
128,875,432 bytes allocated in the heap
32,414,616 bytes copied during GC
2,394,888 bytes maximum residency (16 sample(s))
355,192 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 194 colls, 0 par 0.018s 0.022s 0.0001s 0.0022s
Gen 1 16 colls, 0 par 0.027s 0.031s 0.0019s 0.0042s
UPDATE 2: The size of the executable:
#du -h simple-wreq
63M simple-wreq
A man with a watch knows what time it is. A man with two watches is never sure.
Ah, but what do does two watches show? Are both meant to show the current time in UTC? Or is one of them supposed to show the time in UTC, and the other one the time on a certain point on Mars? As long as they are in sync, the second scenario wouldn't be a problem, right?
And that is exactly what is happening here. You compare different memory measurements:
the maximum residency
the total amount of allocated memory
The maximum residency is the highest amount of memory your program ever uses at a given time. That's 19MB. However, the total amount of allocated memory is a lot more, since that's how GHC works: it "allocates" memory for objects that are garbage collected, which is almost everything that's not unpacked.
Let us inspect a C example for this:
int main() {
int i;
char * mem;
for(i = 0; i < 5; ++i) {
mem = malloc(19 * 1000 * 1000);
free(mem);
}
return 0;
}
Whenever we use malloc, we will allocate 19 megabytes of memory. However, we free the memory immediately after. The highest amount of memory we ever have at one point is therefore 19 megabytes (and a little bit more for the stack and the program itself).
However, in total, we allocate 5 * 19M, 95M total. Still, we could run our little program with just 20 megs of RAM fine. That's the difference between total allocated memory and maximum residency. Note that the residency reported by time is always at least du <executable>, since that has to reside in memory too.
That being said, the easiest way to generate statistics is -s, which will show how what was the maximum residency from the Haskell's program point of view. In your case, it will be the 1.9M, the number in your heap profile (or double the amount due to profiling). And yeah, Haskell executables tend to get extremely large, since libraries are statically linked.
time -l is displaying the (resident, i.e. not swapped out) size of the process as seen by the operating system (obviously). This includes twice the maximum size of the Haskell heap (due to the way that GHC's GC works), plus anything else allocated by the RTS or other C libraries, plus the code of your executable itself plus the libraries it depends on, etc. I'm guessing in this case the primary contributor to the 19M is the size of your exectuable.
total alloc is the total amount allocated onto the Haskell heap. It is not at all a measure of maximum heap size (which is what people usually mean by "how much memory is my program using"). Allocation is very cheap and allocation rates of around 1GB/s are typical for a Haskell program.
The number in the header of the hp2ps output "114,272 bytes x seconds" is something completely different again: it is the integral of the graph, and is measured in bytes * seconds, not in bytes. For example if your program holds onto a 10 MB structure for 4 seconds then that will cause this number to increase by 40 MB*s.
The number around 1.8 MB shown in the graph is the actual maximum size of the Haskell heap, which is probably the number you're most interested in.
You've omitted the most useful source of numbers about your program's execution, which is running it with +RTS -s (this doesn't even require it to have been built with profiling).

Resources