I have an linux based embedded device on which I am running a QT GUI application as well as a second application controlling some hardware. The two communicate with each other via TCP.
I have recently run a system test where I stimulate the QT application using Squish for an entire week. At the start and end of my test I extract the smap and pmap files for each of my two processes. Likewise I extract the meminfo file.
How might I compare the before and after files to get a rough idea as to whether I have a memory leak problem for the device as a whole? Also, if a leak were detected, how might I make a rough, rough estimate as to when the device will stop functioning correctly?
How might I compare the before and after files to get a rough idea as
to whether I have a memory leak problem for the device as a whole?
Answer:
I think you might need to know how much memory device driver used in kernel space so slab info is the place which it is worthwhile to look into.
Initially,you can do(running a script):
cat /proc/meminfo | grep -i Slab
to monitor the value(any increase after long time running).
If it does, you can compare the leaked one with the one before leaking with the help of cat /proc/slabinfo by looking into each entry.
If you familiar the name of each entry, (if you are sure it must be a leak in this driver and have a rough idea of how it allocate memory) you can do a reasonable guess.
for example
size-4194304 0 0 4194304 1
kmem_cache 136 180 128 30
if "kmem_cache entry" is leaking, then your driver might call kmem_cache_create and kmem_cache_alloc() but not free
if it is "size-4194304", you driver might call get_zeroed_page or _ _get_free_pages but not free.
how might I make a rough, rough estimate as to when the device will
stop functioning correctly
Answer:
if your user space doesn't leak any memory and the leak only happens in your driver, you can use "free" command to get how much free system memory is available.
Then you can roughly calculate how long it will be used up:
(Total free memory(kbyte)/memory leak_speed(kbyte/s)) = number of seconds.
,however kernel will complain before that.
Related
Currently I analyze a C++ application and its memory consumption. Checking the memory consumption of the process before and after a certain function call is possible. However, it seems that, for technical reasons or for better efficiency the OS (Linux) assigns not only the required number of bytes but always a few more which can be consumed later by the application. This makes it hard to analyze the memory behavior of the application.
Is there a workaround? Can one switch Linux to a mode where it assigns just the required number of bytes/pages?
if you use malloc/new, the allocator will always alloc a little more bytes than you requested , as it needs some room to do its housekeeping, also it may need to align the bytes on pages boundaries. The amount of supplementary bytes allocated is implementation dependent.
you can consider to use tools such as gperftools (google) to monitor the memory used.
I wanted to check a process for memory leeks some years ago.
What I did was the following: I wrote a very small debugger (it is easier than it sounds) that simply set breakpoints to malloc(), free(), mmap(), ... and similar functions (I did that under Windows but under Linux it is simpler - I did it in Linux for another purpose!).
Whenever a breakpoint was reached I logged the function arguments and continued program execution...
By processing the logfile (semi-automated) I could find memory leaks.
Disadvantage: It is not possible to debug the program using another debugger in parallel.
Here is my system based on Linux2.6.32.12:
1 It contains 20 processes which occupy a lot of usr cpu
2 It needs to write data on rate 100M/s to disk and those data would not be used recently.
What I expect:
It can run steadily and disk I/O would not affect my system.
My problem:
At the beginning, the system run as I thought. But as the time passed, Linux would cache a lot data for the disk I/O, that lead to physical memory reducing. At last, there will be not enough memory, then Linux will swap in/out my processes. It will cause I/O problem that a lot cpu time was used to I/O.
What I have try:
I try to solved the problem, by "fsync" everytime I write a large block.But the physical memory is still decreasing while cached increasing.
How to stop page cache here, it's useless for me
More infomation:
When Top show free 46963m, all is well including cpu %wa is low and vmstat shows no si or so.
When Top show free 273m, %wa is so high which affect my processes and vmstat shows a lot si and so.
I'm not sure that changing something will affect overall performance.
Maybe you might use posix_fadvise(2) and sync_file_range(2) in your program (and more rarely fsync(2) or fdatasync(2) or sync(2) or syncfs(2), ...). Also look at madvise(2), mlock(2) and munlock(2), and of course mmap(2) and munmap(2). Perhaps ionice(1) could help.
In the reader process, you might perhaps use readhahead(2) (perhaps in a separate thread).
Upgrading your kernel (to a 3.6 or better) could certainly help: Linux has improved significantly on these points since 2.6.32 which is really old.
To drop pagecache you can do the following:
"echo 1 > /proc/sys/vm/drop_caches"
drop_caches are usually 0. And, can be changed as per need. As you've identified yourself, that you need to free pagecache, so this is how to do it. You can also take a look at dirty_writeback_centisecs (and it's related tunables)(http://lxr.linux.no/linux+*/Documentation/sysctl/vm.txt#L129) to make quick writeback, but note it might have consequences, as it calls up kernel flasher thread to write out dirty pages. Also, note the uses of dirty_expire_centices, which defines how much time some data needs to be eligible for writeout.
I have been running a crypto-intensive application that was generating pseudo-random strings, with special structure and mathematical requirements. It has generated around 1.7 million voucher numbers per node in over the last 8 days. The generation process was CPU intensive, with very low memory requirements.
Mnesia running on OTP-14B02 was the storage database and the generation was done within each virtual machine. I had 3 nodes in the cluster with all mnesia tables disc_only_copies type. Suddenly, as activity on the Solaris boxes increased (other users logged on remotely and were starting webservers, ftp sessions, and other tasks), my bash shell started reporting afork: not enough space error.
My erlang Vms also, went down with this error below:
Crash dump was written to: erl_crash.dump
temp_alloc: Cannot reallocate 8388608 bytes of memory (of type "root_set").
Usually, we get memory allocation errors and not memory re-location errors and normally memory of type "heap" is the problem. This time, the memory type reported is type "root-set".
Qn 1. What is this "root-set" memory?
Qn 2. Has it got to do with CPU intensive activities ? (why am asking this is that when i start the task, the Machine reponse to say mouse or Keyboard interrupts is too slow meaning either CPU is too busy or its some other problem i cannot explain for now)
Qn 3. Can such an error be avoided? and how ?
The fork: not enough space message suggests this is a problem with the operating system setup, but:
Q1 - The Root Set
The Root Set is what the garbage collector uses as a starting point when it searches for data that is live in the heap. It usually starts off from the registers of the VM and off from the stack, if the stack has references to heap data that still needs to be live. There may be other roots in Erlang I am not aware of, but these are the basic stuff you start off from.
That it is a reallocation error of exactly 8 Megabyte space could mean one of two things. Either you don't have 8 Megabyte free in the heap, or that the heap is fragmented beyond recognition, so while there are 8 megabytes in it, there are no contiguous such space.
Q2 - CPU activity impact
The problem has nothing to do with the CPU per se. You are running out of memory. A large root set could indicate that you have some very deep recursions going on where you keep around a lot of pointers to data. You may be able to rewrite the code such that it is tail-calling and uses less memory while operating.
You should be more worried about the slow response times from the keyboard and mouse. That could indicate something is not right. Does a vmstat 1, a sysstat, a htop, a dstat or similar show anything odd while the process is running? You are also on the hunt to figure out if the kernel or the C libc is doing something odd here due to memory being constrained.
Q3 - How to fix
I don't know how to fix it without knowing more about what the application is doing. Since you have a crash dump, your first instinct should be to take the crash dump viewer and look at the dump. The goal is to find a process using a lot of memory, or one that has a deep stack. From there on, you can seek to limit the amount of memory that process is using. either by rewriting the code so it can give memory up earlier, by tuning the garbage collection setup for the process (see the spawn options in the erlang man pages for this), or by adding more memory to the system.
I'm trying to run a code on ssh that works perfect for a smaller mesh , but since the new mesh is much bigger i used ifort command to compile it,
ifort -mcmodel=medium -i-dynamic -otest.out*.f
and it complies but when i run it , the output is:
killed
i know that problem is from memory, does anyone know if there's any way to run it?
how can i understand where in code cause memory problem?
Thanks
shadi
From the ifort command line, I think you are running on Linux.
Seeing "killed" as output is generally the result of Linux's Out Of Memory killer (OOM) getting involved to prevent an impending crash (because it's common practice for applications to ask for more memory then they need requests for more memory than is currently available are accepted - check for "Out of Memory: Killed process [PID] [process name]" in the system log files). The OOM killer is generally pretty good at disposing of the application responsible for using all the memory, so the place to start is your applications memory usage.
The first thing to do is try and estimate (even if it's only roughly) how much memory you expect your application to use. One approach is to guestimate the size of the major arrays and multiply them by the number of bits needed per element. Another approach is to think about how you would expect the memory use to grow with mesh size. You can study this by experiment (run with different mesh sizes, measure the memory use and extrapolate) or from one measurement and knowledge of how the major array scale. It may be that you are asking for much more memory then you have on the machine: and the solution to this is probably to get a access to bigger computer. (Or you could try and find an alternative algorithm which uses less memory.)
If their is a memory leak you should see more memory use than expected, even for the smaller mesh size. If this is the case, valgrind should help. Moving from static to dynamic storage probably isn't going to help here - I would expect to see a segmentation fault if you were just exceeding the available space on the stack.
try using valgrind. i tried it to find memory leaks in my fortran code with good success.
http://valgrind.org/
I've got a memory leak somewhere, but it doesn't appear to be related to my program. I'm making this bold statement based on the fact that once my program terminates, either by seg-faulting, exitting, or aborting, the memory isn't recovered. If my program were the culprit, I would assume the MMU would recover everything, but this doesn't appear to be the case.
My question is:
On a small Linux system (64 Mb Ram) running a program that uses only stack memory and a few calls to malloc(), what causes can I look too see memory being run down and stay down once my program terminates?
A related question is here:
This all started when after code in question was directing its stdout, stderr to a file. After a few hours it aborted with a "Segmentation Fault". A quick (naive?) look at /proc/meminfo showed that there wasn't much available memory, so I assumed something was leaking.
It appears I don't have a memory leak (see here) but it does lead me to some new questions...
It turns out that writing to block devices can use a quite a pile of physical memory; in my system there was only 64 Meg, so writing hundreds of Megs to a USB drive was increasing the cached, active and inactive memory pools quite a bit.
These memory pools are immediately released to the Free memory pool when the device is dismounted.
The exact cause of my segmentation fault remains a small mystery, but I know it's occurence can be reduced by understanding the virtual memory resources better, particularly around the use of Block devices.