Debug out-of-memory crash, when OOM Killer kills the process? - linux

Posted as Q&A after finding a solution.
Working on a simulation code base on Linux, allocating memory succeeds, but later process gets killed by an external signal. Adding a signal handler does not prevent this, so it is presumably a SIGTERM. Since the process is killed, a debugger cannot provide a backtrace.
Judging by the signs, and preceding high memory usage, it is probably related to the OOM killer. Outright disabling the OOM Killer with
sudo sh -c "echo 2 > /proc/sys/vm/overcommit_memory"
resulted in many programs crashing.
What can be done to find the source of the issue, e.g. to get a backtrace indicating where too much memory is being used?

I observed this issue on Open Suse 15.2 when debugging a crash in a Fortran program. It was clear that it was an out-of-memory issue from the description of the tester, but on my system I would simply see
>>> ./run-simulation
[1] Killed
on the terminal, with no traceback being emitted.
On my system, the source of the issue turned out to be that virtual memory was set to "unlimited", as seen by
>>> ulimit -a
Setting a maximum for the virtual memory, e.g.
>>> ulimit -v 24''000''000 # in kB -> 24 GB on a 32 GB RAM system
resolved the issue by making the simulation program return an error code from ALLOCATE, or crash with a backtrace* for unhandled failed allocations (e.g. from temporary variables in the expression cmplx(transpose(some_larger_than_expected_matrix))).
* Assuming that the executable was compiled with support for backtraces (compiler-dependent), run through a debugger, ...

Related

How to set memory limit for OOM Killer for chrome?

chrome invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=300
I'm getting the above error while testing with headless chrome browser + Selenium.
This error message...
chrome invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=300
...implies that the ChromeDriver controlled Browsing Context i.e. Chrome Browser invoked the OOM Killer due to out-of-memory error.
Out of Memory
Out of Memory error messages can appear when you attempt to start new programs or you try to use programs that are already running, even though you still have plenty of physical and pagefile memory available.
OOM Killer
The OOM Killer or Out Of Memory Killer is a process that the linux kernel employs when the system is critically low on memory. This situation occurs because the linux kernel has over allocated memory to its processes. When a process starts it requests a block of memory from the kernel. This initial request is usually a large request that the process will not immediately or indeed ever use all of. The kernel, aware of this tendency for processes to request redundant memory, over allocates the system memory. This means that when the system has, for example, 2GB of RAM the kernel may allocate 2.5GB to processes. This maximises the use of system memory by ensuring that the memory that is allocated to processes is being actively used. Now, if enough processes begin to use all of their requested memory blocks then there will not be enough physical memory to support them all. This means that the running processes require more memory than is physically available. This situation is exactly when linux kernel invoke the OOM Killer to review all running processes and kill one or more of them in order to free up system memory and keep the system running.
Chrome the first victim of the OOM Killer
Surprisingly it seems Chrome Browser Client is the first victiom of the oom killer. As the Linux OOM Killer kills the process with the highest score=(RSS + oom_score_adj), the chrome tabs are killed because they have an oom_score_adj of 300 (kLowestRendererOomScore = 300 in chrome_constants.cc) as follows:
#if defined(OS_LINUX)
const int kLowestRendererOomScore = 300;
const int kHighestRendererOomScore = 1000;
#endif
Details
This issue is a known issue and can be easily reproduced. We have discussed this issue in length and breadth with in oom_score_adj too high - chrome is always the first victiom of the oom killer. The goal was to adjust the OOM in Chrome OS to make sure the most-recently-opened tab isn't killed as it appeared OOM killer prefers recent processes by default. But on Linux distros that won't reflect and you would get undesirable behavior where Chrome procs get killed over other procs that probably should have been killed instead.
Solution
Some details interms of the error stack trace would have helped us to suggest you some changes in terms of:
total-vm usage
Physical memory
Swap memory
You can find a couple of relevant discussions in:
Understanding the Linux oom-killer's logs
what does anon-rss and total-vm mean
determine vm size of process killed by oom-killer
However, there was a code review to address this issue but the discussion still seems still in status Assigned with Priority:2 with in:
Linux: Adjust /proc/pid/oom_adj to sacrifice plugin and renderer processes to the OOM killer
tl; dr
java.lang.OutOfMemoryError: unable to create new native thread error using ChromeDriver and Chrome through Selenium in Spring boot
Outro
Chromium OS - Design Documents - Out of memory handling
Despite 32GB of RAM, this chromium OOM is still happening within its latest release !
Because this issue will fully freeze Xorg, the sysrq keys association can help to recover the console terminal.
ALT + SYS + K to kill chromium
Think about adding sysrq_always_enabled in the kernel boot command line.

Can a memory leak cause my process to get killed?

This is a short description of my problem :
Context :
Hardware : Toradex Colibri VF61
Distribution : Angstrom v2014.12
Kernel release : 4.0.2-v2
Software language : Qt/C++
Problem :
I develop an application which needs to run at least 2 weeks on an embedded product. My problem is that my process runs for 5 days with a small memory leak, that I monitore whit "Top", and then it gets killed.
My process was turned into a zombie, as Top told me.
Attempt number 1 :
I tried to correct the memory leak with Valgrind, but some "probably" leaks are in libraries I use in my program (many are malloc). It's a very big work to understand all of the librairies and it's not the goal.
I think the memory leak is about 1% of memory lost per day, so 15% lost in 2 weeks. This kind of leak is acceptable for me, because the process will not run after 2 weeks, and the embedded system is dedicated for this process, I don't have any other big process running on the machine. The RAM monitoring shows that the process takes 30% of ressources, so estimated to 45% two weeks later.
Attempt number 2 :
I inquired about memory management under Linux and learned about OOM-Killer. I deduced that OOM-Killer propably felt that my process had been running for too long with a memory leak and killed it.
So I set the variable "oom_score_adj" of my process to -1000 to prevent OOM-Killer from killing my process and I tried again to run for long time with my memory leak.
But this time my process was turned into "sleeping" and not killed but unusable. The sleeping state was associated to an error message "Error in './app': malloc(): memory corruption (fast) : 0x72518ddf". I precise that I have zero malloc in my code, only in librairies I use.
Questions :
Do you think it's possible that a process like OOM-Killer could turn my process into zombie because I have a memory leak and my program has been running for a long time ?
Do you think it's possible that Linux turn my process into sleeping mode because the leak has filled up the memory allocated to the process ?
Concerning your first question, the OOM Killer will kill one or more processes following the oom_score (high memory consumption, less important for system, ..) in case of very less memory left for the system. So if the OOM Killer kills a child process of your main process, this will turn your main process to zombie.
To your second question, Linux put a process to sleeping state, if the resources for this certain process are not available. But in your case if there is memory leakage and the process consumes lot of memory then the process will be rather killed then put in sleeping state.
Are you using UART for your Application?
By the way, there is also a Toradex Community, where engineers can directly answers your questions.
Best regards, Jaski

How to detect if an application leaks memory?

I have quite a complex system, with 30 applications running. One quite complex C++ application was leaking memory, and I think I fixed it.
What I've done so far is:
I executed the application using valgrind's memcheck, and it detected no problems.
I monitored the application using htop, and I noticed that virtual and residual memory is not increasing
I am planing to run valgrind's massif and see if it uses new memory
The question is, how can I make sure there are no leaks? I thought if virtual memory stopped increasing, then I could be sure there are no leaks. When I test my application, I trigger the loop where the memory is allocated and deallocated several times just to make sure.
You can't be sure except you know exactly all the conditions under which the application will allocate new memory. If you can't induce all of these conditions neither valgrind nor htop will guarantee that your application doesn't leak memory under all circumstances.
Yet you should make at least sure that the application doesn't leak memory under normal conditions.
If valgrind doesn't report leaks, there aren't leaks in the sense of memory areas that aren't accessible anymore (during the runs you checked). That doesn't mean that the program doesn't allocate memory, uses it and doesn't free it when it won't use it anymore (but it is still reachable). Think e.g. a the typical to-do stack, you place new items on top, work on the item on top and then push another one. Won't ever go back to the old ones so the memory used for them is wasted, but technically it isn't a leak.
What you can do is to monitor the memory usage by the process. If it steadily increases, you might have a problem there (either a bona fide leak, or some data structure that grows without need).
If this isn't really pressing, it might be cheaper in the long run just letting it be...
You need to use a tool called Valgrind. It is memory debugging, memory leak detection, and profiling tool for Linux and Mac OS X operating systems. Valgrind is a flexible program for debugging and profiling Linux executables.
follow steps..
Just install valgrind
To run...
./a.out arg1 arg2
Now how to Use this command line to turn on the detailed memory leak detector:
valgrind --leak-check=yes ./a.out arg1 arg2
valgrind --leak-check=yes /path/to/myapp arg1 arg2
Or
You can also set logfile:
valgrind --log-file=output.file --leak-check=yes --tool=memcheck ./a.out arg1 arg2
You can check its log for error of memory leak...
cat output.file

Linux memory usage history

I had a problem in which my server began failing some of its normal processes and checks because the server's memory was completely full and taken.
I looked in the logging history and found that what it killed were some Java processes.
I used the "top" command to see what processes were taking up the most memory right now(after the issue was fixed) and it was a Java process. So in essence, I can tell what processes are taking up the most memory right now.
What I want to know is if there is a way to see what processes were taking up the most memory at the time when the failures started happening? Perhaps Linux keeps track or a log of the memory usage at particular times? I really have no idea but it would be great if I could see that kind of detail.
#Andy has answered your question. However, I'd like to add that for future reference use a monitoring tool. Something like these. These will give you what happened during a crash since you obviously cannot monitor all your servers all the time. Hope it helps.
Are you saying the kernel OOM killer went off? What does the log in dmesg say? Note that you can constrain a JVM to use a fixed heap size, which means it will fail affirmatively when full instead of letting the kernel kill something else. But the general answer to your question is no: there's no way to reliably run anything at the time of an OOM failure, because the system is out of memory! At best, you can use a separate process to poll the process table and log process sizes to catch memory leak conditions, etc...
There is no history of memory usage in linux be default, but you can achieve it with some simple command-line tool like sar.
Regarding your problem with memory:
If it was OOM-killer that did some mess on machine, then you have one great option to ensure it won't happen again (of course after reducing JVM heap size).
By default linux kernel allocates more memory than it has really. This, in some cases, can lead to OOM-killer killing the most memory-consumptive process if there is no memory for kernel tasks.
This behavior is controlled by vm.overcommit sysctl parameter.
So, you can try setting it to vm.overcommit = 2 is sysctl.conf and then run sysctl -p.
This will forbid overcommiting and make possibility of OOM-killer doing nasty things very low. Also you can think about adding a little-bit of swap space (if you don't have it already) and setting vm.swappiness to some really low value (like 5, for example. default value is 60), so in normal workflow your application won't go into swap, but if you'll be really short on memory, it will start using it temporarily and you will be able to see it even with df
WARNING this can lead to processes receiving "Cannot allocate memory" error if you have your server overloaded by memory. In this case:
Try to restrict memory usage by applications
Move part of them to another machine

Why do I get a segfault since my application was compiled for 64 bits?

I am running a application(compiled over a 64 bit machine) over a 64 bit Linux system(RHEL5.5). This application is getting crashed after every 40-50 minutes.I am surprised to see this as it was running completely fine when I was running the same code on a 32 bit machine.
One of the possible causes I found that free memory in problematic system was only 50 MB.So I assumed that it is crashing because of low memory.But I also saw that system have around 5 GB of cached memory.I assumed that this cache memory should be available for all of my memory requests.Am I correct in this assumption or I free this cache after a while to solve this problem.
In the system Log I saw following message when my application is crashing:
kernel: MyApplicationName[20655]: segfault at 0000000030363938 rip 0000000000b35c7e rsp 00000000f322a3a0 error 4
can anyone point what can be the problem here?What this error 4 means in the system.
Error 4 is EINTR, it's in /usr/include/asm-generic/errno-base.h :
#define EINTR 4 /* Interrupted system call */
But your problem does not seem related to memory or to this error 4 at all. It's more probably a bug which appeared because the code of your application was not ready for 64 bits system. Your main problem is not error 4, it's the segfault.
A segmentation fault error is when an application tries to use memory it cannot or has not the rights to use. In this case, Kernel has often no other choice but to stop it.
In order to have more info about this error, you can compile your application with debug symbol, attach gdb to your process and ask a full trace when the segfault occurs with this command in gdb shell thread apply all bt.
The cache will be freed when needed. Your issue is far more likely to be poor code practices, possibly your code assuming all ints are 32bit.

Resources