My app is running slow after a few days. Using the unix "top" command it seems there is not a lot of free memory. See below. Even if I stop the application about the same memory shows used. Any ideas why? Does this amount of memory look normal with no app running on a small gear application? How can I reboot the virtual machine?
Below is the output of the "top" command with no app running. Shows
7513700k total, 7327484k used, 186216k free
top - 22:06:26 up 14 days, 5:42, 0 users, load average: 1.83, 2.82, 3.21
Tasks: 3 total, 1 running, 2 sleeping, 0 stopped, 0 zombie
Cpu(s): 10.2%us, 26.6%sy, 1.6%ni, 57.4%id, 4.0%wa, 0.0%hi, 0.0%si, 0.2%st
Mem: 7513700k total, 7327484k used, 186216k free, 170244k buffers
Swap: 6249464k total, 4210036k used, 2039428k free, 925320k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
48736 3558 20 0 14908 1176 944 R 0.7 0.0 0:00.04 top
48374 3558 20 0 102m 2684 848 S 0.0 0.0 0:00.00 sshd
48383 3558 20 0 106m 2072 1436 S 0.0 0.0 0:00.19 bash
What type of app are you running? Also, since openshift uses cgroups you'll want to see what your usage is within your cgroup (top output shows the whole system). Try including the output from for i in $(oo-cgroup-read all);do echo “oo-cgroup-read $i” && oo-cgroup-read $i; done and pay close attention to your memory limits.
Related
Memory occupied by unknown (VMware/CentOS)
Hello.
We have a server that has memory full used issue, but can not find what is eating memory.
Usage of memory has increased few days ago 40% -> neary 100% and stayed there since then.
We’d like to kill whatever eating memory.
[Env]
cat /etc/redhat-release
CentOS release 6.5 (Final)
# arch
x86_64
[status]
#free
total used free shared buffers cached
Mem: 16334148 15682368 651780 0 10168 398956
-/+ buffers/cache: 15273244 1060904
Swap: 8388600 129948 8258652
Result of top (some info are masked with ???)
#top -a
top - 10:19:14 up 49 days, 11:13, 1 user, load average: 1.05, 1.05, 1.10
Tasks: 145 total, 1 running, 143 sleeping, 0 stopped, 1 zombie
Cpu(s): 11.1%us, 18.4%sy, 0.0%ni, 69.5%id, 0.8%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 16334148k total, 15684824k used, 649324k free, 9988k buffers
Swap: 8388600k total, 129948k used, 8258652k free, 387824k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17940 ??? 20 0 7461m 6.5g 6364 S 16.6 41.5 1113:27 java
4982 ??? 20 0 941m 531m 5756 S 2.7 3.3 611:22.48 java
3213 root 20 0 2057m 354m 2084 S 99.8 2.2 988:43.79 python
28270 ??? 20 0 835m 157m 5464 S 0.0 1.0 106:48.55 java
1648 root 20 0 197m 10m 1452 S 0.0 0.1 42:35.95 python
1200 root 20 0 246m 7452 808 S 0.0 0.0 2:37.42 rsyslogd
Processes that are using memory (some info are masked with ???)
# ps aux --sort rss
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1200 0.0 0.0 251968 7452 ? Sl Sep12 2:37 /sbin/rsyslogd -i /var/run/syslogd.pid -c 5
root 1648 0.0 0.0 202268 10604 ? Ss Sep12 42:36 /usr/lib64/???
??? 28270 0.1 0.9 855932 161092 ? Sl Sep14 106:49 /usr/java/???
root 3213 96.1 2.0 2107704 332932 ? Ssl Oct31 992:25 /usr/lib64/???
??? 4982 0.8 3.3 964096 544328 ? Sl Sep12 611:25 /usr/java/???
??? 17940 6.6 41.5 7649356 6781076 ? Sl Oct20 1113:49 /usr/java/???
Memory is almost 100% used, but with ps and top, we can only find processes that uses half of it.
We have checked slab cache, but it was not the cause.
Slab is only 90444 kB.
Nothing is found in syslog too.
Anyone has any idea how to detect what is eating memory?
Thank you in advance.
Run free -m and see the difference. Column available shows real free memory.
And take a look at the https://www.linuxatemyram.com/
we have restarted server and solved this case.
I ran the top -H -p for a process which gave me the few threads with LWPs.
But when I sort the results with smallest PID first, I noticed the time in first thread is constant but the other threads time is changing. Why TIME+ is different?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16989 root 20 0 106m 28m 2448 S 0.0 0.2 0:22.31 glusterfs
16990 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
16992 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
16993 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
16997 root 20 0 106m 28m 2448 S 0.0 0.2 0:11.71 glusterfs
17010 root 20 0 106m 28m 2448 S 0.0 0.2 0:21.07 glusterfs
17061 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
Why TIME+ is different?
Because different threads are doing different percentages of the work. There could be a number of reasons for this1, but the most likely is that the application (glusterfs) is not attempting to distribute work evenly across the worker threads.
It is not something to worry about. It doesn't matter which thread does the work if the work level (see the %CPU) is negligible.
1 - If someone had the time and inclination, they could look at the source code of glusterfs to try to understand its behavior. However, I don't think the effort is warranted.
Because the time column referes to the time consumed by a process, so when a process time does not change it probably means that this process is "sleeping" or simply waiting for an other process to finish, but there could be many more reasons.
http://linux.about.com/od/commands/l/blcmdl1_top.htm
TIME:
Total CPU time the task has used since it started. If cumulative mode
is on, this also includes the CPU time used by the process's children
which have died. You can set cumulative mode with the S command line
option or toggle it with the interactive command S. The header line
will then be changed to CTIME.
I have an embedded system, when I do the user i/o operations, the system just stalls. It does the action after a long time. This system is quite complex and has many process running. My question is how can I identify what is making the system stall - it does nothing literally for 5 minutes. After 5 minutes, I see the outcome. I really don't know what is stalling the system. Any inputs on how to debug this issue. I have run the top on the system. However, it doesn't lead to any issue. See here, the jup_render is just taking 30% of CPU, which is not enough to stall the system. So, I am not sure whether top is useful here or not.
~ # top
top - 12:01:05 up 21 min, 1 user, load average: 1.49, 1.26, 0.87
Tasks: 116 total, 2 running, 114 sleeping, 0 stopped, 0 zombie
Cpu(s): 44.4%us, 13.9%sy, 0.0%ni, 40.3%id, 0.0%wa, 0.0%hi, 1.4%si, 0.0%st
Mem: 822572k total, 389640k used, 432932k free, 1980k buffers
Swap: 0k total, 0k used, 0k free, 227324k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
850 root 20 0 309m 32m 16m S 30 4.0 3:10.88 jup_render
870 root 20 0 221m 13m 10m S 27 1.7 2:28.78 jup_render
688 root 20 0 1156m 4092 3688 S 11 0.5 1:25.49 rxserver
9 root 20 0 0 0 0 S 2 0.0 0:06.81 ksoftirqd/1
16 root 20 0 0 0 0 S 1 0.0 0:06.87 ksoftirqd/3
9294 root 20 0 1904 616 508 R 1 0.1 0:00.10 top
812 root 20 0 865m 85m 46m S 1 10.7 1:21.17 lippo_main
13 root 20 0 0 0 0 S 1 0.0 0:06.59 ksoftirqd/2
800 root 20 0 223m 8316 6268 S 1 1.0 0:08.30 rat-cadaemon
3 root 20 0 0 0 0 S 1 0.0 0:05.94 ksoftirqd/0
1456 root 20 0 80060 10m 8208 S 1 1.2 0:04.82 jup_render
1330 root 20 0 202m 10m 8456 S 0 1.3 0:06.08 jup_render
8905 root 20 0 1868 556 424 S 0 0.1 0:02.91 dropbear
1561 root 20 0 80084 10m 8204 S 0 1.2 0:04.92 jup_render
753 root 20 0 61500 7376 6184 S 0 0.9 0:04.06 ale_app
1329 root 20 0 79908 9m 8208 S 0 1.2 0:04.77 jup_render
631 dbus 20 0 3248 1636 676 S 0 0.2 0:13.10 dbus-daemon
1654 root 20 0 80068 10m 8204 S 0 1.2 0:04.82 jup_render
760 root 20 0 116m 15m 12m S 0 1.9 0:10.19 jup_server
8 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/1:0
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
7 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1
170 root 0 -20 0 0 0 S 0 0.0 0:00.00 kblockd
6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
167 root 20 0 0 0 0 S 0 0.0 0:00.00 sync_supers
281 root 0 -20 0 0 0 S 0 0.0 0:00.00 nfsiod
For an embedded system that has many process running, there can be multitude of reasons. You may need to investigate in all perspective.
Check code for race conditions and deadlock.The kernel might be busy looping in a certain condition . There can be scenario where your application is waiting on a select call or the CPU resource is used up (This choice of CPU resource usage is ruled out based on the output of top command shared by you) or blocked on a read.
If you are performing a blocking I/O operations, the process shall get into wait queue and only move back to the execution path(ready queue) after the completion of the request. That is, it is moved out of the scheduler run queue and put with a special state. It shall be put back into the run queue only if they wake from the sleep or the resource waited for is made available.
Immediate step shall be to try out 'strace'. It shall intercept/record system calls that are called by a process and also the signals that are received by a process. It will be able to show the order of events and all the return/resumption paths of calls. This can take you almost closer to the area of problem.
There are other many handy tools that can be tried based on your development environment/setup. Key tools are as below :
'iotop' - It shall provide you a table of current I/O usage by processes or threads on the system by monitoring the I/O usage information output by the kernel.
'LTTng' - Makes tracing of race conditions and interrupt cascades possible. It is the successor to LTT. It is a combination of kprobes, tracepoint and perf functionalities.
'Ftrace' - This is a Linux kernel internal tracer with which you can analyze/debug latency and performance related issues.
If your system is based on TI processor, the CCS(Trace analyzer) provides capability to perform non-intrusive debug and analysis of system activity. So, note that based on your setup, you may also need to use the relevant tool .
Came across few more ideas :
magic SysRq key is another option in linux. If the driver is stuck, the command SysRq p can take you to the exact routine that is causing the problem.
Profiling of data can tell where exactly the time is being spent by the kernel. There are couple of tools like Readprofile and Oprofile. Oprofile can be enabled by configuring with CONFIG_PROFILING and CONFIG_OPROFILE. Another option is to rebuild the kernel by enabling the profiling option and reading the profile counters using Readprofile utility by booting up with profile=2 via command line.
mpstat can give 'the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request' via 'iowait' argument.
You said you run the top app. Did you find out which programme gets the biggest CPU time and how much is a percentage for it?
If you run the top you should see another screen in there, which you neither provided nor mentioned a cpu load percentage (or other relevant info).
I advise you to include what you can find interesting/relevant or suspicious through top. If it was already done you should discover it in your question more distinctively because now it's not obvious what is the CPU maximum load.
I was checking my server resource usage and noticed that the "cma" process is using a lot of RAM.
top - 15:04:54 up 127 days, 21:00, 1 user, load average: 0.27, 0.33, 0.24
Tasks: 157 total, 1 running, 156 sleeping, 0 stopped, 0 zombie
Cpu(s): 6.9%us, 0.3%sy, 0.0%ni, 92.6%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 4043700k total, 4006616k used, 37084k free, 146968k buffers
Swap: 1052248k total, 1052240k used, 8k free, 1351364k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4308 root 16 0 2080m 977m 4708 S 0.0 24.8 0:00.02 cma
4396 root 15 0 2080m 977m 4708 S 0.0 24.8 0:00.10 cma
4397 root 16 0 2080m 977m 4708 S 0.0 24.8 3:47.36 cma
4398 root 15 0 2080m 977m 4708 S 0.0 24.8 2:31.40 cma
4399 root 15 0 2080m 977m 4708 S 0.0 24.8 0:00.34 cma
4400 root 18 0 2080m 977m 4708 S 0.0 24.8 0:00.00 cma
4403 root 15 0 2080m 977m 4708 S 0.0 24.8 0:47.36 cma
4404 root 18 0 2080m 977m 4708 S 0.0 24.8 0:00.07 cma
4405 root 18 0 2080m 977m 4708 S 0.0 24.8 0:00.04 cma
4406 root 15 0 2080m 977m 4708 S 0.0 24.8 0:12.14 cma
4408 root 19 0 2080m 977m 4708 S 0.0 24.8 0:00.00 cma
I found this forum post from last year and apparently these processes have to do with McAfee virus scanning.
I ran pmap on one of the processes and this is the last line of output:
mapped: 2130892K writeable/private: 2113632K shared: 40K
Is this process really using 2.1GB of memory? Is Top reporting the memory usage accurately>
Thanks!
The VIRT column tells you the total size of the virtual memory segments mapped into the process - this includes the executable itself, libraries, data segments, stack, heap, memory mapped files, etc. In a sense, it is the total amount of memory that the process currently has permission to touch in one way or another (read, write, execute). The process is not necessarily using all of that, which is one of several reasons that the RES column reports a smaller number. RES is the total size of the subset of the VIRT size that is actually currently in physical memory at the moment. It is a better (but still not great) measure of how much memory the process is actually using - the fact that it is in memory indicates that it has been or is currently being actively used. However, if your system has lots of memory, a portion of that RES number may have been touched 3 days ago, and not since, so it may not be actively in use. Conversely, if you are short on memory, the process may be trying to actively use more than RES currently indicates, which will result in paging/swapping activity and performance issues.
Then there's the tendency for some types of memory (executables, libraries) to be shared between multiple instances of a program, the existence of IPC-type shared memory, and several other things that all factor into "how much memory is this process using?"...
In other words, it's not as simple a question as you might imagine...
The TOP command results:
Mem: 3991840k total, 1496328k used, 2495512k free, 156752k buffers
**Swap**: 3905528k total, **3980k** used, 3901548k free, 447860k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ **SWAP** COMMAND
28250 www-data 20 0 430m 210m 21m R 63 5.4 0:07.29 **219m** apache2
28266 www-data 20 0 256m 40m 21m S 30 1.0 0:01.94 **216m** apache2
28206 www-data 20 0 260m 44m 21m S 27 1.1 0:10.27 **215m** apache2
28259 www-data 20 0 256m 40m 21m S 26 1.0 0:02.21 **216m** apache2
The details list shows a group of apache2 processes are using SWAP memory about 210m+ each, but the summary reports only 3980k is used. The total SWAP memory in the detail list is much greater than in the summary. Do the two swap refer the same thing?
Quoted from http://www.linuxforums.org/articles/using-top-more-efficiently_89.html :
VIRT=RES+SWAP
As explained previously, VIRT includes anything inside task's
address space, no matter it is in RAM,
swapped out or still not loaded from
disk. While RES represents total RAM
consumed by this task. So, SWAP here
means it represents the total amount
of data being swapped out OR still not
loaded from disk. Don't be fooled by
the name, it doesn't just represent
the swapped out data.