How can a daemon stay active while using no memory? - linux

I come up with this question while using the ps aux command.
Here I can see that a few processes are at 0% CPU 0% MEM 0 VSZ 0 RSS.
If a daemon is not using any memory, why and how could it be displayed in the first place ? I kind of understand that 0% CPU usage mean the process is not currently in use but wouldn't 0% MEM mean no process at all ?
I wanted to check if this was somehow systems daemon specific so I made a simple C program with an infinite loop, without any variables.
void main()
{
while (1){}
}
This time VSZ and RSS have actual values, while MEM staying at 0%.
What is happening here ?

%MEM is probably not fully documented on your system. AIX manual about ps command says:
%MEM
Calculated as the sum of the number of working segment and code
segment pages in memory times 4 (that is, the RSS value), divided by
the size of the real memory in use, in the machine in KB, times 100,
rounded to the nearest full percentage point. This value attempts to
convey the percentage of real memory being used by the process.
Unfortunately, like RSS, it tends the exaggerate the cost of a process
that is sharing program text with other processes. Further, the
rounding to the nearest percentage point causes all of the processes
in the system that have RSS values under 0.005 times real memory size
to have a %MEM of 0.0.
As you could have suspected by examining the output, some rounding have been applied. So if the value is too low %0.0 is printed.
And, this measure percentage of the real memory usage, which means that it doesn't reflect the size of the process but only which part of the process is actually mapped to real memory.
In your first case %0.0 for CPU just means that the process exists but actually does nothing and it is probably in a waiting state (or consuming a very small percentage of the processing power), not "that it is is not currently in use". In your second case, your process is active, it is in fact very busy (this is what %97.7 reflects), but what it does is stupid (infinite loop doing nothing).
To understand all of this, you may read about process state, process scheduling and virtual memory.

While Jean-Baptiste's answer is correct as far as it goes, I believe it's more significant in this case that all of the 0 memory in all three fields processes you're noting are kernel threads. Their memory is all kernel memory, and doesn't show up on top or ps. You can tell it's a kernel thread on linux both by the command being encapsulated by brackets and by the process consuming no memory in the VSZ column. (That's the column that represents basically everything that could be considered the process's memory. It's only 0 for kernel threads, and that only because they don't properly report their memory.
Also note that with a start time in 2018 and having consumed no more than 1 minute 41 seconds, none of those jobs are really very active.

Related

Get memory high water mark for a time interval

I'm trying to get the max amount of memory used during brief intervals, for a long-running linux process. For example, something like:
resetmaxrss(); // hypothetical new command
void* foo = malloc(4096);
free(foo);
getrusage(...); // 'ru_maxrss' reports 4096 plus whatever else is alive
resetmaxrss();
void* bar = malloc(2048);
free(bar);
getrusage(...); // 'ru_maxrss' reports 2048 + whatever, *not* 4096
Options I've found and ruled out:
getrusage's max RSS can't be reset.
cgmemtime seem to use wait4 under the hood, so isn't viable to query a process while it's running.
tstime reports for exiting processes, so is also not viable to query a process while it's running.
Other options, none of which are good:
Polling. Prone to miss our brief allocations.
Instrumenting our code. We don't have access to all of the memory allocators being used, so this wouldn't be very elegant or straightforward. I'd also rather use values reported by the OS for accuracy.
Is there a way to do this, short of proposing a patch to the Linux kernel?
It turns out that since Linux 4.0, the peak RSS can be reset:
/proc/[pid]/clear_refs (since Linux 2.6.22)
This is a write-only file, writable only by owner of the
process.
The following values may be written to the file:
[snip]
5 (since Linux 4.0)
Reset the peak resident set size ("high water mark") to
the process's current resident set size value.
That HWM/peak RSS can be read out with /proc/[pid]/status -> VmHWM or getrusage().
Patch RFC

Allocate memory to a process that other process can not use in linux

To limit memory resource for particular process we can use ulimit as well as cgroup.
I want to understand that if using cgroup, I have allocated say ~700 MB of memory to process A, on system having 1 GB of RAM, and some other process say B, requires ~400 MB of memory. What will happen in this case?
If process A is allocated ~750 MB of memory but using only 200 MB of memory, will process B can use memory that allocated to A?
If no then how to achieve the scenario when "fix amount of memory is assigned to a process that other process can not use"?
EDIT
Is it possible to lock physical memory for process? Or only VM can be locked so that no other process can access it?
There is one multimedia application that must remain alive and can use maximum system resource in terms of memory, I need to achieve this.
Thanks.
Processes are using virtual memory (not RAM) so they have a virtual address space. See also setrlimit(2) (called by ulimit shell builtin). Perhaps RLIMIT_RSS & RLIMIT_MEMLOCK are relevant. Of course, you could limit some other process e.g. using RLIMIT_AS or RLIMIT_DATA, perhaps thru pam_limits(8) & limits.conf(5)
You could lock some virtual memory into RAM using mlock(2), this ensures that the RAM is kept for the calling process.
If you want to improve performance, you might also use madvise(2) & posix_fadvise(2).
See also ionice(1) & renice(1)
BTW, you might consider using hypervisors like Xen, they are able to reserve RAM.
At last, you might be wrong in believing that your manual tuning could do better than a carefully configured kernel scheduler.
What other processes will run on the same system, and what kind of thing do you want to happen if the other multimedia program needs memory that other processes are using?
You could weight the multimedia process so the OOM killer only picks it as a last choice after every other non-essential process. You might see a dropped frame if the kernel takes some time killing something to free up memory.
According to this article, adjust the oom-killer weight of a process by writing to /proc/pid/oom_adj. e.g. with
echo -17 > /proc/2592/oom_adj

How to determine the real memory usage of a single process?

How can I calculate the real memory usage of a single process? I am not talking about the virtual memory, because it just keeps growing. For instance, there are proc files like smaps, where you can get the mappings of a process. But this is virtual memory and the values of that file just keeps growing for running process. But I would like to reflect the real memory usage of a process. E.g. if you plot the memory usage of a process it should represent the allocations of memory and also the freeing of memory. So the plot should be like an up and down movement instead of a linear function, that just keeps growing for a running process.
So, how could I calculate the real memory usage? I would appreciate any helpful answer.
It's actually kind of a complicated question. The two most common metrics for a program's memory usage at the OS level are virtual size and resident set size. (These show in the output of ps -u as the VSZ and RSS columns.) Roughly speaking, these tell the total memory the program has assigned to it, versus how much it is currently actively using.
Further complicating the question is that when you use malloc (or the C++ new operator) to allocate memory, memory is allocated from a pool in your process which is built by occasionally requesting an allocation of memory from the operating system. But when you free memory, the memory goes back into this pool, but it is typically not returned to the OS. So as your program allocates and frees memory, you typically will not see its memory footprint go up and down. (However, if it frees a lot of memory and then doesn't allocate it any more, eventually you may see its rss go down.)

How does proc stats work

I have done a lot of reading and testing of the proc directory in OS's using the Linux kernel. I have been using Linux myself for many years now, but I needed to get into more details for a small private project. Specifically how the stat files work. I knew the basics, but not enough to create actual calculations with the data in them.
The problem is that the files in proc does not seem to contain what they should, not according to what I have read vs. my tests.
For example: the CPU line in the root stat file should contain the total uptime for the CPU times the amount of cores (and/or amount of CPU's) in jiffies. So to get the system uptime, you would have to add each number in the row to each other, divide by the number of cores/CPU's and again divide by whatever a jiffie is defined to be on that particular system. At least this is the formula that I keep finding when searching this subject. If this was true, then the result should be equal to the first number in /proc/uptime? But this is not the case, and I have tested this on several machines with different amount of cores, both 32bit and 64bit systems. I can never get these two to match up.
Also the stat file for each pid have an uptime part (part 21 I think it was). But I cannot figure out what this number should be matched against to calculate a process uptime in seconds. So far what I have read, it should contain the total cpu jiffies as they was when the process was started. So if this is true, then one would simply substract this from the current total cpu jiffies and divide this with whatever a jiffie is on that system? But again, I cannot seam to get this to add up to reality.
Then there is the problem with finding out what a jiffie is. I found a formula where /proc/stat was used together with /proc/uptime and some dividing with the amount of cores/CPU's to get that number. But this does not work. And I would not expect it to when the values of those two files does not add up, like mentioned in my first problem above. I did however come up with a different approach. Simply reading the first line of /proc/stat twice within a second. Then I could just compare and see how many jiffies the system had added in that second and divide that with the number of cores. This works on normal Linux systems, but it fails on Android in most cases. Android is constantly attaching/detaching cores depending on needs, which means that it differs how much you have to divide with. It is no problem as long as the core count matches both reads, but if one core goes active during the second read, it does not work.
And last. I do not quite get the part by dividing by amount of cores. If each core writes all of it's work time and idle time to the total line in /proc/stat, then it would make sense as that line would actually contain the total uptime times the amount of cores. But if this was true then each of the cpu lines would add up to the same number, but they don't. This means that dividing by amount of cores should provide an incorrect result. But that would also mean that cpu monitor tools are making calculation errors, as they all seam to use this method.
Example:
/proc/stat
cpu 20455737 116285 4584497 104527701 1388173 366 102373 0 0 0
cpu0 4833292 5490 1413887 91023934 1264884 358 94250 0 0 0
cpu1 5785289 47944 1278053 4439797 45015 1 4235 0 0 0
cpu2 4748431 20922 926839 4552724 33455 2 2745 0 0 0
cpu3 5088724 41928 965717 4511246 44819 3 1141 0 0 0
The lines cpu0, cpu1, cpu2 and cpu3 does not add up to the same total result. This means that using the total result of the general cpu line divided by 4 should be incorrect.
/proc/uptime
1503361.21 3706840.53
All of the above output was taken from a system that should be using clock ticks of 100. Now if you take the result of the general cpu line, divide that with 100 and then with 4 (amount of cores), you will not get the result of the uptime file.
And if you take the result of the general cpu line, divide that with the uptime from /proc/uptime and then with 4 (amount of cores), you will not get the 100 that is this kernels clock ticks.
So why is nothing adding up as it should? How do I get the clock ticks of a kernel, even on systems that attaches/detaches cores constantly? How to I get the total real uptime of a process? How do I get the real uptime from /proc/stat?
(This answer is based on the 4.0 kernel.)
The first number on each line of /proc/stat is the total time which each CPU has spent executing non-"nice" tasks in user mode. The second is the total time spent executing "nice" tasks in user mode.
Naturally, there will be random variations across CPUs. For example, the processes running on one CPU may make more syscalls, or slower syscalls, than those running on another CPU. Or one CPU may happen to run more "nice" tasks than another.
The first number in /proc/uptime is the "monotonic boot time" -- the amount of time which has passed since the system was last booted (including time which passed while the system was suspended). The second number is the total amount of time which all CPUs have spent idling.
There is also a task-specific stat file for each PID in the corresponding subdirectory under /proc. This one starts with a PID number, a name in brackets, and a state code (represented by a character). The 19th number after that is the start time of the process, in ticks.
All of this information is not very hard to find simply by browsing the Linux source code. I recommend you clone a local copy of Linus' repo and use grep to find the details you need. As a tip, the process-specific files in /proc are implemented in fs/proc/base.c. The /proc/stat file which you asked about is implemented in fs/proc/stat.c. proc/uptime is implemented in fs/proc/uptime.c.
man proc is also a good source of information.

vm/min_free_kbytes - Why Keep Minimum Reserved Memory?

According to this article:
/proc/sys/vm/min_free_kbytes: This controls the amount of memory that is kept free for use by special reserves including “atomic” allocations (those which cannot wait for reclaim)
My question is that what does it mean by "those which cannot wait for reclaim"? In other words, I would like to understand why there's a need to tell the system to always keep a certain minimum amount of memory free and under what circumstances will this memory be used? [It must be used by something; don't see the need otherwise]
My second question: does setting this memory to something higher than 4MB (on my system) leads to better performance? We have a server which occasionally exhibit very poor shell performance (e.g. ls -l takes 10-15 seconds to execute) when certain processes get going and if setting this number to something higher will lead to better shell performance?
(link is dead, looks like it's now here)
That text is referring to atomic allocations, which are requests for memory that must be satisfied without giving up control (i.e. the current thread can not be suspended). This happens most often in interrupt routines, but it applies to all cases where memory is needed while holding an essential lock. These allocations must be immediate, as you can't afford to wait for the swapper to free up memory.
See Linux-MM for a more thorough explanation, but here is the memory allocation process in short:
_alloc_pages first iterates over each memory zone looking for the first one that contains eligible free pages
_alloc_pages then wakes up the kswapd task [..to..] tap into the reserve memory pools maintained for each zone.
If the memory allocation still does not succeed, _alloc pages will either give up [..] In this process _alloc_pages executes a cond_resched() which may cause a sleep, which is why this branch is forbidden to allocations with GFP_ATOMIC.
min_free_kbytes is unlikely to help much with the described "ls -l takes 10-15 seconds to execute"; that is likely caused by general memory pressure and swapping rather than zone exhaustion. The min_free_kbytes setting only needs to allow enough free pages to handle immediate requests. As soon as normal operation is resumed, the swapper process can be run to rebalance the memory zones. The only time I've had to increase min_free_kbytes is after enabling jumbo frames on a network card that didn't support dma scattering.
To expand on your second question a bit, you will have better results tuning vm.swappiness and the dirty ratios mentioned in the linked article. However, be aware that optimizing for "ls -l" performance may cause other processes to become slower. Never optimize for a non-primary usecase.
All linux systems will attempt to make use of all physical memory available to the system, often through the creation of a filesystem buffer cache, which put simply is an I/O buffer to help improve system performance. Technically this memory is not in use, even though it is allocated for caching.
"wait for reclaim", in your question, refers to the process of reclaiming that cache memory that is "not in use" so that it can be allocated to a process. This is supposed to be transparent but in the real world there are many processes that do not wait for this memory to become available. Java is a good example, especially where a large minimum heap size has been set. The process tries to allocate the memory and if it is not instantly available in one large contiguous (atomic?) chunk, the process dies.
Reserving a certain amount of memory with min_free_kbytes allows this memory to be instantly available and reduces the memory pressure when new processes need to start, run and finish while there is a high memory load and a full buffer cache.
4MB does seem rather low because if the buffer cache is full, any process that wants an immediate allocation of more than 4MB will likely fail. The setting is very tunable and system-specific, but if you have a few GB of memory available it can't hurt to bump up the reserve memory to 128MB. I'm not sure what effect it will have on shell interactivity, but likely positive.
This memory is kept free from use by normal processes. As #Arno mentioned, the special processes that can run include interrupt routines, which must be run now (as it's an interrupt), and finish before any other processes can run (atomic). This can include things like swapping out memory to disk when memory is full.
If the memory is filled an interrupt (memory management) process runs to swap some memory into disk so it can free some memory for use by normal processes. But if vm.min_free_kbytes is too small for it to run, then it locks up the system. This is because this interrupt process must run first to free memory so others can run, but then it's stuck because it doesn't have enough reserved memory vm.min_free_kbytes to do its task resulting in a deadlock.
Also see:
https://www.linbit.com/en/kernel-min_free_kbytes/ and
https://askubuntu.com/questions/41778/computer-freezing-on-almost-full-ram-possibly-disk-cache-problem (where the memory management process has so little memory to work with it takes so long to swap little by little that it feels like a freeze.)

Resources