Swappiness in JVM

Swappiness in JVM - linux

I stumbled upon some interesting problems last week where I noticed that the one of our production servers, which has Apache HTTP Server over Tomcat running, stopped, reporting an HTTP outage.
On further investigating this issue it seemed like it was due to JVM causing memory pages to be swapped out quickly. This resulted in the swap space being completely populated, causing a memory issue the next time a page is moved to swap.
Investigating further, it seems that there is a JVM swappiness factor that's set to 60% by default with some of the our Linux distributions. Based on some research, it seems that this could be a high value for a web service that has high traffic. Our swap space was set to 2 GB.
The swap details are:
Filename Type Size Used Priority
/dev/sda3 partition 2096472 1261420 -1
From /proc/meminfo
SwapCached: 944668 kB
Our JVM properties are as follows:
-Xmx6g -Xms4g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:PermSize=512M -XX:MaxPermSize=1024M -XX:NewSize=2g -XX:MaxNewSize=2g -XX:ParallelGCThreads=8
The server runs with 12 GB RAM.
Swappiness does not go well with the JVM's GC process. Thus I tried reducing the swappiness to 0, but that did not change anything. We still see cases where the entire swap space is consumed and resulting in an OutOfMemory error.
How can I tune the JVM performance?

The swappiness factor is actually a system-wide setting in Linux, not JVM-specific.
My personal experience with some kinds of applications is that they need swap to be turned off in order to work without hiccups, period. While I can't say for sure if your application belongs to this category, I have seen apps being swapped out despite enough free RAM being available. As you pointed out, GC and swap don't mix well. Swappiness is just an indication to the OS, so its impact on how much gets swapped out may be larger or smaller depending on circumstances. My suggestion would be to try turning swap completely off.
In order to do this, you need to be root or have sudo access. Comment out the line describing swap in /etc/fstab, which will prevent swap from being turned on after a reboot. Then, in order to turn swap off for the current server run, run swapoff -a. This may take up to a few minutes if there's data in swap that needs to be pulled back into RAM. Then, check the last line of the output of free to ensure that the total size of available swap is 0. After this, observe your app in order to determine if turning swap off solved your issue.

If your physical RAM is 12 GB and your heap is set to only 6 GB (max), then you have plenty of RAM. Is this server dedicated to the Tomcat instance?
Take a look at the verbose GC log files to see what the memory usage is over time. Check your access log files to confirm whether the access requests have increased over time.

Related

Multiple Cassandra node goes down

WE have a 12 node cassandra cluster across 2 different datacenter. We are migrating the data from sql DB to cassandra through a net application and there is another .net app thats reads data from the cassandra. Off recently we are seeing one or the other node going down (nodetool status shows DN and the service is stopped on it). Below is the output of the nodetool status. WE have to start the service to again get it working but it stops again.
https://ibb.co/4P1T453
Path to the log: https://pastebin.com/FeN6uDGv

So in looking through your pastebin, I see a few things that can be adjusted.
First I'm reasonably sure that this is your primary issue:
Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out,
especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.
From GNU Error Codes:
Macro: int ENOMEM
“Cannot allocate memory.” The system cannot allocate more virtual
memory because its capacity is full.
-Xms12G, -Xmx12G, -Xmn3000M,
How much RAM is on your instance? From what I'm seeing your node is dying from an OOM (Out of Memory error). My guess is that you're designating too much RAM to the heap, and there isn't enough for the OS/page-cache. In fact, I wouldn't designate much more than 50%-60% of RAM to the heap.
For example, I mostly build instances on 16GB of RAM, and I've found that a 10GB max heap is about as high as you'd want to go on that.
-XX:+UseParNewGC, -XX:+UseConcMarkSweepGC
In fact, as you're using CMS GC, I wouldn't go higher than 8GB for max heap size.
Maximum number of memory map areas per process (vm.max_map_count) 65530 is too low,
recommended value: 1048575, you can change it with sysctl.
This means you haven't adjusted your limits.conf or sysctl.conf. Check through the guide (DSE 6.0 - Recommended Production Settings), but generally it's a good idea to add the following to these files:
/etc/limits.conf
* - memlock unlimited
* - nofile 100000
* - nproc 32768
* - as unlimited
/etc/sysctl.conf
vm.max_map_count = 1048575
Note: After adjusting sysctl.conf, you'll want to run a sudo sysctl -p or reboot.
Is swap disabled? : false,
You will want to disable swap. If Cassandra starts swapping contents of RAM to disk, things will get really slow. Run a swapoff -a and then edit /etc/fstab and remove any swap entries.
tl;dr; Summary
Set your initial and max heap sizes to 8GB (heap new size is fine).
Modify your limits.conf an sysctl.conf files appropriately.
Disable swap.
It's also a good idea to get on the latest version of 3.11 (3.11.4).
Hope this helps!

Docker not reporting memory usage correctly?

Through some longevity testing with docker (docker 1.5 and 1.6 with no memory limit) on (centos 7 / rhel 7) and observing the systemd-cgtop stats for the running containers, I noticed what appeared to be very high memory use. Typically the particular application running in a non-containerized state only utilizes around 200-300Meg of memory. Over a 3 day period I ended up seeing systemd-cgtop reporting that my container was up to 13G of memory used. While I am not an expert Linux admin by any means, I started digging in to this which pointed me to the following articles:
https://unix.stackexchange.com/questions/34795/correctly-determining-memory-usage-in-linux
http://corlewsolutions.com/articles/article-6-understanding-the-free-command-in-ubuntu-and-linux
So basically what I am understanding is to determine the actual free memory within the system unit would be to look at the -/+ buffers/cache: within "free -m" and not the top line, as I also noticed that the top line within "free -m" would constantly increase with memory used and constantly show a decreased amount of free memory just like what I am observing with my container through systemd-cgtop. If I observe the -/+ buffers/cache: line I will see the actual stable amounts of memory being used / free. Also, if I observe the actual process within top on the host, I can see the process itself is only ever using less then 1% of memory (0.8% of 32G).
I am a bit confused as to whats going on here. If I set a memory limit of 500-1000M for a container (I believe it would turn out to be twice as much due to the swap) would my process eventually stop when I reach my memory limit, even though the process itself is not using anywhere near that much memory? If anybody out there has any feedback on the former, that would be great. Thanks!

I used docker in CentOS 7 for a while, and got the same confused by these. Checking the github issues link, it looks like docker stats in this release is kind of mislead.
https://github.com/docker/docker/issues/10824
so I just ignored memory usage getting from docker stats.

A year since you asked, but adding an answer here for anyone else interested. If you set a memory limit, I think it would not be killed unless it fails to reclaim unused memory. the cgroups metrics and consequently docker stats shows page cache+RES. You could look at the cgroups detailed metrics to see the breakup
I had a similar issue and when I tested with a memory limit, I saw that the container is not killed. Rather the memory is reclaimed and reused.

linux CPU cache slowdown

We're getting overnight lockups on our embedded (Arm) linux product but are having trouble pinning it down. It usually takes 12-16 hours from power on for the problem to manifest itself. I've installed sysstat so I can run sar logging, and I've got a bunch of data, but I'm having trouble interpreting the results.
The targets only have 512Mb RAM (we have other models which have 1Gb, but they see this issue much less often), and have no disk swap files to avoid wearing the eMMCs.
Some kind of paging / virtual memory event is initiating the problem. In the sar logs, pgpin/s, pgnscand/s and pgsteal/s, and majflt/s all increase steadily before snowballing to crazy levels. This puts the CPU up correspondingly high levels (30-60 on dual core Arm chips). At the same time, the frmpg/s values go very negative, whilst campg/s go highly positive. The upshot is that the system is trying to allocate a large amount of cache pages all at once. I don't understand why this would be.
The target then essentially locks up until it's rebooted or someone kills the main GUI process or it crashes and is restarted (We have a monolithic GUI application that runs all the time and generally does all the serious work on the product). The network shuts down, telnet blocks forever, as do /proc filesystem queries and things that rely on it like top. The memory allocation profile of the main application in this test is dominated by reading data in from file and caching it as textures in video memory (shared with main RAM) in an LRU using OpenGL ES 2.0. Most of the time it'll be accessing a single file (they are about 50Mb in size), but I guess it could be triggered by having to suddenly use a new file and trying to cache all 50Mb of it all in one go. I haven't done the test (putting more logging in) to correlate this event with these system effects yet.
The odd thing is that the actual free and cached RAM levels don't show an obvious lack of memory (I have seen oom-killer swoop in the kill the main application with >100Mb free and 40Mb cache RAM). The main application's memory usage seems reasonably well-behaved with a VmRSS value that seems pretty stable. Valgrind hasn't found any progressive leaks that would happen during operation.
The behaviour seems like that of a system frantically swapping out to disk and making everything run dog slow as a result, but I don't know if this is a known effect in a free<->cache RAM exchange system.
My problem is superficially similar to question: linux high kernel cpu usage on memory initialization but that issue seemed driven by disk swap file management. However, dirty page flushing does seem plausible for my issue.
I haven't tried playing with the various vm files under /proc/sys/vm yet. vfs_cache_pressure and possibly swappiness would seem good candidates for some tuning, but I'd like some insight into good values to try here. vfs_cache_pressure seems ill-defined as to what the difference between setting it to 200 as opposed to 10000 would be quantitatively.
The other interesting fact is that it is a progressive problem. It might take 12 hours for the effect to happen the first time. If the main app is killed and restarted, it seems to happen every 3 hours after that fact. A full cache purge might push this back out, though.
Here's a link to the log data with two files, sar1.log, which is the complete output of sar -A, and overview.log, a extract of free / cache mem, CPU load, MainGuiApp memory stats, and the -B and -R sar outputs for the interesting period between midnight and 3:40am:
https://drive.google.com/folderview?id=0B615EGF3fosPZ2kwUDlURk1XNFE&usp=sharing
So, to sum up, what's my best plan here? Tune vm to tend to recycle pages more often to make it less bursty? Are my assumptions about what's happening even valid given the log data? Is there a cleverer way of dealing with this memory usage model?
Thanks for your help.
Update 5th June 2013:
I've tried the brute force approach and put a script on which echoes 3 to drop_caches every hour. This seems to be maintaining the steady state of the system right now, and the sar -B stats stay on the flat portion, with very few major faults and 0.0 pgscand/s. However, I don't understand why keeping the cache RAM very low mitigates a problem where the kernel is trying to add the universe to cache RAM.

How can I limit the cache used by copying so there is still memory available for other caches?

Basic situation:
I am copying some NTFS disks in openSUSE. Each one is 2 TB. When I do this, the system runs slow.
My guesses:
I believe it is likely due to caching. Linux decides to discard useful caches (for example, KDE 4 bloat, virtual machine disks, LibreOffice binaries, Thunderbird binaries, etc.) and instead fill all available memory (24 GB total) with stuff from the copying disks, which will be read only once, then written and never used again. So then any time I use these applications (or KDE 4), the disk needs to be read again, and reading the bloat off the disk again makes things freeze/hiccup.
Due to the cache being gone and the fact that these bloated applications need lots of cache, this makes the system horribly slow.
Since it is USB, the disk and disk controller are not the bottleneck, so using ionice does not make it faster.
I believe it is the cache rather than just the motherboard going too slow, because if I stop everything copying, it still runs choppy for a while until it recaches everything.
And if I restart the copying, it takes a minute before it is choppy again. But also, I can limit it to around 40 MB/s, and it runs faster again (not because it has the right things cached, but because the motherboard busses have lots of extra bandwidth for the system disks). I can fully accept a performance loss from my motherboard's I/O capability being completely consumed (which is 100% used, meaning 0% wasted power which makes me happy), but I can't accept that this caching mechanism performs so terribly in this specific use case.
# free
total used free shared buffers cached
Mem: 24731556 24531876 199680 0 8834056 12998916
-/+ buffers/cache: 2698904 22032652
Swap: 4194300 24764 4169536
I also tried the same thing on Ubuntu, which causes a total system hang instead. ;)
And to clarify, I am not asking how to leave memory free for the "system", but for "cache". I know that cache memory is automatically given back to the system when needed, but my problem is that it is not reserved for caching of specific things.
Is there some way to tell these copy operations to limit memory usage so some important things remain cached, and therefore any slowdowns are a result of normal disk usage and not rereading the same commonly used files? For example, is there a setting of max memory per process/user/file system allowed to be used as cache/buffers?

The nocache command is the general answer to this problem! It is also in Debian and Ubuntu 13.10 (Saucy Salamander).
Thanks, Peter, for alerting us to the --drop-cache" option in rsync. But that was rejected upstream (Bug 9560 – drop-cache option), in favor of a more general solution for this: the new "nocache" command based on the rsync work with fadvise.
You just prepend "nocache" to any command you want. It also has nice utilities for describing and modifying the cache status of files. For example, here are the effects with and without nocache:
$ ./cachestats ~/file.mp3
pages in cache: 154/1945 (7.9%) [filesize=7776.2K, pagesize=4K]
$ ./nocache cp ~/file.mp3 /tmp
$ ./cachestats ~/file.mp3
pages in cache: 154/1945 (7.9%) [filesize=7776.2K, pagesize=4K]\
$ cp ~/file.mp3 /tmp
$ ./cachestats ~/file.mp3
pages in cache: 1945/1945 (100.0%) [filesize=7776.2K, pagesize=4K]
So hopefully that will work for other backup programs (rsnapshot, duplicity, rdiff-backup, amanda, s3sync, s3ql, tar, etc.) and other commands that you don't want trashing your cache.

Kristof Provost was very close, but in my situation, I didn't want to use dd or write my own software, so the solution was to use the "--drop-cache" option in rsync.
I have used this many times since creating this question, and it seems to fix the problem completely. One exception was when I am using rsync to copy from a FreeBSD machine, which doesn't support "--drop-cache". So I wrote a wrapper to replace the /usr/local/bin/rsync command, and remove that option, and now it works copying from there too.
It still uses huge amount of memory for buffers and seems to keep almost no cache, but it works smoothly anyway.
$ free
total used free shared buffers cached
Mem: 24731544 24531576 199968 0 15349680 850624
-/+ buffers/cache: 8331272 16400272
Swap: 4194300 602648 3591652

You have practically two choices:
Limit the maximum disk buffer size: the problem you're seeing is probably caused by default kernel configuration that allows using huge piece of RAM for disk buffering and, when you try to write lots of stuff to a really slow device, you'll end up lots of your precious RAM for disk caching to that slow the device.
The kernel does this because it assumes that the processes can continue to do stuff when they are not slowed down by the slow device and that RAM can be automatically freed if needed by simply writing the pages on storage (the slow USB stick - but the kernel doesn't consider the actual performance of that process). The quick fix:
# Wake up background writing process if there's more than 50 MB of dirty memory
echo 50000000 > /proc/sys/vm/dirty_background_bytes
# Limit background dirty bytes to 200 MB (source: http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages)
echo 200000000 > /proc/sys/vm/dirty_bytes
Adjust the numbers to match the RAM you're willing to spend on disk write cache. A sensible value depends on your actual write performance, not the amount of RAM you have. You should target on having barely enough RAM for caching to allow full write performance for your devices. Note that this is a global setting, so you have to set this according to the slowest devices you're using.
Reserve a minimum memory size for each task you want to keep going fast. In practice this means creating cgroups for stuff you care about and defining the minimum memory you want to have for any such group. That way, the kernel can use the remaining memory as it sees fit. For details, see this presentation: SREcon19 Asia/Pacific - Linux Memory Management at Scale: Under the Hood
Update year 2022:
You can also try creating new file /etc/udev/rules.d/90-set-default-bdi-max_ratio-and-min_ratio.rules with the following contents:
# For every BDI device, set max cache usage to 30% and min reserved cache to 2% of the whole cache
# https://unix.stackexchange.com/a/481356/20336
ACTION=="add|change", SUBSYSTEM=="bdi", ATTR{max_ratio}="30", ATTR{min_ratio}="2"
The idea is to put limit per device for maximum cache utilization. With the above limit (30%) you can have two totally stalled devices and still have 40% of the disk cache available for the rest of the system. If you have 4 or more stalled devices in parallel, even this workaround cannot help alone. That's why I have also added minimum cache space of 2% for every device but I don't know how to check if this actually effective. I've been running with this config for about half a year and I think it's working nicely.
See https://unix.stackexchange.com/a/481356/20336 for details.

The kernel can not know that you won't use the cached data from copying again. This is your information advantage.
But you could set the swapiness to 0: sudo sysctl vm.swappiness=0. This will cause Linux to drop the cache before libraries, etc. are written to the swap.
It works nice for me too, especially very performant in combination with huge amount of RAM (16-32 GB).

It's not possible if you're using plain old cp, but if you're willing to reimplement or patch it yourself, setting posix_fadvise(fd, 0, 0, POSIX_FADV_NOREUSE) on both input and output file will probably help.
posix_fadvise() tells the kernel about your intended access pattern. In this case, you'd only use the data once, so there isn't any point in caching it.
The Linux kernel honours these flags, so it shouldn't be caching the data any more.

Try using dd instead of cp.
Or mount the filesystem with the sync flag.
I'm not completely sure if these methods bypass the swap, but it may be worth giving a try.

I am copying some NTFS disks [...] the system runs slow. [...]
Since it is USB [...]
The slowdown is a known memory management issue.
Use a newer Linux Kernel. The older ones have a problem with USB data and "Transparent Huge Pages". See this LWN article. Very recently this issue was addressed - see "Memory Management" in LinuxChanges.

Ok, now that I know that you're using rsync and I could dig a bit more:
It seems that rsync is ineffective when used with tons of files at the same time. There's an entry in their FAQ, and it's not a Linux/cache problem. It's an rsync problem eating too much RAM.
Googling around someone recommended to split the syncing in multiple rsync invocations.

How to clean caches used by the Linux kernel

I want to force the Linux kernel to allocate more memory to applications after the cache starts taking up too much memory (as can be seen by the output of 'free').
I've run
sudo sync; sudo sysctl -w vm.drop_caches=3; free
(to free both disc dentry/inode cache and page cache) and I see that only about half of the used cache was freed - the rest remains. How can I tell what is taking up the rest of the cache and force it to be freed?

You may want to increase vfs_cache_pressure as well as set swappiness to 0.
Doing that will make the kernel reclaim cache faster, while giving processes equal or more favor when deciding what gets paged out.
You may only want to do this if processes you care about do very little disk I/O.
If a network I/O bound process has to swap in to serve requests, that's a problem and the real solution is to put it on a less competitive server.
With the default swappiness setting, the kernel is almost always going to favour keeping FS related cache in real memory.
As such, if you increase the cache pressure, be sure to equally adjust swappiness.

The contents of /proc/meminfo tell you what the kernel uses RAM for.
You can use /proc/sys/vm/vfs_cache_pressure to force the kernel to reclaim memory that is used for filesystem-related caches more lazily or eagerly.
Note that your application may only benefit from tuning this parameter if it does little or no disk I/O.

You might find John Nilsson's answer to my Question useful for purging the cache in order to test whether that is related to your problem:
sync && echo 1 > /proc/sys/vm/drop_caches
Though I'm guessing the only real difference is 1 vs 3

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string