Can I tell Linux not to swap out a particular processes' memory?

Can I tell Linux not to swap out a particular processes' memory? - linux

Is there a way to tell Linux that it shouldn't swap out a particular processes' memory to disk?
Its a Java app, so ideally I'm hoping for a way to do this from the command line.
I'm aware that you can set the global swappiness to 0, but is this wise?

You can do this via the mlockall(2) system call under Linux; this will work for the whole process, but do read about the argument you need to pass.
Do you really need to pull the whole thing in-core? If it's a java app, you would presumably lock the whole JVM in-core. I don't know of a command-line method for doing this, but you could write a trivial program to call fork, call mlockall, then exec.
You might also look to see if one of the access pattern notifications in madvise(2) meets your needs. Advising the VM subsystem about a better paging strategy might work out better if it's applicable for you.
Note that a long time ago now under SunOS, there was a mechanism similar to madvise called vadvise(2).

If you wish to change the swappiness for a process add it to a cgroup and set the value for that cgroup:
https://unix.stackexchange.com/questions/10214/per-process-swapiness-for-linux#10227

There exist a class of applications in which you never want them to swap. One such class is a database. Databases will use memory as caches and buffers for their disk areas, and it makes absolutely no sense that these are ever put to swap. The particular memory may hold some relevant data that is not needed for a week until one day when a client asks for it. Without the caching/swapping, the database would simply find the relevant record on disk, which would be quite fast; but with swapping, your service might suddenly be taking a long time to respond.
mysqld includes code to use the OS / system call memlock. On Linux, since at least 2.6.9, this system call will work for non-root processes that have the CAP_IPC_LOCK capability[1]. When using memlock(), the process must still work within the bounds of the LimitMEMLOCK limit. [2]. One of the (few) good things about systemd is that you can grant the mysqld process these capabilities, without requiring a special program. If can also set the rlimits as you'd expect with ulimit. Here is an override file for mysqld that does the requisite steps, including a few others that you might need for a process such as a database:
[Service]
# Prevent mysql from swapping
CapabilityBoundingSet=CAP_IPC_LOCK
# Let mysqld lock all memory to core (don't swap)
LimitMEMLOCK=-1
# do not kills this process if low on memory
OOMScoreAdjust=-900
# Use higher io scheduling
IOSchedulingClass=realtime
Type=simple
ExecStart=
ExecStart=/usr/sbin/mysqld --memlock $MYSQLD_OPTS
Note The standard community mysql currently ships with Type=forking and adds --daemonize in the option to the service on the ExecStart line. This is inherently less stable than the above method.
UPDATE I am not 100% happy with this solution. After several days of runtime, I noticed the process still had enormous amounts of swap! Examining /proc/XXXX/smaps, I note the following:
The largest contributor of swap is from a stack segment! 437 MB and fluctuating. This presents obvious performance issues. It also indicates stack-based memory leak.
There are zero Locked pages. This indicates the memlock option in MySQL (or Linux) is broken. In this case, it wouldn't matter much because MySQL can't memlock stack.

You can do that by the mlock family of syscalls. I'm not sure, however, if you can do it for a different process.

As super user you can 'nice' it to the highest priority level -20 and hope that's enough to keep it from being swapped out. It usually is. Positive numbers lower scheduling priority. Normal users cannot nice upwards (negative nos.)

Except in extremely unusual circumstances, asking this question means that You're Doing It Wrong(tm).
Seriously, if Linux wants to swap and you're trying to keep your process in memory then you're putting an unreasonable demand on the OS. If your app is that important then 1) buy more memory, 2) remove other apps/daemons from the machine, or dedicate a machine to your app, and/or 3) invest in a really fast disk subsystem. These steps are reasonable for an important app. If you can't justify them, then you probably can't justify wiring memory and starving other processes either.

Why do you want to do this?
If you are trying to increase performance of this app then you are probably on the wrong track. The OS will swap out a process to increase memory for disk cache - even if there is free RAM, the kernel knows best (actauly the samrt guys that wrote the scheduler know best).
If you have a process that needs responsiveness (it's swapped out while not used and you need it to restart quickly) then nice it to high priority, mlock, or using a real time kernel might help.

Related

Limit Number of Core Dumps by Process Name

QUESTION Is there an easy, established and accepted way to limit the number of core dumps for a given process on Linux?
WHAT I WANT My ideal solution would be a one-line command to set the per-application limit of x core dumps for all applications. Alternatively, I would be happy with a method to set the limit for each application individually.
WHAT I DON'T WANT I know I can already set a limit for the size of the core dumps using ulimit. I don't want to limit the size of the dumps, just the number of them. I also know I could modify the apport script to get any functionality I desire, but I would like to avoid this if there is a less intrusive solution.
MOTIVATION I am working on a system that is sensitive to excessive disk usage. If a given application cores, I want to keep the core file so that I can debug the problem. If it cores again, which is highly likely since several applications are restarted by a watcher if they die, I don't want to keep the core file because it is unlikely to contain new information and it will just take up disk space.

Process can coredump once, then it is killed. I presume you meant programs like in the rest of the question.
There is nothing of the sort in stock kernels, but things like grsecurity at least used to offer the relevant feature to tamper brute forcing against ASLR.
What do you need this for?

Limiting the memory usage of a program in Linux

I'm new to Linux and Terminal (or whatever kind of command prompt it uses), and I want to control the amount of RAM a process can use. I already looked for hours to find an easy-t-use guide. I have a few requirements for limiting it:
Multiple instances of the program will be running, but I only want to limit some of the instances.
I do not want the process to crash once it exceeds the limit. I want it to use HDD page swap.
The program will run under WINE, and is a .exe.
So can somebody please help with the command to limit the RAM usage on a process in Linux?

The fact that you’re using Wine makes no difference in this particular context, which leaves requirements 1 and 2. Requirement 2 –
I do not want the process to crash once it exceeds the limit. I want it to use HDD page swap.
– is known as limiting the resident set size or rss of the process, and it’s actually rather nontrivial to do on Linux, as is demonstrated by a question asked in 2010. You’ll need to set up Linux control groups (cgroups). Fortunately, Justin L.’s answer gives a brief rundown on how to do so. Note that
instead of jlebar, you should use your own Unix user name, and
instead of your/program, you should use wine /path/to/Windows/program.exe.
Using cgroups will also satisfy your other requirements – you can start as many instances of the program as you wish, but only those which you start with cgexec -g memory:limited will be limited.

Limiting the File System Usage Programmatically in Linux

I was assigned to write a system call for Linux kernel, which oddly determines (and reduces) users´ maximum transfer amount per minute (for file operations). This system call will be called lim_fs_usage and will take a parameter for maximum number of bytes all users can access in a minute. For short, I am going to determine bandwidth of all filesystem operations in Linux. The project also asks for choosing appropriate method for distribution of this restricted resource (file access) among the users but I think this
won´t be a big problem.
I did a long long search and scan but could not find a method for managing file system access programmatically. I thought of mapping (mmap())hard drive to memory and manage memory operations but this turned to be useless. I also tried to find an API for virtual file system in order to monitor and limit it but I could not find one. Any ideas, please... Any help is greatly appreciated. Thank you in advance...

I wonder if you could do this as an IO scheduler implementation.
The main difficulty of doing IO bandwidth limitation under Linux is, by the time it reaches anywhere near the device, the kernel has probably long since forgotten who caused it.
Likewise, you can get on some very tricky ground in determining who is responsible for a given piece of IO:
If a binary is demand-loaded, who owns the IO doing that?
A mapped section of memory (demand-loaded executable or otherwise) might be kicked out of memory because someone else used too much ram, thus causing the kernel to choose to evict those pages, which places an unfair burden on the quota of the other user to then page it back in
IO operations can be combined, and might come from different users
A write operation might cause an IO sooner or later depending on how the kernel schedules it; a later schedule may mean that fewer IOs need to be done in the long run, as another write gets done to the same block in the interim; writing to an already dirty block in cache does not make it any dirtier.
If you understand all these and more caveats, and still want to, I imagine doing it as an IO scheduler is the way to go.
IO schedulers are pluggable under Linux (2.6) and can be changed dynamically - the kernel waits for all IO on the device (IO scheduler is switchable per block device) to end and then switches to the new one.

Since it's urgent I'll give you an idea out of the top of my head without doing any research on the feasibility -- what about inserting a hook to monitor system calls that deal with file system access?
You might end up writing specialised kernel modules to handle the various filesystems (ext3, ext4, etc) but as a proof-of-concept you can start with one. Do not forget that root has reserved blocks in memory, process space and disk for his own operations.
Managing memory operations does not sound related to what you're trying to do (but perhaps I am mistaken here).

After a long period of thinking and searching, I decided to use the ¨hooking¨ method proposed. I am thinking of creating a new system call which initializes and manages a global variable like hdd_ bandwith _limit. This variable will be used in Read() and Write() system calls´ modified implementation (instead of ¨count¨ variable). Then I will decide distribution of this resource which is the real issue. Probably I will find out how many users are using the system for a certain moment and divide this resource equally. Will be a Round-Robin-like distribution. But still, I am open to suggestions on this distribution issue. Will it be a SJF or FCFS or Round-Robin? Synchronization is another issue. How can I know a user´s job is short or long? Or whether he is done with the operation or not?

How to make Linux GUI "usable" when lots of disk activity is happening

If I start copying a huge file tree from one position to another or if some other process starts doing lots of disk activity, the foreground app (GUI) slows way down. For example, take a 2gb file tree with 100k files in it. Open a console and do cp -r bigtree bigtree2. Then go to firefox and start browsing. Firefox is almost unusable. Even if I set firefox's nice level to really high priority (-20), it's still super slow with huge delays.
I remember some years ago when I worked on a Solaris box, the system behaved much better in similar circumstances.
My HD is using DMA, not PIO. It's SATA. Not mounted with the atime flag.

Linux has long had a problem with programs that hog all the system's "dirty" cache memory. What is happening is that the copy process is filling the write cache with the file data it is copying and it is doing it very quickly. So when Firefox comes along and needs to write it must first wait for dirty buffer space or an available disk queue write slot. While waiting it is competing with the copy process and the kernel's pdflush thread, which moves data from dirty buffers to the disk write queue.
Firefox has yet another problem in this scenario. It uses SQLite to store its bookmarks, history and other things. SQLite is a ACID compliant database and it uses a transaction system with its disk writes flushed to disk. So not only does it have to wait for buffer space, it must wait for the disk queue, which is full of copied file, to clear out before it can acknowledge a successful write.
There has been a lot of tweaking done to the Linux disk queuing and buffering system. There are changes in almost every kernel release. Try one of the newer releases. You can also try tweaking the sysctl values. I sort of like these:
vm.dirty_writeback_centisecs = 100
vm.dirty_expire_centisecs = 9000
vm.dirty_background_ratio = 4
vm.dirty_ratio = 80
You can also try tweaking the number of slots in the disk queue. This value is in /sys/block/sda/queue/nr_requests. You need to substitute sda with whatever your drive really is. More slots means more chances to merge IO requests and the CFQ IO scheduler can do a better job with priorities. Fewer slots usually means a shorter wait to get written to disk for synchronous IO like SQLite's transactions. Fewer slots also means a shorter wait to get read IO into the disk queue if a write-heavy process completely stuffs the queue with write IO.

Try ionice-ing or nice-ing the copy process. The issue is due to the fact that IO gets the same priority as the GUI, which for a desktop, affects perceived responsiveness.
There's an Ubuntu brainstorm about this currently.

You're not the first to notice this problem. Former kernel developer [Con Kolivas] (http://en.wikipedia.org/wiki/Con_Kolivas) found that a lot of companies are paying to improve linux server performance at the expense of desktop performance. Con had an impressive set of patches for making the desktop more responsive. Unfortunately there was some sort of code war and eventually Con dropped out.
I would love to know how to petition the Linux kernel developers for better desktop performance. In the meantime, if you are willing to run kernel 2.6.22, you can run with the -ck patch set.

Make sure that DMA is enabled on all your drives that support it. Depending on your distribution this may not be the default. Read man hdparm, and look into your systems init mechanism.

Using "top" in Linux as semi-permanent instrumentation

I'm trying to find the best way to use 'top' as semi-permanent instrumentation in the development of a box running embedded Linux. (The instrumentation will be removed from the final-test and production releases.)
My first pass is to simply add this to init.d:
top -b -d 15 >/tmp/toploop.out &
This runs top in "batch" mode every 15 seconds. Let's assume that /tmp has plenty of space…
Questions:
Is 15 seconds a good value to choose for general-purpose monitoring?
Other than disk space, how seriously is this perturbing the state of the system?
What other (perhaps better) tools could be used like this?

Look at collectd. It's a very light weight system monitoring framework coded for performance.

We use sysstat to monitor things like this.

You might find that vmstat and iostat with a delay and no repeat counter is a better option.

I suspect 15 seconds would be more than adequate unless you actually want to watch what's happening in real time, but that doesn't appear to be the case here.
As far as load, on an idling PIII 900Mhz w/ 768MB of RAM running Ubuntu (not sure which version, but not more than a year old) I have top updating every 0.5 seconds and it's about 2% CPU utilization. At 15s updates, I'm seeing 0.1% CPU utilization.
depending upon what exactly you want, you could use the output of uptime, free, and ps to get most, if not all, of top's information.

If you are looking for overall load, uptime is probably sufficient. However, if you want specific information about processes, you are adventurous, and have the /proc filessystem enabled, you may want to write your own tools. The primary benefit in this environment is that you can focus on exactly what you want and minimize the load introduced to the system.
The proc file system gives your application read access to the kernel memory that keeps track of many of the interesting variables. Reading from /proc is one of the lightest ways to get this information. Additionally, you may be able to get more information than provided by top. I've done this in the past to get amount of time spent in user and system by this process. Additionally, you can use this to get information about the number of file descriptors open by the process. You might also use this to get detailed information about how the network system is working.
Much of this information is pre-processed by other applications which can be used if you get the information you need. However, it is rather straight-forward to read the raw information. Do a man proc for more information.

Pity you haven't said what you are monitoring for.
You should decide whether 15 seconds is ok or not. Feel free to drop it way lower if you wish (and have a fast HDD)
No worries unless you are running a soft real-time system
Have a look at tools suggested in other answers. I'll add another sugestion: "iotop", for answering a "who is thrashing the HDD" questions.

At work for system monitoring during stress tests we use a tool called nmon.
What I love about nmon is it has the ability to export to XLS and generate beautiful graphs for you.
It generates statistics for:
Memory Usage
CPU Usage
Network Usage
Disk I/O
Good luck :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string