Mysterious find command hogging memory on Linux Mint - linux

I'm running linux mint 17 and I notice that every so often my computer slows to a crawl.W When I look at top I see "/usr/bin/find / -ignore_readdir_race (..." etc. sucking up most of my memory. It runs for a really long time (several hours) and my guess is that its an automated indexing process for my hard drive.
I'm working on a project that requires me to have over 6 million audio files on a mounted SSD so another guess is that the filesystem manager is trying to index all these files for quick search. Is that the case? Is there any way to turn it off for the SSD?

The locate command reports data collected for its database by a regular cron task. You can exclude directories from the database, making the task run more quickly. According to updatedb.conf(5)
PRUNEPATHS
A whitespace-separated list of path names of directories which should not be scanned by updatedb(8). Each path name must be exactly in the form in which the directory would be reported by locate(1).
By default, no paths are skipped.
On my Debian machine for instance, /etc/updatedb.conf contains this line:
PRUNEPATHS="/tmp /var/spool /media"
You could modify your /etc/updatedb.conf to add the directories which you want to ignore. Only the top-level directory of a directory tree need be listed; subdirectories are ignored when the parent is ignored.
Further reading:
Tip of the day: Speed up `locate`
How do I get mlocate to only index certain directories?

It's a daily cron job that updates databases used by the locate command. See updatedb(8) if you want to learn more. Having six million audio files will likely cause this process to eat up a lot of CPU as it's trying to index your local filesystems.
If you don't use locate, I'd recommend simply disabling updatedb, something like this:
sudo kill -9 <PID>
sudo chmod -x /etc/cron.daily/mlocate
sudo mv /var/lib/mlocate/mlocate.db /var/lib/mlocate/mlocate.db.bak
If all else fails just remove the package.

Related

Automatically creating symlimks for files

I am in a rather unique predicament.
Let's say that I am on a Linux-based computer. It could be anything, really. The important part is that I have 2 partitions on my device. 1 that is around 1 GB and another that is around 15 GB.
The 1 GB partition (mounted on /) is reserved for system use, and the rest (mounted on /home) is for the user (me) to use.
Suppose I am running low on free space in my system partition. However, I want to install some command line utilities (which, of course, install to the system).
In the meantime, I create a folder in /home called stash. More on this later.
So, I download a tool, for example, bash. Bash is a .deb which I end up extracting to /home/stash. Let's assume bash is too big for me to install it to the system. That's okay, I can just create a symlimk at /bin/bash that redirects to /home/stash/bin/bash.
However, I'd like not only to symlink /bin/bash, but all of the other directories in the /home/stash folder. Is there a way that I could automate this symlink process?

Optimal number of files per directory vs number of directories for EXT4

I have a program that produces large number of small files (say, 10,000 files). After they are created, another script accesses them and processes one by one.
Questions:
does it matter, in terms of performance, how the files are organized (all in one directory or in multiple directories)
if so, then what is the optimal number of directories and files per dir?
I run Debian with ext4 file system
Related
Maximum number of files/folders on Linux?
https://serverfault.com/questions/104986/what-is-the-maximum-number-of-files-a-file-system-can-contain
10k files inside a single folder is not a problem on Ext4. It should have the dir_index option enabled by default, which indexes directories content using a btree-like structure to prevent performance issues.
To sum up, unless you create millions of files or use ext2/ext3, you shouldn't have to worry about system or FS performance issues.
That being said, shell tools and commands don't like to be called with a lot of files as parameter ( rm * for example) and may return you an error message saying something like 'too many arguments'. Look at this answer for what happens then.

Artificially modify server load in Ubuntu

I am curious if it is possible artificially modify the server load in Ubuntu or more generally linux. I am working on an application that reacts to the server load, and in order to test it it would be nice if I could change the server load easily.
I am currently running an over-active program that will literally generate load, but I'd prefer to not continue overheating my laptop (it's getting hot!).
One of the most important things to know about Linux (or Unix) systems is, everything is just a file. Since you are just reading from /proc/loadavg, the easiest was for you to accomplish what you are after is simply make a text file that contains a line of text that you would see when running cat /proc/loadavg. Then have your program read from that file you created instead of /proc/loadavg and it will be none the wiser. If you want to test under different "artificial" situations, just change the text in this file and save. When your testing is done, simply change your program back to reading from /proc/loadavg and you can be sure it will work as expected.
Note, you can make this text file anywhere you want...in your home directory, in the program directory, wherever. However, you shouldn't make it in /proc. That directory is reserved for system objects.
You can use the stress command, see http://weather.ou.edu/~apw/projects/stress/
A tool to impose load on and stress test a computer system
sudo apt-get install stress
To avoid CPU warm, you can install a virtual machine with small cpu capacity. virtualbox and qemu-kvm are free.
Use chroot to run the various pieces of software you're testing with a specified directory as the root directory. Set up a manufactured/modified /proc/loadavg relative to that new root directory, too.
chroot will let you create a dummy file that appears to have /proc/loadavg as its path, so the software will observe your manufactured values even if you can't change your code to look for load data in a different location.
Since you don't want to actually/literally stress the machine, something like stress is not what you are after.
As stated, /proc/loadavg would be the place to set system load averages (faux loads).
But if that's also not the meat of what you're after, I would absolutely suggest
getloadavg
watchdog
and even possible Munin plugins
There're two methods.
Hacking /proc/loadavg
The machine is not overstressed
Your program reads load valus from a file
Todo: hack Linux to report fake load value
Modify your prg
The machine is not overstressed
Your program reads load valus from a file
Todo: change 4 characters in your prg: replace /proc/loadavg with /tmp/loadavg
You can decide now. Calculate costs ;)

What happens if there are too many files under a single directory in Linux?

If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?
ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.
If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.
The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.
Edit:
ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.
The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.
I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.
My experience with large directories on ext3 and dir_index enabled:
If you know the name of the file you want to access, there is almost no penalty
If you want to do operations that need to read in the whole directory entry (like a simple ls on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore
If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (*) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once
Without dir_index however, you are really screwed :-D
Most distros use Ext3 by default, which can use b-tree indexing for large directories.
Some of distros have this dir_index feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.
To see if dir_index feature is activated do (as root):
tune2fs -l /dev/sdaX | grep features
To activate dir_index feature (as root):
tune2fs -O dir_index /dev/sdaX
e2fsck -D /dev/sdaX
Replace /dev/sdaX with partition for which you want to activate it.
When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.
It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.
aaa/
aaavnj78t93ufjw4390
aaavoj78trewrwrwrwenjk983
aaaz84390842092njk423
...
abc/
abckhr89032423
abcnjjkth29085242nw
...
...
The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?
Not every filesystem supports that many files.
On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.
I've got a host with 10M files in a directory. (don't ask)
The filesystem is ext4.
It takes about 5 minutes to
ls
One limitation I've found is that my shell script to read the files (because AWS snapshot restore is a lie and files aren't present till first read) wasn't able to handle the argument list so I needed to do two passes. Firstly construct a file list with find (wholename in case you want to do partial matches)
find /path/to_dir/ -wholename '*.ldb'| tee filenames.txt
then secondly read from a the file containing filenames and read all files. (with limited parallelism)
while read -r line; do
if test "$(jobs | wc -l)" -ge 10; then
wait -n
fi
{
#do something with 10x fanout
} &
done < filenames.txt
Posting here in case anyone finds the specific work-around useful when working with too many files.

Can /tmp in Linux ever fill up?

I'm putting some files in /tmp on a web server that are being used by a web application for a limited amount of time. If the files get left in the server's /tmp after the user quits using the application and this happens repeatedly, should i be concerned about the directory filling up? I read online that rebooting cleans out the /tmp directory, but this box doesn't get rebooted very much.
Tom
Yes, it will fill up. Consider implementing a cron job that will delete old files after a while.
Something like this should do the trick:
/usr/bin/find /tmp/mydata -type f -atime +1 -exec rm -f {} \;
This will delete files that have a modification time that's more than a day old.
Or as a crontab entry:
# run five minutes after midnight, every day
5 0 * * * /usr/bin/find /tmp/mydata -type f -atime +1 -exec rm -f {} \;
where /tmp/mydata is a subdirectory where your application stores its temporary files. (Simply deleting old files under /tmp would be a very bad idea, as someone else pointed out here.)
Look at the crontab and find man pages for the details. Don't go running scripts that delete files on your filesystem without understanding all the details - that's how bad things happen to good servers. :)
Of course, if you can just modify your application to delete temporary files when it's done with them, that would be a far better solution, generally.
You can't just blindly delete everything that hasn't been modified for a certain amount of time. A lot of programs store sockets in there, which never get modified but are still an integral part of the program working. Take for instance mysql from one of my servers:
srwxrwxrwx 1 mysql mysql 0 Sep 11 04:01 mysql.sock=
That's a valid, working "file" in /tmp. It just looks old because mysql hasn't been restarted in a while. Either limit your find with '-type f' or '-atime', or use one of the distro-provided tools others have mentioned.
The only thing you can write to without worrying it will fill up is /dev/null. Everything else will eventually run out of space if you keep dumping things in it.
One simple approach would be to have a cron job clean up all your /tmp files that are older than, say, a few days.
Yep It will be linked to one of your disks/partitions and can fill up.
It gets deleted on a reboot.
When the user quits the application you should clean the files up after them.
In which language is your web-application? A lot of languages propose temp files:
C
python
php
...
Search in your language if there is such a feature.
Just a warning: not all Linux installation clean the /tmp directory after each reboot
Some linux distros have a package that will clean up old files in /tmp for you. It isn't hard to implement your own, as mentioned above. One thing to look out for are long running processes, especially "zombies", which are ones that have died but which haven't finished cleaning up after themselves. If a process has a file open, just deleting it from /tmp won't actually reclaim its space - you have to kill the process or somehow coerce it to close the file. Many programs that write log or temporary files are designed to catch a signal (often SIGUSR1) and close and re-open any log or temporary files for that reason.
Many Linux distributions include something named 'tmpwatch', or similar which runs via cron and deletes things on a pre-defined gradient. Some are smart enough to go by the owner of the file .. stuff that is owned by daemon users gets cleaned out faster than stuff owned by regular users. Check on the mailing lists for your distro of choice to find out.
Still, you should have SNMP or some other kind of monitor watching how much room is available, if it fills up services like Apache aren't going to be happy. For instance, e-accelerator for PHP will need plenty of room, some mail scanners don't clean up properly, etc.

Resources