Can /tmp in Linux ever fill up? - linux

I'm putting some files in /tmp on a web server that are being used by a web application for a limited amount of time. If the files get left in the server's /tmp after the user quits using the application and this happens repeatedly, should i be concerned about the directory filling up? I read online that rebooting cleans out the /tmp directory, but this box doesn't get rebooted very much.
Tom

Yes, it will fill up. Consider implementing a cron job that will delete old files after a while.
Something like this should do the trick:
/usr/bin/find /tmp/mydata -type f -atime +1 -exec rm -f {} \;
This will delete files that have a modification time that's more than a day old.
Or as a crontab entry:
# run five minutes after midnight, every day
5 0 * * * /usr/bin/find /tmp/mydata -type f -atime +1 -exec rm -f {} \;
where /tmp/mydata is a subdirectory where your application stores its temporary files. (Simply deleting old files under /tmp would be a very bad idea, as someone else pointed out here.)
Look at the crontab and find man pages for the details. Don't go running scripts that delete files on your filesystem without understanding all the details - that's how bad things happen to good servers. :)
Of course, if you can just modify your application to delete temporary files when it's done with them, that would be a far better solution, generally.

You can't just blindly delete everything that hasn't been modified for a certain amount of time. A lot of programs store sockets in there, which never get modified but are still an integral part of the program working. Take for instance mysql from one of my servers:
srwxrwxrwx 1 mysql mysql 0 Sep 11 04:01 mysql.sock=
That's a valid, working "file" in /tmp. It just looks old because mysql hasn't been restarted in a while. Either limit your find with '-type f' or '-atime', or use one of the distro-provided tools others have mentioned.

The only thing you can write to without worrying it will fill up is /dev/null. Everything else will eventually run out of space if you keep dumping things in it.
One simple approach would be to have a cron job clean up all your /tmp files that are older than, say, a few days.

Yep It will be linked to one of your disks/partitions and can fill up.
It gets deleted on a reboot.
When the user quits the application you should clean the files up after them.

In which language is your web-application? A lot of languages propose temp files:
C
python
php
...
Search in your language if there is such a feature.

Just a warning: not all Linux installation clean the /tmp directory after each reboot

Some linux distros have a package that will clean up old files in /tmp for you. It isn't hard to implement your own, as mentioned above. One thing to look out for are long running processes, especially "zombies", which are ones that have died but which haven't finished cleaning up after themselves. If a process has a file open, just deleting it from /tmp won't actually reclaim its space - you have to kill the process or somehow coerce it to close the file. Many programs that write log or temporary files are designed to catch a signal (often SIGUSR1) and close and re-open any log or temporary files for that reason.

Many Linux distributions include something named 'tmpwatch', or similar which runs via cron and deletes things on a pre-defined gradient. Some are smart enough to go by the owner of the file .. stuff that is owned by daemon users gets cleaned out faster than stuff owned by regular users. Check on the mailing lists for your distro of choice to find out.
Still, you should have SNMP or some other kind of monitor watching how much room is available, if it fills up services like Apache aren't going to be happy. For instance, e-accelerator for PHP will need plenty of room, some mail scanners don't clean up properly, etc.

Related

Mysterious find command hogging memory on Linux Mint

I'm running linux mint 17 and I notice that every so often my computer slows to a crawl.W When I look at top I see "/usr/bin/find / -ignore_readdir_race (..." etc. sucking up most of my memory. It runs for a really long time (several hours) and my guess is that its an automated indexing process for my hard drive.
I'm working on a project that requires me to have over 6 million audio files on a mounted SSD so another guess is that the filesystem manager is trying to index all these files for quick search. Is that the case? Is there any way to turn it off for the SSD?
The locate command reports data collected for its database by a regular cron task. You can exclude directories from the database, making the task run more quickly. According to updatedb.conf(5)
PRUNEPATHS
A whitespace-separated list of path names of directories which should not be scanned by updatedb(8). Each path name must be exactly in the form in which the directory would be reported by locate(1).
By default, no paths are skipped.
On my Debian machine for instance, /etc/updatedb.conf contains this line:
PRUNEPATHS="/tmp /var/spool /media"
You could modify your /etc/updatedb.conf to add the directories which you want to ignore. Only the top-level directory of a directory tree need be listed; subdirectories are ignored when the parent is ignored.
Further reading:
Tip of the day: Speed up `locate`
How do I get mlocate to only index certain directories?
It's a daily cron job that updates databases used by the locate command. See updatedb(8) if you want to learn more. Having six million audio files will likely cause this process to eat up a lot of CPU as it's trying to index your local filesystems.
If you don't use locate, I'd recommend simply disabling updatedb, something like this:
sudo kill -9 <PID>
sudo chmod -x /etc/cron.daily/mlocate
sudo mv /var/lib/mlocate/mlocate.db /var/lib/mlocate/mlocate.db.bak
If all else fails just remove the package.

Linux ~/.bashrc export most recent directory

I have several environment variables in my ~/.bashrc that point to different directories. I am running a program that creates a new folder every time that it runs and puts a time stamp in the directory name. For example, baseline_2015_11_10_15_40_31-model-stride_1-type_1. Is there away of making a variable that can link to the last created directory?
cd $CURRENT_DIR
Your mileage may vary a lot depending on what exactly do you need to accomplish. However, it almost all cases I would advise against doing something that weird and unreliable like what's described below and revise your architecture to avoid hunting for directories.
Method 1
If your program creates a subdirectory inside current directory, and you always know that nothing else happens in that directory and you want a subdirectory with latest creation timestamp, then you can do something like:
your_complex_program_that_creates_dir
TARGET_DIR=$(ls -t1 --group-directories-first | head -n1)
cd "$TARGET_DIR"
Method 2
If a lot of stuff happens on the system, then you'll end up monitoring what your program does with the filesystem and reacting when it creates a directory. There are two ways to do that, using strace and inotify, both are relatively complex. Here's the way to do that with strace:
strace -o some_temp_file.strace your_complex_program_that_creates_dir
TARGET_DIR=$(sed -ne '/^mkdir(/ { s/^mkdir("\(.*\)", .*).*$/\1/; p }' some_temp_file.strace
cd "$TARGET_DIR"
This snippet runs your_complex_program_that_creates_dir under control of strace, which essentially logs every system call your program makes into a file. Afterwards, this file is analyzed to seek a line like
mkdir("target_dir", 0777) = 0
and extract value of "target_dir" into a variable. Note that:
if your program creates more than 1 directory (even for temporary purposes and deletes them afterwards, or whatever) — there's really no way to determine which of them to grab
running a program with strace is much slower that normal due to huge overhead of logging all the syscalls.
it's super non-portable — facilities like strace exist on most modern OS, but implementations will vary a lot
A solution with inotify works in the same way, but using different mechanism — i.e. it uses OS hook to log all the operations that process performs with file system and then react to it (remember created directory).
However, I repeat, I'd strongly suggest against using any of these solutions beyond research interest.

Where should a well-behaved daemon store auxiliary files?

I have a daemon that backs up some system files before it does anything else and restores them afterwards. What is the right place to put these backups? I'm thinking somewhere in /var or /var/opt, since I don't want to pollute /etc with a bunch of backup files that aren't really doing anything.
If it matters, I'm specifically looking at Ubuntu 10.04+.
If they are not to be maintained after a reboot or between invocations of the program why not use /tmp
This directory contains mostly files that are required temporarily. Many programs use this to create lock files and for temporary storage of data.

detect if something is modified in directory, and if so, backup - otherwise do nothing

I have a "Data" directory, that I rsync to a remote NAS periodically via a shell script.
However, I'd like to make this more efficient. I'd like to detect if something has changed in "Data" before running rsync. This is so that I don't wake up the drives on the NAS unnecessarily.
I was thinking of modifying the shell script to get the latest modified time of the files in Data (by using a recursive find), and write that to a file every time Data is rsynced.
Before every sync, the shell script can compare the current timestamp of "Data" with the previous timestamp when "Data" was sync'd. If the current timestamp is newer, then rsync, otherwise do nothing.
My question is, is there a more efficient way to figure out if the "Data" directory is modified since the last rsync? Note that Data has many, many, layers of sub-directories.
If I understand correctly, you just want to see if any files have been modified so you can figure out whether to proceed to the rsync portion of your script?
It's a pretty simple task to figure out when the data was last synced, especially if you do this nightly. As soon as you find one file with mtime greater than the time of the last sync, you know you have to proceed to the full rsync.
find has this functionality built in:
# find all files modified in the last 24 hours
find -mtime 1
Rsync already does this. There is no on-demand solution that doesn't require checking the mtime and ctime properties of the inodes.
However you could create a daemon that uses inotify to track changes as they occur, and fire rsync at intervals, or whenever you feel sufficient events have occurred to justify calling rsync.
I would use the find command, but do it this way: When the rsync runs, touch a file, like "rsyncranflag". Then you can run
find Data -newer rsyncranflag
That will say definitively whether any files were changed since the last rsync (subject to the accuracy of mtime).

What happens if there are too many files under a single directory in Linux?

If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?
ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.
If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.
The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.
Edit:
ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.
The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.
I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.
My experience with large directories on ext3 and dir_index enabled:
If you know the name of the file you want to access, there is almost no penalty
If you want to do operations that need to read in the whole directory entry (like a simple ls on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore
If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (*) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once
Without dir_index however, you are really screwed :-D
Most distros use Ext3 by default, which can use b-tree indexing for large directories.
Some of distros have this dir_index feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.
To see if dir_index feature is activated do (as root):
tune2fs -l /dev/sdaX | grep features
To activate dir_index feature (as root):
tune2fs -O dir_index /dev/sdaX
e2fsck -D /dev/sdaX
Replace /dev/sdaX with partition for which you want to activate it.
When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.
It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.
aaa/
aaavnj78t93ufjw4390
aaavoj78trewrwrwrwenjk983
aaaz84390842092njk423
...
abc/
abckhr89032423
abcnjjkth29085242nw
...
...
The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?
Not every filesystem supports that many files.
On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.
I've got a host with 10M files in a directory. (don't ask)
The filesystem is ext4.
It takes about 5 minutes to
ls
One limitation I've found is that my shell script to read the files (because AWS snapshot restore is a lie and files aren't present till first read) wasn't able to handle the argument list so I needed to do two passes. Firstly construct a file list with find (wholename in case you want to do partial matches)
find /path/to_dir/ -wholename '*.ldb'| tee filenames.txt
then secondly read from a the file containing filenames and read all files. (with limited parallelism)
while read -r line; do
if test "$(jobs | wc -l)" -ge 10; then
wait -n
fi
{
#do something with 10x fanout
} &
done < filenames.txt
Posting here in case anyone finds the specific work-around useful when working with too many files.

Resources