Maximum number of inodes in a directory? [closed] - linux

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 years ago.
Improve this question
Is there a maximum number of inodes in a single directory?
I have a directory of over 2 million files and can't get the ls command to work against that directory. So now I'm wondering if I've exceeded a limit on inodes in Linux. Is there a limit before a 2^64 numerical limit?

df -i should tell you the number of inodes used and free on the file system.

Try ls -U or ls -f.
ls, by default, sorts the files alphabetically. If you have 2 million files, that sort can take a long time. If ls -U (or perhaps ls -f), then the file names will be printed immediately.

No. Inode limits are per-filesystem, and decided at filesystem creation time. You could be hitting another limit, or maybe 'ls' just doesn't perform that well.
Try this:
tune2fs -l /dev/DEVICE | grep -i inode
It should tell you all sorts of inode related info.

What you hit is an internal limit of ls. Here is an article which explains it quite well:
http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/

Maximum directory size is filesystem-dependent, and thus the exact limit varies. However, having very large directories is a bad practice.
You should consider making your directories smaller by sorting files into subdirectories. One common scheme is to use the first two characters for a first-level subdirectory, as follows:
${topdir}/aa/aardvark
${topdir}/ai/airplane
This works particularly well if using UUID, GUIDs or content hash values for naming.

As noted by Rob Adams, ls is sorting the files before displaying them. Note that if you are using NFS, the NFS server will be sorting the directory before sending it, and 2 million entries may well take longer than the NFS timeout. That makes the directory unlistable via NFS, even with the -f flag.
This may be true for other network file systems as well.
While there's no enforced limit to the number of entries in a directory, it's good practice to have some limit to the entries you anticipate.

Can you get a real count of the number of files? Does it fall very near a 2^n boundry? Could you simply be running out of RAM to hold all the file names?
I know that in windows at least file system performance would drop dramatically as the number of files in the folder went up, but I thought that linux didn't suffer from this issue, at least if you were using a command prompt. God help you if you try to get something like nautilus to open a folder with that many files.
I'm also wondering where these files come from. Are you able to calculate file names programmatically? If that's the case, you might be able to write a small program to sort them into a number of sub-folders. Often listing the name of a specific file will grant you access where trying to look up the name will fail. For example, I have a folder in windows with about 85,000 files where this works.
If this technique is successful, you might try finding a way to make this sort permanent, even if it's just running this small program as a cron job. It'll work especially well if you can sort the files by date somewhere.

Unless you are getting an error message, ls is working but very slowly. You can try looking at just the first ten files like this:
ls -f | head -10
If you're going to need to look at the file details for a while, you can put them in a file first. You probably want to send the output to a different directory than the one you are listing at the moment!
ls > ~/lots-of-files.txt
If you want to do something to the files, you can use xargs. If you decide to write a script of some kind to do the work, make sure that your script will process the list of files as a stream rather than all at once. Here's an example of moving all the files.
ls | xargs -I thefilename mv thefilename ~/some/other/directory
You could combine that with head to move a smaller number of the files.
ls | head -10000 | xargs -I x mv x /first/ten/thousand/files/go/here
You can probably combine ls | head into a shell script to that will split up the files into a bunch of directories with a manageable number of files in each.

For NetBackup, the binaries that analyze the directories in clients perform some type of listing that timeouts by the enormous quantity of files in every folder (about one million per folder, SAP work directory).
My solution was (as Charles Duffy write in this thread), reorganize the folders in subfolders with less archives.

Another option is find:
find . -name * -exec somcommands {} \;
{} is the absolute filepath.
The advantage/disadvantage is that the files are processed one after each other.
find . -name * > ls.txt
would print all filenames in ls.txt
find . -name * -exec ls -l {} \; > ls.txt
would print all information form ls for each file in ls.txt

Related

Bash script, combining directory size for all users in the system

Okay, so basically right now I have a directory which is meant to collect files.
However, I need to find the total usage of the directory for all users in the system and it should be in bytes.
I am stuck on this for a long time and I would appreciate it if someone can help me a bit.
If you are looking for the user home folders, then have a try with that:
du -b /home
Considering what exactly you mean by "users in system" you'd need to add the size of the /root folder, though that should only be a few bytes.
To find out about such utilities take a look at the apropos command on unixoid systems, here a apropos disk space reveals among many other things:
du (1) - estimate file space usage
As a next step the utilities manual page explains its usage and options:
man du
If the /home folder is not what you are interested in then you can implement a simple script iterating over all home folders as found in the /etc/passwd file and sum those values up.

What is the best way to speed up a find command on a huge directory tree using GNU parallel?

I've been using GNU parallel for a while, mostly to grep large files or run the same command for various arguments when each command/arg instance is slow and needs to be spread out across cores/hosts.
One thing which would be great to do across multiple cores and hosts as well would be to find a file on a large directory subtree. For example, something like this:
find /some/path -name 'regex'
will take a very long time if /some/path contains many files and other directories with many files. I'm not sure if this is as easy to speed up. For example:
ls -R -1 /some/path | parallel --profile manyhosts --pipe egrep regex
something like that comes to mind but ls would be very slow to come up with the files to search. What's a good way then to speed up such a find?
If you have N hundred immediate subdirs, you can use:
parallel --gnu -n 10 find {} -name 'regex' ::: *
to run find in parallel on each of them, ten at a time.
Note however that listing a directory recursively like this is an IO bound task, and the speedup you can get will depend on the backing medium. On a hard disk drive, it'll probably just be slower (if testing, beware disk caching).

Linux - Recursively list all files from folder with performance

I want to list down absolute filenames of almost 80 thousand files recursively from folders in text file.
Do anybody know which command will give me performance?
I know that I can use commands like ls, tree, ll But which command will give me more performance.
Also note that these files are on NAS and NAS is mapped using symlink command.
Simply find -type f -print0 for zero-delimited output. I don't think you can get significantly faster than that.

What happens if there are too many files under a single directory in Linux?

If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?
ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.
If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.
The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.
Edit:
ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.
The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.
I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.
My experience with large directories on ext3 and dir_index enabled:
If you know the name of the file you want to access, there is almost no penalty
If you want to do operations that need to read in the whole directory entry (like a simple ls on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore
If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (*) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once
Without dir_index however, you are really screwed :-D
Most distros use Ext3 by default, which can use b-tree indexing for large directories.
Some of distros have this dir_index feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.
To see if dir_index feature is activated do (as root):
tune2fs -l /dev/sdaX | grep features
To activate dir_index feature (as root):
tune2fs -O dir_index /dev/sdaX
e2fsck -D /dev/sdaX
Replace /dev/sdaX with partition for which you want to activate it.
When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.
It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.
aaa/
aaavnj78t93ufjw4390
aaavoj78trewrwrwrwenjk983
aaaz84390842092njk423
...
abc/
abckhr89032423
abcnjjkth29085242nw
...
...
The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?
Not every filesystem supports that many files.
On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.
I've got a host with 10M files in a directory. (don't ask)
The filesystem is ext4.
It takes about 5 minutes to
ls
One limitation I've found is that my shell script to read the files (because AWS snapshot restore is a lie and files aren't present till first read) wasn't able to handle the argument list so I needed to do two passes. Firstly construct a file list with find (wholename in case you want to do partial matches)
find /path/to_dir/ -wholename '*.ldb'| tee filenames.txt
then secondly read from a the file containing filenames and read all files. (with limited parallelism)
while read -r line; do
if test "$(jobs | wc -l)" -ge 10; then
wait -n
fi
{
#do something with 10x fanout
} &
done < filenames.txt
Posting here in case anyone finds the specific work-around useful when working with too many files.

Can /tmp in Linux ever fill up?

I'm putting some files in /tmp on a web server that are being used by a web application for a limited amount of time. If the files get left in the server's /tmp after the user quits using the application and this happens repeatedly, should i be concerned about the directory filling up? I read online that rebooting cleans out the /tmp directory, but this box doesn't get rebooted very much.
Tom
Yes, it will fill up. Consider implementing a cron job that will delete old files after a while.
Something like this should do the trick:
/usr/bin/find /tmp/mydata -type f -atime +1 -exec rm -f {} \;
This will delete files that have a modification time that's more than a day old.
Or as a crontab entry:
# run five minutes after midnight, every day
5 0 * * * /usr/bin/find /tmp/mydata -type f -atime +1 -exec rm -f {} \;
where /tmp/mydata is a subdirectory where your application stores its temporary files. (Simply deleting old files under /tmp would be a very bad idea, as someone else pointed out here.)
Look at the crontab and find man pages for the details. Don't go running scripts that delete files on your filesystem without understanding all the details - that's how bad things happen to good servers. :)
Of course, if you can just modify your application to delete temporary files when it's done with them, that would be a far better solution, generally.
You can't just blindly delete everything that hasn't been modified for a certain amount of time. A lot of programs store sockets in there, which never get modified but are still an integral part of the program working. Take for instance mysql from one of my servers:
srwxrwxrwx 1 mysql mysql 0 Sep 11 04:01 mysql.sock=
That's a valid, working "file" in /tmp. It just looks old because mysql hasn't been restarted in a while. Either limit your find with '-type f' or '-atime', or use one of the distro-provided tools others have mentioned.
The only thing you can write to without worrying it will fill up is /dev/null. Everything else will eventually run out of space if you keep dumping things in it.
One simple approach would be to have a cron job clean up all your /tmp files that are older than, say, a few days.
Yep It will be linked to one of your disks/partitions and can fill up.
It gets deleted on a reboot.
When the user quits the application you should clean the files up after them.
In which language is your web-application? A lot of languages propose temp files:
C
python
php
...
Search in your language if there is such a feature.
Just a warning: not all Linux installation clean the /tmp directory after each reboot
Some linux distros have a package that will clean up old files in /tmp for you. It isn't hard to implement your own, as mentioned above. One thing to look out for are long running processes, especially "zombies", which are ones that have died but which haven't finished cleaning up after themselves. If a process has a file open, just deleting it from /tmp won't actually reclaim its space - you have to kill the process or somehow coerce it to close the file. Many programs that write log or temporary files are designed to catch a signal (often SIGUSR1) and close and re-open any log or temporary files for that reason.
Many Linux distributions include something named 'tmpwatch', or similar which runs via cron and deletes things on a pre-defined gradient. Some are smart enough to go by the owner of the file .. stuff that is owned by daemon users gets cleaned out faster than stuff owned by regular users. Check on the mailing lists for your distro of choice to find out.
Still, you should have SNMP or some other kind of monitor watching how much room is available, if it fills up services like Apache aren't going to be happy. For instance, e-accelerator for PHP will need plenty of room, some mail scanners don't clean up properly, etc.

Resources