What is the best way to speed up a find command on a huge directory tree using GNU parallel? - linux

I've been using GNU parallel for a while, mostly to grep large files or run the same command for various arguments when each command/arg instance is slow and needs to be spread out across cores/hosts.
One thing which would be great to do across multiple cores and hosts as well would be to find a file on a large directory subtree. For example, something like this:
find /some/path -name 'regex'
will take a very long time if /some/path contains many files and other directories with many files. I'm not sure if this is as easy to speed up. For example:
ls -R -1 /some/path | parallel --profile manyhosts --pipe egrep regex
something like that comes to mind but ls would be very slow to come up with the files to search. What's a good way then to speed up such a find?

If you have N hundred immediate subdirs, you can use:
parallel --gnu -n 10 find {} -name 'regex' ::: *
to run find in parallel on each of them, ten at a time.
Note however that listing a directory recursively like this is an IO bound task, and the speedup you can get will depend on the backing medium. On a hard disk drive, it'll probably just be slower (if testing, beware disk caching).

Related

Why are file systems so much slower then a database?

I have a lot of files on my computer (who doesn't).
It is split between harddrives.
I realized a long time ago, that find takes a whole lot of time scanning the whole harddisk. Minutes, for all drives i might take over an hour,
That is why I got used to running du -ba / >> ~/du."$*(date +%F)" on a regular base. Then I would just grep 'WHATEVER' ~/du | sed 's#^ \+[0-9]\+ ##' | xargs -d\\n command
I understand why that is faster than find.
Now I set up a mysql, that has a complete, refreshable index of all files. directories are a simple tree with just a foreign key to the parent row. (or however you call a foreign key that references NOT to a foreign table but to it's own primary of a different row).
Although It is as complex, it is still much faster than using the Filesystems.
Why is that? Am I missing some tools that could search the TOC faster than the normal posix calls to the kernel?
How long should It take to print all files of a harddrive to stdout, whithout a DB or textfile cache?

Linux - Recursively list all files from folder with performance

I want to list down absolute filenames of almost 80 thousand files recursively from folders in text file.
Do anybody know which command will give me performance?
I know that I can use commands like ls, tree, ll But which command will give me more performance.
Also note that these files are on NAS and NAS is mapped using symlink command.
Simply find -type f -print0 for zero-delimited output. I don't think you can get significantly faster than that.

Need ideas for running srm on large amount of files

For security, I need to use srm (secure delete) rather than rm to delete some files: http://en.wikipedia.org/wiki/Srm_%28Unix%29
I currently have srm set up to run 3 passes over any data that I need to delete. The problem I'm having is that srm is running extremely extremely slowly on large amounts of files. For example, there is a 150 directory I tried to delete, and I found it to only have deleted 10GB over 1 week.
I know that srm will run slowly with multiple small files, but does directory depth matter as well? For most of the data I need to delete on a weekly basis, the actual files themselves are nested in various deep subdirectories. Would it help out if I flattened the directory structure before running srm?
Here are two workarounds I am looking at (maybe a combination of both), though I don't know how much they would help out:
Flatten all the directories structures before running srm. That way, all the files that need to be wiped out are in the same target dierctory.
Archive the entire directory before running srm. That way, the target file would be one big tar.gz file. Zipping up the data will likely take a while, but not as long as the srm would have taken.
Does anybody have any other suggestions on what I could do? Some others have used shred as well, but the results were similar and we ended up switching over to srm.
Don't know much about srm, but might be worth trying :
find $mydir -type f -exec srm {} \;
find $mydir -type d -exec srm {} \;

What happens if there are too many files under a single directory in Linux?

If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?
ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.
If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.
The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.
Edit:
ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.
The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.
I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.
My experience with large directories on ext3 and dir_index enabled:
If you know the name of the file you want to access, there is almost no penalty
If you want to do operations that need to read in the whole directory entry (like a simple ls on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore
If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (*) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once
Without dir_index however, you are really screwed :-D
Most distros use Ext3 by default, which can use b-tree indexing for large directories.
Some of distros have this dir_index feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.
To see if dir_index feature is activated do (as root):
tune2fs -l /dev/sdaX | grep features
To activate dir_index feature (as root):
tune2fs -O dir_index /dev/sdaX
e2fsck -D /dev/sdaX
Replace /dev/sdaX with partition for which you want to activate it.
When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.
It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.
aaa/
aaavnj78t93ufjw4390
aaavoj78trewrwrwrwenjk983
aaaz84390842092njk423
...
abc/
abckhr89032423
abcnjjkth29085242nw
...
...
The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?
Not every filesystem supports that many files.
On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.
I've got a host with 10M files in a directory. (don't ask)
The filesystem is ext4.
It takes about 5 minutes to
ls
One limitation I've found is that my shell script to read the files (because AWS snapshot restore is a lie and files aren't present till first read) wasn't able to handle the argument list so I needed to do two passes. Firstly construct a file list with find (wholename in case you want to do partial matches)
find /path/to_dir/ -wholename '*.ldb'| tee filenames.txt
then secondly read from a the file containing filenames and read all files. (with limited parallelism)
while read -r line; do
if test "$(jobs | wc -l)" -ge 10; then
wait -n
fi
{
#do something with 10x fanout
} &
done < filenames.txt
Posting here in case anyone finds the specific work-around useful when working with too many files.

Maximum number of inodes in a directory? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 years ago.
Improve this question
Is there a maximum number of inodes in a single directory?
I have a directory of over 2 million files and can't get the ls command to work against that directory. So now I'm wondering if I've exceeded a limit on inodes in Linux. Is there a limit before a 2^64 numerical limit?
df -i should tell you the number of inodes used and free on the file system.
Try ls -U or ls -f.
ls, by default, sorts the files alphabetically. If you have 2 million files, that sort can take a long time. If ls -U (or perhaps ls -f), then the file names will be printed immediately.
No. Inode limits are per-filesystem, and decided at filesystem creation time. You could be hitting another limit, or maybe 'ls' just doesn't perform that well.
Try this:
tune2fs -l /dev/DEVICE | grep -i inode
It should tell you all sorts of inode related info.
What you hit is an internal limit of ls. Here is an article which explains it quite well:
http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/
Maximum directory size is filesystem-dependent, and thus the exact limit varies. However, having very large directories is a bad practice.
You should consider making your directories smaller by sorting files into subdirectories. One common scheme is to use the first two characters for a first-level subdirectory, as follows:
${topdir}/aa/aardvark
${topdir}/ai/airplane
This works particularly well if using UUID, GUIDs or content hash values for naming.
As noted by Rob Adams, ls is sorting the files before displaying them. Note that if you are using NFS, the NFS server will be sorting the directory before sending it, and 2 million entries may well take longer than the NFS timeout. That makes the directory unlistable via NFS, even with the -f flag.
This may be true for other network file systems as well.
While there's no enforced limit to the number of entries in a directory, it's good practice to have some limit to the entries you anticipate.
Can you get a real count of the number of files? Does it fall very near a 2^n boundry? Could you simply be running out of RAM to hold all the file names?
I know that in windows at least file system performance would drop dramatically as the number of files in the folder went up, but I thought that linux didn't suffer from this issue, at least if you were using a command prompt. God help you if you try to get something like nautilus to open a folder with that many files.
I'm also wondering where these files come from. Are you able to calculate file names programmatically? If that's the case, you might be able to write a small program to sort them into a number of sub-folders. Often listing the name of a specific file will grant you access where trying to look up the name will fail. For example, I have a folder in windows with about 85,000 files where this works.
If this technique is successful, you might try finding a way to make this sort permanent, even if it's just running this small program as a cron job. It'll work especially well if you can sort the files by date somewhere.
Unless you are getting an error message, ls is working but very slowly. You can try looking at just the first ten files like this:
ls -f | head -10
If you're going to need to look at the file details for a while, you can put them in a file first. You probably want to send the output to a different directory than the one you are listing at the moment!
ls > ~/lots-of-files.txt
If you want to do something to the files, you can use xargs. If you decide to write a script of some kind to do the work, make sure that your script will process the list of files as a stream rather than all at once. Here's an example of moving all the files.
ls | xargs -I thefilename mv thefilename ~/some/other/directory
You could combine that with head to move a smaller number of the files.
ls | head -10000 | xargs -I x mv x /first/ten/thousand/files/go/here
You can probably combine ls | head into a shell script to that will split up the files into a bunch of directories with a manageable number of files in each.
For NetBackup, the binaries that analyze the directories in clients perform some type of listing that timeouts by the enormous quantity of files in every folder (about one million per folder, SAP work directory).
My solution was (as Charles Duffy write in this thread), reorganize the folders in subfolders with less archives.
Another option is find:
find . -name * -exec somcommands {} \;
{} is the absolute filepath.
The advantage/disadvantage is that the files are processed one after each other.
find . -name * > ls.txt
would print all filenames in ls.txt
find . -name * -exec ls -l {} \; > ls.txt
would print all information form ls for each file in ls.txt

Resources