Why are file systems so much slower then a database?

Why are file systems so much slower then a database? - linux

I have a lot of files on my computer (who doesn't).
It is split between harddrives.
I realized a long time ago, that find takes a whole lot of time scanning the whole harddisk. Minutes, for all drives i might take over an hour,
That is why I got used to running du -ba / >> ~/du."$*(date +%F)" on a regular base. Then I would just grep 'WHATEVER' ~/du | sed 's#^ \+[0-9]\+ ##' | xargs -d\\n command
I understand why that is faster than find.
Now I set up a mysql, that has a complete, refreshable index of all files. directories are a simple tree with just a foreign key to the parent row. (or however you call a foreign key that references NOT to a foreign table but to it's own primary of a different row).
Although It is as complex, it is still much faster than using the Filesystems.
Why is that? Am I missing some tools that could search the TOC faster than the normal posix calls to the kernel?
How long should It take to print all files of a harddrive to stdout, whithout a DB or textfile cache?

Related

Why vim search is much slower than "cat fileName | grep targetText"?

I have a 1.4 GB text file named test.txt and I want to search a string inside the file.
I'd like to know why vim search (vim test.txt, then type /targetText to search the string) performs much slower than cat test.txt | grep targetText?
On my machine, vim search takes about several minutes to complete the search, while cat test.txt | grep targetText takes about several seconds to complete the search.

Vim is an editor. It will try to load the file in memory then you can do edits on it. Vim can edit huge files, but is not optimized for it.
On the other Hand cat and grep do not need to read the whole file in memory.
BTW you can just do grep search file without using cat.

If targetText is short the delay should be caused by numerous loads from disk (necessary to search through the whole text). We should note that vim is an interactive tool and it is not designed for fast processing of gygabytes. Of course if we know in advance that our pattern match lays in many, many megabytes downstream from the current screen, we could read huge pieces from disk and in such a way get fast. But in real life Vim doesn't know how much data worth to read in once, because if we expect the pattern to be found in rather short distance, say, three lines below (agree, it's much more expected situation) then we have absolutely no reason to read huge data amounts from disk; it would be useless consumption of time and bandwidth. As Vim doesn't know a priori what amount of data to read at once, it uses some trade-off which doesn't occur to be optimal in your case.
On the opposite side, a pipeline "cat|.." bravely operates with very large pieces of data only limited by memory available to the process (ideally having once found the file it reads data in non-stop mode and sends to the pipeline). Because cat "knows" that the whole file content is needed and there is no reason to read it by small pages.
Thus, although grep and cat suck the same amount of data, the latter seeks a track on disk much less times that results in dramatic efficiency increase.
If a prefix character combination of our pattern is very frequent in the file to scan, we may also experience an efficiency advantage of grep search technique based on Aho–Corasick string matching algorithm.

What is the best way to speed up a find command on a huge directory tree using GNU parallel?

I've been using GNU parallel for a while, mostly to grep large files or run the same command for various arguments when each command/arg instance is slow and needs to be spread out across cores/hosts.
One thing which would be great to do across multiple cores and hosts as well would be to find a file on a large directory subtree. For example, something like this:
find /some/path -name 'regex'
will take a very long time if /some/path contains many files and other directories with many files. I'm not sure if this is as easy to speed up. For example:
ls -R -1 /some/path | parallel --profile manyhosts --pipe egrep regex
something like that comes to mind but ls would be very slow to come up with the files to search. What's a good way then to speed up such a find?

If you have N hundred immediate subdirs, you can use:
parallel --gnu -n 10 find {} -name 'regex' ::: *
to run find in parallel on each of them, ten at a time.
Note however that listing a directory recursively like this is an IO bound task, and the speedup you can get will depend on the backing medium. On a hard disk drive, it'll probably just be slower (if testing, beware disk caching).

unix : how does a "./process | sort" work?

To debug some map/reduce jobs I often test them using a simple unix command that basically reads
cat data/* | mapper | sort | reduce > out
Now everything works just fine, but I'm wondering what really happens with the map | sort command.
More precisely :
does someone knows how the ram/cpu is loaded by sort ?
Is the sort command sorting data on the fly, or does it wait for the map job to be finished ( note that the mapper uses STDOUT and does not wait for the end of the computation to output data) ?
Using quite a large amount of input data does not seem to load the ram as I would expect ( I rather observe peaks of cpu, but I'm not really measuring this very precisely). Is it possible for the process to use less ram as the amount of output information ?
Thanks for your answers :)

In Linux, sort uses merge sort algorithm (from http://en.wikipedia.org/wiki/Sort_(Unix) ). A merge sort can store some parts in temporary files on disk (and it does in case of sort). So the process uses a reasonable amount of RAM (you can specify how much RAM is used via --buffer-size option).

Is it OK (performance-wise) to have hundreds or thousands of files in the same Linux directory?

It's well known that in Windows a directory with too many files will have a terrible performance when you try to open one of them. I have a program that is to execute only in Linux (currently it's on Debian-Lenny, but I don't want to be specific about this distro) and writes many files to the same directory (which acts somewhat as a repository). By "many" I mean tens each day, meaning that after one year I expect to have something like 5000-10000 files. They are meant to be kept (once a file is created, it's never deleted) and it is assumed that the hard disk has the required capacity (if not, it should be upgraded). Those files have a wide range of sizes, from a few KB to tens of MB (but not much more than that). The names are always numeric values, incrementally generated.
I'm worried about long-term performance degradation, so I'd ask:
Is it OK to write all to the same directory? Or should I think about creating a set of subdirectories for every X files?
Should I require a specific filesystem to be used for such directory?
What would be the more robust alternative? Specialized filesystem? Which?
Any other considerations/recomendations?

It depends very much on the file system.
ext2 and ext3 have a hard limit of 32,000 files per directory. This is somewhat more than you are asking about, but close enough that I would not risk it. Also, ext2 and ext3 will perform a linear scan every time you access a file by name in the directory.
ext4 supposedly fixes these problems, but I cannot vouch for it personally.
XFS was designed for this sort of thing from the beginning and will work well even if you put millions of files in the directory.
So if you really need a huge number of files, I would use XFS or maybe ext4.
Note that no file system will make "ls" run fast if you have an enormous number of files (unless you use "ls -f"), since "ls" will read the entire directory and the sort the names. A few tens of thousands is probably not a big deal, but a good design should scale beyond what you think you need at first glance...
For the application you describe, I would probably create a hierarchy instead, since it is hardly any additional coding or mental effort for someone looking at it. Specifically, you can name your first file "00/00/01" instead of "000001".

If you use a filesystem without directory-indexing, then it is a very bad idea to have lots of files in one directory (say, > 5000).
However, if you've got directory indexing (which is enabled by default on more recent distros in ext3), then it's not such a problem.
However, it does break quite a few tools to have many files in one directory (For example, "ls" will stat() all the files, which takes a long time). You can probably easily split it into subdirectories.
But don't overdo it. Don't use many levels of nested subdirectory unnecessarily, this just uses lots of inodes and makes metadata operations slower.
I've seen more cases of "too many levels of nested directories" than I've seen of "too many files per directory".

The best solution I have for you (rather than quoting some values from a micro-filesystem-benchmark) is to test it yourself.
Just use the file system of your choice. Create some random test data for 100, 1000 and 10000 entries. Then, measure the time it takes your system to perform the action you are concerned about time-wise (opening a file, reading 100 random files, etc).
Then, you compare the times and use the best solution (put them all into one directory; put each year into a new directory; put each month of each year into a new directory).
I do not know in detail what you are using, but creating a directory is a one time (and probably quite easy) operation, so why not do it instead of changing filesystems or trying some other more time-consuming stuff?

In addition to the other answers, if the huge directory is managed by a known application or library, you could consider replacing it by something else, e.g:
a GDBM index file; GDBM is a very common library providing indexed file, which associates to an arbitrary key (a sequence of bytes) an arbitrary value (another sequence of byte).
perhaps a table inside a database like MySQL or PostGresQL. Be careful about indexing.
some other way to index data
The advantages of the above approaches include:
space performance for a large collection of small items (less than a kilobyte each). A filesystem need an inode for each item. Indexed systems may have much less granularity
time performance: you don't access the filesystem for every item
scalability: indexed approaches are designed to fit large needs: either a GDBM index file, or a database can handle many millions of items. I'm not sure your directory approach will scale as easily.
The disadvantage of such approach is that they don't show as files. But as MarkR's answer remind you, ls is behaving quite poorly on huge directories.
If you stick to a filesystem approach, many software using large number of files are organizing them in subdirectories like aa/ ab/ ac/ ...ay/ az/ ba/ ... bz/ ...

Is it OK to write all to the same directory? Or should I think about creating a set of subdirectories for every X files?
In my experience the only slow down a directory with many files will give is if you do things such as getting a listing with ls. But that mostly is the fault of ls, there are faster ways of listing the contents of a directory using tools such as echo and find (see below).
Should I require a specific filesystem to be used for such directory?
I don't think so with regards to amount of files in one directory. I am sure some filesystems perform better with many small files in one dir whilst others do a better job on huge files. It's also a matter of personal taste, akin to vi vs. emacs. I prefer to use the XFS filesystem so that'd be my advice. :-)
What would be the more robust alternative? Specialized filesystem? Which?
XFS is definitely robust and fast, I use it in many places, as boot partition, oracle tablespaces, space for source control you name it. It lacks a bit on delete performance, but otherwise it's a safe bet. Plus it supports growing the size whilst it is still mounted (that's a requirement actually). That is you just delete the partition, recreate it at the same starting block and whatever ending block that's larger than the original partition, then you run xfs_growfs on it with the filesystem mounted.
Any other considerations/recomendations?
See above. With the addition that having 5000 to 10000 files in one directory should not be a problem. In practice it doesn't arbitrarily slow down the filesystem as far as I know, except for utilities such as "ls" and "rm". But you could do:
find * | xargs echo
find * | xargs rm
The benefit that a directory tree with files, such as directory "a" for file names starting with an "a" etc., will give you is that of looks, it looks more organised. But then you have less of an overview... So what you're trying to do should be fine. :-)
I neglected to say you could consider using something called "sparse files" http://en.wikipedia.org/wiki/Sparse_file

It is bad for performance to have a huge number of files in one directory. Checking for the existence of a file will typically require an O(n) scan of the directory. Creating a new file will require that same scan with the directory locked to prevent the directory state changing before the new file is created. Some file systems may be smarter about this (using B-trees or whatever), but the fewer ties your implementation has to the filesystem's strengths and weaknesses the better for long term maintenance. Assume someone might decide to run the app on a network filesystem (storage appliance or even cloud storage) someday. Huge directories are a terrible idea when using network storage.

What happens if there are too many files under a single directory in Linux?

If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?

ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.
If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.
The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.
Edit:
ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.
The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.
I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.

My experience with large directories on ext3 and dir_index enabled:
If you know the name of the file you want to access, there is almost no penalty
If you want to do operations that need to read in the whole directory entry (like a simple ls on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore
If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (*) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once
Without dir_index however, you are really screwed :-D

Most distros use Ext3 by default, which can use b-tree indexing for large directories.
Some of distros have this dir_index feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.
To see if dir_index feature is activated do (as root):
tune2fs -l /dev/sdaX | grep features
To activate dir_index feature (as root):
tune2fs -O dir_index /dev/sdaX
e2fsck -D /dev/sdaX
Replace /dev/sdaX with partition for which you want to activate it.

When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.
It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.
aaa/
aaavnj78t93ufjw4390
aaavoj78trewrwrwrwenjk983
aaaz84390842092njk423
...
abc/
abckhr89032423
abcnjjkth29085242nw
...
...

The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?

Not every filesystem supports that many files.
On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.

I've got a host with 10M files in a directory. (don't ask)
The filesystem is ext4.
It takes about 5 minutes to
ls
One limitation I've found is that my shell script to read the files (because AWS snapshot restore is a lie and files aren't present till first read) wasn't able to handle the argument list so I needed to do two passes. Firstly construct a file list with find (wholename in case you want to do partial matches)
find /path/to_dir/ -wholename '*.ldb'| tee filenames.txt
then secondly read from a the file containing filenames and read all files. (with limited parallelism)
while read -r line; do
if test "$(jobs | wc -l)" -ge 10; then
wait -n
fi
{
#do something with 10x fanout
} &
done < filenames.txt
Posting here in case anyone finds the specific work-around useful when working with too many files.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string