Optimal directory structure for saving large number of files - linux

One software we developed generates more and more, currently about 70000 files per day, 3-5 MB each. We store these files on a Linux server with ext3 file system. The software creates a new directory every day, and writes the files generated that day into this directory. Writing and reading such a large number of files is getting slower and slower (I mean, per file), so one of my colleagues suggested opening subdirectories in every hour. We will test whether this makes the system faster, but this problem can be generalized:
Has anyone measured the speed of writing and reading files, as a function of the number of files in the target directory? Is there an optimal file count above which it's faster to put the files into subdirectories? What are the important parameters which may influence the optimum?
Thank you in advance.

Related

Disk read performance - Does splitting 100k+ of files into subdirectories help while read them faster?

I have 100Ks+ of small JSON data files in one directory (not by choice). When accessing each of them, does a flat vs. pyramid directory structure make any difference? Does it help Node.js/Nginx/filesystem retrieve them faster, if the files would be grouped by e.g. first letter, in corresponding directories?
In other words, is it faster to get baaaa.json from /json/b/ (only b*.json here), then to get it from /json/ (all files), when it is same to assume that the subdirectories contain 33 times less files each? Does it make finding each file 33x faster? Or is there any disk read difference at all?
jfriend00's comment EDIT: I am not sure what the underlying filesystem will be yet. But let's assume an S3 bucket.

How to extract data from a par (Parchive) file?

There seems to be quite some confusion about PAR files and Im struggling to find an answer to this.
I have several PAR files, each containing several GB of data. Considering PAR is a type of archive file (similar to tar I assume), I would like to extract its contents using linux. However, I cant seem to find how to do this. I can only find how to repair files or create a par file.
I am trying to use the par2 command line tool to do this.
Any help would be appreciated
TLDR: They're not really like .tar archives - they are generally created to support other files (including archives) to protect against data damage/loss. Without any of the original data, I think it is very unlikely any data can be recovered from these files.
.par files are (if they are genuinely PAR2 files) error recovery files for supporting a set of data stored separately. PAR files are useful, because they can protect the whole of the source data without needing a complete second copy.
For example, you might choose to protect 1GB of data using 100MB of .par files in the form of 10x 10MB files. This means that if any part of the original data (up to 100MB) is damaged or lost, it can be recalculated and repaired using the .par records.
This will still work if some of the .par files are lost, but the amount of data that can be recovered cannot exceed what .par files remain.
So...given that it is rare to create par files constituting 100% of the size of the original data, unless you have some of the original data as well, you probably won't be able to recover anything from the files.
http://www.techsono.com/usenet/files/par2

Maximum number of files per drive

I'm facing a problem of adding more file to a partition when there are too many of them, currently, I've approximately 10 million files + Linux file system, for some reason I want to add more files, but it keep saying that there is not enough space (I do have 30+ GB left though) any idea why is that happing and is it possible to be resolved?
The most common cause is that there's too many files in a single directory - directories can hold only a finite number of files. if that's not the problem, there are some other meta-data structures which can also limit the total number of files on disk.
You can differentiate between these two problems by checking if you can add files to another directory.

File copying time variations

Does it take longer to copy many small files compared to 1 large file both totaling of same size? Is it just because of the overhead for copying details of each file?
Yes, it takes more time to copy many small files as compared to copying the same amount of data in one large file.
And yes, the overhead comes from having to manage all the file system entries and metadata.

handling lots of temporary small files

I have a web server which saves cache files and keeps them for 7 days. The file names are md5 hashes, i.e. exactly 32 hex characters long, and are being kept in a tree structure that looks like this:
00/
00/
00000ae9355e59a3d8a314a5470753d8
.
.
00/
01/
You get the idea.
My problem is that deleting old files is taking a really long time. I have a daily cron job that runs
find cache/ -mtime +7 -type f -delete
which takes more than half a day to complete. I worry about scalability and the effect this has on the performance of the server. Additionally, the cache directory is now a black hole in my system, trapping the occasional innocent du or find.
The standard solution to LRU cache is some sort of a heap. Is there a way to scale this to the filesystem level?
Is there some other way to implement this in a way which makes it easier to manage?
Here are ideas I considered:
Create 7 top directories, one for each week day, and empty one directory every day. This increases the seek time for a cache file 7-fold, makes it really complicated when a file is overwritten, and I'm not sure what it will do to the deletion time.
Save the files as blobs in a MySQL table with indexes on name and date. This seemed promising, but in practice it's always been much slower than FS. Maybe I'm not doing it right.
Any ideas?
When you store a file, make a symbolic link to a second directory structure that is organized by date, not by name.
Retrieve your files using the "name" structure, delete them using the "date" structure.
Assuming this is ext2/3 have you tried adding in the indexed directories? When you have a large number of files in any particular directory the lookup will be painfully slow to delete something.
use tune2fs -o dir_index to enable the dir_index option.
When mounting a file system, make sure to use noatime option, which stops the OS from updating access time information for the directories (still needs to modify them).
Looking at the original post it seems as though you only have 2 levels of indirection to the files, which means that you can have a huge number of files in the leaf directories. When there are more than a million entries in these you will find that searches and changes are terribly slow. An alternative is to use a deeper hierarchy of directories, reducing the number of items in any particular directory, therefore reducing the cost of search and updates to the particular individual directory.
Reiserfs is relatively efficient at handling small files. Did you try different Linux file systems? I'm not sure about delete performance - you can consider formatting (mkfs) as a substitute for individual file deletion. For example, you can create a different file system (cache1, cache2, ...) for each weekday.
How about this:
Have another folder called, say, "ToDelete"
When you add a new item, get today's date and look for a subfolder in "ToDelete" that has a name indicative of the current date
If it's not there, create it
Add a symbolic link to the item you've created in today's folder
Create a cron job that goes to the folder in "ToDelete" which is of the correct date and delete all the folders that are linked.
Delete the folder which contained all the links.
How about having a table in your database that uses the hash as the key. The other field would then be the name of the file. That way the file can be stored in a date-related fashion for fast deletion, and the database can be used for finding that file's location based on the hash in a fast fashion.

Resources