handling lots of temporary small files - linux

I have a web server which saves cache files and keeps them for 7 days. The file names are md5 hashes, i.e. exactly 32 hex characters long, and are being kept in a tree structure that looks like this:
00/
00/
00000ae9355e59a3d8a314a5470753d8
.
.
00/
01/
You get the idea.
My problem is that deleting old files is taking a really long time. I have a daily cron job that runs
find cache/ -mtime +7 -type f -delete
which takes more than half a day to complete. I worry about scalability and the effect this has on the performance of the server. Additionally, the cache directory is now a black hole in my system, trapping the occasional innocent du or find.
The standard solution to LRU cache is some sort of a heap. Is there a way to scale this to the filesystem level?
Is there some other way to implement this in a way which makes it easier to manage?
Here are ideas I considered:
Create 7 top directories, one for each week day, and empty one directory every day. This increases the seek time for a cache file 7-fold, makes it really complicated when a file is overwritten, and I'm not sure what it will do to the deletion time.
Save the files as blobs in a MySQL table with indexes on name and date. This seemed promising, but in practice it's always been much slower than FS. Maybe I'm not doing it right.
Any ideas?

When you store a file, make a symbolic link to a second directory structure that is organized by date, not by name.
Retrieve your files using the "name" structure, delete them using the "date" structure.

Assuming this is ext2/3 have you tried adding in the indexed directories? When you have a large number of files in any particular directory the lookup will be painfully slow to delete something.
use tune2fs -o dir_index to enable the dir_index option.
When mounting a file system, make sure to use noatime option, which stops the OS from updating access time information for the directories (still needs to modify them).
Looking at the original post it seems as though you only have 2 levels of indirection to the files, which means that you can have a huge number of files in the leaf directories. When there are more than a million entries in these you will find that searches and changes are terribly slow. An alternative is to use a deeper hierarchy of directories, reducing the number of items in any particular directory, therefore reducing the cost of search and updates to the particular individual directory.

Reiserfs is relatively efficient at handling small files. Did you try different Linux file systems? I'm not sure about delete performance - you can consider formatting (mkfs) as a substitute for individual file deletion. For example, you can create a different file system (cache1, cache2, ...) for each weekday.

How about this:
Have another folder called, say, "ToDelete"
When you add a new item, get today's date and look for a subfolder in "ToDelete" that has a name indicative of the current date
If it's not there, create it
Add a symbolic link to the item you've created in today's folder
Create a cron job that goes to the folder in "ToDelete" which is of the correct date and delete all the folders that are linked.
Delete the folder which contained all the links.

How about having a table in your database that uses the hash as the key. The other field would then be the name of the file. That way the file can be stored in a date-related fashion for fast deletion, and the database can be used for finding that file's location based on the hash in a fast fashion.

Related

Disk read performance - Does splitting 100k+ of files into subdirectories help while read them faster?

I have 100Ks+ of small JSON data files in one directory (not by choice). When accessing each of them, does a flat vs. pyramid directory structure make any difference? Does it help Node.js/Nginx/filesystem retrieve them faster, if the files would be grouped by e.g. first letter, in corresponding directories?
In other words, is it faster to get baaaa.json from /json/b/ (only b*.json here), then to get it from /json/ (all files), when it is same to assume that the subdirectories contain 33 times less files each? Does it make finding each file 33x faster? Or is there any disk read difference at all?
jfriend00's comment EDIT: I am not sure what the underlying filesystem will be yet. But let's assume an S3 bucket.

Cloning only the filled portion of a read only raw data HDD source (without source partition resizing)

I often copy raw data from HDD's with FAT32 partitions at the file level. I would like to switch to bitwise cloning this raw data that consists of thousands of 10MiB files that are sequentially written across a single FAT32 partition.
The idea is on the large archival HDD, have a small partition which contains a shadow directory structure with symbolic links to separate raw data image partitions. Each additional partition being the aforenoted raw data, but sized to only the size consumed on the source drive. The number of raw data files on each source drive can be in the tens up through the tens of thousands.
i.e.: [[sdx1][--sdx2--][-------------sdx3------------][--------sdx4--------][-sdx5-][...]]
Where 'sdx1' = directory of symlinks to sdx2, sdx3, sdx4, ... such that the user can browse to multiple partitions but it appears to them as if they're just in subfolders.
Optimally I'd like to find both a Linux and a Windows solution. If the process can be scripted or a software solution that exists can step through a standard workflow, that'd be best. The process is almost always 1) Insert 4 HDD's with raw data 2) Copy whatever's in them 3) Repeat. Always the same drive slots and process.
AFAIK, in order to clone a source partition without cloning all the free space, one conventionally must resize the source HDD partition first. Since I can't alter the source HDD in any way, how can I get around that?
One way would be clone the entire source partition (incl. free space) and resize the target backup partition afterward, but that's not going to work out because of all the additional time that would take.
The goal is to retain bitwise accuracy and to save time (dd runs about 200MiB/s whereas rsync runs about 130MiB/s, however also needing to copy a ton of blank space every time makes the whole perk moot). I'd also like to be running with some kind of --rescue flag so when bad clusters are hit on the source drive it just behaves like clonezilla and just writes ???????? in place of the bad clusters. I know I said "retain bitwise accuracy" but a bad cluster's a bad cluster.
If you think one of the COTS or GOTS software like EaseUS, AOMEI, Paragon and whatnot are able to clone partitions as I've described please point me in the right direction. If you think there's some way I can dd it up with some script which sizes up the source, makes the right size target partition, then modifies the target FAT to its correct size, chime in I'd love many options and so would future people with a similar use case to mine that stumble on this thread :)
Not sure if this will fit you, but is very simple.
Syncthing https://syncthing.net/ will sync the content of 2 or more folders, works on Linux and Windows.

Fastest way to sort very large files preferably with progress

I have a 200GB flat file (one word per line) and I want to sort the file, then remove the duplicates and create one clean final TXT file out of it.
I tried sort with --parallel but it ran for 3 days and I got frustrated and killed the process as I didn't see any changes to the chunk of files it created in /tmp.
I need to see the progress somehow and make sure its not stuck and its working. Whats the best way to do so? Are there any Linux tools or open source project dedicated for something like this?
I don't use Linux, but if this is Gnu sort, you should be able to see the temporary files it creates from another window to monitor progress. The parallel feature only helps during the initial pass that sorts and creates the initial list of temporary files. After that, the default is a 16-way merge.
Say for example the first pass is creating temp files around 1GB in size. In this case, Gnu sort will end up creating 200 of these 1GB temp files before starting the merge phase. The 16 way merge means that 16 of those temp files will be merged at a time, creating temp files of size 16GB, and so on.
So one way to monitor progress is to monitor the creation of those temporary files.

Script for selectively deleting directories by date

I'm running an sh script on my debian machine. It's done nightly and uses rsync to create incremental backups. It saves each backup in directories named by date. so i have:
2015-07-01
2015-07-02
2015-07-03
2015-07-04
and so-forth
What I would like to be able to do is delete old copies as the list grows. Preferably I'd like to keep daily backups for the past week, and weekly backups for as long as i have space.
Which means I need to do two things:
Check the date of each folder name. If the date is not a Saturday, and is older than 7 days, Delete it.
Check the amount of used space on this partition (/dev/sdb1) and delete the oldest folder if the disk usage is above 75%.
I'm thinking that step 2 would need to be in a loop perhaps. So that it can delete one backup at a time. Recheck the space available, and delete another folder if we're still above the 75%.
I'm assuming all this is possible with bash scripts. I'm still very new to them. but from what I've found whilst googling around it should be pretty straight forward for someone who knows what they are doing. I'm just having trouble figuring out how to piece the elements together.
Here is my old script, which i do not use after i have migrated to rsnapshot. It have some hardcoded strings, but i hope you could modify it for your needs. Also it measures free space in gigabytes, not percents. With rsnapshot i do not need it anymore for my purposes.

Windows Azure Cloud Storage - Impact of huge number of files in root

Sorry if I get any of the terminology wrong here, but hopefully you will get what I mean.
I am using Windows Azure Cloud Storage to store a vast quantity of small files (images, 20Kb each).
At the minute, these files are all stored in the root directory. I understand it's not a normal file system, so maybe root isn't the correct term.
I've tried to find information on the long-term effects of this plan but with no luck so if any one can give me some information I'd be grateful.
Basically, am I going to run into problems if the numbers of files stored in this root end up in the hundreds of thousands/millions?
Thanks,
Steven
I've been in a similar situation where we were storing ~10M small files in one blob container. Accessing individual files through code was fine and there weren't any performance problems.
Where we did have problems was with managing that many files outside of code. If you're using a storage explorer (either the one that comes with VS2010 or anyone of the others), the ones I've encountered don't support the return files by prefix API, you can only list the first 5K, then the next 5K and so on. You can see how this might be a problem when you want to look at the 125,000th file in the container.
The other problem is that there is no easy way of finding out how many files are in your container (which can be important for knowing exactly how much all of that blob storage is costing you) without writing something that simply iterates over all the blobs and counts them.
This was an easy problem to solve for us as our blobs had sequential numeric names, so we've simply partitioned them into folders of 1k items each. Depending on how many items you've got you can group 1K of these folders into sub folders.
http://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/d569a5bb-c4d4-4495-9e77-00bd100beaef
Short Answer: No
Medium Answer: Kindof?
Long Answer: No, but if you query for a file list it will only return 5000. You'll need to requery every 5k to get a full listing according to that MSDN page.
Edit: Root works fine for describing it. 99.99% of people will grok what you're trying to say.

Resources