How to traverse a disk of type ext3 with 500GB data - linux

There are ~10 million files on a disk (not under the same directory).
I want to get [(file_name, file_size, file_atime)] of all files.
But the command
find /data -type f -printf "%p\t%A#\t%s\n"
is hopelessly slow and cause the IO %util ~100%.
Any advice?

Not much you can do.
Check if you are using directory indexes (dir_index).
If you are desperate you can use debug2fs and read the data raw, but I would not recommend it.
You can also buy an SSD - the slowness is probably from seeking, if you do with often an SSD will speed things up quite a bit.

Related

Should content-addressable paths be used in ext4 or btrfs for directories?

I tested this by comparing the speed of reading a file from a directory with 500,000 and a directory with just 100 files.
The result: Both were equally fast.
Test details:
I created a directory with 500,000 files for x in {1..500000}; do touch $x; done, run time cat test-dir/some-file and compared this to another directory with just 100 files.
They both executed equally fast, but maybe on heavy load there's a difference or is ext4 and btrfs clever enough and we don't need content-addressable paths anymore?
With content-addressable paths I could distribute the 500,000 files into multiple subdirectories like this:
/www/images/persons/a/1/john.png
/www/images/persons/a/2/henrick.png
....
/www/images/persons/b/c/frederick.png
...
The 500,000 files are served via nginx to UA, so I want to avoid a latency, but maybe that is no more relevant with ext4 or btrfs?
Discussing this question at another place the answer seems to be that for read operations you don't need to implement content-addressable storage, because there are no iterations over the lookup table in nowerdays filesystems. The filesystem gets the place to look for the file directly.
With ext4 you only have the # of inodes as limitation.

caching on ramdisk - finding stalest file to delete

I have a nice caching system in linux that uses a ramdisk to cache both image files and the HTML output of various pages of my website.
My website is rather large and the ramdisk space required to cache everything exceeds 15GB (excluding image output) and I only have 2GB available for the cache.
Writing to and reading from cache is relatively fast but the problem is trying to figure out how to quickly find the stale-most file(s) when I run out of space in order to make room for a new file. I believe using "ls -R" and scanning the large output is a slow process.
My only other option which is inefficient to me is to flush the entire cache frequently in order to never run out of ramdisk space.
My cache allows my website to load many pages with a time to first byte (TTFB) of under 200ms which is what google likes, so I want to try to keep that 200ms as a maximum TTFB value when loading a file from cache, even if files are deleted as a result from lack of ramdisk space.
I thought of using direct access memory via pointers for cache, but because the output to cache is of various sizes, I would feel that option would waste memory space at best or use alot of cpu to find the next free memory location.
Anyone got an idea on how I can quickly seek and then remove the stalest file from my cache?
ls -latr should not be slow while working with a ramdisk. but this may be closer to what you are looking for:
find -type f -printf '%T+ %p\n' | sort | head -1

in linux, how to (quickly) get a list of all files in a directory - with their filesize

I need to get a list of all the files, along with their sizes, in Linux.
the filesystem is ext4, running over USB on a machine with very little RAM
the functions I'm using are these - is there a better technique?
a) opendir()
b) readdir()
c) stat()
I believe I'm getting hit pretty hard with the stat() call, I don't have much RAM and the HD is USB connected
is there a way to say
"give me all the files in the directory, along with the file sizes?" - my guess is that I'm getting impacted because stat() needs to go query the inode for the size, leading to lots of seeks?
No, not really. If you don't want to hit the disk, you would need have the inodes cached in memory. That's the tradeoff in this case.
You could try tuning inode_readahead_blks and vfs_cache_pressure though.
Use cd to get to the directory you want to get the size of the files and directories for, then type:
du -sh *
This will give you all the sizes of the files and directories.
The s gives a total for each argument, h makes the output human readable. The * means that it will show all the contents of that directory.
Type man du to find out more about the du command (q or Ctrl-c to exit)

How to speed up reading of a fixed set of small files on linux?

I have 100'000 1kb files. And a program that reads them - it is really slow.
My best idea for improving performance is to put them on ramdisk.
But this is a fragile solution, every restart need to setup the ramdisk again.
(and file copying is slow as well)
My second best idea is to concatenate the files and work with that. But it is not trivial.
Is there a better solution?
Note: I need to avoid dependencies in the program, even Boost.
You can optimize by storing the files contiguous on disk.
On a disk with ample free room, the easiest way would be to read a tar archive instead.
Other than that, there is/used to be a debian package for 'readahead'.
You can use that tool to
profile a normal run of your software
edit the lsit of files accesssed (detected by readahead)
You can then call readahead with that file list (it will order the files in disk order so the throughput will be maximized and the seektimes minimized)
Unfortunately, it has been a while since I used these, so I hope you can google to the resepctive packages
This is what I seem to have found now:
sudo apt-get install readahead-fedora
Good luck
If your files are static, I agree just tar them up and then place that in a RAM disk. Probably be faster to read directly out of the TAR file, but you can test that.
edit:: instead of TAR, you could also try creating a squashfs volume.
If you don't want to do that, or still need more performance then:
put your data on an SSD.
start investigating some FS performance test, starting with EXT4, XFS, etc...

Storing & accessing up to 10 million files in Linux

I'm writing an app that needs to store lots of files up to approx 10 million.
They are presently named with a UUID and are going to be around 4MB each but always the same size. Reading and writing from/to these files will always be sequential.
2 main questions I am seeking answers for:
1) Which filesystem would be best for this. XFS or ext4?
2) Would it be necessary to store the files beneath subdirectories in order to reduce the numbers of files within a single directory?
For question 2, I note that people have attempted to discover the XFS limit for number of files you can store in a single directory and haven't found the limit which exceeds millions. They noted no performance problems. What about under ext4?
Googling around with people doing similar things, some people suggested storing the inode number as a link to the file instead of the filename for performance (this is in a database index. which I'm also using). However, I don't see a usable API for opening the file by inode number. That seemed to be more of a suggestion for improving performance under ext3 which I am not intending to use by the way.
What are the ext4 and XFS limits? What performance benefits are there from one over the other and could you see a reason to use ext4 over XFS in my case?
You should definitely store the files in subdirectories.
EXT4 and XFS both use efficient lookup methods for file names, but if you ever need to run tools over the directories such as ls or find you will be very glad to have the files in manageable chunks of 1,000 - 10,000 files.
The inode number thing is to improve the sequential access performance of the EXT filesystems. The metadata is stored in inodes and if you access these inodes out of order then the metadata accesses are randomized. By reading your files in inode order you make the metadata access sequential too.
Modern filesystems will let you store 10 million files all in the same directory if you like. But tools (ls and its friends) will not work well.
I'd recommend putting a single level of directories, a fixed number, perhaps 1,000 directories, and putting the files in there (10,000 files is tolerable to the shell, and "ls").
I've seen systems which create many levels of directories, this is truly unnecessary and increases inode consumption and makes traversal slower.
10M files should not really be a problem either, unless you need to do bulk operations on them.
I expect you will need to prune old files, but something like "tmpwatch" will probably work just fine with 10M files.

Resources