Estimation or measurement of amount of iops to create a file - linux

I'd like to know how many I/O operations (iops) does it take to create an empty file. I am interested in linux and GFS file system, however other file systems information is also very welcome.
Suggestions how to accurately measure this would be also very welcome.
Real scenario (requested by answers):
Linux
GFS file system (if you can estimate for another - pls do)
create a new file in existing directory, the file does not exist,
using the following code
Assume directory is in cache and directory depth is D
Code:
int fd = open("/my_dir/1/2/3/new_file", O_CREAT | S_IRWXU);
// assuming fd is valid
fsync(fd);

For an artifical measurement:
Create a blank filesystem on its own block device (e.g. vmware scsi etc)
Mount it, call sync(), then record the number of IOPS present on that block dev.
Run your test program against the filesystem, and do no further operations (not even "ls").
Wait until all unflushed blocks have been flushed - say about 1 minute or so
Snapshot the iops count again
Of course this is highly unrealistic, because if you created two files rather than one, you'd probably find that there were less than twice as many.
Also creating empty or blank files is unrealistic - as they don't do anything useful.
Directory structures (how deep the directories are, how many entries) might contribute, but also how fragmented it is and other arbitrary factors.

The nature of the answer to this question is; best case, normal case, worst case. There is no single answer, because the number of IOPS required will vary according to the current state of the file system. (A pristine file system is highly unrealistic scenario).
Taking FAT32 as an example, best case is 1. Normal case depends on the degree of file system fragmentation and the directory depth of the pathname for the new file. Worse case is unbounded (except by the size of the file system, which imposes a limit on the maximum possible number of IOPs to create a file).
Really, the question is not answerable, unless you define a particular file system scenario.

I did the following measurement, we wrote an application that creates N files as described in the question.
We ran this application on a disk which was devoted to this application only, and measured IOps amount using iostat -x 1
The result, on GFS and linux kernel 2.6.18 is 2 IOps per file creation.
This answer is based on MarkR answer.

Related

Ext4 on magnetic disk: Is it possible to process an arbitrary list of files in a seek-optimized manner?

I have a deduplicated storage of some million files in a two-level hashed directory structure. The filesystem is an ext4 partition on a magnetic disk. The path of a file is computed by its MD5 hash like this:
e93ac67def11bbef905a7519efbe3aa7 -> e9/3a/e93ac67def11bbef905a7519efbe3aa7
When processing* a list of files sequentially (selected by metadata stored in a separate database), I can literally hear the noise produced by the seeks ("randomized" by the hashed directory layout as I assume).
My actual question is: Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner, given they are stored on an ext4 partition on a magnetic disk (implying the use of linux)?
Such optimization is of course only useful if there is a sufficient share of small files. So please don't care too much about the size distribution of files. Without loss of generality, you may actually assume that there are only small files in each list.
As a potential solution, I was thinking of sorting the files by their physical disk locations or by other (heuristic) criteria that can be related to the total amount and length of the seek operations needed to process the entire list.
A note on file types and use cases for illustration (if need be)
The files are a deduplicated backup of several desktop machines. So any file you would typically find on a personal computer will be included on the partition. The processing however will affect only a subset of interest that is selected via the database.
Here are some use cases for illustration (list is not exhaustive):
extract metadata from media files (ID3, EXIF etc.) (files may be large, but only some small parts of the files are read, so they become effectively smaller)
compute smaller versions of all JPEG images to process them with a classifier
reading portions of the storage for compression and/or encryption (e.g. put all files newer than X and smaller than Y in a tar archive)
extract the headlines of all Word documents
recompute all MD5 hashes to verify data integrity
While researching for this question, I learned of the FIBMAP ioctl command (e.g. mentioned here) which may be worth a shot, because the files will not be moved around and the results may be stored along the metadata. But I suppose that will only work as sort criterion if the location of a file's inode correlates somewhat with the location of the contents. Is that true for ext4?
*) i.e. opening each file and reading the head of the file (arbitrary number of bytes) or the entire file into memory.
A file (especially when it is large enough) is scattered on several blocks on the disk (look e.g. in the figure of ext2 wikipage, it still is somehow relevant for ext4, even if details are different). More importantly, it could be in the page cache (so won't require any disk access). So "sorting the file list by disk location" usually does not make any sense.
I recommend instead improving the code accessing these files. Look into system calls like posix_fadvise(2) and readahead(2).
If the files are really small (hundreds of bytes each only), it is probable that using something else (e.g. sqlite or some real RDBMS like PostGreSQL, or gdbm ...) could be faster.
BTW, adding more RAM could enlarge the page cache size, so the overall experience. And replacing your HDD by some SSD would also help.
(see also linuxatemyram)
Is it possible to sort a list of files to optimize read speed / minimize seek times?
That is not really possible. File system fragmentation is not (in practice) important with ext4. Of course, backing up all your file system (e.g. in some tar or cpio archive) and restoring it sequentially (after making a fresh file system with mkfs) might slightly lower fragmentation, but not that much.
You might optimize your file system settings (block size, cluster size, etc... e.g. various arguments to mke2fs(8)). See also ext4(5).
Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner.
If the list is not too long (otherwise, split it in chunks of several hundred files each), you might open(2) each file there and use readahead(2) on each such file descriptor (and then close(2) it). This would somehow prefill your page cache (and the kernel could reorder the required IO operations).
(I don't know how effective is that in your case; you need to benchmark)
I am not sure there is a software solution to your issue. Your problem is likely IO-bound, so the bottleneck is probably the hardware.
Notice that on most current hard disks, the CHS addressing (used by the kernel) is some "logical" addressing handled by the disk controller and is not much related to physical geometry any more. Read about LBA, TCQ, NCQ (so today, the kernel has no direct influence on the actual mechanical movements of a hard disk head). I/O scheduling mostly happens in the hard disk itself (not much more in the kernel).

If the size of the file exceeds the maximum size of the file system, what happens?

For example, In FAT32 partition, The maximum file size is 4GB. but I was able to create a 5GB file with vim and I saved the file and opened it again, the console output was broken like a staircase. I have three questions.
If the size of the file exceeds the maximum size of the file system, what happens?
In my case, Why break?
In Unix system call, stat() can succeed up to a 2GB(2^31 - 1). Does this have anything to do with the file system? Is there a relationship between the limits of data in stat() and the limits of each feature in the file system?
If the size of the file exceeds the maximum size of the file system, what happens?
By definition, that can never happens. What really happens is that some system call (probably write(2) ...) is failing, and the code doing that should take care of that case.
Notice that FAT32 filesystems restrict the maximal size of files to 2Gigabytes. Use a better file system on your USB key if you want more (or split(1) large files in smaller chunks before copying them to your FAT32-formatted USB key).
If using <stdio.h> notice that fflush(3), fprintf(3), fclose(3) (and most other standard functions) can fail (e.g. because they will do some failing write(2)).
the console output was broken like a staircase
probably because your pseudoterminal was in some broken state. See stty(1), reset(1), termios(3) and read the tty demystified.
In Unix system call, stat() can succeed up to a 2GB(2^31 - 1)
You are misunderstanding stat(2). Read again its documentation
Read Advanced Linux Programming then syscalls(2).
I was able to create a 5GB file with vim
To understand the behavior of vim read first its documentation then study its source code (it is free software, and you can and perhaps should study its code).
You could also use strace(1) to understand what system calls are done by some command or process.

linux 2.6.43, ext3, 10K RPM SAS disk, 2 sequential write(direct io) on different file acting like random write

I recently stall on this one problem:
"2 sequential write(direct io 4KB alignemnt block) on different file acting like random write, which yield poor write performance in 10K RPM SAS disk".
The thing confuse me most: I got batch of server, all equip with same kind of disk (raid 1 with 2 300GB 10K RPM disk), but response different.
several servers seams ok with this kind of write pattern, disk happy accepted up to 50+MB/s;
(same kernel version, same filesystem, with different lib (libc 2.4))
others not so much, 100 op/s seams reach the limit of underlying disk, which confirm the random write performance of disk;
((same kernel version, same filesystem, with different lib (libc 2.12)))
[NOTE: I check the "pwrite" code of different libc, which tell nothing but simple "syscall"]
I have managed to rule out the possibly:
1. software bug in my own program;
by a simple deamon(compile with no dynamic link), do sequcetial direct io write;
2. disk problem;
switch 2 different version of linux system on one test machine, which perform well on my direct io write pattern, and a couple of day after switch to old lib version, the bad random write;
I try to compare:
/sys/block/sda/queue/*, which may different in both way;
filefrag show nothing but two different file interleaved sequenctial grow physical block id;
there must be some kind of write strategy lead to this problem, but i don't know where to start:
different kernel setting ?, may be related to how ext3 allocate disk block ?
raid cache(write back) or disk cache write strategy?
or underlying disk strategy to mapping logical block into real physical block ?
really appreciate
THE ANS IS:
it's because of /sys/block/sda/queue/schedule setting:
MACHINE A: display schedule: cfq, but undlying, it's deadline;
MACHINE B: the schedule is consistent with cfq;
//=>
SINCE my server is db svr, deadline is my best option;

Is it OK (performance-wise) to have hundreds or thousands of files in the same Linux directory?

It's well known that in Windows a directory with too many files will have a terrible performance when you try to open one of them. I have a program that is to execute only in Linux (currently it's on Debian-Lenny, but I don't want to be specific about this distro) and writes many files to the same directory (which acts somewhat as a repository). By "many" I mean tens each day, meaning that after one year I expect to have something like 5000-10000 files. They are meant to be kept (once a file is created, it's never deleted) and it is assumed that the hard disk has the required capacity (if not, it should be upgraded). Those files have a wide range of sizes, from a few KB to tens of MB (but not much more than that). The names are always numeric values, incrementally generated.
I'm worried about long-term performance degradation, so I'd ask:
Is it OK to write all to the same directory? Or should I think about creating a set of subdirectories for every X files?
Should I require a specific filesystem to be used for such directory?
What would be the more robust alternative? Specialized filesystem? Which?
Any other considerations/recomendations?
It depends very much on the file system.
ext2 and ext3 have a hard limit of 32,000 files per directory. This is somewhat more than you are asking about, but close enough that I would not risk it. Also, ext2 and ext3 will perform a linear scan every time you access a file by name in the directory.
ext4 supposedly fixes these problems, but I cannot vouch for it personally.
XFS was designed for this sort of thing from the beginning and will work well even if you put millions of files in the directory.
So if you really need a huge number of files, I would use XFS or maybe ext4.
Note that no file system will make "ls" run fast if you have an enormous number of files (unless you use "ls -f"), since "ls" will read the entire directory and the sort the names. A few tens of thousands is probably not a big deal, but a good design should scale beyond what you think you need at first glance...
For the application you describe, I would probably create a hierarchy instead, since it is hardly any additional coding or mental effort for someone looking at it. Specifically, you can name your first file "00/00/01" instead of "000001".
If you use a filesystem without directory-indexing, then it is a very bad idea to have lots of files in one directory (say, > 5000).
However, if you've got directory indexing (which is enabled by default on more recent distros in ext3), then it's not such a problem.
However, it does break quite a few tools to have many files in one directory (For example, "ls" will stat() all the files, which takes a long time). You can probably easily split it into subdirectories.
But don't overdo it. Don't use many levels of nested subdirectory unnecessarily, this just uses lots of inodes and makes metadata operations slower.
I've seen more cases of "too many levels of nested directories" than I've seen of "too many files per directory".
The best solution I have for you (rather than quoting some values from a micro-filesystem-benchmark) is to test it yourself.
Just use the file system of your choice. Create some random test data for 100, 1000 and 10000 entries. Then, measure the time it takes your system to perform the action you are concerned about time-wise (opening a file, reading 100 random files, etc).
Then, you compare the times and use the best solution (put them all into one directory; put each year into a new directory; put each month of each year into a new directory).
I do not know in detail what you are using, but creating a directory is a one time (and probably quite easy) operation, so why not do it instead of changing filesystems or trying some other more time-consuming stuff?
In addition to the other answers, if the huge directory is managed by a known application or library, you could consider replacing it by something else, e.g:
a GDBM index file; GDBM is a very common library providing indexed file, which associates to an arbitrary key (a sequence of bytes) an arbitrary value (another sequence of byte).
perhaps a table inside a database like MySQL or PostGresQL. Be careful about indexing.
some other way to index data
The advantages of the above approaches include:
space performance for a large collection of small items (less than a kilobyte each). A filesystem need an inode for each item. Indexed systems may have much less granularity
time performance: you don't access the filesystem for every item
scalability: indexed approaches are designed to fit large needs: either a GDBM index file, or a database can handle many millions of items. I'm not sure your directory approach will scale as easily.
The disadvantage of such approach is that they don't show as files. But as MarkR's answer remind you, ls is behaving quite poorly on huge directories.
If you stick to a filesystem approach, many software using large number of files are organizing them in subdirectories like aa/ ab/ ac/ ...ay/ az/ ba/ ... bz/ ...
Is it OK to write all to the same directory? Or should I think about creating a set of subdirectories for every X files?
In my experience the only slow down a directory with many files will give is if you do things such as getting a listing with ls. But that mostly is the fault of ls, there are faster ways of listing the contents of a directory using tools such as echo and find (see below).
Should I require a specific filesystem to be used for such directory?
I don't think so with regards to amount of files in one directory. I am sure some filesystems perform better with many small files in one dir whilst others do a better job on huge files. It's also a matter of personal taste, akin to vi vs. emacs. I prefer to use the XFS filesystem so that'd be my advice. :-)
What would be the more robust alternative? Specialized filesystem? Which?
XFS is definitely robust and fast, I use it in many places, as boot partition, oracle tablespaces, space for source control you name it. It lacks a bit on delete performance, but otherwise it's a safe bet. Plus it supports growing the size whilst it is still mounted (that's a requirement actually). That is you just delete the partition, recreate it at the same starting block and whatever ending block that's larger than the original partition, then you run xfs_growfs on it with the filesystem mounted.
Any other considerations/recomendations?
See above. With the addition that having 5000 to 10000 files in one directory should not be a problem. In practice it doesn't arbitrarily slow down the filesystem as far as I know, except for utilities such as "ls" and "rm". But you could do:
find * | xargs echo
find * | xargs rm
The benefit that a directory tree with files, such as directory "a" for file names starting with an "a" etc., will give you is that of looks, it looks more organised. But then you have less of an overview... So what you're trying to do should be fine. :-)
I neglected to say you could consider using something called "sparse files" http://en.wikipedia.org/wiki/Sparse_file
It is bad for performance to have a huge number of files in one directory. Checking for the existence of a file will typically require an O(n) scan of the directory. Creating a new file will require that same scan with the directory locked to prevent the directory state changing before the new file is created. Some file systems may be smarter about this (using B-trees or whatever), but the fewer ties your implementation has to the filesystem's strengths and weaknesses the better for long term maintenance. Assume someone might decide to run the app on a network filesystem (storage appliance or even cloud storage) someday. Huge directories are a terrible idea when using network storage.

how does kernel handle new file creation

I wish to understand the way kernel works when a user/app tries to create a file in a directorty.
The background - We have a java applicaiton which consumes messages over JMS, processes it and then writes the XML to an outbound queue+a local directory. Yesterday we obeserved unsual delays in writing to the directory. On 'ls|wc -l' we found >300,000 files in there. Did a quick strace on the process and found it full of mutex calls (More than 3/4 calls in the strace were mutex).
So i thought that new file creation is taking time becasue the system has to every time check certain things (e.g name of files to make sure that the new file with a specific name can be created) amongst 300,000 files and then create a file.
I cleared the directory and the applicaiton resumed to normal service levels.
My questions
Was my analysis correct (It seems cuz the app started working fine after a clear down)?
More imporatant, how does the kernel work when you try to creat a new file in directory.
Can the abnormal number of mutex calls be attributed to the high number of files in the directory?
Many thanks
J
Please read about the Linux Filesystem, i-nodes and d-nodes.
http://en.wikipedia.org/wiki/Inode_pointer_structure
The file system is organized into fixed-sized blocks. If your directory is relatively small, it fits in the direct blocks and things are fast. If your directory is not too big, it fits in the direct blocks and some indirect blocks, and is still reasonably fast. If your directory becomes too big, it spills into double indirect blocks and becomes slow.
Actual sizes depend on file system and kernel configuration.
Rule of thumb is to keep the directory under 12 blocks, depending on your block size. Many systems use 8K blocks; a fast directory is under 98,304 bytes.
A file entry is something like 16*4 bytes in size (IIRC), so plan on no more than 1500 files per directory as a practical upper limit.
Directories with large numbers of entries are often slow - how slow depends on the underlying filesystem.
The common solution is to create a hierarchy of directories, so each dir only has a few hundred entries.
Mutex system calls are a result of the application (probably something in the JVM or the Java libraries) making mutex calls.
Synchronisation internal to the kernel you will not see via strace, as this only examines system calls themselves.
A directory with lots of files should not become inefficient if you are using a filesystem which uses directory indexes; most now do (ext3 does optionally but it's normally enabled nowadays).
Non-indexed directories (like those used on the bad old filesystems - ext2, vfat etc) get really bad with lots of files, and you'll see the "open" system call taking a lot longer.

Resources