how does kernel handle new file creation - linux

I wish to understand the way kernel works when a user/app tries to create a file in a directorty.
The background - We have a java applicaiton which consumes messages over JMS, processes it and then writes the XML to an outbound queue+a local directory. Yesterday we obeserved unsual delays in writing to the directory. On 'ls|wc -l' we found >300,000 files in there. Did a quick strace on the process and found it full of mutex calls (More than 3/4 calls in the strace were mutex).
So i thought that new file creation is taking time becasue the system has to every time check certain things (e.g name of files to make sure that the new file with a specific name can be created) amongst 300,000 files and then create a file.
I cleared the directory and the applicaiton resumed to normal service levels.
My questions
Was my analysis correct (It seems cuz the app started working fine after a clear down)?
More imporatant, how does the kernel work when you try to creat a new file in directory.
Can the abnormal number of mutex calls be attributed to the high number of files in the directory?
Many thanks
J

Please read about the Linux Filesystem, i-nodes and d-nodes.
http://en.wikipedia.org/wiki/Inode_pointer_structure
The file system is organized into fixed-sized blocks. If your directory is relatively small, it fits in the direct blocks and things are fast. If your directory is not too big, it fits in the direct blocks and some indirect blocks, and is still reasonably fast. If your directory becomes too big, it spills into double indirect blocks and becomes slow.
Actual sizes depend on file system and kernel configuration.
Rule of thumb is to keep the directory under 12 blocks, depending on your block size. Many systems use 8K blocks; a fast directory is under 98,304 bytes.
A file entry is something like 16*4 bytes in size (IIRC), so plan on no more than 1500 files per directory as a practical upper limit.

Directories with large numbers of entries are often slow - how slow depends on the underlying filesystem.
The common solution is to create a hierarchy of directories, so each dir only has a few hundred entries.

Mutex system calls are a result of the application (probably something in the JVM or the Java libraries) making mutex calls.
Synchronisation internal to the kernel you will not see via strace, as this only examines system calls themselves.
A directory with lots of files should not become inefficient if you are using a filesystem which uses directory indexes; most now do (ext3 does optionally but it's normally enabled nowadays).
Non-indexed directories (like those used on the bad old filesystems - ext2, vfat etc) get really bad with lots of files, and you'll see the "open" system call taking a lot longer.

Related

In a kernel module, how to know whether given inode belongs to a specific directory?

One possible way is that, compare given inode with list of inodes in that directory. The list of inodes could be predetermined or it can be calculated run time, both ways have their own problems:
Predetermined list: List can be changed during this operation, i.e. files could be added or removed from that directory.
Run time list: If that directory has too many files, it's too much overhead for each access of any file in the system.
Is there any efficient solution/way for this? I have tried by comparing file by it's path, which was really a bad idea.
Either if you do it in kernel mode or in user mode has no advantages. To see if an inode is indeed in some directory you have to read that directory as files are located in directories normally as a linear list. This can lead your process blocking for directory blocks to be present if not cached and, in that time, the directory contents can be modified. Only if you maintain the directory inode blocked while doing that operation can help, but this can add severe performance restrictions to your operating system. Another issue is that each filesystem is free to implement directory contents in it's own format. In userland you get an uniform directory format, but in kernel mode you have to deal with the different approaches for different filesystem types. Why do you need to know that? I can't imagine a scenario where this can be needed. Perhaps you can redesign your algorithm for the directory contents to be unnecessary.
By the way, dealing with complete paths or searching directories have obscure race conditions that can deal your system blocked someway. What can happen if, in the middle of your seach, somebody tries to unlink the inode you are searching for; or the directory contents must be modified; or some other process is using namei() to traverse through your directory upwards; or downwards. Have you think in all these possibilities?

How does chroot affect dynamic libraries memory use?

Although there is another question with similar topic, it does not cover the memory use by the shared libraries in chrooted jails.
Let's say we have a few similar chroots. To be more specific, exactly the same sets of binary files and shared libraries which are actually hard links to the master copies to conserve the disk space (to prevent the potential possibility of a files alteration the file system is mounted read only).
How is the memory use affected in such a setup?
As described in the chroot system call:
This call changes an ingredient in the pathname resolution process and does nothing else.
So, the shared library will be loaded in the same way as if it were outside the chroot jail (share read only pages, duplicate data, etc.)
http://man7.org/linux/man-pages/man2/chroot.2.html
Because hardlinks share the same underlying inode, the kernel treats them as the same item when it comes to caching/mapping.
You'll see filesystem cache savings by using hardlinks, as well as disk-space savings.
The biggest issue I'd have with this is that if someone manages so subvert the read-only nature of one of the chroot environments, then they could subvert all of them by making modifications to any of the hardlinked files.
When I set this up, I copied the shared libraries per chroot instead of linking to a read-only mount. With separate files, the text segments were not shared. It's likely that the same inode will map to the same read-only text segment, but this may vary with available memory management hardware and similar architectural details.
Try this experiment on your system: write a small program that makes some minimal use of a large shared library. Run twenty or thirty chroot jails as you describe, each with a running copy of the program. Check overall memory usage before & during running, and dissect one instance to get a good text/data segment breakdown. If memory use increases by the full size of the map for each instance, the segments are not shared. Conversely, if memory use goes up by a fraction of the map, the segments are shared.

Is it OK (performance-wise) to have hundreds or thousands of files in the same Linux directory?

It's well known that in Windows a directory with too many files will have a terrible performance when you try to open one of them. I have a program that is to execute only in Linux (currently it's on Debian-Lenny, but I don't want to be specific about this distro) and writes many files to the same directory (which acts somewhat as a repository). By "many" I mean tens each day, meaning that after one year I expect to have something like 5000-10000 files. They are meant to be kept (once a file is created, it's never deleted) and it is assumed that the hard disk has the required capacity (if not, it should be upgraded). Those files have a wide range of sizes, from a few KB to tens of MB (but not much more than that). The names are always numeric values, incrementally generated.
I'm worried about long-term performance degradation, so I'd ask:
Is it OK to write all to the same directory? Or should I think about creating a set of subdirectories for every X files?
Should I require a specific filesystem to be used for such directory?
What would be the more robust alternative? Specialized filesystem? Which?
Any other considerations/recomendations?
It depends very much on the file system.
ext2 and ext3 have a hard limit of 32,000 files per directory. This is somewhat more than you are asking about, but close enough that I would not risk it. Also, ext2 and ext3 will perform a linear scan every time you access a file by name in the directory.
ext4 supposedly fixes these problems, but I cannot vouch for it personally.
XFS was designed for this sort of thing from the beginning and will work well even if you put millions of files in the directory.
So if you really need a huge number of files, I would use XFS or maybe ext4.
Note that no file system will make "ls" run fast if you have an enormous number of files (unless you use "ls -f"), since "ls" will read the entire directory and the sort the names. A few tens of thousands is probably not a big deal, but a good design should scale beyond what you think you need at first glance...
For the application you describe, I would probably create a hierarchy instead, since it is hardly any additional coding or mental effort for someone looking at it. Specifically, you can name your first file "00/00/01" instead of "000001".
If you use a filesystem without directory-indexing, then it is a very bad idea to have lots of files in one directory (say, > 5000).
However, if you've got directory indexing (which is enabled by default on more recent distros in ext3), then it's not such a problem.
However, it does break quite a few tools to have many files in one directory (For example, "ls" will stat() all the files, which takes a long time). You can probably easily split it into subdirectories.
But don't overdo it. Don't use many levels of nested subdirectory unnecessarily, this just uses lots of inodes and makes metadata operations slower.
I've seen more cases of "too many levels of nested directories" than I've seen of "too many files per directory".
The best solution I have for you (rather than quoting some values from a micro-filesystem-benchmark) is to test it yourself.
Just use the file system of your choice. Create some random test data for 100, 1000 and 10000 entries. Then, measure the time it takes your system to perform the action you are concerned about time-wise (opening a file, reading 100 random files, etc).
Then, you compare the times and use the best solution (put them all into one directory; put each year into a new directory; put each month of each year into a new directory).
I do not know in detail what you are using, but creating a directory is a one time (and probably quite easy) operation, so why not do it instead of changing filesystems or trying some other more time-consuming stuff?
In addition to the other answers, if the huge directory is managed by a known application or library, you could consider replacing it by something else, e.g:
a GDBM index file; GDBM is a very common library providing indexed file, which associates to an arbitrary key (a sequence of bytes) an arbitrary value (another sequence of byte).
perhaps a table inside a database like MySQL or PostGresQL. Be careful about indexing.
some other way to index data
The advantages of the above approaches include:
space performance for a large collection of small items (less than a kilobyte each). A filesystem need an inode for each item. Indexed systems may have much less granularity
time performance: you don't access the filesystem for every item
scalability: indexed approaches are designed to fit large needs: either a GDBM index file, or a database can handle many millions of items. I'm not sure your directory approach will scale as easily.
The disadvantage of such approach is that they don't show as files. But as MarkR's answer remind you, ls is behaving quite poorly on huge directories.
If you stick to a filesystem approach, many software using large number of files are organizing them in subdirectories like aa/ ab/ ac/ ...ay/ az/ ba/ ... bz/ ...
Is it OK to write all to the same directory? Or should I think about creating a set of subdirectories for every X files?
In my experience the only slow down a directory with many files will give is if you do things such as getting a listing with ls. But that mostly is the fault of ls, there are faster ways of listing the contents of a directory using tools such as echo and find (see below).
Should I require a specific filesystem to be used for such directory?
I don't think so with regards to amount of files in one directory. I am sure some filesystems perform better with many small files in one dir whilst others do a better job on huge files. It's also a matter of personal taste, akin to vi vs. emacs. I prefer to use the XFS filesystem so that'd be my advice. :-)
What would be the more robust alternative? Specialized filesystem? Which?
XFS is definitely robust and fast, I use it in many places, as boot partition, oracle tablespaces, space for source control you name it. It lacks a bit on delete performance, but otherwise it's a safe bet. Plus it supports growing the size whilst it is still mounted (that's a requirement actually). That is you just delete the partition, recreate it at the same starting block and whatever ending block that's larger than the original partition, then you run xfs_growfs on it with the filesystem mounted.
Any other considerations/recomendations?
See above. With the addition that having 5000 to 10000 files in one directory should not be a problem. In practice it doesn't arbitrarily slow down the filesystem as far as I know, except for utilities such as "ls" and "rm". But you could do:
find * | xargs echo
find * | xargs rm
The benefit that a directory tree with files, such as directory "a" for file names starting with an "a" etc., will give you is that of looks, it looks more organised. But then you have less of an overview... So what you're trying to do should be fine. :-)
I neglected to say you could consider using something called "sparse files" http://en.wikipedia.org/wiki/Sparse_file
It is bad for performance to have a huge number of files in one directory. Checking for the existence of a file will typically require an O(n) scan of the directory. Creating a new file will require that same scan with the directory locked to prevent the directory state changing before the new file is created. Some file systems may be smarter about this (using B-trees or whatever), but the fewer ties your implementation has to the filesystem's strengths and weaknesses the better for long term maintenance. Assume someone might decide to run the app on a network filesystem (storage appliance or even cloud storage) someday. Huge directories are a terrible idea when using network storage.

Disadvantages to creating/removing many hard links?

I need to create hundreds to thousands of temporary hard or symbolic links that will be deleted shortly after creation. For my purposes both types of links will work (i.e. the target is not a directory and it always exists on the same file system)
As I understand it, symbolic links create a small file that contains the path to the original file. Whereas a hardlink creates a reference to the data in the same inode. So maybe if I am going to be creating/deleting thousands of these links is it better to be creating and deleting thousands of tiny files (symlinks) or thousands of these references (hardlinks)? It seems like one taxes the hard drive (maybe fragmentation) while the other might tax the file system itself? Where are inode references stored. Do I risk corrupting the file system by making so many hard links? What about speed?
Thanks for your expertise!
This a work around to be able to use ffmpeg to encode a movie out of an arbitrary subset of images from a directory. Since ffmpeg requires that the files be named properly (e.g. frame%04d.jpg) I realized I can just create hard/sym links to the subset of files and just name the links appropriately. This avoids renaming the original files and having to actually copy the data. It works great but it requires creating and deleting many thousands of links, repeatedly.
Sort of addresses this problem too I believe:
convert image sequence using ffmpeg
If this activity breaks your file system, then your file system is at fault, not you. File systems are generally pretty reliable, so don't worry about that.
Both options require adding an entry in the directory. The symbolic link requires creating a file as well. When you access the file the hard link jumps directly to the content, while accessing a symlink requires finding the symlink file, reading it, finding the directory with the content, finding where the content is, and then accessing that. Therefore symlinks are more work for the filesystem all around.
But the difference is minute when compared to the work of actually reading the data in the files. Therefore I would not worry about it, and just go with whichever one best gives you the semantics you want.
Since you are not trying to create hundreds of thousands to the same file, hard links are marginally better performing.
However, symbolic links in /tmp if /tmp is tmpfs is even better performing yet.
Oh, and symlinks are too small to cause fragmentation issues.
Both options require the addition of a file entry in the directory inode, the directory structure may grow by allocating new blocks.
But a symbolic link requires the allocation of an inode and the filesystem has a limit for inodes. Your hundreds of thousands symlinks may hit that limit and you may get the "Not enough space for file" error message even with gigabytes free.
By default, the file system creation tool choose the maximum number of inodes according to the physical partition size. For instance for Linux ext2/3/4, mkfs.ext3 uses a bytes-per-inode ratio you can find in your /etc/mke2fs.conf.
For an existing filesystem, here is a command to get information about inodes:
# dumpe2fs /dev/sda1 | grep -i inode | less
Inode count: 979200
Free inodes: 742304
Inodes per group: 16320
Inode blocks per group: 510
First inode: 11
Inode size: 128
Journal inode: 8
First orphan inode: 441066
Journal backup: inode blocks
As a conclusion, you should prefer hard links mainly for resource consumption on disk and in memory (VFS structures in caches).
Another advice: do not create too many files in the same directory, 2'000 files is a reasonable limit to avoid performance issues.

Estimation or measurement of amount of iops to create a file

I'd like to know how many I/O operations (iops) does it take to create an empty file. I am interested in linux and GFS file system, however other file systems information is also very welcome.
Suggestions how to accurately measure this would be also very welcome.
Real scenario (requested by answers):
Linux
GFS file system (if you can estimate for another - pls do)
create a new file in existing directory, the file does not exist,
using the following code
Assume directory is in cache and directory depth is D
Code:
int fd = open("/my_dir/1/2/3/new_file", O_CREAT | S_IRWXU);
// assuming fd is valid
fsync(fd);
For an artifical measurement:
Create a blank filesystem on its own block device (e.g. vmware scsi etc)
Mount it, call sync(), then record the number of IOPS present on that block dev.
Run your test program against the filesystem, and do no further operations (not even "ls").
Wait until all unflushed blocks have been flushed - say about 1 minute or so
Snapshot the iops count again
Of course this is highly unrealistic, because if you created two files rather than one, you'd probably find that there were less than twice as many.
Also creating empty or blank files is unrealistic - as they don't do anything useful.
Directory structures (how deep the directories are, how many entries) might contribute, but also how fragmented it is and other arbitrary factors.
The nature of the answer to this question is; best case, normal case, worst case. There is no single answer, because the number of IOPS required will vary according to the current state of the file system. (A pristine file system is highly unrealistic scenario).
Taking FAT32 as an example, best case is 1. Normal case depends on the degree of file system fragmentation and the directory depth of the pathname for the new file. Worse case is unbounded (except by the size of the file system, which imposes a limit on the maximum possible number of IOPs to create a file).
Really, the question is not answerable, unless you define a particular file system scenario.
I did the following measurement, we wrote an application that creates N files as described in the question.
We ran this application on a disk which was devoted to this application only, and measured IOps amount using iostat -x 1
The result, on GFS and linux kernel 2.6.18 is 2 IOps per file creation.
This answer is based on MarkR answer.

Resources