Level 2 I/O in Linux using readdir() possible?

Level 2 I/O in Linux using readdir() possible? - linux

I am trying to traverse a directory structure and open every file in that structure. To traverse, I am using opendir() and readdir(). Since I already have the entity, it seems stupid to build a path and open the file -- that presumably forces Linux to find the directory and file I just traversed.
Level 2 I/O (open, creat, read, write) require a path. Is there any call to either open a filename inside a directory, or open a file given an inode?

You probably should use nftw(3) to recursively traverse a file tree.
Otherwise, in a portable way, construct your directory + filename path using e.g.
snprintf(pathbuf, sizeof(pathbuf), "%s/%s", dirname, filename);
(or perhaps using asprintf(3) but don't forget to later free the result)
And to answer your question about opening a file in a directory, you could use the Linux or POSIX2008 specific openat(2). But I believe that you should really use nftw or construct your path like suggested above. Read also about O_PATH and O_TMPFILE in open(2).
BTW, the kernel has to access several times the directory (actually, the metadata is cached by file system kernel code), just because another process could have written inside it while you are traversing it.
Don't even think of opening a file thru its inode number: this will violate several file system abstractions! (but might be hardly possible by insane and disgusting tricks, e.g. debugfs - and this could probably harm very strongly your filesystem!!).
Remember that files are generally inodes, and can have zero (a process did open then unlink(2) a file while keeping the open file descriptor), one (this is the usual case), or several (e.g. /foo/bar1 and /gee/bar2 could be hard-linked using link(2) ....) file names.
Some file systems (e.g. FAT ...) don't have real inodes. The kernel fakes something in that case.

Related

Can POSIX/Linux unlink file entries completely race free?

POSIX famously lets processes rename and unlink file entries with no regard as to the effects on others using them, whilst Windows by default raises an error if you even try to touch the timestamps of a directory which has a file handle open somewhere deep inside inside.
However Windows doesn't need to be so conservative. If you open all your file handles with FILE_FLAG_BACKUP_SEMANTICS and FILE_SHARE_DELETE and take care to rename files to random names just before flagging deletion, you get POSIX semantics including lack of restriction on manipulating file paths containing open file handles.
One very nifty thing Windows can do is to perform renames and deletes and hard links only using an open file descriptor, and therefore you can delete a file without having to worry about whether another process has renamed it or any of the directories in the path preceding the file's location. This facility lets you perform completely race free file deletions - once you have an open file handle to the right file, you can stop caring about what other processes are doing to the filing system, at least for deletion (which is the most important as it implicitly involves destroying data).
This raises the question of what about POSIX? On POSIX unlink() takes a path, and between retrieving the current path of a file descriptor using /proc/self/fd/x or F_GETPATH and calling unlink() someone may have changed that path, thus potentially leading to the wrong file being unlinked and data lost.
A considerably safer solution is this:
Get one of the current paths of the open file descriptor using /proc/self/fd/x or F_GETPATH etc.
Open its containing directory.
Do a statat() on the containing directory for the leafname of the open file descriptor, checking if the device ids and inodes match.
If they match, do an unlinkat() to remove the leafname.
This is race safe from the parent directory upwards, though the hard link you delete may not be the one expected. However, it is not race safe if within the containing directory a third party process were to rename your file to something else and rename another file to your leafname between you checking for inode equivalence and calling the unlinkat(). Here the wrong file could be deleted, and data lost.
I therefore ask the question: can POSIX, or any specific POSIX implementation such as Linux, allow programs to unlink file entries completely race free? One solution could be to unlink a file entry by open file descriptor, another could be to unlink a file entry by inode, however google has not turned up solutions for either of those. Interestingly, NTFS does let you delete by a choice of inode or GUID (yes NTFS does provide inodes, you can fetch them from the NT kernel) in addition to deletion via open file handle, but that isn't much help here.
In case this seems like too esoteric a question, this problem affects proposed Boost.AFIO where I need to determine what filing system races I can mitigate and what I cannot as part of its documented hard behaviour guarantees.
Edit: Clarified that there is no canonical current path of an open file descriptor, and that in this use case we don't care - we just want to unlink some one of the links for the file.

No replies to this question, and I have spent several days trawling through Linux source code. I believe the answer is "currently you can't unlink a file race free", so I have opened a feature request at https://bugzilla.kernel.org/show_bug.cgi?id=93441 to have Linux extend unlinkat() with the AT_EMPTY_PATH Linux extension flag. If they accept that idea, I'll mark this answer as the correct one.

Syncing a file system that has no file on it

Say I want to synchronize data buffers of a file system to disk (in my case the one of an USB stick partition) on a linux box.
While searching for a function to do that I found the following
DESCRIPTION
sync() causes all buffered modifications to file metadata and
data to be written to the underlying file sys‐
tems.
syncfs(int fd) is like sync(), but synchronizes just the file system
containing file referred to by the open file
descriptor fd.
But what if the file system has no file on it that I can open and pass to syncfs? Can I "abuse" the dot file? Does it appear on all file systems?
Is there another function that does what I want? Perhaps by providing a device file with major / minor numbers or some such?

Yes I think you can do that. The root directory of your file system will have at least one inode for your root directory. You can use the .-file to do that. Play also around with ls -i to see the inode numbers.
Is there a possibility to avoid your problem by mounting your file system with sync? Does performance issues hamper? Did you have a look at remounting? This can sync your file system as well in particular cases.
I do not know what your application is, but I suffered problems with synchronization of files to a USB stick with the FAT32-file system. It resulted in weird read and write errors. I can not imagine any other valid reason why you should sync an empty file system.

From man 8 sync description:
"sync writes any data buffered in memory out to disk. This can include (but is not
limited to) modified superblocks, modified inodes, and delayed reads and writes. This
must be implemented by the kernel; The sync program does nothing but exercise the sync(2)
system call."
So, note that it's all about modification (modified inode, superblocks etc). If you don't have any modification, it don't have anything to sync up.

I/O Performance in Linux

File A in a directory which have 10000 files, and file B in a directory which have 10 files, Would read/write file A slower than file B?
Would it be affected by different journaling file system?

No.
Browsing the directory and opening a file will be slower (whether or not that's noticeable in practice depends on the filesystem). Input/output on the file is exactly the same.
EDIT:
To clarify, the "file" in the directory is not really the file, but a link ("hard link", as opposed to symbolic link), which is merely a kind of name with some metadata, but otherwise unrelated to what you'd consider "the file". That's also the historical reason why deleting a file is done via the unlink syscall, not via a hypothetical deletefile call. unlink removes the link, and if that was the last link (but only then!), the file.
It is perfectly legal for one file to have a hundred links in different directories, and it is perfectly legal to open a file and then move it to a different place or even unlink it (while it remains open!). It does not affect your ability to read/write on the file descriptor in any way, even when a file (to your knowledge) does not even exist any more.

In general, once a file has been opened and you have a handle to it, the performance of accessing that file will be the same no matter how many other files are in the same directory. You may be able to detect a small difference in the time it takes to open the file, as the OS will have to search for the file name in the directory.

Journaling aims to reduce the recover time from file system crashes, IMHO, it will not affect the read/write speed of files. Journaling ext2

Regarding Hard Link

Can somebody please explain me why the kernel doesn't allow us to make a hard link to a directory. Whether it is because it breaks the rule of directed acyclic graph structure of the file-system or it is because of some other reason. What other complications come if it allows that?

Back in the days of 7th Edition (or Version 7) UNIX, there were no system calls mkdir(2) and rmdir(2). The mkdir(1) program was SUID root, and used the mknod(2) system call to create the directory and the link(2) system call to make the entries for . and .. in the new directory. The link(2) system call only allowed root to do that. Consequently, way back then (circa 1978), it was possible for the superuser to create links to directories, but only the superuser was permitted to do so to ensure that there were no problems with cycles or other missing links. There were diagnostic programs to pick up the pieces if the system crashed while a directory was partly created, for example.
You can find the Unix 7th Edition manuals at Bell Labs. Sections 2 and 3 are devoid of mkdir(2) and rmdir(2). You used the mknod(2) system call to make the directory:
NAME
mknod – make a directory or a special file
SYNOPSIS
mknod(name, mode, addr)
char *name;
DESCRIPTION
Mknod creates a new file whose name is the null-terminated string pointed to by name. The mode of
the new file (including directory and special file bits) is initialized from mode. (The protection part of
the mode is modified by the process’s mode mask; see umask(2)). The first block pointer of the i-node
is initialized from addr. For ordinary files and directories addr is normally zero. In the case of a special
file, addr specifies which special file.
Mknod may be invoked only by the super-user.
SEE ALSO
mkdir(1), mknod(1), filsys(5)
DIAGNOSTICS
Zero is returned if the file has been made; – 1 if the file already exists or if the user is not the superuser.
The entry for link(2) states:
DIAGNOSTICS
Zero is returned when a link is made; – 1 is returned when name1 cannot be found; when name2 already
exists; when the directory of name2 cannot be written; when an attempt is made to link to a directory by
a user other than the super-user; when an attempt is made to link to a file on another file system; when a
file has too many links.
The entry for unlink(2) states:
DIAGNOSTICS
Zero is normally returned; – 1 indicates that the file does not exist, that its directory cannot be written,
or that the file contains pure procedure text that is currently in use. Write permission is not required on
the file itself. It is also illegal to unlink a directory (except for the super-user).
The manual page for the ln(1) command noted:
It is forbidden to link to a directory or to link across file systems.
The manual page for the mkdir(1) command notes:
Standard entries, '.', for the directory itself, and '..'
for its parent, are made automatically.
This would not be worthy of comment were it not that it was possible to create directories without those links.
Nowadays, the mkdir(2) and rmdir(2) system calls are standard and permit any user to create and remove directories, preserving the correct semantics. There is no longer a need to permit users to create hard links to directories. This is doubly true since symbolic links were introduced - they were not in 7th Edition UNIX, but were in the BSD versions of UNIX from quite early on.
With normal directories, the .. entry unambiguously links back to the (single, solitary) parent directory. If you have two hard links (two names) for the same directory in different directories, where does the .. entry point? Presumably, to the original parent directory - and presumably there is no way to get to the 'other' parent directory from the linked directory. That's an asymmetry that can cause trouble. Normally, if you do:
chdir("./subdir");
chdir("..");
(where ./subdir is not a symbolic link), then you will be back in the directory you started from. If ./subdir is a hard link to a directory somewhere else, then you will be in a different directory from where you started after the second chdir(). You'd have to show that with a pair of stat() calls before and after the chdir() operations shown.

This is entirely because allowing hard links to directories allows for potential loops and cycles in the directory graph without adding much value.

In addition to the possibility of getting cycles (much like with symlinks, by the way, but these are easier to detect and handle), there is a second reason I can think of.
On UNIX, there is a common assumption in use by many programs, that will assume that all directories will have a link count of 2+number of child directories. This is due to the POSIX standard directory entries '.' and '..' which link to the directory or it's parent.
(After verification, I can say that the root (/) is not an exception).
This is especially useful as a performance optimization to detect leaf directories when recursing, but many applications will exist that have found other uses for it
Clarifying
By allowing 'userdefined' hardlinks to directories, these invariants so to say will no longer hold, and any dependent applications might stop working correctly.
The element of surprise is why you need root permissions (and some good design (re)thinking) in order to force creation of directory hardlinks

Because then the directory tree will cease to be a directory tree. One directory could have multiple parents.

Cyclic references will break garbage collection by reference counting. Wikipedia describes the problem:
There are a variety of ways of handling the problem of detecting and collecting reference cycles. One is that a system may explicitly forbid reference cycles.
That it the way Linux does it.

Why can't files be manipulated by inode?

Why is it that you cannot access a file when you only know its inode, without searching for a file that links to that inode? A hard link to the file contains nothing but a name and a number telling you where to find the inode with all the real information about the file. I was surprised when I was told that there was no usermode way to use the inode number directly to open a file.
This seems like such a harmless and useful capability for the system to provide. Why is it not provided?

Security reasons -- to access a file you need permission on the file AS WELL AS permission to search all the directories from the root needed to get at the file. If you could access a file by inode, you could bypass the checks on the containing directories.
This allows you to create a file that can be accessed by a set of users (or a set of groups) and not anyone else -- create directories that are only accessable by the the users (one dir per user), and then hard-link the file into all of those directories -- the file itself is accessable by anyone, but can only actually be accessed by someone who has search permissions on one of the directories it is linked into.

Some Operating Systems do have that facility. For example, OS X needs it to support the Carbon File Manager, and on Linux you can use debugfs. Of course, you can do it on any UNIX from the command-line via find -inum, but the real reason you can't access files by inode is that it isn't particularly useful. It does kindof circumvent file permissions, because if there's a file you can read in a folder you can't read or execute, then opening the inode lets you discover it.
The reason it isn't very useful is that you need to find an inode number via a *stat() call, at which point you already have the filename (or an open fd)...or you need to guess the inum.

In response to your comment: To "pass a file", you can use fd passing over AF_LOCAL sockets by means of SCM_RIGHTS (see man 7 unix).

Btrfs does have an ioctl for that (BTRFS_IOC_INO_PATHS added in this patch), however it does no attempt to check permissions along the path, and is simply reserved to root.

Surely if you've already looked up a file via a path, you shouldn't have to do it again and again?
stat(f,&s); i=open(f,O_MODE);
involves two trawls through a directory structure. This wastes CPU cycles with unnecessary string operations. Yes, the well-designed file system cache will hide most of this inefficiency from a casual end-user, but repeating work for no reason is ugly if not plain silly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string