Redundant Linux Kernel System Calls

Redundant Linux Kernel System Calls - linux

I'm currently working on a project that hooks into various system calls and writes things to a log, depending on which one was called. So, for example, when I change the permissions of a file, I write a little entry to a log file that tracks the old permission and new permission. However, I'm having some trouble pinning down exactly where I should be watching. For the above example, strace tells me that the "chmod" command uses the system call sys_fchmodat(). However, there's also a sys_chmod() and a sys_fchmod().
I'm sure the kernel developers know what they're doing, but I wonder: what is the point of all these (seemingly) redundant system calls, and is there any rule on which ones are used for what? (i.e. are the "at" syscalls or ones prefixed with "f" meant to do something specific?)

History :-)
Once a system call has been created it can't ever be changed, therefore when new functionality is required a new system call is created. (Of course this means there's a very high bar before a new system call is created).

Yes, there are some naming rules.
chmod takes a filename, while fchmod takes a file descriptor. Same for stat vs fstat.
fchmodat takes a file descriptor/filename pair (file descriptor for the directory and filename for the file name within the directory). Same for other *at calls; see the NOTES section of http://kerneltrap.org/man/linux/man2/openat.2 for an explanation.

Related

Linux seems to have acquired a lot of *at calls? What is the motivation for these?

I've noticed that Linux now has renameat, fstatat, openat and a variety of other calls that allow you to specify paths relative to a file descriptor instead of having them be interpreted relative to the process' current working directory as is normally the case.
Why have these calls been added? There seems to be at versions of most system calls that have path name arguments, so there must be a pretty compelling use-case for this. But I can't think what it is.

All these *at routines were introduced as part POSIX. If you dig into the rationale section of the openat routine, you will find the following paragraph:
The purpose of the openat() function is to enable opening files in directories other than the current working directory without exposure to race conditions. Any part of the path of a file could be changed in parallel to a call to open(), resulting in unspecified behavior. By opening a file descriptor for the target directory and using the openat() function it can be guaranteed that the opened file is located relative to the desired directory.
In other words, it's a security measure.

Can POSIX/Linux unlink file entries completely race free?

POSIX famously lets processes rename and unlink file entries with no regard as to the effects on others using them, whilst Windows by default raises an error if you even try to touch the timestamps of a directory which has a file handle open somewhere deep inside inside.
However Windows doesn't need to be so conservative. If you open all your file handles with FILE_FLAG_BACKUP_SEMANTICS and FILE_SHARE_DELETE and take care to rename files to random names just before flagging deletion, you get POSIX semantics including lack of restriction on manipulating file paths containing open file handles.
One very nifty thing Windows can do is to perform renames and deletes and hard links only using an open file descriptor, and therefore you can delete a file without having to worry about whether another process has renamed it or any of the directories in the path preceding the file's location. This facility lets you perform completely race free file deletions - once you have an open file handle to the right file, you can stop caring about what other processes are doing to the filing system, at least for deletion (which is the most important as it implicitly involves destroying data).
This raises the question of what about POSIX? On POSIX unlink() takes a path, and between retrieving the current path of a file descriptor using /proc/self/fd/x or F_GETPATH and calling unlink() someone may have changed that path, thus potentially leading to the wrong file being unlinked and data lost.
A considerably safer solution is this:
Get one of the current paths of the open file descriptor using /proc/self/fd/x or F_GETPATH etc.
Open its containing directory.
Do a statat() on the containing directory for the leafname of the open file descriptor, checking if the device ids and inodes match.
If they match, do an unlinkat() to remove the leafname.
This is race safe from the parent directory upwards, though the hard link you delete may not be the one expected. However, it is not race safe if within the containing directory a third party process were to rename your file to something else and rename another file to your leafname between you checking for inode equivalence and calling the unlinkat(). Here the wrong file could be deleted, and data lost.
I therefore ask the question: can POSIX, or any specific POSIX implementation such as Linux, allow programs to unlink file entries completely race free? One solution could be to unlink a file entry by open file descriptor, another could be to unlink a file entry by inode, however google has not turned up solutions for either of those. Interestingly, NTFS does let you delete by a choice of inode or GUID (yes NTFS does provide inodes, you can fetch them from the NT kernel) in addition to deletion via open file handle, but that isn't much help here.
In case this seems like too esoteric a question, this problem affects proposed Boost.AFIO where I need to determine what filing system races I can mitigate and what I cannot as part of its documented hard behaviour guarantees.
Edit: Clarified that there is no canonical current path of an open file descriptor, and that in this use case we don't care - we just want to unlink some one of the links for the file.

No replies to this question, and I have spent several days trawling through Linux source code. I believe the answer is "currently you can't unlink a file race free", so I have opened a feature request at https://bugzilla.kernel.org/show_bug.cgi?id=93441 to have Linux extend unlinkat() with the AT_EMPTY_PATH Linux extension flag. If they accept that idea, I'll mark this answer as the correct one.

How to implement futimes in terms of utimes?

Given that in Linux utimes(2) is a system call and futimes(3) is a library function, I would think that futimes is implemented in terms of utimes. However, utimes takes a pathname, whereas futimes takes a file descriptor.
Since, it is "not possible" to determine a pathname from the file descriptor or i-node number I wonder how this can be done? Does the "real" system call always work on i-node numbers?

First, you likely wrongly mentioned Posix because the latter doesn't differ system calls and library functions. The putting of futimes() to library calls is Linux specific. In glibc (file sysdeps/unix/sysv/linux/futimes.c), there is the comment:
/* Change the access time of the file associated with FD to TVP[0] and
the modification time of FILE to TVP[1].
Starting with 2.6.22 the Linux kernel has the utimensat syscall which
can be used to implement futimes. Earlier kernels have no futimes()
syscall so we use the /proc filesystem. */
So, this is done using utimensat() with the specified descriptor as the reference one as for all *at() calls. Previously, this worked using utimes() for the path /proc/${pid}/fd/${fd} (too cumbersome and only if /proc is mounted). This is a reply to your second question: despite it isn't generally possible to detect a file name from its descriptor, the file still could be accessed separately. (BTW, the initial path used to open the file is sometimes stored; see /proc/$pid/{cwd,exe} for a Linux process.)
To compare with, FreeBSD provides explicit futimes() and futimesat() syscalls (but I wonder why the latter isn't named "utimesat").

System Call and File Operation in Linux

I am reading about device drivers and I have a question related to the UNIX philosophy regarding everything as file.
When a user issues a command say for eg, opening a file then what comes into action - System call or File Operation?
sys_open is a system call and open is a file operation. Can you please elaborate on the topic.
Thanks in advance.

Quick answer, I hope it'll help:
All system calls work the same way. The system call number is stored somewhere (e.g. in a register) together with the system call parameters. In case of open system calls parameters are: pointer to the filename and permissions string. Then the open function raises a software interruption using the adequate intruction (syscall, int ..., it depends on the HW).
As for any interruption, the kernel is invoked (in kernel mode) to handle the interruption. The system detects that the interruption was caused by a system call, then read the system call number in the register sees it is a open system call, create the file descriptor in the kernel memory and proceed to actually open the file by calling the driver open function. The file descriptor id is then stored back into a register and returns to user mode.
The file descriptor is then retrieved from the register and returned by open().

"Each open file (represented internally by a file
structure, which we will examine shortly) is associated with its own set of functions
(by including a field called f_op that points to a file_operations structure). The
operations are mostly in charge of implementing the system calls and are therefore,
named open, read, and so on."
This is from LDD chapter Character Driver. can anyone please elaborate that what does the last line mean.

intercepting file system system calls

I am writing an application for which I need to intercept some filesystem system calls eg. unlink. I would like to save some file say abc. If user deletes the file then I need to copy it to some other place. So I need unlink to call my code before deleting abc so that I could save it. I have gone through threads related to intercepting system calls but methods like LD_PRELOAD it wont work in my case because I want this to be secure and implemented in kernel so this method wont be useful. inotify notifies after the event so I could not be able to save it. Could you suggest any such method. I would like to implement this in a kernel module instead of modifying kernel code itself.
Another method as suggested by Graham Lee, I had thought of this method but it has some problems ,I need hardlink mirror of all the files it consumes no space but still could be problematic as I have to repeatedly mirror drive to keep my mirror up to date, also it won't work cross partition and on partition not supporting link so I want a solution through which I could attach hooks to the files/directories and then watch for changes instead of repeated scanning.
I would also like to add support for write of modified file for which I cannot use hard links.
I would like to intercept system calls by replacing system calls but I have not been able to find any method of doing that in linux > 3.0. Please suggest some method of doing that.

As far as hooking into the kernel and intercepting system calls go, this is something I do in a security module I wrote:
https://github.com/cormander/tpe-lkm
Look at hijacks.c and symbols.c for the code; how they're used is in the hijack_syscalls function inside security.c. I haven't tried this on linux > 3.0 yet, but the same basic concept should still work.
It's a bit tricky, and you may have to write a good deal of kernel code to do the file copy before the unlink, but it's possible here.

One suggestion could be Filesystems in Userspace (FUSE.) That is, write a FUSE module (which is, granted, in userspace) which intercepts filesystem-related syscalls, performs whatever tasks you want, and possibly calls the "default" syscall afterwards.
You could then mount certain directories with your FUSE filesystem and, for most of your cases, it seems like the default syscall behavior would not need to be overridden.

You can watch unlink events with inotify, though this might happen too late for your purposes (I don't know because I don't know your purposes, and you should experiment to find out). The in-kernel alternatives based on LSM (by which I mean SMACK, TOMOYO and friends) are really for Mandatory Access Control so may not be suitable for your purposes.

If you want to handle deletions only, you could keep a "shadow" directory of hardlinks (created via link) to the files being watched (via inotify, as suggested by Graham Lee).
If the original is now unlinked, you still have the shadow file to handle as you want to, without using a kernel module.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string