System Call and File Operation in Linux - linux

I am reading about device drivers and I have a question related to the UNIX philosophy regarding everything as file.
When a user issues a command say for eg, opening a file then what comes into action - System call or File Operation?
sys_open is a system call and open is a file operation. Can you please elaborate on the topic.
Thanks in advance.

Quick answer, I hope it'll help:
All system calls work the same way. The system call number is stored somewhere (e.g. in a register) together with the system call parameters. In case of open system calls parameters are: pointer to the filename and permissions string. Then the open function raises a software interruption using the adequate intruction (syscall, int ..., it depends on the HW).
As for any interruption, the kernel is invoked (in kernel mode) to handle the interruption. The system detects that the interruption was caused by a system call, then read the system call number in the register sees it is a open system call, create the file descriptor in the kernel memory and proceed to actually open the file by calling the driver open function. The file descriptor id is then stored back into a register and returns to user mode.
The file descriptor is then retrieved from the register and returned by open().

"Each open file (represented internally by a file
structure, which we will examine shortly) is associated with its own set of functions
(by including a field called f_op that points to a file_operations structure). The
operations are mostly in charge of implementing the system calls and are therefore,
named open, read, and so on."
This is from LDD chapter Character Driver. can anyone please elaborate that what does the last line mean.

Related

Working of ioctl(2)

Please explain to me the working of the ioctl(2) system call.
The manual page and wikipedia are neither very informative nor detailed.
What should the file descriptor that is passed as the first argument of ioctl(2) be pointing to?
You have to open the device you want to manipulate before calling ioctl. Then you pass the file descriptor for the device you want to manipulate as the first parameter. You would only call ioctl when there's some particular device that has some particular manipulation that you want to perform on it.

How linux identify a particular file system to execute system call

Can please summarize the events/steps that happen when I try to execute a read()/write() system call. How does the kernel know which file system to issue these commands.
Lets say a process calls write().
Then It will call sys_write().
Now probably, since sys_write() is executed on behalf of the current process, it can access the struct task_struct and hence it can access the struct files_struct and struct fs_struct which contains file system information.
But after that I am not seeing, how this fs_struct is helping to identify the file system.
Edit: Now that Alex has described the flow...I have still doubt how the read/write are getting routed to a FS, since the VFS does not do it, then it must be happening somewhere else, Also how is the underlying block device and then finally the hardware protocol PCI/USB getting attached.
A simple flow chart involving actual data structures would be helpful
Please help.
This answer is based on kernel version 4.0. I traced out some of the code which handles a read syscall. I recommend you clone the Linux source repo and follow along in the source code.
Syscall handler for read, at fs/read_write.c:620 is called. It receives a file descriptor (integer) as an argument, and calls fdget_pos to convert it to a struct fd.
fdget_pos calls __fdget_pos calls __fdget calls __fget_light. __fget_light uses current->files, the file descriptor table for the current process, to look up the struct file which corresponds to the passed file descriptor number.
Back in the syscall handler, the file struct is passed to vfs_read, at fs/read_write.c:478.
vfs_read calls __vfs_read, which calls file->f_op->read. From here on, you are in filesystem-specific code.
So the VFS doesn't really bother "identifying" the filesystem which a file lives on; it simply uses the table of "file operation" function pointers which is stored in its struct file. When that struct file is initialized, it is given the correct f_op function pointer table which implements all the filesystem-specific operations for its filesystem.
Each filesystem registers itself to VFS. When a filesystem is mounted, its superblock is read and VFS superblock is populated with this information. Function pointer table for this filesystem is also populated at this time. when file->f_op->read call happens, registered function from the filesystem is actually called. You can refer to text in http://www.science.unitn.it/~fiorella/guidelinux/tlk/node102.html

How to implement futimes in terms of utimes?

Given that in Linux utimes(2) is a system call and futimes(3) is a library function, I would think that futimes is implemented in terms of utimes. However, utimes takes a pathname, whereas futimes takes a file descriptor.
Since, it is "not possible" to determine a pathname from the file descriptor or i-node number I wonder how this can be done? Does the "real" system call always work on i-node numbers?
First, you likely wrongly mentioned Posix because the latter doesn't differ system calls and library functions. The putting of futimes() to library calls is Linux specific. In glibc (file sysdeps/unix/sysv/linux/futimes.c), there is the comment:
/* Change the access time of the file associated with FD to TVP[0] and
the modification time of FILE to TVP[1].
Starting with 2.6.22 the Linux kernel has the utimensat syscall which
can be used to implement futimes. Earlier kernels have no futimes()
syscall so we use the /proc filesystem. */
So, this is done using utimensat() with the specified descriptor as the reference one as for all *at() calls. Previously, this worked using utimes() for the path /proc/${pid}/fd/${fd} (too cumbersome and only if /proc is mounted). This is a reply to your second question: despite it isn't generally possible to detect a file name from its descriptor, the file still could be accessed separately. (BTW, the initial path used to open the file is sometimes stored; see /proc/$pid/{cwd,exe} for a Linux process.)
To compare with, FreeBSD provides explicit futimes() and futimesat() syscalls (but I wonder why the latter isn't named "utimesat").

How to read/write from/to a linux /proc file from kernel space?

I am writing a program consisting of user program and a kernel module. The kernel module needs to gather data that it will then "send" to the user program. This has to be done via a /proc file. Now, I create the file, everything is fine, and spent ages reading the internet for answer and still cannot find one. How do you read/write a /proc file from the kernel space ? The write_proc and read_proc supplied to the procfile are used to read and write data from USER space, whereas I need the module to be able to write the /proc file itself.
That's not how it works. When a userspace program opens the files, they are generated on the fly on a case-by-case basis. Most of them are readonly and generated by a common mechanism:
Register an entry with create_proc_read_entry
Supply a callback function (called read_proc by convention) which is called when the file is read
This callback function should populate a supplied buffer and (typically) call proc_calc_metrics to update the file pointer etc supplied to userspace.
You (from the kernel) do not "write" to procfs files, you supply the results dynamically when userspace requests them.
One of the approaches to get data across to the user space would be seq_files. In order to configure (write) kernel parameters you may want to consider sys-fs nodes.
Thanks,
Vijay

Redundant Linux Kernel System Calls

I'm currently working on a project that hooks into various system calls and writes things to a log, depending on which one was called. So, for example, when I change the permissions of a file, I write a little entry to a log file that tracks the old permission and new permission. However, I'm having some trouble pinning down exactly where I should be watching. For the above example, strace tells me that the "chmod" command uses the system call sys_fchmodat(). However, there's also a sys_chmod() and a sys_fchmod().
I'm sure the kernel developers know what they're doing, but I wonder: what is the point of all these (seemingly) redundant system calls, and is there any rule on which ones are used for what? (i.e. are the "at" syscalls or ones prefixed with "f" meant to do something specific?)
History :-)
Once a system call has been created it can't ever be changed, therefore when new functionality is required a new system call is created. (Of course this means there's a very high bar before a new system call is created).
Yes, there are some naming rules.
chmod takes a filename, while fchmod takes a file descriptor. Same for stat vs fstat.
fchmodat takes a file descriptor/filename pair (file descriptor for the directory and filename for the file name within the directory). Same for other *at calls; see the NOTES section of http://kerneltrap.org/man/linux/man2/openat.2 for an explanation.

Resources