Working of ioctl(2) - linux

Please explain to me the working of the ioctl(2) system call.
The manual page and wikipedia are neither very informative nor detailed.
What should the file descriptor that is passed as the first argument of ioctl(2) be pointing to?

You have to open the device you want to manipulate before calling ioctl. Then you pass the file descriptor for the device you want to manipulate as the first parameter. You would only call ioctl when there's some particular device that has some particular manipulation that you want to perform on it.

Related

fuse: pass through custom open flags from the "open(2)" call `.open` handler in fuse implementation

I have implemented a fuse driver for cache management. I would like to give some hints from the application side to fuse driver using file system calls like open(2), e.g., sequential read pattern.
I thought I could use the flags argument in open(2)1 and thought it would be passed to the fuse handler 2. I have done some experiments and found the my custom bits are not shown in the fuse's .open handler in the struct fuse_file_info *.
I would like to know why but I failed to find any documentation about the flags validation/stripping rules in the linux kernel documentation and libuse documentation. Can anyone point me to the specification or related source code?
Besides, I'm wondering if passing custom (not specified in fcntl.h) is not possible, is three other approaches where I can pass some hints from the application to the fuse handler?
Thanks in advance!
What are the flags you've tried? because doc of open says:
Creation (O_CREAT, O_EXCL, O_NOCTTY) flags will be filtered out / handled by the kernel.
As for alternatives, the standard way for any file system, not necessarily fuse, would be a dedicated ioctl. In case the hint is per file and not per specific opened descriptor, setxattr is also an option.

How linux identify a particular file system to execute system call

Can please summarize the events/steps that happen when I try to execute a read()/write() system call. How does the kernel know which file system to issue these commands.
Lets say a process calls write().
Then It will call sys_write().
Now probably, since sys_write() is executed on behalf of the current process, it can access the struct task_struct and hence it can access the struct files_struct and struct fs_struct which contains file system information.
But after that I am not seeing, how this fs_struct is helping to identify the file system.
Edit: Now that Alex has described the flow...I have still doubt how the read/write are getting routed to a FS, since the VFS does not do it, then it must be happening somewhere else, Also how is the underlying block device and then finally the hardware protocol PCI/USB getting attached.
A simple flow chart involving actual data structures would be helpful
Please help.
This answer is based on kernel version 4.0. I traced out some of the code which handles a read syscall. I recommend you clone the Linux source repo and follow along in the source code.
Syscall handler for read, at fs/read_write.c:620 is called. It receives a file descriptor (integer) as an argument, and calls fdget_pos to convert it to a struct fd.
fdget_pos calls __fdget_pos calls __fdget calls __fget_light. __fget_light uses current->files, the file descriptor table for the current process, to look up the struct file which corresponds to the passed file descriptor number.
Back in the syscall handler, the file struct is passed to vfs_read, at fs/read_write.c:478.
vfs_read calls __vfs_read, which calls file->f_op->read. From here on, you are in filesystem-specific code.
So the VFS doesn't really bother "identifying" the filesystem which a file lives on; it simply uses the table of "file operation" function pointers which is stored in its struct file. When that struct file is initialized, it is given the correct f_op function pointer table which implements all the filesystem-specific operations for its filesystem.
Each filesystem registers itself to VFS. When a filesystem is mounted, its superblock is read and VFS superblock is populated with this information. Function pointer table for this filesystem is also populated at this time. when file->f_op->read call happens, registered function from the filesystem is actually called. You can refer to text in http://www.science.unitn.it/~fiorella/guidelinux/tlk/node102.html

How to implement futimes in terms of utimes?

Given that in Linux utimes(2) is a system call and futimes(3) is a library function, I would think that futimes is implemented in terms of utimes. However, utimes takes a pathname, whereas futimes takes a file descriptor.
Since, it is "not possible" to determine a pathname from the file descriptor or i-node number I wonder how this can be done? Does the "real" system call always work on i-node numbers?
First, you likely wrongly mentioned Posix because the latter doesn't differ system calls and library functions. The putting of futimes() to library calls is Linux specific. In glibc (file sysdeps/unix/sysv/linux/futimes.c), there is the comment:
/* Change the access time of the file associated with FD to TVP[0] and
the modification time of FILE to TVP[1].
Starting with 2.6.22 the Linux kernel has the utimensat syscall which
can be used to implement futimes. Earlier kernels have no futimes()
syscall so we use the /proc filesystem. */
So, this is done using utimensat() with the specified descriptor as the reference one as for all *at() calls. Previously, this worked using utimes() for the path /proc/${pid}/fd/${fd} (too cumbersome and only if /proc is mounted). This is a reply to your second question: despite it isn't generally possible to detect a file name from its descriptor, the file still could be accessed separately. (BTW, the initial path used to open the file is sometimes stored; see /proc/$pid/{cwd,exe} for a Linux process.)
To compare with, FreeBSD provides explicit futimes() and futimesat() syscalls (but I wonder why the latter isn't named "utimesat").

System Call and File Operation in Linux

I am reading about device drivers and I have a question related to the UNIX philosophy regarding everything as file.
When a user issues a command say for eg, opening a file then what comes into action - System call or File Operation?
sys_open is a system call and open is a file operation. Can you please elaborate on the topic.
Thanks in advance.
Quick answer, I hope it'll help:
All system calls work the same way. The system call number is stored somewhere (e.g. in a register) together with the system call parameters. In case of open system calls parameters are: pointer to the filename and permissions string. Then the open function raises a software interruption using the adequate intruction (syscall, int ..., it depends on the HW).
As for any interruption, the kernel is invoked (in kernel mode) to handle the interruption. The system detects that the interruption was caused by a system call, then read the system call number in the register sees it is a open system call, create the file descriptor in the kernel memory and proceed to actually open the file by calling the driver open function. The file descriptor id is then stored back into a register and returns to user mode.
The file descriptor is then retrieved from the register and returned by open().
"Each open file (represented internally by a file
structure, which we will examine shortly) is associated with its own set of functions
(by including a field called f_op that points to a file_operations structure). The
operations are mostly in charge of implementing the system calls and are therefore,
named open, read, and so on."
This is from LDD chapter Character Driver. can anyone please elaborate that what does the last line mean.

How to read/write from/to a linux /proc file from kernel space?

I am writing a program consisting of user program and a kernel module. The kernel module needs to gather data that it will then "send" to the user program. This has to be done via a /proc file. Now, I create the file, everything is fine, and spent ages reading the internet for answer and still cannot find one. How do you read/write a /proc file from the kernel space ? The write_proc and read_proc supplied to the procfile are used to read and write data from USER space, whereas I need the module to be able to write the /proc file itself.
That's not how it works. When a userspace program opens the files, they are generated on the fly on a case-by-case basis. Most of them are readonly and generated by a common mechanism:
Register an entry with create_proc_read_entry
Supply a callback function (called read_proc by convention) which is called when the file is read
This callback function should populate a supplied buffer and (typically) call proc_calc_metrics to update the file pointer etc supplied to userspace.
You (from the kernel) do not "write" to procfs files, you supply the results dynamically when userspace requests them.
One of the approaches to get data across to the user space would be seq_files. In order to configure (write) kernel parameters you may want to consider sys-fs nodes.
Thanks,
Vijay

Resources