I need to know how to write a system call that blocks(lock) and unblocks(unlock) an archive(inode) or a partition(super_block) for read and write functions.
Example: these function are in fs.h
lock_super(struct super_block *);
unlock_super(struct super_block *);
How to obtain the super_block (/dev/sda1 for example)?
The lock_super and unlock_super calls are not meant to be controlled directly by the user level processes. It is only meant to be called by the VFS layer, when a operation(operation on inode) on the filesystem is called by the user process. If you still wish to do that, you have to write your own device driver and expose the desired functionality(locking unlocking of the inode) to the user level.
There are no current system calls that would allow you to lock, unlock inodes. There are many reasons why it is not wise to implement new system call, without due consideration. But if you wish to do that, you would need to write the handler of your own system call in the kernel. It seems you want fine-grain control over the file-system, perhaps you are implementing user-level file-system.
For the answer on how to get the super_block, every file-system module registers itself with the VFS(Virtual File System). VFS acts as a intermediate layer between user and the actual file-system. So, it is the VFS that knows about the function pointers to the lock_super and unlock_super methods. The VFS Superblock contains the "device info" and "set of pointers to the file-system superblock". You can get those pointers from here and call them. But remember, because the actual file-system is managed by the VFS, you would be potentially corrupting the data.
Related
When writing to a file opened with O_SYNC, the data (and metadata) is guaranteed to be written to persistent storage when the write call returns, and no explicit fsync call is needed.
Is the same true for ftruncate? Or do I still need to call fsync after ftruncate even with O_SYNC?
Not all filesystems are capable of dealing with holes. There will be some filesystems that actually have to physically write 0's when you call ftruncate().
So logically, ftruncate() be treated like a write() and be subject to O_SYNC.
The POSIX definition of O_SYNC says:
Write I/O operations on the file descriptor shall complete as defined by synchronized I/O file integrity completion.
And the POSIX definition for "synchronized I/O file integrity completion":
Identical to a synchronized I/O data integrity completion with the addition that all file attributes relative to the I/O operation [...] are successfully transferred prior to returning to the calling process.
And the definition for "Synchronized I/O Data Integrity Completion":
[...] The write is complete only when the data specified in the write request is successfully transferred and all file system information required to retrieve the data is successfully transferred.
That includes the file size.
But notably it only applies to "writes" (and "reads").
However, neither POSIX nor the Linux man pages define what a "write" or "write I/O" is, and in particular, whether ftruncate() counts as one.
So if you want to get lawyery about it, it is not strictly guaranteed anywhere, although I think it's a bug in the specification.
In practice, though, I doubt any file system that implements O_SYNC and ftruncate() would require you to call fsync() after ftruncate() of a file opened with O_SYNC.
Can please summarize the events/steps that happen when I try to execute a read()/write() system call. How does the kernel know which file system to issue these commands.
Lets say a process calls write().
Then It will call sys_write().
Now probably, since sys_write() is executed on behalf of the current process, it can access the struct task_struct and hence it can access the struct files_struct and struct fs_struct which contains file system information.
But after that I am not seeing, how this fs_struct is helping to identify the file system.
Edit: Now that Alex has described the flow...I have still doubt how the read/write are getting routed to a FS, since the VFS does not do it, then it must be happening somewhere else, Also how is the underlying block device and then finally the hardware protocol PCI/USB getting attached.
A simple flow chart involving actual data structures would be helpful
Please help.
This answer is based on kernel version 4.0. I traced out some of the code which handles a read syscall. I recommend you clone the Linux source repo and follow along in the source code.
Syscall handler for read, at fs/read_write.c:620 is called. It receives a file descriptor (integer) as an argument, and calls fdget_pos to convert it to a struct fd.
fdget_pos calls __fdget_pos calls __fdget calls __fget_light. __fget_light uses current->files, the file descriptor table for the current process, to look up the struct file which corresponds to the passed file descriptor number.
Back in the syscall handler, the file struct is passed to vfs_read, at fs/read_write.c:478.
vfs_read calls __vfs_read, which calls file->f_op->read. From here on, you are in filesystem-specific code.
So the VFS doesn't really bother "identifying" the filesystem which a file lives on; it simply uses the table of "file operation" function pointers which is stored in its struct file. When that struct file is initialized, it is given the correct f_op function pointer table which implements all the filesystem-specific operations for its filesystem.
Each filesystem registers itself to VFS. When a filesystem is mounted, its superblock is read and VFS superblock is populated with this information. Function pointer table for this filesystem is also populated at this time. when file->f_op->read call happens, registered function from the filesystem is actually called. You can refer to text in http://www.science.unitn.it/~fiorella/guidelinux/tlk/node102.html
When ever we fire a command on linux terminal.The process thus created traverses to the VFS layer,where it decides which file system function to be called like ext4 ,ext3 or anyother filesystem. So my question is How does the VFS differntiate the filesystems? form where the VFS gets the filesystem information,is it the fs_struct in task_struct that tells the VFS ?
As a part of the FS implementation you need to implement file, inode, superblock operations, which will register the the underlying FS ops (ex: ext3_open()) with VFS layer. Depending the the path to the file provided to the open(), VFS will invoke the appropriate file system specific implementation of the syscall.
Lets say you have already mounted a file system, when you are mounting a file system you register your FS for specific operations with VFS layer during the module initialization. During this step, two handlers get_sb() and kill_sb(). get_sb() is called at the time of mounting the file system. kill_sb() is called at the time of unmounting the file system.
For more information refer to RKFS and look into the how the file operations are implemented along with the data flow diagrams.
I'm periodically reading from a file and checking the readout to decide subsequent action. As this file may be modified by some mechanism which will bypass the block file I/O manipulation layer in the Linux kernel, I need to ensure the read operation reading data from the real underlying device instead of the kernel buffer.
I know fsync() can make sure all I/O write operations completed with all data written to the real device, but it's not for I/O read operations.
The file has to be kept opened.
So could anyone please kindly tell me how I can do to meet such requirement in Linux system? is there such a API similar to fsync() that can be called?
Really appreciate your help!
I believe that you want to use the O_DIRECT flag to open().
I think memory mapping in combination with madvise() and/or posix_fadvise() should satisfy your requirements... Linus contrasts this with O_DIRECT at http://kerneltrap.org/node/7563 ;-).
You are going to be in trouble if another device is writing to the block device at the same time as the kernel.
The kernel assumes that the block device won't be written by any other party than itself. This is true even if the filesystem is mounted readonly.
Even if you used direct IO, the kernel may cache filesystem metadata, so a change in the location of those blocks of the file may result in incorrect behaviour.
So in short - don't do that.
If you wanted, you could access the block device directly - which might be a more successful scheme, but still potentially allowing harmful race-conditions (you cannot guarantee the order of the metadata and data updates by the other device). These could cause you to end up reading junk from the device (if the metadata were updated before the data). You'd better have a mechanism of detecting junk reads in this case.
I am of course, assuming some very simple braindead filesystem such as FAT. That might reasonably be implemented in userspace (mtools, for instance, does)
I am writing a program consisting of user program and a kernel module. The kernel module needs to gather data that it will then "send" to the user program. This has to be done via a /proc file. Now, I create the file, everything is fine, and spent ages reading the internet for answer and still cannot find one. How do you read/write a /proc file from the kernel space ? The write_proc and read_proc supplied to the procfile are used to read and write data from USER space, whereas I need the module to be able to write the /proc file itself.
That's not how it works. When a userspace program opens the files, they are generated on the fly on a case-by-case basis. Most of them are readonly and generated by a common mechanism:
Register an entry with create_proc_read_entry
Supply a callback function (called read_proc by convention) which is called when the file is read
This callback function should populate a supplied buffer and (typically) call proc_calc_metrics to update the file pointer etc supplied to userspace.
You (from the kernel) do not "write" to procfs files, you supply the results dynamically when userspace requests them.
One of the approaches to get data across to the user space would be seq_files. In order to configure (write) kernel parameters you may want to consider sys-fs nodes.
Thanks,
Vijay