Linux API - EXT3 file information - linux

I'm writing backup software. I want to programmatically determine if a file has been modified since last time. Is a flag or something like that on files under the EXT3 filesystem?

Sure. Just call stat() on the file, and inspect the st_mtime member:
struct stat {
/* ... snip ... */
time_t st_atime; /* time of last access */
time_t st_mtime; /* time of last modification */
time_t st_ctime; /* time of last status change */
};
If you have in the application a timestamp when the last backup was made, you can compare directly.
Note though that not all filesystems really update the modified time, as doing so is kind of expensive. You seem to be aware of this risk.

I think you are looking for stat()

Related

Do I need to close a file before calling syncfs()

On my embedded system, I want to make sure that the data is safely written when I close a file - if the system reports that the data was saved, the user should be able to remove power immediately.
I know that the proper way to do this is fsync(), fclose(), and fsync() on the directory (cfr. this blog entry). However, it's a bit tricky to get a file descriptor for the directory in my case (I'd have to go through /proc/self/fd to find back the filename and derive the directory from there). It would be much simpler for me to just do syncfs() on the entire filesystem - I know that this is the only file that is open on the filesystem anyway.
Now my question is:
Is it sufficient to do syncfs()?
Do I need to fclose() the FILE * first (for the directory entry to be up-to-date)? Or is fflush() sufficient?
If it needs to be closed, is it useful to dup() the file descriptor before closing so I can use it directly for syncfs()?
First of all, don't mix standard library <stdio.h> calls (like fprintf(3) or fopen(3)) with system calls (like open(2) or close(2) or sync(2)) as the formers are library routines that use in-process' buffers to store temporary data, for which the system is unaware, and the others are operating system interfaces that make the system responsible for the data maintainance from now onwards. You'll distinguish them easily as the former use FILE * descriptors to operate, while the last use int integer descriptors to operate on.
So if you use a system call to ensure your data is properly synced to disk, it is absolutely neccessary to first fflush(3) your process' buffer data before you do the filesystem sync(2) or fsync(2) call.
No sync(2) is warranted to happen at fclose(3) or even on close(2) time, or in the atexit() callbacks your process does before exit().
The operating system buffers are write delayed for performance reasons, and close(2) is not an event that makes it to trigger such a thing. Just think that many processes can be reading and writing the same file at the same time, and each close(2) triggering a filesystem flush could be a pain to achieve. Operating system triggers such calls at regular intervals, on umount(2) system calls, on system shutdown, and on specific calls to the sync(2) and fsync(2) system calls.
If you need to maintain the FILE *fd descriptor open, just do a fflush(fd) for that descriptor to ensure that the operating system has all its buffers for fwrite(3)d or fprintf(3)ed data first.
So finally, if you are using <stdio.h> functions, first do a fflush() for all the FILE * descriptors you have written to, or call fflush(NULL); to tell stdio to synch all descriptors in one call. Then do the sync(2) or fsync(2) call to ensure all your data is physically on disk. No need to close anything.
FILE *fd;
...
fflush(fd);
fsync(fileno(fd));
/* here you know that up to the last write(2) or fwrite(3)...
* data is synced to disk */
By the way, your approach of going to /dev/fd/<number> to get the descriptor (that you had previously) is faulty for two reasons:
Once you close your descriptor, /dev/fd/<number> is not anymore the descriptor you want. Normally, it doesn't exist, even. Just try this:
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
int main()
{
int fd;
char fn[] = "/dev/fd/1";
close(1); /* close standard output */
fd = open(fn, O_RDONLY); /* try to reopen from /dev/fd */
if (fd < 0) {
fprintf(stderr,
"%s: %s(errno=%d)\n",
fn,
strerror(errno),
errno);
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
} /* main */
You cannot get the directory where an open file belongs to with only the file descriptor. In a multilinked file, there can be thousands of directories just pointing to it. There's nothing on the inode (or in the open file structure) that allows you to get the path used to open that file. A common way to use temporary files is just to create them and immediately unlink(2) them, so nobody can open it again. As much as you retain the file open you have access to it, but no path points to it anymore.
Enable the "sync" flag in your filesystem (/etc/fstab), default is "async" (disabled) . When this flag is enabled, all changes to the according filesystem are inmediately flushed to disk. This makes your entire filesystem slow, but depending on your embedded system requirements, this can be a great option to consider.

How to monitor which files consumes iops?

I need to understand which files consumes iops of my hard disc. Just using "strace" will not solve my problem. I want to know, which files are really written to disc, not to page cache. I tried to use "systemtap", but I cannot understand how to find out which files (filenames or inodes) consumes my iops. Is there any tools, which will solve my problem?
Yeah, you can definitely use SystemTap for tracing that. When upper-layer (usually, a VFS subsystem) wants to issue I/O operation, it will call submit_bio and generic_make_request functions. Note that these doesn't necessary mean a single physical I/O operation. For example, writes from adjacent sectors can be merged by I/O scheduler.
The trick is how to determine file path name in generic_make_request. It is quite simple for reads, as this function will be called in the same context as read() call. Writes are usually asynchronous, so write() will simply update page cache entry and mark it as dirty, while submit_bio gets called by one of the writeback kernel threads which doesn't have info of original calling process:
Writes can be deduced by looking at page reference in bio structure -- it has mapping of struct address_space. struct file which corresponds to an open file also contains f_mapping which points to the same address_space instance and it also points to dentry containing name of the file (this can be done by using task_dentry_path)
So we would need two probes: one to capture attempts to read/write a file and save path and address_space into associative array and second to capture generic_make_request calls (this is performed by probe ioblock.request).
Here is an example script which counts IOPS:
// maps struct address_space to path name
global paths;
// IOPS per file
global iops;
// Capture attempts to read and write by VFS
probe kernel.function("vfs_read"),
kernel.function("vfs_write") {
mapping = $file->f_mapping;
// Assemble full path name for running task (task_current())
// from open file "$file" of type "struct file"
path = task_dentry_path(task_current(), $file->f_path->dentry,
$file->f_path->mnt);
paths[mapping] = path;
}
// Attach to generic_make_request()
probe ioblock.request {
for (i = 0; i < $bio->bi_vcnt ; i++) {
// Each BIO request may have more than one page
// to write
page = $bio->bi_io_vec[i]->bv_page;
mapping = #cast(page, "struct page")->mapping;
iops[paths[mapping], rw] <<< 1;
}
}
// Once per second drain iops statistics
probe timer.s(1) {
println(ctime());
foreach([path+, rw] in iops) {
printf("%3d %s %s\n", #count(iops[path, rw]),
bio_rw_str(rw), path);
}
delete iops
}
This example script is works for XFS, but needs to be updated to support AIO and volume managers (including btrfs). Plus I'm not sure how it will handle metadata reads and writes, but it is a good start ;)
If you want to know more on SystemTap you can check out my book: http://myaut.github.io/dtrace-stap-book/kernel/async.html
Maybe iotop gives you a hint about which process are doing I/O, in consequence you have an idea about the related files.
iotop --only
the --only option is used to see only processes or threads actually doing I/O, instead of showing all processes or threads

FUSE's write sequence guarantees

Should write() implementations assume random-access, or can there be some assumptions, like that they'll ever be performed sequentially, and at increasing offsets?
You'll get extra points for a link to the part of a POSIX or SUS specification that describes the VFS interface.
Random, for certain. There's a reason why the read and write interfaces take both size and offset. You'll notice that there isn't a seek field in the fuse_operations struct; when a user program calls seek/lseek on a FUSE file, the offset in the kernel file descriptor is updated, but the FUSE fs isn't notified at all. Later reads and writes just start coming to you with a different offset, and you should be able to handle that. If something about your implementation makes it impossible, you should probably return -EIO on the writes you can't satisfy.
Unless there is something unusual about your FUSE filesystem that would prevent an existing file from being opened for write, your implementation of the write operation must support writes to any offset — an application can write to any location in a file by lseek()-ing around in the file while it's open, e.g.
fd = open("file", O_WRONLY);
lseek(fd, SEEK_SET, 100);
write(fd, ...);
lseek(fd, SEEK_SET, 0);
write(fd, ...);

How to tell which file was created first?

On a Linux system (the one in front of me is an Ubuntu 10.04, but that shouldn't matter), how can I tell which of two files created within the same second was created first? The process I control creates neither itself; in all other respects the ctime would, I think, do the trick, but the 1 second resolution is a problem.
For background, I'm trying to reliably determine whether a potentially stale pidfile refers to the current process with that pid. If there's a better way to do that, I'm all ears.
Actually, on modern Unices with modern filesystems, the file modification time is stored in a timespec. Details:
The standard says stat looks like this WRT times:
struct timespec st_atim Last data access timestamp.
struct timespec st_mtim Last data modification timestamp.
struct timespec st_ctim Last file status change timestamp.
And a timespec
time_t tv_sec seconds
long tv_nsec nanoseconds
So, doing a stat on my Linux 2.6.39:
Access: 2011-07-14 15:38:20.016666721 +0300
Modify: 2011-06-10 03:06:12.000000000 +0300
Change: 2011-06-17 11:01:35.416667110 +0300
In conclusion, I think you've got enough precision there if the hardware is supplying it.
You can try ls -rt to sort the files by time on the hope that the file header has more precision than the default list time format displays. But if the file system doesn't have the information, there is no way to do this.
Other options? You could add an ID to the file and always increment it but as soon as you try to load this ID from the file system (when you create a new process), you'll run into problems with locking.
So how can you make sure the PID file is not stale? Answer: Use the daemon script. It runs a process in the background and makes sure the PID file gets deleted as soon as the process exits.

how are inode numbers generated in linux tmpfs?

It seems to me that tmpfs is not re-using inode numbers, but instead creates a new inode number via a +1 sequence everytime it needs a free inode.
Do you know how this is implemented / can you pin-point me to some source code where i could check the algorithm that is used in tmpfs ?
I need to understand this in order to bypass a limitation in a caching system that uses the inode number as its cache key (hence leading to rare, but occuring collisions when inodes are re-used too often). tmpfs could save my day if I can prove that it keeps creating unique inode numbers.
Thank you for your help,
Jerome Wagner
I won't directly answer your question, so I apologize in advance for that.
The tmpfs idea is good, but I wouldn't have my program depend on a more or less obscure implementation detail for generating keys. Why don't you try another method, such as combining the inode number with some other information? Maybe modification date: it's impossible two files get the same inode number AND modification date at the time of key-generation, unless system date changes.
Cheers!
The bulk of the tmpfs code is in mm/shmem.c. New inodes are created by
static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir,
int mode, dev_t dev, unsigned long flags)
but it delegates almost everything to the generic filesystem code.
In particular, the field i_ino is filled in in fs/inode.c:
/**
* new_inode - obtain an inode
* #sb: superblock
*
* Allocates a new inode for given superblock. The default gfp_mask
* for allocations related to inode->i_mapping is GFP_HIGHUSER_MOVABLE.
* If HIGHMEM pages are unsuitable or it is known that pages allocated
* for the page cache are not reclaimable or migratable,
* mapping_set_gfp_mask() must be called with suitable flags on the
* newly created inode's mapping
*
*/
struct inode *new_inode(struct super_block *sb)
{
/*
* On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
* error if st_ino won't fit in target struct field. Use 32bit counter
* here to attempt to avoid that.
*/
static unsigned int last_ino;
struct inode *inode;
spin_lock_prefetch(&inode_lock);
inode = alloc_inode(sb);
if (inode) {
spin_lock(&inode_lock);
__inode_add_to_lists(sb, NULL, inode);
inode->i_ino = ++last_ino;
inode->i_state = 0;
spin_unlock(&inode_lock);
}
return inode;
}
And it does indeed just use an incrementing counter (last_ino).
Most other filesystems use information from the on-disk files to later override the i_ino field.
Note that it's perfectly possible for this to wrap all the way around. The kernel also has a "generation" field that gets filled in various ways. mm/shmem.c uses the current time.

Resources