how are inode numbers generated in linux tmpfs? - linux

It seems to me that tmpfs is not re-using inode numbers, but instead creates a new inode number via a +1 sequence everytime it needs a free inode.
Do you know how this is implemented / can you pin-point me to some source code where i could check the algorithm that is used in tmpfs ?
I need to understand this in order to bypass a limitation in a caching system that uses the inode number as its cache key (hence leading to rare, but occuring collisions when inodes are re-used too often). tmpfs could save my day if I can prove that it keeps creating unique inode numbers.
Thank you for your help,
Jerome Wagner

I won't directly answer your question, so I apologize in advance for that.
The tmpfs idea is good, but I wouldn't have my program depend on a more or less obscure implementation detail for generating keys. Why don't you try another method, such as combining the inode number with some other information? Maybe modification date: it's impossible two files get the same inode number AND modification date at the time of key-generation, unless system date changes.
Cheers!

The bulk of the tmpfs code is in mm/shmem.c. New inodes are created by
static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir,
int mode, dev_t dev, unsigned long flags)
but it delegates almost everything to the generic filesystem code.
In particular, the field i_ino is filled in in fs/inode.c:
/**
* new_inode - obtain an inode
* #sb: superblock
*
* Allocates a new inode for given superblock. The default gfp_mask
* for allocations related to inode->i_mapping is GFP_HIGHUSER_MOVABLE.
* If HIGHMEM pages are unsuitable or it is known that pages allocated
* for the page cache are not reclaimable or migratable,
* mapping_set_gfp_mask() must be called with suitable flags on the
* newly created inode's mapping
*
*/
struct inode *new_inode(struct super_block *sb)
{
/*
* On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
* error if st_ino won't fit in target struct field. Use 32bit counter
* here to attempt to avoid that.
*/
static unsigned int last_ino;
struct inode *inode;
spin_lock_prefetch(&inode_lock);
inode = alloc_inode(sb);
if (inode) {
spin_lock(&inode_lock);
__inode_add_to_lists(sb, NULL, inode);
inode->i_ino = ++last_ino;
inode->i_state = 0;
spin_unlock(&inode_lock);
}
return inode;
}
And it does indeed just use an incrementing counter (last_ino).
Most other filesystems use information from the on-disk files to later override the i_ino field.
Note that it's perfectly possible for this to wrap all the way around. The kernel also has a "generation" field that gets filled in various ways. mm/shmem.c uses the current time.

Related

what is the difference between I_DIRTY and I_DIRTY_SYNC

In the fs code I see mark_inode_dirty() function which is passed with the parameter I_DIRTY and I_DIRTY_SYNC
What is the difference between both. I guess both will mark the inode as dirty and commit the changes to the
disk.
See here: http://ehc.ac/p/mrvopensource/linux-ppc-2.6/ci/1c0eeaf5698597146ed9b873e2f9e0961edcf0f9/tree/include/linux/fs.h?barediff=2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b
I_DIRTY is a superset of I_DIRTY_SYNC:
#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
Which are documented as:
I_DIRTY_SYNC Inode itself is dirty.
I_DIRTY_DATASYNC Data-related inode changes pending
I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean.

what is the diffrence inode bitmap and inode table

I am trying to understand the difference between inode bitmap and inode table (from ext2 file system documentation), but I am not getting it. Can any one explain?
The bitmap just occupies one block and is a sequence of 0s and 1s where 0 means that the corresponding inode in the _inode_table_ is free and 1 specified that it's used.
The inode table is where the actual information about the inode is written and it occupies more than one block on the filesystem.
The bitmap technique is useful to quickly find free locations on the inode table (or data blocks) when modifying the filesystem.
On the hard drive, those sections would look like this:
inode bitmap:
11100011010010101...
inode table:
struct inode | struct inode | struct inode | struct inode | ...

Linux: Difference between inode and file_inode(file)?

in source/arch/x86/kernel/msr.c, the msr_open callback for the character device uses the following construct to extract the minor number of the character device file used:
static int msr_open(struct inode *inode, struct file *file)
{
unsigned int cpu = iminor(file_inode(file));
[...]
}
My question is:
Why not directly call iminor with the first argument of the function, like:
unsigned int cpu = iminor(inode);
The construct is used in other callbacks (e.g. read and write) as well,, where the inode is not passed as an argument, so I guess this is due to copy/paste, or is there a deeper meaning to it?
An inode is a data structure on a traditional Unix-style file system such as UFS or ext3. An inode stores basic information about a regular file, directory, or other file system object.
- http://www.cyberciti.biz/tips/understanding-unixlinux-filesystem-inodes.html
Same deal.

How to create a large file on a VFAT partition efficiently in embedded Linux

I'm trying to create a large empty file on a VFAT partition by using the `dd' command in an embedded linux box:
dd if=/dev/zero of=/mnt/flash/file bs=1M count=1 seek=1023
The intention was to skip the first 1023 blocks and write only 1 block at the end of the file, which should be very quick on a native EXT3 partition, and it indeed is. However, this operation turned out to be quite slow on a VFAT partition, along with the following message:
lowmem_shrink:: nr_to_scan=128, gfp_mask=d0, other_free=6971, min_adj=16
// ... more `lowmem_shrink' messages
Another attempt was to fopen() a file on the VFAT partition and then fseek() to the end to write the data, which has also proved slow, along with the same messages from the kernel.
So basically, is there a quick way to create the file on the VFAT partition (without traversing the first 1023 blocks)?
Thanks.
Why are VFAT "skipping" writes so slow ?
Unless the VFAT filesystem driver were made to "cheat" in this respect, creating large files on FAT-type filesystems will always take a long time. The driver, to comply with FAT specification, will have to allocate all data blocks and zero-initialize them, even if you "skip" the writes. That's because of the "cluster chaining" FAT does.
The reason for that behaviour is FAT's inability to support either:
UN*X-style "holes" in files (aka "sparse files")
that's what you're creating on ext3 with your testcase - a file with no data blocks allocated to the first 1GB-1MB of it, and a single 1MB chunk of actually committed, zero-initialized blocks) at the end.
NTFS-style "valid data length" information.
On NTFS, a file can have uninitialized blocks allocated to it, but the file's metadata will keep two size fields - one for the total size of the file, another for the number of bytes actually written to it (from the beginning of the file).
Without a specification supporting either technique, the filesystem would always have to allocate and zerofill all "intermediate" data blocks if you skip a range.
Also remember that on ext3, the technique you used does not actually allocate blocks to the file (apart from the last 1MB). If you require the blocks preallocated (not just the size of the file set large), you'll have to perform a full write there as well.
How could the VFAT driver be modified to deal with this ?
At the moment, the driver uses the Linux kernel function cont_write_begin() to start even an asynchronous write to a file; this function looks like:
/*
* For moronic filesystems that do not allow holes in file.
* We may have to extend the file.
*/
int cont_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block, loff_t *bytes)
{
struct inode *inode = mapping->host;
unsigned blocksize = 1 << inode->i_blkbits;
unsigned zerofrom;
int err;
err = cont_expand_zero(file, mapping, pos, bytes);
if (err)
return err;
zerofrom = *bytes & ~PAGE_CACHE_MASK;
if (pos+len > *bytes && zerofrom & (blocksize-1)) {
*bytes |= (blocksize-1);
(*bytes)++;
}
return block_write_begin(mapping, pos, len, flags, pagep, get_block);
}
That is a simple strategy but also a pagecache trasher (your log messages are a consequence of the call to cont_expand_zero() which does all the work, and is not asynchronous). If the filesystem were to split the two operations - one task to do the "real" write, and another one to do the zero filling, it'd appear snappier.
The way this could be achieved while still using the default linux filesystem utility interfaces were by internally creating two "virtual" files - one for the to-be-zerofilled area, and another for the actually-to-be-written data. The real file's directory entry and FAT cluster chain would only be updated once the background task is actually complete, by linking its last cluster with the first one of the "zerofill file" and the last cluster of that one with the first one of the "actual write file". One would also want to go for a directio write to do the zerofilling, in order to avoid trashing the pagecache.
Note: While all this is technically possible for sure, the question is how worthwhile would it be to do such a change ? Who needs this operation all the time ? What would side effects be ?
The existing (simple) code is perfectly acceptable for smaller skipping writes, you won't really notice its presence if you create a 1MB file and write a single byte at the end. It'll bite you only if you go for filesizes on the order of the limits of what the FAT filesystem allows you to do.
Other options ...
In some situations, the task at hand involves two (or more) steps:
freshly format (e.g.) a SD card with FAT
put one or more big files onto it to "pre-fill" the card
(app-dependent, optional)
pre-populate the files, or
put a loopback filesystem image into them
One of the cases I've worked on we've folded the first two - i.e. modified mkdosfs to pre-allocate/ pre-create files when making the (FAT32) filesystem. That's pretty simple, when writing the FAT tables just create allocated cluster chains instead of clusters filled with the "free" marker. It's also got the advantage that the data blocks are guaranteed to be contiguous, in case your app benefits from this. And you can decide to make mkdosfs not clear the previous contents of the data blocks. If you know, for example, that one of your preparation steps involves writing the entire data anyway or doing ext3-in-file-on-FAT (pretty common thing - linux appliance, sd card for data exchange with windows app/gui), then there's no need to zero out anything / double-write (once with zeroes, once with whatever-else). If your usecase fits this (i.e. formatting the card is a useful / normal step of the "initialize it for use" process anyway) then try it out; a suitably-modified mkdosfs is part of TomTom's dosfsutils sources, see mkdosfs.c search for the -N command line option handling.
When talking about preallocation, as mentioned, there's also posix_fallocate(). Currently on Linux when using FAT, this will do essentially the same as a manual dd ..., i.e. wait for the zerofill. But the specification of the function doesn't mandate it being synchronous. The block allocation (FAT cluster chain generation) would have to be done synchronously, but the VFAT on-disk dirent size update and the data block zerofills could be backgrounded / delayed (i.e. either done at low-prio in background or only done if explicitly requested via fdsync() / sync() so that the app can e.g. alloc blocks, write the contents with non-zeroes itself ...). That's technique / design though; I'm not aware of anyone having done that kernel modification yet, if only for experimenting.

Linux API - EXT3 file information

I'm writing backup software. I want to programmatically determine if a file has been modified since last time. Is a flag or something like that on files under the EXT3 filesystem?
Sure. Just call stat() on the file, and inspect the st_mtime member:
struct stat {
/* ... snip ... */
time_t st_atime; /* time of last access */
time_t st_mtime; /* time of last modification */
time_t st_ctime; /* time of last status change */
};
If you have in the application a timestamp when the last backup was made, you can compare directly.
Note though that not all filesystems really update the modified time, as doing so is kind of expensive. You seem to be aware of this risk.
I think you are looking for stat()

Resources