what is the difference between I_DIRTY and I_DIRTY_SYNC - linux

In the fs code I see mark_inode_dirty() function which is passed with the parameter I_DIRTY and I_DIRTY_SYNC
What is the difference between both. I guess both will mark the inode as dirty and commit the changes to the
disk.

See here: http://ehc.ac/p/mrvopensource/linux-ppc-2.6/ci/1c0eeaf5698597146ed9b873e2f9e0961edcf0f9/tree/include/linux/fs.h?barediff=2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b
I_DIRTY is a superset of I_DIRTY_SYNC:
#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
Which are documented as:
I_DIRTY_SYNC Inode itself is dirty.
I_DIRTY_DATASYNC Data-related inode changes pending
I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean.

Related

does linux fsync will sync file's xattr?

From man fsync(2), it will sync the file's metadata, which i think it's something list in stat.
What's about file's xattr? does it belongs to metadata?
We did a test,
write a file, and set 6 xattrs, then do fsync,
then change 1 xattr value, and do fsync again(use 0.2s).
We think the second fsync should be fast, but it's not(use 0.16s).
As my colleague said, it's reasonable, bc the mininum size of disk operation is a sector, which usally 512 bytes, so there is no difference when i update 1 xattr, or 6 xattrs.

how can i show the size of files in /proc? it should not be size zero

from the following message, we know that there are two characters in file /proc/sys/net/ipv4/ip_forward, but why ls just showed this file is of size zero?
i know this is not a file on disk, but a file in the memory, so is there any command which i can see the real size of the files in /proc?
root#OpenWrt:/proc/sys/net/ipv4# cat ip_forward | wc -c
2
root#OpenWrt:/proc/sys/net/ipv4# ls -l ip_forward
-rw-r--r-- 1 root root 0 Sep 3 00:20 ip_forward
root#OpenWrt:/proc/sys/net/ipv4# pwd
/proc/sys/net/ipv4
Those are not really files on disk (as you mention) but they are also not files in memory - the names in /proc correspond to calls into the running kernel in the operating system, and the contents are generated on the fly.
The system doesn't know how large the files would be without generating them, but if you read the "file" twice there's no guarantee you get the same data because the system may have changed.
You might be looking for the program
sysctl -a
instead.
Things in /proc are not really files. In most cases, they're not even files in memory. When you access these files, the proc filesystem driver performs a system call that gets data appropriate for the file, and then formats it for output. This is usually dynamic data that's constructed on the fly. An example of this is /proc/net/arp, which contains the current ARP cache.
Getting the size of these things can only be done by formatting the entire output, so it's not done just when listing the file. If you want the sizes, use wc -c as you did.
The /proc/ filesystem is an "illusion" maintained by the kernel, which does not bother giving the size of (most of) its pseudo-files (since computing that "real" size would usually involve having built the entire textual pseudo-file's content), and expects most [pseudo-] textual files from /proc/ to be read in sequence from first to last byte (i.e. till EOF), in reasonably sized (e.g. 1K) blocks. See proc(5) man page for details.
So there is no way to get the true size (of some file like /proc/self/maps or /proc/sys/net/ipv4/ip_forward) in a single syscall (like stat(2), because it would give a size of 0, as reported by stat(1) or ls(1) commands). A typical way of reading these textual files might be
FILE* f = fopen("/proc/self/maps", "r");
// or some other textual /proc file,
// e.g. /proc/sys/net/ipv4/ip_forward
if (f)
{
do {
// you could use readline instead of fgets
char line[256];
memset (line, 0, sizeof(line));
if (NULL == fgets(line, sizeof(line), f))
break;
// do something with line, for example:
fputs(line, stdout);
} while (!feof (f));
fclose (f);
}
Of course, some files (e.g. /proc/self/cmdline) are documented as possibly containing NUL bytes. You'll need some fread for them.
It's not really a file in the memory, it's an interface between the user and the kernel.

how are inode numbers generated in linux tmpfs?

It seems to me that tmpfs is not re-using inode numbers, but instead creates a new inode number via a +1 sequence everytime it needs a free inode.
Do you know how this is implemented / can you pin-point me to some source code where i could check the algorithm that is used in tmpfs ?
I need to understand this in order to bypass a limitation in a caching system that uses the inode number as its cache key (hence leading to rare, but occuring collisions when inodes are re-used too often). tmpfs could save my day if I can prove that it keeps creating unique inode numbers.
Thank you for your help,
Jerome Wagner
I won't directly answer your question, so I apologize in advance for that.
The tmpfs idea is good, but I wouldn't have my program depend on a more or less obscure implementation detail for generating keys. Why don't you try another method, such as combining the inode number with some other information? Maybe modification date: it's impossible two files get the same inode number AND modification date at the time of key-generation, unless system date changes.
Cheers!
The bulk of the tmpfs code is in mm/shmem.c. New inodes are created by
static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir,
int mode, dev_t dev, unsigned long flags)
but it delegates almost everything to the generic filesystem code.
In particular, the field i_ino is filled in in fs/inode.c:
/**
* new_inode - obtain an inode
* #sb: superblock
*
* Allocates a new inode for given superblock. The default gfp_mask
* for allocations related to inode->i_mapping is GFP_HIGHUSER_MOVABLE.
* If HIGHMEM pages are unsuitable or it is known that pages allocated
* for the page cache are not reclaimable or migratable,
* mapping_set_gfp_mask() must be called with suitable flags on the
* newly created inode's mapping
*
*/
struct inode *new_inode(struct super_block *sb)
{
/*
* On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
* error if st_ino won't fit in target struct field. Use 32bit counter
* here to attempt to avoid that.
*/
static unsigned int last_ino;
struct inode *inode;
spin_lock_prefetch(&inode_lock);
inode = alloc_inode(sb);
if (inode) {
spin_lock(&inode_lock);
__inode_add_to_lists(sb, NULL, inode);
inode->i_ino = ++last_ino;
inode->i_state = 0;
spin_unlock(&inode_lock);
}
return inode;
}
And it does indeed just use an incrementing counter (last_ino).
Most other filesystems use information from the on-disk files to later override the i_ino field.
Note that it's perfectly possible for this to wrap all the way around. The kernel also has a "generation" field that gets filled in various ways. mm/shmem.c uses the current time.

How to create a large file on a VFAT partition efficiently in embedded Linux

I'm trying to create a large empty file on a VFAT partition by using the `dd' command in an embedded linux box:
dd if=/dev/zero of=/mnt/flash/file bs=1M count=1 seek=1023
The intention was to skip the first 1023 blocks and write only 1 block at the end of the file, which should be very quick on a native EXT3 partition, and it indeed is. However, this operation turned out to be quite slow on a VFAT partition, along with the following message:
lowmem_shrink:: nr_to_scan=128, gfp_mask=d0, other_free=6971, min_adj=16
// ... more `lowmem_shrink' messages
Another attempt was to fopen() a file on the VFAT partition and then fseek() to the end to write the data, which has also proved slow, along with the same messages from the kernel.
So basically, is there a quick way to create the file on the VFAT partition (without traversing the first 1023 blocks)?
Thanks.
Why are VFAT "skipping" writes so slow ?
Unless the VFAT filesystem driver were made to "cheat" in this respect, creating large files on FAT-type filesystems will always take a long time. The driver, to comply with FAT specification, will have to allocate all data blocks and zero-initialize them, even if you "skip" the writes. That's because of the "cluster chaining" FAT does.
The reason for that behaviour is FAT's inability to support either:
UN*X-style "holes" in files (aka "sparse files")
that's what you're creating on ext3 with your testcase - a file with no data blocks allocated to the first 1GB-1MB of it, and a single 1MB chunk of actually committed, zero-initialized blocks) at the end.
NTFS-style "valid data length" information.
On NTFS, a file can have uninitialized blocks allocated to it, but the file's metadata will keep two size fields - one for the total size of the file, another for the number of bytes actually written to it (from the beginning of the file).
Without a specification supporting either technique, the filesystem would always have to allocate and zerofill all "intermediate" data blocks if you skip a range.
Also remember that on ext3, the technique you used does not actually allocate blocks to the file (apart from the last 1MB). If you require the blocks preallocated (not just the size of the file set large), you'll have to perform a full write there as well.
How could the VFAT driver be modified to deal with this ?
At the moment, the driver uses the Linux kernel function cont_write_begin() to start even an asynchronous write to a file; this function looks like:
/*
* For moronic filesystems that do not allow holes in file.
* We may have to extend the file.
*/
int cont_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block, loff_t *bytes)
{
struct inode *inode = mapping->host;
unsigned blocksize = 1 << inode->i_blkbits;
unsigned zerofrom;
int err;
err = cont_expand_zero(file, mapping, pos, bytes);
if (err)
return err;
zerofrom = *bytes & ~PAGE_CACHE_MASK;
if (pos+len > *bytes && zerofrom & (blocksize-1)) {
*bytes |= (blocksize-1);
(*bytes)++;
}
return block_write_begin(mapping, pos, len, flags, pagep, get_block);
}
That is a simple strategy but also a pagecache trasher (your log messages are a consequence of the call to cont_expand_zero() which does all the work, and is not asynchronous). If the filesystem were to split the two operations - one task to do the "real" write, and another one to do the zero filling, it'd appear snappier.
The way this could be achieved while still using the default linux filesystem utility interfaces were by internally creating two "virtual" files - one for the to-be-zerofilled area, and another for the actually-to-be-written data. The real file's directory entry and FAT cluster chain would only be updated once the background task is actually complete, by linking its last cluster with the first one of the "zerofill file" and the last cluster of that one with the first one of the "actual write file". One would also want to go for a directio write to do the zerofilling, in order to avoid trashing the pagecache.
Note: While all this is technically possible for sure, the question is how worthwhile would it be to do such a change ? Who needs this operation all the time ? What would side effects be ?
The existing (simple) code is perfectly acceptable for smaller skipping writes, you won't really notice its presence if you create a 1MB file and write a single byte at the end. It'll bite you only if you go for filesizes on the order of the limits of what the FAT filesystem allows you to do.
Other options ...
In some situations, the task at hand involves two (or more) steps:
freshly format (e.g.) a SD card with FAT
put one or more big files onto it to "pre-fill" the card
(app-dependent, optional)
pre-populate the files, or
put a loopback filesystem image into them
One of the cases I've worked on we've folded the first two - i.e. modified mkdosfs to pre-allocate/ pre-create files when making the (FAT32) filesystem. That's pretty simple, when writing the FAT tables just create allocated cluster chains instead of clusters filled with the "free" marker. It's also got the advantage that the data blocks are guaranteed to be contiguous, in case your app benefits from this. And you can decide to make mkdosfs not clear the previous contents of the data blocks. If you know, for example, that one of your preparation steps involves writing the entire data anyway or doing ext3-in-file-on-FAT (pretty common thing - linux appliance, sd card for data exchange with windows app/gui), then there's no need to zero out anything / double-write (once with zeroes, once with whatever-else). If your usecase fits this (i.e. formatting the card is a useful / normal step of the "initialize it for use" process anyway) then try it out; a suitably-modified mkdosfs is part of TomTom's dosfsutils sources, see mkdosfs.c search for the -N command line option handling.
When talking about preallocation, as mentioned, there's also posix_fallocate(). Currently on Linux when using FAT, this will do essentially the same as a manual dd ..., i.e. wait for the zerofill. But the specification of the function doesn't mandate it being synchronous. The block allocation (FAT cluster chain generation) would have to be done synchronously, but the VFAT on-disk dirent size update and the data block zerofills could be backgrounded / delayed (i.e. either done at low-prio in background or only done if explicitly requested via fdsync() / sync() so that the app can e.g. alloc blocks, write the contents with non-zeroes itself ...). That's technique / design though; I'm not aware of anyone having done that kernel modification yet, if only for experimenting.

Fast way to find the number of files in one directory on Linux

I am looking for a fast way to find the number of files in a directory on Linux.
Any solution that takes linear time in the number of files in the directory is NOT acceptable (e.g. "ls | wc -l" and similar things) because it would take a prohibitively long amount of time (there are tens or maybe hundreds of millions of files in the directory).
I'm sure the number of files in the directory must be stored as a simple number somewhere in the filesystem structure (inode perhaps?), as part of the data structure used to store the directory entries - how can I get to this number?
Edit: The filesystem is ext3. If there is no portable way of doing this, I am willing to do something specific to ext3.
Why should the data structure contain the number? A tree doesn't need to know its size in O(1), unless it's a requirement (and providing that, could require more locking and possibly a performance bottleneck)
By tree I don't mean including subdir contents, but files with -maxdepth 1 -- supposing they are not really stored as a list..
edit: ext2 stored them as a linked list.
modern ext3 implements hashed B-Trees
Having said that, /bin/ls does a lot more than counting, and actually scans all the inodes. Write your own C program or script using opendir() and readdir().
from here:
#include <stdio.h>
#include <sys/types.h>
#include <dirent.h>
int main()
{
int count;
struct DIR *d;
if( (d = opendir(".")) != NULL)
{
for(count = 0; readdir(d) != NULL; count++);
closedir(d);
}
printf("\n %d", count);
return 0;
}
You can use inotify to track and record file create and unlink events in the monitored directory. It would distribute the total time required to maintain file count and allow you to retrieve the current file count instantaneously.
The inode for the directory does not store the number of files in it, since usually the file count is not needed separately from the list of names in the directory. The directory inode's link count does indirectly give the number of sub-directories (st_nlink is number of sub-dirs plus two).
I think you have no choice except read through the whole list of files in the directory. find might or might not be faster than ls.
This is an example of why large directories are a problem, even when the directory is implemented using a B-tree.
There's no portable way to do this. The low-level file primitives, i.e. readdir, work as if it's a linear list. Clearly, that's an abstraction, and some filesystems might store a count. However, accessing it is inherently filesystem-specific.
If you are willing to jump through hoops you may have each directory in a different filesystem, use quotas, and get the info with the "repquota" command.

Resources