How are ext4 directory entries stored in the i-nodes? - linux

I am doing some experimentation with the internals of the ext4 file system, when I stumbled upon this issue while trying to implement reading a file by path.
The root directory i-node, number 2 as per the Kernel documentation's special i-node table, is easily found in the i-node table per the pointers in the block group descriptors and superblock.
As far as I understand it, the process of looking up a file by path is
Find the root directory i-node
Traverse it's directory entries until we find the name of the sub-directory we're looking for
Take the i-node number the directory entry we have found points to
go to (2.), repeat until we have found the file.
Read the file by parsing the extent tree
Is this correct?
If so, how are the struct ext4_dir_entrys stored/referenced from the i-node? I assume i_node.i_block[] has something to do with that, but I am not entirely clear on how to read the directory entries from there. Are they stored in the i-node? Or does the array contain pointers?

Related

How to prove that directory is a file in Linux

"Everything is a file in Linux". How can i prove that directories are represented as files in linux. Also the physical hardware devices everything creates and is represented as files in Linux. But how can i prove this concept with supporting examples to someone.
Viewing the Directory and other physical hardwares as files in Liniux.( POC)
The "Everything is a file in Linux" statement is a bit of an oversimplification. There are many things in Linux that appear as files, but don't quite 'act' as you think they would in a conventional sense.
Block files (e.g. /dev/loop0) are a great example of this as they are used as a way of communicating with device drivers.
That said, directories are their own 'special' kind of file that contain inode ids pointing to a file's inode. I suppose a simple 'proof' of sorts would be to ls -l any directory and you will notice that most (if not all) of them will have a listed file size of 4096 bytes rather than listing the collective size of its contents.
4096 bytes is the smallest blocksize for most filesystems and is usually more than enough to fit all the information (inode ids) of a directory. So rather than direct information/access to its files, a directory rather holds meta-data about them.
Alternatively, using stat on any directory will display it's own inode number (as well as the number of links it has).
EDIT: Directory files contain the inode id (a pointer to a file's inode) not the inode itself. I have edited the answer.

Resolving symbolic links algorithm

What should the algorithm for resolving symlinks on Linux look like?
Something like:
Split path to steps /usr/bin/hello -> ['usr', 'bin', 'hello']
First resolve /usr -> /something1
Add next step and resolve /something1/bin -> /something2
Add next step and resolve /something2/hello -> /something3
Will that work?
What you are actually looking for is readlink command, that relies on POSIX realpath. Its algorithm is available here
As written in one book, the idea is this:
All path type resolution (checking) processing uses the presence or absence of a leading slash (/) to indicate whether the path is an absolute or relative path. If the slash is present, the first qualifier after the slash is compared against the MVS prefix to determine if it matches the prefix. If so, then the path type will be considered to be explicitly resolved via the prefix. If no match is found, or no slash was present, the implicit path type resolution heuristic is used.
Some details are also available here
Basically when you request an I/O, the kernel has to go through a series of steps. The kernel needs to to search directories for the requested file, this isn't a problem because the kernel always knows from where to start because the root file has a constant inode number, it's inode 2 in ext family of filesystems. The kernel then converts the filename to an inode number once it locates the filename in a directory. Because each directory is just a special kind of file which holds entries each entry with (filename, inode) fields, by searching directories the kernel will be able to locate the file's inode.
Once the kernel finds the inode of a file, this inode holds the block addresses for a regular file and thus will be used to located the data stored in that file. Block addresses of a file hold the actual data that are stored in the file. *The difference between a regular file and symlink file is that, the symlink file is a file that points to another location and thus the kernel has to perform the same series of steps twice, that is, when the inode of a symlink file is found the kernel has to redo the same operation for the filepath that the symlink file points it, it has to search in directories and find a matching filename in a directory in order to get the inode number. This obviously adds an overhead.
A recursive (a.k.a cyclic) symlink is an invalid symlink.
Not sure if I've answered your question, but that's what generally happens, you also have the VFS layer on the top and below that is the physical filesystem. Some filesystems don't even support symlinks, like vfat.

In a kernel module, how to know whether given inode belongs to a specific directory?

One possible way is that, compare given inode with list of inodes in that directory. The list of inodes could be predetermined or it can be calculated run time, both ways have their own problems:
Predetermined list: List can be changed during this operation, i.e. files could be added or removed from that directory.
Run time list: If that directory has too many files, it's too much overhead for each access of any file in the system.
Is there any efficient solution/way for this? I have tried by comparing file by it's path, which was really a bad idea.
Either if you do it in kernel mode or in user mode has no advantages. To see if an inode is indeed in some directory you have to read that directory as files are located in directories normally as a linear list. This can lead your process blocking for directory blocks to be present if not cached and, in that time, the directory contents can be modified. Only if you maintain the directory inode blocked while doing that operation can help, but this can add severe performance restrictions to your operating system. Another issue is that each filesystem is free to implement directory contents in it's own format. In userland you get an uniform directory format, but in kernel mode you have to deal with the different approaches for different filesystem types. Why do you need to know that? I can't imagine a scenario where this can be needed. Perhaps you can redesign your algorithm for the directory contents to be unnecessary.
By the way, dealing with complete paths or searching directories have obscure race conditions that can deal your system blocked someway. What can happen if, in the middle of your seach, somebody tries to unlink the inode you are searching for; or the directory contents must be modified; or some other process is using namei() to traverse through your directory upwards; or downwards. Have you think in all these possibilities?

What happens internally when deleting an opened file in linux

I came across this and this questions on deleting opened files in linux
However, I'm still confused what happened in the RAM when a process(call it A) deletes an opened file by another process B.
What baffles me is this(my analysis could be wrong, please correct me if so):
When a process opens a file, a new entry for that file in the UFDT is created.
When a process deletes a file, all the links to the file are gone
especially, we have no reference to its inode, thus, it gets removed from the GFDT
However, when modifying the file(say writing to it) it must be updated in the disk(since its pages gets modified/dirty), but it got no reference in the GFDT because of the earlier delete, so we don't know the inode to it.
The Question is why the "deleted" file still accessible by the process which opened it? And how is that been done by the operating system?
EDIT By UFDT i mean the file descriptor table of the process which holds the file descriptors of the files which opened by the process(each process has its own UFDT) and the GFDT is the global file descriptor table, there is only one GFDT in the system(RAM in our case).
I never really heard about those UFDT and GFDT acronyms, but your view of the system sounds mostly right. I think you lack some detail on your description of how open files are managed by the kernel, and perhaps this is where your confusion comes from. I'll try to give a more detailed description.
First, there are three data structures used to keep track of and manage open files:
Each process has a table of file descriptors. Each entry in this table stores a file descriptor, and the file descriptor status flags (as of now, the only flag is O_CLOEXEC). The file descriptor is just a pointer to an entry in the file table entry, which I cover next. The integer returned by open(2) and family is usually an index into this file descriptor table - each process has its table, that's why open(2) and family may return the same value for different processes opening different files.
There is one opened files table in the entire system. Each file descriptor table entry of each process references one of these entries in the opened files table. There is one entry in this table for each opened file: if two processes open the same file, two entries in this global table are created, even though it's the same file. Each entry in the files table stores the file status flags (opened for reading, writing, appending, etc), and the current file offset. This is why different processes can read from and write to different offsets in the same file concurrently as long as each of them opens the file.
Each entry in the file table entry also references an entry in the vnode table. The vnode table is a global table that has one entry for each unique file. If processes A, B, and C open file D, there will be only one vnode table entry, referenced by all 3 of the file table entries (in Linux, there is really no vnode, rather there is an inode, but let's keep this description generic and conceptual). The vnode entry contains pretty much the same information as the traditional inode (file size, other attributes, etc.), but it also contains other information useful for opened files, such as file locks that are active, who owns them, which portions of the file they lock, etc. This vnode entry also stores pointers to the file's data blocks on disk.
Deleting a file consists of calling unlink(2). This function unlinks a file from a directory. Each file inode in disk has a count of the number of links pointing to it; the file is only really removed if the link count reaches 0 and it is not opened (or 2 in the case of directories, since a directory references itself and is also referenced by its parent). In fact, the manpage for unlink(2) is very specific about this behavior:
unlink - delete a name and possibly the file it refers to
So, instead of looking at unlinking as deleting a file, look at it as deleting a file name, and maybe the file it refers to.
When unlink(2) detects that there is an active vnode table entry referring this file, it doesn't delete the file from the filesystem. Nothing happens. Yes, you can't find the file on your filesystem anymore. find(1) won't find it. You can't open it in new processes.
But the file is still there. It just doesn't appear in any directory entry.
For example, if it's a huge file, and if you run df or du, you will see that space usage is the same. The file is still there, on disk, you just can't reach it.
So, any reads or writes take place as usual - the file data blocks are accessible through the vnode table entry. You can still know the file size. And the owner. And the permissions. All of it. Everything's there.
When the process terminates or explicitly closes the file, the operating system checks the inode. If the number of links pointing to the inode is 0 and this was the last process that opened the file (which is also indicated by storing a link count in the vnode table entry), then the file is purged.
When a process opens a file, a new entry for that file in the UFDT is
created.
What is this weird acronym? I take it you mean the process in question has a file descriptor.
When a process deletes a file, all the links to the file are gone
especially, we have no reference to its inode, thus, it gets removed
from the GFDT
What on earth is GFDT?
However, when modifying the file(say writing to it) it must be updated
in the disk(since its pages gets modified/dirty), but it got no
reference in the GFDT because of the earlier delete, so we don't know
the inode to it.
I am guessing whatever this GFDT is has something to do with being "global" and "file descriptors".
So, all this shows serious misconceptions.
As was outlined by your own question, the file is a different thingy from the name. Next, when you open something from a filesystem it gets an in-memory representation of the inode and a struct file object is allocated, which later points to the in-memory inode. Finally, file descriptor table of relevant thread is updated to store the pointer to the struct file object at given offset. The offset is known as a file descriptor.
So there. Amount of names associated with an inode has zero relation to kernel's ability to issue reads/writes affecting the inode (or blocks the file it represents) as long as it had it opened before the last name got removed.
The may or may not be trashed when there are no names and the kernel does not use it anymore.

Can inode and crtime be used as a unique file identifier?

I have a file indexing database on Linux. Currently I use file path as an identifier.
But if a file is moved/renamed, its path is changed and I cannot match my DB record to the new file and have to delete/recreate the record. Even worse, if a directory is moved/renamed, then I have to delete/recreate records for all files and nested directories.
I would like to use inode number as a unique file identifier, but inode number can be reused if file is deleted and another file created.
So, I wonder whether I can use a pair of {inode,crtime} as a unique file identifier.
I hope to use i_crtime on ext4 and creation_time on NTFS.
In my limited testing (with ext4) inode and crtime do, indeed, remain unchanged when renaming or moving files or directories within the same file system.
So, the question is whether there are cases when inode or crtime of a file may change.
For example, can fsck or defragmentation or partition resizing change inode or crtime or a file?
Interesting that
http://msdn.microsoft.com/en-us/library/aa363788%28VS.85%29.aspx says:
"In the NTFS file system, a file keeps the same file ID until it is deleted."
but also:
"In some cases, the file ID for a file can change over time."
So, what are those cases they mentioned?
Note that I studied similar questions:
How to determine the uniqueness of a file in linux?
Executing 'mv A B': Will the 'inode' be changed?
Best approach to detecting a move or rename to a file in Linux?
but they do not answer my question.
{device_nr,inode_nr} are a unique identifier for an inode within a system
moving a file to a different directory does not change its inode_nr
the linux inotify interface enables you to monitor changes to inodes (either files or directories)
Extra notes:
moving files across filesystems is handled differently. (it is infact copy+delete)
networked filesystems (or a mounted NTFS) can not always guarantee the stability of inodenumbers
Microsoft is not a unix vendor, its documentation does not cover Unix or its filesystems, and should be ignored (except for NTFS's internals)
Extra text: the old Unix adagium "everything is a file" should in fact be: "everything is an inode". The inode carries all the metainformation about a file (or directory, or a special file) except the name. The filename is in fact only a directory entry that happens to link to the particular inode. Moving a file implies: creating a new link to the same inode, end deleting the old directory entry that linked to it.
The inode metatata can be obtained by the stat() and fstat() ,and lstat() system calls.
The allocation and management of i-nodes in Unix is dependent upon the filesystem. So, for each filesystem, the answer may vary.
For the Ext3 filesystem (the most popular), i-nodes are reused, and thus cannot be used as a unique file identifier, nor is does reuse occur according to any predictable pattern.
In Ext3, i-nodes are tracked in a bit vector, each bit representing a single i-node number. When an i-node is freed, it's bit is set to zero. When a new i-node is needed, the bit vector is searched for the first zero-bit and the i-node number (which may have been previously allocated to another file) is reused.
This may lead to the naive conclusion that the lowest numbered available i-node will be the one reused. However, the Ext3 file system is complex and highly optimised, so no assumptions should be made about when and how i-node numbers can be reused, even though they clearly will.
From the source code for ialloc.c, where i-nodes are allocated:
There are two policies for allocating an inode. If the new inode is a
directory, then a forward search is made for a block group with both
free space and a low directory-to-inode ratio; if that fails, then of
he groups with above-average free space, that group with the fewest
directories already is chosen. For other inodes, search forward from
the parent directory's block group to find a free inode.
The source code that manages this for Ext3 is called ialloc and the definitive version is here: https://github.com/torvalds/linux/blob/master/fs/ext3/ialloc.c
I guess the dB application would need to consider the case where the file is subject to restoration from backup, which would preserve the file crtime, but not the inode number.

Resources