MEM_SHARED, mmap, and hard links

MEM_SHARED, mmap, and hard links - linux

Just wondering if the key to shared memory is the file name or the inode.
I have a file called .last, which is just a hard link to a file named YYYYMMDDHHMMSS.
A directory looks like this:
20110101143000
.last
.last is just a hard link to 20110101143000.
Some time later, a new file is created
20110101143000
20110622083000
.last
We then delete .last, and recreate it to refer to the new file.
Our software, which is continuously running during these updates, mmaps the .last file with MAP_SHARED. When done with a file, the software might cache it for several minutes rather than unmap it. On a physical server, there are 12-24 instances of the software running at the same time. Different instances often mmap the same file at about the same time. My question is:
Does linux use the file name to key to the shared memory, or does it use the inode?
Given this scenario:
proc A mmaps .last, and does not unmap
a new file is written, .last is deleted, a new .last is created to link the new
file
proc B mmaps the new .last, and does not unmap
If linux used the inode, then proc A and B would be seeing different blocks of memory mapped to different files, which is what we want. If linux uses the filename, then both A and B see the same block of memory mapped to the new file. B is fine, but A crashes when the memory in the shard block changes.
Anyone know how it actually works? I'm going to test, but if it turns out to be name based, i am screwed unless someone knows a trick.
Thanks!

It's the inode, at least effectively. That is to say that once you have mapped some pages from a file they will continue to refer to that file and won't change just because the mapping of names to files changes in the filesystem.

Related

When writing to a newly created file, can I create the directory entry only after writing is completed?

I'm writing a file that takes minutes to write. External software monitors for this file to appear, but unfortunately doesn't monitor for inotify IN_CLOSE_WRITE events, but rather checks periodically "the file is there" and then starts to process it, which will fail if the file is incomplete. I cannot fix the external software. A workaround I've been using so far is to write a temporary file and then rename it when it's finished, but this workaround complicates my workflow for reasons beyond the scope of this question¹.
Files are not directory entries. Using hardlinks, there can be multiple pointers to the same file. When I open a file for writing, both the inode and the directory entry are created immediately. Can I prevent this? Can I postpone the creation of the directory entry until the file is closed, rather than when the file is opened for writing?
Example Python-code, but the question is not specific to Python:
fp = open(dest, 'w') # currently both inode and directory entry are created here
fp.write(...)
fp.write(...)
fp.write(...)
fp.close() # I would like to create the directory entry only here
Reading everything into memory and then writing it all in one go is not a good solution, because writing will still take time and the file might not fit into memory.
I found the related question Is it possible to create an unlinked file on a selected filesystem?, but I would want to first create an anonymous/unnamed file, then naming it when I'm done writing (I agree with the answer there that creating an inode is unavoidable, but that's fine; I just want to postpone naming it).
Tagging this as linux, because I suspect the answer might be different between Linux and Windows and I only need a solution on Linux.
¹Many files are produced in parallel within dask graphs, and injecting a "move as soon as finished" task in our system would be complicated, so we're really renaming 50 files when 50 files have been written, which causes delays.

How to prevent changes to the underlying file after mmap()-ing a file from being visible to my program?

According to mmap() manpage:
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
Question: How to prevent changes to the underlying file after mmap()-ing a file from being visible to my program?
Background: I am designing a data structure for a text editor designed to allow editing huge text files efficiently. The data structure is akin to an on-disk rope but with the actual strings being pointer to mmap()-ed ranges from the original file.
Since the file could be very large, there are a few restrictions around the design:
Must not load the entire file into RAM as the file may be larger than available physical RAM
Must not copy files on opening as this will make opening new files really slow
Must work on filesystems like ext4 that does not support copy-on-write (cp --reflink/ioctl_ficlone)
Must not rely on mandatory file locking, as this is deprecated, and requires specific mount option -o mand in the filesystem
As long as the changes aren't visible in my mmap(), it's ok for the underlying file to change on the filesystem
Only need to support recent Linux and using Linux-specific system APIs are ok
The data structure I'm designing would keep track of a list of unedited and edited ranges in the file by storing start and end index of the ranges into the mmap()-ed buffer. While the user is browsing through the file, ranges of text that have never been modified by the user would be read directly from a mmap() of the original file, while a swap file will store the ranges of texts that have been edited by the user but had not been saved.
When the user saves a file, the data structure would use copy_file_range to splice the swap file and the original file to assemble the new file. For this splicing to work, the original file as seen by my program must remain unchanged throughout the entire editing session.
Problem: The user may concurrently have other programs modifying the same file, possibly other text editors or some other programs that modified the text file in-place, after making unsaved changes in my text editor.
In such situation, the editor can detect such external change using inotify, and then I want to give the user two options on how to continue from this:
discard all unsaved changes and re-read the file from disk, implementing this option is fairly straightforward
allow the user to continue editing the file and later on the user should be able to save the unsaved changes in a new location or to overwrite the changes that had been made by the other program, but implementing this seems tricky
Since my editor did not make a copy of the file when it opened the file, when the other program overwrite the file, the text ranges that my data structure are tracking may become invalid because the data on-disk have changed and these changes are now visible through my mmap(). This means if my editor tried to write unsaved changes after the file has been modified from another process, it could be splicing text ranges in the old file using data from the data from the new file, which could mean that my editor could be producing a corrupt file when saving the unsaved changes.
I don't think advisory locks would have saved the situation here in all cases, as other programs may not honor advisory lock.
My ideal solution would be to make it so that when other programs overwrites the file, the system should transparently copy the file to allow my program to continue seeing the old version while the other program finishes their write to disk and make their version visible in the filesystem. I think ioctl_ficlone could have made this possible, but to my understanding, this only works with a copy-on-write filesystem like btrfs.
Is such a thing possible?
Any other suggestions to solve this problem would also be welcome.

What you want to do isn't possible with mmap, and I'm not sure if it's possible at all with your constraints.
When you map a region, the kernel may or may not actually load all of it into memory. The region of memory that lacks data will actually contain an invalid page, so when you access it, the kernel takes a page fault and maps that region into memory. That region will likely contain whatever is in that portion of the file at the time the page fault occurs. There is an option, MAP_LOCKED, which tries to prefault all of the pages in, but doesn't guarantee it, so you can't rely on it working.
In general, you cannot prevent other processes from changing a file out from under you. Some tools (including editors) will write a new file to the side, calling rename to overwrite the file, and some will rewrite the file in place. The former is what you want, but many editors choose to do the latter, since it preserves characteristics such as ACLs and permissions you can't restore.
Furthermore, you really don't want to use mmap on any file you can't totally control, because if another process truncates the file and you try to access that portion of the buffer, your process will die with SIGBUS. Catching this signal is undefined behavior, and the only sane thing to do is die. (Also, it can be sent in other situations, such as unaligned access, and you'll have a hard time distinguishing between them.)
Ultimately, if you're not interested in copying the file, you can't guarantee someone won't change underneath you, and you'll need to be prepared for that to occur.

What happens internally when deleting an opened file in linux

I came across this and this questions on deleting opened files in linux
However, I'm still confused what happened in the RAM when a process(call it A) deletes an opened file by another process B.
What baffles me is this(my analysis could be wrong, please correct me if so):
When a process opens a file, a new entry for that file in the UFDT is created.
When a process deletes a file, all the links to the file are gone
especially, we have no reference to its inode, thus, it gets removed from the GFDT
However, when modifying the file(say writing to it) it must be updated in the disk(since its pages gets modified/dirty), but it got no reference in the GFDT because of the earlier delete, so we don't know the inode to it.
The Question is why the "deleted" file still accessible by the process which opened it? And how is that been done by the operating system?
EDIT By UFDT i mean the file descriptor table of the process which holds the file descriptors of the files which opened by the process(each process has its own UFDT) and the GFDT is the global file descriptor table, there is only one GFDT in the system(RAM in our case).

I never really heard about those UFDT and GFDT acronyms, but your view of the system sounds mostly right. I think you lack some detail on your description of how open files are managed by the kernel, and perhaps this is where your confusion comes from. I'll try to give a more detailed description.
First, there are three data structures used to keep track of and manage open files:
Each process has a table of file descriptors. Each entry in this table stores a file descriptor, and the file descriptor status flags (as of now, the only flag is O_CLOEXEC). The file descriptor is just a pointer to an entry in the file table entry, which I cover next. The integer returned by open(2) and family is usually an index into this file descriptor table - each process has its table, that's why open(2) and family may return the same value for different processes opening different files.
There is one opened files table in the entire system. Each file descriptor table entry of each process references one of these entries in the opened files table. There is one entry in this table for each opened file: if two processes open the same file, two entries in this global table are created, even though it's the same file. Each entry in the files table stores the file status flags (opened for reading, writing, appending, etc), and the current file offset. This is why different processes can read from and write to different offsets in the same file concurrently as long as each of them opens the file.
Each entry in the file table entry also references an entry in the vnode table. The vnode table is a global table that has one entry for each unique file. If processes A, B, and C open file D, there will be only one vnode table entry, referenced by all 3 of the file table entries (in Linux, there is really no vnode, rather there is an inode, but let's keep this description generic and conceptual). The vnode entry contains pretty much the same information as the traditional inode (file size, other attributes, etc.), but it also contains other information useful for opened files, such as file locks that are active, who owns them, which portions of the file they lock, etc. This vnode entry also stores pointers to the file's data blocks on disk.
Deleting a file consists of calling unlink(2). This function unlinks a file from a directory. Each file inode in disk has a count of the number of links pointing to it; the file is only really removed if the link count reaches 0 and it is not opened (or 2 in the case of directories, since a directory references itself and is also referenced by its parent). In fact, the manpage for unlink(2) is very specific about this behavior:
unlink - delete a name and possibly the file it refers to
So, instead of looking at unlinking as deleting a file, look at it as deleting a file name, and maybe the file it refers to.
When unlink(2) detects that there is an active vnode table entry referring this file, it doesn't delete the file from the filesystem. Nothing happens. Yes, you can't find the file on your filesystem anymore. find(1) won't find it. You can't open it in new processes.
But the file is still there. It just doesn't appear in any directory entry.
For example, if it's a huge file, and if you run df or du, you will see that space usage is the same. The file is still there, on disk, you just can't reach it.
So, any reads or writes take place as usual - the file data blocks are accessible through the vnode table entry. You can still know the file size. And the owner. And the permissions. All of it. Everything's there.
When the process terminates or explicitly closes the file, the operating system checks the inode. If the number of links pointing to the inode is 0 and this was the last process that opened the file (which is also indicated by storing a link count in the vnode table entry), then the file is purged.

When a process opens a file, a new entry for that file in the UFDT is
created.
What is this weird acronym? I take it you mean the process in question has a file descriptor.
When a process deletes a file, all the links to the file are gone
especially, we have no reference to its inode, thus, it gets removed
from the GFDT
What on earth is GFDT?
However, when modifying the file(say writing to it) it must be updated
in the disk(since its pages gets modified/dirty), but it got no
reference in the GFDT because of the earlier delete, so we don't know
the inode to it.
I am guessing whatever this GFDT is has something to do with being "global" and "file descriptors".
So, all this shows serious misconceptions.
As was outlined by your own question, the file is a different thingy from the name. Next, when you open something from a filesystem it gets an in-memory representation of the inode and a struct file object is allocated, which later points to the in-memory inode. Finally, file descriptor table of relevant thread is updated to store the pointer to the struct file object at given offset. The offset is known as a file descriptor.
So there. Amount of names associated with an inode has zero relation to kernel's ability to issue reads/writes affecting the inode (or blocks the file it represents) as long as it had it opened before the last name got removed.
The may or may not be trashed when there are no names and the kernel does not use it anymore.

Hard links linux, memory

When you copy files in linux(using contex menu copy command) does linux create hard links of files ?
Also, what happens if you delete original file, than hard link, that file still persist in memory, but it's pointer is removed ?
I have trouble understanding few things with a memory.
To free disk space, you need to delete both files, right ?
Does hard link points to memory location of a original file ? I used to see term inode, I'm now quiet sure what inode really is.

The inode is all the file data except the content.
A directory contains a set of names and numbers: "This directory contains file foo, which is file number 3 on this drive, bar, which is file number 4, quux, 17, viz, 123 and lastly ohmygod, 77321341". Inode number 3 contains "This file was created on Januar 1, 1970, last modified on January 1, 1990 and last read on January 2, 1990. It is 722 bytes large, and those bytes are in 4k block number 768123 on the drive" and a few more things.
The stat() system call shows how many blocks are needed, and almost everything else related to the inode.

Copying does not create hard links, that would be broken behavior. A hard link is just an additional first-class name to the same file; modify the file via one name (and not by saving under a temp name and then moving it, as some editors do), and you will see the change in the file when accessed under the other name, too. Not what I’d expect from a copy.
Note that there is nothing special about the first name a file had. All hard links are simply pointing at the same file.
Once the last directory entry pointing to a file is removed, there may still be file handles open pointing to it (from programs that opened the file). As long as one of those exists, the file is still there and can be used. It just cannot be opened by processes that haven’t done so before any longer, since it has no name any more.
When there is no more directory entry pointing to a file and no program has an open handle to the file any more, it can never be reached again. Therefore, the operating system frees the space on the disk.

Can inode and crtime be used as a unique file identifier?

I have a file indexing database on Linux. Currently I use file path as an identifier.
But if a file is moved/renamed, its path is changed and I cannot match my DB record to the new file and have to delete/recreate the record. Even worse, if a directory is moved/renamed, then I have to delete/recreate records for all files and nested directories.
I would like to use inode number as a unique file identifier, but inode number can be reused if file is deleted and another file created.
So, I wonder whether I can use a pair of {inode,crtime} as a unique file identifier.
I hope to use i_crtime on ext4 and creation_time on NTFS.
In my limited testing (with ext4) inode and crtime do, indeed, remain unchanged when renaming or moving files or directories within the same file system.
So, the question is whether there are cases when inode or crtime of a file may change.
For example, can fsck or defragmentation or partition resizing change inode or crtime or a file?
Interesting that
http://msdn.microsoft.com/en-us/library/aa363788%28VS.85%29.aspx says:
"In the NTFS file system, a file keeps the same file ID until it is deleted."
but also:
"In some cases, the file ID for a file can change over time."
So, what are those cases they mentioned?
Note that I studied similar questions:
How to determine the uniqueness of a file in linux?
Executing 'mv A B': Will the 'inode' be changed?
Best approach to detecting a move or rename to a file in Linux?
but they do not answer my question.

{device_nr,inode_nr} are a unique identifier for an inode within a system
moving a file to a different directory does not change its inode_nr
the linux inotify interface enables you to monitor changes to inodes (either files or directories)
Extra notes:
moving files across filesystems is handled differently. (it is infact copy+delete)
networked filesystems (or a mounted NTFS) can not always guarantee the stability of inodenumbers
Microsoft is not a unix vendor, its documentation does not cover Unix or its filesystems, and should be ignored (except for NTFS's internals)
Extra text: the old Unix adagium "everything is a file" should in fact be: "everything is an inode". The inode carries all the metainformation about a file (or directory, or a special file) except the name. The filename is in fact only a directory entry that happens to link to the particular inode. Moving a file implies: creating a new link to the same inode, end deleting the old directory entry that linked to it.
The inode metatata can be obtained by the stat() and fstat() ,and lstat() system calls.

The allocation and management of i-nodes in Unix is dependent upon the filesystem. So, for each filesystem, the answer may vary.
For the Ext3 filesystem (the most popular), i-nodes are reused, and thus cannot be used as a unique file identifier, nor is does reuse occur according to any predictable pattern.
In Ext3, i-nodes are tracked in a bit vector, each bit representing a single i-node number. When an i-node is freed, it's bit is set to zero. When a new i-node is needed, the bit vector is searched for the first zero-bit and the i-node number (which may have been previously allocated to another file) is reused.
This may lead to the naive conclusion that the lowest numbered available i-node will be the one reused. However, the Ext3 file system is complex and highly optimised, so no assumptions should be made about when and how i-node numbers can be reused, even though they clearly will.
From the source code for ialloc.c, where i-nodes are allocated:
There are two policies for allocating an inode. If the new inode is a
directory, then a forward search is made for a block group with both
free space and a low directory-to-inode ratio; if that fails, then of
he groups with above-average free space, that group with the fewest
directories already is chosen. For other inodes, search forward from
the parent directory's block group to find a free inode.
The source code that manages this for Ext3 is called ialloc and the definitive version is here: https://github.com/torvalds/linux/blob/master/fs/ext3/ialloc.c

I guess the dB application would need to consider the case where the file is subject to restoration from backup, which would preserve the file crtime, but not the inode number.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string