Uniquely identify files - linux

I'd like to index files in a local database but I do not understand how I can identified each individual file. For example if I store the file path in the database then the entry will no longer be valid if the file is moved or deleted. I imagine there is some way of uniquely identifying files no matter what happens to them but I have had no success with Google.
This will be for *nix/Linux and ext4 in particular, so please nothing specific to windows or ntfs or anything like that.

In addition to the excellent suggestion above, you might consider using the inode number property of the files, viewable in a shell with ls -i.
Using index.php on one of my boxes:
ls -i
yields
196237 index.php
I then rename the file using mv index.php index1.php, after which the same ls -i yields:
196237 index1.php
(Note the inode number is the same)

Try using a hashing scheme such as MD5, SHA-1, or SHA-2 these will allow you to match the files up by content.
Basically when you first create the index, you will hash all the files that you wish to add. This string is pretty good at telling if two files are different or the same. Then when you need to see if one of the files is already in the index, hash it and then compare the generated hash to your table of known hashes.
EDIT: As was said in the comments, it is a good idea to incorporate both data's so that way you can more accurately track changes

If you do not consider files with same content same and only want to track moved/renamed files as same, then using its inode number will do. Otherwise you will have to hash the content.

Only fly in the ointment with inodes is that they can be reassigned after a delete (depending on the platform) - you need to record the file creation Timestamp as well as the device id to be 100% sure. Its easier with windows and their user file attributes.

Related

How exactly is md5 vulnerable and how to reproduce it?

All the online articles blogs says md5 is vulnerable but how exactly it produces same hash from two different inputs. I use this application called fdupes which recursively calculates md5 hash of files in a directory or subdirs and deletes the duplicate ones if delete option is specified. In this case what are the chances that it can produce same hash from 2 different files ? Also I would like to know how to reproduce such behavior.

Unique File Id?

I am making an application that will save information for certain files. I was wondering what the best way to keep track of files. I was thinking of using the absolute path for a file but that could change if the file is renamed. I found that if you run ls -i each file has an id beside it that is unique(?). Is that ok to use for a unique file id?
The inode is unique per device but, I would not recommend using it because imagine your box crashes and you move all the files to a new file system now all your files have new ids.
It really depends on your language of choice but almost all of them include a library for generating UUID's. While collisions are theoretically possible its a veritable non-issue. Generate the UUID prepend it to the front of your file and you are in business. As your implementation grows it will also allow you to create a HashTable index of your files for quick look ups later.
The question is, "unique across what?"
If you need something unique on a given machine at a given point in time, then yes, the inode number + device number is nearly always unique - these can be obtained from stat() or similar in C, os.stat() in python. However, if you delete a file and create another, the inode number may be reused. Also, two different hosts may have a completely different idea of what the device,inodeno pairs are.
If you need something to describe the file's content (so two files with the same content have the same id), you might look into one of the SHA or RIPEMD functions. This will be pretty unique - the odds of an accidental collision are astronomically low.
If you need some other form of uniqueness, please elaborate.

Embedding hidden encoded bits in plain text file

I'm designing a system to process plain text files, one of its features will be to move processed files to an archive server once they've been completely processed through the system. What I want to do is tag a text file once its been completely processed by the system, i.e. a system seal of approval or marker. The reason for this is I want this same system to be able to analyze the text file later and search for this hidden marker so it can identify it as having been processed in the past. At the same time, I want this marker to be ignored by any other system that might be handling this file
I was thinking of having a unique key that only this system uses and has access to and creating a procedure for hashing and salting the key and placing it within the text file before it gets moved to its final destination. I'm curious about any other techniques for creating a hidden seal or marker. So to summarize:
Can I create a set or string of encoded bits and place them in a text file?
Can these bits be hidden within the text file such that they are ignored by any other system that might handle this text file?
I'd appreciate any insight or feedback.
Personally, i would avoid modifying original content, ASCII text file (to my knowledge) can't be signed in a way that would prevent all applications from seeing the signature.
Instead, i would take md5 of the file maintain "processed" one separately from the ones that have not yet been "processed.
Map<MD5, FileName> is a structure to consider. You should be able to write code to both retrieve by MD5 or file name.
Hope it helps.
Hiding data inside another file is called Steganography. It can be done with ASCII files, but it is usually more easily done with data or image files.
In your particular case, a parallel register, or meta-data, of processed files would seem to be a better fit. Using a good hash, MD5 or better, is fine as long as you do not expect malicious deliberate attacks. In that case you would need to use HMAC-MD5 or HMAC-SHA-256. A malicious attacker can easily calculate the correct hash value for the altered file.

Using diff and suppressing entirely different files

I have two copies of an application source code. One copy is encoded, while the other is not. There are config files scattered through-out the application's directory structure, and this is what I would like to compare.
Is there a way to use diff where-by I can ignore the wildly different files (ie: An encrypted file and an unencrypted file), and only report the difference on the similar-yet-different files (The configs).
You could write a script that uses find to find the files based on name or other criteria and file to determine whether they have the same type of contents (i.e. one compressed, one not).
For me to be more specific I would need you to give more details about whether these are parallel directory structures (files and directories appear in the same places in the two trees) and whether the files you are looking for have names that distinguish them from files you want to ignore. Any additional information you can provide might help even more.

searching for files from a single folder (knowing prefix) versus searching for files from multiple folders (knowing folder name)

I've got a system in which users select a couple of options and receive an image based on those options. I'm trying to combine multiple generated images(corresponding to those options) into the requested picture. I'm trying to optimize this so that if, an image exists for a certain option (i.e. the file exists), then there's no need to compute it and we move on to the next step.
Should I store these images in different folders, where each folder is an option name? Should I store them in the same folder, adding a prefix corresponding to the option to each image? Should I store the filenames in a database and check there? Which way is faster to check a file for existence?
I'm using PHP on Linux, but I'm also interested if the answer varies if I change the programming language or the OS.
If you're going to be producing a lot of these images, it doesn't seem very scalable to keep them all in one flat directory. I would go with a hierarchy, which will make it a lot easier to manage.
It's always going to be quicker to check in a database than to check if a file exists though, so if speed is the primary concern, use a hierarchical folder structure and keep all the filenames in a database.

Resources