Unique File Id? - linux

I am making an application that will save information for certain files. I was wondering what the best way to keep track of files. I was thinking of using the absolute path for a file but that could change if the file is renamed. I found that if you run ls -i each file has an id beside it that is unique(?). Is that ok to use for a unique file id?

The inode is unique per device but, I would not recommend using it because imagine your box crashes and you move all the files to a new file system now all your files have new ids.
It really depends on your language of choice but almost all of them include a library for generating UUID's. While collisions are theoretically possible its a veritable non-issue. Generate the UUID prepend it to the front of your file and you are in business. As your implementation grows it will also allow you to create a HashTable index of your files for quick look ups later.

The question is, "unique across what?"
If you need something unique on a given machine at a given point in time, then yes, the inode number + device number is nearly always unique - these can be obtained from stat() or similar in C, os.stat() in python. However, if you delete a file and create another, the inode number may be reused. Also, two different hosts may have a completely different idea of what the device,inodeno pairs are.
If you need something to describe the file's content (so two files with the same content have the same id), you might look into one of the SHA or RIPEMD functions. This will be pretty unique - the odds of an accidental collision are astronomically low.
If you need some other form of uniqueness, please elaborate.

Related

Turning unique filepath into unique integer

I often times use filepaths to provide some sort of unique id for some software system. Is there any way to take a filepath and turn it into a unique integer in relatively quick (computationally) way?
I am ok with larger integers. This would have to be a pretty nifty algorithm as far as I can tell, but would be very useful in some cases.
Anybody know if such a thing exists?
You could try the inode number:
fs.statSync(filename).ino
#djones's suggestion of the inode number is good if the program is only running on one machine and you don't care about a new file duplicating the id of an old, deleted one. Inode numbers are re-used.
Another simple approach is hashing the path to a big integer space. E.g. using a 128 bit murmurhash (in Java I'd use the Guava Hashing class; there are several js ports), the chance of a collision among a billion paths is still 1/2^96. If you're really paranoid, keep a set of the hash values you've already used and rehash on collision.
This is just my comment turned to an answer.
If you run it in the memory, you can use one of standard hashmaps in your corresponding language. Not just for file names, but for any similar situation. Normally, hashmaps in different programming languages are satisfying collisions by buckets, so the hash number and the corresponding bucket number will provide a unique id.
Btw, it is not hard to write your own hashmap, such that you have control on the underlying structure (e.g. to retrieve the number etc).

Uniquely identify files

I'd like to index files in a local database but I do not understand how I can identified each individual file. For example if I store the file path in the database then the entry will no longer be valid if the file is moved or deleted. I imagine there is some way of uniquely identifying files no matter what happens to them but I have had no success with Google.
This will be for *nix/Linux and ext4 in particular, so please nothing specific to windows or ntfs or anything like that.
In addition to the excellent suggestion above, you might consider using the inode number property of the files, viewable in a shell with ls -i.
Using index.php on one of my boxes:
ls -i
yields
196237 index.php
I then rename the file using mv index.php index1.php, after which the same ls -i yields:
196237 index1.php
(Note the inode number is the same)
Try using a hashing scheme such as MD5, SHA-1, or SHA-2 these will allow you to match the files up by content.
Basically when you first create the index, you will hash all the files that you wish to add. This string is pretty good at telling if two files are different or the same. Then when you need to see if one of the files is already in the index, hash it and then compare the generated hash to your table of known hashes.
EDIT: As was said in the comments, it is a good idea to incorporate both data's so that way you can more accurately track changes
If you do not consider files with same content same and only want to track moved/renamed files as same, then using its inode number will do. Otherwise you will have to hash the content.
Only fly in the ointment with inodes is that they can be reassigned after a delete (depending on the platform) - you need to record the file creation Timestamp as well as the device id to be 100% sure. Its easier with windows and their user file attributes.

Using diff and suppressing entirely different files

I have two copies of an application source code. One copy is encoded, while the other is not. There are config files scattered through-out the application's directory structure, and this is what I would like to compare.
Is there a way to use diff where-by I can ignore the wildly different files (ie: An encrypted file and an unencrypted file), and only report the difference on the similar-yet-different files (The configs).
You could write a script that uses find to find the files based on name or other criteria and file to determine whether they have the same type of contents (i.e. one compressed, one not).
For me to be more specific I would need you to give more details about whether these are parallel directory structures (files and directories appear in the same places in the two trees) and whether the files you are looking for have names that distinguish them from files you want to ignore. Any additional information you can provide might help even more.

searching for files from a single folder (knowing prefix) versus searching for files from multiple folders (knowing folder name)

I've got a system in which users select a couple of options and receive an image based on those options. I'm trying to combine multiple generated images(corresponding to those options) into the requested picture. I'm trying to optimize this so that if, an image exists for a certain option (i.e. the file exists), then there's no need to compute it and we move on to the next step.
Should I store these images in different folders, where each folder is an option name? Should I store them in the same folder, adding a prefix corresponding to the option to each image? Should I store the filenames in a database and check there? Which way is faster to check a file for existence?
I'm using PHP on Linux, but I'm also interested if the answer varies if I change the programming language or the OS.
If you're going to be producing a lot of these images, it doesn't seem very scalable to keep them all in one flat directory. I would go with a hierarchy, which will make it a lot easier to manage.
It's always going to be quicker to check in a database than to check if a file exists though, so if speed is the primary concern, use a hierarchical folder structure and keep all the filenames in a database.

Deleting files securely in delphi7

I need to delete my input file securely once I have finished with it, at the moment I'm overwriting all the data with zero, this is messy as my temp folder becomes full of old files also the name of the files is a security issue.
Rather than just moving them to the recycle bin I would like them to skip it and just disappear, this is in conjunction with being wiped byte wise as data recovery software can recover items from beyond the recycle bin. As the name is also important I need to rename them before I delete them.
This is a progressive problem. What is "secure" for one application is insecure for another. If security is really important and you find yourself asking these kinds of questions on Stack Overflow, then most likely need to contract with an external security consultant. Examples of really important include financial information, medical records, or anything else where there is a law or contract requiring the securing of the data. I don't say this to be mean or imply that you are incapable of solving the problem, but to point out that this is a rather complex and evolving problem.
Basically to accomplish what you want to accomplish:
Once your code you wrote finishes then change the file size to empty - this makes recovery more difficult because the original file size is lost.
Then rename the file (RenameFile)to a different name.
Finally delete the file using DeleteFile, which does not move the file to the recycle bin.
Make sure you maintain an exclusive handle on the files the whole time they are on the disk too, or they can just be copied before they are deleted.
As I said, this is a progressive problem. This is a really basic solution, and is subject to a number of vulnerabilities. So depending on the level of security needed you might consider never letting the file be written to disk, or using multiple pass overwrites. If security is really important, then actually burning the hard drive platter at a high temperature, and then smashing it is the only way to be sure.
Edit: It appears you removed your code sample.
There are third-party utilities to do this kind of thing from the command - I found PGP Command Line has this feature, if you search around you can probably find a free app that will do this from the command line. You could then just call the command from your app in order to securely delete the file.
I would say that if you are insistent upon writing your own code to do this, then instead of using all 0's, write random bytes to the disk. And don't use the built-in c++ rand function, use a more secure random number generator.
As Jim McKeeth said, this is not something you want to do yourself if there are serious legal repercussions for getting it wrong.
Jim has described well the issues with solving your problem in code. The problem is indeed progressive, and any solution you implement will only approximate complete security without ever attaining it. So one thing to do is to decide exactly what you need to protect the file against (snooping family members? co-workers? corporate espionage? totalitarian governments?), then design your solution accordingly and document its limitations.
I have a sort of an orthogonal suggestion though. Instead of - or in addition to - implementing secure wiping in code, you can require cooperation from users. For example, you can suggest (or require) that input files be stored on an encrypted volume. In corporate environments PGP Disk might be preferred, since it's a recognizable brand, while home users would be well served to use the free and well-tested TrueCrupt. Both products support creating virtual encrypted volumes as well as encrypting whole partitions. This would go a great length to keeping the names and contents of input files secure, even before you write a single line of code.
Deleting a file can be touchy subject...
Depending on the need of your customer I would like to point to the Data remanence phenomenon. Which is residual data left after a simple overwrite. Data erasure is a method of destroying the residual data.
There are a few standards on how to erase the residual data, DoD 5220.22-M is mostly referred to by "secure file delete" applications, but apparently the rules have changed.
As of the June 2007 edition of the DSS
C&SM, overwriting is no longer
acceptable for sanitization of
magnetic media; only degaussing or
physical destruction is acceptable.
So what I'm saying is, try to get the rules which your customer has to follow.
Beware of "wear leveling" algorithms used with flash storage. To promote even wear, files are moved around on the drive, and it's invisible to your app, and even the operating system. So you can "secure delete" the file all you want, and you will only affect the most recent copy of the file. But prior copies are recoverable/discoverable with recovery software. So the only way to solve that, is to encrypt the file contents.

Resources