ZIP files comparison

ZIP files comparison - zip

I am trying to verify that two ZIP files have the same content. I assumed that if I unzip and immediately rezip a group of files (using the same zip function) the zip files should be identical. They appear not to be identical. There sha256sums are different and they appear to be different. I unzipped both of these new zip files and there contents are identical.
Why would the zip files not be the same for two identical set of files/directories?

As you have discovered, that is an entirely incorrect assumption. You should compare the uncompressed contents if you want to know if the contents are the same. You could also just look at the entry names, lengths, and CRCs to get a high probability verification, without having to decompress.
The zip file can be different for many reasons. The order of the files in the zip file do not need to be the same. The modification dates of the files are stored in the zip file, so if they are not restored they will be different. A different compression level or different compression software (even just a different version of the same software) could be used, resulting in different output. Different compression methods could have been chosen. Different file metadata (permissions, etc.) or other extra fields could be included in one zip file, but not the other, e.g. Unix permissions. Different variants of the zip format might be used, e.g. Zip64. The list goes on.

The zip format supports different compression levels and might store file attributes (like file timestamp), so this has to be expected.

Related

Uniquely identify files

I'd like to index files in a local database but I do not understand how I can identified each individual file. For example if I store the file path in the database then the entry will no longer be valid if the file is moved or deleted. I imagine there is some way of uniquely identifying files no matter what happens to them but I have had no success with Google.
This will be for *nix/Linux and ext4 in particular, so please nothing specific to windows or ntfs or anything like that.

In addition to the excellent suggestion above, you might consider using the inode number property of the files, viewable in a shell with ls -i.
Using index.php on one of my boxes:
ls -i
yields
196237 index.php
I then rename the file using mv index.php index1.php, after which the same ls -i yields:
196237 index1.php
(Note the inode number is the same)

Try using a hashing scheme such as MD5, SHA-1, or SHA-2 these will allow you to match the files up by content.
Basically when you first create the index, you will hash all the files that you wish to add. This string is pretty good at telling if two files are different or the same. Then when you need to see if one of the files is already in the index, hash it and then compare the generated hash to your table of known hashes.
EDIT: As was said in the comments, it is a good idea to incorporate both data's so that way you can more accurately track changes

If you do not consider files with same content same and only want to track moved/renamed files as same, then using its inode number will do. Otherwise you will have to hash the content.

Only fly in the ointment with inodes is that they can be reassigned after a delete (depending on the platform) - you need to record the file creation Timestamp as well as the device id to be 100% sure. Its easier with windows and their user file attributes.

If I create a torrent file on two different sources for the same set of files and specify the same tracker will they be exactly the same?

I want to create a torrent file for the same set of files but it may end up being created on two or more computers. I'm wondering if I use the same tracker urls and same piece sizes will these two torrent files will match up exactly. More specifically, if I have a group of computers and some have the torrent file from host A and others have the torrent file from host B will the tracker recognize that they are all using the same torrent file and thus enable them to share together or will the torrent files be unique for each source they are generated on and thus segment the peers?

If it is the same file of which you create the torrents from then they are identical. The generated torrent on both computers will be the same, just make sure you use the same software to generate the torrents.
Make sure you take a hash of both files to make sure they are the same.

Unique File Id?

I am making an application that will save information for certain files. I was wondering what the best way to keep track of files. I was thinking of using the absolute path for a file but that could change if the file is renamed. I found that if you run ls -i each file has an id beside it that is unique(?). Is that ok to use for a unique file id?

The inode is unique per device but, I would not recommend using it because imagine your box crashes and you move all the files to a new file system now all your files have new ids.
It really depends on your language of choice but almost all of them include a library for generating UUID's. While collisions are theoretically possible its a veritable non-issue. Generate the UUID prepend it to the front of your file and you are in business. As your implementation grows it will also allow you to create a HashTable index of your files for quick look ups later.

The question is, "unique across what?"
If you need something unique on a given machine at a given point in time, then yes, the inode number + device number is nearly always unique - these can be obtained from stat() or similar in C, os.stat() in python. However, if you delete a file and create another, the inode number may be reused. Also, two different hosts may have a completely different idea of what the device,inodeno pairs are.
If you need something to describe the file's content (so two files with the same content have the same id), you might look into one of the SHA or RIPEMD functions. This will be pretty unique - the odds of an accidental collision are astronomically low.
If you need some other form of uniqueness, please elaborate.

Using diff and suppressing entirely different files

I have two copies of an application source code. One copy is encoded, while the other is not. There are config files scattered through-out the application's directory structure, and this is what I would like to compare.
Is there a way to use diff where-by I can ignore the wildly different files (ie: An encrypted file and an unencrypted file), and only report the difference on the similar-yet-different files (The configs).

You could write a script that uses find to find the files based on name or other criteria and file to determine whether they have the same type of contents (i.e. one compressed, one not).
For me to be more specific I would need you to give more details about whether these are parallel directory structures (files and directories appear in the same places in the two trees) and whether the files you are looking for have names that distinguish them from files you want to ignore. Any additional information you can provide might help even more.

searching for files from a single folder (knowing prefix) versus searching for files from multiple folders (knowing folder name)

I've got a system in which users select a couple of options and receive an image based on those options. I'm trying to combine multiple generated images(corresponding to those options) into the requested picture. I'm trying to optimize this so that if, an image exists for a certain option (i.e. the file exists), then there's no need to compute it and we move on to the next step.
Should I store these images in different folders, where each folder is an option name? Should I store them in the same folder, adding a prefix corresponding to the option to each image? Should I store the filenames in a database and check there? Which way is faster to check a file for existence?
I'm using PHP on Linux, but I'm also interested if the answer varies if I change the programming language or the OS.

If you're going to be producing a lot of these images, it doesn't seem very scalable to keep them all in one flat directory. I would go with a hierarchy, which will make it a lot easier to manage.
It's always going to be quicker to check in a database than to check if a file exists though, so if speed is the primary concern, use a hierarchical folder structure and keep all the filenames in a database.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ZIP files comparison - zip

The zip format supports different compression levels and might store file attributes (like file timestamp), so this has to be expected.

Related

Uniquely identify files

If I create a torrent file on two different sources for the same set of files and specify the same tracker will they be exactly the same?

Unique File Id?

Using diff and suppressing entirely different files

searching for files from a single folder (knowing prefix) versus searching for files from multiple folders (knowing folder name)

Categories

Resources