Compare 2 identical excel/ppt/csv files in nodejs - node.js

I have a requirement where I want to compare 2 identical excel/ppt/csv files which may have exact same content but may be created at different point in time.
I want to compare only the file contents in whatever manner possible using any nodejs package.
But I couldn't figure out how it can be done in an easier way either by stream comparison or even buffer comparison also didn't help.
I've done more research but not much success and I'm just wondering how it would be possible to ignore certain things such as time stamp and any other metadata while doing comparison and only consider contents to match up.
I've tried stream-compare, stream-equal, file-compare, buff1.equals(buff2) and few others but nine of them seem to have worked for my requirement.
But I didn't find any node package on the web which does what I am looking for.
Any insights or any suggestions as how it can be achieved?
Thanks in advance any help would be appreciated.

Search for a package that computes a hash on the document, for example crypto, calculate hashes (sha256) for 2 docs and compare them. If hashes match, document content will be the same (there is still a chance of hash collision, but it depends on the hash algorytm that are you using, sha256 will give you a decent confidence that documents are identical). Check this thread for more details: Obtaining the hash of a file using the stream capabilities of crypto module (ie: without hash.update and hash.digest)

Related

Turning unique filepath into unique integer

I often times use filepaths to provide some sort of unique id for some software system. Is there any way to take a filepath and turn it into a unique integer in relatively quick (computationally) way?
I am ok with larger integers. This would have to be a pretty nifty algorithm as far as I can tell, but would be very useful in some cases.
Anybody know if such a thing exists?
You could try the inode number:
fs.statSync(filename).ino
#djones's suggestion of the inode number is good if the program is only running on one machine and you don't care about a new file duplicating the id of an old, deleted one. Inode numbers are re-used.
Another simple approach is hashing the path to a big integer space. E.g. using a 128 bit murmurhash (in Java I'd use the Guava Hashing class; there are several js ports), the chance of a collision among a billion paths is still 1/2^96. If you're really paranoid, keep a set of the hash values you've already used and rehash on collision.
This is just my comment turned to an answer.
If you run it in the memory, you can use one of standard hashmaps in your corresponding language. Not just for file names, but for any similar situation. Normally, hashmaps in different programming languages are satisfying collisions by buckets, so the hash number and the corresponding bucket number will provide a unique id.
Btw, it is not hard to write your own hashmap, such that you have control on the underlying structure (e.g. to retrieve the number etc).

How do AV engines search files for known signatures so efficiently?

Data in the form of search strings continue to grow as new virus variants are released, which prompts my question - how do AV engines search files for known signatures so efficiently? If I download a new file, my AV scanner rapidly identifies the file as being a threat or not, based on its signatures, but how can it do this so quickly? I'm sure by this point there are hundreds of thousands of signatures.
UPDATE: As tripleee pointed out, the Aho-Corasick algorithm seems very relevant to virus scanners. Here is some stuff to read:
http://www.dais.unive.it/~calpar/AA07-08/aho-corasick.pdf
http://www.researchgate.net/publication/4276168_Generalized_Aho-Corasick_Algorithm_for_Signature_Based_Anti-Virus_Applications/file/d912f50bd440de76b0.pdf
http://jason.spashett.com/av/index.htm
Aho-Corasick-like algorithm for use in anti-malware code
Below is my old answer. Its still relevant for easily detecting malware like worms which simply make copies of themselves:
I'll just write some of my thoughts on how AVs might work. I don't know for sure. If someone thinks the information is incorrect, please notify me.
There are many ways in which AVs detect possible threats. One way is signature-based
detection.
A signature is just a unique fingerprint of a file (which is just a sequence of bytes). In terms of computer science, it can be called a hash. A single hash could take about 4/8/16 bytes. Assuming a size of 4 bytes (for example, CRC32), about 67 million signatures could be stored in 256MB.
All these hashes can be stored in a signature database. This database could be implemented with a balanced tree structure, so that insertion, deletion and search operations can be done in O(logn) time, which is pretty fast even for large values of n (n is the number of entries). Or else if a lot of memory is available, a hashtable can be used, which gives O(1) insertion, deletion and search. This is can be faster as n grows bigger and a good hashing technique is used.
So what an antivirus does roughly is that it calculates the hash of the file or just its critical sections (where malicious injections are possible), and searches its signature database for it. As explained above, the search is very fast, which enables scanning huge amounts of files in a short amount of time. If it is found, the file is categorized as malicious.
Similarly, the database can be updated quickly since insertion and deletion is fast too.
You could read these pages to get some more insight.
Which is faster, Hash lookup or Binary search?
https://security.stackexchange.com/questions/379/what-are-rainbow-tables-and-how-are-they-used
Many signatures are anchored to a specific offset, or a specific section in the binary structure of the file. You can skip the parts of a binary which contain data sections with display strings, initialization data for internal structures, etc.
Many present-day worms are stand-alone files for which a whole-file signature (SHA1 hash or similar) is adequate.
The general question of how to scan for a large number of patterns in a file is best answered with a pointer to the Aho-Corasick algorithm.
I don't know how a practical AV works. but I think the question have some relative with finding words in a long text with a given dictionary.
For the above question, data structures like TRIE will make it very fast. processing a Length=N text dictionary of K words takes only O(N) time.

Unique File Id?

I am making an application that will save information for certain files. I was wondering what the best way to keep track of files. I was thinking of using the absolute path for a file but that could change if the file is renamed. I found that if you run ls -i each file has an id beside it that is unique(?). Is that ok to use for a unique file id?
The inode is unique per device but, I would not recommend using it because imagine your box crashes and you move all the files to a new file system now all your files have new ids.
It really depends on your language of choice but almost all of them include a library for generating UUID's. While collisions are theoretically possible its a veritable non-issue. Generate the UUID prepend it to the front of your file and you are in business. As your implementation grows it will also allow you to create a HashTable index of your files for quick look ups later.
The question is, "unique across what?"
If you need something unique on a given machine at a given point in time, then yes, the inode number + device number is nearly always unique - these can be obtained from stat() or similar in C, os.stat() in python. However, if you delete a file and create another, the inode number may be reused. Also, two different hosts may have a completely different idea of what the device,inodeno pairs are.
If you need something to describe the file's content (so two files with the same content have the same id), you might look into one of the SHA or RIPEMD functions. This will be pretty unique - the odds of an accidental collision are astronomically low.
If you need some other form of uniqueness, please elaborate.

Need a secure way to publicly display hash values

I am building a windows application to store backups of sensitive files. The purpose of my application is to store a copy of a file with its hash. The program or user will then display the hash publicly in case the user needs to prove they had the backup of the sensitive file at a certain time.
Motivation:
Some situations where this might be useful are:
Someone has a job at a company where they think they might be accused of doing something illegal. If they were accused of changing some data over time, it would be convenient to have copies of sensitive files related to their case over a period of time.
A politician might take notes about things they did each day, many of them about classified or sensitive subjects, and then want to be able to disclose her files at a later date if they are accused of something (for instance, if the CIA said they were briefed on torture…). Not absolute proof, but it would be hard to create fake backup files for every potential scenario, especially several years into the future.
Just to be clear, this application is mostly just an excuse for me to practice my coding skills. I don’t recommend using any type of cryptographic software that hasn’t been scrutinized by several professionals.
Possible Solutions:
For my application, I need to find a good place to publicly store the hash values. Here are my ideas so far:
Send the hash values to a group of people through email. (disadvantage: could annoy people, but would create a traceable record)
Publish the hash values on a public blog (disadvantage: if I ever got in serious legal trouble someone with resources could try to attack the free service I used and erase my data)
Publish the hash values using some online security service that stores documents but does not allow you to delete them. (I am not sure something like this exists.)
What is the most secure and convenient way to publicly display my hash values?
Hash your set of hashes so that you have only one hash to record. Then publish this hash in the classifieds of a widely archived newspaper.
Truly secure? Print out the hashes on a piece of paper along with a legal text to the effect of, "On this day XX/XX/XXXX I affirm these hashes to be accurately identifying these files with these dates." (not a lawyer, get one to verify this), then have it notarized. Then, save that piece of paper in a secure location.

How do sites like tinyurl generate urls?

I looked at tinyurl, tinypic, imgur and youtube! I thought they would use a text safe representation of a index and use it as a primary ID in their DB. However trying to put the keys into Convert.FromBase64String("key") yields no results and throw an exception. So these sites dont use a base64 array. What are they using? What might i want to use if i were to do a youtube like site or tinyurl?
im guessing they have developed their own encoding which is simply an alphanumeric equivalent of the ids in their database. im sure they dont generate random strings simply because this will cause catastrophic overflows at a certain point
I don't know about TinyURL, Tinypic, etc. but shorl.com uses something called koremutake.
If I were to develop such a system, I guess some sort of short hash or plain random strings could be possible choices.
My guess is that they simply generate a random string and use that as the primary key. I don't really see a reason to do anything else.

Resources