I am working on a small project to keep my skills from completely rusting
I am generating a lot of hashes(in his case md5) and I need to check if I've seen that hash before so I wanted to keep it in a list
whats the best way to list them that I can look if they exist in pior to doing calculations
The hash itself is already a key of sorts. Your best bet is a hash table. In a properly implemented hash table, you can check for the existence of a key in constant time. Common hash table implementations with this feature are C# Dictionaries, Python's dict type, PHP array (which are actually Maps, not arrays), Perl's hashes % and Ruby's Hash. If you included details of what language you're working in, an example wouldn't be too hard to lookup.
Related
I often times use filepaths to provide some sort of unique id for some software system. Is there any way to take a filepath and turn it into a unique integer in relatively quick (computationally) way?
I am ok with larger integers. This would have to be a pretty nifty algorithm as far as I can tell, but would be very useful in some cases.
Anybody know if such a thing exists?
You could try the inode number:
fs.statSync(filename).ino
#djones's suggestion of the inode number is good if the program is only running on one machine and you don't care about a new file duplicating the id of an old, deleted one. Inode numbers are re-used.
Another simple approach is hashing the path to a big integer space. E.g. using a 128 bit murmurhash (in Java I'd use the Guava Hashing class; there are several js ports), the chance of a collision among a billion paths is still 1/2^96. If you're really paranoid, keep a set of the hash values you've already used and rehash on collision.
This is just my comment turned to an answer.
If you run it in the memory, you can use one of standard hashmaps in your corresponding language. Not just for file names, but for any similar situation. Normally, hashmaps in different programming languages are satisfying collisions by buckets, so the hash number and the corresponding bucket number will provide a unique id.
Btw, it is not hard to write your own hashmap, such that you have control on the underlying structure (e.g. to retrieve the number etc).
I want to know if there is any way by which I can access entries stored in MapDB BTreeMap in reverse order. I know I can use descendingMap() but it is very slow and it involves a lot of CPU operations. Is there any other faster way? The Key Value pairs are non primitive java types.
Got following reply from Jan Kotek, creator of MapDB,
There is bug open for
BTreeMap.descendingMap() iteration
performance. It will be fixed in MapDB 2.
Is it good practice to use CRC32 for hashing Strings? If not, what are better alternatives?
I have objects whose uniqueness is defined by 2 strings. I would like to add these objects to Mongo database, add calculated hash as an object's field, create index on that field and then search DB for object when I have the 2 strings (and can calculate the hash).
Thanks.
It would work, but a CRC is not the best choice for hashing. There are many hash functions that have been developed to be both fast and to minimize several different kinds of collision threats.
A very good example are the CityHash set of algorithms.
So I have the code for a hashing function, and from the looks of it, there's no way to simply unhash it (lots of bitwise ANDs, ORs, Shifts, etc). My question is, if I need to find out the original value before being hashed, is there a more efficient way than just brute forcing a set of possible values?
Thanks!
EDIT: I should add that in my case, the original message will never be longer than several characters, for my purposes.
EDIT2: Out of curiosity, are there any ways to do this on the run, without precomputed tables?
Yes; rainbow table attacks. This is especially true for hashes of shorter strings. i.e. hashes of small strings like 'true' 'false' 'etc' can be stored in a dictionary and can be used as a comparison table. This speeds up cracking process considerably. Also if the hash size is short (i.e. MD5) the algorithm becomes especially easy to crack. Of course, the way around this issue is combining 'cryptographic salts' with passwords, before hashing them.
There are two very good sources of info on the matter: Coding Horror: Rainbow Hash Cracking and
Wikipedia: Rainbow table
Edit: Rainbox tables can tage tens of gigabytes so downloading (or reproducing) them may take weeks just to make simple tests. Instead, there seems to be some online tools for reversing simple hashes: http://www.onlinehashcrack.com/ (i.e. try to reverse 463C8A7593A8A79078CB5C119424E62A which is MD5 hash of the word 'crack')
"Unhashing" is called a "preimage attack": given a hash output, find a corresponding input.
If the hash function is "secure" then there is no better attack than trying possible inputs until a hit is found; for a hash function with a n-bit output, the average number of hash function invocations will be about 2n, i.e. Way Too Much for current earth-based technology if n is greater than 180 or so. To state it otherwise: if an attack method faster than this brute force method is found, for a given hash function, then the hash function is deemed irreparably broken.
MD5 is considered broken, but for other weaknesses (there is a published method for preimages with cost 2123.4, which is thus about 24 times faster than the brute force cost -- but it is still so far in the technologically unfeasible that it cannot be confirmed).
When the hash function input is known to be part of a relatively small space (e.g. it is a "password", so it could fit in the brain of a human user), then one can optimize preimage attacks by using precomputed tables: the attacker still has to pay the search cost once, but he can reuse his tables to attack multiple instances. Rainbow tables are precomputed tables with a space-efficient compressed representation: with rainbow tables, the bottleneck for the attacker is CPU power, not the size of his hard disks.
Assuming the "normal case", the original message will be many times longer than the hash. Therefore, it is in principle absolutely impossible to derive the message from the hash, simply because you cannot calculate information that is not there.
However, you can guess what's probably the right message, and there exist techniques to accelerate this process for common messages (such as passwords), for example rainbow tables. It is very likely that if something that looks sensible is the right message if the hash matches.
Finally, it may not be necessary at all to find the good message as long as one can be found which will pass. This is the subject of a known attack on MD5. This attack lets you create a different message which gives the same hash.
Whether this is a security problem or not depends on what exactly you use the hash for.
This may sound trivial, but if you have the code to the hashing function, you could always override a hash table container class's hash() function (or similar, depending on your programming language and environment). That way, you can hash strings of say 3 characters or less, and then you can store the hash as a key by which you obtain the original string, which appears to be exactly what you want. Use this method to construct your own rainbow table, I suppose. If you have the code to the program environment in which you want to find these values out, you could always modify it to store hashes in the hash table.
I have a 'large' set of line delimited full sentences that I'm processing with Hadoop. I've developed a mapper that applies some of my favorite NLP techniques to it. There are several different techniques that I'm mapping over the original set of sentences, and my goal during the reducing phase is to collect these results into groups such that all members in a group share the same original sentence.
I feel that using the entire sentence as a key is a bad idea. I felt that generating some hash value of the sentence may not work because of a limited number of keys (unjustified belief).
Can anyone recommend the best idea/practice for generating unique keys for each sentence? Ideally, I would like to preserve order. However, this isn't a main requirement.
Aντίο,
Standard hashing should work fine. Most hash algorithms have a value space far greater than the number of sentences you're likely to be working with, and thus the likelihood of a collision will still be extremely low.
Despite the answer that I've already given you about what a proper hash function might be, I would really suggest you just use the sentences themselves as the keys unless you have a specific reason why this is problematic.
Though you might want to avoid simple hash functions (for example, any half-baked idea that you could think up quickly) because they might not mix up the sentence data enough to avoid collisions in the first place, one of the standard cryptographic hash functions would probably be quite suitable, for example MD5, SHA-1, or SHA-256.
You can use MD5 for this, even though collisions have been found and the algorithm is considered unsafe for security intensive purposes. This isn't a security critical application, and the collisions that have been found arose through carefully constructed data and probably won't arise randomly in your own NLP sentence data. (See, for example Johannes Schindelin's explanation of why it's probably unnecessary to change git to use SHA-256 hashes, so that you can appreciate the reasoning behind this.)