Can I use the hashes in a HashMap directly?

Can I use the hashes in a HashMap directly? - rust

Is it possible to insert and get values into/from a HashMap directly with a Hash provided, so I can cache the hashes?
I want to do something like this:
map.insert(key, "value");
let hashed_key = {
let mut hasher = map.hasher().build_hasher();
key.hash(&mut hasher);
hasher.finish()
};
assert_eq!(map.get(key).unwrap(), map.get_by_hash(hashed_key).unwrap());
playground

No.
This is fundamentally impossible at the algorithmic level.
By design, a hash operation is surjective: multiple elements may hash to the same value. Therefore, any HashMap implementation can only use the hash as a hint and must then use a full equality comparison to check that the element found by the hint is the right element (or not).
At best, a get_by_hash method would return an Iterator of all possible elements matching the current hash.
For a degenerate case, consider a hashing algorithm which always returns 4 (obtained by the roll of a fair dice). Which element would you expect it to return?
Work-around
If caching is what you are after, the trick in languages with no HashBuilder is to pre-hash (and cache) the hash inside the key itself.
It requires caching the full key (because of the equality check), but hashing is then a very simple operation (return the cached value).
It does not, however, speed up the equality check, which depending on the value may be quite expensive.
You could adapt the pattern to Rust, although you would lose the advantage of using a HashBuilder.

Related

What is the most idomatic way to write iterators that map an uncertain number of input items to other objects in Rust?

I'm trying to implement a Lexer. Since lexers emit tokens, I suppose that we can perceive a Lexer as a special iterator that maps certain chunks of chars to Tokens. I therefore expect Lexer to store an Iterator<Item=char> and manipulate that iterator instead of a &str to enable maximum flexibility.
struct Lexer<T: Iterator<Item=char>> {
source: T
}
Yet I find it hard to manipulate the iterator, since almost all iterator adaptors take ownership, and with generics I cannot change the type of T at runtime, unless I switch to Box.
self.source.take_while(|x| x.is_whitespace())
A possible workaround is to require that the iterator implement Clone, use a clone every time I want to transform it, remember how many characters I have seen, and call next that many times. I believe that it is too clumsy.
I wonder if there is an idomatic way to write iterators that map an uncertain number of input items (in this case, chars) into another object (in this case, Tokens)?
The most elegant way I can come up with so far is to use while let etc. which are not so fluent-style-like. I inspected the implementation of GroupBy in itertools and found that they use the while let approach too.

Is this approach to dealing with hash collisions new/unique?

When dealing with hash maps, I have seen a few strategies to deal with hash collisions, but we have come up with something different.
I was wondering if this is something new or not.
This version of a hash map only works if the hash and the data structures that will be hashed are salteable.
(This is the case in hashable in Haskell, where we suggested implementing this approach.)
The idea is that, instead of storing a list or array in each cell of the hash map, you store a recursive hash map. The only difference in this recursive hash map is that you use a different salt.
This way the hash collisions on one level of the hash map are most likely not hash collisions on the next level.
As a result, insertion into such a hash map is no longer O(number of collisions on this hash) but O(number of levels that the collisions on this happen at recursively), which is most likely better.
A more detailed explanation and an implementation can be found here:
https://github.com/tibbe/unordered-containers/pull/217/files/58af4519ace34c5f7d3c1359907ff75e27b9cdb8#diff-ba23e0f18c79cb873ac5375367524cfaR114

Your idea seems to be effectively the same as the one suggested in the Fredman, Komlós & Szemerédi paper from 1984. As Wikipedia summarizes it:
FKS Hashing makes use of a hash table with two levels in which the top level contains n buckets which each contain their own hash table.
In contrast to your idea, the local hash maps aren't recursive, instead each of them chooses a salt that makes it a perfect hash. In practice, this will (as you say) usually be given already by the first salt you try, thus it's asymptotically constant-time.

What are the differences between hashtable and hashmap? (Not specific to Java)

During my most recent job interview for a software engineer position, I was asked this questions: what are the differences between hashtable and hashmap? I asked the interviewer if he was specific about Java since in Java hashtable is synchronized and hashmap is not (and actually tons of information to compare hashtable vs hashmap in Java after googling so that's not the answer I am looking for) but he said no and wanted to me to explain the the difference of these two in general.
I was really puzzled and shocked (actually still puzzled now) about this question. IMO, hastable or hashmap is simply a matter of terminology. Actually only Java has both terms and in other languages like C++, they don't even have the term hashtable. During the interview, I just explained the principle of hashing and said that hashmap and hashtable should both be implemented based on this principle and I don't know if there is any difference between these two. The interviewer was definitely not convinced and was looking for other answers and of course I was rejected after that round.
So back to the topic, what could possibly be the differences between hashmap and hashtable in general (not specific to Java) if there is any?

In Computer Science there is a difference due to the wording.
A HashTable is some kind of lookup table using key hashes to lookup the corresponding value in a table like data structure. Thats only one kind of a key-value Mapping. There are different implementations as you are probably aware. Different hashes, hash collusion solutions and table growing strategies and more under the hood. It's only interesting if you need to make your own hash table for whatever reason.
A HashMap is some kind of mapping of key-value pairs with a hashed key. Mapping is abstract as such and it may not be a table. Balanced trees or tries or other data structures/mappings are possible too.
You could simplify and say that a HashTable is the underlying data structure and the HashMap may be utilizing a HashTable.
A Dictionary is yet another abstraction level since it may not use hashes at all - for example with full text binary search lookups or other ways for compares. This is all you can get out of the words without considering certain programming languages.
--
Before thinking too much about it. Can you say - with certainty - that your interviewer had a clue about what he/she was talking about? Did you discuss technical details or did they just listen/ask and sometimes comment? Sometimes interviewers just come up with the most ridicules answers to problems they don't really understand in the first place.
Like you wrote yourself, in general it's just Terminology. Software Developers often use the Terms interchangeable, except maybe those who really have differences like in Java.

The interviewer may have been looking for the insight that...
a hash table is a lower-level concept that doesn't imply or necessarily support any distinction or separation of keys and values (i.e. you can implement a hash set of values using a hash table), while
a hash map must support distinct keys and values, as there's to be a mapping/association from keys to values; the two are distinct, even if in some implementations they're always stored side by side in memory, e.g. members of the same structure / std::pair<>.
Example: a (bad) hash table implementation preventing use as a hash map.
Consider:
template <typename T>
class Hash_Table
{
...
bool insert(const T& t)
{
// work out which bucket t hashes to...
size_t bucket = hash_bytes((void*)&t, sizeof t) % num_buckets_;
// see if t is already stored in the bucket...
if (memcmp((void*)&t, (void*)&buckets_[bucket], sizeof t) == 0)
...
... handle collisions etc. ...
}
...
};
Above, the hard-coded calls to a hash function that treats the value being inserted as a binary blob, and memcmp of the entire t, mean you can't make T say a std::pair<int, std::string> and use the hash table as a hash map from ints to strings. So, it's an example of a hash table that's not usable as a hash map.
You might or might not also consider a hash table that simply doesn't provide any convenience features for use as a hash map not to be a hash map. For example, if the API was designed as if dealing only in values - h.insert(t); h.erase(t); auto i = h.find(t); - but it allowed the caller to specify arbitrary custom comparison and hashing functions that could restrict their operations to only the key part of t, then the hash table could be (ab)used as a functional hash map.
To clarify how this relates to makadev's existing answer, I disagree with:
"A HashTable [uses] key hashes to lookup the corresponding value"; wrong because it assumes a key->value mapping.
"A HashMap [...]. Mapping is abstract as such and it may not be a table. Balanced trees or tries or other data structures/mappings are possible too."; wrong because the primary mechanism of a hash map is still hashing of the key to a bucket (index) in the table/array: some hash tables/maps may use other data structures (arrays, linked lists, trees...) to store elements that collide at the same bucket, but that's a different issue and not part of the difference between hash tables and hash maps.

Actually HashTable become obsoletes and HasHMap is best approach to use because Hashtable is synchronized. If a thread-safe implementation is not needed, it is recommended to use HashMap in place of Hashtable. If a thread-safe highly-concurrent implementation is desired, then it is recommended to use java.util.concurrent.ConcurrentHashMap in place of Hashtable.
Second difference is HashMap extends Map Interface and whether HashSet Dictionary interface.

The order of the keys in json::Object

Whatever order I use in here
let mut tm = TreeMap::new();
tm.insert("aaa".to_string(), "val1".to_json());
tm.insert("zzz".to_string(), "val2".to_json());
//or
// tm.insert("zzz".to_string(), "val2".to_json());
// tm.insert("aaa".to_string(), "val1".to_json());
let a = json::Object(tm);
println!("Json is {}", a)
The result json is always the same:
json is {"aaa":"val1","zzz":"val2"}
But I want the order to be the same as it is in insert operations. How?

Generally it's a very bad idea to rely on order of keys in JSON. Usually the underlying data structure is a hash table, it does not preserve order (the standard does not require it, and a hash map turns out to be the most efficient way of implementing such unordered map). There are some implementations of JSON parsers/generators which preserve order (and some even allow duplicates), but you can never rely on this behavior.
So the best way to achieve the result you want is to use an array of pairs (a pair can be either an array or a map). The order of elements within array is preserved.

Constant-time hash for strings?

Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.
It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.
I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.
I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.
From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.
Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.
I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.
Thanks.

A hash function doesn't have to (and can't) return a unique value for every string.
You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.
You could also just return the constant value 1. Strictly speaking, this is still a hash function, although not a very useful one.

In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).
Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...

You cannot easily achieve a general constant time hashing algorithm for strings without risking severe cases of hash collisions.
For it to be constant time, you will not be able to access every character in the string. As a simple example, suppose we take the first 6 characters. Then comes someone and tries to hash an array of URLs. The has function will see "http:/" for every single string.
Similar scenarios may occur for other characters selections schemes. You could pick characters pseudo-randomly based on the value of the previous character, but you still run the risk of failing spectacularly if the strings for some reason have the "wrong" pattern and many end up with the same hash value.

You can hope for asymptotically less than linear hashing time if you use ropes instead of strings and have sharing that allows you to skip some computations. But obviously a hash function can not separate inputs that it has not read, so I wouldn't take the "everything can be hashed in constant time" too seriously.
Anything is possible in the compromise between the hash function's quality and the amount of computation it takes, and a hash function over long strings must have collisions anyway.
You have to determine if the strings that are likely to occur in your algorithm will collide too often if the hash function only looks at a prefix.

Although I cannot imagine a fixed-time hash function for unlimited length strings, there is really no need for it.
The idea behind using a hash function is to generate a distribution of the hash values that makes it unlikely that many strings would collide - for the domain under consideration. This key would allow direct access into a data store. These two combined result in a constant time lookup - on average.
If ever such collision occurs, the lookup algorithm falls back on a more flexible lookup sub-strategy.

Certainly this is doable, so long as you ensure all your strings are 'interned', before you pass them to something requiring hashing. Interning is the process of inserting the string into a string table, such that all interned strings with the same value are in fact the same object. Then, you can simply hash the (fixed length) pointer to the interned string, instead of hashing the string itself.

You may be interested in the following mathematical result I came up with last year.
Consider the problem of hashing an infinite number of keys—such as the set of all strings of any length—to the set of numbers in {1,2,…,b}. Random hashing proceeds by first picking at random a hash function h in a family of H functions.
I will show that there is always an infinite number of keys that are certain to collide over all H functions, that is, they always have the same hash value for all hash functions.
Pick any hash function h: there is at least one hash value y such that the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding. Pick any other hash function h‘ and hash the keys in the set A. There is at least one hash value y‘ such that the set A‘={s is in A: h‘(s)=y‘} is infinite, that is, there are infinitely many strings colliding on two hash functions. You can repeat this argument any number of times. Repeat it H times. Then you have an infinite set of strings where all strings collide over all of your H hash functions. CQFD.
Further reading:
Sensible hashing of variable-length strings is impossible
http://lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string