Can I write a hash function say `hashit(a, b, c)` taking no more than O(N) space - hashmap

Given integers a,b,c such that
-N<=a<=N,
0<=b<=N,
0<=c<=10
Can I write a hash function say hashit(a, b, c) taking no more than O(N) adrdress space.
My naive thought was to write it as,
a+2N*b+10*2N*N*c
thats like O(20N*N) space, so it wont suffice my need.
let me elaborate my usecase, I want tuple (a,b,c) as key of a hashmap . Basically a,b,c are arguments to my function which I want to memorise. in python #lru_cache perfectly does it without any issue for N=1e6 but when I try to write hash function myself I get memory overflow. So how do python do it ?
I am working wih N of the order of 10^6
This code work
#lru_cache(maxsize=None)
def myfn(a,b,c):
//some logic
return 100
But if i write the hash function myself like this, it doesn't . So how do python do it.
def hashit(a,b,c):
return a+2*N*b+2*N*N*c
def myfn(a,b,c):
if hashit(a,b,c) in myhashtable:
return myhashtable[hashit(a,b,c)]
//some logic
myhashtable[hashit(a,b,c)] = 100;
return myhashtable[hashit(a,b,c)]

To directly answer your question of whether it is possible to find an injective hash function from a set of size Θ(N^2) to a set of size O(N): it isn't. The very existence of an injective function from a finite set A to a set B implies that |B| >= |A|. This is similar to trying to give a unique number out of {1, 2, 3} to each member of a group of 20 people.
However, do note that hash functions do oftentimes have collisions; the hash tables that employ them simply have a method for resolving those collisions. As one simple example for clarification, you could for instance hold an array such that every possible output of your hash function is mapped to an index of this array, and at each index you have a list of elements (so an array of lists where the array is of size O(N)), and then in the case of a collision simply go over all elements in the matching list and compare them (not their hashes) until you find what you're looking for. This is known as chain hashing or chaining. Some rare manipulations (re-hashing) on the hash table based on how populated it is (measured through its load factor) could ensure an amortized time complexity of O(1) for element access, but this could over time increase your space complexity if you actually try to hold ω(N) values, though do note that this is unavoidable: you can't use less space than Θ(X) to hold Θ(X) values without any extra information (for instance: if you hold 1000000 unordered elements where each is a natural number between 1 and 10, then you could simply hold ten counters; but in your case you describe a whole possible set of elements of size 11*(N+1)*(2N+1), so Θ(N^2)).
This method would, however, ensure a space complexity of O(N+K) (equivalent to O(max{N,K})) where K is the amount of elements you're holding; so long as you aren't trying to simultaneously hold Θ(N^2) (or however many you deem to be too many) elements, it would probably suffice for your needs.

Related

Data Structure Question: Is there a link between the size of a list in a chaining implementation of hash maps and its load factor?

For example, if I have n keys and m slots in the hash map, the average size of a linked list starting from a slot would be n/m. Am I correct in thinking this? Again, I'm talking about an average. Thanks in advance!
I'm trying to learn data structures.
As you say, the average size of a single list is generally going to be the table's load factor; but this is assuming that the "Simple Uniform Hashing Assumption" holds with your hash table (more specifically, with its hash function(s) and expected input keys): simply put, we assume that the hash function distributes elements to buckets uniformly, as well as independently of one another.
To expand a little, and in different words:
We assume that if we choose a new item randomly (imagine sampling an item from the probability distribution that characterizes our inputs), then there is an equal chance that the item we end up with will be mapped to any of the m buckets. (A chance of 1/m.)
Furthermore, that this probability is unaffected given the presence (or absence) of any other elements in any of the buckets.
This is helpful because from this we can conclude that the probability for an item to be sorted into a given bucket is always 1/m, regardless of any other circumstances; from this it directly follows that the expected (average) length of a single bucket's list will be n/m (we insert n elements into the table, and for each one, sort it into this given list at a probability of 1/m).
To see that this is important, we might imagine a case in which it doesn't hold: for instance, if we're facing some kind of "attack" and our inputs are engineered to all hash into the same bucket, or even just with a high probability. In this case SUHA no longer holds, and clearly neither does the link you've asked about between the length of a list and the load factor.
This is part of the reason that it is important to choose a good hash function for your use case: without it, the assumption may not hold which could have a harmful effect on your lookup times.

Generating all permutations excluding cyclic rotations in Python

I need to create a (practice) program for currency arbitrage that detects profitable "loops" given a series of exchange rates. So there might be different values for USD->JPY, JPY->USD, USD->EUR, and so on. In order to detect profitability, however, I first need to enumerate all possible loops -- USD->JPY->EUR->USD is one example, but USD->EUR->JPY->USD is a distinct example using the same currencies since it may hit different exchange rates.
If I ignore the last part of the loop, which will always be the same as the origin, it seems to be the case that every currency can only exist at most once in the "best" loop, as if a currency exists more than once it would actually be two different loops (at least one of which would still be profitable).
Similarly, I can ignore loops that are just translations of already tested loops: USD->JPY->ASD is the same as JPY->ASD->USD.
So, given input like [USD,JPY,EUR,ASD] I need something that would return:
(USD,JPY,EUR,ASD)
(USD,JPY,ASD,EUR)
(USD,EUR,ASD,JPY)
(USD,EUR,JPY,ASD)
(USD,ASD,EUR,JPY)
(USD,ASD,JPY,EUR)
This solution uses the yield from syntax introduced in Python 3.3. Like the built-in itertools.permutations(), this:
Is a generator and does not require storing anything
Will yield an empty tuple if passed length 0
Assumes every item in the permuted object is itself unique
from itertools import permutations
def unique_cyclic_permutations(thing, length):
if length == 0:
yield (); return
for x in permutations(thing[1:], length - 1):
yield (thing[0],) + x
if length < len(thing):
yield from unique_cyclic_permutations(thing[1:], length)
The algorithm works by choosing a pivot, fixing it at the beginning, and then permuting the rest of the objects. In the case of a non-full length permutation, there will also be some permutations that don't include the pivot object at all. In this case, the generator recursively calls itself while excluding the original pivot.

hashmap remove complexity

So a lot of sources say the hashmap remove function is O(1), but I don't see how this could be unless a hashmap were backed by a linkedlist because list removals are O(n). Could someone explain?
You can view a Hasmap as an array. Imagine, you want to store objects of all humans on earth somewhere. You could just get an unique number for everyone and use an array with a dimension of 10*10^20.
If someone is born, she/he gets the next free number and is added to the end. If someone dies, her/his number is used and the array entry is set to null.
You can easily see, to add some or to remove someone, you need only constant time. calculate array address, done (if you have random access memory).
What is added by the Hashmap? There are 2 motivations. On the one side, you do not want to have such a big array. If you only want to store 10 people from all over the world, nearly all entries of the array are free. On the other side, not all data you want to store somewhere have an unique number. Sometimes there are multiple times the same number, some numbers do now show overall and sometimes you do not have any number. Therefore, you define a function, which uses the big numbers from the input and reduce them to numbers in a smaller range. This reduction should be in a way, that the resulting number is most likely unique for different inputs.
Example: Lets say you want to store 10 numbers from 1 to 100000000. You could use an array with 100000000 indices. Or you could use an array with 100 indices and the function f(x) = x % 100. If you have the number 1234, f(1234) = 34. Mark 34 as assigned.
Now you could ask, what happens if you have the number 2234? We have a collision then. You need some strategy then to handle this, there are several. Study some literature or ask specific questions for this.
If you want to store a string, you could imagine to use the length or the sum of the ascii value from every characters.
As you see, we can easily store something, and easily access it again. What we have to do? Calculate the hash from the function (constant time for a good function), access the array (constant time), store or remove (constant time).
In real world, a good hash function is not that easy. Try to stick with the included ones in java.
If you want to read more details, the wikipedia article about hash table is a good starting point: http://en.wikipedia.org/wiki/Hash_table
I don't think the remove(key) complexity is O(1). If we have a big hash table with many collisions, then it would be O(n) in worst case. It very rare to get the worst case but we can't neglect the fact that O(1) is not guaranteed.
If your HashMap is backed by a LinkedList buckets array
The worst case of the remove function will be O(n)
If your HashMap is backed by a Balanced Binary Tree buckets array
The worst case of the remove function will be O(log n)
The best case and the average case (amortized complexity) of the remove function is O(1)

searching and sorting

If the list has 1024 items (lg1024 = 10) at what point (the number of searches) does sorting the list first and using binary search pay off? How does your answer change if the list has 2048 items? instead of using sequential search
Where the "linear access" curve crosses the "binary search" curve depends on how long it takes to access/insert a single item versus how many items there are. This will be different for every combination of compiler, memory and cpu architecture, type of data/node in the list, the distribution of data values, what sort and insertion algorithms you use, etc... But with a "large enough" set of items, the running time can be described by mentioning how its upper bound grows with increasing number of items, even though that "Big-O" bound may not precisely describe any particular run.
You can figure out precisely if you can know the specific algorithm you will insert or search with, and determine the actual instructions that make up your list accesses, and find out how many clock cycles they take to execute, etc etc...
Then you can say for sure which one is faster, and at which point. And if you know you data values, you can model it. But if you don't know, you have to assume (for example, what if your inserted data values are already ordered? how does that affect your sort or insertion function?)
For example, a single item retrieval may take 1us. Comparing two items may take 0.5us. Doing a sorted list insertion with 100 items in the list might require X number of retrievals, Y number of compares, and Z number of updates/writes.... Whereas an unordered list might require more or less depending on what's already there and what you're inserting.
if your list is unsorted it will take O(n) to find it. Sort with quicksort costs O(n*log n), then binary search is O(log n). Lets assume that x is number of searchs. x * n = x * logn + n * logn . by putting different values you can estimate the dynamics. my rough estimate tells that if n = 1024 and number searches is greater then ~10, it is more efficitent to sort first. put 1024 instead of n and try.

Constant-time hash for strings?

Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.
It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.
I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.
I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.
From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.
Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.
I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.
Thanks.
A hash function doesn't have to (and can't) return a unique value for every string.
You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.
You could also just return the constant value 1. Strictly speaking, this is still a hash function, although not a very useful one.
In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).
Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...
You cannot easily achieve a general constant time hashing algorithm for strings without risking severe cases of hash collisions.
For it to be constant time, you will not be able to access every character in the string. As a simple example, suppose we take the first 6 characters. Then comes someone and tries to hash an array of URLs. The has function will see "http:/" for every single string.
Similar scenarios may occur for other characters selections schemes. You could pick characters pseudo-randomly based on the value of the previous character, but you still run the risk of failing spectacularly if the strings for some reason have the "wrong" pattern and many end up with the same hash value.
You can hope for asymptotically less than linear hashing time if you use ropes instead of strings and have sharing that allows you to skip some computations. But obviously a hash function can not separate inputs that it has not read, so I wouldn't take the "everything can be hashed in constant time" too seriously.
Anything is possible in the compromise between the hash function's quality and the amount of computation it takes, and a hash function over long strings must have collisions anyway.
You have to determine if the strings that are likely to occur in your algorithm will collide too often if the hash function only looks at a prefix.
Although I cannot imagine a fixed-time hash function for unlimited length strings, there is really no need for it.
The idea behind using a hash function is to generate a distribution of the hash values that makes it unlikely that many strings would collide - for the domain under consideration. This key would allow direct access into a data store. These two combined result in a constant time lookup - on average.
If ever such collision occurs, the lookup algorithm falls back on a more flexible lookup sub-strategy.
Certainly this is doable, so long as you ensure all your strings are 'interned', before you pass them to something requiring hashing. Interning is the process of inserting the string into a string table, such that all interned strings with the same value are in fact the same object. Then, you can simply hash the (fixed length) pointer to the interned string, instead of hashing the string itself.
You may be interested in the following mathematical result I came up with last year.
Consider the problem of hashing an infinite number of keys—such as the set of all strings of any length—to the set of numbers in {1,2,…,b}. Random hashing proceeds by first picking at random a hash function h in a family of H functions.
I will show that there is always an infinite number of keys that are certain to collide over all H functions, that is, they always have the same hash value for all hash functions.
Pick any hash function h: there is at least one hash value y such that the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding. Pick any other hash function h‘ and hash the keys in the set A. There is at least one hash value y‘ such that the set A‘={s is in A: h‘(s)=y‘} is infinite, that is, there are infinitely many strings colliding on two hash functions. You can repeat this argument any number of times. Repeat it H times. Then you have an infinite set of strings where all strings collide over all of your H hash functions. CQFD.
Further reading:
Sensible hashing of variable-length strings is impossible
http://lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/

Resources