Numpy reduce multiple operations for hashing - python-3.x

I'm trying to implement some hashing functions for numpy arrays to easily find these inside a big list of arrays, but almost every hashing function I find needs to make a reduce with more than one operation, for example:
def fnv_hash (arr):
result = FNV_offset_basis
for v in arr.view(dtype = np.uint8):
result *= FNV_prime
result ^= v
return result
It would take two operations to the result variable in each loop, which (I think) is not possible using only reduce calls in numpy functions (i.e. numpy.ufunc.reduce).
I want to avoid basic loops as they do not treat numpy arrays as memory regions (which is slow) and I don't want to use hashlib functions. Also, converting a function using numpy.vectorize and similar (which is just a for loop as said in the documentation) does not helps performance.
Unfortunately I cannot use numba.jit because, as I'm working with large arrays, I need to run my code in a cluster which doesn't have numba installed. The same happens for xxhash.
My solution so far is to use a simple hashing function as
def my_hash (arr):
indices = np.arange(arr.shape[0])
return int((arr * ((1 << (indices * 5)) - indices)).sum())
Which is kinda fast (this isn't the actual function code, I made some optimizations in my script but I can assure you the output is the same), but it makes some unwanted collisions.
In short: I want to implement a good hashing function using only numpy operations, as my arrays and my search space are enormous.
Thanks in advance.

Since your arrays are enormous, but you don't have many of them, did you try hashing just a part of each array? For example, even if arr is huge, hash(tuple(arr[:10e5])) is ~60ms on my machine and is probably unique enough to distinguish 10k different arrays, depending on how they were generated.
If this won't solve your problem, any additional context you can give on your problem would be helpful. How and when are these arrays generated? Why are you trying to hash them?

Related

Can I write a hash function say `hashit(a, b, c)` taking no more than O(N) space

Given integers a,b,c such that
-N<=a<=N,
0<=b<=N,
0<=c<=10
Can I write a hash function say hashit(a, b, c) taking no more than O(N) adrdress space.
My naive thought was to write it as,
a+2N*b+10*2N*N*c
thats like O(20N*N) space, so it wont suffice my need.
let me elaborate my usecase, I want tuple (a,b,c) as key of a hashmap . Basically a,b,c are arguments to my function which I want to memorise. in python #lru_cache perfectly does it without any issue for N=1e6 but when I try to write hash function myself I get memory overflow. So how do python do it ?
I am working wih N of the order of 10^6
This code work
#lru_cache(maxsize=None)
def myfn(a,b,c):
//some logic
return 100
But if i write the hash function myself like this, it doesn't . So how do python do it.
def hashit(a,b,c):
return a+2*N*b+2*N*N*c
def myfn(a,b,c):
if hashit(a,b,c) in myhashtable:
return myhashtable[hashit(a,b,c)]
//some logic
myhashtable[hashit(a,b,c)] = 100;
return myhashtable[hashit(a,b,c)]
To directly answer your question of whether it is possible to find an injective hash function from a set of size Θ(N^2) to a set of size O(N): it isn't. The very existence of an injective function from a finite set A to a set B implies that |B| >= |A|. This is similar to trying to give a unique number out of {1, 2, 3} to each member of a group of 20 people.
However, do note that hash functions do oftentimes have collisions; the hash tables that employ them simply have a method for resolving those collisions. As one simple example for clarification, you could for instance hold an array such that every possible output of your hash function is mapped to an index of this array, and at each index you have a list of elements (so an array of lists where the array is of size O(N)), and then in the case of a collision simply go over all elements in the matching list and compare them (not their hashes) until you find what you're looking for. This is known as chain hashing or chaining. Some rare manipulations (re-hashing) on the hash table based on how populated it is (measured through its load factor) could ensure an amortized time complexity of O(1) for element access, but this could over time increase your space complexity if you actually try to hold ω(N) values, though do note that this is unavoidable: you can't use less space than Θ(X) to hold Θ(X) values without any extra information (for instance: if you hold 1000000 unordered elements where each is a natural number between 1 and 10, then you could simply hold ten counters; but in your case you describe a whole possible set of elements of size 11*(N+1)*(2N+1), so Θ(N^2)).
This method would, however, ensure a space complexity of O(N+K) (equivalent to O(max{N,K})) where K is the amount of elements you're holding; so long as you aren't trying to simultaneously hold Θ(N^2) (or however many you deem to be too many) elements, it would probably suffice for your needs.

Way to use bisect module for sets in python

I was looking for something similar to lower_bound() function for sets in
python, as we have in C++.
Task is to have a ds, which inserts element in sorted manner, storing only single instance of each distinct value, and returns the left neighbor of a given value, both operations in O(logn) worst time in python.
python: something similar to bisect module for lists, with efficient insertion may work.
sets are unordered, and the standard lib does not offer tree structures.
Maybe you could look at sorted containers (3rd party lib): http://www.grantjenks.com/docs/sortedcontainers/ it might offer a good approach to your problem.

Use frozen rv distribution objects to speed up a program with many function calls in which the same distribution is used?

Problem
Assume there is a program that needs to generate a lot of random variates from the same distribution in various functions and classes at various times.
It seems that using a frozen rv distribution object (and drag that object through all functions) is a lot faster than re-generating the distribution in every function before drawing random variates.
To give some evidence: Consider that this code:
import scipy.stats
import time
def gen_rvs(dist):
dist.rvs(1000)
time1 = time.time()
d = scipy.stats.bernoulli(0.75)
for i in range(100000):
gen_rvs(d)
print(time.time() - time1)
runs (on my machine) almost 10 times faster than this (7.4 sec vs. 68.6 sec):
import scipy.stats
import time
def gen_rvs():
scipy.stats.bernoulli(0.75).rvs(1000)
time1 = time.time()
for i in range(100000):
gen_rvs()
print(time.time() - time1)
Potential Solutions
Dragging distribution objects around through all functions etc.? Problem: This seems very messy, will require more arguments (if you perhaps need multiple distributions), makes function calls less easy to understand and will make errors more likely.
Having the frozen rv distribution object as a global variable? Problem: Will not work as easily if the program is spread out across multiple files. Parallelization would create more problems.
Pass the frozen rv distribution to all classes that need the rv generator at some point and save it locally everywhere? Problem: Still seems messy.
Pass generated random variates instead of the distribution to functions? Problem: Even more messy if there are longer call stacks. And it would have to be known in advance how many random variates are needed in a particular function.
Run the random number generator in a separate process and push random vartates into a queue from where they are collected by other processes? Problem: While this sounds fancy and not messy, implementing it and governing efficiently how many random variates the process should generate might become messy.
What is the preferred way to deal with that?
Edit (06/24/2016) the software versions with which the above scripts were executed and timed are: Python 3.5.2, Scipy 0.18.1

Why is it not possible to get dictionary values in O(1) time?

Can we write a data structure which will search directly by taking the values in O(1) time?
For example, in this code in python3, we can get morse code by taking the keys and output the values.
morse={'A':'.-','B':'-...','C':'-.-.','D':'-..','E':'.',\
'F':'..-.','G':'--.','H':'....','I':'..','J':'.---',\
'K':'-.-','L':'.-..','M':'--','N':'_.','O':'---',\
'P':'.--.','Q':'--.-','R':'.-.','S':'...','T':'-',\
'U':'..-','V':'...-','W':'.--','X':'-..-','Y':'-.--',\
'Z':'--..','1':'.---','2':'..---','3':'...--','4':'....-',\
'5':'.....','6':'-....','7':'--...','8':'---..','9':'----.',\
'0':'----'}
n=input()
n=''.join(i.upper() for i in n if i!=' ')
for i in n:
print(morse[i],end=' ')
This gives the output:
>>>
S O S
... --- ...
If we want to search by taking the morse code as input and giving the string as output:
>>>
... --- ...
S O S
how do we do that without making another dictionary of morse code?
Please provide the proper reasoning and what are the limitations if any.
Python dictionaries are hashmaps behind the scenes. The keys are hashed to achieve O(1) lookups. The same is not done for values for a few reasons, one of which is the reason #CrakC mentioned: the dict doesn't have to have unique values. Maintaining an automatic reverse lookup would be nonconsistent at best. Another reason could be that fundamental data structures are best kept to a minimum set of operations for predictability reasons.
Hence the correct & common pattern is to maintain a separate dict with key-value pairs reversed if you want to have reverse lookups in O(1). If you cannot do that, you'll have to settle for greater time complexities.
Yes, getting the name of the key from its value in a dictionary is not possible in python. The reason for this is quite obvious. The keys in a dictionary are unique in nature i.e., there cannot be two entries in the dictionary for the same key. But the inverse is not always true. Unique keys might have non-unique values. It should be noted here that the immutable nature of the keys actually defines the structure of the dictionary. Since they are unique in nature, they can be indexed and so fetching the value of a given key executes in O(1) time. The inverse, as explained above, cannot be realized in O(1) time and will always take an average time of O(n). The most important point that you should know here is that python dictionaries are not meant to be used this way.
Further reading: http://stupidpythonideas.blogspot.in/2014/07/reverse-dictionary-lookup-and-more-on.html
Can we write a data structure which will search directly by taking the values in O(1) time?
The answer to that question would be yes, and it's a HasMap or HashTable.
Following your example, what actually happens there is that Python Dictionaries are implemented as HashMap's. From that follows that search complexity is O(1) but, as I understand, your real problem is how to search the key by the value in O(1) too. Well, being dictionaries implemented as hashmaps, if Python provided (I am not 100% sure it doesn't) that reverse searching functionality it wouldn't be O(1) because HashMaps are not designed to provide it.
It can be shown looking at how HashMaps work: you would need a hashing function which would map the key and the value to the same index in the array which, if not impossible, is pretty hard to do.
I guess that your best option is to define de inverse dictionary. It's not that uncommon to sacrifice memory to achieve better times.
As CrakC has correctly stated it is not possible to get the key from the dictionary in O(1) time, you will need to traverse the dictionary once in O(n) time in order to search for the key in the dictionary. As you do not want to create another dictionary this would be your only option.

First occurrence search in CUDA

My application do some stuff in device-code and generates an array inside the kernel.
I need to search the first occurrence of an element in this array. How can i perform it in GPU? If i copy the array to CPU and do the work there, it will generate so much memory traffic, because this piece of code is called many times.
There is most probably a more sophisticated solution, but for a start and especially if the number of occurrences of the element is very small, a simple brute-force atomic-min might be a viable solution:
template<typename T> __global__ void find(T *data, T value, int *min_idx)
{
int idx = threadIdx.x + blockDim.x*blockIdx.x;
if(data[idx] == value)
atomicMin(min_idx, idx);
}
If the number of occurrences is really small and thus nearly all threads don't even attempt to access the atomic, this might actually be not that bad a solution. Otherwise (if the searched element is not so rare) you would have much more intra-warp divergence and, even worse, a much higher probability of conflicting atomic operations.
EDIT: For a more sophisticated approach (but maybe still not the best) you could otherwise also in a pre-step create an int array with the value at index idx set to idx if the input array's element equals the searched element at that index, and INT_MAX if it doesn't:
indices[idx] = (data[idx]==value) ? idx : INT_MAX;
and then do a "classical" minimum-reduction on that index array to get the first matching index.
One approach is to use atomic operations which prevent other threads from accessing editable data until the one currently processing it is done.
Here's an example of finding first occurrence of a word:
http://supercomputingblog.com/cuda/search-algorithm-with-cuda/
The atomicMin function is used in that example. In addition, there's also a performance comparison between GPU and CPU in the article.
Another way to find first occurrence is to use a method known as parallel reduction. There is an example of parallel sum in the CUDA SDK (the sample calculates sum of all values in an array). Parallel reduction is a good option especially if you use hardware with older compute capability version and if you need high precision.
To use parallel reduction to find first occurrence, you firstly check if the value in the array equals to what you want to find. If it does, you save its index. Then, you perform one or many min operations (not atomic min) where you compare the indices you saved in the previous step. You can implement this search by editing the parallel sum example of CUDA SDK.
This site has some information about reduction and atomic operations. It also includes binary tree reduction and workaround atomic functions that I haven't talked about here.
The atomic vs. reduction issue has also been discussed on Stack Overflow.

Resources