Spark Hashing TF power of two feature dimension recommendation reasoning - apache-spark

According to https://spark.apache.org/docs/2.3.0/ml-features.html#tf-idf:
"HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3."
...
"Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices."
I tried to understand why using a power of two as the feature dimension will map words evenly and tried find some helpful documentation on the internet to understand it, but both attempts were not successful.
Does somebody know or have useful sources on why using the power two maps words evenly to vector indices?

The output of a hash function is b-bit, i.e., there are 2^b possible values to which a feature can be hashed. Additionally, we assume that the 2^b possible values appear uniformly at random.
If d is the feature dimension, an index for a feature f is determined as hash(f) MOD d. Again, hash(f) takes on 2^b possible values. It is easy to see that d has to be a power of two (i.e., a divisor of 2^b) itself in order for uniformity to be maintained.
For a counter-example, consider a 2-bit hash function and a 3-dimensional feature space. As per our assumptions, the hash function outputs 0, 1, 2, or 3 with probability 1/4 each. However, taking mod 3 results in 0 with probability 1/2, or 1 or 2 with probability 1/4 each. Therefore, uniformity is not maintained. On the other hand; if the feature space were 2-dimensional, it is easy to see that the result would be 0 or 1 with probability 1/2 each.

Related

How to generate a random stochastic matrix or ndarray?

I was looking for a crate that would allow me to easily and randomly generate probability vectors, stochastic matrices or, in general, ndarrays that are stochastic. For people not familiar with these concepts, a probability vector v is defined as follows
0 <= v[i] <= 1, for all i
sum(v[i]) = 1
Similarly, a stochastic matrix is a matrix where each column (or row) satisfies the conditions above.
More generally, a ndarray A would be stochastic if
0 <= A[i, j, k, ..., h] <= 1, for all indices
sum(A[i, j, k, ..., :]) = 1, for all indices
Here, ... just means other indices between k and the last index h. : is a notation to indicate all elements of that dimension.
Is there a crate that does this easily (i.e. you just need to call a function or something like that)? If not, how would you do it? I suppose one could just generate a random ndarray and then change the array by dividing the last dimension by the sum of the elements in that dimension, so, for a 1d array (a probability vector), we could do something like this
use ndarray::Array1;
use ndarray_rand::RandomExt;
use ndarray_rand::rand_distr::Uniform;
fn main() {
let mut a = Array1::random(10, Uniform::new(0.0, 1.0));
a = &a / a.sum();
println!("The sum is {:?}", a.sum());
}
But how would you do it for higher dimensional arrays? We could use a for loop an iterate over all indices, but that doesn't look like it would be efficient. I suppose there must be a way to do this operation in a vectorized form. Is there a function (in the standard library, in the ndarray crate or some other crate) that does this for us? Could we use ndarray-rand to do this without having to divide by the sum?
Requirements
Efficiency is not strictly necessary, but it would be nice.
I am more looking for a simple solution (no more complicated than what I wrote above).
Numerical stability would also be great (e.g. generating random integers and then dividing by the sum may be a better idea than generating random floats and then do the same thing).
I would like to use ndarrays and the related crate, but it's ok if you share also other solutions (which may be useful to others that don't use ndarrays)
I would argue that sampling with whatever distribution you have on hands (U(0,1), Exponential, abs Normal, ...) and then dividing by sum is the wrong way to go.
Start with distribution which has property values being in the [0...1] range and sum of values being equal to 1.
Fortunately, there is such distribution - Dirichlet distribution.
And, apparently, there is a Rust lib to do Dirichlet sampling. Cannot say anything about lib quality.
https://docs.rs/rand_distr/latest/rand_distr/struct.Dirichlet.html
UPDATE
Wrt sampling and then normalizing, problem is, noone knows what would be distribution of the RVs
U(0,1)/(U(0,1) + U(0,1) + ... + U(0,1))
Mean value? Median? Variance? Anything to say at all?
You could even construct it like
[U(0,1);Exp(2);|N(0,1)|;U(0,88);Exp(4.5);...] and as soon as you divide it by sum, values in the vector would be between 0 and 1 and summed to 1. Even less to say about properties of such RVs.
I assume you want to generate random vector/matrices for some purpose, like Monte Carlo etc. Dealing with known distribution with well-defined properties, mean values, variance looks like right way to go.
If I understand correctly, the Dirichlet distribution allows you to generate a probability vector, where the probabilities depend on the initial parameters that you pass, but you would still need to pass these parameters (manually)
Yes, concentration parameters. By default all ones, which makes RVs uniformly distributed in the simplex.
So, are you suggesting the Dirichlet distribution because it was designed on purpose to generate probability vectors?
I'm suggesting Dirichlet because by default it will produce uniformly in-the-simplex distributed RVs, summed to 1 and with well-known statistical properties, starting with PDF, CDF, mean, median, variance, ...
UPDATE II
For Dirichlet
PDF=Prod(xiai-1)/B(a)
So for the case where all ai=1
PDF = 1/B(a)
so given the constrains defining simplex Sum(xi)=1 this is as uniform as it gets.

Random Index from a Tensor (Sampling with Replacement from a Tensor)

I'm trying to manipulate individual weights of different neural nets to see how their performance degrades. As part of these experiments, I'm required to sample randomly from their weight tensors, which I've come to understand as sampling with replacement (in the statistical sense). However, since it's high-dimensional, I've been stumped by how to do this in a fair manner. Here are the approaches and research I've put into considering this problem:
This was previously implemented by selecting a random layer and then selecting a random weight in that layer (ignore the implementation of picking a random weight). Since layers are different sizes, we discovered that weights were being sampled unevenly.
I considered what would happen if we sampled according to the numpy.shape of the tensor; however, I realize now that this encounters the same problem as above.
Consider what happens to a rank 2 tensor like this:
[[*, *, *],
[*, *, *, *]]
Selecting a row randomly and then a value from that row results in an unfair selection. This method could work if you're able to assert that this scenario never occurs, but it's far from a general solution.
Note that this possible duplicate actually implements it in this fashion.
I found people suggesting flattening the tensor and use numpy.random.choice to select randomly from a 1D array. That's a simple solution, except I have no idea how to invert the flattened tensor back into its original shape. Further, flattening millions of weights would be a somewhat slow implementation.
I found someone discussing tf.random.multinomial here, but I don't understand enough of it to know whether it's applicable or not.
I ran into this paper about resevoir sampling, but again, it went over my head.
I found another paper which specifically discusses tensors and sampling techniques, but it went even further over my head.
A teammate found this other paper which talks about random sampling from a tensor, but it's only for rank 3 tensors.
Any help understanding how to do this? I'm working in Python with Keras, but I'll take an algorithm in any form that it exists. Thank you in advance.
Before I forget to document the solution we arrived at, I'll talk about the two different paths I see for implementing this:
Use a total ordering on scalar elements of the tensor. This is effectively enumerating your elements, i.e. flattening them. However, you can do this while maintaining the original shape. Consider this pseudocode (in Python-like syntax):
def sample_tensor(tensor, chosen_index: int) -> Tuple[int]:
"""Maps a chosen random number to its index in the given tensor.
Args:
tensor: A ragged-array n-tensor.
chosen_index: An integer in [0, num_scalar_elements_in_tensor).
Returns:
The index that accesses this element in the tensor.
NOTE: Entirely untested, expect it to be fundamentally flawed.
"""
remaining = chosen_index
for (i, sub_list) in enumerate(tensor):
if type(sub_list) is an iterable:
if |sub_list| > remaining:
remaining -= |sub_list|
else:
return i joined with sample_tensor(sub_list, remaining)
else:
if len(sub_list) <= remaining:
return tuple(remaining)
First of all, I'm aware this isn't a sound algorithm. The idea is to count down until you reach your element, with bookkeeping for indices.
We need to make crucial assumptions here. 1) All lists will eventually contain only scalars. 2) By direct consequence, if a list contains lists, assume that it also doesn't contain scalars at the same level. (Stop and convince yourself for (2).)
We also need to make a critical note here too: We are unable to measure the number of scalars in any given list, unless the list is homogeneously consisting of scalars. In order to avoid measuring this magnitude at every point, my algorithm above should be refactored to descend first, and subtract later.
This algorithm has some consequences:
It's the fastest in its entire style of approaching the problem. If you want to write a function f: [0, total_elems) -> Tuple[int], you must know the number of preceding scalar elements along the total ordering of the tensor. This is effectively bound at Theta(l) where l is the number of lists in the tensor (since we can call len on a list of scalars).
It's slow. It's too slow compared to sampling nicer tensors that have a defined shape to them.
It begs the question: can we do better? See the next solution.
Use a probability distribution in conjunction with numpy.random.choice. The idea here is that if we know ahead of time what the distribution of scalars is already like, we can sample fairly at each level of descending the tensor. The hard problem here is building this distribution.
I won't write pseudocode for this, but lay out some objectives:
This can be called only once to build the data structure.
The algorithm needs to combine iterative and recursive techniques to a) build distributions for sibling lists and b) build distributions for descendants, respectively.
The algorithm will need to map indices to a probability distribution respective to sibling lists (note the assumptions discussed above). This does require knowing the number of elements in an arbitrary sub-tensor.
At lower levels where lists contain only scalars, we can simplify by just storing the number of elements in said list (as opposed to storing probabilities of selecting scalars randomly from a 1D array).
You will likely need 2-3 functions: one that utilizes the probability distribution to return an index, a function that builds the distribution object, and possibly a function that just counts elements to help build the distribution.
This is also faster at O(n) where n is the rank of the tensor. I'm convinced this is the fastest possible algorithm, but I lack the time to try to prove it.
You might choose to store the distribution as an ordered dictionary that maps a probability to either another dictionary or the number of elements in a 1D array. I think this might be the most sensible structure.
Note that (2) is truly the same as (1), but we pre-compute knowledge about the densities of the tensor.
I hope this helps.

After computing the hash, what is the significance of keeping the only the last byte of the hash

Problem: To generate Test and train to improve on Generalization error.
possible solutions:
1. Split instances into train 80% and test 20%, train your model on trainset and tests on testset. But repeating above again and again will somehow let the model cram the data as in multiple time splits will select 1st time chosen instances of the testset into trainset(random sampling.)
The above approach might fail when we fetch an updated dataset.
Another approach is to select each instance's most stable feature/s(combination can be) to create a unique & immutable identifier that will remain robust even after the dataset updates.After selecting one, we could compute a hash of each instance's identifier, keep only the last two bytes of the hash, and put the instance in the test set if the value is <= 256 * test_ratio.}. This will ensure that testset will remain consistent across multiple runs, even if the dataset is refreshed.
Question: What is the significance of just taking last two bytes of the computed hash?
-----Thanks to Aurélien Géron-------
We need a solution to sample a unique test-set even after fetching a updated dataset.
SOLUTION: to use each instance's identifier to decide whether or not it should go to test_set.{Assuming that the instances have a unique and immutable identifier.
we could compute a hash of each instance's identifier, keep only the last bytes of the hash, and put the instance in the test set if value is <= 256*test_ratio i.e 51}
This ensures that the test-set will remain consistent across multiple runs, even if you refresh the dataset. The new test_set will contain 20% of the new instances, but it will not contain any instance that was previosly in the train_set.
First, a quick recap on hash functions:
A hash function f(x) is deterministic, such that if a==b, then f(a)==f(b).
Moreover, if a!=b, then with a very high probability f(a)!=f(b).
With this definition, a function such as f(x)=x%12345678 (where % is the modulo operator) meets the criterion above, so it is technically a hash function.However, most hash functions go beyond this definition, and they act more or less like pseudo-random number generators, so if you compute f(1), f(2), f(3),..., the output will look very much like a random sequence of (usually very large) numbers.
We can use such a "random-looking" hash function to split a dataset into a train set and a test set.
Let's take the MD5 hash function, for example. It is a random-looking hash function, but it outputs rather large numbers (128 bits), such as 136159519883784104948368321992814755841.
For a given instance in the dataset, there is 50% chance that its MD5 hash will be smaller than 2^127 (assuming the hashes are unsigned integers), and a 25% chance that it will be smaller than 2^126, and a 12.5% chance that it will be smaller than 2^125. So if I want to split the dataset into a train set and a test set, with 87.5% of the instances in the train set, and 12.5% in the test set, then all I need to do is to compute the MD5 hash of some unchanging features of the instances, and put the instances whose MD5 hash is smaller than 2^125 into the test set.
If I want precisely 10% of the instances to go into the test set, then I need to checkMD5 < 2^128 * 10 / 100.
This would work fine, and you can definitely implement it this way if you want. However, it means manipulating large integers, which is not always very convenient, especially given that Python's hashlib.md5() function outputs byte arrays, not long integers. So it's simpler to just take one or two bytes in the hash (anywhere you wish), and convert them to a regular integer. If you just take one byte, it will look like a random number from 0 to 255.
If you want to have 10% of the instances in the test set, you just need to check that the byte is smaller or equal to 25. It won't be exactly 10%, but actually 26/256=10.15625%, but that's close enough. If you want a higher precision, you can take 2 or more bytes.

approximate histogram for streaming string values (card catalog algorithm?)

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

Near-perfect hash on known strings

Given a set of 125,000 strings, table size of 250,000 (so load factor .5), and also given that these strings never change, what is a good process for finding a better hash function?
Strings are 1-59 characters long, contain 72 unique characters (typical ascii values), average length and median length is 7 characters.
Approaches tried so far (hash always eventually mod table size)
(suggested by someone) md5 with linear probing (48)
Python built-in hash (max 40 probes per search)
Custom hash with quadratic probing (25)
Polynomial with prime coefficient, double hash with different prime coefficient, search primes 1-1000 for optimal pair (13)
Do previous 5 probes deep, then generate an array of size 256 that contains largest contiguous blocks of free space left in table, then use those mod 256 with linear probing (11)
Cuckoo hashing with three independent hash functions, but haven't found any combination of hash functions to avoid infinite loops
Given that the load factor is .5, is there some theoretical limit on how well the hash function can work? Can it ever be perfect without a very massive additional lookup table?
I have read that minimal pefect hashing requires ~1.6 bits/key, and current best results are ~2.5 bits/key. But this is for minimal (table size = # keys). Surely in my situation we can get very close to perfect, if not perfect, with quite a small lookup table?
Speed of hash function is immaterial in this case by the way.
Have you thought about using two independent hash functions? Variants of cuckoo hashing can build hash tables with surprisingly high load factors using only two hash functions.
Unmodified cuckoo hashing (each item hashes to exactly one of its two locations) attains a load factor of .5 with constant probability. If you modify it to use buckets of size two (so each item hashes to one of two buckets, so one of four locations, and you evict the oldest element of a bucket), I believe you can get load factors of around 0.8 or 0.9 without unreasonably long worst-case insertion times.
In your question as posed, there are 250000^125000 possible mappings from strings to table cells. 250000*249999*...*125001 of them are injective ("perfect hash functions"). Approximate the latter number using Stirling; taking the difference of the logs of these two numbers, you see that a randomly-chosen function will be a perfect hash with probability about 2^(-55000). Meaning that (with astonishingly high probability) there exists a 55-kilobit table that specifies a perfect hash function whose size is "only" 55 kilobits and also there isn't anything substantially smaller. (Finding this table is another matter. Also, note that this information-theoretic approach assumes that no probing whatsoever is done.)

Resources