How to represent Kademlia distance metric as integer - p2p

I'm new in P2P networking and currently I try to understand some basic things specified by Kademlia papers. The main thing that I cannot understand is Kademlia distance metric. All papers define the distance as XOR of two IDs. The ID size is 160 bits, so the result is also has 160 bits.
The question: what is a convenient way to represent this distance as integer?
Some implementations, that I checked, use the following:
distance = 160 - prefix length (where prefix length is number of leading zeros).
Is it correct approach?

Some implementations, that I checked, use the following: distance = 160 - prefix length (where prefix length is number of leading zeros). Is it correct approach?
That approach is based on an early revision of the kademlia paper and is insufficient to implement some of the later chapters of the final paper.
A full-fledged implementation should use a tree-like routing table that orders buckets by their absolute position in the keyspace which can be resized when bucket splitting happens.
The ID size is 160 bits, so the result is also has 160 bits. The question: what is a convenient way to represent this distance as integer?
The distance metrics are 160bit integers. You can use a big-integer class or roll your own based on arrays. To get shared prefix bit counts you just have to count the leading zeroes, which scale logarithmically with the network size and should normally fit in much smaller integers once you're done.

Related

Kademlia The relationship between distance and height of the smallest subtree

I was looking at Kademlia's paper, and I had a problem I couldn't understand.
In a fully-populated binary tree of 160-bit IDs, the magnitude of the distance between two IDs is the height of the smallest subtree containing them both.
d(101,010) = 5 ^ 2 = 7
but Lowest Common Ancestor height is 4:Count from one or 3:Count from zero (root)
This result is obviously wrong, and I must have something wrong, so how should I interpret this sentence
I am looking forward to your reply. Thank you
Pseudo Reliable Broadcast in the Kademlia P2P
System
Kademlia, in turn, organizes its nodes to a binary tree.
(For an in-depth discussion of the internal mechanisms of
Kademlia, please refer to [2].) Distance between nodes is
calculated using the XOR (exclusive or) function, which
essentially captures the idea of the binary tree topology. For
any nodes A and B, the magnitude of their distance
d(A,B)=AB, e.g. the most significant nonzero bit of d is the
height of the smallest subtree containing both of them.
Kademlia: A Peer-to-peer Information System
Based on the XOR Metric
We next note that XOR captures the notion of distance implicit in our binarytree-based sketch of the system. In a fully-populated binary tree of 160-bit IDs,
the magnitude of the distance between two IDs is the height of the smallest
subtree containing them both. When a tree is not fully populated, the closest
leaf to an ID x is the leaf whose ID shares the longest common prefix of x. If
there are empty branches in the tree, there might be more than one leaf with the
longest common prefix. In that case, the closest leaf to x will be the closest leaf
to ID x˜ produced by flipping the bits in x corresponding to the empty branches
of the tree.
That sentence is talking about the magnitude of the distance, not the exact distance. The exact distance is simply the XOR between both addresses.
In the particular case of 101 and 010 the distance is 111, the maximal possible distance, thus they share no common subtree other than the whole tree itself and thus the magnitude is 3 bits (assuming a 3bit-keyspace) which is also the maximal height. The equivalent in CIDR subnetting would be the /0 mask, i.e. 0 shared prefix bits.

What do negative vectors mean on word2vec?

I am doing a research on travel reviews and used word2vec to analyze the reviews. However, when I showed my output to my adviser, he said that I have a lot of words with negative vector values and that only words with positive values are considered logical.
What could these negative values mean? Is there a way to ensure that all vector values I will get in my analysis would be positive?
While some other word-modeling algorithms do in fact model words into spaces where dimensions are 0 or positive, and the individual positive dimensions might be clearly meaningful to humans, that is not the case with the original, canonical 'word2vec' algorithm.
The positive/negativeness of any word2vec word-vector – in a particular dimension, or in net magnitude – has no strong meaning. Meaningful words will be spread out in every direction from the origin point. Directions or neighborhoods in this space that loosely correlate to recognizable categories may appear anywhere, and skew with respect to any of the dimensional axes.
(Here's a related algorithm that does use non-negative constraints – https://www.cs.cmu.edu/~bmurphy/NNSE/. But most references to 'word2vec' mean the classic approach where dimensions usefully range over all reals.)

Spark Hashing TF power of two feature dimension recommendation reasoning

According to https://spark.apache.org/docs/2.3.0/ml-features.html#tf-idf:
"HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3."
...
"Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices."
I tried to understand why using a power of two as the feature dimension will map words evenly and tried find some helpful documentation on the internet to understand it, but both attempts were not successful.
Does somebody know or have useful sources on why using the power two maps words evenly to vector indices?
The output of a hash function is b-bit, i.e., there are 2^b possible values to which a feature can be hashed. Additionally, we assume that the 2^b possible values appear uniformly at random.
If d is the feature dimension, an index for a feature f is determined as hash(f) MOD d. Again, hash(f) takes on 2^b possible values. It is easy to see that d has to be a power of two (i.e., a divisor of 2^b) itself in order for uniformity to be maintained.
For a counter-example, consider a 2-bit hash function and a 3-dimensional feature space. As per our assumptions, the hash function outputs 0, 1, 2, or 3 with probability 1/4 each. However, taking mod 3 results in 0 with probability 1/2, or 1 or 2 with probability 1/4 each. Therefore, uniformity is not maintained. On the other hand; if the feature space were 2-dimensional, it is easy to see that the result would be 0 or 1 with probability 1/2 each.

approximate histogram for streaming string values (card catalog algorithm?)

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

Near-perfect hash on known strings

Given a set of 125,000 strings, table size of 250,000 (so load factor .5), and also given that these strings never change, what is a good process for finding a better hash function?
Strings are 1-59 characters long, contain 72 unique characters (typical ascii values), average length and median length is 7 characters.
Approaches tried so far (hash always eventually mod table size)
(suggested by someone) md5 with linear probing (48)
Python built-in hash (max 40 probes per search)
Custom hash with quadratic probing (25)
Polynomial with prime coefficient, double hash with different prime coefficient, search primes 1-1000 for optimal pair (13)
Do previous 5 probes deep, then generate an array of size 256 that contains largest contiguous blocks of free space left in table, then use those mod 256 with linear probing (11)
Cuckoo hashing with three independent hash functions, but haven't found any combination of hash functions to avoid infinite loops
Given that the load factor is .5, is there some theoretical limit on how well the hash function can work? Can it ever be perfect without a very massive additional lookup table?
I have read that minimal pefect hashing requires ~1.6 bits/key, and current best results are ~2.5 bits/key. But this is for minimal (table size = # keys). Surely in my situation we can get very close to perfect, if not perfect, with quite a small lookup table?
Speed of hash function is immaterial in this case by the way.
Have you thought about using two independent hash functions? Variants of cuckoo hashing can build hash tables with surprisingly high load factors using only two hash functions.
Unmodified cuckoo hashing (each item hashes to exactly one of its two locations) attains a load factor of .5 with constant probability. If you modify it to use buckets of size two (so each item hashes to one of two buckets, so one of four locations, and you evict the oldest element of a bucket), I believe you can get load factors of around 0.8 or 0.9 without unreasonably long worst-case insertion times.
In your question as posed, there are 250000^125000 possible mappings from strings to table cells. 250000*249999*...*125001 of them are injective ("perfect hash functions"). Approximate the latter number using Stirling; taking the difference of the logs of these two numbers, you see that a randomly-chosen function will be a perfect hash with probability about 2^(-55000). Meaning that (with astonishingly high probability) there exists a 55-kilobit table that specifies a perfect hash function whose size is "only" 55 kilobits and also there isn't anything substantially smaller. (Finding this table is another matter. Also, note that this information-theoretic approach assumes that no probing whatsoever is done.)

Resources