I was looking at Kademlia's paper, and I had a problem I couldn't understand.
In a fully-populated binary tree of 160-bit IDs, the magnitude of the distance between two IDs is the height of the smallest subtree containing them both.
d(101,010) = 5 ^ 2 = 7
but Lowest Common Ancestor height is 4:Count from one or 3:Count from zero (root)
This result is obviously wrong, and I must have something wrong, so how should I interpret this sentence
I am looking forward to your reply. Thank you
Pseudo Reliable Broadcast in the Kademlia P2P
System
Kademlia, in turn, organizes its nodes to a binary tree.
(For an in-depth discussion of the internal mechanisms of
Kademlia, please refer to [2].) Distance between nodes is
calculated using the XOR (exclusive or) function, which
essentially captures the idea of the binary tree topology. For
any nodes A and B, the magnitude of their distance
d(A,B)=AB, e.g. the most significant nonzero bit of d is the
height of the smallest subtree containing both of them.
Kademlia: A Peer-to-peer Information System
Based on the XOR Metric
We next note that XOR captures the notion of distance implicit in our binarytree-based sketch of the system. In a fully-populated binary tree of 160-bit IDs,
the magnitude of the distance between two IDs is the height of the smallest
subtree containing them both. When a tree is not fully populated, the closest
leaf to an ID x is the leaf whose ID shares the longest common prefix of x. If
there are empty branches in the tree, there might be more than one leaf with the
longest common prefix. In that case, the closest leaf to x will be the closest leaf
to ID x˜ produced by flipping the bits in x corresponding to the empty branches
of the tree.
That sentence is talking about the magnitude of the distance, not the exact distance. The exact distance is simply the XOR between both addresses.
In the particular case of 101 and 010 the distance is 111, the maximal possible distance, thus they share no common subtree other than the whole tree itself and thus the magnitude is 3 bits (assuming a 3bit-keyspace) which is also the maximal height. The equivalent in CIDR subnetting would be the /0 mask, i.e. 0 shared prefix bits.
Related
Given a set P of n points in 2D, for any point x in P, what is the fastest way to find out the farthest neighbor of x? By farthest neighbor, we mean a point in P which has the maximum Euclidean distance to x.
To the best of my knowledge, the current standard kNN search algorithm for various trees (R-Trees, quadtrees, kd-trees) was developed by:
G. R. Hjaltason and H. Samet., "Distance browsing in spatial
databases.", ACM TODS 24(2):265--318. 1999
See here. It traverses the tree based on a priority queue of nearest nodes/entries. One key insight is that the algorithm also works for farthest neighbor search.
The basic algorithm uses a priority queue. The queue can contain tree nodes as well as data entries, all sorted by their distance to your search point.
As initial step it adds the root node to the priority queue. Then repeat the following until k entries have been found:
Take the first element from the queue. If it is an entry, return it. If it is a node, add all elements in the node to the priority queue.
Repeat 1.
The paper describes an implementation for R-Trees, but they claim it can be applied to most tree-like structures. I have implemented the nearest neighbor version myself for R-Trees and PH-Trees (a special type of quadtree), both in Java. I think I know how to do it efficiently for KD-Trees but I believe it is somewhat complicated.
Does the opposite of Kruskal's algorithm for minimum spanning tree work for it? I mean, choosing the max weight (edge) every step?
Any other idea to find maximum spanning tree?
Yes, it does.
One method for computing the maximum weight spanning tree of a network G –
due to Kruskal – can be summarized as follows.
Sort the edges of G into decreasing order by weight. Let T be the set of edges comprising the maximum weight spanning tree. Set T = ∅.
Add the first edge to T.
Add the next edge to T if and only if it does not form a cycle in T. If
there are no remaining edges exit and report G to be disconnected.
If T has n−1 edges (where n is the number of vertices in G) stop and
output T . Otherwise go to step 3.
Source: https://web.archive.org/web/20141114045919/http://www.stats.ox.ac.uk/~konis/Rcourse/exercise1.pdf.
From Maximum Spanning Tree at Wolfram MathWorld:
"A maximum spanning tree is a spanning tree of a weighted graph having maximum weight. It can be computed by negating the weights for each edge and applying Kruskal's algorithm (Pemmaraju and Skiena, 2003, p. 336)."
If you invert the weight on every edge and minimize, do you get the maximum spanning tree? If that works you can use the same algorithm. Zero weights will be a problem, of course.
Although this thread is too old, I have another approach for finding the maximum spanning tree (MST) in a graph G=(V,E)
We can apply some sort Prim's algorithm for finding the MST. For that I have to define Cut Property for the maximum weighted edge.
Cut property: Let say at any point we have a set S which contains the vertices that are in MST( for now assume it is calculated somehow ). Now consider the set S/V ( vertices not in MST ):
Claim: The edge from S to S/V which has the maximum weight will always be in every MST.
Proof: Let's say that at a point when we are adding the vertices to our set S the maximum weighted edge from S to S/V is e=(u,v) where u is in S and v is in S/V. Now consider an MST which does not contain e. Add the edge e to the MST. It will create a cycle in the original MST. Traverse the cycle and find the vertices u' in S and v' in S/V such that u' is the last vertex in S after which we enter S/V and v' is the first vertex in S/V on the path in cycle from u to v.
Remove the edge e'=(u',v') and the resultant graph is still connected but the weight of e is greater than e' [ as e is the maximum weighted edge from S to S/V at this point] so this results in an MST which has sum of weights greater than original MST. So this is a contradiction. This means that edge e must be in every MST.
Algorithm to find MST:
Start from S={s} //s is the start vertex
while S does not contain all vertices
do
{
for each vertex s in S
add a vertex v from S/V such that weight of edge e=(s,v) is maximum
}
end while
Implementation:
we can implement using Max Heap/Priority Queue where the key is the maximum weight of the edge from a vertex in S to a vertex in S/V and value is the vertex itself. Adding a vertex in S is equal to Extract_Max from the Heap and at every Extract_Max change the key of the vertices adjacent to the vertex just added.
So it takes m Change_Key operations and n Extract_Max operations.
Extract_Min and Change_Key both can be implemented in O(log n). n is the number of vertices.
So This takes O(m log n) time. m is the number of edges in the graph.
Let me provide an improvement algorithm:
first construct an arbitrary tree (using BFS or DFS)
then pick an edge outside the tree, add to the tree, it will form a cycle, drop the smallest weight edge in the cycle.
continue doing this util all the rest edges are considered
Thus, we'll get the maximum spanning tree.
This tree satisfies any edge outside the tree, if added will form a cycle and the edge outside <= any edge weights in the cycle
In fact, this is a necessary and sufficient condition for a spanning tree to be maximum spanning tree.
Pf.
Necessary: It's obvious that this is necessary, or we could swap edge to make a tree with a larger sum of edge weights.
Sufficient: Suppose tree T1 satisfies this condition, and T2 is the maximum spanning tree.
Then for the edges T1 ∪ T2, there're T1-only edges, T2-only edges, T1 ∩ T2 edges, if we add a T1-only edge(x1, xk) to T2, we know it will form a cycle, and we claim, in this cycle there must exist one T2-only edge that has the same edge weights as (x1, xk). Then we can exchange these edges will produce a tree with one more edge in common with T2 and has the same sum of edge weights, repeating doing this we'll get T2. so T1 is also a maximum spanning tree.
Prove the claim:
suppose it's not true, in the cycle we must have a T2-only edge since T1 is a tree. If none of the T2-only edges has a value equal to that of (x1, xk), then each of T2-only edges makes a loop with tree T1, then T1 has a loop leads to a contradiction.
This algorithm taken from UTD professor R. Chandrasekaran's notes. You can refer here: Single Commodity Multi-terminal Flows
Negate the weight of original graph and compute minimum spanning tree on the negated graph will give the right answer. Here is why: For the same spanning tree in both graphs, the weighted sum of one graph is the negation of the other. So the minimum spanning tree of the negated graph should give the maximum spanning tree of the original one.
Only reversing the sorting order, and choosing a heavy edge in a vertex cut does not guarantee a Maximum Spanning Forest (Kruskal's algorithm generates forest, not tree). In case all edges have negative weights, the Max Spanning Forest obtained from reverse of kruskal, would still be a negative weight path. However the ideal answer is a forest of disconnected vertices. i.e. a forest of |V| singleton trees, or |V| components having total weight of 0 (not the least negative).
Change the weight in a reserved order(You can achieve this by taking a negative weight value and add a large number, whose purpose is to ensure non-negative) Then run your family geedy-based algorithm on the minimum spanning tree.
I'm new in P2P networking and currently I try to understand some basic things specified by Kademlia papers. The main thing that I cannot understand is Kademlia distance metric. All papers define the distance as XOR of two IDs. The ID size is 160 bits, so the result is also has 160 bits.
The question: what is a convenient way to represent this distance as integer?
Some implementations, that I checked, use the following:
distance = 160 - prefix length (where prefix length is number of leading zeros).
Is it correct approach?
Some implementations, that I checked, use the following: distance = 160 - prefix length (where prefix length is number of leading zeros). Is it correct approach?
That approach is based on an early revision of the kademlia paper and is insufficient to implement some of the later chapters of the final paper.
A full-fledged implementation should use a tree-like routing table that orders buckets by their absolute position in the keyspace which can be resized when bucket splitting happens.
The ID size is 160 bits, so the result is also has 160 bits. The question: what is a convenient way to represent this distance as integer?
The distance metrics are 160bit integers. You can use a big-integer class or roll your own based on arrays. To get shared prefix bit counts you just have to count the leading zeroes, which scale logarithmically with the network size and should normally fit in much smaller integers once you're done.
I'm using a static KD-Tree for nearest neighbor search in 3D space. However, the client's specifications have now changed so that I'll need a weighted nearest neighbor search instead. For example, in 1D space, I have a point A with weight 5 at 0, and a point B with weight 2 at 4; the search should return A if the query point is from -5 to 5, and should return B if the query point is from 5 to 6. In other words, the higher-weighted point takes precedence within its radius.
Google hasn't been any help - all I get is information on the K-nearest neighbors algorithm.
I can simply remove points that are completely subsumed by a higher-weighted point, but this generally isn't the case (usually a lower-weighted point is only partially subsumed, like in the 1D example above). I could use a range tree to query all points in an NxNxN cube centered on the query point and determine the one with the greatest weight, but the naive implementation of this is wasteful - I'll need to set N to the point with the maximum weight in the entire tree, even though there may not be a point with that weight within the cube, e.g. let's say the point with the maximum weight in the tree is 25, then I'll need to set N to 25 even though the point with the highest weight for any given cube probably has a much lower weight; in the 1D case, if I have a point located at 100 with weight 25 then my naive algorithm would need to set N to 25 even if I'm outside of the point's radius.
To sum up, I'm looking for a way that I can query the KD tree (or some alternative/variant) such that I can quickly determine the highest-weighted point whose radius covers the query point.
FWIW, I'm coding this in Java.
It would also be nice if I could dynamically change a point's weight without incurring too high of a cost - at present this isn't a requirement, but I'm expecting that it may be a requirement down the road.
Edit: I found a paper on a priority range tree, but this doesn't exactly address the same problem in that it doesn't account for higher-priority points having a greater radius.
Use an extra dimension for the weight. A point (x,y,z) with weight w is placed at (N-w,x,y,z), where N is the maximum weight.
Distances in 4D are defined by…
d((a, b, c, d), (e, f, g, h)) = |a - e| + d((b, c, d), (f, g, h))
…where the second d is whatever your 3D distance was.
To find all potential results for (x,y,z), query a ball of radius N about (0,x,y,z).
I think I've found a solution: the nested interval tree, which is an implementation of a 3D interval tree. Rather than storing points with an associated radius that I then need to query, I instead store and query the radii directly. This has the added benefit that each dimension does not need to have the same weight (so that the radius is a rectangular box instead of a cubic box), which is not presently a project requirement but may become one in the future (the client only recently added the "weighted points" requirement, who knows what else he'll come up with).
Wikipedia entry says:
Each node has a "weight" equal to the length of its string plus the sum of all the weights in its left subtree. Thus a node with two children divides the whole string into two parts: the left subtree stores the first part of the string. The right subtree stores the second part and its weight is the sum of the two parts.
I'm a bit confused, it says first that a nodes weight is the length of its string plus the sum of all the weights in its left subtree. Then it says if a node has two children (and thus a left and a right subtree), that the weight is the sum of both parts, and not just the left subtree. Looking at the diagram makes sense (the 9 directly below the 22 is a 9 and not larger because the right child/subtree of 7 does not contribute to the weight) but the phrasing seems off to me or am I misunderstanding something?
Yeah, the phrasing is off. The "weight" is the partition point, so it only includes the left substring (or the included string, if that's what you have instead).
You don't need to store the total length of a node, but modifying the rope requires that all parent nodes be notified of the change (which should be O(log n), so that's ok.)