Adding new nodes to Kademlia, building Kademlia routing tables - p2p

I can't quite wrap my brain around the joining process of Kademlia DHTs. I've seen a few tutorials and presentations online, but they all seem to say things the same way and all psedo code etc is the same in most (actual copy/paste).
Can somebody give a high level walk through on this?

I'm assuming you've read the Kademlia paper. Here's an excerpt from my article An Introduction to Kademlia DHT & How It Works
Some background information:
When you have a Kademlia network running, there should always be a node that every other node knows about in order for them to join the network; lets call this the Bootstrap node BN.
K is a Kademlia constant that determines the size of the Buckets in a node's routing table as well as the amount of nodes a piece of Data should be stored on.
Joining Process:
A new Node NN is created with a NodeId (assigned by some method) and an IP Address (the IP of the computer it's hosted on).
NN sends a LookupRequest(A.NodeId) to BN. A Lookup Request basically asks the receiving node for the K-Closest nodes it knows to a given NodeId. In this case, BN will return the K-Closest nodes it knows to NN.
BN will now add NN to it's routing table, so NN is now in the network.
NN receives the list of K-Closest nodes to itself from BN. NN adds BN to it's routing table.
NN now pings these K nodes received from BN, and the ones that reply are added to it's Routing Table in the necessary buckets based on distance. By pinging these nodes, they also learn of NN existence and add NN to their Routing tables.
NN is now connected to the network and is known by nodes on the network.
NN now loops through each of it's K-Buckets
foreach(K-Buckets as KB)
1. NN generates a random NodeId `RNID` // A NodeId that will be in KB
2. NN sends LookupRequest(RNID) to the K-Closest nodes it knows to RNID.
3. The response will be K nodes closest to RNID.
4. NN now fills KB.
NN does this for each of it's buckets to fill these buckets.
After this operation, NN has a better idea of the nodes on the network at different distances away from itself.
Note: This step is not mandatory, however I did it in My Implementation of Kademlia so that each node will have better knowledge of the network when they join.
I wrote a full introduction to Kademlia at An Introduction to Kademlia DHT & How It Works

My guess is it uses some super nodes and geospatial informations to compute a minimum spanning tree. It can also compute a voronoi-diagram or the dual delaunay triangulation from the super nodes and use it to run a proximity search. Here is an example: http://www.mathworks.de/de/help/matlab/math/spatial-searching.html.

Related

Spatstat: Export estimated cluster center coordinate points in fitted ThomasCluster model

If you fit a e.g. Thomas Cluster model (using kppm for example), it will fit a model with X number of clusters. Is there a way to extract where the center of each of the X clusters are estimated to be? E.g. if the best fit model on a ppp with 500 points has a mean number of 250 points we would expect there to be 2 clusters estimated from the data. What are the center coordinates of these two clusters?
Many thanks
kppm does not estimate the number of cluster centres or the locations of the cluster centres. It fits a clustered point process model to the point pattern, essentially by matching the K function of the model to the K function of the data. The fitted model only describes the probability distribution of the cluster centres (number and location) and the probability distribution of offspring points relative to their parents.
Estimation/prediction of the actual cluster locations is a much harder task (belonging to the class of missing data problems). You could try the R package mclust for this purpose. You can expect it to take a much longer time to compute.
The fitted model parameters obtained from kppm could be used to determine the cluster parameters in the mclust package to simplify the task.

Find the outliers or anomaly in gps data (time, latitude, longitude, altitude)

I have data. Based on the data (time, latitude, longitude, altitude) determine what are the typical routes that device makes during a full week.
After determining the baseline routes or typical area frequented by device we can start determining an anomaly based on the device traveling outside it’s frequent route/area.
Action: The process will then send an “alert” to the system is traveling outside it’s frequent area route
Please suggest which machine learning algorithm is useful. I am going to start clustering algorithm. Also tell me which python libraries is useful to use machine learning algorithm.
First of all, if you use Python, then use scikit-learn.
For this problem, there is multiple possibilities.
One way is indeed to use a clustering algorithm. For this purpose to get the anomaly too, you can use DBSCAN. It is an algorithm designed to get cluster and the outliers.
Another way would be (assuming you have for each device all their position) to use more funny way like a clustering algorithm on all the positions to get the important place, and after an LDA (latent dirichlet allocation) to get the main topics (here the words would be the index of the cluster, the document would be the list of position of each device and so the topics would be the main "routes").

Bayesian Network Key Benefit

I have some trouble understanding the benefits of Bayesian networks 100%.
Am I correct that the key benefit of the network is, that one does not need to use
chain rule of probability in order to calculate joint distributions?
So using the chain rule:
Leads to the same result as the following (assuming the nodes are structured by an Bayesian network)?
The benefit of using the Bayesian network is exactly so that we can use the chain rule. This network can be thought of as representing a huge lookup table that tells you the probability of all possible joint events that the network represents. It is because some events are conditionally independent of other events that we don't need to store this huge lookup table but can distribute it to the node level on the network.
If you consider the nodes of a Bayesian network to be stored as a probability lookup table (i.e., storing the probability of observing this event, represented by the node, given the possible values for its parent nodes), this table is fairly small in comparison to the size of the network as a whole. The entire network then just consists of these small tables that are linked by the parent-child relationships. When you perform a calculation to obtain a joint probability (i.e., P(A_1 ... A_n) from above) you can efficiently iterate (using the chain rule) to calculate the probability of seeing the observation given the structure of the network.
Note that it is the structure of the network that provides this saving. In your example, the "parents(A_1)" clause is just a subset of the entire set of nodes. The structure implicitly tells us that A_1 is conditionally independent of the other nodes in the network, given the values of its parents. So we only apply the chain rule to a small set of nodes that can effect the node in question.
This small amount of computation is generally just a fraction of the huge space saving that you obtain by using this structure.

Clustering of facebook-users with k-means

i got a facebook-list of user-ids from following page:
Stanford Facebook-Data
If you look at the facebook_combined data, you can see that it is a list of user-connections (edges). So for instance user 0 has something to do with user 1,2,3 and so on.
Now my work is to find clusters in the dataset.
In the first step i used node.js to read the file and save the data in an array like this:
array=[[0,1],[0,2], ...]
In the second step i used a k-means plugin for node.js to cluster the data:
Cluster-Plugin
But i dont know if the result is right, because now i get clusters of edges and not clusters of users.
UPDATE:
I am trying out a markov implementation for node js. The Markov Plugin however needs an adjacency matrix to build clusters. I implemented an algorithm with java to save the matrix in a file.
Maybe you got any other suggestion how i could get clusters out of edges.
K-means assumes your input data issue an R^d vector space.
In fact, it requires the data to be this way, because it computes means as cluster centers, hence the name k-means.
So if you want to use k-means, then you need
One row per datapoint (not an edge list)
A fixed dimensionality data space where the mean is a useful center (usually, you should have continuous attributes, on binary data the mean does not make too much sense) and where least-squares is a meaningful optimization criterion (again, on binary data, least-squares does not have a strong theoretical support)
On your Faceboook data, you could try some embedding, but I'd have doubts about the trustworthiness.

scikit-learn AgglomerativeClustering and connectivity

I am trying to use AgglomerativeClustering from scikit-learn to cluster points on a place. Points are defined by coordinates (X,Y) stored in _XY.
Cluster are limited to a few neighbours through the connectivity matrix defined by
C = kneighbors_graph(_XY, n_neighbors = 20).
I want some points not be part of the same cluster, even if they are neighbours, so I modified the connectivity matrix to put 0 between these points.
The algorithm runs smoothly but, at the end, some clusters contain points that should not be together, i.e. some couple for which I imposed _C = 0.
From the children, I can see that the problem arises when a cluster of two points (i, j) is already formed and that k joins (i,j) even if _C[i,k]=0.
So I was wondering how the connectivity constraint is propagated when the size of some clusters is larger than 2, _C being not defined in that case.
Thanks !
So what seems to be happening in your case is that despite your active disconnection of point you do not want to have in one cluster, these points are still part of the same connected component and the data associated to them still imply that they should be connected to the same cluster from a certain level up.
In general, AgglomerativeClustering works as follows: At the beginning, all data points are separate clusters. Then, at each iteration, two adjacent clusters are merged, such that the overall increase in discrepancy with the original data is minimal if we compare the original data with cluster means in L2 distance.
Hence, although you sever the direct link between two nodes, they can be clustered together one level higher by an intermediate node.

Resources