I am trying to use AgglomerativeClustering from scikit-learn to cluster points on a place. Points are defined by coordinates (X,Y) stored in _XY.
Cluster are limited to a few neighbours through the connectivity matrix defined by
C = kneighbors_graph(_XY, n_neighbors = 20).
I want some points not be part of the same cluster, even if they are neighbours, so I modified the connectivity matrix to put 0 between these points.
The algorithm runs smoothly but, at the end, some clusters contain points that should not be together, i.e. some couple for which I imposed _C = 0.
From the children, I can see that the problem arises when a cluster of two points (i, j) is already formed and that k joins (i,j) even if _C[i,k]=0.
So I was wondering how the connectivity constraint is propagated when the size of some clusters is larger than 2, _C being not defined in that case.
Thanks !
So what seems to be happening in your case is that despite your active disconnection of point you do not want to have in one cluster, these points are still part of the same connected component and the data associated to them still imply that they should be connected to the same cluster from a certain level up.
In general, AgglomerativeClustering works as follows: At the beginning, all data points are separate clusters. Then, at each iteration, two adjacent clusters are merged, such that the overall increase in discrepancy with the original data is minimal if we compare the original data with cluster means in L2 distance.
Hence, although you sever the direct link between two nodes, they can be clustered together one level higher by an intermediate node.
Related
I am using networkx package to analyse IMDb data to compute centrality(closeness and betweenness). Problem is, the graph has two types of nodes - namely, actors and movies. I want to calculate the centrality with respect to only the actors and not the graph overall.
The code -
T = nx.Graph()
T.add_nodes_from(demo_df.primaryName,bipartite=1)
T.add_nodes_from(demo_df.primaryTitle,bipartite=0)
T = nx.from_pandas_edgelist(demo_df,'primaryName','primaryTitle')
nx.closeness_centrality(T)
nx.betweenness_centrality(T)
I don't want it to calculate/display the betweenness and closeness of the movies(Wings of Desire, Dopey Dicks, Studio Stoops). I want it to be calculated only for the actors.
For bipartite graphs, you have the networkx.algorithms.bipartite.centrality counterpart. For instance for the closeness_centrality the result will be a dictionary keyed by node with bipartite degree centrality as the value. In the nodes argument specify the nodes in one bipartite node set:
from networkx.algorithms import bipartite
part0_nodes, part1_nodes = bipartite.sets(T)
cs_partition0 = bipartite.centrality.closeness_centrality(T, part0_nodes)
For disconnected graphs, you may try obtaining the nodes from a given partition with:
partition = nx.get_node_attributes(T, 'bipartite')
part0_nodes = [node for node, p in partition.items() if p==0]
Note that the returned dictionary will still contain all nodes even though you've specified the nodes from one partition in nodes. So you can just keep those in just one set using part0_nodes. This is mentioned in the notes section:
The nodes input parameter must contain all nodes in one bipartite node set,
but the dictionary returned contains all nodes from both bipartite node
sets. See :mod:bipartite documentation <networkx.algorithms.bipartite>
for further details on how bipartite graphs are handled in NetworkX.
I'm using Shift-means clustering (https://scikit-learn.org/stable/modules/clustering.html#mean-shift) in which the labels of clusters are obtained from this source: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html
However,it's not clear how the labels of clusters (0,1,...) are generated. Appearly, it seems that label 0 is the cluster with more elements. It this a general rule?
How the others algorithms works? it's in a "random" sense? or the algorithms behind detecte the greater clusters for the 0 cluster?
Thanks!
PS: it's easy order the labels according this rule, my question is more theoretical.
In many cases, the cluster order depends on the initialization. If you provide the initial values, then this order will be preserved.
If you do not provide such initial values, the order will usually be based on the data order. The first item is likely to belong to the first cluster, for example (withholding noise in some algorithms, such as DBSCAN).
Now quantity (cluster size) has an interesting effect: assuming that your data is randomly ordered (and not, for example, ordered by some synthetic data generation process) then the first element is more likely to belong to the "largest" cluster, so this cluster is most likely to come first even with "random" order.
Now in sklearn's mean-shift (which in my opinion contains an error in the final assignment rule) the authors decided to sort by "intensity" apparently, but I don't remember any such rule in the original papers. https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/cluster/mean_shift_.py#L222
I want to have a measure of similarity between two points in a cluster.
Would the similarity calculated this way be an acceptable measure of similarity between the two datapoint?
Say I have to vectors: vector A and vector B that are in the same cluster. I have trained a cluster which is denoted by model and then model.computeCost() computes thesquared distance between the input point and the corresponding cluster center.
(I am using Apache Spark MLlib)
val costA = model.computeCost(A)
val costB = model.computeCost(B)
val dissimilarity = |cost(A)-cost(B)|
Dissimilarity i.e. the higher the value, the more unlike each other they are.
If you are just asking is this a valid metric then the answer is almost, it is a valid pseudometric if only .computeCost is deterministic.
For simplicity i denote f(A) := model.computeCost(A) and d(A, B) := |f(A)-f(B)|
Short proof: d is a L1 applied to an image of some function, thus is a pseudometric itself, and a metric if f is injective (in general, yours is not).
Long(er) proof:
d(A,B) >= 0 yes, since |f(A) - f(B)| >= 0
d(A,B) = d(B,A) yes, since |f(A) - f(B)| = |f(B) - f(A)|
d(A,B) = 0 iff A=B, no, this is why it is pseudometric, since you can have many A != B such that f(A) = f(B)
d(A,B) + d(B,C) <= d(A,C), yes, directly from the same inequality for absolute values.
If you are asking will it work for your problem, then the answer is it might, depends on the problem. There is no way to answer this without analysis of your problem and data. As shown above this is a valid pseudometric, thus it will measure something decently behaving from mathematical perspective. Will it work for your particular case is completely different story. The good thing is most of the algorithms which work for metrics will work with pseudometrics as well. The only difference is that you simply "glue together" points which have the same image (f(A)=f(B)), if this is not the issue for your problem - then you can apply this kind of pseudometric in any metric-based reasoning without any problems. In practise, that means that if your f is
computes the sum of squared distances between the input point and the corresponding cluster center
this means that this is actually a distance to closest center (there is no summation involved when you consider a single point). This would mean, that 2 points in two separate clusters are considered identical when they are equally far away from their own clusters centers. Consequently your measure captures "how different are relations of points and their respective clusters". This is a well defined, indirect dissimilarity computation, however you have to be fully aware what is happening before applying it (since it will have specific consequences).
Your "cost" is actually the distance to the center.
Points that have the same distance to the center are considered to be identical (distance 0), which creates a really odd pseudonetric, because it ignores where on the circle of that distance points are.
It's not very likely this will work on your problem.
What's the best way to achieve what follows:
(1)
My input data consists of three columns: Object, Category, Value. I need to cluster Objects based on Value but the clusters need to be Category specific i.e. I need a cluster for every Category. It's impractical to split a file and load category specific data individually.
Initially I thought it was simple (I was already able to cluster Objects for one specific Category) and loaded data into a pair RDD where the key was Category value. However, KMeans train method accepts RDD and I got stuck on trying to make RDD of the value for each key of original RDD.
(2)
Is there a method of clustering that returns optimal number of sets in the cluster except for starting with low K and iterating training while K increases until Within Set Sum of Squared Error stabilizes?
(3)
Is there a method of clustering where the size of cluster sets could be controlled (the goal being producing more balanced sizes of sets)?
Thank you.
Why is it impractical to split your data set?
this will not take longer than a single k-means iteration (1 pass over the data set)
it will untangle the multiple problems you have, so some subsets can converge earlier, thus speed zp the overall process.
Note that k-means is best on multivariate data. On 1-dimensional data it is much more efficient to sort the data and then do kernel density estimation (or even simply histograms and have the user intuitively decide). Then you can easily do all thes "extras" such as ensuring a minimum cluster size etc.
I've implemented K-Means in Java and have a bit of a head scratcher. I select my initial centroids by choosing a random value in each dimension within the range of values of the data points. I've run into cases where this results in one or more of these centroids not ending up being the closet centroid of any data point. So what do I do for the next iteration? Just leave it at its original randomized value? Pick a new random value? Compute as an average of the other centroids? Seems like this isn't accounted for in the original algorithm, but probably I've just missed something.
Most implementations of k-means define initial centroids using actual data points, not random points in the bounding box drawn by the variables. However, some suggestions for solving your actual problem are below.
You could take another data-point at random and make it a new cluster centroid. This is very simple and fast to implement, and shouldn't affect the algorithm adversely.
You could also try making a smarter initial selection of cluster centroids using kmeans++. This algorithm chooses the first centroid randomly, and picks the remaining K-1 centroids to try and maximize the inter-centroid distance. By picking smarter centroids, you are much less likely to encounter the problem of a centroid being assigned zero data points.
If you wanted to be slightly more clever clever, you could use the kmeans++ algorithm to make a new centroid whenever a centroid gets assigned zero data points.
The way I've used it, the initial values were taken as random points from the data set, not random points in the spanned space. That means each cluster has at least one point in it initially. You could still get unlucky with outliers but with any luck you'll be able to detect this and restart with different points. (Provided "K clusters of points" is an adequate description of your data)
Instead of picking random values (which can be pretty meaningless if the space of possible values is large in comparison to the clusters), many implementations pick random points from the dataset as the initial centroids.