How can I put cluster labels in the Gaussian Mixture Model (GMM)? - scikit-learn

I used GMM to cluster my data. It has 3 dimensions with 3 clusters. The GMM ran very well, and I got the mean and covariance matrix for each cluster.
The problem is I don't know which label belongs to which cluster. In addition, the cluster labels changed whenever I ran the iteration.
For example, in the first iteration, the cluster labels were 0, 1 and 2. In the second iteration, the cluster labels were 2,1 and 0, etc.
Can I fix the cluster labels? If yes, how do I do it?

Related

Improving my clustering for my dataset and changing my clusters to better reflect my data

Image of clusters
So I think this is somewhat self-explanatory but my clusters don't really make sense. I am new to clustering and just want these datapoints to be grouped together. The problem I keep seeing is the black is being clustered over 2 distinct lines and I want them to be seperate. I am plotting weather radar over the united states and just want to have lines of storms a unique cluster and eventually remove any random super small groups, not near the lines. I tried to just add more clusters and adjust the algorithm, but nothing has been working.
Here is my code:
nc = 8
kmeans = KMeans(n_clusters = nc, init ='k-means++',n_init=1,algorithm='full')
kmeans.fit(X[X.columns[1:nc]]) # Compute k-means clustering.
X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:nc]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(X[X.columns[1:3]]) # Labels of each point
maxcluster = X['cluster_label'].max()

How can I speed up max pooling clusters of different sizes and shapes of an image?

I have clustered the pixels of an image into clusters of different sizes and shapes. I want to max pool each cluster as fast as possible because the max pooling happens in one layer of my CNN.
To clarify:
Input is a batch of images with the following shape [batch_size, height of image, width of image, number of channels]. I have clustered each image before I start training my CNN. So for each image I have a ndarray of labels with shape [height of image, width of image].
How can I max pool over all pixels of an image that have the same label for all labels? I understand how to do it with a of for loop but that is painstakingly slow. I am searching for a fast solution that ideally can max pool over every cluster of each image in less than a second.
For implementation, I use Python3.7 and PyTorch.
I figured it out. torch_scatter. scatter_max(img, cluster_labels) outputs the max element from each cluster and removes the for loop from my code.

Spatstat: Export estimated cluster center coordinate points in fitted ThomasCluster model

If you fit a e.g. Thomas Cluster model (using kppm for example), it will fit a model with X number of clusters. Is there a way to extract where the center of each of the X clusters are estimated to be? E.g. if the best fit model on a ppp with 500 points has a mean number of 250 points we would expect there to be 2 clusters estimated from the data. What are the center coordinates of these two clusters?
Many thanks
kppm does not estimate the number of cluster centres or the locations of the cluster centres. It fits a clustered point process model to the point pattern, essentially by matching the K function of the model to the K function of the data. The fitted model only describes the probability distribution of the cluster centres (number and location) and the probability distribution of offspring points relative to their parents.
Estimation/prediction of the actual cluster locations is a much harder task (belonging to the class of missing data problems). You could try the R package mclust for this purpose. You can expect it to take a much longer time to compute.
The fitted model parameters obtained from kppm could be used to determine the cluster parameters in the mclust package to simplify the task.

how to correlate noise data of sklearn-DBSCAN result with other clusters?

I am using sklearn-DBSCAN to cluster my text data.
I used GoogleNews-vectors-negative300.bin to create 300 dimensional sentence vectors for each document and created metrics of size 10000*300.
when I passed metrics to DBSCAN with few possible values of eps (0.2 to 3) & min_samples (5 to 100) with other default parameters, getting numbers of clusters (200 to 10).
As I analyzed for all the clusters noise data is approx 75-80% of my data.
Is there any way to reduce noise or use some other parameters (distances) to reduce noise?
Even I checked with euclidean distance between 2 vectors is 0.6 but both are in different clusters, how can I manage to bring in same cluster?
X_scaled = scaler.fit_transform(sentence_vectors)
ep = 0.3
min_sam = 10
for itr in range(1,11):
dbscan = DBSCAN(eps=ep, min_samples = min_sam*itr)
clusters = dbscan.fit_predict(X_scaled)
If you want two points at distance 0.6 to be in the same cluster, then you may need to use a larger epsilon (which is a distance threshold). At 0.6 they should be in the same cluster.
Since word2vec is trained with dot products, it would likely make more sense to use the dot product as similarity and/or cosine distance.
But in general I doubt you'll be able to get good results. The way sentence vectors are built by averaging word2vec vectors kills too much signal, and adds to much noise. And since the data is high-dimensional, all such noise is a problem.

scikit-learn AgglomerativeClustering and connectivity

I am trying to use AgglomerativeClustering from scikit-learn to cluster points on a place. Points are defined by coordinates (X,Y) stored in _XY.
Cluster are limited to a few neighbours through the connectivity matrix defined by
C = kneighbors_graph(_XY, n_neighbors = 20).
I want some points not be part of the same cluster, even if they are neighbours, so I modified the connectivity matrix to put 0 between these points.
The algorithm runs smoothly but, at the end, some clusters contain points that should not be together, i.e. some couple for which I imposed _C = 0.
From the children, I can see that the problem arises when a cluster of two points (i, j) is already formed and that k joins (i,j) even if _C[i,k]=0.
So I was wondering how the connectivity constraint is propagated when the size of some clusters is larger than 2, _C being not defined in that case.
Thanks !
So what seems to be happening in your case is that despite your active disconnection of point you do not want to have in one cluster, these points are still part of the same connected component and the data associated to them still imply that they should be connected to the same cluster from a certain level up.
In general, AgglomerativeClustering works as follows: At the beginning, all data points are separate clusters. Then, at each iteration, two adjacent clusters are merged, such that the overall increase in discrepancy with the original data is minimal if we compare the original data with cluster means in L2 distance.
Hence, although you sever the direct link between two nodes, they can be clustered together one level higher by an intermediate node.

Resources