Perform clustering on high dimensional data - search

Recently I trained a BYOL model on a set of images to learn an embedding space where similar vectors are close by. The performance was fantastic when I performed approximate K-nearest neighbours search.
Now the next task, where I am facing a problem is to find a clustering algorithm that uncovers a set of clusters using the embedding vectors generated by the BYOL trained feature extractor [dimension of the vector is 1024 & there are 1 million vectors]. I have no information apriori about the number of classes i.e. clusters in my dataset & thus cannot use Kmeans. Is there any scalable clustering algorithm that can help me uncover such clusters. I tried to use FISHDBC but the repository does not have good documentation.

Related

How to find out what a cluster represents on a PCA biplot?

I am building a K means algorithm and have multiple variables to feed into it. As of this I am using PCA to transform the data to two dimensions. When I display the PCA biplot I don't understand what similarities the data has to be grouped into a specific cluster. I am using a customer segmentation dataset. I.E: I want to be able to know that a specific cluster is a cluster as a customer has a low income but spends a lot of money on products.
Since you are using k-means:
Compute the mean of each cluster on the original data. Now you can compare these attributes.
Alternatively: don't use PCA in the first place, if it had your analysis... k-means is as good as PCA at coping with several dozen variables.

Predicting correct cluster for unseen data using a trained K-Means model

I know that K-Means is a lazy learner and will have to be retrained from scratch with new points, but still would like to know if there's any workaround to use a trained model to predict on a new unseen data.
I'm using K-Means algorithm to cluster a medical corpus. I'm creating a term-document matrix to represent this corpus. Before feeding the data to kmeans algorithm, I perform truncated singular value decomposition on the data for dimensionality reduction. I've been thinking if there's a way to cluster a new unseen document without retraining the entire model.
To get the vector representation of the new document and predict its cluster using the trained model, I need to ensure that it has the same vocabulary as that of the trained model and also maintains the same order in the term-document matrix. This can be done considering that these documents have a similar kind of vocabulary. But, how do I get SVD representation of this document? Now here's where my understanding gets a little shaky, so correct me if I'm wrong but to perform SVD on this vector representation, I'll need to append it to the original term-document matrix. Now, if I append this new document to original term-document matrix and perform SVD on it to get the vector representation with limited features (100 in this case), then I'm not sure how things will change? Will the new features selected by the SVD correspond semantically to that of the original ones? i.e. it won't make sense to measure the distance of new document from cluster centroids (with 100 features) if the corresponding features grasp different concepts.
Is there a way to use a trained kmeans model for new text data? Or any other better-suited clustering approach for this task?
You problem isn't k-means, a simple nearest-neighbor classificator using the means as data will work.
Your problem is SVD, which is not stable. Adding new data can give you entirely different results.

How to use clustering to group sentences with similar intents?

I'm trying to develop an program in Python that can process raw chat data and cluster sentences with similar intents so they can be used as training examples to build a new chatbot. The goal is to make it as quick and automatic (i.e. no parameters to enter manually) as possible.
1- For feature extraction, I tokenize each sentence, stem its words and vectorize it using Sklearn's TfidfVectorizer.
2- Then I perform clustering on those sentence vectors with Sklearn's DBSCAN. I chose this clustering algorithm because it doesn't require the user to specify the desired number of clusters (like the k parameter in k-means). It throws away a lot of sentences (considering them as outliers), but at least its clusters are homogeneous.
The overall algorithm works on relatively small datasets (10000 sentences) and generates meaningful clusters, but there are a few issues:
On large datasets (e.g. 800000 sentences), DBSCAN fails because it requires too much memory, even with parallel processing on a powerful machine in the cloud. I need a less computationally-expensive method, but I can't find another algorithm that doesn't make weird and heterogeneous sentence clusters. What other options are there? What algorithm can handle large amounts of high-dimensional data?
The clusters that are generated by DBSCAN are sentences that have similar wording (due to my feature extraction method), but the targeted words don't always represent intents. How can I improve my feature extraction so it better captures the intent of a sentence? I tried Doc2vec but it didn't seem to work well with small datasets made of documents that are the size of a sentence...
A standard implementation of DBSCAN is supposed to need only O(n) memory. You cannot get lower than this memory requirement. But I read somewhere that sklearn's DBSCAN actually uses O(n²) memory, so it is not the optimal implementation. You may need to implement this yourself then, to use less memory.
Don't expect these methods to be able to cluster "by intent". There is no way an unsupervised algorithm can infer what is intended. Most likely, the clusters will just be based on a few key words. But this could be whether people say "hi" or "hello". From an unsupervised point of view, this distinction gives two nice clusters (and some noise, maybe also another cluster "hola").
I suggest to train a supervised feature extraction based on a subset where you label the "intent".

How to use sickit learn to calculate the k-means feature importance

I use scikit-learn to do clustering by k-means:
from sklearn import cluster
k = 4
kmeans = cluster.KMeans(n_clusters=k)
but another question is :
How to use scikit learn to calculate the k-means feature importance?
Unfortunately, to my knowledge there is no such thing as "feature importance" in the context of a k-means algorithm - at least in the understanding that feature importance means "automatic relevance determination" (as in the link below).
In fact, the k-means algorithm treats all features equally, since the clustering procedure depends on the (unweighted) Euclidean distances between data points and cluster centers.
More generally, there exist clustering algorithms which perform automatic feature selection or automatic relevance determination, or generic feature selection methods for clustering. A specific (and arbitrary) example is
Roth and Lange, Feature Selection in Clustering Problems, NIPS 2003
I have answered this on StackExchange, you can partially estimate the most important features for, not the whole clustering problem, rather each cluster's most important features. Here is the answer:
I faced this problem before and developed two possible methods to find the most important features responsible for each K-Means cluster sub-optimal solution.
Focusing on each centroid’s position and the dimensions responsible for the highest Within-Cluster Sum of Squares minimization
Converting the problem into classification settings (Inspired by the paper: "A Supervised Methodology to Measure the Variables Contribution to a Clustering").
I have written a detailed article here Interpretable K-Means: Clusters Feature Importances. GitHub link is included as well if you want to try it.

How does PCA gives centers for the Kmeans algorithm in scikit learn

I'm looking at this example code given on Scikit Kmeans digit example
There is the following code in this script :
# in this case the seeding of the centers is deterministic, hence we run the
# kmeans algorithm only once with n_init=1
pca = PCA(n_components=n_digits).fit(data)
bench_k_means(KMeans(init=pca.components_, n_clusters=n_digits, n_init=1),
name="PCA-based",
data=data)
Why are the eigen vectors used as initial centers and is there any intuition for this?
There is a stackexchange link here, and also some discussion on the PCA wikipedia.
There is also an informative mailing list discussion about the creation of this example.
All of these threads point back to this paper among others. In a brief, this paper says that there is a strong relationship between the subspace found by SVD (as seen in PCA) and the optimal cluster centers we seek in K-means, along with associated proofs. The key sentence comes in the lower right of the first page - "We prove that principal
components are actually the continuous solution of the cluster membership indicators in the K-means clustering method, i.e., the PCA dimension reduction automatically performs data clustering according to the K-means objective function".
What this amounts to is that SVD/PCA eigenvectors should be very good initializers for K-Means. The authors of this paper actually take things a step further, and project the data into the eigenspace for both of their experiments, then cluster there.

Resources