In order to do a measure of the "goodness" of the classification k-means has found I need to calculate the (Between Sum of squares) BSS/TSS (Total Sum of squares) ratio which should approach 1 if the clustering has the properties of internal cohesion and external separation. I was wondering whether spark has internal functions to compute BSS/TSS for me similar to the R Kmeans clustering package in order to leverage the parrallism of the spark cluster.
Or is there a cost effective way of computing the BSS/TSS ratio through another means?
Related
Recently I trained a BYOL model on a set of images to learn an embedding space where similar vectors are close by. The performance was fantastic when I performed approximate K-nearest neighbours search.
Now the next task, where I am facing a problem is to find a clustering algorithm that uncovers a set of clusters using the embedding vectors generated by the BYOL trained feature extractor [dimension of the vector is 1024 & there are 1 million vectors]. I have no information apriori about the number of classes i.e. clusters in my dataset & thus cannot use Kmeans. Is there any scalable clustering algorithm that can help me uncover such clusters. I tried to use FISHDBC but the repository does not have good documentation.
I have a very large dataset (about 3.5M) of text messages. I am using tf-idf vector to represent each message in this dataset. I want to cluster the messages of the same topic together and I don't know the actual clusters or even the number of them.
So I searched a little and found that Optics, DBSCAN, or HDBSCAN can do this job but there is no implementation of them is spark ml or mllib. according to this In spark mllib there are implementations of K-means, Gaussian mixture, Power iteration clustering (PIC), Latent Dirichlet allocation (LDA), Bisecting k-means and Streaming k-means.
So my problem is, all of them need K as an input and I don't have it. Is there any clustering algorithm implemented in Spark that find the number of clusters on its own?
Got a little bit too long for a comment. I'll try to explain it here.
Do you have the data on which topic a message belongs? Then you can simply do a group by that topic to group all the messages with similar topics.
That's one thing. And if you are trying to derive topics (K) from the dataset itself, then you need little more statistics to build a sound feature set to cluster them. Then you can come to a conclusion on K by varying it and find the best K with minimal error. There is a famous method called elbow method.
Check this out. https://www.geeksforgeeks.org/ml-determine-the-optimal-value-of-k-in-k-means-clustering/#:~:text=There%20is%20a%20popular%20method,fewer%20elements%20in%20the%20cluster.
I started using Bisecting K-Means Clustering in pyspark and I am wondering what is the division rule during clustering.
I know that K-Means is done there, but how next cluster for next division is selected? I have seen that there are couple of methods ( eg. biggest cluster is divided / cluster with less internal similarity), but I can't find what is division rule implemented in spark ml.
Thank you for help
According to the Pyspark ML documentation (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans), Bisecting KMeans algorithm is based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar (https://www.cs.cmu.edu/~dunja/KDDpapers/Steinbach_IR.pdf).
In section 3:
We found little difference between the possible methods for selecting
a cluster to split and chose to split the largest remaining
cluster.
Modifications were made for Pyspark. According to the Pyspark doc :
The algorithm starts from a single cluster that contains all points.
Iteratively it finds divisible clusters on the bottom level and
bisects each of them using k-means, until there are k leaf clusters in
total or no leaf clusters are divisible. The bisecting steps of
clusters on the same level are grouped together to increase
parallelism. If bisecting all divisible clusters on the bottom level
would result more than k leaf clusters, larger clusters get higher
priority.
I have a piece of code given below that generates gaussian distribution and samples data from it:
input = pd.read_csv("..\\data\\input.txt", sep=",", header=None).values
gmm = GMM(n_components=5).fit(input)
sampled = gmm.sample(input.shape[0], random_state=42)
original_label = gmm.predict(input)
generated_label = gmm.predict(sampled)
return sampled
When I checked the original_label and generated_label, the number of samples in each cluster is different.
The number of elements in original_label:
Cluster 1:0
Cluster 2:1761
Cluster 3:2024
Cluster 4:769
Cluster 5:0
The number of elements in generated_label:
Cluster 1:0
Cluster 2:1273
Cluster 3:739
Cluster 4:1140
Cluster 5:1402
I want to sample data from gmm with the same distribution of original input. Here, there is a big difference between the clusters of sampled and original data. Can you please help me to fix it?
Gaussian Mixture Models are a soft clustering approach. Every object belongs to every cluster, just to a varying degree.
If you would sum the soft cluster densities, they should match much more closely. (I suggest you verify this. The huge difference in cluster 5 may indicate a problem in sklearn).
Generating a hard clustering that satisfies the GMM density model as well as the predicted hard label will usually be unsatisfiable because of cluster overlap. This demonstrates that the 'hard' labeling is not true to the underlying assumptions of GMM.
I have a set of multivariate (2D) Gaussian distributions (represented by mean and variance) and would like to perform clustering on these distributions in a way that maintains the probabilistic Gaussian information (perhaps using the overlap of variances?).
I have done some research into clustering methods and found that DBSCAN clustering is more appropriate than K-means, as I don't know how many clusters I expect to find. However, DBSCAN makes use of a euclidean distance epsilon value to find clusters instead of using the variances of each distribution. I have also looked into Gaussian-Mixture Model methods, but they fit a set of points to a set of K Gaussian clusters, rather than fitting clusters to a set of Gaussian distributions.
Does anyone know of any additional clustering methods that might be appropriate to my needs?
Thanks!
DBSCAN can be used with arbitrary distances. It is not limited to Euclidean distance. You could employ a divergence measure, e.g. how much your Gaussians overlap.
However, I would suggest hierarchical clustering or Gaussian Mixture Modeling (EM).
DBSCAN is designed to allow Banana-shaped clusters, which are not well approximated by Gaussians. Your objective appear to be to merge similar Gaussians. That is better achieved by hierarchical clustering.