I started using Bisecting K-Means Clustering in pyspark and I am wondering what is the division rule during clustering.
I know that K-Means is done there, but how next cluster for next division is selected? I have seen that there are couple of methods ( eg. biggest cluster is divided / cluster with less internal similarity), but I can't find what is division rule implemented in spark ml.
Thank you for help
According to the Pyspark ML documentation (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans), Bisecting KMeans algorithm is based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar (https://www.cs.cmu.edu/~dunja/KDDpapers/Steinbach_IR.pdf).
In section 3:
We found little difference between the possible methods for selecting
a cluster to split and chose to split the largest remaining
cluster.
Modifications were made for Pyspark. According to the Pyspark doc :
The algorithm starts from a single cluster that contains all points.
Iteratively it finds divisible clusters on the bottom level and
bisects each of them using k-means, until there are k leaf clusters in
total or no leaf clusters are divisible. The bisecting steps of
clusters on the same level are grouped together to increase
parallelism. If bisecting all divisible clusters on the bottom level
would result more than k leaf clusters, larger clusters get higher
priority.
Related
I have a very large dataset (about 3.5M) of text messages. I am using tf-idf vector to represent each message in this dataset. I want to cluster the messages of the same topic together and I don't know the actual clusters or even the number of them.
So I searched a little and found that Optics, DBSCAN, or HDBSCAN can do this job but there is no implementation of them is spark ml or mllib. according to this In spark mllib there are implementations of K-means, Gaussian mixture, Power iteration clustering (PIC), Latent Dirichlet allocation (LDA), Bisecting k-means and Streaming k-means.
So my problem is, all of them need K as an input and I don't have it. Is there any clustering algorithm implemented in Spark that find the number of clusters on its own?
Got a little bit too long for a comment. I'll try to explain it here.
Do you have the data on which topic a message belongs? Then you can simply do a group by that topic to group all the messages with similar topics.
That's one thing. And if you are trying to derive topics (K) from the dataset itself, then you need little more statistics to build a sound feature set to cluster them. Then you can come to a conclusion on K by varying it and find the best K with minimal error. There is a famous method called elbow method.
Check this out. https://www.geeksforgeeks.org/ml-determine-the-optimal-value-of-k-in-k-means-clustering/#:~:text=There%20is%20a%20popular%20method,fewer%20elements%20in%20the%20cluster.
I have to compute the smallest magnitude eigenvalue and it's associated eigenvector of a non symmetric matrix using PySpark libraries.
The size of
is very high and I want the computation to be distributed among the cluster's workers.
The problem is that i didn't find any API to compute eigenvalues in PySpark 2.3 documentation.
I have identified two paths, but I want to avoid them:
to reimplement eigen value decomposition trough QR algorithm using QRDecomposition available in PySpark API
to compute eigen value decomposition trough scala version class as described in this question on Stack Overflow
Is there a simpler or better way then this last two?
I already know the existence of this post, but they are conceptually different.
I have a piece of code given below that generates gaussian distribution and samples data from it:
input = pd.read_csv("..\\data\\input.txt", sep=",", header=None).values
gmm = GMM(n_components=5).fit(input)
sampled = gmm.sample(input.shape[0], random_state=42)
original_label = gmm.predict(input)
generated_label = gmm.predict(sampled)
return sampled
When I checked the original_label and generated_label, the number of samples in each cluster is different.
The number of elements in original_label:
Cluster 1:0
Cluster 2:1761
Cluster 3:2024
Cluster 4:769
Cluster 5:0
The number of elements in generated_label:
Cluster 1:0
Cluster 2:1273
Cluster 3:739
Cluster 4:1140
Cluster 5:1402
I want to sample data from gmm with the same distribution of original input. Here, there is a big difference between the clusters of sampled and original data. Can you please help me to fix it?
Gaussian Mixture Models are a soft clustering approach. Every object belongs to every cluster, just to a varying degree.
If you would sum the soft cluster densities, they should match much more closely. (I suggest you verify this. The huge difference in cluster 5 may indicate a problem in sklearn).
Generating a hard clustering that satisfies the GMM density model as well as the predicted hard label will usually be unsatisfiable because of cluster overlap. This demonstrates that the 'hard' labeling is not true to the underlying assumptions of GMM.
In order to do a measure of the "goodness" of the classification k-means has found I need to calculate the (Between Sum of squares) BSS/TSS (Total Sum of squares) ratio which should approach 1 if the clustering has the properties of internal cohesion and external separation. I was wondering whether spark has internal functions to compute BSS/TSS for me similar to the R Kmeans clustering package in order to leverage the parrallism of the spark cluster.
Or is there a cost effective way of computing the BSS/TSS ratio through another means?
SPSS: K-means analysis. What criteria can I use to state my choice of the number of final clusters I choose. Using a hierarchical cluster analysis, I started with 2 clusters in my K-mean analysis. However, after running many other k-means with different number of clusters, I dont knwo how to choose which one is better. Is there a general method of choosing the number of clusters that is scientifically right.
Are you using SPSS Modeler or SPSS Statistics? Because I created an extension to determine the optimal number of clusters.
It is based on R: Cluster analysis in R: determine the optimal number of clusters