Gaussian mixture model sampling from specified cluster python - python-3.x

I have a piece of code given below that generates gaussian distribution and samples data from it:
input = pd.read_csv("..\\data\\input.txt", sep=",", header=None).values
gmm = GMM(n_components=5).fit(input)
sampled = gmm.sample(input.shape[0], random_state=42)
original_label = gmm.predict(input)
generated_label = gmm.predict(sampled)
return sampled
When I checked the original_label and generated_label, the number of samples in each cluster is different.
The number of elements in original_label:
Cluster 1:0
Cluster 2:1761
Cluster 3:2024
Cluster 4:769
Cluster 5:0
The number of elements in generated_label:
Cluster 1:0
Cluster 2:1273
Cluster 3:739
Cluster 4:1140
Cluster 5:1402
I want to sample data from gmm with the same distribution of original input. Here, there is a big difference between the clusters of sampled and original data. Can you please help me to fix it?

Gaussian Mixture Models are a soft clustering approach. Every object belongs to every cluster, just to a varying degree.
If you would sum the soft cluster densities, they should match much more closely. (I suggest you verify this. The huge difference in cluster 5 may indicate a problem in sklearn).
Generating a hard clustering that satisfies the GMM density model as well as the predicted hard label will usually be unsatisfiable because of cluster overlap. This demonstrates that the 'hard' labeling is not true to the underlying assumptions of GMM.

Related

Clustering with unknown number of clusters in Spark

I have a very large dataset (about 3.5M) of text messages. I am using tf-idf vector to represent each message in this dataset. I want to cluster the messages of the same topic together and I don't know the actual clusters or even the number of them.
So I searched a little and found that Optics, DBSCAN, or HDBSCAN can do this job but there is no implementation of them is spark ml or mllib. according to this In spark mllib there are implementations of K-means, Gaussian mixture, Power iteration clustering (PIC), Latent Dirichlet allocation (LDA), Bisecting k-means and Streaming k-means.
So my problem is, all of them need K as an input and I don't have it. Is there any clustering algorithm implemented in Spark that find the number of clusters on its own?
Got a little bit too long for a comment. I'll try to explain it here.
Do you have the data on which topic a message belongs? Then you can simply do a group by that topic to group all the messages with similar topics.
That's one thing. And if you are trying to derive topics (K) from the dataset itself, then you need little more statistics to build a sound feature set to cluster them. Then you can come to a conclusion on K by varying it and find the best K with minimal error. There is a famous method called elbow method.
Check this out. https://www.geeksforgeeks.org/ml-determine-the-optimal-value-of-k-in-k-means-clustering/#:~:text=There%20is%20a%20popular%20method,fewer%20elements%20in%20the%20cluster.

How to find out what a cluster represents on a PCA biplot?

I am building a K means algorithm and have multiple variables to feed into it. As of this I am using PCA to transform the data to two dimensions. When I display the PCA biplot I don't understand what similarities the data has to be grouped into a specific cluster. I am using a customer segmentation dataset. I.E: I want to be able to know that a specific cluster is a cluster as a customer has a low income but spends a lot of money on products.
Since you are using k-means:
Compute the mean of each cluster on the original data. Now you can compare these attributes.
Alternatively: don't use PCA in the first place, if it had your analysis... k-means is as good as PCA at coping with several dozen variables.

Bisecting K-Means spark ml - what is the division rule?

I started using Bisecting K-Means Clustering in pyspark and I am wondering what is the division rule during clustering.
I know that K-Means is done there, but how next cluster for next division is selected? I have seen that there are couple of methods ( eg. biggest cluster is divided / cluster with less internal similarity), but I can't find what is division rule implemented in spark ml.
Thank you for help
According to the Pyspark ML documentation (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans), Bisecting KMeans algorithm is based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar (https://www.cs.cmu.edu/~dunja/KDDpapers/Steinbach_IR.pdf).
In section 3:
We found little difference between the possible methods for selecting
a cluster to split and chose to split the largest remaining
cluster.
Modifications were made for Pyspark. According to the Pyspark doc :
The algorithm starts from a single cluster that contains all points.
Iteratively it finds divisible clusters on the bottom level and
bisects each of them using k-means, until there are k leaf clusters in
total or no leaf clusters are divisible. The bisecting steps of
clusters on the same level are grouped together to increase
parallelism. If bisecting all divisible clusters on the bottom level
would result more than k leaf clusters, larger clusters get higher
priority.

How to avoid Kmean local optima when using sklearn Kmeans

I want to use use scikit kmean in production deployment and would want to use the default setting for kmean.init = k-means++. The question I have is that what are the chances that kmeans will fall into a local optima when it initializes cluster centroids?.
Notes says that "‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details".
Is there a data on the probability of getting a local optima ?.
If so, should I iterate to get the minimal cost function ?.
Probability of getting trapped into local optima, majorly depends on the nature of your data. If its explicitly grouped, then initial cluster might not have much of an impact on the final clusters results. May be of use for you.
Inspite of above point, for high dimensional dataset, it is preferable to try 10 or more iterations with different initial clusters and choose the one with best performance (one of performance metric could be silhouette-coefficient)

How to interpret k-means output in SPSS

SPSS: K-means analysis. What criteria can I use to state my choice of the number of final clusters I choose. Using a hierarchical cluster analysis, I started with 2 clusters in my K-mean analysis. However, after running many other k-means with different number of clusters, I dont knwo how to choose which one is better. Is there a general method of choosing the number of clusters that is scientifically right.
Are you using SPSS Modeler or SPSS Statistics? Because I created an extension to determine the optimal number of clusters.
It is based on R: Cluster analysis in R: determine the optimal number of clusters

Resources