How to interpret k-means output in SPSS - statistics

SPSS: K-means analysis. What criteria can I use to state my choice of the number of final clusters I choose. Using a hierarchical cluster analysis, I started with 2 clusters in my K-mean analysis. However, after running many other k-means with different number of clusters, I dont knwo how to choose which one is better. Is there a general method of choosing the number of clusters that is scientifically right.

Are you using SPSS Modeler or SPSS Statistics? Because I created an extension to determine the optimal number of clusters.
It is based on R: Cluster analysis in R: determine the optimal number of clusters

Related

Clustering with unknown number of clusters in Spark

I have a very large dataset (about 3.5M) of text messages. I am using tf-idf vector to represent each message in this dataset. I want to cluster the messages of the same topic together and I don't know the actual clusters or even the number of them.
So I searched a little and found that Optics, DBSCAN, or HDBSCAN can do this job but there is no implementation of them is spark ml or mllib. according to this In spark mllib there are implementations of K-means, Gaussian mixture, Power iteration clustering (PIC), Latent Dirichlet allocation (LDA), Bisecting k-means and Streaming k-means.
So my problem is, all of them need K as an input and I don't have it. Is there any clustering algorithm implemented in Spark that find the number of clusters on its own?
Got a little bit too long for a comment. I'll try to explain it here.
Do you have the data on which topic a message belongs? Then you can simply do a group by that topic to group all the messages with similar topics.
That's one thing. And if you are trying to derive topics (K) from the dataset itself, then you need little more statistics to build a sound feature set to cluster them. Then you can come to a conclusion on K by varying it and find the best K with minimal error. There is a famous method called elbow method.
Check this out. https://www.geeksforgeeks.org/ml-determine-the-optimal-value-of-k-in-k-means-clustering/#:~:text=There%20is%20a%20popular%20method,fewer%20elements%20in%20the%20cluster.

How to find out what a cluster represents on a PCA biplot?

I am building a K means algorithm and have multiple variables to feed into it. As of this I am using PCA to transform the data to two dimensions. When I display the PCA biplot I don't understand what similarities the data has to be grouped into a specific cluster. I am using a customer segmentation dataset. I.E: I want to be able to know that a specific cluster is a cluster as a customer has a low income but spends a lot of money on products.
Since you are using k-means:
Compute the mean of each cluster on the original data. Now you can compare these attributes.
Alternatively: don't use PCA in the first place, if it had your analysis... k-means is as good as PCA at coping with several dozen variables.

Bisecting K-Means spark ml - what is the division rule?

I started using Bisecting K-Means Clustering in pyspark and I am wondering what is the division rule during clustering.
I know that K-Means is done there, but how next cluster for next division is selected? I have seen that there are couple of methods ( eg. biggest cluster is divided / cluster with less internal similarity), but I can't find what is division rule implemented in spark ml.
Thank you for help
According to the Pyspark ML documentation (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans), Bisecting KMeans algorithm is based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar (https://www.cs.cmu.edu/~dunja/KDDpapers/Steinbach_IR.pdf).
In section 3:
We found little difference between the possible methods for selecting
a cluster to split and chose to split the largest remaining
cluster.
Modifications were made for Pyspark. According to the Pyspark doc :
The algorithm starts from a single cluster that contains all points.
Iteratively it finds divisible clusters on the bottom level and
bisects each of them using k-means, until there are k leaf clusters in
total or no leaf clusters are divisible. The bisecting steps of
clusters on the same level are grouped together to increase
parallelism. If bisecting all divisible clusters on the bottom level
would result more than k leaf clusters, larger clusters get higher
priority.

How to avoid Kmean local optima when using sklearn Kmeans

I want to use use scikit kmean in production deployment and would want to use the default setting for kmean.init = k-means++. The question I have is that what are the chances that kmeans will fall into a local optima when it initializes cluster centroids?.
Notes says that "‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details".
Is there a data on the probability of getting a local optima ?.
If so, should I iterate to get the minimal cost function ?.
Probability of getting trapped into local optima, majorly depends on the nature of your data. If its explicitly grouped, then initial cluster might not have much of an impact on the final clusters results. May be of use for you.
Inspite of above point, for high dimensional dataset, it is preferable to try 10 or more iterations with different initial clusters and choose the one with best performance (one of performance metric could be silhouette-coefficient)

General principles behind Spark MLlib parallelism

I'm new to Spark (and to cluster computing framework) and I'm wondering about the general principles followed by the parallel algorithms used for machine learning (MLlib). Are they essentially faster because Spark distributes training data over multiple nodes? If yes, I suppose that all nodes share the same set of parameters right? And that they have to combine (ex: summing) the intermediate calculations (ex: the gradients) on a regular basis, am I wrong?
Secondly, suppose I want to fit my data with an ensemble of models (ex: 10). Wouldn't it be simpler in this particular context to run my good old machine-learning program independently on 10 machines instead of having to write complicated code (for me at least!) for training in a Spark cluster?
Corollary question: is Spark (or other cluster computing framework) useful only for big data applications for which we could not afford training more than one model and for which training time would be too much long on a single machine?
You correct about the general principle. Typical MLlib algorithm is a an iterative procedure with local phase and data exchange.
MLlib algorithms are not necessarily faster. They try to solve two problems:
disk latency.
memory limitations on a single machine.
If you can process data on a single node this can be orders of magnitude faster than using ML / MLlib.
The last question is hard to answer but:
It is not complicated to train ensembles:
def train_model(iter):
items = np.array(list(iter))
model = ...
return model
rdd.mapPartitions(train_model)
There are projects which already do that (https://github.com/databricks/spark-sklearn)

Resources