Customize Distance Formular of K-means in Apache Spark Python - apache-spark

Now I'm using K-means for clustering and following this tutorial and API.
But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark?

In general using a different distance measure doesn't make sense, because k-means (unlike k-medoids) algorithm is well defined only for Euclidean distances.
See Why does k-means clustering algorithm use only Euclidean distance metric? for an explanation.
Moreover MLlib algorithms are implemented in Scala, and PySpark provides only the wrappers required to execute Scala code. Therefore providing a custom metric as a Python function, wouldn't be technically possible without significant changes in the API.
Please note that since Spark 2.4 there are two built-in measures that can be used with pyspark.ml.clustering.KMeans and pyspark.ml.clustering.BisectingKMeans. (see DistanceMeasure Param).
euclidean for Euclidean distance.
cosine for cosine distance.
Use at your own risk.

Related

How to compute eigenvector system of a matrix using Apache PySpark 2.3

I have to compute the smallest magnitude eigenvalue and it's associated eigenvector of a non symmetric matrix using PySpark libraries.
The size of
is very high and I want the computation to be distributed among the cluster's workers.
The problem is that i didn't find any API to compute eigenvalues in PySpark 2.3 documentation.
I have identified two paths, but I want to avoid them:
to reimplement eigen value decomposition trough QR algorithm using QRDecomposition available in PySpark API
to compute eigen value decomposition trough scala version class as described in this question on Stack Overflow
Is there a simpler or better way then this last two?
I already know the existence of this post, but they are conceptually different.

Is there a way to perform clustering for a set of multivariate Gaussian distributions?

I have a set of multivariate (2D) Gaussian distributions (represented by mean and variance) and would like to perform clustering on these distributions in a way that maintains the probabilistic Gaussian information (perhaps using the overlap of variances?).
I have done some research into clustering methods and found that DBSCAN clustering is more appropriate than K-means, as I don't know how many clusters I expect to find. However, DBSCAN makes use of a euclidean distance epsilon value to find clusters instead of using the variances of each distribution. I have also looked into Gaussian-Mixture Model methods, but they fit a set of points to a set of K Gaussian clusters, rather than fitting clusters to a set of Gaussian distributions.
Does anyone know of any additional clustering methods that might be appropriate to my needs?
Thanks!
DBSCAN can be used with arbitrary distances. It is not limited to Euclidean distance. You could employ a divergence measure, e.g. how much your Gaussians overlap.
However, I would suggest hierarchical clustering or Gaussian Mixture Modeling (EM).
DBSCAN is designed to allow Banana-shaped clusters, which are not well approximated by Gaussians. Your objective appear to be to merge similar Gaussians. That is better achieved by hierarchical clustering.

How to use sickit learn to calculate the k-means feature importance

I use scikit-learn to do clustering by k-means:
from sklearn import cluster
k = 4
kmeans = cluster.KMeans(n_clusters=k)
but another question is :
How to use scikit learn to calculate the k-means feature importance?
Unfortunately, to my knowledge there is no such thing as "feature importance" in the context of a k-means algorithm - at least in the understanding that feature importance means "automatic relevance determination" (as in the link below).
In fact, the k-means algorithm treats all features equally, since the clustering procedure depends on the (unweighted) Euclidean distances between data points and cluster centers.
More generally, there exist clustering algorithms which perform automatic feature selection or automatic relevance determination, or generic feature selection methods for clustering. A specific (and arbitrary) example is
Roth and Lange, Feature Selection in Clustering Problems, NIPS 2003
I have answered this on StackExchange, you can partially estimate the most important features for, not the whole clustering problem, rather each cluster's most important features. Here is the answer:
I faced this problem before and developed two possible methods to find the most important features responsible for each K-Means cluster sub-optimal solution.
Focusing on each centroid’s position and the dimensions responsible for the highest Within-Cluster Sum of Squares minimization
Converting the problem into classification settings (Inspired by the paper: "A Supervised Methodology to Measure the Variables Contribution to a Clustering").
I have written a detailed article here Interpretable K-Means: Clusters Feature Importances. GitHub link is included as well if you want to try it.

scikit-learn clustering: predict(X) vs. fit_predict(X)

In scikit-learn, some clustering algorithms have both predict(X) and fit_predict(X) methods, like KMeans and MeanShift, while others only have the latter, like SpectralClustering. According to the doc:
fit_predict(X[, y]): Performs clustering on X and returns cluster labels.
predict(X): Predict the closest cluster each sample in X belongs to.
I don't really understand the difference between the two, they seem equivalent to me.
In order to use the 'predict' you must use the 'fit' method first. So using 'fit()' and then 'predict()' is definitely the same as using 'fit_predict()'. However, one could benefit from using only 'fit()' in such cases where you need to know the initialization parameters of your models rather than if you use 'fit_predict()', where you will just be obtained the labeling results of running your model on the data.
fit_predict is usually used for unsupervised machine learning transductive estimator.
Basically, fit_predict(x) is equivalent to fit(x).predict(x).
This might be very late to add an answer here, It just that someone might get benefitted in future
The reason I could relate for having predict in kmeans and only fit_predict in dbscan is
In kmeans you get centroids based on the number of clusters considered. So once you trained your datapoints using fit(), you can use that to predict() a new single datapoint to assign to a specific cluster.
In dbscan you don't have centroids , based on the min_samples and eps (min distance between two points to be considered as neighbors) you define, clusters are formed . This algorithm returns cluster labels for all the datapoints. This behavior explains why there is no predict() method to predict a single datapoint. Difference between fit() and fit_predict() was already explained by other user -
In another spatial clustering algorithm hdbscan gives us an option to predict using approximate_predict(). Its worth to explore that.
Again its my understanding based on the source code I explored. Any experts can highlight any difference.

How to use a precomputed distance matrix in Scikit KMeans?

I'm new to scikit.
I can't find an example using a precomputed distance matrix in Scikit KMeans.
Could anybody shed a light now this, better with an example?
Scikit-learn does not allow you to pass in a custom (precomputed) distance matrix. It can precompute Euclidean distance matrix to speed-up the process, but there's no way to use your own one without hacking the source.

Resources