scikit-learn clustering: predict(X) vs. fit_predict(X) - python-3.x

In scikit-learn, some clustering algorithms have both predict(X) and fit_predict(X) methods, like KMeans and MeanShift, while others only have the latter, like SpectralClustering. According to the doc:
fit_predict(X[, y]): Performs clustering on X and returns cluster labels.
predict(X): Predict the closest cluster each sample in X belongs to.
I don't really understand the difference between the two, they seem equivalent to me.

In order to use the 'predict' you must use the 'fit' method first. So using 'fit()' and then 'predict()' is definitely the same as using 'fit_predict()'. However, one could benefit from using only 'fit()' in such cases where you need to know the initialization parameters of your models rather than if you use 'fit_predict()', where you will just be obtained the labeling results of running your model on the data.

fit_predict is usually used for unsupervised machine learning transductive estimator.
Basically, fit_predict(x) is equivalent to fit(x).predict(x).

This might be very late to add an answer here, It just that someone might get benefitted in future
The reason I could relate for having predict in kmeans and only fit_predict in dbscan is
In kmeans you get centroids based on the number of clusters considered. So once you trained your datapoints using fit(), you can use that to predict() a new single datapoint to assign to a specific cluster.
In dbscan you don't have centroids , based on the min_samples and eps (min distance between two points to be considered as neighbors) you define, clusters are formed . This algorithm returns cluster labels for all the datapoints. This behavior explains why there is no predict() method to predict a single datapoint. Difference between fit() and fit_predict() was already explained by other user -
In another spatial clustering algorithm hdbscan gives us an option to predict using approximate_predict(). Its worth to explore that.
Again its my understanding based on the source code I explored. Any experts can highlight any difference.

Related

Can I use Sklearn EllipticEnvelope for univariate data?

Sklearn EllipticEnvelope calculates the covariance between two or more features and estimates the outliers. Instead of using two features, I created one new feature by dividing first with the second. When I apply EllipticEnvelope on just this one new feature. It works well. But my question is this a correct way to do it since the model relies on the covariance of two or more features?
I found the answer. It works for both univariate and multivariate. But still would love to see more answers about how it works with a single feature.
“EllipticEnvelope is a function that tries to figure out the key parameters of your data's general distribution by assuming that your entire data is an expression of an underlying multivariate Gaussian distribution. That's an assumption that cannot hold true for all datasets, yet when it does, it proves an effective method indeed for spotting outliers. Simplifying the complex estimations working behind the algorithm as much as possible, we can say that it checks the distance of each observation with respect to a grand mean that takes into account all the variables in your dataset. For this reason, it is able to spot both univariate and multivariate outliers.”
Source: Alberto Boschetti. “Python Data Science Essentials.”.

Do I need a test-train split for K-means clustering even if I'm not looking to predict anything?

I have a set of 2000 points which are basically x,y coordinates of pass origins from association football. I want to run a k-means clustering algorithm on it to just classify it to get which 10 passes are the most common (k=10). However, I don't want to predict any points for future values. I simply want to work with the existing data. Do I still need to split it into testing-training sets? I assume they're only done when we want to train the model on a particular set to calculate for future values (?)
I'm new to clustering (and Python as a whole) so any help would be appreciated.
No, in clustering (i.e unsupervised learning ) you do not need to split the data
I disagree with the answer. Clustering has accuracy as a metric. If you do not split the data into train and test then most likely you'll be overfitting the model. See these similar question 1, 2, 3. Please note, data splitting into train/test set is unrelated to the supervised or unsupervised problem.

Is there any support for BiPlots when using PCA in spark.ml?

I have used kmeans and PCA to attempt to visualise high dimensional k-means clusters in two dimensions but have lost the meaning of the clusters in 2D.
Is there anyway to project the features onto to 2D plot to return some interpretability?
Any non-linear dimensionality reduction method might work better (also called "manifold learning", e.g. see sklearn's suite). The t-sne method is generally quite popular for this.
However, these do not take your cluster labels into account. If you wanted to do that (although generally you do not), you could add a penalty to the manifold learning technique that forces same-cluster points to be close together, for example.

How does PCA gives centers for the Kmeans algorithm in scikit learn

I'm looking at this example code given on Scikit Kmeans digit example
There is the following code in this script :
# in this case the seeding of the centers is deterministic, hence we run the
# kmeans algorithm only once with n_init=1
pca = PCA(n_components=n_digits).fit(data)
bench_k_means(KMeans(init=pca.components_, n_clusters=n_digits, n_init=1),
name="PCA-based",
data=data)
Why are the eigen vectors used as initial centers and is there any intuition for this?
There is a stackexchange link here, and also some discussion on the PCA wikipedia.
There is also an informative mailing list discussion about the creation of this example.
All of these threads point back to this paper among others. In a brief, this paper says that there is a strong relationship between the subspace found by SVD (as seen in PCA) and the optimal cluster centers we seek in K-means, along with associated proofs. The key sentence comes in the lower right of the first page - "We prove that principal
components are actually the continuous solution of the cluster membership indicators in the K-means clustering method, i.e., the PCA dimension reduction automatically performs data clustering according to the K-means objective function".
What this amounts to is that SVD/PCA eigenvectors should be very good initializers for K-Means. The authors of this paper actually take things a step further, and project the data into the eigenspace for both of their experiments, then cluster there.

Setting feature weights for KNN

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

Resources