How does `cosine` metric works in sklearn's clustering algorithoms? - scikit-learn

I'm puzzeled about how does cosine metric works in sklearn's clustering algorithoms.
For example, DBSCAN has a parameter eps and it specified maximum distance when clustering. However, bigger cosine similarity means two vectors are closer, which is just the opposite to our distance concept.
I found that there are cosine_similarity and cosine_distance( just 1-cos() ) in pairwise_metric, and when we specified the metric is cosine we use cosine_similarity.
So, when clustering, how does DBSCAN compares the cosine_similarity and #param eps to decide whether two vectors have the same label?
An example
import numpy as np
from sklearn.cluster import DBSCAN
samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
clf = DBSCAN(metric='cosine', eps=0.1)
result = clf.fit_predict(samples)
print(result)
it outputs [-1, -1, -1, -1] which means these four points are in the same cluster
However,
for points pair [1,1], [2, 2],
its cosine_similarity is 4/(4) = 1,
the cosine distance will be 1-1 = 0, so they are in the same cluster
for points pair[1,1], [1,0],
its cosine_similarity is 1/sqrt(2),
the cosine distance will be 1-1/sqrt(2) = 0.29289321881345254, this distance is bigger than our eps 0.1, why DBSCAN clustered them into the same cluster?
Thanks for #Stanislas Morbieu 's answer, and I finally understand the cosine metric means cosine_distance which is 1-cosine

The implementation of DBSCAN in scikit-learn rely on NearestNeighbors (see the implementation of DBSCAN).
Here is an example to see how it works with cosine metric:
import numpy as np
from sklearn.neighbors import NearestNeighbors
samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
neigh = NearestNeighbors(radius=0.1, metric='cosine')
neigh.fit(samples)
rng = neigh.radius_neighbors([[1, 1]])
print([samples[i] for i in rng[1][0]])
It outputs [[1, 1], [2, 2]], i.e. the points which are closest to [1, 1] in a radius of 0.1.
So points which have a cosine distance smaller than eps in DBSCAN tend to be in the same cluster.
The parameter min_samples of DBSCAN plays an important role. Since by default, it is set to 5, no points can be considered as core point.
Setting it to 1, the example code:
import numpy as np
from sklearn.cluster import DBSCAN
samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
clf = DBSCAN(metric='cosine', eps=0.1, min_samples=1)
result = clf.fit_predict(samples)
print(result)
outputs [0 1 2 2] which means that [1, 1] and [2, 2] are in the same cluster (numbered 2).
By the way, the output [-1, -1, -1, -1] doesn't mean that points are in the same cluster, but that all points are in no cluster.

Related

Voronoi Diagram Explanation

I am trying to generate Voronoi split polygons and not able to understand the parameter 'furthest_site=True' in Voronoi's Scipy's implementation.
from scipy.spatial import Voronoi, voronoi_plot_2d
points = np.array([[0, 0], [0, 1], [0, 2], [1, 0], [1, 1], [1, 2],
[2, 0], [2, 1], [2, 2]])
vor = Voronoi(points,furthest_site=True)
import matplotlib.pyplot as plt
fig = voronoi_plot_2d(vor) plt.show()
This gives me output as :
What is the explanation for attribute "furthest_site=True"
scipy says it uses QHull to compute voronoi diagrams, and they have this in their documentation:
The furthest-site Voronoi diagram is the furthest-neighbor map for a set of points. Each region contains those points that are further from one input site than any other input site.
Furthest (or "farthest")-site diagrams are described in plenty of other places, including example diagrams; for example, in other stackexchange posts: 1, 2.
Your plot looks odd because your pointset is somewhat degenerate; only the four corner points ever serve as the furthest reference point.

What does the ordering/index of cluster_centers_ represent in KMeans clustering SKlearn

I have implemented the following code
k_mean = KMeans(n_clusters=5,init=centroids,n_init=1,random_state=SEED).fit(X_input)
k_mean.cluster_centers_.shape
>>
(5, 50)
I have 5 clusters of the data.
How are the clusters ordered? Are the indices of the clusters centres representing the labels?
Means does the cluster_center index at 0th position represent the label = 0 or not?
In the docs you have a smiliar example:
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10., 2.],
[ 1., 2.]])
The indexes are ordered yes. Btw with k_mean.cluster_centers_.shapeyou only return the shape of your array, and not the values. So in your case you have 5 clusters, and the dimension of your features is 50.
To get the nearest point, you can have a look here.

LDA covariance matrix not match calculated covariance matrix

I'm looking to better understand the covariance_ attribute returned by scikit-learn's LDA object.
I'm sure I'm missing something, but I expect it to be the covariance matrix associated with the input data. However, when I compare .covariance_ against the covariance matrix returned by numpy.cov(), I get different results.
Can anyone help me understand what I am missing? Thanks and happy to provide any additional information.
Please find a simple example illustrating the discrepancy below.
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Sample Data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 0, 0, 0])
# Covariance matrix via np.cov
print(np.cov(X.T))
# Covariance matrix via LDA
clf = LinearDiscriminantAnalysis(store_covariance=True).fit(X, y)
print(clf.covariance_)
In sklearn.discrimnant_analysis.LinearDiscriminantAnalysis, the covariance is computed as follow:
In [1]: import numpy as np
...: cov = np.zeros(shape=(X.shape[1], X.shape[1]))
...: for c in np.unique(y):
...: Xg = X[y == c, :]
...: cov += np.count_nonzero(y==c) / len(y) * np.cov(Xg.T, bias=1)
...: print(cov)
array([[0.66666667, 0.33333333],
[0.33333333, 0.22222222]])
So it corresponds to the sum of the covariance of each individual class multiplied by a prior which is the class frequency. Note that this prior is a parameter of LDA.

Train/fit a Linear Regression in sklearn with only one feature/variable

So I am understanding lasso regression and I don't understand why it needs two input values to predict another value when it's just a 2 dimensional regression.
It says in the documentation that
clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
which I don't understand. Why is it [0,0] or [1,1] and not just [0] or [1]?
[[0,0], [1, 1], [2, 2]]
means that you have 3 samples/observations and each is characterised by 2 features/variables (2 dimensional).
Indeed, you could have these 3 samples with only 1 features/variables and still be able to fit a model.
Example using 1 feature.
from sklearn import datasets
from sklearn import linear_model
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :1] # we only take the feature
y = iris.target
clf = linear_model.Lasso(alpha=0.1)
clf.fit(X,y)
print(clf.coef_)
print(clf.intercept_)

scikit-learn: Get selected features for prediction data

I have a training set of data. The python script for creating the model also calculates the attributes into a numpy array (It's a bit vector). I then want to use VarianceThreshold to eliminate all features that have 0 variance (eg. all 0 or 1). I then run get_support(indices=True) to get the indices of the select columns.
My issue now is how to get only the selected features for the data I want to predict. I first calculate all features and then use array indexing but it does not work:
x_predict_all = getAllFeatures(suppl_predict)
x_predict = x_predict_all[indices] #only selected features
indices is a numpy array.
The returned array x_predict has the correct length len(x_predict) but wrong shape x_predict.shape[1] which is still the original length. My classifier then throws an error due to wrong shape
prediction = gbc.predict(x_predict)
File "C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py", li
ne 1032, in _init_decision_function
self.n_features, X.shape[1]))
ValueError: X.shape[1] should be 1855, not 2090.
How can I solve this issue?
You can do it like this:
Test data
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0, 2, 0, 3],
[0, 1, 4, 3],
[0, 1, 1, 3]])
selector = VarianceThreshold()
Alternative 1
>>> selector.fit(X)
>>> idxs = selector.get_support(indices=True)
>>> X[:, idxs]
array([[2, 0],
[1, 4],
[1, 1]])
Alternative 2
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])

Resources