Affinity Propagation (sklearn) - strange behavior - scikit-learn

Trying to use affinity propagation for a simple clustering task:
from sklearn.cluster import AffinityPropagation
c = [[0], [0], [0], [0], [0], [0], [0], [0]]
af = AffinityPropagation (affinity = 'euclidean').fit (c)
print (af.labels_)
I get this strange result:
[0 1 0 1 2 1 1 0]
I would expect to have all samples in the same cluster, like in this case:
c = [[0], [0], [0]]
af = AffinityPropagation (affinity = 'euclidean').fit (c)
print (af.labels_)
which indeed puts all samples in the same cluster:
[0 0 0]
What am I missing?
Thanks

I believe this is because your problem is essentially ill-posed (you pass lots of the same point to an algorithm which is trying to find similarity between different points). AffinityPropagation is doing matrix math under the hood, and your similarity matrix (which is all zeros) is nastily degenerate. In order to not error out, the implementation adds a small random matrix to the similarity matrix, preventing the algorithm from quitting when it encounters two of the same point.

Related

Multiplying two sub-matrices in NumPy without copying the matrices

I have two matrices A and B and I'm interested in multiplying a sub-matrix of A (defined by some of its rows) and a sub-matrix of B (defined by some of its columns) as shown in the example below:
import numpy as np
# create two matrices
A = np.ones((5, 3))
B = np.ones((3, 5))
# define sub-matrices
rows_idx = [0, 2, 3]
cols_idx = [1, 2, 4]
# print sub-matrices
print(A[rows_idx])
print(B[:, cols_idx])
I'm aware that this can be achieved directly through
A[rows_idx, :] # B[:, cols_idx]
However, the latter comes with the drawback of making copies of the rows of A and columns of B since rows_idx and cols_idx are lists, as mentioned here.
This is less efficient in terms of time and memory.
Also, I refrained from creating a Python function to perform the multiplication as Python loops are slow. Numpy array operations are much faster. Here are some timing comparisons.
Is there way to compute A[rows_idx, :] # B[:, cols_idx] without copying?

What is the the role of using OneVsRestClassifier wrapper around XGBClassifier?

I have a multiclass classficiation problem with 3 classes.
0 - on a given day (24h) my laptop battery did not die
1 - on a given day my laptop battery died before 12AM
2 - on a given day my laptop battery died at or after 12AM
(Note that these categories are mutually exclusive. The battery is not recharged once it died)
I am interested to know the predicted probability for each 3 classes. More specifically, I intend to derive 2 types of warning:
If the prediction for class 1 is higher then a threshold x: 'Your battery is at risk of dying in the morning.'
If the prediction for class 2 is higher then a threshold y: 'Your battery is at risk of dying in the afternoon.'
I can generate the the probabilities by using xgboost.XGBClassifier with the appropriate parameters for a multiclass problem.
import numpy as np
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from xgboost import XGBClassifier
X = np.array([
[10, 10],
[8, 10],
[-5, 5.5],
[-5.4, 5.5],
[-20, -20],
[-15, -20]
])
y = np.array([0, 1, 1, 1, 2, 2])
clf1 = XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42)
clf1.fit(X, y)
clf1.predict_proba([[-19, -20]])
Results:
array([[0.15134096, 0.3304505 , 0.51820856]], dtype=float32)
But I can also wrap this with sklearn.multiclass.OneVsRestClassifier. Which then produces slightly different results:
clf2 = OneVsRestClassifier(XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42))
clf2.fit(X, y)
clf2.predict_proba([[-19, -20]])
Results:
array([[0.10356173, 0.34510303, 0.5513352 ]], dtype=float32)
I was expecting the two approaches to produce the same results. My understanding was that XGBClassifier is also based on a one-vs-rest approach in a multiclass case, since there are 3 probabilities in the output and they sum up to 1.
Can you tell me where the difference comes from, and how the respective results should be interpreted? And most important, which is approach is better suited to solve my problem.

How can I determine neighborhood from pairwise distance matrix efficiently?

I have a M * N pairwise distance matrix between M points from group A and N points from group B.
I want to get the list of neighbor points from group B for each points from group A.
Is there any efficient code for this problem using pytorch? instead of multiple 'for' loop.
Thanks
You can use sort:
import torch
# fake pairwise distance matrix, M=3, N=4
x = torch.rand((3,4))
print(x)
# tensor([[0.7667, 0.6847, 0.3779, 0.3007],
# [0.9881, 0.9909, 0.3180, 0.5389],
# [0.6341, 0.8095, 0.4214, 0.7216]])
closest = torch.sort(x, dim=-1) # default is -1, but I prefer to be clear
# let's say you want the k=2 closest points
k=2
closest_k_values = closest[0][:, :k]
closest_k_indices = closest[1][:, :k]
print(closest_k_values)
# tensor([[0.3007, 0.3779],
# [0.3180, 0.5389],
# [0.4214, 0.6341]])
print(closest_k_indices)
# tensor([[3, 2],
# [2, 3],
# [2, 0]])

python3: find most k nearest vectors from a list?

Say I have a vector v1 and a list of vector l1. I want to find k vectors from l1 that are most closed (similar) to v1 in descending order.
I have a function sim_score(v1,v2) that will return a similarity score between 0 and 1 for any two input vectors.
Indeed, a naive way is to write a for loop over l1, calculate distance and store them into another list, then sort the output list. But is there a Pythonic way to do the task?
Thanks
import numpy as np
np.sort([np.sqrt(np.sum(( l-v1)*(l-v1))) For l in l1])[:3]
Consider using scipy.spatial.distance module for distance computations. It supports the most common metrics.
import numpy as np
from scipy.spatial import distance
v1 = [[1, 2, 3]]
l1 = [[11, 3, 5],
[ 2, 1, 9],
[.1, 3, 2]]
# compute distances
dists = distance.cdist(v1, l1, metric='euclidean')
# sorted distances
sd = np.sort(dists)
Note that each parameter to cdist must be two-dimensional. Hence, v1 must be a nested list, or a 2d numpy array.
You may also use your homegrown metric like:
def my_metric(a, b, **kwargs):
# some logic
dists = distance.cdist(v1, l1, metric=my_metric)

AgglomerativeClustering scikit learn connectivity

I have a matrix x=
[[0,1,1,1,0,0,0,0],
[1,0,1,1,0,0,0,0],
[1,1,0,1,0,0,0,0],
[1,1,1,0,0,0,0,0],
[0,0,0,0,0,1,1,1],
[0,0,0,0,1,0,1,1],
[0,0,0,0,1,1,0,1],
[0,0,0,0,1,1,1,0],]
After calling AgglomerativeClustering I was expecting the data to be divided into 2 clusters (0-3) and (4-7) i.e labels_=[0,0,0,0,1,1,1,1] but insted the labels_ list is [0, 0, 0, 1, 0, 0, 0, 1]
My code is as follows s=AgglomerativeClustering(affinity='precomputed',n_clusters=2,linkage='complete)
s.fit(x)
Does the code contain any error? Why is the clustering not as expected
It appears to me, after playing around with a few examples, that AgglomerativeClustering interprets the 'affinity' matrix as a distance matrix, although I can't find this specified anywhere. This means that your 0's and 1's should be switched.
Also it only seems to consider the upper triangular portion of the matrix (everything else being redundant).
I believe that defining x as:
x=
[[0,0,0,0,1,1,1,1],
[ 0,0,0,0,1,1,1,1],
[ 0,0,0,0,1,1,1,1],
[ 0,0,0,0,1,1,1,1],
[ 0,0,0,0,0,0,0,0],
[ 0,0,0,0,0,0,0,0],
[ 0,0,0,0,0,0,0,0],
[ 0,0,0,0,0,0,0,0],]
will give you the expected results.
The error is in how you are specifying the connectivity matrix. From your description, I assume that your matrix indicates linkage between points, where [0/1] indicate [no link/link]. However, the algorithm treats this as a matrix of pairwise distances, which is why you get unexpected results.
You could convert your affinity matrix to a sort of distance matrix with a simple transform; e.g.
>>> x = np.array(x)
>>> s.fit(np.exp(-x))
>>> s.labels_
array([1, 1, 1, 1, 0, 0, 0, 0])
Better would be to use an actual distance metric on the data used to generate this affinity matrix.

Resources