I have two matrices A and B and I'm interested in multiplying a sub-matrix of A (defined by some of its rows) and a sub-matrix of B (defined by some of its columns) as shown in the example below:
import numpy as np
# create two matrices
A = np.ones((5, 3))
B = np.ones((3, 5))
# define sub-matrices
rows_idx = [0, 2, 3]
cols_idx = [1, 2, 4]
# print sub-matrices
print(A[rows_idx])
print(B[:, cols_idx])
I'm aware that this can be achieved directly through
A[rows_idx, :] # B[:, cols_idx]
However, the latter comes with the drawback of making copies of the rows of A and columns of B since rows_idx and cols_idx are lists, as mentioned here.
This is less efficient in terms of time and memory.
Also, I refrained from creating a Python function to perform the multiplication as Python loops are slow. Numpy array operations are much faster. Here are some timing comparisons.
Is there way to compute A[rows_idx, :] # B[:, cols_idx] without copying?
I have a multiclass classficiation problem with 3 classes.
0 - on a given day (24h) my laptop battery did not die
1 - on a given day my laptop battery died before 12AM
2 - on a given day my laptop battery died at or after 12AM
(Note that these categories are mutually exclusive. The battery is not recharged once it died)
I am interested to know the predicted probability for each 3 classes. More specifically, I intend to derive 2 types of warning:
If the prediction for class 1 is higher then a threshold x: 'Your battery is at risk of dying in the morning.'
If the prediction for class 2 is higher then a threshold y: 'Your battery is at risk of dying in the afternoon.'
I can generate the the probabilities by using xgboost.XGBClassifier with the appropriate parameters for a multiclass problem.
import numpy as np
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from xgboost import XGBClassifier
X = np.array([
[10, 10],
[8, 10],
[-5, 5.5],
[-5.4, 5.5],
[-20, -20],
[-15, -20]
])
y = np.array([0, 1, 1, 1, 2, 2])
clf1 = XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42)
clf1.fit(X, y)
clf1.predict_proba([[-19, -20]])
Results:
array([[0.15134096, 0.3304505 , 0.51820856]], dtype=float32)
But I can also wrap this with sklearn.multiclass.OneVsRestClassifier. Which then produces slightly different results:
clf2 = OneVsRestClassifier(XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42))
clf2.fit(X, y)
clf2.predict_proba([[-19, -20]])
Results:
array([[0.10356173, 0.34510303, 0.5513352 ]], dtype=float32)
I was expecting the two approaches to produce the same results. My understanding was that XGBClassifier is also based on a one-vs-rest approach in a multiclass case, since there are 3 probabilities in the output and they sum up to 1.
Can you tell me where the difference comes from, and how the respective results should be interpreted? And most important, which is approach is better suited to solve my problem.
I have a M * N pairwise distance matrix between M points from group A and N points from group B.
I want to get the list of neighbor points from group B for each points from group A.
Is there any efficient code for this problem using pytorch? instead of multiple 'for' loop.
Thanks
You can use sort:
import torch
# fake pairwise distance matrix, M=3, N=4
x = torch.rand((3,4))
print(x)
# tensor([[0.7667, 0.6847, 0.3779, 0.3007],
# [0.9881, 0.9909, 0.3180, 0.5389],
# [0.6341, 0.8095, 0.4214, 0.7216]])
closest = torch.sort(x, dim=-1) # default is -1, but I prefer to be clear
# let's say you want the k=2 closest points
k=2
closest_k_values = closest[0][:, :k]
closest_k_indices = closest[1][:, :k]
print(closest_k_values)
# tensor([[0.3007, 0.3779],
# [0.3180, 0.5389],
# [0.4214, 0.6341]])
print(closest_k_indices)
# tensor([[3, 2],
# [2, 3],
# [2, 0]])
Say I have a vector v1 and a list of vector l1. I want to find k vectors from l1 that are most closed (similar) to v1 in descending order.
I have a function sim_score(v1,v2) that will return a similarity score between 0 and 1 for any two input vectors.
Indeed, a naive way is to write a for loop over l1, calculate distance and store them into another list, then sort the output list. But is there a Pythonic way to do the task?
Thanks
import numpy as np
np.sort([np.sqrt(np.sum(( l-v1)*(l-v1))) For l in l1])[:3]
Consider using scipy.spatial.distance module for distance computations. It supports the most common metrics.
import numpy as np
from scipy.spatial import distance
v1 = [[1, 2, 3]]
l1 = [[11, 3, 5],
[ 2, 1, 9],
[.1, 3, 2]]
# compute distances
dists = distance.cdist(v1, l1, metric='euclidean')
# sorted distances
sd = np.sort(dists)
Note that each parameter to cdist must be two-dimensional. Hence, v1 must be a nested list, or a 2d numpy array.
You may also use your homegrown metric like:
def my_metric(a, b, **kwargs):
# some logic
dists = distance.cdist(v1, l1, metric=my_metric)
I have a matrix x=
[[0,1,1,1,0,0,0,0],
[1,0,1,1,0,0,0,0],
[1,1,0,1,0,0,0,0],
[1,1,1,0,0,0,0,0],
[0,0,0,0,0,1,1,1],
[0,0,0,0,1,0,1,1],
[0,0,0,0,1,1,0,1],
[0,0,0,0,1,1,1,0],]
After calling AgglomerativeClustering I was expecting the data to be divided into 2 clusters (0-3) and (4-7) i.e labels_=[0,0,0,0,1,1,1,1] but insted the labels_ list is [0, 0, 0, 1, 0, 0, 0, 1]
My code is as follows s=AgglomerativeClustering(affinity='precomputed',n_clusters=2,linkage='complete)
s.fit(x)
Does the code contain any error? Why is the clustering not as expected
It appears to me, after playing around with a few examples, that AgglomerativeClustering interprets the 'affinity' matrix as a distance matrix, although I can't find this specified anywhere. This means that your 0's and 1's should be switched.
Also it only seems to consider the upper triangular portion of the matrix (everything else being redundant).
I believe that defining x as:
x=
[[0,0,0,0,1,1,1,1],
[ 0,0,0,0,1,1,1,1],
[ 0,0,0,0,1,1,1,1],
[ 0,0,0,0,1,1,1,1],
[ 0,0,0,0,0,0,0,0],
[ 0,0,0,0,0,0,0,0],
[ 0,0,0,0,0,0,0,0],
[ 0,0,0,0,0,0,0,0],]
will give you the expected results.
The error is in how you are specifying the connectivity matrix. From your description, I assume that your matrix indicates linkage between points, where [0/1] indicate [no link/link]. However, the algorithm treats this as a matrix of pairwise distances, which is why you get unexpected results.
You could convert your affinity matrix to a sort of distance matrix with a simple transform; e.g.
>>> x = np.array(x)
>>> s.fit(np.exp(-x))
>>> s.labels_
array([1, 1, 1, 1, 0, 0, 0, 0])
Better would be to use an actual distance metric on the data used to generate this affinity matrix.