How to find all distances between points in a matrix without duplicates? - python-3.x

I have a Nx3 matrix that contains the x,y,z coordinates of N points in 3D space. I'd like to find the absolute distances between all points without duplicates.
I tried using scipy.spatial.distance.cdist()
[see documentation here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html ]. However, the output matrix contains duplicats of distances. For example, the distance between the points P1 and P2 is calculated twice as distance from P1 to P2 and again as distance from P2 to P1. See code output:
>>> from scipy.spatial import distance
>>> points = [[1, 2, 3],
... [4, 5, 6],
... [7, 8, 9]]
>>> distances = distance.cdist(points, points, 'euclidean')
>>> print(distances)
[[ 0. 5.19615242 10.39230485]
[ 5.19615242 0. 5.19615242]
[10.39230485 5.19615242 0. ]]
I'd like the output to be without dupilcates. For example, find the distance between the first point and all other points then the second point and the remaining points (exluding the first point) and so on. Ideally, in an efficient and scalable manner that preserves the order of the points. That is once I find the distances, I'd like to query them; e.g. finding distances within a certain range and be able to output points that correspond to these distances.

Looks like in general you want a KDTree implementation, with query_pairs.
from scipy.spatial import KDTree
points_tree = KDTree(points)
points_in_radius = points_tree.query_pairs(radius)
This will be much faster than actually computing all of the instances and applying a tolerance.

Related

How to use "cv2.estimateAffine3D" correctly, to align two 3d coordinate systems?

I have to align two 3d coordinate systems and I have 32 coordinates of points in the real world for each coordinate system.
I already know that I have to calculate the translation and the rotation and I (think) that I also understand what the calculation is. Because some smart people already had solve this problem I just want to use openCV for this and I asked the aiChatBot of how to do this calculation without and with openCV.
But I think using "cv2.estimateAffine3D" looks right.
Maybe some can just give me an example how to use it correctly?
At the moment the code with example I have from the aiChatBot looks like this:
import cv2
import numpy as np
# Define the coordinates of the three points in the first coordinate system
p1 = np.array([[1, 2, 3]], dtype=np.float32)
p2 = np.array([[4, 5, 6]], dtype=np.float32)
p3 = np.array([[7, 8, 9]], dtype=np.float32)
# Define the coordinates of the three points in the second coordinate system
q1 = np.array([[1, 2, 3]], dtype=np.float32)
q2 = np.array([[5, 6, 7]], dtype=np.float32)
q3 = np.array([[9, 10, 11]], dtype=np.float32)
# Calculate the affine transformation that aligns the two coordinate systems
T = cv2.estimateAffine3D(p1, p2, p3)
# Reshape the point arrays to 2D or 3D
q1_reshaped = q1.reshape(1, -1, 3)
q2_reshaped = q2.reshape(1, -1, 3)
q3_reshaped = q3.reshape(1, -1, 3)
# Apply the transformation to the second coordinate system
q1_aligned = cv2.transform(q1_reshaped, T)
q2_aligned = cv2.transform(q2_reshaped, T)
q3_aligned = cv2.transform(q3_reshaped, T)
# Print the transformed points
print(q1_aligned)
print(q2_aligned)
print(q3_aligned)
the error I get is :
error: OpenCV(4.6.0) :-1: error: (-5:Bad argument) in function 'transform'
Overload resolution failed:
m is not a numerical tuple
Expected Ptr<cv::UMat> for argument 'm
the answer form the aiChatBot to solve this is:
"It looks like the error you are experiencing is caused by passing an incorrect argument to the cv2.transform function. The cv2.transform function expects a 2D or 3D point array as the first argument and a 3x3 or 3x4 transformation matrix as the second argument, but you are passing a 1D point array and a 3x3 or 3x4 transformation matrix as arguments.
To fix this error, you need to pass a 2D or 3D point array as the first argument to the cv2.transform function. You can do this by reshaping the 1D point arrays q1, q2, and q3 using the np.reshape function, as follows:"
but there we come in a loop because this "solution" is already in the code.
So I do not really find any information about that. Can anybody help me with this?
cheers!
I want to align two coordinate systems.

Perform matrix multiplication with cosine similarity function

I have two lists:
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'scents', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder']
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife']]
I would like to compute the cosine similarity between the two lists above in such a way where the cosine similarity between the first sub-list in list1 and all sublists of list 2 are measured against each other. Then the same thing but with the second sub-list in list 1 and all sub-lists in list 2, etc.
The goal is to create a len(list_2) by len(list_1) matrix, and each entry in that matrix is a cosine similarity score. Currently I've done this the following way:
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
However, I'd like to implement this with matrix multiplication and no for loops.
My two questions are:
Is there a way to do some sort of element-wise matrix multiplication but with gensim's n_similiarity() method to generate the required matrix?
Would it be more efficient and faster using the current method or matrix multiplication?
I hope my question was clear enough, please let me know if I can clarify even further.
Here's an approach, but it's not clear from the question whether you understand the underlying mechanics of the calculation, which might be causing the block.
I've changed the input strings to give more exact word matches, and given the two strings different dimensions to make it a bit clearer:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'my', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder'],
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife'],
['flavor', 'taste', 'romantic', 'aroma', 'what']]
cnt = CountVectorizer()
# Combine each sublist into single str, and join everything into corpus
combined_lists = ([' '.join(item) for item in list_1] +
[' '.join(item) for item in list_2])
count_matrix = cnt.fit_transform(combined_lists).toarray()
# Split them again into list_1 and list_2 word counts
count_matrix_1 = count_matrix[:len(list_1),]
count_matrix_2 = count_matrix[len(list_1):,]
match_matrix = np.matmult(count_matrix_1, count_matrix_2.T)
Output of match_matrix:
array([[0, 0, 0, 0, 2],
[0, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]], dtype=int64)
You can see that the 1st string in list_1 has 2 matches with the 5th string in list_2, and so on.
So the first part of the calculation (the dot product) has been calculated. Now we need the magnitudes:
magnitudes = np.array([np.linalg.norm(count_matrix[i,:])
for i in range(len(count_matrix))])
Now we can use matrix multiplication to turn that into a matrix of divisors (we need to reshape magnitudes into n x 1 and 1 x n matrices for this to produce an n x n matrix:
divisor_matrix = np.matmul(magnitudes.reshape(len(magnitudes),1),
magnitudes.reshape(1,len(magnitudes)))
Now since we didn't compare every single sublist, but only the list_1 with the list_2 sublists, we need to take a subsection of this divisor matrix to get the right magnitudes:
divisor_matrix = divisor_matrix[:len(list_1), len(list_1):]
Output:
array([[4.89897949, 4. , 4. , 4. , 4.47213595],
[5.47722558, 4.47213595, 4.47213595, 4.47213595, 5. ],
[4.24264069, 3.46410162, 3.46410162, 3.46410162, 3.87298335],
[4.89897949, 4. , 4. , 4. , 4.47213595]])
Now we can calculate the final matrix of cosine similarity scores:
cos_sim = match_matrix / divisor_matrix
Output:
array([[0. , 0. , 0. , 0. , 0.4472136],
[0. , 0. , 0. , 0.2236068, 0.2 ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.2236068]])
Note these scores differ from the example given, since in the example every cosine similarity score would be 0.
There are two problems in code, the second last and last line.
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
Answers to you questions:
1. You are already using a direct function to calculate the similarity between two sentences(L1 and L2) which are first converted to two vectors and then cosine similarity is calculated of those two vectors. Everything is already done inside the n_similarity() so you can't do any kind of matrix multiplication.
If you want to do your own matrix multiplication then instead of directly using n_similarity() calculates the vectors of the sentences and then apply matrix multiplication while calculating cosine similarity.
2. As I said in (1) that everything is done in n_similarity() and creators of gensim takes care of the efficiency when writing the libraries so any other multiplication method will most likely not make a difference.

python3: find most k nearest vectors from a list?

Say I have a vector v1 and a list of vector l1. I want to find k vectors from l1 that are most closed (similar) to v1 in descending order.
I have a function sim_score(v1,v2) that will return a similarity score between 0 and 1 for any two input vectors.
Indeed, a naive way is to write a for loop over l1, calculate distance and store them into another list, then sort the output list. But is there a Pythonic way to do the task?
Thanks
import numpy as np
np.sort([np.sqrt(np.sum(( l-v1)*(l-v1))) For l in l1])[:3]
Consider using scipy.spatial.distance module for distance computations. It supports the most common metrics.
import numpy as np
from scipy.spatial import distance
v1 = [[1, 2, 3]]
l1 = [[11, 3, 5],
[ 2, 1, 9],
[.1, 3, 2]]
# compute distances
dists = distance.cdist(v1, l1, metric='euclidean')
# sorted distances
sd = np.sort(dists)
Note that each parameter to cdist must be two-dimensional. Hence, v1 must be a nested list, or a 2d numpy array.
You may also use your homegrown metric like:
def my_metric(a, b, **kwargs):
# some logic
dists = distance.cdist(v1, l1, metric=my_metric)

KNN algorithm that return 2 or more nearest neighbours

For example, I have a vector x and a is it's nearest neigbour. Then, b is it's next nearest neighbour. Is there any package in Pyton or R that outputs something like [a, b] meaning that a is its nearest neighbour(maybe by majority vote), while b is it's second nearest neighbour.
This is exactly what those metric-trees are build for.
Your question reads as you are asking for something as simple as that using sklearn's KDTree (consider BallTree depending on your metric in play):
import numpy as np
from sklearn.neighbors import KDTree
X = np.array([[1,1],[2,2], [3,3]]) # 3 points in 2 dimensions
tree = KDTree(X)
dist, ind = tree.query([[1.25, 1.35]], k=2)
print(ind) # indices of 2 closest neighbors
print(dist) # distances to 2 closest neighbors
Out:
[[0 1]]
[[ 0.43011626 0.99247166]]
And just to be clear: KNN usually refers to some pre-build algorithm based on metric-trees (KDTree, BallTree) for the task of classification. Often those data-structures are the only thing one is interested in.
Edit
If i interpret your comment correctly, you want to use the manhattan / taxicab / l1 metric.
Look here for the compatibility lists of those spatial-trees.
You just would use it like that:
X = np.array([[1,1],[2,2], [3,3]]) # 3 points in 2 dimensions
tree = KDTree(X, metric='l1') # !!!
dist, ind = tree.query([[1.25, 1.35]], k=2)
print(ind) # indices of 2 closest neighbors
print(dist) # distances to 2 closest neighbors
Out:
[[0 1]]
[[ 0.6 1.4]]

AgglomerativeClustering scikit learn connectivity

I have a matrix x=
[[0,1,1,1,0,0,0,0],
[1,0,1,1,0,0,0,0],
[1,1,0,1,0,0,0,0],
[1,1,1,0,0,0,0,0],
[0,0,0,0,0,1,1,1],
[0,0,0,0,1,0,1,1],
[0,0,0,0,1,1,0,1],
[0,0,0,0,1,1,1,0],]
After calling AgglomerativeClustering I was expecting the data to be divided into 2 clusters (0-3) and (4-7) i.e labels_=[0,0,0,0,1,1,1,1] but insted the labels_ list is [0, 0, 0, 1, 0, 0, 0, 1]
My code is as follows s=AgglomerativeClustering(affinity='precomputed',n_clusters=2,linkage='complete)
s.fit(x)
Does the code contain any error? Why is the clustering not as expected
It appears to me, after playing around with a few examples, that AgglomerativeClustering interprets the 'affinity' matrix as a distance matrix, although I can't find this specified anywhere. This means that your 0's and 1's should be switched.
Also it only seems to consider the upper triangular portion of the matrix (everything else being redundant).
I believe that defining x as:
x=
[[0,0,0,0,1,1,1,1],
[ 0,0,0,0,1,1,1,1],
[ 0,0,0,0,1,1,1,1],
[ 0,0,0,0,1,1,1,1],
[ 0,0,0,0,0,0,0,0],
[ 0,0,0,0,0,0,0,0],
[ 0,0,0,0,0,0,0,0],
[ 0,0,0,0,0,0,0,0],]
will give you the expected results.
The error is in how you are specifying the connectivity matrix. From your description, I assume that your matrix indicates linkage between points, where [0/1] indicate [no link/link]. However, the algorithm treats this as a matrix of pairwise distances, which is why you get unexpected results.
You could convert your affinity matrix to a sort of distance matrix with a simple transform; e.g.
>>> x = np.array(x)
>>> s.fit(np.exp(-x))
>>> s.labels_
array([1, 1, 1, 1, 0, 0, 0, 0])
Better would be to use an actual distance metric on the data used to generate this affinity matrix.

Resources