python3: find most k nearest vectors from a list? - python-3.x

Say I have a vector v1 and a list of vector l1. I want to find k vectors from l1 that are most closed (similar) to v1 in descending order.
I have a function sim_score(v1,v2) that will return a similarity score between 0 and 1 for any two input vectors.
Indeed, a naive way is to write a for loop over l1, calculate distance and store them into another list, then sort the output list. But is there a Pythonic way to do the task?
Thanks

import numpy as np
np.sort([np.sqrt(np.sum(( l-v1)*(l-v1))) For l in l1])[:3]

Consider using scipy.spatial.distance module for distance computations. It supports the most common metrics.
import numpy as np
from scipy.spatial import distance
v1 = [[1, 2, 3]]
l1 = [[11, 3, 5],
[ 2, 1, 9],
[.1, 3, 2]]
# compute distances
dists = distance.cdist(v1, l1, metric='euclidean')
# sorted distances
sd = np.sort(dists)
Note that each parameter to cdist must be two-dimensional. Hence, v1 must be a nested list, or a 2d numpy array.
You may also use your homegrown metric like:
def my_metric(a, b, **kwargs):
# some logic
dists = distance.cdist(v1, l1, metric=my_metric)

Related

Multiplying two sub-matrices in NumPy without copying the matrices

I have two matrices A and B and I'm interested in multiplying a sub-matrix of A (defined by some of its rows) and a sub-matrix of B (defined by some of its columns) as shown in the example below:
import numpy as np
# create two matrices
A = np.ones((5, 3))
B = np.ones((3, 5))
# define sub-matrices
rows_idx = [0, 2, 3]
cols_idx = [1, 2, 4]
# print sub-matrices
print(A[rows_idx])
print(B[:, cols_idx])
I'm aware that this can be achieved directly through
A[rows_idx, :] # B[:, cols_idx]
However, the latter comes with the drawback of making copies of the rows of A and columns of B since rows_idx and cols_idx are lists, as mentioned here.
This is less efficient in terms of time and memory.
Also, I refrained from creating a Python function to perform the multiplication as Python loops are slow. Numpy array operations are much faster. Here are some timing comparisons.
Is there way to compute A[rows_idx, :] # B[:, cols_idx] without copying?

Finding points in radius of each point in same GeoDataFrame

I have geoDataFrame:
df = gpd.GeoDataFrame([[0, 'A', Point(10,12)],
[1, 'B', Point(14,8)],
[2, 'C', Point(100,2)],
[3, 'D' ,Point(20,10)]],
columns=['ID','Value','geometry'])
Is it possible to find points in a range of radius for example 10 for each point and add their "Value" and 'geometry' to GeoDataFrame so output would look like:
['ID','Value','geometry','value_of_point_in_range_1','geometry_of_point_in_range_1','value_of_point_in_range_2','geometry_of_point_in_range_2' etc.]
Before i was finding nearest neighbor for each and after that was checking if is it in range but i must find all of the points in radius and don't know what tool should i use.
Although in your example the output will have a predictable amount of columns in the resulting dataframe, this not true in general. Therefore I would instead create a column in the dataframe that consists of a lists denoting the index/value/geometry of the nearby points.
In a small dataset like you provided, simple arithmetics in python will suffice. But for large datasets you will want to use a spatial tree to query the nearby points. I suggest to use scipy's KDTree like this:
import geopandas as gpd
import numpy as np
from shapely.geometry import Point
from scipy.spatial import KDTree
df = gpd.GeoDataFrame([[0, 'A', Point(10,12)],
[1, 'B', Point(14,8)],
[2, 'C', Point(100,2)],
[3, 'D' ,Point(20,10)]],
columns=['ID','Value','geometry'])
tree = KDTree(list(zip(df.geometry.x, df.geometry.y)))
pairs = tree.query_pairs(10)
df['ValueOfNearbyPoints'] = np.empty((len(df), 0)).tolist()
n = df.columns.get_loc("ValueOfNearbyPoints")
m = df.columns.get_loc("Value")
for (i, j) in pairs:
df.iloc[i, n].append(df.iloc[j, m])
df.iloc[j, n].append(df.iloc[i, m])
This yields the following dataframe:
ID Value geometry ValueOfNearbyPoints
0 0 A POINT (10.00000 12.00000) [B]
1 1 B POINT (14.00000 8.00000) [A, D]
2 2 C POINT (100.00000 2.00000) []
3 3 D POINT (20.00000 10.00000) [B]
To verify the results, you may find plotting the result usefull:
import matplotlib.pyplot as plt
ax = plt.subplot()
df.plot(ax=ax)
for (i, j) in pairs:
plt.plot([df.iloc[i].geometry.x, df.iloc[j].geometry.x],
[df.iloc[i].geometry.y, df.iloc[j].geometry.y], "-r")
plt.show()

How can I determine neighborhood from pairwise distance matrix efficiently?

I have a M * N pairwise distance matrix between M points from group A and N points from group B.
I want to get the list of neighbor points from group B for each points from group A.
Is there any efficient code for this problem using pytorch? instead of multiple 'for' loop.
Thanks
You can use sort:
import torch
# fake pairwise distance matrix, M=3, N=4
x = torch.rand((3,4))
print(x)
# tensor([[0.7667, 0.6847, 0.3779, 0.3007],
# [0.9881, 0.9909, 0.3180, 0.5389],
# [0.6341, 0.8095, 0.4214, 0.7216]])
closest = torch.sort(x, dim=-1) # default is -1, but I prefer to be clear
# let's say you want the k=2 closest points
k=2
closest_k_values = closest[0][:, :k]
closest_k_indices = closest[1][:, :k]
print(closest_k_values)
# tensor([[0.3007, 0.3779],
# [0.3180, 0.5389],
# [0.4214, 0.6341]])
print(closest_k_indices)
# tensor([[3, 2],
# [2, 3],
# [2, 0]])

How to randomly select from list of lists

I have a list of lists in python as follows:
a = [[1,1,2], [2,3,4], [5,5,5], [7,6,5], [1,5,6]]
for example, How would I at random select 3 lists out of the 6?
I tried numpy's random.choice but it does not work for lists.
Any suggestion?
numpy's random.choice doesn't work on 2-d array, so one alternative is to use lenght of the array to get the random index of 2-d array and then get the elements from that random index. see below example.
import numpy as np
random_count = 3 # number of random elements to find
a = [[1,1,2], [2,3,4], [5,5,5], [7,6,5], [1,5,6]] # 2-d data list
alist = np.array(a) # convert 2-d data list to numpy array
random_numbers = np.random.choice(len(alist), random_count) # fetch random index of 2-d array based on len
for item in random_numbers: # iterate over random indexs
print(alist[item]) # print random elememt through index
You can use the random library like this:
a = [[1,1,2], [2,3,4], [5,5,5], [7,6,5], [1,5,6]]
import random
random.choices(a, k=3)
>>> [[1, 5, 6], [2, 3, 4], [7, 6, 5]]
You can read more about the random library at this official page https://docs.python.org/3/library/random.html.

How to find all distances between points in a matrix without duplicates?

I have a Nx3 matrix that contains the x,y,z coordinates of N points in 3D space. I'd like to find the absolute distances between all points without duplicates.
I tried using scipy.spatial.distance.cdist()
[see documentation here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html ]. However, the output matrix contains duplicats of distances. For example, the distance between the points P1 and P2 is calculated twice as distance from P1 to P2 and again as distance from P2 to P1. See code output:
>>> from scipy.spatial import distance
>>> points = [[1, 2, 3],
... [4, 5, 6],
... [7, 8, 9]]
>>> distances = distance.cdist(points, points, 'euclidean')
>>> print(distances)
[[ 0. 5.19615242 10.39230485]
[ 5.19615242 0. 5.19615242]
[10.39230485 5.19615242 0. ]]
I'd like the output to be without dupilcates. For example, find the distance between the first point and all other points then the second point and the remaining points (exluding the first point) and so on. Ideally, in an efficient and scalable manner that preserves the order of the points. That is once I find the distances, I'd like to query them; e.g. finding distances within a certain range and be able to output points that correspond to these distances.
Looks like in general you want a KDTree implementation, with query_pairs.
from scipy.spatial import KDTree
points_tree = KDTree(points)
points_in_radius = points_tree.query_pairs(radius)
This will be much faster than actually computing all of the instances and applying a tolerance.

Resources