Efficient way to get nearest point [duplicate] - python-3.x

let's say I have the following numpy matrix (simplified):
matrix = np.array([[1, 1],
[2, 2],
[5, 5],
[6, 6]]
)
And now I want to get the vector from the matrix closest to a "search" vector:
search_vec = np.array([3, 3])
What I have done is the following:
min_dist = None
result_vec = None
for ref_vec in matrix:
distance = np.linalg.norm(search_vec-ref_vec)
distance = abs(distance)
print(ref_vec, distance)
if min_dist == None or min_dist > distance:
min_dist = distance
result_vec = ref_vec
The result works, but is there a native numpy solution to do it more efficient?
My problem is, that the bigger the matrix becomes, the slower the entire process will be.
Are there other solutions that handle these problems in a more elegant and efficient way?

Approach #1
We can use Cython-powered kd-tree for quick nearest-neighbor lookup, which is very efficient both memory-wise and with performance -
In [276]: from scipy.spatial import cKDTree
In [277]: matrix[cKDTree(matrix).query(search_vec, k=1)[1]]
Out[277]: array([2, 2])
Approach #2
With SciPy's cdist -
In [286]: from scipy.spatial.distance import cdist
In [287]: matrix[cdist(matrix, np.atleast_2d(search_vec)).argmin()]
Out[287]: array([2, 2])
Approach #3
With Scikit-learn's Nearest Neighbors -
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=1).fit(matrix)
closest_vec = matrix[nbrs.kneighbors(np.atleast_2d(search_vec))[1][0,0]]
Approach #4
With Scikit-learn's kdtree -
from sklearn.neighbors import KDTree
kdt = KDTree(matrix, metric='euclidean')
cv = matrix[kdt.query(np.atleast_2d(search_vec), k=1, return_distance=False)[0,0]]
Approach #5
From eucl_dist package (disclaimer: I am its author) and following the wiki contents, we could leverage matrix-multiplication -
M = matrix.dot(search_vec)
d = np.einsum('ij,ij->i',matrix,matrix) + np.inner(search_vec,search_vec) -2*M
closest_vec = matrix[d.argmin()]

Related

Matrix Multiplication Python

I am trying to multiply the matrix and came across below code, can someone please help me understand the logic for second 'for' loop, why is it range(len(B[0])). I am quite a newbie to programming world so unable to understand the logic. Please help.
for i in range(r1):
print("i=",i)
for j in range(len(B[0])):
print("j=",j)
for k in range(r2):
print("k=",k)
result[i][j] += A[i][k] * B[k][j]
return(result)
Here r1 and r2 are lengths of two matrix
A simple way to do matrix multiplication is to use numpy dot product:
import numpy as np
result = np.dot([[2, 5], [5, 8]],[[2, 1], [5, 9]])
#result = np.dot(matrix1, matrix2)

python3: find most k nearest vectors from a list?

Say I have a vector v1 and a list of vector l1. I want to find k vectors from l1 that are most closed (similar) to v1 in descending order.
I have a function sim_score(v1,v2) that will return a similarity score between 0 and 1 for any two input vectors.
Indeed, a naive way is to write a for loop over l1, calculate distance and store them into another list, then sort the output list. But is there a Pythonic way to do the task?
Thanks
import numpy as np
np.sort([np.sqrt(np.sum(( l-v1)*(l-v1))) For l in l1])[:3]
Consider using scipy.spatial.distance module for distance computations. It supports the most common metrics.
import numpy as np
from scipy.spatial import distance
v1 = [[1, 2, 3]]
l1 = [[11, 3, 5],
[ 2, 1, 9],
[.1, 3, 2]]
# compute distances
dists = distance.cdist(v1, l1, metric='euclidean')
# sorted distances
sd = np.sort(dists)
Note that each parameter to cdist must be two-dimensional. Hence, v1 must be a nested list, or a 2d numpy array.
You may also use your homegrown metric like:
def my_metric(a, b, **kwargs):
# some logic
dists = distance.cdist(v1, l1, metric=my_metric)

how does sklearn compute the Accuracy score step by step?

I was reading about the metrics used in sklearn but I find pretty confused the following:
In the documentation sklearn provides a example of its usage as follows:
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
0.5
I understood that sklearns computes that metric as follows:
I am not sure about the process, I would like to appreciate if some one could explain more this result step by step since I was studying it but I found hard to understand, In order to understand more I tried the following case:
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3,0]
y_true = [0, 1, 2, 3,0]
print(accuracy_score(y_true, y_pred))
0.6
And I supposed that the correct computation would be the following:
but I am not sure about it, I would like to see if someone could support me with the computation rather than copy and paste the sklearn's documentation.
I have the doubt if the i in the sumatory is the same as the i in the formula inside the parenthesis, it is unclear to me, I don't know if the number of elements in the sumatory is related just to the number of elements in the sample of if it depends on also by the number of classes.
The indicator function equals one only if the variables in its arguments are equal, else it’s value is zero. Therefor when y is equal to yhat the indicator function produces a one counting as a correct classification. There is a code example in python and numerical example below.
import numpy as np
yhat=np.array([0,2,1,3])
y=np.array([0,1,2,3])
acc=np.mean(y==yhat)
print( acc)
example
A simple way to understand the calculation of the accuracy is:
Given two lists, y_pred and y_true, for every position index i, compare the i-th element of y_pred with the i-th element of y_true and perform the following calculation:
Count the number of matches
Divide it by the number of samples
So using your own example:
y_pred = [0, 2, 1, 3, 0]
y_true = [0, 1, 2, 3, 0]
We see matches on indices 0, 3 and 4. Thus:
number of matches = 3
number of samples = 5
Finally, the accuracy calculation:
accuracy = matches/samples
accuracy = 3/5
accuracy = 0.6
And for your question about the i index, it is the sample index, so it is the same for both the summation index and the Y/Yhat index.

Spark Matrix multiplication with python

I am trying to do matrix multiplication using Apache Spark and Python.
Here is my data
from pyspark.mllib.linalg.distributed import RowMatrix
My RDD of vectors
rows_1 = sc.parallelize([[1, 2], [4, 5], [7, 8]])
rows_2 = sc.parallelize([[1, 2], [4, 5]])
My maxtrix
mat1 = RowMatrix(rows_1)
mat2 = RowMatrix(rows_2)
I would like to do something like this:
mat = mat1 * mat2
I wrote a function to process the matrix multiplication but I'm afraid to have a long processing time. Here is my function:
def matrix_multiply(df1, df2):
nb_row = df1.count()
mat=[]
for i in range(0, nb_row):
row=list(df1.filter(df1['index']==i).take(1)[0])
row_out = []
for r in range(0, len(row)):
r_value = 0
col = df2.select(df2[list_col[r]]).collect()
col = [list(c)[0] for c in col]
for c in range(0, len(col)):
r_value += row[c] * col[c]
row_out.append(r_value)
mat.append(row_out)
return mat
My function make a lot of spark actions (take, collect, etc.). Does the function will take a lot of processing time?
If someone have another idea it will be helpful for me.
You cannot. Since RowMatrix has no meaningful row indices it cannot be used for multiplications. Even ignoring that the only distributed matrix which supports multiplication with another distributed structure is BlockMatrix.
from pyspark.mllib.linalg.distributed import *
def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
return IndexedRowMatrix(
rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
).toBlockMatrix(rowsPerBlock, colsPerBlock)
as_block_matrix(rows_1).multiply(as_block_matrix(rows_2))

How to project a new point to new basis using 'components_' attribute of PCA from sklearn.decomposition package?

I have some data points with 3 co-ordinates and using PCA function I converted it into a points having 2 co-ordinates by doing this
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1, -3], [-2, -1, -1], [-3, -2, -2], [1, 1, 1], [2, 1, 5], [3, 2, 6]]) #data
pca = PCA(n_components=2)
pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
XT = pca.fit_transform(X)
print XT
#output obtained
#[[-4.04510516 -1.24556106]
#[-2.92607624 0.61239898]
#[-4.55000611 1.13825234]
#[ 0.81687144 -1.11632484]
#[ 4.5401931 0.56854397]
#[ 6.16412297 0.04269061]]
The I got principal axes in feature space, representing the directions of maximum variance in the data using 'components_' attribute
W = (pca.components_)
print W
# output obtained
#[[ 0.49508794 0.3217835 0.80705843]
# [-0.67701709 -0.43930775 0.59047148]]
Now I wanted to project the first point [-1, -1, -3] (which is first point in X) onto 2D subspace using 'components_' attribute by doing this
projectedXT_0 = np.dot(W,X[0])
print projectedXT_0
#output obtained
#[-3.23804673 -0.65508959]
#expected output
#[-4.04510516 -1.24556106]
I am not getting what I expected so, obviously I am doing something wrong while calculating projectedPoint using 'components_' attribute. Kindly demonstrate the use of 'components_' attribute to get projection of a point.
NOTE: I know 'transform' function does this but I want do using 'components_' attribute.
You forgot to substract the mean.
See the source of pca transform:
if self.mean_ is not None:
X = X - self.mean_
X_transformed = fast_dot(X, self.components_.T)
if self.whiten:
X_transformed /= np.sqrt(self.explained_variance_)
return X_transformed
see Projecting new samples into existing PCA space?
it is pca.transform(...)
Also the two lines you have:
pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
don't do anything and you should use fit_transform()

Resources