I have a knn classification project, which needs to calculate euclidean distance with tensorflow for comparison.
The original code without tensorflow is like this:
def euclidean_distance(self,x1, x2):
distance = 0.0
for i in range(len(x1)):
distance += pow( x1[i] - x2[i], 2)
return math.sqrt(distance)
and with tensorflow is like this:
distance = 0.0
for i in range(len(x1)):
distance = tf.negative(tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(x1, x2)))))
return distance
Is this right? Because of that code distance became tensor, and I need a method for converting that tensor into a normal matrix.
Any help is appreciated, thanks!

In order to get nd array(matrix) you need to run the graph like blow
You have to change your code to
distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(x1, x2))))
print (nd_distance)
return nd_distance
I don't see the need for tf.negative function and for loop


Custom metric for NearestNeighbors sklearn

Hello I'm on a project where I use 512 bits hash to create clusters. I'm using a custom metric bitwise hamming distance. But when I compare two hash with this function I obtain different distance results than using the NearestNeighbors.
Extending this to DBSCAN, using a eps=5, the cluster are created with some consistence, are being correctly clustered. But I try to check the distance between points from the same cluster I obtain distance enormous. Here is an example.
This a list of points from 2 clusters created by DBSCAN, and as you can see when using the function to calculate the distance gives number bigger than 30 but the NN gives results consistent with the eps=5.
from sklearn.neighbors import NearestNeighbors
hash_list_1 = [2711636196460699638441853508983975450613573844625556129377064665736210167114069990407028214648954985399518205946842968661290371575620508000646896480583712,
hash_list_2 = [1677246762479319235863065539858628614044010438213592493389244703420353559152336301659250128835190166728647823546464421558167523127086351613289685036466208,
def custom_metric(x, y):
return bin(int(x[0]) ^ int(y[0])).count('1')
objective_hash = hash_list_1[0]
complete_list = hash_list_1 + hash_list_2
distance = [custom_metric([objective_hash], [hash_point]) for hash_point in complete_list]
print("Function iteration distance:")
neighbors_model = NearestNeighbors(radius=100, algorithm='ball_tree',
X = [[x] for x in complete_list]
distance, neighborhoods = neighbors_model.radius_neighbors(objective_hash, 100, return_distance=True)
print("Nearest Neighbors distance:")
print("Nearest Neighbors index:")
The problem:
Numpy can't handle numbers so big and converts them to float losing a lot of precision.
The solution:
Precompute with your custom metric all the distances and feed them to the DBSCAN algorithm.

Trilateration and nonlienar least squares

I am trying to find user distances from beacon positions using trilateration. I am having a mean squared error function that I am trying to minimize using non-linear least squares but I am not getting correct results. Any help is appreciated. The code is below.
def mse(self, user_pos, positions, distances):
mse = 0.0
for pos, dist in zip(positions, distances):
distance = great_circle((user_pos[0], user_pos[1]), (pos[0], pos[1])).meters
mse += math.pow(distance - dist, 2.0)
return mse/len(positions)
def least_squares_func(self, positions, distances):
# Returns users coordinates
return least_squares(self.mse, [0,0], args=(positions, distances)).x
Starting position in least_squares is [0,0] but after changing it I am not getting much different results.
Input example:
positions = [(5.0, -6.0), (13.0, -15.0), (21.0, -3.0)]
distances = [8.06, 13.97, 23.32]
great_circle is used for GPS where we deal with an oblate spheroid for beacons you must used simple Euclidean metric to calculate the distance between the user and each beacons.

scipy.cluster.hierarchy: labels seems not in the right order, and confused by the value of the vertical axes

I know that scipy.cluster.hierarchy focused on dealing with the distance matrix. But now I have a similarity matrix... After I plot it by using Dendrogram, something weird just happens.
Here is the code:
similarityMatrix = np.array(([1,0.75,0.75,0,0,0,0],
here is the linkage method
Z_sim = sch.linkage(similarityMatrix)
But here is the outcome:
My question is:
Why is the label for this dendrogram not right?
I am giving a similarity matrix for the linkage method, but I cannot fully understand what the vertical axes means. For example, as the maximum similarity is 1, why is the maximum value in the vertical axes almost 1.6?
Thank you very much for your help!
linkage expects "distances", not "similarities". To convert your matrix to something like a distance matrix, you can subtract it from 1:
dist = 1 - similarityMatrix
linkage does not accept a square distance matrix. It expects the distance data to be in "condensed" form. You can get that using scipy.spatial.distance.squareform:
from scipy.spatial.distance import squareform
dist = 1 - similarityMatrix
condensed_dist = squareform(dist)
Z_sim = sch.linkage(condensed_dist)
(When you pass a two-dimensional array with shape (m, n) to linkage, it treats the rows as points in n-dimensional space, and computes the distances internally.)

how did mllib calculate gradient

Need an mllib expert to help explain the linear regression code. In LeastSquaresGradient.compute
override def compute(
data: Vector,
label: Double,
weights: Vector,
cumGradient: Vector): Double = {
val diff = dot(data, weights) - label
axpy(diff, data, cumGradient)
diff * diff / 2.0
cumGradient is computed using axpy, which is simply y += a * x, or here
cumGradient += diff * data
I thought for a long time but can make the connection to the gradient calculation as defined in the gradient descent documentation. In theory the gradient is the slope of the loss against delta in one particular weighting parameter. I don't see anything in this axpy implementation that remotely resemble that.
Can someone shed some light?
It is not really a programming question but to give you some idea what is going on cost function for least square regression is defined as
where theta is weights vector.
Partial derivatives of the above cost function are:
and if computed over all theta:
It should be obvious that above is equivalent to cumGradient += diff * data computed for all data points and to quote Wikipedia
in a rectangular coordinate system, the gradient is the vector field whose components are the partial derivatives of f

scikit-learn: projecting SVM weights of Prinicpal Components to original image space

I did a PCA on my 3D image datasets, and used the first n PCs as my features in a linear SVM. I have SVM weights for each PC. Now, I want to project the PC weights into original image space to find what regions of the image were more discriminative in the classification process. I used the inverse_transform PCA method on the weight vector. However, the resulting image only has positive values, whereas the SVM weights were both positive and negative. This makes me think if my approach is a valid one. Does anybody have any suggestions?
Thanks in advance.
I have a program that does this projection in image space. The thing to realise is that the weights themselves do not define the 'discrimination' weights (as also termed in this paper). You need the sum of the inputs weighted by their kernel coefficients.
Consider this toy example:
Class A has 2 vectors: a1=(1,1) and a2=(2,2)
Class B has 2 vectors: b1=(2,4) and a3=(4,2).
If you draw this, you can construct the decision boundary by hand: it's the line of points (x,y) where x+y == 5. My SVM program finds the solution where w_a1 == 0 (no support vector), w_a2 == -1) and w_b1 == w_b2 == 1/2, and bias == -5.
Now you can construct the projection vector p = a2*w_a2 + b1*w_b1 + b2*w_b2 = -1*(2,2) + 1/2*(2,4) + 1/2*(4,2) = (1,1).
In other words, every point should be projected onto the line y == x, and for a new vector v the inner product <v,p> is below 5 for class A vectors, and above 5 for class B vectors. You can centre the result around 0 by adding the bias.
