Sci-kit learn pairwise_distances is imprecise? - scikit-learn

The scikit-learn function pairwise_distances provides the distance matrix from an array X.
However for some inputs the results seems not to be precise.
Example:
from sklearn.metrics.pairwise import pairwise_distances
X = [[-0.903858372568, -0.5521578], [-0.903858372568, -0.55215782]]
print pairwise_distances(X)
Gives the following output:
[[ 0. 0.]
[ 0. 0.]]
Although there is a distance of 0.00000002.
2nd Example:
X = [[-0.903858372568, -0.5521578], [-0.903858372568, -0.552157821]]
gives
[[ 0.00000000e+00 2.10734243e-08]
[ 2.10734243e-08 0.00000000e+00]]
Here there is a distance but is only correct up to the first digit.
For my application it is undesirable if the output can be zero although there is a distance.
Is there a good way to increase the precision?

I didn't dig on why scikit-learn gives such unprecise result, but it seems scipy gives better precision. Try this:
from scipy.spatial.distance import pdist, squareform
squareform(pdist(X))
For example,
X = [[-0.903858372568, -0.5521578], [-0.903858372568, -0.552157821]]
gives
array([[ 0.00000000e+00, 2.10000000e-08],
[ 2.10000000e-08, 0.00000000e+00]])

Related

Using the colon operator to slice columns in numpy when it might be a vector or might be a matrix

I have two general functions Estability3 and Lstability3 where I would like to evaluate both two dimensional slices of arrays and one dimensional ranges of vectors. I have explored the error outside the functions in a jupyter notebook with some of the arguments to the functions.
These function compute energy and angular momentum. The position and velocity data needed to compute the energy and angular momentum is stored in a two dimensional matrix called xvec where the position and velocity are along a row and the three entries represent the three stars. xvec0 is the initial data for the simulation (timestep 0).
xvec0
array([[-5.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, -2.23606798e+00, 0.00000000e+00],
[ 5.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 2.23606798e+00, 0.00000000e+00],
[ 9.95024876e+02, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 4.46099737e-01, 0.00000000e+00]])
I select the first star of the zeroth timestep by selecting the first row of this matrix. If I were looping over thousands of timesteps like usual I would use thousands of matrices like these and append them to a list then convert to a numpy array with thousands of columns. (so xvec1_0 would have thousands of columns instead of one).
xvec1_0=xvec0[0]
Since xvec1_0 has only one column, here I am trying to force numpy to recognize it as a matrix. It doesn't work.
np.reshape(xvec1_0,(1,6))
array([[-5. , 0. , 0. , -0. , -2.23606798,
0. ]])
I see that it has two outer brackets, which implies that it is a matrix. But when I try to use the colon index over the one column like I normally do over the 1000s of columns, I get an error.
xvec1_0[:,0:3]
IndexError Traceback (most recent call last)
<ipython-input-115-79d26475ac10> in <module>
----> 1 xvec1_0[:,0:3]
IndexError: too many indices for array
Why can't I use the : operator to obtain the first row of this two dimensional array? How can I do that in this more general code that also applies to matrices?
Thanks,
Steven
I think I misread the function definition for reshape. I thought it changed it in place. It doesn't, I needed to assign an output, like this
xvec0_1 = np.reshape(xvec1_0,(1,6))
xvec1_0[:,0:3]
array([[-5., 0., 0.]])
xvec1_0
array([[-5. , 0. , 0. , -0. , -2.23606798,
0. ]])
xvec1_0.shape
(1, 6)
Thanks to a friend's help, I discovered that the following works just fine.
import numpy as np
x = np.zeros((1,6))
print(x.shape)
print(x[:,0:3])
x[:,0:3]
(1, 6)
[[0. 0. 0.]]
array([[0., 0., 0.]])
x = np.zeros((6,))
print(x.shape)
x = np.reshape(x, (1,6))
print(x[:,0:3])
x[:,0:3]
(6,)
[[0. 0. 0.]]
array([[0., 0., 0.]])
Probably I should have thought of some of these tests, but I thought I already had found the most basic test when I saw the output from np.reshape. I really appreciate the help from my friend, and hope my question did not waste anyone's time too badly.

Different values on Tensorflow matrix multiplication vs. manual calculation

I am working on an optimsation on tensorflow where matrix multiplication gives differnt values compared to manual calculation. The difference is just on the 6 decimal and i know its very tiny but as epochs goes on i get quite different elbo values.
Here is a small example:
import tensorflow as tf
import numpy as np
a = np.array([[0.2751678 , 0.00671141, 0.39597315, 0.4966443 , 0.17449665,
0.00671141, 0.32214764, 0.02013423, 1. , 0.40939596,
0. , 0.9597315 , 0.4161074 , 0. , 0.2147651 ,
0.22147651, 0.5771812 , 0.70469797, 0.44966444, 0.36241612]],dtype=np.float32)
b = np.array([[2.6560298e-04, 0.0000000e+00, 7.9084152e-01, 8.2393251e-03,
0.0000000e+00, 9.8140877e-01, 6.5296537e-01, 2.6107374e-01,
1.2936005e-03, 5.2952105e-01, 2.2449312e-01, 9.9892569e-01,
8.4370503e-04, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 9.5679509e-03, 0.0000000e+00]],dtype=np.float32)
a_t = tf.constant(a)
b_t = tf.constant(b.T)
Matrix multiplication
tf.matmul(a_t,b_t)
<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[1.7209427]], dtype=float32)>
Manual calculation
tf.reduce_sum(tf.transpose(a_t)*b_t)
<tf.Tensor: shape=(), dtype=float32, numpy=1.7209429>
What is the reason for this difference? Is ther a fix for this?
You are comparing the results from different algorithms that rely on float arithmetic. It's completely normal to get different results at the last significant decimal digit. Actually, that is the best case scenario. Sometimes the difference will be even higher.
For example, you may try different values for n in the following code:
from numpy import random as np_random
from numpy import matmul as np_matmul
from numpy import multiply as np_multiply
from numpy import transpose as np_transpose
from numpy import sum as np_sum
n = 10
for i in range(10000):
a = np_random.rand(1,n).astype('float32')
b = np_random.rand(n,1).astype('float32')
c = np_matmul(a,b)
d = np_sum(np_multiply(a,np_transpose(b)),axis=1)
e = c-d
if abs(e) > 0:
print("%.16f" % c)
print("%.16f" % d)
print("%e" % e)
break
Anyway, single-precision floating-point format (float32) gives from 6 to 9 significant decimal digits precision. If you need more precision, you still have double precision (float64).
More information can be found in Floating Point Arithmetic: Issues and Limitations from Python Documentation.

sklearn.preprocessing.MinMaxScaler() only returns 0 or 1 and not float

For whatever reason this only returns 0 or 1 instead of float between them.
from sklearn import preprocessing
X = [[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]
minmaxscaler = preprocessing.MinMaxScaler()
X_scale = minmaxscaler.fit_transform(X)
print(X_scale) # returns [[0. 1. 0. 0. 1. 0.] [1. 0. 1. 1. 0. 0.]]
Minmax Scaler can not work with list of lists, it needs to work with numpy array for example (or dataframes).
You can convert to numpy array. It will result 6 features with 2 samples, which I guess is not what you means so you need also reshape.
import numpy
X = numpy.array([[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]).reshape(-1,1)
Results after MinMax Scaler:
[[0.02047619]
[0.0252381 ]
[0.02206349]
[0.02285714]
[0.19507937]
[1. ]
[0.03 ]
[0. ]
[0.06809524]
[0.72047619]
[0.04761905]
[1. ]]
Not exactly sure if you want to minimax each list separatly or all together
The answer which you have got from MinMaxScaler is the expected answer.
When you have only two datapoints, you will get only 0s and 1s. See the example here for three datapoints scenario.
You need to understand that it will convert the lowest value as 0 and highest values as 1 for each column. When you have more datapoints, the remaining ones would calculation based on the range (Max-min). see the formula here.
Also, MinMaxScaler accepts 2D data, which means lists of list is acceptable. Thats the reason why you did not got any error.

Confused about numpy's returned eigenvectors

I've been playing around with numpy's linalg module and wanted to get the eigenvectors for the following matrix:
import numpy as np
matrix = np.array([[4,0,-1],[0,3,0],[1,0,2]])
w,v = np.linalg.eig(matrix)
print(v)
array([[0.70710678, 0.70710678, 0. ],
[0. , 0. , 1. ],
[0.70710678, 0.70710678, 0. ]])
Calculating the eigenvectors by hand gives me only two vectors which are [1,0,1] and [0,1,0]. I know that numpy normalizes the vectors which is fine but the problem is when I try to check if the first and second columns are equal:
v[:,0] == v[:,1]
array([False, True, False])
This gives me the impression that these are two different vectors (so I now have a total of 3 eigenvectors) when I already know I'll only get two.
Can someone please explain what's going on here.

Sklearn KNN + mahalanobis on python

I try to use the function NearestNeighbors on Sklearn. I write an example to understand what's happening on these function.
from sklearn.neighbors import NearestNeighbors
samples = [[0.2, 0], [0.5, 0.1], [0.4,0.4]]
neigh = NearestNeighbors(n_neighbors=2,metric='mahalanobis')
neigh.fit(samples)
print(neigh.kneighbors([[272,7522752]])) # use any point to test
Above code work well and it can correctly compute the 2 - nearest point .
But when I try to use my dataset , and some mistakes are happend. Dataset matrix are 9959 * 384 matrix. I print the matrix below , and I declare the matrix training_data
[[ 0.069915 0.020142 0.070054 ..., 0.333937 0.477351 0.055993]
[ 0.131826 0.038203 0.131573 ..., 0.353589 0.426197 0.048557]
[ 0.130338 0.02595 0.130351 ..., 0.315951 0.32355 0.098884]
...,
[ 0.053331 0.023395 0.0534 ..., 0.366064 0.404756 0.066217]
[ 0.063554 0.021197 0.063671 ..., 0.235945 0.439595 0.105366]
[ 0.123632 0.045492 0.12322 ..., 0.308702 0.437344 0.040144]]
And when I use training_data into above code which just change the samples to training_data, it has a mistake.
LinAlgError: 0-dimensional array given. Array must be at least two- dimensional
Please help me solve these questions, tks a lot !

Resources