Different values on Tensorflow matrix multiplication vs. manual calculation - python-3.x

I am working on an optimsation on tensorflow where matrix multiplication gives differnt values compared to manual calculation. The difference is just on the 6 decimal and i know its very tiny but as epochs goes on i get quite different elbo values.
Here is a small example:
import tensorflow as tf
import numpy as np
a = np.array([[0.2751678 , 0.00671141, 0.39597315, 0.4966443 , 0.17449665,
0.00671141, 0.32214764, 0.02013423, 1. , 0.40939596,
0. , 0.9597315 , 0.4161074 , 0. , 0.2147651 ,
0.22147651, 0.5771812 , 0.70469797, 0.44966444, 0.36241612]],dtype=np.float32)
b = np.array([[2.6560298e-04, 0.0000000e+00, 7.9084152e-01, 8.2393251e-03,
0.0000000e+00, 9.8140877e-01, 6.5296537e-01, 2.6107374e-01,
1.2936005e-03, 5.2952105e-01, 2.2449312e-01, 9.9892569e-01,
8.4370503e-04, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 9.5679509e-03, 0.0000000e+00]],dtype=np.float32)
a_t = tf.constant(a)
b_t = tf.constant(b.T)
Matrix multiplication
tf.matmul(a_t,b_t)
<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[1.7209427]], dtype=float32)>
Manual calculation
tf.reduce_sum(tf.transpose(a_t)*b_t)
<tf.Tensor: shape=(), dtype=float32, numpy=1.7209429>
What is the reason for this difference? Is ther a fix for this?

You are comparing the results from different algorithms that rely on float arithmetic. It's completely normal to get different results at the last significant decimal digit. Actually, that is the best case scenario. Sometimes the difference will be even higher.
For example, you may try different values for n in the following code:
from numpy import random as np_random
from numpy import matmul as np_matmul
from numpy import multiply as np_multiply
from numpy import transpose as np_transpose
from numpy import sum as np_sum
n = 10
for i in range(10000):
a = np_random.rand(1,n).astype('float32')
b = np_random.rand(n,1).astype('float32')
c = np_matmul(a,b)
d = np_sum(np_multiply(a,np_transpose(b)),axis=1)
e = c-d
if abs(e) > 0:
print("%.16f" % c)
print("%.16f" % d)
print("%e" % e)
break
Anyway, single-precision floating-point format (float32) gives from 6 to 9 significant decimal digits precision. If you need more precision, you still have double precision (float64).
More information can be found in Floating Point Arithmetic: Issues and Limitations from Python Documentation.

Related

sklearn's precision_recall_curve incorrect on small example

Here is a very small example using precision_recall_curve():
from sklearn.metrics import precision_recall_curve, precision_score, recall_score
y_true = [0, 1]
y_predict_proba = [0.25,0.75]
precision, recall, thresholds = precision_recall_curve(y_true, y_predict_proba)
precision, recall
which results in:
(array([1., 1.]), array([1., 0.]))
The above does not match the "manual" calculation which follows.
There are three possible class vectors depending on threshold: [0,0] (when the threshold is > 0.75) , [0,1] (when the threshold is between 0.25 and 0.75), and [1,1] (when the threshold is <0.25). We have to discard [0,0] because it gives an undefined precision (divide by zero). So, applying precision_score() and recall_score() to the other two:
y_predict_class=[0,1]
precision_score(y_true, y_predict_class), recall_score(y_true, y_predict_class)
which gives:
(1.0, 1.0)
and
y_predict_class=[1,1]
precision_score(y_true, y_predict_class), recall_score(y_true, y_predict_class)
which gives
(0.5, 1.0)
This seems not to match the output of precision_recall_curve() (which for example did not produce a 0.5 precision value).
Am I missing something?
I know I am late, but I had your same doubt that I have eventually solved.
The main point here is that precision_recall_curve() does not output precision and recall values anymore after full recall is obtained the first time; moreover, it concatenates a 0 to the recall array and a 1 to the precision array so as to let the curve start in correspondence of the y-axis.
In your specific example, you'll have effectively two arrays done like this (they are ordered the other way around because of the specific implementation of sklearn):
precision, recall
(array([1., 0.5]), array([1., 1.]))
Then, the values of the two arrays which do correspond to the second occurrence of full recall are omitted and 1 and 0 values (for precision and recall, respectively) are concatenated as described above:
precision, recall
(array([1., 1.]), array([1., 0.]))
I have tried to explain it here in full details; another useful link is certainly this one.

How to calculate geometric mean in a differentiable way?

How to calculate goemetric mean along a dimension using Pytorch? Some numbers can be negative. The function must be differentiable.
A known (reasonably) numerically-stable version of the geometric mean is:
import torch
def gmean(input_x, dim):
log_x = torch.log(input_x)
return torch.exp(torch.mean(log_x, dim=dim))
x = torch.Tensor([2.0] * 1000).requires_grad_(True)
print(gmean(x, dim=0))
# tensor(2.0000, grad_fn=<ExpBackward>)
This kind of implementation can be found, for example, in SciPy (see here), which is a quite stable lib.
The implementation above does not handle zeros and negative numbers. Some will argue that the geometric mean with negative numbers is not well-defined, at least when not all of them are negative.
torch.prod() helps:
import torch
x = torch.FloatTensor(3).uniform_().requires_grad_(True)
print(x)
y = x.prod() ** (1.0/x.shape[0])
print(y)
y.backward()
print(x.grad)
# tensor([0.5692, 0.7495, 0.1702], requires_grad=True)
# tensor(0.4172, grad_fn=<PowBackward0>)
# tensor([0.2443, 0.1856, 0.8169])
EDIT: ?what about
y = (x.abs() ** (1.0/x.shape[0]) * x.sign() ).prod()

KMeans clustering - Value error: n_samples=1 should be >= n_cluster

I am doing an experiment with three time-series datasets with different characteristics for my experiment whose format is as the following.
0.086206438,10
0.086425551,12
0.089227066,20
0.089262508,24
0.089744425,30
0.090036815,40
0.090054172,28
0.090377569,28
0.090514071,28
0.090762872,28
0.090912691,27
The first column is a timestamp. For reproducibility reasons, I am sharing the data here. From column 2, I wanted to read the current row and compare it with the value of the previous row. If it is greater, I keep comparing. If the current value is smaller than the previous row's value, I want to divide the current value (smaller) by the previous value (larger). Accordingly, here is the code:
import numpy as np
import matplotlib.pyplot as plt
protocols = {}
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
protocols[protname] = {
"col_time": col_time,
"col_window": col_window,
"quotient_times": quotient_times,
"quotient": quotient,
}
plt.figure(); plt.clf()
plt.plot(quotient_times,quotient, ".", label=protname, color="blue")
plt.ylim(0, 1.0001)
plt.title(protname)
plt.xlabel("time")
plt.ylabel("quotient")
plt.legend()
plt.show()
And this produces the following three points - one for each dataset I shared.
As we can see from the points in the plots based on the code given above, data1 is pretty consistent whose value is around 1, data2 will have two quotients (whose values will concentrate either around 0.5 or 0.8) and the values of data3 are concentrated around two values (either around 0.5 or 0.7). This way, given a new data point (with quotient and quotient_times), I want to know which cluster it belongs to by building each dataset stacking these two transformed features quotient and quotient_times. I am trying it with KMeans clustering as the following
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)
But this is giving me an error: ValueError: n_samples=1 should be >= n_clusters=3. How can we fix this error?
Update: samlpe quotient data = array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129,
0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 ,
0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581,
0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])
As is, your quotient variable is now one single sample; here I get a different error message, probably due to different Python/scikit-learn version, but the essence is the same:
import numpy as np
quotient = np.array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129, 0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 , 0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581, 0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])
quotient.shape
# (20,)
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)
This gives the following error:
ValueError: Expected 2D array, got 1D array instead:
array=[0.7 0.7 0.4973262 0.7008547 0.71287129 0.704
0.49723757 0.49723757 0.70676692 0.5 0.5 0.70754717
0.5 0.49723757 0.70322581 0.5 0.49723757 0.49723757
0.5 0.49723757].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
which, despite the different wording, is not different from yours - essentially it says that your data look like a single sample.
Following the first advice(i.e. considering that quotient contains a single feature (column) resolves the issue:
k_means.fit(quotient.reshape(-1,1))
# result
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)
Please try the code below. A brief explanation on what I've done:
First I built the dataset sample = np.vstack((quotient_times, quotient)).T and standardized it, so it would become easier to cluster. Following, I've applied DBScan with multiple hyperparameters (eps and min_samples) until I've found the one that separated the points better. Finally, I've plotted the data with its respective labels, since you are working with 2 dimensional data, it's easy to visualize how good the clustering is.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
dataset = np.empty((0, 2))
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
sample = np.vstack((quotient_times, quotient)).T
dataset = np.append(dataset, sample, axis=0)
scaler = StandardScaler()
dataset = scaler.fit_transform(dataset)
k_means = DBSCAN(eps=0.6, min_samples=1)
k_means.fit(dataset)
colors = [i for i in k_means.labels_]
plt.figure();
plt.title('Dataset 1,2,3')
plt.xlabel("time")
plt.ylabel("quotient")
plt.scatter(dataset[:, 0], dataset[:, 1], c=colors)
plt.legend()
plt.show()
You are trying to make 3 clusters, while you have only 1 np.array i.e n_samples.
Try increasing the no. of arrays.
Decreasing no. of clusters.
Reshaping the array (not sure)

Equivalent codes, different results (Python, Mathematica)

These are two codes, one written with Python 3, and the other one written with Wolfram Mathematica. The codes are equivalent, and therefore the results (plots) should be the same. But the codes give different plots. Here are the codes.
The Python code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import k0, k1, i0, i1
k=100.0
x = 0.0103406
B = 80.0
def fdens(f):
return (1/2*(1-f**2)**2+f **4/2
+1/2*B*k*x**2*f**2*(1-f**2)*np.log(1+2/(B*k*x**2))
+(B*f**2*(1+B*k*x**2))/((k*(2+B*k*x**2))**2)
-f**4/(2+B*k*x**2)
+(B*f)/(k*x)*
(k0(f*x)*i1(f *np.sqrt(2/(k*B)+x**2))
+i0(f*x)*k1(f *np.sqrt(2/(k*B)+x**2)))/
(k1(f*x)*i1(f *np.sqrt(2/(k*B)+x**2))
-i1(f*x)*k1(f *np.sqrt(2/(k*B)+x**2)))
)
plt.figure(figsize=(10, 8), dpi=70)
X = np.linspace(0, 1, 100, endpoint=True)
C = fdens(X)
plt.plot(X, C, color="blue", linewidth=2.0, linestyle="-")
plt.show()
the python result
The Mathematica code:
k=100.;B=80.;
x=0.0103406;
func[f_]:=1/2*(1-f^2)^2+1/2*B*k*x^2*f^2*(1-f^2)*Log[1+2/(B*k*x^2)]+f^4/2-f^4/(2+B*k*x^2)+B*f^2*(1+B*k*x^2)/(k*(2+B*k*x^2)^2)+(B*f)/(k*x)*(BesselI[1, (f*Sqrt[2/(B*k) + x^2])]*BesselK[0, f*x] + BesselI[0, f*x]*BesselK[1, (f*Sqrt[2/(B*k) + x^2])])/(BesselI[1, (f*Sqrt[2/(B*k) + x^2])]*BesselK[1,f*x] - BesselI[1,f*x]*BesselK[1, (f*Sqrt[2/(B*k) + x^2])]);
Plot[func[f],{f,0,1}]
the Mathematica result
(correct one)
The results are different. Does someone know why?
From my tests it looks like the first order Bessell functions give different results. Both evaluate to Bessel(f * 0.0188925) initially, but the scipy version gives me a range from 0 to 9.4e-3 where wolframalpha (which uses a Mathematica backend) gives 0 to 1.4. I would dig a little deeper into this.
Additionally python uses standard C floating point numbers while Mathematica uses symbolic operations. Sympy tries to mimic such symbolic operations in python.

Sci-kit learn pairwise_distances is imprecise?

The scikit-learn function pairwise_distances provides the distance matrix from an array X.
However for some inputs the results seems not to be precise.
Example:
from sklearn.metrics.pairwise import pairwise_distances
X = [[-0.903858372568, -0.5521578], [-0.903858372568, -0.55215782]]
print pairwise_distances(X)
Gives the following output:
[[ 0. 0.]
[ 0. 0.]]
Although there is a distance of 0.00000002.
2nd Example:
X = [[-0.903858372568, -0.5521578], [-0.903858372568, -0.552157821]]
gives
[[ 0.00000000e+00 2.10734243e-08]
[ 2.10734243e-08 0.00000000e+00]]
Here there is a distance but is only correct up to the first digit.
For my application it is undesirable if the output can be zero although there is a distance.
Is there a good way to increase the precision?
I didn't dig on why scikit-learn gives such unprecise result, but it seems scipy gives better precision. Try this:
from scipy.spatial.distance import pdist, squareform
squareform(pdist(X))
For example,
X = [[-0.903858372568, -0.5521578], [-0.903858372568, -0.552157821]]
gives
array([[ 0.00000000e+00, 2.10000000e-08],
[ 2.10000000e-08, 0.00000000e+00]])

Resources