unbiased variance in Theano - theano

In numpy we can set ddof=1 to get the ubiased variance, how is it implemented in theano?
I've looked at this page it seems the theano.tensor.var function does not support such options.

theano.tensor.var returns the biased sample variance. I'm not aware of a builtin function that returns the unbiased sample variance, but you can obtain it as follows:
Given a vector x, use Theano's builtin var(), but change the 1/n divisor to 1/(n-1):
v = x.var() * x.size / (x.size - 1)

Related

Why is not variance of normalized data by sklearn equal 1?

I'm using preprocessing from package sklearn to normalize data as follows:
import pandas as pd
import urllib3
from sklearn import preprocessing
decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t')
decathlon.describe()
nor_df = decathlon.copy()
nor_df.iloc[:, 0:10] = preprocessing.scale(decathlon.iloc[:, 0:10])
nor_df.describe()
The result is
The mean is -1.516402e-16, which is almost 0. On the contrary, the variance is 1.012423e+00, which is 1.012423. For me, 1.012423 is not considered as near 1.
Could you please elaborate on this phenomenon?
In this instance sklearn and pandas calculate std differently.
sklearn.preprocessing.scale:
We use a biased estimator for the standard deviation, equivalent to
numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to
affect model performance.
pandas.Dataframe.describe uses pandas.core.series.Series.std where:
Normalized by N-1 by default. This can be changed using the ddof argument
...
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof,
where N represents the number of elements.
It should be noted that, in 2020-10-28, pandas.Dataframe.describe does not have a ddof parameter so the default of ddof=1 is always used for Series.

matrix multiplication for complex numbers in PyTorch

I am trying to multiply two complex matrices in PyTorch and it seems the torch.matmul functions is not added yet to PyTorch library for complex numbers.
Do you have any recommendation or is there another method to multiply complex matrices in PyTorch?
Currently torch.matmul is not supported for complex tensors such as ComplexFloatTensor but you could do something as compact as the following code:
def matmul_complex(t1,t2):
return torch.view_as_complex(torch.stack((t1.real # t2.real - t1.imag # t2.imag, t1.real # t2.imag + t1.imag # t2.real),dim=2))
When possible avoid using for loops as these will result in much slower implementations.
Vectorization is achieved by using built-in methods as demonstrated in the code I have attached.
For example, your code takes roughly 6.1s on CPU while the vectorized version takes only 101ms (~60 times faster) for 2 random complex matrices with dimensions 1000 X 1000.
Update:
Since PyTorch 1.7.0 (as #EduardoReis mentioned) you can do matrix multiplication between complex matrices similarly to real-valued matrices as follows:
t1 # t2
(for t1, t2 complex matrices).
I implemented this function for pytorch.matmul for complex numbers using torch.mv and it's working fine for time-being:
def matmul_complex(t1, t2):
m = list(t1.size())[0]
n = list(t2.size())[1]
t = torch.empty((1,n), dtype=torch.cfloat)
t_total = torch.empty((m,n), dtype=torch.cfloat)
for i in range(0,n):
if i == 0:
t_total = torch.mv(t1,t2[:,i])
else:
t_total = torch.cat((t_total, torch.mv(t1,t2[:,i])), 0)
t_final = torch.reshape(t_total, (m,n))
return t_final
I am new to PyTorch, so please correct me if I am wrong.

How do you generate positive definite matrix in pytorch?

I am trying to define Multivariate Gaussian distribution with randomly generated covariance matrix:
psi = torch.zeros(512).normal_(0., 1.).requires_grad_(True)
# Generate random matrix
Sigma_k = torch.rand(512, 512)
# Make it symmetric positive
Sigma_k = Sigma_k * Sigma_k.t()
# Make it definite
Sigma_k.add_(0.001, torch.eye(512)).requires_grad_(True)
multivariate_normal.MultivariateNormal(psi, Sigma_k)
But I end up with getting an exception:
RuntimeError: Lapack Error in potrf : the leading minor of order 2 is not positive definite at /Users/soumith/mc3build/conda-bld/pytorch_1549597882250/work/aten/src/TH/generic/THTensorLapack.cpp:658
What is the proper way to generate positive definite square matrix?
The answer is one should make a dot product of matrix A and it's transpose matrix (A.t()) in order to obtain a positive semi-definite matrix. The last thing is to ensure that it is definite (strictly greater than zero).
With Pytorch:
Sigma_k = torch.rand(512, 512)
Sigma_k = torch.mm(Sigma_k, Sigma_k.t())
Sigma_k.add_(torch.eye(512))
Formal algorithm is described here.
In "make it definite"
tensor.add() does not change tensor, but only returns a changed version.
You want to use tensor.add_()

How can I use scipy optimization to find the minimum chi-squared for 3 parameters and a list of data points?

I have a histogram of sorted random numbers and a Gaussian overlay. The histogram represents observed values per bin (applying this base case to a much larger dataset) and the Gaussian is an attempt to fit the data. Clearly, this Gaussian does not represent the best fit to the histogram. The code below is the formula for a Gaussian.
normc, mu, sigma = 30.845, 50.5, 7 # normalization constant, avg, stdev
gauss = lambda x: normc * exp( (-1) * (x - mu)**2 / ( 2 * (sigma **2) ) )
I calculated the expectation values per bin (area under the curve) and calculated the number of observed values per bin. There are several methods to find the 'best' fit. I am concerned with the best fit possible by minimizing Chi-Squared. In this formula for Chi-Squared, the expectation value is the area under the curve per bin and the observed value is the number of occurrences of sorted data values per bin. So I want to fluctuate normc, mu, and sigma near their given values to find the right combination of normc, mu, and sigma that produce the smallest Chi-Square, as these will be the parameters I can plug into the code above to overlay the best fit Gaussian on my histogram. I am trying to use the scipy module to minimize my Chi-Square as done in this example. Since I need to fluctuate parameters, I will use the function gauss (defined above) to plot the Gaussian overlay, and will define a new function to find the minimum Chi-Squared.
def gaussmin(var,data):
# var[0] = normc
# var[1] = mu
# var[2] = sigma
# data is the sorted random numbers, represents unbinned observed values
for index in range(len(data)):
return var[0] * exp( (-1) * (data[index] - var[1])**2 / ( 2 * (var[2] **2) ) )
# I realize this will return a new value for each index of data, any guidelines to fix?
After this, I am stuck. How can I fluctuate the parameters to find the normc, mu, sigma that produced the best fit? My last attempt at a solution is below:
var = [normc, mu, sigma]
result = opt.minimize(chi2, [normc,mu,sigma])
# chi2 is the chisquare value obtained via scipy
# chisquare input (a,b)
# where a is number of occurences per bin, b is expected value per bin
# b is dependent upon normc, mu, sigma
print(result)
# data is a list, can I keep it as a constant and only fluctuate parameters in var?
There are plenty of examples online for scalar functions but I cannot find any for variable functions.
PS - I can post my full code so far but it's bit lengthy. If you would like to see it, just ask and I can post it here or provide a googledrive link.
A Gaussian distribution is completely characterized by its mean and variance (or std deviation). Under the hypothesis that your data are normally distributed, the best fit will be obtained by using x-bar as the mean and s-squared as the variance. But before doing so, I'd check whether normality is plausible using, e.g., a q-q plot.

curve fitting with integer inputs Python 3.3

I am using scipy's curvefit module to fit a function and wanted to know if there is a way to tell it the the only possible entries are integers not real numbers? Any ideas as to another way of doing this?
In its general form, an integer programming problem is NP-hard ( see here ). There are some efficient heuristic or approximate algorithm to solve this problem, but none guarantee an exact optimal solution.
In scipy you may implement a grid search over the integer coefficients and use, say, curve_fit over the real parameters for the given integer coefficients. As for grid search. scipy has brute function.
For example if y = a * x + b * x^2 + some-noise where a has to be integer this may work:
Generate some test data with a = 5 and b = -1.5:
coef, n = [5, - 1.5], 50
xs = np.linspace(0, 10, n)[:,np.newaxis]
xs = np.hstack([xs, xs**2])
noise = 2 * np.random.randn(n)
ys = np.dot(xs, coef) + noise
A function which given the integer coefficients fits the real coefficient using curve_fit method:
def optfloat(intcoef, xs, ys):
from scipy.optimize import curve_fit
def poly(xs, floatcoef):
return np.dot(xs, [intcoef, floatcoef])
popt, pcov = curve_fit(poly, xs, ys)
errsqr = np.linalg.norm(poly(xs, popt) - ys)
return dict(errsqr=errsqr, floatcoef=popt)
A function which given the integer coefficients, uses the above function to optimize the float coefficient and returns the error:
def errfun(intcoef, *args):
xs, ys = args
return optfloat(intcoef, xs, ys)['errsqr']
Minimize errfun using scipy.optimize.brute to find optimal integer coefficient and call optfloat with the optimized integer coefficient to find the optimal real coefficient:
from scipy.optimize import brute
grid = [slice(1, 10, 1)] # grid search over 1, 2, ..., 9
# it is important to specify finish=None in below
intcoef = brute(errfun, grid, args=(xs, ys,), finish=None)
floatcoef = optfloat(intcoef, xs, ys)['floatcoef'][0]
Using this method I obtain [5.0, -1.50577] for the optimal coefficients, which is exact for the integer coefficient, and close enough for the real coefficient.
In general, the answer is No: scipy.optimize.curve_fit() and leastsq() that it is based on, and (AFAIK) all the other solvers in scipy.optimize work strictly on floating point numbers.
You could try increasing the value of epsfcn (which has a default value of numpy.finfo('double').eps ~ 2.e-16), which would be used as the initial step to all variables in the problem. The basic issue is that the fitting algorithm will adjust a floating point number, and if you do
int_var = int(float_var)
and the algorithm changes float_var from 1.0 to 1.00000001, it will see no difference in the result and decide that that value does not actually alter the fit metric.
Another approach would be to have a floating point parameter 'tmp_float_var' that is freely adjusted by the fitting algorithm but then in your objective function use
int_var = int(tmp_float_var / numpy.finfo('double').eps)
as the value for your integer variable. That might need a little tweaking, and might be a little unstable, but ought to work.

Resources