Can I use scipy.optimize with numpy.where? - python-3.x

I am trying to fit a box-like function with scipy.optimize, and that function is defined with a numpy.where instruction. But the result is a covariance matrix filled with "inf" entries
The function I am trying to fit is box-like: it is equal to F0 outside an "event" and F0-A inside an "event":
np.where(np.absolute(miu-x)<=omega/2.0,F0-A,F0)
I have tried defining it as a branch function and through numpy.where. In both cases, the end result is not what I intend, as it always gives the following warning:
OptimizeWarning: Covariance of the parameters could not be estimated
def func(x, F0, A, miu, omega):
return np.where(np.absolute(miu-x)<=omega/2.0,F0-A,F0)
popt, pcov = curve_fit(func, xdata, ydata, p0 = pri_values, sigma = sigmas)
# xdata, ydata and p0 are obtained from other files
I expect as an output an array of 4 elements with the best fit for each parameter (popt) as well as a 4x4 covariance matrix (pcov). But my result does not converge to the expected values, and the covariance matrix is filled with "inf".

Related

Fixing two parameters in Gaussian fit using curve_fit

I am using a Gaussian model and would like to fix two parameters, mu1, and sigma1.
def gauss_cnst(x, A1, mu1, sigma1, const):
return ((A1*np.exp(-(x-mu1)**2/(2.*sigma1**2)))+const)
This is the fitting code I'm using:
p0_zr = [A, t0, sig, cnst]
dd = np.linspace(min(mjd), max(mjd_alloid_zr_brt_cad), 2000)
coeff_zr, var_matrix_zr = curve_fit(gauss_cnst, mjd,
mag, p0=p0_zr,
sigma=err, method='lm', maxfev = 900000)
coeff_zr_err = np.sqrt(np.diag(var_matrix_zr))
gauss_fit_zr=gauss_cnst(dd, *coeff_zr)
where mjd, mag, and err are my data x, y, and error. In p0_zr array, A and cnst are variables that will be given an initial guess to be estimated by the fit but I would like the Gaussian model to fix t0 and sig parameters as e.g., *t0=59800, sig=2.5.
How can I do that?

Calculate Batch Pairwise Sinkhorn Distance in PyTorch

I have two tensors and both are of same shape. I want to calculate pairwise sinkhorn distance using GeomLoss.
What i have tried:
import torch
import geomloss # pip install git+https://github.com/jeanfeydy/geomloss
a = torch.rand((8,4))
b = torch.rand((8,4))
geomloss.SamplesLoss('sinkhorn')(a,b)
# ^ input shape [batch, feature_dim]
# will return a scalar value
geomloss.SamplesLoss('sinkhorn')(a.unsqueeze(1),b.unsqueeze(1))
# ^ input shape [batch, n_points, feature_dim]
# will return a tensor of size [batch] of distances between a[i] and b[i] for each i
However I would like to compute pairwise distance where the resultant tensor should be of size [batch, batch]. To achieve this, I tried the following to use broadcasting:
geomloss.SamplesLoss('sinkhorn')(a.unsqueeze(0), b.unsqueeze(1))
But I got this error message:
ValueError: Samples x and y should have the same batchsize.
Since the documentation doesn't give examples on how to use the distance's forward function. Here's a way to do it, which will require you to call the distance function batch times.
We will construct the distance matrix line by line. Line i corresponds to the distances a[i]<->b[0], a[i]<->b[1], through to a[i]<->b[batch]. To do so we need to construct, for each line i, a (8x4) repeated version of tensor a[i].
This will do:
a_i = torch.stack(8*[a[i]], dim=0)
Then we calculate the distance between a[i] and each batch in b:
dist(a_i.unsqueeze(1), b.unsqueeze(1))
Having a total of batch lines we can construct our final tensor stack.
Here's the complete code:
batch = a.shape[0]
dist = geomloss.SamplesLoss('sinkhorn')
distances = [dist(torch.stack(batch*[a[i]]).unsqueeze(1), b.unsqueeze(1)) for i in range(batch)]
D = torch.stack(distances)

Fit sigmoid function ("S" shape curve) to data using Python

I'm trying to fit a sigmoid function to some data I have but I keep getting:ValueError: Unable to determine number of fit parameters.
My data looks like this:
My code is:
from scipy.optimize import curve_fit
def sigmoid(x):
return (1/(1+np.exp(-x)))
popt, pcov = curve_fit(sigmoid, xdata, ydata, method='dogbox')
Then I get:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-78540a3a23df> in <module>
2 return (1/(1+np.exp(-x)))
3
----> 4 popt, pcov = curve_fit(sigmoid, xdata, ydata, method='dogbox')
~\Anaconda3\lib\site-packages\scipy\optimize\minpack.py in curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma, check_finite, bounds, method, jac, **kwargs)
685 args, varargs, varkw, defaults = _getargspec(f)
686 if len(args) < 2:
--> 687 raise ValueError("Unable to determine number of fit parameters.")
688 n = len(args) - 1
689 else:
ValueError: Unable to determine number of fit parameters.
I'm not sure why this does not work, it seems like a trivial action--> fit a curve to some point. The desired curve would look like this:
Sorry for the graphics.. I did it in PowerPoint...
How can I find the best sigmoid ("S" shape) curve?
UPDATE
Thanks to #Brenlla I've changed my code to:
def sigmoid(k,x,x0):
return (1 / (1 + np.exp(-k*(x-x0))))
popt, pcov = curve_fit(sigmoid, xdata, ydata, method='dogbox')
Now I do not get an error, but the curve is not as desired:
x = np.linspace(0, 1600, 1000)
y = sigmoid(x, *popt)
plt.plot(xdata, ydata, 'o', label='data')
plt.plot(x,y, label='fit')
plt.ylim(0, 1.3)
plt.legend(loc='best')
and the result is:
How can I improve it so it will fit the data better?
UPDATE2
The code is now:
def sigmoid(x, L,x0, k, b):
y = L / (1 + np.exp(-k*(x-x0)))+b
But the result is still...
After great help from #Brenlla the code was modified to:
def sigmoid(x, L ,x0, k, b):
y = L / (1 + np.exp(-k*(x-x0))) + b
return (y)
p0 = [max(ydata), np.median(xdata),1,min(ydata)] # this is an mandatory initial guess
popt, pcov = curve_fit(sigmoid, xdata, ydata,p0, method='dogbox')
The parameters optimized are L, x0, k, b, who are initially assigned in p0, the point the optimization starts from.
L is responsible for scaling the output range from [0,1] to [0,L]
b adds bias to the output and changes its range from [0,L] to [b,L+b]
k is responsible for scaling the input, which remains in (-inf,inf)
x0 is the point in the middle of the Sigmoid, i.e. the point where Sigmoid should originally output the value 1/2 [since if x=x0, we get 1/(1+exp(0)) = 1/2].
And the result:
Note - there were some questions about initial estimates earlier. My data is particularly messy, and the solution above worked most of the time, but would occasionally miss entirely. This was remedied by changing the method from 'dogbox' to 'lm':
p0 = [max(ydata), np.median(xdata),1,min(ydata)] # this is an mandatory initial guess
popt, pcov = curve_fit(sigmoid, xdata, ydata,p0, method='lm') ## Updated method from 'dogbox' to 'lm' 9.30.2021
Over about 50 fitted curves, it didn't change the ones that worked well at all, but completely addressed the challenge cases.
Point is, in all cases you and your data are a special snowflake so don't be afraid to dig in and poke around at the parameters of a function you copy from the internet.

Plotting residuals of masked values with `statsmodels`

I'm using statsmodels.api to compute the statistical parameters for an OLS fit between two variables:
def computeStats(x, y, yName):
'''
Takes as an argument an array, and a string for the array name.
Uses Ordinary Least Squares to compute the statistical parameters for the
array against log(z), and determines the equation for the line of best fit.
Returns the results summary, residuals, statistical parameters in a list, and the
best fit equation.
'''
# Mask NaN values in both axes
mask = ~np.isnan(y) & ~np.isnan(x)
# Compute model parameters
model = sm.OLS(y, sm.add_constant(x), missing= 'drop')
results = model.fit()
residuals = results.resid
# Compute fit parameters
params = stats.linregress(x[mask], y[mask])
fit = params[0]*x + params[1]
fitEquation = '$(%s)=(%.4g \pm %.4g) \\times redshift+%.4g$'%(yName,
params[0], # slope
params[4], # stderr in slope
params[1]) # y-intercept
return results, residuals, params, fit, fitEquation
The second part of the function (using stats.linregress) plays nicely with the masked values, but statsmodels does not. When I try to plot the residuals against the x values with plt.scatter(x, resids), the dimensions do not match:
ValueError: x and y must be the same size
because there are 29007 x-values, and 11763 residuals (that's how many y-values made it through the masking process). I tried changing the model variable to
model = sm.OLS(y[mask], sm.add_constant(x[mask]), missing= 'drop')
but this had no effect.
How can I scatter-plot the residuals against the x-values they match with?
Hi #jim421616 Since statsmodels dropped few missing values, you should use the model's exog variable to plot the scatter as shown.
plt.scatter(model.model.exog[:,1], model.resid)
For reference a complete dummy example
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
#generate data
x = np.random.rand(1000)
y =np.sin( x*25)+0.1*np.random.rand(1000)
# Make some as NAN
y[np.random.choice(np.arange(1000), size=100)]= np.nan
x[np.random.choice(np.arange(1000), size=80)]= np.nan
# fit model
model = sm.OLS(y, sm.add_constant(x) ,missing='drop').fit()
print model.summary()
# plot
plt.scatter(model.model.exog[:,1], model.resid)
plt.show()

How does one input a subrange-dependent Jacobian using minimize from scipy.optimizate given that the objective function returns a sum over a range?

Suppose one is trying to minimize a chi square function for given expectation values per bin and observed counts per bin. In python syntax, (assuming equal lengths of lists for expectation values and observed counts), the chi square function is
obs[i] are the observed counts in the i-th bin
exp[i] are the expectation values of the i-th bin (obtained by integrating distribution function over bin bounds)
chisquare = sum([( (obs[i] - exp[i])**2 / exp[i] ) for i in range(len(obs))])
On paper, I know how to calculate the partial derivatives of chi square (wrt to observed counts per bin and wrt to expectation values per bin) to compute a Jacobian.
jaco = [( 2 * (obs[i] - 1) / exp[i] ), ( (exp[i]**2 - obs[i]**2)/exp[i] )]
If the i-th bin is being considered, then the terms of all other bins in the Jacobian are all zero.
According to the SCIPY docs,
jac : bool or callable, optional
Jacobian (gradient) of objective function. Only for CG, BFGS, Newton-CG, L-BFGS-B, TNC, SLSQP, dogleg, trust-ncg. If jac is a Boolean and is True, fun is assumed to return the gradient along with the objective function. If False, the gradient will be estimated numerically. jac can also be a callable returning the gradient of the objective. In this case, it must accept the same arguments as fun.
Without the Jacobian, my code is
from scipy.stats import chisquare
def argmin( args ):
#argmin is the argument that is minimized by the function defined below
# arg1 = mu
# arg2 = sigma
obsperbin = countperbin( numbins = numbins ) # pre-defined function (not shown)
expperbin = expectperbin( numbins , args ) # pre-defined function (not shown)
return chisquare(obsperbin, expperbin)
from scipy.optimize import minimize
def miniz( ):
parameterguess = initial_params( ) # pre-defined function (not shown) that returns [mu, sigma]
# [mu, sigma] of distribution used to generate initial guess of chi square value
# looks for optimized values of [mu, sigma] that minimize chi square
bnds = ((0.01, 10), (0.01, 1)) # ((bounds of mu), (bounds of sigma))
globmin = minimize( argmin , parameterguess , method = 'L-BFGS-B', bounds = bnds, jac = jaco)
res = miniz()
print(res)
My trouble is that the Jacobian is bin-dependent and the chi square value is a sum over all of the bins. Does this mean that the Jacobian that should be an argument in the minimizing function should be the sum of each Jacobian over each bin? Or perhaps an array for each iteration? How am I to input the Jacobian as an argument in the minimizing function?

Resources