Fit sigmoid function ("S" shape curve) to data using Python - python-3.x

I'm trying to fit a sigmoid function to some data I have but I keep getting:ValueError: Unable to determine number of fit parameters.
My data looks like this:
My code is:
from scipy.optimize import curve_fit
def sigmoid(x):
return (1/(1+np.exp(-x)))
popt, pcov = curve_fit(sigmoid, xdata, ydata, method='dogbox')
Then I get:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-78540a3a23df> in <module>
2 return (1/(1+np.exp(-x)))
3
----> 4 popt, pcov = curve_fit(sigmoid, xdata, ydata, method='dogbox')
~\Anaconda3\lib\site-packages\scipy\optimize\minpack.py in curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma, check_finite, bounds, method, jac, **kwargs)
685 args, varargs, varkw, defaults = _getargspec(f)
686 if len(args) < 2:
--> 687 raise ValueError("Unable to determine number of fit parameters.")
688 n = len(args) - 1
689 else:
ValueError: Unable to determine number of fit parameters.
I'm not sure why this does not work, it seems like a trivial action--> fit a curve to some point. The desired curve would look like this:
Sorry for the graphics.. I did it in PowerPoint...
How can I find the best sigmoid ("S" shape) curve?
UPDATE
Thanks to #Brenlla I've changed my code to:
def sigmoid(k,x,x0):
return (1 / (1 + np.exp(-k*(x-x0))))
popt, pcov = curve_fit(sigmoid, xdata, ydata, method='dogbox')
Now I do not get an error, but the curve is not as desired:
x = np.linspace(0, 1600, 1000)
y = sigmoid(x, *popt)
plt.plot(xdata, ydata, 'o', label='data')
plt.plot(x,y, label='fit')
plt.ylim(0, 1.3)
plt.legend(loc='best')
and the result is:
How can I improve it so it will fit the data better?
UPDATE2
The code is now:
def sigmoid(x, L,x0, k, b):
y = L / (1 + np.exp(-k*(x-x0)))+b
But the result is still...

After great help from #Brenlla the code was modified to:
def sigmoid(x, L ,x0, k, b):
y = L / (1 + np.exp(-k*(x-x0))) + b
return (y)
p0 = [max(ydata), np.median(xdata),1,min(ydata)] # this is an mandatory initial guess
popt, pcov = curve_fit(sigmoid, xdata, ydata,p0, method='dogbox')
The parameters optimized are L, x0, k, b, who are initially assigned in p0, the point the optimization starts from.
L is responsible for scaling the output range from [0,1] to [0,L]
b adds bias to the output and changes its range from [0,L] to [b,L+b]
k is responsible for scaling the input, which remains in (-inf,inf)
x0 is the point in the middle of the Sigmoid, i.e. the point where Sigmoid should originally output the value 1/2 [since if x=x0, we get 1/(1+exp(0)) = 1/2].
And the result:

Note - there were some questions about initial estimates earlier. My data is particularly messy, and the solution above worked most of the time, but would occasionally miss entirely. This was remedied by changing the method from 'dogbox' to 'lm':
p0 = [max(ydata), np.median(xdata),1,min(ydata)] # this is an mandatory initial guess
popt, pcov = curve_fit(sigmoid, xdata, ydata,p0, method='lm') ## Updated method from 'dogbox' to 'lm' 9.30.2021
Over about 50 fitted curves, it didn't change the ones that worked well at all, but completely addressed the challenge cases.
Point is, in all cases you and your data are a special snowflake so don't be afraid to dig in and poke around at the parameters of a function you copy from the internet.

Related

Fitting logistic function giving strainght line in python?

I am trying to fit a logistic function to the data below. I am not able to understand why I get a straight line with fitted parameters? Can someone please help me with this?
x = [473,523,573,623,673]
y = [104,103,95,79,83]
x = np.array(x)
y = np.array(y)
def f(x,a,b,c,d):
return a/(1+np.exp(-c*(x-d))) + b
popt, pcov = opt.curve_fit(f,x,y,method="trf")
y_fit = f(x,*popt)
plt.plot(x,y,'o')
plt.plot(x,y_fit,'-')
plt.show()
A good choice of initial values can make the difference between a successful optimization and one that fails. For this particular problem I would suggest the initial values that you can find in the code below (p0 list). These values allows to properly estimate the covariance.
Moreover, you don't need to use the trf algorithm since you are not providing any bound (see docs). The Levenberg–Marquardt algorithm is the right tool for this case.
import numpy as np
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit
x = [473, 523, 573, 623, 673]
y = [104, 103, 95, 79, 83]
x = np.array(x)
y = np.array(y)
def f(x, a, b, c, d):
return a / (1 + np.exp(-c * (x - d))) + b
p0 = [20.0, 0.0, 0.0, 0.0]
popt, pcov = curve_fit(f, x, y, p0=p0)
y_fit = f(x, *popt)
plt.plot(x, y, "o")
plt.plot(x, y_fit, "-")
plt.show()
If possible I would also suggest to increase the number of observations.

Python: Fitting a piecewise polynomial

I am trying to fit a piecewise polynomial function
Code:
import numpy as np
import scipy
from scipy.interpolate import UnivariateSpline, splrep
from scipy.optimize import curve_fit
from matplotlib import pyplot as plt
def piecewise_func(x, X, Y):
"""
cond_l: condition list
func_l: function list
"""
spl = UnivariateSpline(X, Y, k=3, s=0.5)
tck = (spl._data[8], spl._data[9], 3) # tck = (knots, coefficients, degree)
p = scipy.interpolate.PPoly.from_spline(tck)
cond_l = []
func_l = []
for idx, i in enumerate(range(3, len(spl.get_knots()) + 3 - 1)):
cond_l.append([(x >= p.x[i] & x < p.x[i + 1])])
func_l.append([lambda x: p.c[3, i] + p.c[2, i] * x + p.c[1, i] * x ** 2 + p.c[0, i] * x ** 3])
return np.piecewise(x, cond_l, func_l)
if __name__ == '__main__':
xdata = [0.28190937, 0.63429607, 0.91620544, 1.68793236, 2.32350115, 2.95215219, 4.5,
4.78103382, 7.2, 7.53430054, 8.03627018, 9., 9.86212529, 11.25951191, 11.62658532, 11.65598578, 13.90295926]
ydata = [0.36273168, 0.81614628, 1.17887796, 1.4475374, 5.52692706, 2.17548169, 3.55313396, 3.80326533, 7.75556311, 8.30176616, 10.72117182, 11.2499386,
11.72296513, 11.02146624, 14.51260631, 20.59365525, 21.77847853]
spl = UnivariateSpline(xdata, ydata, k=3, s=1)
plt.plot(xdata, ydata, '*')
plt.plot(xdata, spl(xdata))
plt.show()
p, e = curve_fit(piecewise_func, xdata, ydata)
# x_plot = np.linspace(0., 0.15, len(x))
# plt.plot(x, y, "+")
# plt.plot(x, (piecewise_func(x_plot, *p)), 'C3-', lw=3)
I tried the UnivariateSpline function to interpolate, I see the following result
However, I don't want the polynomial curve to pass through all data points. I tried varying the smoothing factor but I am not able to obtain something like the one below.
Expected output:
I'm trying curve fitting (Use UnivariateSpline to fit data tightly) to get the expected output and I have the following issues.
piecewise_func in the code posted returns the piecewise polynomial.
Passing this to curve_fit(piecewise_func, xdata, ydata) returns an error
Error:
res = leastsq(func, p0, Dfun=jac, full_output=1, **kwargs)
ValueError: diff requires input that is at least one dimensional
I am not sure what is wrong.
Suggestions on how to get the expected fit will be
of great help.
I would recommend having a closer look at the parameter s in the UnivariateSpline documentation:
s : float or None, optional
Positive smoothing factor used to choose the number of knots. Number of knots will be increased until the smoothing condition is satisfied:
sum((w[i] * (y[i]-spl(x[i])))**2, axis=0) <= s
If s is None, s = len(w) which should be a good value if 1/w[i] is an estimate of the standard deviation of y[i]. If 0, spline will interpolate through all data points. Default is None.
Since you do not set w, this is just a complicated way of saying that s is the least squares error that you allow, i.e., squared errors summed over all the data points. Your value of 1 does not lead to interpolation but it is quite tight compared to what you want to achieve.
Taking
spl = UnivariateSpline(xdata, ydata, k=3, s=10)
you get the following:
Yet closer to your goal is s=100:
So my recommendation is to play around with s and if that proves insufficient, to ask a new question describing what you need more precisely. I haven't had a proper look at the problem with piecewise_func.

Python Scipy Curvefit to Linear Quadratic Curve

I'm trying to fit a linear quadratic model curve to experiment data. The Y axis values reduce from 1 to 10^-5. When I use the following code, the resulting curve often seems to not fit the data at higher X values. I have a suspicion that because the Y values at high X values are so small, the resulting difference between the experiment value and model value is small. But I would like the model curve to pass as close to the higher X value points as possible (even if it means the low values are not as well fitted). I haven't found anything about weighting in scipy.optimize.curve_fit, other than using standard deviations (which I don't have). How can I improve my model fit at high X values?
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
def lq(x, a, b):
#y(x) = exp[-(ax+bx²)]
y = []
for i in x:
x2=i**2
ax = a*i
bx2 = b*x2
y.append(np.exp(-(ax+bx2)))
return y
#x and y are from experiment
x=[0,1.778,2.921,3.302,6.317,9.524,10.54]
y=[1,0.831763771,0.598411595,0.656145266,0.207014135,0.016218101,0.004102041]
(a,b), pcov = curve_fit(lq, x, y, p0=[0.05,0.05])
#make the model curve using a and b
xmodel = list(range(0,20))
ymodel = lq(xmodel, a, b)
fig, ax1 = plt.subplots()
ax1.set_yscale('log')
ax1.plot(x,y, "ro", label="Experiment")
ax1.plot(xmodel,ymodel, "r--", label="Model")
plt.show()
I agree with your assessment that the fit is not very sensitive to small misfits for the small values of y. Since you are plotting the data and fit on a semi-log plot, I think that what you really want is to fit in the log-space as well. That is, you could fit log(y) to a quadratic function. As an aside (but an important one if you're going to be doing numerical work with Python), you should not loop over lists but rather use numpy arrays: this will make everything faster and simpler. With such changes, your script might look like
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
def lq(x, a, b):
return -(a*x+b*x*x)
x = np.array([0,1.778,2.921,3.302,6.317,9.524,10.54])
y = np.array([1,0.831763771,0.598411595,0.656145266,0.207014135,0.016218101,0.004102041])
(a,b), pcov = curve_fit(lq, x, np.log(y), p0=[0.05,0.05])
xmodel = np.arange(20) # Note: use numpy!
ymodel = np.exp(lq(xmodel, a, b)) # Note: take exp() as inverse log()
fig, ax1 = plt.subplots()
ax1.set_yscale('log')
ax1.plot(x, y, "ro", label="Experiment")
ax1.plot(xmodel,ymodel, "r--", label="Model")
plt.show()
Note that the model function is changed to just be the ax+bx^2 you wanted to write in the first place and that this is now fitting np.log(y), not y. This will give a much more satisfying fit at the smaller y values.
You might also find lmfit (https://lmfit.github.io/lmfit-py/) helpful for this problem (disclaimer: I am a lead author). With this, your fit script could become
from lmfit import Model
model = Model(lq)
params = model.make_params(a=0.05, b=0.05)
result = model.fit(np.log(y), params, x=x)
print(result.fit_report())
xmodel = np.arange(20)
ymodel = np.exp(result.eval(x=xmodel))
plt.plot(x, y, "ro", label="Experiment")
plt.plot(xmodel, ymodel, "r--", label="Model")
plt.yscale('log')
plt.legend()
plt.show()
This will print out a report including fit statistics and interpretable uncertainties and correlations between variables:
[[Model]]
Model(lq)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 7
# data points = 7
# variables = 2
chi-square = 0.16149397
reduced chi-square = 0.03229879
Akaike info crit = -22.3843833
Bayesian info crit = -22.4925630
[[Variables]]
a: -0.05212688 +/- 0.04406602 (84.54%) (init = 0.05)
b: 0.05274458 +/- 0.00479056 (9.08%) (init = 0.05)
[[Correlations]] (unreported correlations are < 0.100)
C(a, b) = -0.968
and give a plot of
Note that lmfit Parameters can be fixed or bounded and that lmfit comes with many built-in models.
Finally, if you were to include a constant term in the quadratic model, you would not really need an iterative method but could use polynomial regression, as with numpy.polyfit.
Here is a graphical Python fitter using your data with a Gompertz type of sigmoidal equation. This code uses scipy's Differential Evolution genetic algorithm module to determine initial parameter estimates for scipy's non-linear curve_fit() routine. That scipy module uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, requiring bounds within which to search. In this example, I made all of the parameter search bounds from -2.0 to 2.0, and that seems to work in this case. Note that it is much easier to provide ranges for the initial parameter estimates than specific values, and those parameter ranges can be generous.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
#x and y are from experiment
x=[0,1.778,2.921,3.302,6.317,9.524,10.54]
y=[1,0.831763771,0.598411595,0.656145266,0.207014135,0.016218101,0.004102041]
# alias data to match previous example code
xData = numpy.array(x, dtype=float)
yData = numpy.array(y, dtype=float)
def func(x, a, b, c): # Sigmoidal Gompertz C from zunzun.com
return a * numpy.exp(b * numpy.exp(c*x))
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = func(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
parameterBounds = []
parameterBounds.append([-2.0, 2.0]) # search bounds for a
parameterBounds.append([-2.0, 2.0]) # search bounds for b
parameterBounds.append([-2.0, 2.0]) # search bounds for c
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()
# now call curve_fit without passing bounds from the genetic algorithm,
# just in case the best fit parameters are aoutside those bounds
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Fitted parameters:', fittedParameters)
print()
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# plot wuth log Y axis scaling
plt.yscale('log')
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

curve_fit with polynomials of variable length

I'm new to python (and programming in general) and want to make a polynomial fit using curve_fit, where the order of the polynomials (or the number of fit parameters) is variable.
I made this code which is working for a fixed number of 3 parameters a,b,c
# fit function
def fit_func(x, a,b,c):
p = np.polyval([a,b,c], x)
return p
# do the fitting
popt, pcov = curve_fit(fit_func, x_data, y_data)
But now I'd like to have my fit function to only depend on a number N of parameters instead of a,b,c,....
I'm guessing that's not a very hard thing to do, but because of my limited knowledge I can't get it work.
I've already looked at this question, but I wasn't able to apply it to my problem.
You can define the function to be fit to your data like this:
def fit_func(x, *coeffs):
y = np.polyval(coeffs, x)
return y
Then, when you call curve_fit, set the argument p0 to the initial guess of the polynomial coefficients. For example, this plot is generated by the script that follows.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
# Generate a sample input dataset for the demonstration.
x = np.arange(12)
y = np.cos(0.4*x)
def fit_func(x, *coeffs):
y = np.polyval(coeffs, x)
return y
fit_results = []
for n in range(2, 6):
# The initial guess of the parameters to be found by curve_fit.
# Warning: in general, an array of ones might not be a good enough
# guess for `curve_fit`, but in this example, it works.
p0 = np.ones(n)
popt, pcov = curve_fit(fit_func, x, y, p0=p0)
# XXX Should check pcov here, but in this example, curve_fit converges.
fit_results.append(popt)
plt.plot(x, y, 'k.', label='data')
xx = np.linspace(x.min(), x.max(), 100)
for p in fit_results:
yy = fit_func(xx, *p)
plt.plot(xx, yy, alpha=0.6, label='n = %d' % len(p))
plt.legend(framealpha=1, shadow=True)
plt.grid(True)
plt.xlabel('x')
plt.show()
The parameters of polyval specify p is an array of coefficients from the highest to lowest. With x being a number or array of numbers to evaluate the polynomial at. It says, the following.
If p is of length N, this function returns the value:
p[0]*x**(N-1) + p[1]*x**(N-2) + ... + p[N-2]*x + p[N-1]
def fit_func(p,x):
z = np.polyval(p,x)
return z
e.g.
t= np.array([3,4,5,3])
y = fit_func(t,5)
503
which is if you do the math here is right.

Linear Regression algorithm works with one data-set but not on another, similar data-set. Why?

I created a linear regression algorithm following a tutorial and applied it to the data-set provided and it works fine. However the same algorithm does not work on another similar data-set. Can somebody tell me why this happens?
def computeCost(X, y, theta):
inner = np.power(((X * theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
temp = np.matrix(np.zeros(theta.shape))
params = int(theta.ravel().shape[1])
cost = np.zeros(iters)
for i in range(iters):
err = (X * theta.T) - y
for j in range(params):
term = np.multiply(err, X[:,j])
temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))
theta = temp
cost[i] = computeCost(X, y, theta)
return theta, cost
alpha = 0.01
iters = 1000
g, cost = gradientDescent(X, y, theta, alpha, iters)
print(g)
On running the algo through this dataset I get the output as matrix([[ nan, nan]]) and the following errors:
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: RuntimeWarning: overflow encountered in power
from ipykernel import kernelapp as app
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:11: RuntimeWarning: invalid value encountered in double_scalars
However this data set works just fine and outputs matrix([[-3.24140214, 1.1272942 ]])
Both the datasets are similar, I have been over it many times but can't seem to figure out why it works on one dataset but not on other. Any help is welcome.
Edit: Thanks Mark_M for editing tips :-)
[Much better question, btw]
It's hard to know exactly what's going on here, but basically your cost is going the wrong direction and spiraling out of control, which results in an overflow when you try to square the value.
I think in your case it boils down to your step size (alpha) being too big which can cause gradient descent to go the wrong way. You need to watch the cost in gradient descent and makes sure it's always going down, if it's not either something is broken or alpha is to large.
Personally, I would reevaluate the code and try to get rid of the loops. It's a matter of preference, but I find it easier to work with X and Y as column vectors. Here is a minimal example:
from numpy import genfromtxt
# this is your 'bad' data set from github
my_data = genfromtxt('testdata.csv', delimiter=',')
def computeCost(X, y, theta):
inner = np.power(((X # theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
for i in range(iters):
# you don't need the extra loop - this can be vectorize
# making it much faster and simpler
theta = theta - (alpha/len(X)) * np.sum((X # theta.T - y) * X, axis=0)
cost = computeCost(X, y, theta)
if i % 10 == 0: # just look at cost every ten loops for debugging
print(cost)
return (theta, cost)
# notice small alpha value
alpha = 0.0001
iters = 100
# here x is columns
X = my_data[:, 0].reshape(-1,1)
ones = np.ones([X.shape[0], 1])
X = np.hstack([ones, X])
# theta is a row vector
theta = np.array([[1.0, 1.0]])
# y is a columns vector
y = my_data[:, 1].reshape(-1,1)
g, cost = gradientDescent(X, y, theta, alpha, iters)
print(g, cost)
Another useful technique is to normalize your data before doing regression. This is especially useful when you have more than one feature you're trying to minimize.
As a side note - if you're step size is right you shouldn't get overflows no matter how many iterations you do because the cost will will decrease with every iteration and the rate of decrease will slow.
After 1000 iterations I arrived at a theta and cost of:
[[ 1.03533399 1.45914293]] 56.041973778
after 100:
[[ 1.01166889 1.45960806]] 56.0481988054
You can use this to look at the fit in an iPython notebook:
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(my_data[:, 0].reshape(-1,1), y)
axes = plt.gca()
x_vals = np.array(axes.get_xlim())
y_vals = g[0][0] + g[0][1]* x_vals
plt.plot(x_vals, y_vals, '--')

Resources