How to fit a curve to this data using scipy curve_fit - python-3.x

I am hoping someone can me with where I'm going wrong with fitting a curve to this data. I am using the method in this link and so have the following code:
def sigmoid(x, L, x0, k, b):
y = L / (1 + np.exp(-k*(x-x0)))+b
return y
p0 = [max(y1), np.median(x2), 1, min(y1)]
popt, pcov = curve_fit(sigmoid, xdata=x2, ydata=y1, p0=p0, method='dogbox')
predictions = sigmoid(x2, *popt)
And my plotted "curve" looks like so:
But I am expecting a more s-shaped curve. I have experimented with different p0 values but not getting the required output (and if I'm honest I'm not sure how I'm supposed to find the ideal starting parameters).
Using p0 = [max(y1), np.median(x2), 0.4, 1] and method='trf I did get the following, which is closer but still missing the curve in the middle?
Any help greatly appreciated!

That is because your y-axis is a log scale. If you change the y-axis to a linear one, you'll see that the fit is actually quite good.

Related

How to plot data like points and draw line of linear fit in python?

I have data in plotted on graph, how to draw linear fit line in graph?
I will be grateful if you can suggest solution, Thank you!
You can use inbuilt polyfit function to get linear fit value based on least square method for your data.
But I am not aware how to maintain aspect ratio of inside grid 1:1.
angle = np.polyfit(x, y, 1)
y_line = angle[1] + angle[0] * x
fig, ax = plt.subplots(figsize=(8,8))
ax.scatter(x,y)
ax.plot(x,y_line, 'r')

Failing a simple Cosine fit in Python

Here's how I generate my data and the tried fit:
import matplotlib.pyplot as plt
from scipy import optimize
import numpy as np
def f(t,a,b):
return a*np.cos(b*t)
v = 0
x = 0.03
t = 0
dt = 0.001
time = []
pos = []
while t<3:
a = (-5*x)/0.1
v = v + a*dt
x = x + v*dt
time.append(t)
pos.append(x)
t = t+dt
pop, pcov = optimize.curve_fit(f,time,pos)
print(pop)
Even when I indicate initial values for the parameters (such as 0.03 for "a" and "7" for b), the resulting fit is still way off (see below, dashed line is the fit function).
Am I using the wrong library? or have I made an obvious blunder?
Thanks for any hints.
As Tyberius noted, you need to provide better initial values.
Why is that? optimize.curve_fit uses least_squares which finds a local minimum of the cost function.
I believe in your case you are stuck in such a local minimum (that is not the global minimum). If you look at your diagram, your fit is approximately y=0. (It is a bit wavy because it is a cosine)
If you were to increase a a bit the error would go up, so a stays close to zero. And if you were to increase b to fit the frequency of the data, the cost function would go up as well so that one stays low as well.
If you don't provide initial values, the parameters start at 1 each so it looks like this:
plt.plot(time, pos, 'black', label="data")
a,b = 1,1
init = [a*np.cos(b*t) for t in time]
plt.plot(time, init, 'b', label="a,b=1,1")
plt.legend()
plt.show()
a will go down and b will stay behind. I believe the scale is an additional problem. If you normalized your data to have an amplitude of 1 the humps might be more pronounced and easier to fit.
If you start with a convenient value for a, b can find its way from an initial value as low as 5:
plt.plot(time, pos, 'black', label="data")
for i in [1, 4.8, 4.9, 5]:
pop, pcov = optimize.curve_fit(f,time,pos, p0=(0.035,i))
a,b = pop
fit = [a*np.cos(b*t) for t in time]
plt.plot(time, fit, label=f"$b_0 = {i}$")
plt.legend()
plt.show()

Problems with numpy polyfit

For some reason, my polyfit is way way off, and I cannot figure out why that is. My scatter plot seems normal.
Scatter Plot
PolyFit Plot
How can I fix this? here is my code:
def plot(data, x_axis, y_axis, title):
x = data[0]
y = data[1]
## Plot data
plt.figure(figsize=(8,4))
plt.scatter(x, y)
idx = np.isfinite(x) & np.isfinite(y)
plt.plot(np.poly1d(np.polyfit(x[idx], y[idx], 3)))
## Format graph
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
plt.gca().xaxis.set_major_locator(mdates.YearLocator(3))
plt.gcf().autofmt_xdate()
## Define labels
plt.xlabel(x_axis)
plt.ylabel(y_axis)
plt.title(title)
## Graph data
plt.show()
If I need to link my data, then I can. There's too much of it to post here.
Inspecting
print(x[idx])
print(y[idx])
Shows the correct values and nothing seems off.
x[idx] and y[idx] plot
EDIT:
I have figured out my solution. I was not using polyfit correctly.
idx = np.isfinite(x) & np.isfinite(y)
avgTrend = np.poly1d(np.polyfit(x[idx], y[idx], 3))
plt.plot(x, avgTrend(x), color='red')
enter image description here
The problem seems to be with the degree of the polynomial. For so many data points it may be simply impossible to fit a good degree 3 polynomial. You could try a higher degree (unlikely to work in the way you want it) or you can try a spline function.
For example you could try the csaps package that implements smoothing splines and that I can recommend.
Hope this helps

Interpolating using a cubic function gives a negative value for probability

I have a set of data which correspond to ages (in steps of 0.1) along the x axis, and probabilities along the y axis. I'm trying to interpolate the data so I can find the maximum and a range of ages which covers 95% of the probability.
I've tried a simple interpolation using the code below, taken from the SciPy help pages, and it produces good results (I change the x and y variables to read my data), except for one feature.
from scipy.interpolate import interp1d
x = np.linspace(72, 100, num=29, endpoint=True)
y = df.iloc[:,0].values
f = interp1d(x, y)
f2 = interp1d(x, y, kind='cubic')
xnew = np.linspace(0, 10, num=41, endpoint=True)
import matplotlib.pyplot as plt
plt.plot(x, y, 'o', xnew, f(xnew), '-', xnew, f2(xnew), '--')
plt.legend(['data', 'linear', 'cubic'], loc='best')
plt.show()
The problem is, the cubic function works best, with the smoothest fit. However, it gives negative values for some parts of the probability curve, which is obviously not acceptable. Is there some way of setting a floor at y=0? I thought maybe switching to a quadratic kind would fix it, but it doesn't seem to. The linear fit does, but it's not smoothed, so is not a very good match.
I'm also not sure how to perform the second part of what I'm trying to do. It's probably very simple, but I don't know how to find the mean when I don't have a frequency table, but a grid of interpolated points which form a function. If I knew the function, I could integrate it, but I'm not sure how to do that in Python.
EDIT to include some data:
This is what my y data looks like:
array([3.41528917e-08, 7.81041275e-05, 9.60711716e-04, 5.75868934e-05,
6.50260297e-05, 2.95556411e-05, 2.37331370e-05, 9.11990619e-05,
1.08003254e-04, 4.16800419e-05, 6.63673113e-05, 2.57934035e-04,
3.42235937e-03, 5.07534495e-03, 1.76603165e-02, 1.69535370e-01,
2.67624254e-01, 4.29420872e-01, 8.25165926e-02, 2.08367339e-02,
2.01227453e-03, 1.15405995e-04, 5.40163098e-07, 1.66905537e-10,
8.31862858e-18, 4.14093219e-23, 8.32103362e-29, 5.65637769e-34,
7.93547444e-40])

Linear Regression algorithm works with one data-set but not on another, similar data-set. Why?

I created a linear regression algorithm following a tutorial and applied it to the data-set provided and it works fine. However the same algorithm does not work on another similar data-set. Can somebody tell me why this happens?
def computeCost(X, y, theta):
inner = np.power(((X * theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
temp = np.matrix(np.zeros(theta.shape))
params = int(theta.ravel().shape[1])
cost = np.zeros(iters)
for i in range(iters):
err = (X * theta.T) - y
for j in range(params):
term = np.multiply(err, X[:,j])
temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))
theta = temp
cost[i] = computeCost(X, y, theta)
return theta, cost
alpha = 0.01
iters = 1000
g, cost = gradientDescent(X, y, theta, alpha, iters)
print(g)
On running the algo through this dataset I get the output as matrix([[ nan, nan]]) and the following errors:
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: RuntimeWarning: overflow encountered in power
from ipykernel import kernelapp as app
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:11: RuntimeWarning: invalid value encountered in double_scalars
However this data set works just fine and outputs matrix([[-3.24140214, 1.1272942 ]])
Both the datasets are similar, I have been over it many times but can't seem to figure out why it works on one dataset but not on other. Any help is welcome.
Edit: Thanks Mark_M for editing tips :-)
[Much better question, btw]
It's hard to know exactly what's going on here, but basically your cost is going the wrong direction and spiraling out of control, which results in an overflow when you try to square the value.
I think in your case it boils down to your step size (alpha) being too big which can cause gradient descent to go the wrong way. You need to watch the cost in gradient descent and makes sure it's always going down, if it's not either something is broken or alpha is to large.
Personally, I would reevaluate the code and try to get rid of the loops. It's a matter of preference, but I find it easier to work with X and Y as column vectors. Here is a minimal example:
from numpy import genfromtxt
# this is your 'bad' data set from github
my_data = genfromtxt('testdata.csv', delimiter=',')
def computeCost(X, y, theta):
inner = np.power(((X # theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
for i in range(iters):
# you don't need the extra loop - this can be vectorize
# making it much faster and simpler
theta = theta - (alpha/len(X)) * np.sum((X # theta.T - y) * X, axis=0)
cost = computeCost(X, y, theta)
if i % 10 == 0: # just look at cost every ten loops for debugging
print(cost)
return (theta, cost)
# notice small alpha value
alpha = 0.0001
iters = 100
# here x is columns
X = my_data[:, 0].reshape(-1,1)
ones = np.ones([X.shape[0], 1])
X = np.hstack([ones, X])
# theta is a row vector
theta = np.array([[1.0, 1.0]])
# y is a columns vector
y = my_data[:, 1].reshape(-1,1)
g, cost = gradientDescent(X, y, theta, alpha, iters)
print(g, cost)
Another useful technique is to normalize your data before doing regression. This is especially useful when you have more than one feature you're trying to minimize.
As a side note - if you're step size is right you shouldn't get overflows no matter how many iterations you do because the cost will will decrease with every iteration and the rate of decrease will slow.
After 1000 iterations I arrived at a theta and cost of:
[[ 1.03533399 1.45914293]] 56.041973778
after 100:
[[ 1.01166889 1.45960806]] 56.0481988054
You can use this to look at the fit in an iPython notebook:
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(my_data[:, 0].reshape(-1,1), y)
axes = plt.gca()
x_vals = np.array(axes.get_xlim())
y_vals = g[0][0] + g[0][1]* x_vals
plt.plot(x_vals, y_vals, '--')

Resources