I have a data that looks like a sigmoidal plot but flipped relative to the vertical line.
But the plot is a result of plotting 1D data instead of some sort of function.
My goal is to find the x value when the y value is at 50%. As you can see, there is no data point when y is exactly at 50%.
Interpolate comes to my mind. But I'm not sure if interpolate enable me to find the x value when the y value is 50%. So my question is 1) can you use interpolate to find the x when the y is 50%? or 2)do you need to fit the data to some sort of a function?
Below is what I currently have in my code
import numpy as np
import matplotlib.pyplot as plt
my_x = [4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66]
my_y_raw=np.array([0.99470977497817203, 0.99434995886145172, 0.98974611323163653, 0.961630837657524, 0.99327633558441175, 0.99338952769251909, 0.99428263292577534, 0.98690514212711611, 0.99111667721533181, 0.99149418924880861, 0.99133773062680464, 0.99143506380003499, 0.99151080464011454, 0.99268261743308517, 0.99289757252812316, 0.99100207861144063, 0.99157171773324027, 0.99112571824824358, 0.99031608691035722, 0.98978104266076905, 0.989782674787969, 0.98897835092187614, 0.98517540405423909, 0.98308943666187076, 0.96081810781994603, 0.85563541881892147, 0.61570811548079107, 0.33076276040577052, 0.14655134838124245, 0.076853147122142126, 0.035831324928136087, 0.021344669212790181])
my_y=my_y_raw/np.max(my_y_raw)
plt.plot(my_x, my_y,color='k', markersize=40)
plt.scatter(my_x,my_y,marker='*',label="myplot", color='k', edgecolor='k', linewidth=1,facecolors='none',s=50)
plt.legend(loc="lower left")
plt.xlim([4,102])
plt.show()
Using SciPy
The most straightforward way to do the interpolation is to use the SciPy interpolate.interp1d function. SciPy is closely related to NumPy and you may already have it installed. The advantage to interp1d is that it can sort the data for you. This comes at the cost of somewhat funky syntax. In many interpolation functions it is assumed that you are trying to interpolate a y value from an x value. These functions generally need the "x" values to be monotonically increasing. In your case, we swap the normal sense of x and y. The y values have an outlier as #Abhishek Mishra has pointed out. In the case of your data, you are lucky and you can get away with the the leaving the outlier in.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
my_x = [4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,
48,50,52,54,56,58,60,62,64,66]
my_y_raw=np.array([0.99470977497817203, 0.99434995886145172,
0.98974611323163653, 0.961630837657524, 0.99327633558441175,
0.99338952769251909, 0.99428263292577534, 0.98690514212711611,
0.99111667721533181, 0.99149418924880861, 0.99133773062680464,
0.99143506380003499, 0.99151080464011454, 0.99268261743308517,
0.99289757252812316, 0.99100207861144063, 0.99157171773324027,
0.99112571824824358, 0.99031608691035722, 0.98978104266076905,
0.989782674787969, 0.98897835092187614, 0.98517540405423909,
0.98308943666187076, 0.96081810781994603, 0.85563541881892147,
0.61570811548079107, 0.33076276040577052, 0.14655134838124245,
0.076853147122142126, 0.035831324928136087, 0.021344669212790181])
# set assume_sorted to have scipy automatically sort for you
f = interp1d(my_y_raw, my_x, assume_sorted = False)
xnew = f(0.5)
print('interpolated value is ', xnew)
plt.plot(my_x, my_y_raw,'x-', markersize=10)
plt.plot(xnew, 0.5, 'x', color = 'r', markersize=20)
plt.plot((0, xnew), (0.5,0.5), ':')
plt.grid(True)
plt.show()
which gives
interpolated value is 56.81214249272691
Using NumPy
Numpy also has an interp function, but it doesn't do the sort for you. And if you don't sort, you'll be sorry:
Does not check that the x-coordinate sequence xp is increasing. If xp
is not increasing, the results are nonsense.
The only way I could get np.interp to work was to shove the data in to a structured array.
import numpy as np
import matplotlib.pyplot as plt
my_x = np.array([4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,
48,50,52,54,56,58,60,62,64,66], dtype = np.float)
my_y_raw=np.array([0.99470977497817203, 0.99434995886145172,
0.98974611323163653, 0.961630837657524, 0.99327633558441175,
0.99338952769251909, 0.99428263292577534, 0.98690514212711611,
0.99111667721533181, 0.99149418924880861, 0.99133773062680464,
0.99143506380003499, 0.99151080464011454, 0.99268261743308517,
0.99289757252812316, 0.99100207861144063, 0.99157171773324027,
0.99112571824824358, 0.99031608691035722, 0.98978104266076905,
0.989782674787969, 0.98897835092187614, 0.98517540405423909,
0.98308943666187076, 0.96081810781994603, 0.85563541881892147,
0.61570811548079107, 0.33076276040577052, 0.14655134838124245,
0.076853147122142126, 0.035831324928136087, 0.021344669212790181],
dtype = np.float)
dt = np.dtype([('x', np.float), ('y', np.float)])
data = np.zeros( (len(my_x)), dtype = dt)
data['x'] = my_x
data['y'] = my_y_raw
data.sort(order = 'y') # sort data in place by y values
print('numpy interp gives ', np.interp(0.5, data['y'], data['x']))
which gives
numpy interp gives 56.81214249272691
As you said, your data looks like a flipped sigmoidal. Can we make the assumption that your function is a strictly decreasing function? If that is the case, we can try the following methods:
Remove all the points where the data is not strictly decreasing.For example, for your data that point will be near 0.
Use the binary search to find the location where y=0.5 should be put in.
Now you know two (x, y) pairs where your desired y=0.5 should lie.
You can use simple linear interpolation if (x, y) pairs are very close.
Otherwise, you can see what is the approximation of sigmoid near those pairs.
You might not need to fit any functions to your data. Simply find the following two elements:
The largest x for which y<50%
The smallest x for which y>50%
Then use interpolation and find the x*. Below is the code
my_x = np.array([4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66])
my_y=np.array([0.99470977497817203, 0.99434995886145172, 0.98974611323163653, 0.961630837657524, 0.99327633558441175, 0.99338952769251909, 0.99428263292577534, 0.98690514212711611, 0.99111667721533181, 0.99149418924880861, 0.99133773062680464, 0.99143506380003499, 0.99151080464011454, 0.99268261743308517, 0.99289757252812316, 0.99100207861144063, 0.99157171773324027, 0.99112571824824358, 0.99031608691035722, 0.98978104266076905, 0.989782674787969, 0.98897835092187614, 0.98517540405423909, 0.98308943666187076, 0.96081810781994603, 0.85563541881892147, 0.61570811548079107, 0.33076276040577052, 0.14655134838124245, 0.076853147122142126, 0.035831324928136087, 0.021344669212790181])
tempInd1 = my_y<.5 # This will only work if the values are monotonic
x1 = my_x[tempInd1][0]
y1 = my_y[tempInd1][0]
x2 = my_x[~tempInd1][-1]
y2 = my_y[~tempInd1][-1]
scipy.interp(0.5, [y1, y2], [x1, x2])
I am trying to fit a function which looks like log(y)=a*log(b-x)+c, where a, b and c are the parameters that need to be fitted. The relevant bit of code is
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
def logfunc(T, a, b, c):
v=(a*np.log(b-T))+c
return v
popt, pcov=curve_fit(logfunc, T, np.log(Energy), check_finite=False, bounds=([0.1, 1.8, 0.1], [1.0, 2.6, 1.0]))
plt.plot(T, logfunc(T, *popt))
plt.show
Where T and Energy is some data that was generated (I use it to plot other things so the data should be fine). T is between 0.3 and 3.2. I am pretty sure that the problem is the fact that there is a point where b=T because I keep getting the error ValueError: Residuals are not finite in the initial point. but I am not sure how to solve this.
You may find the lmfit package (http://lmfit.github.io/lmfit-py/) useful for this sort of problem. This provides a higher-level approach to curve fitting problems and a better abstraction of Parameters and Models than scipy.optimize package or curve_fit() function.
For the problem here, two important features of lmfit are
the ability to set bounds on variables. curve_fit() can do this as well, but only by working with ordered lists of min/max bounds. With lmfit, the bounds belong to Parameter objects.
having a way to explicitly set a policy for handling NaN values, which could definitely cause problems for your fit.
With lmfit, your script would be written approximately as
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model
def logfunc(T, a, b, c):
return (a*np.log(b-T))+c
log_model = Model(logfunc, nan_policy='raise') # raise error on NaNs
params = log_model.make_params(a=0.5, b=2.0, c=0.5) # initial values
params['b'].min = 1.8 # set min/max values
params['b'].max = 2.6
params['c'].min = 0.1 # and so forth
result = log_model.fit(np.log(Energy), params, T=T)
print(result.fit_report())
plt.plot(T, Energy, 'bo', label='data')
plt.plot(T, np.exp(result.best_fit), 'r--', label='fit')
plt.legend()
plt.xlabel('T')
plt.ylabel('Energy')
plt.gca().set_yscale('log', basey=10)
plt.show()
This is slightly more verbose than your starting script because it gives a labeled plot and because using Parameter objects instead of scalars gives more flexibility and clarity.
For your fit, you might consider setting the nan_policy to 'omit', which will omit NaNs as they occur -- never a great idea, but sometimes helpful to get you started on finding where log(b-T) is valid. You could also alter your model function to do something like
def logfunc(T, a, b, c):
arg = b - T
arg[np.where(arg < 1.e-16)] = 1.e-16
return a*np.log(arg) + c
To explicitly prevent one obvious cause of NaNs.
Residuals are not finite in the initial point
means the initial point is bad, where some logarithms are infinite or undefined. You need a better initial point.
By the nature of the model, b has to be greater than any of the points in T. The bounds on b that you have at present do not guarantee that. Tighten them up.
When you do not provide p0 parameter, SciPy will take a guess within the provided bounds. So if the bounds guarantee finiteness, the error will not occur.
Still, it is generally better to prescribe p0 yourself, because you have better a priori understanding of the problem than SciPy does.
A working example with adjusted bounds:
popt, pcov=curve_fit(logfunc, np.linspace(0.3, 3.2, 6), [8, 7, 6, 5, 4, 3], bounds=([0.1, 3.2, 0.1], [1.0, 3.6, 1.0]))
Recently I was working on some data for which I was able to obtain a curve using curve_fit after saving the plot and the values obtained I returned to the same code later only to find it does not work.
#! python 3.5.2
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.optimize import curve_fit
data= np.array([
[24, 0.176644513],
[27, 0.146382841],
[30, 0.129891534],
[33, 0.105370908],
[38, 0.077820511],
[50, 0.047407538]])
x, y = np.array([]), np.array([])
for val in data:
x = np.append(x, val[0])
y = np.append(y, (val[1]/(1-val[1])))
def f(x, a, b):
return (np.exp(-a*x)**b)
# The original a and b values obtained
a = -0.2 # after rounding
b = -0.32 # after rounding
plt.scatter(x, y)
Xcurve = np.linspace(x[0], x[-1], 500)
plt.plot(Xcurve, f(Xcurve,a,b), ls='--', color='k', lw=1)
plt.show()
# the original code to get the values
a = b = 1
popt, pcov = curve_fit(f, x, y, (a, b))
Whereas, previously curve_fit returned the values a, b = -0.2, -0.32 now returns:
Warning (from warnings module):
File "C:/Users ... line 22
return (np.exp(-a*x)**b)
RuntimeWarning: overflow encountered in exp
The code as far as I am aware did not change. Thanks
Without knowing what changed in the code, it is hard to say what changed between your state of "working" and "not working". It may be that changes in the version of scipy you used give different results: there have changes to the underlying implementation in curve_fit() over the past few years.
But also: curve_fit() (and the underlying python and Fortran code it uses) requires reasonably good initial guesses for the parameters for many problems to work at all. With bad guesses for the parameters, many problems will fail.
Exponential decay problems seem to be especially challenging for the Levenberg-Marquardt algorithm (and the implementations used by curve_fit(), and do require reasonable starting points. It's also easy to get into a part of parameter space where the function evaluates to zero, and changes in the parameter values have no effect.
If possible, if your problem involves exponential decay, it is helpful to work in log space. That is, model log(f), not f itself. For your problem in particular, your model function is exp(-a*x)**b. Is that really what you mean? a and bwill be exactly correlated.
In addition, you may find lmfit helpful. It has a Model class for curve-fitting, using similar underlying code, but allows fixing or setting bounds on any of the parameters. An example for your problem would be (approximately):
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.optimize import curve_fit
import lmfit
data= np.array([
[24, 0.176644513],
[27, 0.146382841],
[30, 0.129891534],
[33, 0.105370908],
[38, 0.077820511],
[50, 0.047407538]])
x, y = np.array([]), np.array([])
for val in data:
x = np.append(x, val[0])
y = np.append(y, (val[1]/(1-val[1])))
def f(x, a, b):
print("In f: a, b = " , a, b)
return (np.exp(-a*x)**b)
fmod = lmfit.Model(f)
params = fmod.make_params(a=-0.2, b=-0.4)
# set bounds on parameters
params['a'].min = -2
params['a'].max = 0
params['b'].vary = False
out = fmod.fit(y, params, x=x)
print(out.fit_report())
plt.plot(x, y)
plt.plot(x, out.best_fit, '--')
plt.show()