Julia: Plot residuals on a fitted model - statistics

I'm trying to figure out how to make a residual plot like this to show the deviation from the predicted results:
I found this question which seems to answer it in Python, but I can't find in the Plots.jl docs or just google how to do that in Julia.
Note: data is not an issue here, I can handle that, it's just the literal plotting API I'm struggling with.

Just to complement aramirezreyes's answer with this Julia discourse answer, you might want to avoid the loop if you have many points (because calling plot! can be expensive and you might want to avoid many series). In this case, you can insert NaNs between the coordinates of each bar. For example,
julia> x2 = repeat(points_x, inner=3) ;
julia> y2 = reduce(vcat, [y, yhat, NaN] for (y, yhat) in zip(points_x, points_y)) ;
julia> p1 = plot(points_x, points_x) #Plotting the fit
julia> plot!(x2, y2, lab="")
julia> scatter!(p1, points_x, points_y, c=2) #Plotting the points
will give you

What are you struggling with in the API?
What have you tried so far?
julia> points_y = [x for x in 1:10] .+ rand(10)
10-element Vector{Float64}:
1.1165819028282722
2.986599717814377
3.5743557742882377
4.835499304548991
5.106506715905332
6.483859149461656
7.041461273394912
8.22314808617758
9.128892867863867
10.891435553032899
julia> points_x = 1:10
1:10
julia> p1 = plot(points_x,points_x) #Plotting the fit
julia> scatter!(p1,points_x,points_y) #Plotting the points
julia> for i in 1:length(points_x)
plot!(p1,[points_x[i], points_x[i]],[points_x[i],points_y[i]], color=:black, legend=false)
end
julia> p1
Produces this:
In the for loop, which I think is what interests you, I am just drawing lines between the point in the fit (x,f(x)) and the scattered point (x, y), so the syntax becomes plot!(p,[x,x],[f(x),y]). A little difficult to get deeper as you are not specifying in what part of the process are you struggling.

Related

Why my fit for a logarithm function looks so wrong

I'm plotting this dataset and making a logarithmic fit, but, for some reason, the fit seems to be strongly wrong, at some point I got a good enough fit, but then I re ploted and there were that bad fit. At the very beginning there were a 0.0 0.0076 but I changed that to 0.001 0.0076 to avoid the asymptote.
I'm using (not exactly this one for the image above but now I'm testing with this one and there is that bad fit as well) this for the fit
f(x) = a*log(k*x + b)
fit = fit f(x) 'R_B/R_B.txt' via a, k, b
And the output is this
Also, sometimes it says 7 iterations were as is the case shown in the screenshot above, others only 1, and when it did the "correct" fit, it did like 35 iterations or something and got a = 32 if I remember correctly
Edit: here is again the good one, the plot I got is this one. And again, I re ploted and get that weird fit. It's curious that if there is the 0.0 0.0076 when the good fit it's about to be shown, gnuplot says "Undefined value during function evaluation", but that message is not shown when I'm getting the bad one.
Do you know why do I keep getting this inconsistence? Thanks for your help
As I already mentioned in comments the method of fitting antiderivatives is much better than fitting derivatives because the numerical calculus of derivatives is strongly scattered when the data is slightly scatered.
The principle of the method of fitting an integral equation (obtained from the original equation to be fitted) is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales . The application to the case of y=a.ln(c.x+b) is shown below.
Numerical calculus :
In order to get even better result (according to some specified criteria of fitting) one can use the above values of the parameters as initial values for iterarive method of nonlinear regression implemented in some convenient software.
NOTE : The integral equation used in the present case is :
NOTE : On the above figure one can compare the result with the method of fitting an integral equation to the result with the method of fitting with derivatives.
Acknowledgements : Alex Sveshnikov did a very good work in applying the method of regression with derivatives. This allows an interesting and enlightening comparison. If the goal is only to compute approximative values of parameters to be used in nonlinear regression software both methods are quite equivalent. Nevertheless the method with integral equation appears preferable in case of scattered data.
UPDATE (After Alex Sveshnikov updated his answer)
The figure below was drawn in using the Alex Sveshnikov's result with further iterative method of fitting.
The two curves are almost indistinguishable. This shows that (in the present case) the method of fitting the integral equation is almost sufficient without further treatment.
Of course this not always so satisfying. This is due to the low scatter of the data.
In ADDITION , answer to a question raised in comments by CosmeticMichu :
The problem here is that the fit algorithm starts with "wrong" approximations for parameters a, k, and b, so during the minimalization it finds a local minimum, not the global one. You can improve the result if you provide the algorithm with starting values, which are close to the optimal ones. For example, let's start with the following parameters:
gnuplot> a=47.5087
gnuplot> k=0.226
gnuplot> b=1.0016
gnuplot> f(x)=a*log(k*x+b)
gnuplot> fit f(x) 'R_B.txt' via a,k,b
....
....
....
After 40 iterations the fit converged.
final sum of squares of residuals : 16.2185
rel. change during last iteration : -7.6943e-06
degrees of freedom (FIT_NDF) : 18
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 0.949225
variance of residuals (reduced chisquare) = WSSR/ndf : 0.901027
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 35.0415 +/- 2.302 (6.57%)
k = 0.372381 +/- 0.0461 (12.38%)
b = 1.07012 +/- 0.02016 (1.884%)
correlation matrix of the fit parameters:
a k b
a 1.000
k -0.994 1.000
b 0.467 -0.531 1.000
The resulting plot is
Now the question is how you can find "good" initial approximations for your parameters? Well, you start with
If you differentiate this equation you get
or
The left-hand side of this equation is some constant 'C', so the expression in the right-hand side should be equal to this constant as well:
In other words, the reciprocal of the derivative of your data should be approximated by a linear function. So, from your data x[i], y[i] you can construct the reciprocal derivatives x[i], (x[i+1]-x[i])/(y[i+1]-y[i]) and the linear fit of these data:
The fit gives the following values:
C*k = 0.0236179
C*b = 0.106268
Now, we need to find the values for a, and C. Let's say, that we want the resulting graph to pass close to the starting and the ending point of our dataset. That means, that we want
a*log(k*x1 + b) = y1
a*log(k*xn + b) = yn
Thus,
a*log((C*k*x1 + C*b)/C) = a*log(C*k*x1 + C*b) - a*log(C) = y1
a*log((C*k*xn + C*b)/C) = a*log(C*k*xn + C*b) - a*log(C) = yn
By subtracting the equations we get the value for a:
a = (yn-y1)/log((C*k*xn + C*b)/(C*k*x1 + C*b)) = 47.51
Then,
log(k*x1+b) = y1/a
k*x1+b = exp(y1/a)
C*k*x1+C*b = C*exp(y1/a)
From this we can calculate C:
C = (C*k*x1+C*b)/exp(y1/a)
and finally find the k and b:
k=0.226
b=1.0016
These are the values used above for finding the better fit.
UPDATE
You can automate the process described above with the following script:
# Name of the file with the data
data='R_B.txt'
# The coordinates of the last data point
xn=NaN
yn=NaN
# The temporary coordinates of a data point used to calculate a derivative
x0=NaN
y0=NaN
linearFit(x)=Ck*x+Cb
fit linearFit(x) data using (xn=$1,dx=$1-x0,x0=$1,$1):(yn=$2,dy=$2-y0,y0=$2,dx/dy) via Ck, Cb
# The coordinates of the first data point
x1=NaN
y1=NaN
plot data using (x1=$1):(y1=$2) every ::0::0
a=(yn-y1)/log((Ck*xn+Cb)/(Ck*x1+Cb))
C=(Ck*x1+Cb)/exp(y1/a)
k=Ck/C
b=Cb/C
f(x)=a*log(k*x+b)
fit f(x) data via a,k,b
plot data, f(x)
pause -1

Is there a logic error with my implementation of odeint?

I am getting the wrong solution and am unsure if odeint is the correct tool for solving this system of ODEs.
I am trying to model a simple first order chemical reaction by solving a system of ODEs. From a logic standpoint my functions are correct and I can solve this problem in MATLAB with little issue. I would like to also be able to do this work in python as well. I think odeint is the tool for the job but I could be wrong. My solution should not converge at independent variable = 10 every time but it always does regardless of inputs.
from matplotlib.pyplot import (plot,grid,xlabel,ylabel,show,legend)
import numpy as np
from scipy.integrate import odeint
wght= np.linspace(0,20)
# reaction is A -> B
def PBR(fun,W):
X,y = fun
P_0=20;#%bar
v_0=5; #%m^3/min
y_A0=1; #unitless
k=.005; #m^3/kg/min
alpha=0.1; #1/kg
epi=.13; #unitless
R=8.314; #J/mol/K
F_A0= .5 ;#mol/min
ra = -k *y*(1-X)/(1+epi*X)
dX = (-ra)/F_A0
dy = -alpha*(1+epi*X)/(2*y)
return [dX,dy]
X0 = 0.0
y0 = 1.0
sol = odeint(PBR, [X0, y0],wght)
plot(wght, sol[:, 0], 'b', label='X')
plot(wght, sol[:, 1], 'g', label='y')
legend(loc='best')
xlabel('W')
grid()
show()
print(sol)
Output graph
Your graphic is not reproducible, in python 3.7, scipy 1.4.1, I get a divergence to practically infinity at around 12.5, and after cleaning the work space, both graphs moving to zero after time 10.
The problem is your division by 2*y in the second equation. In effect that means that y is the square root of some other function, and that type of function behaves badly numerically at small values as the tangent becomes vertical, changing to zero shortly after.
However, one can desingularize the division in a simple way
dy = -alpha*(1+epi*X)*y/(1e-10+2*y*y)
or in a more complicated way that leaves the solution far away from zero really unchanged
dy = -alpha*(1+epi*X)*y/(y*y+max(1e-12,y*y))
resulting in the graph
If that is still not the expected behavior, then something else is different in your ODE system relative to the Matlab version.
That the singular point is at about w=10 depends almost completely on alpha=0.1. As epi is small and X is initially small, the second equation is close to
d(y^2)/dW = -alpha ==> y^2 = 1 - alpha*W
and this hits zero at about W=10 where the solution would have to end as the square of y can not take negative values.

Finding the centre of multiple lines using least squares approach in Python

I have a series of lines which roughly (but not exactly) intersect at some point.
I need to find the point which minimises the distance between each line in the centre. I have been trying to follow this methodology:
Nearest point to intersecting lines in 2D
When I create my script in Python to perform this function I get the incorrect answer:
Here is my code, I was wondering if anyone could suggest what I am doing wrong? Or an easier way of going about this. Each line is defined by two points x1 and x2.
def directionalv(x1,x2):
point1=np.array(x1) #point1 and point2 define my line
point2=np.array(x2)
ortho= np.array([[0,-1],[1,0]]) #see wikipedia article
subtract=point2-point1
length=np.linalg.norm(subtract)
fraction = np.divide(subtract,length)
n1=ortho.dot(fraction)
num1=n1.dot(n1.transpose())
num = num1*(point1)
denom=n1.dot(n1.transpose())
return [num,denom]
n1l1=directionalv(x1,x2)
n1l2=directionalv(x3,x4)
n1l3=directionalv(x5,x6)
n1l4=directionalv(x7,x8)
n1l5=directionalv(x9,x10)
numerall=n1l1[0]+n1l2[0]+n1l3[0]+n1l4[0]+n1l5[0] #sum of (n.n^t)pi from wikipedia article
denomall=n1l1[1]+n1l2[1]+n1l3[1]+n1l4[1]+n1l5[1] #sum of n.n^t
point=(numerall/denomall)
My points are as follows
Line1 consists of points x1= [615, 396] and x2 = [616, 880]
Line 2, x3 = [799, 449] x4= [449, 799]
Line 3, x5 = [396, 637] x6 = [880, 636]
Line 4, x7 = [618, 396] x8 = [618, 880]
Line 5, x9 = [483, 456] x10 = [777, 875]
Any help would be really appreciated!
Thank you for your time.
Could it simply be the fact that you should define in Python the matrix as 2 vectors (understand is a column of the matrix, not row!
see: How to define two-dimensional array in python ), you'll then should define the ortho matrix like this:
ortho= np.array([[0,1],[-1,0]])
Otherwise, what does the following means?
numerall=n1l1[0]+n1l2[0]+n1l3[0]+n1l4[0]+n1l5[0] #sum of (n.n^t)pi from wikipedia article
denomall=n1l1[1]+n1l2[1]+n1l3[1]+n1l4[1]+n1l5[1] #sum of n.n^t
point=(numerall/denomall)
I do not understand your interpretation of the transposition of a Matrix; and the inverse of a matrix does not equals to a division.
Use an existing Python library like Numpy to do the computing instead of implementing it yourself. See: https://docs.scipy.org/doc/numpy-1.10.4/reference/generated/numpy.matrix.html

Gnu Octave Quadratic Equation Graphic

I lost two days of trying to do this but with no result. How can I plot the quadratic equation's parabola and roots. Something like this. I just need to be able to see the parabola and that it crosses the abscissa at the write coordinates.
Here is what I have:
x = linspace(-50,50);
y = 1.*x.*x - 8.*x + 15;
plot(x,y)
hold on;
grid()
rts = roots([1,-8,15]);
plot(rts, zeros(size(rts)), 'o', "color", "r")
And the result is:
As you can see, the top of the parabola at 0 ordinate, instead of under it. I will appreciate your help!
Using a smaller linspace range works fine for me:
x = linspace(1,6);
y = 1.*x.*x - 8.*x + 15;
plot(x,y)
hold on;
grid()
rts = roots([1,-8,15]);
plot(rts, zeros(size(rts)), 'o', "color", "r")
Christoph is right the plot could be misleading because a great linspace can flatten the part of the curve below the abscissa, and if you don't think and you make some error calculating the vertex, as I've done you are fried! This is another solution, I hope this one is right!
ezplot(#(x,y) x.^2 -x.*8 -y.+ 15)
hold on
grid on
rts = roots([1,-8,15]);
plot(rts,zeros(size(rts)),'o',"color","r");
line(xlim,[0 0], 'linestyle', '--')

Linear Fit does not adjust b independently form a

I'm using the following gnuplot script to plot a linear fit:
#!/usr/bin/gnuplot
set term cairolatex
set output "linear_fit.tex"
c = 299792458.
x(x) = c / x
y(x) = x
h(x) = a * x + b
fit h(x) "linear_fit.dat" u (x($1)):(y($2)) via a,b
plot "linear_fit.dat" u (x($1)):(y($2)) w points title "", \
(h(x)) with lines linecolor rgb "black" title "Linear Fit"
However, after the iterations converge, b is always 1.0: https://dpaste.de/ozReq/
How can I get gnuplot to adjust b as well as a?
Update: Repeating the fit command a few hundred times with alternating via a/via b does give pretty good results, but that just can't be how it's supposed to be done.
Update 2: Here's the data in linear_fit.dat:
# lambda, V
360e-9 1.119
360e-9 1.148
360e-9 1.145
400e-9 0.949
400e-9 0.993
400e-9 0.971
440e-9 0.883
440e-9 0.875
440e-9 0.863
490e-9 0.737
490e-9 0.728
490e-9 0.755
540e-9 0.575
540e-9 0.571
540e-9 0.592
590e-9 0.457
590e-9 0.455
590e-9 0.482
I think your troubles stem from the fact that your x-values are very large (on the order of 10e14).
If you do not provide gnuplot with an initial guess for a and b, it will assume a=1 and b=1 as starting points for the fit. However, this is a poor initial guess:
Please note the log scale on both the x- and y-axis.
From the gnuplot documentation:
fit may, and often will get "lost" if started far from a solution, where SSR is large and changing slowly as the parameters are varied, or it may reach a numerically unstable region (e.g., too large a number causing a floating point overflow) which results in an "undefined value" message or gnuplot halting.
To improve the chances of finding the global optimum, you should set the starting values at least roughly in the vicinity of the solution, e.g., within an order of magnitude, if possible. The closer your starting values are to the solution, the less chance of stopping at another minimum. One way to find starting values is to plot data and the fitting function on the same graph and change parameter values and replot until reasonable similarity is reached. The same plot is also useful to check whether the fit stopped at a minimum with a poor fit.
In your case, such starting values could be:
a = 1e-15
b = -0.5
I obtained these values by eye-balling your range of values.
With those starting values, the linear fit results in:
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 1.97355e-015 +/- 6.237e-017 (3.161%)
b = -0.5 +/- 0.04153 (8.306%)
Which looks like this:
You can play with the control setting of fit (such as setting FIT_LIMIT = 1.e-35) or the starting values to achieve a better fit than this.
EDIT
While I still have not been able to coax gnuplot into modifying both parameters a, b at the same time, I found an alternate approach using R. I am aware that there are many other (scripting) languages that can perform a linear fit and this question was about gnuplot. However, the required effort with R appeared to be minimal.
Here's an example, which, when saved as linear_fit.R and called with
R CMD BATCH linear_fit.R
will provide the two coefficients of the linear fit, that gnuplot failed to provide.
y <- c(1.119, 1.148, 1.145, 0.949, 0.993, 0.971, 0.883, 0.875, 0.863,
0.737, 0.728, 0.755, 0.575, 0.571, 0.592, 0.457, 0.455, 0.482)
x <- c(3.60E-007, 3.60E-007, 3.60E-007, 4.00E-007, 4.00E-007,
4.00E-007, 4.40E-007, 4.40E-007, 4.40E-007, 4.90E-007,
4.90E-007, 4.90E-007, 5.40E-007, 5.40E-007, 5.40E-007,
5.90E-007, 5.90E-007, 5.90E-007)
c = 299792458.
x <- c/x
lm.out <- lm(y ~ x)
svg("linear_fit.svg")
plot(x,y)
abline(lm.out,col="red")
summary(lm.out)
You will end up with an svg-file that contains the plot and a linear_fit.Rout text file. In there you'll find the following coefficients:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.429e-01 4.012e-02 -13.53 3.55e-10 ***
x 2.037e-15 6.026e-17 33.80 2.61e-16 ***
So, in the terminology of the original question, we obtain:
a = 2.037e-15
b = -5.429e-01
These values are very close to the values you quoted from alternating the fit.
In case the comments get purged, these questions were identified as related:
What is gnuplot's internal representation of floating point numbers?
Gnuplot behaves oddly in polynomial fit. Why is that?

Resources