Linear regression library for Go language - statistics

I'm looking for a Go library that implements linear regression with MLE or LSE.
Has anyone seen one?
There is this stats library, but it doesn't seem to have what I need:
https://github.com/grd/statistics
Thanks!

Implementing an LSE (Least Squared Error) linear regression is fairly simple.
Here's an implementation in JavaScript - it should be trivial to port to Go.
Here's an (untested) port:
package main
import "fmt"
type Point struct {
X float64
Y float64
}
func linearRegressionLSE(series []Point) []Point {
q := len(series)
if q == 0 {
return make([]Point, 0, 0)
}
p := float64(q)
sum_x, sum_y, sum_xx, sum_xy := 0.0, 0.0, 0.0, 0.0
for _, p := range series {
sum_x += p.X
sum_y += p.Y
sum_xx += p.X * p.X
sum_xy += p.X * p.Y
}
m := (p*sum_xy - sum_x*sum_y) / (p*sum_xx - sum_x*sum_x)
b := (sum_y / p) - (m * sum_x / p)
r := make([]Point, q, q)
for i, p := range series {
r[i] = Point{p.X, (p.X*m + b)}
}
return r
}
func main() {
// ...
}

I have implemented the following using gradient descent, it only gives the coefficients but takes any number of explanatory variables and is reasonably accurate:
package main
import "fmt"
func calc_ols_params(y []float64, x[][]float64, n_iterations int, alpha float64) []float64 {
thetas := make([]float64, len(x))
for i := 0; i < n_iterations; i++ {
my_diffs := calc_diff(thetas, y, x)
my_grad := calc_gradient(my_diffs, x)
for j := 0; j < len(my_grad); j++ {
thetas[j] += alpha * my_grad[j]
}
}
return thetas
}
func calc_diff (thetas []float64, y []float64, x[][]float64) []float64 {
diffs := make([]float64, len(y))
for i := 0; i < len(y); i++ {
prediction := 0.0
for j := 0; j < len(thetas); j++ {
prediction += thetas[j] * x[j][i]
}
diffs[i] = y[i] - prediction
}
return diffs
}
func calc_gradient(diffs[] float64, x[][]float64) []float64 {
gradient := make([]float64, len(x))
for i := 0; i < len(diffs); i++ {
for j := 0; j < len(x); j++ {
gradient[j] += diffs[i] * x[j][i]
}
}
for i := 0; i < len(x); i++ {
gradient[i] = gradient[i] / float64(len(diffs))
}
return gradient
}
func main(){
y := []float64 {3,4,5,6,7}
x := [][]float64 {{1,1,1,1,1}, {4,3,2,1,3}}
thetas := calc_ols_params(y, x, 100000, 0.001)
fmt.Println("Thetas : ", thetas)
y_2 := []float64 {1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4}
x_2 := [][]float64 {{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1},
{4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5},
{4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5},
{4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4},}
thetas_2 := calc_ols_params(y_2, x_2, 100000, 0.001)
fmt.Println("Thetas_2 : ", thetas_2)
}
Result:
Thetas : [6.999959251448524 -0.769216974483968]
Thetas_2 : [1.5694174539341945 -0.06169183063112409 0.2359981255871977 0.2424327101610395]
go playground
I checked my results with python.pandas and they were very close:
In [24]: from pandas.stats.api import ols
In [25]: df = pd.DataFrame(np.array(x).T, columns=['x1','x2','x3','y'])
In [26]: from pandas.stats.api import ols
In [27]: x = [
[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
]
In [28]: y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
In [29]: x.append(y)
In [30]: df = pd.DataFrame(np.array(x).T, columns=['x1','x2','x3','y'])
In [31]: ols(y=df['y'], x=df[['x1', 'x2', 'x3']])
Out[31]:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x1> + <x2> + <x3> + <intercept>
Number of Observations: 23
Number of Degrees of Freedom: 4
R-squared: 0.5348
Adj R-squared: 0.4614
Rmse: 0.8254
F-stat (3, 19): 7.2813, p-value: 0.0019
Degrees of Freedom: model 3, resid 19
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x1 -0.0618 0.1446 -0.43 0.6741 -0.3453 0.2217
x2 0.2360 0.1487 1.59 0.1290 -0.0554 0.5274
x3 0.2424 0.1394 1.74 0.0983 -0.0309 0.5156
intercept 1.5704 0.6331 2.48 0.0226 0.3296 2.8113
---------------------------------End of Summary---------------------------------
and
In [34]: df_1 = pd.DataFrame(np.array([[3,4,5,6,7], [4,3,2,1,3]]).T, columns=['y', 'x'])
In [35]: df_1
Out[35]:
y x
0 3 4
1 4 3
2 5 2
3 6 1
4 7 3
[5 rows x 2 columns]
In [36]: ols(y=df_1['y'], x=df_1['x'])
Out[36]:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 2
R-squared: 0.3077
Adj R-squared: 0.0769
Rmse: 1.5191
F-stat (1, 3): 1.3333, p-value: 0.3318
Degrees of Freedom: model 1, resid 3
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x -0.7692 0.6662 -1.15 0.3318 -2.0749 0.5365
intercept 7.0000 1.8605 3.76 0.0328 3.3534 10.6466
---------------------------------End of Summary---------------------------------
In [37]: df_1 = pd.DataFrame(np.array([[3,4,5,6,7], [4,3,2,1,3]]).T, columns=['y', 'x'])
In [38]: ols(y=df_1['y'], x=df_1['x'])
Out[38]:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 2
R-squared: 0.3077
Adj R-squared: 0.0769
Rmse: 1.5191
F-stat (1, 3): 1.3333, p-value: 0.3318
Degrees of Freedom: model 1, resid 3
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x -0.7692 0.6662 -1.15 0.3318 -2.0749 0.5365
intercept 7.0000 1.8605 3.76 0.0328 3.3534 10.6466
---------------------------------End of Summary---------------------------------

There's a project called gostat which has a bayes package which should be able to do linear regressions.
Unfortunately the documentation is somewhat lacking, so you'll probably have to read the code to learn how to use it. I dabbled with it a bit myself but haven't touched the bayes package.

Related

Calculating the avg point of students from a txt file and return a dict

I'm quite new to Python and I have been working on this problem for a week, still can't figure this out, pls help.
The txt input file is like this (the first number in each line is the Student ID; Math, Phsc, Chem and Bio each has 4 scores, the rest has 5, separated by ';'):
StudentID, Math, Physics, Chemistry, Biology, Literature, Language, History, Geography
1; 5,7,7,8;5,5,6,6;8,6,7,7;4,8,5,7;7,7,6,7,9;7,5,8,6,7;7,8,8,5,9;5,8,6,8,7
2; 8,6,8,6;5,5,8,4;4,9,9,7;4,9,3,4;6,7,7,7,4;8,9,6,7,5;5,7,7,9,6;6,6,4,4,7
3; 5,8,9,8;7,8,8,7;6,6,7,6;5,7,9,7;6,3,5,8,8;5,6,6,6,8;7,7,6,6,7;8,5,3,6,4
4; 7,9,9,8;7,9,7,6;10,7,6,7;7,9,8,7;6,8,8,5,7;8,6,6,4,8;7,5,8,6,7;7,6,8,6,8
5; 9,7,4,6;4,6,5,5;7,5,6,7;6,9,7,6;7,9,7,6,6;6,7,7,8,8;7,9,6,8,6;8,6,8,8,5
6; 6,7,7,7;4,6,9,7;5,5,7,7;7,6,5,7;7,9,7,8,7;8,7,7,8,9;9,9,8,8,9;8,7,9,7,5
Math, Phsc, Chem and Bio have 4 weights for each score: 5%, 10%, 15%, 70%, which means, for example, the avg point of Math of Student 1 = 5x5% + 7x10% + 7x15% + 8x70%
Litr, Lang, Hist and Geo has 5 weights: 5%, 10%, 10%, 15%, 60%
Requirment:
Calculate the avg point of each student and output to a dict like this:
{‘Student 1’: {‘Math’: 9.00; ‘Physics’: 8.55, …}, ‘Student 2’: {…‘History’: 9.00; ‘Geography’: 8.55}}
Thank you.
Considering that the script.py and your text file es student.txt are at the same path (directory):
final_dict = {}
with open("student.txt", "r") as f:
for idx, l in enumerate(f.readlines()):
if l != "\n":
if idx == 0:
l = l.replace("\n", "")
header = l.split(", ")[1:]
else:
final_dict.update({f"Student {l[0]}": {}})
marks = l.split("; ")[1].replace("\n", "").split(";")
for i, mark in enumerate(marks):
current_subject_int_marks = tuple(map(int, mark.split(",")))
len_marks = len(current_subject_int_marks)
if len_marks < 5:
avr = (
current_subject_int_marks[0] * 0.05
+ current_subject_int_marks[1] * 0.10
+ current_subject_int_marks[2] * 0.15
+ current_subject_int_marks[3] * 0.70
)
else:
avr = (
current_subject_int_marks[0] * 0.05
+ current_subject_int_marks[1] * 0.10
+ current_subject_int_marks[2] * 0.10
+ current_subject_int_marks[3] * 0.15
+ current_subject_int_marks[4] * 0.60
)
final_dict[f"Student {l[0]}"].update({header[i]: avr})
print(final_dict)

Get a certain combination of numbers in Python

Is there a efficient and convenient solution in Python to do something like -
Find largest combination of two numbers x and y, with the following conditions -
0 < x < 1000
0 < y < 2000
x/y = 0.75
x & y are integers
It's easy to do it using a simple graphing calculator but trying to find the best way to do it in Python
import pulp
My_optimization_prob = pulp.LpProblem('My_Optimization_Problem', pulp.LpMaximize)
# Creating the variables
x = pulp.LpVariable("x", lowBound = 1, cat='Integer')
y = pulp.LpVariable("y", lowBound = 1, cat='Integer')
# Adding the Constraints
My_optimization_prob += x + y #Maximize X and Y
My_optimization_prob += x <= 999 # x < 1000
My_optimization_prob += y <= 1999 # y < 2000
My_optimization_prob += x - 0.75*y == 0 # x/y = 0.75
#Printing the Problem and Constraints
print(My_optimization_prob)
My_optimization_prob.solve()
#printing X Y
print('x = ',pulp.value(x))
print('y = ',pulp.value(y))
Probably just -
z = [(x, y) for x in range(1, 1000) for y in range(1, 2000) if x/y==0.75]
z.sort(key=lambda x: sum(x), reverse=True)
z[0]
#Returns (999, 1332)
This is convenient, not sure if this is the most efficient way.
Another possible relatively efficient solution is -
x_upper_limit = 1000
y_upper_limit = 2000
x = 0
y = 0
temp_variable = 0
ratio = 0.75
for i in range(x_upper_limit, 0, -1):
temp_variable = i/ratio
if temp_variable.is_integer() and temp_variable < y_upper_limit:
x = i
y = int(temp_variable)
break
print(x,y)

tkinter how do i calculate the normal vector and the conservation of the kinetic energy of all particles in python?

I was trying to calculate the normal vector n formula for normal vector and the tangential vectors t tangential vector n=v of two particles p1 and p2 to find the conservation of kinetic energy conservation of energy
or here's another way to write the formula: formula 2
but i don't really know where and how in the code to implent this?
from tkinter import *
from random import *
from math import *
myHeight=250#400
myWidth=400#800
mySpeed=20#100
col= randint(0,255)
radius = randint(0,50)
print (col)
#x= 60
global particules
particules = []
def initialiseParticule(dx,dy,radius,color):
x, y = randint(0,myWidth), randint(0,myHeight) #100
radius = randint(0,10)
#color = randint(0,255)
#col1=str(color)
k = myCanvas.create_oval(x-radius,y-radius,\
x+radius,y+radius,\
width=2,fill=color)
b = [x, y, dx, dy, radius]
particules.append(b)
#print(k)
def updateParticules():
N = len(particules)
for i in range(N):
# update displacement
particules[i][0] += particules[i][2]
particules[i][1] += particules[i][3]
#xi += vxi
#yi += vyi
# collision with walls
if particules[i][0]<particules[i][4]or particules[i][0]>=myWidth-particules[i][4]:
particules[i][2] *= -1
if particules[i][1]<particules[i][4] or particules[i][1]>=myHeight-particules[i][4]:
particules[i][3] *= -1
# collision with other particles
for j in range(N):
if i != j:
xi, yi = particules[i][0], particules[i][1]
vxi, vyi = particules[i][2], particules[i][3]
xj, yj = particules[j][0], particules[j][1]
vxj, vyj = particules[j][2], particules[j][3]
dij = sqrt((xi-xj)**2 + (yi-yj)**2)
# print(dij)
# # collision !!!
if dij <= particules[i][4]+particules[j][4]:
particules[i][2] *= -1
particules[j][2] *= -1
particules[i][3] *= -1
particules[j][3] *= -1
r = particules[i][4]
myCanvas.coords(i+1, particules[i][0]-r, particules[i][1]-r,
particules[i][0]+r, particules[i][1]+r)
def animation():
miseAJourBalles()
myCanvas.after(mySpeed, animation)
mainWindow=Tk()
mainWindow.title('Pong')
#mainWindow.geometry(str(myWidth)+'x'+str(myHeight+100))
myCanvas=Canvas(mainWindow, bg='dark grey', height=myHeight, width=myWidth)
myCanvas.pack(side=TOP)
N = 3
for n in range(N):
# initialiseParticules( -1, -1, radius,'randint(0,10)')
initialiseParticules( -1, -1, radius,'pink')
animation()
#bou=Button(mainWindow,text="Leave",command=mainWindow.destroy)
#bou.pack()
mainWindow.mainloop()

Geometry Arc Algorithm

I searched all internet and didn't find any pseudo code that solved this problem,
I want to find an Arc between two points, A and B, using 5 arguments:
Start Point
End Point
Radius (Don't know if this is needed)
Angle
Quality
Example:
StartPoint = The green point on the left is the Start Point set on the arguments
EndPoint = The green point on the right is the End Point set on the arguments
Angle = Angle of the Arc(Semi Circle)
Quality = How many red circles to create
I would like to have a pseudo code to solve this problem
Thanks in advance :D
Let start point is P0, end point P1, angle Fi. R is not needed
At first find arc center. Get middle of P0-P1 segment.
M = (P0 + P1) / 2
// M.x = (P0.x + P1.x) / 2 , same for y
And direction vector
D = (P1 - P0) / 2
Get length of D
lenD = Math.Hypot(D.x, D.y) //Vector.Length, sqrt of sum of squares
Get unit vector
uD = D / lenD
Get (left) perpendicular vector
(P.x, P.y) = (-uD.y, ud.x)
Now circle center
if F = Pi then
C.x = M.x
C.y = M.y
else
C.x = M.x + P.x * Len / Tan(Fi/2)
C.y = M.y + P.y * Len / Tan(Fi/2)
Vector from center to start point:
CP0.x = P0.x - C.x
CP0.y = P0.y - C.y
Then you can calculate coordinates of N intermediate points at the arc using rotation of vector CP0 around center point
an = i * Fi / (NSeg + 1);
X[i] = C.x + CP0.x * Cos(an) - CP0.y * Sin(an)
Y[i] = C.y + CP0.x * Sin(an) + CP0.y * Cos(an)
Working Delphi code
procedure ArcByStartEndAngle(P0, P1: TPoint; Angle: Double; NSeg: Integer);
var
i: Integer;
len, dx, dy, mx, my, px, py, t, cx, cy, p0x, p0y, an: Double;
xx, yy: Integer;
begin
mx := (P0.x + P1.x) / 2;
my := (P0.y + P1.y) / 2;
dx := (P1.x - P0.x) / 2;
dy := (P1.y - P0.y) / 2;
len := Math.Hypot(dx, dy);
px := -dy / len;
py := dx / len;
if Angle = Pi then
t := 0
else
t := len / Math.Tan(Angle / 2);
cx := mx + px * t;
cy := my + py * t;
p0x := P0.x - cx;
p0y := P0.y - cy;
for i := 0 to NSeg + 1 do begin
an := i * Angle / (NSeg + 1);
xx := Round(cx + p0x * Cos(an) - p0y * Sin(an));
yy := Round(cy + p0x * Sin(an) + p0y * Cos(an));
Canvas.Ellipse(xx - 3, yy - 3, xx + 4, yy + 4);
end;
end;
Result for (Point(100, 0), Point(0, 100), Pi / 2, 8 (Y-axis down at the picture)

Extrapolation -- awk based

I need help in the following: I have a data file (columns separated by "\t" tabular) like this data.dat
# y1 y2 y3 y4
17.1685 21.6875 20.2393 26.3158
These are x values of 4 points for a linear fit. The four y values are constant: 0, 200, 400, 600.
I can create a linear fit of the point pairs (x,y): (x1,y1)=(17.1685,0), (x2,y2)=(21.6875,200), (x3,y3)=(20.2393,400), (x4,y4)=(26.3158,600).
Now I would like to make a linear fit on three of these point paris, (x1,y1), (x2,y2), (x3,y3) and (x2,y2), (x3,y3), (x4,y4) and (x1,y1), (x3,y3), (x4,y4) and (x1,y1), (x2,y2), (x4,y4).
If I have these three of points with a linear fit I would like to know the value of the x value of the extrapolated point being out of these three fitted points.
I have so far this awk code:
#!/usr/bin/awk -f
BEGIN{
z[1] = 0;
z[2] = 200;
z[3] = 400;
z[4] = 600;
}
{
split($0,str,"\t");
n = 0.0;
for(i=1; i<=NF; i++)
{
centr[i] = str[i];
n += 1.0;
# printf("%d\t%f\t%.1f\t",i,centr[i],z[i]);
}
# print "";
if (n > 2)
{
lsq(n,z,centr);
}
}
function lsq(n,x,y)
{
sx = 0.0
sy = 0.0
sxx = 0.0
syy = 0.0
sxy = 0.0
eps = 0.0
for (i=1;i<=n;i++)
{
sx += x[i]
sy += y[i]
sxx += x[i]*x[i]
sxy += x[i]*y[i]
syy += y[i]*y[i]
}
if ( (n==0) || ((n*sxx-sx*sx)==0) )
{
next;
}
# print "number of data points = " n;
a = (sxx*sy-sxy*sx)/(n*sxx-sx*sx)
b = (n*sxy-sx*sy)/(n*sxx-sx*sx)
for(i=1;i<=n;i++)
{
ycalc[i] = a+b*x[i]
dy[i] = y[i]-ycalc[i]
eps += dy[i]*dy[i]
}
print "# Intercept =\t"a"
print "# Slope =\t"b"
for (i=1;i<=n;i++)
{
printf("%8g %8g %8g \n",x[i],y[i],ycalc[i])
}
} # function lsq()
So,
If we extrapolate to the place of 4th
0 17.1685 <--(x1,y1)
200 21.6875 <--(x2,y2)
400 20.2393 <--(x3,y3)
600 22.7692 <<< (x4 = 600,y1 = 22.7692)
If we extrapolate to the place of 3th
0 17.1685 <--(x1,y1)
200 21.6875 <--(x2,y2)
400 23.6867 <<< (x3 = 400,y3 = 23.6867)
600 26.3158 <--(x4,y4)
0 17.1685
200 19.35266 <<<
400 20.2393
600 26.3158
0 18.1192 <<<
200 21.6875
400 20.2393
600 26.3158
My current output is the following:
$> ./prog.awk data.dat
# Intercept = 17.4537
# Slope = 0.0129968
0 17.1685 17.4537
200 21.6875 20.0531
400 20.2393 22.6525
600 26.3158 25.2518
Assuming the core calculation in the lsq function is OK (it looks about right, but I haven't scrutinized it), then that gives you the slope and intercept for the least sum of squares line of best fit for the input data set (parameters x, y, n). I'm not sure I understand the tail end of the function.
For your 'take three points and calculate the fourth' problem, the simplest way is to generate the 4 subsets (logically, by deleting one point from the set of four on each of four calls), and redo the calculation.
You need to call another function that takes the line data (slope, intercept) from lsq and interpolates (extrapolates) the value at another y value. That's a straight-forward calculation (x = m * y + c), but you need to determine which y value is missing from the set of 3 you pass in.
You could 'optimize' (meaning 'complicate') this scheme by dropping one value at a time from the 'sums of squares' and 'sums' and 'sum of products' values, recalculating the slope, intercept, and then calculating the missing point again.
(I'll also observe that normally it would be the x-coordinates with the fixed values 0, 200, 400, 600 and the y-coordinates would be the values read. However, that's just a matter of orientation, so it is not crucial.)
Here's at least plausibly working code. Since awk automatically splits on white space, there's no need for you to split on tabs specifically; the read loop takes this into account.
The code needs serious refactoring; there is a ton of repetition in it - however, I also have a job that I'm supposed to do.
#!/usr/bin/awk -f
BEGIN{
z[1] = 0;
z[2] = 200;
z[3] = 400;
z[4] = 600;
}
{
for (i = 1; i <= NF; i++)
{
centr[i] = $i
}
if (NF > 2)
{
lsq(NF, z, centr);
}
}
function lsq(n, x, y)
{
if (n == 0) return
sx = 0.0
sy = 0.0
sxx = 0.0
syy = 0.0
sxy = 0.0
for (i = 1; i <= n; i++)
{
print "x[" i "] = " x[i] ", y[" i "] = " y[i]
sx += x[i]
sy += y[i]
sxx += x[i]*x[i]
sxy += x[i]*y[i]
syy += y[i]*y[i]
}
if ((n*sxx - sx*sx) == 0) return
# print "number of data points = " n;
a = (sxx*sy-sxy*sx)/(n*sxx-sx*sx)
b = (n*sxy-sx*sy)/(n*sxx-sx*sx)
for (i = 1; i <= n; i++)
{
ycalc[i] = a+b*x[i]
}
print "# Intercept = " a
print "# Slope = " b
print "Line: x = " a " + " b " * y"
for (i = 1; i <= n; i++)
{
printf("x = %8g, yo = %8g, yc = %8g\n", x[i], y[i], ycalc[i])
}
print ""
print "Different subsets\n"
for (drop = 1; drop <= n; drop++)
{
print "Subset " drop
sx = sy = sxx = sxy = syy = 0
j = 1
for (i = 1; i <= n; i++)
{
if (i == drop) continue
print "x[" j "] = " x[i] ", y[" j "] = " y[i]
sx += x[i]
sy += y[i]
sxx += x[i]*x[i]
sxy += x[i]*y[i]
syy += y[i]*y[i]
j++
}
if (((n-1)*sxx - sx*sx) == 0) continue
a = (sxx*sy-sxy*sx)/((n-1)*sxx-sx*sx)
b = ((n-1)*sxy-sx*sy)/((n-1)*sxx-sx*sx)
print "Line: x = " a " + " b " * y"
xt = x[drop]
yt = a + b * xt;
print "Interpolate: x = " xt ", y = " yt
}
}
Since awk doesn't provide an easy way to pass back multiple values from a function, nor does it provide structures other than arrays (sometimes associative), it is not perhaps the best language for this task. On the other hand, it can be made to do the job. You might be able to bundle the Least Squares calculation in a function that returns an array containing the slope and intercept, and then use that. Your turn to explore options.
Given the script lsq.awk and the input file lsq.data shown, I get the output shown:
$ cat lsq.data
17.1685 21.6875 20.2393 26.3158
$ awk -f lsq.awk lsq.data
x[1] = 0, y[1] = 17.1685
x[2] = 200, y[2] = 21.6875
x[3] = 400, y[3] = 20.2393
x[4] = 600, y[4] = 26.3158
# Intercept = 17.4537
# Slope = 0.0129968
Line: x = 17.4537 + 0.0129968 * y
x = 0, yo = 17.1685, yc = 17.4537
x = 200, yo = 21.6875, yc = 20.0531
x = 400, yo = 20.2393, yc = 22.6525
x = 600, yo = 26.3158, yc = 25.2518
Different subsets
Subset 1
x[1] = 200, y[1] = 21.6875
x[2] = 400, y[2] = 20.2393
x[3] = 600, y[3] = 26.3158
Line: x = 18.1192 + 0.0115708 * y
Interpolate: x = 0, y = 18.1192
Subset 2
x[1] = 0, y[1] = 17.1685
x[2] = 400, y[2] = 20.2393
x[3] = 600, y[3] = 26.3158
Line: x = 16.5198 + 0.0141643 * y
Interpolate: x = 200, y = 19.3526
Subset 3
x[1] = 0, y[1] = 17.1685
x[2] = 200, y[2] = 21.6875
x[3] = 600, y[3] = 26.3158
Line: x = 17.7985 + 0.0147205 * y
Interpolate: x = 400, y = 23.6867
Subset 4
x[1] = 0, y[1] = 17.1685
x[2] = 200, y[2] = 21.6875
x[3] = 400, y[3] = 20.2393
Line: x = 18.163 + 0.007677 * y
Interpolate: x = 600, y = 22.7692
$
Edit: In the previous version of the answer, the subsets were multiplying by n instead of (n-1). The values in the revised output seem to agree with what you expect. The residual issues are presentational, not computational.

Resources