How to curve fit data in Excel to a multi variable polynomial? - excel

I have a simple set of data, 10 values that increase.
I want to fit them to a polynomial of the form:
Z = A1 + A2*X + A3*Y + A4*X^2 + A5*X*Y+ A6*Y^2
Where Z the output is the set of data above, A1 - A6 are the coefficients I am looking for,
X is the range of inputs (10 of course), and Y for the moment is a constant value.
How can I curve fit to this polynomial and not the standard 2nd order one that is created using 'trendline'?

Construct a Vandermonde matrix on your data points, find it's inverse with MINVERSE, then apply this to the vector of Z values with MMULT. This would work for polynomial degree n with n data points.
Otherwise you could try polynomial regression, which will again use the Vandermonde matrix.
More math than Excel really.

Related

Scikit Learn PolynomialFeatures - what is the use of the include_bias option?

In scikit-learn's PolynomialFeatures preprocessor, there is an option to include_bias. This essentially just adds a column of ones to the dataframe. I was wondering what the point of having this was. Of course, you can set it to False. But theoretically how does having or not having a column of ones along with the Polynomial Features generated affect Regression.
This is the explanation in the documentation, but I can't seem to get anything useful out of it relation to why it should be used or not.
include_bias : boolean
If True (default), then include a bias column, the feature in which
all polynomial powers are zero (i.e. a column of ones - acts as an
intercept term in a linear model).
Suppose you want to perform the following regression:
y ~ a + b x + c x^2
where x is a generic sample. The best coefficients a,b,c are computed via simple matricial calculus. First, let us denote with X = [1 | X | X^2] a matrix with N rows, where N is the number of samples. The first column is a column of 1s, the second column is a column of values x_i, for all the samples i, the third column is a column of values x_i^2, for all samples i. Let us denote with B the following column vector B=[a b c]^T If Y is a column vector of the N target values for all samples i, we can write the regression as
y ~ X B
The i-th row of this equation is y_i ~ [1 x_i x^2] [a b c]^t = a + b x_i + c x_i^2.
The goal of training a regression is to find B=[a b c] such that X B be as close as possible to y.
If you don't add a column of 1, you are assuming a-priori that a=0, which might not be correct.
In practice, when you write Python code, and you use PolynomialFeatures together with sklearn.linear_model.LinearRegression, the latter takes care by default of adding a column of 1s (since in LinearRegression the fit_intercept parameter is True by default), so you don't need to add it as well in PolynomialFeatures. Therefore, in PolynomialFeatures one usually keeps include_bias=False.
The situation is different if you use statsmodels.OLS instead of LinearRegression

Inverse CDF of Poisson dist in Excel

I want to know is there a function to calculate the inverse cdf of poisson distribution? So that I can use inverse CDF of poisson to generate a set of poisson distributed random number.
A) Inverse CDF of Poisson distribution
The inverse CDF at q is also referred to as the q quantile of a distribution. For a discrete distribution distribution . the inverse CDF at q is the smallest integer x such that CDF[dist,x]≥q.. The Poisson distribution is a discrete distribution that models the number of events based on a constant rate of occurrence. The Poisson distribution can be used as an approximation to the binomial when the number of independent trials is large and the probability of success is small. A common application of the Poisson distribution is predicting the number of events over a specific time, such as the number of cars arriving at a toll plaza in 1 minute.
Formula
The probability mass function (PMF) is:
mean = λ
variance = λ
Notation
Term Description
e base of the natural logarithm
Reference: Methods and Formulas for Inverse Cumulative Distribution Functions
B) Excel Function: Excel provides the following function for the Poisson distribution:
POISSON(x, μ, cum)
where μ = the mean of the distribution and cum takes the values TRUE and FALSE
POISSON(x, μ, FALSE) = probability density function value f(x) at the value x for the Poisson distribution with mean μ.
POISSON(x, μ, TRUE)= cumulative probability distribution function F(x) at the value x for the Poisson distribution with mean μ.
Excel 2010/2013/2016 provide the additional function POISSON.DIST which is equivalent to POISSON.
Reference: Office Support POISSON.DIST Function
C) Excel doesn’t provide a worksheet function for the inverse of the Poisson distribution.
Instead you can use the following function provided by the Real Statistics Resource Pack. It’s a free download for Excel various versions.
POISSON_INV(p, μ) = smallest integer x such that POISSON(x, μ, TRUE) ≥ p
Note that the maximum value of x is 1,024,000,000. A value higher than this indicates an error.
Reference: Real Statistics Using Excel
D)
Reference to MREXCEL.COM web site a query related to your question quoted below seems to be related to your question.
Not sure if anyone can help with this. Basically I'm trying to find out how to apply the reverse of the Poisson function in excel. So as of now I have poisson(x value, mean, true-cumulative) and that lets me get the probability for that occurence. Basically I want to know how I can get the minimum/maximum x value based on a given probability.
So if I have a list of data (700 rows) and I want to find out what the minimum starting value should be given a desired average and the fact that I want the lowest value to be at the 0.05% probability. So 0.05% = (x, 35, True) solve for x. I know I can prob do this with solver, but I am trying to figure out a way to do this formulaicly without having to use the solver (as I may have to use this many times).
The code referred to here covers the inverse of the poisson formula when using True in the excel formula. It does not cover the inverse of the poisson formula when using False in the excel formula.
Re: Reverse Poisson?
Originally Posted by shg
A further mod to accommodate large means:
Code:
Function PoissonInv(Prob As Double, Mean As Double) As Variant
' shg 2011, 2012, 2014, 2015-0415
' For a Poisson process with mean Mean, returns a three-element array:
' o The smallest integer N such that POISSON(N, Mean, True) >= Prob
' o The CDF for N-1 (which is < Prob)
' o The CDF for N (which is >= Prob)
-------Reference :> https://www.mrexcel.com/forum/excel-questions/507508-reverse-poisson-2.html>
E) Why doesn't Excel have a POISSON.INV function?
Discussion on Referred web page have references to some formulas for calculating related information desired by OP.
You could use the following.
With the Poisson mean named lambda, enter the following in an newly inserted worksheet.
A1: =IF(ROWS(A$1:A1)<=4*lambda,POISSON(ROWS(A$1:A1)-1,lambda,1))
Fill A1 down into A2:A1000 (4 times as many rows as your most typical lambda value). Name the A1:A1000 range POISSON.CDF. Then use the formula
=MATCH(n,POISSON.CDF)-1
to give the results a POISSON.INV(n,lambda) function would.
If you want this for varying lambda, use the array formula
=MATCH(n,POISSON(ROW($A$1:INDEX($A:$A,4*lambda+1),lambda,1))-1
Reference Shared Link
Hope That Helps.
=MATCH(RAND(),MMULT((ROW(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+3,1)))=COLUMN(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(1,MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+2))))+0,MMULT((ROW(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+2,1)))=(COLUMN(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(1,MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+1)))+1))+0,POISSON(ROW($A$1:INDEX($A:$A,MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+1))-1,lambda,1)))+(ROW(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+3,1)))=(COLUMN(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(1,1)))+FLOOR(MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+2,1)))+0)-1
It is quite slow for lambda >1000.
This expands on the array formula
=MATCH(C4,POISSON(ROW($A$1:INDEX($A:$A,4*lambda+1)),lambda,1))-1
shared above by skkakkar, by prepending the array with 0 and appending with 1, following Is there a way to concatenate two arrays in Excel without VBA? .
The rest is mostly making the array shorter by replacing 4* lambda with 6* SQRT(lambda).

Python3 Sine Function to fit Data

I have data which I broke into 28day slices. Here is a plot of the means of each of these slices.
I have to calculate some values from equation that fits the data reasonably well, and then find the correlation coefficient between those values and the means for each 28-day slice. Since I have one mean for each slice I have to find one value from the equation per slice.
I believe the equation is:
y = A sin(Bx - C) + D
But I am struggling to implement the equation for my data and find these values.

Excel gives weird R square calculations?

This is really weird. I calculate R^2 values with Excel in two different ways and the results differ hugely. Why?
1) First I use Excel to do a linear regression via a graph, and use the "Add Trendline..." right mouse button functionality to specify Intercept = 0. The R square value shows -3.253. The regressed equation is Y = -0.1321 * X
2) Then I use Excel to do a linear regression via LINEST function. I highlight 5x2 rows and in the top left cell, I type "=LINEST ([Y vector]; [X vector], FALSE, TRUE). The False means the intercept is 0, and the True means Excel should print additional regression statistical information. Then I press CTRL + SHIFT + Enter. This will show me additional statistics, such as R^2 value in the third left cell. Which turns out to be 0.11166. The regressed equation is Y = -0.1321 * X
My question is; what am I doing wrong in calculating R^2 with the graph? Python and statsmodels.api confirms that R^2 is 0.11166, and the regressed equation is Y = -0.1321 * X.
Y =
0.0291970802919708
0.141801551718973
0.145668034655723
0.0691229530946433
0.0431577486597426
0.133618351873374
X =
-0.35551988
-0.20577599
0.10780785
-0.25028796
-0.42762184
0.02442197
Your calculation is correct. Scatter plot does not return correct R^2 when the intercept is 0. This is an formula fo R^2
where
If you use standard regression model, you use average value of y as y̅. But when you assume that the intercept equals 0, you need to set y̅ as zero. If you use the average value of y instead of zero, you get the R^2 = -3.252767.
You can see the calculation here. The SStot wrong column uses average value of y as y̅. Then the R^2 value equals to -3.252767. If you use 0 (as I did in SStot right column), then you get 0.111.
It is an old bug described by Microsoft here:https://support.microsoft.com/en-us/help/829249/you-will-receive-an-incorrect-r-squared-value-in-the-chart-tool-in-excel-2003
You need to use the LINEST function to get correct R^2 value.
Me and my fellow engineers just got tangled up in this. Based on this discussion and what we observed, the R^2 is wrong all of the time except when Excel calculates the best y-intercept. Any other y-intercept (either forced through Zero OR user-defined), is wrong.

how to obtain estimation from regression in excel?

I use datas in excel to produce a graphic.
Then I make a regression, and have an equation. I'd like to know what value would be obtained from the regression (for example, x = 7,6 is the value for which I wanna know an estimation of y).
It is an approximation with a 6 degree polynome.
One wimple method would be this : I have the equation, so I could use it
However, I wondered if there is a fast method to do it? Like I enter 7,6 somewhere to have the result quickly?
if you are looking at a linear regression line (straight line) you could try the forecast formula
=forecast(X, Known Ys, Known Xs)
you could also build your own equation automatically from
=linest(...)
I found the following on a site describing the capabilities of the linest function in excel:
In addition to using LOGEST to calculate statistics for other
regression types, you can use LINEST to calculate a range of other
regression types by entering functions of the x and y variables as the
x and y series for LINEST. For example, the following formula:
=LINEST(yvalues, xvalues^COLUMN($A:$C))
works when you have a single column of y-values and a single column of
x-values to calculate the cubic (polynomial of order 3) approximation
of the form:
y = m1*x + m2*x^2 + m3*x^3 + b
You can adjust this formula to calculate other types of regression,
but in some cases it requires the adjustment of the output values and
other statistics.
or look at:
=trend

Resources