I am trying to recreate the formula from a trendline on a graph. basically my company is trying to predict the corn yields for next year. all of the actual programmers are out for the week so they passed it on to me(web developer:D). Ive attempted the LINEST formula multiple times with no luck.
basically in column B I have the years(1-15, trying to project 16) and Column C i have the actual trend data. i am probably doing this wrong however
EX =LINEST(C16:C30,B16:B30,FALSE,FALSE)
Any help would be appreciated. just tell me if you need the actual file or more information. Thanks in advance!
The fourth argument, concerning the return of additional regression statistics, is optional and is taken as FALSE if omitted, so seems not required for your purposes. The third argument, concerning the intercept with the Y-axis (the value of y when x is 0), is also optional but taken as TRUE if omitted. In your case TRUE seems appropriate so the third parameter seems not required for your purposes.
With your data spanning 15 years, if ending with the current year, it is conveniently 2001-2015 bdi and has no information about the value of y (production) in year 2000 (ie when x is 0) but this is unlikely to have been 0, as would be taken to be the case if the third argument is FALSE.
In a simplified example, take production of 50 in 2001, increasing by an (unrealistically!) constant 5 each year. By 2015 this has reached 120, so for 2016 at the same rate of increase production of 125 should be expected. Your formula returns 9.35 so would predict production of 129.35, though we know to expect 125, as given by:
=LINEST(C16:C30,B16:B30)
when added to the latest available (120).
The former is too high a predicted increase because it assumes growth was from 0 to 120 in sixteen years, rather than what I have taken to be from 50 to 120 in fifteen.
As has been mentioned by #Byron Wall, Excel has the TREND function that may be used for linear extrapolation to obtain the next (16th) value like so:
=TREND(C16:C30,B16:B30,16)
This directly returns 125 for the, simplified, sample data.
HOWEVER, all the above assumes growth is linear. Taking say Brazilian corn production (Million tons) over the period (offset one year) this has been roughly (based on USDA.gov):
The red line is the Linear trend and green a fourth order Polynomial. They happen both to end up at the same place for one year ahead (the hollow bar) but predict different results from the latest six years:
It may be worth charting the data you have, and adding different trend lines, before deciding whether linear extrapolation seems the most promising for forecasting purposes. ‘Wavy’ (cyclical) progress is evident in many datasets.
Related
I have a column with increasing numbers and I want with forecast.linear to predict the missing values between the previous values and the next value. G2:G6 AND G16.
However when I run the FORECAST.LINEAR(F14,G2:G13,F2:F13) it outputs 1.60 which is not correct if you consider that it should be something greater than 1.62 and less than 1.89
UPDATE:
I did this calculation and it seems ok
=IF(AND(G2=0;G3=0;G4<>0;G1<>0)=TRUE;ROUND((G4-G1)/3;2);FALSE)
The correct linear progression for the values your sample shows for 4/1/21 through 15/1/21 should be 1.60424. The final value, 1.62, happens to be a high outlier that is above the linear best fit for the values given. So the function is working correctly. It would not be uncommon for the first or last points to be above or below the linear progression.
The problem is that the function’s range of known Y values ends with 1.62, so the function you entered knows nothing of the 1.89 value.
When I set the problem up to skip a 13th and 14th x and y, but include a 15th value 1.89, I get 1.61 and 1.74 for values 13 and 14, so even when including the 1.89 value, the 13th value is still less than 1.62. It’s a significantly high variation from the linear.
I’m not sure what the best approach is, but this will not likely be an easy problem to solve using this approach. You end up with a circular reference if the Y value you are trying to forecast is within the known Y values range of the formula. The normal way of solving this problem is to have separate actual columns and forecast columns, and not mix the two
So, the point is, in my dataset I have to create a variable "Moving Avg. Amt paid per sq. ft." and the formula or the logic I need is to calculate the last five values as per most recent transactions. i.e. most recent sales by date. but this average should only return value in case it matches the same building and same area variable.
This is what my data looks like
Area ID has three categories. Building number has 5 categories. Date is sorted in ascending order. Now my variable moving average should calculate last 5 averages w.r.t date but for the same building in the same area. e.g. there are buildings 1 and 2 in area 102. I need my Mov Avg. variable to calculate using conditions when it matches criteria of building 1 in 102 for past five sales and when it finds building 2 in the building number variable, it should calculate average of last 5 sales of that building in area 102.
So my approach to this issue was (which is flawed at the moment):
I calculate average of amount paid per sq. foot w.r.t area & building based on dates using the formula
=AVERAGEIFS($N$2:$N$6547,$D$2:$D$6547,D14,$C$2:$C$6547,C14,$B$2:$B$6547,B14)
but I cannot make this formula work, to calculate moving average whenever it meets the criteria. I tried the offset the point as well by 5 but the logic is not right and hence its not working and returning #value in the cells. The formula I used to offset the above condition is
=AVERAGEIFS((OFFSET(N13,5,,5)),$D$2:$D$6547,D13,$C$2:$C$6547,C13,$B$2:$B$6547,B13)
(These formulae are used in column Q of my data)
Need a support from the community as I am badly stuck in making this data useful and I am out of any ideas to make this work.
Edit 1: I am not sure how I can attach my excel file here so you may review the dataset. I have uploaded it on a third party site, for which the link is shared below, so you can view the file in detail.
https://file.io/hlciAHJOHzWA
Expected result is as I have mentioned the instruction said
"Create a variable called "mov. avg amt. paid per sq ft". For each row, this variable should calculate average amt paid per sq ft for the most recent past five sales (by date) for the same building in the same area."
And my approach to build a logic or formula to make this variable calculate moving average w.r.t date for same building in the same area doesn't seem to work because there might be some flaws.
In Office 365 you could use:
=LET(f,FILTER($N$1:N13,($B$1:B13=B14)*($C$1:C13=C14),""),
c,COUNTA(f),
s,SEQUENCE(5,,c-5),
IFERROR(IF(c<5,SUM(f)/c,SUM(INDEX(f,s))/5),""))
If there's less than 5 matches prior to the current sales it'll calculate the average of the count. If 5 or more matches it'll calculate the average of the last 5 prior to the current sale.
am struggling with the application of goal seek function in excel. Am forecasting production for an oil well however we have a target cumulative production expected after say 20 years of production. I have produced table columns of monthly production rate and cumulative production. I would like to play (create sensitivity scenarios) with my expected cumulative production.
Can i use goal seek to change the production forecast profile per month by just changing the cumulative production at the end.
Also advise alternative functions should goal seek not be the right function for this task.
Appreciate your support
This is really just an example of what #DanK has already mentioned. Say ColumnB figures are actual production (in black) and estimates (in blue). The estimates in this case computed as number of days in the month times the factor in D1 ("daily production"). To ramp up production so that the total cumulative production (in the example below, for 1-1/2 years, rather than all 20 as in the example above), presently estimated to be 115,620 units is instead 150,000 then Goal Seek might be applied so:
whereupon the D1 value (200) should change to 287 (and the total in B19 to 150,000, and all the blue values change also). The principle should work if, say, June 2015 were calculated as 16*D1 rather than 30*D1 to allow for planned suspension of production. If that fortnight were an intervention to add production from another reservoir anticipated to be 100 per day then Goal Seek would not adjust "100 per day" but would adjust a new daily rate of 1.5*D1.
I have a data set containing three columns, first column represents number of trials, second column represents experimental values, and the third column represents corresponding standard deviation.
With each experiment there is an increment in my experimental values. To get the incremental values, I hold my first value as the reference value and subtract this reference value from each subsequent value and use them to create fourth column of these incremental values.
My problem begins right from here. How do I create a new set of incremental standard deviations for the incremental experimental values I got? My apology if the problem is not well defined but hopefully someone will eventually be able to help me out. Many thanks!
Below is my data set,
Trial Mean SD Incr Mean Incre SD
1 45.311 4.668 0
2 56.682 2.234 11.371
3 62.197 2.266 16.886
4 70.550 4.751 25.239
5 80.528 4.412 35.217
6 87.453 4.542 42.142
7 89.979 2.185 44.668
8 96.859 3.476 51.548
To be clear, for other readers, your incremental mean is actually the difference between trial 1 and the other trials.
Variances add directly when you subtract (or add) independent normal distributions. So you first want to convert that standard deviation to a variance by squaring it, and then you can add the variances, and then you can take the square root to turn it back into a standard deviation. Note when using this kind of Pythagorean combination, you are assuming that trial 1 is independent from the trials, so for example, you cannot do things like have some sample in both trials.
Logically this makes sense that your so called "incremental SD" will always be greater than the individual SDs, since the uncertainty of both distributions contributes towards the uncertainty of the difference.
We have a application where users enter prices all day. These prices are recorded in a table with a timestamp and then used for producing charts of how the price has moved... Every now and then the user enters a price wrongly (eg. puts in a zero to many or to few) which somewhat ruins the chart (you get big spikes). We've even put in an extra confirmation dialogue if the price moves by more than 20% but this doesn't stop them entering wrong values...
What statistical method can I use to analyse the values before I chart them to exclude any values that are way different from the rest?
EDIT: To add some meat to the bone. Say the prices are share prices (they are not but they behave in the same way). You could see prices moving significantly up or down during the day. On an average day we record about 150 prices and sometimes one or two are way wrong. Other times they are all good...
Calculate and track the standard deviation for a while. After you have a decent backlog, you can disregard the outliers by seeing how many standard deviations away they are from the mean. Even better, if you've got the time, you could use the info to do some naive Bayesian classification.
That's a great question but may lead to quite a bit of discussion as the answers could be very varied. It depends on
how much effort are you willing to put into this?
could some answers genuinely differ by +/-20% or whatever test you invent? so will there always be need for some human intervention?
and to invent a relevant test I'd need to know far more about the subject matter.
That being said the following are possible alternatives.
A simple test against the previous value (or mean/mode of previous 10 or 20 values) would be straight forward to implement
The next level of complexity would involve some statistical measurement of all values (or previous x values, or values of the last 3 months), a normal or Gaussian distribution would enable you to give each value a degree of certainty as to it being a mistake vs. accurate. This degree of certainty would typically be expressed as a percentage.
See http://en.wikipedia.org/wiki/Normal_distribution and http://en.wikipedia.org/wiki/Gaussian_function there are adequate links from these pages to help in programming these, also depending on the language you're using there are likely to be functions and/or plugins available to help with this
A more advanced method could be to have some sort of learning algorithm that could take other parameters into account (on top of the last x values) a learning algorithm could take the product type or manufacturer into account, for instance. Or even monitor the time of day or the user that has entered the figure. This options seems way over the top for what you need however, it would require a lot of work to code it and also to train the learning algorithm.
I think the second option is the correct one for you. Using standard deviation (a lot of languages contain a function for this) may be a simpler alternative, this is simply a measure of how far the value has deviated from the mean of x previous values, I'd put the standard deviation option somewhere between option 1 and 2
You could measure the standard deviation in your existing population and exclude those that are greater than 1 or 2 standard deviations from the mean?
It's going to depend on what your data looks like to give a more precise answer...
Or graph a moving average of prices instead of the actual prices.
Quoting from here:
Statisticians have devised several methods for detecting outliers. All the methods first quantify how far the outlier is from the other values. This can be the difference between the outlier and the mean of all points, the difference between the outlier and the mean of the remaining values, or the difference between the outlier and the next closest value. Next, standardize this value by dividing by some measure of scatter, such as the SD of all values, the SD of the remaining values, or the range of the data. Finally, compute a P value answering this question: If all the values were really sampled from a Gaussian population, what is the chance of randomly obtaining an outlier so far from the other values? If the P value is small, you conclude that the deviation of the outlier from the other values is statistically significant.
Google is your friend, you know. ;)
For your specific question of plotting, and your specific scenario of an average of 1-2 errors per day out of 150, the simplest thing might be to plot trimmed means, or the range of the middle 95% of values, or something like that. It really depends on what value you want out of the plot.
If you are really concerned with the true max and true of a day's prices, then you have to deal with the outliers as outliers, and properly exclude them, probably using one of the outlier tests previously proposed ( data point is x% more than next point, or the last n points, or more than 5 standard deviations away from the daily mean). Another approach is to view what happens after the outlier. If it is an outlier, then it will have a sharp upturn followed by a sharp downturn.
If however you care about overall trend, plotting daily trimmed mean, median, 5% and 95% percentiles will portray history well.
Choose your display methods and how much outlier detection you need to do based on the analysis question. If you care about medians or percentiles, they're probably irrelevant.