How to obtain Incremental standard deviations from a set of standard deviations? - statistics

I have a data set containing three columns, first column represents number of trials, second column represents experimental values, and the third column represents corresponding standard deviation.
With each experiment there is an increment in my experimental values. To get the incremental values, I hold my first value as the reference value and subtract this reference value from each subsequent value and use them to create fourth column of these incremental values.
My problem begins right from here. How do I create a new set of incremental standard deviations for the incremental experimental values I got? My apology if the problem is not well defined but hopefully someone will eventually be able to help me out. Many thanks!
Below is my data set,
Trial Mean SD Incr Mean Incre SD
1 45.311 4.668 0
2 56.682 2.234 11.371
3 62.197 2.266 16.886
4 70.550 4.751 25.239
5 80.528 4.412 35.217
6 87.453 4.542 42.142
7 89.979 2.185 44.668
8 96.859 3.476 51.548

To be clear, for other readers, your incremental mean is actually the difference between trial 1 and the other trials.
Variances add directly when you subtract (or add) independent normal distributions. So you first want to convert that standard deviation to a variance by squaring it, and then you can add the variances, and then you can take the square root to turn it back into a standard deviation. Note when using this kind of Pythagorean combination, you are assuming that trial 1 is independent from the trials, so for example, you cannot do things like have some sample in both trials.
Logically this makes sense that your so called "incremental SD" will always be greater than the individual SDs, since the uncertainty of both distributions contributes towards the uncertainty of the difference.

Related

Create exchanges with bounded random parameters and fixed sum to be used in Montecarlo

I have to run a montecarlo where, for some products, certain exchanges are relate to each other in the sense that my process can take as input any of the products in different (bounded) proportions but with fixed sum.
Example:
my product a takes as inputs a total of 10 kg of x,y, and z alltogheter and x has a uniform distribution that goes from 0 to 4 kg, y from 1 to 6 and z from 3 to 8 with their sum that must be equal to 10. So, every iteration I would need to get a random number for my three exchanges within their bounds making sure that their sum is always 10.
I have seen that in stats_array it is possible to set the bounds of the distributions and thus create values in a specified interval but this would not ensure that the sum of my random vector equals the fixed sum of 10.
Wondering if there is already a (relatively) straightforward way to implemented this in bw2
Otherwise the only way I see this feasible is to create all the uncertainity parameters with ParameterVectorLCA, tweak the value in the array for those products that must meet the aforementioned requirements (e.g with something like this or this) and then use this array with modified parameters to re-run my MC .
We are working on this in https://github.com/PascalLesage/brightway2-presamples, but it isn't ready yet. I don't know of any way to do this currently without hacking something together by subclassing the MonteCarloLCA.

How to generate random numbers within a normal distribution using Excel

I want to use the RAND() function in Excel to generate a random number between 0 and 1.
However, I would like 80% of the values to fall between 0 and 0.2, 90% of the values to fall between 0 and 0.3, 95% of the values to fall between 0 and 0.5, etc.
This reminds me that I took an applied statistics course once upon a time, but not of what was actually in the course...
How is the best way to go about achieving this result using an Excel formula. Alternatively, what is this kind of statistical calculation called / any other pointers that I can Google around for.
=================
Use case:
I have a single column of meter readings, which I would like to duplicate 7 times (each column for a new month). each column has 55 000 rows. While the meter readings need to vary for each month, when taken as a time series, each meter number should have 7 realistic readings.
The aim is to produce realistic data to turn into heat maps (i.e. flag outlying meter readings)
I don't think that there is a formula which would fit exactly to your requirements. I would use a very straightforward solution:
Generate 80% of data using =RANDBETWEEN(0,20)/100
Generate 10% of data using =RANDBETWEEN(20,30)/100
Generate 5% of data using =RANDBETWEEN(30,50)/100
and so on
You can easily change the precision of generated data by modifying the parameters, for example: =RANDBETWEEN(0,2000)/10000 will generate data with up to 4 digits after decimal point.
UPDATE
Use a normal distribution for the use case, for example:
=NORMINV(RAND(), 20, 5)
where 20 is a mean value and 5 is a standard deviation.

Trying to either pull or recreate trendline data using LINEST

I am trying to recreate the formula from a trendline on a graph. basically my company is trying to predict the corn yields for next year. all of the actual programmers are out for the week so they passed it on to me(web developer:D). Ive attempted the LINEST formula multiple times with no luck.
basically in column B I have the years(1-15, trying to project 16) and Column C i have the actual trend data. i am probably doing this wrong however
EX =LINEST(C16:C30,B16:B30,FALSE,FALSE)
Any help would be appreciated. just tell me if you need the actual file or more information. Thanks in advance!
The fourth argument, concerning the return of additional regression statistics, is optional and is taken as FALSE if omitted, so seems not required for your purposes. The third argument, concerning the intercept with the Y-axis (the value of y when x is 0), is also optional but taken as TRUE if omitted. In your case TRUE seems appropriate so the third parameter seems not required for your purposes.
With your data spanning 15 years, if ending with the current year, it is conveniently 2001-2015 bdi and has no information about the value of y (production) in year 2000 (ie when x is 0) but this is unlikely to have been 0, as would be taken to be the case if the third argument is FALSE.
In a simplified example, take production of 50 in 2001, increasing by an (unrealistically!) constant 5 each year. By 2015 this has reached 120, so for 2016 at the same rate of increase production of 125 should be expected. Your formula returns 9.35 so would predict production of 129.35, though we know to expect 125, as given by:
=LINEST(C16:C30,B16:B30)
when added to the latest available (120).
The former is too high a predicted increase because it assumes growth was from 0 to 120 in sixteen years, rather than what I have taken to be from 50 to 120 in fifteen.
As has been mentioned by #Byron Wall, Excel has the TREND function that may be used for linear extrapolation to obtain the next (16th) value like so:
=TREND(C16:C30,B16:B30,16)
This directly returns 125 for the, simplified, sample data.
HOWEVER, all the above assumes growth is linear. Taking say Brazilian corn production (Million tons) over the period (offset one year) this has been roughly (based on USDA.gov):
The red line is the Linear trend and green a fourth order Polynomial. They happen both to end up at the same place for one year ahead (the hollow bar) but predict different results from the latest six years:
It may be worth charting the data you have, and adding different trend lines, before deciding whether linear extrapolation seems the most promising for forecasting purposes. ‘Wavy’ (cyclical) progress is evident in many datasets.

How do I use a standard distribution to guess where the value falls in the future?

I have a mean value x and I want to model it into the future. I want to output a value of what it could be in 6 months. Assuming the value follows a normal distribution and we have the standard deviation how do I randomize the value x while following a normal distribution? I'm doing this in excel, but just understanding it would help too! Basically I want to produce numbers 68% of the time within 1 deviation, 95% of the time withing 2 deviation etc. etc.
You can use the excel function 'NORMINV' to convert a random input 'RAND()' to a normal distribution.
=NORMINV(RAND(),Mean,Std Dev)
i.e. if you repeat this many times, save and analyze the results, you'll see a bell curve over the input Mean value.
Does that get you started?
The tricky bit comes when you come up with the formula to predict what a value will be in the future using this.

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Resources