My 'y'- axis in my normal distribution curve is over 1. Is this okay? - statistics

I am trying to show the normal distribution of two sets of data. My goal is to see if dataset 1 differs to dataset 2 (the dataset is total eroded area in m2). When i make normal distribution curves, i am aiming to fix or understand these problems
firstly im not sure how to interpret these negative values as total eroded area (my variable) cannot be negative
secondly im not sure what these greater than 1 y values mean
Dataset 1 is 0.180,0.063,0.65,0.43 and Dataset 2 is 0.148, 0.106, 0.39, 0.32 and the resulting normal distribution graph (based on the mean and standard deviation) is shown below.

Related

bollinger bands versus statistics: is 1 standard deviation supposed to be split into two halves by it's mean? or top and bottom bands from mean?

I have a question about how bollinger bands are plotted in relation to statistics. In statistics, once a standard deviation is calculated from a mean of a set of numbers, shouldn't interpreting a 1 standard deviation be done so that you divide this number is half, and plot each half above and below the mean? By doing so, you can then determine whether or not it's data points fall within this 1 standard deviation.
Then, correct me if I am wrong, but aren't bollinger bands NOT calculated this way?? Instead, it takes a 1 standard deviation (if you have set it to 1) and plots the WHOLE value both above and below the mean (not splitting in two), thereby doubling the size of this standard-deviation?
Bollinger bands loosely state that that 68% of data falls within the 1st band, 1 standard deviation (loosely because the empirical rule in statistics requires that distributions be normal distributions which most often stock prices are not). However if this empirical rule is from statistics where 1 standard deviation is split in half, that means that applying a 68% probability in to an entire bollinger band is wrong. ??? is this correct??
You can modify the deviation multiples to suite your purpose, you can use 0.5 for example.

Normalisation or Standardisation for detecting outlier?

When to use min max scaling that is normalisation and when to use standardisation that is using z score for data pre-processing ?
I know that normalisation brings down the range of feature to 0 to 1, and z score bring downs to -3 to 3, but am unsure when to use one of the two technique for detecting the outliers in data?
Let us briefly agree on the terms:
The z-score tells us how many standard deviations a given element of a sample is away from the mean.
The min-max scaling is the method of rescaling a range of measurements the interval [0, 1].
By those definitions, z-score usually spans an interval much larger than [-3,3] if your data follows a long-tailed distribution. On the other hand, a plain normalization does indeed limit the range of the possible outcomes, but will not help you help you to find outliers, since it just bounds the data.
What you need for outlier dedetction are thresholds above or below which you consider a data point to be an outlier. Many programming languages offer Violin plots or Box plots which nicely show your data distribution. The methods behind plots implement a common choice of thresholds:
Box and whisker [of the box plot] plots quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
the minimum and maximum of all of the data [...]
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
All data points outside the whiskers of the box plots are plotted as points and considered outliers.

Applying Quadratic Fit to Unknowns

I'm trying to build a spreadsheet to find a quadratic fit for a set of control data, then apply that fit to a set of unknowns to get a calculated concentration.
For my quadratic curve calculation, I have this:
=LINEST(F28:F33,A28:A33^{1,2},TRUE,TRUE)
An example of relevant control data (where 0-40 would be found in the A column, and the 0.001-0.575 in the F column) is:
0 0.001
2 0.030
5 0.076
10 0.156
20 0.310
40 0.575
This is giving me a curve solution that matches the software currently being used to analyze the data (SoftMax 4.7):
A: -5.1E-05
B: 0.016
C: -0.002
Using this formula to apply the curve to data (where E16 represents any individual datapoint I'm solving for and Blank1 is a set of negative controls):
=(-CurveB+SQRT((CurveB^2)-(4*CurveA*(CurveC-(E16-AVERAGE(Blank1))))))/(2*CurveA)
However, when I apply the curve using the formula
to a set of data, e.g.:
0.275 0.269 0.266
0.217 0.193 0.194
0.011 0.013 0.011
0.004 0.006 0.003
I get output:
17.835 17.426 17.221
13.922 12.333 12.399
0.796 0.919 0.796
0.369 0.491 0.308
Compared to SoftMax's output:
17.827 17.405 17.215
13.918 12.333 12.393
0.785 0.950 0.797
0.353 0.487 0.298
My problem is, I can't find enough documentation on how SoftMax applies the quadratic fit to the data so I don't know which set of results is more accurate. I've checked to see if it's a rounding error (i.e. Softmax is rounding the displayed results but calculating using unrounded figures or possibly the other way around), I've tried throwing the whole mess through Solver, letting Excel change the curve variables and the blank factor (I also tried removing the blank factor and solving, and adding independent blank factors for each column and solving) and solving for a minimum total variance from the Softmax results, but I cannot find a solution that produces the same results as the Softmax software (or even closer than 0.58% or so average variance from the Softmax results).
Can anybody tell me if this is an error in my calculations (I'm specifically skeptical of my formula to apply the curve to data-is there a more graceful way to apply a quadratic fit to a set of unknowns in Excel?) or is it an error with the calculations produced by the other program, e.g. solving using approximations or rounded values somewhere?
Summary: I think you're seeing rounding errors.
Details. I used your Excel equations and the data provided and reproduced your curve parameters, so that seems OK. I then plugged the SoftMax Pro output (17.827, 17.405, 17.215, 13.918, ...) and your output (17.835, 17.426, 17.221, 13.922, ...) into y=AX^2+BX+C and calculated y-values. The pair-wise differences were in the 4th decimal place or smaller --- biggest (abs) difference was ~ -0.0005, so that's consistent with a rounding/truncation of the X-data that's hidden from you.
Final Comment: I suspect you should not subtract blanks. The standard curve appears to have been created using not-blank-subtracted data (at zero input the output is non-zero) so it seems like you need to treat samples the same way as standards. It may not make much difference ...
Hope that helps.

How to use Excel column chart for datasets that have very different scales

There are 2 datasets that have values in the interval [0; 1]. I need to visualize these 2 datasets in Excel as a column chart. The problem is that some data points have values 0.0001, 0.0002, and other data point have values 0.8, 0.9, etc. So, the difference is hugde, and therefore it´s impossible to see data points with small values. What could be the solution? Should I use logarithmic scale? I appreciate any example.
Two possible ways below
Graph the smaller data set as a second series against a right hand Y axis (with same ratio from min to max as left hand series)
Multiply the smaller data set by 1000 and compare the multiplied data set to the larger one
Note that a log scale will give negative results given you are working with fractions, so that isn't really an option

From one histogram, create a new histogram from just a mean or median?

Suppose I have a list of values that I can histogram and calculate descriptive statistics on such as mean, average, max, standard deviation, etc. Perhaps this histogram is bimodal or right skewed. Let’s call this group of data “DataSet1”.
Suppose I had just a mean or median of another set of data. Lets call that DataSet2. I do not have all the raw data for DataSet2, just the median or mean. There is a strong belief that DataSet1 and DataSet2 would show the same variability in values.
If I knew just a single value of either mean or median, can I apply the description statistics from DataSet1 to create a new histogram that mirrors the bimodal or right skewed behavior from DataSet1?
Thanks
Dan
Alternative intent:
I have 3 years of historical data, where the data definitely has a "day of week" trend to it. I am using a python api to apply seasonal ARIMA to forecast the next 7 days from the 3 years of historical data. The predicted value is great, but it is only 1 value. I would like to use that predicted value as the "mean" and create a histogram from the variability of values shown to exists historically by day of week.
so, today is thursday. Lets say i predict tomorrow to have a value of 78.6.
I want to sample potential values of tomorrow based upon a mean of 78.6 but with variability similar to that showed to exist on all historical fridays
If i look at historical fridays, perhaps it shows a skewed to the left behavior
so when i sample with a mean of 78.6, if i sampled 100 times, the values sampled, if plotted in a histogram, would also skew to the left
Hope that helps..

Resources