Fitting the cumulative distribution function using MATLAB - statistics

How is it possible to make the following data more fitted when i will plot using Cumulative_distribution_function?
here is my code, plotted using the cdfplot
clear all;
close all;
y = [23 23 23 -7.59 23 22.82 22.40 13.54 -3.97 -4.00 8.72 23 23 10.56 12.19 23 9.47 5.01 23 23 23 23 22.85 23 13.61 -0.77 -14.15 23 12.91 23 20.88 -9.42 23 -1.37 1.83 14.35 -8.30 23 15.17 23 5.01 22.28 23 21.91 21.68 -4.76 -13.50 14.35 23]
cdfplot(y)

There is no definite answer to your question, it is too broad and mainly belongs to statistics. Before doing any computation you should answer some questions:
is there a specific distribution type which the data follow?
is there any theoretical justification to select a distribution type and discard others?
do I need parametric or non-parametric distribution?
if no specific distribution type can be selected than what set of distributions should I investigate?
how to compare the distributions, goodness-of-fit measures?
what fitting method should I use, e.g. max-likelihood, method of moments, Bayesian, etc.?
how to treat uncertainties?
how and for what want I use the results?
etc.
Without answering these question it is meaningless to talk about fitting distribution to data.
I give you an example how to do the fit in Matlab using maximum-likelihood method, just for illustration, but I would strongly discourage you to use it without considering the above points.
Since I have no additional background information in respect of the nature of the data, normal and kernel distributions are fitted to illustrate 1 parametric and 1 non-parametric distribution.
cdfplot(y)
hold on
xx = -20:40;
%normal distribution
pd_norm = fitdist(y', 'normal');
F_norm = normcdf(xx, pd_norm.mu, pd_norm.sigma);
plot(xx, F_norm, 'r')
%kernel distribution
pd_kernel1 = fitdist(y', 'kernel', 'Kernel', 'normal', 'Width', 6);
F_kernel1 = cdf(pd_kernel1, xx);
plot(xx, F_kernel1, 'g')
%kernel distribution
pd_kernel2 = fitdist(y', 'kernel', 'Kernel', 'normal', 'Width', 2);
F_kernel2 = cdf(pd_kernel2, xx);
plot(xx, F_kernel2, 'black')
legend('ecdf', 'normal', 'kernel1', 'kernel2', 'Location', 'NorthWest')

You can try
h = cdfplot(y)
cftool( get(h,'XData'), get(h,'YData') )

Related

Welford's online variance algorithm, but for Interquartile Range?

Short Version
Welford's Online Algorithm lets you keep a running value for variance - meaning you don't have to keep all the values (e.g. in a memory constraned system).
Is there something similar for Interquartile Range (IQR)? An online algorithm that lets me know the middle 50% range without having to keep all historical values?
Long Version
Keeping a running average of data, where you are memory constrainted, is pretty easy:
Double sum
Int64 count
And from this you can compute the mean:
mean = sum / count
This allows hours, or years, of observations to quietly be collected, but only take up 16-bytes.
Welford's Algorithm for Variance
Normally when you want the variance (or standard deviation), you have to have all your readings, because you have to computer reading - mean for all previous readings:
Double sumOfSquaredError = 0;
foreach (Double reading in Readings)
sumOfSquaredError += Math.Square(reading - mean);
Double variance = sumOfSquaredError / count
Which is why it was nice when Welford came up with an online algorithm for computing variance of a stream of readings:
It is often useful to be able to compute the variance in a single pass, inspecting each value xi only once; for example, when the data is being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation.
The algorithm for adding a new value to the running variance is:
void addValue(Double newValue) {
Double oldMean = sum / count;
sum += newValue;
count += 1;
Double newMean = sum / count;
if (count > 1)
variance = ((count-2)*variance + (newValue-oldMean)*(newValue-newMean)) / (count-1);
else
variance = 0;
}
How about an online algorithm for Interquartile Range (IQR)?
Interquartile Range (IRQ) is another method of getting the spread of data. It tells you how wide the middle 50% of the data is:
And from that people then generally draw a IQR BoxPlot:
Or at the very least, have the values Q1 and Q3.
Is there a way to calculate the Interquartile Range without having to keep all the recorded values?
In other words:
Is there something like Welford's online variance algorithm, but for Interquartile Range?
Knuth, Seminumerical Algorithms
You can find Welford's algorithm explained in Knuth's 2nd volume Seminumerical Algorithms:
(just in case anyone thought this isn't computer science or programming related)
Research Effort
Stackoverflow: Simple algorithm for online outlier detection of a generic time series
Stats: Simple algorithm for online outlier detection of a generic time series
Online outlier detection for data streams (IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & Applications, September 2011, Pages 88–96)
Stats: Robust outlier detection in financial timeseries
Stats: Online outlier detection
Distance-based outlier detection in data streams (Proceedings of the VLDB Endowment, Volume 9, Issue 12, August 2016, pp 1089–1100) pdf
Online Outlier Detection Over Data Streams (Hongyin Cui, Masters Thesis, 2005)
There's a useful paper by Ben-Haim, and Tom-Tov published in 2010 in the Journal of Machine Learning Research
A Streaming Parallel Decision Tree Algorithm
Short PDF: https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=0667E2F91B9E0E5387F85655AE9BC560?doi=10.1.1.186.7913&rep=rep1&type=pdf
Full paper: https://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
It describes an algoritm to automatically create a histogram from a online (streaming) data, that does not require unlimited memory.
you add a value to the history
the algorithm dynamically creates buckets
including bucket sizes
The paper is kind of dense (as all math papers are), but the algorithm is fairly simple.
Lets start with some sample data. For this answer i'll use digits of PI as the source for incoming floating point numbers:
Value
3.14159
2.65358
9.79323
8.46264
3.38327
9.50288
4.19716
9.39937
5.10582
0.97494
4.59230
7.81640
6.28620
8.99862
8.03482
5.34211
...
I will define that i want 5 bins in my histogram.
We add the first value (3.14159), which causes the first bin to be created:
Bin
Count
3.14159
1
The bins in this histogram don't have any width; they are purely a point:
And then we add the 2nd value (2.65358) to the histogram:
And we continue adding points, until we reach our arbitrary limit of 5 "buckets":
That is all 5 buckets filled.
We add our 6th value (9.50288) to the histogram, except that means we now have 6 buckets; but we decided we only want five:
Now is where the magic starts
In order to do the streaming part, and limit memory usage to less-than-infinity, we need to merge some of the "bins".
Look at each pair of bins - left-to-right - and see which two are closest together. Which in our case is these two buckets:
These two buckets are merged, and given a new x value dependant on their relative heights (i.e. counts)
ynew = yleft + yright = 1 + 1 = 2
xnew = xleft × (yleft/ynew) + xright×(yright/ynew) = 3.14159×(1/2) + 3.38327×(1/2) = 3.26243
And now we repeat.
add a new value
merge the two neighboring buckets that are closet to each other
deciding on the new x position between based on their relative heights
Eventually giving you (although i screwed it up as i was doing it manually in Excel for this answer):
Practical Example
I wanted a histogram of 20 buckets. This allows me to extract some useful statistics. For a histogram of 11 buckets, containing 38,000 data points, it only requires 40 bytes of memory:
With these 20 buckets, i can now computer the Probably Density Function (PDF):
Bin
Count
PDF
2.113262834
3085
5.27630%
6.091181608
3738
6.39313%
10.13907062
4441
7.59548%
14.38268188
5506
9.41696%
18.92107481
6260
10.70653%
23.52148965
6422
10.98360%
28.07685659
5972
10.21396%
32.55801082
5400
9.23566%
36.93292359
4604
7.87426%
41.23715698
3685
6.30249%
45.62006198
3136
5.36353%
50.38765223
2501
4.27748%
55.34957161
1618
2.76728%
60.37095192
989
1.69149%
65.99939004
613
1.04842%
71.73292736
305
0.52164%
78.18427775
140
0.23944%
85.22261376
38
0.06499%
90.13115876
12
0.02052%
96.1987941
4
0.00684%
And with the PDF, you can now calculate the Expected Value (i.e. mean):
Bin
Count
PDF
EV
2.113262834
3085
5.27630%
0.111502092
6.091181608
3738
6.39313%
0.389417244
10.13907062
4441
7.59548%
0.770110873
14.38268188
5506
9.41696%
1.354410824
18.92107481
6260
10.70653%
2.025790219
23.52148965
6422
10.98360%
2.583505901
28.07685659
5972
10.21396%
2.86775877
32.55801082
5400
9.23566%
3.00694827
36.93292359
4604
7.87426%
2.908193747
41.23715698
3685
6.30249%
2.598965665
45.62006198
3136
5.36353%
2.446843872
50.38765223
2501
4.27748%
2.155321935
55.34957161
1618
2.76728%
1.531676732
60.37095192
989
1.69149%
1.021171415
65.99939004
613
1.04842%
0.691950026
71.73292736
305
0.52164%
0.374190474
78.18427775
140
0.23944%
0.187206877
85.22261376
38
0.06499%
0.05538763
90.13115876
12
0.02052%
0.018498245
96.1987941
4
0.00684%
0.006581183
Which gives:
Expected Value: 27.10543
Cumulative Density Function CDF
We can now also get the Cumulative Density Function (CDF):
Bin
Count
PDF
EV
CDF
2.113262834
3085
5.27630%
0.11150
5.27630%
6.091181608
3738
6.39313%
0.38942
11.66943%
10.13907062
4441
7.59548%
0.77011
19.26491%
14.38268188
5506
9.41696%
1.35441
28.68187%
18.92107481
6260
10.70653%
2.02579
39.38839%
23.52148965
6422
10.98360%
2.58351
50.37199%
28.07685659
5972
10.21396%
2.86776
60.58595%
32.55801082
5400
9.23566%
3.00695
69.82161%
36.93292359
4604
7.87426%
2.90819
77.69587%
41.23715698
3685
6.30249%
2.59897
83.99836%
45.62006198
3136
5.36353%
2.44684
89.36188%
50.38765223
2501
4.27748%
2.15532
93.63936%
55.34957161
1618
2.76728%
1.53168
96.40664%
60.37095192
989
1.69149%
1.02117
98.09814%
65.99939004
613
1.04842%
0.69195
99.14656%
71.73292736
305
0.52164%
0.37419
99.66820%
78.18427775
140
0.23944%
0.18721
99.90764%
85.22261376
38
0.06499%
0.05539
99.97264%
90.13115876
12
0.02052%
0.01850
99.99316%
96.1987941
4
0.00684%
0.00658
100.00000%
And the CDF is where we can start to get the values i want.
The median (50th percentile), where the CDF reaches 50%:
From interpolation of the data, we can find the x value where the CDF is 50%:
Bin
Count
PDF
EV
CDF
18.92107481
6260
10.70653%
2.02579
39.38839%
23.52148965
6422
10.98360%
2.58351
50.37199%
t = (50-39.38839)/(50.37199-39.38839) = 10.61161/10.9836 = 0.96613
xmedian = (1-t)*18.93107481 + (t)*23.52148965 = 23.366
So now we know:
Expected Value (mean): 27.10543
Median: 23.366
My original ask was the IQV - the x values that account from 25%-75% of the values. Once again we can interpolate the CDF:
Bin
Count
PDF
EV
CDF
10.13907062
4441
7.59548%
0.77011
19.26491%
12.7235
25.00000%
14.38268188
5506
9.41696%
1.35441
28.68187%
23.366
50.00000% (median)
23.52148965
6422
10.98360%
2.58351
50.37199% (mode)
27.10543
mean
28.07685659
5972
10.21396%
2.86776
60.58595%
32.55801082
5400
9.23566%
3.00695
69.82161%
35.4351
75.00000%
36.93292359
4604
7.87426%
2.90819
77.69587%
This can be continued to get other useful stats:
Mean (aka average, expected value)
Median (50%)
Middle quintile (middle 20%)
IQR (middle 50% range)
middle 3 quintiles (middle 60%)
1 standard deviation range (middle 68.26%)
middle 80%
middle 90%
middle 95%
2 standard deviations range (middle 95.45%)
middle 99%
3 standard deviations range (middle 99.74%)
Short Version
https://github.com/apache/spark/blob/4c7888dd9159dc203628b0d84f0ee2f90ab4bf13/sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java

Why I can't fit Poisson distribution using Chisquare test ? Whats wrong is in fitting? [duplicate]

I want to fit poission distribution on my data points and want to decide based on chisquare test that should I accept or reject this proposed distribution. I only used 10 observations. Here is my code
#Fitting function:
def Poisson_fit(x,a):
return (a*np.exp(-x))
#Code
hist, bins= np.histogram(x, bins=10, density=True)
print("hist: ",hist)
#hist: [5.62657158e-01, 5.14254073e-01, 2.03161280e-01, 5.84898068e-02,
1.35995217e-02,2.67094169e-03,4.39345778e-04,6.59603327e-05,1.01518320e-05,
1.06301906e-06]
XX = np.arange(len(hist))
print("XX: ",XX)
#XX: [0 1 2 3 4 5 6 7 8 9]
plt.scatter(XX, hist, marker='.',color='red')
popt, pcov = optimize.curve_fit(Poisson_fit, XX, hist)
plt.plot(x_data, Poisson_fit(x_data,*popt), linestyle='--',color='red',
label='Fit')
print("hist: ",hist)
plt.xlabel('s')
plt.ylabel('P(s)')
#Chisquare test:
f_obs =hist
#f_obs: [5.62657158e-01, 5.14254073e-01, 2.03161280e-01, 5.84898068e-02,
1.35995217e-02, 2.67094169e-03, 4.39345778e-04, 6.59603327e-05,
1.01518320e-05, 1.06301906e-06]
f_exp= Poisson_fit(XX,*popt)
f_exp: [6.76613820e-01, 2.48912314e-01, 9.15697229e-02, 3.36866185e-02,
1.23926144e-02, 4.55898806e-03, 1.67715798e-03, 6.16991940e-04,
2.26978650e-04, 8.35007789e-05]
chi,p_value=chisquare(f_obs,f_exp)
print("chi: ",chi)
print("p_value: ",p_value)
chi: 0.4588956658201067
p_value: 0.9999789643475111`
I am using 10 observations so degree of freedom would be 9. For this degree of freedom I can't find my p-value and chi value on Chi-square distribution table. Is there anything wrong in my code?Or my input values are too small that test fails? if P-value >0.05 distribution is accepted. Although p-value is large 0.999 but for this I can't find chisquare value 0.4588 on table. I think there is something wrong in my code. How to fix this error?
Is this returned chi value is the critical value of tails? How to check proposed hypothesis?

Calculating 95 % confidence interval for the mean in python

I need little help. If I have 30 random sample with mean of 52 and variance of 30 then how can i calculate the 95 % confidence interval for the mean with estimated and true variance of 30.
Here you can combine the powers of numpy and statsmodels to get you started:
To produce normally distributed floats with mean of 52 and variance of 30 you can use numpy.random.normal with numbers = np.random.normal(loc=52, scale=30, size=30) where the parameters are:
Parameters
----------
loc : float
Mean ("centre") of the distribution.
scale : float
Standard deviation (spread or "width") of the distribution.
size : int or tuple of ints, optional
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn. Default is None, in which case a
single value is returned.
And here's a 95% confidence interval of the mean using DescrStatsW.tconfint_mean:
import statsmodels.stats.api as sms
conf = sms.DescrStatsW(numbers).tconfint_mean()
conf
# output
# (36.27, 56.43)
EDIT - 1
That's not the whole story though... Depending on your sample size, you should use the Z score and not t score that's used by sms.DescrStatsW(numbers).tconfint_mean() here. And I have a feeling that its not coincidental that the rule-of-thumb threshold is 30, and that you have 30 observations in your question. Z vs t also depends on whether or not you know the population standard deviation or have to rely on an estimate from your sample. And those are calculated differently as well. Take a look here. If this is something you'd like me to explain and demonstrate further, I'll gladly take another look at it over the weekend.

get fit data out of gnuplot

I often use Octave to create data that I can plot from my lab results. That data is then fitted with some function in gnuplot:
f1(x) = a * exp(-x*g);
fit f1(x) "c_1.dat" using 1:2:3 via a,g
That creates a fit.log:
*******************************************************************************
Tue May 8 19:13:39 2012
FIT: data read from "e_schwach.dat" using 1:2:3
format = x:z:s
#datapoints = 16
function used for fitting: schwach(x)
fitted parameters initialized with current variable values
Iteration 0
WSSR : 12198.7 delta(WSSR)/WSSR : 0
delta(WSSR) : 0 limit for stopping : 1e-05
lambda : 14.2423
initial set of free parameter values
mu2 = 1
omega2 = 1
Q2 = 1
After 70 iterations the fit converged.
final sum of squares of residuals : 46.0269
rel. change during last iteration : -2.66463e-06
degrees of freedom (FIT_NDF) : 13
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 1.88163
variance of residuals (reduced chisquare) = WSSR/ndf : 3.54053
Final set of parameters Asymptotic Standard Error
======================= ==========================
mu2 = 0.120774 +/- 0.003851 (3.188%)
omega2 = 0.531482 +/- 0.0006112 (0.115%)
Q2 = 17.6593 +/- 0.7416 (4.199%)
correlation matrix of the fit parameters:
mu2 omega2 Q2
mu2 1.000
omega2 -0.139 1.000
Q2 -0.915 0.117 1.000
Is there some way to get the parameters and their error back into Octave? I mean I can write a Python program that parses that, but I hoped to avoid that.
Update
This question is not applicable to me any more, since I use Python and matplotlib for my lab work now, and it can does all this from a single program. I leave this question open in case somebody else has the same problem.
I don't know much about the gnuplot-Octave interface, but what can make your (parsing) life easier is you can:
set fit errorvariables
fit a*x+g via a,g
set print "fit_parameters.txt"
print a,a_err
print g,g_err
set print
Now your variables and their respective errors are in the file "fit_parameters.txt" with
no parsing needed from python.
from the documentation on fit:
If gnuplot was built with this option, and you activated it using set
fit errorvariables, the error for each fitted parameter will be
stored in a variable named like the parameter, but with _err
appended. Thus the errors can be used as input for further
computations.

Get quadratic equation term of a graph in R

I need to find the quadratic equation term of a graph I have plotted in R.
When I do this in excel, the term appears in a text box on the chart but I'm unsure how to move this to a cell for subsequent use (to apply to values requiring calibrating) or indeed how to ask for it in R. If it is summonable in R, is it saveable as an object to do future calculations with?
This seems like it should be a straightforward request in R, but I can't find any similar questions. Many thanks in advance for any help anyone can provide on this.
All the answers provide aspects of what you appear at want to do, but non thus far brings it all together. Lets consider Tom Liptrot's answer example:
fit <- lm(speed ~ dist + I(dist^2), cars)
This gives us a fitted linear model with a quadratic in the variable dist. We extract the model coefficients using the coef() extractor function:
> coef(fit)
(Intercept) dist I(dist^2)
5.143960960 0.327454437 -0.001528367
So your fitted equation (subject to rounding because of printing is):
\hat{speed} = 5.143960960 + (0.327454437 * dist) + (-0.001528367 * dist^2)
(where \hat{speed} is the fitted values of the response, speed).
If you want to apply this fitted equation to some data, then we can write our own function to do it:
myfun <- function(newdist, model) {
coefs <- coef(model)
res <- coefs[1] + (coefs[2] * newdist) + (coefs[3] * newdist^2)
return(res)
}
We can apply this function like this:
> myfun(c(21,3,4,5,78,34,23,54), fit)
[1] 11.346494 6.112569 6.429325 6.743024 21.386822 14.510619 11.866907
[8] 18.369782
for some new values of distance (dist), Which is what you appear to want to do from the Q. However, in R we don't do things like this normally, because, why should the user have to know how to form fitted or predicted values from all the different types of model that can be fitted in R?
In R, we use standard methods and extractor functions. In this case, if you want to apply the "equation", that Excel displays, to all your data to get the fitted values of this regression, in R we would use the fitted() function:
> fitted(fit)
1 2 3 4 5 6 7 8
5.792756 8.265669 6.429325 11.608229 9.991970 8.265669 10.542950 12.624600
9 10 11 12 13 14 15 16
14.510619 10.268988 13.114445 9.428763 11.081703 12.122528 13.114445 12.624600
17 18 19 20 21 22 23 24
14.510619 14.510619 16.972840 12.624600 14.951557 19.289106 21.558767 11.081703
25 26 27 28 29 30 31 32
12.624600 18.369782 14.057455 15.796751 14.057455 15.796751 17.695765 16.201008
33 34 35 36 37 38 39 40
18.688450 21.202650 21.865976 14.951557 16.972840 20.343693 14.057455 17.340416
41 42 43 44 45 46 47 48
18.038887 18.688450 19.840853 20.098387 18.369782 20.576773 22.333670 22.378377
49 50
22.430008 21.93513
If you want to apply your model equation to some new data values not used to fit the model, then we need to get predictions from the model. This is done using the predict() function. Using the distances I plugged into myfun above, this is how we'd do it in a more R-centric fashion:
> newDists <- data.frame(dist = c(21,3,4,5,78,34,23,54))
> newDists
dist
1 21
2 3
3 4
4 5
5 78
6 34
7 23
8 54
> predict(fit, newdata = newDists)
1 2 3 4 5 6 7 8
11.346494 6.112569 6.429325 6.743024 21.386822 14.510619 11.866907 18.369782
First up we create a new data frame with a component named "dist", containing the new distances we want to get predictions for from our model. It is important to note that we include in this data frame a variable that has the same name as the variable used when we created our fitted model. This new data frame must contain all the variables used to fit the model, but in this case we only have one variable, dist. Note also that we don't need to include anything about dist^2. R will handle that for us.
Then we use the predict() function, giving it our fitted model and providing the new data frame just created as argument 'newdata', giving us our new predicted values, which match the ones we did by hand earlier.
Something I glossed over is that predict() and fitted() are really a whole group of functions. There are versions for lm() models, for glm() models etc. They are known as generic functions, with methods (versions if you like) for several different types of object. You the user generally only need to remember to use fitted() or predict() etc whilst R takes care of using the correct method for the type of fitted model you provide it. Here are some of the methods available in base R for the fitted() generic function:
> methods(fitted)
[1] fitted.default* fitted.isoreg* fitted.nls*
[4] fitted.smooth.spline*
Non-visible functions are asterisked
You will possibly get more than this depending on what other packages you have loaded. The * just means you can't refer to those functions directly, you have to use fitted() and R works out which of those to use. Note there isn't a method for lm() objects. This type of object doesn't need a special method and thus the default method will get used and is suitable.
You can add a quadratic term in the forumla in lm to get the fit you are after. You need to use an I()around the term you want to square as in the example below:
plot(speed ~ dist, cars)
fit1 = lm(speed ~ dist, cars) #fits a linear model
abline(fit1) #puts line on plot
fit2 = lm(speed ~ I(dist^2) + dist, cars) #fits a model with a quadratic term
fit2line = predict(fit2, data.frame(dist = -10:130))
lines(-10:130 ,fit2line, col=2) #puts line on plot
To get the coefficients from this use:
coef(fit2)
I dont think it is possible in Excel, as they only provide functions to get coefficients for a linear regression (SLOPE, INTERCEPT, LINEST) or for a exponential one (GROWTH, LOGEST), though you may have more luck by using Visual Basic.
As for R you can extract model coefficients using the coef function:
mdl <- lm(y ~ poly(x,2,raw=T))
coef(mdl) # all coefficients
coef(mdl)[3] # only the 2nd order coefficient
I guess you mean that you plot X vs Y values in Excel or R, and in Excel use the "Add trendline" functionality. In R, you can use the lm function to fit a linear function to your data, and this also gives you the "r squared" term (see examples in the linked page).

Resources