Inverse of cumulative probabilty (CDF) in Julia? - statistics

If I have a value, such as 0.5 for a standard normal distribution, how do I convert that into a random outcome?
E.g. I am looking for the function f such that f(0.5) = 0.0 for a standard normal distribution.

julia> using Distributions
[ Info: Precompiling Distributions [31c24e10-a181-5473-b8eb-7969acd0382f]
julia> quantile(Normal(), 0.5)
0.0
doc says:
Evaluate the inverse cumulative distribution function at q.

Related

Why do we have different values for skewness and kurtosis in MATLAB and Python?

Following are the codes for skewness and kurtosis in MATLAB:
clc; clear all
% Generate "N" data points
N = 1:1:2000;
% Set sampling frequency
Fs = 1000;
% Set time step value
dt = 1/Fs;
% Frequency of the signal
f = 5;
% Generate time array
t = N*dt;
% Generate sine wave
y = 10 + 5*sin(2*pi*f*t);
% Skewness
y_skew = skewness(y);
% Kurtosis
y_kurt = kurtosis(y);
The answer acquired in MATLAB is:
y_skew = 4.468686410415491e-15
y_kurt = 1.500000000000001 (Value is positive in MATLAB)
Now, below are the codes in Python:
import numpy as np
from scipy.stats import skew
from scipy.stats import kurtosis
# Generate "N" data points
N = np.linspace(1,2000,2000)
# Set sampling frequency
Fs = 1000
# Set time step value
dt = 1/Fs
# Frequency of the signal
f = 5
# Generate time array
t = N*dt
# Generate sine wave
y = 10 + 5*np.sin(2*np.pi*f*t);
# Skewness
y_skew = skew(y)
# Kurtosis
y_kurt = kurtosis(y)
The answer acquired in Python is:
y_skew = -1.8521564287013977e-16
y_kurt = -1.5 (Value has turned out to be negative in Python)
Can somebody please explain, why do we have different answers for skewness and kurtosis, in MATLAB and Python?
Specifically, in the case of kurtosis, the value has changed from positive to negative. Can somebody please help me out in understanding this.
This is the difference between the Fisher and Pearson measure of kurtosis.
From the MATLAB docs:
Kurtosis is a measure of how outlier-prone a distribution is. The kurtosis of the normal distribution is 3. Distributions that are more outlier-prone than the normal distribution have kurtosis greater than 3; distributions that are less outlier-prone have kurtosis less than 3. Some definitions of kurtosis subtract 3 from the computed value, so that the normal distribution has kurtosis of 0. The kurtosis function does not use this convention.
From the scipy docs:
Kurtosis is the fourth central moment divided by the square of the variance. If Fisher’s definition is used, then 3.0 is subtracted from the result to give 0.0 for a normal distribution.
Noting that Fisher's definition is used by default in scipy
scipy.stats.kurtosis(a, axis=0, fisher=True, ...)
Your results would be equivalent if you used fisher=False in Python (or manually add 3) or subtracted 3 from your MATLAB result so that they were both using the same definition.
So it looks like the sign is being flipped, but that's just by chance since +1.5 - 3 = -1.5.
The difference in skewness appears to be due to numerical precision, since both results are basically 0. Please see Why is 24.0000 not equal to 24.0000 in MATLAB?

Statistical tests for two random datasets

I need to compare two data sets, i randomly created them in Julia, with rand. I want to know if there's some statistical test (that can be perform in Julia JuMP) that tells me how different the distributions are (having no assumptions of the original distribution).
Why would you want to perform this in JuMP?
This is really a job for the HypothesisTests package:
https://github.com/JuliaStats/HypothesisTests.jl
julia> using HypothesisTests
julia> x, y = rand(100), rand(100);
julia> test = HypothesisTests.ApproximateTwoSampleKSTest(x, y)
Approximate two sample Kolmogorov-Smirnov test
----------------------------------------------
Population details:
parameter of interest: Supremum of CDF differences
value under h_0: 0.0
point estimate: 0.11
Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value: 0.5806
Details:
number of observations: [100,100]
KS-statistic: 0.7778174593052022
julia> pvalue(test)
0.5806177304235198
https://juliastats.org/HypothesisTests.jl/stable/nonparametric/#HypothesisTests.ApproximateTwoSampleKSTest

Median absolute deviation in Julia

I approximated pi using the Monte Carlo method.
I am having troubles calculating the Median absolute deviation of my results (see code below).
My guess is that there is already a function in Julia that does this but unfortunately I have no idea how to implement it in my code.
for i in 1:5
picircle(1000)
end
3.0964517741129436
3.152423788105947
3.1284357821089457
3.1404297851074463
3.0904547726136933
As MarcMush mentioned in the comments, StatsBase exports the mad function, which as we can see from the docstring
help?> mad
search: mad mad! maxad muladd mapreduce meanad ismarked mapfoldr mapfoldl mean_and_var mean_and_std mean_and_cov macroexpand
mad(x; center=median(x), normalize=true)
Compute the median absolute deviation (MAD) of collection x around center (by default, around the median).
If normalize is set to true, the MAD is multiplied by 1 / quantile(Normal(), 3/4) ≈ 1.4826, in order to obtain a consistent
estimator of the standard deviation under the assumption that the data is normally distributed.
so
julia> A = randn(10000);
julia> using StatsBase
julia> mad(A, normalize=false)
0.6701649037518176
or alternatively, if you don't want the StatsBase dependency, then you can just calculate it directly with (e.g.)
julia> using Statistics
julia> median(abs.(A .- median(A)))
0.6701649037518176
which gives an identical result

How can I use scipy optimization to find the minimum chi-squared for 3 parameters and a list of data points?

I have a histogram of sorted random numbers and a Gaussian overlay. The histogram represents observed values per bin (applying this base case to a much larger dataset) and the Gaussian is an attempt to fit the data. Clearly, this Gaussian does not represent the best fit to the histogram. The code below is the formula for a Gaussian.
normc, mu, sigma = 30.845, 50.5, 7 # normalization constant, avg, stdev
gauss = lambda x: normc * exp( (-1) * (x - mu)**2 / ( 2 * (sigma **2) ) )
I calculated the expectation values per bin (area under the curve) and calculated the number of observed values per bin. There are several methods to find the 'best' fit. I am concerned with the best fit possible by minimizing Chi-Squared. In this formula for Chi-Squared, the expectation value is the area under the curve per bin and the observed value is the number of occurrences of sorted data values per bin. So I want to fluctuate normc, mu, and sigma near their given values to find the right combination of normc, mu, and sigma that produce the smallest Chi-Square, as these will be the parameters I can plug into the code above to overlay the best fit Gaussian on my histogram. I am trying to use the scipy module to minimize my Chi-Square as done in this example. Since I need to fluctuate parameters, I will use the function gauss (defined above) to plot the Gaussian overlay, and will define a new function to find the minimum Chi-Squared.
def gaussmin(var,data):
# var[0] = normc
# var[1] = mu
# var[2] = sigma
# data is the sorted random numbers, represents unbinned observed values
for index in range(len(data)):
return var[0] * exp( (-1) * (data[index] - var[1])**2 / ( 2 * (var[2] **2) ) )
# I realize this will return a new value for each index of data, any guidelines to fix?
After this, I am stuck. How can I fluctuate the parameters to find the normc, mu, sigma that produced the best fit? My last attempt at a solution is below:
var = [normc, mu, sigma]
result = opt.minimize(chi2, [normc,mu,sigma])
# chi2 is the chisquare value obtained via scipy
# chisquare input (a,b)
# where a is number of occurences per bin, b is expected value per bin
# b is dependent upon normc, mu, sigma
print(result)
# data is a list, can I keep it as a constant and only fluctuate parameters in var?
There are plenty of examples online for scalar functions but I cannot find any for variable functions.
PS - I can post my full code so far but it's bit lengthy. If you would like to see it, just ask and I can post it here or provide a googledrive link.
A Gaussian distribution is completely characterized by its mean and variance (or std deviation). Under the hypothesis that your data are normally distributed, the best fit will be obtained by using x-bar as the mean and s-squared as the variance. But before doing so, I'd check whether normality is plausible using, e.g., a q-q plot.

Is there a value with error library in Haskell?

I am looking for a library that provides a 'value with error' (eg x ± y). But searching for "Haskell xyz Error" only gives error handling libraries.
I would expect that such a library would provide common math operations (Num, Floating) where appropriate. The use case would be to get a error estimate from a calculation based on noisy sensor readings.
Update
I did some research and the term "propagation of uncertainty" came up. I found uncertainly-haskell which I'll try out soon. Are there other packages like this?
Have a look at the intervals package.
The Data.Eq.Approximate module seems to be a fit for getting approximate equality.
Data.Eq.Approximate
Contents
Type wrappers
Classes for tolerance type annotations
Absolute tolerance
Relative tolerance
Zero tolerance
Tolerance annotations using Digits
The purpose of this module is to provide newtype wrapper that allows one to effectively override the equality operator of a value so that it > is approximate rather than exact. For example, the type
type ApproximateDouble = AbsolutelyApproximateValue (Digits Five) Double
defines an alias for a wrapper containing Doubles such that two doubles are equal if they are equal to within five decimals of accuracy; for > example, we have that
1 == (1+10^^(-6) :: ApproximateDouble)
evaluates to True. Note that we did not need to wrap the value 1+10^^(-6) since AbsolutelyApproximateValue is an instance of Num. For > convenience, Num as well as many other of the numerical classes such as Real and Floating have all been derived for the wrappers defined in > this package so that one can conveniently use the wrapped values in the same way as one would use the values themselves.
Two kinds of wrappers are provided by this package.
The uncertain package seems to provide what you are looking for:
Some highlights from the readme:
Provides tools to manipulate numbers with inherent
experimental/measurement uncertainty, and propagates them through
functions based on principles from statistics.
Manipulate with error propagation
ghci> let x = 1.52 +/- 0.07
ghci> let y = 781.4 +/- 0.3
ghci> let z = 1.53e-1 `withPrecision` 3
ghci> cosh x
2.4 +/- 0.2
ghci> exp x / z * sin (y ** z)
10.9 +/- 0.9
ghci> pi + 3 * logBase x y
52 +/- 5
Create numbers
ghci> 1.52 +/- 0.07
1.52 +/- 7.0e-2
ghci> fromSamples [12.5, 12.7, 12.6, 12.6, 12.5]
12.58 +/- 7.0e-2
Comparisons
Note that this is very different from other libraries with similar
data types (like from intervals and rounding); these do not
attempt to maintain intervals or simply digit precisions; they instead
are intended to model actual experimental and measurement data with
their uncertainties, and apply functions to the data with the
uncertainties and properly propagating the errors with sound
statistical principles.
For a clear example, take
> (52 +/- 6) + (39 +/- 4)
91.0 +/- 7.0
In a library like intervals, this would result in 91 +/- 10
(that is, a lower bound of 46 + 35 and an upper bound of 58 + 43).
However, with experimental data, errors in two independent samples
tend to "cancel out", and result in an overall aggregate uncertainty
in the sum of approximately 7.

Resources