Median absolute deviation in Julia - statistics

I approximated pi using the Monte Carlo method.
I am having troubles calculating the Median absolute deviation of my results (see code below).
My guess is that there is already a function in Julia that does this but unfortunately I have no idea how to implement it in my code.
for i in 1:5
picircle(1000)
end
3.0964517741129436
3.152423788105947
3.1284357821089457
3.1404297851074463
3.0904547726136933

As MarcMush mentioned in the comments, StatsBase exports the mad function, which as we can see from the docstring
help?> mad
search: mad mad! maxad muladd mapreduce meanad ismarked mapfoldr mapfoldl mean_and_var mean_and_std mean_and_cov macroexpand
mad(x; center=median(x), normalize=true)
Compute the median absolute deviation (MAD) of collection x around center (by default, around the median).
If normalize is set to true, the MAD is multiplied by 1 / quantile(Normal(), 3/4) ≈ 1.4826, in order to obtain a consistent
estimator of the standard deviation under the assumption that the data is normally distributed.
so
julia> A = randn(10000);
julia> using StatsBase
julia> mad(A, normalize=false)
0.6701649037518176
or alternatively, if you don't want the StatsBase dependency, then you can just calculate it directly with (e.g.)
julia> using Statistics
julia> median(abs.(A .- median(A)))
0.6701649037518176
which gives an identical result

Related

Statistical tests for two random datasets

I need to compare two data sets, i randomly created them in Julia, with rand. I want to know if there's some statistical test (that can be perform in Julia JuMP) that tells me how different the distributions are (having no assumptions of the original distribution).
Why would you want to perform this in JuMP?
This is really a job for the HypothesisTests package:
https://github.com/JuliaStats/HypothesisTests.jl
julia> using HypothesisTests
julia> x, y = rand(100), rand(100);
julia> test = HypothesisTests.ApproximateTwoSampleKSTest(x, y)
Approximate two sample Kolmogorov-Smirnov test
----------------------------------------------
Population details:
parameter of interest: Supremum of CDF differences
value under h_0: 0.0
point estimate: 0.11
Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value: 0.5806
Details:
number of observations: [100,100]
KS-statistic: 0.7778174593052022
julia> pvalue(test)
0.5806177304235198
https://juliastats.org/HypothesisTests.jl/stable/nonparametric/#HypothesisTests.ApproximateTwoSampleKSTest

Word2Vec Subsampling -- Implementation

I am implementing the Skipgram model, both in Pytorch and Tensorflow2. I am having doubts about the implementation of subsampling of frequent words. Verbatim from the paper, the probability of subsampling word wi is computed as
where t is a custom threshold (usually, a small value such as 0.0001) and f is the frequency of the word in the document. Although the authors implemented it in a different, but almost equivalent way, let's stick with this definition.
When computing the P(wi), we can end up with negative values. For example, assume we have 100 words, and one of them appears extremely more often than others (as it is the case for my dataset).
import numpy as np
import seaborn as sns
np.random.seed(12345)
# generate counts in [1, 20]
counts = np.random.randint(low=1, high=20, size=99)
# add an extremely bigger count
counts = np.insert(counts, 0, 100000)
# compute frequencies
f = counts/counts.sum()
# define threshold as in paper
t = 0.0001
# compute probabilities as in paper
probs = 1 - np.sqrt(t/f)
sns.distplot(probs);
Q: What is the correct way to implement subsampling using this "probability"?
As an additional info, I have seen that in keras the function keras.preprocessing.sequence.make_sampling_table takes a different approach:
def make_sampling_table(size, sampling_factor=1e-5):
"""Generates a word rank-based probabilistic sampling table.
Used for generating the `sampling_table` argument for `skipgrams`.
`sampling_table[i]` is the probability of sampling
the i-th most common word in a dataset
(more common words should be sampled less frequently, for balance).
The sampling probabilities are generated according
to the sampling distribution used in word2vec:
```
p(word) = (min(1, sqrt(word_frequency / sampling_factor) /
(word_frequency / sampling_factor)))
```
We assume that the word frequencies follow Zipf's law (s=1) to derive
a numerical approximation of frequency(rank):
`frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))`
where `gamma` is the Euler-Mascheroni constant.
# Arguments
size: Int, number of possible words to sample.
sampling_factor: The sampling factor in the word2vec formula.
# Returns
A 1D Numpy array of length `size` where the ith entry
is the probability that a word of rank i should be sampled.
"""
gamma = 0.577
rank = np.arange(size)
rank[0] = 1
inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank)
f = sampling_factor * inv_fq
return np.minimum(1., f / np.sqrt(f))
I tend to trust deployed code more than paper write-ups, especially in a case like word2vec, where the original authors' word2vec.c code released by the paper's authors has been widely used & served as the template for other implementations. If we look at its subsampling mechanism...
if (sample > 0) {
real ran = (sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn;
next_random = next_random * (unsigned long long)25214903917 + 11;
if (ran < (next_random & 0xFFFF) / (real)65536) continue;
}
...we see that those words with tiny counts (.cn) that could give negative values in the original formula instead here give values greater-than 1.0, and thus can never be less than the long-random-masked-and-scaled to never be more than 1.0 ((next_random & 0xFFFF) / (real)65536). So, it seems the authors' intent was for all negative-values of the original formula to mean "never discard".
As per the keras make_sampling_table() comment & implementation, they're not consulting the actual word-frequencies at all. Instead, they're assuming a Zipf-like distribution based on word-rank order to synthesize a simulated word-frequency.
If their assumptions were to hold – the related words are from a natural-language corpus with a Zipf-like frequency-distribution – then I'd expect their sampling probabilities to be close to down-sampling probabilities that would have been calculated from true frequency information. And that's probably "close enough" for most purposes.
I'm not sure why they chose this approximation. Perhaps other aspects of their usual processes have not maintained true frequencies through to this step, and they're expecting to always be working with natural-language texts, where the assumed frequencies will be generally true.
(As luck would have it, and because people often want to impute frequencies to public sets of word-vectors which have dropped the true counts but are still sorted from most- to least-frequent, just a few days ago I wrote an answer about simulating a fake-but-plausible distribution using Zipf's law – similar to what this keras code is doing.)
But, if you're working with data that doesn't match their assumptions (as with your synthetic or described datasets), their sampling-probabilities will be quite different than what you would calculate yourself, with any form of the original formula that uses true word frequencies.
In particular, imagine a distribution with one token a million times, then a hundred tokens all appearing just 10 times each. Those hundred tokens' order in the "rank" list is arbitrary – truly, they're all tied in frequency. But the simulation-based approach, by fitting a Zipfian distribution on that ordering, will in fact be sampling each of them very differently. The one 10-occurrence word lucky enough to be in the 2nd rank position will be far more downsampled, as if it were far more frequent. And the 1st-rank "tall head" value, by having its true frequency *under-*approximated, will be less down-sampled than otherwise. Neither of those effects seem beneficial, or in the spirit of the frequent-word-downsampling option - which should only "thin out" very-frequent words, and in all cases leave words of the same frequency as each other in the original corpus roughly equivalently present to each other in the down-sampled corpus.
So for your case, I would go with the original formula (probability-of-discarding-that-requires-special-handling-of-negative-values), or the word2vec.c practical/inverted implementation (probability-of-keeping-that-saturates-at-1.0), rather than the keras-style approximation.
(As a totally-separate note that nonetheless may be relevant for your dataset/purposes, if you're using negative-sampling: there's another parameter controlling the relative sampling of negative examples, often fixed at 0.75 in early implementations, that one paper has suggested can usefully vary for non-natural-language token distributions & recommendation-related end-uses. This parameter is named ns_exponent in the Python gensim implementation, but simply a fixed power value internal to a sampling-table pre-calculation in the original word2vec.c code.)

How do I prevent minimize (via SCIPY) from outputting "optimized" parameters that I have input as guesses?

I am trying to use the minimize function from the scipy module. The full code is too lengthy to post, but the main idea is that there are multiple defined distributions that should be fittable against datasets. The observations per bin are easily calculated from the datasets, whereas the expectations per bin are calculated by a function that uses one argument to specify which distribution should be integrated over bin bounds (where the bin bounds are identical to the histogram bins). There are three functions chisqI where I = 1,2,3 (one for each distribution), each of which inputs specified observations per bin and expectations per bin to output the chi square. Then there are three functions, each of which inputs a chisqI and args to output the minimized function result and optimized parameters. Here, the args are parameters mu and sigma that will be optimized to produce the smallest chi-square. I was able to pass arguments through a chain of functions for one distribution, and am wondering if I need to pass through another arg that specifies which distribution is being dealt with from one function down the chain.
There are different methods that the minimize function can use, like Nelder-Mead or CG. I've been trying to compare results from the different methods to find the one that provides the best fit (where the best fit is defined as the fit that both produces the smallest chi-square or largest p-value when compared to an actual dataset). Interestingly enough, the Nelder-Mead and Powell methods produce the lowest chi square relative to the other methods, but the plotted fit against the histogram of the actual data looks better with other methods. For the code outputs below, the function value is the negative of the p-value that is associated with a chi-square value; this is the minimized result. CHISQ_RED is the reduced chi square value by using the CHISQ_TOT and the degrees of freedom, whereas the first and second elements in the x: array are the optimized parameters mu and sigma for a distribution, respectively.
Running the Nelder-Mead minimization method produces the output below.
final_simplex: (array([[ 6.00002802, 0.60020636],
[ 5.99995429, 0.60018798],
[ 6.0000716 , 0.60011127]]), array([ -5.16845821e-21, -5.16838926e-21, -5.16815050e-21]))
fun: -5.1684582072826815e-21
message: 'Optimization terminated successfully.'
nfev: 47
nit: 24
status: 0
success: True
x: array([ 6.00002802, 0.60020636])
CHISQ_TOT = 259.042420419 CHISQ_RED = 3.36418727816
Running the CG minimization method produces the output below.
fun: -4.0964504680695594e-97
jac: array([ 8.72867710e-94, -3.96555507e-93])
message: 'Optimization terminated successfully.'
nfev: 4
nit: 0
njev: 1
status: 0
success: True
x: array([ 6.01921293, 0.54436257])
CHISQ_TOT = 683.781671477 CHISQ_RED = 8.88028144776
Yet, the fit with a higher chi square value looks like a better fit (same dataset in the histogram).
The problem is that every method of minimization outputs my guess parameters (mu and sigma) as the optimized parameters. The Nelder-Mead method (smaller chi-square, worse-looking fit) has 47 function evaluations and 24 iterations, whereas the CG method (larger chi-square, better-looking fit) has 4 function evaluations and 0 iterations. I tried to change this by adding extra args in the minimization function (where chisq3 is the pre-defined function of mu and sigma being minimized, and parameterguess is [mu_guess, sigma_guess].
minimize( chisq3 , parameterguess , method = 'CG', options={'gtol':1e-50, 'maxiter': 100})
If I change my guess value of mu and sigma by adding 2 to each, then the fits become drastically worse (as the guess value for the optimized parameters is rather decent). I'm not sure if it's relevant, but the data shown in the plots are adapted from a lognormal distribution by taking the logarithm of each value in my dataset to create a "pseudo-" Gaussian shape/distribution (over logarithmic x axes).
I am guessing that the minimize function via scipy is supposed to do many iterations to be truly successful. So I think adding more iterations should decrease the sensitivity of the minimize function to my initial guess of parameters.
Most importantly, is this a common error using the minimize function via scipy? If so, what are some common fixes for this? Also, why would the minimize function do many iterations and function evaluations only to produce the same result as the input?
The problem was that chi square is the calculation equalto the sum of the square of the per-bin difference of expectation values and observed values, all divided by the expectation value. The result was a small number divided by a large number, squared, then continuously summed thousands of times, contributing to zero division problems and round off errors. By minimizing a simpler function, such as chi square without the denominator term, the source of the bug is gone and one can calculate a chi square from the obtained parameter fit.

Expectation Maximization algorithm(Gaussian Mixture Model) : ValueError: the input matrix must be positive semidefinite

I am trying to implement Expectation Maximization algorithm(Gaussian Mixture Model) on a data set data=[[x,y],...]. I am using mv_norm.pdf(data, mean,cov) function to calculate cluster responsibilities. But after calculating new values of covariance (cov matrix) after 6-7 iterations, cov matrix is becoming singular i.e determinant of cov is 0 (very small value) and hence it is giving errors
ValueError: the input matrix must be positive semidefinite
and
raise np.linalg.LinAlgError('singular matrix')
Can someone suggest any solution for this?
#E-step: Compute cluster responsibilities, given cluster parameters
def calculate_cluster_responsibility(data,centroids,cov_m):
pdfmain=[[] for i in range(0,len(data))]
for i in range(0,len(data)):
sum1=0
pdfeach=[[] for m in range(0,len(centroids))]
pdfeach[0]=1/3.*mv_norm.pdf(data[i], mean=centroids[0],cov=[[cov_m[0][0][0],cov_m[0][0][1]],[cov_m[0][1][0],cov_m[0][1][1]]])
pdfeach[1]=1/3.*mv_norm.pdf(data[i], mean=centroids[1],cov=[[cov_m[1][0][0],cov_m[1][0][1]],[cov_m[1][1][0],cov_m[0][1][1]]])
pdfeach[2]=1/3.*mv_norm.pdf(data[i], mean=centroids[2],cov=[[cov_m[2][0][0],cov_m[2][0][1]],[cov_m[2][1][0],cov_m[2][1][1]]])
sum1+=pdfeach[0]+pdfeach[1]+pdfeach[2]
pdfeach[:] = [x / sum1 for x in pdfeach]
pdfmain[i]=pdfeach
global old_pdfmain
if old_pdfmain==pdfmain:
return
old_pdfmain=copy.deepcopy(pdfmain)
softcounts=[sum(i) for i in zip(*pdfmain)]
calculate_cluster_weights(data,centroids,pdfmain,soft counts)
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
Can someone suggest any solution for this?
The problem is your data lies in some manifold of dimension strictly smaller than the input data. In other words for example your data lies on a circle, while you have 3 dimensional data. As a consequence when your method tries to estimate 3 dimensional ellipsoid (covariance matrix) that fits your data - it fails since the optimal one is a 2 dimensional ellipse (third dimension is 0).
How to fix it? You will need some regularization of your covariance estimator. There are many possible solutions, all in M step, not E step, the problem is with computing covariance:
Simple solution, instead of doing something like cov = np.cov(X) add some regularizing term, like cov = np.cov(X) + eps * np.identity(X.shape[1]) with small eps
Use nicer estimator like LedoitWolf estimator from scikit-learn.
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
This makes no sense, covariance matrix values has nothing to do with amount of clusters. You can initialize it with anything more or less resonable.

unbiased variance in Theano

In numpy we can set ddof=1 to get the ubiased variance, how is it implemented in theano?
I've looked at this page it seems the theano.tensor.var function does not support such options.
theano.tensor.var returns the biased sample variance. I'm not aware of a builtin function that returns the unbiased sample variance, but you can obtain it as follows:
Given a vector x, use Theano's builtin var(), but change the 1/n divisor to 1/(n-1):
v = x.var() * x.size / (x.size - 1)

Resources