What does a 'tractable' distribution mean? - statistics

For example, in generative adversarial network, we often hear that inference is easy because the conditional distribution of x given latent variable z is 'tractable'.
Also, I read somewhere that Boltzmann machine and variational autoencoder is used where the posterior distribution is not tractable so some sort of approximation need to be applied.
Could anyone tell me what 'tractable' means, in a rigorous definition? Or could anyone explain in any of the examples I gave above, what tractable exactly means in that context?

First of all, let's define what tractable and intractable problems are (Reference: http://www.cs.ucc.ie/~dgb/courses/toc/handout29.pdf).
Tractable Problem: a problem that is solvable by a polynomial-time algorithm. The upper bound is polynomial.
Intractable Problem: a problem that cannot be solved by a polynomial-time algorithm. The lower bound is exponential.
From this perspective, a definition for tractable distribution is that it takes polynomial-time to calculate the probability of this distribution at any given point.
If a distribution is in a closed-form expression, the probability of this distribution can definitely be calculated in polynomial-time, which, in the world of academia, means the distribution is tractable. Intractable distributions take equal to or more than exponential-time, which usually means that with existing computational resources, we can never calculate the probability at a given point with relatively "short" time (any time longer than polynomial-time is long...).

Related

Why is Standard Deviation the square of difference of an obsevation from the mean?

I am learning statistics, and have some basic yet core questions on SD:
s = sample size
n = total number of observations
xi = ith observation
μ = arithmetic mean of all observations
σ = the usual definition of SD, i.e. ((1/(n-1))*sum([(xi-μ)**2 for xi in s])**(1/2) in Python lingo
f = frequency of an observation value
I do understand that (1/n)*sum([xi-μ for xi in s]) would be useless (= 0), but would not (1/n)*sum([abs(xi-μ) for xi in s]) have been a measure of variation?
Why stop at power of 1 or 2? Would ((1/(n-1))*sum([abs((xi-μ)**3) for xi in s])**(1/3) or ((1/(n-1))*sum([(xi-μ)**4 for xi in s])**(1/4) and so on have made any sense?
My notion of squaring is that it 'amplifies' the measure of variation from the arithmetic mean while the simple absolute difference is somewhat a linear scale notionally. Would it not amplify it even more if I cubed it (and made absolute value of course) or quad it?
I do agree computationally cubes and quads would have been more expensive. But with the same argument, the absolute values would have been less expensive... So why squares?
Why is the Normal Distribution like it is, i.e. f = (1/(σ*math.sqrt(2*pi)))*e**((-1/2)*((xi-μ)/σ))?
What impact would it have on the normal distribution formula above if I calculated SD as described in (1) and (2) above?
Is it only a matter of our 'getting used to the squares', it could well have been linear, cubed or quad, and we would have trained our minds likewise?
(I may not have been 100% accurate in my number of opening and closing brackets above, but you will get the idea.)
So, if you are looking for an index of dispersion, you actually don't have to use the standard deviation. You can indeed report mean absolute deviation, the summary statistic you suggested. You merely need to be aware of how each summary statistic behaves, for example the SD assigns more weight to outlying variables. You should also consider how each one can be interpreted. For example, with a normal distribution, we know how much of the distribution lies between ±2SD from the mean. For some discussion of mean absolute deviation (and other measures of average absolute deviation, such as the median average deviation) and their uses see here.
Beyond its use as a measure of spread though, SD is related to variance and this is related to some of the other reasons it's popular, because the variance has some nice mathematical properties. A mathematician or statistician would be able to provide a more informed answer here, but squared difference is a smooth function and is differentiable everywhere, allowing one to analytically identify a minimum, which helps when fitting functions to data using least squares estimation. For more detail and for a comparison with least absolute deviations see here. Another major area where variance shines is that it can be easily decomposed and summed, which is useful for example in ANOVA and regression models generally. See here for a discussion.
As to your questions about raising to higher powers, they actually do have uses in statistics! In general, the mean (which is related to average absolute mean), the variance (related to standard deviation), skewness (related to the third power) and kurtosis (related to the fourth power) are all related to the moments of a distribution. Taking differences raised to those powers and standardizing them provides useful information about the shape of a distribution. The video I linked provides some easy intuition.
For some other answers and a larger discussion of why SD is so popular, See here.
Regarding the relationship of sigma and the normal distribution, sigma is simply a parameter that stretches the standard normal distribution, just like the mean changes its location. This is simply a result of the way the standard normal distribution (a normal distribution with mean=0 and SD=variance=1) is mathematically defined, and note that all normal distributions can be derived from the standard normal distribution. This answer illustrates this. Now, you can parameterize a normal distribution in other ways as well, but I believe you do need to provide sigma, whether using the SD or precisions. I don't think you can even parametrize a normal distribution using just the mean and the mean absolute difference. Now, a deeper question is why normal distributions are so incredibly useful in representing widely different phenomena and crop up everywhere. I think this is related to the Central Limit Theorem, but I do not understand the proofs of the theorem well enough to comment further.

Gaussian approximation of old states

I came across the following sentence referred to the usual Extended Kalman Filter and I'm trying to make sense of it:
States before the current state are approximated with a normal distribution
What does it mean?
the modeled quantity has uncertainty because it is derived from measurements. you can't be sure it's exactly value X. that's why the quantity is represented by a probability density function (or a cumulative distribution function, which is the integral of that).
a probability distribution can look very arbitrary but there are many "simple" distributions that approximate the real world. you've heard of the normal distribution (gaussian), the uniform distribution (rectangle), ...
the normal distribution (parameters mu and sigma) occurs everywhere in nature so it's likely that your measurements already fit a normal distribution very well.
"a gaussian" implies that your distribution isn't a mixture (sum) of gaussians but a single gaussian.

Feature scaling and its affect on various algorithm

Despite going through lots of similar question related to this I still could not understand why some algorithm is susceptible to it while others are not.
Till now I found that SVM and K-means are susceptible to feature scaling while Linear Regression and Decision Tree are not.Can somebody please elaborate me why? in general or relating to this 4 algorithm.
As I am a beginner, please explain this in layman terms.
One reason I can think of off-hand is that SVM and K-means, at least with a basic configuration, uses an L2 distance metric. An L1 or L2 distance metric between two points will give different results if you double delta-x or delta-y, for example.
With Linear Regression, you fit a linear transform to best describe the data by effectively transforming the coordinate system before taking a measurement. Since the optimal model is the same no matter the coordinate system of the data, pretty much by definition, your result will be invariant to any linear transform including feature scaling.
With Decision Trees, you typically look for rules of the form x < N, where the only detail that matters is how many items pass or fail the given threshold test - you pass this into your entropy function. Because this rule format does not depend on dimension scale, since there is no continuous distance metric, we again have in-variance.
Somewhat different reasons for each, but I hope that helps.

Does sklearn.linear_model.LogisticRegression always converge to best solution?

When using this code, I've noticed that it converges unbelievably quickly (small
fraction of one second), even when the model and/or the data is very large. I
suspect that in some cases I am not getting anything close to the best solution,
but this is hard to prove. It would be nice to have the option for some type of
global optimizer such as the basin hopping algorithm, even if this consumed 100
to 1,000 times as much CPU. Does anyone have any thoughts on this subject?
This is a very complex question and this answer might be incomplete, but should give you some hints (as your question also indicates some knowledge gaps):
(1) First i disagree with the desire for some type of global optimizer such as the basin hopping algorithm, even if this consumed 100 to 1,000 times as much CPU as this does not help in most cases (in ML world) as the differences are so subtle and the optimization-error will often be negligible compared to the other errors (model-power; empirical-risk)
Read "Stochastic Gradient Descent Tricks" (Battou) for some overview (and the error-components!)
He even gives a very important reason to use fast approximate algorithms (not necessarily a good fit in your case if 1000x training-time is not a problem): approximate optimization can achieve better expected risk because more training examples can be processed during the allowed time
(2) Basin-hopping is some of these highly heuristic tools of global-optimization (looking for global-minima instead of local minima) without any guarantees at all (touching NP-hardness and co.). It's the last algorithm you want to use here (see point (3))!
(3) The problem of logistic-regression is a convex optimization problem!
The local minimum is always the global-minimum, which follows from convexity (i'm ignoring stuff like strictly/unique solutions and co)!
Therefore you will always use something tuned for convex-optimization! And never Basin-hopping!
(4) There are different solvers and each support different variants of problems (different regularization and co.). We don't know exactly what you are optimizing, but of course these solvers are working differently in regards to convergence:
Take the following comments with a grain of salt:
liblinear: is probably using some CG-based algorithm (conjugated-gradient) which means convergence is highly dependent on the data
if accurate convergence is achieved is solely depending on the exact implementation (liblinear is high-quality)
as it's a first-order method i would call the general accuracy medium
sag/saga: seems to have a better convergence-theory (did not check it much), but again: it's dependent on your data as mentioned in sklearn's docs and if solutions are accurate is highly depending on the implementation details
as these are first-order methods: general accuracy medium
newton-cg: an inexact newton-method
in general much more robust in terms of convergence as line-searches replace heuristics or constant learning-rates (LS costly in first-order opt)
second-order method with inexact-core: expected accuracy: medium-high
lbfgs: quasi-newton method
again in general much more robust in terms of convergence like newton-cg
second-order method: expected accuracy: medium-high
Of course second-order methods get more hurt with large-scale data (even complexity-wise) and as mentioned, not all solvers are supporting every logreg-optimization-problem supported in sklearn.
I hope you get the idea how complex this question is (because of highly complex solver-internals).
Most important things:
LogReg is convex -> use solvers tuned for unconstrained convex optimization
If you want medium-high accuracy: use those second-order based methods available and do many iterations (it's a parameter)
If you want high accuracy: use second-order based methods which are even more conservative/careful (no: hessian-approx; inverse-hessian-approx; truncating...):
e.g. any off-the-shelve solver from convex-optimization
Open-source: cvxopt, ecos and co.
Commercial: Mosek
(but you need to formulate the model yourself in their frameworks or some wrapper; probably some examples for classic logistic-regression available)
As expected: some methods will get very slow with much data.

Bayesian t-test assumptions

Good afternoon,
I know that the traditional independent t-test assumes homoscedasticity (i.e., equal variances across groups) and normality of the residuals.
They are usually checked by using levene's test for homogeneity of variances, and the shapiro-wilk test and qqplots for the normality assumption.
Which statistical assumptions do I have to check with the bayesian independent t test? How may I check them in R with coda and rjags?
For whichever test you want to run, find the formula and plug in using the posterior draws of the parameters you have, such as the variance parameter and any regression coefficients that the formula requires. Iterating the formula over the posterior draws will give you a range of values for the test statistic from which you can take the mean to get an average value and the sd to get a standard deviation (uncertainty estimate).
And boom, you're done.
There might be non-parametric Bayesian t-tests. But commonly, Bayesian t-tests are parametric, and as such they assume equality of relevant population variances. If you could obtain a t-value from a t-test (just a regular t-test for your type of t-test from any software package you're comfortable with), use levene's test (do not think this in any way is a dependable test, remember it uses p-value), then you can do a Bayesian t-test. But remember the point that the Bayesian t-test, requires a conventional modeling of observations (Likelihood), and an appropriate prior for the parameter of interest.
It is highly recommended that t-tests be re-parameterized in terms of effect sizes (especially standardized mean difference effect sizes). That is, you focus on the Bayesian estimation of the effect size arising from the t-test not other parameter in the t-test. If you opt to estimate Effect Size from a t-test, then a very easy to use free, online Bayesian t-test software is THIS ONE HERE (probably one of the most user-friendly package available, note that this software uses a cauchy prior for the effect size arising from any type of t-test).
Finally, since you want to do a Bayesian t-test, I would suggest focusing your attention on picking an appropriate/defensible/meaningful prior rather then levenes' test. No test could really show that the sample data may have come from two populations (in your case) that have had equal variances or not unless data is plentiful. Note that the issue that sample data may have come from populations with equal variances itself is an inferential (Bayesian or non-Bayesian) question.

Resources