How to sample a multivariate normal distribution in Math.NET? - statistics

Am I missing something obvious here? Math.NET has a wealth of probability distribution sampling classes but no multivariate normal distribution. It has Normal and MatrixNormal classes — is there an easy way of adopting either of those to sample a multivariate normal distribution defined by a mean vector and a covariance matrix?

As per #robert-dodier's suggestion, the MatrixNormal distribution becomes the multivariate normal at p = 1. This is more verbose than if there were a native Multinormal distribution class, but not by much:
using MathNet.Numerics.LinearAlgebra;
Vector Sample<T>(System.Random random, Vector<T> mean, Matrix<T> cov)
{
return MatrixNormal.Sample(
random, mean.ToColumnMatrix(), cov, Matrix<T>.Build.DenseIdentity(1)).Column(0);
}
However, only positive-definite covariance matrices are allowed, since the distribution performs the Cholesky decomposition.

Related

How to calculate mutual information in PyTorch (differentiable estimator)

I am training a model with pytorch, where I need to calculate the degree of dependence between two tensors (let's say they are the two tensors each containing values very close to zero or one, e.g. v1 = [0.999, 0.998, 0.001, 0.98] and v2 = [0.97, 0.01, 0.997, 0.999]) as a part of my loss function. I am trying to calculate mutual information, but I can't find any mutual information estimation implementation in PyTorch. Has such a thing been provided anywhere?
Mutual information is defined for distribution and not individual points. So, I will write the next part assuming v1 and v2 are samples from a distribution, p. I will also take that you have n samples from p, n>1.
You want a method to estimate mutual information from samples. There are many ways to do this. One of the simplest ways to do this would be to use a non-parametric estimator like NPEET (https://github.com/gregversteeg/NPEET). It works with numpy (you can convert from torch to numpy for this). There are more involved parametric models for which you may be able to find implementation in pytorch (See https://arxiv.org/abs/1905.06922).
If you only have two vectors and want to compute a similarity measure, a dot product similarity would be more suitable than mutual information as there is no distribution.
It is not provided in the official Pytorch code, but here is a pytorch implementation that uses kernel density estimation for the histogram approximation. Note that this method is fully-differentiable.
Alternatively, you can also use the differentiable histogram functions in Kornia to compute the MI metric yourself if you want more control for whatever reason.

Does Gpytorch use Analytic gradient or Automatic differentiation for training?

I am confused about how gpytorch calculates the gradients with respect to parameters of the model. For instance, lets say I am using ExactGP with Gaussian likelihood, RBF kernel, and constant mean and using MLE (maximum likelihood estimate) for finding the parameters of the model (mean, kernel parameters, and noise). One way to calculate the gradient w.r.t parameters of the model is using analytical gradient which means taking derivative of negative log-likelihood with respect to parameters and finding the equation for each derivation. Another way is to use automatic differentiation provided by pytorch.
Gpytorch authors have mentioned in their paper with the title of "GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration" that they are using analytical gradient or at least this is what I understood by reading the paper. Am I correct? Also, I couldn't find the code that they have implemented the analytical gradient.
Could anyone help me understand this better, please?
The "automatic differentiation provided by PyTorch" does compute the analytic gradient (via back-propagation, note that there is no finite differencing or anything like that involved) - it just does so automatically.
https://github.com/cornellius-gp/gpytorch/discussions/1949#discussioncomment-2384471

How can we use TFIDF vectors with multinomial naive bayes?

Say we have used the TFIDF transform to encode documents into continuous-valued features.
How would we now use this as input to a Naive Bayes classifier?
Bernoulli naive-bayes is out, because our features aren't binary anymore.
Seems like we can't use Multinomial naive-bayes either, because the values are continuous rather than categorical.
As an alternative, would it be appropriate to use gaussian naive bayes instead? Are TFIDF vectors likely to hold up well under the gaussian-distribution assumption?
The sci-kit learn documentation for MultionomialNB suggests the following:
The multinomial Naive Bayes classifier is suitable for classification
with discrete features (e.g., word counts for text classification).
The multinomial distribution normally requires integer feature counts.
However, in practice, fractional counts such as tf-idf may also work.
Isn't it fundamentally impossible to use fractional values for MultinomialNB?
As I understand it, the likelihood function itself assumes that we are dealing with discrete-counts (since it deals with counting/factorials)
How would TFIDF values even work with this formula?
Technically, you are right. The (traditional) Multinomial N.B. model considers a document D as a vocabulary-sized feature vector x, where each element xi is the count of term i i document D. By definition, this vector x then follows a multinomial distribution, leading to the characteristic classification function of MNB.
When using TF-IDF weights instead of term counts, our feature vectors are (most likely) not following a multinomial distribution anymore, so the classification function is not theoretically well-founded anymore. However, it does turn out that tf-idf weights instead of counts work (much) better.
How would TFIDF values even work with this formula?
In the exact same way, except that the feature vector x is now a vector of tf-idf weights and not counts.
You can also check out the Sublinear tf-idf weighting scheme, implemented in sklearn tfidf-vectorizer. In my own research I found this one performing even better: it uses a logarithmic version of the term frequency. The idea is that when a query term occurs 20 times in doc. a and 1 time in doc. b, doc. a should (probably) not be considered 20 times as important but more likely log(20) times as important.

For the multivariate normal model, why is jeffreys' prior distribution not a probability density?

For the multivariate normal model, Jeffreys' rule for generating a prior distribution on (theta, sigma) gives p_j(theta, sigma) proportional to |sigma|^{-(p+2)/2}.
My book notes in a footnote that p_j cannot actually be a probability density for theta, sigma. Why is this?
It's "improper", meaning it doesn't integrate to 1 as probability distributions have to do. For example, the marginal density with respect to theta is just a constant, whose integral over the real line is infinite. It's OK to use improper distributions as priors in Bayesian inference, as long as the posterior is a proper probability distribution.

"pre-built" matrices for latent semantic analysis

I want to use Latent Semantic Analysis for a small app I'm building, but I don't want to build up the matrices myself. (Partly because the documents I have wouldn't make a very good training collection, because they're kinda short and heterogeneous, and partly because I just got a new computer and I'm finding it a bitch to install the linear algebra and such libraries I would need.)
Are there any "default"/pre-built LSA implementations available? For example, things I'm looking for include:
Default U,S,V matrices (i.e., if D is a term-document matrix from some training set, then D = U S V^T is the singular value decomposition), so that given any query vector q, I can use these matrices to compute the LSA projection of q myself.
Some black-box LSA algorithm that, given a query vector q, returns the LSA projection of q.
You'd probably be interested in the Gensim framework for Python; notably, it has an example on building the appropriate matrices from English Wikipedia.

Resources