Acceptance probability for Metropolis-Hastings MCMC on multinomial-Dirichlet model - statistics

As an exercise to learn how to manually code MCMC, I've built a Metropolis-Hastings sampler on top of a multinomial-Dirichlet posterior distribution. Since a closed form solution exists, I can compare results from the MCMC with simulations from the actual posterior distribution.
I'm using a Dirichlet proposal distribution with parameters equal to the latest probabilities in the chain times a scaling constant (~1000), which makes the expected value of the distribution equal to those probabilities, with the scaling constant controlling the variance. Since this distribution certainly isn't symmetric, I tried adding the ratio of the of values from the proposal distribution to the calculation of the acceptance probability.
Doing this, however, seems to bias the results away from the results given by the closed form solution. The only way I've gotten results from the MCMC to match the results from the closed-form solution is to calculate the acceptance probability from the posterior distribution alone, as you would if the proposal distribution were symmetric. R code here: https://github.com/sivadivel/nps_stats/blob/master/manual_mcmc.R
My question is: why is this the case?

Related

Loss functions in LightFM

I recently came across LightFM while learning to train a recommender system. And so far what I know is that it utilizes loss functions which are logistic, BPR, WARP and k-OS WARP. I did not go through the math behind all these functions. Now what I am confused about is that how will I know that which loss function to use where?
From lightfm model documentation page:
logistic: useful when both positive (1) and negative (-1) interactions are present.
BPR: Bayesian Personalised Ranking 1 pairwise loss. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.
WARP: Weighted Approximate-Rank Pairwise [2] loss. Maximises the rank of positive examples by repeatedly sampling negative examples until rank violating one is found. Useful when only positive interactions are present and optimising the top of the recommendation list (precision#k) is desired.
k-OS WARP: k-th order statistic loss [3]. A modification of WARP that uses the k-the positive example for any given user as a basis for pairwise updates.
Everything boils down to how your dataset is structured and what kind of user interacions you're looking at. Obviously one approach would be to include the loss function in your parameter grid when going through hyperparameter tuning (at least that's what I did) and check model accuracy. I find investingating why a given loss function performed better/worse on a dataset as a good learning exercise.

How does MCMC help bayesian inference?

Literature says that the metropolis-hasting algorithm in MCMC is one of the most important algorithms developed last century and is revolutional. Literature also says that it is such development in MCMC that gave bayesian statistics a second birth.
I understand what MCMC does - it provides an efficient way to draw samples from any complicated probability distribution.
I also know what bayesian inference is - it is the process by which the full posterior distribution of parameters is calculated.
I am having difficult time connecting the dots here:
Which step in the process of bayesian inference does MCMC come into play? Why is MCMC so important that people say it is MCMC that gave bayesian statistics a second birth??
You might want to ask a similar question on StatsExchange. However, here is an attempt for a high level "build some intuition" answer (disclaimer: I am a Computer Scientist and not a Statistician. Head over to StatsExchange for a more formal discussion).
Bayesian Inference:
In the most basic sense we follow Bayes rule: p(Θ|y)=p(y|Θ)p(Θ)/p(y). Here p(Θ|y) is called the 'posterior' and this is what you are trying to compute. p(y|Θ) is called the 'data likelihood' and is typically given by your model or your generative description of the data. p(Θ) is called the 'prior' and it captures your belief about the plausible values of the parameters before observing the data. p(y) is called the 'marginal likelihood' and using the law of total probability can be expressed as ∫ p(y|Θ)p(Θ) dΘ. That looks really neat but in reality the p(y) is often intractable to compute analytically and in high dimensions (i.e. when Θ has many dimensions) numerical integration is imprecise and computationally intractable. There are certain cases when the conjugate structure of the problem allows you to compute this analytically, but in many useful models this is simply not possible. Therefore, we turn to approximating the posterior.
There are two ways (that I know of) to approximate the posterior: Monte Carlo and Variational Inference. Since you asked about MCMC, I'll stick to that.
Monte Carlo (and Markov Chain Monte Carlo):
Many problems in Statistics deal with taking expectations of functions under probability distributions. From the Law of Large Numbers, an expectation can be efficiently approximated by a Monte Carlo estimator. Therefore, if we can draw samples from a distribution (even if we don't know the distribution itself) then we can compute a Monte Carlo estimate of the expectation in question. The key is that we don't need to have an expression for the distribution: If we just have samples then we can compute the expectations that we are interested in. But there is a catch... How to draw the samples??
There has been a lot of work which developed ways of drawing samples from unknown distributions. These include 'rejection', 'importance' and 'slice' sampling. These were all great innovations and were useful in many applications but they all suffered by scaling poorly to high dimensions. For example, rejection sampling draws samples from a known 'proposal' distribution and then accepts or rejects that sample based on a probability that needs to evaluate the likelihood function and the proposal function. This is wonderful in 1 dimension but as the dimensionality grows, the probability mass that a given sample gets rejected increases dramatically.
Markov Chain Monte Carlo was an innovation that has some super nice theoretical guarantees attached to it. The key idea was to not randomly draw samples from a proposal distribution but rather to use a known sample (with the hope that the sample is in an area of high probability mass) and then make a small random step under a draw from a proposal distribution. Ideally, if the first draw was in an area of high probability mass then the second draw is also likely to be accepted. Therefore, you end up accepting many more samples and you don't waste time drawing samples that are to be rejected. The amazing thing is that if you run the Markov Chain long enough (i.e. to infinity) and under specific conditions (the chain must be finite, aperiodic, irreducible and ergodic) then your samples will be drawn from the true posterior of your model. That's amazing! The MCMC technique is to draw dependent samples so it scales to a higher dimensionality than previous methods, but under the right conditions, even though the samples are dependent, they are as if they are drawn IID from the desired distribution (which is the posterior in Bayesian Inference).
Tying it together (and hopefully answering your question):
MCMC can be seen as a tool that enables Bayesian Inference (just as analytical calculation from conjugate structure, Variational Inference and Monte Carlo are alternatives). Apart from an analytical solution, all of the other tools are approximating the true posterior. Our goal is then to make the approximation as good as possible and to do this as cheaply as possible (in both computation cost and the cost of computing a bunch of messy algebra). Pervious sampling methods did not scale to high dimensions (which are typical of any real world problem) and therefore Bayesian Inference became computationally very expensive and impractical in many instances. However, MCMC opened the door to a new way to efficiently draw samples from a high dimensional posterior, to do this with good theoretical guarantees and to do this (comparatively) easily and computationally cheaply.
It is worth mentioning that Metropolis itself has problems: it struggles with highly correlated latent parameter space, it requires a user-specified proposal distribution and the correlation between samples can be high leading to biased results. Therefore more modern and sometimes more useful MCMC tools have been proposed to try combat this. See 'Hamiltonian Monte Carlo' and the 'No U-Turn Sampler' for the state of the art. Nonetheless, Metropolis was a huge innovation that suddenly made real world problems computationally tractable.
A last note: See this discussion by MacKay for a really good overview of these topics.
This post https://stats.stackexchange.com/a/344360/137466 perfectly clears my question on how MCMC sampling helps solving bayesian inference. Especially this following part from the post is the key concept that I missed:
The Markov chain has a stationary
distribution
which is the distribution that preserves itself if you run it through
the chain. Under certain broad assumptions (e.g., the chain is
irreducible, aperiodic), the stationary distribution will also be the
limiting distribution of the Markov chain, so that regardless of how
you choose the starting value, this will be the distribution that the
outputs converge towards as you run the chain longer and longer. It
turns out that it is possible to design a Markov chain with a
stationary distribution equal to the posterior distribution, even
though we don't know exactly what that distribution is. That is, it
is possible to design a Markov chain that has $\pi( \theta |
\mathbb{x} )$ as its stationary limiting distribution, even if all we
know is that $\pi( \theta | \mathbb{x} ) \propto L_\mathbb{x}(\theta)
\pi(\theta)$. There are various ways to design this kind of Markov
chain, and these various designs constitute available MCMC algorithms
for generating values from the posterior distribution.
Once we have designed an MCMC method like this, we know that we can
feed in any arbitrary starting value $\theta_{(0)}$ and the
distribution of the outputs will converge to the posterior
distribution (since this is the stationary limiting distribution of
the chain). So we can draw (non-independent) samples from the
posterior distribution by starting with an arbitrary starting value,
feeding it into the MCMC algorithm, waiting for the chain to converge
close to its stationary distribution, and then taking the subsequent
outputs as our draws.

Improving linear regression model by taking absolute value of predicted output?

I have a particular classification problem that I was able to improve using Python's abs() function. I am still somewhat new when it comes to machine learning, and I wanted to know if what I am doing is actually "allowed," so to speak, for improving a regression problem. The following line describes my method:
lr = linear_model.LinearRegression()
predicted = abs(cross_val_predict(lr, features, labels_postop_IS, cv=10))
I attempted this solution because linear regression can sometimes produce negative predictions values, even though my particular case, these predictions should never be negative, as they are a physical quantity.
Using the abs() function, my predictions produce a better fit for the data.
Is this allowed?
Why would it not be "allowed". I mean if you want to make certain statistical statements (like a 95% CI e.g.) you need to be careful. However, most ML practitioners do not care too much about underlying statistical assumptions and just want a blackbox model that can be evaluated based on accuracy or some other performance metric. So basically everything is allowed in ML, you just have to be careful not to overfit. Maybe a more sensible solution to your problem would be to use a function that truncates at 0 like f(x) = x if x > 0 else 0. This way larger negative values don't suddenly become large positive ones.
On a side note, you should probably try some other models as well with more parameters like a SVR with a non-linear kernel. The thing is obviously that a LR fits a line, and if this line is not parallel to your x-axis (thinking in the single variable case) it will inevitably lead to negative values at some point on the line. That's one reason for why it is often advised not to use LRs for predictions outside the "fitted" data.
A straight line y=a+bx will predict negative y for some x unless a>0 and b=0. Using logarithmic scale seems natural solution to fix this.
In the case of linear regression, there is no restriction on your outputs.
If your data is non-negative (as in your case the values are physical quantities and cannot be negative), you could model using a generalized linear model (GLM) with a log link function. This is known as Poisson regression and is helpful for modeling discrete non-negative counts such as the problem you described. The Poisson distribution is parameterized by a single value λ, which describes both the expected value and the variance of the distribution.
I cannot say your approach is wrong but a better way is to go towards the above method.
This results in an approach that you are attempting to fit a linear model to the log of your observations.

Spark Naive Bayes Result accuracy (Spark ML 1.6.0) [duplicate]

I am using Spark ML to optimise a Naive Bayes multi-class classifier.
I have about 300 categories and I am classifying text documents.
The training set is balanced enough and there is about 300 training examples for each category.
All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero).
What are the possible reasons for this?
I would like to add that in SPARK ML there is something called "raw prediction" and when I look at it, I can see negative numbers but they have more or less comparable magnitude, so even the category with the high probability has comparable raw prediction score, but I am finding difficulties in interpreting this scores.
Lets start with a very informal description of Naive Bayes classifier. If C is a set of all classes and d is a document and xi are the features, Naive Bayes returns:
Since P(d) is the same for all classes we can simplify this to
where
Since we assume that features are conditionally independent (that is why it is naive) we can further simplify this (with Laplace correction to avoid zeros) to:
Problem with this expression is that in any non-trivial case it is numerically equal to zero. To avoid we use following property:
and replace initial condition with:
These are the values you get as the raw probabilities. Since each element is negative (logarithm of the value in (0, 1]) a whole expression has negative value as well. As you discovered by yourself these values are further normalized so the maximum value is equal to 1 and divided by the sum of the normalized values
It is important to note that while values you get are not strictly P(c|d) they preserve all important properties. The order and ratios are exactly (ignoring possible numerical issues) the same. If none other class gets prediction close to one it means that, given the evidence, it is a very strong prediction. So it is actually something you want to see.

information criteria for confusion matrices

One can measure goodness of fit of a statistical model using Akaike Information Criterion (AIC), which accounts for goodness of fit and for the number of parameters that were used for model creation. AIC involves calculation of maximized value of likelihood function for that model (L).
How can one compute L, given prediction results of a classification model, represented as a confusion matrix?
It is not possible to calculate the AIC from a confusion matrix since it doesn't contain any information about the likelihood. Depending on the model you are using it may be possible to calculate the likelihood or quasi-likelihood and hence the AIC or QIC.
What is the classification problem that you are working on, and what is your model?
In a classification context often other measures are used to do GoF testing. I'd recommend reading through The Elements of Statistical Learning by Hastie, Tibshirani and Friedman to get a good overview of this kind of methodology.
Hope this helps.
Information-Based Evaluation Criterion for Classifier's Performance by Kononenko and Bratko is exactly what I was looking for:
Classification accuracy is usually used as a measure of classification performance. This measure is, however, known to have several defects. A fair evaluation criterion should exclude the influence of the class probabilities which may enable a completely uninformed classifier to trivially achieve high classification accuracy. In this paper a method for evaluating the information score of a classifier''s answers is proposed. It excludes the influence of prior probabilities, deals with various types of imperfect or probabilistic answers and can be used also for comparing the performance in different domains.

Resources