forward-backward algorithm for secondary structure prediction - statistics

I want to use HMM (forward backward model) for protein secondary structure prediction.
Basically, a three-state model is used: States = {H=alpha helix, B=beta sheet, C=coil}
and each state has a emission probability pmf of 1-by-20 (for the 20 amino acids).
After using a "training set" of sequences on the forward backward model, the expectation maximization converges for an optimal transitions matrix (3-by-3 between the three states), and emission probability pmf for each state.
Does anyone know of a dataset (preferably very small) of sequences for which the "correct" values of the transition matrix and emission probabilities are determined. I would like to use that dataset in Excel to apply the forward backward algorithm and build my confidence to determine whether or not I can get the same result.
And then move on to something less primitive than Excel :o)

The best way to do this is probably to produce your own simulated data from distributions you decide. Then you run your program to see if the parameter estimations converge towards your known parameters.
In your case, this will involve writing a Markov chain that changes from state to state with some known and arbitrary probability (for instance, P(Helix to Chain)=0.001) and then emits an amino acid with probability (for instance, P(methionine)=0.11). For each step, print out the state and emission. You can then watch your posterior probabilities approach the state for each site.
You can make these as arbitrary as you want, because when you run your HMM you should converge on the proper distributions.
Good luck!

Related

Normality Assumption - how to check you have not violated it?

I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!

Ensuring that optimization does not find the trivial solution by setting weights to 0

I am trying to train a neural network which takes as input (input_t0) and an initial hidden state (call it s_t0) and produces a new hidden state (s_t1) by transforming the input via a series of transformations (neural network layers). At the next time step, a transformed input (input_t1) and the hidden state from the previous time step (s_t1) is passed to the same model. This process keeps repeating for a couple of steps.
The goal of optimization is to ensure the distance between s_t0 and s_t1 is small through self-supervision, as s_t1 is supposed to be an transformed version of s_t0. In other words, I want s_t1 to only carry new information in the new input. My intuition tells me taking the norm of the weights and ensuring the norm does not go to zero (is this even possible?) would be one way to achieve this. However, I'm afraid won't be the best thing to do necessarily, as it might not encourage the model to update the state vector with new information.
Currently the way I train the model is by taking the absolute distance between s_t0 and s_t1 via loss = torch.abs(s_t1 - s_t0).mean(dim=1). Then I call loss.backward() and optimizer.step() which changes the weights. Note that the reason that I use abs() is that the hidden states are produced after applying ReLU, so the only hold positive values. So what is the best way to achieve this and ensure the weights don't go to 0? Would I be able to somehow use mutual information for this?
However, I noticed that optimization quickly finds the trivial solution by setting weights to 0. This causes both s_t0 and s_t1 get smaller and smaller until their difference is 0, which satisfies the constraint but does not yield the behavior I expect. Is there a way to ensure weights do not go to zero during optimization?

Is there a function that computes the probability of a sequence T of observations given a hidden Markov model on hmmlearn?

I have a GaussianHMM that I have fitted using hmmlearn.fit function. I also have a bunch of sequences of observations and I want to compute the probability of each sequence happening given my fitted model. I looked into hmmlearn's documentation and I couldn't quite find a method that does what I want. In this case, do I just have to code the forward-backward algorithm? In case I code the forward-backward, I would also need the emission matrix, which is not given by hmmlearn.
Anyone has an advice regarding this? thank you!
I also have a bunch of sequences of observations and I want to compute the probability of each sequence happening given my fitted model
What you might be looking for is the score function, to evaluate the probability of sequence (ie. model.score(X)). Note that this is the log probability, as hmmlearn adjusts for underflow error.
In case I code the forward-backward, I would also need the emission matrix, which is not given by hmmlearn.
While the GaussianHMM does not have an emission matrix, you can choose to discretize your emissions and utilize MultinomialHMM, which allows you to specify and also later extract the emission matrix model.emissionprob_.

How does MCMC help bayesian inference?

Literature says that the metropolis-hasting algorithm in MCMC is one of the most important algorithms developed last century and is revolutional. Literature also says that it is such development in MCMC that gave bayesian statistics a second birth.
I understand what MCMC does - it provides an efficient way to draw samples from any complicated probability distribution.
I also know what bayesian inference is - it is the process by which the full posterior distribution of parameters is calculated.
I am having difficult time connecting the dots here:
Which step in the process of bayesian inference does MCMC come into play? Why is MCMC so important that people say it is MCMC that gave bayesian statistics a second birth??
You might want to ask a similar question on StatsExchange. However, here is an attempt for a high level "build some intuition" answer (disclaimer: I am a Computer Scientist and not a Statistician. Head over to StatsExchange for a more formal discussion).
Bayesian Inference:
In the most basic sense we follow Bayes rule: p(Θ|y)=p(y|Θ)p(Θ)/p(y). Here p(Θ|y) is called the 'posterior' and this is what you are trying to compute. p(y|Θ) is called the 'data likelihood' and is typically given by your model or your generative description of the data. p(Θ) is called the 'prior' and it captures your belief about the plausible values of the parameters before observing the data. p(y) is called the 'marginal likelihood' and using the law of total probability can be expressed as ∫ p(y|Θ)p(Θ) dΘ. That looks really neat but in reality the p(y) is often intractable to compute analytically and in high dimensions (i.e. when Θ has many dimensions) numerical integration is imprecise and computationally intractable. There are certain cases when the conjugate structure of the problem allows you to compute this analytically, but in many useful models this is simply not possible. Therefore, we turn to approximating the posterior.
There are two ways (that I know of) to approximate the posterior: Monte Carlo and Variational Inference. Since you asked about MCMC, I'll stick to that.
Monte Carlo (and Markov Chain Monte Carlo):
Many problems in Statistics deal with taking expectations of functions under probability distributions. From the Law of Large Numbers, an expectation can be efficiently approximated by a Monte Carlo estimator. Therefore, if we can draw samples from a distribution (even if we don't know the distribution itself) then we can compute a Monte Carlo estimate of the expectation in question. The key is that we don't need to have an expression for the distribution: If we just have samples then we can compute the expectations that we are interested in. But there is a catch... How to draw the samples??
There has been a lot of work which developed ways of drawing samples from unknown distributions. These include 'rejection', 'importance' and 'slice' sampling. These were all great innovations and were useful in many applications but they all suffered by scaling poorly to high dimensions. For example, rejection sampling draws samples from a known 'proposal' distribution and then accepts or rejects that sample based on a probability that needs to evaluate the likelihood function and the proposal function. This is wonderful in 1 dimension but as the dimensionality grows, the probability mass that a given sample gets rejected increases dramatically.
Markov Chain Monte Carlo was an innovation that has some super nice theoretical guarantees attached to it. The key idea was to not randomly draw samples from a proposal distribution but rather to use a known sample (with the hope that the sample is in an area of high probability mass) and then make a small random step under a draw from a proposal distribution. Ideally, if the first draw was in an area of high probability mass then the second draw is also likely to be accepted. Therefore, you end up accepting many more samples and you don't waste time drawing samples that are to be rejected. The amazing thing is that if you run the Markov Chain long enough (i.e. to infinity) and under specific conditions (the chain must be finite, aperiodic, irreducible and ergodic) then your samples will be drawn from the true posterior of your model. That's amazing! The MCMC technique is to draw dependent samples so it scales to a higher dimensionality than previous methods, but under the right conditions, even though the samples are dependent, they are as if they are drawn IID from the desired distribution (which is the posterior in Bayesian Inference).
Tying it together (and hopefully answering your question):
MCMC can be seen as a tool that enables Bayesian Inference (just as analytical calculation from conjugate structure, Variational Inference and Monte Carlo are alternatives). Apart from an analytical solution, all of the other tools are approximating the true posterior. Our goal is then to make the approximation as good as possible and to do this as cheaply as possible (in both computation cost and the cost of computing a bunch of messy algebra). Pervious sampling methods did not scale to high dimensions (which are typical of any real world problem) and therefore Bayesian Inference became computationally very expensive and impractical in many instances. However, MCMC opened the door to a new way to efficiently draw samples from a high dimensional posterior, to do this with good theoretical guarantees and to do this (comparatively) easily and computationally cheaply.
It is worth mentioning that Metropolis itself has problems: it struggles with highly correlated latent parameter space, it requires a user-specified proposal distribution and the correlation between samples can be high leading to biased results. Therefore more modern and sometimes more useful MCMC tools have been proposed to try combat this. See 'Hamiltonian Monte Carlo' and the 'No U-Turn Sampler' for the state of the art. Nonetheless, Metropolis was a huge innovation that suddenly made real world problems computationally tractable.
A last note: See this discussion by MacKay for a really good overview of these topics.
This post https://stats.stackexchange.com/a/344360/137466 perfectly clears my question on how MCMC sampling helps solving bayesian inference. Especially this following part from the post is the key concept that I missed:
The Markov chain has a stationary
distribution
which is the distribution that preserves itself if you run it through
the chain. Under certain broad assumptions (e.g., the chain is
irreducible, aperiodic), the stationary distribution will also be the
limiting distribution of the Markov chain, so that regardless of how
you choose the starting value, this will be the distribution that the
outputs converge towards as you run the chain longer and longer. It
turns out that it is possible to design a Markov chain with a
stationary distribution equal to the posterior distribution, even
though we don't know exactly what that distribution is. That is, it
is possible to design a Markov chain that has $\pi( \theta |
\mathbb{x} )$ as its stationary limiting distribution, even if all we
know is that $\pi( \theta | \mathbb{x} ) \propto L_\mathbb{x}(\theta)
\pi(\theta)$. There are various ways to design this kind of Markov
chain, and these various designs constitute available MCMC algorithms
for generating values from the posterior distribution.
Once we have designed an MCMC method like this, we know that we can
feed in any arbitrary starting value $\theta_{(0)}$ and the
distribution of the outputs will converge to the posterior
distribution (since this is the stationary limiting distribution of
the chain). So we can draw (non-independent) samples from the
posterior distribution by starting with an arbitrary starting value,
feeding it into the MCMC algorithm, waiting for the chain to converge
close to its stationary distribution, and then taking the subsequent
outputs as our draws.

What is the stochastic aspect of Word2Vec?

I'm vectorizing words on a few different corpora with Gensim and am getting results that are making me rethink how Word2Vec functions. My understanding was that Word2Vec was deterministic, and that the position of a word in a vector space would not change from training to training. If "My cat is running" and "your dog can't be running" are the two sentences in the corpus, then the value of "running" (or its stem) seems necessarily fixed.
However, I've found that that value indeed does vary across models, and words keep changing where they are on a vector space when I train the model. The differences are not always hugely meaningful, but they do indicate the existence of some random process. What am I missing here?
This is well-covered in the Gensim FAQ, which I quote here:
Q11: I've trained my Word2Vec/Doc2Vec/etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)
Answer: The *2vec models (word2vec, fasttext, doc2vec…) begin with random initialization, then most modes use additional randomization
during training. (For example, the training windows are randomly
truncated as an efficient way of weighting nearer words higher. The
negative examples in the default negative-sampling mode are chosen
randomly. And the downsampling of highly-frequent words, as controlled
by the sample parameter, is driven by random choices. These
behaviors were all defined in the original Word2Vec paper's algorithm
description.)
Even when all this randomness comes from a
pseudorandom-number-generator that's been seeded to give a
reproducible stream of random numbers (which gensim does by default),
the usual case of multi-threaded training can further change the exact
training-order of text examples, and thus the final model state.
(Further, in Python 3.x, the hashing of strings is randomized each
re-launch of the Python interpreter - changing the iteration ordering
of vocabulary dicts from run to run, and thus making even the same
string-of-random-number-draws pick different words in different
launches.)
So, it is to be expected that models vary from run to run, even
trained on the same data. There's no single "right place" for any
word-vector or doc-vector to wind up: just positions that are at
progressively more-useful distances & directions from other vectors
co-trained inside the same model. (In general, only vectors that were
trained together in an interleaved session of contrasting uses become
comparable in their coordinates.)
Suitable training parameters should yield models that are roughly as
useful, from run-to-run, as each other. Testing and evaluation
processes should be tolerant of any shifts in vector positions, and of
small "jitter" in the overall utility of models, that arises from the
inherent algorithm randomness. (If the observed quality from
run-to-run varies a lot, there may be other problems: too little data,
poorly-tuned parameters, or errors/weaknesses in the evaluation
method.)
You can try to force determinism, by using workers=1 to limit
training to a single thread – and, if in Python 3.x, using the
PYTHONHASHSEED environment variable to disable its usual string hash
randomization. But training will be much slower than with more
threads. And, you'd be obscuring the inherent
randomness/approximateness of the underlying algorithms, in a way that
might make results more fragile and dependent on the luck of a
particular setup. It's better to tolerate a little jitter, and use
excessive jitter as an indicator of problems elsewhere in the data or
model setup – rather than impose a superficial determinism.
While I don't know any implementation details of Word2Vec in gensim, I do know that, in general, Word2Vec is trained by a simple neural network with an embedding layer as the first layer. The weight matrix of this embedding layer contains the word vectors that we are interested in.
This being said, it is in general also quite common to initialize the weights of a neural network randomly. So there you have the origin of your randomness.
But how can the results be different, regardless of different (random) starting conditions?
A well trained model will assign similar vectors to words that have similar meaning. This similarity is measured by the cosine of the angle between the two vectors. Mathematically speaking, if v and w are the vectors of two very similar words then
dot(v, w) / (len(v) * len(w)) # this formula gives you the cosine of the angle between v and w
will be close to 1.
Also, it will allow you to do arithmetics like the famous
king - man + woman = queen
For illustration purposes imagine 2D-vectors. Would these arithmetical properties get lost if you e.g. rotate everything by some angle around the origin? With a little mathematical background I can assure you: No, they won't!
So, your assumption
If "My cat is running" and "your dog can't be running" are the two
sentences in the corpus, then the value of "running" (or its stem)
seems necessarily fixed.
is wrong. The value of "running" is not fixed at all. What is (somehow) fixed, however, is the similarity (cosine) and arithmetical relationship to other words.

Resources