Computational Learning theory based on PAC-learning framework - statistics

Consider a Machine Learning Algorithm which train from a training set, with the help of PAC learning model we get bounds on training sample size needed so the probability that error is limited(by epsilon) is bounded(by delta).
What does PAC learning model say about computational(time) complexity.
Suppose a Learning Algorithm is given more time(like more iterations) how the error and probability that error is limited changes
As an learning algorithm which takes one hour to train is of no practical use in financial prediction problems. I need how the performance changes as time given to algorithm changes both in terms of error bounds and what is the probability that error is bounded

The PAC model simply tells you how many pieces of data you need to get a certain level of error with some probability. This can be translated into the impact on the run time by looking at the actual machine learning algorithm your using.
For example, if your algorithm runs in time O(2^n), and the PAC model says you need 1000 examples to have a 95% chance of having .05 error and 10,000 example for .005 error, then you know you should expect a HUGE slowdown for the increased accuracy. Whereas the same PAC information for a O(log n) algorithm would probably lead you to go ahead and get the lower error.
On a side note, it sounds like you might be confused about how most supervised learning algorithms work:
In most cases you can't really just give the same algorithm more time and expect better results, unless you chance the parameters (e.g. learning rate) or increase the number of examples. Perhaps by 'iterations' you meant examples, in which case the impact of the number of examples on the probably and error rate can be found by manipulating the system of equations used for the PAC learning model; see the wiki article.


Word2Vec clustering: embed with low dimensionality or with high dimensionality and then reduce?

I am using K-means for topic modelling using Word2Vec and would like to understand the implications of vectorizing up to, let's say, 10 dimensions, against embedding it with 200 dimensions and then using PCA to get down to 10. Does the second approach make sense at all?
Which one worked better for your specific purposes, & your specific data, after trying both & comparing the end-results against each other, either in some ad-hoc ("eyeballing") or rigorous way?
There's no reason to prematurely reject any approach, given how many details about your data & ultimate end-goals are unstated.
It would be atypical to train a word2vec model to have only 10 dimensions. Published work most often shows the use of 100 to 1000 dimensions, often 300 or 400, assuming you've got enough bulk training data to make the algorithm worthwhile.
(Word2vec needs a lot of varied training text, with many contrasting usage examples for every word of interest, to generate good results. You may occasionally see toy-sized demos, on smaller amounts of data, just to quickly show steps, or some major qualities of the results. But good results, in the aspects for which word2vec is most appreciated, depend on plentiful training data.)
Also, whether or not your aims would be helped by the extra step of PCA to reduce the dimensionality of a larger word2vec model seems another separable question, to be determined experimentally by comparing results with and without that step, on your actual data/problem, rather than guessed at from intuitions from other projects that might not be comparable.

Why models often benefit from reducing the learning rate during training

In Keras official documentation for ReduceLROnPlateau class (
they mention that
"Models often benefit from reducing the learning rate"
Why is that so?
It's counter-intuitive for me at least, since from what I know- a higher learning rate allows taking further steps from my current position.
Neither too high nor too low learning rate should be considered for training a NN. A large learning rate can miss the global minimum and in extreme cases can cause the model to diverge completely from the optimal solution. On the other hand, a small learning rate can stuck to a local minimum.
ReduceLROnPlateau purpose is to track your model's performance and reduce the learning rate when there is no improvement for x number of epochs. The intuition is that the model approached a sub-optimal solution with current learning rate and oscillate around the global minimum. Reducing the learning rate would enable the model to take smaller learning steps to the optimal solution of the cost function.
Image source

How does MCMC help bayesian inference?

Literature says that the metropolis-hasting algorithm in MCMC is one of the most important algorithms developed last century and is revolutional. Literature also says that it is such development in MCMC that gave bayesian statistics a second birth.
I understand what MCMC does - it provides an efficient way to draw samples from any complicated probability distribution.
I also know what bayesian inference is - it is the process by which the full posterior distribution of parameters is calculated.
I am having difficult time connecting the dots here:
Which step in the process of bayesian inference does MCMC come into play? Why is MCMC so important that people say it is MCMC that gave bayesian statistics a second birth??
You might want to ask a similar question on StatsExchange. However, here is an attempt for a high level "build some intuition" answer (disclaimer: I am a Computer Scientist and not a Statistician. Head over to StatsExchange for a more formal discussion).
Bayesian Inference:
In the most basic sense we follow Bayes rule: p(Θ|y)=p(y|Θ)p(Θ)/p(y). Here p(Θ|y) is called the 'posterior' and this is what you are trying to compute. p(y|Θ) is called the 'data likelihood' and is typically given by your model or your generative description of the data. p(Θ) is called the 'prior' and it captures your belief about the plausible values of the parameters before observing the data. p(y) is called the 'marginal likelihood' and using the law of total probability can be expressed as ∫ p(y|Θ)p(Θ) dΘ. That looks really neat but in reality the p(y) is often intractable to compute analytically and in high dimensions (i.e. when Θ has many dimensions) numerical integration is imprecise and computationally intractable. There are certain cases when the conjugate structure of the problem allows you to compute this analytically, but in many useful models this is simply not possible. Therefore, we turn to approximating the posterior.
There are two ways (that I know of) to approximate the posterior: Monte Carlo and Variational Inference. Since you asked about MCMC, I'll stick to that.
Monte Carlo (and Markov Chain Monte Carlo):
Many problems in Statistics deal with taking expectations of functions under probability distributions. From the Law of Large Numbers, an expectation can be efficiently approximated by a Monte Carlo estimator. Therefore, if we can draw samples from a distribution (even if we don't know the distribution itself) then we can compute a Monte Carlo estimate of the expectation in question. The key is that we don't need to have an expression for the distribution: If we just have samples then we can compute the expectations that we are interested in. But there is a catch... How to draw the samples??
There has been a lot of work which developed ways of drawing samples from unknown distributions. These include 'rejection', 'importance' and 'slice' sampling. These were all great innovations and were useful in many applications but they all suffered by scaling poorly to high dimensions. For example, rejection sampling draws samples from a known 'proposal' distribution and then accepts or rejects that sample based on a probability that needs to evaluate the likelihood function and the proposal function. This is wonderful in 1 dimension but as the dimensionality grows, the probability mass that a given sample gets rejected increases dramatically.
Markov Chain Monte Carlo was an innovation that has some super nice theoretical guarantees attached to it. The key idea was to not randomly draw samples from a proposal distribution but rather to use a known sample (with the hope that the sample is in an area of high probability mass) and then make a small random step under a draw from a proposal distribution. Ideally, if the first draw was in an area of high probability mass then the second draw is also likely to be accepted. Therefore, you end up accepting many more samples and you don't waste time drawing samples that are to be rejected. The amazing thing is that if you run the Markov Chain long enough (i.e. to infinity) and under specific conditions (the chain must be finite, aperiodic, irreducible and ergodic) then your samples will be drawn from the true posterior of your model. That's amazing! The MCMC technique is to draw dependent samples so it scales to a higher dimensionality than previous methods, but under the right conditions, even though the samples are dependent, they are as if they are drawn IID from the desired distribution (which is the posterior in Bayesian Inference).
Tying it together (and hopefully answering your question):
MCMC can be seen as a tool that enables Bayesian Inference (just as analytical calculation from conjugate structure, Variational Inference and Monte Carlo are alternatives). Apart from an analytical solution, all of the other tools are approximating the true posterior. Our goal is then to make the approximation as good as possible and to do this as cheaply as possible (in both computation cost and the cost of computing a bunch of messy algebra). Pervious sampling methods did not scale to high dimensions (which are typical of any real world problem) and therefore Bayesian Inference became computationally very expensive and impractical in many instances. However, MCMC opened the door to a new way to efficiently draw samples from a high dimensional posterior, to do this with good theoretical guarantees and to do this (comparatively) easily and computationally cheaply.
It is worth mentioning that Metropolis itself has problems: it struggles with highly correlated latent parameter space, it requires a user-specified proposal distribution and the correlation between samples can be high leading to biased results. Therefore more modern and sometimes more useful MCMC tools have been proposed to try combat this. See 'Hamiltonian Monte Carlo' and the 'No U-Turn Sampler' for the state of the art. Nonetheless, Metropolis was a huge innovation that suddenly made real world problems computationally tractable.
A last note: See this discussion by MacKay for a really good overview of these topics.
This post perfectly clears my question on how MCMC sampling helps solving bayesian inference. Especially this following part from the post is the key concept that I missed:
The Markov chain has a stationary
which is the distribution that preserves itself if you run it through
the chain. Under certain broad assumptions (e.g., the chain is
irreducible, aperiodic), the stationary distribution will also be the
limiting distribution of the Markov chain, so that regardless of how
you choose the starting value, this will be the distribution that the
outputs converge towards as you run the chain longer and longer. It
turns out that it is possible to design a Markov chain with a
stationary distribution equal to the posterior distribution, even
though we don't know exactly what that distribution is. That is, it
is possible to design a Markov chain that has $\pi( \theta |
\mathbb{x} )$ as its stationary limiting distribution, even if all we
know is that $\pi( \theta | \mathbb{x} ) \propto L_\mathbb{x}(\theta)
\pi(\theta)$. There are various ways to design this kind of Markov
chain, and these various designs constitute available MCMC algorithms
for generating values from the posterior distribution.
Once we have designed an MCMC method like this, we know that we can
feed in any arbitrary starting value $\theta_{(0)}$ and the
distribution of the outputs will converge to the posterior
distribution (since this is the stationary limiting distribution of
the chain). So we can draw (non-independent) samples from the
posterior distribution by starting with an arbitrary starting value,
feeding it into the MCMC algorithm, waiting for the chain to converge
close to its stationary distribution, and then taking the subsequent
outputs as our draws.

Why does removing validation samples from Keras model improve test accuracy so much

I'm doing a programming assignment for Andrew Ng's Deep Learning course on Convolutional Models that involves training and evaluating a model using Keras. What I've observed after a little playing with various knobs is something curious: The test accuracy of the model greatly improves (from 50 percentile to 90 percentile) by setting the validation_fraction parameter on the operation to 0. This is surprising to me; I would have thought that eliminating the validation samples would lead to over-fitting of the model, which would, in turn, reduce accuracy on the test set.
Can someone please explain why this is happening?
You're right, there is more training data, but the increase is pretty negligible since dI was setting the validation fraction to 0.1, so that would increase the training data by 11.111...% However, thinking about it some more, I realized that removing the validation step doesn't have any effect on the model, hence no impact on test accuracy. I think that I must have changed some other parameter, too, though I don't remember which.
As Matias says, it means there is more training data to work with.
However, I'd also make sure that the test accuracy is actually increasing from 50 to 90% consistently. Run it over a couple times to make sure. There is a possibility that, because there is very little validation samples, that the model got lucky. That's why it is important to have a lot of validation data - to make sure the model isn't just getting lucky, and that there's actually a method to the madness.
I go over some of the "norms" when it comes to training and testing data in my book about stock prediction (another great way in my opinion to learn about Deep Learning). Feel free to check it out and learn more, as it's great for beginners.
Good Luck!

Training set - proportion of pos / neg / neutral sentences

I am hand tagging twitter messages as Positive, Negative, Neutral. I am try to appreciate is there some logic one can use to identify of the training set what proportion of message should be positive / negative and neutral ?
So for e.g. if I am training a Naive Bayes classifier with 1000 twitter messages should the proportion of pos : neg : neutral be 33 % : 33% : 33% or should it be 25 % : 25 % : 50 %
Logically in my head it seems that I i train (i.e. give more samples for neutral) that the system would be better at identifying neutral sentences then whether they are positive or negative - is that true ? or I am missing some theory here ?
The problem you're referring to is known as the imbalance problem. Many machine learning algorithms perform badly when confronted with imbalanced training data, i.e. when the instances of one class heavily outnumber those of the other class. Read this article to get a good overview of the problem and how to approach it. For techniques like naive bayes or decision trees it is always a good idea to balance your data somehow, e.g. by random oversampling (explained in the references paper). I disagree with mjv's suggestion to have a training set match the proportions in the real world. This may be appropriate in some cases but I'm quite confident it's not in your setting. For a classification problem like the one you describe, the more the sizes of the class sets differ, the more most ML algorithms will have problems discriminating the classes properly. However, you can always use the information about which class is the largest in reality by taking it as a fallback such that when the classifier's confidence for a particular instance is low or this instance couldn't be classified at all, you would assign it the largest class.
One further remark: finding the positivity/negativity/neutrality in Twitter messages seems to me to be a question of degree. As such, it may be viewes as a regression rather than a classification problem, i.e. instead of a three class scheme you perhaps may want calculate a score which tells you how positive/negative the message is.
There are many other factors... but an important one (in determining a suitable ratio and volume of training data) is the expected distribution of each message category (Positive, Neutral, Negative) in the real world. Effectively, a good baseline for the training set (and the control set) is
[qualitatively] as representative as possible of the whole "population"
[quantitatively] big enough that measurements made from such sets is statistically significant.
The effect of the [relative] abundance of a certain category of messages in the training set is hard to determine; it is in any case a lesser factor -or rather one that is highly sensitive to- other factors. Improvements in the accuracy of the classifier, as a whole, or with regards to a particular category, is typically tied more to the specific implementation of the classifier (eg. is it Bayesian, what are the tokens, are noise token eliminated, is proximity a factor, are we using bi-grams etc...) than to purely quantitative characteristics of the training set.
While the above is generally factual but moderately helpful for the selection of the training set's size and composition, there are ways of determining, post facto, when an adequate size and composition of training data has been supplied.
One way to achieve this is to introduce a control set, i.e. one manually labeled but that is not part of the training set and to measure for different test runs with various subsets of the training set, the recall and precision obtained for each category (or some similar accuracy measurements), for this the classification of the control set. When these measurements do not improve or degrade, beyond what's statistically representative, the size and composition of the training [sub-]set is probably the right one (unless it is an over-fitting set :-(, but that's another issue altogether... )
This approach, implies that one uses a training set that could be 3 to 5 times the size of the training subset effectively needed, so that one can build, randomly (within each category), many different subsets for the various tests.
