Theory behind state normalization in Reinforcement Learning - dynamic-programming

I know that normalizing the observation state returns better results in reinforcement learning Stable-baselines documentation.
But I could not find any theoretical background to back this theory up. I applied RL to robotics grasping. I receive the raw depth sensor values and input it into a series of convolutional layers, at the end receiving the 512-dimensional output. Without normalizing this output, the agent does not learn a working policy. But by applying normalization, it somehow achieves far better performance. I am not looking for a full mathematical proof. Instead, a logical explanation is enough.

Related

What does "fine-tuning of a BERT model" refer to?

I was not able to understand one thing , when it says "fine-tuning of BERT", what does it actually mean:
Are we retraining the entire model again with new data.
Or are we just training top few transformer layers with new data.
Or we are training the entire model but considering the pretrained weights as initial weight.
Or there is already few layers of ANN on top of transformer layers which is only getting trained keeping transformer weight freeze.
Tried Google but I am getting confused, if someone can help me on this.
Thanks in advance!
I remember reading about a Twitter poll with similar context, and it seems that most people tend to accept your suggestion 3. (or variants thereof) as the standard definition.
However, this obviously does not speak for every single work, but I think it's fairly safe to say that 1. is usually not included when talking about fine-tuning. Unless you have vast amounts of (labeled) task-specific data, this step would be referred to as pre-training a model.
2. and 4. could be considered fine-tuning as well, but from personal/anecdotal experience, allowing all parameters to change during fine-tuning has provided significantly better results. Depending on your use case, this is also fairly simple to experiment with, since freezing layers is trivial in libraries such as Huggingface transformers.
In either case, I would really consider them as variants of 3., since you're implicitly assuming that we start from pre-trained weights in these scenarios (correct me if I'm wrong).
Therefore, trying my best at a concise definition would be:
Fine-tuning refers to the step of training any number of parameters/layers with task-specific and labeled data, from a previous model checkpoint that has generally been trained on large amounts of text data with unsupervised MLM (masked language modeling).

When and Whether should we normalize the ground-truth labels in the multi-task regression models?

I am trying a multi-task regression model. However, the ground-truth labels of different tasks are on different scales. Therefore, I wonder whether it is necessary to normalize the targets. Otherwise, the MSE of some large-scale tasks will be extremely bigger. The figure below is part of my overall targets. You can certainly find that columns like ASA_m2_c have much higher values than some others.
First, I have already tried some weighted loss techniques to balance the concentration of my model when it does gradient backpropagation. The result shows it didn't perform well.
Secondly, I have seen tremendous discussions regarding normalizing the input data, but hardly discovered any particular talking about normalizing the labels. It's partly because most of the people's problems are classification type and a single task. I do know pytorch provides a convenient approach to normalize the vision dataset by transform.normalize, which is still operated on the input rather than the labels.
Similar questions: https://forums.fast.ai/t/normalizing-your-dataset/49799
https://discuss.pytorch.org/t/ground-truth-label-normalization/26981/19
PyTorch - How should you normalize individual instances
Moreover, I think it might be helpful to provide some details of my model architecture. The input is first fed into a feature extractor and then several generators use the shared output representation from that extractor to predict different targets.
I've been working on a Multi-Task Learning problem where one head has an output of ~500 and another between 0 and 1.
I've tried Uncertainty Weighting but in vain. So I'd be grateful if you could give me a little clue about your studies.(If there is any progress)
Thanks.

How does MCMC help bayesian inference?

Literature says that the metropolis-hasting algorithm in MCMC is one of the most important algorithms developed last century and is revolutional. Literature also says that it is such development in MCMC that gave bayesian statistics a second birth.
I understand what MCMC does - it provides an efficient way to draw samples from any complicated probability distribution.
I also know what bayesian inference is - it is the process by which the full posterior distribution of parameters is calculated.
I am having difficult time connecting the dots here:
Which step in the process of bayesian inference does MCMC come into play? Why is MCMC so important that people say it is MCMC that gave bayesian statistics a second birth??
You might want to ask a similar question on StatsExchange. However, here is an attempt for a high level "build some intuition" answer (disclaimer: I am a Computer Scientist and not a Statistician. Head over to StatsExchange for a more formal discussion).
Bayesian Inference:
In the most basic sense we follow Bayes rule: p(Θ|y)=p(y|Θ)p(Θ)/p(y). Here p(Θ|y) is called the 'posterior' and this is what you are trying to compute. p(y|Θ) is called the 'data likelihood' and is typically given by your model or your generative description of the data. p(Θ) is called the 'prior' and it captures your belief about the plausible values of the parameters before observing the data. p(y) is called the 'marginal likelihood' and using the law of total probability can be expressed as ∫ p(y|Θ)p(Θ) dΘ. That looks really neat but in reality the p(y) is often intractable to compute analytically and in high dimensions (i.e. when Θ has many dimensions) numerical integration is imprecise and computationally intractable. There are certain cases when the conjugate structure of the problem allows you to compute this analytically, but in many useful models this is simply not possible. Therefore, we turn to approximating the posterior.
There are two ways (that I know of) to approximate the posterior: Monte Carlo and Variational Inference. Since you asked about MCMC, I'll stick to that.
Monte Carlo (and Markov Chain Monte Carlo):
Many problems in Statistics deal with taking expectations of functions under probability distributions. From the Law of Large Numbers, an expectation can be efficiently approximated by a Monte Carlo estimator. Therefore, if we can draw samples from a distribution (even if we don't know the distribution itself) then we can compute a Monte Carlo estimate of the expectation in question. The key is that we don't need to have an expression for the distribution: If we just have samples then we can compute the expectations that we are interested in. But there is a catch... How to draw the samples??
There has been a lot of work which developed ways of drawing samples from unknown distributions. These include 'rejection', 'importance' and 'slice' sampling. These were all great innovations and were useful in many applications but they all suffered by scaling poorly to high dimensions. For example, rejection sampling draws samples from a known 'proposal' distribution and then accepts or rejects that sample based on a probability that needs to evaluate the likelihood function and the proposal function. This is wonderful in 1 dimension but as the dimensionality grows, the probability mass that a given sample gets rejected increases dramatically.
Markov Chain Monte Carlo was an innovation that has some super nice theoretical guarantees attached to it. The key idea was to not randomly draw samples from a proposal distribution but rather to use a known sample (with the hope that the sample is in an area of high probability mass) and then make a small random step under a draw from a proposal distribution. Ideally, if the first draw was in an area of high probability mass then the second draw is also likely to be accepted. Therefore, you end up accepting many more samples and you don't waste time drawing samples that are to be rejected. The amazing thing is that if you run the Markov Chain long enough (i.e. to infinity) and under specific conditions (the chain must be finite, aperiodic, irreducible and ergodic) then your samples will be drawn from the true posterior of your model. That's amazing! The MCMC technique is to draw dependent samples so it scales to a higher dimensionality than previous methods, but under the right conditions, even though the samples are dependent, they are as if they are drawn IID from the desired distribution (which is the posterior in Bayesian Inference).
Tying it together (and hopefully answering your question):
MCMC can be seen as a tool that enables Bayesian Inference (just as analytical calculation from conjugate structure, Variational Inference and Monte Carlo are alternatives). Apart from an analytical solution, all of the other tools are approximating the true posterior. Our goal is then to make the approximation as good as possible and to do this as cheaply as possible (in both computation cost and the cost of computing a bunch of messy algebra). Pervious sampling methods did not scale to high dimensions (which are typical of any real world problem) and therefore Bayesian Inference became computationally very expensive and impractical in many instances. However, MCMC opened the door to a new way to efficiently draw samples from a high dimensional posterior, to do this with good theoretical guarantees and to do this (comparatively) easily and computationally cheaply.
It is worth mentioning that Metropolis itself has problems: it struggles with highly correlated latent parameter space, it requires a user-specified proposal distribution and the correlation between samples can be high leading to biased results. Therefore more modern and sometimes more useful MCMC tools have been proposed to try combat this. See 'Hamiltonian Monte Carlo' and the 'No U-Turn Sampler' for the state of the art. Nonetheless, Metropolis was a huge innovation that suddenly made real world problems computationally tractable.
A last note: See this discussion by MacKay for a really good overview of these topics.
This post https://stats.stackexchange.com/a/344360/137466 perfectly clears my question on how MCMC sampling helps solving bayesian inference. Especially this following part from the post is the key concept that I missed:
The Markov chain has a stationary
distribution
which is the distribution that preserves itself if you run it through
the chain. Under certain broad assumptions (e.g., the chain is
irreducible, aperiodic), the stationary distribution will also be the
limiting distribution of the Markov chain, so that regardless of how
you choose the starting value, this will be the distribution that the
outputs converge towards as you run the chain longer and longer. It
turns out that it is possible to design a Markov chain with a
stationary distribution equal to the posterior distribution, even
though we don't know exactly what that distribution is. That is, it
is possible to design a Markov chain that has $\pi( \theta |
\mathbb{x} )$ as its stationary limiting distribution, even if all we
know is that $\pi( \theta | \mathbb{x} ) \propto L_\mathbb{x}(\theta)
\pi(\theta)$. There are various ways to design this kind of Markov
chain, and these various designs constitute available MCMC algorithms
for generating values from the posterior distribution.
Once we have designed an MCMC method like this, we know that we can
feed in any arbitrary starting value $\theta_{(0)}$ and the
distribution of the outputs will converge to the posterior
distribution (since this is the stationary limiting distribution of
the chain). So we can draw (non-independent) samples from the
posterior distribution by starting with an arbitrary starting value,
feeding it into the MCMC algorithm, waiting for the chain to converge
close to its stationary distribution, and then taking the subsequent
outputs as our draws.

When and why would you want to use a Probability Density Function?

A wanna-be data-scientist here and am trying to understand as a data scientist, when and why would you use a Probability Density Function (PDF)?
Sharing a scenario and a few pointers to learn about this and other such functions like CDF and PMF would be really helpful. Know of any book that talks about these functions from practice stand-point?
Why?
Probability theory is very important for modern data-science and machine-learning applications, because (in a lot of cases) it allows one to "open up a black box" and shed some light into the model's inner workings, and with luck find necessary ingredients to transform a poor model into a great model. Without it, a data scientist's work is very much restricted in what they are able to do.
A PDF is a fundamental building block of the probability theory, absolutely necessary to do any sort of probability reasoning, along with expectation, variance, prior and posterior, and so on.
Some examples here on StackOverflow, from my own experience, where a practical issue boils down to understanding data distribution:
Which loss-function is better than MSE in temperature prediction?
Binary Image Classification with CNN - best practices for choosing “negative” dataset?
How do neural networks account for outliers?
When?
The questions above provide some examples, here're a few more if you're interested, and the list is by no means complete:
What is the 'fundamental' idea of machine learning for estimating parameters?
Role of Bias in Neural Networks
How to find probability distribution and parameters for real data? (Python 3)
I personally try to find probabilistic interpretation whenever possible (choice of loss function, parameters, regularization, architecture, etc), because this way I can move from blind guessing to making reasonable decisions.
Reading
This is very opinion-based, but at least few books are really worth mentioning: The Elements of Statistical Learning, An Introduction to Statistical Learning: with Applications in R or Pattern Recognition and Machine Learning (if your primary interest is machine learning). That's just a start, there are dozens of books on more specific topics, like computer vision, natural language processing and reinforcement learning.

Where do the input filters come from in conv-neural nets (MNIST Example)

I am a newby to the convolutional neural nets... so this may be an ignorant question.
I have followed many examples and tutorials now on the MNIST example in TensforFlow. In the CNN examples, all authors talk bout using the 'input filters' to run in the CNN. But no one that I can find mentions WHERE they come from. Can anyone answer where these come from? Or are they magically obtained from the input images.
Thanks! Chris
This is an image that one professor uses, be he does not exaplain if he made them or TensorFlow auto-extracts these somehow.
Disclaimer: I am not an expert, more of an enthusiast.
To cut a long story short: filters are the CNN equivalent of weights, and all a neural network essentially does is learning their optimal values.
Which it does by iterating through a training dataset, making predictions, comparing them to the label/value already assigned to each training unit (usually an image in case of a CNN) and adjusting weights to minimize the error function (the difference between the predicted value and the actual value).
Initial values of filters/weights do not matter that much, so although they might affect the speed of convergence to a small degree, I believe they are often assigned random values.
It is the job of the neural network to figure out the optimal weights, not of the person implementing it.

Resources