Monte Carlo policy evaluation confusion

Monte Carlo policy evaluation confusion - montecarlo

I'm having trouble understanding Monte Carlo policy evaluation algorithm. What I am reading is that G is the average return after visiting a particular state, lets say s1, for the first time. Does this mean averaging all rewards following that state s1 to the end of the episode and then assigning the resulting value to s1? Or does it mean the immediate reward received for taking an action in s1 averaged over multiple episodes?

The purpose of Monte Carlo policy evaluation is to find a value function for a given policy π. A value function for a policy just tells us the expected cumulutive discounted reward that will result from being in a state, then following the policy forever or until the end of the episode. It tells us the expected return for a state.
So a Monte Carlo approach to estimating this value function is to simply run the policy and keep track of the return from each state; when I reach a state for the first time, how much discounted reward do I accumulate in the rest of the episode? Average all of these that you observe (one return per each state that you visit, per each episode that you run).
Does this mean averaging all rewards following that state s1 to the end of the episode and then assigning the resulting value to s1? Or does it mean the immediate reward received for taking an action in s1 averaged over multiple episodes?
So, your first thought is correct.

Related

How do we get the expectation of likelihood function when the parameters are have their own distribution?

I am going through a paper by Fader and Hardie titled "Counting Your Customers” the Easy Way:
An Alternative to the Pareto/NBD Model", in which they suggest use of BG NBD model as an alternative of Pareto NBD model in customer behavior analysis.
Consider a customer who had x transactions in the period (0,T] with the transactions occurring at
t1, t2, ..., tx
Assuming the time interval between customer spend follows Exponential distribution with rate Lambda and probability (1-p) of customer being alive at a time t, they derive the likelihood function as
They further write "Taking the expectation of the above function over the distribution of Lambda and p (Gamma and Beta) results in the following expression for the likelihood function for a randomly chosen customer with a given purchase history "
I don't understand how we arrive at the above likelihood function from the previous one. Any help would be appreciated.
Link to the paper : http://brucehardie.com/papers/018/fader_et_al_mksc_05.pdf

DDPG (Deep Deterministic Policy Gradients), how is the actor updated?

I'm currently trying to implement DDPG in Keras. I know how to update the critic network (normal DQN algorithm), but I'm currently stuck on updating the actor network, which uses the equation:
so in order to reduce the loss of the actor network wrt to its weight dJ/dtheta, it's using chain rule to get dQ/da (from critic network) * da/dtheta (from actor network).
This looks fine, but I'm having trouble understanding how to derive the gradients from those 2 networks. Could someone perhaps explain this part to me?

So the main intuition is that here, J is something you want to maximize instead of minimize. Therefore, we can call it an objective function instead of a loss function. The equation simplifies down to:
dJ/dTheta = dQ / da * da / dTheta = dQ/dTheta
Meaning you want to change the parameters Theta to change Q. Since in RL, we want to maximize Q, for this part, we want to do gradient ascent instead. To do this, you just perform gradient descent, except feed the gradients as negative values.
To derive the gradients, do the following:
Using the online actor network, send in a batch of states that was sampled from your replay memory. (The same batch used to train the critic)
Calculate the deterministic action for each of those states
Send the states used to calculate those actions to the online critic network to map those exact states to Q values.
Calculate the gradient of the Q values with respect with the actions calculated in step 2. We can use tf.gradients(Q value, actions) to do this. Now, we have dQ/dA.
Send the states to the actor online critic again and map it to actions.
Calculate the gradient of the actions with respect to the online actor network weights, again using tf.gradients(a, network_weights). This will give you dA/dTheta
Multiply dQ/dA by -dA/dTheta to get GRADIENT ASCENT. We are left with the gradient of the objective function, i.e., gradient J
Divide all elements of gradient J by the batch size, i.e.,
for j in J,
j / batch size
Apply a variant of gradient descent by first zipping gradient J with the network parameters. This can be done using tf.apply_gradients(zip(J, network_params))
And bam, your actor is training its parameters with respect to maximizing Q.
I hope this makes sense! I also had a hard time understanding this concept, and am still a little fuzzy on some parts to be completely honest. Let me know if I can clarify anything!

How to interpret some syntax (n.adapt, update..) in jags?

I feel very confused with the following syntax in jags, for example,
n.iter=100,000
thin=100
n.adapt=100
update(model,1000,progress.bar = "none")
Currently I think
n.adapt=100 means you set the first 100 draws as burn-in,
n.iter=100,000 means the MCMC chain has 100,000 iterations including the burn-in,
I have checked the explanation for this question a lot of time but still not sure whether my interpretation about n.iter and n.adapt is correct and how to understand update() and thinning.
Could anyone explain to me?

This answer is based on the package rjags, which takes an n.adapt argument. First I will discuss the meanings of adaptation, burn-in, and thinning, and then I will discuss the syntax (I sense that you are well aware of the meaning of burn-in and thinning, but not of adaptation; a full explanation may make this answer more useful to future readers).
Burn-in
As you probably understand from introductions to MCMC sampling, some number of iterations from the MCMC chain must be discarded as burn-in. This is because prior to fitting the model, you don't know whether you have initialized the MCMC chain within the characteristic set, the region of reasonable posterior probability. Chains initialized outside this region take a finite (sometimes large) number of iterations to find the region and begin exploring it. MCMC samples from this period of exploration are not random draws from the posterior distribution. Therefore, it is standard to discard the first portion of each MCMC chain as "burn-in". There are several post-hoc techniques to determine how much of the chain must be discarded.
Thinning
A separate problem arises because in all but the simplest models, MCMC sampling algorithms produce chains in which successive draws are substantially autocorrelated. Thus, summarizing the posterior based on all iterations of the MCMC chain (post burn-in) may be inadvisable, as the effective posterior sample size can be much smaller than the analyst realizes (note that STAN's implementation of Hamiltonian Monte-Carlo sampling dramatically reduces this problem in some situations). Therefore, it is standard to make inference on "thinned" chains where only a fraction of the MCMC iterations are used in inference (e.g. only every fifth, tenth, or hundredth iteration, depending on the severity of the autocorrelation).
Adaptation
The MCMC samplers that JAGS uses to sample the posterior are governed by tunable parameters that affect their precise behavior. Proper tuning of these parameters can produce gains in the speed or de-correlation of the sampling. JAGS contains machinery to tune these parameters automatically, and does so as it draws posterior samples. This process is called adaptation, but it is non-Markovian; the resulting samples do not constitute a Markov chain. Therefore, burn-in must be performed separately after adaptation. It is incorrect to substitute the adaptation period for the burn-in. However, sometimes only relatively short burn-in is necessary post-adaptation.
Syntax
Let's look at a highly specific example (the code in the OP doesn't actually show where parameters like n.adapt or thin get used). We'll ask rjags to fit the model in such a way that each step will be clear.
n.chains = 3
n.adapt = 1000
n.burn = 10000
n.iter = 20000
thin = 50
my.model <- jags.model(mymodel.txt, data=X, inits=Y, n.adapt=n.adapt) # X is a list pointing JAGS to where the data are, Y is a vector or function giving initial values
update(my.model, n.burn)
my.samples <- coda.samples(my.model, params, n.iter=n.iter, thin=thin) # params is a list of parameters for which to set trace monitors (i.e. we want posterior inference on these parameters)
jags.model() builds the directed acyclic graph and then performs the adaptation phase for a number of iterations given by n.adapt.
update() performs the burn-in on each chain by running the MCMC for n.burn iterations without saving any of the posterior samples (skip this step if you want to examine the full chains and discard a burn-in period post-hoc).
coda.samples() (from the coda package) runs the each MCMC chain for the number of iterations specified by n.iter, but it does not save every iteration. Instead, it saves only ever nth iteration, where n is given by thin. Again, if you want to determine your thinning interval post-hoc, there is no need to thin at this stage. One advantage of thinning at this stage is that the coda syntax makes it simple to do so; you don't have to understand the structure of the MCMC object returned by coda.samples() and thin it yourself. The bigger advantage to thinning at this stage is realized if n.iter is very large. For example, if autocorrelation is really bad, you might run 2 million iterations and save only every thousandth (thin=1000). If you didn't thin at this stage, you (and your RAM) would need to manipulate an object with three chains of two million numbers each. But by thinning as you go, the final object only has 2 thousand numbers in each chain.

Generating a Markov transition matrix with known, but stochastic, state times

I have looked for an answer for a while, but with no luck. I am trying to develop a discrete time Markov model. Presently, I have 5 states, with the 5th state being the absorbing state. I also know the variable time durations that each state stays in. So say state one might take 20 years to transition to state 2 and this is normally distributed. Is there a way to calculate the transition matrix with this data?
update: I'm thinking along the lines of Monte-carlo Markov chain simulation, but I'm unsure how to structure the MCMC model.

Have you considered taking into account only the first occurrence of each state in your data? This way you can populate your transition matrix using the obtained Markov chain with no consecutive identical states.

How do i prove that my derived equation and the Monte-Carlo simulation are equivalent?

I have derived and implemented an equation of an expected value.
To show that my code is free of errors i have employed the Monte-Carlo
computation a number of times to show that it converges into the same
value as the equation that i derived.
As I have the data now, how can i visualize this?
Is this even the correct test to do?
Can I give a measure how sure i am that the results are correct?

It's not clear what you mean by visualising the data, but here are some ideas.
If your Monte Carlo simulation is correct, then the Monte Carlo estimator for your quantity is just the mean of the samples. The variance of your estimator (how far away from the 'correct' value the average value will be) will scale inversely proportional to the number of samples you take: so long as you take enough, you'll get arbitrarily close to the correct answer. So, use a moderate (1000 should suffice if it's univariate) number of samples, and look at the average. If this doesn't agree with your theoretical expectation, then you have an error somewhere, in one of your estimates.
You can also use a histogram of your samples, again if they're one-dimensional. The distribution of samples in the histogram should match the theoretical distribution you're taking the expectation of.
If you know the variance in the same way as you know the expectation, you can also look at the sample variance (the mean squared difference between the sample and the expectation), and check that this matches as well.
EDIT: to put something more 'formal' in the answer!
if M(x) is your Monte Carlo estimator for E[X], then as n -> inf, abs(M(x) - E[X]) -> 0. The variance of M(x) is inversely proportional to n, but exactly what it is will depend on what M is an estimator for. You could construct a specific test for this based on the mean and variance of your samples to see that what you've done makes sense. Every 100 iterations, you could compute the mean of your samples, and take the difference between this and your theoretical E[X]. If this decreases, you're probably error free. If not, you have issues either in your theoretical estimate or your Monte Carlo estimator.

Why not just do a simple t-test? From your theoretical equation, you have the true mean mu_0 and your simulators mean,mu_1. Note that we can't calculate mu_1, we can only estimate it using the mean/average. So our hypotheses are:
H_0: mu_0 = mu_1 and H_1: mu_0 does not equal mu_1
The test statistic is the usual one-sample test statistic, i.e.
T = (mu_0 - x)/(s/sqrt(n))
where
mu_0 is the value from your equation
x is the average from your simulator
s is the standard deviation
n is the number of values used to calculate the mean.
In your case, n is going to be large, so this is equivalent to a Normal test. We reject H_0 when T is bigger/smaller than (-3, 3). This would be equivalent to a p-value < 0.01.
A couple of comments:
You can't "prove" that the means are equal.
You mentioned that you want to test a number of values. One possible solution is to implement a Bonferroni type correction. Basically, you reduce your p-value to: p-value/N where N is the number of tests you are running.
Make your sample size as large as possible. Since we don't have any idea about the variability in your Monte Carlo simulation it's impossible to say use n=....
The value of p-value < 0.01 when T is bigger/smaller than (-3, 3) just comes from the Normal distribution.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string