Represent state space graph for Markov process for car racing example

Represent state space graph for Markov process for car racing example - state-machine

Could anybody please help me with designing state space graph for Markov Decision process of car racing example from Berkeley CS188.
car racing example
For example I can do 100 actions and I want to run value iteration to get best policy to maximize my rewards.
When I have only 3 states (Cool, Warm and Overheated) I don't know how to add "End" state and complete MDP.
I am thinking about having 100 Cool states and 100 Warm states, and for example from Cool1 you can go to Cool2, Warm2 or Overheated and so on.
In this example my values of states close to 0 are higher than states closed to 100.
Am I missing something in MDP?

There should only be 3 possible states. "Cool" and "warm" states are recurrent, and "overheated" state is absorbing because the probability of leaving the state is 0.
You can have two actions, slow or fast, for both"cool" and "warm" states, as described in the problem statement. The probability transition matrix and step rewards can be easily established from the chart. For example, P(go fast, from cool to warm) = 0.5, and R(go fast, from cool to warm) = 2.
Depending on the objective, you can solve it as a finite horizon or infinite horizon MDP.

Related

Tuning Parameters to Optimize Score without CNN

I am trying to create an Agent in Rust that uses a scoring function to determine the best move on a 2D uniform cost grid. The specifics of the game aren't very relevant, other than knowing that each turn you can choose to make one of 4 moves (up, down, left or right) and you are competing against other AIs who are playing on the same board. Currently the AI makes "branches" of possible paths it could make into the future using several different simple algorithms such as using A* to find enemies or food. Several characteristics are saved as the future simulations run including the number of enemies we killed on that branch, amount of food we ate and how long the future branch lasted before we died.
Once we are ready to make our move, we give each future predicting branch a score and go in the direction with the highest average score. This score is essentially a sum of each characteristic mentioned previously multiplied by a constant. For example the score may be 30 * number of food eaten + 100 * number of enemies killed. However, the number 30 and 100 were chosen almost at random through experimentation. If the snake died from not eating food then I increase the score multiplier for eating food for example. However, there are 10 different characteristics each with their own weight. Figuring out the relationship between them all manually is both time consuming and doesn't easily converge onto the optimal strategy.
Here in lies my issue. I would like to find a way to "train" the values for the AI through a process sort of like Q-Learning. There is a very clear terminal condition when you win or lose which helps. My currently idea is creating a table with 100 possible values of each parameter, then play 100 games with each combination and record the win rate. However, this would take (1000 choose 10) * 100 games or 2.6E25 games. It seems like there should be a smarted way to eliminate bad combinations using some form of loss minimization. If anybody has suggestions on tuning these parameters without a neural network, it would be greatly appreciated.

Number of training samples for text classification tas

Suppose you have a set of transcribed customer service calls between customers and human agents, where on average each call's length is 7 minutes. Customers will mostly call because of issues they have with the product. Let's assume that a human can assign one label per axis per call:
Axis 1: What was the problem from the customer's perspective?
Axis 2: What was the problem from the agent's perspective?
Axis 3: Could the agent resolve the customer's issue?
Based on the manually labeled texts you want to train a text classifier that shall predict a label for each call for each of the three axes. But the labeling of recordings takes time and costs money. On the other hand you need a certain amount of training data to get good prediction results.
Given the above assumptions, how many manually labeled training texts would you start with? And how do you know that you need more labeled training texts?
Maybe you've worked on a similar task before and can give some advice.
UPDATE (2018-01-19): There's no right or wrong answer to my question. Ok, ideally, somebody worked on exactly the same task, but that's very unlikely. I'll leave the question open for one more week and then accept the best answer.

This would be tricky to answer but I will try my best based on my experience.
In the past, I have performed text classification on 3 datasets; the number in the bracket indicates how big my dataset was: restaurant reviews (50K sentences), reddit comments (250k sentences) and developer comments from issue tracking systems (10k sentences). Each of them had multiple labels as well.
In each of the three cases, including the one with 10k sentences, I achieved an F1 score of more than 80%. I am stressing on this dataset specifically because I was told by some that the size is less for this dataset.
So, in your case, assuming you have atleast 1000 instances (calls that include conversation between customer and agent) of average 7 minute calls, this should be a decent start. If the results are not satisfying, you have the following options:
1) Use different models (MNB, Random Forest, Decision Tree, and so on in addition to whatever you are using)
2) If point 1 gives more or less similar results, check the ratio of instances of all the classes you have (the 3 axis you are talking about here). If they do not share a good ratio, get more data or try out the different balancing techniques if you cannot get more data.
3) Another way would be to classify them at a sentence level than message or conversation level to generate more data and individual labels for sentences rather than message or the conversation itself.

How to interpret some syntax (n.adapt, update..) in jags?

I feel very confused with the following syntax in jags, for example,
n.iter=100,000
thin=100
n.adapt=100
update(model,1000,progress.bar = "none")
Currently I think
n.adapt=100 means you set the first 100 draws as burn-in,
n.iter=100,000 means the MCMC chain has 100,000 iterations including the burn-in,
I have checked the explanation for this question a lot of time but still not sure whether my interpretation about n.iter and n.adapt is correct and how to understand update() and thinning.
Could anyone explain to me?

This answer is based on the package rjags, which takes an n.adapt argument. First I will discuss the meanings of adaptation, burn-in, and thinning, and then I will discuss the syntax (I sense that you are well aware of the meaning of burn-in and thinning, but not of adaptation; a full explanation may make this answer more useful to future readers).
Burn-in
As you probably understand from introductions to MCMC sampling, some number of iterations from the MCMC chain must be discarded as burn-in. This is because prior to fitting the model, you don't know whether you have initialized the MCMC chain within the characteristic set, the region of reasonable posterior probability. Chains initialized outside this region take a finite (sometimes large) number of iterations to find the region and begin exploring it. MCMC samples from this period of exploration are not random draws from the posterior distribution. Therefore, it is standard to discard the first portion of each MCMC chain as "burn-in". There are several post-hoc techniques to determine how much of the chain must be discarded.
Thinning
A separate problem arises because in all but the simplest models, MCMC sampling algorithms produce chains in which successive draws are substantially autocorrelated. Thus, summarizing the posterior based on all iterations of the MCMC chain (post burn-in) may be inadvisable, as the effective posterior sample size can be much smaller than the analyst realizes (note that STAN's implementation of Hamiltonian Monte-Carlo sampling dramatically reduces this problem in some situations). Therefore, it is standard to make inference on "thinned" chains where only a fraction of the MCMC iterations are used in inference (e.g. only every fifth, tenth, or hundredth iteration, depending on the severity of the autocorrelation).
Adaptation
The MCMC samplers that JAGS uses to sample the posterior are governed by tunable parameters that affect their precise behavior. Proper tuning of these parameters can produce gains in the speed or de-correlation of the sampling. JAGS contains machinery to tune these parameters automatically, and does so as it draws posterior samples. This process is called adaptation, but it is non-Markovian; the resulting samples do not constitute a Markov chain. Therefore, burn-in must be performed separately after adaptation. It is incorrect to substitute the adaptation period for the burn-in. However, sometimes only relatively short burn-in is necessary post-adaptation.
Syntax
Let's look at a highly specific example (the code in the OP doesn't actually show where parameters like n.adapt or thin get used). We'll ask rjags to fit the model in such a way that each step will be clear.
n.chains = 3
n.adapt = 1000
n.burn = 10000
n.iter = 20000
thin = 50
my.model <- jags.model(mymodel.txt, data=X, inits=Y, n.adapt=n.adapt) # X is a list pointing JAGS to where the data are, Y is a vector or function giving initial values
update(my.model, n.burn)
my.samples <- coda.samples(my.model, params, n.iter=n.iter, thin=thin) # params is a list of parameters for which to set trace monitors (i.e. we want posterior inference on these parameters)
jags.model() builds the directed acyclic graph and then performs the adaptation phase for a number of iterations given by n.adapt.
update() performs the burn-in on each chain by running the MCMC for n.burn iterations without saving any of the posterior samples (skip this step if you want to examine the full chains and discard a burn-in period post-hoc).
coda.samples() (from the coda package) runs the each MCMC chain for the number of iterations specified by n.iter, but it does not save every iteration. Instead, it saves only ever nth iteration, where n is given by thin. Again, if you want to determine your thinning interval post-hoc, there is no need to thin at this stage. One advantage of thinning at this stage is that the coda syntax makes it simple to do so; you don't have to understand the structure of the MCMC object returned by coda.samples() and thin it yourself. The bigger advantage to thinning at this stage is realized if n.iter is very large. For example, if autocorrelation is really bad, you might run 2 million iterations and save only every thousandth (thin=1000). If you didn't thin at this stage, you (and your RAM) would need to manipulate an object with three chains of two million numbers each. But by thinning as you go, the final object only has 2 thousand numbers in each chain.

Generating a Markov transition matrix with known, but stochastic, state times

I have looked for an answer for a while, but with no luck. I am trying to develop a discrete time Markov model. Presently, I have 5 states, with the 5th state being the absorbing state. I also know the variable time durations that each state stays in. So say state one might take 20 years to transition to state 2 and this is normally distributed. Is there a way to calculate the transition matrix with this data?
update: I'm thinking along the lines of Monte-carlo Markov chain simulation, but I'm unsure how to structure the MCMC model.

Have you considered taking into account only the first occurrence of each state in your data? This way you can populate your transition matrix using the obtained Markov chain with no consecutive identical states.

Distance dependent Chinese Restaurant Process maybe

I'm new to machine learning and want to implement the distance dependent Chinese Restaurant process in MATLAB for the clustering of audio tracks.
I'm looking to use the dd-CRP on 26 features. I'm guessing the process might go like this
Read in 1st feature vector and assign it a "table"
Read in 2nd feature vector and compare it to the 1st "table", maybe using the cosine angle(due to high dimension) of the two vectors and if it agrees within some defined theta, join that table, else start a new one.
Read in next feature and repeat step 2 for the new feature vector for each existing table.
While this is occurring, I will be keeping track of how many tables there are.
I will be running the algorithm over say for example 16 audio tracks. The way the audio will be fed into the algorithm is the first feature vector will be from say the first frame from audio track 1, the second feature vector from form the first frame in track 2 etc. as I'm trying to find out which audio tracks like to cluster together most, but I don't want to define how many centroids there are. Obviously I'll have to keep track of which audio track is at which "table".
Does this make sense?

This is not a Chinese Restaurant Process. This is a heuristic algorithm which has some similarity to a Chinese Restaurant Process. In a CRP everything is phrased in terms of priors over the assignments of items to clusters (the tables analogy), and these are combined with a likelihood function for each cluster (which formalises the similarity function you described). Inference is then done by Gibbs Sampling, which means non-deterministically sampling which cluster each track is assigned to in turn given all the other assignments. Variational methods for non-parametrics are still in a very preliminary state.
Why do you want to use a CRP? Do you think you'll get something out of it beyond more conventional clustering methods? The bar to entry for the implementation and proper understanding of non-parametrics is pretty high, and they're often of little practical use at the moment because of the constraints on inference I mentioned.

You can use the X-means algorithm, which automatically determines the optimal number of centroids (and hence number of clusters) based on the Bayesian Information Criterion (or BIC). In short, the algorithm looks for how dense each cluster is, and how far is each cluster from the other.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string