Pacman ai project - Suitable of combination of step cost and heuristic - search

As part of a project, I am trying to implement A* within the context of a pacman game (see UC Berkley pacman ai project). There are no ghosts or capsules, only a maze and the 'fruit'. I am having trouble, however, understanding the relationship between my heuristic function and my cost function.
As per the project, when defining the search problem, we need to specify a step cost that derives from:
score = -Nb Steps + 10*NbOfEatenDots + 200*NbOfEatenGhosts + (-500*isLoss) + (500*isWin)
This cost is supposed to be always positive and so, for simplicity, I have decided to take: 1.5 - (0.5*AteAFoodDot). I have ignored ghosts and capsules since they do not exist and I have given a preferential score for moves tht end up eating a dot. I have also ignored steps that result in a loss (since they do not exist) and steps that result in a win state.
Now as far as the A* algorithm itself is concerned, we have to implement a cost function and a heuristic function of our own:
As a cost function I have chosen: Cost = sum(step costs to current state) and as a heuristic: h = Manhattan distance between pacman and the dot closest to him + manhattan distance of this dot and another dot that is furthest away from it, as long as it exists, which is an admissible heuristic. I have also implemented this heuristic using real maze distances instead of manhattan distances, but this seemed too time consuming for mazes with many food dots.
Now if I have understood correctly if g(n) is my cost function and h(n) my heuristic, I must always have: g(n to goal) >= h(n) so that A* always returns an optimal path and the closest the values of g and h for a node n, the less nodes will be expanded.
In this respect, is it not in my interest to ignore how the score is computed, ignore the fact that a step results in eating a food dot or not and simply take step_cost = 1 for all steps?
This is how I obtain the best results with respect to computation time and nodes expanded, but ignoring the cost function of the game seems wrong.
Could someone clarify this for me? Is it a matter of rpeference/choice or is there an objective correct answer/best approach?

Related

Consistent Heuristic Search - Tiebreaking Policy

So I have the following question in my midterm review and I'm not really sure how to go about it, it's the only one I couldn't figure out. The notation we used in class is that f(n) = g(n) + h(n)
g(n): Cost from start to current
h(n): Estimated cost from current to goal, given heuristic
C*: Optimal cost from start to goal
I know it has something to do with the fact that the heuristic is consistent and this property:
If an instance of A* uses a consistent heuristic, then A* will expand
every node n for which f(n) < C*.
Any help would be greatly appreciated.

What is the difference between uniform-cost search and best-first search methods?

Both methods have a data structure which holds the nodes (with their cost) to expand. Both methods first expand the node with the best cost. So, what is the difference between them?
I was told that uniform-cost search is a blind method and best-first search is not, which confused me even more (both have information about node costs or not?).
The difference is in the heuristic function.
Uniform-cost search is uninformed search: it doesn't use any domain knowledge. It expands the least cost node, and it does so in every direction because no information about the goal is provided. It can be viewed as a function f(n) = g(n) where g(n) is a path cost ("path cost" itself is a function that assigns a numeric cost to a path with respect to performance measure, e.g. distance in kilometers, or number of moves etc.). It simply is a cost to reach node n.
Best-first search is informed search: it uses a heuristic function to estimate how close the current state is to the goal (are we getting close to the goal?). Hence our cost function f(n) = g(n) is combined with the cost to get from n to the goal, the h(n) (heuristic function that estimates that cost) giving us f(n) = g(n) + h(n). An example of a best-first search algorithm is A* algorithm.
Yes, both methods have a list of expanded nodes, but best-first search will try to minimize that number of expanded nodes (path cost + heuristic function).
There is a little misunderstanding in here. Uniform cost search, best first search and A* search algorithms are all different algorithms. Uniform cost is an uninformed search algorithm when Best First and A* search algorithms are informed search algorithms. Informed means that it uses a heuristic function for deciding the expanding node. Difference between best first search and A* is that best first uses f(n) = h(n) for expanding and A* uses f(n) = g(n)+h(n) for choosing the expanding node. h(n) is the heuristic function. g(n) is the actual cost from starting node to node n.
https://www.cs.utexas.edu/~mooney/cs343/slide-handouts/heuristic-search.4.pdf It can be seen here with more details.
Slight correction to the accepted answer
Best-first search does not estimate how close to goal the current state is, it estimates how close to goal each of the next states will be (from the current state) to influence the path selected.
Uniform-cost search expands the least cost node (regardless of heuristic), and best-first search expands the least (cost + heuristic) node.
f(n) is the cost function used to evaluate the potential nodes to
expand
g(n) is the cost of moving to a node n
h(n) is the estimated
cost that it will take to get to the final goal state from if we were
to go to n
The f(n) used in uniform-cost search
f(n) = g(n)
The f(n) used in best-first search (A* is an example of best-first search)
f(n) = h(n)
The f(n) used in A* search.
Note: The h(n) from best-first search above is expanded in A* so that it always includes g(n). It is still basically just a heuristic, but it is a heuristic that includes g(n).
f(n) = g(n) + h(n).
Each of these functions is evaluating the potential expansion nodes, not the current node when traversing the tree looking for an n that is a goal state
The differences are given below:
Uniform-cost search (UCS) expands the node with lowest path cost (i.e. with the lowest g(n)), whereas best-first search (BFS) expand the node with closest to the goal
UCS cannot deal with a heuristic function, whereas BFS can deal with a heuristic function
In UCS, f(n) = g(n), whereas, in BFS, f(n) = g(n) + h(n).
Uniform-cost search picks the unvisited node with the lowest distance, calculates the distance through it to each unvisited neighbor, and updates the neighbor's distance if smaller.
Best-first search is an heuristic-based algorithm that attempts to predict how close the end of a path (i.e. the last node in the path) is to the goal node, so that paths which are judged to be closer to a solution are expanded first.

Directed graph linear algorithm

I would like to know the best way to calculate the length of the shortest path between vertex s and every other vertex of the graph in linear time using dynamic programming.
The graph is weighted DAG.
What you can hope for is an algorithm linear in the number of edges and vertices, i.e. O(|E| + |V|), which also works correctly in presence of negative weights.
This is done by first computing a topological order and then 'exploring' the graph in the order given by this topological order.
Some notation: let's call d'(s,v) the shortest distance from s to v and d(u,v) the length/weight of the arc from u to v (if it exists).
Then, for a node v that is currently being visited, the shortest path from s to v is the minimum of d'(s,u)+d(u,v) for each in-neighbour u of v.
In principle, this is very similar to Dijkstra's algorithm except that we already know in which order to traverse the vertices.
The topological sorting ensures that all in-neighbours of v have already been visited and will not be updated again. So, whenever a node has been visited, the distance it is assigned is the correct shortest path from s to v. Therefore, you end up with a shortest s-v-path for each v.
A full description and implementation can be found here, which links to these lecture notes. I'm not sure where the algorithmic idea for this DAG algorithm was originally published in the literature.
This algorithm works for DAGs, even in the presence of negative weights/distances.
While a typical implementation of this algorithm will most likely not be done using dynamic programming explicitly, it can still be interpreted as such since the problem of finding a shortest path to a node v is computed using the shortest paths to the in-neighbours of v.
For further discussion on if/how this type of algorithm counts as dynamic programming, let me refer you to this question.
It's possible what you're looking for is Bellman-Ford algorithm, which is O(|V||E|) in terms of time complexity (not really linear).
Not sure if some witty dynamic-programming approach could improve on that though.
As hauron said, Bellman-Ford will give you what you're looking for in time O(|V||E|). This works even if your graph contains negative weighted edges, and Bellman-Ford uses dynamic programming at its core.
However, I must add that if your weights are non-negative, you can do Dijkstra from your vertex s in time O(|E| log |E|).
Initialize d[s] = 0.
For every vertex, calculate:
d[v] = min {d[u] + w(u,v) | (u,v) is an edge}
d[v] = ∞ if v has no incoming edges.
(The algorithm always halts since the graph is acyclic.)

Pacman pathfinding heuristic

How do I go about implementing an admissable heuristic function for a pacman game such that it finds the shortest path from a given location that includes multiple goals(all remaining dots). Currently i'm using an A* search with manhattan distances as the heuristic. I take the sum of all manhattan distances from a node to every remaining dot that has not yet been eaten and that is my H(n). The algorithm takes extremely long to complete and i'm not really sure about how to tiebreak.
Well, I'm assuming you're taking the edX course in Artificial Intelligence.
Taking the sum of the difference between your current position and each food pellet will not be admissible as you need to take into account that eating one pellet may get you even closer to another pellet.
Depending on the size of the grid and how sparse it is, what you can do is run a BFS from your pacman's current location to find the nearest pellet. You can then use that distance as an admissable heuristic.

How do I efficiently estimate a probability based on a small amount of evidence?

I've been trying to find an answer to this for months (to be used in a machine learning application), it doesn't seem like it should be a terribly hard problem, but I'm a software engineer, and math was never one of my strengths.
Here is the scenario:
I have a (possibly) unevenly weighted coin and I want to figure out the probability of it coming up heads. I know that coins from the same box that this one came from have an average probability of p, and I also know the standard deviation of these probabilities (call it s).
(If other summary properties of the probabilities of other coins aside from their mean and stddev would be useful, I can probably get them too.)
I toss the coin n times, and it comes up heads h times.
The naive approach is that the probability is just h/n - but if n is small this is unlikely to be accurate.
Is there a computationally efficient way (ie. doesn't involve very very large or very very small numbers) to take p and s into consideration to come up with a more accurate probability estimate, even when n is small?
I'd appreciate it if any answers could use pseudocode rather than mathematical notation since I find most mathematical notation to be impenetrable ;-)
Other answers:
There are some other answers on SO that are similar, but the answers provided are unsatisfactory. For example this is not computationally efficient because it quickly involves numbers way smaller than can be represented even in double-precision floats. And this one turned out to be incorrect.
Unfortunately you can't do machine learning without knowing some basic math---it's like asking somebody for help in programming but not wanting to know about "variables" , "subroutines" and all that if-then stuff.
The better way to do this is called a Bayesian integration, but there is a simpler approximation called "maximum a postieri" (MAP). It's pretty much like the usual thinking except you can put in the prior distribution.
Fancy words, but you may ask, well where did the h/(h+t) formula come from? Of course it's obvious, but it turns out that it is answer that you get when you have "no prior". And the method below is the next level of sophistication up when you add a prior. Going to Bayesian integration would be the next one but that's harder and perhaps unnecessary.
As I understand it the problem is two fold: first you draw a coin from the bag of coins. This coin has a "headsiness" called theta, so that it gives a head theta fraction of the flips. But the theta for this coin comes from the master distribution which I guess I assume is Gaussian with mean P and standard deviation S.
What you do next is to write down the total unnormalized probability (called likelihood) of seeing the whole shebang, all the data: (h heads, t tails)
L = (theta)^h * (1-theta)^t * Gaussian(theta; P, S).
Gaussian(theta; P, S) = exp( -(theta-P)^2/(2*S^2) ) / sqrt(2*Pi*S^2)
This is the meaning of "first draw 1 value of theta from the Gaussian" and then draw h heads and t tails from a coin using that theta.
The MAP principle says, if you don't know theta, find the value which maximizes L given the data that you do know. You do that with calculus. The trick to make it easy is that you take logarithms first. Define LL = log(L). Wherever L is maximized, then LL will be too.
so
LL = hlog(theta) + tlog(1-theta) + -(theta-P)^2 / (2*S^2)) - 1/2 * log(2*pi*S^2)
By calculus to look for extrema you find the value of theta such that dLL/dtheta = 0.
Since the last term with the log has no theta in it you can ignore it.
dLL/dtheta = 0 = (h/theta) + (P-theta)/S^2 - (t/(1-theta)) = 0.
If you can solve this equation for theta you will get an answer, the MAP estimate for theta given the number of heads h and the number of tails t.
If you want a fast approximation, try doing one step of Newton's method, where you start with your proposed theta at the obvious (called maximum likelihood) estimate of theta = h/(h+t).
And where does that 'obvious' estimate come from? If you do the stuff above but don't put in the Gaussian prior: h/theta - t/(1-theta) = 0 you'll come up with theta = h/(h+t).
If your prior probabilities are really small, as is often the case, instead of near 0.5, then a Gaussian prior on theta is probably inappropriate, as it predicts some weight with negative probabilities, clearly wrong. More appropriate is a Gaussian prior on log theta ('lognormal distribution'). Plug it in the same way and work through the calculus.
You can use p as a prior on your estimated probability. This is basically the same as doing pseudocount smoothing. I.e., use
(h + c * p) / (n + c)
as your estimate. When h and n are large, then this just becomes h / n. When h and n are small, this is just c * p / c = p. The choice of c is up to you. You can base it on s but in the end you have to decide how small is too small.
You don't have nearly enough info in this question.
How many coins are in the box? If it's two, then in some scenarios (for example one coin is always heads, the other always tails) knowing p and s would be useful. If it's more than a few, and especially if only some of the coins are only slightly weighted then it is not useful.
What is a small n? 2? 5? 10? 100? What is the probability of a weighted coin coming up heads/tail? 100/0, 60/40, 50.00001/49.99999? How is the weighting distributed? Is every coin one of 2 possible weightings? Do they follow a bell curve? etc.
It boils down to this: the differences between a weighted/unweighted coin, the distribution of weighted coins, and the number coins in your box will all decide what n has to be for you to solve this with a high confidence.
The name for what you're trying to do is a Bernoulli trial. Knowing the name should be helpful in finding better resources.
Response to comment:
If you have differences in p that small, you are going to have to do a lot of trials and there's no getting around it.
Assuming a uniform distribution of bias, p will still be 0.5 and all standard deviation will tell you is that at least some of the coins have a minor bias.
How many tosses, again, will be determined under these circumstances by the weighting of the coins. Even with 500 tosses, you won't get a strong confidence (about 2/3) detecting a .51/.49 split.
In general, what you are looking for is Maximum Likelihood Estimation. Wolfram Demonstration Project has an illustration of estimating the probability of a coin landing head, given a sample of tosses.
Well I'm no math man, but I think the simple Bayesian approach is intuitive and broadly applicable enough to put a little though into it. Others above have already suggested this, but perhaps if your like me you would prefer more verbosity.
In this lingo, you have a set of mutually-exclusive hypotheses, H, and some data D, and you want to find the (posterior) probabilities that each hypothesis Hi is correct given the data. Presumably you would choose the hypothesis that had the largest posterior probability (the MAP as noted above), if you had to choose one. As Matt notes above, what distinguishes the Bayesian approach from only maximum likelihood (finding the H that maximizes Pr(D|H)) is that you also have some PRIOR info regarding which hypotheses are most likely, and you want to incorporate these priors.
So you have from basic probability Pr(H|D) = Pr(D|H)*Pr(H)/Pr(D). You can estimate these Pr(H|D) numerically by creating a series of discrete probabilities Hi for each hypothesis you wish to test, eg [0.0,0.05, 0.1 ... 0.95, 1.0], and then determining your prior Pr(H) for each Hi -- above it is assumed you have a normal distribution of priors, and if that is acceptable you could use the mean and stdev to get each Pr(Hi) -- or use another distribution if you prefer. With coin tosses the Pr(D|H) is of course determined by the binomial using the observed number of successes with n trials and the particular Hi being tested. The denominator Pr(D) may seem daunting but we assume that we have covered all the bases with our hypotheses, so that Pr(D) is the summation of Pr(D|Hi)Pr(H) over all H.
Very simple if you think about it a bit, and maybe not so if you think about it a bit more.

Resources