Resource outlining hierarchy of search algorithms? - search

I would like to better understand how the various common search algorithms relate to each other. Does anyone know of a resource, such as a hierarchy diagram or concise textual description of this?
A small example of what I mean is:
A* Search
-> Uniform-cost is a variant of A* where the heuristic is a constant function
-> Dijkstra's is a variant of uniform-cost search with no goal
-> Breadth-first search is a variant of A* where all step costs are +ve and identical

There is no hierarchy as such, just a bunch of different algorithms with different traits.
eg. A* can be considered to be based on Dijkstra's, with an added heuristic.
Or it can be considered to be based on a heuristic-based best-first search, with an additional factor of the path cost so far.
Similarly, A* is implemented much the same way as a typical breadth-first search is (ie. with a queue of nodes). Iteratively-deepening A* (IDA*) is based on A* in that it uses the same cost and heuristic measurements, but is actually implemented as a depth-first search method.
There's also a big crossover with optimisation algorithms here. Some people think of genetic algorithms as a bunch of complex hill-climbing attempts, but others consider it a form of beam search.
It's common for search and optimisation algorithms to draw properties from more than one source, to mix and match approaches to make them more relevant either to the search domain or the computing requirements, so rather than a hierarchy of methods you'll find a selection of themes that crop up across various approaches.

Try this


Any-goal bidirectional A* pathfinding reference

(reposted from cs.stackexchange since I got no answers or comments)
I want to solve the problem of finding a shortest path on a directed weighted graph from a certain node to any of a specified set of destination nodes (preferably the closest one, but that's not that important). The standard (I believe) way to do this with the A* algorithm is to use a distance-to-closest-goal heuristic (which is admissable) and exit as soon as any of the goal nodes is reached.
However, in my scenario (which is game AI, if that matters) some (or all) of the goals might be unreachable; furthermore, the set of nodes reachable from such goals is typically quite small (or, at least, I want to optimize in that particular case). For the case of a single goal, bidirectional search sounds promising: the reverse search direction would quickly exhaust all reachable nodes and conclude that no path exists. These slides by Andrew Goldberg et al. describe the bidirectional A* algorithm with proper conditions on the heuristics, as well as stopping conditions.
My question is: is there a way to combine these two approaches, i.e. to perform bidirectional A* to find path to any of a specified set of goal nodes? I'm not sure what heuristic function to choose for the reverse search direction, what are the stopping conditions, etc. Googling for anything on this topic didn't get me anywhere either.

Data structure for multidimensional coordinates (search,insert)?

Is there a data-structure designed specifically for fast insertion and search of multidimensional coordinates (many more than 2 or 3d, for all practical purposes say less than 1k dimensions and 1M points)? Even better, for arbitrary distance metrics?
I know about kd-trees, which are good for insertion, but as far as I know, balancing them is non-trivial, and search is not very efficient in higher dimensions. Unordered maps / hash tables would be a good solution at first glance, but as far as I know there are issues with hashing and collisions (eg converting to a string often truncates the numerical precision, and dealing with collisions of non-neighbouring points can be expensive). Maybe something like a red-black tree on each dimension would be good for insertion, and not too bad for search (recursively filtering along dimensions).
I just don't want to reinvent the wheel and I am sure this is a common need in data sciences these days. Happy to take links to papers / tutorials as an answer. Ideally the answer would have an existing implementation in C / C++ / Python / Java / Matlab.
The data structure you're looking for is R-Tree.
You can find Java implementation here.

What is the difference between informed and uninformed searches?

What is the difference between informed and uninformed searches? Can you explain this with some examples?
Blind or Uniformed Search
It is a search without "information" about the goal node.
An example an is breadth-first search (BFS). In BFS, the search proceeds one layer after the other. In other words, nodes in the same layer are first visited before nodes in successive layers. This is performed until a node that is "expanded" is the goal node. In this case, no information about the goal node is used to visit, expand or generate nodes.
We can think of a blind or uniformed search as a brute-force search.
Heuristic or Informed Search
It is a search with "information" about the goal.
An example of such type of algorithm is A*. In this algorithm, nodes are visited and expanded using also information about the goal node. The information about the goal node is given by an heuristic function (which is a function that associates information about the goal node to each of the nodes of the state space). In the case of A*, the heuristic information associated with each node n is an estimate of the distance from n to the goal node.
We can think of a informed search as a approximately "guided" search.
An uninformed search is a brute-force or "blind" search. It uses no knowledge about problem, hence possibly less efficient than an informed search.
Examples of uninformed search algorithms are breadth-first search, depth-first search, depth-limited search, uniform-cost search, depth-first iterative deepening search and bidirectional search.
An informed search (also called "heuristic search") uses prior knowledge about problem ("domain knowledge"), hence possibly more efficient than uninformed search.
Examples of informed search algorithms are best-first search and A*.
Difference between uniformed search and informed search are given below :
Uniformed search technique have access only to the problem definition
whereas Informed search technique have access to the heuristic function and
problem definition.
Uniformed search is less efficient whereas informed search is more efficient.
Uniformed search known as blind search whereas Informed search is known as heuristic search.
Uniformed search use more computation whereas Informed search use less computation.

What tried and true algorithms for suggesting related articles are out there?

Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related.
Let's assume very little metadata about each item. That is, no tags, categories. Treat as one big blob of text, including the title and author name.
How do you go about finding the possibly related documents?
I'm rather interested in the actual algorithm, not ready solutions, although I'd be ok with taking a look at something implemented in ruby or python, or relying on mysql or pgsql.
edit: the current answer is pretty good but I'd like to see more. Maybe some really bare example code for a thing or two.
This is a pretty big topic -- in addition to the answers people come up with here, I recommend tracking down the syllabi for a couple of information retrieval classes and checking out the textbooks and papers assigned for them. That said, here's a brief overview from my own grad-school days:
The simplest approach is called a bag of words. Each document is reduced to a sparse vector of {word: wordcount} pairs, and you can throw a NaiveBayes (or some other) classifier at the set of vectors that represents your set of documents, or compute similarity scores between each bag and every other bag (this is called k-nearest-neighbour classification). KNN is fast for lookup, but requires O(n^2) storage for the score matrix; however, for a blog, n isn't very large. For something the size of a large newspaper, KNN rapidly becomes impractical, so an on-the-fly classification algorithm is sometimes better. In that case, you might consider a ranking support vector machine. SVMs are neat because they don't constrain you to linear similarity measures, and are still quite fast.
Stemming is a common preprocessing step for bag-of-words techniques; this involves reducing morphologically related words, such as "cat" and "cats", "Bob" and "Bob's", or "similar" and "similarly", down to their roots before computing the bag of words. There are a bunch of different stemming algorithms out there; the Wikipedia page has links to several implementations.
If bag-of-words similarity isn't good enough, you can abstract it up a layer to bag-of-N-grams similarity, where you create the vector that represents a document based on pairs or triples of words. (You can use 4-tuples or even larger tuples, but in practice this doesn't help much.) This has the disadvantage of producing much larger vectors, and classification will accordingly take more work, but the matches you get will be much closer syntactically. OTOH, you probably don't need this for semantic similarity; it's better for stuff like plagiarism detection. Chunking, or reducing a document down to lightweight parse trees, can also be used (there are classification algorithms for trees), but this is more useful for things like the authorship problem ("given a document of unknown origin, who wrote it?").
Perhaps more useful for your use case is concept mining, which involves mapping words to concepts (using a thesaurus such as WordNet), then classifying documents based on similarity between concepts used. This often ends up being more efficient than word-based similarity classification, since the mapping from words to concepts is reductive, but the preprocessing step can be rather time-consuming.
Finally, there's discourse parsing, which involves parsing documents for their semantic structure; you can run similarity classifiers on discourse trees the same way you can on chunked documents.
These pretty much all involve generating metadata from unstructured text; doing direct comparisons between raw blocks of text is intractable, so people preprocess documents into metadata first.
You should read the book "Programming Collective Intelligence: Building Smart Web 2.0 Applications" (ISBN 0596529325)!
For some method and code: First ask yourself, whether you want to find direct similarities based on word matches, or whether you want to show similar articles that may not directly relate to the current one, but belong to the same cluster of articles.
See Cluster analysis / Partitional clustering.
A very simple (but theoretical and slow) method for finding direct similarities would be:
Store flat word list per article (do not remove duplicate words).
"Cross join" the articles: count number of words in article A that match same words in article B. You now have a matrix int word_matches[narticles][narticles] (you should not store it like that, similarity of A->B is same as B->A, so a sparse matrix saves almost half the space).
Normalize the word_matches counts to range 0..1! (find max count, then divide any count by this) - you should store floats there, not ints ;)
Find similar articles:
select the X articles with highest matches from word_matches
This is a typical case of Document Classification which is studied in every class of Machine Learning. If you like statistics, mathematics and computer science, I recommend that you have a look at the unsupervised methods like kmeans++, Bayesian methods and LDA. In particular, Bayesian methods are pretty good at what are you looking for, their only problem is being slow (but unless you run a very large site, that shouldn't bother you much).
On a more practical and less theoretical approach, I recommend that you have a look a this and this other great code examples.
A small vector-space-model search engine in Ruby. The basic idea is that two documents are related if they contain the same words. So we count the occurrence of words in each document and then compute the cosine between these vectors (each terms has a fixed index, if it appears there is a 1 at that index, if not a zero). Cosine will be 1.0 if two documents have all terms common, and 0.0 if they have no common terms. You can directly translate that to % values.
terms ={|h,k|h[k]=h.size}
docs = DATA.collect { |line|
name = line.match(/^\d+/)
words = line.downcase.scan(/[a-z]+/)
vector = []
words.each { |word| vector[terms[word]] = 1 }
current = docs.first # or any other
docs.sort_by { |doc|
# assume we have defined cosine on arrays
related = docs[1..5].collect{|doc|doc[:name]}
puts related
0 Human machine interface for Lab ABC computer applications
1 A survey of user opinion of computer system response time
2 The EPS user interface management system
3 System and human system engineering testing of EPS
4 Relation of user-perceived response time to error measurement
5 The generation of random, binary, unordered trees
6 The intersection graph of paths in trees
7 Graph minors IV: Widths of trees and well-quasi-ordering
8 Graph minors: A survey
the definition of Array#cosine is left as an exercise to the reader (should deal with nil values and different lengths, but well for that we got Array#zip right?)
BTW, the example documents are taken from the SVD paper by Deerwester etal :)
Some time ago I implemented something similiar. Maybe this idea is now outdated, but I hope it can help.
I ran a ASP 3.0 website for programming common tasks and started from this principle: user have a doubt and will stay on website as long he/she can find interesting content on that subject.
When an user arrived, I started an ASP 3.0 Session object and recorded all user navigation, just like a linked list. At Session.OnEnd event, I take first link, look for next link and incremented a counter column like:
<Article Title="Cookie problem A">
<NextPage Title="Cookie problem B" Count="5" />
<NextPage Title="Cookie problem C" Count="2" />
So, to check related articles I just had to list top n NextPage entities, ordered by counter column descending.

Finding related words (specifically physical objects) to a specific word

I am trying to find words (specifically physical objects) related to a single word. For example:
Tennis: tennis racket, tennis ball, tennis shoe
Snooker: snooker cue, snooker ball, chalk
Chess: chessboard, chess piece
Bookcase: book
I have tried to use WordNet, specifically the meronym semantic relationship; however, this method is not consistent as the results below show:
Tennis: serve, volley, foot-fault, set point, return, advantage
Snooker: nothing
Chess: chess move, checkerboard (whose own meronym relationships shows ‘square’ & 'diagonal')
Bookcase: shelve
Weighting of terms will eventually be required, but that is not really a concern now.
Anyone have any suggestions on how to do this?
Just an update: Ended up using a mixture of both Jeff's and StompChicken's answers.
The quality of information retrieved from Wikipedia is excellent, specifically how (unsurprisingly) there is so much relevant information (in comparison to some corpora where terms such as 'blog' and 'ipod' do not exist).
The range of results from Wikipedia is the best part. The software is able to match terms such as (lists cut for brevity):
golf: [ball, iron, tee, bag, club]
photography: [camera, film, photograph, art, image]
fishing: [fish, net, hook, trap, bait, lure, rod]
The biggest problem is classifying certain words as physical artefacts; default WordNet is not a reliable resource as many terms (such as 'ipod', and even 'trampolining') do not exist in it.
I think what you are asking for is a source of semantic relationships between concepts. For that, I can think of a number of ways to go:
Semantic similarity algorithms. These algorithms usually perform a tree walk over the relationships in Wordnet to come up with a real-valued score of how related two terms are. These will be limited by how well WordNet models the concepts that you are interested in. WordNet::Similarity (written in Perl) is pretty good.
Try using OpenCyc as a knowledge base. OpenCyc is a open-source version of Cyc, a very large knowledge base of 'real-world' facts. It should have a much richer set of sematic realtionships than WordNet does. However, I have never used OpenCyc so I can't speak to how complete it is, or how easy it is to use.
n-gram frequency analysis. As mentioned by Jeff Moser. A data-driven approach that can 'discover' relationships from large amounts of data, but can often produce noisy results.
Latent Semantic Analysis. A data-driven approach similar to n-gram frequency analysis that finds sets of semantically related words.
Judging by what you say you want to do, I think the last two options are more likely to be successful. If the relationships are not in Wordnet then semantic similarity won't work and OpenCyc doesn't seem to know much about snooker other than the fact that it exists.
I think a combination of both n-grams and LSA (or something like it) would be a good idea. N-gram frequencies will find concepts tightly bound to your target concept (e.g. tennis ball) and LSA would find related concepts mentioned in the same sentence/document (e.g. net, serve). Also, if you are only interested in nouns, filtering your output to contain only nouns or noun phrases (by using a part-of-speech tagger) might improve results.
In the first case, you probably are looking for n-grams where n = 2. You can get them from places like Google or create your own from all of Wikipedia.
For more information, check out this related Stack Overflow question.
