What is the difference between full text search and keyword search? - search

What is the difference between full text search and keyword search ?
Definitions are quite ok . Can anybody give brief idea what is the difference and which one is better ?

For information retrieval as you mention here, the ideal search for you would be keyword. And actually also pretty generally, why ?
Well, you can see to imagine every person having their own way of interpreting a subject, and giving a full text search will give you more accurate info, or no hits at all just cause of the fact that you write it in your "Own" language. Selecting your keywords correctly and keeping it short and troughout, will not only give you better results, but also clever you to keep notes for future reference and learn more in depth due to an actual word stucking to your head.

Related

Is there a way to get all complete sentences that a search engine (e.g. Google) has indexed that contain two search terms?

as the question says: "Is there a way to get all complete sentences that a search engine (e.g. Google) has indexed that contain two search terms?"
I would like to use the (e.g. Google) search syntax: BMW AND Toyota. (<-- this is just an example)
And I would then like to have returned all sentences that mention BMW and Toyota. They must be in a single (ideally: short) sentence though.
Is that possible?
Many thanks!
PS.: Sorry - I have difficulties finding the right tags for my question... Please feel free to suggest more appropriate ones and I will update the question.
PPS.: Let me rephrase my question: If it is not readily possible with an existing search engine, are there any programmatical ways to do that? Would one have to write a crawler for that purpose?
No this may not be possible, as google stores this info based on keywords and other algorithms.
For any given keyword or set of keywords, google must be maintaining a reference to one or many matching (some accurate, some not so accurate) titles.
I do not work for google, but that could one way they are maintaining their search results.

Algorithm for immediate string search in a number of text files

I'm writing a help system and I would like to implement the search in modern way, like e.g. it works now in Google search: when user only starts to type in query letters he immediately gets the result and each next letter makes the result set more specific.
What are the algorithms for indexing text and searching it this way?
Wikipedia or/and other links will be fine, no need to waste your time to detailed explanation.
As a practical approach, I think you may use some library like lucene which will help you with most of the things about full text indexing.

Why doesn't Google offer partial search? Is it because the index would be too large?

Google/GMail/etc. doesn't offer partial or prefix search (e.g. stuff*) though it could be very useful. Often I don't find a mail in GMail, because I don't remember the exact expression.
I know there is stemming and such, but it's not the same, especially if we talk about languages other than English.
Why doesn't Google add such a feature? Is it because the index would explode? But databases offer partial search, so surely there are good algorithms to tackle this problem.
What is the problem here?
Google doesn't actually store the text that it searches. It stores search terms, links to the page, and where in the page the term exists. That data structure is indexed in the traditional database sense. I'd bet using wildcards would make the index of the index pretty slow and as Developer Art says, not very useful.
Google does search partial words. Gmail does not though. Since you ask what's the problem here, my answer is lack of effort. This problem has a solution that enables to search in constant time and linear space but not very cache friendly: Suffix Trees. Suffix Arrays is another option that is more cache-friendly and still time efficient.
It is possible via the Google Docs - follow this article:
http://www.labnol.org/internet/advanced-gmail-search/21623/
Google Code Search can search based on regular expressions, so they do know how to do it. Of course, the amount of data Code Search has to index is tiny compared to the web search. Using regex or wildcard search in the web search would increase index size and decrease performance to impractical levels.
The secret to finding anything in Google is to enter a combination of search terms (or quoted phrases) that are very likely to be in the content you are looking for, but unlikely to appear together in unrelated content. A wildcard expression does the opposite of this. Just enter the terms you expect the wildcard to match, keeping in mind that Google will do stemming for you. Back in the days when computers ran on steam, Lycos (iirc) had pattern matching, but they turned it off several years ago. I presume it was putting too much load on their servers.
Because you can't sensibly derive what is meant with car*:
Cars?
Carpets?
Carrots?
Google's algorithms compare document texts, also external inbound links to determine what a document is about. With these wildcards all these algorithms go into junk

"Did you mean?" feature in Lucene.net

Can someone please let me know how do I implement "Did you mean" feature in Lucene.net?
Thanks!
You should look into the SpellChecker module in the contrib dir. It's a port of Java lucene's SpellChecker module, so its documentation should be helpful.
(From the javadocs:)
Example Usage:
import org.apache.lucene.search.spell.SpellChecker;
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
AFAIK Lucene supports proximity-search, meaning that if you use something like:
field:stirng~0.5
(it s a tilde-sign)
will match "string". the float is how "tolerant" the search would be, where 1.0 is exact match and 0.0 is match everything (sort of).
Different parsers will however implement this differently.
A proximity-search is much slower than a fuzzy-search (stri*) so use it with caution. In your case, one would assume that if you find no matches on a regular search, you try a proximity-search to see what you find, and present "did you mean" based on the result somehow.
Might be useful to cache this sort of lookups for very common mispellings, for performance reasons.
Google's "Did you mean?" is (probably; they're secretive, of course) implemented by consulting their query log. Look to see if people who searched for the query you're processing searched for something very similar soon after; if so, it indicates they made a mistake, and realized what they ought to be searching for.
Since you probably don't have a huge query log, you could approximate it. Take the query, split up the terms, see if there are any similar terms in the database (by edit distance, whatever); replace your terms with those nearby terms, and rerun the query. If you get more hits, that was probably a better query. Suggest it to the user. (And since you've already got the hits, and most people only look at the top 2 results, show them those.)
Take a look at google code project called semanticvectors.
There's a decent amount of discussion on the Lucene mailing lists for doing functionality like what you're after using it - however it is written in java.
You will probably have to parse and use some machine learning algorithms on your search logs to build a feature like this!

Ways to do "related searches" functionality

I've seen a few sites that list related searches when you perform a search, namely they suggest other search queries you may be interested in.
I'm wondering the best way to model this in a medium-sized site (not enough traffic to rely on visitor stats to infer relationships). My initial thought is to store the top 10 results for each unique query, then when a new search is performed to find all the historical searches that match some amount of the top 10 results but ideally not matching all of them (matching all of them might suggest an equivalent search and hence not that useful as a suggestion).
I imagine that some people have done this functionality before and may be able to provide some ideas of different ways to do this. I'm not necessarily looking for one winning idea since the solution will no doubt vary substantially depending on the size and nature of the site.
have you considered a matrix of with keywords on 1 axis vs. documents on another axis. once you find the set of vetors representing the keywords, find sets of keyword(s) found in your initial result set and then find a way to rank the other keywords by how many documents they reference or how many times they interset the intial result set.
I've tried a number of different approaches to this, with various degrees of success. In the end, I think the best approach is highly dependent on the domain/topics being searched, and how the users form queries.
Your thought about storing previous searches seems reasonable to me. I'd be curious to see how it works in practice (I mean that in the most sincere way -- there are many nuances that can cause these techniques to fail in the "real world", particularly when data is sparse).
Here are some techniques I've used in the past, and seen in the literature:
Thesaurus based approaches: Index into a thesaurus for each term that the user has used, and then use some heuristic to filter the synonyms to show the user as possible search terms.
Stem and search on that: Stem the search terms (eg: with the Porter Stemming Algorithm and then use the stemmed terms instead of the initially provided queries, and given the user the option of searching for exactly the terms they specified (or do the opposite, search the exact terms first, and use stemming to find the terms that stem to the same root. This second approach obviously takes some pre-processing of a known dictionary, or you can collect terms as your indexing term finds them.)
Chaining: Parse the results found by the user's query and extract key terms from the top N results (KEA is one library/algorithm that you can look at for keyword extraction techniques.)

Resources