Using an "AND" query with Lucene - search

I'm very new to Lucene.net (2.9.4), and I'm attempting to search using a MultiFieldQueryParser. I'm not getting the results back I expect. I've searched for answers to no avail...wonder if someone can assist...
Take the following records (strings) of items that have been indexed:
Medical Advisory Board Bios
Medical Advisory Board
A presentation - Speaker Bios
When I search for:
advisory, I'd expect to get 1 & 2 back, which I do.
When I search for advisory AND bios, I'd expect to get just 1 back, but it seems to be treating the AND as an or and I get all three results back...
What am I missing about the AND? The docs seem to say you can do this straight-forward out of the box. Thanks for the help...

After spending hours on this yesterday, I just tried it again, and it worked. Seems as if it requires capital letters for boolean operators, which I'm quite positive I was testing with, but, I must not have been. Hope this helps seomeone else...

Related

Advice on how to search and return strings

Forgive me, this will be my first every post to SO, so do let me know how I can improve.
I am currently looking for advice on a problem I am facing. I have a list of one billion unique strings of text. These text strings also have a list of tags associated with them to indicate the content of the string.
Example:
StringText: The cat ate on Sunday
AnimalCode: c001
ActionCode: a001
TimeCode: d001
where
c001 = The cat
a001= ate
d001 = on Sunday
I have loaded all of the strings and their codes as individual documents in an instance of MongoDB
At present, I am trying to devise a method by which I can enter a string and search against the database to return the match. My problem is that the search is taking far to long to return results.
I have created an index on the StringText field, but am guessing that it is too large to hold in memory.
Each string has an equal probability of being searched for so I can't reliably predict which strings have a higher probability of being searched for and pull them out into another collection.
Currently, I am running the DB off a single box with 16GB of RAM and a 4TB HDD.
Does anybody have any advice on how I might accomplish my task more efficiently? Is Mongo the right technology or are there others more adept at doing this kind of search and return?
My goal (forgive me if foolish) would be to try and return a result within 2 seconds or less.
I am very new to this whole arena so any and all advice would be welcome.
Thanks much to all in advance for the help and time.
Sincerely,
Zinga
As discussed in the comments, you could preprocess the input string to find the associated Animal and Action codes and search for StringText based on the indexed codes, which is much faster than text search.
You can't totally avoid text search, so reduce it to the Animal and/or Action collection by tokenizing the input string. See how you can use map/reduce techniques just for queries of this sort.
In your case, if you know that the first word or two will always contain the name of the animal, just use those one or two words to search for the relevant animal. Searching through the Animal/Actions collection shouldn't take long. In case it does, you can keep a periodically updating list of most common animals/actions (based on their frequency) and search against that to make it faster. This is also discussed in the articles on the linked page.
If even after that your search against StringText is slow, you could shard the StringText collection by Animal/Action codes. The official doc should suffice for this and there's not much that's involved in the setup so you might try this anyway. The basic ideology everywhere is to restrict your target space as much as possible. Searching through a billion records for every query is plain overkill. Cache where you can, preprocess where you can, show guesses while you run a slow query.
Good luck!

Why does an exact match on a name return a useless set of venues?

This doesn't make much sense to me, and I'm hoping someone can shed some light on what's going on here and how I work around it.
If I query like this:
https://api.foursquare.com/v2/venues/search?ll=37.77%2C-122.41&radius=15000&intent=browse&oauth_token=xxx&limit=20&query=pi%20ba
I get a list of about 15 items, including the item I'm searching for (pi bar). However, if I search for the exact match name:
https://api.foursquare.com/v2/venues/search?ll=37.77%2C-122.41&radius=15000&intent=browse&oauth_token=xxx&limit=20&query=pi%20bar
I just get back the blanket list of venues within this area (mostly BART stops, etc.)
Is it expected that I should have to shave the last character off of user entered queries to get results back, or is this just a messed up venue name that I've been debugging with?
I'm not sure if this may help, but I've discovered placing an "and" between words in your query can produce more accurate results:
Searching for Chili's Bar & Grill
The first query has extraneous results:
https://api.foursquare.com/v2/venues/search?ll=34.07527923583984,-84.29469299316406&radius=5000&query=chili's bar grill&oauth_token=xxx&v=20111205
The second is much more accurate (although I've removed the ampersand: &)
https://api.foursquare.com/v2/venues/search?ll=34.07527923583984,-84.29469299316406&radius=5000&query=chili's and bar and grill&oauth_token=xxx&v=20111205
There's a known issue with quality of bigram matches in foursquare venue searches -- your query term includes a very popular word ("bar") which skews the results. The search team is working on quality improvements for these sorts of queries.

Using GMail as an interface to my database

What if I choose to use GMail's awesome mail archive search capabilities on my database? What if, for every transaction that my database is responsible, I emailed details of that transaction to a GMail address that exists for the sole purpose of searching and retrieving transactions.
Anyone logged into that account could search according to labels, invoice numbers, customer names - whatever using Google's search engine. The results are presented as 'email messages'.
Imagine a user working from the standard (web-based) GMail account searches for an invoice number via GMail's search box - he's returned all instances where the db did anything that included that unique number. Opening any of these 'email messages' would have the static text text included at the time of the transactions (historical and tracking gold) but could also carry a Gadget that could transform the 'message' into an editor so as to execute a new transaction on that invoice.
Imagine further that I wasn't the first one to think of this - cuz surely i'm not - and even if i were, i'm not smart enough to execute the idea alone.
Are you aware of efforts similar to this?
thx
[?belongs on superuser instead?]
An interesting idea, however given your search parameters it might be unreliable. Although gmail's search is great, I have found issues when searching for partial terms. Case in point, I had an email whose subject line was "stuffas". When I searched for "stuffa" I got no results, when I searched for "stuffas" I got the email in the search result. Additionally, I had an email with an 8 digit number inside the body. When I searched for 7 digits out of 8, I got no results, but when I put all 8 digits, the email appeared in the results. So, search in gmail may not be as powerful of a solution as you think. Again this is my experience, I'd love to hear if someone is able to partial search numbers in gmail.
I just had the same idea; 4 years after you. It still doesn't look like this has 'been done before' in any production sense. But now in 2014, I really don't see why not. Python packages for interfacing with gmail are already there and dead-simple to use. It does not take a whole lot of abstraction to turn this into a generalized key-value storage.
Its probably not exactly the fastest database, and not the best solution for everything; but as an easy-to-use, easy to search, trivial to configure, 100% uptime, cloud stored and backed up, free-as-in-beer database, its pretty epic as far as I can see.
Anyone else has seen examples of this having been done before?
Edit: having thought about it some more, there are several answers as to why this is a bad idea:
gmail does not permit random access from different locations; it will block you account. quite a showstopper
amazon simpleDB also gives you a simple key-value store with the same characteristics (plus good python support), and isn't THAT big of a pain to set up if you are willing to spend a day wrapping your head around it. And is also effectively free for the kind of traffic that youd be able to cram into a gmail account.

"Did you mean?" feature in Lucene.net

Can someone please let me know how do I implement "Did you mean" feature in Lucene.net?
Thanks!
You should look into the SpellChecker module in the contrib dir. It's a port of Java lucene's SpellChecker module, so its documentation should be helpful.
(From the javadocs:)
Example Usage:
import org.apache.lucene.search.spell.SpellChecker;
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
AFAIK Lucene supports proximity-search, meaning that if you use something like:
field:stirng~0.5
(it s a tilde-sign)
will match "string". the float is how "tolerant" the search would be, where 1.0 is exact match and 0.0 is match everything (sort of).
Different parsers will however implement this differently.
A proximity-search is much slower than a fuzzy-search (stri*) so use it with caution. In your case, one would assume that if you find no matches on a regular search, you try a proximity-search to see what you find, and present "did you mean" based on the result somehow.
Might be useful to cache this sort of lookups for very common mispellings, for performance reasons.
Google's "Did you mean?" is (probably; they're secretive, of course) implemented by consulting their query log. Look to see if people who searched for the query you're processing searched for something very similar soon after; if so, it indicates they made a mistake, and realized what they ought to be searching for.
Since you probably don't have a huge query log, you could approximate it. Take the query, split up the terms, see if there are any similar terms in the database (by edit distance, whatever); replace your terms with those nearby terms, and rerun the query. If you get more hits, that was probably a better query. Suggest it to the user. (And since you've already got the hits, and most people only look at the top 2 results, show them those.)
Take a look at google code project called semanticvectors.
There's a decent amount of discussion on the Lucene mailing lists for doing functionality like what you're after using it - however it is written in java.
You will probably have to parse and use some machine learning algorithms on your search logs to build a feature like this!

How do you implement a "Did you mean"? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How does the Google “Did you mean?” Algorithm work?
Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>" like Google does in some search queries?
Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.
There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.
Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.
Soundex - in my experience - is crap.
Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.
Google's Dr Norvig has outlined how it works; he even gives a 20ish line Python implementation:
http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html
http://www.norvig.com/spell-correct.html
Dr Norvig also discusses the "did you mean" in this excellent talk. Dr Norvig is head of research at Google - when asked how "did you mean" is implemented, his answer is authoritive.
So its spell-checking, presumably with a dynamic dictionary build from other searches or even actual internet phrases and such. But that's still spell checking.
SOUNDEX and other guesses don't get a look in, people!
Check this article on wikipedia about the Levenshtein distance. Make sure you take a good look at Possible improvements.
I was pleasantly surprised that someone has asked how to create a state-of-the-art spelling suggestion system for search engines. I have been working on this subject for more than a year for a search engine company and I can point to information on the public domain on the subject.
As was mentioned in a previous post, Google (and Microsoft and Yahoo!) do not use any predefined dictionary nor do they employ hordes of linguists that ponder over the possible misspellings of queries. That would be impossible due to the scale of the problem but also because it is not clear that people could actually correctly identify when and if a query is misspelled.
Instead there is a simple and rather effective principle that is also valid for all European languages. Get all the unique queries on your search logs, calculate the edit distance between all pairs of queries, assuming that the reference query is the one that has the highest count.
This simple algorithm will work great for many types of queries. If you want to take it to the next level then I suggest you read the paper by Microsoft Research on that subject. You can find it here
The paper has a great introduction but after that you will need to be knowledgeable with concepts such as the Hidden Markov Model.
I would suggest looking at SOUNDEX to find similar words in your database.
You can also access google own dictionary by using the Google API spelling suggestion request.
You may want to look at Peter Norvig's "How to Write a Spelling Corrector" article.
I believe Google logs all queries and identifies when someone makes a spelling correction. This correction may then be suggested when others supply the same first query. This will work for any language, in fact any string of any characters.
http://en.wikipedia.org/wiki/N-gram#Google_use_of_N-gram
I think this depends on how big your website it. On our local Intranet which is used by about 500 member of staff, I simply look at the search phrases that returned zero results and enter that search phrase with the new suggested search phrase into a SQL table.
I them call on that table if no search results has been returned, however, this only works if the site is relatively small and I only do it for search phrases which are the most common.
You might also want to look at my answer to a similar question:
"Similar Posts" like functionality using MS SQL Server?
If you have industry specific translations, you will likely need a thesaurus. For example, I worked in the jewelry industry and there were abbreviate in our descriptions such as kt - karat, rd - round, cwt - carat weight... Endeca (the search engine at that job) has a thesaurus that will translate from common misspellings, but it does require manual intervention.
I do it with Lucene's Spell Checker.
Soundex is good for phonetic matches, but works best with peoples' names (it was originally developed for census data)
Also check out Full-Text-Indexing, the syntax is different from Google logic, but it's very quick and can deal with similar language elements.
Soundex and "Porter stemming" (soundex is trivial, not sure about porter stemming).
There's something called aspell that might help:
http://blog.evanweaver.com/files/doc/fauna/raspell/classes/Aspell.html
There's a ruby gem for it, but I don't know how to talk to it from python
http://blog.evanweaver.com/files/doc/fauna/raspell/files/README.html
Here's a quote from the ruby implementation
Usage
Aspell lets you check words and suggest corrections. For example:
string = "my haert wil go on"
string.gsub(/[\w\']+/) do |word|
if !speller.check(word)
# word is wrong
puts "Possible correction for #{word}:"
puts speller.suggest(word).first
end
end
This outputs:
Possible correction for haert:
heart
Possible correction for wil:
Will
Implementing spelling correction for search engines in an effective way is not trivial (you can't just compute the edit/levenshtein distance to every possible word). A solution based on k-gram indexes is described in Introduction to Information Retrieval (full text available online).
U could use ngram for the comparisment: http://en.wikipedia.org/wiki/N-gram
Using python ngram module: http://packages.python.org/ngram/index.html
import ngram
G2 = ngram.NGram([ "iis7 configure ftp 7.5",
"ubunto configre 8.5",
"mac configure ftp"])
print "String", "\t", "Similarity"
for i in G2.search("iis7 configurftp 7.5", threshold=0.1):
print i[1], "\t", i[0]
U get:
>>>
String Similarity
0.76 "iis7 configure ftp 7.5"
0.24 "mac configure ftp"
0.19 "ubunto configre 8.5"
Why not use google's did you mean in your code.For how see here
http://narenonit.blogspot.com/2012/08/trick-for-using-googles-did-you-mean.html

Resources