Lucene Reverse Phrase Search - search

If I want to search for Keyword "Error Message" , Can lucene be able to lend me results matching "Error Message" and "Message Error". Currenlty i am getting results matching "Error Message" Only. I am using Standard Analyser and Query Parser for searching a Keyword.

Use a PhraseQuery with slop > 0. From the javadoc:
Sets the number of other words permitted between words in query
phrase. If zero, then this is an exact phrase search. For larger
values this works like a WITHIN or NEAR operator. The slop is in fact
an edit-distance, where the units correspond to moves of terms in the
query phrase out of position. For example, to switch the order of two
words requires two moves (the first move places the words atop one
another), so to permit re-orderings of phrases, the slop must be at
least two.
More exact matches are scored higher than sloppier matches, thus
search results are sorted by exactness.
The slop is zero by default, requiring exact matches.

There isn't anything that will do quite that, other than doing it as a search for "Error Message" OR "Message Error".
But if you search for
Title:(Error AND Message)
then you'll get everything where the title matches "Error" and "Message".
One key point, though: if you're programmatically constructing a Lucene query, you really shouldn't be using QueryParser. You should be using a QueryBuilder to construct it structurally. QueryParser is only for human-generated queries that a user might type into your application.

Related

Azure Search - exact match as first or single result

I'm using Azure Search based on the rich Lucene Query Parser syntax. I defined to "~1" as additional parameter to one symbol for distance ). But I faced with problem, that the entity is not ordered even if there is exact match. (For example,"blue~1" would return "blues", "blue", "glue". Or when searching product SKU like "P002", I would get result "P003", "P005", "P004", "P002", "P001", "P006" )
So my question: is there some way to define, that the entity with exact match must be first in list, or be singl search result even then I'm using fuzzy search "~1"?
With Lucene Query syntax you can boost individual subqueries, for example: term^2 | term~1 - this translates to "find documents that match 'term' OR 'term' with edit distance 1, and score the exact matches higher relative to fuzzy matches by a factor of two.
search=blue^2|blue~1&queryType=full
There is no guarantee that the exact match will always be first in the results set as the document score is a function of term frequency and inverse document frequency. If the fuzzy sub-query expands the input term to a term that's very unique in your document corpus you may need to bump the boosting factor (2 in my example). In general, relying on the relevance score for ordering is not a practical idea. Take a look at my answer in the following post for more information: Azure Search scoring
Let me know if this helps

Data structure to index entire document and algorithm for quick search of any size substring

I'm trying to find a data structure (and algorithm) that would allow me to index an entire text document and search for substring of it, no matter the size of the substring. The data structure should be stored in disk, during or at the end of the indexing procedure.
For instance, given the following sentence:
The book is on the table
The algorithm should quickly (O(log(n))) find the occurrences of any subset of the text.
For instance, if the input is book it should find all occurrences of it, but this should also be true for book is and The book is.
Unfortunately, the majority of solutions work by tokenizing the text and making searches using individual tokens. Ordinary databases also index any text without worrying about subset searching (that is why SELECT '%foo%' is done with linear search and takes a lot?).
I could try to develop something from scratch (maybe a variation of reverse index?) but I'd love to discover that somebody did that.
The most similar thing I found is SQLite3 Full-text search.
Thanks!
One approach is to index your document in a suffix tree, and then - each prefix of some suffix - is a substring in the document.
With this approach, all you have to do, is build your suffix tree, and upon querying a substring s, follow nodes in the tree, and if you can follow through the entire query string - it means there is a suffix, which its prefix is the query string - and thus it is also a substring.
If you are querying only complete words, inverted index could be just enough. Inverted index is usually mapping a term (word) to a list of documents it appears in. Instead, for you it will mapping to locations in the document.
Upon query, you need to find for each occurance of word i in the query, its positions (let it be p), and if term i+1 of your query, appears as well in position p+1.
This can be done pretty efficiently, similarly to how inverted index is traditionally doing AND queries, but instead of searching all terms in same document, search terms in increasing positions.

String matching algorithm : (multi token strings)

I have a dictionary which contains a big number of strings. Each string could have a range of 1 to 4 tokens (words). Example :
Dictionary :
The Shawshank Redemption
The Godfather \
Pulp Fiction
The Dark Knight
Fight Club
Now I have a paragraph and I need to figure out how many strings in the para are part of the dictionary.
Example, when the para below :
The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been
battling The Godfather for the top spot.
is run against the dictionary, I should be getting the ones in bold as the ones that are part of the dictionary.
How can I do this with the least dictionary calls.
Thanks
You might be better off using a Trie. A Trie is better suited to finding partial matches (i.e. as you search through the text of a paragraph) that are potentially what you're looking for, as opposed to making a bunch of calls to a dictionary that will mostly fail.
The reason why I think a Trie (or some variation) is appropriate is because it's built to do exactly what you're trying to do:
If you use this (or some modification that has the tokenized words at each node instead of a letter), this would be the most efficient (at least that I know of) in terms of storage and retrieval; Storage because instead of storing the word "The" a couple thousand times in each Dict entry that has that word in the title (as is the case with movie titles), it would be stored once in one of the nodes right under the root. The next word, "Shawshank" would be in a child node, and then "redemption" would be in the next, with a total of 3 lookups; then you would move to the next phrase. If it fails, i.e. the phrase is only "The Shawshank Looper", then you fail after the same 3 lookups, and you move to the failed word, Looper (which as it happens, would also be a child node under the root, and you get a hit. This solution works assuming you're reading a paragraph without mashup movie names).
Using a hash table, you're going to have to split all the words, check the first word, and then while there's no match, keep appending words and checking if THAT phrase is in the dictionary, until you get a hit, or you reach the end of the paragraph. So if you hit a paragraph with no movie titles, you would have as many lookups as there are words in the paragraph.
This is not a complete answer, more like an extended-comment.
In literature it's called "multi-pattern matching problem". Since you mentioned that the set of patterns has millions of elements, Trie based solutions will most probably perform poorly.
As far as I know, in practice traditional string search is used with a lot of heuristics. DNA search, antivirus detection, etc. all of these fields need fast and reliable pattern matching, so there should be decent amount of research done.
I can imagine how Rabin-Karp with rolling-hash functions and some filters (Bloom filter) can be used in order to speed up the process. For example, instead of actually matching the substrings, you could first filter (e.g. with weak-hashes) and then actually verify, thus reducing number of verifications needed. Plus this should reduce the work done with the original dictionary itself, as you would store it's hashes, or other filters.
In Python:
import re
movies={1:'The Shawshank Redemption', 2:'The Godfather', 3:'Pretty Woman', 4:'Pulp Fiction'}
text = 'The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been battling The Godfather for the top spot.'
repl_str ='(?P<title>' + '|'.join(['(?:%s)' %movie for movie in movies.values()]) + ')'
result = re.sub(repl_str, '<b>\g<title></b>',text)
Basically it consists of forming up a big substitution instruction string out of your dict values.
I don't know whether regex and sub have a limitation in the size of the substitution instructions you give them though. You might want to check.
lai

Search with attribute values correspondence in Lucene

Here's a text with ambiguous words:
"A man saw an elephant."
Each word has attributes: lemma, part of speech, and various grammatical attributes depending on its part of speech.
For "saw" it is like:
{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}
All this attributes come from the 3rd party tools, Lucene itself is not involved in the word disambiguation.
I want to perform a query like "pos=verb & number=singular" and NOT to get "saw" in the result.
I thought of encoding distinct grammatical annotations into strings like "l:see;pos:verb;t:past|l:saw;pos:noun;n:sg" and searching for regexp "pos\:verb[^\|]+n\:sg", but I definitely can't afford regexp queries due to performance issues.
Maybe some hacks with posting list payloads can be applied?
UPD: A draft of my solution
Here are the specifics of my project: there is a fixed maximum of parses a word can have (say, 8).
So, I thought of inserting the parse number in each attribute's payload and use this payload at the posting lists intersectiion stage.
E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|...
While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing the 'x.1234' entries would be accepted until the intersection stage where they would be rejected because of non-corresponding parse numbers.
I think this is a pretty compact solution, but how hard would be incorporating it into Lucene?
So... the cheater way of doing this is (indeed) to control how you build the lucene index.
When constructing the lucene index, modify each word before Lucene indexes it so that it includes all the necessary attributes of the word. If you index things this way, you must do a lookup in the same way.
One way:
This means for each type of query you do, you must also build an index in the same way.
Example:
saw becomes noun-saw -- index it as that.
saw also becomes noun-past-see -- index it as that.
saw also becomes noun-past-singular-see -- index it as that.
The other way:
If you want attribute based lookup in a single index, you'd probably have to do something like permutation completion on the word 'saw' so that instead of noun-saw, you'd have all possible permutations of the attributes necessary in a big logic statement.
Not sure if this is a good answer, but that's all I could think of.

Solr proximity ordered vs unordered

In Solr you can perform an ordered proximity search using syntax
"word1 word2"~10
By ordered, I mean word1 will always come before word2 in the document. I would like to know if there is an easy way to perform an unordered proximity search, ie. word1 and word2 occur within 10 words of each other and it doesn't matter which comes first.
One way to do this would be:
"word1 word2"~10 OR "word2 word1"~10
The above will work but I'm looking for something simpler, if possible.
Slop means how many word transpositions can occur. So "a b" is going to be different than "b a" because a different number of transpositions are allowed.
a foo b has positions (a,1), (foo, 2), (b, 3). To match (a,1), (b,2) will require one change: (b,2) => (b,3)
However, to match (b,1), (a,2) you will need (a,2) => (a,1) and (b,1) => (b,3), for a total of three position movements
In general, if "a b"~n matches something, then "b a"~(n+2) will match it too.
EDIT: I guess I never gave an answer. I see two options:
If you want a slop of n, increase it to n+2
Manually disjunctivize your search like you suggested
I think #2 is probably better, unless your slop is very large to begin with.
Are you sure it's already doesn't work like that? There is nothing in documentation saying that it's 'ordered':
A proximity search can be done with a sloppy phrase query. The closer together the two terms appear in the document, the higher the score will be. A sloppy phrase query specifies a maximum "slop", or the number of positions tokens need to be moved to get a match.
This example for the standard request handler will find all documents where "batman" occurs within 100 words of "movie":
http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_search_for_one_term_near_another_term_.28say.2C_.22batman.22_and_.22movie.22.29
Since Solr 4 it is possible with SurroundQueryParser.
E.g. to do ordered search (query where "phrase two" follows "phrase one" not further than 3 words after):
3W(phrase W one, phrase W two)
To do unordered search (query "phrase two" in proximity of 5 words of "phrase one"):
5N(phrase W one, phrase W two)

Resources