Solr proximity ordered vs unordered - search

In Solr you can perform an ordered proximity search using syntax
"word1 word2"~10
By ordered, I mean word1 will always come before word2 in the document. I would like to know if there is an easy way to perform an unordered proximity search, ie. word1 and word2 occur within 10 words of each other and it doesn't matter which comes first.
One way to do this would be:
"word1 word2"~10 OR "word2 word1"~10
The above will work but I'm looking for something simpler, if possible.

Slop means how many word transpositions can occur. So "a b" is going to be different than "b a" because a different number of transpositions are allowed.
a foo b has positions (a,1), (foo, 2), (b, 3). To match (a,1), (b,2) will require one change: (b,2) => (b,3)
However, to match (b,1), (a,2) you will need (a,2) => (a,1) and (b,1) => (b,3), for a total of three position movements
In general, if "a b"~n matches something, then "b a"~(n+2) will match it too.
EDIT: I guess I never gave an answer. I see two options:
If you want a slop of n, increase it to n+2
Manually disjunctivize your search like you suggested
I think #2 is probably better, unless your slop is very large to begin with.

Are you sure it's already doesn't work like that? There is nothing in documentation saying that it's 'ordered':
A proximity search can be done with a sloppy phrase query. The closer together the two terms appear in the document, the higher the score will be. A sloppy phrase query specifies a maximum "slop", or the number of positions tokens need to be moved to get a match.
This example for the standard request handler will find all documents where "batman" occurs within 100 words of "movie":
http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_search_for_one_term_near_another_term_.28say.2C_.22batman.22_and_.22movie.22.29

Since Solr 4 it is possible with SurroundQueryParser.
E.g. to do ordered search (query where "phrase two" follows "phrase one" not further than 3 words after):
3W(phrase W one, phrase W two)
To do unordered search (query "phrase two" in proximity of 5 words of "phrase one"):
5N(phrase W one, phrase W two)

Related

How to find strings of a list in a text with typo's

I'm trying to check if some String in a list are in a given text. But the given text can have some typos. For example let's take this.
text: The brownw focx and the cat are in th eforest.
and my list is: [brown fox, forest, cat]
What I do actually to do this is that I separate my text in multiple groups, groups of one word and two words like so:
[The, brownw, focx, and, the, cat, are, in, th, eforest, The brownw, brownw focx, focx and, and the, the cat, cat are, are in, in th, th eforest]
Than I iterate over each group of word and check with the Levensthein algorithm how much the two strings match with each other. In case it's more than 90% I consider they are the same.
This approach however is very time consuming and I wonder if I can find an alternative to this.
Instead of using the full Levenshtein distance (which is slow to compute), you could do a couple of sanity checks beforehand, to try and exclude candidates which are obviously wrong:
word length: the will never match brown fox, as it is far too short. Count the word length, and exclude all candidates that are more than a few letters shorter or longer.
letters: just check what letters are in the word. for example, the does not contain a single letter from fox, so you can rule it out straightaway. With short words it might not make a big difference in performance, but for longer words it will do. Additional optimisation: look for rare characters (x,q,w) first, or simply ignore common ones (e,t,s) which are more likely to be present anyway.
Heuristics such as these will of course not give you the right answer, but they can help to filter out those that are definitely not going to match. Then you only need to perform the more expensive full check on a much smaller number of candidate words.

Finding the most similar string among a set of millions of strings

Let's say I have a dictionary (word list) of millions upon millions of words. Given a query word, I want to find the word from that huge list that is most similar.
So let's say my query is elepant, then the result would most likely be elephant.
If my word is fentist, the result will probably be dentist.
Of course assuming both elephant and dentist are present in my initial word list.
What kind of index, data structure or algorithm can I use for this so that the query is fast? Hopefully complexity of O(log N).
What I have: The most naive thing to do is to create a "distance function" (which computes the "distance" between two words, in terms of how different they are) and then in O(n) compare the query with every word in the list, and return the one with the closest distance. But I wouldn't use this because it's slow.
The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.
If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.
If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.
(You can compute the Levenshtein distance with Hirschberg's algorithm)
I made similar algorythm some time ago
Idea is to have an array char[255] with characters
and values is a list of words hashes (word ids) that contains this character
When you are searching 'dele....'
search(d) will return empty list
search(e) will find everything with character e, including elephant (two times, as it have two 'e')
search(l) will brings you new list, and you need to combine this list with results from previous step
...
at the end of input you will have a list
then you can try to do group by wordHash and order by desc by count
Also intresting thing, if your input is missing one or more characters, you will just receive empty list in the middle of the search and it will not affect this idea
My initial algorythm was without ordering, and i was storing for every character wordId and lineNumber and char position.
My main problem was that i want to search
with ee to find 'elephant'
with eleant to find 'elephant'
with antph to find 'elephant'
Every words was actually a line from file, so it's often was very long
And number of files and lines was big
I wanted quick search for directories with more than 1gb text files
So it was a problem even store them in memory, for this idea you need 3 parts
function to fill your cache
function to find by char from input
function to filter and maybe order results (i didn't use ordering, as i was trying to fill my cache in same order as i read the file, and i wanted to put lines that contains input in the same order upper )
I hope it make sense

Lucene Reverse Phrase Search

If I want to search for Keyword "Error Message" , Can lucene be able to lend me results matching "Error Message" and "Message Error". Currenlty i am getting results matching "Error Message" Only. I am using Standard Analyser and Query Parser for searching a Keyword.
Use a PhraseQuery with slop > 0. From the javadoc:
Sets the number of other words permitted between words in query
phrase. If zero, then this is an exact phrase search. For larger
values this works like a WITHIN or NEAR operator. The slop is in fact
an edit-distance, where the units correspond to moves of terms in the
query phrase out of position. For example, to switch the order of two
words requires two moves (the first move places the words atop one
another), so to permit re-orderings of phrases, the slop must be at
least two.
More exact matches are scored higher than sloppier matches, thus
search results are sorted by exactness.
The slop is zero by default, requiring exact matches.
There isn't anything that will do quite that, other than doing it as a search for "Error Message" OR "Message Error".
But if you search for
Title:(Error AND Message)
then you'll get everything where the title matches "Error" and "Message".
One key point, though: if you're programmatically constructing a Lucene query, you really shouldn't be using QueryParser. You should be using a QueryBuilder to construct it structurally. QueryParser is only for human-generated queries that a user might type into your application.

String matching algorithm : (multi token strings)

I have a dictionary which contains a big number of strings. Each string could have a range of 1 to 4 tokens (words). Example :
Dictionary :
The Shawshank Redemption
The Godfather \
Pulp Fiction
The Dark Knight
Fight Club
Now I have a paragraph and I need to figure out how many strings in the para are part of the dictionary.
Example, when the para below :
The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been
battling The Godfather for the top spot.
is run against the dictionary, I should be getting the ones in bold as the ones that are part of the dictionary.
How can I do this with the least dictionary calls.
Thanks
You might be better off using a Trie. A Trie is better suited to finding partial matches (i.e. as you search through the text of a paragraph) that are potentially what you're looking for, as opposed to making a bunch of calls to a dictionary that will mostly fail.
The reason why I think a Trie (or some variation) is appropriate is because it's built to do exactly what you're trying to do:
If you use this (or some modification that has the tokenized words at each node instead of a letter), this would be the most efficient (at least that I know of) in terms of storage and retrieval; Storage because instead of storing the word "The" a couple thousand times in each Dict entry that has that word in the title (as is the case with movie titles), it would be stored once in one of the nodes right under the root. The next word, "Shawshank" would be in a child node, and then "redemption" would be in the next, with a total of 3 lookups; then you would move to the next phrase. If it fails, i.e. the phrase is only "The Shawshank Looper", then you fail after the same 3 lookups, and you move to the failed word, Looper (which as it happens, would also be a child node under the root, and you get a hit. This solution works assuming you're reading a paragraph without mashup movie names).
Using a hash table, you're going to have to split all the words, check the first word, and then while there's no match, keep appending words and checking if THAT phrase is in the dictionary, until you get a hit, or you reach the end of the paragraph. So if you hit a paragraph with no movie titles, you would have as many lookups as there are words in the paragraph.
This is not a complete answer, more like an extended-comment.
In literature it's called "multi-pattern matching problem". Since you mentioned that the set of patterns has millions of elements, Trie based solutions will most probably perform poorly.
As far as I know, in practice traditional string search is used with a lot of heuristics. DNA search, antivirus detection, etc. all of these fields need fast and reliable pattern matching, so there should be decent amount of research done.
I can imagine how Rabin-Karp with rolling-hash functions and some filters (Bloom filter) can be used in order to speed up the process. For example, instead of actually matching the substrings, you could first filter (e.g. with weak-hashes) and then actually verify, thus reducing number of verifications needed. Plus this should reduce the work done with the original dictionary itself, as you would store it's hashes, or other filters.
In Python:
import re
movies={1:'The Shawshank Redemption', 2:'The Godfather', 3:'Pretty Woman', 4:'Pulp Fiction'}
text = 'The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been battling The Godfather for the top spot.'
repl_str ='(?P<title>' + '|'.join(['(?:%s)' %movie for movie in movies.values()]) + ')'
result = re.sub(repl_str, '<b>\g<title></b>',text)
Basically it consists of forming up a big substitution instruction string out of your dict values.
I don't know whether regex and sub have a limitation in the size of the substitution instructions you give them though. You might want to check.
lai

how an search index works when querying many words?

I'm trying to build my own search engine for experimenting.
I know about the inverted indexes. for example when indexing words.
the key is the word and has a list of document ids containing that word. So when you search for that word you get the documents right away
how does it work for multiple words
you get all documents for every word and traverse those document to see if have both words?
I feel it is not the case.
anyone knows the real answer for this without speculating?
Inverted index is very efficient for getting intersection, using a zig-zag alorithm:
Assume your terms is a list T:
lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
if (currTerm > T.last): //if we have passed the last term:
insert lastDoc into result
currTerm <- 0
lastDoc <- lastDoc + 1
continue
docId <- T[currTerm].getFirstAfter(lastDoc-1)
if (docID != lastDoc):
lastDoc <- docID
currTerm <- 0
else:
currTerm <- currTerm + 1
This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.
The algorithm will be most efficient if the terms are sorted such that the rarest term is first.
The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.
Note: Though this alorithm is efficient, AFAIK lucene does not use it.
More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]
You need to store position of a word in a document in index file.
Your index file structure should be like this..
word id - doc id- no. of hits- pos of hits.
Now suppose the query contains 4 words "w1 w2 w3 w4" . Choose those files containing most of the words. Now calculate their relative distance in the document. The document where most of the words occur and their relative distance is minimum will have high priority in search results.
I have developed a total search engine without using any crawling or indexing tool available in internet. You can read a detailed description here-Search Engine
for more info read this paper by Google founders-click here
You find the intersection of document sets as biziclop said, and you can do it in a fairly fast way. See this post and the papers linked therein for a more formal description.
As pointed out by biziclop, for an AND query you need to intersect the match lists (aka inverted lists) for the two query terms.
In typical implementations, the inverted lists are implemented such that they can be searched for any given document id very efficiently (generally, in logarithmic time). One way to achieve this is to keep them sorted (and use binary search), but note that this is not trivial as there is also a need to store them in compressed form.
Given a query A AND B, and assume that there are occ(A) matches for A and occ(B) matches for B (i.e. occ(x) := the length of the match list for term x). Assume, without loss of generality, that occ(A) > occ(B), i.e. A occurs more frequently in the documents than B. What you do then is to iterate through all matches for B and search for each of them in the list for A. If indeed the lists can be searched in logarithmic time, this means you need
occ(B) * log(occ(A))
computational steps to identify all matches that contain both terms.
A great book describing various aspects of the implementation is Managing Gigabytes.
I don't really understand why people is talking about intersection for this.
Lucene supports combination of queries using BooleanQuery, which you can nest indefinitely if you must.
The QueryParser also supports the AND keyword, which would require both words to be in the document.
Example (Lucene.NET, C#):
var outerQuery + new BooleanQuery();
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word1 ) ), BooleanClause.Occur.MUST );
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word2 ) ), BooleanClause.Occur.MUST );
If you want to split the words (your actual search term) using the same analyzer, there are ways to do that too. Although, a QueryParser might be easier to use.
You can view this answer for example on how to split the string using the same analyzer that you used for indexing:
No hits when searching for "mvc2" with lucene.net

Resources