Proximity Search using phrases in Solr - search

I use Solr's proximity search quite often to search for words within a specified range of each other, like so
"Government Spending" ~2
I was wondering is there a way to perform a proximity search using a phrase and a word or two phrases. Is this possible? If so what is the syntax?

This appears to be "somewhat" doable. Consider this text:
This is more about traffic between Solr servers themselves
"more traffic between solr" ~2
"more about between solr" ~2
Even if you change the order it works:
"more about solr between" ~2" ~2
But too far apart and it stops working:
"more about servers themselves" ~2
I think if that doesn't work, it would probably not be TOO hard to make a custom request handler that does this. I think you might need to define a new syntax, prehaps something like ("phrase one" "phrase two") ~2. I would guess that if you are shingling, and you create a Lucene query where there is a token of just "phrase one" and another of "phrase two" that have a certain proximity, i think it will work. (of course you will need to actually make the lucene java call, you can't just hand the query over (read this http://lucene.apache.org/java/2_2_0/api/index.html)).

Out of the box I have discovered a way to perform a Solr proximity search using more then one word, or phrases, see below
eg. with 3 words:
"(word1) (word2) (word3)"~10
eg. with 2 phrases: (note the double quote needs to be escaped)
"(\"phrase1\") (\"phrase2\")"~10

Since Solr 4 it is possible with SurroundQueryParser.
E.g. to query where "phrase two" follows "phrase one" not further than 3 words after:
3W(phrase W one, phrase W two)
To query "phrase two" in proximity of 5 words of "phrase one":
5N(phrase W one, phrase W two)

Related

The trouble with searching for a single-word sentence

I have a text field for tags. For example some entities:
{"tags": "apple. fruits. eat."}
{"tags": "green apple."}
{"tags": "banana. apple."}
I want to select entities with tag apple, not green apple or smth apple smth. Different variants lead to the one point: select a sentence with existing expression and it doesn't matter how this sentence looks like. But in this case it's matter.
How can I do it by using Lucene syntax or Azure Search tools? Or (in general) how can I search for a completely same sentence?
I presume that the "." is a deliminator for the different tags. There may be a way to express this in lucene, but you may need to add some custom analyzers to preserve the "."'s in tokenization.
A better strategy in this case would be use use a field of type Collection(Edm.String). This will allow you to better preserve structure the phrases for the tags, and you can use a filter to select the specific value of "apple". Collection(Edm.String) also allows you to enable faceting of the tags which is useful.

Solr exact search with a hyphen

I am trying to search for a term in Solr in the Title that contains only the string 1604-04. But the results come back with anything that contains 1604 or 04. What would the syntax be to force solr to search on the exact string of 1604-04?
You can also use Classic Tokenizer.The Classic Tokenizer preserves the same behavior as the Standard Tokenizer with the following exceptions:-
Words are split at hyphens, unless there is a number in the word, in which case the token is not split and
the numbers and hyphen(s) are preserved.
This means if someone searches for 1604-04 then this Tokenizer won't break search string into two tokens.
If you want exact matches only, use a string field or a text field with a KeywordTokenizer as the tokenizer. These will keep your tokens intact as one single entry, and won't break it up into multiple tokens.
The difference is that if you use a Textfield with a KeywordTokenizer, you can still apply other filters, such as a LowercaseFilter, while a string field will store anything verbatim without any further processing possible.
Your analyzer is splitting "1604-04" into two terms, "1604" and "04". You've received answer on how to change your analysis to stop doing that.
Changing your analysis my not be the best solution (can't be entirely sure based on what you've written). Using a phrase query would be the usual way to do this. You can use a phrase query by wrapping it in quotes:
field:"1604-04"
This will still analyze and split it into two terms, but it will look for those terms in sequence. So, that query would match "1604-04" and "1604 04", but not "1604 some other stuff 04".

Solr Search Field Best Practices

I'm using solr for an enterprise application. So far it works well, as I am using a ngram field to search against. It works correctly for partial queries (match against indexed ngrams). But the problem I have is, how to enforce exact query matches?. For an example the query "Test 1" should match exactly the same text as it is when the user enter it with double quotation marks. Currently Since I have used some tokenizers and filters, the double quotation marks get filtered out, there's no difference in the queries "test 1", "tEst 1" or "TEST 1" (that is because of the analyzer chain I use, but it is needed to work with ngrams and partial search).
Currently I'm searching against a ngram query field. In order to enforce exact query match, what should I do? what is the best practice?. currently what I think is to identify the double quotation marks from client side and change the query field to the original field (with out ngrams). But I feel like there should be a better way of doing this, since the problem I have is generic and solr is a complete enterprise level search engine.
You can have another field for it and add string as the fieldType for the same and index it with same.
When you want to perform the exact match you can query on the above field.
And when you want to perform partial search ..you can query to the earlier field which is indexed by ngram.
OR.. Here is another way you can try.
You have defined the current field type using the ngram. In that while indexing you can define the ngram tokenizer and for the query you mention keywordTokenizer and lowercase filter factory only.
While indexing the text will be tokenized and while performing the query it will not.

Does Solr have an equivalent of "strict order operator" that Sphinx has?

I'm choosing between Solr and Sphinx.
Sphinx doc page
has a section called "5.3. Extended query syntax" which describes the following search parameters (among others) :
strict order operator (example: aaa << bbb << ccc) -
NEAR, generalized proximity operator (example: hello NEAR/3 world NEAR/4 "my test") - search according to distance between words
SENTENCE/PARAGRAPH (example: "Bill Gates" PARAGRAPH "Steve Jobs") - search inside a sentence/paragraph
Does Solr have any similar functionality?
strict order operator: you would need to use SpanQueries for this, look at enter link description here for an explanation of SpanQuery, and in order to use them from Solr, you could try SurroundQParser or else see this other question
NEAR, generalized proximity operator: yes, this is supported, see Proximity search
SENTENCE/PARAGRAPH: not directly. You could try several approaches:
Map somehow those to documents (and maybe use Join functionality in 4.0 to link Paragraph documents to parent documents etc)
Try to insert information about paragraphs with special tokens/gaps, see this

One word phrase search to avoid stemming in Solr

I have stemming enabled in my Solr instance, I had assumed that in order to perform an exact word search without disabling stemming, it would be as simple as putting the word into quotes. This however does not appear to be the case?
Is there a simple way to achieve this?
There is a simple way, if what you're referring to is the "slop" (required similarity) as part of a fuzzy search (see the Lucene Query Syntax here).
For example, if I perform this search:
q=field_name:determine
I see results that contain "determine", "determining", "determined", etc.. If I then modify the query like so:
q=field_name:determine~1
I only see results that contain the word "determine". This is because I'm specifying a required similarity of 1, which means "exact match". I can specify this value anywhere from 0 to 1.
Another thing you can do is index the same text without stemming in one field, and with stemming in another. Boost the non-stemmed field & that should prefer exact versions of words to stemmed versions. Of course you could also write your own query parser that directs quoted phrases to the non-stemmed field only.

Resources