I am running solr 4.3.0.
I have three simple demands of my search feature
each word will be searched alone, results order will be presented by their existence and proximity.
"" words between quotation mark will be searched together (near)
(-) words next to minus sign will not apear in result.
I believe this is a very common search definition, so my question is what is recommended way to translate user generated free text field to the solr search http request ?
You need to check the edismax query parser which will handle all of the above.
By setting the qf (Query Fields), qs (Query Phrase Slop), pf (Phrase Fields), ps (Phrase Slop), pf2 (Phrase bigram fields), ps2 (Phrase bigram slop), pf3 (Phrase trigram fields), ps3 (Phrase trigram slop) parameter you can control which fields would be searched upon. Usually the words are search individually on all the fields and the scored as per the proximity
mm would help you set how many query terms you need to match
phrase search is supported and would be searched together.
- is treated a negative operator and would result into the results being returned without the term.
Related
I currently use Typesense to search in an HTML database. When I search for a term, I would like to retrieve N characters before and N characters after the term found in search.
For example, I search for "query" and this is the sentence that matches:
Let's repeat the query we made earlier with a group_by parameter
I would like to easy retrieve a fixed number of letters (or words) before and after the term to show it in a presumably small area where the search results is retrieved, without breaking any words.
For this particular example, I would be showing:
..repeat the query we made earlier..
Is there a feature like this in Typesense?
I have checked Typesense's documents, without any luck.
The feature you're referring to is called snippets/highlights and it's enabled by default. You can control how many words are returned on either side of the matched text using the highlight_affix_num_tokens search parameter, documented under the table here: https://typesense.org/docs/0.23.1/api/search.html#results-parameters
highlight_affix_num_tokens
The number of tokens that should surround the highlighted text on each side. This controls the length of the snippet.
Query is this :- (Profisee)
Indexed Field has the exact same token as in the above input query. But Solr search is giving zero results.
If Query is this :- (Profisee
Then I am able to find the document in the result.
P.S: I am able to get the document result for (Pro, (Profi, (Profise etc queries also.
Here are the attached images.
Exact Query No Result
Inexact Query Got Result
Here is my schema.xml definition for the fieldtype
First, please include the relevant details in your question next time, as images are hard to search, makes it hard to get the overview of your question and is hard to read for those that doesn't have perfect vision.
For your actual question, the problem is that you have a WhitespaceTokenizer. This will only break words on whitespace, such as . The indexed document contains your term as (foo), which means that only (foo) will match (since the tokenizer only breaks on whitespace, and ( or ) isn't whitespace).
foo (bar) will be indexed as two tokens, foo and (bar). Searching for (bar will match neither.
Use the StandardTokenizer to get the behaviour you want, or use a WordDelimiterGraphFilterFactory to break the word into further tokens.
I am trying to search for a term in Solr in the Title that contains only the string 1604-04. But the results come back with anything that contains 1604 or 04. What would the syntax be to force solr to search on the exact string of 1604-04?
You can also use Classic Tokenizer.The Classic Tokenizer preserves the same behavior as the Standard Tokenizer with the following exceptions:-
Words are split at hyphens, unless there is a number in the word, in which case the token is not split and
the numbers and hyphen(s) are preserved.
This means if someone searches for 1604-04 then this Tokenizer won't break search string into two tokens.
If you want exact matches only, use a string field or a text field with a KeywordTokenizer as the tokenizer. These will keep your tokens intact as one single entry, and won't break it up into multiple tokens.
The difference is that if you use a Textfield with a KeywordTokenizer, you can still apply other filters, such as a LowercaseFilter, while a string field will store anything verbatim without any further processing possible.
Your analyzer is splitting "1604-04" into two terms, "1604" and "04". You've received answer on how to change your analysis to stop doing that.
Changing your analysis my not be the best solution (can't be entirely sure based on what you've written). Using a phrase query would be the usual way to do this. You can use a phrase query by wrapping it in quotes:
field:"1604-04"
This will still analyze and split it into two terms, but it will look for those terms in sequence. So, that query would match "1604-04" and "1604 04", but not "1604 some other stuff 04".
I'm using solr for an enterprise application. So far it works well, as I am using a ngram field to search against. It works correctly for partial queries (match against indexed ngrams). But the problem I have is, how to enforce exact query matches?. For an example the query "Test 1" should match exactly the same text as it is when the user enter it with double quotation marks. Currently Since I have used some tokenizers and filters, the double quotation marks get filtered out, there's no difference in the queries "test 1", "tEst 1" or "TEST 1" (that is because of the analyzer chain I use, but it is needed to work with ngrams and partial search).
Currently I'm searching against a ngram query field. In order to enforce exact query match, what should I do? what is the best practice?. currently what I think is to identify the double quotation marks from client side and change the query field to the original field (with out ngrams). But I feel like there should be a better way of doing this, since the problem I have is generic and solr is a complete enterprise level search engine.
You can have another field for it and add string as the fieldType for the same and index it with same.
When you want to perform the exact match you can query on the above field.
And when you want to perform partial search ..you can query to the earlier field which is indexed by ngram.
OR.. Here is another way you can try.
You have defined the current field type using the ngram. In that while indexing you can define the ngram tokenizer and for the query you mention keywordTokenizer and lowercase filter factory only.
While indexing the text will be tokenized and while performing the query it will not.
I have stemming enabled in my Solr instance, I had assumed that in order to perform an exact word search without disabling stemming, it would be as simple as putting the word into quotes. This however does not appear to be the case?
Is there a simple way to achieve this?
There is a simple way, if what you're referring to is the "slop" (required similarity) as part of a fuzzy search (see the Lucene Query Syntax here).
For example, if I perform this search:
q=field_name:determine
I see results that contain "determine", "determining", "determined", etc.. If I then modify the query like so:
q=field_name:determine~1
I only see results that contain the word "determine". This is because I'm specifying a required similarity of 1, which means "exact match". I can specify this value anywhere from 0 to 1.
Another thing you can do is index the same text without stemming in one field, and with stemming in another. Boost the non-stemmed field & that should prefer exact versions of words to stemmed versions. Of course you could also write your own query parser that directs quoted phrases to the non-stemmed field only.