Can someone help me understand Solr search behaviour in this case? - search

Query is this :- (Profisee)
Indexed Field has the exact same token as in the above input query. But Solr search is giving zero results.
If Query is this :- (Profisee
Then I am able to find the document in the result.
P.S: I am able to get the document result for (Pro, (Profi, (Profise etc queries also.
Here are the attached images.
Exact Query No Result
Inexact Query Got Result
Here is my schema.xml definition for the fieldtype

First, please include the relevant details in your question next time, as images are hard to search, makes it hard to get the overview of your question and is hard to read for those that doesn't have perfect vision.
For your actual question, the problem is that you have a WhitespaceTokenizer. This will only break words on whitespace, such as . The indexed document contains your term as (foo), which means that only (foo) will match (since the tokenizer only breaks on whitespace, and ( or ) isn't whitespace).
foo (bar) will be indexed as two tokens, foo and (bar). Searching for (bar will match neither.
Use the StandardTokenizer to get the behaviour you want, or use a WordDelimiterGraphFilterFactory to break the word into further tokens.

Related

Azure-search: How to get documents which exactly contain search term

This question/answer dealt with a pretty similar topic, but I couldn't find the solution I was searching for.
How to practially use a keywordanalyzer in azure-search?
Starting situation:
I created a resource with multiple indexes. One of these indexes contains a Collection(Edm.String) field.
From this field i only want to get documents which exactly contain the search term. For example the field contains documents like these: "Hovercraft zero", "Hovercraft one", "Hovercraft two".
If the search term is "Hover" all three documents should be returned. If the search term is "craft zer" only the document "Hovercraft zero" should be returned. The document shouldn't get a higher score, the desired behaviour is that I only get the "Hovercraft zero" document as result.
Further information:
It is not possible to set the searchmode to all (like it was recommended in the question on the top) because I just want to set this behaviour for this specific field and not for all search queries. It also is not possible to let the responsibility on the user to enter the search term with quotes.
What I have tried so far:
Use the keyword analyzer like it was described in the question on
top: no success
Use an indexanalyzer with specific token filters (ngram,
lowercase) and a searchanalyzer as a keyword analyzer: no success
Use Charfilters to manipulate the search term and manually set the
quotes on the first and last position (craft zer -> "craft zer").
Like Yahnoosh explained in the question on top, the query parser
processes the query string before the analyzers are applied. So:
no success
Is there any solution for this issue?
Or is there a other approach to achieve the desired behaviour?
Hopefully someone can help.
Thanks in advance!
Using your example with three documents: "Hovercraft zero", "Hovercraft one", "Hovercraft two"
Issue a prefix query to find all documents that contain terms that start with "Hover"
search=Hover*
To match the term "craft zer", you need to use the keyword analyzer (or the keyword tokenizer with the lowercase token filter) at indexing time to make sure elements of your string collection are not tokenized. Then at query time you can issue a regex query (note regex queries are much slower than term or prefix queries)
search=/.craft zer./&queryType=full
Also, please use the Analyze API to test your custom analyzer configurations. It will help you make sure the analyzer produces the terms you expect.
Thanks #Yahnoosh for your answer, I found a solution that worked for me.
Short example:
I have an index including three fields (field1, field2, field3). From field3 I want a result where documents exactly contain the search term. From field1 and field2 I want do get a "standard" result.
Solution:
I manipulated the searchquery to ->
field1:{searchterm} || field2:{searchterm} || field3:"{searchterm}" &queryType=full
Using this searchquery field1 and field2 are queried in the "standard" way and field3 is queried with the behaviour i was searching for. Of course there are more efficient and elegant ways out there to solve this issue, but it worked for me.
If anybody has a better solution let me know ;)

Azure Search returning inconsistent results

I have a fairly simple index where all 10 or so fields are searchable strings and my searchMode is "all".
For sake of simplicity let's say I issue the following search:
-(x|y|z)
And I get all documents that do not have x, y or z in them.
Let's say I issue the following search:
(i+j)
And I get all docs that contain the terms i and j.
And lets say there is a decent overlap between the docs that are returned by the two searches.
I would have thought that in "all" searchMode if I issue the following:
(i+j) -(x|y|z)
I would receive the subset of i and j that do not contain x, y or z. In other words the results of the combined query would not contain any entries from the results of the individual query -(x|y|z).
But that's not the case.
Either I am misunderstanding the functionality or I am receiving wrong results.
Can someone help explain this to me?
Thanks
Azure Search should give consistent answers for this, if not let us know.
In this case it was an issue with escaping "+" in URLs (see comments). Search text in the URL query string needs to be escaped (e.g. + should show up as %2B, but it's best to use a library function to escape all the input search text instead of special-casing any particular character; there's functions for this in most environments and they know which characters need escaping).

ElasticSearch incorrectly indexing and querying on non-alphanumeric characters

My ElasticSearch index is not correctly indexing and querying non-alphanumeric characters. Specifically, dots and dashes are causing problems.
If I index a document with the name "O.K. Corral," it should match queries for "OK Corral". Similarly, if I index "Whiskey A Go-Go," I'd like it to match "Whiskey A GoGo" and "Whiskey A Go Go".
Right now, only queries with the correct dots and dashes will return these documents.
I'm hoping the solution will also solve any potential problems with other non-alphanumeric characters, like commas and apostrophes.
It sounds like a job for ElasticSearch token filters, but I haven't been able to find one that does what I'm looking for. Also, I would like to do this within ElasticSearch -- I don't want to write custom string manipulations to normalize data before it gets to my ES index.
Thanks for your help!
You might want to have a look at the Word Delimiter Token Filter. It will at least do what you want with "Whiskey A GoGo" and "Whiskey A Go-Go,". You can check its behaviour in advance using the analyze api.

Solr title search failing

I am indexing the title field for few products in Solr.
But when I am searching, I am not getting those titles in response.
For eg. I am storing following as title : Baboons Typing Tshirt
But when I am searching following I am not getting any result !!!
1)title:Baboons
2)title:(Baboons Typing Tshirt)
3)title:(Baboons*)
On the otherhand, if I am searching like this, I am getting lot of results
1)title:(Tshirt)
I have indexed many titles containing word Tshirt but I want to search a specific title which is failing..!!
I dont know whether Solr is ignoring first words, or it is doing something random.
My Question is basically: If I have a search title with lots of words, I will like to match it with the title which contains maximum common terms.
How to do it?
Thanks
Solr works like that by itself. You don't have to change anything.
You have to be careful how you set up your fields in schema.xml, i.e. how analysis is done.
You can use Solr's admin > Analysis interface to see how exactly your title field (when indexing) and query (when searching) is processed (tokenized, transformed).
Remember, match, in order to occur, requires identical word (case and everything) on both sides (index & query).
To open your index and see how Solr has actually indexed your data, use Luke.

WildcardQuery error in Solr

I use solr to search for documents and when trying to search for documents using this query "id:*", I get this query parser exception telling that it cannot parse the query with * or ? as the first character.
HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse 'id:*': '*' or '?' not allowed as first character in WildcardQuery
type Status report
message org.apache.lucene.queryParser.ParseException: Cannot parse 'id:*': '*' or '?' not allowed as first character in WildcardQuery
description The request sent by the client was syntactically incorrect (org.apache.lucene.queryParser.ParseException: Cannot parse 'id:*': '*' or '?' not allowed as first character in WildcardQuery).
Is there any patch for getting this to work with just * ? Or is it very costly to do such a query?
If you want all documents, do a query on *:*
If you want all documents with a certain field (e.g. id) try id:[* TO *]
Lucene doesn't allow you to start WildcardQueries with an asterisk by default, because those are incredibly expensive queries and will be very, very, very slow on large indexes.
If you're using the Lucene QueryParser, call setAllowLeadingWildcard(true) on it to enable it.
If you want all of the documents with a certain field set, you are much better off querying or walking the index programmatically than using QueryParser. You should really only use QueryParser to parse user input.
id:[a* TO z*] id:[0* TO 9*] etc.
I just did this in lukeall on my index and it worked, therefore it should work in Solr which uses the standard query parser. I don't actually use Solr.
In base Lucene there's a fine reason for why you'd never query for every document, it's because to query for a document you must use a new indexReader("DirectoryName") and apply a query to it. Therefore you could totally skip applying a query to it and use the indexReader methods numDocs() to get a count of all the documents, and document(int n) to retrieve any of the documents.
If you are just trying to get all documents, Solr does support the *:* query. It's the only time I know of that Solr will let you begin a query with an *. I'm sure you've probably seen this as the default query in the Solr admin page.
If you are trying to do a more specific query with an * as the first character, like say id:*456 then one of the best ways I've seen is to index that field twice. Once normally (field name: id), and once with all the characters reversed (field name: reverse_id). Then you could essentially do the query id:456 by sending the query reverse_id:654 instead. Hope that makes sense.
You can also search the Solr user group mailing list at http://www.mail-archive.com/solr-user#lucene.apache.org/ where questions like this come up quite often.
The following Solr issue is a request to be able to configure the default lucene query parser.
https://issues.apache.org/jira/browse/SOLR-218
In this issue you can find the following description how to 'patch' Solr. This modification would allow you to start queries with a *.
Jonas Salk: I've basically updated only one Java file: SolrQueryParser.java.
public SolrQueryParser(IndexSchema schema, String defaultField) {
...
setAllowLeadingWildcard(true);
setLowercaseExpandedTerms(true);
...
}
...
public SolrQueryParser(QParser parser, String defaultField, Analyzer analyzer) {
...
setAllowLeadingWildcard(true);
setLowercaseExpandedTerms(true);
...
}
I'm not sure if setLowercaseExpandedTerms is needed...
I'm assuming with id:* you're just trying to match all documents, right?
I've never used solr before, but in my Lucene experience, when ingesting data, we've added a hidden field to every document, then when we need to return every record we do a search for the string constant in that field that's the same for every record.
If you can't add a field like that in your situation, you could use a RegexQuery with a regex that would match anything that could be found in the id field.
Edit: actually answering the question. I've never heard of a patch to get that to work, but I would be surprised if it could even be made to work reasonably well. See this question for a reason why unconstrained PrefixQuery's can cause a problem.
Actually, I have been using a workaround for this. I append a character to the id, eg: A1, A2, etc.
With such values in the field, it is possible to search using the query id:A*
But would love to find whether a true solution exists.

Resources