Distinguishing synonym hits from regular hits in elastic search - search

We are using Elastic Search and as part of a requirement we want to be able to distinguish hits generated by the synonym filter from those that are not because of synonyms.
For example if we had a query such as:
(car AND red) AND (NOT ford)
With synonym: color <-> red
Then we want to know:
[the red car] is a simple hit.
But,
[the color of the car] is a hit caused by the synonym filter.
Our synonym filter is defined as follows:
synonym_filter :
type : synonym
synonyms_path : synonyms.txt
ignore_case : true
expand : true
format : solr
Since the synonym filter does its work by modifying the token stream at index time there might not be a straightforward way to do this. Perhaps by using the highlighting functionality there might be an algorithm.
I was wondering if anybody has experience with this kind of solution or if a clever solution exists for this requirement. Thank you in advance.

I believe the best solution would be to search content with synonyms separately from content without.
That is, if you are applying the SynonymFilter at index time, then index the content twice, once without synonyms, and once with synonyms (and possibly any other filters to facilitate a broader search). You could then either run separate queries against the two fields, or you could run a single query with matches against the more direct field significantly boosted.

Related

Returning accented as well as normal result set via azure search filters

Does anyone know how to ensure we can return normal result as well as accented result set via the azure search filter. For e.g the below filter query in Azure search returns a name called unicorn when i check for record with name unicorn.
var result= searchServiceClient.Documents.SearchAsync<myDto>("*",new SearchParameters
{
SearchFields = new List<string> {"Name"},
Filter = "Name eq 'unicorn'"
});
This is all good but what i want is i want to write a filter such that it returns record named unicorn as well as record named únicorn (please note the first accented character) provided that both record exist.
This can be achieved when searching for such name via the search query using language or Standard ASCII folding search analyzer as mentioned in this link. What i am struggling to find out is how can we implement the same with azure filters?
Please let me know if anyone has got any solutions around this.
Filters are applied on the non-analyzed representation of the data, so I don’t think there’s any way to do any type of linguistic analysis on filters. One way to work around this is to manually create a field which only do lowercasing + asciifloding (no tokenization) and then search lucene queries that look like this:
"normal search query terms" AND customFilterColumn:"filtérValuèWithÄccents"
Basically the document would both need to match the search terms in any field AND also match the filter term in the “customFilterColumn”. This may not be sufficient for your needs, but at least you understand the art of the possible.
Using filters it won't work unless you specify in advance all the possibilities:
for example:
$filter=name eq 'unicorn' or name eq 'únicorn'
You'd better work with a different analyzer that will change accents to it's root form. As another possibility, you can try fuzzy search:
search=unicorn~&highlight=Name

Can someone help me understand Solr search behaviour in this case?

Query is this :- (Profisee)
Indexed Field has the exact same token as in the above input query. But Solr search is giving zero results.
If Query is this :- (Profisee
Then I am able to find the document in the result.
P.S: I am able to get the document result for (Pro, (Profi, (Profise etc queries also.
Here are the attached images.
Exact Query No Result
Inexact Query Got Result
Here is my schema.xml definition for the fieldtype
First, please include the relevant details in your question next time, as images are hard to search, makes it hard to get the overview of your question and is hard to read for those that doesn't have perfect vision.
For your actual question, the problem is that you have a WhitespaceTokenizer. This will only break words on whitespace, such as . The indexed document contains your term as (foo), which means that only (foo) will match (since the tokenizer only breaks on whitespace, and ( or ) isn't whitespace).
foo (bar) will be indexed as two tokens, foo and (bar). Searching for (bar will match neither.
Use the StandardTokenizer to get the behaviour you want, or use a WordDelimiterGraphFilterFactory to break the word into further tokens.

Solr Fuzzy search (max 2 edits)

I am using Solr 6.0.0
I am using data-driven-configuration for my configuration related purpose. Most of the configuration is standard.
I have a document in Solr with
name:"aquickbrownfox"
Now if I do a fuzzy search like:
name:aquickbrownfo~0.7
OR
name:aquickbrownf~0.7
It lists out the record in the results.
But if I do a search like:
name:aquickbrown~0.7
It does not list the record.
Does it have to do something with the maxEdits in solrconfig.xml which is set to 2 ?
I tried increasing it. But I could not create a collection with this configuration. It gave an error:
ERROR: Error CREATEing SolrCore 'my-search': Unable to create core
[my-search] Caused by: Invalid maxEdits
Max 2 Edits seems to be a serious limitation. I wonder what is the use of passing the fractional value after the ~ operator.
My Usecase:
I have a contact database. I am supposed to detect the duplicates based on three parameters : Name, Email and Phone. So I rely on Solr for Fuzzy search. Email and Phone are relatively easy to work with simple assumptions. Name seems to be a bit tricky. For each word in the Name, I plan to do a fuzzy search. I expected the optional parameter after ~ to work without the maxEdit distance limitation.
The documentation no longer suggests using a fractional value after the tilde - see http://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Fuzzy_Searches for more information.
However, you are correct that only 2 changes are allowed to be made to the search string in order to carry out a fuzzy search. I would guess this limitation strikes a balance between efficiency and usefulness.
The maxEdits parameter in solrconfig.xml applies to the DirectSpellChecker configuration, and doesn't affect your searching, unless you're using the spell checker.
For your use case, your best approach may be to index the name field twice, using different field configurations: one using a simple set of analyzers and filters (ie. StandardTokenizerFactory, StandardFilterFactory, LowerCaseFilterFactory), and the other using a phonetic matcher such as the Beider-Morse filter. You can use the first field to carry out fuzzy searches, and the second version to look for names which may be spelled differently but sound the same as the name being checked.

How can I retrieve meta information about synonym matches for a Solr query?

We are using Solr to provide search functionality for our site, and I have the following requirement which has me stumped:
Given the search term "2011 Bolinger", identify that "Bollinger" (note the different spelling) is a valid value for the Producer facet, and automatically apply facet filtering for this value.
It's the fuzzy matching of the search term which I'm stuck on. Does anyone know of a way to include information in a Solr response about synonym matches which have occurred for a query during the search (i.e. a way for Solr to tell me that it saw the word 'Bollinger' in a document and recognised it as equivalent to 'Bolinger')? From what I've read so far of the Solr documentation I can't see a way to do this, but I may have missed something.

Modify haystack query syntax?

Is it possible to modify or extend how haystack understands a query?
For example, I'm looking at integrating haystack with an OSQA-based site to get SO-style search -- a search where regular keywords search question/answer/comment text, but where syntax like "[tag]" is understood to be restricted to the question's tags field. At some point we might want to add other goodies like "user:eternicode" and "score:0", but for now keywords and tags are the must-haves.
Unfortunately, it's not as simple as regexing the tags out of the query string and using that to filter on the tags field, because we want all the complexity of AND, OR, NOT, and arbitrary grouping to apply.
Is this possible with haystack? Better yet, has anyone done it before?
It seems there is no way to customize how Haystack's auto_query works, so what we ended up doing was preparsing the search query to extract tag and other custom syntaxes, perform the auto_query with the leftovers, and then apply the custom syntaxes as extra filters on the auto_query results.
In order to do this, though, we had to simplify our requirements and drop the OR requirement, so all terms are only ANDed now -- that simplified a lot of things (for example, grouping is now unnecessary).

Resources