I'm trying to improve the ranking of results that come back from an Azure Search index. The search index basically contains a list of band names and members.
Exact match is important to us, but also a partial match, but likewise, so is a partial word within the query.
If I use the example of trying to find a band called Black Flag. In user input area, I have got as far as typing black fl.
I currently structure the query as: "black fl"|black fl* (exact match on whole phrase and partial match on fl).
This brings back the following results in following order:
Flourescent Black
Florence Black
Black Flag
At the moment, there is the single text field being searched against, using the Standard - Lucene Analyzer.
I've looked at Scoring Profiles but these don't appear to be relevant to such a small dataset in terms of fields available.
I have also explored the full lucene search, by adding things like ^10 on the word black to make it more important - and have changed my query string in many ways, all of which don't seem to give the effect i'm after.
I would expect that Black Flag would match better as the word order is more correct than that of the results that come above it.
Is there a way to change the scoring method to handle this? I now imagine that I'm looking at dealing with a custom analyzer (https://learn.microsoft.com/en-gb/azure/search/index-add-custom-analyzers) but not really sure where to start with this or how I would want the analyzer to behave.
Any thoughts or examples on how best to handle this scenario would be greatly appreciated.
EDIT - More Info
The current solution consists of the following, but it involves having to manipulate the results that come back from the search index.
The index is created as follows:
{
"fields": [
{"name": "id", "type": "Edm.String", "key": true, "filterable": false, "searchable": false, "sortable": false, "facetable": false},
{"name": "entityId", "type": "Edm.Int64", "filterable": false, "searchable": false, "sortable": false, "facetable": false},
{"name": "entityType", "type": "Edm.Int32", "sortable": false, "facetable": false},
{"name": "sortableName", "type": "Edm.String", "filterable": false, "facetable": false, "searchable": false},
{"name": "name", "type": "Edm.String", "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "analyzer":"keyword_analyzer"},
{"name": "town", "type": "Edm.String", "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "analyzer":"keyword_analyzer"},
{"name": "tags", "type": "Collection(Edm.String)", "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "analyzer":"keyword_analyzer"}
],
"defaultScoringProfile": "default_score",
"scoringProfiles": [
{
"name": "default_score",
"text":{
"weights": {
"name": 3.5,
"tags": 2,
"town": 1
}
}
}
],
"analyzers":[
{
"name": "keyword_analyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters":[
"map_dash",
"map_space"
],
"tokenizer":"keyword_tokenizer",
"tokenFilters":[
"asciifolding",
"lowercase",
"trim",
"delimiter_filter"
]
}
],
"charFilters":[
{
"name":"map_dash",
"#odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
"mappings":["-=>_"]
},
{
"name":"map_space",
"#odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
"mappings":["\\u0020=>_"]
}
],
"tokenizers":[
{
"name": "keyword_tokenizer",
"#odata.type":"#Microsoft.Azure.Search.KeywordTokenizerV2"
}
],
"tokenFilters":[
{
"name": "stopwords_filter",
"#odata.type":"#Microsoft.Azure.Search.StopwordsTokenFilter",
"removeTrailing": false
},
{
"name": "delimiter_filter",
"#odata.type":"#Microsoft.Azure.Search.WordDelimiterTokenFilter",
"generateWordParts": true,
"generateNumberParts": true,
"splitOnCaseChange": false,
"preserveOriginal": true,
"splitOnNumerics": false
}
]
}
Before uploading data to the index we need to normalize it - Black Flag becomes black flag. We also have to remove any preceeding words of the so this means that The Killers becomes killers - also any non standard characters are replaced to remove accents etc.
When performing a search, in code we need to now remove any preceeding the if it exists, and perform the same normalization - I can accept doing this.
We then build up the query which changes depending upon how many words there are in the initial query.
List<string> splitQ = queryPhrase.SplitToList(" ");
if (splitQ.Count > 0)
{
if (splitQ.Count == 1)
{
search.Append($"(\"{splitQ[0]}\" || {this.EscapeSpecialCharacters(splitQ[0])}*)");
}
else
{
for (int i = 0; i < splitQ.Count; i++)
{
if (i == splitQ.Count - 1)
{
search.Append($"+{this.EscapeSpecialCharacters(splitQ[i])}*");
}
else
search.Append($"+\"{splitQ[i]}\"");
}
search.Insert(0, $"(\"{queryPhrase}\"||(");
search.Append("))");
}
}
A single word black would mean the main query is: ("black" || black*)
However, as soon as additional words come in it has to change. black fl becomes: ("black fl"||(+"black"+fl*))
A three word search would be: ("one two three"||(+"one"+"two"+three*))
On top of this we then add any filter options.
Search is sent to the index with the query type set to full
The above has got us as close as we can to having decent and accurate results. However, the scoring is all messed up.
Processing the results...
Firstly we now normalise the score given by the Azure Search Index, depending upon the search query, the scores range massively, so we normalize this as a percentage based on the maximum scoring item.
We now have to apply our own enhancer to the score based on the tag or name field. An exact match to the query gives an enhancer of 5, and a startswith query gets and enhancement of 3.
We then provide a score that uses the enhancement to increase the results position in the rankings.
This final seciton of processing the results seems as though it is something that should be done automatically within the search index system.
Related
I have a field in Azure Cognitive Search that has special characters in it.
they look like this: some_id: 'SOME*STUFF*123'
I'm trying to have a "startsWith" query, but that doesnt return anything as soon as the regex tries to match anything that goes farther than the \*
After a bit google I found out its the Analyzer, possibly breaking apart strings at '*'
So I changed the Analyzer to "keyword", as I read multiple times its the Analyzer you are supposed to use for this.
the new config looks like this:
{
"name": "some_id",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": true,
"analyzer": "keyword",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
my request look like this:
{
"count": true,
"skip": 0,
"top": 5,
"searchMode": "any",
"queryType": "full",
"search": "some_id:/SO(.*)/" // SOME\\*S(.*) also doesnt work
}
I get zero matches.
With the Standart analyzer I started going no matches as soon as I had a \\* in my regex (I escaped them with \\)
Clarification on Requirements:
I can not change any data, the values (including the \*) can not be changed. I'm trying to have the whole field matched as a single token and for me to run startsWith on.
For example this regex: /SOME\\*ST(.*)/ is supposed to literally return entries that fully match the regex. No magic with seperators or tokens, simply the whole value as a single token that I can run startsWith on.
What I'm trying to say is, take for example JavaScript, I want the exact same results you would get from string.startsWith(value).
I'm guessing there is either something wrong with my config, or with my requests, can anyone help me?
IMHO, you should work with a different separator. For example:
Field1 (FROM) | Field2 (TO)
SOME*STUFF*123 | SOME||STUFF||123
Then use a custom analyzer to break terms every ||. Aditionally, you can also work with tokenizer and specify it to do it every 3 chars.
Samples:
SOM
OME
STU
TUF
UFF
123
Then search using:
SOM*
and it should return the data you're looking for. It would be better if you could provide more details about your content and give us samples, but this answer should point you to the result you're looking for.
I'm trying to best index contents of e-mail messages, subjects and email addresses. E-mails can contain both text and HTML representation. They can be in any language so I can't use language specific analysers unfortunately.
As I am new to this I have many questions:
First I used Standard Lucene analyser but after some testing and
checking what each analyser does I switched to using "simple"
analyser. Standard one didn't allow me to search by domain in
user#domain.com (It sees user and domain.com as tokens). Is "simple" the best I can use in my case?
How can I handle HTML contents of e-mail? I thought this should be
possible to do it in Azure Search but right now I think I would need
to strip HTML tags myself.
My users aren't tech savvy and I assumed "simple" query type will be
enough for them. I expect them to type word or two and find messages
containing this word/containing words starting with this word. From my tests it looks I need to append * to their queries to get "starting with" to work?
It would help if you included an example of your data and how you index and query. What happened, and what did you expect?
The standard Lucene analyzer will work with your user#domain.com example. It is correct that it produces the tokens user and domain.com. But the same happens when you query, and you will get records with the tokens user and domain.com.
CREATE INDEX
"fields": [
{"name": "Id", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "sortable": true, "facetable": false, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "synonymMaps": [] },
{"name": "Email", "type": "Edm.String", "filterable": true, "sortable": true, "facetable": false, "searchable": true, "analyzer": "standard"}
]
UPLOAD
{
"value": [
{
"#search.action": "mergeOrUpload",
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "2",
"Email": "some.user#some-domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "3",
"Email": "another#another.com"
}
]
}
QUERY
Query, using full and all.
https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/indexes/{{INDEX_NAME}}/docs?search=user#domain.com&$count=true&$select=Id,Email&searchMode=all&queryType=full&api-version={{API-VERSION}}
Which produces results as expected (all records containing user and domain.com):
{
"#odata.context": "https://<your-search-env>.search.windows.net/indexes('dg-test-65392234')/$metadata#docs(*)",
"#odata.count": 2,
"value": [
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.score": 0.25316024,
"Id": "2",
"Email": "some.user#some-domain.com"
}
]
}
If your expected result is to only get the record above where the email matches completely, you could instead use a phrase search. I.e. replace the search parameter above with search="user#domain.com" and you would get:
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
}
Alternatively, you could use the keyword analyzer.
ANALYZE
You can compare the different analyzers directly via REST. Using the keyword analyzer on the Email property will produce a single token.
{
"text": "some-user#some-domain.com",
"analyzer": "keyword"
}
Results in the following tokens:
"tokens": [
{
"token": "some-user#some-domain.com",
"startOffset": 0,
"endOffset": 25,
"position": 0
}
]
Compared to the standard tokenizer, which does a decent job for most types of unstructured content.
{
"text": "some-user#some-domain.com",
"analyzer": "standard"
}
Which produces reasonable results for cases where the email address was part of some generic text.
"tokens": [
{
"token": "some",
"startOffset": 0,
"endOffset": 4,
"position": 0
},
{
"token": "user",
"startOffset": 5,
"endOffset": 9,
"position": 1
},
{
"token": "some",
"startOffset": 10,
"endOffset": 14,
"position": 2
},
{
"token": "domain.com",
"startOffset": 15,
"endOffset": 25,
"position": 3
}
]
SUMMARY
This is a long answer already, so I won't cover your other two questions in detail. I would suggest splitting them to separate questions so it can benefit others.
HTML content: You can use a built-in HTML analyzer that strips HTML tags. Or you can strip the HTML yourself using custom code. I typically use Beautiful Soup for cases like these or simple regular expressions for simpler cases.
Wildcard search: Usually, users don't expect automatic wildcards appended. The only application that does this is the Outlook client, which destroys precision. When I search for "Jan" (a common name), I annoyingly get all emails sent in January(!). And a search for Dan (again, a name), I also get all emails from Danmark (Denmark).
Everything in search is a trade-off between precision and recall. In your first example with the email address, your expectation was heavily geared toward precision. But, in your last wildcard question, you seem to prefer extreme recall with wildcards on everything. It all comes down to your expectations.
My issue is that when we do a first name search using a fuzzy search(with a distance of 2 characters on the first name) it doesn’t seem to bring back all possibilities.
QueryType is Full
QueryString - "FirstName:gra~2 AND (LastName: \"*****\" OR LastName: /.*\"*****\".*/)"
I'm using an exact match OR a contains on the lastname for this example, this will stay constant across the examples
Results:
If I search FirstName:gre~2 in an Azure Search query string we get back:
Greg
Gary
Gene
If I search FirstName:gra~2 we get back:
Gina
Gary
If I search FirstName:grag~2 we get back:
Greg
Gary
We know that azure fuzzy search uses the damerau-levenshtein distance and it seems like from “gra” both “gina” and “greg” would be 2 characters difference, yet only one is showing up. Also "grag" in theory should return "gina" as well
I'm wondering if anyone has an explanation for this since it seems inconsistent
I used this to verify the "distance" between the strings "gra" and "greg"&"gina"
http://fuzzy-string.com/Compare/
Here's the link to the azure documentation on Lucene Syntax
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax
These are both of the field definitions
{
"name": "FirstName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "standard.lucene",
"synonymMaps": []
},
{
"name": "LastName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "standard.lucene",
"synonymMaps": []
}
**Results seem to be the same regardless of lastname being used or not
I would also expect those terms to match your fuzzy query. Just to do a sanity check before we dig deeper, can you confirm what are your analyzer settings (both at query time and indexing time)? I just want to confirm all the terms you mentioned are actually tokenized and indexed exactly the way you expect them (and also if their casing gets normalized the way you would expect them). You can use the Analyze API (https://learn.microsoft.com/en-us/rest/api/searchservice/test-analyzer) to confirm how those terms are tokenized. You also mentioned your query includes an AND clause matching on another field (LastName), can you confirm that even without that second clause, the results on the FirstName are still not what you expect? I just want to make sure we eliminate all external factors outside of the actual edit distance algorithm.
Update: I tried it on my side using the default analyzers and without the LastName clause. searching for "gra~2" successfully return "Greg", "Gary" and "Gina". I get the same results when I search for "gre~2" (as you did). Searching for "grag~2" only returns "Greg" and "Gary". "Gina" is not returned, but to me that seems expected (edit distance seems to be 3).
A while ago I set up a search index for a web application. One of the requirements was to return partial matches of the search terms. For instance, searching for Joh should find John Doe. The most straightforward way to implement this was to append a * to each search term before posting the query to Azure Search. So if a user types Joh, we actually ask Azure Search to search for Joh*.
One limitation of this approach is that all the matches of Joh* have the same search score. Because of this, sometimes a partial match appears higher in the results than an exact match. This is documented behavior, so I guess there is not much I can do about it. Or can I?
While my current way to return partial matches seems like a hack, it has worked well enough in practice that I didn't matter finding out how to properly solve the problem. Now I have the time to look into it and my instinct says there must be a "proper" way to do this. I have read the word "ngrams" here and there, and it seems to be part of the solution. I could probably find a passable solution after some of hours of hacking on it, but if there is any "standard way" to achieve what I want, I would rather follow that path instead of using a home-grown hack. Hence this question.
So my question is: is there a standard way to retrieve partial matches in Azure Search, while giving exact matches a higher score? How should I change the code below to make Azure Search return the search results I need?
The code
Index definition, as returned by the Azure API:
{
"name": "test-index",
"defaultScoringProfile": null,
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
},
{
"name": "name",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
}
],
"scoringProfiles": [],
"corsOptions": null,
"suggesters": [],
"analyzers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": []
}
The documents, as posted to the Azure API:
{
"value": [
{
"#search.action": "mergeOrUpload",
"id": "1",
"name": "Joh Doe"
},
{
"#search.action": "mergeOrUpload",
"id": "2",
"name": "John Doe"
}
]
}
Search query, as posted to the Azure API:
{
search: "Joh*"
}
Results, where the exact match appears second, while we would like it to appear first:
{
"value": [
{
"#search.score": 1,
"id": "2",
"name": "John Doe"
},
{
"#search.score": 1,
"id": "1",
"name": "Joh Doe"
}
]
}
This is a very good question and thanks for providing a detailed explanation. The easiest way to achieve that would be to use term boosting on the actual term and combine it with a wildcard query. You can modify the query in your post to -
search=Joh^10 OR Joh*&queryType=full
This will score the documents that match Joh exactly higher. If you have more complicated requirements, you can look at constructing a custom analyzer with ngrams to search on them to support partial search.
I'm trying to cater for the following example with Azure Search.
Given the following index schema:
{
"name": "mySchema",
"fields": [
{
"name": "Id",
"type": "Edm.String",
"key": true,
"searchable": false,
"filterable": false,
"sortable": false,
"facetable": false,
"retrievable": true,
"suggestions": false
},
{
"name": "StateId",
"type": "Edm.Int32",
"key": false,
"searchable": false,
"filterable": true,
"sortable": false,
"facetable": false,
"retrievable": true,
"suggestions": false
},
{
"name": "Location",
"type": "Edm.GeographyPoint",
"key": false,
"searchable": false,
"filterable": true,
"sortable": true,
"facetable": false,
"retrievable": true,
"suggestions": false
},
],
}
I want to be able to order my results firstly on the StateId field, and then by the distance from a given lat/long location.
I realise that I am able to achieve the first part by using a $filter= StateId eq x component when querying. However, I do want to still receive results (with a lower score) that are not in the provided StateId, but are of a given distance away from a provided location.
I recognise also, that this looks like it should be able to be achieved by a custom Scoring Profile. I would expect by using a Scoring Profile, I'd be able to return something like this:
[
{
"#search.score":100.0,
"Id":"111",
"StateId":"123",
"Location": {"type": "Point details...."},
},
{
"#search.score":100.0,
"Id":"222",
"StateId":"123",
"Location": {"type": "Point details...."},
},
{
"#search.score":50.0,
"Id":"333",
"StateId":"789",
"Location": {"type": "Point details...."},
}
]
However, I am not able to search on the StateId field, as this is an Edm.Int32 value, so I do not believe using a Scoring Profile would be a viable solution.
Anyone come across a similar scenario?
EDIT:
Trying to explain just a bit further - if I were to explain this in Psuedo-SQL, this is basically the case I'm trying to handle
ORDER BY (CASE WHEN StateId = #StateId THEN 1 ELSE 0 END) DESC, Location
We don't currently support modeling this scenario with scoring profiles. This has come up multiple times though, so it's something we'd like to add.
In the meanwhile, one thing you can do as a work-around is to add the StateId value to one of the searchable fields (e.g. just append it at the end of the text). Then during search include the state id as part of the search string, which should skew results towards those documents that match the state id (or that are a very good match without it, which might be good relevance anyway depending on the case).
During display, if you show this text field you'd have to strip out the state id from the end of the string (or use a different field).