Azure Cognitive Search, how to configure analyzer to support "startsWith"? - azure

I have a field in Azure Cognitive Search that has special characters in it.
they look like this: some_id: 'SOME*STUFF*123'
I'm trying to have a "startsWith" query, but that doesnt return anything as soon as the regex tries to match anything that goes farther than the \*
After a bit google I found out its the Analyzer, possibly breaking apart strings at '*'
So I changed the Analyzer to "keyword", as I read multiple times its the Analyzer you are supposed to use for this.
the new config looks like this:
{
"name": "some_id",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": true,
"analyzer": "keyword",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
my request look like this:
{
"count": true,
"skip": 0,
"top": 5,
"searchMode": "any",
"queryType": "full",
"search": "some_id:/SO(.*)/" // SOME\\*S(.*) also doesnt work
}
I get zero matches.
With the Standart analyzer I started going no matches as soon as I had a \\* in my regex (I escaped them with \\)
Clarification on Requirements:
I can not change any data, the values (including the \*) can not be changed. I'm trying to have the whole field matched as a single token and for me to run startsWith on.
For example this regex: /SOME\\*ST(.*)/ is supposed to literally return entries that fully match the regex. No magic with seperators or tokens, simply the whole value as a single token that I can run startsWith on.
What I'm trying to say is, take for example JavaScript, I want the exact same results you would get from string.startsWith(value).
I'm guessing there is either something wrong with my config, or with my requests, can anyone help me?

IMHO, you should work with a different separator. For example:
Field1 (FROM) | Field2 (TO)
SOME*STUFF*123 | SOME||STUFF||123
Then use a custom analyzer to break terms every ||. Aditionally, you can also work with tokenizer and specify it to do it every 3 chars.
Samples:
SOM
OME
STU
TUF
UFF
123
Then search using:
SOM*
and it should return the data you're looking for. It would be better if you could provide more details about your content and give us samples, but this answer should point you to the result you're looking for.

Related

Is there a reason why azure search isn't returning all the possible values during a fuzzy search?

My issue is that when we do a first name search using a fuzzy search(with a distance of 2 characters on the first name) it doesn’t seem to bring back all possibilities.
QueryType is Full
QueryString - "FirstName:gra~2 AND (LastName: \"*****\" OR LastName: /.*\"*****\".*/)"
I'm using an exact match OR a contains on the lastname for this example, this will stay constant across the examples
Results:
If I search FirstName:gre~2 in an Azure Search query string we get back:
Greg
Gary
Gene
If I search FirstName:gra~2 we get back:
Gina
Gary
If I search FirstName:grag~2 we get back:
Greg
Gary
We know that azure fuzzy search uses the damerau-levenshtein distance and it seems like from “gra” both “gina” and “greg” would be 2 characters difference, yet only one is showing up. Also "grag" in theory should return "gina" as well
I'm wondering if anyone has an explanation for this since it seems inconsistent
I used this to verify the "distance" between the strings "gra" and "greg"&"gina"
http://fuzzy-string.com/Compare/
Here's the link to the azure documentation on Lucene Syntax
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax
These are both of the field definitions
{
"name": "FirstName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "standard.lucene",
"synonymMaps": []
},
{
"name": "LastName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "standard.lucene",
"synonymMaps": []
}
**Results seem to be the same regardless of lastname being used or not
I would also expect those terms to match your fuzzy query. Just to do a sanity check before we dig deeper, can you confirm what are your analyzer settings (both at query time and indexing time)? I just want to confirm all the terms you mentioned are actually tokenized and indexed exactly the way you expect them (and also if their casing gets normalized the way you would expect them). You can use the Analyze API (https://learn.microsoft.com/en-us/rest/api/searchservice/test-analyzer) to confirm how those terms are tokenized. You also mentioned your query includes an AND clause matching on another field (LastName), can you confirm that even without that second clause, the results on the FirstName are still not what you expect? I just want to make sure we eliminate all external factors outside of the actual edit distance algorithm.
Update: I tried it on my side using the default analyzers and without the LastName clause. searching for "gra~2" successfully return "Greg", "Gary" and "Gina". I get the same results when I search for "gre~2" (as you did). Searching for "grag~2" only returns "Greg" and "Gary". "Gina" is not returned, but to me that seems expected (edit distance seems to be 3).

Returning partial matches in Azure Search

A while ago I set up a search index for a web application. One of the requirements was to return partial matches of the search terms. For instance, searching for Joh should find John Doe. The most straightforward way to implement this was to append a * to each search term before posting the query to Azure Search. So if a user types Joh, we actually ask Azure Search to search for Joh*.
One limitation of this approach is that all the matches of Joh* have the same search score. Because of this, sometimes a partial match appears higher in the results than an exact match. This is documented behavior, so I guess there is not much I can do about it. Or can I?
While my current way to return partial matches seems like a hack, it has worked well enough in practice that I didn't matter finding out how to properly solve the problem. Now I have the time to look into it and my instinct says there must be a "proper" way to do this. I have read the word "ngrams" here and there, and it seems to be part of the solution. I could probably find a passable solution after some of hours of hacking on it, but if there is any "standard way" to achieve what I want, I would rather follow that path instead of using a home-grown hack. Hence this question.
So my question is: is there a standard way to retrieve partial matches in Azure Search, while giving exact matches a higher score? How should I change the code below to make Azure Search return the search results I need?
The code
Index definition, as returned by the Azure API:
{
"name": "test-index",
"defaultScoringProfile": null,
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
},
{
"name": "name",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
}
],
"scoringProfiles": [],
"corsOptions": null,
"suggesters": [],
"analyzers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": []
}
The documents, as posted to the Azure API:
{
"value": [
{
"#search.action": "mergeOrUpload",
"id": "1",
"name": "Joh Doe"
},
{
"#search.action": "mergeOrUpload",
"id": "2",
"name": "John Doe"
}
]
}
Search query, as posted to the Azure API:
{
search: "Joh*"
}
Results, where the exact match appears second, while we would like it to appear first:
{
"value": [
{
"#search.score": 1,
"id": "2",
"name": "John Doe"
},
{
"#search.score": 1,
"id": "1",
"name": "Joh Doe"
}
]
}
This is a very good question and thanks for providing a detailed explanation. The easiest way to achieve that would be to use term boosting on the actual term and combine it with a wildcard query. You can modify the query in your post to -
search=Joh^10 OR Joh*&queryType=full
This will score the documents that match Joh exactly higher. If you have more complicated requirements, you can look at constructing a custom analyzer with ngrams to search on them to support partial search.

Azure Search - Partial Phrase match

I'm trying to improve the ranking of results that come back from an Azure Search index. The search index basically contains a list of band names and members.
Exact match is important to us, but also a partial match, but likewise, so is a partial word within the query.
If I use the example of trying to find a band called Black Flag. In user input area, I have got as far as typing black fl.
I currently structure the query as: "black fl"|black fl* (exact match on whole phrase and partial match on fl).
This brings back the following results in following order:
Flourescent Black
Florence Black
Black Flag
At the moment, there is the single text field being searched against, using the Standard - Lucene Analyzer.
I've looked at Scoring Profiles but these don't appear to be relevant to such a small dataset in terms of fields available.
I have also explored the full lucene search, by adding things like ^10 on the word black to make it more important - and have changed my query string in many ways, all of which don't seem to give the effect i'm after.
I would expect that Black Flag would match better as the word order is more correct than that of the results that come above it.
Is there a way to change the scoring method to handle this? I now imagine that I'm looking at dealing with a custom analyzer (https://learn.microsoft.com/en-gb/azure/search/index-add-custom-analyzers) but not really sure where to start with this or how I would want the analyzer to behave.
Any thoughts or examples on how best to handle this scenario would be greatly appreciated.
EDIT - More Info
The current solution consists of the following, but it involves having to manipulate the results that come back from the search index.
The index is created as follows:
{
"fields": [
{"name": "id", "type": "Edm.String", "key": true, "filterable": false, "searchable": false, "sortable": false, "facetable": false},
{"name": "entityId", "type": "Edm.Int64", "filterable": false, "searchable": false, "sortable": false, "facetable": false},
{"name": "entityType", "type": "Edm.Int32", "sortable": false, "facetable": false},
{"name": "sortableName", "type": "Edm.String", "filterable": false, "facetable": false, "searchable": false},
{"name": "name", "type": "Edm.String", "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "analyzer":"keyword_analyzer"},
{"name": "town", "type": "Edm.String", "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "analyzer":"keyword_analyzer"},
{"name": "tags", "type": "Collection(Edm.String)", "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "analyzer":"keyword_analyzer"}
],
"defaultScoringProfile": "default_score",
"scoringProfiles": [
{
"name": "default_score",
"text":{
"weights": {
"name": 3.5,
"tags": 2,
"town": 1
}
}
}
],
"analyzers":[
{
"name": "keyword_analyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters":[
"map_dash",
"map_space"
],
"tokenizer":"keyword_tokenizer",
"tokenFilters":[
"asciifolding",
"lowercase",
"trim",
"delimiter_filter"
]
}
],
"charFilters":[
{
"name":"map_dash",
"#odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
"mappings":["-=>_"]
},
{
"name":"map_space",
"#odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
"mappings":["\\u0020=>_"]
}
],
"tokenizers":[
{
"name": "keyword_tokenizer",
"#odata.type":"#Microsoft.Azure.Search.KeywordTokenizerV2"
}
],
"tokenFilters":[
{
"name": "stopwords_filter",
"#odata.type":"#Microsoft.Azure.Search.StopwordsTokenFilter",
"removeTrailing": false
},
{
"name": "delimiter_filter",
"#odata.type":"#Microsoft.Azure.Search.WordDelimiterTokenFilter",
"generateWordParts": true,
"generateNumberParts": true,
"splitOnCaseChange": false,
"preserveOriginal": true,
"splitOnNumerics": false
}
]
}
Before uploading data to the index we need to normalize it - Black Flag becomes black flag. We also have to remove any preceeding words of the so this means that The Killers becomes killers - also any non standard characters are replaced to remove accents etc.
When performing a search, in code we need to now remove any preceeding the if it exists, and perform the same normalization - I can accept doing this.
We then build up the query which changes depending upon how many words there are in the initial query.
List<string> splitQ = queryPhrase.SplitToList(" ");
if (splitQ.Count > 0)
{
if (splitQ.Count == 1)
{
search.Append($"(\"{splitQ[0]}\" || {this.EscapeSpecialCharacters(splitQ[0])}*)");
}
else
{
for (int i = 0; i < splitQ.Count; i++)
{
if (i == splitQ.Count - 1)
{
search.Append($"+{this.EscapeSpecialCharacters(splitQ[i])}*");
}
else
search.Append($"+\"{splitQ[i]}\"");
}
search.Insert(0, $"(\"{queryPhrase}\"||(");
search.Append("))");
}
}
A single word black would mean the main query is: ("black" || black*)
However, as soon as additional words come in it has to change. black fl becomes: ("black fl"||(+"black"+fl*))
A three word search would be: ("one two three"||(+"one"+"two"+three*))
On top of this we then add any filter options.
Search is sent to the index with the query type set to full
The above has got us as close as we can to having decent and accurate results. However, the scoring is all messed up.
Processing the results...
Firstly we now normalise the score given by the Azure Search Index, depending upon the search query, the scores range massively, so we normalize this as a percentage based on the maximum scoring item.
We now have to apply our own enhancer to the score based on the tag or name field. An exact match to the query gives an enhancer of 5, and a startswith query gets and enhancement of 3.
We then provide a score that uses the enhancement to increase the results position in the rankings.
This final seciton of processing the results seems as though it is something that should be done automatically within the search index system.

Azure search not returning correct result with . (dot) in search query

We have stored documents into azure search. One of the document is having below field value.
"Title": "statistics_query.compute_shader_invocations.secondary_inherited fails"
We have defined custom analyzer on it as per the recommendation from MS Azure Team, in order to resolve one of the issue we were facing due to _ (underscore).
{
"name": "myindex",
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null
},
{
"name": "Title",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "remove_underscore"
}
],
"analyzers": [
{
"name": "remove_underscore",
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters": [
"remove_underscore"
],
"tokenizer": "standard_v2"
}
],
"charFilters": [
{
"name": "remove_underscore",
"#odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
"mappings": [
"_=>-"
]
}
]
}
However, when I search with below Filters on my azure search index (version # 2016-09-01 Preview), i didnt get any result.
$filter=search.ismatch('"compute_shader_invocations*"','Title', 'full', 'any')
$filter=search.ismatch('"compute_shader_invocations"','Title', 'full', 'any')
$filter=search.ismatch('"shader_invocations*"','Title', 'full', 'any')
However, if I include the text with (.) dot character, the same filter works.
$filter=search.ismatch('"query.compute_shader*"','Title', 'full', 'any')
Based on my tests, if the document is having a dot (.) character present right after or before the search term used in the filters, then the search doesnt return result.
So, below filters wont work as there is a (.) dot character present in the document, right before and after the search terms used in the query. In our case there is a dot character present before word "compute" and after word "invocations" in the Azure Search Document.
$filter=search.ismatch('"compute_shader_invocations*"','Title', 'full', 'any')
$filter=search.ismatch('"compute_shader"','Title', 'full', 'any')
$filter=search.ismatch('"shader_invocations*"','Title', 'full', 'any')
However below filters should work, as there is no dot character present before the word "query" or after the word "shadder" in the Azure search document
$filter=search.ismatch('"query.compute_shader*"','Title', 'full', 'any')
$filter=search.ismatch('"shader*"','Title', 'full', 'any')
This is driving me crazy. Any help would be highly appreciated.
tl;dr Wildcard queries don't have custom analysis performed. Non wildcard queries should return results, so please double check
Detailed answer
So, the dot (.) actually doesn't have anything to do with the behavior you are observing. There are 2 classes of search queries you are issuing:
A wildcard query *
A non wildcard query (such as "compute_shader")
In general, a non wildcard query you issue, will undergo the same analysis as defined by any custom analyzer in your index. In case of wildcard queries, no analysis is performed.
Now taking your document text as an example "statistics_query.compute_shader_invocations.secondary_inherited fails", the custom analyzer you defined will break it down into tokens. (FYI: You can use the Analyze API to see the breakdown).
The following wildcard query succeeds
$filter=search.ismatch('"shader*"','Title', 'full', 'any')
because, when you run the analysis on the source document, there are tokens like "shader"
The following wildcard queries don't succeed
$filter=search.ismatch('"compute_shader_invocations*"','Title', 'full', 'any')
$filter=search.ismatch('"shader_invocations*"','Title', 'full', 'any')
because there are no tokens like "computer_shader_invocations" or "shader_invocations" when the source document is analyzed with your custom analyzer.
This one shouldn't succeed as well, but interestingly you say that it does:
$filter=search.ismatch('"query.compute_shader*"','Title', 'full', 'any')
Let's focus now on queries without wildcards.
$filter=search.ismatch('"compute_shader_invocations"','Title', 'full', 'any')
$filter=search.ismatch('"compute_shader"','Title', 'full', 'any')
These should technically get tokenized correctly using the custom analyzer and should have matching results.
Could you please verify whether your queries in the last 3 highlighted instances were correct in your original question? When I tried to create a sample index and issued a search request based on your configuration, those were the 3 anomalies I noticed. I would appreciate some clarification around those.
Also, in general the documentation around how full text search in Azure search works is a great place to get in-depth details about some of the things that I mentioned.

character "#" in regex azure search lucene

when implementing a search in Azure Search with a text containing the # character does not return information.
Which analyzer are you using for the search field? If you did not specify an analyzer, it defaults to the lucene standard analyzer which discards punctuations and symbols and the email address abc#bcd.gov.co is tokenized into , , , and . As documented, regex search query only applies to single tokenized terms. The regex /.bcd.gov.co./ doesn't find the email address as it does not match any of the tokenized terms. You can either use whitespace analyzer or a build a custom one that doesn't discard punctuations or symbols to apply regex matching on the entire address.
Hope this helps. Thanks.
Nate
here is sample code
{
"name": "Username",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"analyzer": "email_analyzer"
},
"analyzers": [
{
"name": "email_analyzer",
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer": "uax_url_email"
}
]

Resources