Related
I'm trying to best index contents of e-mail messages, subjects and email addresses. E-mails can contain both text and HTML representation. They can be in any language so I can't use language specific analysers unfortunately.
As I am new to this I have many questions:
First I used Standard Lucene analyser but after some testing and
checking what each analyser does I switched to using "simple"
analyser. Standard one didn't allow me to search by domain in
user#domain.com (It sees user and domain.com as tokens). Is "simple" the best I can use in my case?
How can I handle HTML contents of e-mail? I thought this should be
possible to do it in Azure Search but right now I think I would need
to strip HTML tags myself.
My users aren't tech savvy and I assumed "simple" query type will be
enough for them. I expect them to type word or two and find messages
containing this word/containing words starting with this word. From my tests it looks I need to append * to their queries to get "starting with" to work?
It would help if you included an example of your data and how you index and query. What happened, and what did you expect?
The standard Lucene analyzer will work with your user#domain.com example. It is correct that it produces the tokens user and domain.com. But the same happens when you query, and you will get records with the tokens user and domain.com.
CREATE INDEX
"fields": [
{"name": "Id", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "sortable": true, "facetable": false, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "synonymMaps": [] },
{"name": "Email", "type": "Edm.String", "filterable": true, "sortable": true, "facetable": false, "searchable": true, "analyzer": "standard"}
]
UPLOAD
{
"value": [
{
"#search.action": "mergeOrUpload",
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "2",
"Email": "some.user#some-domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "3",
"Email": "another#another.com"
}
]
}
QUERY
Query, using full and all.
https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/indexes/{{INDEX_NAME}}/docs?search=user#domain.com&$count=true&$select=Id,Email&searchMode=all&queryType=full&api-version={{API-VERSION}}
Which produces results as expected (all records containing user and domain.com):
{
"#odata.context": "https://<your-search-env>.search.windows.net/indexes('dg-test-65392234')/$metadata#docs(*)",
"#odata.count": 2,
"value": [
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.score": 0.25316024,
"Id": "2",
"Email": "some.user#some-domain.com"
}
]
}
If your expected result is to only get the record above where the email matches completely, you could instead use a phrase search. I.e. replace the search parameter above with search="user#domain.com" and you would get:
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
}
Alternatively, you could use the keyword analyzer.
ANALYZE
You can compare the different analyzers directly via REST. Using the keyword analyzer on the Email property will produce a single token.
{
"text": "some-user#some-domain.com",
"analyzer": "keyword"
}
Results in the following tokens:
"tokens": [
{
"token": "some-user#some-domain.com",
"startOffset": 0,
"endOffset": 25,
"position": 0
}
]
Compared to the standard tokenizer, which does a decent job for most types of unstructured content.
{
"text": "some-user#some-domain.com",
"analyzer": "standard"
}
Which produces reasonable results for cases where the email address was part of some generic text.
"tokens": [
{
"token": "some",
"startOffset": 0,
"endOffset": 4,
"position": 0
},
{
"token": "user",
"startOffset": 5,
"endOffset": 9,
"position": 1
},
{
"token": "some",
"startOffset": 10,
"endOffset": 14,
"position": 2
},
{
"token": "domain.com",
"startOffset": 15,
"endOffset": 25,
"position": 3
}
]
SUMMARY
This is a long answer already, so I won't cover your other two questions in detail. I would suggest splitting them to separate questions so it can benefit others.
HTML content: You can use a built-in HTML analyzer that strips HTML tags. Or you can strip the HTML yourself using custom code. I typically use Beautiful Soup for cases like these or simple regular expressions for simpler cases.
Wildcard search: Usually, users don't expect automatic wildcards appended. The only application that does this is the Outlook client, which destroys precision. When I search for "Jan" (a common name), I annoyingly get all emails sent in January(!). And a search for Dan (again, a name), I also get all emails from Danmark (Denmark).
Everything in search is a trade-off between precision and recall. In your first example with the email address, your expectation was heavily geared toward precision. But, in your last wildcard question, you seem to prefer extreme recall with wildcards on everything. It all comes down to your expectations.
I'm having trouble transcribing email using Google Speech REST API. The best I can get is most of the email address, however Google Speech ignores "dot" and "dot com". For example first.last#gmail.com returns "First Last at gmail". If I say "period" instead of "dot" I at least get "First. Last at gmail." I'm using the following:
{
"config": {
"encoding": "MULAW",
"sampleRateHertz": 8000,
"languageCode": "en-US",
"maxAlternatives": 0,
"profanityFilter": true,
"enableWordTimeOffsets": false,
"model": "phone_call",
"useEnhanced": true
},
"audio": {
"content":"&&NameBase64&&"
}
}
I've tried add "dot" as a speech context with no changes. ".", ".com", "com", and "kom" also didn't change the results.
{
"config": {
"encoding": "MULAW",
"sampleRateHertz": 8000,
"languageCode": "en-US",
"maxAlternatives": 1,
"profanityFilter": true,
"enableWordTimeOffsets": false,
"model": "phone_call",
"useEnhanced": true,
"speechContexts": [{
"phrases": ["dot"],
}],
},
"audio": {
"content":"Base64Recording"
}
}
I've tried adding alphanumberic speech contexts and spelling it out but the results were pretty bad.
Any thoughts on how I can get "." or "dot" and "com" to show up in the transcription would be greatly appreciated.
Have you tried providing a boost value for the phrase? I'm facing the same issue and I noticed that increasing the boost value helped in identifying the word "dot".
Boost values are usually between 0 and 20, but applying anything above 10 helped in recognizing the "dot".
Here's an example:-
"config": {
"encoding": "MULAW",
"sampleRateHertz": 8000,
"languageCode": "en-US",
"maxAlternatives": 1,
"profanityFilter": true,
"enableWordTimeOffsets": false,
"model": "phone_call",
"useEnhanced": true,
"speechContexts": [{
"phrases": ["dot"],
"boots": 15.0
}],
},
"audio": {
"content":"Base64Recording"
}
}
You can also have multiple key value pairs in the context, each with different boost values. For example, this is what I use to detect email addresses:-
[{
phrases: ["$OOV_CLASS_ALPHANUMERIC_SEQUENCE"],
boost: 14.0
},
{
phrases: ["gmail.com","yahoo.com","aol.com","outlook.com"],
boost:5.0
},
{
phrases: ["com",".","c o m",".com","dotcom","dot com","dot","at","at the rate","#"],
boost: 10.0
},
{
phrases: ["org","io","dot org","dot io","gov","dot gov","net","dot net","co","dot co"],
boost:8.0
},
{
phrases: ["$OOV_CLASS_DIGIT_SEQUENCE","8","naught","z","zed","zee","zz","d","aa","ae","ee","oo","ii","ay","eh","ahh","ah","ze","dee",
"1","2","3","4","5","6","7","8","9","0","zero",],
boost: -20.0
}
]
Notice, the phrases with negative boost values will help weed out words that are often misunderstood.
A while ago I set up a search index for a web application. One of the requirements was to return partial matches of the search terms. For instance, searching for Joh should find John Doe. The most straightforward way to implement this was to append a * to each search term before posting the query to Azure Search. So if a user types Joh, we actually ask Azure Search to search for Joh*.
One limitation of this approach is that all the matches of Joh* have the same search score. Because of this, sometimes a partial match appears higher in the results than an exact match. This is documented behavior, so I guess there is not much I can do about it. Or can I?
While my current way to return partial matches seems like a hack, it has worked well enough in practice that I didn't matter finding out how to properly solve the problem. Now I have the time to look into it and my instinct says there must be a "proper" way to do this. I have read the word "ngrams" here and there, and it seems to be part of the solution. I could probably find a passable solution after some of hours of hacking on it, but if there is any "standard way" to achieve what I want, I would rather follow that path instead of using a home-grown hack. Hence this question.
So my question is: is there a standard way to retrieve partial matches in Azure Search, while giving exact matches a higher score? How should I change the code below to make Azure Search return the search results I need?
The code
Index definition, as returned by the Azure API:
{
"name": "test-index",
"defaultScoringProfile": null,
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
},
{
"name": "name",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
}
],
"scoringProfiles": [],
"corsOptions": null,
"suggesters": [],
"analyzers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": []
}
The documents, as posted to the Azure API:
{
"value": [
{
"#search.action": "mergeOrUpload",
"id": "1",
"name": "Joh Doe"
},
{
"#search.action": "mergeOrUpload",
"id": "2",
"name": "John Doe"
}
]
}
Search query, as posted to the Azure API:
{
search: "Joh*"
}
Results, where the exact match appears second, while we would like it to appear first:
{
"value": [
{
"#search.score": 1,
"id": "2",
"name": "John Doe"
},
{
"#search.score": 1,
"id": "1",
"name": "Joh Doe"
}
]
}
This is a very good question and thanks for providing a detailed explanation. The easiest way to achieve that would be to use term boosting on the actual term and combine it with a wildcard query. You can modify the query in your post to -
search=Joh^10 OR Joh*&queryType=full
This will score the documents that match Joh exactly higher. If you have more complicated requirements, you can look at constructing a custom analyzer with ngrams to search on them to support partial search.
I'm trying to improve the ranking of results that come back from an Azure Search index. The search index basically contains a list of band names and members.
Exact match is important to us, but also a partial match, but likewise, so is a partial word within the query.
If I use the example of trying to find a band called Black Flag. In user input area, I have got as far as typing black fl.
I currently structure the query as: "black fl"|black fl* (exact match on whole phrase and partial match on fl).
This brings back the following results in following order:
Flourescent Black
Florence Black
Black Flag
At the moment, there is the single text field being searched against, using the Standard - Lucene Analyzer.
I've looked at Scoring Profiles but these don't appear to be relevant to such a small dataset in terms of fields available.
I have also explored the full lucene search, by adding things like ^10 on the word black to make it more important - and have changed my query string in many ways, all of which don't seem to give the effect i'm after.
I would expect that Black Flag would match better as the word order is more correct than that of the results that come above it.
Is there a way to change the scoring method to handle this? I now imagine that I'm looking at dealing with a custom analyzer (https://learn.microsoft.com/en-gb/azure/search/index-add-custom-analyzers) but not really sure where to start with this or how I would want the analyzer to behave.
Any thoughts or examples on how best to handle this scenario would be greatly appreciated.
EDIT - More Info
The current solution consists of the following, but it involves having to manipulate the results that come back from the search index.
The index is created as follows:
{
"fields": [
{"name": "id", "type": "Edm.String", "key": true, "filterable": false, "searchable": false, "sortable": false, "facetable": false},
{"name": "entityId", "type": "Edm.Int64", "filterable": false, "searchable": false, "sortable": false, "facetable": false},
{"name": "entityType", "type": "Edm.Int32", "sortable": false, "facetable": false},
{"name": "sortableName", "type": "Edm.String", "filterable": false, "facetable": false, "searchable": false},
{"name": "name", "type": "Edm.String", "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "analyzer":"keyword_analyzer"},
{"name": "town", "type": "Edm.String", "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "analyzer":"keyword_analyzer"},
{"name": "tags", "type": "Collection(Edm.String)", "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "analyzer":"keyword_analyzer"}
],
"defaultScoringProfile": "default_score",
"scoringProfiles": [
{
"name": "default_score",
"text":{
"weights": {
"name": 3.5,
"tags": 2,
"town": 1
}
}
}
],
"analyzers":[
{
"name": "keyword_analyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters":[
"map_dash",
"map_space"
],
"tokenizer":"keyword_tokenizer",
"tokenFilters":[
"asciifolding",
"lowercase",
"trim",
"delimiter_filter"
]
}
],
"charFilters":[
{
"name":"map_dash",
"#odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
"mappings":["-=>_"]
},
{
"name":"map_space",
"#odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
"mappings":["\\u0020=>_"]
}
],
"tokenizers":[
{
"name": "keyword_tokenizer",
"#odata.type":"#Microsoft.Azure.Search.KeywordTokenizerV2"
}
],
"tokenFilters":[
{
"name": "stopwords_filter",
"#odata.type":"#Microsoft.Azure.Search.StopwordsTokenFilter",
"removeTrailing": false
},
{
"name": "delimiter_filter",
"#odata.type":"#Microsoft.Azure.Search.WordDelimiterTokenFilter",
"generateWordParts": true,
"generateNumberParts": true,
"splitOnCaseChange": false,
"preserveOriginal": true,
"splitOnNumerics": false
}
]
}
Before uploading data to the index we need to normalize it - Black Flag becomes black flag. We also have to remove any preceeding words of the so this means that The Killers becomes killers - also any non standard characters are replaced to remove accents etc.
When performing a search, in code we need to now remove any preceeding the if it exists, and perform the same normalization - I can accept doing this.
We then build up the query which changes depending upon how many words there are in the initial query.
List<string> splitQ = queryPhrase.SplitToList(" ");
if (splitQ.Count > 0)
{
if (splitQ.Count == 1)
{
search.Append($"(\"{splitQ[0]}\" || {this.EscapeSpecialCharacters(splitQ[0])}*)");
}
else
{
for (int i = 0; i < splitQ.Count; i++)
{
if (i == splitQ.Count - 1)
{
search.Append($"+{this.EscapeSpecialCharacters(splitQ[i])}*");
}
else
search.Append($"+\"{splitQ[i]}\"");
}
search.Insert(0, $"(\"{queryPhrase}\"||(");
search.Append("))");
}
}
A single word black would mean the main query is: ("black" || black*)
However, as soon as additional words come in it has to change. black fl becomes: ("black fl"||(+"black"+fl*))
A three word search would be: ("one two three"||(+"one"+"two"+three*))
On top of this we then add any filter options.
Search is sent to the index with the query type set to full
The above has got us as close as we can to having decent and accurate results. However, the scoring is all messed up.
Processing the results...
Firstly we now normalise the score given by the Azure Search Index, depending upon the search query, the scores range massively, so we normalize this as a percentage based on the maximum scoring item.
We now have to apply our own enhancer to the score based on the tag or name field. An exact match to the query gives an enhancer of 5, and a startswith query gets and enhancement of 3.
We then provide a score that uses the enhancement to increase the results position in the rankings.
This final seciton of processing the results seems as though it is something that should be done automatically within the search index system.
I'm trying to cater for the following example with Azure Search.
Given the following index schema:
{
"name": "mySchema",
"fields": [
{
"name": "Id",
"type": "Edm.String",
"key": true,
"searchable": false,
"filterable": false,
"sortable": false,
"facetable": false,
"retrievable": true,
"suggestions": false
},
{
"name": "StateId",
"type": "Edm.Int32",
"key": false,
"searchable": false,
"filterable": true,
"sortable": false,
"facetable": false,
"retrievable": true,
"suggestions": false
},
{
"name": "Location",
"type": "Edm.GeographyPoint",
"key": false,
"searchable": false,
"filterable": true,
"sortable": true,
"facetable": false,
"retrievable": true,
"suggestions": false
},
],
}
I want to be able to order my results firstly on the StateId field, and then by the distance from a given lat/long location.
I realise that I am able to achieve the first part by using a $filter= StateId eq x component when querying. However, I do want to still receive results (with a lower score) that are not in the provided StateId, but are of a given distance away from a provided location.
I recognise also, that this looks like it should be able to be achieved by a custom Scoring Profile. I would expect by using a Scoring Profile, I'd be able to return something like this:
[
{
"#search.score":100.0,
"Id":"111",
"StateId":"123",
"Location": {"type": "Point details...."},
},
{
"#search.score":100.0,
"Id":"222",
"StateId":"123",
"Location": {"type": "Point details...."},
},
{
"#search.score":50.0,
"Id":"333",
"StateId":"789",
"Location": {"type": "Point details...."},
}
]
However, I am not able to search on the StateId field, as this is an Edm.Int32 value, so I do not believe using a Scoring Profile would be a viable solution.
Anyone come across a similar scenario?
EDIT:
Trying to explain just a bit further - if I were to explain this in Psuedo-SQL, this is basically the case I'm trying to handle
ORDER BY (CASE WHEN StateId = #StateId THEN 1 ELSE 0 END) DESC, Location
We don't currently support modeling this scenario with scoring profiles. This has come up multiple times though, so it's something we'd like to add.
In the meanwhile, one thing you can do as a work-around is to add the StateId value to one of the searchable fields (e.g. just append it at the end of the text). Then during search include the state id as part of the search string, which should skew results towards those documents that match the state id (or that are a very good match without it, which might be good relevance anyway depending on the case).
During display, if you show this text field you'd have to strip out the state id from the end of the string (or use a different field).