Indexing e-mails in Azure Search - azure

I'm trying to best index contents of e-mail messages, subjects and email addresses. E-mails can contain both text and HTML representation. They can be in any language so I can't use language specific analysers unfortunately.
As I am new to this I have many questions:
First I used Standard Lucene analyser but after some testing and
checking what each analyser does I switched to using "simple"
analyser. Standard one didn't allow me to search by domain in
user#domain.com (It sees user and domain.com as tokens). Is "simple" the best I can use in my case?
How can I handle HTML contents of e-mail? I thought this should be
possible to do it in Azure Search but right now I think I would need
to strip HTML tags myself.
My users aren't tech savvy and I assumed "simple" query type will be
enough for them. I expect them to type word or two and find messages
containing this word/containing words starting with this word. From my tests it looks I need to append * to their queries to get "starting with" to work?

It would help if you included an example of your data and how you index and query. What happened, and what did you expect?
The standard Lucene analyzer will work with your user#domain.com example. It is correct that it produces the tokens user and domain.com. But the same happens when you query, and you will get records with the tokens user and domain.com.
CREATE INDEX
"fields": [
{"name": "Id", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "sortable": true, "facetable": false, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "synonymMaps": [] },
{"name": "Email", "type": "Edm.String", "filterable": true, "sortable": true, "facetable": false, "searchable": true, "analyzer": "standard"}
]
UPLOAD
{
"value": [
{
"#search.action": "mergeOrUpload",
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "2",
"Email": "some.user#some-domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "3",
"Email": "another#another.com"
}
]
}
QUERY
Query, using full and all.
https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/indexes/{{INDEX_NAME}}/docs?search=user#domain.com&$count=true&$select=Id,Email&searchMode=all&queryType=full&api-version={{API-VERSION}}
Which produces results as expected (all records containing user and domain.com):
{
"#odata.context": "https://<your-search-env>.search.windows.net/indexes('dg-test-65392234')/$metadata#docs(*)",
"#odata.count": 2,
"value": [
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.score": 0.25316024,
"Id": "2",
"Email": "some.user#some-domain.com"
}
]
}
If your expected result is to only get the record above where the email matches completely, you could instead use a phrase search. I.e. replace the search parameter above with search="user#domain.com" and you would get:
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
}
Alternatively, you could use the keyword analyzer.
ANALYZE
You can compare the different analyzers directly via REST. Using the keyword analyzer on the Email property will produce a single token.
{
"text": "some-user#some-domain.com",
"analyzer": "keyword"
}
Results in the following tokens:
"tokens": [
{
"token": "some-user#some-domain.com",
"startOffset": 0,
"endOffset": 25,
"position": 0
}
]
Compared to the standard tokenizer, which does a decent job for most types of unstructured content.
{
"text": "some-user#some-domain.com",
"analyzer": "standard"
}
Which produces reasonable results for cases where the email address was part of some generic text.
"tokens": [
{
"token": "some",
"startOffset": 0,
"endOffset": 4,
"position": 0
},
{
"token": "user",
"startOffset": 5,
"endOffset": 9,
"position": 1
},
{
"token": "some",
"startOffset": 10,
"endOffset": 14,
"position": 2
},
{
"token": "domain.com",
"startOffset": 15,
"endOffset": 25,
"position": 3
}
]
SUMMARY
This is a long answer already, so I won't cover your other two questions in detail. I would suggest splitting them to separate questions so it can benefit others.
HTML content: You can use a built-in HTML analyzer that strips HTML tags. Or you can strip the HTML yourself using custom code. I typically use Beautiful Soup for cases like these or simple regular expressions for simpler cases.
Wildcard search: Usually, users don't expect automatic wildcards appended. The only application that does this is the Outlook client, which destroys precision. When I search for "Jan" (a common name), I annoyingly get all emails sent in January(!). And a search for Dan (again, a name), I also get all emails from Danmark (Denmark).
Everything in search is a trade-off between precision and recall. In your first example with the email address, your expectation was heavily geared toward precision. But, in your last wildcard question, you seem to prefer extreme recall with wildcards on everything. It all comes down to your expectations.

Related

Azure Search Fails to Return Expected Result When No OR Multiple SearchFields Are Defined

I have a fairly basic Azure Search index with several fields of searchable string data, for example [abridged]...
"fields": [
{
"name": "Field1",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": true,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "Field2",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "en.microsoft",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
]
Field1 is loaded with alphanumeric id data and Field2 is loaded with English language string data, specifically the name/title of the record. searchMode=all is also being used to ensure the accuracy of the results.
Let's say one of the records indexed has the following Field2 data: BA (Hons) in Business, Organisational Behaviour and Coaching. Putting that into the en.microsoft analyzer, this is the result we get out:
"tokens": [
{
"token": "ba",
"startOffset": 0,
"endOffset": 2,
"position": 0
},
{
"token": "hon",
"startOffset": 4,
"endOffset": 8,
"position": 1
},
{
"token": "hons",
"startOffset": 4,
"endOffset": 8,
"position": 1
},
{
"token": "business",
"startOffset": 13,
"endOffset": 21,
"position": 3
},
{
"token": "organizational",
"startOffset": 23,
"endOffset": 37,
"position": 4
},
{
"token": "organisational",
"startOffset": 23,
"endOffset": 37,
"position": 4
},
{
"token": "behavior",
"startOffset": 38,
"endOffset": 47,
"position": 5
},
{
"token": "behaviour",
"startOffset": 38,
"endOffset": 47,
"position": 5
},
{
"token": "coach",
"startOffset": 52,
"endOffset": 60,
"position": 7
},
{
"token": "coaching",
"startOffset": 52,
"endOffset": 60,
"position": 7
}
]
As you can see, the tokens returned are what you'd expect for such a string. However, when it comes to using that same indexed string value as a search term (sadly a valid user case in this instance), the results returned are not as expected unless you explicitly use searchFields=Field2.
Query 1 (Returns 0 results):
?searchMode=all&search=BA%20(Hons)%20in%20Business%2C%20Organisational%20Behaviour%20and%20Coaching
Query 2 (Returns 0 results):
?searchMode=all&searchFields=Field1,Field2&search=BA%20(Hons)%20in%20Business%2C%20Organisational%20Behaviour%20and%20Coaching
Query 3 (Returns 1 result as expected):
?searchMode=all&searchFields=Field2&search=BA%20(Hons)%20in%20Business%2C%20Organisational%20Behaviour%20and%20Coaching
So why does this only return the expected result with searchFields=Field2 and not with no searchFields defined or searchFields=Field1,Field2? I would not expect a no match on Field1 to exclude a result that's clearly matching on Field2?
Furthermore, removing the "in" and "and" within the search term seems to correct the issue and return the expected result. For example:
Query 4 (Returns 1 result as expected):
?searchMode=all&search=BA%20(Hons)%20Business%2C%20Organisational%20Behaviour%20Coaching
(This is almost like one analyzer is tokenizing the indexed data and a completely different analyzer is tokenizing the search term, although that theory doesn't make any sense when taking into consideration Query 3, as that provides a positive match using the exact same indexed data/search term.)
Is anybody able to shed some light as to what's going on here as I'm completely out of ideas and I can't find anything more in the documentation?
NB. Please bear in mind that I'm looking to understand why Azure Search is behaving in this way and not necessarily wanting a work around.
The reason you don't get any hits is due to how stopwords are handled when you use searchMode=all. The standard analyzer does not remove stopwords. The Lucene and Microsoft analyzers for English removes stopwords. I verified by creating an index with your property definitions and sample data. If you use the standard analyzer, stopwords are not removed and you will get a match also when using searchMode=all. To get a match when using either Lucene or Microsoft analyzers with simple query mode, you would have to use a phrase search.
When you test the en.microsoft analyzer in your example, you only get the response from what the first stage of the analyzer does. It splits your query into tokens. In your case, two of the tokens are also stopwords in English (in, and). Stopword removal is part of lexical analysis, which is done later in stage 2 as explained in the article called Anatomy of a search request. Furthermore, lexical analysis is only applied to "query types that require complete terms", like searchMode=all. See Exceptions to lexical analysis for more examples.
There is a previous post here about this that explains in more detail. See Queries with stopwords and searchMode=all return no results
I know you did not ask for workarounds, but to better understand what goes on it could be useful to list some possible workarounds.
For English analyzers, use phrase search by wrapping the query in quotes: search="BA (Hons) in Business, Organisational Behaviour and Coaching"&searchMode=all
The standard analyzer works the way you expect: search=BA (Hons) in Business, Organisational Behaviour and Coaching&searchMode=all
Disable lexical analysis by defining a custom analyzer.

How to make Microsoft LUIS case sensitive?

I have a Azure LUIS instance for NLP,
tried to extract Alphanumberic values using RegEx Expression. it worked well but the output had output in lowercase alphabets.
For example:
CASE 1*
My Input: " run job for AE0002" RegExCode = [a-zA-Z]{2}\d+
Output:
{
"query": " run job for AE0002",
"topScoringIntent": {
"intent": "Run Job",
"score": 0.7897274
},
"intents": [
{
"intent": "Run Job",
"score": 0.7897274
},
{
"intent": "None",
"score": 0.00434472738
}
],
"entities": [
{
"entity": "ae0002",
"type": "Alpha Number",
"startIndex": 15,
"endIndex": 20
}
]
}
I need to maintain the case of the input.
CASE 2
My Input : "Extract only abreaviations like HP and IBM" RegExCode = [A-Z]{2,}
Output :
{
"query": "extract only abreaviations like hp and ibm", // Query accepted by LUIS test window
"query": "extract only abreaviations like HP and IBM", // Query accepted as an endpoint url
"prediction": {
"normalizedQuery": "extract only abreaviations like hp and ibm",
"topIntent": "None",
"intents": {
"None": {
"score": 0.09844558
}
},
"entities": {
"Abbre": [
"extract",
"only",
"abreaviations",
"like",
"hp",
"and",
"ibm"
],
"$instance": {
"Abbre": [
{
"type": "Abbre",
"text": "extract",
"startIndex": 0,
"length": 7,
"modelTypeId": 8,
"modelType": "Regex Entity Extractor",
"recognitionSources": [
"model"
]
},
{
"type": "Abbre",
"text": "only",
"startIndex": 8,
"length": 4,
"modelTypeId": 8,
"modelType": "Regex Entity Extractor",
"recognitionSources": [
"model"
]
},....
{
"type": "Abbre",
"text": "ibm",
"startIndex": 39,
"length": 3,
"modelTypeId": 8,
"modelType": "Regex Entity Extractor",
"recognitionSources": [
"model"
]
}
]
}
}
}
}
This makes me doubt if the entire training is happening in lowercase, What shocked me was all the words that were trained initially to their respective entities were retrained as Abbre
Any input would be of great help :)
Thank you
For Case 1, do you need to preserve the case in order to query the job on your system? As long as the job identifier always has uppercase characters you can just use toUpperCase(), e.g. var jobName = step._info.options.entities.Alpha_Number.toUpperCase() (not sure about the underscore in Alpha Number, I've never had an entity with spaces before).
For Case 2, this is a shortcoming of the LUIS application. You can force case sensitivity in the regex with (?-i) (e.g. /(?-i)[A-Z]{2,}/g). However, LUIS appears to convert everything to lowercase first, so you'll never get any matches with that statement (which is better than matching every word, but that isn't saying much!). I don't know of any way to make LUIS recognize entities in the way you are requesting.
You could create a list entity with all of the abbreviations you are expecting, but depending on the inputs you are expecting, that could be too much to maintain. Plus abbreviations that are also words would be picked up as false positives (e.g. CAT and cat). You could also write a function to do it for you outside of LUIS, basically building your own manual entity detection. There could be some additional solutions based on exactly what you are trying to do after you identify the abbreviations.
You can simply use the word indexes provided in the output to get the values from the input string, exactly as they were provided.
{
"query": " run job for AE0002",
...
"entities": [
{
"entity": "ae0002",
"type": "Alpha Number",
"startIndex": 15,
"endIndex": 20
}
]
}
Once you got this reply, use a substring method on your query, using startIndex and endIndex (or endIndex - startIndex if your method want a length, not an end index), in order to have the value you are looking for.

Returning partial matches in Azure Search

A while ago I set up a search index for a web application. One of the requirements was to return partial matches of the search terms. For instance, searching for Joh should find John Doe. The most straightforward way to implement this was to append a * to each search term before posting the query to Azure Search. So if a user types Joh, we actually ask Azure Search to search for Joh*.
One limitation of this approach is that all the matches of Joh* have the same search score. Because of this, sometimes a partial match appears higher in the results than an exact match. This is documented behavior, so I guess there is not much I can do about it. Or can I?
While my current way to return partial matches seems like a hack, it has worked well enough in practice that I didn't matter finding out how to properly solve the problem. Now I have the time to look into it and my instinct says there must be a "proper" way to do this. I have read the word "ngrams" here and there, and it seems to be part of the solution. I could probably find a passable solution after some of hours of hacking on it, but if there is any "standard way" to achieve what I want, I would rather follow that path instead of using a home-grown hack. Hence this question.
So my question is: is there a standard way to retrieve partial matches in Azure Search, while giving exact matches a higher score? How should I change the code below to make Azure Search return the search results I need?
The code
Index definition, as returned by the Azure API:
{
"name": "test-index",
"defaultScoringProfile": null,
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
},
{
"name": "name",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
}
],
"scoringProfiles": [],
"corsOptions": null,
"suggesters": [],
"analyzers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": []
}
The documents, as posted to the Azure API:
{
"value": [
{
"#search.action": "mergeOrUpload",
"id": "1",
"name": "Joh Doe"
},
{
"#search.action": "mergeOrUpload",
"id": "2",
"name": "John Doe"
}
]
}
Search query, as posted to the Azure API:
{
search: "Joh*"
}
Results, where the exact match appears second, while we would like it to appear first:
{
"value": [
{
"#search.score": 1,
"id": "2",
"name": "John Doe"
},
{
"#search.score": 1,
"id": "1",
"name": "Joh Doe"
}
]
}
This is a very good question and thanks for providing a detailed explanation. The easiest way to achieve that would be to use term boosting on the actual term and combine it with a wildcard query. You can modify the query in your post to -
search=Joh^10 OR Joh*&queryType=full
This will score the documents that match Joh exactly higher. If you have more complicated requirements, you can look at constructing a custom analyzer with ngrams to search on them to support partial search.

Azure Search - phonetic search implementation

I was trying out Phoenetic search using Azure Search without much luck. My objective is to work out an Index configuration that can handle typos and accomodate phonetic search for end users.
With the below configuration and sample data, I was trying to search for intentionally misspelled words like 'softvare' or 'alek'. I got results for 'alek' thanks for Phonetic analyzer; but didn't get any results for 'softvare'.
Looks like for this requirement phonetic search will not do the trick.
Only option that I found was to use synonyms map. The major pitfall is that I'm unable to use the Phonetics / Custom analyzer along with Synonyms :(
What are the various strategies that you would recommend for taking care of typos?
search query used
?api-version=2017-11-11&search=alec
?api-version=2017-11-11&search=softvare
Here is the index configuration
"name": "phonetichotels",
"fields": [
{"name": "hotelId", "type": "Edm.String", "key":true, "searchable": false},
{"name": "baseRate", "type": "Edm.Double"},
{"name": "description", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "analyzer":"my_standard"},
{"name": "hotelName", "type": "Edm.String", "analyzer":"my_standard"},
{"name": "category", "type": "Edm.String", "analyzer":"my_standard"},
{"name": "tags", "type": "Collection(Edm.String)", "analyzer":"my_standard"},
{"name": "parkingIncluded", "type": "Edm.Boolean"},
{"name": "smokingAllowed", "type": "Edm.Boolean"},
{"name": "lastRenovationDate", "type": "Edm.DateTimeOffset"},
{"name": "rating", "type": "Edm.Int32"},
{"name": "location", "type": "Edm.GeographyPoint"}
],
Analyzer (part of the index creation)
"analyzers":[
{
"name":"my_standard",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard_v2",
"tokenFilters":[ "lowercase", "asciifolding", "phonetic" ]
}
]
Analyze API Input and Output for 'software'
{
"analyzer":"my_standard",
"text": "software"
}
{
"#odata.context": "https://ctsazuresearchpoc.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
"tokens": [
{
"token": "SFTW",
"startOffset": 0,
"endOffset": 8,
"position": 0
}
]
}
Analyze API Input and Output for 'softvare'
{
"analyzer":"my_standard",
"text": "softvare"
}
{
"#odata.context": "https://ctsazuresearchpoc.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
"tokens": [
{
"token": "SFTF",
"startOffset": 0,
"endOffset": 8,
"position": 0
}
]
}
Sample data that I loaded
{
"#search.action": "upload",
"hotelId": "5",
"baseRate": 199.0,
"description": "Best hotel in town for software people",
"hotelName": "Fancy Stay",
"category": "Luxury",
"tags": ["pool", "view", "wifi", "concierge"],
"parkingIncluded": false,
"smokingAllowed": false,
"lastRenovationDate": "2010-06-27T00:00:00Z",
"rating": 5,
"location": { "type": "Point", "coordinates": [-122.131577, 47.678581] }
},
{
"#search.action": "upload",
"hotelId": "6",
"baseRate": 79.99,
"description": "Cheapest hotel in town ",
"hotelName": " Alec Baldwin Motel",
"category": "Budget",
"tags": ["motel", "budget"],
"parkingIncluded": true,
"smokingAllowed": true,
"lastRenovationDate": "1982-04-28T00:00:00Z",
"rating": 1,
"location": { "type": "Point", "coordinates": [-122.131577, 49.678581] }
},
With the right configuration, I should have got results even with the misspelled words.
I work on Azure Search. Before I suggest approaches to handle misspelled words, it would be helpful to look at your custom analyzer (my_standard) configuration. It might tell us why it's not able to handle the case for 'softvare'. As a DIY, you can use the Analyze API to see the tokens created using your custom analyzer and it should contain 'software' to actually match the docs.
Now then, here are a few ways that can be used independently or in conjunction to handle misspelled words. The best approach varies depending on the use-case and I strongly suggest you experiment with these to figure out the best one in your case.
You are already familiar with phonetic filters which is a common approach to handle similarly pronounced terms. If you haven't already, try different encoders for the filter to evaluate which configuration gives you the best results. Check out the list of encoders here.
Use fuzzy queries supported as part of the Lucene query syntax in Azure Search which returns terms that are near the original query term based on a distance metric. The limitation here is that it works on a single term. Check the docs for more details. Sample query would look like - search=softvare~1 You can also use term boosting to give the original term more boost in cases where the original term is also a valid term.
You also alluded to synonyms which is also used to query with misspelled terms. This approach gives you the most control over the process of handling typos but also require you to have prior knowledge of different typos for terms. You can use these docs if you want to experiment with synonyms.
As you could read in my post; my Objective was to handle the typos.
The only easy option is to use the inbuilt Lucene functionality - Fuzzy Search. I'm yet to check on the response times as the querytype has to be set to 'full' for using fuzzy search. Otherwise, the results were satisfactory.
Example:
search=softvare~&fuzzy=true&querytype=full
will return all documents with the 'Software' in it.
For further reading please go through Documentation

Filtering Contentful Query on Linked Objects

I'm attempting to utilize Contentful on a current project of mine and I'm trying to understand how to filter my query results based on a field in a linked object.
My top level object contains a Link defined as such:
"name": "Service_Description",
"fields": [
{
"name": "Header",
"id": "header",
"type": "Link",
"linkType": "Entry",
"required": true,
"validations": [
{
"linkContentType": [
"offerGeneral"
]
}
],
"localized": false,
"disabled": false,
"omitted": false
},
This "header" field links to another content type that has this definition:
"fields": [
{
"name": "General",
"id": "general",
"type": "Link",
"linkType": "Entry",
"required": true,
"validations": [
{
"linkContentType": [
"genericGeneral"
]
}
],
"localized": false,
"disabled": false,
"omitted": false
},
which then links to the lowest level:
"fields": [{
"name": "TagList",
"id": "tagList",
"type": "Array",
"items": {
"type": "Link",
"linkType": "Entry",
"validations": [
{
"linkContentType": [
"tag"
]
}
]
},
"validations": []
}
where tagList is an array of tags this piece of content may have.
I want to be able to run a query from the top level object that says get me X number of these "Service_Description" content entries where it contains a tag from a supplied list of tags.
In PostMan, I've been running with this:
https://cdn.contentful.com/spaces/{SPACE_ID}/entries?access_token={ACCESS_TOKEN}&content_type=serviceDescription&include=3
I'm trying to add a filter something like so:
fields.header.fields.general.fields.tagList.sys.id%5Bin%5D={TAG_SYS_ID}
This is clearly incorrect, but I've been struggling with how to walk this relationship to achieve my goal. Perusing the documentation this seems to have something to do with includes, but I'm unsure of how to rectify the problem.
Any direction on how to achieve my goal or if this is possible?
This is now possible, something I believe was solved for in the API based on requests for this functionality. You can see the thread here.
This gist of it is that you have to query on the entries that have linked entries and then include the contentType for those linked entries in the query like so:
contentfulClient.getEntries({
'content_type': 'location',
'fields.market.fields.marketName': 'New York',
'fields.market.sys.contentType.sys.id': 'marketRegion'
})
Unfortunately what you are requesting is not currently possible in Contentful.
We were facing a very similar issue with nested/referenced content types and support said it wasn't possible.
We ended up writing a very complicated system that allowed us to do what you want. Essentially doing a full text search for the referenced content and then querying all of the parents entries. We then matched the relationships by iterating over the parents to find the relationship.
Sorry it couldn't be easier. Hopefully the devs work on something that improve this complication. We have brought this to their attention.

Resources