Azure Search - phonetic search implementation

Azure Search - phonetic search implementation - azure

I was trying out Phoenetic search using Azure Search without much luck. My objective is to work out an Index configuration that can handle typos and accomodate phonetic search for end users.
With the below configuration and sample data, I was trying to search for intentionally misspelled words like 'softvare' or 'alek'. I got results for 'alek' thanks for Phonetic analyzer; but didn't get any results for 'softvare'.
Looks like for this requirement phonetic search will not do the trick.
Only option that I found was to use synonyms map. The major pitfall is that I'm unable to use the Phonetics / Custom analyzer along with Synonyms :(
What are the various strategies that you would recommend for taking care of typos?
search query used
?api-version=2017-11-11&search=alec
?api-version=2017-11-11&search=softvare
Here is the index configuration
"name": "phonetichotels",
"fields": [
{"name": "hotelId", "type": "Edm.String", "key":true, "searchable": false},
{"name": "baseRate", "type": "Edm.Double"},
{"name": "description", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "analyzer":"my_standard"},
{"name": "hotelName", "type": "Edm.String", "analyzer":"my_standard"},
{"name": "category", "type": "Edm.String", "analyzer":"my_standard"},
{"name": "tags", "type": "Collection(Edm.String)", "analyzer":"my_standard"},
{"name": "parkingIncluded", "type": "Edm.Boolean"},
{"name": "smokingAllowed", "type": "Edm.Boolean"},
{"name": "lastRenovationDate", "type": "Edm.DateTimeOffset"},
{"name": "rating", "type": "Edm.Int32"},
{"name": "location", "type": "Edm.GeographyPoint"}
],
Analyzer (part of the index creation)
"analyzers":[
{
"name":"my_standard",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard_v2",
"tokenFilters":[ "lowercase", "asciifolding", "phonetic" ]
}
]
Analyze API Input and Output for 'software'
{
"analyzer":"my_standard",
"text": "software"
}
{
"#odata.context": "https://ctsazuresearchpoc.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
"tokens": [
{
"token": "SFTW",
"startOffset": 0,
"endOffset": 8,
"position": 0
}
]
}
Analyze API Input and Output for 'softvare'
{
"analyzer":"my_standard",
"text": "softvare"
}
{
"#odata.context": "https://ctsazuresearchpoc.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
"tokens": [
{
"token": "SFTF",
"startOffset": 0,
"endOffset": 8,
"position": 0
}
]
}
Sample data that I loaded
{
"#search.action": "upload",
"hotelId": "5",
"baseRate": 199.0,
"description": "Best hotel in town for software people",
"hotelName": "Fancy Stay",
"category": "Luxury",
"tags": ["pool", "view", "wifi", "concierge"],
"parkingIncluded": false,
"smokingAllowed": false,
"lastRenovationDate": "2010-06-27T00:00:00Z",
"rating": 5,
"location": { "type": "Point", "coordinates": [-122.131577, 47.678581] }
},
{
"#search.action": "upload",
"hotelId": "6",
"baseRate": 79.99,
"description": "Cheapest hotel in town ",
"hotelName": " Alec Baldwin Motel",
"category": "Budget",
"tags": ["motel", "budget"],
"parkingIncluded": true,
"smokingAllowed": true,
"lastRenovationDate": "1982-04-28T00:00:00Z",
"rating": 1,
"location": { "type": "Point", "coordinates": [-122.131577, 49.678581] }
},
With the right configuration, I should have got results even with the misspelled words.

I work on Azure Search. Before I suggest approaches to handle misspelled words, it would be helpful to look at your custom analyzer (my_standard) configuration. It might tell us why it's not able to handle the case for 'softvare'. As a DIY, you can use the Analyze API to see the tokens created using your custom analyzer and it should contain 'software' to actually match the docs.
Now then, here are a few ways that can be used independently or in conjunction to handle misspelled words. The best approach varies depending on the use-case and I strongly suggest you experiment with these to figure out the best one in your case.
You are already familiar with phonetic filters which is a common approach to handle similarly pronounced terms. If you haven't already, try different encoders for the filter to evaluate which configuration gives you the best results. Check out the list of encoders here.
Use fuzzy queries supported as part of the Lucene query syntax in Azure Search which returns terms that are near the original query term based on a distance metric. The limitation here is that it works on a single term. Check the docs for more details. Sample query would look like - search=softvare~1 You can also use term boosting to give the original term more boost in cases where the original term is also a valid term.
You also alluded to synonyms which is also used to query with misspelled terms. This approach gives you the most control over the process of handling typos but also require you to have prior knowledge of different typos for terms. You can use these docs if you want to experiment with synonyms.

As you could read in my post; my Objective was to handle the typos.
The only easy option is to use the inbuilt Lucene functionality - Fuzzy Search. I'm yet to check on the response times as the querytype has to be set to 'full' for using fuzzy search. Otherwise, the results were satisfactory.
Example:
search=softvare~&fuzzy=true&querytype=full
will return all documents with the 'Software' in it.
For further reading please go through Documentation

Related

Indexing e-mails in Azure Search

I'm trying to best index contents of e-mail messages, subjects and email addresses. E-mails can contain both text and HTML representation. They can be in any language so I can't use language specific analysers unfortunately.
As I am new to this I have many questions:
First I used Standard Lucene analyser but after some testing and
checking what each analyser does I switched to using "simple"
analyser. Standard one didn't allow me to search by domain in
user#domain.com (It sees user and domain.com as tokens). Is "simple" the best I can use in my case?
How can I handle HTML contents of e-mail? I thought this should be
possible to do it in Azure Search but right now I think I would need
to strip HTML tags myself.
My users aren't tech savvy and I assumed "simple" query type will be
enough for them. I expect them to type word or two and find messages
containing this word/containing words starting with this word. From my tests it looks I need to append * to their queries to get "starting with" to work?

It would help if you included an example of your data and how you index and query. What happened, and what did you expect?
The standard Lucene analyzer will work with your user#domain.com example. It is correct that it produces the tokens user and domain.com. But the same happens when you query, and you will get records with the tokens user and domain.com.
CREATE INDEX
"fields": [
{"name": "Id", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "sortable": true, "facetable": false, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "synonymMaps": [] },
{"name": "Email", "type": "Edm.String", "filterable": true, "sortable": true, "facetable": false, "searchable": true, "analyzer": "standard"}
]
UPLOAD
{
"value": [
{
"#search.action": "mergeOrUpload",
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "2",
"Email": "some.user#some-domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "3",
"Email": "another#another.com"
}
]
}
QUERY
Query, using full and all.
https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/indexes/{{INDEX_NAME}}/docs?search=user#domain.com&$count=true&$select=Id,Email&searchMode=all&queryType=full&api-version={{API-VERSION}}
Which produces results as expected (all records containing user and domain.com):
{
"#odata.context": "https://<your-search-env>.search.windows.net/indexes('dg-test-65392234')/$metadata#docs(*)",
"#odata.count": 2,
"value": [
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.score": 0.25316024,
"Id": "2",
"Email": "some.user#some-domain.com"
}
]
}
If your expected result is to only get the record above where the email matches completely, you could instead use a phrase search. I.e. replace the search parameter above with search="user#domain.com" and you would get:
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
}
Alternatively, you could use the keyword analyzer.
ANALYZE
You can compare the different analyzers directly via REST. Using the keyword analyzer on the Email property will produce a single token.
{
"text": "some-user#some-domain.com",
"analyzer": "keyword"
}
Results in the following tokens:
"tokens": [
{
"token": "some-user#some-domain.com",
"startOffset": 0,
"endOffset": 25,
"position": 0
}
]
Compared to the standard tokenizer, which does a decent job for most types of unstructured content.
{
"text": "some-user#some-domain.com",
"analyzer": "standard"
}
Which produces reasonable results for cases where the email address was part of some generic text.
"tokens": [
{
"token": "some",
"startOffset": 0,
"endOffset": 4,
"position": 0
},
{
"token": "user",
"startOffset": 5,
"endOffset": 9,
"position": 1
},
{
"token": "some",
"startOffset": 10,
"endOffset": 14,
"position": 2
},
{
"token": "domain.com",
"startOffset": 15,
"endOffset": 25,
"position": 3
}
]
SUMMARY
This is a long answer already, so I won't cover your other two questions in detail. I would suggest splitting them to separate questions so it can benefit others.
HTML content: You can use a built-in HTML analyzer that strips HTML tags. Or you can strip the HTML yourself using custom code. I typically use Beautiful Soup for cases like these or simple regular expressions for simpler cases.
Wildcard search: Usually, users don't expect automatic wildcards appended. The only application that does this is the Outlook client, which destroys precision. When I search for "Jan" (a common name), I annoyingly get all emails sent in January(!). And a search for Dan (again, a name), I also get all emails from Danmark (Denmark).
Everything in search is a trade-off between precision and recall. In your first example with the email address, your expectation was heavily geared toward precision. But, in your last wildcard question, you seem to prefer extreme recall with wildcards on everything. It all comes down to your expectations.

Azure Digital Twins: what does "GetOntologies" response means?

I am trying to understand the provisioning process of Digital Twins and I am reading this doc: https://learn.microsoft.com/en-us/azure/digital-twins/tutorial-facilities-setup
But I can not follow the point in this section
however, I can not understand the response for "dotnet run GetOntologies"
Anyone can help me to understand better what are those values? and how are they related to "models are available"?

In Azure Digital Twins, the Ontology entity contains a set of all types and subtypes that can be used in your application. In your example the "Required" and "Default" ontology are enabled (this is by default). If you use the REST API to see what the "Default" ontology contains you get the following:
{
"id": 2,
"name": "Default",
"loaded": true,
"types": [
{
"id": 17,
"category": "SensorDataType",
"name": "Humidity",
"disabled": false,
"logicalOrder": 0
},
{
"id": 18,
"category": "SensorDataType",
"name": "Temperature",
"disabled": false,
"logicalOrder": 0
},
{
"id": 19,
"category": "SensorDataSubtype",
"name": "RoomHumidity",
"disabled": false,
"logicalOrder": 0,
"friendlyName": "Room Humidity"
}, // etc etc
As you can see in the example above, the ontology has basic definitions for the types of sensors/spaces/data types for things related to Smart Building scenarios. The BACnet and Advanced ontologies just add different and more specific types. When you set an ontology to 'enabled', you can start using those types/subtypes. You can check them out in the REST API with:
https://your-url.your-region.azuresmartspaces.net/management/api/v1.0/ontologies/3?includes=Types

Returning partial matches in Azure Search

A while ago I set up a search index for a web application. One of the requirements was to return partial matches of the search terms. For instance, searching for Joh should find John Doe. The most straightforward way to implement this was to append a * to each search term before posting the query to Azure Search. So if a user types Joh, we actually ask Azure Search to search for Joh*.
One limitation of this approach is that all the matches of Joh* have the same search score. Because of this, sometimes a partial match appears higher in the results than an exact match. This is documented behavior, so I guess there is not much I can do about it. Or can I?
While my current way to return partial matches seems like a hack, it has worked well enough in practice that I didn't matter finding out how to properly solve the problem. Now I have the time to look into it and my instinct says there must be a "proper" way to do this. I have read the word "ngrams" here and there, and it seems to be part of the solution. I could probably find a passable solution after some of hours of hacking on it, but if there is any "standard way" to achieve what I want, I would rather follow that path instead of using a home-grown hack. Hence this question.
So my question is: is there a standard way to retrieve partial matches in Azure Search, while giving exact matches a higher score? How should I change the code below to make Azure Search return the search results I need?
The code
Index definition, as returned by the Azure API:
{
"name": "test-index",
"defaultScoringProfile": null,
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
},
{
"name": "name",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
}
],
"scoringProfiles": [],
"corsOptions": null,
"suggesters": [],
"analyzers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": []
}
The documents, as posted to the Azure API:
{
"value": [
{
"#search.action": "mergeOrUpload",
"id": "1",
"name": "Joh Doe"
},
{
"#search.action": "mergeOrUpload",
"id": "2",
"name": "John Doe"
}
]
}
Search query, as posted to the Azure API:
{
search: "Joh*"
}
Results, where the exact match appears second, while we would like it to appear first:
{
"value": [
{
"#search.score": 1,
"id": "2",
"name": "John Doe"
},
{
"#search.score": 1,
"id": "1",
"name": "Joh Doe"
}
]
}

This is a very good question and thanks for providing a detailed explanation. The easiest way to achieve that would be to use term boosting on the actual term and combine it with a wildcard query. You can modify the query in your post to -
search=Joh^10 OR Joh*&queryType=full
This will score the documents that match Joh exactly higher. If you have more complicated requirements, you can look at constructing a custom analyzer with ngrams to search on them to support partial search.

Filtering Contentful Query on Linked Objects

I'm attempting to utilize Contentful on a current project of mine and I'm trying to understand how to filter my query results based on a field in a linked object.
My top level object contains a Link defined as such:
"name": "Service_Description",
"fields": [
{
"name": "Header",
"id": "header",
"type": "Link",
"linkType": "Entry",
"required": true,
"validations": [
{
"linkContentType": [
"offerGeneral"
]
}
],
"localized": false,
"disabled": false,
"omitted": false
},
This "header" field links to another content type that has this definition:
"fields": [
{
"name": "General",
"id": "general",
"type": "Link",
"linkType": "Entry",
"required": true,
"validations": [
{
"linkContentType": [
"genericGeneral"
]
}
],
"localized": false,
"disabled": false,
"omitted": false
},
which then links to the lowest level:
"fields": [{
"name": "TagList",
"id": "tagList",
"type": "Array",
"items": {
"type": "Link",
"linkType": "Entry",
"validations": [
{
"linkContentType": [
"tag"
]
}
]
},
"validations": []
}
where tagList is an array of tags this piece of content may have.
I want to be able to run a query from the top level object that says get me X number of these "Service_Description" content entries where it contains a tag from a supplied list of tags.
In PostMan, I've been running with this:
https://cdn.contentful.com/spaces/{SPACE_ID}/entries?access_token={ACCESS_TOKEN}&content_type=serviceDescription&include=3
I'm trying to add a filter something like so:
fields.header.fields.general.fields.tagList.sys.id%5Bin%5D={TAG_SYS_ID}
This is clearly incorrect, but I've been struggling with how to walk this relationship to achieve my goal. Perusing the documentation this seems to have something to do with includes, but I'm unsure of how to rectify the problem.
Any direction on how to achieve my goal or if this is possible?

This is now possible, something I believe was solved for in the API based on requests for this functionality. You can see the thread here.
This gist of it is that you have to query on the entries that have linked entries and then include the contentType for those linked entries in the query like so:
contentfulClient.getEntries({
'content_type': 'location',
'fields.market.fields.marketName': 'New York',
'fields.market.sys.contentType.sys.id': 'marketRegion'
})

Unfortunately what you are requesting is not currently possible in Contentful.
We were facing a very similar issue with nested/referenced content types and support said it wasn't possible.
We ended up writing a very complicated system that allowed us to do what you want. Essentially doing a full text search for the referenced content and then querying all of the parents entries. We then matched the relationships by iterating over the parents to find the relationship.
Sorry it couldn't be easier. Hopefully the devs work on something that improve this complication. We have brought this to their attention.

Elastic opposite query

I'm using elastic search for classic queries LIKE "search all documents with G4 in name and LG in manfucaturer". This is ok. But what if I have a lot of documents and database with lot of search terms and I need to know which documents match some specific multicolumn terms. For example:
Documents:
[
{
"id": 5787,
"name": "Smartphone G4",
"manufacturer": "LG",
"description": "The revolutionary LG G4 design can only be described as forward thinking—with a classic touch."
},
{
"id": 68779,
"name": "Smartphone S6",
"manufacturer": "Samsung",
"description": "The Samsung Galaxy S6 is powerful to use and beautiful to behold."
}
]
...
Terms:
[
{
"id": "587",
"name": "G4",
"manufacturer": "LG",
"description": "classic touch"
},
{
"id": "364",
"manufacturer": "Samsung",
"description": "galaxy s6"
}
]
...
Result:
{
"587": [5787],
"364": [68779]
}
OR:
{
"5787": [587],
"68779": [364]
}
I need list of documents and list of terms which corresponds them (or oposite). In small amount of terms, it should be possible to apply all rules one by one and save matching documents. But I have milions of documents and thousands of terms. So, it is not possible to aply them one by one. Is it possible in another way?

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html is exactly what I wanted. It can store your queries and execute them against documents.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string