I am trying to enable phonetic analyzer, and synonyms together. It doesn't seem to work. Is it wrong to use them together?
In the below implementation I would expect the search query be converted using synonyms and then phonetic analyzer be used to retrieve the results. But my synonyms are totally ignored in here.
If i remove the phonetic analyser as part of the index creation, then the synonyms are working fine.
Also, the synonyms work fine if i use the in built analyzers like en.microsoft; instead of custom analyzers. Is this a bug?
My Synonym map
{
"name":"mysynonymmap",
"format":"solr",
"synonyms": "
SW, Software, Softvare, software, softvare, sft\n
HW, Hardware, Hardvare, hardware, hardvare, hdw => hardware\n"
}
Below is how the index is getting created
"name": "newphonetichotelswithsynonyms",
"fields": [
{"name": "hotelId", "type": "Edm.String", "key":true, "searchable": false},
{"name": "baseRate", "type": "Edm.Double"},
{"name": "description", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "analyzer":"my_standard",
"synonymMaps":[
"mysynonymmap"
]},
{"name": "hotelName", "type": "Edm.String", "analyzer":"my_standard",
"synonymMaps":[
"mysynonymmap"
]},
{"name": "category", "type": "Edm.String", "analyzer":"my_standard",
"synonymMaps":[
"mysynonymmap"
]},
{"name": "tags", "type": "Collection(Edm.String)", "analyzer":"my_standard",
"synonymMaps":[
"mysynonymmap"
]},
{"name": "parkingIncluded", "type": "Edm.Boolean"},
{"name": "smokingAllowed", "type": "Edm.Boolean"},
{"name": "lastRenovationDate", "type": "Edm.DateTimeOffset"},
{"name": "rating", "type": "Edm.Int32"},
{"name": "location", "type": "Edm.GeographyPoint"}
],
"analyzers":[
{
"name":"my_standard",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard_v2",
"tokenFilters":[ "lowercase", "asciifolding", "phonetic" ]
}
]
}````
Related
I have a search setup for a list of jobs.
Other than Id field I have just two fields on the search made as searchable and analyzer standard.lucene as following:
And I’m having difficulty to make the search work with things like C# or C++.
As per the documentation for unsafe characters we should encode and for Special characters we should scape.
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax
But let’s say I want to search all jobs with C# on the title or description I do the following search using the encode %23 and does not seems that I'm getting the correct results. Is returning other jobs that contain only C with other things.
Same way if I want to get only the C++ jobs as per azure docs I should use something like C\+\+ When I tried a got other jobs also.
So not sure if I’m missing something or did not understand correct the documentation. But I’m not able to get the exact result I was expecting.
Your results depend on the analyzer you use. Read more about analyzers in the documentation: https://learn.microsoft.com/en-us/azure/search/search-analyzers
You can check how your query is analyzed by calling the analyze API with your query and a specified analyzer. For example:
{
"text": "C# engineer",
"analyzer": "standard"
}
Will result in:
"tokens": [
{
"token": "c",
"startOffset": 0,
"endOffset": 1,
"position": 0
},
{
"token": "engineer",
"startOffset": 3,
"endOffset": 11,
"position": 1
}
]
With the standard analyzer, you effectively no longer have the ability to separate a C# engineer from a C++ engineer. The analyzer will simplify it to two tokens. C. Engineer.
Instead, use one of the built-in analyzers or configure your own. Here is the same example using the whitespace analyzer.
{
"text": "C# engineer",
"analyzer": "whitespace"
}
Results in
"tokens": [
{
"token": "C#",
"startOffset": 0,
"endOffset": 2,
"position": 0
},
{
"token": "engineer",
"startOffset": 3,
"endOffset": 11,
"position": 1
}
]
CREATE
Create the index like this
"fields": [
{
"name": "Id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
},
{
"name": "Title",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": true,
"analyzer": "whitespace",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
],
UPLOAD
Test by uploading the following simple content.
"value": [
{
"#search.action": "mergeOrUpload",
"Id": "1",
"Title": "C++ Engineer"
},
{
"#search.action": "mergeOrUpload",
"Id": "2",
"Title": "C# Engineer"
}
]
QUERY
Then query
https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/indexes/{{YOUR-INDEX}}/docs?search=C%23&$count=true&searchMode=all&queryType=full&api-version={{API-VERSION}}
Your search for C# now returns the entry with C# in the title.
{
"#odata.context": "https://walter-searchsvc-we-dev.search.windows.net/indexes('so-73954804')/$metadata#docs(*)",
"#odata.count": 1,
"value": [
{
"#search.score": 0.25811607,
"Id": "2",
"Title": "C# Engineer"
}
]
}
Consider the following model, where Address has nested property of City
{
"HotelId": "1",
"HotelName": "Secret Point Motel",
"Description": "Ideally located on the main commercial artery of the city in the heart of New York.",
"Tags": ["Free wifi", "on-site parking", "indoor pool", "continental breakfast"],
"Address": {
"StreetAddress": "677 5th Ave",
"City": "New York",
"StateProvince": "NY"
},
"Rooms": [
{
"Description": "Budget Room, 1 Queen Bed (Cityside)",
"RoomNumber": 1105,
"BaseRate": 96.99,
},
{
"Description": "Deluxe Room, 2 Double Beds (City View)",
"Type": "Deluxe Room",
"BaseRate": 150.99,
}
. . .
]
}
The model is indexed in Azure Cognitive Search as the following, where the Address is set as Edm.ComplexType
{
"name": "hotels",
"fields": [
{ "name": "HotelId", "type": "Edm.String", "key": true, "filterable": true },
{ "name": "HotelName", "type": "Edm.String", "searchable": true, "filterable": false },
{ "name": "Description", "type": "Edm.String", "searchable": true, "analyzer": "en.lucene" },
{ "name": "Address", "type": "Edm.ComplexType",
"fields": [
{ "name": "StreetAddress", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "searchable": true },
{ "name": "City", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true, "facetable": true },
{ "name": "StateProvince", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true, "facetable": true }
]
},
{ "name": "Rooms", "type": "Collection(Edm.ComplexType)",
"fields": [
{ "name": "Description", "type": "Edm.String", "searchable": true, "analyzer": "en.lucene" },
{ "name": "Type", "type": "Edm.String", "searchable": true },
{ "name": "BaseRate", "type": "Edm.Double", "filterable": true, "facetable": true }
]
}
]
}
Now I am trying to search on the data for City equals New York using the following queries, but none of them works
city eq 'new york' // return no result
address/city eq 'new york' // return error The property 'address/city' does not exist
address.city eq 'new york' // return error The property 'address.city' does not exist
So then how to search on Edm.ComplexType filed in Azure Cognitive Search?
N.B: I am using Azure Dotnet SDK (10.1.0)
The correct syntax is to define the OData expression in $filter clause. If you were using REST API, your $filter clause would be:
Address/City eq 'New York'
The reason your code is failing is because the actual field path is Address/City whereas you are specifying it as address/city. Once you use the proper field names, your code should work just fine.
I am indexing data into an Azure Search Index that is produced by a custom skill. This custom skill produces complex data which I want to preserve into the Azure Search Index.
Source data is coming from blob storage and I am constrained to using the REST API without a very solid argument for using the .NET SDK.
Current code
The following is a brief rundown of what I currently have. I cannot change the index's field or the format of data produced by the endpoint used by the custom skill.
Complex data
The following is an example of complex data produced by the custom skill (in the correct value/recordId/etc. format):
{
"field1": 0.135412,
"field2": 0.123513,
"field3": 0.243655
}
Custom skill
Here is the custom skill which creates said data:
{
"#odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"uri": "https://myfunction.azurewebsites.com/api,
"httpHeaders": {},
"httpMethod": "POST",
"timeout": "PT3M50S",
"batchSize": 1,
"degreeOfParallelism": 5,
"name": "MySkill",
"context": "/document/mycomplex
"inputs": [
{
"name": "text",
"source": "/document/content"
}
],
"outputs": [
{
"name": "field1",
"targetName": "field1"
},
{
"name": "field2",
"targetName": "field2"
},
{
"name": "field3",
"targetName": "field3"
}
]
}
I have attempted several variations, notable using the ShaperSkill with each field as an input and the output "targetName" as "mycomplex" (with the appropriate context).
Indexer
Here is the indexer's output field mapping for the skill:
{
"sourceFieldName": "/document/mycomplex,
"targetFieldName": "mycomplex"
}
I have tried several variations such as "sourceFieldName": "/document/mycomplex/*.
Search index
And this is the targeted index field:
{
"name": "mycomplex",
"type": "Edm.ComplexType",
"fields": [
{
"name": "field1",
"type": "Edm.Double",
"retrievable": true,
"filterable": true,
"sortable": true,
"facetable": false,
"searchable": false
},
{
"name": "field2",
"type": "Edm.Double",
"retrievable": true,
"filterable": true,
"sortable": true,
"facetable": false,
"searchable": false
},
{
"name": "field3",
"type": "Edm.Double",
"retrievable": true,
"filterable": true,
"sortable": true,
"facetable": false,
"searchable": false
}
]
}
Result
My result is usually similar to Could not map output field 'mycomplex' to search index. Check your indexer's 'outputFieldMappings' property..
This may be a mistake with the context of your skill. Instead of setting the context to /document/mycomplex, can you try setting it to /document? You can then add a ShaperSkill with the context also set to /document and the output field being mycomplex to generate the expected complex type shape
Example skills:
"skills":
[
{
"#odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"uri": "https://myfunction.azurewebsites.com/api,
"httpHeaders": {},
"httpMethod": "POST",
"timeout": "PT3M50S",
"batchSize": 1,
"degreeOfParallelism": 5,
"name": "MySkill",
"context": "/document"
"inputs": [
{
"name": "text",
"source": "/document/content"
}
],
"outputs": [
{
"name": "field1",
"targetName": "field1"
},
{
"name": "field2",
"targetName": "field2"
},
{
"name": "field3",
"targetName": "field3"
}
]
},
{
"#odata.type": "#Microsoft.Skills.Util.ShaperSkill",
"context": "/document",
"inputs": [
{
"name": "field1",
"source": "/document/field1"
},
{
"name": "field2",
"source": "/document/field2"
},
{
"name": "field3",
"source": "/document/field3"
}
],
"outputs": [
{
"name": "output",
"targetName": "mycomplex"
}
]
}
]
Please refer to the documentation on shaper skill for specifics.
I have a nested JSON structure as shown below. When I do avro.schema.Parse using python3 I get an error.
avro.schema.SchemaParseException: Unknown named schema 'record', known names:[data.info]
{"namespace" : "data",
"type": "record",
"name": "info",
"doc": "A list of strings.",
"fields": [
{"name": "DATE", "type": "string"},
{"name": "file", "type": "string"},
{"name": "info", "type": "record", "fields": [
{"name": "START_DATE", "type": "string"},
{"name": "END_DATE", "type": "string"},
{"name": "other", "type": "array", "items":"string"}]}
]
}
The problem was with the nested avro sctructure,
I could solve this by follwing Avro-nested schemas
Also using avro-json-validator could help to find the problem as soon as we write the .avsc files. A successful conversion to JSON tells that the avro.schema.Parse would work fine.I validated the next further updates I did to the .avsc file using this which worked fine.
I know blob storage is the only data source (thus far) that supports the indexing of html content.
My question is, is it possible to strip content using a custom analyser and the charfilter 'html_strip' (mentioned in azure docs) before adding a document to an index via REST?
Here is my create index payload:
{
"name": "htmlindex",
"fields": [
{"name": "id", "type": "Edm.String", "key": true, "searchable": false},
{"name": "title", "type": "Edm.String", "filterable": true, "sortable": true, "facetable": true},
{"name": "html", "type": "Collection(Edm.String)", "analyzer": "htmlAnalyzer"}
],
"analyzers": [
{
"name": "htmlAnalyzer",
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters": [ "html_strip" ],
"tokenizer": "standard_v2"
}
]
}
Here is my add document to index payload:
{
"value": [
{
"id": "1",
"title": "title1",
"html": [
"<p>test1</p>",
"<p>test2</p>"
]
}
]
}
Now when I search the index, I see the html content is not being stripped :
{
"#odata.context": "https://deviqfy.search.windows.net/indexes('htmlindex')/$metadata#docs",
"value": [
{
"#search.score": 1,
"id": "1",
"title": "title1",
"html": [
"<p>test1</p>",
"<p>test2</p>"
]
}
]
}
What am I doing wrong? How can I accomplish the stripping of html from the content before I add it? Without a pre-step..
So the custom analyzers (and the associated character filters) are optional steps that you can perform prior to tokenizing text. These analyzers help us facilitate better full-text search.
Azure search doesn't have a mechanism for modifying the contents of the document to be indexed when using the REST API to push documents to your index. You will have to do that yourself, as the analyzers are used to extract terms from documents that are stored in the search index.
More details here if you are interested: https://learn.microsoft.com/en-us/azure/search/search-lucene-query-architecture