Unexpected search results from Azure Cognitive Search - azure

I recently developed an index on Azure. I have the following index structure:
{"name": "my_index",
"fields":
[
{"name": "id", "type": "Edm.String", "filterable": true, "key": true, "searchable": true, "sortable": true, "facetable": false},
{"name": "metadata_storage_path", "type": "Edm.String", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false},
{"name": "Name", "type": "Edm.String", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "Description", "type": "Edm.String", "searchable": true, "retrievable": true, "filterable": false, "sortable": false, "facetable": false, "analyzer": "en.microsoft"},
{"name": "Content", "type": "Edm.String", "searchable": true, "retrievable": true, "filterable": false, "sortable": false, "facetable": false, "analyzer": "en.microsoft"}
}
When I try to search an entire phrase, for example, "cloud platform", I get some top results without any mention of "cloud platform" which is a bit strange. When I then look at the search.score, even the top results have very low score like 0.07. However, I could see the phrases appearing in the documents and I expect to have enough documents containing the phrase.
Does anyone know why that might be the case? Is it because I used the wrong analyzer?
Any potential tests I can try would also be hugely appreciated.

are you querying using REST or SDK, in both cases an example request will help to understand your issue better.
If I were doing this using REST it will be like this
https://<yourserviceName>.search.windows.net/indexes/<yourIndexName>/docs?api-version=2020-06-30&search=*&%24filter=description%20eq%20'cloud platform'
Note to make sure the exact match happens I am using filter instead of search.

Related

adding analyzers to Azure Search Index using REST API not saving

Having trouble getting the analyzers to save / update on the index. When creating, everything else (the tokenFilters, tokenizers, fields) saves fine, but the analyzers array is always empty?
await client.createOrUpdateIndex(index, { allowIndexDowntime: true });
Creating a new index:
let index = {
name: "test-index",
tokenizers: [{
"odatatype": "#Microsoft.Azure.Search.StandardTokenizerV2",
"name": "test_standard_v2",
"maxTokenLength": 255
}],
fields: [{
"name": "metadata_storage_path",
"type": "Edm.String",
"facetable": false,
"filterable": false,
"key": true,
"retrievable": true,
"searchable": false,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}, {
'name': 'metadata_storage_name',
'type': 'Edm.String',
'facetable': false,
'filterable': false,
'key': false,
'retrievable': true,
'searchable': true,
'sortable': false,
'synonymMaps': [],
'fields': [],
},
{
"name": "partialName",
"type": "Edm.String",
"retrievable": false,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"key": false,
"searchAnalyzer": "standardCmAnalyzer",
"indexAnalyzer": "filename_analyzer"
}],
tokenFilters: [{
"name": "nGramCmTokenFilter",
"odatatype": "#Microsoft.Azure.Search.NGramTokenFilterV2",
"minGram": 3,
"maxGram": 20
}],
analyzers: [{
"name": "standardCmAnalyzer",
"odatatype": "#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer": "test_standard_v2",
"tokenFilters": ["lowercase", "asciifolding"]
},
{
"name": "filename_analyzer",
"odatatype": "#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer": "test_standard_v2",
"tokenFilters": [
"nGramCmTokenFilter"
]
}],
};
Then creating it:
await client.createOrUpdateIndex(index, { allowIndexDowntime: true });
I noticed no error messages being returned.
EDIT:
Using the sdk #azure/search-documents ^11.1.0

Azure Cognitive Search: How to index json custom metadata of a blob

I have a blob with a custom metadata property of jsonmd.
The custom metadata looks something like:
{
"ResourceName": "ipso factum...",
"ResourceVariations": [{
"Description": "ipso factum...",
"Name": "R4.mp4",
"Thumbnail": "R4.jpg",
"URL": ""
},
...
I was able to capture the full json in the index by including a filed in the index:
{
"name": "jsonmd",
"type": "Edm.String",
"facetable": true,
"filterable": true,
...
I want to capture the Thumbnail property and have added this field to the index:
{
"name": "Thumbnail",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": true,
"analyzer": "standard.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
I can't figure out how to use the custom metadata (jsonmd) to populate the Thumbnail property of the index?
You can define complex types in your index schema. Below is an example of how I collect metadata from PDF documents that we index. I extract the properties from the PDF using regular C# code, populate a Dictionary and then submit the objects using the Azure Cognitive Search SDK.
For more examples, see Model complex data types in Azure Cognitive Search.
{
"name": "Metadata",
"type": "Edm.ComplexType",
"analyzer": null,
"synonymMaps": [],
"fields": [
{
"name": "Properties",
"type": "Collection(Edm.ComplexType)",
"analyzer": null,
"synonymMaps": [],
"fields": [
{
"name": "Name",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "pattern",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "Values",
"type": "Collection(Edm.String)",
"facetable": true,
"filterable": true,
"retrievable": true,
"searchable": true,
"analyzer": "pattern",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
]
}
]
}

How to search on complex fields in Azure Cognitive Search

Consider the following model, where Address has nested property of City
{
"HotelId": "1",
"HotelName": "Secret Point Motel",
"Description": "Ideally located on the main commercial artery of the city in the heart of New York.",
"Tags": ["Free wifi", "on-site parking", "indoor pool", "continental breakfast"],
"Address": {
"StreetAddress": "677 5th Ave",
"City": "New York",
"StateProvince": "NY"
},
"Rooms": [
{
"Description": "Budget Room, 1 Queen Bed (Cityside)",
"RoomNumber": 1105,
"BaseRate": 96.99,
},
{
"Description": "Deluxe Room, 2 Double Beds (City View)",
"Type": "Deluxe Room",
"BaseRate": 150.99,
}
. . .
]
}
The model is indexed in Azure Cognitive Search as the following, where the Address is set as Edm.ComplexType
{
"name": "hotels",
"fields": [
{ "name": "HotelId", "type": "Edm.String", "key": true, "filterable": true },
{ "name": "HotelName", "type": "Edm.String", "searchable": true, "filterable": false },
{ "name": "Description", "type": "Edm.String", "searchable": true, "analyzer": "en.lucene" },
{ "name": "Address", "type": "Edm.ComplexType",
"fields": [
{ "name": "StreetAddress", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "searchable": true },
{ "name": "City", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true, "facetable": true },
{ "name": "StateProvince", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true, "facetable": true }
]
},
{ "name": "Rooms", "type": "Collection(Edm.ComplexType)",
"fields": [
{ "name": "Description", "type": "Edm.String", "searchable": true, "analyzer": "en.lucene" },
{ "name": "Type", "type": "Edm.String", "searchable": true },
{ "name": "BaseRate", "type": "Edm.Double", "filterable": true, "facetable": true }
]
}
]
}
Now I am trying to search on the data for City equals New York using the following queries, but none of them works
city eq 'new york' // return no result
address/city eq 'new york' // return error The property 'address/city' does not exist
address.city eq 'new york' // return error The property 'address.city' does not exist
So then how to search on Edm.ComplexType filed in Azure Cognitive Search?
N.B: I am using Azure Dotnet SDK (10.1.0)
The correct syntax is to define the OData expression in $filter clause. If you were using REST API, your $filter clause would be:
Address/City eq 'New York'
The reason your code is failing is because the actual field path is Address/City whereas you are specifying it as address/city. Once you use the proper field names, your code should work just fine.

Azure search: how to create a search index of complex type for blobs

I have a blob storage that has a number of folders, each folder has a number of pdf documents. I now want to create an azure search index which indexes the data by folder level, but includes a complex type structure (Collection(edm.ComplexType) that allows me to include all the documents. So the index looks like this:
{"name": "index",
"fields":
[
{"name": "id", "type": "Edm.String", "filterable": true, "key": true, "searchable": true, "sortable": true, "facetable": false},
{"name": "folderName", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "documents", "type": "Collection(Edm.ComplexType)",
"fields": [
{"name": "documentName", "type": "Edm.String", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "content", "type": "Edm.String", "searchable": true, "retrievable": true, "filterable": false, "sortable": false, "facetable": false, "analyzer": "en.microsoft", "synonymMaps": ["synonymsmap"]},
{"name": "documentType", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "language", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"}
]
}
]
}
Does anyone know how I should approach this? I have been creating and populating indexes using rest api.
I am thinking maybe I need to create a folder-level index structure and populate the folder-level details from some sql-table before populating the sub-fields with the blobs through skillset and indexer etc?
EDITS:
Maybe my ideas above are completely off-track. What I want to do is to search a term and return folder names based on the aggregate relevancy of documents within folders. Not sure if this is achievable in search or have to be processed afterwards. Any pointers?
Does anyone know how I should approach this? I have been creating and populating indexes using rest api.
A: If you want this structure, then you're right. You'll need to create your index and push data by yourself (rest api is one of the options)
I am thinking maybe I need to create a folder-level index structure and populate the folder-level details from some sql-table before populating the sub-fields with the blobs through skillset and indexer etc?
A: This is not a good idea, when searching using a particular term, you'll need to query all the possible indexes and do the ordering by yourself.
I personally would create a simple structure, which no complex types:
{
"name": "index",
"fields":
[
{"name": "id", "type": "Edm.String", "filterable": true, "key": true, "searchable": true, "sortable": true, "facetable": false},
{"name": "folderName", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "documentName", "type": "Edm.String", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "content", "type": "Edm.String", "searchable": true, "retrievable": true, "filterable": false, "sortable": false, "facetable": false, "analyzer": "en.microsoft", "synonymMaps": ["synonymsmap"]},
{"name": "documentType", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "language", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"}
]
}
want to retrieve all documents by a particular folder?
search=*&$filter=folderName eq 'abc'
want to retrieve a particular documents in a particular folder?
search=*&$filter=folderName eq 'abc' and documentName eq 'x.docx'
want to all documents that contain a particular term?
search=mickey mouse&$orderBy=folderName
simple and effective

How to match this query in Azure Search

I have this INDEX
{
"name": "testentities",
"fields": [
{
"name": "id",
"type": "Edm.String",
"key": true,
"retrievable": true,
"filterable": true,
"sortable": true
},
{
"name": "entity_id",
"type": "Edm.String",
"searchable": true,
"sortable": true,
"facetable": false,
"retrievable": true,
"filterable": true,
"searchAnalyzer":"standard",
"indexAnalyzer": "custom_analyzer"
},
{
"name": "description",
"type": "Edm.String",
"searchable": true,
"sortable": false,
"facetable": false,
"retrievable": true,
"filterable": true
},
{
"name": "name",
"type": "Edm.String",
"searchable": true,
"sortable": true,
"facetable": false,
"retrievable": true,
"filterable": true
},
{
"name": "entity_type",
"type": "Edm.String",
"searchable": true,
"sortable": true,
"facetable": true,
"retrievable": true,
"filterable": true
},
{
"name": "ancestors",
"type": "Collection(Edm.String)",
"searchable": false,
"sortable": false,
"facetable": false,
"retrievable": true,
"filterable": true
},
{
"name": "calendar_id",
"type": "Edm.String",
"searchable": false,
"sortable": false,
"facetable": false,
"retrievable": false,
"filterable": false
},
{
"name": "currency",
"type": "Edm.String",
"searchable": false,
"sortable": false,
"facetable": false,
"retrievable": false,
"filterable": false
},
{
"name": "timezone",
"type": "Edm.String",
"searchable": false,
"sortable": false,
"facetable": false,
"retrievable": false,
"filterable": false
},
{
"name": "active",
"type": "Edm.Boolean",
"retrievable": true,
"facetable": true,
"filterable": true
},
{
"name": "kpi_collection",
"type": "Edm.String",
"searchable": false,
"sortable": false,
"facetable": false,
"retrievable": false,
"filterable": false
},
{
"name": "rid",
"type": "Edm.String"
}
],
"scoringProfiles": [
{
"name": "boostEntity",
"text": {
"weights": {
"entity_id": 9,
"name": 8,
"description": 1
}
}
}
],
"analyzers": [
{
"name": "custom_analyzer",
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"token1",
"tokenFilters": [
"lowercase",
"entityID_stopWords",
"entityID_edgeNGram"
]
}
],
"tokenizers":[
{
"name":"token1",
"#odata.type":"#Microsoft.Azure.Search.StandardTokenizerV2"
}
],
"tokenFilters": [
{
"name": "entityID_edgeNGram",
"#odata.type": "#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
"minGram": 1,
"maxGram": 6
},
{
"name": "entityID_stopWords",
"#odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter",
"stopwords": [
"store",
"region",
"zone",
"field_org",
":"
]
}
]
}
and if i execute this query :
{
"search": "0001",
"filter": "entity_type eq 'store' ",
"select":"name,entity_id,entity_type,description,active,ancestors",
"count": "true"
}
i get this result, that is correct , because it matches with name that have hight score after entity id.
"#odata.count": 1,
"value": [
{
"#search.score": 1.6654625,
"name": "LensCrafters 0001",
"entity_id": "store:1",
"entity_type": "store",
"description": "2130 Mall Road, Florence, 41042, KY, US",
"active": true,
"ancestors": [
"region:1021",
"zone:1123",
"field_org:lenscrafters_na",
"ROOT"
]
}
]
}
But if i run this query
{
"search": "1",
"filter": "entity_type eq 'store' ",
"select":"name,entity_id,entity_type,description,active,ancestors",
"count": "true"
}
I got this result that is not correct
{
"#search.score": 1.4522386,
"name": "LensCrafters 1622",
"entity_id": "store:1622",
"entity_type": "store",
"description": "31625 Pacific Hwy S, Spc #E-1, Federal Way, 98003-5645, WA, US",
"active": true,
"ancestors": [
"region:1024",
"zone:1107",
"field_org:lenscrafters_na",
"ROOT"
]
},
{
"#search.score": 1.3403159,
"name": "LensCrafters 1178",
"entity_id": "store:1178",
"entity_type": "store",
"description": "1 W FlatIron Crossing Dr #1104, Broomfield, 80021-8881, CO, US",
"active": true,
"ancestors": [
"region:1019",
"zone:1122",
"field_org:lenscrafters_na",
"ROOT"
]
},
{
...............
Why the resulat is not this despite inside scoring profile entity_is has value 9?
"#odata.count": 1,
"value": [
{
"#search.score": 1.6654625,
"name": "LensCrafters 0001",
"entity_id": "store:1",
"entity_type": "store",
"description": "2130 Mall Road, Florence, 41042, KY, US",
"active": true,
"ancestors": [
"region:1021",
"zone:1123",
"field_org:lenscrafters_na",
"ROOT"
]
}
]
}
Here the scoring profile?
"scoringProfiles": [
{
"name": "boostEntity",
"text": {
"weights": {
"entity_id": 9,
"name": 8,
"description": 1
}
},
"functions": [],
"functionAggregation": null
}
],.............
You are using a custom analyzer on the entity_id field that produces the following tokens for text store:1178: 1, 11, 117, 1178 (you can test your analyzer configuration with the Analyze API). This means, the documents LensCrafters 1622 and LensCrafters 1178 match the query as well as the document LensCrafters 0001 - they all have 1 in entity_id. However, the documents LensCrafters 1622 and LensCrafters 1178 also match 1 in description. Thus, they have a higher score than LensCrafters 0001.
To learn more about query processing and custom analyzers in Azure Search, please read: How full text search works in Azure Search.
Do you want to keep the edgeNGram token filter in your analysis chain? Why?

Resources