Azure Cognitive Search: How to index json custom metadata of a blob - azure

I have a blob with a custom metadata property of jsonmd.
The custom metadata looks something like:
{
"ResourceName": "ipso factum...",
"ResourceVariations": [{
"Description": "ipso factum...",
"Name": "R4.mp4",
"Thumbnail": "R4.jpg",
"URL": ""
},
...
I was able to capture the full json in the index by including a filed in the index:
{
"name": "jsonmd",
"type": "Edm.String",
"facetable": true,
"filterable": true,
...
I want to capture the Thumbnail property and have added this field to the index:
{
"name": "Thumbnail",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": true,
"analyzer": "standard.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
I can't figure out how to use the custom metadata (jsonmd) to populate the Thumbnail property of the index?

You can define complex types in your index schema. Below is an example of how I collect metadata from PDF documents that we index. I extract the properties from the PDF using regular C# code, populate a Dictionary and then submit the objects using the Azure Cognitive Search SDK.
For more examples, see Model complex data types in Azure Cognitive Search.
{
"name": "Metadata",
"type": "Edm.ComplexType",
"analyzer": null,
"synonymMaps": [],
"fields": [
{
"name": "Properties",
"type": "Collection(Edm.ComplexType)",
"analyzer": null,
"synonymMaps": [],
"fields": [
{
"name": "Name",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "pattern",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "Values",
"type": "Collection(Edm.String)",
"facetable": true,
"filterable": true,
"retrievable": true,
"searchable": true,
"analyzer": "pattern",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
]
}
]
}

Related

adding analyzers to Azure Search Index using REST API not saving

Having trouble getting the analyzers to save / update on the index. When creating, everything else (the tokenFilters, tokenizers, fields) saves fine, but the analyzers array is always empty?
await client.createOrUpdateIndex(index, { allowIndexDowntime: true });
Creating a new index:
let index = {
name: "test-index",
tokenizers: [{
"odatatype": "#Microsoft.Azure.Search.StandardTokenizerV2",
"name": "test_standard_v2",
"maxTokenLength": 255
}],
fields: [{
"name": "metadata_storage_path",
"type": "Edm.String",
"facetable": false,
"filterable": false,
"key": true,
"retrievable": true,
"searchable": false,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}, {
'name': 'metadata_storage_name',
'type': 'Edm.String',
'facetable': false,
'filterable': false,
'key': false,
'retrievable': true,
'searchable': true,
'sortable': false,
'synonymMaps': [],
'fields': [],
},
{
"name": "partialName",
"type": "Edm.String",
"retrievable": false,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"key": false,
"searchAnalyzer": "standardCmAnalyzer",
"indexAnalyzer": "filename_analyzer"
}],
tokenFilters: [{
"name": "nGramCmTokenFilter",
"odatatype": "#Microsoft.Azure.Search.NGramTokenFilterV2",
"minGram": 3,
"maxGram": 20
}],
analyzers: [{
"name": "standardCmAnalyzer",
"odatatype": "#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer": "test_standard_v2",
"tokenFilters": ["lowercase", "asciifolding"]
},
{
"name": "filename_analyzer",
"odatatype": "#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer": "test_standard_v2",
"tokenFilters": [
"nGramCmTokenFilter"
]
}],
};
Then creating it:
await client.createOrUpdateIndex(index, { allowIndexDowntime: true });
I noticed no error messages being returned.
EDIT:
Using the sdk #azure/search-documents ^11.1.0

How to search on complex fields in Azure Cognitive Search

Consider the following model, where Address has nested property of City
{
"HotelId": "1",
"HotelName": "Secret Point Motel",
"Description": "Ideally located on the main commercial artery of the city in the heart of New York.",
"Tags": ["Free wifi", "on-site parking", "indoor pool", "continental breakfast"],
"Address": {
"StreetAddress": "677 5th Ave",
"City": "New York",
"StateProvince": "NY"
},
"Rooms": [
{
"Description": "Budget Room, 1 Queen Bed (Cityside)",
"RoomNumber": 1105,
"BaseRate": 96.99,
},
{
"Description": "Deluxe Room, 2 Double Beds (City View)",
"Type": "Deluxe Room",
"BaseRate": 150.99,
}
. . .
]
}
The model is indexed in Azure Cognitive Search as the following, where the Address is set as Edm.ComplexType
{
"name": "hotels",
"fields": [
{ "name": "HotelId", "type": "Edm.String", "key": true, "filterable": true },
{ "name": "HotelName", "type": "Edm.String", "searchable": true, "filterable": false },
{ "name": "Description", "type": "Edm.String", "searchable": true, "analyzer": "en.lucene" },
{ "name": "Address", "type": "Edm.ComplexType",
"fields": [
{ "name": "StreetAddress", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "searchable": true },
{ "name": "City", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true, "facetable": true },
{ "name": "StateProvince", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true, "facetable": true }
]
},
{ "name": "Rooms", "type": "Collection(Edm.ComplexType)",
"fields": [
{ "name": "Description", "type": "Edm.String", "searchable": true, "analyzer": "en.lucene" },
{ "name": "Type", "type": "Edm.String", "searchable": true },
{ "name": "BaseRate", "type": "Edm.Double", "filterable": true, "facetable": true }
]
}
]
}
Now I am trying to search on the data for City equals New York using the following queries, but none of them works
city eq 'new york' // return no result
address/city eq 'new york' // return error The property 'address/city' does not exist
address.city eq 'new york' // return error The property 'address.city' does not exist
So then how to search on Edm.ComplexType filed in Azure Cognitive Search?
N.B: I am using Azure Dotnet SDK (10.1.0)
The correct syntax is to define the OData expression in $filter clause. If you were using REST API, your $filter clause would be:
Address/City eq 'New York'
The reason your code is failing is because the actual field path is Address/City whereas you are specifying it as address/city. Once you use the proper field names, your code should work just fine.

Azure search: how to create a search index of complex type for blobs

I have a blob storage that has a number of folders, each folder has a number of pdf documents. I now want to create an azure search index which indexes the data by folder level, but includes a complex type structure (Collection(edm.ComplexType) that allows me to include all the documents. So the index looks like this:
{"name": "index",
"fields":
[
{"name": "id", "type": "Edm.String", "filterable": true, "key": true, "searchable": true, "sortable": true, "facetable": false},
{"name": "folderName", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "documents", "type": "Collection(Edm.ComplexType)",
"fields": [
{"name": "documentName", "type": "Edm.String", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "content", "type": "Edm.String", "searchable": true, "retrievable": true, "filterable": false, "sortable": false, "facetable": false, "analyzer": "en.microsoft", "synonymMaps": ["synonymsmap"]},
{"name": "documentType", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "language", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"}
]
}
]
}
Does anyone know how I should approach this? I have been creating and populating indexes using rest api.
I am thinking maybe I need to create a folder-level index structure and populate the folder-level details from some sql-table before populating the sub-fields with the blobs through skillset and indexer etc?
EDITS:
Maybe my ideas above are completely off-track. What I want to do is to search a term and return folder names based on the aggregate relevancy of documents within folders. Not sure if this is achievable in search or have to be processed afterwards. Any pointers?
Does anyone know how I should approach this? I have been creating and populating indexes using rest api.
A: If you want this structure, then you're right. You'll need to create your index and push data by yourself (rest api is one of the options)
I am thinking maybe I need to create a folder-level index structure and populate the folder-level details from some sql-table before populating the sub-fields with the blobs through skillset and indexer etc?
A: This is not a good idea, when searching using a particular term, you'll need to query all the possible indexes and do the ordering by yourself.
I personally would create a simple structure, which no complex types:
{
"name": "index",
"fields":
[
{"name": "id", "type": "Edm.String", "filterable": true, "key": true, "searchable": true, "sortable": true, "facetable": false},
{"name": "folderName", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "documentName", "type": "Edm.String", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "content", "type": "Edm.String", "searchable": true, "retrievable": true, "filterable": false, "sortable": false, "facetable": false, "analyzer": "en.microsoft", "synonymMaps": ["synonymsmap"]},
{"name": "documentType", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"},
{"name": "language", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "filterable": true, "sortable": false, "facetable": true, "analyzer": "en.microsoft"}
]
}
want to retrieve all documents by a particular folder?
search=*&$filter=folderName eq 'abc'
want to retrieve a particular documents in a particular folder?
search=*&$filter=folderName eq 'abc' and documentName eq 'x.docx'
want to all documents that contain a particular term?
search=mickey mouse&$orderBy=folderName
simple and effective

Why the field is not subject to lexical analysis

I'm trying to make my searches ignore word accents
To do this I decided to use the language analyzer: es.microsoft
I was testing the analyzer with the word "Lámpara" in the analyzer API and I got the following results:
{
"token": "lampara",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "lámpara",
"startOffset": 0,
"endOffset": 7,
"position": 0
}
I have only 2 documents in my test index:
{
"#search.score": 1,
"Id": "2",
"Nombre": "Lampara"
},
{
"#search.score": 1,
"Id": "1",
"Nombre": "Lámpara"
}
When searching for the word in the index search=Lámpara I get the following results:
{
"#search.score": 0.30685282,
"Id": "1",
"Nombre": "Lámpara"
}
For what reason the document is only received with Nombre = "Lámpara" and not Nombre = "Lampara" (without accent). I have the impression that the Name field was not sent to the lexical analysis
The definition of my index is as follows
{
"name": "test",
"fields": [
{
"name": "Id",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": true,
"retrievable": true,
"searchable": false,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "Nombre",
"type": "Edm.String",
"facetable": false,
"filterable": false,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "es.microsoft",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
],
"suggesters": [],
"scoringProfiles": [],
"defaultScoringProfile": null,
"corsOptions": null,
"analyzers": [],
"charFilters": [],
"tokenFilters": [],
"tokenizers": []
}
I would appreciate any help, and an apology for my bad English
Sorry for the delay in getting your an answer. Indeed, the Microsoft Spanish analyzer currently only fold the accents in documents, so they can be matched by queries that forgo the accents (as you mentioned, search for Lampara, it will match documents that contains Lámpara, but if you explicitly set the accents in the query (for example, searching for Lámpara), it won't match documents that don't have any accents.
If this behavior is important to you, you can instead use the es.lucene analyzer which actually actually do "ascii folding" (removing of accents) both at indexing and at search time.

How to match this query in Azure Search

I have this INDEX
{
"name": "testentities",
"fields": [
{
"name": "id",
"type": "Edm.String",
"key": true,
"retrievable": true,
"filterable": true,
"sortable": true
},
{
"name": "entity_id",
"type": "Edm.String",
"searchable": true,
"sortable": true,
"facetable": false,
"retrievable": true,
"filterable": true,
"searchAnalyzer":"standard",
"indexAnalyzer": "custom_analyzer"
},
{
"name": "description",
"type": "Edm.String",
"searchable": true,
"sortable": false,
"facetable": false,
"retrievable": true,
"filterable": true
},
{
"name": "name",
"type": "Edm.String",
"searchable": true,
"sortable": true,
"facetable": false,
"retrievable": true,
"filterable": true
},
{
"name": "entity_type",
"type": "Edm.String",
"searchable": true,
"sortable": true,
"facetable": true,
"retrievable": true,
"filterable": true
},
{
"name": "ancestors",
"type": "Collection(Edm.String)",
"searchable": false,
"sortable": false,
"facetable": false,
"retrievable": true,
"filterable": true
},
{
"name": "calendar_id",
"type": "Edm.String",
"searchable": false,
"sortable": false,
"facetable": false,
"retrievable": false,
"filterable": false
},
{
"name": "currency",
"type": "Edm.String",
"searchable": false,
"sortable": false,
"facetable": false,
"retrievable": false,
"filterable": false
},
{
"name": "timezone",
"type": "Edm.String",
"searchable": false,
"sortable": false,
"facetable": false,
"retrievable": false,
"filterable": false
},
{
"name": "active",
"type": "Edm.Boolean",
"retrievable": true,
"facetable": true,
"filterable": true
},
{
"name": "kpi_collection",
"type": "Edm.String",
"searchable": false,
"sortable": false,
"facetable": false,
"retrievable": false,
"filterable": false
},
{
"name": "rid",
"type": "Edm.String"
}
],
"scoringProfiles": [
{
"name": "boostEntity",
"text": {
"weights": {
"entity_id": 9,
"name": 8,
"description": 1
}
}
}
],
"analyzers": [
{
"name": "custom_analyzer",
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"token1",
"tokenFilters": [
"lowercase",
"entityID_stopWords",
"entityID_edgeNGram"
]
}
],
"tokenizers":[
{
"name":"token1",
"#odata.type":"#Microsoft.Azure.Search.StandardTokenizerV2"
}
],
"tokenFilters": [
{
"name": "entityID_edgeNGram",
"#odata.type": "#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
"minGram": 1,
"maxGram": 6
},
{
"name": "entityID_stopWords",
"#odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter",
"stopwords": [
"store",
"region",
"zone",
"field_org",
":"
]
}
]
}
and if i execute this query :
{
"search": "0001",
"filter": "entity_type eq 'store' ",
"select":"name,entity_id,entity_type,description,active,ancestors",
"count": "true"
}
i get this result, that is correct , because it matches with name that have hight score after entity id.
"#odata.count": 1,
"value": [
{
"#search.score": 1.6654625,
"name": "LensCrafters 0001",
"entity_id": "store:1",
"entity_type": "store",
"description": "2130 Mall Road, Florence, 41042, KY, US",
"active": true,
"ancestors": [
"region:1021",
"zone:1123",
"field_org:lenscrafters_na",
"ROOT"
]
}
]
}
But if i run this query
{
"search": "1",
"filter": "entity_type eq 'store' ",
"select":"name,entity_id,entity_type,description,active,ancestors",
"count": "true"
}
I got this result that is not correct
{
"#search.score": 1.4522386,
"name": "LensCrafters 1622",
"entity_id": "store:1622",
"entity_type": "store",
"description": "31625 Pacific Hwy S, Spc #E-1, Federal Way, 98003-5645, WA, US",
"active": true,
"ancestors": [
"region:1024",
"zone:1107",
"field_org:lenscrafters_na",
"ROOT"
]
},
{
"#search.score": 1.3403159,
"name": "LensCrafters 1178",
"entity_id": "store:1178",
"entity_type": "store",
"description": "1 W FlatIron Crossing Dr #1104, Broomfield, 80021-8881, CO, US",
"active": true,
"ancestors": [
"region:1019",
"zone:1122",
"field_org:lenscrafters_na",
"ROOT"
]
},
{
...............
Why the resulat is not this despite inside scoring profile entity_is has value 9?
"#odata.count": 1,
"value": [
{
"#search.score": 1.6654625,
"name": "LensCrafters 0001",
"entity_id": "store:1",
"entity_type": "store",
"description": "2130 Mall Road, Florence, 41042, KY, US",
"active": true,
"ancestors": [
"region:1021",
"zone:1123",
"field_org:lenscrafters_na",
"ROOT"
]
}
]
}
Here the scoring profile?
"scoringProfiles": [
{
"name": "boostEntity",
"text": {
"weights": {
"entity_id": 9,
"name": 8,
"description": 1
}
},
"functions": [],
"functionAggregation": null
}
],.............
You are using a custom analyzer on the entity_id field that produces the following tokens for text store:1178: 1, 11, 117, 1178 (you can test your analyzer configuration with the Analyze API). This means, the documents LensCrafters 1622 and LensCrafters 1178 match the query as well as the document LensCrafters 0001 - they all have 1 in entity_id. However, the documents LensCrafters 1622 and LensCrafters 1178 also match 1 in description. Thus, they have a higher score than LensCrafters 0001.
To learn more about query processing and custom analyzers in Azure Search, please read: How full text search works in Azure Search.
Do you want to keep the edgeNGram token filter in your analysis chain? Why?

Resources