I'm trying to make my searches ignore word accents
To do this I decided to use the language analyzer: es.microsoft
I was testing the analyzer with the word "Lámpara" in the analyzer API and I got the following results:
{
"token": "lampara",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "lámpara",
"startOffset": 0,
"endOffset": 7,
"position": 0
}
I have only 2 documents in my test index:
{
"#search.score": 1,
"Id": "2",
"Nombre": "Lampara"
},
{
"#search.score": 1,
"Id": "1",
"Nombre": "Lámpara"
}
When searching for the word in the index search=Lámpara I get the following results:
{
"#search.score": 0.30685282,
"Id": "1",
"Nombre": "Lámpara"
}
For what reason the document is only received with Nombre = "Lámpara" and not Nombre = "Lampara" (without accent). I have the impression that the Name field was not sent to the lexical analysis
The definition of my index is as follows
{
"name": "test",
"fields": [
{
"name": "Id",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": true,
"retrievable": true,
"searchable": false,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "Nombre",
"type": "Edm.String",
"facetable": false,
"filterable": false,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "es.microsoft",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
],
"suggesters": [],
"scoringProfiles": [],
"defaultScoringProfile": null,
"corsOptions": null,
"analyzers": [],
"charFilters": [],
"tokenFilters": [],
"tokenizers": []
}
I would appreciate any help, and an apology for my bad English
Sorry for the delay in getting your an answer. Indeed, the Microsoft Spanish analyzer currently only fold the accents in documents, so they can be matched by queries that forgo the accents (as you mentioned, search for Lampara, it will match documents that contains Lámpara, but if you explicitly set the accents in the query (for example, searching for Lámpara), it won't match documents that don't have any accents.
If this behavior is important to you, you can instead use the es.lucene analyzer which actually actually do "ascii folding" (removing of accents) both at indexing and at search time.
Related
I have a search setup for a list of jobs.
Other than Id field I have just two fields on the search made as searchable and analyzer standard.lucene as following:
And I’m having difficulty to make the search work with things like C# or C++.
As per the documentation for unsafe characters we should encode and for Special characters we should scape.
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax
But let’s say I want to search all jobs with C# on the title or description I do the following search using the encode %23 and does not seems that I'm getting the correct results. Is returning other jobs that contain only C with other things.
Same way if I want to get only the C++ jobs as per azure docs I should use something like C\+\+ When I tried a got other jobs also.
So not sure if I’m missing something or did not understand correct the documentation. But I’m not able to get the exact result I was expecting.
Your results depend on the analyzer you use. Read more about analyzers in the documentation: https://learn.microsoft.com/en-us/azure/search/search-analyzers
You can check how your query is analyzed by calling the analyze API with your query and a specified analyzer. For example:
{
"text": "C# engineer",
"analyzer": "standard"
}
Will result in:
"tokens": [
{
"token": "c",
"startOffset": 0,
"endOffset": 1,
"position": 0
},
{
"token": "engineer",
"startOffset": 3,
"endOffset": 11,
"position": 1
}
]
With the standard analyzer, you effectively no longer have the ability to separate a C# engineer from a C++ engineer. The analyzer will simplify it to two tokens. C. Engineer.
Instead, use one of the built-in analyzers or configure your own. Here is the same example using the whitespace analyzer.
{
"text": "C# engineer",
"analyzer": "whitespace"
}
Results in
"tokens": [
{
"token": "C#",
"startOffset": 0,
"endOffset": 2,
"position": 0
},
{
"token": "engineer",
"startOffset": 3,
"endOffset": 11,
"position": 1
}
]
CREATE
Create the index like this
"fields": [
{
"name": "Id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
},
{
"name": "Title",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": true,
"analyzer": "whitespace",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
],
UPLOAD
Test by uploading the following simple content.
"value": [
{
"#search.action": "mergeOrUpload",
"Id": "1",
"Title": "C++ Engineer"
},
{
"#search.action": "mergeOrUpload",
"Id": "2",
"Title": "C# Engineer"
}
]
QUERY
Then query
https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/indexes/{{YOUR-INDEX}}/docs?search=C%23&$count=true&searchMode=all&queryType=full&api-version={{API-VERSION}}
Your search for C# now returns the entry with C# in the title.
{
"#odata.context": "https://walter-searchsvc-we-dev.search.windows.net/indexes('so-73954804')/$metadata#docs(*)",
"#odata.count": 1,
"value": [
{
"#search.score": 0.25811607,
"Id": "2",
"Title": "C# Engineer"
}
]
}
I am working on configuration of Azure Cognitive Search Index which will be queried from websites in different languages. I have created language specific fields and have added appropriate language analyzers while Index creation.
For example:
{
"id": "",
"Description": "some_value",
"Description_es": null,
"Description_fr": null,
"Region": [ "some_value", "some_value" ],
"SpecificationData": [
{
"name": "some_key1",
"value": "some_value1",
"name_es": null,
"value_es": null,
"name_fr": null,
"value_fr": null
},
{
"name": "some_key2",
"value": "some_value2",
"name_pt": null,
"value_pt": null,
"name_fr": null,
"value_fr": null
}
]
}
The fields Description, SpecificationData.name and SpecificationData.value are in English and coming from Cosmos DB. Fields Description_es, SpecificationData.name_es and SpecificationData.value_es will be queried from the Spanish website and should be fields translated in Spanish. And similar for the French language fields.
But since, Cosmos DB is having fields only in English, language specific fields such as Description_es, SpecificationData.name_es and SpecificationData.value_es are Null by default.
I have tried using Skillsets and linking Index to "Azure Cognitive Translate Service" but Skillsets are translating only one field at a time.
Is there any way to translate multiple fields and save the specific translation in particular fields?
Edit: Adding Index, Skillset and Indexer code that I have tried:
Index (snippet):
{
"name": "SpecificationData",
"type": "Collection(Edm.ComplexType)",
"analyzer": null,
"synonymMaps": [],
"fields": [
{
"name": "name",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "standard.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "value",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "standard.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "name_fr",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "fr.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "value_fr",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "fr.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
]
}
Skillset:
{
"#odata.type": "#Microsoft.Skills.Text.TranslationSkill",
"name": "psd_name_fr",
"description": null,
"context": "/document/SpecificationData",
"defaultFromLanguageCode": null,
"defaultToLanguageCode": "fr",
"suggestedFrom": "en",
"inputs": [
{
"name": "text",
"source": "/*/name"
}
],
"outputs": [
{
"name": "translatedText",
"targetName": "name_fr"
}
]
}
Indexer:
"outputFieldMappings": [
{
"sourceFieldName": "/document/SpecificationData/*/name/name_fr",
"targetFieldName": "/name_fr" //I get an error message as "Output field mapping specifies target field 'name_fr' that doesn't exist in the index". I have tried accessing the full path as /document/SpecificationData/name_fr but it still gives same error. It looks for the specified field inside root structure and gives the error if the field is nested array object.
}
]
You could use a text merge skill first to merge all the fields you want to translate if you wanted to get one big merged translation field for each language. That probably wouldn't fit your exact scenario though since you said you still wanted separate fields as the output. To keep them separate, I think you'll have to translate them one by one, with one translation skill per field and language. There's no problem in having more than one translation skill in a skillset so that should work fine, it just may be a little tedious to setup.
UPDATE 5/18/22
OK, so since you're not defining a complex SpecificationData index field, but instead top-level "name_fr" and so on, then yes, output field mappings are fine. Output field mappings map a path in the enriched document to an index field, by name. So targetFieldName should be "name_fr" with no leading slash. sourceFieldName should point to the output of your translation skill, name_fr under the context path, which is /document/SpecificationData, so the full path to your skill's output is /document/SpecificationData/name_fr.
But then there's another issue, which is that you really have an array of values as the output of the skill of the skill because of the * in the input path (/*/name). That probably won't work as the index field is a string and not an array.
It seems like your intent is to get a translation for each name of each SpecificationData entry. For that, your context should probably do the enumeration (/document/SpecificationData/*) and have the input path be /document/SpecificationData/*/name. This way, one name_fr will be under each item in the SpecificationData array.
Then you'll need to make those multiple values into a single string for the index, if you keep the index defined that way. The simplest way to do this is by using a text merger skill, probably something like this:
{
"#odata.type": "#Microsoft.Skills.Text.MergeSkill",
"context": "/document",
"inputs": [
{
"name": "itemsToInsert",
"source": "/document/SpecificationData/*/name_fr"
}
],
"outputs": [
{
"name": "mergedText",
"targetName" : "name_fr"
}
]
}
And then, since the output of this new skill will be /document/name_fr with the space-separated concatenation of all French-translated names, you don't need the output field mapping at all, the value will get automatically mapped to your index.
Finally, to better understand and debug skillsets, you should take a look at debug sessions.
Having trouble getting the analyzers to save / update on the index. When creating, everything else (the tokenFilters, tokenizers, fields) saves fine, but the analyzers array is always empty?
await client.createOrUpdateIndex(index, { allowIndexDowntime: true });
Creating a new index:
let index = {
name: "test-index",
tokenizers: [{
"odatatype": "#Microsoft.Azure.Search.StandardTokenizerV2",
"name": "test_standard_v2",
"maxTokenLength": 255
}],
fields: [{
"name": "metadata_storage_path",
"type": "Edm.String",
"facetable": false,
"filterable": false,
"key": true,
"retrievable": true,
"searchable": false,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}, {
'name': 'metadata_storage_name',
'type': 'Edm.String',
'facetable': false,
'filterable': false,
'key': false,
'retrievable': true,
'searchable': true,
'sortable': false,
'synonymMaps': [],
'fields': [],
},
{
"name": "partialName",
"type": "Edm.String",
"retrievable": false,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"key": false,
"searchAnalyzer": "standardCmAnalyzer",
"indexAnalyzer": "filename_analyzer"
}],
tokenFilters: [{
"name": "nGramCmTokenFilter",
"odatatype": "#Microsoft.Azure.Search.NGramTokenFilterV2",
"minGram": 3,
"maxGram": 20
}],
analyzers: [{
"name": "standardCmAnalyzer",
"odatatype": "#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer": "test_standard_v2",
"tokenFilters": ["lowercase", "asciifolding"]
},
{
"name": "filename_analyzer",
"odatatype": "#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer": "test_standard_v2",
"tokenFilters": [
"nGramCmTokenFilter"
]
}],
};
Then creating it:
await client.createOrUpdateIndex(index, { allowIndexDowntime: true });
I noticed no error messages being returned.
EDIT:
Using the sdk #azure/search-documents ^11.1.0
I have a blob with a custom metadata property of jsonmd.
The custom metadata looks something like:
{
"ResourceName": "ipso factum...",
"ResourceVariations": [{
"Description": "ipso factum...",
"Name": "R4.mp4",
"Thumbnail": "R4.jpg",
"URL": ""
},
...
I was able to capture the full json in the index by including a filed in the index:
{
"name": "jsonmd",
"type": "Edm.String",
"facetable": true,
"filterable": true,
...
I want to capture the Thumbnail property and have added this field to the index:
{
"name": "Thumbnail",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": true,
"analyzer": "standard.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
I can't figure out how to use the custom metadata (jsonmd) to populate the Thumbnail property of the index?
You can define complex types in your index schema. Below is an example of how I collect metadata from PDF documents that we index. I extract the properties from the PDF using regular C# code, populate a Dictionary and then submit the objects using the Azure Cognitive Search SDK.
For more examples, see Model complex data types in Azure Cognitive Search.
{
"name": "Metadata",
"type": "Edm.ComplexType",
"analyzer": null,
"synonymMaps": [],
"fields": [
{
"name": "Properties",
"type": "Collection(Edm.ComplexType)",
"analyzer": null,
"synonymMaps": [],
"fields": [
{
"name": "Name",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "pattern",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "Values",
"type": "Collection(Edm.String)",
"facetable": true,
"filterable": true,
"retrievable": true,
"searchable": true,
"analyzer": "pattern",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
]
}
]
}
We have a tags field in the search index just like:
{
"name": "tags",
"type": "Collection(Edm.String)",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
}
and the following tag scoring profile:
{
"name": "tagBoost",
"functionAggregation": "sum",
"text": null,
"functions": [
{
"fieldName": "tags",
"interpolation": "linear",
"type": "tag",
"boost": 15,
"freshness": null,
"magnitude": null,
"distance": null,
"tag": {
"tagsParameter": "doctype"
}
}
]
}
When requesting a search just like https://my-beautiful-products-index.search.windows.net/indexes/products/docs?api-version=2017-11-11&search=karin&scoringParameter=doctype-serial, we get
{
"error": {
"code": "",
"message": "Expected 0 parameter(s) but 1 were supplied.\r\nParameter name: scoringParameter"
}
}
Anybody know why is this and how to get rid of the error?
We've gone through the (scarce) documentation and that request seems to be ok and no traces of that error was found either in docs or Internet :-/.
Even if you keep scoringParameter alone (...&scoringParameter), the error is the same; it only gets away if we remove the scoringParameter from the query string.
I know it not been answered from long, but for others to refer.
Missing part was to add &scoringProfile="name". For me adding this worked.
In this case- https://my-beautiful-products-index.search.windows.net/indexes/products/docs?api-version=2017-11-11&search=karin&scoringProfile=tagBoost&scoringParameter=doctype-serial,