Azure search not working with special characters

Azure search not working with special characters - azure

I have a search setup for a list of jobs.
Other than Id field I have just two fields on the search made as searchable and analyzer standard.lucene as following:
And I’m having difficulty to make the search work with things like C# or C++.
 
As per the documentation for unsafe characters we should encode and for Special characters we should scape.
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax
 
But let’s say I want to search all jobs with C# on the title or description I do the following search using the encode %23 and does not seems that I'm getting the correct results. Is returning other jobs that contain only C with other things.
 
Same way if I want to get only the C++ jobs as per azure docs I should use something like C\+\+ When I tried a got other jobs also.
So not sure if I’m missing something or did not understand correct the documentation. But I’m not able to get the exact result I was expecting.

Your results depend on the analyzer you use. Read more about analyzers in the documentation: https://learn.microsoft.com/en-us/azure/search/search-analyzers
You can check how your query is analyzed by calling the analyze API with your query and a specified analyzer. For example:
{
"text": "C# engineer",
"analyzer": "standard"
}
Will result in:
"tokens": [
{
"token": "c",
"startOffset": 0,
"endOffset": 1,
"position": 0
},
{
"token": "engineer",
"startOffset": 3,
"endOffset": 11,
"position": 1
}
]
With the standard analyzer, you effectively no longer have the ability to separate a C# engineer from a C++ engineer. The analyzer will simplify it to two tokens. C. Engineer.
Instead, use one of the built-in analyzers or configure your own. Here is the same example using the whitespace analyzer.
{
"text": "C# engineer",
"analyzer": "whitespace"
}
Results in
"tokens": [
{
"token": "C#",
"startOffset": 0,
"endOffset": 2,
"position": 0
},
{
"token": "engineer",
"startOffset": 3,
"endOffset": 11,
"position": 1
}
]
CREATE
Create the index like this
"fields": [
{
"name": "Id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
},
{
"name": "Title",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": true,
"analyzer": "whitespace",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
],
UPLOAD
Test by uploading the following simple content.
"value": [
{
"#search.action": "mergeOrUpload",
"Id": "1",
"Title": "C++ Engineer"
},
{
"#search.action": "mergeOrUpload",
"Id": "2",
"Title": "C# Engineer"
}
]
QUERY
Then query
https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/indexes/{{YOUR-INDEX}}/docs?search=C%23&$count=true&searchMode=all&queryType=full&api-version={{API-VERSION}}
Your search for C# now returns the entry with C# in the title.
{
"#odata.context": "https://walter-searchsvc-we-dev.search.windows.net/indexes('so-73954804')/$metadata#docs(*)",
"#odata.count": 1,
"value": [
{
"#search.score": 0.25811607,
"Id": "2",
"Title": "C# Engineer"
}
]
}

Related

How to translate and update Azure Cognitive Search Index document for different Language Analyzer fields?

I am working on configuration of Azure Cognitive Search Index which will be queried from websites in different languages. I have created language specific fields and have added appropriate language analyzers while Index creation.
For example:
{
"id": "",
"Description": "some_value",
"Description_es": null,
"Description_fr": null,
"Region": [ "some_value", "some_value" ],
"SpecificationData": [
{
"name": "some_key1",
"value": "some_value1",
"name_es": null,
"value_es": null,
"name_fr": null,
"value_fr": null
},
{
"name": "some_key2",
"value": "some_value2",
"name_pt": null,
"value_pt": null,
"name_fr": null,
"value_fr": null
}
]
}
The fields Description, SpecificationData.name and SpecificationData.value are in English and coming from Cosmos DB. Fields Description_es, SpecificationData.name_es and SpecificationData.value_es will be queried from the Spanish website and should be fields translated in Spanish. And similar for the French language fields.
But since, Cosmos DB is having fields only in English, language specific fields such as Description_es, SpecificationData.name_es and SpecificationData.value_es are Null by default.
I have tried using Skillsets and linking Index to "Azure Cognitive Translate Service" but Skillsets are translating only one field at a time.
Is there any way to translate multiple fields and save the specific translation in particular fields?
Edit: Adding Index, Skillset and Indexer code that I have tried:
Index (snippet):
{
"name": "SpecificationData",
"type": "Collection(Edm.ComplexType)",
"analyzer": null,
"synonymMaps": [],
"fields": [
{
"name": "name",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "standard.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "value",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "standard.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "name_fr",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "fr.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "value_fr",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "fr.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
]
}
Skillset:
{
"#odata.type": "#Microsoft.Skills.Text.TranslationSkill",
"name": "psd_name_fr",
"description": null,
"context": "/document/SpecificationData",
"defaultFromLanguageCode": null,
"defaultToLanguageCode": "fr",
"suggestedFrom": "en",
"inputs": [
{
"name": "text",
"source": "/*/name"
}
],
"outputs": [
{
"name": "translatedText",
"targetName": "name_fr"
}
]
}
Indexer:
"outputFieldMappings": [
{
"sourceFieldName": "/document/SpecificationData/*/name/name_fr",
"targetFieldName": "/name_fr" //I get an error message as "Output field mapping specifies target field 'name_fr' that doesn't exist in the index". I have tried accessing the full path as /document/SpecificationData/name_fr but it still gives same error. It looks for the specified field inside root structure and gives the error if the field is nested array object.
}
]

You could use a text merge skill first to merge all the fields you want to translate if you wanted to get one big merged translation field for each language. That probably wouldn't fit your exact scenario though since you said you still wanted separate fields as the output. To keep them separate, I think you'll have to translate them one by one, with one translation skill per field and language. There's no problem in having more than one translation skill in a skillset so that should work fine, it just may be a little tedious to setup.
UPDATE 5/18/22
OK, so since you're not defining a complex SpecificationData index field, but instead top-level "name_fr" and so on, then yes, output field mappings are fine. Output field mappings map a path in the enriched document to an index field, by name. So targetFieldName should be "name_fr" with no leading slash. sourceFieldName should point to the output of your translation skill, name_fr under the context path, which is /document/SpecificationData, so the full path to your skill's output is /document/SpecificationData/name_fr.
But then there's another issue, which is that you really have an array of values as the output of the skill of the skill because of the * in the input path (/*/name). That probably won't work as the index field is a string and not an array.
It seems like your intent is to get a translation for each name of each SpecificationData entry. For that, your context should probably do the enumeration (/document/SpecificationData/*) and have the input path be /document/SpecificationData/*/name. This way, one name_fr will be under each item in the SpecificationData array.
Then you'll need to make those multiple values into a single string for the index, if you keep the index defined that way. The simplest way to do this is by using a text merger skill, probably something like this:
{
"#odata.type": "#Microsoft.Skills.Text.MergeSkill",
"context": "/document",
"inputs": [
{
"name": "itemsToInsert",
"source": "/document/SpecificationData/*/name_fr"
}
],
"outputs": [
{
"name": "mergedText",
"targetName" : "name_fr"
}
]
}
And then, since the output of this new skill will be /document/name_fr with the space-separated concatenation of all French-translated names, you don't need the output field mapping at all, the value will get automatically mapped to your index.
Finally, to better understand and debug skillsets, you should take a look at debug sessions.

How to index complex types into Edm.ComplexType with Azure Cognitive Search

I am indexing data into an Azure Search Index that is produced by a custom skill. This custom skill produces complex data which I want to preserve into the Azure Search Index.
Source data is coming from blob storage and I am constrained to using the REST API without a very solid argument for using the .NET SDK.
Current code
The following is a brief rundown of what I currently have. I cannot change the index's field or the format of data produced by the endpoint used by the custom skill.
Complex data
The following is an example of complex data produced by the custom skill (in the correct value/recordId/etc. format):
{
"field1": 0.135412,
"field2": 0.123513,
"field3": 0.243655
}
Custom skill
Here is the custom skill which creates said data:
{
"#odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"uri": "https://myfunction.azurewebsites.com/api,
"httpHeaders": {},
"httpMethod": "POST",
"timeout": "PT3M50S",
"batchSize": 1,
"degreeOfParallelism": 5,
"name": "MySkill",
"context": "/document/mycomplex
"inputs": [
{
"name": "text",
"source": "/document/content"
}
],
"outputs": [
{
"name": "field1",
"targetName": "field1"
},
{
"name": "field2",
"targetName": "field2"
},
{
"name": "field3",
"targetName": "field3"
}
]
}
I have attempted several variations, notable using the ShaperSkill with each field as an input and the output "targetName" as "mycomplex" (with the appropriate context).
Indexer
Here is the indexer's output field mapping for the skill:
{
"sourceFieldName": "/document/mycomplex,
"targetFieldName": "mycomplex"
}
I have tried several variations such as "sourceFieldName": "/document/mycomplex/*.
Search index
And this is the targeted index field:
{
"name": "mycomplex",
"type": "Edm.ComplexType",
"fields": [
{
"name": "field1",
"type": "Edm.Double",
"retrievable": true,
"filterable": true,
"sortable": true,
"facetable": false,
"searchable": false
},
{
"name": "field2",
"type": "Edm.Double",
"retrievable": true,
"filterable": true,
"sortable": true,
"facetable": false,
"searchable": false
},
{
"name": "field3",
"type": "Edm.Double",
"retrievable": true,
"filterable": true,
"sortable": true,
"facetable": false,
"searchable": false
}
]
}
Result
My result is usually similar to Could not map output field 'mycomplex' to search index. Check your indexer's 'outputFieldMappings' property..

This may be a mistake with the context of your skill. Instead of setting the context to /document/mycomplex, can you try setting it to /document? You can then add a ShaperSkill with the context also set to /document and the output field being mycomplex to generate the expected complex type shape
Example skills:
"skills":
[
{
"#odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"uri": "https://myfunction.azurewebsites.com/api,
"httpHeaders": {},
"httpMethod": "POST",
"timeout": "PT3M50S",
"batchSize": 1,
"degreeOfParallelism": 5,
"name": "MySkill",
"context": "/document"
"inputs": [
{
"name": "text",
"source": "/document/content"
}
],
"outputs": [
{
"name": "field1",
"targetName": "field1"
},
{
"name": "field2",
"targetName": "field2"
},
{
"name": "field3",
"targetName": "field3"
}
]
},
{
"#odata.type": "#Microsoft.Skills.Util.ShaperSkill",
"context": "/document",
"inputs": [
{
"name": "field1",
"source": "/document/field1"
},
{
"name": "field2",
"source": "/document/field2"
},
{
"name": "field3",
"source": "/document/field3"
}
],
"outputs": [
{
"name": "output",
"targetName": "mycomplex"
}
]
}
]
Please refer to the documentation on shaper skill for specifics.

Why the field is not subject to lexical analysis

I'm trying to make my searches ignore word accents
To do this I decided to use the language analyzer: es.microsoft
I was testing the analyzer with the word "Lámpara" in the analyzer API and I got the following results:
{
"token": "lampara",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "lámpara",
"startOffset": 0,
"endOffset": 7,
"position": 0
}
I have only 2 documents in my test index:
{
"#search.score": 1,
"Id": "2",
"Nombre": "Lampara"
},
{
"#search.score": 1,
"Id": "1",
"Nombre": "Lámpara"
}
When searching for the word in the index search=Lámpara I get the following results:
{
"#search.score": 0.30685282,
"Id": "1",
"Nombre": "Lámpara"
}
For what reason the document is only received with Nombre = "Lámpara" and not Nombre = "Lampara" (without accent). I have the impression that the Name field was not sent to the lexical analysis
The definition of my index is as follows
{
"name": "test",
"fields": [
{
"name": "Id",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": true,
"retrievable": true,
"searchable": false,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "Nombre",
"type": "Edm.String",
"facetable": false,
"filterable": false,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "es.microsoft",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
],
"suggesters": [],
"scoringProfiles": [],
"defaultScoringProfile": null,
"corsOptions": null,
"analyzers": [],
"charFilters": [],
"tokenFilters": [],
"tokenizers": []
}
I would appreciate any help, and an apology for my bad English

Sorry for the delay in getting your an answer. Indeed, the Microsoft Spanish analyzer currently only fold the accents in documents, so they can be matched by queries that forgo the accents (as you mentioned, search for Lampara, it will match documents that contains Lámpara, but if you explicitly set the accents in the query (for example, searching for Lámpara), it won't match documents that don't have any accents.
If this behavior is important to you, you can instead use the es.lucene analyzer which actually actually do "ascii folding" (removing of accents) both at indexing and at search time.

azure search request with scoringParameter query param returns error

We have a tags field in the search index just like:
{
"name": "tags",
"type": "Collection(Edm.String)",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
}
and the following tag scoring profile:
{
"name": "tagBoost",
"functionAggregation": "sum",
"text": null,
"functions": [
{
"fieldName": "tags",
"interpolation": "linear",
"type": "tag",
"boost": 15,
"freshness": null,
"magnitude": null,
"distance": null,
"tag": {
"tagsParameter": "doctype"
}
}
]
}
When requesting a search just like https://my-beautiful-products-index.search.windows.net/indexes/products/docs?api-version=2017-11-11&search=karin&scoringParameter=doctype-serial, we get
{
"error": {
"code": "",
"message": "Expected 0 parameter(s) but 1 were supplied.\r\nParameter name: scoringParameter"
}
}
Anybody know why is this and how to get rid of the error?
We've gone through the (scarce) documentation and that request seems to be ok and no traces of that error was found either in docs or Internet :-/.
Even if you keep scoringParameter alone (...&scoringParameter), the error is the same; it only gets away if we remove the scoringParameter from the query string.

I know it not been answered from long, but for others to refer.
Missing part was to add &scoringProfile="name". For me adding this worked.
In this case- https://my-beautiful-products-index.search.windows.net/indexes/products/docs?api-version=2017-11-11&search=karin&scoringProfile=tagBoost&scoringParameter=doctype-serial,

Can we use Phonetic token and Synonyms together?

I am trying to enable phonetic analyzer, and synonyms together. It doesn't seem to work. Is it wrong to use them together?
In the below implementation I would expect the search query be converted using synonyms and then phonetic analyzer be used to retrieve the results. But my synonyms are totally ignored in here.
If i remove the phonetic analyser as part of the index creation, then the synonyms are working fine.
Also, the synonyms work fine if i use the in built analyzers like en.microsoft; instead of custom analyzers. Is this a bug?
My Synonym map
{
"name":"mysynonymmap",
"format":"solr",
"synonyms": "
SW, Software, Softvare, software, softvare, sft\n
HW, Hardware, Hardvare, hardware, hardvare, hdw => hardware\n"
}
Below is how the index is getting created
"name": "newphonetichotelswithsynonyms",
"fields": [
{"name": "hotelId", "type": "Edm.String", "key":true, "searchable": false},
{"name": "baseRate", "type": "Edm.Double"},
{"name": "description", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "analyzer":"my_standard",
"synonymMaps":[
"mysynonymmap"
]},
{"name": "hotelName", "type": "Edm.String", "analyzer":"my_standard",
"synonymMaps":[
"mysynonymmap"
]},
{"name": "category", "type": "Edm.String", "analyzer":"my_standard",
"synonymMaps":[
"mysynonymmap"
]},
{"name": "tags", "type": "Collection(Edm.String)", "analyzer":"my_standard",
"synonymMaps":[
"mysynonymmap"
]},
{"name": "parkingIncluded", "type": "Edm.Boolean"},
{"name": "smokingAllowed", "type": "Edm.Boolean"},
{"name": "lastRenovationDate", "type": "Edm.DateTimeOffset"},
{"name": "rating", "type": "Edm.Int32"},
{"name": "location", "type": "Edm.GeographyPoint"}
],
"analyzers":[
{
"name":"my_standard",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard_v2",
"tokenFilters":[ "lowercase", "asciifolding", "phonetic" ]
}
]
}````

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string