azure search index html content - azure

I know blob storage is the only data source (thus far) that supports the indexing of html content.
My question is, is it possible to strip content using a custom analyser and the charfilter 'html_strip' (mentioned in azure docs) before adding a document to an index via REST?
Here is my create index payload:
{
"name": "htmlindex",
"fields": [
{"name": "id", "type": "Edm.String", "key": true, "searchable": false},
{"name": "title", "type": "Edm.String", "filterable": true, "sortable": true, "facetable": true},
{"name": "html", "type": "Collection(Edm.String)", "analyzer": "htmlAnalyzer"}
],
"analyzers": [
{
"name": "htmlAnalyzer",
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters": [ "html_strip" ],
"tokenizer": "standard_v2"
}
]
}
Here is my add document to index payload:
{
"value": [
{
"id": "1",
"title": "title1",
"html": [
"<p>test1</p>",
"<p>test2</p>"
]
}
]
}
Now when I search the index, I see the html content is not being stripped :
{
"#odata.context": "https://deviqfy.search.windows.net/indexes('htmlindex')/$metadata#docs",
"value": [
{
"#search.score": 1,
"id": "1",
"title": "title1",
"html": [
"<p>test1</p>",
"<p>test2</p>"
]
}
]
}
What am I doing wrong? How can I accomplish the stripping of html from the content before I add it? Without a pre-step..

So the custom analyzers (and the associated character filters) are optional steps that you can perform prior to tokenizing text. These analyzers help us facilitate better full-text search.
Azure search doesn't have a mechanism for modifying the contents of the document to be indexed when using the REST API to push documents to your index. You will have to do that yourself, as the analyzers are used to extract terms from documents that are stored in the search index.
More details here if you are interested: https://learn.microsoft.com/en-us/azure/search/search-lucene-query-architecture

Related

Azure search not working with special characters

I have a search setup for a list of jobs.
Other than Id field I have just two fields on the search made as searchable and analyzer standard.lucene as following:
And I’m having difficulty to make the search work with things like C# or C++.
 
As per the documentation for unsafe characters we should encode and for Special characters we should scape.
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax
 
But let’s say I want to search all jobs with C# on the title or description I do the following search using the encode %23 and does not seems that I'm getting the correct results. Is returning other jobs that contain only C with other things.
 
Same way if I want to get only the C++ jobs as per azure docs I should use something like C\+\+ When I tried a got other jobs also.
So not sure if I’m missing something or did not understand correct the documentation. But I’m not able to get the exact result I was expecting.
Your results depend on the analyzer you use. Read more about analyzers in the documentation: https://learn.microsoft.com/en-us/azure/search/search-analyzers
You can check how your query is analyzed by calling the analyze API with your query and a specified analyzer. For example:
{
"text": "C# engineer",
"analyzer": "standard"
}
Will result in:
"tokens": [
{
"token": "c",
"startOffset": 0,
"endOffset": 1,
"position": 0
},
{
"token": "engineer",
"startOffset": 3,
"endOffset": 11,
"position": 1
}
]
With the standard analyzer, you effectively no longer have the ability to separate a C# engineer from a C++ engineer. The analyzer will simplify it to two tokens. C. Engineer.
Instead, use one of the built-in analyzers or configure your own. Here is the same example using the whitespace analyzer.
{
"text": "C# engineer",
"analyzer": "whitespace"
}
Results in
"tokens": [
{
"token": "C#",
"startOffset": 0,
"endOffset": 2,
"position": 0
},
{
"token": "engineer",
"startOffset": 3,
"endOffset": 11,
"position": 1
}
]
CREATE
Create the index like this
"fields": [
{
"name": "Id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
},
{
"name": "Title",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": true,
"analyzer": "whitespace",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
],
UPLOAD
Test by uploading the following simple content.
"value": [
{
"#search.action": "mergeOrUpload",
"Id": "1",
"Title": "C++ Engineer"
},
{
"#search.action": "mergeOrUpload",
"Id": "2",
"Title": "C# Engineer"
}
]
QUERY
Then query
https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/indexes/{{YOUR-INDEX}}/docs?search=C%23&$count=true&searchMode=all&queryType=full&api-version={{API-VERSION}}
Your search for C# now returns the entry with C# in the title.
{
"#odata.context": "https://walter-searchsvc-we-dev.search.windows.net/indexes('so-73954804')/$metadata#docs(*)",
"#odata.count": 1,
"value": [
{
"#search.score": 0.25811607,
"Id": "2",
"Title": "C# Engineer"
}
]
}

How to index complex types into Edm.ComplexType with Azure Cognitive Search

I am indexing data into an Azure Search Index that is produced by a custom skill. This custom skill produces complex data which I want to preserve into the Azure Search Index.
Source data is coming from blob storage and I am constrained to using the REST API without a very solid argument for using the .NET SDK.
Current code
The following is a brief rundown of what I currently have. I cannot change the index's field or the format of data produced by the endpoint used by the custom skill.
Complex data
The following is an example of complex data produced by the custom skill (in the correct value/recordId/etc. format):
{
"field1": 0.135412,
"field2": 0.123513,
"field3": 0.243655
}
Custom skill
Here is the custom skill which creates said data:
{
"#odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"uri": "https://myfunction.azurewebsites.com/api,
"httpHeaders": {},
"httpMethod": "POST",
"timeout": "PT3M50S",
"batchSize": 1,
"degreeOfParallelism": 5,
"name": "MySkill",
"context": "/document/mycomplex
"inputs": [
{
"name": "text",
"source": "/document/content"
}
],
"outputs": [
{
"name": "field1",
"targetName": "field1"
},
{
"name": "field2",
"targetName": "field2"
},
{
"name": "field3",
"targetName": "field3"
}
]
}
I have attempted several variations, notable using the ShaperSkill with each field as an input and the output "targetName" as "mycomplex" (with the appropriate context).
Indexer
Here is the indexer's output field mapping for the skill:
{
"sourceFieldName": "/document/mycomplex,
"targetFieldName": "mycomplex"
}
I have tried several variations such as "sourceFieldName": "/document/mycomplex/*.
Search index
And this is the targeted index field:
{
"name": "mycomplex",
"type": "Edm.ComplexType",
"fields": [
{
"name": "field1",
"type": "Edm.Double",
"retrievable": true,
"filterable": true,
"sortable": true,
"facetable": false,
"searchable": false
},
{
"name": "field2",
"type": "Edm.Double",
"retrievable": true,
"filterable": true,
"sortable": true,
"facetable": false,
"searchable": false
},
{
"name": "field3",
"type": "Edm.Double",
"retrievable": true,
"filterable": true,
"sortable": true,
"facetable": false,
"searchable": false
}
]
}
Result
My result is usually similar to Could not map output field 'mycomplex' to search index. Check your indexer's 'outputFieldMappings' property..
This may be a mistake with the context of your skill. Instead of setting the context to /document/mycomplex, can you try setting it to /document? You can then add a ShaperSkill with the context also set to /document and the output field being mycomplex to generate the expected complex type shape
Example skills:
"skills":
[
{
"#odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"uri": "https://myfunction.azurewebsites.com/api,
"httpHeaders": {},
"httpMethod": "POST",
"timeout": "PT3M50S",
"batchSize": 1,
"degreeOfParallelism": 5,
"name": "MySkill",
"context": "/document"
"inputs": [
{
"name": "text",
"source": "/document/content"
}
],
"outputs": [
{
"name": "field1",
"targetName": "field1"
},
{
"name": "field2",
"targetName": "field2"
},
{
"name": "field3",
"targetName": "field3"
}
]
},
{
"#odata.type": "#Microsoft.Skills.Util.ShaperSkill",
"context": "/document",
"inputs": [
{
"name": "field1",
"source": "/document/field1"
},
{
"name": "field2",
"source": "/document/field2"
},
{
"name": "field3",
"source": "/document/field3"
}
],
"outputs": [
{
"name": "output",
"targetName": "mycomplex"
}
]
}
]
Please refer to the documentation on shaper skill for specifics.

limit EntityRecognitionSkill to confident > .5

I'm using Microsoft.Skills.Text.EntityRecognitionSkill in my skillset which output "Person", "Location", "Organization".
however I want to only output Location that have a confident level > .5
is there a way to do that?
here is a snap of my code
{
"#odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
"categories": [
"Person",
"Location",
"Organization"
],
"context": "/document/finalText/pages/*",
"inputs": [
{
"name": "text",
"source": "/document/finalText/pages/*"
},
{
"name": "languageCode",
"source": "/document/languageCode"
}
],
"outputs": [
{
"name": "persons",
"targetName": "people"
},
{
"name": "locations"
},
{
"name": "namedEntities",
"targetName": "entities"
}
]
},
[Edited based on Mick's comment]
Yes, this should be possible by setting the minimumPrecision parameter of the entity recognition skill to 0.5, which will result in entities whose confidence is >= 0.5 to be returned.
The documentation for entity recognition skill is here: https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-entity-recognition
As Mick points out, the documentation says minimumPrecision is unused, however that documentation is out of date and I will fix it soon.

How can I turn a cosmosdb list of documents into a hashmap and append values to it

I am currently trying to turn my list of documents I am getting from a cosmosdb query into a map so that I can iterate over the objects elements without using their ids. I want to remove some elements, and I want to append some data to elements as well. Finally I want to output a Json file with this data. How can I do this?
For example:
{
"action": "A",
"id": "138",
"validate": "yes",
"BaseVehicle": {
"id": "105"
},
"Qty": {
"value": "1"
},
"PartType": {
"id": "8852"
},
"BatchNumber": 0,
"_attachments": "attachments/",
"_ts": 1551998460
}
Should Look something like this:
"type": "App",
"data": {
"attributes": {
"Qty": {
"values": [
{
"source": "internal",
"locale": "en-US",
"value": "1"
}
]
},
"BaseVehicle": {
"values": [
{
"source": "internal",
"locale": "en-US",
"value": "105"
}
]
},
"PartType": {
"values": [
{
"source": "internal",
"locale": "en-US",
"value": "8852"
}
]
},
}
}
}
You could use Copy Activity in Azure Data Factory to implement your requirements.
1.Write an API to query data from cosmos db and process data into the format you want using code.
2.Output the desired results and configure http connector as source of copy activity.Refer to this link.
3.Configure Azure Blob Storage as sink of copy activity.The dataset properties supports JSON format.Refer to this link.

Can we use Phonetic token and Synonyms together?

I am trying to enable phonetic analyzer, and synonyms together. It doesn't seem to work. Is it wrong to use them together?
In the below implementation I would expect the search query be converted using synonyms and then phonetic analyzer be used to retrieve the results. But my synonyms are totally ignored in here.
If i remove the phonetic analyser as part of the index creation, then the synonyms are working fine.
Also, the synonyms work fine if i use the in built analyzers like en.microsoft; instead of custom analyzers. Is this a bug?
My Synonym map
{
"name":"mysynonymmap",
"format":"solr",
"synonyms": "
SW, Software, Softvare, software, softvare, sft\n
HW, Hardware, Hardvare, hardware, hardvare, hdw => hardware\n"
}
Below is how the index is getting created
"name": "newphonetichotelswithsynonyms",
"fields": [
{"name": "hotelId", "type": "Edm.String", "key":true, "searchable": false},
{"name": "baseRate", "type": "Edm.Double"},
{"name": "description", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "analyzer":"my_standard",
"synonymMaps":[
"mysynonymmap"
]},
{"name": "hotelName", "type": "Edm.String", "analyzer":"my_standard",
"synonymMaps":[
"mysynonymmap"
]},
{"name": "category", "type": "Edm.String", "analyzer":"my_standard",
"synonymMaps":[
"mysynonymmap"
]},
{"name": "tags", "type": "Collection(Edm.String)", "analyzer":"my_standard",
"synonymMaps":[
"mysynonymmap"
]},
{"name": "parkingIncluded", "type": "Edm.Boolean"},
{"name": "smokingAllowed", "type": "Edm.Boolean"},
{"name": "lastRenovationDate", "type": "Edm.DateTimeOffset"},
{"name": "rating", "type": "Edm.Int32"},
{"name": "location", "type": "Edm.GeographyPoint"}
],
"analyzers":[
{
"name":"my_standard",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard_v2",
"tokenFilters":[ "lowercase", "asciifolding", "phonetic" ]
}
]
}````

Resources