Need help on Azure search with search term having asterisk(*) - azure

We are facing an issue with Azure search API when hit with search term with asterisk(*) at the end and also with special characters.
We are hitting our production Azure search API with below json object and get no results. Notice the search term "déménage*" with asterisk(*) at the end.
https://one-adscope-search-fr-prod.search.windows.net/indexes/one-adscope-advancedsearch-fr/docs/search?api-version=2016-09-01
{
"count": "true",
"facets": null,
"orderby": "firstSeenDate desc,creativeIdNumber asc",
"search": "déménage*",
"searchFields": "keywordSignatureLangSearch,keywordSloganLangSearch,keywordTextLangSearch,keywordScriptLangSearch,keywordIncrustTVLangSearch,keywordVisualKeywordsLangSearch,keywordAgencyLangSearch,keywordMusicTitleLangSearch,keywordMusicPerformerLangSearch,keywordMusicAuthorLangSearch,categoryLevel_1_nameLangSearch,categoryLevel_2_nameLangSearch,categoryLevel_3_nameLangSearch,categoryLevel_4_nameLangSearch,categoryLevel_5_nameLangSearch,productLevel_1_nameLangSearch,productLevel_2_nameLangSearch,productLevel_3_nameLangSearch,productLevel_4_nameLangSearch,productLevel_5_nameLangSearch,campaignNamesLangSearch,themeNamesLangSearch,creativeTitleLangSearch,visualLangSearch,keyword_tagsLangSearch,countryNameLangSearch,directorLangSearch,hashtagsLangSearch,illustratorLangSearch,inlayLangSearch,csmediaNameLangSearch,subMediaNameLangSearch,modifVersionLangSearch,photographerLangSearch,productionLangSearch,taglineLangSearch,partnersLangSearch,creativeLabelLangSearch,propertyNameLangSearch,sponsorshipProgramTitleLangSearch",
"searchMode": "any",
"select": "",
"skip": 0,
"top": 250,
"queryType": "full"
}
But when hit the API with similar json except only one change – search term without and asterisk(*) at the end like "déménage” we are getting appropriate results.
Please notice below all the other fields are the same along with SearchFields.
{
"count": "true",
"facets": null,
"orderby": "firstSeenDate desc,creativeIdNumber asc",
"search": "déménage",
"searchFields": "keywordSignatureLangSearch,keywordSloganLangSearch,keywordTextLangSearch,keywordScriptLangSearch,keywordIncrustTVLangSearch,keywordVisualKeywordsLangSearch,keywordAgencyLangSearch,keywordMusicTitleLangSearch,keywordMusicPerformerLangSearch,keywordMusicAuthorLangSearch,categoryLevel_1_nameLangSearch,categoryLevel_2_nameLangSearch,categoryLevel_3_nameLangSearch,categoryLevel_4_nameLangSearch,categoryLevel_5_nameLangSearch,productLevel_1_nameLangSearch,productLevel_2_nameLangSearch,productLevel_3_nameLangSearch,productLevel_4_nameLangSearch,productLevel_5_nameLangSearch,campaignNamesLangSearch,themeNamesLangSearch,creativeTitleLangSearch,visualLangSearch,keyword_tagsLangSearch,countryNameLangSearch,directorLangSearch,hashtagsLangSearch,illustratorLangSearch,inlayLangSearch,csmediaNameLangSearch,subMediaNameLangSearch,modifVersionLangSearch,photographerLangSearch,productionLangSearch,taglineLangSearch,partnersLangSearch,creativeLabelLangSearch,propertyNameLangSearch,sponsorshipProgramTitleLangSearch",
"searchMode": "any",
"select": "",
"skip": 0,
"top": 250,
"queryType": "full"
}
Please advise at the earliest.
Thanks,
Bhavik Shah

I suspect the documents returned in the case without the suffix operator '*' are matching because the diacritics were removed from the search term during the lexical analysis process. Please see this post for details: Prefix queries (*) in Azure Search don't return expected results
Consider changing your query to search=déménage* OR déménage

Related

Azure Cognitive Search, how to configure analyzer to support "startsWith"?

I have a field in Azure Cognitive Search that has special characters in it.
they look like this: some_id: 'SOME*STUFF*123'
I'm trying to have a "startsWith" query, but that doesnt return anything as soon as the regex tries to match anything that goes farther than the \*
After a bit google I found out its the Analyzer, possibly breaking apart strings at '*'
So I changed the Analyzer to "keyword", as I read multiple times its the Analyzer you are supposed to use for this.
the new config looks like this:
{
"name": "some_id",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": true,
"analyzer": "keyword",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
my request look like this:
{
"count": true,
"skip": 0,
"top": 5,
"searchMode": "any",
"queryType": "full",
"search": "some_id:/SO(.*)/" // SOME\\*S(.*) also doesnt work
}
I get zero matches.
With the Standart analyzer I started going no matches as soon as I had a \\* in my regex (I escaped them with \\)
Clarification on Requirements:
I can not change any data, the values (including the \*) can not be changed. I'm trying to have the whole field matched as a single token and for me to run startsWith on.
For example this regex: /SOME\\*ST(.*)/ is supposed to literally return entries that fully match the regex. No magic with seperators or tokens, simply the whole value as a single token that I can run startsWith on.
What I'm trying to say is, take for example JavaScript, I want the exact same results you would get from string.startsWith(value).
I'm guessing there is either something wrong with my config, or with my requests, can anyone help me?
IMHO, you should work with a different separator. For example:
Field1 (FROM) | Field2 (TO)
SOME*STUFF*123 | SOME||STUFF||123
Then use a custom analyzer to break terms every ||. Aditionally, you can also work with tokenizer and specify it to do it every 3 chars.
Samples:
SOM
OME
STU
TUF
UFF
123
Then search using:
SOM*
and it should return the data you're looking for. It would be better if you could provide more details about your content and give us samples, but this answer should point you to the result you're looking for.

Indexing e-mails in Azure Search

I'm trying to best index contents of e-mail messages, subjects and email addresses. E-mails can contain both text and HTML representation. They can be in any language so I can't use language specific analysers unfortunately.
As I am new to this I have many questions:
First I used Standard Lucene analyser but after some testing and
checking what each analyser does I switched to using "simple"
analyser. Standard one didn't allow me to search by domain in
user#domain.com (It sees user and domain.com as tokens). Is "simple" the best I can use in my case?
How can I handle HTML contents of e-mail? I thought this should be
possible to do it in Azure Search but right now I think I would need
to strip HTML tags myself.
My users aren't tech savvy and I assumed "simple" query type will be
enough for them. I expect them to type word or two and find messages
containing this word/containing words starting with this word. From my tests it looks I need to append * to their queries to get "starting with" to work?
It would help if you included an example of your data and how you index and query. What happened, and what did you expect?
The standard Lucene analyzer will work with your user#domain.com example. It is correct that it produces the tokens user and domain.com. But the same happens when you query, and you will get records with the tokens user and domain.com.
CREATE INDEX
"fields": [
{"name": "Id", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "sortable": true, "facetable": false, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "synonymMaps": [] },
{"name": "Email", "type": "Edm.String", "filterable": true, "sortable": true, "facetable": false, "searchable": true, "analyzer": "standard"}
]
UPLOAD
{
"value": [
{
"#search.action": "mergeOrUpload",
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "2",
"Email": "some.user#some-domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "3",
"Email": "another#another.com"
}
]
}
QUERY
Query, using full and all.
https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/indexes/{{INDEX_NAME}}/docs?search=user#domain.com&$count=true&$select=Id,Email&searchMode=all&queryType=full&api-version={{API-VERSION}}
Which produces results as expected (all records containing user and domain.com):
{
"#odata.context": "https://<your-search-env>.search.windows.net/indexes('dg-test-65392234')/$metadata#docs(*)",
"#odata.count": 2,
"value": [
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.score": 0.25316024,
"Id": "2",
"Email": "some.user#some-domain.com"
}
]
}
If your expected result is to only get the record above where the email matches completely, you could instead use a phrase search. I.e. replace the search parameter above with search="user#domain.com" and you would get:
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
}
Alternatively, you could use the keyword analyzer.
ANALYZE
You can compare the different analyzers directly via REST. Using the keyword analyzer on the Email property will produce a single token.
{
"text": "some-user#some-domain.com",
"analyzer": "keyword"
}
Results in the following tokens:
"tokens": [
{
"token": "some-user#some-domain.com",
"startOffset": 0,
"endOffset": 25,
"position": 0
}
]
Compared to the standard tokenizer, which does a decent job for most types of unstructured content.
{
"text": "some-user#some-domain.com",
"analyzer": "standard"
}
Which produces reasonable results for cases where the email address was part of some generic text.
"tokens": [
{
"token": "some",
"startOffset": 0,
"endOffset": 4,
"position": 0
},
{
"token": "user",
"startOffset": 5,
"endOffset": 9,
"position": 1
},
{
"token": "some",
"startOffset": 10,
"endOffset": 14,
"position": 2
},
{
"token": "domain.com",
"startOffset": 15,
"endOffset": 25,
"position": 3
}
]
SUMMARY
This is a long answer already, so I won't cover your other two questions in detail. I would suggest splitting them to separate questions so it can benefit others.
HTML content: You can use a built-in HTML analyzer that strips HTML tags. Or you can strip the HTML yourself using custom code. I typically use Beautiful Soup for cases like these or simple regular expressions for simpler cases.
Wildcard search: Usually, users don't expect automatic wildcards appended. The only application that does this is the Outlook client, which destroys precision. When I search for "Jan" (a common name), I annoyingly get all emails sent in January(!). And a search for Dan (again, a name), I also get all emails from Danmark (Denmark).
Everything in search is a trade-off between precision and recall. In your first example with the email address, your expectation was heavily geared toward precision. But, in your last wildcard question, you seem to prefer extreme recall with wildcards on everything. It all comes down to your expectations.

Azure Maps API - Limit By Type

I have implemented the Azure Maps search at https://learn.microsoft.com/en-gb/rest/api/maps/search/getsearchaddress but I want to get a list of only certain "types".
In the results below, the type is "Street", but I am interested in returning only those where the type matches "MunicipalitySubdivision".
If I do a call to this service, the API returns results in blocks of 10 by default (which can be upped to 200), and gives a TotalResults field as well. It is possible to iterate through (for example) 50,000 results 200 at a time by providing a results offset startIndex parameter in the API, but this doesn't seem like the most efficient way to return just results of one type.
Can anyone suggest anything?
{
"type": "Street",
"id": "GB/STR/p0/1199538",
"score": 5.07232,
"address": {
"streetName": "Hampton Road",
"municipalitySubdivision": "Birmingham, Aston",
"municipality": "Birmingham",
"countrySecondarySubdivision": "West Midlands",
"countrySubdivision": "ENG",
"postalCode": "B6",
"extendedPostalCode": "B6 6AB,B6 6AE,B6 6AN,B6 6AS",
"countryCode": "GB",
"country": "United Kingdom",
"countryCodeISO3": "GBR",
"freeformAddress": "Hampton Road, Birmingham",
"countrySubdivisionName": "England"
},
"position": {
"lat": 52.50665,
"lon": -1.90082
},
"viewport": {
"topLeftPoint": {
"lat": 52.50508,
"lon": -1.90015
},
"btmRightPoint": {
"lat": 52.50804,
"lon": -1.90139
}
}
}
There currently isn't an option to limit the results as you requested other than to scan the results programmatically. If the address information you have is structured (you have the individual pieces) and is not a freeform string, then using the structured geocoding service would allow you to specify the type right request when passing in the address parts: https://learn.microsoft.com/en-us/rest/api/maps/search/getsearchaddressstructured

Azure Search - phonetic search implementation

I was trying out Phoenetic search using Azure Search without much luck. My objective is to work out an Index configuration that can handle typos and accomodate phonetic search for end users.
With the below configuration and sample data, I was trying to search for intentionally misspelled words like 'softvare' or 'alek'. I got results for 'alek' thanks for Phonetic analyzer; but didn't get any results for 'softvare'.
Looks like for this requirement phonetic search will not do the trick.
Only option that I found was to use synonyms map. The major pitfall is that I'm unable to use the Phonetics / Custom analyzer along with Synonyms :(
What are the various strategies that you would recommend for taking care of typos?
search query used
?api-version=2017-11-11&search=alec
?api-version=2017-11-11&search=softvare
Here is the index configuration
"name": "phonetichotels",
"fields": [
{"name": "hotelId", "type": "Edm.String", "key":true, "searchable": false},
{"name": "baseRate", "type": "Edm.Double"},
{"name": "description", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "analyzer":"my_standard"},
{"name": "hotelName", "type": "Edm.String", "analyzer":"my_standard"},
{"name": "category", "type": "Edm.String", "analyzer":"my_standard"},
{"name": "tags", "type": "Collection(Edm.String)", "analyzer":"my_standard"},
{"name": "parkingIncluded", "type": "Edm.Boolean"},
{"name": "smokingAllowed", "type": "Edm.Boolean"},
{"name": "lastRenovationDate", "type": "Edm.DateTimeOffset"},
{"name": "rating", "type": "Edm.Int32"},
{"name": "location", "type": "Edm.GeographyPoint"}
],
Analyzer (part of the index creation)
"analyzers":[
{
"name":"my_standard",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard_v2",
"tokenFilters":[ "lowercase", "asciifolding", "phonetic" ]
}
]
Analyze API Input and Output for 'software'
{
"analyzer":"my_standard",
"text": "software"
}
{
"#odata.context": "https://ctsazuresearchpoc.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
"tokens": [
{
"token": "SFTW",
"startOffset": 0,
"endOffset": 8,
"position": 0
}
]
}
Analyze API Input and Output for 'softvare'
{
"analyzer":"my_standard",
"text": "softvare"
}
{
"#odata.context": "https://ctsazuresearchpoc.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
"tokens": [
{
"token": "SFTF",
"startOffset": 0,
"endOffset": 8,
"position": 0
}
]
}
Sample data that I loaded
{
"#search.action": "upload",
"hotelId": "5",
"baseRate": 199.0,
"description": "Best hotel in town for software people",
"hotelName": "Fancy Stay",
"category": "Luxury",
"tags": ["pool", "view", "wifi", "concierge"],
"parkingIncluded": false,
"smokingAllowed": false,
"lastRenovationDate": "2010-06-27T00:00:00Z",
"rating": 5,
"location": { "type": "Point", "coordinates": [-122.131577, 47.678581] }
},
{
"#search.action": "upload",
"hotelId": "6",
"baseRate": 79.99,
"description": "Cheapest hotel in town ",
"hotelName": " Alec Baldwin Motel",
"category": "Budget",
"tags": ["motel", "budget"],
"parkingIncluded": true,
"smokingAllowed": true,
"lastRenovationDate": "1982-04-28T00:00:00Z",
"rating": 1,
"location": { "type": "Point", "coordinates": [-122.131577, 49.678581] }
},
With the right configuration, I should have got results even with the misspelled words.
I work on Azure Search. Before I suggest approaches to handle misspelled words, it would be helpful to look at your custom analyzer (my_standard) configuration. It might tell us why it's not able to handle the case for 'softvare'. As a DIY, you can use the Analyze API to see the tokens created using your custom analyzer and it should contain 'software' to actually match the docs.
Now then, here are a few ways that can be used independently or in conjunction to handle misspelled words. The best approach varies depending on the use-case and I strongly suggest you experiment with these to figure out the best one in your case.
You are already familiar with phonetic filters which is a common approach to handle similarly pronounced terms. If you haven't already, try different encoders for the filter to evaluate which configuration gives you the best results. Check out the list of encoders here.
Use fuzzy queries supported as part of the Lucene query syntax in Azure Search which returns terms that are near the original query term based on a distance metric. The limitation here is that it works on a single term. Check the docs for more details. Sample query would look like - search=softvare~1 You can also use term boosting to give the original term more boost in cases where the original term is also a valid term.
You also alluded to synonyms which is also used to query with misspelled terms. This approach gives you the most control over the process of handling typos but also require you to have prior knowledge of different typos for terms. You can use these docs if you want to experiment with synonyms.
As you could read in my post; my Objective was to handle the typos.
The only easy option is to use the inbuilt Lucene functionality - Fuzzy Search. I'm yet to check on the response times as the querytype has to be set to 'full' for using fuzzy search. Otherwise, the results were satisfactory.
Example:
search=softvare~&fuzzy=true&querytype=full
will return all documents with the 'Software' in it.
For further reading please go through Documentation

How to index keywords using couchdb-lucene

I'm trying to build a couchdb view using couchdb-lucene to query on keywords. I want lucene to index them without any processing.
I'm using "index": "not_analyzed" option, but it is still not doing as I expected.
When I query of /works/OL1000010W, couchdb-lucene is converting it into lowercase and stripping the first / character.
$ curl -s 'http://127.0.0.1:5984/editions_1k/_fti/_design/seeds/by_seed?q=seed:/works/OL1000010W&limit=1'
{
"rows": [],
"total_rows": 0,
"skip": 0,
"search_duration": 1,
"q": "seed:works/ol1000010w",
"fetch_duration": 0,
"etag": "11e4be5bdb5c1598",
"limit": 1
}
Is there any way to make couchdb-lucene index it without processing and stop couchdb-lucene from processing the query?
Here is my design document:
https://gist.github.com/670374
Found that this is due a bug in couchdb-lucene.
https://github.com/rnewson/couchdb-lucene/issues/#issue/92
And workaround is to write the view like this:
{
"analyzer": "keyword",
"index": "function(doc) {...}"
}

Resources