Openai semantic search not working with the file parameter - search

From what I understand, you can use the documents parameter OR the file parameter to tell openai on what labels you want to perform a search. I'm getting expected results using the documents parameter. I am getting unsatisfactory results using the file parameter. I would expect them to be the same.
When performing a search using the documents parameter..
response = dict(openai.Engine('davinci').search(
query='sitcom',
#file=file_id,
max_rerank=5,
documents=["white house", "school", "seinfeld"],
return_metadata=False))
..I get expected results.. "sitcom" wins the search with a score of 771.
{'object': 'list', 'data': [<OpenAIObject search_result at 0xb5e8ef48> JSON: {
"document": 0,
"object": "search_result",
"score": 147.98
}, <OpenAIObject search_result at 0xb5ebd148> JSON: {
"document": 1,
"object": "search_result",
"score": 211.021
}, <OpenAIObject search_result at 0xb5ebd030> JSON: {
"document": 2,
"object": "search_result",
"score": 771.348
}], 'model': 'davinci:2020-05-03'}
Now trying with the file parameter I create a temp.jsonl file with contents..
{"text": "white house", "metadata": "metadata here"}
{"text": "school", "metadata": "metadata here"}
{"text": "seinfeld", "metadata": "metadata here"}
I then upload the file to openai server with..
res = openai.File.create(file=open('temp.jsonl'), purpose="search")
where..
file_id = res['id']
I wait until the file is processed by the server then..
response = dict(openai.Engine('davinci').search(
query='sitcom',
file=file_id,
max_rerank=5,
#documents=["white house", "school", "seinfeld"],
return_metadata=False))
But I get the following message when I perform search..
No similar documents were found in file with ID 'file-LzHkASUxbDjTAWBhHxHpIOf4'.Please upload more documents or adjust your query.
I only get results when my query exactly matches a label..
response = dict(openai.Engine('davinci').search(
query='seinfeld',
file=file_id,
max_rerank=5,
#documents=["white house", "school", "seinfeld"],
return_metadata=False))
{'object': 'list', 'data': [<OpenAIObject search_result at 0xb5e74f48> JSON: {
"document": 0,
"object": "search_result",
"score": 668.846,
"text": "seinfeld"
}], 'model': 'davinci:2020-05-03'}
What am I doing wrong? Shouldn't the results be the same using the documents parameter or the file parameter?

Rereading the docs, it seems, when using file parameter instead of documents parameter, the server first performs a basic "keyword" search with the provided query to narrow down the results before finally reranking those results with a semantic search using the same query.
This is disappointing.
Just to provide a working example..
{"text": "stairway to the basement", "metadata": "metadata here"}
{"text": "school", "metadata": "metadata here"}
{"text": "stairway to heaven", "metadata": "metadata here"}
Now using the query "led zeppelin's most famous song stairway" the server will narrow down the results to document 0 and document 2 finding matches for the "stairway" token. It will then perform a semantic search and score both of them. Document 2 ("stairway to heaven") will have the highest relevancy score.
Using the query "stairway to the underground floor" will give document 0 ("stairway to the basement") the highest relevancy score.
This is disappointing because the query has to be useful for both a keyword search AND the semantic search.
In my original post, the keyword search was not providing any results because the query was only designed for a semantic search. When using the documents parameter, only a semantic search is performed, that is why it worked in that case.

Related

Azure Cognitive Search - Retrieve Search Score in Search Result

I am looking for a way to retrieve the search score in the search result (an index field value), similar to the other metadata fields like metadata_storage_name or metadata_storage_path. In the Indexer Definition, I tried retrieving the search score in the following way. Please correct me if I am missing anything or retrieving it the wrong way.
"fieldMappings": [
{
"sourceFieldName": "#search.score",
"targetFieldName": "search_score",
"mappingFunction": null
}
]
Search score is an attribute added to each search result in the search request response. Try issue a simple search request using your favourite REST client or Azure Poral. Below is an example of a response object. #search.score is what you're looking for.
"value": [
{
"#search.score": 7.3617697,
"HotelId": "21",
"HotelName": "Nova Hotel & Spa",
"Description": "1 Mile from the airport. Free WiFi, Outdoor Pool, Complimentary Airport Shuttle, 6 miles from the beach & 10 miles from downtown.",
"Category": "Resort and Spa",
"Tags": [
"pool",
"continental breakfast",
"free parking"
]
},
{
"#search.score": 2.5560288,
"HotelId": "25",
"HotelName": "Scottish Inn",
"Description": "Newly Redesigned Rooms & airport shuttle. Minutes from the airport, enjoy lakeside amenities, a resort-style pool & stylish new guestrooms with Internet TVs.",
"Category": "Luxury",
"Tags": [
"24-hour front desk service",
"continental breakfast",
"free wifi"
]
}]
Example is from here: https://learn.microsoft.com/en-us/azure/search/search-query-simple-examples#example-1-full-text-search
'#search.score' is not a field in an index, but a computation of each search result relevance scoring. If there is a match for the criteria of your search and a result returned, you can retrieve that value from the HTTP response with '#search.score'.
Field mappings on the other hand are used to map a field that is found in your data source and does not match the name you would like to use in the index, so you can map to the name you need.
For more information on the HTTP response of Search Documents REST API and search scoring, please visit:
https://learn.microsoft.com/rest/api/searchservice/search-documents and
https://learn.microsoft.com/azure/search/index-similarity-and-scoring

Azure Maps API - Limit By Type

I have implemented the Azure Maps search at https://learn.microsoft.com/en-gb/rest/api/maps/search/getsearchaddress but I want to get a list of only certain "types".
In the results below, the type is "Street", but I am interested in returning only those where the type matches "MunicipalitySubdivision".
If I do a call to this service, the API returns results in blocks of 10 by default (which can be upped to 200), and gives a TotalResults field as well. It is possible to iterate through (for example) 50,000 results 200 at a time by providing a results offset startIndex parameter in the API, but this doesn't seem like the most efficient way to return just results of one type.
Can anyone suggest anything?
{
"type": "Street",
"id": "GB/STR/p0/1199538",
"score": 5.07232,
"address": {
"streetName": "Hampton Road",
"municipalitySubdivision": "Birmingham, Aston",
"municipality": "Birmingham",
"countrySecondarySubdivision": "West Midlands",
"countrySubdivision": "ENG",
"postalCode": "B6",
"extendedPostalCode": "B6 6AB,B6 6AE,B6 6AN,B6 6AS",
"countryCode": "GB",
"country": "United Kingdom",
"countryCodeISO3": "GBR",
"freeformAddress": "Hampton Road, Birmingham",
"countrySubdivisionName": "England"
},
"position": {
"lat": 52.50665,
"lon": -1.90082
},
"viewport": {
"topLeftPoint": {
"lat": 52.50508,
"lon": -1.90015
},
"btmRightPoint": {
"lat": 52.50804,
"lon": -1.90139
}
}
}
There currently isn't an option to limit the results as you requested other than to scan the results programmatically. If the address information you have is structured (you have the individual pieces) and is not a freeform string, then using the structured geocoding service would allow you to specify the type right request when passing in the address parts: https://learn.microsoft.com/en-us/rest/api/maps/search/getsearchaddressstructured

Azure Search - phonetic search implementation

I was trying out Phoenetic search using Azure Search without much luck. My objective is to work out an Index configuration that can handle typos and accomodate phonetic search for end users.
With the below configuration and sample data, I was trying to search for intentionally misspelled words like 'softvare' or 'alek'. I got results for 'alek' thanks for Phonetic analyzer; but didn't get any results for 'softvare'.
Looks like for this requirement phonetic search will not do the trick.
Only option that I found was to use synonyms map. The major pitfall is that I'm unable to use the Phonetics / Custom analyzer along with Synonyms :(
What are the various strategies that you would recommend for taking care of typos?
search query used
?api-version=2017-11-11&search=alec
?api-version=2017-11-11&search=softvare
Here is the index configuration
"name": "phonetichotels",
"fields": [
{"name": "hotelId", "type": "Edm.String", "key":true, "searchable": false},
{"name": "baseRate", "type": "Edm.Double"},
{"name": "description", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "analyzer":"my_standard"},
{"name": "hotelName", "type": "Edm.String", "analyzer":"my_standard"},
{"name": "category", "type": "Edm.String", "analyzer":"my_standard"},
{"name": "tags", "type": "Collection(Edm.String)", "analyzer":"my_standard"},
{"name": "parkingIncluded", "type": "Edm.Boolean"},
{"name": "smokingAllowed", "type": "Edm.Boolean"},
{"name": "lastRenovationDate", "type": "Edm.DateTimeOffset"},
{"name": "rating", "type": "Edm.Int32"},
{"name": "location", "type": "Edm.GeographyPoint"}
],
Analyzer (part of the index creation)
"analyzers":[
{
"name":"my_standard",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard_v2",
"tokenFilters":[ "lowercase", "asciifolding", "phonetic" ]
}
]
Analyze API Input and Output for 'software'
{
"analyzer":"my_standard",
"text": "software"
}
{
"#odata.context": "https://ctsazuresearchpoc.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
"tokens": [
{
"token": "SFTW",
"startOffset": 0,
"endOffset": 8,
"position": 0
}
]
}
Analyze API Input and Output for 'softvare'
{
"analyzer":"my_standard",
"text": "softvare"
}
{
"#odata.context": "https://ctsazuresearchpoc.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
"tokens": [
{
"token": "SFTF",
"startOffset": 0,
"endOffset": 8,
"position": 0
}
]
}
Sample data that I loaded
{
"#search.action": "upload",
"hotelId": "5",
"baseRate": 199.0,
"description": "Best hotel in town for software people",
"hotelName": "Fancy Stay",
"category": "Luxury",
"tags": ["pool", "view", "wifi", "concierge"],
"parkingIncluded": false,
"smokingAllowed": false,
"lastRenovationDate": "2010-06-27T00:00:00Z",
"rating": 5,
"location": { "type": "Point", "coordinates": [-122.131577, 47.678581] }
},
{
"#search.action": "upload",
"hotelId": "6",
"baseRate": 79.99,
"description": "Cheapest hotel in town ",
"hotelName": " Alec Baldwin Motel",
"category": "Budget",
"tags": ["motel", "budget"],
"parkingIncluded": true,
"smokingAllowed": true,
"lastRenovationDate": "1982-04-28T00:00:00Z",
"rating": 1,
"location": { "type": "Point", "coordinates": [-122.131577, 49.678581] }
},
With the right configuration, I should have got results even with the misspelled words.
I work on Azure Search. Before I suggest approaches to handle misspelled words, it would be helpful to look at your custom analyzer (my_standard) configuration. It might tell us why it's not able to handle the case for 'softvare'. As a DIY, you can use the Analyze API to see the tokens created using your custom analyzer and it should contain 'software' to actually match the docs.
Now then, here are a few ways that can be used independently or in conjunction to handle misspelled words. The best approach varies depending on the use-case and I strongly suggest you experiment with these to figure out the best one in your case.
You are already familiar with phonetic filters which is a common approach to handle similarly pronounced terms. If you haven't already, try different encoders for the filter to evaluate which configuration gives you the best results. Check out the list of encoders here.
Use fuzzy queries supported as part of the Lucene query syntax in Azure Search which returns terms that are near the original query term based on a distance metric. The limitation here is that it works on a single term. Check the docs for more details. Sample query would look like - search=softvare~1 You can also use term boosting to give the original term more boost in cases where the original term is also a valid term.
You also alluded to synonyms which is also used to query with misspelled terms. This approach gives you the most control over the process of handling typos but also require you to have prior knowledge of different typos for terms. You can use these docs if you want to experiment with synonyms.
As you could read in my post; my Objective was to handle the typos.
The only easy option is to use the inbuilt Lucene functionality - Fuzzy Search. I'm yet to check on the response times as the querytype has to be set to 'full' for using fuzzy search. Otherwise, the results were satisfactory.
Example:
search=softvare~&fuzzy=true&querytype=full
will return all documents with the 'Software' in it.
For further reading please go through Documentation

Need help on Azure search with search term having asterisk(*)

We are facing an issue with Azure search API when hit with search term with asterisk(*) at the end and also with special characters.
We are hitting our production Azure search API with below json object and get no results. Notice the search term "déménage*" with asterisk(*) at the end.
https://one-adscope-search-fr-prod.search.windows.net/indexes/one-adscope-advancedsearch-fr/docs/search?api-version=2016-09-01
{
"count": "true",
"facets": null,
"orderby": "firstSeenDate desc,creativeIdNumber asc",
"search": "déménage*",
"searchFields": "keywordSignatureLangSearch,keywordSloganLangSearch,keywordTextLangSearch,keywordScriptLangSearch,keywordIncrustTVLangSearch,keywordVisualKeywordsLangSearch,keywordAgencyLangSearch,keywordMusicTitleLangSearch,keywordMusicPerformerLangSearch,keywordMusicAuthorLangSearch,categoryLevel_1_nameLangSearch,categoryLevel_2_nameLangSearch,categoryLevel_3_nameLangSearch,categoryLevel_4_nameLangSearch,categoryLevel_5_nameLangSearch,productLevel_1_nameLangSearch,productLevel_2_nameLangSearch,productLevel_3_nameLangSearch,productLevel_4_nameLangSearch,productLevel_5_nameLangSearch,campaignNamesLangSearch,themeNamesLangSearch,creativeTitleLangSearch,visualLangSearch,keyword_tagsLangSearch,countryNameLangSearch,directorLangSearch,hashtagsLangSearch,illustratorLangSearch,inlayLangSearch,csmediaNameLangSearch,subMediaNameLangSearch,modifVersionLangSearch,photographerLangSearch,productionLangSearch,taglineLangSearch,partnersLangSearch,creativeLabelLangSearch,propertyNameLangSearch,sponsorshipProgramTitleLangSearch",
"searchMode": "any",
"select": "",
"skip": 0,
"top": 250,
"queryType": "full"
}
But when hit the API with similar json except only one change – search term without and asterisk(*) at the end like "déménage” we are getting appropriate results.
Please notice below all the other fields are the same along with SearchFields.
{
"count": "true",
"facets": null,
"orderby": "firstSeenDate desc,creativeIdNumber asc",
"search": "déménage",
"searchFields": "keywordSignatureLangSearch,keywordSloganLangSearch,keywordTextLangSearch,keywordScriptLangSearch,keywordIncrustTVLangSearch,keywordVisualKeywordsLangSearch,keywordAgencyLangSearch,keywordMusicTitleLangSearch,keywordMusicPerformerLangSearch,keywordMusicAuthorLangSearch,categoryLevel_1_nameLangSearch,categoryLevel_2_nameLangSearch,categoryLevel_3_nameLangSearch,categoryLevel_4_nameLangSearch,categoryLevel_5_nameLangSearch,productLevel_1_nameLangSearch,productLevel_2_nameLangSearch,productLevel_3_nameLangSearch,productLevel_4_nameLangSearch,productLevel_5_nameLangSearch,campaignNamesLangSearch,themeNamesLangSearch,creativeTitleLangSearch,visualLangSearch,keyword_tagsLangSearch,countryNameLangSearch,directorLangSearch,hashtagsLangSearch,illustratorLangSearch,inlayLangSearch,csmediaNameLangSearch,subMediaNameLangSearch,modifVersionLangSearch,photographerLangSearch,productionLangSearch,taglineLangSearch,partnersLangSearch,creativeLabelLangSearch,propertyNameLangSearch,sponsorshipProgramTitleLangSearch",
"searchMode": "any",
"select": "",
"skip": 0,
"top": 250,
"queryType": "full"
}
Please advise at the earliest.
Thanks,
Bhavik Shah
I suspect the documents returned in the case without the suffix operator '*' are matching because the diacritics were removed from the search term during the lexical analysis process. Please see this post for details: Prefix queries (*) in Azure Search don't return expected results
Consider changing your query to search=déménage* OR déménage

latlon format for cloudsearch

I want to do Geographic search in cloud search, i do indexing like this
when i uploading document
[{"type": "add", "id": "kdhrlfh1304532987654321987654321", "fields":{"name": "user1", "latlon":[12.628611, 120.694152] , "phoneverifiedon": "2015-05-04T15:39:03Z", "fbnumfriends": 172}},
{"type": "add", "id": "kdhrlfh1304532987654321987654322", "fields": {"name": "user2", "latlon":[12.628645,20.694178] , "phoneverifiedon": "2015-05-04T15:39:03Z", "fbnumfriends": 172}}]
i got below error
Status: error
Adds: 0
Deletes: 0
Errors:
{ ["Field "latlon" must have array type to have multiple values (near operation with index 1; document_id kdhrlfh1304532987654321987654321)","Validation error for field 'latlon': Invalid latlon value 12.628611"] }
i tried multiple format for "latlon" field
please suggest what is the correct format for the lat long in cloudsearch
The correct syntax for doc submission is a single string with the two values comma-separated, eg "latlon" : "12.628611, 120.694152".
[
{
"type": "add",
"id": "kdhrlfh1304532987654321987654321",
"fields": {
"name": "user1",
"latlon" : "12.628611, 120.694152"
"phoneverifiedon": "2015-05-04T15:39:03Z",
"fbnumfriends": 172
}
}
]
It is definitely confusing that the submission syntax doesn't match the query syntax, which uses an array to represent lat-lon.
https://forums.aws.amazon.com/thread.jspa?threadID=151633

Resources