Solr search engine and nlu - search

I want to build a query using solr based on nlu semantic which is based on entities. How can get benefit from NLU results even some entities don't exists as fields in solr database ? I tried to filter results and I got wrong results because when filtering, it's mandatory to filter by fields that exists in solr database.
Example of solr fields:
{ "id":"07Ce2HEBB-PtfltbxUlv", "nom":["Microsoft Surface Pro 4 Type Cover with Fingerprint ID"], "categorie":["Accessoires pour ordinateurs"], "image":["No image"], "marque":["Microsoft"], "version":1665642569640443904}
and this is an example of NLU output:
{
"intent": {
"name": "Haut-parleurs",
"confidence": 0.9998957514762878
},
"entities": [
{
"entity": "type",
"start": 13,
"end": 21,
"extractor": "DIETClassifier",
"value": "sans fil"
},
{
"entity": "marque",
"start": 22,
"end": 26,
"extractor": "DIETClassifier",
"value": "Sony"
},
{
"entity": "model",
"start": 27,
"end": 37,
"extractor": "DIETClassifier",
"value": "SRSHG1/BLK"
},
{
"entity": "couleur",
"start": 47,
"end": 51,
"extractor": "DIETClassifier",
"value": "noir",
"processors": [
"EntitySynonymMapper"
]
}
],
"intent_ranking": [
{
"name": "Haut-parleurs",
"confidence": 0.9998957514762878
},
{
"name": "greet",
"confidence": 9.423414303455502e-05
},
{
"name": "Casques Bluetooth",
"confidence": 9.48187880567275e-06
},
{
"name": "Boitier lecteur multimedia",
"confidence": 4.859907676291186e-07
},
{
"name": "Bonbons",
"confidence": 1.837529062242993e-08
}
],
"response_selector": {
"default": {
"response": {
"name": null,
"confidence": 0.0
},
"ranking": [],
"full_retrieval_intent": null
}
},
"text": "Haut-parleur sans fil Sony SRSHG1/BLK Hi-Res - Noir anthracite"
}
"text": "Haut-parleur sans fil Sony SRSHG1/BLK Hi-Res - Noir anthracite": represent the client query who search for 'haut-parleur', 'sans fil', 'sony','SRSHG1/BLK', and of color 'noir'.
this is what I am trying:
folder_path = 'D:/nlu/'
for filename in glob.glob(os.path.join(folder_path, '*.json')):
with open(filename, 'r') as f:
json_files = json.load(f)
text=json_files['text']
for i in variable:
if(("marque" in entities) and ("categorie" in entities) and ("nom" in entities) and ("image" in entities)):
if(i['entity']=="marque"):
results=solr.search(q="*:*",fq=["marque:"+i['value']])
docs=pd.DataFrame(results.docs)
print(docs)
elif(i["entity"]=="nom"):
results=solr.search(q="*:*",fq=["nom:"+i['value']])
docs=pd.DataFrame(results.docs)
print(docs)
elif(i["entity"]=="categorie"):
results=solr.search(q="*:*",fq=["categorie:"+i['value']])
docs=pd.DataFrame(results.docs)
print(docs)
elif(i["entity"]=="image"):
results=solr.search(q="*:*",fq=["image:"+i['value']])
docs=pd.DataFrame(results.docs)
print(docs)
else:
print("*******")
elif(("marque" in entities) or ("categorie" in entities) or ("nom" in entities)):
if(i['entity']=="marque" and i['entity']!="nom" and i['entity']!="categorie" and i['entity']!="image"):
results=solr.search(q=["marque:"+i['value']])
docs=pd.DataFrame(results.docs)
print(docs)
elif(i["entity"]=="nom" and i['entity']!="marque" and i['entity']!="categorie" and i['entity']!="image"):
results=solr.search(q=["nom:"+i['value']])
docs=pd.DataFrame(results.docs)
print(docs)
elif(i["entity"]=="categorie" and i['entity']!="nom" and i['entity']!="marque" and i['entity']!="image"):
results=solr.search(q=["categorie:"+i['value']])
docs=pd.DataFrame(results.docs)
print(docs)
elif(i["entity"]=="image" and i['entity']!="nom" and i['entity']!="categorie" and i['entity']!="marque"):
results=solr.search(q=["image:"+i['value']])
docs=pd.DataFrame(results.docs)
print(docs)
else:
results=solr.search(q="_text_:"+json_files['text'])
docs=pd.DataFrame(results.docs)
docs=pd.DataFrame(results.docs)
print(docs)
else:
results=solr.search(q="_text_:"+json_files['text'])
docs=pd.DataFrame(results.docs)
docs=pd.DataFrame(results.docs)
print(docs)
I want to filter search by fields that is not stored or indexed in solr database using pysolr
the output is:
Empty DataFrame
Columns: []
Index: []
id nom \
0 4bCe2HEBB-PtfltbxUlv [Haut-parleur sans fil Sony SRSHG1/BLK Hi-Res ...
1 47Ce2HEBB-PtfltbxUlv [Mini-cassettes vidéo numériques Sony - DVC ...
2 5LCe2HEBB-PtfltbxUlv [Haut-parleur sans fil SRS-ZR7]
3 8LCe2HEBB-PtfltbxUlv [Facade autoradio CD marin Sony MEXM100BT 160W...
4 8bCe2HEBB-PtfltbxUlv [Haut-parleur portable sans fil Sony SRSXB30/B...
5 8rCe2HEBB-PtfltbxUlv [Mini-système LBT-GPX555 de Sony avec Bluet...
6 2rCe2HEBB-Ptfltb3Gnu [Sony Nh-Aa-B4gn Rechargeable Ni-MH Battery]
7 3rCe2HEBB-PtfltbwkXR [Sony HT-GT1 2.1 Home Theatre System]
categorie \
0 [haut-parleurs Bluetooth et sans fil]
1 [Accessoires appareil photo]
2 [haut-parleurs Bluetooth et sans fil]
3 [Accessoires électroniques pour voiture]
4 [haut-parleurs Bluetooth et sans fil]
5 [Accessoires audio et vidéo]
6 [Cameras & Accessories]
7 [Home Entertainment]
image marque \
0 [No image] [Sony]
1 [No image] [Sony]
2 [No image] [Sony]
3 [No image] [Sony]
4 [No image] [Sony]
5 [No image] [Sony]
6 [http://img6a.flixcart.com/image/rechargeable-... [Sony]
7 [http://img6a.flixcart.com/image/home-theatre-... [Sony]
_version_
0 1665642569914122240
1 1665642569924608000
2 1665642569927753728
3 1665642569950822400
4 1665642569950822401
5 1665642569953968128
6 1665642572737937420
7 1665642573811679246
Empty DataFrame
Columns: []
Index: []
Empty DataFrame
Columns: []
Index: []
​
but it seems wrong because the client want exactly: "Haut-parleur sans fil Sony SRSHG1/BLK Hi-Res - Noir anthracite" described in NLU output

Related

How to get catalog search results by keywords via Amazon SP API that like Amazon website search results?

First method "listCatalogItems" produces correct results but limits max 10 ASINs. And now this method is deprecated.
Other method "searchCatalogItems" produces INcorrect random results.
fyi listCatalogItems says it's deprecated but it still works
I am getting correct results when I use searchCatalogItems. Here is my postman call: https://sellingpartnerapi-na.amazon.com/catalog/2022-04-01/items?marketplaceIds=ATVPDKIKX0DER&keywords=samsung
and part of my results:
{
"numberOfResults": 54592886,
"pagination": {
"nextToken": "9HkIVcuuPmX_bm51o3-igBfN45pxW4Ru7ElIM6GCECYCuXJKzT26f-3Tfs1Ro3IhelNA74VxDMJwt_JvE7qiRh0loZTzTpEBWUbZ8HB0T4ttV8cFw4xYQ4RMUzdY_udbnvAHOHCcZcycn0nW8RotZh1l1vj7KQoFIa7pWiOPHyaYWP7sBE9Fg7cGN2wE0an5ePw96h6ZL7m6olRxFOcqTWNanEVRjipq"
},
...
"items": [
{
"asin": "B09YN4W5C1",
"summaries": [
{
"marketplaceId": "ATVPDKIKX0DER",
"adultProduct": false,
"autographed": false,
"brand": "SAMSUNG",
"itemClassification": "VARIATION_PARENT",
"itemName": "SAMSUNG Jet Bot Robot Vacuum Cleaner",
"manufacturer": "SAMSUNG",
"memorabilia": false,
"packageQuantity": 1,
"tradeInEligible": false,
"websiteDisplayGroup": "home_display_on_website",
"websiteDisplayGroupName": "Home"
}
]
},
{
"asin": "B01AQ6OWAG",
"summaries": [
{
"marketplaceId": "ATVPDKIKX0DER",
"adultProduct": false,
"autographed": false,
"brand": "SAMSUNG",
"browseClassification": {
"displayName": "Remote Controls",
"classificationId": "10967581"
},
"itemClassification": "BASE_PRODUCT",
"itemName": "SAMSUNG TV Remote Control BN59-01199F by Samsung",
"manufacturer": "Samsung",
"memorabilia": false,
"modelNumber": "BN59-01199F",
"packageQuantity": 1,
"partNumber": "BN59-01199F",
"tradeInEligible": false,
"websiteDisplayGroup": "ce_display_on_website",
"websiteDisplayGroupName": "CE"
}
]
},

AzureSearch edgeNGram search matching too many documents

I'm trying to implement a prefix search using a field analyzed with an edge ngram analyzer.
However, whenever I do a search, it returns similar matches, but that do not contain the searched term.
The following query
POST /indexes/resources/docs/search?api-version=2020-06-30
{
"queryType": "full",
"searchMode": "all",
"search": "short_text_prefix:7024032"
}
Returns
{
"#odata.context": ".../indexes('resources')/$metadata#docs(*)",
"#search.nextPageParameters": {
"queryType": "full",
"searchMode": "all",
"search": "short_text_prefix:7024032",
"skip": 50
},
"value": [
{
"#search.score": 4.669537,
"short_text_prefix": "7024032 "
},
{
"#search.score": 4.6333756,
"short_text_prefix": "7024030 "
},
{
"#search.score": 4.6333756,
"short_text_prefix": "7024034 "
},
{
"#search.score": 4.6333756,
"short_text_prefix": "7024031 "
},
{
"#search.score": 4.6319494,
"short_text_prefix": "7024033 "
},
... omitted for brevity ...
],
"#odata.nextLink": ".../indexes('resources')/docs/search.post.search?api-version=2020-06-30"
}
Which includes a bunch of documents which almost match my term. And the "correct" document with the highest score on top.
The custom analyzer tokenizes "7024032 " like this
"#odata.context": "/$metadata#Microsoft.Azure.Search.V2020_06_30.AnalyzeResult",
"tokens": [
{
"token": "7",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "70",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "702",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "7024",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "70240",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "702403",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "7024032",
"startOffset": 0,
"endOffset": 7,
"position": 0
}
]
}
How do I exclude the documents which did not match the term exactly?
Ngram is not the right way in this case as the prefix '702403' appears in all those documents. You can use it if you specify the minimum length to be the length of the term you're searching for.
Here's an example:
token length: 3
sample content:
234
1234
2345
3456
001234
99234345
searching for '234'
it would return items 1 (234), 2 (1234), 3 (2345), 4 (001234) and 5 (99234345)
Another option, if you're 100% the content is stored in the way you presented, you could use regular expression to retrieve the way you want:
/.*7024032\s+/
I figured out the problem:
I had created the field with the "analyzer" property referring to my custom analyzer ("edge_nGram_analyzer"). Setting this field means the string are tokenized both on indexing and when searching. So searching for "7024032" meant I was searching for all tokens, split according to the egde n-gram analyzer: "7", "70", "702", "7024", "7024032", "70240", "702403", "7024032"
The indexAnalyzer and searchAnalyzer properties can instead be used, to handle index-tokenizing separately from search-tokenizing. When I used them separately:
{ "indexAnalyzer": "edge_nGram_analyzer", "searchAnalyzer": "whitespace" }
everything worked as expected.

How to get only one item from each category in azure cognitive search?

In the given hotel example given azure cognitive search, I need to get only one hotel from each category. What are the filter parameters I need to use and how?
I want a result like below where all categories are Budget, Resort and Spa, Luxury, Boutique, Suite, Extended-Stay
[
{
"#search.score": 1,
"HotelId": "24",
"HotelName": "Gacc Capital",
**"Category": "Budget",**
"Rating": 3.5
},
{
"#search.score": 1,
"HotelId": "22",
"HotelName": "Stone Lion Inn",
**"Category": "Luxury",**
"Rating": 3.9
},
{
"#search.score": 1,
"HotelId": "11",
"HotelName": "Regal Orb Resort & Spa",
**"Category": "Extended-Stay",**
"Rating": 2.5
},
{
"#search.score": 1,
"HotelId": "13",
"HotelName": "Historic Lion Resort",
**"Category": "Boutique",**
"Rating": 4.1
},
{
"#search.score": 1,
"HotelId": "29",
"HotelName": "Thompson House",
**"Category": "Resort and Spa",**
"Rating": 2.6
},
{
"#search.score": 1,
"HotelId": "36",
"HotelName": "Pelham Hotel",
**"Category": "Suite",**
"Rating": 3.5
}
]
What you are asking for is known as aggregation or collapsing. At this point, there is no support in Azure Cognitive Search to support this. You can vote for this functionality to be added to the product here:
https://feedback.azure.com/forums/263029-azure-search/suggestions/8382225-add-aggregations-functionality
https://feedback.azure.com/forums/263029-azure-search/suggestions/9484995-add-support-for-field-collapsing
The only possible workaround is to submit multiple queries. First a query where you request the facet/refiner for Category. Then you need to submit a query for each of the Category entries to retrieve the top 1 result.

Azure Form Recognizer Not Behaving As Expected

I am having an issue with FormRecognizer not behaving how I have seen it should. Here is the dilemma
I have an Invoice that, when run through https://{endpoint}/formrecognizer/v2.0/layout/analyze
it recognized the table in the Invoice and generates the proper JSON with the "tables" node. Here is an example of part of it
{
"rows": 8,
"columns": 8,
"cells": [
{
"rowIndex": 0,
"columnIndex": 4,
"columnSpan": 3,
"text": "% 123 F STREET Deer Park TX 71536",
"boundingBox": [
3.11,
2.0733
],
"elements": [
"#/readResults/0/lines/20/words/0",
"#/readResults/0/lines/20/words/1"
]
}
When I train a model with NO labels file https://{endpoint}/formrecognizer/v2.0/custom/models It does not generate an empty "tables" node, but it generates (tokens). Here is an example of the one above without "table"
{
"key": {
"text": "__Tokens__12",
"boundingBox": null,
"elements": null
},
"value": {
"text": "123 F STREET",
"boundingBox": [
5.3778,
2.0625,
6.8056,
2.0625,
6.8056,
2.2014,
5.3778,
2.2014
],
"elements": null
},
"confidence": 1.0
}
I am not sure exactly where this is not behaving how intended, but any insight would be appreciated!
If you train a model WITH labeling files, then call FR Analyze(), the FR service will call the Layout service, which returns tables in "pageResults" section.

How to calculate Heating/Cooling Degree Day using some api in python

i am trying to calculate heating/cooling degree day using (Tbase - Ta) formula Tbase is usually 65F and Ta = (high_temp + low_temp)/2
(e.x)
high_temp = 96.5F low_temp=65.21F then
mean=(high_temp + low_temp)/2
result = mean - 65
65 is average room temperature
if result is > 65 then cooling degree day(cdd) else heating degree day(hdd)
i get weather data from two api
weatherbit
darksky
in weatherbit the provide both cdd and hdd data, but in darksky we need to calculate using above formula (Tbase - Ta)
my problem is both api show different result (e.x)
darksky json response for day
{
"latitude": 47.552758,
"longitude": -122.150589,
"timezone": "America/Los_Angeles",
"daily": {
"data": [
{
"time": 1560927600,
"summary": "Light rain in the morning and overnight.",
"icon": "rain",
"sunriseTime": 1560946325,
"sunsetTime": 1561003835,
"moonPhase": 0.59,
"precipIntensity": 0.0057,
"precipIntensityMax": 0.0506,
"precipIntensityMaxTime": 1561010400,
"precipProbability": 0.62,
"precipType": "rain",
"temperatureHigh": 62.44,
"temperatureHighTime": 1560981600,
"temperatureLow": 48,
"temperatureLowTime": 1561028400,
"apparentTemperatureHigh": 62.44,
"apparentTemperatureHighTime": 1560981600,
"apparentTemperatureLow": 46.48,
"apparentTemperatureLowTime": 1561028400,
"dewPoint": 46.61,
"humidity": 0.75,
"pressure": 1021.81,
"windSpeed": 5.05,
"windGust": 8.36,
"windGustTime": 1560988800,
"windBearing": 149,
"cloudCover": 0.95,
"uvIndex": 4,
"uvIndexTime": 1560978000,
"visibility": 4.147,
"ozone": 380.8,
"temperatureMin": 49.42,
"temperatureMinTime": 1561010400,
"temperatureMax": 62.44,
"temperatureMaxTime": 1560981600,
"apparentTemperatureMin": 47.5,
"apparentTemperatureMinTime": 1561014000,
"apparentTemperatureMax": 62.44,
"apparentTemperatureMaxTime": 1560981600
}
]
},
"offset": -7
}
python calculation
response = result.get("daily").get("data")[0]
low_temp = response.get("temperatureMin")
hi_temp = response.get("temperatureMax")
mean = (hi_temp + low_temp)/2
#65 is normal room temp
print(65-mean)
here mean is 6.509999999999998
65 - mean = 58.49
hdd is 58.49 so cdd is 0
same date in weatherbit json response is :
{
"threshold_units": "F",
"timezone": "America/Los_Angeles",
"threshold_value": 65,
"state_code": "WA",
"country_code": "US",
"city_name": "Newcastle",
"data": [
{
"rh": 68,
"wind_spd": 5.6,
"timestamp_utc": null,
"t_ghi": 8568.9,
"max_wind_spd": 11.4,
"cdd": 0.4,
"dewpt": 46.9,
"snow": 0,
"hdd": 6.7,
"timestamp_local": null,
"precip": 0.154,
"t_dni": 11290.6,
"temp_wetbulb": 53.1,
"t_dhi": 1413.9,
"date": "2019-06-20",
"temp": 58.6,
"sun_hours": 7.6,
"clouds": 58,
"wind_dir": 186
}
],
"end_date": "2019-06-21",
"station_id": "727934-94248",
"count": 1,
"start_date": "2019-06-20",
"city_id": 5804676
}
here hdd is 6.7 and cdd is 0.4
can you explain how they get this result ?
You need to use hourly data to calculate the HDD and CDD, and then average them to get the daily value.
More details here: https://www.weatherbit.io/blog/post/heating-and-cooling-degree-days-weather-api-release

Resources