AzureSearch edgeNGram search matching too many documents

AzureSearch edgeNGram search matching too many documents - azure

I'm trying to implement a prefix search using a field analyzed with an edge ngram analyzer.
However, whenever I do a search, it returns similar matches, but that do not contain the searched term.
The following query
POST /indexes/resources/docs/search?api-version=2020-06-30
{
"queryType": "full",
"searchMode": "all",
"search": "short_text_prefix:7024032"
}
Returns
{
"#odata.context": ".../indexes('resources')/$metadata#docs(*)",
"#search.nextPageParameters": {
"queryType": "full",
"searchMode": "all",
"search": "short_text_prefix:7024032",
"skip": 50
},
"value": [
{
"#search.score": 4.669537,
"short_text_prefix": "7024032 "
},
{
"#search.score": 4.6333756,
"short_text_prefix": "7024030 "
},
{
"#search.score": 4.6333756,
"short_text_prefix": "7024034 "
},
{
"#search.score": 4.6333756,
"short_text_prefix": "7024031 "
},
{
"#search.score": 4.6319494,
"short_text_prefix": "7024033 "
},
... omitted for brevity ...
],
"#odata.nextLink": ".../indexes('resources')/docs/search.post.search?api-version=2020-06-30"
}
Which includes a bunch of documents which almost match my term. And the "correct" document with the highest score on top.
The custom analyzer tokenizes "7024032 " like this
"#odata.context": "/$metadata#Microsoft.Azure.Search.V2020_06_30.AnalyzeResult",
"tokens": [
{
"token": "7",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "70",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "702",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "7024",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "70240",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "702403",
"startOffset": 0,
"endOffset": 7,
"position": 0
},
{
"token": "7024032",
"startOffset": 0,
"endOffset": 7,
"position": 0
}
]
}
How do I exclude the documents which did not match the term exactly?

Ngram is not the right way in this case as the prefix '702403' appears in all those documents. You can use it if you specify the minimum length to be the length of the term you're searching for.
Here's an example:
token length: 3
sample content:
234
1234
2345
3456
001234
99234345
searching for '234'
it would return items 1 (234), 2 (1234), 3 (2345), 4 (001234) and 5 (99234345)
Another option, if you're 100% the content is stored in the way you presented, you could use regular expression to retrieve the way you want:
/.*7024032\s+/

I figured out the problem:
I had created the field with the "analyzer" property referring to my custom analyzer ("edge_nGram_analyzer"). Setting this field means the string are tokenized both on indexing and when searching. So searching for "7024032" meant I was searching for all tokens, split according to the egde n-gram analyzer: "7", "70", "702", "7024", "7024032", "70240", "702403", "7024032"
The indexAnalyzer and searchAnalyzer properties can instead be used, to handle index-tokenizing separately from search-tokenizing. When I used them separately:
{ "indexAnalyzer": "edge_nGram_analyzer", "searchAnalyzer": "whitespace" }
everything worked as expected.

Related

Syntax for JSONPath filtering to not return array

I'm new to JSONPath and want to write a JSONPath-syntax that retrieves the property value only if a certain condition is met. The value I'm after is not part of an array, but I've managed to make filtering work in the following JSONPath tool: https://www.site24x7.com/tools/json-path-evaluator.html
Given the following JSON, I only want to extract the value of column2.dimValue if column2.attributeId equals B0:
{
"batchId": 279,
"companyId": "40",
"period": 202208,
"taxCode": "1",
"taxSystem": "",
"transactionDate": "2022-08-05T00:00:00.000",
"transactionNumber": 222006089,
"transactionType": "IF",
"year": 2022,
"accountingInformation": {
"account": "4010",
"column1": {
"attributeId": "H9",
"dimValue": "76"
},
"column2": {
"attributeId": "B0",
"dimValue": "2170103"
},
"column3": {
"attributeId": "",
"dimValue": ""
},
"column4": {
"attributeId": "BF",
"dimValue": "217010330"
},
"column5": {
"attributeId": "10",
"dimValue": "3101"
},
"column6": {
"attributeId": "06",
"dimValue": ""
},
"column7": {
"attributeId": "19",
"dimValue": "K"
}
},
"categories": {
"cat1": "H9",
"cat2": "B0",
"cat3": "",
"cat4": "BF",
"cat5": "10",
"cat6": "06",
"cat7": "19",
"dim1": "76",
"dim2": "2170103",
"dim3": "",
"dim4": "217010330",
"dim5": "3101",
"dim6": "",
"dim7": "K"
},
"amounts": {
"amount": 48.24,
"amount3": 0.0,
"amount4": 0.0,
"currencyAmount": 48.24,
"currencyCode": "NOK",
"debitCreditFlag": 1
},
"invoice": {
"customerOrSupplierId": "58118",
"description": "",
"externalArchiveReference": "",
"externalReference": "2170103",
"invoiceNumber": "220238522",
"ledgerType": "P"
},
"additionalInformation": {
"number": 0,
"orderLineNumber": 0,
"orderNumber": 0,
"sequenceNumber": 1,
"status": "",
"value": 0.0,
"valueDate": "2022-08-05T00:00:00.000"
},
"lastUpdated": {
"updatedAt": "2022-09-05T10:59:11.633",
"updatedBy": "HELVES"
}
}
I've used this JSONPath-syntax:
$['accountingInformation']['column2'][?(#.attributeId=='B0')].dimValue
This gives the following result:
[
"2170103"
]
I'm using this result in Azure Data Factory mapping, and it seems that it doesn't work as the result is an array.
Can anyone help me with the syntax to it only returns the actual value? Is that even possible?

I repro'd the same and below is the approach.
Sample Json file is taken as in below image as a source in lookup activity.
If activity is taken to filter the value of column2 with attributeId='B0'. Expression is given as below
#equals(activity('Lookup1').output.value[0].accountingInformation.column2.attributeId ,'B0')
In true case of IF activity, Set Variable is added. New Variable with string type is taken and it is set using below expression.
#activity('Lookup1').output.value[0].accountingInformation.column2.dimvalue
Then Copy activity is added next to IF activity sequentially. In source dummy dataset is taken. +New is click in additional columns
Name: col1
Value: #variables('v2')
In Mapping, Import schemas is clicked. All other columns except the additional column that is added in source are deleted.
Pipeline is debugged and data is copied to sink without error.

Azure Search Fails to Return Expected Result When No OR Multiple SearchFields Are Defined

I have a fairly basic Azure Search index with several fields of searchable string data, for example [abridged]...
"fields": [
{
"name": "Field1",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": true,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "Field2",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "en.microsoft",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
]
Field1 is loaded with alphanumeric id data and Field2 is loaded with English language string data, specifically the name/title of the record. searchMode=all is also being used to ensure the accuracy of the results.
Let's say one of the records indexed has the following Field2 data: BA (Hons) in Business, Organisational Behaviour and Coaching. Putting that into the en.microsoft analyzer, this is the result we get out:
"tokens": [
{
"token": "ba",
"startOffset": 0,
"endOffset": 2,
"position": 0
},
{
"token": "hon",
"startOffset": 4,
"endOffset": 8,
"position": 1
},
{
"token": "hons",
"startOffset": 4,
"endOffset": 8,
"position": 1
},
{
"token": "business",
"startOffset": 13,
"endOffset": 21,
"position": 3
},
{
"token": "organizational",
"startOffset": 23,
"endOffset": 37,
"position": 4
},
{
"token": "organisational",
"startOffset": 23,
"endOffset": 37,
"position": 4
},
{
"token": "behavior",
"startOffset": 38,
"endOffset": 47,
"position": 5
},
{
"token": "behaviour",
"startOffset": 38,
"endOffset": 47,
"position": 5
},
{
"token": "coach",
"startOffset": 52,
"endOffset": 60,
"position": 7
},
{
"token": "coaching",
"startOffset": 52,
"endOffset": 60,
"position": 7
}
]
As you can see, the tokens returned are what you'd expect for such a string. However, when it comes to using that same indexed string value as a search term (sadly a valid user case in this instance), the results returned are not as expected unless you explicitly use searchFields=Field2.
Query 1 (Returns 0 results):
?searchMode=all&search=BA%20(Hons)%20in%20Business%2C%20Organisational%20Behaviour%20and%20Coaching
Query 2 (Returns 0 results):
?searchMode=all&searchFields=Field1,Field2&search=BA%20(Hons)%20in%20Business%2C%20Organisational%20Behaviour%20and%20Coaching
Query 3 (Returns 1 result as expected):
?searchMode=all&searchFields=Field2&search=BA%20(Hons)%20in%20Business%2C%20Organisational%20Behaviour%20and%20Coaching
So why does this only return the expected result with searchFields=Field2 and not with no searchFields defined or searchFields=Field1,Field2? I would not expect a no match on Field1 to exclude a result that's clearly matching on Field2?
Furthermore, removing the "in" and "and" within the search term seems to correct the issue and return the expected result. For example:
Query 4 (Returns 1 result as expected):
?searchMode=all&search=BA%20(Hons)%20Business%2C%20Organisational%20Behaviour%20Coaching
(This is almost like one analyzer is tokenizing the indexed data and a completely different analyzer is tokenizing the search term, although that theory doesn't make any sense when taking into consideration Query 3, as that provides a positive match using the exact same indexed data/search term.)
Is anybody able to shed some light as to what's going on here as I'm completely out of ideas and I can't find anything more in the documentation?
NB. Please bear in mind that I'm looking to understand why Azure Search is behaving in this way and not necessarily wanting a work around.

The reason you don't get any hits is due to how stopwords are handled when you use searchMode=all. The standard analyzer does not remove stopwords. The Lucene and Microsoft analyzers for English removes stopwords. I verified by creating an index with your property definitions and sample data. If you use the standard analyzer, stopwords are not removed and you will get a match also when using searchMode=all. To get a match when using either Lucene or Microsoft analyzers with simple query mode, you would have to use a phrase search.
When you test the en.microsoft analyzer in your example, you only get the response from what the first stage of the analyzer does. It splits your query into tokens. In your case, two of the tokens are also stopwords in English (in, and). Stopword removal is part of lexical analysis, which is done later in stage 2 as explained in the article called Anatomy of a search request. Furthermore, lexical analysis is only applied to "query types that require complete terms", like searchMode=all. See Exceptions to lexical analysis for more examples.
There is a previous post here about this that explains in more detail. See Queries with stopwords and searchMode=all return no results
I know you did not ask for workarounds, but to better understand what goes on it could be useful to list some possible workarounds.
For English analyzers, use phrase search by wrapping the query in quotes: search="BA (Hons) in Business, Organisational Behaviour and Coaching"&searchMode=all
The standard analyzer works the way you expect: search=BA (Hons) in Business, Organisational Behaviour and Coaching&searchMode=all
Disable lexical analysis by defining a custom analyzer.

Invalid blockchain Json data

I have a huge text file with blockchain data that I'd like to parse so that I can get the info from the fields I need. I have tried to convert it to json but it says is invalid. After doing some thought, I've realised it is not the best way since I only want 2 or 3 fields. Can someone help me to find the best way of extracting data from the file? There's an example below. I would only want txid size, and hash.
{
"txid": "254d5cc8d2b1889a2cb45f7e3dca8ed53a3fcfa32e8b9eac5f68c4f09e7af7bd",
"hash": "a8e125eb6d7ab883177d8ab228a3d09c1733d1ca49b7b2dff4b057eeb80ff9be",
"version": 2,
"size": 171,
"vsize": 144,
"weight": 576,
"locktime": 0,
"vin": [
{
"coinbase": "02ee170101",
"sequence": 4294967295
}
],
"vout": [
{
"value": 12.00000000,
"n": 0,
"scriptPubKey": {
"asm": "OP_HASH160 cd5b833dd43bc60b8c28c4065af670f283a203ff OP_EQUAL",
"hex": "a914cd5b833dd43bc60b8c28c4065af670f283a203ff87",
"reqSigs": 1,
"type": "scripthash",
"addresses": [
"2NBy4928yJakYBFQuXxXBwXjsLCRWgzyiGm"
]
}
},
{
"value": 5.00000000,
"n": 1,
"scriptPubKey": {
"asm": "OP_HASH160 cd5b833dd43bc60b8c28c4065af670f283a203ff OP_EQUAL",
"hex": "a914cd5b833dd43bc60b8c28c4065af670f283a203ff87",
"reqSigs": 1,
"type": "scripthash",
"addresses": [
"2NBy4928yJakYBFQuXxXBwXjsLCRWgzyiGm"
]
}
}
],
"hex":
"020000000001010000000000000000000000000000000000000000000000000000000000000000
ffffffff0502ee170101ffffffff02000000000000000017a914cd5b833dd43bc60b8c28c4065af670f283a
203ff870000000000000000266a24aa21a9ede2f61c3f71d1defd3fa999dfa36953755c69068979996
2b48bebd836974e8cf9012000000000000000000000000000000000000000000000000000000000
0000000000000000",
"blockhash": "0f84abb78891a4b9e8bc9637ec5fb8b4962c7fe46092fae99e9d69373bf7812a",
"confirmations": 1,
"time": 1590830080,
"blocktime": 1590830080
}
Thank you

#andrewjames is correct. If you have no control over the JSON file, you can address the error by just removing the newline characters:
parsed = json.loads(jsonText.replace("\n", ""))
Then you can access the fields you want like a normal dictionary:
print(parsed['txid'])

Returning the "search term" along with result - Elasticsearch

In the elasticsearch module I have built, is it possible to return the "input search term" in the search results ?
For example :
GET /signals/_search
{
"query": {
"match": {
"focused_content": "stock"
}
}
}
This returns
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.057534903,
"hits": [
{
"_index": "signals",
"_type": "signal",
"_id": "13",
"_score": 0.057534903,
"_source": {
"username": "abc#abc.com",
"tags": [
"News"
],
"content_url": "http://www.wallstreetscope.com/morning-stock-highlights-western-digital-corporation-wdc-fibria-celulose-sa-fbr-ametek-inc-ame-cott-corporation-cot-graftech-international-ltd-gti/25375462/",
"source": null,
"focused_content": "Morning Stock Highlights: Western Digital Corporation (WDC), Fibria Celulose SA (FBR), Ametek Inc. (AME), Cott Corporation (COT), GrafTech International Ltd. (GTI) - WallStreet Scope",
"time_stamp": "2015-08-12"
}
}
]
}
Is it possible to have the input search term "stock" along with each of the results (like an additional JSON Key along with "content_url","source","focused_content","time_stamp") to identify which search term had brought that result ?
Thanks in Advance !

All I can think of, would be using highlighting feature. So it would bring back additional key _highlightand it would highlight things, that matched.
It won't bring exact matching terms, tho. You'd have to deal with them in your application. You could use pre/post tags functionality to wrap them up somehow specially, so your app could recognize that it was a match.

You can use highlights on all fields, like #Evaldas suggested. This will return the result along with the value in the field which matched, surrounded by customisable tags (default is <em>).
GET /signals/_search
{
"highlight": {
"fields": {
"username": {},
"tags": {},
"source": {},
"focused_content": {},
"time_stamp": {}
}
},
"query": {
"match": {
"focused_content": "stock"
}
}
}

Flickr API Not Returning Actual Photos

Hi i'm on a project and want to use Flickr for my image galery, i'm using the photosets.* method but whenever i make a request i don't get images, i only get info.
Json Result:
{
"photoset": {
"id": "77846574839405047",
"primary": "88575847594",
"owner": "998850450#N03",
"ownername": "mr.barde",
"photo": [
{
"id": "16852316982",
"secret": "857fur848c",
"server": "8568",
"farm": 9,
"title": "wallpaper-lenovo-blue-pc-brand",
"isprimary": "1",
"ispublic": 1,
"isfriend": 0,
"isfamily": 0
},
{
"id": "16665875068",
"secret": "857fur848c",
"server": "7619",
"farm": 8,
"title": "white_horses-1280x720",
"isprimary": "0",
"ispublic": 1,
"isfriend": 0,
"isfamily": 0
}
],
"page": 1,
"per_page": "2",
"perpage": "2",
"pages": 3,
"total": "6",
"title": "My First Album"
},
"stat": "ok"
}
Please would like to have actual image URLs returned, how can i do this.

Thanks to the comment by #CBroe
I found this in the Flickr API doc.
You can construct the source URL to a photo once you know its ID, server ID, farm ID and secret, as returned by many API methods.
https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret}.jpg
or
https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret}_[mstzb].jpg
or
https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{o-secret}_o.(jpg|gif|png)
The final result would then look something like this.
https://farm1.staticflickr.com/2/1418878_1e92283336_m.jpg
Reference: https://www.flickr.com/services/api/misc.urls.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string