I am currently working with Azure Form Recognizer and had a question. I am using
https://<>.cognitiveservices.azure.com/formrecognizer/v2.0-preview/layout/analyzeResults/2e0a2322-65bb-4fd2-a3bf-98f70b36641e
The JSON returned seems to be using basic OCR. I was wondering if its possible (easily)
to take this
{
"boundingBox": [
4.4033,
1.5114,
6.5483,
1.5114,
6.5483,
1.6407,
4.4033,
1.6407
],
"text": "Invoice For: First Up Consultants",
"words": [
{
"boundingBox": [
4.4033,
1.5143,
4.8234,
1.5143,
4.8234,
1.6155,
4.4033,
1.6155
],
"text": "Invoice",
"confidence": 1
},
{
"boundingBox": [
4.8793,
1.5143,
5.1013,
1.5143,
5.1013,
1.6154,
4.8793,
1.6154
],
"text": "For:",
"confidence": 1
},
{
"boundingBox": [
5.2048,
1.5130,
5.4927,
1.5130,
5.4927,
1.6151,
5.2048,
1.6151
],
"text": "First",
"confidence": 1
},
{
"boundingBox": [
5.5427,
1.5130,
5.7120,
1.5130,
5.7120,
1.6407,
5.5427,
1.6407
],
"text": "Up",
"confidence": 1
},
{
"boundingBox": [
5.7621,
1.5114,
6.5483,
1.5114,
6.5483,
1.6151,
5.7621,
1.6151
],
"text": "Consultants",
"confidence": 1
}
]
}
but return it as
"boundingBox": [
4.4033,
1.5114,
6.5483,
1.5114,
6.5483,
1.6407,
4.4033,
1.6407
],
"text": "Invoice For:",
"value": "First Up Consultants"
}
If this is not something that I can do in azure form recognizer, then no worries. I just wanted to see.
Thank you in advance!
Michael
It sounds like you're looking to extract semantic meaning from your document. In that case, you might want to look at using a custom Form Recognizer model.
You can start by training a custom model to extract key value pairs:
https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/curl-train-extract
Sample key value pair:
{
"key": {
"text": "Address:",
"boundingBox": [ 0.7972, 1.5125, 1.3958, 1.5125, 1.3958, 1.6431, 0.7972, 1.6431 ]
},
"value": {
"text": "1 Redmond way Suite 6000 Redmond, WA 99243",
"boundingBox": [ 0.7972, 1.6764, 2.15, 1.6764, 2.15, 2.2181, 0.7972, 2.2181 ]
},
"confidence": 0.86
}
Or you can train a custom model using labels that you provide:
https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/label-tool
Sample field output:
{
"total": {
"type": "string",
"valueString": "$22,123.24",
"text": "$22,123.24",
"boundingBox": [ 5.29, 3.41, 5.975, 3.41, 5.975, 3.54, 5.29, 3.54 ],
"page": 1,
"confidence": 1
}
}
Related
i am trying to understand a skillsets in Azure Cognitive Search. I want to build an Ocr powered search and i try to understend how it works.
For example documentation says taht ocr skill produces response:
{
"text": "Hello World. -John",
"layoutText":
{
"language" : "en",
"text" : "Hello World. -John",
"lines" : [
{
"boundingBox":
[ {"x":10, "y":10}, {"x":50, "y":10}, {"x":50, "y":30},{"x":10, "y":30}],
"text":"Hello World."
},
{
"boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
"text":"-John"
}
],
"words": [
{
"boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
"text":"Hello"
},
{
"boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
"text":"World."
},
{
"boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
"text":"-John"
}
]
}
}
but then in this paragraph we see, that only text field from OCR skill is used and newcomer, contentOffset is presented.
Custom skillset definition:
{
"description": "Extract text from images and merge with content text to produce merged_text",
"skills":
[
{
"description": "Extract text (plain and structured) from image.",
"#odata.type": "#Microsoft.Skills.Vision.OcrSkill",
"context": "/document/normalized_images/*",
"defaultLanguageCode": "en",
"detectOrientation": true,
"inputs": [
{
"name": "image",
"source": "/document/normalized_images/*"
}
],
"outputs": [
{
"name": "text"
}
]
},
{
"#odata.type": "#Microsoft.Skills.Text.MergeSkill",
"description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
"context": "/document",
"insertPreTag": " ",
"insertPostTag": " ",
"inputs": [
{
"name":"text",
"source": "/document/content"
},
{
"name": "itemsToInsert",
"source": "/document/normalized_images/*/text"
},
{
"name":"offsets",
"source": "/document/normalized_images/*/contentOffset"
}
],
"outputs": [
{
"name": "mergedText",
"targetName" : "merged_text"
}
]
}
]
}
and input should look like this:
{
"values": [
{
"recordId": "1",
"data":
{
"text": "The brown fox jumps over the dog",
"itemsToInsert": ["quick", "lazy"],
"offsets": [3, 28]
}
}
]
}
So how the array of offsets (contentOffset in skill definition) are coming from where OcrSkill response not returning that and Read method from computer vision not returning that as well from API?
contentOffset - is the default feature to extract content from the files are having images embedded init. So, whenever the OCR skillset recognizes images included in the input document, contentOffset is called.
To answer the reason for coming array of contentOffset is due to having multiple images in every input we are uploading for analyzing. Consider the following documentation for ReadAPI through REST to follow the JSON operations.
Raw Read response from Azure Computer Vision looks like this:
{
"status": "succeeded",
"createdDateTime": "2021-04-08T21:56:17.6819115+00:00",
"lastUpdatedDateTime": "2021-04-08T21:56:18.4161316+00:00",
"analyzeResult": {
"version": "3.2",
"readResults": [
{
"page": 1,
"angle": 0,
"width": 338,
"height": 479,
"unit": "pixel",
"lines": [
{
"boundingBox": [
25,
14
],
"text": "NOTHING",
"appearance": {
"style": {
"name": "other",
"confidence": 0.971
}
},
"words": [
{
"boundingBox": [
27,
15
],
"text": "NOTHING",
"confidence": 0.994
}
]
}
]
}
]
}
}
Copied from here
I want to create custom skill in Azure Cognitive Search that are not using VisionSkill but my own Azure Functions that will use Computer vision client in code.
The problem is, that to pass input to Text.MergeSkill:
{
"#odata.type": "#Microsoft.Skills.Text.MergeSkill",
"description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
"context": "/document",
"insertPreTag": " ",
"insertPostTag": " ",
"inputs": [
{
"name":"text",
"source": "/document/content"
},
{
"name": "itemsToInsert",
"source": "/document/normalized_images/*/text"
},
{
"name":"offsets",
"source": "/document/normalized_images/*/contentOffset"
}
],
"outputs": [
{
"name": "mergedText",
"targetName" : "merged_text"
}
]
}
i need to convert Read output to form that returns OcrSkill from custom skills. That response must look like this:
{
"text": "Hello World. -John",
"layoutText":
{
"language" : "en",
"text" : "Hello World.",
"lines" : [
{
"boundingBox":
[ {"x":10, "y":10}, {"x":50, "y":10}, {"x":50, "y":30},{"x":10, "y":30}],
"text":"Hello World."
},
],
"words": [
{
"boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
"text":"Hello"
},
{
"boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
"text":"World."
}
]
}
}
And i copied it from here
My question is, how to convert boundingBox parameter from Read Computer Vision endpoint to form that Text.MergeSkill accept? Do we really need to do that or we just can pass Read response to Text.MergeSkill diffrently?
The built in OCRSkill calls the Cognitive Services Computer Vision Read API for certain languages, and it handles the merging of the text for you via the 'text' output. If at all possible, I would strongly suggest you use this skill instead of writing a custom one.
If you must write a custom skill and merge the output text yourself, per the MergeSkill documentation, the 'text' and 'offsets' inputs are optional. Meaning that you should just be able to directly pass the text from the individual Read API output objects to the MergeSkill via the 'itemsToInsert' input if you just need a way to merge those outputs together into one large text. This would make your skillset look something like this (not tested to know for sure), assuming you are still using the built in AzureSearch image extraction and your custom skill outputs the exact payload that the Read API returns that you shared above.
{
"skills": [
{
"#odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"description": "Custom skill that calls Cognitive Services Computer Vision Read API",
"uri": "<your custom skill uri>",
"batchSize": 1,
"context": "/document/normalized_images/*",
"inputs": [
{
"name": "image",
"source": "/document/normalized_images/*"
}
],
"outputs": [
{
"name": "readAPIOutput"
}
]
},
{
"#odata.type": "#Microsoft.Skills.Text.MergeSkill",
"description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
"context": "/document",
"insertPreTag": "",
"insertPostTag": "\n",
"inputs": [
{
"name": "itemsToInsert",
"source": "/document/normalized_images/*/readAPIOutput/analyzeResult/readResults/*/lines/*/text"
}
],
"outputs": [
{
"name": "mergedText",
"targetName": "merged_text"
}
]
}
]
}
However, if you need to guarantee that the text appears in the correct order based on the bounding boxes, you will likely need to write a custom solution to calculate the positions and recombine the text yourself. Hence the suggestion to use our built in solution in the OCRSkill if at all possible.
Application Goal: read the receipt image, extract the store/organization name along with the total amount paid. Feed it to web-form for auto-filling & submission.
Post Request - "https://*.cognitiveservices.azure.com/vision/v2.0/recognizeText?{params}
Get Request - https://*.cognitiveservices.azure.com/vision/v2.0/textOperations/{operationId}
however when I get the results back, sometimes it's confusing in line ordering (see below picture [similar results in JSON response])
This mixing is resulting in getting the total as $0.88
Similar situations are present for 2 out of 9 testing receipts.
Q: Why it's working for similar & different structured receipts but for some reason not consistent for all? Also, any ideas how to get around it?
I had a quick look to your case.
OCR Result
As you mentioned, the results are not ordered as you thought. I had a quick look to the bounding boxes values and I don't know how they are ordered. You could try to consolidate fields based on that, but there is a service that is already doing it for you.
Form Recognizer:
Using Form Recognizer and your image, I got the following results for your receipt.
As you can see below, the understandingResults contains the total with its value ("value": 9.11), the MerchantName ("Chick-fil-a") and other fields.
{
"status": "Succeeded",
"recognitionResults": [
{
"page": 1,
"clockwiseOrientation": 0.17,
"width": 404,
"height": 1226,
"unit": "pixel",
"lines": [
{
"boundingBox": [
108,
55,
297,
56,
296,
71,
107,
70
],
"text": "Welcome to Chick-fil-a",
"words": [
{
"boundingBox": [
108,
56,
169,
56,
169,
71,
108,
71
],
"text": "Welcome",
"confidence": "Low"
},
{
"boundingBox": [
177,
56,
194,
56,
194,
71,
177,
71
],
"text": "to"
},
{
"boundingBox": [
201,
56,
296,
57,
296,
71,
201,
71
],
"text": "Chick-fil-a"
}
]
},
...
OTHER LINES CUT FOR DISPLAY
...
]
}
],
"understandingResults": [
{
"pages": [
1
],
"fields": {
"Subtotal": null,
"Total": {
"valueType": "numberValue",
"value": 9.11,
"text": "$9.11",
"elements": [
{
"$ref": "#/recognitionResults/0/lines/32/words/0"
},
{
"$ref": "#/recognitionResults/0/lines/32/words/1"
}
]
},
"Tax": {
"valueType": "numberValue",
"value": 0.88,
"text": "$0.88",
"elements": [
{
"$ref": "#/recognitionResults/0/lines/31/words/0"
},
{
"$ref": "#/recognitionResults/0/lines/31/words/1"
},
{
"$ref": "#/recognitionResults/0/lines/31/words/2"
}
]
},
"MerchantAddress": null,
"MerchantName": {
"valueType": "stringValue",
"value": "Chick-fil-a",
"text": "Chick-fil-a",
"elements": [
{
"$ref": "#/recognitionResults/0/lines/0/words/2"
}
]
},
"MerchantPhoneNumber": {
"valueType": "stringValue",
"value": "+13092689500",
"text": "309-268-9500",
"elements": [
{
"$ref": "#/recognitionResults/0/lines/4/words/0"
}
]
},
"TransactionDate": {
"valueType": "stringValue",
"value": "2019-06-21",
"text": "6/21/2019",
"elements": [
{
"$ref": "#/recognitionResults/0/lines/6/words/0"
}
]
},
"TransactionTime": {
"valueType": "stringValue",
"value": "13:00:57",
"text": "1:00:57 PM",
"elements": [
{
"$ref": "#/recognitionResults/0/lines/6/words/1"
},
{
"$ref": "#/recognitionResults/0/lines/6/words/2"
}
]
}
}
}
]
}
More details on Form Recognizer: https://azure.microsoft.com/en-us/services/cognitive-services/form-recognizer/
I'm working with LUIS and want to manage and deal not only with the top scoring intent but also with all others. In this specific situation occurs when someone inquires about two things in the same phrase.
For example: "I want to buy apples" ("Buy" intent) and "I want to sell bananas" ("Sell" intent) versus "I want to buy bananas and sell apples" ("buy" and "sell" intents on the same utterance).
The idea is to define a threshold that will accept as "valid" any intentions score above this confidence number.
During some tests I found out this can work if we have very few intents on the same utterance.
However if we increase the number of intents on the same utterance the results degrades very fast.
I included some examples to clarify what I mean: The output examples below were generated on a LUIS with 4 intents ("buy", "sell", "none" and "prank") and 1 entity ("fruit")
I want to buy apples ==>
{
"query": "i want to buy apples",
"topScoringIntent": {
"intent": "Buy",
"score": 0.999846
},
"intents": [
{
"intent": "Buy",
"score": 0.999846
},
{
"intent": "None",
"score": 0.2572831
},
{
"intent": "sell",
"score": 2.32163586e-7
},
{
"intent": "prank",
"score": 2.32163146e-7
}
],
"entities": [
{
"entity": "apples",
"type": "Fruit",
"startIndex": 14,
"endIndex": 19,
"resolution": {
"values": [
"apple"
]
}
}
]
}
I want to sell bananas ==>
{
"query": "i want to sell bananas",
"topScoringIntent": {
"intent": "sell",
"score": 0.999886036
},
"intents": [
{
"intent": "sell",
"score": 0.999886036
},
{
"intent": "None",
"score": 0.253938943
},
{
"intent": "Buy",
"score": 2.71893583e-7
},
{
"intent": "prank",
"score": 1.97906232e-7
}
],
"entities": [
{
"entity": "bananas",
"type": "Fruit",
"startIndex": 15,
"endIndex": 21,
"resolution": {
"values": [
"banana"
]
}
}
]
}
I want to eat a pizza ==>
{
"query": "i want to eat a pizza",
"topScoringIntent": {
"intent": "prank",
"score": 0.997353
},
"intents": [
{
"intent": "prank",
"score": 0.997353
},
{
"intent": "None",
"score": 0.378299
},
{
"intent": "sell",
"score": 2.72957237e-7
},
{
"intent": "Buy",
"score": 1.54754474e-7
}
],
"entities": []
}
Now with two intents... The score of each one starts to reduce aggressively
I want to buy apples and sell bananas ==>
{
"query": "i want to buy apples and sell bananas",
"topScoringIntent": {
"intent": "sell",
"score": 0.4442593
},
"intents": [
{
"intent": "sell",
"score": 0.4442593
},
{
"intent": "Buy",
"score": 0.263670564
},
{
"intent": "None",
"score": 0.161728472
},
{
"intent": "prank",
"score": 5.190861e-9
}
],
"entities": [
{
"entity": "apples",
"type": "Fruit",
"startIndex": 14,
"endIndex": 19,
"resolution": {
"values": [
"apple"
]
}
},
{
"entity": "bananas",
"type": "Fruit",
"startIndex": 30,
"endIndex": 36,
"resolution": {
"values": [
"banana"
]
}
}
]
}
and if we include the third intent, LUIS seems to collapse:
I want to buy apples, sell bananas and eat a pizza ==>
{
"query": "i want to buy apples, sell bananas and eat a pizza",
"topScoringIntent": {
"intent": "None",
"score": 0.139652014
},
"intents": [
{
"intent": "None",
"score": 0.139652014
},
{
"intent": "Buy",
"score": 0.008631414
},
{
"intent": "sell",
"score": 0.005520768
},
{
"intent": "prank",
"score": 0.0000210663875
}
],
"entities": [
{
"entity": "apples",
"type": "Fruit",
"startIndex": 14,
"endIndex": 19,
"resolution": {
"values": [
"apple"
]
}
},
{
"entity": "bananas",
"type": "Fruit",
"startIndex": 27,
"endIndex": 33,
"resolution": {
"values": [
"banana"
]
}
}
]
}
Do you know/recommend any approach that I should use to train LUIS in order to mitigate this issue? Dealing with multiple intents in the same utterance is key to my case.
Thanks a lot for any help.
You will likely need to do some pre-processing of the input using NLP to chunk the sentences and then train/submit the chunks one at a time. I doubt that LUIS is sophisticated enough to handle multiple intents in compound sentences.
Here's a sample code for preprocessing using Spacy in Python - have not tested this for more complicated sentences but this should work for your example sentence. You can use the segments below to feed to LUIS.
Multiple intents are not an easy problem to address and there may be other ways to handle them
import spacy
model = 'en'
nlp = spacy.load(model)
print("Loaded model '%s'" % model)
doc = nlp("i want to buy apples, sell bananas and eat a pizza ")
for word in doc:
if word.dep_ in ('dobj'):
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
print(subtree_span.root.head.text + ' ' + subtree_span.text)
print(subtree_span.text, '|', subtree_span.root.head.text)
print()
If you know the permutations you are expecting you might be able to get the information you need.
I defined a single "buy and sell" intent, in addition to the individual buy and sell intents. I created two entities "Buy Fruit" and "Sell Fruit", each of which contained the "Fruit" entity from your example. Then in the "buy and sell" intent I used sample utterances like "I want to by apples and sell bananas", as well as switching the buy/sell around. I marked the fruit as a "fruit" entity, and the phrases as "buy fruit" and "sell fruit" as respectively.
This is the kind of output I get from "I want to buy a banana and sell an apple":
{
"query": "I want to buy a banana and sell an apple",
"prediction": {
"topIntent": "buy and sell",
"intents": {
"buy and sell": {
"score": 0.899272561
},
"Buy": {
"score": 0.06608531
},
"Sell": {
"score": 0.03477564
},
"None": {
"score": 0.009155964
}
},
"entities": {
"Buy Fruit": [
{}
],
"Sell Fruit": [
{}
],
"Fruit": [
"banana",
"apple"
],
"keyPhrase": [
"banana",
"apple"
],
"$instance": {
"Buy Fruit": [
{
"type": "Buy Fruit",
"text": "buy a banana",
"startIndex": 10,
"length": 12,
"score": 0.95040834,
"modelTypeId": 1,
"modelType": "Entity Extractor",
"recognitionSources": [
"model"
]
}
],
"Sell Fruit": [
{
"type": "Sell Fruit",
"text": "sell an apple",
"startIndex": 27,
"length": 13,
"score": 0.7225706,
"modelTypeId": 1,
"modelType": "Entity Extractor",
"recognitionSources": [
"model"
]
}
],
"Fruit": [
{
"type": "Fruit",
"text": "banana",
"startIndex": 16,
"length": 6,
"score": 0.9982499,
"modelTypeId": 1,
"modelType": "Entity Extractor",
"recognitionSources": [
"model"
]
},
{
"type": "Fruit",
"text": "apple",
"startIndex": 35,
"length": 5,
"score": 0.98748064,
"modelTypeId": 1,
"modelType": "Entity Extractor",
"recognitionSources": [
"model"
]
}
],
"keyPhrase": [
{
"type": "builtin.keyPhrase",
"text": "banana",
"startIndex": 16,
"length": 6,
"modelTypeId": 2,
"modelType": "Prebuilt Entity Extractor",
"recognitionSources": [
"model"
]
},
{
"type": "builtin.keyPhrase",
"text": "apple",
"startIndex": 35,
"length": 5,
"modelTypeId": 2,
"modelType": "Prebuilt Entity Extractor",
"recognitionSources": [
"model"
]
}
]
}
}
}
}
To make this work you would have to cater for all the possible permutations, so this isn't strictly a solution to discerning multiple intents. It's more about defining a composite intent for each permutation of individual intents that you wanted to cater for. In many applications that would not be practical, but in your example it could get you a satisfactory result.
I want to modify scoring in ElasticSearch (v2+) based on the weight of a field in a nested object within an array.
For instance, using this data:
PUT index/test/0
{
"name": "red bell pepper",
"words": [
{"text": "pepper", "weight": 20},
{"text": "bell","weight": 10},
{"text": "red","weight": 5}
]
}
PUT index/test/1
{
"name": "hot red pepper",
"words": [
{"text": "pepper", "weight": 15},
{"text": "hot","weight": 11},
{"text": "red","weight": 5}
]
}
I want a query like {"words.text": "red pepper"} which would rank "red bell pepper" above "hot red pepper".
The way I am thinking about this problem is "first match the 'text' field, then modify scoring based on the 'weight' field". Unfortunately I don't know how to achieve this, if it's even possible, or if I have the right approach for something like this.
If proposing alternative approach, please try and keep a generalized idea where there are tons of different similar cases (eg: simply modifying the "red bell pepper" document score to be higher isn't really a suitable alternative).
The approach you have in mind is feasible. It can be achieved via function score in a nested query .
An example implementation is shown below :
PUT test
PUT test/test/_mapping
{
"properties": {
"name": {
"type": "string"
},
"words": {
"type": "nested",
"properties": {
"text": {
"type": "string"
},
"weight": {
"type": "long"
}
}
}
}
}
PUT test/test/0
{
"name": "red bell pepper",
"words": [
{"text": "pepper", "weight": 20},
{"text": "bell","weight": 10},
{"text": "red","weight": 5}
]
}
PUT test/test/1
{
"name": "hot red pepper",
"words": [
{"text": "pepper", "weight": 15},
{"text": "hot","weight": 11},
{"text": "red","weight": 5}
]
}
post test/_search
{
"query": {
"bool": {
"disable_coord": true,
"must": [
{
"match": {
"name": "red pepper"
}
}
],
"should": [
{
"nested": {
"path": "words",
"query": {
"function_score": {
"functions": [
{
"field_value_factor": {
"field" : "words.weight",
"missing": 0
}
}
],
"query": {
"match": {
"words.text": "red pepper"
}
},
"score_mode": "sum",
"boost_mode": "replace"
}
},
"score_mode": "total"
}
}
]
}
}
}
Result :
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "0",
"_score": 26.030865,
"_source": {
"name": "red bell pepper",
"words": [
{
"text": "pepper",
"weight": 20
},
{
"text": "bell",
"weight": 10
},
{
"text": "red",
"weight": 5
}
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 21.030865,
"_source": {
"name": "hot red pepper",
"words": [
{
"text": "pepper",
"weight": 15
},
{
"text": "hot",
"weight": 11
},
{
"text": "red",
"weight": 5
}
]
}
}
]
}
The query in a nutshell would score a document that satisfies the must clause as follows : sum up the weights of the matched nested documents with the score of the must clause.