Unable to match all required fields. invoice2data - python-3.x

I'm trying to create templates and have the data with invoice2data but that gives me this error
regexp for field amount didn't match
regexp for field amount_untaxed didn't match
regexp for field date didn't match
Unable to match all required fields. The required fields are: ['date', 'amount', 'invoice_number', 'issuer']. Output contains the following fields: ['vat', 'issuer', 'currency', 'invoice_number'].
I searched on my template template.yaml and i have all these fields
there is my template:
# -*- coding: utf-8 -*-
issuer: Online SAS / Scaleway
fields:
amount: (Total.*)\w+
amount_untaxed: (Total.*)\w+
date: (Date.*)\w+
invoice_number: Facture n\W\s+(\d+)
static_vat: FR35433115904
keywords:
- Sosh
- Orange
- Votre facture mobile
options:
currency: EUR
date_formats:
- '%B %d, %Y'
decimal_separator: ','
I tried with a second example "pdf2.pdf" and it gives me
Syntax Error (69926): No current point in closepath
Syntax Error (69950): No current point in closepath
Syntax Error (69970): No current point in closepath
I searched on internet and i didn't see any answer.
Can someone explains me the problem
Thanks for your answer and sorry for my english.

Related

Regex to find text & value in large text

As I SSH into CM, run commands and start reading the CLI output, I get the following
back:
# * A lot more output above but been removed *
terminal_output = """
[24;1H [79b[1GCommand: disp sys cust<<[23;0H[0;7m [79b[1G[0m[24;0H [79b[1G[1;0H[0;7m [79b[1G[0m[2;0H [79b[1G[3;1H[0J7[1;1H[0;7mdisplay system-parameters customer-options [0m8[1;65H[0;7mPage 1 of 12[0m[2;33HOPTIONAL FEATURES[4;8HG3 Version: [4;20HV20 [4;50HSoftware Package: [4;68HEnterprise [5;10HLocation: [5;20H2[6;10HPlatform: [6;20H28 [5;51HSystem ID (SID): [5;68H9990093751 [6;51HModule ID (MID): [6;68H1 [8;60HUSED[9;29HPlatform Maximum Ports: [9;53H 81000[9;60H 436[10;35HMaximum Stations: [10;53H 135[10;60H 110[11;27HMaximum XMOBILE Stations: [11;53H 41000[11;60H 0[12;17HMaximum Off-PBX Telephones - EC500: [12;53H 135[12;60H 2[13;17HMaximum Off-PBX Telephones - OPS: [13;53H 135[13;60H 40[14;17HMaximum Off-PBX Telephones - PBFMC: [14;53H 135[14;60H 0[15;17HMaximum Off-PBX Telephones - PVFMC: [15;53H 135[15;60H 0[16;17HMaximum Off-PBX Telephones - SCCAN: [16;53H 0[16;60H 0[17;22HMaximum Survivable Processors: [17;53H 313[17;62H 1[22;9H(NOTE: You must logoff & login to effect the permission changes.)[2;50H[0m
"""
It's a lot of ANSI escape codes (I think?) which sort of makes the output not too readable but anyways, what I'm trying to get back is the following from the text above:
Maximum Stations: 135 110
I know from my understanding that a Regex would be required for this.
The Regexes that I tried using but did not work:
r'Maximum Stations:\s*(\d+)(\d+)'
r'Maximum Stations: \d+'
If anyone knows how to filter out these ANSI character codes so they don't appear in the final output that'd be great too.
Thank you.
you can try the following
"(Maximum Stations:)\s\[\d*;\d*H\s*(\d*)\[\d*;\d*H\s*(\d*)"gm
it produces three groups the first with the maximum stations text then two more each with the number you wanted to capture. You would have to combine the groups to get your final output.
I don't know if this will be generic enough for your application though.

Node.js - Parse raw text to JSON using RegEx

I´m still new to Node.js and currently developing a small app for my kitchen. This app can scan receipts and uses OCR to extract the data. For OCR extracting I´m using the ocr-space web api. Afterwards I need to parse the raw text to a JSON structure and send it to my database. I´ve also tested this receipt using AWS textract, which gave me a even poorer result.
Currently I´m struggling at the parsing part using RegEx in Node.js.
Here is my JSON structure which I use to parse the receipt data:
receipt = {
title: 'title of receipt'
items: [
'item1',
'item2',
'item3'
],
preparation: 'preparation text'
}
As most of the receipts have a items part and afterwards a preparation part my general approach so far looks like the following:
Searching for keywords like 'items' and 'preparation' in the raw text
Parse the text between these keywords
Do further string processing, like missing whitespaces, triming etc.
This approach doesn´t work if these keywords are missing. Take for example the following receipt, where I´m struggle to parse it into my JSON structure. The receipt is in German and there are no corresponding keywords ('items' or 'Zutaten', 'preparation' or 'Zubereitung').
Following information from the raw text are necessary:
title: line 1
items: line 2 - 8
preparation: line 9 until end
Do you have any hints or tips how to come closer to the solution? Or do you have any other ideas how to manage such situations accordingly?
Quinoa-Brot
30 g Chiasamen
350 g Quinoa
70 ml Olivenöl
1/2 TL Speisenatron
1 Prise Salz
Saft von 1/2 Zitrone
1 Handvoll Sonnenblumenkerne
30 g Schwarzkümmelsamen
1 Chiasamen mit 100 ml Wasser
verrühren und 30 Minuten quel-
len lassen. Den Ofen auf 200 oc
vorheizen, eine kleine Kastenform
mit Backpapier auslegen.
2 Quinoa mit der dreifachen
Menge Wasser in einen Topf ge-
ben, einmal aufkochen und dann
3 Minuten köcheln lassen - die
Quinoa wird so nur teilweise ge-
gegart. In ein Sieb abgießen, kalt
abschrecken und anschließend
gut abtropfen lassen.
Between each line there is a \n tabulator.
The parsed receipt should look like this:
receipt = {
title: 'Quinoa-Brot',
items: [
'30 g Chiasamen',
'350 g Quinoa',
'70 ml Olivenöl',
'1/2 TL Speisenatron',
'1 Prise Salz',
'Saft von 1/2 Zitrone'
'1 Handvoll Sonnenblumenkerne'
'30 g Schwarzkümmelsamen',
],
preparation: '1 Chiasamen mit 100 ml Wasser verrühren und 30 Minuten quellen lassen. Den Ofen auf 200 oc vorheizen, eine kleine Kastenform mit Backpapier auslegen. 2 Quinoa mit der dreifachen Menge Wasser in einen Topf geben, einmal aufkochen und dann 3 Minuten köcheln lassen - die Quinoa wird so nur teilweise gegegart. In ein Sieb abgießen, kalt abschrecken und anschließend gut abtropfen lassen.'
}
Pattern matching solutions like RegExp don't sound suitable for this sort of a categorization problem. You might want to consider clustering (k-means, etc.) - training a model to differentiate between ingredients and instructions. This can be done by labeling a number of recipes (the more the better), or using unsupervised ML by clustering line by line.
If you need to stick to RegExp for some reason, you keeping track of repeated words. Weak methodology: ingredient names (Chiasemen, Quinoa, ) will be referenced in the instructions, so you can match on multiline to find where the same word is repeated later on:
(?<=\b| )([^ ]+)(?= |$).+(\1)
If you do run this on a loop, plus logic, you can find pairs ingredient-instruction pairs, and work through the document with silhouette information.
You might be able to take advantage of ingredient lines containing numeric data like numbers or words like "piece(s), sticks, leaves" which you might store in a dictionary. That can enrich the word boundary input matches.
I would reconsider using RegExp here at all...

How to fix my RE to get my expected arguments of group

I m new learner to python and learning Regex at this moment.
I made a Regex that is supposed to find all phone numbers.
I think I did it right but my code doesn't seem to be working correctly.
phoneRegex = re.compile(r'''(
(\d{2,3}|\(\d(2,3)\))? # first 2-3 digits
(\s|-|\.)? # -
(\d{3,4}) # second 3-4 digits
(\s|-|\.) # -
(\d{4}) # last 4 digits.
(\s*(ext|x|ext.)\s*(\d{3,4}))? # extension
)''', re.VERBOSE)
phoneRegex.findall('010 1234 5678 ext1234')
I am working on automatetheboringstuff tutorials, and read the Regex chapter through for 3 times.
If there are some minor things that I should read or consider, sorry for my hasty, but I spent roughly 2hrs, and I am happy to any of your suggesting reading materials and help.
I appreciate in advance.
Expected result:
[('010 1234 5678 ext1234', '010', ' ', '1234', ' ', '5678', ' ext1234', 'ext', '1234')]
Actual result:
[('010 1234 5678 ext1234', '010', '', ' ', '1234', ' ', '5678', ' ext1234', 'ext', '1234')]
what is the 3rd thing ('') and where did it come from?

How do I declare and use a variable in the yaml file that is formatted for pyresttest?

So, a brief description of what I want, what my issue is, and what I have tried.
I want to declare and use a dictionary variable for my tests in pyrest, specifically for the [url, body] section so that I can conduct my POST tests targeting a specific endpoint and with a preformatted body.
Here is how mytest.yml file is structured:
- data:
- id: 63
- rate: 25
... a sizable set of field for reasons ...
- type: lab_test_authorization
- modified_at: ansible_date_time.datetime # Useful way to generate
- test:
- url: "some-valid-url/{the_url_question}" # data['special_key']
- method: 'POST'
- headers : {etc..etc}
- body: '{ "data": ${the_body_question} }' # data (the content)
Now the problem starts in my lack of understanding why (if true) does pyrest does not have support for dictionary mappings. I understand yaml supports these feature but am not sure if pyrest can parse through it. Knowing how to call and use dictionary variable in my url and body tags would be significantly helpful.
As of right now, if I try to convert my data Sequence into a data Dictionary, I will get an error stating:
yaml.parser.ParserError: while parsing a block mapping
in "<unicode string>", line 4, column 1:
data:
^
expected <block end>, but found '-'
in "<unicode string>", line 36, column 1:
- config:
I'm pretty sure there are gaps in my knowledge regarding how yaml and pyresttest interact with each other, so any insight would be greatly appreciated.

Search in directory of files based on keywords from another file

Perl Newbie here and looking for some help.
I have a directory of files and a "keywords" file which has the attributes to search for and the attribute type.
For example:
Keywords.txt
Attribute1 boolean
Attribute2 boolean
Attribute3 search_and_extract
Attribute4 chunk
For each file in the directory, I have to:
lookup the keywords.txt
search based on Attribute type
something like the below.
IF attribute_type = boolean THEN
search for attribute;
set found = Y if attribute found;
ELSIF attribute_type = search_and_extract THEN
extract string where attribute is Found
ELSIF attribute_type = chunk THEN
extract the complete chunk of paragraph where attribute is found.
This is what I have so far and I'm sure there is a more efficient way to do this.
I'm hoping someone can guide me in the right direction to do the above.
Thanks & regards,
SiMa
# Reads attributes from config file
# First set boolean attributes. IF keyword is found in text,
# variable flag is set to Y else N
# End Code: For each text file in directory loop.
# Run the below for each document.
use strict;
use warnings;
# open Doc
open(DOC_FILE,'Final_CLP.txt');
while(<DOC_FILE>) {
chomp;
# open the file
open(FILE,'attribute_config.txt');
while (<FILE>) {
chomp;
($attribute,$attribute_type) = split("\t");
$is_boolean = ($attribute_type eq "boolean") ? "N" : "Y";
# For each boolean attribute, check if the keyword exists
# in the file and return Y or N
if ($is_boolean eq "Y") {
print "Yes\n";
# search for keyword in doc and assign values
}
print "Attribute: $attribute\n";
print "Attribute_Type: $attribute_type\n";
print "is_boolean: $is_boolean\n";
print "-----------\n";
}
close(FILE);
}
close(DOC_FILE);
exit;
It is a good idea to start your specs/question with a story ("I have a ..."). But
such a story - whether true or made up, because you can't disclose the truth -
should give
a vivid picture of the situation/problem/task
the reason(s) why all the work must be done
definitions for uncommon(ly used)terms
So I'd start with: I'm working in a prison and have to scan the emails
of the inmates for
names (like "Al Capone") mentioned anywhere in the text; the director
wants to read those mails in toto
order lines (like "weapon: AK 4711 quantity: 14"); the ordnance
officer wants those info to calculate the amount of ammunition and
rack space needed
paragraphs containing 'family'-keywords like "wife", "child", ...;
the parson wants to prepare her sermons efficiently
Taken for itself, each of the terms "keyword" (~running text) and
"attribute" (~structured text) of may be 'clear', but if both are applied
to "the X I have to search for", things get mushy. Instead of general ("chunk")
and technical ("string") terms, you should use 'real-world' (line) and
specific (paragraph) words. Samples of your input:
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
and your expected output:
--- Robin.txt ----
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife:
knife: Bowie quantity: 8
machine gun:
stinger rocket:
weapon:
weapon: AK 4711 quantity: 14
social relations paragaphs:
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Pseudo code should begin at the top level. If you start with
for each file in folder
load search list
process current file('s content) using search list
it's obvious that
load search list
for each file in folder
process current file using search list
would be much better.
Based on this story, examples, and top level plan, I would try to come
up with proof of concept code for a simplified version of the "process
current file('s content) using search list" task:
given file/text to search in and list of keywords/attributes
print file name
print "keywords:"
for each boolean item
print boolean item text
if found anywhere in whole text
print "Yes"
else
print "No"
print "order line:"
for each line item
print line item text
if found anywhere in whole text
print whole line
print "social relations paragaphs:"
for each paragraph
for each social relation item
if found
print paragraph
no need to check for other items
first implementation attempt:
use Modern::Perl;
#use English qw(-no_match_vars);
use English;
exit step_00();
sub step_00 {
# given file/text to search in
my $whole_text = <<"EOT";
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
EOT
# print file name
say "--- Robin.txt ---";
# print "keywords:"
say "keywords:";
# for each boolean item
for my $bi ("Al Capone", "Billy the Kid", "Scarface") {
# print boolean item text
printf " %s: ", $bi;
# if found anywhere in whole text
if ($whole_text =~ /$bi/) {
# print "Yes"
say "Yes";
# else
} else {
# print "No"
say "No";
}
}
# print "order line:"
say "order lines:";
# for each line item
for my $li ("knife", "machine gun", "stinger rocket", "weapon") {
# print line item text
# if found anywhere in whole text
if ($whole_text =~ /^$li.*$/m) {
# print whole line
say " ", $MATCH;
}
}
# print "social relations paragaphs:"
say "social relations paragaphs:";
# for each paragraph
for my $para (split /\n\n/, $whole_text) {
# for each social relation item
for my $sr ("wife", "son", "husband") {
# if found
if ($para =~ /$sr/) {
## if ($para =~ /\b$sr\b/) {
# print paragraph
say $para;
# no need to check for other items
last;
}
}
}
return 0;
}
output:
perl 16953439.pl
--- Robin.txt ---
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife: Bowie quantity: 8
weapon: AK 4711 quantity: 14
social relations paragaphs:
tell Al Capone to send a car to the prison gate on sunday.
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Such (premature) code helps you to
clarify your specs (Should not-found keywords go into the output?
Is your search list really flat or should it be structured/grouped?)
check your assumptions about how to do things (Should the order line
search be done on the array of lines of thw whole text?)
identify topics for further research/rtfm (eg. regex (prison!))
plan your next steps (folder loop, read input file)
(in addition, people in the know will point out all my bad practices,
so you can avoid them from the start)
Good luck!

Resources