Ideas to extract specific invoice pdf data for different formats and convert to Excel - excel

I am currently working on a digitalisation project which consists in extracting specific information from pdf-formatted electricity invoices. Once the data is extracted, I would like to store it in an Excel spreadsheet.
The objectives are the following:
First of all, the data to be extracted would be the following:
https://i.stack.imgur.com/6RLo2.png
In this case, the data to be extracted is the information surrounded in red. This would be the CUPS, the total amount and the consumed electricity per period (P1-P6).
Once this is extracted, I would like to display this in an Excel Spreadsheet.
Could you please give me any ideas/tips regarding the extraction of this data? I understand that OCR software would do this best, but do not know how I could extract this specific information.
Thanks for you help and advice.

If there is no text data in your PDF then I don't believe there is a clean and consistent way to do this yet. If your invoice templates are always the same format and resolution, then the pixel coordinates of the text positions should be the same.
This means that you can create a cropped image with only the text you're interested in. Then you can use your OCR tool to extract all the text and you have extracted your data field. You would have to do this for all the data fields that you want to extract.
This would only work for invoices that always have the same format and resolution. So scanned invoices wouldn't work, and dynamic tables make things exponentially more complex as well.

I would check if its possible to simply extract the text using PDF to text 1st then work my cmd text parsing around that output, and loop file to file.
I don't have your sample to test so you would need to adjust to suit your bills
pdftotext -nopgbrk -layout electric.pdf - |findstr /i "cups factura" & pdftotext -nopgbrk -layout -y 200 -W 300 -H 200 electric.pdf
Personally would use the two parts as separate cycles so first pair replace the , with a safe csv character such as * then inject , for the large gap to make them 2 column csv (perhaps replace the Γé¼ with ,€ if necessary since your captured text may be in €uros already)
The second group I would possibly inject , by numeric position to form the desired columns, I only demo 4 column by 2 rows but you want 7 column by 4 rows, so adjust those values to suit. However, you can use any language you are familiar with such as VBA to split how you want to import in to eXcel.

In Excel you may want to use PowerQuery to read the pdf:
https://learn.microsoft.com/en-us/power-query/connectors/pdf
Then you can further process to extract the data you want within PowerQuery.
If you are interested in further data analysis after extraction you may want to consider KNIME as well:
https://hub.knime.com/jyotendra/spaces/Public/latest/Reading%20PDF%20and%20extracting%20information~pNh3GdorF0Z9WGm8
From there export to Excel is also supported.
edit:
after extracting, regex helps to filter for the specific data, e.g. look for key words, length and structure of the data item (e.g. the CUPS number), is it a currency with decimal etc.
edit 2: regex in Excel
How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
e.g. look for a new line starting with CUPS followed by a sequence of 15-characters (if you have more details, you can specify the matching pattern more: e.g. starting with E, or 5th character is X or 5, etc.)

Related

How to grep csv documents to APPEND info and keep the old data intact?

I have huge_database.csv like this:
name,phone,email,check_result,favourite_fruit
sam,64654664,sam#example.com,,
sam2,64654664,sam2#example.com,,
sam3,64654664,sam3#example.com,,
[...]
===============================================
then I have 3 email lists:
good_emails.txt
bad_emails.txt
likes_banana.txt
the contents of which are:
good_emails.txt:
sam#example.com
sam3#example.com
bad_emails.txt:
sam2#example.com
likes_banana.txt:
sam#example.com
sam2#example.com
===============================================
I want to do some grep, so that at the end the output will be like this:
sam,64654664,sam#example.com,y,banana
sam2,64654664,sam2#example.com,n,banana
sam3,64654664,sam3#example.com,y,
I don't mind doing it in multiple steps manually and, perhaps, in some complex algorithm such as copy pasting to multple files. What matters to me is the reliability, and most importantly the ability to process very LARGE csv files with more than 1M lines.
What must also be noted is the lists that I will "grep" to add data to some of the columns will most of the times affect at most 20% of the total csv file rows, meaning the remaining 80% must be intact and if possible not even displace from their current order.
I would also like to note that I will be using a software called EmEditor rather than spreadsheet softwares like Excel due to the speed of it and the fact that Excel simply cannot process large csv files.
How can this be done?
Will appreciate any help.
Thanks.
Googling, trial and error, grabbing my head from frustration.
Filter all good emails with Filter. Open Advanced Filter. Next to the Add button is Add Linked File. Add the good_emails.txt file, set to the email column, and click Filter. Now only records with good emails are shown.
Select column 4 and type y. Now do the same for bad emails and change the column to n. Follow the same steps and change the last column values to the correct string.

Extracting text in excel

I have some text which I receive daily that I need to seperate. I have hundreds of lines similar to the extract below:
COMMODITY PRICE DIFFERENTIAL: FEB50-FEB40 (APR): COMPANY A OFFERS 1000KB AT $0.40
I need to extract individual snippets from this text, so for each in a seperate cell, I the result needs to be the date, month, company, size, and price. In the case, the result would be:
FEB50-40
APR
COMPANY A
100
0.40
The issue I'm struggling with is uniformity. For example one line might have FEB50-FEB40, another FEB5-FEB40, or FEB50-FEB4. Another example giving me difficult is that some rows might have 'COMPANY A' and the other 'COMPANYA' (one word instead of two).
Any ideas? I've been trying combinations of the below but I'm not able to have uniform results.
=TRIM(MID(SUBSTITUTE($D7," ",REPT(" ",LEN($D7))), (5)*LEN($D7)+1,LEN($D7)))
=MID($D7,20,21-10)
=TRIM(RIGHT(SUBSTITUTE($D6,"$",REPT("$",2)),4))
Sometimes I get
FEB40-50(' OR 'FEB40-FEB5'
when it should be
'FEB40-FEB50'`
Thank you to who is able to help.
You might get to the limits of formulas with this scenario, but with Power Query you can still work.
As I see it, you want to apply the following logic to extract text from this string:
COMMODITY PRICE DIFFERENTIAL: FEB50-FEB40 (APR): COMPANY A OFFERS 1000KB AT $0.40
text after the first : and before the first (
text between the brackets
text after the word OFFERS and before AT
text after 'AT`
These can be easily translated into several "Split" scenarios inside Power Query.
split by custom delimiter : - that's colon and space - for each ocurrence
remove first column
Split new first column by ( - that's space and bracket - for leftmost
Replace ) with nothing in second column
Split third column by delimiter OFFERS
split new fourth column by delimiter AT
The screenshot shows the input data and the result in the Power Query editor after renaming the columns and before loading the query into the worksheet.
Once you have loaded the query, you can add / remove data in the input table and simply refresh the query to get your results. No formulas, just clicking ribbon commands.
You can take this further by removing the "KB" from the column, convert it to a number, divide it by 100. Your business processing logic will drive what you want to do. Just take it one step at a time.

How to convert csv text to numbers as well as manipulate almost non-machine readable data in Power Bi?

I have a sales datasheet that is in csv, that I input into Power BI. The monetary values on the sheet come up as decimal placed numbers (e.g 123.0000) but Power BI reads it as text. When I try and convert this to a fixed decimal number ($) it kicks back an error. How do I convert this safely to ($)? There are also multiple columns with these values in them. How would I convert all of them in the easiest way, as there are other columns with just normal numbers between these monetary columns? (1 x SOH column and then 1 x Net column - this repeats)
On top of this, the datasheet is spread in such a way that is is difficult to manipulate the data into a form that is easy for Power BI to read. The header rows begin with the SKU code and description, but then move over to each individual store (retail store) by location as well as being broken up into SOH and Net, per store per column. I've been racking my brain on this for ages and can't seem to find a simple way around it. Any tips would be great.
For the conversion to ($), I went into the csv sheet, altered the format of the numbers and saved it as a .xml, but the issue with this is that I would have to repeat this tedious step every time I would need to pull data, which is a lot.
For the layout of the original spreadsheet, I tried unpivoting the data in Power BI, which does work. However, it is still sectioned off by Net and SOH, which means I have to add a slicer in just to see Net or SOH on its own, instead of having them as separate entries.
I expect the output to firstly give me fixed decimal numbers, but all I get is an error when trying to convert the numbers to $.
With the unpivoting, I can manipulate the data by store, which is great and helps, but I have to create a separate sheet which has the store ID's on it so that I can"filter" them when I want to switch between them (again, a slicer is necessary). I expect to be able to look at the store individually as well as overall and then also look at the Net individually and SOH individually, by store and as a whole. From there I can input my cost sheet and calculate the GP.
I have attached a picture of the data. I can drop a sample sheet somewhere as well if necessary. I just need to know where.
I figured it out. All you have to do is change the regional settings, not on your laptop specifically but rather within Power BI itself.
Go to File > options and settings > options > regional settings > change to English (United Kingdom) {thats the region that worked for me and fixed everything automatically}

Combining rows of long text from Excel for text analytics use

This is sort of related to Outputting Excel rows to a series of text files
Similar to he has asked previously, I have a slightly more complex dataset on my hands.
Each row in Excel for me is a paragraph, and I have a column for the text, a column that states the document ID, as well as an order ID(smallest to largest).
I would like to concatenate the the text for each document ID together, with line breaks in between. However, some documents are so long that it will exceed the Excel character limit.
I also have a series of variables that describe the document, title, dates etc.
I have tried to use google spreadsheet or openoffice, but it does not seem to work either.
How can I perform this "vertical concatenate"? I am ok with either one huge CSV file, a collection of text files, or maybe an Access file?

How Do I separate data written in the same column?

So I've got bunch of file in excel that has the measure of the electric current versus voltage with like 2000 points. The problem is, they're all saved in the column same box. It's suppose to be such that Column A has all the voltages and Column B all the currents. Right now, everything is saved in Column A both voltages and current.
I really don't want to separate 2000 points with 20 files of it hand by hand, is there a good way to separate them? The good news is, the points are separated by a space e.g. [1.001 2.002] with 1.001 being a voltage or 2.002 being a current or separated by a negative sign if there is one [-1.001-2.002] so I feel like a simple program can fix this up. I know how to code in C and matlab(also, the goal is to make it matlab readable) but what's the best way to resolve this if it could also be done maybe in excel macro?
Are you on a unix/linux?
Save the file as CSV.
Run this from the terminal.
$ sed -i .bak "s/ /,/g" my-file.csv
Your original file will be saved as my-file.bak. The my-file.csv will now be comma delimited instead of space delineated.
Open the CSV in excel.
One option would be to open the sheet in Power Query and split the column based on a delimiter. Assuming you have something (a space, comma, etc.), then this will give you two new columns. In Query Editor, go to split column and then by delimiter. Source: https://support.office.com/en-au/article/Split-a-column-of-text-Power-Query-5282d425-6dd0-46ca-95bf-8e0da9539662?ui=en-US&rs=en-AU&ad=AU#__toc354843578
Edit: If you don't have Power Query, it's a free add-on for Excel: https://www.microsoft.com/en-gb/download/details.aspx?id=39379

Resources