Combining rows of long text from Excel for text analytics use - excel

This is sort of related to Outputting Excel rows to a series of text files
Similar to he has asked previously, I have a slightly more complex dataset on my hands.
Each row in Excel for me is a paragraph, and I have a column for the text, a column that states the document ID, as well as an order ID(smallest to largest).
I would like to concatenate the the text for each document ID together, with line breaks in between. However, some documents are so long that it will exceed the Excel character limit.
I also have a series of variables that describe the document, title, dates etc.
I have tried to use google spreadsheet or openoffice, but it does not seem to work either.
How can I perform this "vertical concatenate"? I am ok with either one huge CSV file, a collection of text files, or maybe an Access file?

Related

Ideas to extract specific invoice pdf data for different formats and convert to Excel

I am currently working on a digitalisation project which consists in extracting specific information from pdf-formatted electricity invoices. Once the data is extracted, I would like to store it in an Excel spreadsheet.
The objectives are the following:
First of all, the data to be extracted would be the following:
https://i.stack.imgur.com/6RLo2.png
In this case, the data to be extracted is the information surrounded in red. This would be the CUPS, the total amount and the consumed electricity per period (P1-P6).
Once this is extracted, I would like to display this in an Excel Spreadsheet.
Could you please give me any ideas/tips regarding the extraction of this data? I understand that OCR software would do this best, but do not know how I could extract this specific information.
Thanks for you help and advice.
If there is no text data in your PDF then I don't believe there is a clean and consistent way to do this yet. If your invoice templates are always the same format and resolution, then the pixel coordinates of the text positions should be the same.
This means that you can create a cropped image with only the text you're interested in. Then you can use your OCR tool to extract all the text and you have extracted your data field. You would have to do this for all the data fields that you want to extract.
This would only work for invoices that always have the same format and resolution. So scanned invoices wouldn't work, and dynamic tables make things exponentially more complex as well.
I would check if its possible to simply extract the text using PDF to text 1st then work my cmd text parsing around that output, and loop file to file.
I don't have your sample to test so you would need to adjust to suit your bills
pdftotext -nopgbrk -layout electric.pdf - |findstr /i "cups factura" & pdftotext -nopgbrk -layout -y 200 -W 300 -H 200 electric.pdf
Personally would use the two parts as separate cycles so first pair replace the , with a safe csv character such as * then inject , for the large gap to make them 2 column csv (perhaps replace the Γé¼ with ,€ if necessary since your captured text may be in €uros already)
The second group I would possibly inject , by numeric position to form the desired columns, I only demo 4 column by 2 rows but you want 7 column by 4 rows, so adjust those values to suit. However, you can use any language you are familiar with such as VBA to split how you want to import in to eXcel.
In Excel you may want to use PowerQuery to read the pdf:
https://learn.microsoft.com/en-us/power-query/connectors/pdf
Then you can further process to extract the data you want within PowerQuery.
If you are interested in further data analysis after extraction you may want to consider KNIME as well:
https://hub.knime.com/jyotendra/spaces/Public/latest/Reading%20PDF%20and%20extracting%20information~pNh3GdorF0Z9WGm8
From there export to Excel is also supported.
edit:
after extracting, regex helps to filter for the specific data, e.g. look for key words, length and structure of the data item (e.g. the CUPS number), is it a currency with decimal etc.
edit 2: regex in Excel
How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
e.g. look for a new line starting with CUPS followed by a sequence of 15-characters (if you have more details, you can specify the matching pattern more: e.g. starting with E, or 5th character is X or 5, etc.)

Extracting text in excel

I have some text which I receive daily that I need to seperate. I have hundreds of lines similar to the extract below:
COMMODITY PRICE DIFFERENTIAL: FEB50-FEB40 (APR): COMPANY A OFFERS 1000KB AT $0.40
I need to extract individual snippets from this text, so for each in a seperate cell, I the result needs to be the date, month, company, size, and price. In the case, the result would be:
FEB50-40
APR
COMPANY A
100
0.40
The issue I'm struggling with is uniformity. For example one line might have FEB50-FEB40, another FEB5-FEB40, or FEB50-FEB4. Another example giving me difficult is that some rows might have 'COMPANY A' and the other 'COMPANYA' (one word instead of two).
Any ideas? I've been trying combinations of the below but I'm not able to have uniform results.
=TRIM(MID(SUBSTITUTE($D7," ",REPT(" ",LEN($D7))), (5)*LEN($D7)+1,LEN($D7)))
=MID($D7,20,21-10)
=TRIM(RIGHT(SUBSTITUTE($D6,"$",REPT("$",2)),4))
Sometimes I get
FEB40-50(' OR 'FEB40-FEB5'
when it should be
'FEB40-FEB50'`
Thank you to who is able to help.
You might get to the limits of formulas with this scenario, but with Power Query you can still work.
As I see it, you want to apply the following logic to extract text from this string:
COMMODITY PRICE DIFFERENTIAL: FEB50-FEB40 (APR): COMPANY A OFFERS 1000KB AT $0.40
text after the first : and before the first (
text between the brackets
text after the word OFFERS and before AT
text after 'AT`
These can be easily translated into several "Split" scenarios inside Power Query.
split by custom delimiter : - that's colon and space - for each ocurrence
remove first column
Split new first column by ( - that's space and bracket - for leftmost
Replace ) with nothing in second column
Split third column by delimiter OFFERS
split new fourth column by delimiter AT
The screenshot shows the input data and the result in the Power Query editor after renaming the columns and before loading the query into the worksheet.
Once you have loaded the query, you can add / remove data in the input table and simply refresh the query to get your results. No formulas, just clicking ribbon commands.
You can take this further by removing the "KB" from the column, convert it to a number, divide it by 100. Your business processing logic will drive what you want to do. Just take it one step at a time.

Sorting txt data files while importing in Excel Data Query

I am trying to enter approximately 190 txt datafiles in Excel using the New Query tool (Data->New Query->From File->From Folder). In the Windows explorer the data are properly ordered: the first being 0summary, the second 30summary etc.
However, when entering them through the query tool the files are sorted as shown in the picture (see line 9 for example, you will see that the file is not in the right position):
The files are sorted based on the first digit instead of the value represented. Is there a solution to this issue? I have tried putting space between the number and the summary but it also didn't work. I saw online that Excel doesn't recognize the text within "" or after /, but I am not allowed to save the text files with those symbols in their name in Windows. Even when removed the word summary the problem didn't fix. Any suggestions?
If all your names include the word Summary:
You can add a column "Extract" / "Text before delimiter" enter "Summary", change the column type to Number and sort over that column
If the only numbers are those you wish to sort on, you can
add a custom column with just the numbers
Change the data type to whole number
sort on that.
The formula for the custom column:
Text.Select([Name],{"0".."9"})
If the alpha portion varies, and you need to sort on that also, you can do something similar adding another column for the alpha portion, and sorting on that.
If there might be digits after the leading digits upon which you want to sort, then use the following formula for the added column which will extract only the digits at the beginning of the file name:
=Text.Middle([Name],0,Text.PositionOfAny([Name],{"A".."z"}))

Excel: Inconsistent sorting criteria from 'Smallest to Largest' to 'A-Z'

Situation:
I am pulling information from a database and exporting it into an Excel 2010 template. The data consists of unique IDs (numeric), dates, and text in their respective columns. When going to sort, Excel usually recognizes the unique IDs as text and gives me the option of 'A-Z' which yields the correct result.
Problem:
Occasionally when sorting the unique IDs, Excel will give me the option to sort from 'Smallest to Largest' and when this happens the report yields a wildly incorrect result.
Pattern:
The sorting criteria is the only common denominator when a report fails, which makes little sense as they are both ascending orders. This issue only occurs ~20% of the time. The other times it sorts correctly from 'A-Z' as it does in the other worksheets within the same template.
-I've tried changing Number Format within the drop down to 'Text' 'General' and 'Numbers'
-I've tried manually sorting the data through filters as opposed to sort hierarchies
-I've tried clearing the table, and re-copying/pasting the data into the template's worksheet. This seems to work, but as the end goal is automation, I'd like to find out what the root cause is.
Expected result: Numeric data copied and pasted into the field to be sorted from 'A-Z', resulting in a successful report.
Actual result: Numeric data copied and pasted into the field typically results in the sort option of "A-Z', but occasionally sorts from 'Smallest to Largest' resulting in a failed report.
Excel is designed for numbers - and is generally very helpful in coercing text to numbers where appropriate. However, once in Number format the reverse is not easy. As you have discovered, merely choosing Text as format is not enough.
A clue is whether or not (assuming activated) the cells show green triangles.
Other than starting afresh with data entry into a cell already formatted as Text, the conventional solution for conversion with code is to prepend a quote, though appending a space would also serve.
Other than that, the easiest mass conversion approach may be to copy into Word (Keep text only) and copy back to Excel with pasting as Text.
The better solution may be to store IDs as text and prepend 0s to a standard length.

Format multiple date entries as strings

I have an Excel file storing a thousand lines of dates. Each date seems to be (auto)formatted as a Date. A (PHP Excel) parser I'm using (really can't update/use another one) is parsing this to a string which will occur in the number of days till 1900.
Is there a way to format the values in Excel being simple text "08.03.1991" to get this file parsed correctly?
I could add a quote: "'08.03.1991" but I need an (Excel-based) one-action-solution for all the thousand lines.
Remark: Since this is a file of a user I can't just write simple VBA-Script or so to handle this since there will be new files in the future and the User needs to be able to solve this alone.
I admit I am not quite sure what you have and what you want but it may be worth trying: Select column of dates, apply Text to Columns with Tab as delimiter and in step 3 of 3 select Text.
You could use the TEXT function like this:
=TEXT(A1,"dd.mm.yyyy")
For more details have a look here

Resources