extract data using regex?

extract data using regex? - python-3.x

I am writing a code to extract a paragragh in a document using regex and I am using python. The data contains a lot of similar words,but i need to extract the paragraph when it hits the first recurring word.
ex: data.txt
extract data
useful data is extracted
extract numbers
useful numbers are extracted
extract variable
useful variables are extracted
The question is, I have to extract only the below:
"extract numbers
useful numbers are extracted"

You can use re.findall and pattern ("([a-zA-Z].* *\n.[a-zA-Z .,']*)") for find all paragraphs. Also, it can be used for poems too.
We save your data in poem variable:
poem = """extract data
useful data is extracted
extract numbers
useful numbers are extracted
extract variable
useful variables are extracted"""
Now, we find all paragraphs and store them in par variable:
import re
par = re.findall("([a-zA-Z].* *\n.[a-zA-Z .,']*)",poem)
Now, par have three elements which you can choose them by par[0], par[1] and par[2].
par[0] is:
'extract data \nuseful data is extracted'
par[1] is:
'extract numbers\nuseful numbers are extracted'
par[2] is:
'extract variable \nuseful variables are extracted'

Related

Ideas to extract specific invoice pdf data for different formats and convert to Excel

I am currently working on a digitalisation project which consists in extracting specific information from pdf-formatted electricity invoices. Once the data is extracted, I would like to store it in an Excel spreadsheet.
The objectives are the following:
First of all, the data to be extracted would be the following:
https://i.stack.imgur.com/6RLo2.png
In this case, the data to be extracted is the information surrounded in red. This would be the CUPS, the total amount and the consumed electricity per period (P1-P6).
Once this is extracted, I would like to display this in an Excel Spreadsheet.
Could you please give me any ideas/tips regarding the extraction of this data? I understand that OCR software would do this best, but do not know how I could extract this specific information.
Thanks for you help and advice.

If there is no text data in your PDF then I don't believe there is a clean and consistent way to do this yet. If your invoice templates are always the same format and resolution, then the pixel coordinates of the text positions should be the same.
This means that you can create a cropped image with only the text you're interested in. Then you can use your OCR tool to extract all the text and you have extracted your data field. You would have to do this for all the data fields that you want to extract.
This would only work for invoices that always have the same format and resolution. So scanned invoices wouldn't work, and dynamic tables make things exponentially more complex as well.

I would check if its possible to simply extract the text using PDF to text 1st then work my cmd text parsing around that output, and loop file to file.
I don't have your sample to test so you would need to adjust to suit your bills
pdftotext -nopgbrk -layout electric.pdf - |findstr /i "cups factura" & pdftotext -nopgbrk -layout -y 200 -W 300 -H 200 electric.pdf
Personally would use the two parts as separate cycles so first pair replace the , with a safe csv character such as * then inject , for the large gap to make them 2 column csv (perhaps replace the Γé¼ with ,€ if necessary since your captured text may be in €uros already)
The second group I would possibly inject , by numeric position to form the desired columns, I only demo 4 column by 2 rows but you want 7 column by 4 rows, so adjust those values to suit. However, you can use any language you are familiar with such as VBA to split how you want to import in to eXcel.

In Excel you may want to use PowerQuery to read the pdf:
https://learn.microsoft.com/en-us/power-query/connectors/pdf
Then you can further process to extract the data you want within PowerQuery.
If you are interested in further data analysis after extraction you may want to consider KNIME as well:
https://hub.knime.com/jyotendra/spaces/Public/latest/Reading%20PDF%20and%20extracting%20information~pNh3GdorF0Z9WGm8
From there export to Excel is also supported.
edit:
after extracting, regex helps to filter for the specific data, e.g. look for key words, length and structure of the data item (e.g. the CUPS number), is it a currency with decimal etc.
edit 2: regex in Excel
How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
e.g. look for a new line starting with CUPS followed by a sequence of 15-characters (if you have more details, you can specify the matching pattern more: e.g. starting with E, or 5th character is X or 5, etc.)

counting most commonly used words across thousands of text files using Octave

I've accumulated a list of over 10,000 text files in Octave. I've got a function which cleans up the contents of each files, normalizing things in various ways (lowercase, reducing repeated whitespace, etc). I'd like to distill from all these files a list of words that appear in at least 100 files. I'm not at all familiar with Octave data structs or cell arrays or the Octave sorting functions, and was hoping someone could help me understand how to:
initialize an appropriate data structure (word_appearances) to count how many emails contain a particular word
loop thru the unique words that appear in an email string and increment for each of those words the count I'm tracking in word_appearances -- ideally we'd ignore words less than two chars in length and also exclude a short list of stop_words.
reduce word_appearances to only contain words that appear some number of times, e.g, min_appearances=100 times.
sort the words in word_appearances alphabetically and either export this as a .MAT file or as a CSV file like so:
1 aardvark
2 albatross
etc.
I currently have this code to loop through my files one by one:
for i = 1:total_files
filename = char(files{i}(1));
printf("processing %s\n", filename);
file_contents = jPreProcessFile(readFile(filename))
endfor
Note that the file_contents that comes back is pretty clean -- usually just a bunch of words, some repeated, separated by single spaces like so:
email market if done right is both highli effect and amazingli cost effect ok it can be downright profit if done right imagin send your sale letter to on million two million ten million or more prospect dure the night then wake up to sale order ring phone and an inbox full of freshli qualifi lead we ve been in the email busi for over seven year now our list ar larg current and deliver we have more server and bandwidth than we current need that s why we re send you thi note we d like to help you make your email market program more robust and profit pleas give us permiss to call you with a propos custom tailor to your busi just fill out thi form to get start name email address phone number url or web address i think you ll be delight with the bottom line result we ll deliv to your compani click here emailaddr thank you remov request click here emailaddr licatdcknjhyynfrwgwyyeuwbnrqcbh
Obviously, I need to create the word_appearances data structure such that each element in it specifies a word and how many files have contained that word so far. My primary point of confusion is what sort of data structure word_appearances should be, how I would search this data structure to see if some new word is already in it, and if found, increment its count, otherwise add a new element to word_appearances with count=1.

Octave has containers.Map to hold key-value pairs. This is the simple usage:
% initialize map
m = containers.Map('KeyType', 'char', 'ValueType', 'int32');
% check if it has a word
word = 'hello';
if !m.isKey(word)
m(word) = 1;
endif
% increment existing values
m(word) += 1;
This is one way to extract most frequent words from a map like the one above:
counts = m.values;
[sorted_counts, indices] = sort(cell2mat(counts));
top10_indices = indices(end:-1:end-9);
top10_words = m.keys(top10_indices);
I must warn you though, Octave may be pretty slow at this task, considering that you have thousands of files. Use it only if running time isn't that important for you.

Excel VBA Textfile to 2d array

I am new to excel vba. I want to read a textfile that contains text like this:
John Smith Engineer Chicago
Bob Alice Doctor New York
Jane Smith Teacher St. Louis
So, I want to convert this into a 2D array so if I do print(3,3), it should return 'Teacher'.
I am able to read entire file contents into one string but am having difficulty in converting it to
a 2d array like above. Please advice on how to proceed. Thanks

unless the text file has some specific structure to it, you're going to struggle a bit. Things that might make it easier are:
Does the text file contain line breaks at the end of each line?
Are all the names in [FirstName][LastName] format as per your example
or might some have more/less words?
Does the Occupation always come directly after the name?
Are there a (very) limited number of Occupations?

as mentioned by NautMeg, You have to make some assumptions on the data based on the provided template.
However we can assume that :
a space is the delimiter
The Final column is City, which can contain a space
there are 4 columns
First Name
Last Name
Profession
City/Location
Using this information:
While Not EOF(my_file)
Line Input #my_file, text_line
// text_line contains the independent line
i = i + 1
// i is the line number
Wend
is how we retrieve each line.
Split ( Expression, [Delimiter], [Limit], [Compare] )
This will give you each item in the list. For index's < 3 (0 based index), they are unique columns of data and you can handle them however you want.
For Index >=3, Join these together into 1 string .
Join( SourceArray, [Delimiter] )
You'll likely want to make the delimiter in this case a simple space, since the split function will remove the space.
That will allow you to parse the data AS is.
However, for future reference if you can control the export of the text file, you should try exporting as a CSV file.
Good luck

SPSS compare string function

I have a file composed of many strings. For each string, I want to create substrings of length 4 and then compare each substring with a dictionary of words from another SPSS file. For example, if I have the string "transport" I want to create a list of 4-letter strings (e.g., 'tran', 'rans', 'ansp', etc.). For each of these 4-letter strings, I want to know if it exists in another file with a long list of words. Here is my syntax in SPSS:
*rawNonword is the name of the string in my first file.
compute chars = char.length(rawNonword).
string holder (A50).
loop #i = 1 to chars-4.
compute holder = char.substr(rawNonword, #i, 4).
*here I would like to compare holder with the strings in another file.
end loop.
execute.
I realize that the merge and match functions are normally used in SPSS, but it seems as if I can't use them inside a loop. I believe this problem is fairly easy in python, but I need to do this task in SPSS. Is there an easy function in SPSS that will return a value of 1 or true if the 4-letter string exists in another file?

Certainly easier to do using the Python plugin with the extendedTransforms.vlookup function, but in traditional syntax, you could create a variable holding all the four-letter fragments, sort both files, and use a TABLE match with MATCH FILES using that variable as the key.

Sybase: get a specific string from a binary column

on Sybase, I have a table containing a binary column.
Using convert(varchar(16384), convert(binary(16384), T1.TEXT)) as Text I can convert the data contained in to a string format.
Now there is my question: I need to select a string from this field as a new string containing specific words. How can I do it?
Let me take an example.
If I Suppose in one row the field contains the string "Output of this activity are txt files: the file orange.txt, the file black.txt and eventually the file red.txt", in output of my query I want the field as "orange.txt, black.txt, red.txt".
Is it possible to do it?
Thanks

You can't do this. This is because neither the BINARY nor the TEXT datatypes under Sybase allow sub-string searching or regular expression processing.
When you are storing character data, VARCHAR or UNIVARCHAR are always the better options. TEXT as a type should only ever be used if your TEXT fields are larger than your Sybase configured page size.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

extract data using regex? - python-3.x

Related

Ideas to extract specific invoice pdf data for different formats and convert to Excel

counting most commonly used words across thousands of text files using Octave

Excel VBA Textfile to 2d array

SPSS compare string function

Sybase: get a specific string from a binary column

Categories

Resources