Using REGEX to grab the information after the match - python-3.x

I ran a PDF through a series of processes to extra the text from it. I was successful in that regard. However, now I want to extract specific text from documents.
The document is set up as a multi lined string (I believe. when I paste it into Word the paragraph character is at the end of each line):
Send Unit: COMPLETE
NOA Selection: 20-0429.07
#for some reason, in this editor, despite the next line having > infront of it, the following line (Pni/Trk) keeps wrapping up to the line above. This doesn't exist in the actual doc.
Pni/Trk: 3 Panel / 3 Track
Panel Stack: STD
Width: 142.0000
The information is want to extract are the numbers following "NOA Selection:".
I know I can do a regex something to the effect of:
pattern = re.compile(r'NOA\sSelection:\s\d*-\d*\.\d*)
but I only want the numbers after the NOA selection, especially because NOA Selection will always be the same but the format of the numbers/letters/./-/etc. can vary pretty wildly. This looked promising but it is in Java and I haven't had much luck recreating it in Python.
I think I need to use (?<=...), but haven't been able to implement it.
Also, several of the examples show the string stored in the python file as a variable, but I'm trying to access it from a .txt file, so I might be going wrong there. This is what I have so far.
with open('export1.txt', 'r') as d:    
contents = d.read()    
p = re.compile('(?<=NOA)')
s = re.search(p, contents)
print(s.group())
Thank you for any help you can provide.

With your shown samples, you could try following too. For sample 20-0429.07 I have kept .07 part optional in regex in case you have values 20-0429 only it should work for those also.
import re
val = """Send Unit: COMPLETE
NOA Selection: 20-0429.07"""
matches = re.findall(r'NOA\s+Selection:\s+(\d+-\d+(?:\.\d+)?)', val)
print(matches)
['20-0429.07']
Explanation: Adding detailed explanation(only for explanation purposes).
NOA\s+Selection:\s+ ##matching NOA spaces(1 or more occurrences) Selection: spaces(1 or more occurrences)
(\d+-\d+(?:\.\d+)?) ##Creating capturing group matching(1 or more occurrences) digits-digits(1 or more occurrences)
##and in a non-capturing group matching dot followed by digits keeping it optional.

Keeping it simple, you could use re.findall here:
inp = """Send Unit: COMPLETE
NOA Selection: 20-0429.07"""
matches = re.findall(r'\bNOA Selection: (\S+)', inp)
print(matches) # ['20-0429.07']

Related

Regex to find numbers that are not in a phrase

If I have e.g. this string:
Beschreibung Menge VK-Preis MwSt% Betrag
Schadenbewertunginkl.Restwertermittlung 1 25,00€ 19 25,00€
Rechnungsbetragexcl.MwSt.: 25,00€
MwSt.(19%): 4,75€
Rechnungsbetragincl.MwSt.: 123.029,75€
I want to extract all the numbers.
My regexes are:
regex_up_to_thousand = r'\b(?:\d{1,3}){1}(?:,{1}\d{2})\b'
and
regex_every_price = r'\b(?:\d{1,3}(\.|,))+(:?\d{3}(\.|,))(?:\d{2})\b'
My idea was to first get the "big" prices, remove them from the text and get the other numbers.
Which works in most cases, until I have a date that looks like this maybe
Gutachtennummer: 1009126 Leistungsdatum: 11.10.2021
I would get the 11.10 with my second regex, and I don't know how to prevent this.
I thought the \b would help, but sadly not.
Any ideas?
It's not the end of the world, since I do a lot of math in the background, but it's a possibility that a date would fit some values and I calculate something wrong in the end.
You could try the following pattern.
\b\d+(?:(?:\.|,)\d{3})*(?:(?:\.|,)\d{2})\b(?!\W\d)
The main thing is (?!\W\d) at the end which ensures that after your amount you will not have a construct of 1 non-word character followed by 1 digit.
Example: https://regex101.com/r/q1ic9S/1

Regular expression match / break

I am doing text analysis on SEC filings (e.g., 10-K), and the documents I have are the complete submission. The complete filing submission includes the 10-K, plus several other documents. Each document resides within the tags ‘<DOCUMENT>’ and ‘</DOCUMENT>’.
What I want: To count the number of words in the 10-K only before the first instance of ‘</DOCUMENT>’
How I want to accomplish it: I want to use a for loop, with a regex (regex_end10k) to indicate where to stop the for loop.
What is happening: No matter where I put my regex match break, the program counts all of the words in the entire document. I have no error, however I cannot get the desired results.
How I know this: I have manually trimmed one filing, while retaining the full document (results below). When I manually remove the undesired documents after the first instance of ‘</DOCUMENT>’, I yield about 750,000 fewer words.
Current output
Note: Apparently I don't have enough SO reputation to embed a screenshot in my post; it defaults to a link.
What I have tried: several variations of where to put the regex match break. No matter what, it almost always counts the entire document. I believe that the two functions may be performed over the entire document. I have tried putting the break statement within get_text_from_html() so that count_words() only performs on the 10-K, but I have had no luck.
The code below is a snippet from a larger function. It's purpose is to (1) strip html tags and (2) count the number of words in the text. If I can provide any additional information, please let me know and I'll update my post.
The remaining code (not shown) extracts firm and report identifiers, (e.g., ‘file’ or ‘cik’) from the header section between tags ‘<SEC-HEADER>’ and ‘</SEC-HEADER>’. Using the same logic, when extracting header information, I use a regex match break logic and it works perfectly. I need help trying to understand why this same logic isn’t working when I try to count the number of words and how to correct my code. Any help is appreciated.
regex_end10k = re.compile(r'</DOCUMENT>', re.IGNORECASE)
for line in f:
def get_text_from_html(html:str):
doc = lxml.html.fromstring(html)
for table in doc.xpath('.//table'): # optional: removes tables from HTML source code
table.getparent().remove(table)
for tag in ["a", "p", "div", "br", "h1", "h2", "h3", "h4", "h5"]:
for element in doc.findall(tag):
if element.text:
element.text = element.text + "\n"
else:
element.text = "\n"
return doc.text_content()
to_clean = f.read()
clean = get_text_from_html(to_clean)
#print(clean[:20000])
def count_words(clean):
words = re.findall(r"\b[a-zA-Z\'\-]+\b",clean)
word_count = len(words)
return word_count
header_vars["words"] = count_words(clean)
match = regex_end10k.search(line) # This should do it, but it doesn't.
if match:
break
You dont need regx, just split your orginal string, and then in the part before count the words, simple example above:
text = 'Text before <DOCUMENT> text after'
splited_text = text.split('<DOCUMENT>')
splited_text_before = splited_text[0]
count_words = len(splited_text_before.split())
print(splited_text_before)
print(count_words)
output
Text before
2

Splitting text file into multiline records using regex

I would like to separate my text into a list of records using the timestamp as the separator. My current code captures the first record, but not the second. How should I modify my code to capture both?
s = """13:45:09 HEY HOW ARE YOU
I AM FINE
13:50:10 OK THEN
Bye"""
import re
text_regex = r'^\d\d:\d\d:\d\d(.*?)(?=\d\d:\d\d:\d\d)'
pattern = re.compile(text_regex,re.DOTALL)
records = []
for match in pattern.findall(s):
match = match.rstrip()
records.append(match)
My current code captures the first record, but not the second. How should I modify my code to capture both?
Your code doesn't capture the second record because due to the trailing (?=\d\d:\d\d:\d\d) lookahead it matches only if a timestamp follows, which isn't so for the last record.
Wiktor Stribiżew's suggestion works if corrected to
records = re.split(r'(?!\A)(?=^\d\d:\d\d:\d\d)', s, flags=re.M)
- otherwise the multiline flag would be taken for the third positional argument maxsplit.
A slightly simplified variant of Wiktor Stribiżew's suggestion which drops extra text before the first timestamp:
records = re.split(r'(?=\d\d:\d\d:\d\d)', s)[1:]
Another variant which (as in your original code) doesn't capture the timestamps:
records = re.split(r'\d\d:\d\d:\d\d', s)[1:]

FuzzyWuzzy for very similar records in Python

I have a dataset with which I want to find the closest string match. For that purpose I'm using FuzzyWuzzy in this way
sol=process.extract(t,dev2,scorer=fuzz.token_sort_ratio)
Where t is the string and dev2 is the list to compare to. My problem is that sometimes it has very similar records and options provided by FuzzyWuzzy seems to be lacking. And I've tested with token_sort, token_set, partial_token sort and set, ratio, partial_ratio, and WRatio.
For example, the string Italy - Serie A gives me the following 2 closest matches.
Token_sort_ratio: (92, 'Italy - Serie D');(86, 'Italian - Serie A')
The one wanted is obviously the second one, but character by character is closer the first one, which is a different league.
This happens as well with teams. If, let's say I have a string Buchtholz I would obtains Buchtholz II before I get TSV Buchtholz.
My main guess now would be to try and weight the presence and absence of several characters more heavily, like single capital letters at the end of the string, so if there is a difference in the letter or an absence it is weighted as less close. Or for () and special characters.
I don't know if there is a way to take this into account or you guys have a better approach to get the string that really matches.
Similarity matches often require knowledge of the data being analysed. i.e. it is not just a blind single round of matching. I recommend that you pass your results through more steps of matching, starting with inclusive/optimistic approaches (like token_set_ratio) with low cut off scores and working toward more exclusive/pessimistic approaches with higher cut off scores until you have a clear winner. If you know more about the text you're analyzing, you can even modify the strings as you progress.
In a case I worked on, I did similarity matches of goods movement descriptions. In the descriptions the numbers sequences were more important than the text. e.g. when looking for a match for "SLURRY VALVE 250MM RAGMAX 2000" the 250 and 2000 part of the string are important, otherwise I get a "SLURRY VALVE 50MM RAGMAX 2000" as the best match instead of "VALVE B/F 250MM,RAGMAX 250RAG2000 RAGON" which is a better result.
I put the similarity match process through two steps: 1. Get a bunch of similar matches using an optimistic matching scorer (token_set_ratio) 2. get the number sequences of these results and pass them through another round of matching with a more strict scorer (token_sort_ratio). Doing this gave me the better result in the example I showed above.
Below is some blocks of code that could be of assistance:
here's a function to get numbers from the sequence. (In your case you might use this to exclude numbers from your string instead?)
def get_numbers_from_string(description):
numbers = ''.join((ch if ch in '0123456789.-' else ' ') for ch in description)
numbers = ' '.join([nr for nr in numbers.split()])
return numbers
and here is a portion of the code I used to put the description match through two rounds:
try:
# get close match from goods move that has material numbers
df_material = pd.DataFrame(process.extract(description,
corpus_material,
scorer=fuzz.token_set_ratio),
columns=['Similar Text','Score']
)
if df_material['Score'][df_material['Score']>=cut_off_accuracy_materials].count()>=1:
similar_text = df_material['Similar Text'].iloc[0]
score = df_material['Score'].iloc[0]
if nr_description_numbers>4:
# if there are multiple matches found, then get best number combination match
df_material = df_material[df_material['Score']>=cut_off_accuracy_materials]
new_corpus = list(df_material['Similar Text'])
new_corpus = np.vectorize(get_numbers_from_string)(new_corpus)
df_material['numbers'] = new_corpus
df_numbers = pd.DataFrame(process.extract(description_numbers,
new_corpus,
scorer=fuzz.token_sort_ratio),
columns=['numbers','Score']
)
similar_text = df_material['Similar Text'][df_material['numbers']==df_numbers['numbers'].iloc[0]].iloc[0]
nr_score = df_numbers['Score'].iloc[0]
hope it helps, and good luck

Searching for Number of Term Appearances in Mathematica

I'm trying to search across a large array of textual files in Mathematica 8 (12k+). So far, I've been able to plot the sheer numbers of times that a word appears (i.e. the word "love" appears 5,000 times across those 12k files). However, I'm running into difficulty determining the number of files in which "love" appears once - which might only be in 1,000 files, with it repeating several times in others.
I'm finding the documentation WRT FindList, streams, RecordSeparators, etc. a bit murky. Is there a way to set it up so it finds an incidence of a term once in a file and then moves onto the next?
Example of filelist:
{"89001.txt", "89002.txt", "89003.txt", "89004.txt", "89005.txt", "89006.txt", "89007.txt", "89008.txt", "89009.txt", "89010.txt", "89011.txt", "89012.txt", "89013.txt", "89014.txt", "89015.txt", "89016.txt", "89017.txt", "89018.txt", "89019.txt", "89020.txt", "89021.txt", "89022.txt", "89023.txt", "89024.txt"}
The following returns all of the lines with love across every file. Is there a way to return only the first incidence of love in each file before moving onto the next one?
FindList[filelist, "love"]
Thanks so much. This is my first post and I'm largely learning Mathematica through peer/supervisory help, online tutorials, and the documentation.
In addition to Daniel's answer, you also seem to be asking for a list of files where the word only occurs once. To do that, I'd continue to run FindList across all the files
res =FindList[filelist, "love"]
Then, reduce the results to single lines only, via
lines = Select[ res, Length[#]==1& ]
But, this doesn't eliminate the cases where there is more than one occurrence in a single line. To do that, you could use StringCount and only accept instances where it is 1, as follows
Select[ lines, StringCount[ #, RegularExpression[ "\\blove\\b" ] ] == 1& ]
The RegularExpression specifies that "love" must be a distinct word using the word boundary marker (\\b), so that words like "lovely" won't be included.
Edit: It appears that FindList when passed a list of files returns a flattened list, so you can't determine which item goes with which file. For instance, if you have 3 files, and they contain the word "love", 0, 1, and 2 times, respectively, you'd get a list that looked like
{, love, love, love }
which is clearly not useful. To overcome this, you'll have to process each file individually, and that is best done via Map (/#), as follows
res = FindList[#, "love"]& /# filelist
and the rest of the above code works as expected.
But, if you want to associate the results with a file name, you have to change it a little.
res = {#, FindList[#, "love"]}& /# filelist
lines = Select[res,
Length[ #[[2]] ] ==1 && (* <-- Note the use of [[2]] *)
StringCount[ #[[2]], RegularExpression[ "\\blove\\b" ] ] == 1&
]
which returns a list of the form
{ {filename, { "string with love in it" },
{filename, { "string with love in it" }, ...}
To extract the file names, you simply type lines[[All, 1]].
Note, in order to Select on the properties you wanted, I used Part ([[ ]]) to specify the second element in each datum, and the same goes for extracting the file names.
Help > Documentation Center > FindList item 4:
"FindList[files,text,n]
includes only the first n lines found."
So you could set n to 1.
Daniel Lichtblau

Resources