Python and Regex: Problem with re findall()

Python and Regex: Problem with re findall() - python-3.x

This is a project found # https://automatetheboringstuff.com/2e/chapter7/
It searches text on the clipboard for phone numbers and emails then copy the results to the clipboard again.
If I understood it correctly, when the regular expression contains groups, the findall() function returns a list of tuples. Each tuple would contain strings matching each regex group.
Now this is my problem: the regex on phoneRegex as far as i can tell contains only 6 groups (numbered on the code) (so i would expect tuples of length 6)
But when I print the tuples i get tuples of length 9
('800-420-7240', '800', '-', '420', '-', '7240', '', '', '')
('415-863-9900', '415', '-', '863', '-', '9900', '', '', '')
('415-863-9950', '415', '-', '863', '-', '9950', '', '', '')
What am i missing?
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.
import pyperclip, re
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # area code (first group?)0
(\s|-|\.)? # separator 1
(\d{3}) # first 3 digits 2
(\s|-|\.) # separator 3
(\d{4}) # last 4 digits 4
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension 5
)''', re.VERBOSE)
# Create email regex.
emailRegex = re.compile(r'''(
[a-zA-Z0-9._%+-]+ # username
# # # symbol
[a-zA-Z0-9.-]+ # domain name
(\.[a-zA-Z]{2,4}) # dot-something
)''', re.VERBOSE)
text = str(pyperclip.paste())
matches = []
for groups in phoneRegex.findall(text):
print(groups)
phoneNum = '-'.join([groups[1], groups[3], groups[5]])
if groups[8] != '':
phoneNum += ' x' + groups[8]
matches.append(phoneNum)
for groups in emailRegex.findall(text):
matches.append(groups[0])
# Copy results to the clipboard.
if len(matches) > 0:
pyperclip.copy('\n'.join(matches))
print('Copied to clipboard:')
print('\n'.join(matches))
else:
print('No phone numbers or email addresses found.')

Anything in parentheses will become a capturing group (and add one to the length of the re.findall tuple) unless you specify otherwise. To turn a sub-group into a non-capturing group, add ?: just inside the parentheses:
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)?
(\d{3})
(\s|-|\.)
(\d{4})
(\s*(?:ext|x|ext.)\s*(?:\d{2,5}))? # <---
)''', re.VERBOSE)
You can see the extension part was adding two additional capturing groups. With this updated version, you will have 7 items in your tuple. There are 7 instead of 6 because the entire string is matched as well.
The regex could be better, too. This is cleaner and will match more cases with the re.IGNORECASE flag:
phoneRegex = re.compile(r'''(
(\(?\d{3}\)?)
([\s.-])?
(\d{3})
([\s.-])
(\d{4})
\s* # don't need to capture whitespace
((?:ext\.?|x)\s*(?:\d{1,5}))?
)''', re.VERBOSE | re.IGNORECASE)

Related

Python how to grep two keywords from a sting in a text file and append into two strings

DELIVERED,machine01,2022-01-20T12:57:06,033,Email [Test1] is delivered by [192.168.0.2]
Above is the content from the text file. I have used split(",") method but I have no idea how to make it works as below. Can anyone help with this?
'DELIVERED', 'machine01', '2022-01-20T12:57:06', '033', 'Test1', '192.168.0.2'
with open('log_file.log', 'r') as f:
for line in f.readlines():
sep = line.split(",")
print(sep)

text = "DELIVERED,machine01,2022-01-20T12:57:06,033,Email [Test1] is delivered by [192.168.0.2]"
result = []
for part in text.split(','): # loops the parts of text separated by ","
result.append(part) # appends this parts into a list
print(result) # prints this list:
['DELIVERED', 'machine01', '2022-01-20T12:57:06', '033', 'Email [Test1] is delivered by [192.168.0.2]']
# or you can do all the same work in just 1 line of code!
result = [part for part in text.split(',')]
print(result)
['DELIVERED', 'machine01', '2022-01-20T12:57:06', '033', 'Email [Test1] is delivered by [192.168.0.2]']

Once you have split using , you then need to use a regular expression to find the contents of the [] in the final string. Since you are doing this over multiple lines, we collect each list in a variable (fields) then print this list of lists at the end:
import re
fields = []
with open('log_file.log', 'r') as f:
for line in f.readlines():
sep = line.split(",")
# Get the last item in the list
last = sep.pop()
# Find the values in [] in last
extras = re.findall(r'\[(.*?)\]', last)
# Add these values back onto sep
sep.extend(extras)
fields.append(sep)
print(fields)
log_file.log:
DELIVERED,machine01,2022-01-20T12:57:06,033,Email [Test1] is delivered by [192.168.0.2]
DELIVERED,machine02,2022-01-20T12:58:06,034,Email [Test2] is delivered by [192.168.0.3]
Result:
[['DELIVERED', 'machine01', '2022-01-20T12:57:06', '033', 'Test1', '192.168.0.2'], ['DELIVERED', 'machine02', '2022-01-20T12:58:06', '034', 'Test2', '192.168.0.3']]

Python regex - identifying words with preceding symbol

I am trying to split a target sentence into composite pieces for a later function using re.split() and the regex
(#?\w+)(\W+)
Ideally, this would split words and non-word characters in a generated list, preserving both as separate list items, with the exception of the "#" symbol which could precede a word. If there is an # symbol before a word, I want to keep it as a cohesive item in the split. My example is below.
My test sentence is as follows:
this is a test of proper nouns #Ryan
So the line of code is:
re.split(r'(#?\w+)(\W+)', "this is a test of proper nouns #Ryan")
The list that I want to generate would include "#Ryan" as a single item but, instead, it looks like this
['', 'this', ' ', '', 'is', ' ', '', 'a', ' ', '', 'test', ' ', '', 'of', ' ', '', 'proper', ' ', '', 'nouns', ' #', 'Ryan']
Since the first container has the # symbol, I would have thought that it would be evaluated first but that is apparently not the case. I have tried using lookaheads or removing # from the \W+ container to no avail.
https://regex101.com/r/LeezvP/1

With your shown samples, could you please try following(written and tested in Python 3.8.5). considering that you need to remove empty/null items in your list. This will give output where # is together with words.
##First split the text/line here and save it to list named li.
li=re.split(r'(#?\w+)(?:\s+)', "this is a test of proper nouns #Ryan")
li
['', 'this', '', 'is', '', 'a', '', 'test', '', 'of', '', 'proper', '', 'nouns', '#Ryan']
##Use filter to remove nulls in list li.
list(filter(None, li))
['this', 'is', 'a', 'test', 'of', 'proper', 'nouns', '#Ryan']
Simple explanation would be, use split function with making 1 capturing group which has an optional # followed by words and 1 non-capturing group which has spaces one or more occurrences in it. This will place null elements in list, so to remove them use filter function.
NOTE: As per OP's comments nulls/spaces may be required, so in that case one could refer following code; which worked for OP:
li=re.split(r'(#?\w+)(\s+|\W+)', "this is a test of proper nouns #Ryan")

You could also match using re.findall and use an alternation | matching the desired parts.
(?:[^#\w\s]+|#(?!\w))+|\s+|#?\w+
Explanation
(?: Non capture group
[^#\w\s]+ Match 1+ times any char except # word char or whitespace char
| Or
#(?!\w) Match # when not directly followed by a word char
)+ Close the group and match 1+ times
| Or
\s+ Match 1+ whitespace chars to keep them as a separate match in the result
| Or
#?\w+ Match # directly followed by 1+ word chars
Regex demo
Example
import re
pattern = r"(?:[^#\w\s]+|#(?!\w))+|\s+|#?\w+"
print(re.findall(pattern, "this is a test of proper nouns #Ryan"))
# Output
# ['this', ' ', 'is', ' ', 'a', ' ', 'test', ' ', 'of', ' ', 'proper', ' ', 'nouns', ' ', '#Ryan']
print(re.findall(pattern, "this #Ryan #$#test#123#4343##$%$test#1#$#$###1####"))
# Output
# ['this', ' ', '#Ryan', ' ', '#$', '#test', '#123', '#4343', '##$%$', 'test', '#1', '#$#$##', '#1', '####']

The regex, #?\w+|\b(?!$) should meet your requirement.
Explanation at regex101:
1st Alternative #\w
# matches the character # literally (case sensitive)
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Alternative \b(?!$)
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
Negative Lookahead (?!$)
Assert that the Regex below does not match
$ asserts position at the end of a line

Misplaced and produced by a for loop

Totally new to programming. Working on the comma code example in Automate the boring stuff. The following code works (with some extra spaces I have to clean up). But on initialization, it prints the list correctly as in "The list is apples, bananas, tofu, and cats.
When you create a new list it does this instead: "The list is and, a, b, c, d."
# initializing list
test_list = ['apples', 'bananas', 'tofu', 'cats']
# printing original list
while True:
print('The list is ', end='')
for i in range(0, len(test_list)):
if i != (len(test_list)-1):
print(str(test_list[i]), ', ', end='')
if i == (len(test_list)-1):
print('and', str(test_list[i]), '.')
print('Write a new list that contains elements separated by a comma then space.')
test_list = [input()]

Your issue is the input is taking the full string and adding it as a single element to the test_list list. The string needs to be split.
The only thing that needs to be changed is the last line:
test_list = input().split(", ")
>>>test1, test2, test3
The list is test1 , test2 , and test3 .
As for cleaning up the extra spaces--concatenate the strings with + instead of ,.
Combining all code:
# initializing list
test_list = ['apples', 'bananas', 'tofu', 'cats']
# printing original list
while True:
print('The list is ', end='')
for i in range(0, len(test_list)):
if i != (len(test_list)-1):
print(str(test_list[i]) + ', ', end='')
if i == (len(test_list)-1):
print('and', str(test_list[i]) + '.')
print('Write a new list that contains elements separated by a comma then space.')
test_list = input().split(", ")
Results in the output: The list is test1, test2, and test3.
There are several other things I would write differently, but this should solve your problem. Take care!

Replacing a substring AFTER a character in a python pandas dataframe

I'm new to pandas and am having a lot of trouble with this and haven't found a solution, despite my searches. Hoping one of you can help me.
I have a pandas dataframe that has a column of emails that I'm trying to clean up. Some examples are:
>>> email['EMAIL']
0 testing#...com
1 NaN
2 I.am.ME#GAMIL.COM
3 FIRST.LAST.NAME#MAIL.CMO
4 EMAIL+REMOVE#TESTING.COM
Name: EMAIL, dtype: object
There are a number of things I'm trying to do here:
1) replace misspelled endings (e.g. CMO) with correct spellings (e.g. COM)
2) replace misspelled domain names with correct spellings
3) replace multiple periods with just 1 period AFTER the '#' symbol.
4) remove all periods before the '#' sign if they have a gmail account
5) remove all characters after the "+" symbol up to the '#' symbol
So, from the example above I would have returned:
>>> email['EMAIL']
0 testing#.com
1 NaN
2 IamME#GMAIL.COM
3 FIRST.LAST.NAME#MAIL.COM
4 EMAIL#TESTING.COM
Name: EMAIL, dtype: object
I've worked on a number of different codes and keep running into errors. Here's one of my best guesses so far, for removing multiple periods after the '#' symbol.
def remove_periods(email):
email_split = email['EMAIL'].str.split('#')
ending = email_split.str.get(-1)
ending = ending.str.replace('\.{2,}', '.')
emailupdate = email_split.str[:-1]
emailupdate.append(ending)
email_split.str.get()
return '#'.join(emailupdate)
email['EMAIL'].apply(remove_periods)
I could print the multiple other versions too, but they all returns errors too.
Thanks a lot for the help!

import numpy as np
import pandas as pd
pd.options.display.width = 1000
email = pd.DataFrame({'EMAIL':[
'testing#...com', np.nan, 'I.am.ME#GAMIL.COM', 'FIRST.LAST.NAME#MAIL.CMO',
'EMAIL+REMOVE#TESTING.COM', 'gamil#bar...com', 'noperiods#localhost']})
email[['NAME', '#', 'ADDR']] = email['EMAIL'].str.rpartition('#')
# 1) replace misspelled endings (e.g. COM) with correct spellings
email['ADDR'] = email['ADDR'].str.replace(r'(?i)CMO$', 'COM')
# 2) replace misspelled domain names with correct spellings
email['ADDR'] = email['ADDR'].str.replace(r'(?i)GAMIL', 'GMAIL')
# 3) replace multiple periods with just 1 period AFTER the '#' symbol.
email['ADDR'] = email['ADDR'].str.replace(r'[.]{2,}', '.')
# 4) remove all periods before the '#' sign if they have a gmail account
mask = email['ADDR'].str.contains(r'(?i)^GMAIL[.]COM$') == True
email.loc[mask, 'NAME'] = email.loc[mask, 'NAME'].str.replace(r'[.]', '')
# 5) remove all characters after the "+" symbol up to the '#' symbol
email['NAME'] = email['NAME'].str.replace(r'[+].*', '')
# put it back together. You could reassign to email['EMAIL'] if you wish.
email['NEW_EMAIL'] = email['NAME'] + email['#'] + email['ADDR']
# clean up intermediate columns
# del email[['NAME', '#', 'ADDR']]
print(email)
yields
EMAIL NAME # ADDR NEW_EMAIL
0 testing#...com testing # .com testing#.com
1 NaN NaN None None NaN
2 I.am.ME#GAMIL.COM IamME # GMAIL.COM IamME#GMAIL.COM
3 FIRST.LAST.NAME#MAIL.CMO FIRST.LAST.NAME # MAIL.COM FIRST.LAST.NAME#MAIL.COM
4 EMAIL+REMOVE#TESTING.COM EMAIL # TESTING.COM EMAIL#TESTING.COM
5 gamil#bar...com gamil # bar.com gamil#bar.com
6 noperiods#localhost noperiods # localhost noperiods#localhost
The NAME column holds everything before the last #
The ADDR column holds everything after the last #.
I left the NAME, ADDR columns visible (and did not overwrite the original EMAIL column)
so it would be easier to understand the intermediate steps.

Python split method removing spaces....why?

I have this doing what I want it to (Take a file, shuffle the middle letters of the words and rejoin them), but for some reason, the spaces are being removed even though I'm asking it to split on spaces. Why is that?
import random
File_input= str(input("Enter file name here:"))
text_file=None
try:
text_file = open(File_input)
except FileNotFoundError:
print ("Please check file name.")
if text_file:
for line in text_file:
for word in line.split(' '):
words=list (word)
Internal = words[1:-1]
random.shuffle(Internal)
words[1:-1]=Internal
Shuffled=' '.join(words)
print (Shuffled, end='')

If you want the delimiter as part of the values:
d = " " #delim
line = "This is a test" #string to split, would be `line` for you
words = [e+d for e in line.split(d) if e != ""]
What this does is split the string, but return the split value plus the delimiter used. Result is still a list, in this case ['This ', 'is ', 'a ', 'test '].
If you want the delimiter as part of the resultant list, instead of using the regular str.split(), you can use re.split(). The docs note:
re.split(pattern, string[, maxsplit=0, flags=0])
Split string by the
occurrences of pattern. If capturing parentheses are used in pattern,
then the text of all groups in the pattern are also returned as part
of the resulting list.
So, you could use:
import re
re.split("( )", "This is a test")
And result:
['this', ' ', 'is', ' ', 'a', ' ', 'test']

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python and Regex: Problem with re findall() - python-3.x

Related

Python how to grep two keywords from a sting in a text file and append into two strings

Python regex - identifying words with preceding symbol

Misplaced and produced by a for loop

Replacing a substring AFTER a character in a python pandas dataframe

Python split method removing spaces....why?

Categories

Resources