I am trying to split a target sentence into composite pieces for a later function using re.split() and the regex
(#?\w+)(\W+)
Ideally, this would split words and non-word characters in a generated list, preserving both as separate list items, with the exception of the "#" symbol which could precede a word. If there is an # symbol before a word, I want to keep it as a cohesive item in the split. My example is below.
My test sentence is as follows:
this is a test of proper nouns #Ryan
So the line of code is:
re.split(r'(#?\w+)(\W+)', "this is a test of proper nouns #Ryan")
The list that I want to generate would include "#Ryan" as a single item but, instead, it looks like this
['', 'this', ' ', '', 'is', ' ', '', 'a', ' ', '', 'test', ' ', '', 'of', ' ', '', 'proper', ' ', '', 'nouns', ' #', 'Ryan']
Since the first container has the # symbol, I would have thought that it would be evaluated first but that is apparently not the case. I have tried using lookaheads or removing # from the \W+ container to no avail.
https://regex101.com/r/LeezvP/1
With your shown samples, could you please try following(written and tested in Python 3.8.5). considering that you need to remove empty/null items in your list. This will give output where # is together with words.
##First split the text/line here and save it to list named li.
li=re.split(r'(#?\w+)(?:\s+)', "this is a test of proper nouns #Ryan")
li
['', 'this', '', 'is', '', 'a', '', 'test', '', 'of', '', 'proper', '', 'nouns', '#Ryan']
##Use filter to remove nulls in list li.
list(filter(None, li))
['this', 'is', 'a', 'test', 'of', 'proper', 'nouns', '#Ryan']
Simple explanation would be, use split function with making 1 capturing group which has an optional # followed by words and 1 non-capturing group which has spaces one or more occurrences in it. This will place null elements in list, so to remove them use filter function.
NOTE: As per OP's comments nulls/spaces may be required, so in that case one could refer following code; which worked for OP:
li=re.split(r'(#?\w+)(\s+|\W+)', "this is a test of proper nouns #Ryan")
You could also match using re.findall and use an alternation | matching the desired parts.
(?:[^#\w\s]+|#(?!\w))+|\s+|#?\w+
Explanation
(?: Non capture group
[^#\w\s]+ Match 1+ times any char except # word char or whitespace char
| Or
#(?!\w) Match # when not directly followed by a word char
)+ Close the group and match 1+ times
| Or
\s+ Match 1+ whitespace chars to keep them as a separate match in the result
| Or
#?\w+ Match # directly followed by 1+ word chars
Regex demo
Example
import re
pattern = r"(?:[^#\w\s]+|#(?!\w))+|\s+|#?\w+"
print(re.findall(pattern, "this is a test of proper nouns #Ryan"))
# Output
# ['this', ' ', 'is', ' ', 'a', ' ', 'test', ' ', 'of', ' ', 'proper', ' ', 'nouns', ' ', '#Ryan']
print(re.findall(pattern, "this #Ryan #$#test#123#4343##$%$test#1#$#$###1####"))
# Output
# ['this', ' ', '#Ryan', ' ', '#$', '#test', '#123', '#4343', '##$%$', 'test', '#1', '#$#$##', '#1', '####']
The regex, #?\w+|\b(?!$) should meet your requirement.
Explanation at regex101:
1st Alternative #\w
# matches the character # literally (case sensitive)
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Alternative \b(?!$)
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
Negative Lookahead (?!$)
Assert that the Regex below does not match
$ asserts position at the end of a line
Related
I'm trying to split a string in a list of strings. Right now i have to split whenever I see any of these characters: '.', ';', ':', '?', '!', '( )', '[ ]', '{ }' (keep in mind that I have to mantain whatever is inside the brackets).
To solve it I tried to write
print(re.split("\(([^)]*)\)|[.,;:?!]\s*", "Hello world,this is(example)"))
but as output I get:
['Hello world', None, 'this is', 'example', '']
Omitting the ' ' at the end that I'll solve later, how can I remove the None that appears in the middle of the list?
By the way I can't iterate in the list another time because the program shall work with huge files and I have to make it as fast as possible.
Also I don't have to necessarily use re.split so everything that works will be just fine!
I'm still new at this so I'm sorry if something is incorrect.
Not sure if this is fast enough but you could do this:
re.sub(r";|,|:|\(|\)|\[|\]|\?|\.|\{|\}|!", " ", "Hello world,this is(example)").split()
This is a project found # https://automatetheboringstuff.com/2e/chapter7/
It searches text on the clipboard for phone numbers and emails then copy the results to the clipboard again.
If I understood it correctly, when the regular expression contains groups, the findall() function returns a list of tuples. Each tuple would contain strings matching each regex group.
Now this is my problem: the regex on phoneRegex as far as i can tell contains only 6 groups (numbered on the code) (so i would expect tuples of length 6)
But when I print the tuples i get tuples of length 9
('800-420-7240', '800', '-', '420', '-', '7240', '', '', '')
('415-863-9900', '415', '-', '863', '-', '9900', '', '', '')
('415-863-9950', '415', '-', '863', '-', '9950', '', '', '')
What am i missing?
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.
import pyperclip, re
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # area code (first group?)0
(\s|-|\.)? # separator 1
(\d{3}) # first 3 digits 2
(\s|-|\.) # separator 3
(\d{4}) # last 4 digits 4
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension 5
)''', re.VERBOSE)
# Create email regex.
emailRegex = re.compile(r'''(
[a-zA-Z0-9._%+-]+ # username
# # # symbol
[a-zA-Z0-9.-]+ # domain name
(\.[a-zA-Z]{2,4}) # dot-something
)''', re.VERBOSE)
text = str(pyperclip.paste())
matches = []
for groups in phoneRegex.findall(text):
print(groups)
phoneNum = '-'.join([groups[1], groups[3], groups[5]])
if groups[8] != '':
phoneNum += ' x' + groups[8]
matches.append(phoneNum)
for groups in emailRegex.findall(text):
matches.append(groups[0])
# Copy results to the clipboard.
if len(matches) > 0:
pyperclip.copy('\n'.join(matches))
print('Copied to clipboard:')
print('\n'.join(matches))
else:
print('No phone numbers or email addresses found.')
Anything in parentheses will become a capturing group (and add one to the length of the re.findall tuple) unless you specify otherwise. To turn a sub-group into a non-capturing group, add ?: just inside the parentheses:
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)?
(\d{3})
(\s|-|\.)
(\d{4})
(\s*(?:ext|x|ext.)\s*(?:\d{2,5}))? # <---
)''', re.VERBOSE)
You can see the extension part was adding two additional capturing groups. With this updated version, you will have 7 items in your tuple. There are 7 instead of 6 because the entire string is matched as well.
The regex could be better, too. This is cleaner and will match more cases with the re.IGNORECASE flag:
phoneRegex = re.compile(r'''(
(\(?\d{3}\)?)
([\s.-])?
(\d{3})
([\s.-])
(\d{4})
\s* # don't need to capture whitespace
((?:ext\.?|x)\s*(?:\d{1,5}))?
)''', re.VERBOSE | re.IGNORECASE)
I am trying to take words from stopwords.txt file and append them as string in python list.
stopwords.txt
a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
My Code :
stopword = open("stopwords.txt", "r")
stopwords = []
for word in stopword:
stopwords.append(word)
List stopwords output:
['a\n',
'about\n',
'above\n',
'after\n',
'again\n',
'against\n',
'all\n',
'am\n',
'an\n',
'and\n',
'any\n',
'are\n',
"aren't\n",
'as\n',
'at\n',
'be\n',
'because\n',
'been\n',
'before\n',
'being\n']
Desired Output :
['a',
'about',
'above',
'after',
'again',
'against',
'all',
'am',
'an',
'and',
'any',
'are',
"aren't",
'as',
'at',
'be',
'because',
'been',
'before',
'being']
Is there any method to transpose stopword so that it eliminate '\n' character or any method at all to reach the desire output?
Instead of
stopwords.append(word)
do
stopwords.append(word.strip())
The string.strip() method strips whitespace of any kind (spaces, tabs, newlines, etc.) from the start and end of the string. You can give an argument to the function in order to strip a specific string or set of characters, or use lstrip() or rstrip() to only strip the front or back of the string, but for this case just strip() should suffice.
You can use the .strip() method. It removes all occurrences of the character passed as an argument from a string:
stopword = open("stopwords.txt", "r")
stopwords = []
for word in stopword:
stopwords.append(word.strip("\n"))
I have this doing what I want it to (Take a file, shuffle the middle letters of the words and rejoin them), but for some reason, the spaces are being removed even though I'm asking it to split on spaces. Why is that?
import random
File_input= str(input("Enter file name here:"))
text_file=None
try:
text_file = open(File_input)
except FileNotFoundError:
print ("Please check file name.")
if text_file:
for line in text_file:
for word in line.split(' '):
words=list (word)
Internal = words[1:-1]
random.shuffle(Internal)
words[1:-1]=Internal
Shuffled=' '.join(words)
print (Shuffled, end='')
If you want the delimiter as part of the values:
d = " " #delim
line = "This is a test" #string to split, would be `line` for you
words = [e+d for e in line.split(d) if e != ""]
What this does is split the string, but return the split value plus the delimiter used. Result is still a list, in this case ['This ', 'is ', 'a ', 'test '].
If you want the delimiter as part of the resultant list, instead of using the regular str.split(), you can use re.split(). The docs note:
re.split(pattern, string[, maxsplit=0, flags=0])
Split string by the
occurrences of pattern. If capturing parentheses are used in pattern,
then the text of all groups in the pattern are also returned as part
of the resulting list.
So, you could use:
import re
re.split("( )", "This is a test")
And result:
['this', ' ', 'is', ' ', 'a', ' ', 'test']
Is it possible to use a regex so as to obtain the following features ?
text = "123abcd56EFG"
listWanted = ["123", "abcd", "56", "EFG"]
The idea is to cut the texte each time one digit is followed by one letter, or one letter is followed by one digit.
The solution thanks to the following answer
import re
pattern = r'(\d+|\D+)'
text = "123abcd56EFG"
print(re.split(pattern, text))
text = "abcd56EFG"
print(re.split(pattern, text))
This code will give...
['', '123', '', 'abcd', '', '56', '', 'EFG', '']
['', 'abcd', '', '56', '', 'EFG', '']
Use a capturing group in your regex.
>>> import re
>>> text = "123abcd56EFG"
>>> pattern = r'(\d+)'
>>> re.split(pattern, text)
['', '123', 'abcd', '56', 'EFG']
While this will give you empty strings at the start and/or end for lines with digit groups at the start and/or end, those are easy enough to trim off.
You're going to want to do a split using: \d+|\D+ as your Regex.
--note that you need excape sequences to make the \ in your string, so the actual text entered will be: "\\d+|\\D+"
UNLESS, as noted in the comment below, you use a raw string, in which case it would be r"\d+|\D+" or r'\d+|\D+'