REGEX - Cut a text between number and letter - text

Is it possible to use a regex so as to obtain the following features ?
text = "123abcd56EFG"
listWanted = ["123", "abcd", "56", "EFG"]
The idea is to cut the texte each time one digit is followed by one letter, or one letter is followed by one digit.
The solution thanks to the following answer
import re
pattern = r'(\d+|\D+)'
text = "123abcd56EFG"
print(re.split(pattern, text))
text = "abcd56EFG"
print(re.split(pattern, text))
This code will give...
['', '123', '', 'abcd', '', '56', '', 'EFG', '']
['', 'abcd', '', '56', '', 'EFG', '']

Use a capturing group in your regex.
>>> import re
>>> text = "123abcd56EFG"
>>> pattern = r'(\d+)'
>>> re.split(pattern, text)
['', '123', 'abcd', '56', 'EFG']
While this will give you empty strings at the start and/or end for lines with digit groups at the start and/or end, those are easy enough to trim off.

You're going to want to do a split using: \d+|\D+ as your Regex.
--note that you need excape sequences to make the \ in your string, so the actual text entered will be: "\\d+|\\D+"
UNLESS, as noted in the comment below, you use a raw string, in which case it would be r"\d+|\D+" or r'\d+|\D+'

Related

Python regex - identifying words with preceding symbol

I am trying to split a target sentence into composite pieces for a later function using re.split() and the regex
(#?\w+)(\W+)
Ideally, this would split words and non-word characters in a generated list, preserving both as separate list items, with the exception of the "#" symbol which could precede a word. If there is an # symbol before a word, I want to keep it as a cohesive item in the split. My example is below.
My test sentence is as follows:
this is a test of proper nouns #Ryan
So the line of code is:
re.split(r'(#?\w+)(\W+)', "this is a test of proper nouns #Ryan")
The list that I want to generate would include "#Ryan" as a single item but, instead, it looks like this
['', 'this', ' ', '', 'is', ' ', '', 'a', ' ', '', 'test', ' ', '', 'of', ' ', '', 'proper', ' ', '', 'nouns', ' #', 'Ryan']
Since the first container has the # symbol, I would have thought that it would be evaluated first but that is apparently not the case. I have tried using lookaheads or removing # from the \W+ container to no avail.
https://regex101.com/r/LeezvP/1
With your shown samples, could you please try following(written and tested in Python 3.8.5). considering that you need to remove empty/null items in your list. This will give output where # is together with words.
##First split the text/line here and save it to list named li.
li=re.split(r'(#?\w+)(?:\s+)', "this is a test of proper nouns #Ryan")
li
['', 'this', '', 'is', '', 'a', '', 'test', '', 'of', '', 'proper', '', 'nouns', '#Ryan']
##Use filter to remove nulls in list li.
list(filter(None, li))
['this', 'is', 'a', 'test', 'of', 'proper', 'nouns', '#Ryan']
Simple explanation would be, use split function with making 1 capturing group which has an optional # followed by words and 1 non-capturing group which has spaces one or more occurrences in it. This will place null elements in list, so to remove them use filter function.
NOTE: As per OP's comments nulls/spaces may be required, so in that case one could refer following code; which worked for OP:
li=re.split(r'(#?\w+)(\s+|\W+)', "this is a test of proper nouns #Ryan")
You could also match using re.findall and use an alternation | matching the desired parts.
(?:[^#\w\s]+|#(?!\w))+|\s+|#?\w+
Explanation
(?: Non capture group
[^#\w\s]+ Match 1+ times any char except # word char or whitespace char
| Or
#(?!\w) Match # when not directly followed by a word char
)+ Close the group and match 1+ times
| Or
\s+ Match 1+ whitespace chars to keep them as a separate match in the result
| Or
#?\w+ Match # directly followed by 1+ word chars
Regex demo
Example
import re
pattern = r"(?:[^#\w\s]+|#(?!\w))+|\s+|#?\w+"
print(re.findall(pattern, "this is a test of proper nouns #Ryan"))
# Output
# ['this', ' ', 'is', ' ', 'a', ' ', 'test', ' ', 'of', ' ', 'proper', ' ', 'nouns', ' ', '#Ryan']
print(re.findall(pattern, "this #Ryan #$#test#123#4343##$%$test#1#$#$###1####"))
# Output
# ['this', ' ', '#Ryan', ' ', '#$', '#test', '#123', '#4343', '##$%$', 'test', '#1', '#$#$##', '#1', '####']
The regex, #?\w+|\b(?!$) should meet your requirement.
Explanation at regex101:
1st Alternative #\w
# matches the character # literally (case sensitive)
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Alternative \b(?!$)
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
Negative Lookahead (?!$)
Assert that the Regex below does not match
$ asserts position at the end of a line

Need output in string type

I have the input s of string. I want to print string s in which all the occurrences of WUB are replaced with a white space.
s = input()
print(s.split("WUB"))
Input : WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB
but the output I am getting is like this
: ['', 'WE', 'ARE', '', 'THE', 'CHAMPIONS', 'MY', 'FRIEND', '']
instead I need output in string format, like this : WE ARE THE CHAMPIONS MY FRIEND
You can join the strings in the list produced by split with a space:
print(" ".join(s.split("WUB")))
You can also just use replace instead of split + join:
print(s.replace("WUB", " "))
You can apply the input in the print statement like this
s = input()
print(*s.split("WUB"))
Notice * before s.split("WUB") this gives the desired output.
WE ARE THE CHAMPIONS MY FRIEND
Just join all elements from your list. See it below:
print(" ".join("WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB".split("WUB")).strip())

How to eliminate new line character from text file when opened in python?

I am trying to take words from stopwords.txt file and append them as string in python list.
stopwords.txt
a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
My Code :
stopword = open("stopwords.txt", "r")
stopwords = []
for word in stopword:
stopwords.append(word)
List stopwords output:
['a\n',
'about\n',
'above\n',
'after\n',
'again\n',
'against\n',
'all\n',
'am\n',
'an\n',
'and\n',
'any\n',
'are\n',
"aren't\n",
'as\n',
'at\n',
'be\n',
'because\n',
'been\n',
'before\n',
'being\n']
Desired Output :
['a',
'about',
'above',
'after',
'again',
'against',
'all',
'am',
'an',
'and',
'any',
'are',
"aren't",
'as',
'at',
'be',
'because',
'been',
'before',
'being']
Is there any method to transpose stopword so that it eliminate '\n' character or any method at all to reach the desire output?
Instead of
stopwords.append(word)
do
stopwords.append(word.strip())
The string.strip() method strips whitespace of any kind (spaces, tabs, newlines, etc.) from the start and end of the string. You can give an argument to the function in order to strip a specific string or set of characters, or use lstrip() or rstrip() to only strip the front or back of the string, but for this case just strip() should suffice.
You can use the .strip() method. It removes all occurrences of the character passed as an argument from a string:
stopword = open("stopwords.txt", "r")
stopwords = []
for word in stopword:
stopwords.append(word.strip("\n"))

Strip Punctuation From String in Python

I`m working with documents, and I need to have the words isolated without punctuation. I know how to use string.split(" ") to make each word just the letters, but the punctuation baffles me.
this is an example using regex, and the result is
['this', 'is', 'a', 'string', 'with', 'punctuation']
s = " ,this ?is a string! with punctuation. "
import re
pattern = re.compile('\w+')
result = pattern.findall(s)
print(result)

Why is stripping a newline character in Python creating a blank space as a list item?

I have the following code:
# File declaration.
infileS = open("single.dat", 'r')
infileD = open("double.dat", 'r')
infileT = open("triple.dat", 'r')
infileHR = open("homerun.dat", 'r')
infileAB = open("atbat.dat", 'r')
infileP = open("player.dat", 'r')
# Fill up the lists.
single = infileS.read()
double = infileD.read()
triple = infileT.read()
homerun = infileHR.read()
atbat = infileAB.read()
player = infileP.read()
single = [item.rstrip() for item in single]
double = [item.rstrip() for item in double]
triple = [item.rstrip() for item in triple]
homerun = [item.rstrip() for item in homerun]
atbat = [item.rstrip() for item in atbat]
player = [item.rstrip() for item in player]
print (single)
What prints:
['5', '', '3', '', '1', '0', '', '1', '2', '', '6', '', '9', '', '2', '0', '', '4', '', '7']
I don't want the '' items. What have I done wrong and what can I do to fix this?
All the .dat files are simple Notepad lists of numbers. The "single.dat" is a list of numbers with "enter" putting them on different lines (with no lines in between), and looks like: (minus, of course, the spaces between the paragraphs containing those numbers)
5
3
10
12
6
9
20
4
7
The empty strings ('') are what's left over if you strip something that's all whitespace (or possibly they were empty to start with). The easiest way to eliminate these is to use the fact that '' is falsy, so you can remove them right there in your list comprehensions by adding if item.strip().
The problem is that you're iterating over the output of file.read(), which is a single string. Strings in Python are iterable, but this means that when you iterate over them, you iterate over each character. So what you're doing is stripping each individual character and adding it to your list--so all your newlines turn into empty strings, rather than being stripped out like I think you intended.
To fix it, use the fact that file objects are also iterable, and iterate line-by-line. This is the idiomatic way to read a file line-by-line in Python (using a context manager rather than a lone open call):
with open('single.dat') as f:
for line in f:
dosomething(line)
So, use that pattern along with some filtering in your list comprehension, and you'll be all set:
with open('single.dat') as f:
single = [line.strip() for line in f if line.strip()]
It might be easiest to just filter out the ''. For instance:
>>> list = ['', 'cat', 'dog', '']
>>> filter(None, list)
['cat', 'dog']

Resources