Strip Punctuation From String in Python - python-3.x

I`m working with documents, and I need to have the words isolated without punctuation. I know how to use string.split(" ") to make each word just the letters, but the punctuation baffles me.

this is an example using regex, and the result is
['this', 'is', 'a', 'string', 'with', 'punctuation']
s = " ,this ?is a string! with punctuation. "
import re
pattern = re.compile('\w+')
result = pattern.findall(s)
print(result)

Related

Python regex - identifying words with preceding symbol

I am trying to split a target sentence into composite pieces for a later function using re.split() and the regex
(#?\w+)(\W+)
Ideally, this would split words and non-word characters in a generated list, preserving both as separate list items, with the exception of the "#" symbol which could precede a word. If there is an # symbol before a word, I want to keep it as a cohesive item in the split. My example is below.
My test sentence is as follows:
this is a test of proper nouns #Ryan
So the line of code is:
re.split(r'(#?\w+)(\W+)', "this is a test of proper nouns #Ryan")
The list that I want to generate would include "#Ryan" as a single item but, instead, it looks like this
['', 'this', ' ', '', 'is', ' ', '', 'a', ' ', '', 'test', ' ', '', 'of', ' ', '', 'proper', ' ', '', 'nouns', ' #', 'Ryan']
Since the first container has the # symbol, I would have thought that it would be evaluated first but that is apparently not the case. I have tried using lookaheads or removing # from the \W+ container to no avail.
https://regex101.com/r/LeezvP/1
With your shown samples, could you please try following(written and tested in Python 3.8.5). considering that you need to remove empty/null items in your list. This will give output where # is together with words.
##First split the text/line here and save it to list named li.
li=re.split(r'(#?\w+)(?:\s+)', "this is a test of proper nouns #Ryan")
li
['', 'this', '', 'is', '', 'a', '', 'test', '', 'of', '', 'proper', '', 'nouns', '#Ryan']
##Use filter to remove nulls in list li.
list(filter(None, li))
['this', 'is', 'a', 'test', 'of', 'proper', 'nouns', '#Ryan']
Simple explanation would be, use split function with making 1 capturing group which has an optional # followed by words and 1 non-capturing group which has spaces one or more occurrences in it. This will place null elements in list, so to remove them use filter function.
NOTE: As per OP's comments nulls/spaces may be required, so in that case one could refer following code; which worked for OP:
li=re.split(r'(#?\w+)(\s+|\W+)', "this is a test of proper nouns #Ryan")
You could also match using re.findall and use an alternation | matching the desired parts.
(?:[^#\w\s]+|#(?!\w))+|\s+|#?\w+
Explanation
(?: Non capture group
[^#\w\s]+ Match 1+ times any char except # word char or whitespace char
| Or
#(?!\w) Match # when not directly followed by a word char
)+ Close the group and match 1+ times
| Or
\s+ Match 1+ whitespace chars to keep them as a separate match in the result
| Or
#?\w+ Match # directly followed by 1+ word chars
Regex demo
Example
import re
pattern = r"(?:[^#\w\s]+|#(?!\w))+|\s+|#?\w+"
print(re.findall(pattern, "this is a test of proper nouns #Ryan"))
# Output
# ['this', ' ', 'is', ' ', 'a', ' ', 'test', ' ', 'of', ' ', 'proper', ' ', 'nouns', ' ', '#Ryan']
print(re.findall(pattern, "this #Ryan #$#test#123#4343##$%$test#1#$#$###1####"))
# Output
# ['this', ' ', '#Ryan', ' ', '#$', '#test', '#123', '#4343', '##$%$', 'test', '#1', '#$#$##', '#1', '####']
The regex, #?\w+|\b(?!$) should meet your requirement.
Explanation at regex101:
1st Alternative #\w
# matches the character # literally (case sensitive)
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Alternative \b(?!$)
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
Negative Lookahead (?!$)
Assert that the Regex below does not match
$ asserts position at the end of a line

Need output in string type

I have the input s of string. I want to print string s in which all the occurrences of WUB are replaced with a white space.
s = input()
print(s.split("WUB"))
Input : WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB
but the output I am getting is like this
: ['', 'WE', 'ARE', '', 'THE', 'CHAMPIONS', 'MY', 'FRIEND', '']
instead I need output in string format, like this : WE ARE THE CHAMPIONS MY FRIEND
You can join the strings in the list produced by split with a space:
print(" ".join(s.split("WUB")))
You can also just use replace instead of split + join:
print(s.replace("WUB", " "))
You can apply the input in the print statement like this
s = input()
print(*s.split("WUB"))
Notice * before s.split("WUB") this gives the desired output.
WE ARE THE CHAMPIONS MY FRIEND
Just join all elements from your list. See it below:
print(" ".join("WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB".split("WUB")).strip())

How to eliminate new line character from text file when opened in python?

I am trying to take words from stopwords.txt file and append them as string in python list.
stopwords.txt
a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
My Code :
stopword = open("stopwords.txt", "r")
stopwords = []
for word in stopword:
stopwords.append(word)
List stopwords output:
['a\n',
'about\n',
'above\n',
'after\n',
'again\n',
'against\n',
'all\n',
'am\n',
'an\n',
'and\n',
'any\n',
'are\n',
"aren't\n",
'as\n',
'at\n',
'be\n',
'because\n',
'been\n',
'before\n',
'being\n']
Desired Output :
['a',
'about',
'above',
'after',
'again',
'against',
'all',
'am',
'an',
'and',
'any',
'are',
"aren't",
'as',
'at',
'be',
'because',
'been',
'before',
'being']
Is there any method to transpose stopword so that it eliminate '\n' character or any method at all to reach the desire output?
Instead of
stopwords.append(word)
do
stopwords.append(word.strip())
The string.strip() method strips whitespace of any kind (spaces, tabs, newlines, etc.) from the start and end of the string. You can give an argument to the function in order to strip a specific string or set of characters, or use lstrip() or rstrip() to only strip the front or back of the string, but for this case just strip() should suffice.
You can use the .strip() method. It removes all occurrences of the character passed as an argument from a string:
stopword = open("stopwords.txt", "r")
stopwords = []
for word in stopword:
stopwords.append(word.strip("\n"))

How to append a string into a list without its new line "\n" in Python 3?

I have a text file something like this (suppose A and B are persons and below text is a conversation between them):
A: Hello
B: Hello
A: How are you?
B: I am good. Thanks and you?
I added this conversation into a list that returns below result:
[['A', 'Hello\n'], ['A', 'How are you?\n'], ['B', 'Hello\n'], ['B', 'I am good. Thanks and you?\n']]
I use these commands in a loop:
new_sentence = line.split(': ', 1)[1]
attendees_and_sentences[index].append(person)
attendees_and_sentences[index].append(new_sentence)
print(attendees_and_sentences) # with this command I get the above result
print(attendees_and_sentences[0][1]) # if I run this one, then I don't get "\n" in the sentence.
The problem is those "\n" characters on my result screen. How can I get rid of them?
Thank you.
You can use Python's rstrip function.
For example:
>>> 'my string\n'.rstrip()
'my string'
And if you want to trim the trailing newlines while preserving other whitespace, you can specify the characters to remove, like so:
>>> 'my string \n'.rstrip()
'my string '

REGEX - Cut a text between number and letter

Is it possible to use a regex so as to obtain the following features ?
text = "123abcd56EFG"
listWanted = ["123", "abcd", "56", "EFG"]
The idea is to cut the texte each time one digit is followed by one letter, or one letter is followed by one digit.
The solution thanks to the following answer
import re
pattern = r'(\d+|\D+)'
text = "123abcd56EFG"
print(re.split(pattern, text))
text = "abcd56EFG"
print(re.split(pattern, text))
This code will give...
['', '123', '', 'abcd', '', '56', '', 'EFG', '']
['', 'abcd', '', '56', '', 'EFG', '']
Use a capturing group in your regex.
>>> import re
>>> text = "123abcd56EFG"
>>> pattern = r'(\d+)'
>>> re.split(pattern, text)
['', '123', 'abcd', '56', 'EFG']
While this will give you empty strings at the start and/or end for lines with digit groups at the start and/or end, those are easy enough to trim off.
You're going to want to do a split using: \d+|\D+ as your Regex.
--note that you need excape sequences to make the \ in your string, so the actual text entered will be: "\\d+|\\D+"
UNLESS, as noted in the comment below, you use a raw string, in which case it would be r"\d+|\D+" or r'\d+|\D+'

Resources