How to split sentence including punctuation - string

If I had the sentence sentence = 'There is light!' and I was to split this sentence with mysentence = sentence.split(), how would I have the output as 'There, is, light, !' of print(mysentence)? What I specifically wanted to do was split the sentence including all punctuation, or just a list of selected punctuation. I got some code but the program is recognizing the characters in the word, not the word.
out = "".join(c for c in punct1 if c not in ('!','.',':'))
out2 = "".join(c for c in punct2 if c not in ('!','.',':'))
out3 = "".join(c for c in punct3 if c not in ('!','.',':'))
How would I use this without recognizing each character in a word, but the word itself. Therefore, the output of "Hello how are you?" should become "Hello, how, are, you, ?" Any way of doing this

You may use a \w+|[^\w\s]+ regex with re.findall to get those chunks:
\w+|[^\w\s]
See the regex demo
Pattern details:
\w+ - 1 or more word chars (letters, digits or underscores)
| - or
[^\w\s] - 1 char other than word / whitespace
Python demo:
import re
p = re.compile(r'\w+|[^\w\s]')
s = "There is light!"
print(p.findall(s))
NOTE: If you want to treat an underscore as punctuation, you need to use something like [a-zA-Z0-9]+|[^A-Za-z0-9\s] pattern.
UPDATE (after comments)
To make sure you match an apostrophe as part of the words, add (?:'\w+)* or (?:'\w+)? to the \w+ in the pattern above:
import re
p = re.compile(r"\w+(?:'\w+)*|[^\w\s]")
s = "There is light!? I'm a human"
print(p.findall(s))
See the updated demo
The (?:'\w+)* matches zero or more (*, if you use ?, it will match 1 or 0) occurrences of an apostrophe followed with 1+ word characters.

Related

Regex: Match between delimiters (a letter and a special character) in a string to form new sub-strings

I was working on a certain problem where I have form new sub-strings from a main string.
For e.g.
in_string=ste5ts01,s02,s03
The expected output strings are ste5ts01, ste5ts02, ste5ts03
There could be comma(,) or forward-slash (/) as the separator and in this case the delimiters are the letter s and ,
The pattern I have created so far:
pattern = r"([^\s,/]+)(?<num>\d+)([,/])(?<num>\d+)(?:\2(?<num>\d+))*(?!\S)"
The issue is, I am not able to figure out how to give the letter 's' as one of the delimiters.
Any help will be much appreciated!
You might use an approach using the PyPi regex module and named capture groups which are available in the captures:
=(?<prefix>s\w+)(?<num>s\d+)(?:,(?<num>s\d+))+
Explanation
= Match literally
(?<prefix>s\w+) Match s and 1+ word chars in group prefix
(?<num>s\d+) Capture group num match s and 1+ digits
(?:,(?<num>s\d+))+ Repeat 1+ times matching , and capture s followed by 1+ digits in group num
Example
import regex as re
pattern = r"=(?<prefix>s\w+)(?<num>s\d+)(?:,(?<num>s\d+))+"
s="in_string=ste5ts01,s02,s03"
matches = re.finditer(pattern, s)
for _, m in enumerate(matches, start=1):
print(','.join([m.group("prefix") + c for c in m.captures("num")]))
Output
ste5ts01,ste5ts02,ste5ts03

Removing Characters With Regular Expression in List Comprehension in Python

I am learning python and I am trying to do some text preprocessing and I have been reading and borrowing ideas from Stackoverflow. I was able to come up with the following formulations below, but they don't appear to do what I was expecting, and they don't throw any errors either, so I'm stumped.
First, in a Pandas dataframe column, I am trying to remove the third consecutive character in a word; it's kind of like running a spell check on words that are supposed to have two consecutive characters instead of three
buttter = butter
bettter = better
ladder = ladder
The code I used is below:
import re
docs['Comments'] = [c for c in docs['Comments'] if re.sub(r'(\w)\1{2,}', r'\1', c)]
In the second instance, I just want to to replace multiple punctuations with the last one.
????? = ?
..... = .
!!!!! = !
---- = -
***** = *
And the code I have for that is:
docs['Comments'] = [i for i in docs['Comments'] if re.sub(r'[\?\.\!\*]+(?=[\?\.\!\*])', '', i)]
It looks like you want to use
docs['Comments'] = docs['Comments'].str.replace(r'(\w)\1{2,}', r'\1\1', regex=True)
.str.replace(r'([^\w\s]|_)(\1)+', r'\2', regex=True)
The r'(\w)\1{2,}' regex finds three or more repeated word chars and \1\1 replaces with two their occurrences. See this regex demo.
The r'([^\w\s]|_)(\1)+' regex matches repeated punctuation chars and captures the last into Group 2, so \2 replaces the match with the last punctuation char. See this regex demo.

Replace characters other than A-Za-z0-9 and decimal values with space using regex

I want to keep alphanumeric characters and also the decimal numbers present in my text string and replace all other characters with space.
For alphanumeric characters, I can use
def clean_up(text):
return re.sub(r"[^A-Za-z0-9]", " ", text)
But this will replace all . whether they are between two digits or a fullstop or at random locations. I just want to keep the . if they come between two digits.
I thought of [^((A-Za-z0-9)|(\d\.\d))], but it doesn't seem to work.
You can match and capture the patterns you need to keep and just match any char otherwise. Then, using the lambda expression as the replacement argument, you can either replace with the captured substring or a space.
The patterns are:
[+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)? - matches any number
[^\W_] - matches any alphanumeric, Unicode included
. - matches any char (with re.S or re.DOTALL).
The solution looks like
pattern = re.compile(r'([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?|[^\W_])|.', re.DOTALL)
def clean_up(text):
return pattern.sub(lambda x: x.group(1) or " ", text)
See the online demo:
import re
pattern = re.compile(r'([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?|[^\W_])|.', re.DOTALL)
def clean_up(text):
return pattern.sub(lambda x: x.group(1) or " ", text)
print( clean_up("+1.2E02 ANT01-TEXT_HERE!") )
Output: +1.2E02 ANT01 TEXT HERE
[^A-Za-z0-9](?!\d)
You can use Negated Character Class with Negative lookahead.
[...] is a character set, which match the literal character inside, so [^((A-Za-z0-9)|(\d\.\d))] means not to match [A-Za-z0-9] and the literal (, |, . and ).
You may try [^a-zA-Z0-9.]|(?<!\d)\.|\.(?!\d):
[^a-zA-Z0-9.] match all non-alphanumeric characters except .
(?<!\d)\. match . after non-digit
\.(?!\d) match . before non-digit
Test run:
print(re.sub(r"[^a-zA-Z0-9.]|(?<!\d)\.|\.(?!\d)", " ", ".aBc.1e2#3.4F5$6.gHi."))
Output:
aBc 1e2 3.4F5 6 gHi

Using re.sub to replace a

I have a text. I want to remove certain words and phrases.
One sentence is: We lived there in the l[b]ate[/b] 1990s.
I search it to find ate. (= words[0])
newline = re.sub('ate', newselectionString, line)
But I only want it to find ate, on its own, not as part of another word.
Is it possible to tell re just to find these 3 letters?
Later in the text is: The best thing was when we ate ice cream.
for line in lines:
for i in range(0, len(words)):
if words[i] in line:
print('Found ' + words[i])
newselectionString = selectionString.replace('GX', 'G' + str(startInt))
newline = re.sub(words[i], newselectionString, line)
newLines.append(newline)
startInt +=1
Here are two ways to do it:
Regular Expression
The regex you want is \bate\b, or that ate should appear between two word boundaries. It will match We ate., I ate it., but not We're late..
Splitting the String
Fairly similar to just a normal regex, but you might want control over the other words in the sentence.
word_fragments = re.split("\b", your_string)
print(' '.join([word for word in word_fragments if word != 'ate']))
Use word boundaries \b with str.format.
Ex:
re.sub(r"\b{}\b".format(words[i]), "Hello World", Text)

Lua frontier pattern match (whole word search)

can someone help me with this please:
s_test = "this is a test string this is a test string "
function String.Wholefind(Search_string, Word)
_, F_result = string.gsub(Search_string, '%f[%a]'..Word..'%f[%A]',"")
return F_result
end
A_test = String.Wholefind(s_test,"string")
output: A_test = 2
So the frontier pattern finds the whole word no problem and gsub counts the whole words no problem but what if the search string has numbers?
s_test = " 123test 123test 123"
B_test = String.Wholefind(s_test,"123test")
output: B_test = 0
seems to work with if the numbers aren't at the start or end of the search string
Your pattern doesn't match because you are trying to do the impossible.
After including your variable value, the pattern looks like this: %f[%a]123test%f[%A]. Which means:
%f[%a] - find a transition from a non letter to a letter
123 - find 123 at the position after transition from a non letter to a letter. This itself is a logical impossibility as you can't match a transition to a letter when a non-letter follows it.
Your pattern (as written) will not work for any word that starts or ends with a non-letter.
If you need to search for fragments that include letters and numbers, then your pattern needs to be changed to something like '%f[%S]'..Word..'%f[%s]'.

Resources