Using re.sub to replace a - python-3.x

I have a text. I want to remove certain words and phrases.
One sentence is: We lived there in the l[b]ate[/b] 1990s.
I search it to find ate. (= words[0])
newline = re.sub('ate', newselectionString, line)
But I only want it to find ate, on its own, not as part of another word.
Is it possible to tell re just to find these 3 letters?
Later in the text is: The best thing was when we ate ice cream.
for line in lines:
for i in range(0, len(words)):
if words[i] in line:
print('Found ' + words[i])
newselectionString = selectionString.replace('GX', 'G' + str(startInt))
newline = re.sub(words[i], newselectionString, line)
newLines.append(newline)
startInt +=1

Here are two ways to do it:
Regular Expression
The regex you want is \bate\b, or that ate should appear between two word boundaries. It will match We ate., I ate it., but not We're late..
Splitting the String
Fairly similar to just a normal regex, but you might want control over the other words in the sentence.
word_fragments = re.split("\b", your_string)
print(' '.join([word for word in word_fragments if word != 'ate']))

Use word boundaries \b with str.format.
Ex:
re.sub(r"\b{}\b".format(words[i]), "Hello World", Text)

Related

Python: lower() method generates wrong letter in a string

text = 'ÇEKİM GÜNÜ KALİTESİNİ DÜZENLERLSE'
sentence = text.split(' ')
print(sentence)
if "ÇEKİM" in sentence:
print("yes-1")
print(" ")
sentence_ = text.lower().split(' ')
print(sentence_)
if "çekim" in sentence_:
print("yes-2")
>> output:
['ÇEKİM', 'GÜNÜ', 'KALİTESİNİ', 'DÜZENLERLSE']
yes-1
['çeki̇m', 'günü', 'kali̇tesi̇ni̇', 'düzenlerlse']
I have a problem about string. I have a sentence like a text. When I check a specific word in this sentence-splitted list, I can find "ÇEKİM" word (prints yes). However, while I make search by lowering sentence, I can not find in the list because it changes "i" letter. What is the reason of it (encoding/decoding) ? Why "lower()" method changes string in addition to lowering ? Btw, it is a turkish word. Upper:ÇEKİM - Lower:çekim
Turkish i and English i are treated differently. Capitalized Turkish i is İ, while capitalized English i is I. To differentiate Unicode has rules for converting to lower and upper case. Lowercase Turkish i has a combining mark. Also, converting the lower case version to upper case leaves the characters in a decomposed form, so proper comparison needs to normalize the string to a standard form. You can't compare a decomposed form to a composed form. Note the differences in the strings below:
#coding:utf8
import unicodedata as ud
def dump_names(s):
print('string:',s)
for c in s:
print(f'U+{ord(c):04X} {ud.name(c)}')
turkish_i = 'İ'
dump_names(turkish_i)
dump_names(turkish_i.lower())
dump_names(turkish_i.lower().upper())
dump_names(ud.normalize('NFC',turkish_i.lower().upper()))
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
string: i̇
U+0069 LATIN SMALL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0049 LATIN CAPITAL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
Some terminals also have display issues. My system displays '' with the dot over the m, not the i. For example, on the Chrome browser, below displays correctly:
>>> s = 'ÇEKİM'
>>> s.lower()
'çeki̇m'
But on one of my editors it displays as:
So it appears something like this is what the OP is seeing. The following comparison will work:
if "çeki\N{COMBINING DOT ABOVE}m" in sentence_:
print("yes-2")

How can I replace each letter in the sentence to sentence without breaking it?

Here's my problem.
sentence = "This car is awsome."
and what I want do do is
sentence.replace("a","<emoji:a>")
sentence.replace("b","<emoji:b>")
sentence.replace("c","<emoji:c>")
and so on...
But of course if I do it in that way the letters in "<emoji:>" will also be replaced as I go along. So how can I do it in other way?
As Carlos Gonzalez suggested:
create a mapping dict and apply it to each character in sequence:
sentence = "This car is awsome."
# mapping
up = {"a":"<emoji:a>",
"b":"<emoji:b>",
"c":"<emoji:c>",}
# apply mapping to create a new text (use up[k] if present else default to k)
text = ''.join( (up.get(k,k) for k in sentence) )
print(text)
Output:
This <emoji:c><emoji:a>r is <emoji:a>wsome.
The advantage of the generator expression inside the ''.join( ... generator ...) is that it takes each single character of sentence and either keeps it or replaces it. It only ever touches each char once, so there is no danger of multiple substitutions and it takes only one pass of sentence to convert the whole thing.
Doku: dict.get(key,default) and Why dict.get(key) instead of dict[key]?
If you used
sentence = sentence.replace("a","o")
sentence = sentence.replace("o","k")
you would first make o from a and then make k from any o (or a before) - and you would have to touch each character twice to make it happen.
Using
up = { "a":"o", "o":"k" }
text = ''.join( (up.get(k,k) for k in sentence) )
avoids this.
If you want to replace more then 1 character at a time, it would be easier to do this with regex. Inspired by Passing a function to re.sub in Python
import re
sentence = "This car is awsome."
up = {"is":"Yippi",
"ws":"WhatNot",}
# modified it to create the groups using the dicts key
text2 = re.sub( "("+'|'.join(up)+")", lambda x: up[x.group()], sentence)
print(text2)
Output:
ThYippi car Yippi aWhatNotome.
Doku: re.sub(pattern, repl, string, count=0, flags=0)
You would have to take extra care with your keys, if you wanted to use "regex" specific characters that have another meaning if used as regex-pattern - f.e. .+*?()[]^$

Python 3.5: Is it possible to align punctuation (e.g. £, $) to the left side of a word using regex?

As part of my code, I need to align things like the pound sign to the left of a string. For example my code starts with:
"A price of £ 8 is roughly the same as $ 10.23!"
and needs to end with:
"A price of £8 is roughly the same as $10.23!"
I've created the following function to solve this however I feel that it is very inefficient and was wondering if there was a way to do this with regular expressions in Python?
for i in sentence:
if i == "(" or i == "{" or i == "[" or i == "£" or i == "$":
if i != len(sentence):
corrected_sentence.append(" ")
corrected_sentence.append(i)
else:
corrected_sentence.append(i)
What this is doing right now is going through the 'sentence' list where I have split up all of the words and punctuation and t then reforming this followed by a space EXPECT where the listed characters are used and adding to another list to be made into a single string again.
I only want to do this with the characters I have listed above (so I need to ignore things like full stops or exclamation marks etc).
Thanks!
I'm not sure what you want to do with the brackets, but from the description you can use a regex to find and replace whitespace preceded by the characters (lookbehind) and followed by a digit (lookahead).
>>> print(re.sub(r"(?<=[\{\[£\$])\s+(?=\d)", "", "A price of £ 8 is roughly the same as $ 10.23!"))
A price of £8 is roughly the same as $10.23!

How to split sentence including punctuation

If I had the sentence sentence = 'There is light!' and I was to split this sentence with mysentence = sentence.split(), how would I have the output as 'There, is, light, !' of print(mysentence)? What I specifically wanted to do was split the sentence including all punctuation, or just a list of selected punctuation. I got some code but the program is recognizing the characters in the word, not the word.
out = "".join(c for c in punct1 if c not in ('!','.',':'))
out2 = "".join(c for c in punct2 if c not in ('!','.',':'))
out3 = "".join(c for c in punct3 if c not in ('!','.',':'))
How would I use this without recognizing each character in a word, but the word itself. Therefore, the output of "Hello how are you?" should become "Hello, how, are, you, ?" Any way of doing this
You may use a \w+|[^\w\s]+ regex with re.findall to get those chunks:
\w+|[^\w\s]
See the regex demo
Pattern details:
\w+ - 1 or more word chars (letters, digits or underscores)
| - or
[^\w\s] - 1 char other than word / whitespace
Python demo:
import re
p = re.compile(r'\w+|[^\w\s]')
s = "There is light!"
print(p.findall(s))
NOTE: If you want to treat an underscore as punctuation, you need to use something like [a-zA-Z0-9]+|[^A-Za-z0-9\s] pattern.
UPDATE (after comments)
To make sure you match an apostrophe as part of the words, add (?:'\w+)* or (?:'\w+)? to the \w+ in the pattern above:
import re
p = re.compile(r"\w+(?:'\w+)*|[^\w\s]")
s = "There is light!? I'm a human"
print(p.findall(s))
See the updated demo
The (?:'\w+)* matches zero or more (*, if you use ?, it will match 1 or 0) occurrences of an apostrophe followed with 1+ word characters.

Lua frontier pattern match (whole word search)

can someone help me with this please:
s_test = "this is a test string this is a test string "
function String.Wholefind(Search_string, Word)
_, F_result = string.gsub(Search_string, '%f[%a]'..Word..'%f[%A]',"")
return F_result
end
A_test = String.Wholefind(s_test,"string")
output: A_test = 2
So the frontier pattern finds the whole word no problem and gsub counts the whole words no problem but what if the search string has numbers?
s_test = " 123test 123test 123"
B_test = String.Wholefind(s_test,"123test")
output: B_test = 0
seems to work with if the numbers aren't at the start or end of the search string
Your pattern doesn't match because you are trying to do the impossible.
After including your variable value, the pattern looks like this: %f[%a]123test%f[%A]. Which means:
%f[%a] - find a transition from a non letter to a letter
123 - find 123 at the position after transition from a non letter to a letter. This itself is a logical impossibility as you can't match a transition to a letter when a non-letter follows it.
Your pattern (as written) will not work for any word that starts or ends with a non-letter.
If you need to search for fragments that include letters and numbers, then your pattern needs to be changed to something like '%f[%S]'..Word..'%f[%s]'.

Resources