Regex remove both apostrophes if they exist in Python - python-3.x

I’m quit new to Regex but almost finished with my text mining script. Only one thing fails: I’m trying to remove the apostrophes between a word if they exist. I’m using re.sub for this.
For instance:
‘Apple’ needs to be Apple
‘apple’ needs to be apple
‘[apple]’ needs to be [apple]
‘(apple)’ needs to be (apple)
However: Apple’s needs to stay Apple’s because there is only one apostrophe.
How do I select both apostrophes when there is a word in between so I can delete them with re.sub? In every try I remove the entire string! Hopefully someone can help.
My code is as follows:
str_o='\'Apple\''
str_o_a = re.sub(r"\'(.*?)\'","", str_o)

I have a simpler idea: split by whitespace, trim leading and trailing apostrophes, join with whitespace. Avoids having to write a regular expression and handles sentences such as "She's 'her' mother's daughter".
text = "She's 'her' mother's daughter"
text = ' '.join([word.strip("'") for word in text.split()])
print(text)
# She's her mother's daughter

The purpose of the parentheses in your regular expression was probably to capture the string you want to keep. The idiom looks like
str_o_a = re.sub(r"'([^']*)'", r"\1", str_o)
You want a raw string around the replacement, too, in order to preserve the backslash in the argument (otherwise you would be replacing with the literal string "\x01").
Notice also the preference for using a negated character class over a non-greedy "match anything" wildcard.

Related

Dialogflow RE2 Regex

I am new here. I wanted to ask a question on using REGEX for an entity in DialogFlow
I wanted the entity to accept all text and spaces except for the symbol *
I have tried to use [A-Za-z0-9 ][^*], but it is not working. Any advice. thanks!
In your Regex expression, [^*] means "capture any character at the start of the line." To refer to a literal asterisk rather than matching any character, you need to use \*
If you want to match a line of letters or numbers as in the [A-Za-z0-9] example you give, but only if that string does not include an asterisk, then this expression should work for you:
^[a-zA-Z0-9]+$
This means "match a whole line of text if it only contains one or more of the characters a-z, A-Z, or 0-9".
If you want to match any character or group of characters in a line except for the asterisk, then you could use something like this:
(?!\*)([a-zA-Z0-9]+)(?<!\*)
The first part is called a "negative lookahead," and it looks forward to ensure we're not matching the asterisk. The last part is called a "negative lookbehind," and it looks backwards to make sure we're not matching the asterisk. The middle part is your "capture group," and confirms that you're matching any letters or numbers in a given string, but excluding the * character.
If this Regex gets input like *abc, it will capture abc. If it encounters abc*, it will still capture abc. If it encounters abc*def, it will capture abc and def separately in two capture groups, because it will break around the asterisk.
This link explains the concept of lookarounds in Regex. You can also use this Regex tester to get started practicing your Regular Expressions with explanations of what each block of characters does.
EDITED TO ADD If you're just interested in matching single characters rather than groups of characters, you can use [A-Za-z0-9] and match any upper or lowercase letter and any single digit. You don't need to exclude the * character, because the character group is already exclusive.
This is a slight duplicate of the question below, so responses here may also help you. Hope this helps!
How can I exclude asterisk in a regex expression
[A-Za-z0-9 ][^*]
What you regex will do is match 2 consecutive characters. First, it will look for anything A-Za-z0-9 . Then, it will look at the negated set that includes *, and will match ANY character except *.
You can type your regex into https://regexr.com/ to see a breakdown of how it matches and test some strings.
For example, your regex would match these:
Aa
AA
a&
A1
0_
But would not match these:
A*
a*
1*
And WOULD NOT match anything longer than 2 characters. If you really want to match any string with any characters except *, this should work:
[^\*]+
What that will do is match any number of consecutive characters that are not *. (The + means match 1 or more characters in the set). It is also a good idea to escape * because it is also a reserved character in regex. Even though most regex parsers are smart enough to know that inside a group you probably mean the literal char *, it is still a best practice to escape it. (And by that same token, you would want to use \s instead of the blank space in your original regex.)

Python - Removing parentheses and quotation marks

I've been trying to remove pair parentheses(including text between them), unbalanced parentheses and quotation marks from the string.
What I've done so far:
import re
sample_text = '""sads"add"sfsfdsfds()()(0sefdAAAsfs)dasdad(asd'
res = re.sub(r'\([^)]*\)', '', sample_text))
It matches with only ()()(0sefdAAAsfs) part of the text. Unbalanced and quotation marks left unmatched. What can be done to improve above regex?
This isn't really something that a regular expression is suited for, so is not the right tool for the job. Having said that, you can use the following pattern to see if there is an opening paren, then zero or more non-paren, and a matching closing paren:
\([^)+]*\)
Substitute a " or ' or [ or whatever for the other types of matching components.
But again, this would not work with something like this:
(asdf))))))))
Long story short: it's not a problem that a regular expression is capable of solving. Try testing it out here: https://regex101.com/r/bdiK5W/2.

Substitute and change case for program variables

I'm changing some notation in a few source code files.
In particular, variable names using the format
m_variable1
m_anothervariable
should be renamed and reformatted to
mVariable1
mAnotherVariable
That is, substitute m_ with m and make the next character uppercase.
I know how todo simple substitutions, like
%s/m_/m/gc
using vim, but not sure how to add syntax for changing a char to uppercase in a substitute statement?
You can make the first character of variable name uppercase, but I think you can hardly separate words from a consecutive string simply by built-in command.
I hope following command will help you:
:%s/\vm_(\w+)/m\u\1/g
Explaination
\v enables the 'very magic' mode
\u makes the first character of word after it uppercase
\1 references the first captured group
Result
mVariable1
mAnothervariable

Appending to the end of a pattern with a word in the middle using vim

I have a file that is out-of-date and needs to be updated. The names have changed somewhat and I would like to clean them all up using a single substitution.
Here's what I'm trying to accomplish:
foo.foo_[single word] -> foo_bar.foo_[single word]_bar
where a single word is a string of n characters. In the file, they are always preceded by an underscore, but it needs to have "_bar" appended. There is always a "." after these instances, so I thought the following might work:
%s/foo\.foo_*\./foo_bar\.foo_*_bar\./g
Sadly, the first part doesn't even match what I want, so I'm back to square one.
I would first change:
foo_[word] -> foo_[word]_bar
and then
foo. -> foo_bar.
i.e.:
%s,\(foo_\w\+\),\1_bar,g|%s,foo\.,foo_bar\.,g
There are many ways to skin a cat but following should do the trick
%s/\vfoo.foo_(\w+)/foo_bar.foo_\1_bar/gc
what loosely translates to
\v Very Magic (:help magic)
foo.foo_ Search for exact string
(\w+) Search for a "word" and store in a backreference
/foo_bar.foo Replace search pattern with this exact string
\1 appended with backreference 1
_bar appended with _bar
or if you don't want to repeat the search in the replace part, you can go a bit nuts with backreferences and use
%s/\v(foo)\.foo_(\w+)/\1_bar.\1_\2_bar/gc
The most important parts you were missing were
using backreferences (:helpgrep backref)
using character classes (:h \w)
using repetition (_* is searching for 0 or more underscores. You probably meant _.*)

vim search and replace between number

I have a pattern where there are double-quotes between numbers in a CSV file.
I can search for the pattern by [0-9]\"[0-9], but how do I retain value while removing the double quote. CSV format is like this:
"1234"5678","Text1","Text2"
"987654321","Text3","text4"
"7812891"3","Text5","Text6"
As you may notice there are double quotes between some numbers which I want to remove.
I have tried the following way, which is incorrect:
:%s/[0-9]\"[0-9]/[0-9][0-9]/g
Is it possible to execute a command at every search pattern, maybe go one character forward and delete it. How can "lx" be embedded in search and replace.
You need to capture groups. Try:
:%s/\(\d\)"\(\d\)/\1\2/g
[A digit can also be denoted by \d.]
I know that this question has been answered already, but here's another approach:
:%s/\d\zs"\ze\d
Explanation:
%s   Substitute for the whole buffer
\d   look up for a digit
\zs set the start of match here
"     look up for a double-quote
\ze set the end of match here
\d   look up for a digit
That makes the substitute command to match only the double-quote surrounded by digits.
Omitting the replacement string just deletes the match.
You need boundaries to use in regular expression.
Try this:
:%s/\([0-9]\)"\([0-9]\)/\1\2/g
A bit naive solution:
%s/^"/BEGINNING OF LINE QUOTE MARK/g
%s/\",\"/quote comma quote/g
%s/\"$/quota end of line/g
%s/\"//g
%s/quota end of line/"/g
%s/quote comma quote/","/g
%s/BEGINNING OF LINE QUOTE MARK/"/g
A macro can be created quite easy out of it and invoked as many times as needed.

Resources