Find pattern and replace it by "Prestring" + pattern [duplicate] - python-3.x

I found these things in my regex body but I haven't got a clue what I can use them for.
Does somebody have examples so I can try to understand how they work?
(?!) - negative lookahead
(?=) - positive lookahead
(?<=) - positive lookbehind
(?<!) - negative lookbehind
(?>) - atomic group

Examples
Given the string foobarbarfoo:
bar(?=bar) finds the 1st bar ("bar" which has "bar" after it)
bar(?!bar) finds the 2nd bar ("bar" which does not have "bar" after it)
(?<=foo)bar finds the 1st bar ("bar" which has "foo" before it)
(?<!foo)bar finds the 2nd bar ("bar" which does not have "foo" before it)
You can also combine them:
(?<=foo)bar(?=bar) finds the 1st bar ("bar" with "foo" before it and "bar" after it)
Definitions
Look ahead positive (?=)
Find expression A where expression B follows:
A(?=B)
Look ahead negative (?!)
Find expression A where expression B does not follow:
A(?!B)
Look behind positive (?<=)
Find expression A where expression B precedes:
(?<=B)A
Look behind negative (?<!)
Find expression A where expression B does not precede:
(?<!B)A
Atomic groups (?>)
An atomic group exits a group and throws away alternative patterns after the first matched pattern inside the group (backtracking is disabled).
(?>foo|foot)s applied to foots will match its 1st alternative foo, then fail as s does not immediately follow, and stop as backtracking is disabled
A non-atomic group will allow backtracking; if subsequent matching ahead fails, it will backtrack and use alternative patterns until a match for the entire expression is found or all possibilities are exhausted.
(foo|foot)s applied to foots will:
match its 1st alternative foo, then fail as s does not immediately follow in foots, and backtrack to its 2nd alternative;
match its 2nd alternative foot, then succeed as s immediately follows in foots, and stop.
Some resources
http://www.regular-expressions.info/lookaround.html
http://www.rexegg.com/regex-lookarounds.html
Online testers
https://regex101.com

Lookarounds are zero width assertions. They check for a regex (towards right or left of the current position - based on ahead or behind), succeeds or fails when a match is found (based on if it is positive or negative) and discards the matched portion. They don't consume any character - the matching for regex following them (if any), will start at the same cursor position.
Read regular-expression.info for more details.
Positive lookahead:
Syntax:
(?=REGEX_1)REGEX_2
Match only if REGEX_1 matches; after matching REGEX_1, the match is discarded and searching for REGEX_2 starts at the same position.
example:
(?=[a-z0-9]{4}$)[a-z]{1,2}[0-9]{2,3}
REGEX_1 is [a-z0-9]{4}$ which matches four alphanumeric chars followed by end of line.
REGEX_2 is [a-z]{1,2}[0-9]{2,3} which matches one or two letters followed by two or three digits.
REGEX_1 makes sure that the length of string is indeed 4, but doesn't consume any characters so that search for REGEX_2 starts at the same location. Now REGEX_2 makes sure that the string matches some other rules. Without look-ahead it would match strings of length three or five.
Negative lookahead
Syntax:
(?!REGEX_1)REGEX_2
Match only if REGEX_1 does not match; after checking REGEX_1, the search for REGEX_2 starts at the same position.
example:
(?!.*\bFWORD\b)\w{10,30}$
The look-ahead part checks for the FWORD in the string and fails if it finds it. If it doesn't find FWORD, the look-ahead succeeds and the following part verifies that the string's length is between 10 and 30 and that it contains only word characters a-zA-Z0-9_
Look-behind is similar to look-ahead: it just looks behind the current cursor position. Some regex flavors like javascript doesn't support look-behind assertions. And most flavors that support it (PHP, Python etc) require that look-behind portion to have a fixed length.
Atomic groups basically discards/forgets the subsequent tokens in the group once a token matches. Check this page for examples of atomic groups

Grokking lookaround rapidly.
How to distinguish lookahead and lookbehind?
Take 2 minutes tour with me:
(?=) - positive lookahead
(?<=) - positive lookbehind
Suppose
A B C #in a line
Now, we ask B, Where are you?
B has two solutions to declare it location:
One, B has A ahead and has C bebind
Two, B is ahead(lookahead) of C and behind (lookhehind) A.
As we can see, the behind and ahead are opposite in the two solutions.
Regex is solution Two.

Why - Suppose you are playing wordle, and you've entered "ant". (Yes three-letter word, it's only an example - chill)
The answer comes back as blank, yellow, green, and you have a list of three letter words you wish to use a regex to search for? How would you do it?
To start off with you could start with the presence of the t in the third position:
[a-z]{2}t
We could improve by noting that we don't have an a
[b-z]{2}t
We could further improve by saying that the search had to have an n in it.
(?=.*n)[b-z]{2}t
or to break it down;
(?=.*n) - Look ahead, and check the match has an n in it, it may have zero or more characters before that n
[b-z]{2} - Two letters other than an 'a' in the first two positions;
t - literally a 't' in the third position

I used look behind to find the schema and look ahead negative to find tables missing with(nolock)
expression="(?<=DB\.dbo\.)\w+\s+\w+\s+(?!with\(nolock\))"
matches=re.findall(expression,sql)
for match in matches:
print(match)

Related

Lua pattern matching: When can anchors be safely omitted?

The reference manual describes pattern & anchors as follows:
A pattern is a sequence of pattern items. A '^' at the beginning of a pattern anchors the match at the beginning of the subject string. A '$' at the end of a pattern anchors the match at the end of the subject string. At other positions, '^' and '$' have no special meaning and represent themselves.
Clearly, if a pattern ends with .* or .+ (no matter whether inside a capture group), a trailing $ anchor may be safely omitted, as the entire remaining sequence will be matched either way by the last greedy quantifier; for .-, the anchor may not be omitted though, as that wouldn't force it to match all characters to the end.
But not for the "beginning" of string anchor, it seems the same holds: ^.* and ^.+ can simply be converted into .* and .+ respectively. However, surprisingly, it seems that this time - perhaps due to the way patterns are implemented - ^.- can indeed be simplified to .-, at least from my testing. Even though the docs state:
a single character class followed by '-', which also matches 0 or more repetitions of characters in the class. Unlike '*', these repetition items will always match the shortest possible sequence;
If it isn't anchored, the pattern matching could start at a later position, thus matching a shorter sequence for .- - yet this isn't happening:
$ lua
Lua 5.3.4 Copyright (C) 1994-2017 Lua.org, PUC-Rio
> ("00000000000000000000000001"):match".-1"
00000000000000000000000001
> ("00000000000000000000000001"):match"^.-1"
00000000000000000000000001
>
Is this somehow guaranteed or specified behavior, or is it just "undefined" behavior and should the anchor ^ still be used to stay on the safe side should the implementation change?
There are two things you need to bear in mind when using Lua patterns (and any patterns in general):
There are pattern strings that are used to match specific texts
There are libraries, methods or functions in programming languages that parse the pattern strings and extract/replace/remove/split the input strings based on the incoming pattern logic.
Thus, please make sure you understand what your pattern does and how a specific function/method uses the pattern.
If you use match and ^.-1, the result will be a substring that matches at the start of string (^), then has any zero or more chars as few as possible up to the leftmost occurrence of 1. The ^ is a pattern part that guarantees that matching starts only at the start of string. However, match only searches for a single match (it is not gmatch) and . in Lua patterns matches any char (including line break chars). Thus, .-1 with match will yield the same match.
Once you use gmatch to find multiple matches, ^.-1 and .-1 patterns will start making difference.
If you use it in a replacing/removing context, the difference will be visible at once, too, since by default, these methods - and string.gsub is not an exception - replace all found matches: "Its basic use is to substitute the replacement string for all occurrences of the pattern inside the subject string" (see 20.1 – Pattern-Matching Functions).

example from ch.16 "learn vimscript the hard way"

I'm trying to complete an exercise from https://learnvimscriptthehardway.stevelosh.com/chapters/16.html
The sample text to be worked on is:
Topic One
=========
This is some text about topic one.
It has multiple paragraphs.
Topic Two
=========
This is some text about topic two. It has only one paragraph.
The mapping to delete the heading of Topic One or Topic Two (depending on which body the cursor is placed in) and enter insert mode is:
:onoremap ih :<c-u>execute "normal! ?^==\\+$\r:nohlsearch\rkvg_"<cr>
Enter 'cih' in the body of either text below the headings and respective heading will be erased and the cursor will be placed there ready to go, in insert mode. Great mapping--but, I'm trying to understand what's happening with \+$.
When I omit \+$ and use this mapping:
:onoremap ih :<c-u>execute "normal! ?^==\r:nohlsearch\rkvg_"<cr>
it works fine, seemingly identically to the other mapping. So what is the use of the \+$?
Here is how Mr. Losh explains it:
The first piece,
?^==\+$
performs a search backwards for any line that consists of two
or more equal signs and nothing else. This will leave our cursor on
the first character of the line of equal signs."
But what does \+$ accomplish? I've tried to enter it manually in command but I just get an error sound. It works as intended as part of the full function, though. but like I said, when I remove it and run the full command without, it works fine.
There's something I'm missing about the necessity of that '+$'... Maybe it has to do with the "two or more equal signs and nothing else"?
The author's command:
?^==\+$
searches backward for a line consisting exclusively of 2 or more equal signs:
^ anchors the pattern to the beginning of the line,
= matches a literal equal sign,
^= thus matches a literal equal sign at the beginning of the line,
= matches a second equal sign,
\+ matches one or more of the preceding atom, as many as possible,
=\+ thus matches one or more equal sign, as many as possible,
$ anchors the pattern to the end of the line,
so the pattern above is going to match any of the following lines:
==
===
=============
etc.
but not lines like:
==foo
== <- six spaces
etc.
which is exactly the goal of that exercice.
Your command, on the other hand:
?^==
searches backward for a sequence of two equal signs at the beginning of a line:
^ anchors the pattern to the beginning of the line,
== matches two literal equal signs,
so your pattern is going to match the same lines as above:
==
===
=============
etc.
but also lines like:
==foo
== <- six spaces
etc.
because it is not strict enough.
Your pattern would definitely be good enough if used manually to jump to one of those underlines because it gets the job done with minimal typing. But the goal, here, is to make a mapping. Those things have to be generalised to be reliable, which pretty much requires a level of explicitness and precision your pattern lacks.
In short, Steve's pattern checks all the boxes while yours doesn't: it is explicit and precise while yours is implicit and imprecise.
The \+$ is part of the regular expression matching a line of only equals signs. Without it, your mapping would recognize, for example,
This is not a heading
=This is not an underline
as a heading.
The \+ means "At least two of the previous character (=)". The $ means End of line, so there cannot be anything after the equals signs.

Dialogflow RE2 Regex

I am new here. I wanted to ask a question on using REGEX for an entity in DialogFlow
I wanted the entity to accept all text and spaces except for the symbol *
I have tried to use [A-Za-z0-9 ][^*], but it is not working. Any advice. thanks!
In your Regex expression, [^*] means "capture any character at the start of the line." To refer to a literal asterisk rather than matching any character, you need to use \*
If you want to match a line of letters or numbers as in the [A-Za-z0-9] example you give, but only if that string does not include an asterisk, then this expression should work for you:
^[a-zA-Z0-9]+$
This means "match a whole line of text if it only contains one or more of the characters a-z, A-Z, or 0-9".
If you want to match any character or group of characters in a line except for the asterisk, then you could use something like this:
(?!\*)([a-zA-Z0-9]+)(?<!\*)
The first part is called a "negative lookahead," and it looks forward to ensure we're not matching the asterisk. The last part is called a "negative lookbehind," and it looks backwards to make sure we're not matching the asterisk. The middle part is your "capture group," and confirms that you're matching any letters or numbers in a given string, but excluding the * character.
If this Regex gets input like *abc, it will capture abc. If it encounters abc*, it will still capture abc. If it encounters abc*def, it will capture abc and def separately in two capture groups, because it will break around the asterisk.
This link explains the concept of lookarounds in Regex. You can also use this Regex tester to get started practicing your Regular Expressions with explanations of what each block of characters does.
EDITED TO ADD If you're just interested in matching single characters rather than groups of characters, you can use [A-Za-z0-9] and match any upper or lowercase letter and any single digit. You don't need to exclude the * character, because the character group is already exclusive.
This is a slight duplicate of the question below, so responses here may also help you. Hope this helps!
How can I exclude asterisk in a regex expression
[A-Za-z0-9 ][^*]
What you regex will do is match 2 consecutive characters. First, it will look for anything A-Za-z0-9 . Then, it will look at the negated set that includes *, and will match ANY character except *.
You can type your regex into https://regexr.com/ to see a breakdown of how it matches and test some strings.
For example, your regex would match these:
Aa
AA
a&
A1
0_
But would not match these:
A*
a*
1*
And WOULD NOT match anything longer than 2 characters. If you really want to match any string with any characters except *, this should work:
[^\*]+
What that will do is match any number of consecutive characters that are not *. (The + means match 1 or more characters in the set). It is also a good idea to escape * because it is also a reserved character in regex. Even though most regex parsers are smart enough to know that inside a group you probably mean the literal char *, it is still a best practice to escape it. (And by that same token, you would want to use \s instead of the blank space in your original regex.)

Cucumber regex step definition

Can someone explain what is the difference between
#When("some text (.*)")
and
#When("^some text ([^\"]*)$")
?
The former worked when using a straightforward step, but when using a data table it maps only to the first table item.
Here is explanation of couple of common regex :
.* matches anything (or nothing), literally “any character (except a newline) 0 or more times”
.+ matches at least one of anything
[0-9] or d matches a series of digits (or nothing)
[0-9]+ or d+ matches one or more digits
"[^"]*" matches something (or nothing) in double quotes
an? matches a or an (the question mark makes the n optional)
So, depending on your question, the difference is :
.* will take everything except the new lines,
([^\"]*) this will take everything also the new lines

How to perform following search and replace in vim?

I have the following string in the code at multiple places,
m_cells->a[ Id ]
and I want to replace it with
c(Id)
where the string Id could be anything including numbers also.
A regular expression replace like below should do:
%s/m_cells->a\[\s\(\w\+\)\s\]/c(\1)/g
If you wish to apply the replacement operation on a number of files you could use the :bufdo command.
Full explanation of #BasBossink's answer (as a separate answer because this won't fit in a comment), because regexes are awesome but non-trivial and definitely worth learning:
In Command mode (ie. type : from Normal mode), s/search_term/replacement/ will replace the first occurrence of 'search_term' with 'replacement' on the current line.
The % before the s tells vim to perform the operation on all lines in the document. Any range specification is valid here, eg. 5,10 for lines 5-10.
The g after the last / performs the operation "globally" - all occurrences of 'search_term' on the line or lines, not just the first occurrence.
The "m_cells->a" part of the search term is a literal match. Then it gets interesting.
Many characters have special meaning in a regex, and if you want to use the character literally, without the special meaning, then you have to "escape" it, by putting a \ in front.
Thus \[ and \] match the literal '[' and ']' characters.
Then we have the opposite case: literal characters that we want to treat as special regex entities.
\s matches white*s*pace (space, tab, etc.).
\w matches "*w*ord" characters (letters, digits, and underscore _).
(. matches any character (except a newline). \d matches digits. There are more...)
If a character is not followed by a quantifier, then exactly one such character matches. Thus, \s will match one space or tab, but not fewer or more.
\+ is a quantifier, and means "one or more". (\? matches 0 or 1; * (with no backslash) matches any number: zero or more. Warning: matching on zero occurrences takes a little getting used to; when you're first learning regexes, you don't always get the results you expected. It's also possible to match on an arbitrary exact number or range of occurrences, but I won't get into that here.)
\( and \) work together to form a "capturing group". This means that we don't just want to match on these characters, we also want to remember them specially so that we can do something with them later. You can have any number of capturing groups, and they can be nested too. You can refer to them later by number, starting at 1 (not 0). Just start counting (escaped) left-parantheses from the left to determine the number.
So here, we are matching a space followed by a group (which we will capture) of at least one "word" character followed by a space, within the square brackets.
Then section between the second and third / is the replacement text.
The "c" is literal.
\1 means the first captured group, which in this case will be the "Id".
In summary, we are finding text that matches the given description, capturing part of it, and replacing the entire match with the replacement text that we have constructed.
Perhaps a final suggestion: c after the final / (doesn't matter whether it comes before or after the 'g') enables *c*onfirmation: vim will highlight the characters to be replaced and will show the replacement text and ask whether you want to go ahead. Great for learning.
Yes, regexes are complicated, but super powerful and well worth learning. Once you have them internalized, they're actually fairly easy. I suggest that, as with learning vim itself, you start with the basics, get fluent in them, and then incrementally add new features to your repertoire.
Good luck and have fun.

Resources