in nodejs, I have to following pattern
(/^(>|<|>=|<=|!=|%)?[a-z0-9 ]+?(%)*$/i
to match only alphanumeric strings, with optional suffix and prefix with some special characters. And its working just fine.
Now I want to match the last '%' only if the first character is alphanumeric (case insensitive) or also a %. Then its optionally allowed, otherwise it should not match.
Example:
Should match:
>test
!=test
<test
>=test
<=test
%test
test
%test%
test%
Example which should not match:
<test% <-- its now matching, which is not correct
<test< <-- its now **not** matching, which is correct
Any Ideas?
You can add a negative lookahead after ^ like
/^(?![^a-z\d%].*%$)(?:[><]=?|!=|%)?[a-z\d ]+%*$/i
^^^^^^^^^^^^^^^^^
See the regex demo. Details:
^ - start of string
(?![^a-z\d%].*%$) - fail the match if there is a char other than alphanumeric or % at the start and % at the end
(?:[><]=?|!=|%)? - optionally match <, >, <=, >=, != or %
[a-z\d ]+ - one or more alphanumeric or space chars
%* - zero or more % chars
$ - end of string
You might use an alternation | to match either one of the options.
^(?:[a-z0-9%](?:[a-z0-9 ]*%)?|(?:[<>]=?|!=|%)?[a-z0-9 ]+)$
^ Start of string
(?: Non capture group
[a-z0-9%] Match one of the listed in the character class
(?:[a-z0-9 ]*%)? Optionally match repeating 0+ times any of the character class followed by %
| Or
(?:[<>]=?|!=|%)? Optionally match one of the alternatives
[a-z0-9 ]+ Match 1+ times any of the character class
) Close non capture group
$ End of string
Regex demo
Related
I have a file that contains segments that form a word in the following format <+segment1 segment2 segment3 segment4+>, what I want to have is an output with all the segments beside each other to form one word (So basically I want to remove the space between the segments and the <+ +> sign surronding the segments). So for example:
Input:
<+play ing+> <+game s .+>
Output:
playing games.
I tried first detecting the pattern using \<\+(.*?)\+\> but I cannot seem to know how to remove the spaces
Use this Python code:
import re
line = '<+play ing+> <+game s .+>'
line = re.sub(r'<\+\s*(.*?)\s*\+>', lambda z: z.group(1).replace(" ", ""), line)
print(line)
Results: playing games.
The lambda removes spaces additionally.
REGEX EXPLANATION
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\+ '+'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\+ '+'
--------------------------------------------------------------------------------
> '>'
I assume that spaces can be converted to empty strings except when they are preceded by '>' and are followed by '<'. That is, the space in the string '> <' is not to be replaced by an empty string.
You can replace each match of the following regular expression with an empty string:
<\+|\+>|(?<!>) | (?!<)
Regex demo<¯\(ツ)/¯>Python code
This expression can be broken down as follows.
<\+ # Match '<+'
| # or
\+> # Match '<+'
| # or
(?<!>) # Negative lookbehind asserts current location is not preceded by '>'
[ ] # Match a space
| # or
[ ] # Match a space
(?!<) # Negative lookahead asserts current location is not followed by '<'
I've placed each space in a character class above so it is visible.
line A
foo bar bar foo bar foo
line B
foo bar bar foo
In line A, there are multiple occurrence of double space.
I only want to match lines like line B which has only once double space occurrence.
I tried
^.*\s{2}.*$
but it will match both.
How may I have the desired output? Thank you.
If you wish to match strings that contain no more than one string of two or more spaces between words you could use following regular expression.
r'^(?!(?:.*(?<! ) {2,}(?! )){2})'
Start your engine!
Note that this expression matches
abc de fgh
where there are four spaces between 'c' and 'd'.
Python's regex engine performs the following operations.
^
(?! : begin negative lookahead
(?: : begin non-capture group
.* : match 0+ characters other than line terminators
(?<! : begin negative lookbehind
[ ]{2,} : match 2+ spaces
(?! ) : negative lookahead asserts match is not followed by a space
) : end negative lookbehind
) : end non-capture group
{2} : execute non-capture group twice
) : end negative lookahead
You can do:
^(?!.*[ \t]{2,}.*[ \t]{2,})
# Negative look ahead assertion that states 'only start the match
# on this line IF there are NOT 2 (or potentially more) breaks with
# two (or potentially more) of tabs or spaces'.
Demo 1
If you want to require ONE double space in the line but not more:
^(?=.*[ \t]{2,})(?!.*[ \t]{2,}.*[ \t]{2,})
# Positive look ahead that states 'only start this match if there is
# at least one break with two tabs or spaces'
# BUT
# Negative look ahead assertion that states 'only start the match
# on this line IF there are NOT 2 (or potentially more) breaks with
# two (or potentially more) of tabs or spaces'.
Demo 2
If you want to limit to only two spaces (not tabs and not more than 2 spaces):
^(?=.*[ ]{2})(?!.*[ ]{2}.*[ ]{2})
# Same as above but remove the tabs as part of the assertion
Demo 3
Note: In your regex you have \s as the class for a space. That also matches [\r\n\t\f\v ] so both horizontal and vertical space characters.
Note 2:
You can do this without a regex as well (assuming you only want lines that have 1 and only 1 double space in them):
txt='''\
line A
foo bar bar foo bar foo
line B
foo bar bar foo'''
>>> [line for line in txt.splitlines() if len(line.split(' '))==2]
['foo bar bar foo']
You can get the match without lookarounds by starting the match with 1+ non whitespace chars.
Then optionally repeat a single whitespace char followed by non whitespace chars before and after matching a double whitespace char.
The negated character class [^\S\r\n] will match any whitespace chars except a newline or carriage return. If you want to allow matching newlines as well, you could use \s
^\S+(?:[^\S\r\n]\S+)*[^\S\r\n]{2}(?:\S+[^\S\r\n])*\S+$
Explanation
^ Start of string
\S+ Match 1+ non whitespace chars
(?: Non capture group
[^\S\r\n]\S+ Match a whitespace char without a newline
)* Close group and repeat 0+ times
[^\S\r\n]{2} Match the 2 whitespace chars without a newline
(?: Non capture group
\S+[^\S\r\n] Match 1+ non whitespace chars followed by a whitespace char without a newline
)* Close group a and repeat 1+ times
\S+ Match 1+ non whitespace chars
$ End of string
Regex demo
I have a data file like this:
randomthingsbefore $DATAROOT/randompathwithoutanypattern randomthingsafter
randomthingsbefore $DATAROOT/randompathwithoutanypattern randomthingsafter $DATAROOT/randompathwithoutanypattern randomthingsafter
randomthingsbefore $DATAROOT/randompathwithoutanypattern randomthingsafter
(...)
I want to delete the substring $DATAROOT from each path and add blank spaces after the path to keep the columns where randomthingsafter started. Notice that there could be 2 or more paths with the $DATAROOT substring in the same line. This way, my desired output would look like this:
randomthingsbefore /randompathwithoutanypattern randomthingsafter
randomthingsbefore /randompathwithoutanypattern randomthingsafter /randompathwithoutanypattern randomthingsafter
randomthingsbefore /randompathwithoutanypattern randomthingsafter
(...)
I've tried:
VAR1=*pathtofile*
VAR2=$(\grep -oP '\$DATAROOT\K[^ ]*' $VAR1)
arr=$(echo $VAR2 | tr " " "\n")
for x in $arr
do
y="${x} "
sed -i "s:$x:$y:" $VAR1
done
sed -i 's/$DATAROOT\///g' $VAR1
but it does not seem to work. Thank you for your help!
I believe the easiest is just to use sed to replace your script in a single line:
sed 's/$DATAROOT\([^[:blank:]]*\)/\1 /g' /path/to/file
Note, that are 9 spaces after \1 which is the length of the string $DATAROOT. Here we make use of what is known as back-reference:
Editing Commands in sed
[2addr]s/BRE/replacement/flags:
Substitute the replacement string for instances of the BRE in the pattern space. Any character other than <backslash> or <newline> can be used instead of a <slash> to delimit the BRE and the replacement. Within the BRE and the replacement, the BRE delimiter itself can be used as a literal character if it is preceded by a <backslash>.
The replacement string shall be scanned from beginning to end. An <ampersand> ( & ) appearing in the replacement shall be replaced by the string matching the BRE. The special meaning of & in this context can be suppressed by preceding it by a <backslash>. The characters \n, where n is a digit, shall be replaced by the text matched by the corresponding back-reference expression. If the corresponding back-reference expression does not match, then the characters \n shall be replaced by the empty string. The special meaning of \n where n is a digit in this context, can be suppressed by preceding it by a <backslash>. For each other <backslash> encountered, the following character shall lose its special meaning (if any).
source: POSIX SED
9.3.6 BREs Matching Multiple Characters
The back-reference expression \n shall match the same (possibly empty) string of characters as was matched by a subexpression enclosed between \( and \) preceding the \n. The character n shall be a digit from 1 through 9, specifying the nth subexpression (the one that begins with the nth \( from the beginning of the pattern and ends with the corresponding paired \) ). The expression is invalid if less than n subexpressions precede the \n. The string matched by a contained subexpression shall be within the string matched by the containing subexpression. If the containing subexpression does not match, or if there is no match for the contained subexpression within the string matched by the containing subexpression, then back-reference expressions corresponding to the contained subexpression shall not match. When a subexpression matches more than one string, a back-reference expression corresponding to the subexpression shall refer to the last matched string. For example, the expression ^\(.*\)\1$ matches strings consisting of two adjacent appearances of the same substring, and the expression \(a\)*\1 fails to match a, the expression \(a\(b\)*\)*\2 fails to match abab, and the expression ^\(ab*\)*\1$ matches ababbabb, but fails to match ababbab.
source: POSIX Basic Regular Expressions
I've a file with occurence in the form _[number].htm, for example _43672151820.htm
How I can remove all occurrences of strings with a matching pattern?
Substitute substring
Use this regular expression with the substitution command %s
:%s/_\d\+\.htm//g
Explanation (from regex101.com):
_ matches the character _ with index 9510 (5F16 or 1378) literally (case sensitive)
\d matches a digit (equivalent to [0-9])
\+ matches the character + with index 4310 (2B16 or 538) literally (case sensitive)
\. matches the character . with index 4610 (2E16 or 568) literally (case sensitive)
htm matches the characters htm literally (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
Substitute word
The above regular expression will match for instance 123.htm in ab_123.htm. If you want to match a word use vim's word boundaries\< and \>:
:%s/\<_\d\+\.htm\>//g
(see In Vim, how do you search for a word boundary character, like the \b in regexp?)
Specifically, I'll like to detect lines that have a '+' character in either the first column, or in the second position right after a '*' character.
This pattern ?
/^\*\=+
^ matches the start of line
\* matches the star character
\= says if any
+ matches the ... plus.