Confusion with regex expression for valid IPv4 addresses [duplicate] - linux

This question already has answers here:
Validating IPv4 addresses with regexp
(44 answers)
Closed 2 years ago.
I'm trying to write a Regex expression for selecting the valid IPv4 addresses out of a file which contains many valid, invalid(both) type of addresses.
I have already written the Regex for doing that but two of invalid IPv4 addresses are still printing out - 255.255.256.255 and 8.234.88,55
Can anyone help me understanding why these two are printing out with regex that I have put.
((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){1,3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
I am using this regex to filter valid IPv4 addresses through the file containing, below listed IPv4 addresses.
12.12.12.12
127.0.0.0
255.255.256.255
344.19.0.1.
12.255.12.255
138.168.5.193
256.123.256.123
195.45.13.0
8.234.88.55
1334.0.1.234
196.83.83.191
133.133.133.133
8.234.88,55
203.26.27.38
88.173.71.66
136.186.20.9
241.92.88.103
I want to know why this regex expression is matching with 255.255.256.255 and 8.234.88,55 IPv4 addresses.

why this regex expression is matching with 255.255.256.255 and 8.234.88,55 IPv4 addresses.
It doesn't. It matches parts of that string. Most probably you did:
$ echo '255.255.256.255' | grep -E '((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){1,3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
255.255.256.255
Yay, it works. But the pattern doesn't match the whole like, it matches parts 255.255.25 and 6.255 separately. The {1,3} allows the first part to match only once or twice, not necessarily 3 times. Like:
((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
25 5 . 25 5 . 2 5 6.255
^^^^^ - left over
Because of the {1,3} the first part may be matched only once. Because grep applies regex to part of the string and because the full regex matched, the line is printed.
Similarly for 8.234.88,55 the part 8.234.88 is matched and ,55 is not matched. Is cool to see:
$ echo '8.234.88,55' | grep --color -E '(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){1,3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){1}'
8.234.88,55
^^^^^^^^ - is red
To match the whole line do grep -x or add anchors ^....$ or most probably you want to change {1,3} to {3} to match exactly 3 parts.

Your regular expression is not anchored to the beginning and end of the strings. It matches fragments of each line, not the entire line.
Put your regex between ^ and $.
^ matches the beginning of the string; $ matches the end of the string.
If multi-line matching is enabled, ^ matches the beginning of a line, $ matches the end of a line.
Also, the regex slightly incorrect and this makes it match less than it should. An IPv4 address always has 4 components. Because of {1,3}, your regex allows 2 to 4 components. Combined with the lack of anchors, it finds two matches in the lines you mentioned.
Take a look at regex101.com.
The regex should be:
^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$

((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.)
I've tried your expression in C++.
Adding an extra slash before the dot solved here for the comma issue.
It parsed a comma because you are missing a slash, the way it is being written interpretes the dot as "parse any character but EOL".
Also your expression is allowing values to be prefixed by a 0 when you put [01]?
There goes a suggestion on how to tackle the expression: if the it has only one digit, how can it be written? Then 2 digits then 3...
(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])

Related

Dialogflow RE2 Regex

I am new here. I wanted to ask a question on using REGEX for an entity in DialogFlow
I wanted the entity to accept all text and spaces except for the symbol *
I have tried to use [A-Za-z0-9 ][^*], but it is not working. Any advice. thanks!
In your Regex expression, [^*] means "capture any character at the start of the line." To refer to a literal asterisk rather than matching any character, you need to use \*
If you want to match a line of letters or numbers as in the [A-Za-z0-9] example you give, but only if that string does not include an asterisk, then this expression should work for you:
^[a-zA-Z0-9]+$
This means "match a whole line of text if it only contains one or more of the characters a-z, A-Z, or 0-9".
If you want to match any character or group of characters in a line except for the asterisk, then you could use something like this:
(?!\*)([a-zA-Z0-9]+)(?<!\*)
The first part is called a "negative lookahead," and it looks forward to ensure we're not matching the asterisk. The last part is called a "negative lookbehind," and it looks backwards to make sure we're not matching the asterisk. The middle part is your "capture group," and confirms that you're matching any letters or numbers in a given string, but excluding the * character.
If this Regex gets input like *abc, it will capture abc. If it encounters abc*, it will still capture abc. If it encounters abc*def, it will capture abc and def separately in two capture groups, because it will break around the asterisk.
This link explains the concept of lookarounds in Regex. You can also use this Regex tester to get started practicing your Regular Expressions with explanations of what each block of characters does.
EDITED TO ADD If you're just interested in matching single characters rather than groups of characters, you can use [A-Za-z0-9] and match any upper or lowercase letter and any single digit. You don't need to exclude the * character, because the character group is already exclusive.
This is a slight duplicate of the question below, so responses here may also help you. Hope this helps!
How can I exclude asterisk in a regex expression
[A-Za-z0-9 ][^*]
What you regex will do is match 2 consecutive characters. First, it will look for anything A-Za-z0-9 . Then, it will look at the negated set that includes *, and will match ANY character except *.
You can type your regex into https://regexr.com/ to see a breakdown of how it matches and test some strings.
For example, your regex would match these:
Aa
AA
a&
A1
0_
But would not match these:
A*
a*
1*
And WOULD NOT match anything longer than 2 characters. If you really want to match any string with any characters except *, this should work:
[^\*]+
What that will do is match any number of consecutive characters that are not *. (The + means match 1 or more characters in the set). It is also a good idea to escape * because it is also a reserved character in regex. Even though most regex parsers are smart enough to know that inside a group you probably mean the literal char *, it is still a best practice to escape it. (And by that same token, you would want to use \s instead of the blank space in your original regex.)

Issue in sed command

I am trying to change nw_src in the following string:
cookie=0xb868a1f26498cddd, duration=5327.613s, table=0, n_packets=199, n_bytes=19502, priority=30,icmp,in_port="qvo2495b490-33",nw_src=10.0.0.133,nw_dst=8.8.8.0/24 actions=group:2
command used:
sed -i -e
's/nw_src=(^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$,)/nw_src=1.1.1.1/g' flows.txt
It seems to not work however when i generate data from regex: it shows all the permutation of generated data right.
I am just trying to replace nw_src=<any_ipv4_address> to nw_src=1.1.1.1
Also nw_src appears only once in line
What am i missing.
Please help
Let's start with the regular expression that verifies that a text fragment is a one, two or three digit substring, representing a numeric value between 0 and 255:
2(5[0-5]|[0-4][0-9])|1[0-9]{2}|[1-9]?[0-9]
For example, run the following command and inspect its output:
seq 0 999 | grep -E '^(2(5[0-5]|[0-4][0-9])|1[0-9]{2}|[1-9]?[0-9])$'
seq generates all integers from 0 to 999, one per line, and grep will return only the integers from 0 to 255. Note that I had to enclose the regexp in parentheses, so that the anchors ^ and $ apply to the whole thing. In general, you need to have the regexp preceded and followed by non-digits, or by some sort of zero-length assertions to make sure the string is not a substring of a longer sequence of digits.
If you must replace the value of nw_src, but only when it is a valid IP4 address (and leave it as is if it's not a valid IP4 address), you could do this:
sed -iE 's/\bnw_src=(2(5[0-5]|[0-4][0-9])|1[0-9]{2}|[1-9]?[0-9])(\.\1){3}\b/nw_src=1.1.1.1/'
If it is guaranteed that the "old" value is always a valid IP4 address (so you don't need to verify that in this command), or if you need to replace the "old" value regardless of whether it is a valid address or not, the command can be simplified quite a bit; for example
... 's/\bnw_src=[^,]*/nw_src=1.1.1.1/'
If all you are trying to do is replace nw_src=<any_ipv4_address> to nw_src=1.1.1.1 then the following will suffice
sed 's/\(.*nw_src=\)\([^,]*\),\(.*\)/\11.1.1.1,\2/' network_file
Output
$ sed 's/\(.*nw_src=\)\([^,]*\),\(.*\)/\11.1.1.1,\2/' network_file
cookie=0xb868a1f26498cddd, duration=5327.613s, table=0, n_packets=199, n_bytes=19502, priority=30,icmp,in_port="qvo2495b490-33",nw_src=1.1.1.1,10.0.0.133

Grep expression filter out lines of the form [alnum][punct][alnum]

Hi all my first post is for what I thought would be simple ...
I haven't been able to find an example of a similar problem/solution.
I have thousands of text files with thousands of lines of content in the form
<word><space><word><space><number>
Example:
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
In the above example I want to exclude line 3 as it contains internal punctuation
I'm trying to use the GNU grep 2.25 however not having luck
my initial attempt was (however this does not allow the "-" internal to the pattern):
grep -v [:alnum:]*[:punct:]*[:alnum:]* filename
so tried this however
grep -v [:alnum:]*[:space:]*[!]*["]*[#]*[$]*[%]*[&]*[']*[(]*[)]*[*]*[+]*[,]*[.]*[/]*[:]*[;]*[<]*[=]*[>]*[?]*[#]*[[]*[\]*[]]*[^]*[_]*[`]*[{]*[|]*[}]*[~]*[.]*[:space:]*[:alnum:]* filename
however I need to factor in spaces and - as these are acceptable internal to the string.
I had been trying with the :punct" set however now see it contains - so clearly that will not work
I do currently have a stored procedure in TSQL to process these however would prefer to preprocess prior to loading if possible as the routine takes some seconds per file.
Has someone been able to achieve something similar?
On the face of it, you're looking for the 'word space word space number' schema, assuming 'word' is 'one alphanumeric optionally followed by zero or one occurrences of zero or more alphanumeric or punctuation characters and ending with an alphanumeric', and 'space' is 'one or more spaces' and 'number' is 'one or more digits'.
In terms of grep -E (aka egrep):
grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+'
That contains:
[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?
That detects a word with any punctuation surrounded by alphanumerics, and:
[[:space:]]+
[[:digit:]]+
which look for one or more spaces or digits.
Using a mildly extended data file, this produces:
$ cat data
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$ grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+' data
example for 1
useful when 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$
It eliminates the for. However 1 line as required.
Your regex contains a long string of ordered optional elements, but that means it will fail if something happens out of order. For example,
[!]*[?]*
will capture !? but not ?! (and of course, a character class containing a single character is just equivalent to that single character, so you might as well say !*?*).
You can instead use a single character class which contains all of the symbols you want to catch. As soon as you see one next to an alphanumeric character, you are done, so you don't need for the regex to match the entire input line.
grep -v '[[:alnum:]][][!"#$%&'"'"'()*+,./:;<=>?#\^_`{|}~]' filename
Also notice how the expression needs to be in single quotes in order for the shell not to interfere with the many metacharacters here. In order for a single-quoted string to include a literal single quote, I temporarily break out into a double-quoted string; see here for an explanation (I call this "seesaw quoting").
In a character class, if the class needs to include ], it needs to be at the beginning of the enumerated list; for symmetry and idiom, I also moved [ next to it.
Moreover, as pointed out by Jonathan Leffler, a POSIX character class name needs to be inside a character class; so to match one character belonging to the [:alnum:] named set, you say [[:alnum:]]. (This means you can combine sets, so [-[:alnum:].] covers alphanumerics plus dash and period.)
If you need to constrain this to match only on the first field, change the [[:alnum:]] to ^[[:alnum:]]\+.
Not realizing that a*b*c* matches anything is a common newbie error. You want to avoid writing an expression where all elements are optional, because it will match every possible string. Focus on what you want to match (the long list of punctuation characters, in your case) and then maybe add optional bits of context around it if you really need to; but the fewer of these you need, the faster it will run, and the easier it will be to see what it does. As a quick rule of thumb, a*bc* is effectively precisely equivalent to just b -- leading or trailing optional expressions might as well not be specified, because they do not affect what is going to be matched.

sed regex not being greedy?

In bash I have a string variable tempvar, which is created thus:
tempvar=`grep -n 'Mesh Tally' ${meshtalfile}`
meshtalfile is a (large) input file which contains some header lines and a number of blocks of data lines, each marked by a beginning line which is searched for in the grep above.
In the case at hand, the variable tempvar contains the following string:
5: Mesh Tally Number 4 977236: Mesh Tally Number 14 1954467: Mesh Tally Number 24 4354479: Mesh Tally Number 34
I now wish to extract the line number relating to a particularly mesh tally number - so I define a variable meshnum1 as equal to 24, and run the following sed command:
echo ${tempvar} | sed -r "s/^.*([0-9][0-9]*):\sMesh\sTally\sNumber\s${meshnum1}.*$/\1/"
This is where things go wrong. I expect the output 1954467, but instead I get 7. Trying with number 34 instead returns 9 instead of 4354479. It seems that sed is returning only the last digit of the number - which surely violates the principle of greedy matching? And oddly, when I move the open parenthesis ( left a couple of characters to include .*, it returns the whole line up to and including the single character it was previously returning. Surely it cannot be greedy in one situation and antigreedy in another? Hopefully I have just done something stupid with the syntax...
The problem is that the .* is being greedy too, which means that it will get all numbers too. Since you force it to get at least one digit in the [0-9][0-9]* part, the .* before it will be greedy enough to leave only one digit for the expression after it.
A solution could be:
echo ${tempvar} | sed -r "s/^.*\s([0-9][0-9]*):\sMesh\sTally\sNumber\s${meshnum1}.*$/\1/"
Where now the \s between the .* and the [0-9][0-9]* explictly forces there to be a space before the digits you want to match.
Hope this helps =)
Are the values in $tempvar supposed to be multiple or a single line? Because if it is a single line, ".*$" should match to the end of line, meaning all the other values too, right?
There's no need for sed, here's one way using GNU grep:
echo "$tempvar" | grep -oP "[0-9]+(?=:\sMesh\sTally\sNumber\s${meshnum1}\b)"

grep - removing a line that contains anything other than specified characters

I'm trying to find a way to delete any lines that contain characters other than what I specify. For example if I specify the characters a,e,i,o,u,r,s,t and I have a list of words
rat
tar
set
meow
Then "meow" should be deleted from the list because it contains the letters "m" and "w", which I haven't okayed. Any ideas?
Alternatively you can do this:
$ grep -v '[^aeiourst]' file.txt
rat
tar
set
The pattern matches lines that contain any caracter not specified in the list. This is clearly explained in the grep manual page:
A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that list; if the first character of the list is the caret ^ then it matches any character not in the list. For example, the regular expression [0123456789] matches any single digit.
In addition to this, since what you want is to remove the lines that match that pattern the -v/--invert-match option is used. This is also well explained in the grep manual page:
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)
This should do it for you. It has the letters you specified in a set, enclosed by []. * denotes that they can occur any number of times. ^ denotes the line must start with one of those letters, and $ denotes it must end with it as well.
grep '^[aeiourst]*$' file.txt

Resources