How to extract lines that start with either this string or that string? [duplicate] - linux

This question already has answers here:
Use grep to find either of two strings without changing the order of the lines?
(2 answers)
Closed 4 months ago.
Newbie UNIX user question ...
The input file (location.txt) is this:
WGS_LAT deg 12
WGS_LAT min 30
WGS_LAT sec 05
WGS_LAT hsec 29
WGS_LAT northSouth North
WGS_DLAT decimalDegreesLatitude 12.501469
WGS_LONG deg 07
WGS_LONG min 00
WGS_LONG sec 05
WGS_LONG hsec 61
WGS_LONG eastWest West
WGS_DLONG decimalDegreesLongitude -70.015606
I want to get all lines that start with WGS_LAT or WGS_DLAT.
First, is grep the tool you recommend for this job?
Second, if it is, then how to express the pattern? All of these failed:
grep ^WGS_LAT|^WGS_DLAT location.txt
grep ^(WGS_LAT|WGS_DLAT) location.txt
grep ^WGS_D?LAT location.txt
What is the correct pattern, please?

Grep can handle two types of regular expressions:
Basic regular expressions (BRE) which you call using grep PATTERN file
Extended regular expressions (ERE) which you call using grep -E PATTERN file
So by default grep makes use of BRE.
When reading the man-pages of grep you find
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their
special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).
So, in your case the answer is:
$ grep "^\(WGS_LAT\|WGS_DLAT \)" location.txt
$ grep -E "^(WGS_LAT|WGS_DLAT)" location.txt
$ grep "^WGS_D\?LAT" location.txt
$ grep -E "^WGS_D?LAT" location.txt

First, you should always quote your regular expressions to protect them from the shell. For example, | has special meaning in the shell, it is the pipe operator that allows you to pass the output of one program as input to another. So the unquoted grep ^WGS_LAT|^WGS_DLAT location.txt is interpreted as "run grep ^WGS_LAT and pass its output as input to ^WGS_DLAT location.txt.
Next, grep uses Basic Regular Expressions by default, and to get the | to mean OR you need to either escape it as \| or use the -E (or -P flag if you are using GNU grep, which enables PCRE) to enable extended regular expressions. So all of these should work for you:
grep -E '^WGS_LAT|^WGS_DLAT' location.txt
grep -E '^(WGS_LAT|WGS_DLAT)' location.txt
grep '^WGS_LAT\|^WGS_DLAT' location.txt
Or, more simply, grep for lines starting with WGS_ and an optional D followed by LAT:
grep -E '^WGS_D?LAT' location.txt

Related

Some help needed on grep

I am trying to find alphanumeric string including these two characters "/+" with at least 30 characters in length.
I have written this code,
grep "[a-zA-Z0-9\/\+]{30,}" tmp.txt
cat tmp.txt
> array('rWmyiJgKT8sFXCmMr639U4nWxcSvVFEur9hNOOvQwF/tpYRqTk9yWV2xPFBAZwAPRVs/s
ddd73ZEjfy+airfy8DtqIqKI9+dd 6hdd7soJ9iG0sGs/ld5f2GHzockoYHfh
+pAzx/t17Crf0T/2+8+reo+MU39lqCr02sAkcC1k/LzyBvSDEtu9N/9NHicr jA3SvDqg5s44DFlaNZ/8BW37fGEf2rk13S/q68OVVyzac7IT7yE7PIL9XZ/6LsmrY
KEsAmN4i/+ym8be3wwn KWGYaIB908+7W98pI6qao3iaZB
3mh7Y/nZm52hyLa37978f+PyOCqUh0Wfx2PL3vglofi0l
QVrOM1pg+mFLEIC88B706UzL4Pss7ouEo+EsrES+/qJq9Y1e/UGvwefOWSL2TJdt
this does not work, Mainly I wanted to have minimum length of the string to be 30
In the syntax of grep, the repetition braces need to be backslashed.
grep -o '[a-zA-Z0-9/+]\{30,\}' file
If you want to constrain the match to lines containing only matches to this pattern, add line-start and line-ending anchors:
grep '^[a-zA-Z0-9/+]\{30,\}$' file
The -o option in the first command line causes grep to only print the matching part, not the entire matching line.
The repetition operator is not directly supported in Basic Regular Expression syntax. Use grep -E to enable Extended Regular Expression syntax, or backslash the braces.
You can use
grep -e "^[a-zA-Z0-9/+]\{30,\}" tmp.txt
grep -e "^[a-zA-Z0-9/+]\{30,\}" tmp.txt
+pAzx/t17Crf0T/2+8+reo+MU39lqCr02sAkcC1k/LzyBvSDEtu9N/9NHicr jA3SvDqg5s44DFlaNZ/8BW37fGEf2rk13S/q68OVVyzac7IT7yE7PIL9XZ/6LsmrY
3mh7Y/nZm52hyLa37978f+PyOCqUh0Wfx2PL3vglofi0l
QVrOM1pg+mFLEIC88B706UzL4Pss7ouEo+EsrES+/qJq9Y1e/UGvwefOWSL2TJdt
man grep
Read up about the difference between between regular and extended patterns. You need the -E option.

Bash to transform string `3.11.0.17.16` into `3.11.0-17-generic`

I'm trying to transform this 3.11.0.17.16 into 3.11.0-17-generic using only bash and unix tools. The 16 in the original string can be anything. I feel like sed is the answer, but I'm not comfortable with its flavor of regex. How would you do this?
Version using awk instead of sed:
echo "3.11.0.17.16" | awk -F. '{printf "%s.%s.%s-%s-generic\n",$1,$2,$3,$4}'
echo "3.11.0.17.16" | sed 's/\.\([0-9][0-9]*\)\.[0-9][0-9]*$/-\1-generic/'
3.11.0-17-generic
This only accepts digits in the final component. If you want to accept arbitrary characters other than . there (you can't allow . or the match will become ambiguous) then write instead
echo "3.11.0.17.gr#wl1x" | sed 's/\.\([0-9][0-9]*\)\.[^.][^.]*$/-\1-generic/'
In a portable sed invocation you are limited to POSIX basic regular expressions, which most importantly means you cannot use +, ?, or |, and ( ) { } are ordinary characters unless \-escaped. Many sed implementations now accept an -E option that brings their regex syntax in line with egrep, but that is not a feature even of the very latest revision of POSIX so you cannot rely on it.
Substring removal using bash parameter expansion and extended globs
shopt -s extglob
version=3.11.0.17.16
version=${version%.+(!(.))}
printf "%s-%s-generic\n" ${version%.+(!(.))} ${version##*.}
3.11.0-17-generic
If you anchor the regex you are trying to match onto the last 3 sets of digits you would get
echo "3.11.0.17.16" | sed 's!\([0-9]*\)\.\([0-9]*\)\.\([0-9]*\)$!\1-\2-generic!'

What's the difference between "grep -e" and "grep -E" [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have a file test.txt, in which there are some formatted phone numbers. I'm trying to use grep to find the lines containing a phone number.
It seems that grep -e "[0-9]{3}-[0-9]{3}-[0-9]{4}" test.txt doesn't work and gives no results. But grep -E "[0-9]{3}-[0-9]{3}-[0-9]{4}" test.txtworks. So I wonder what's the difference between these 2 options.
According to man grep:
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e. force
grep to behave as egrep).
-e pattern, --regexp=pattern
Specify a pattern used during the search of the input: an input
line is selected if it matches any of the specified patterns.
This option is most useful when multiple -e options are used to
specify multiple patterns, or when a pattern begins with a dash
(`-').
But I don't quite understand it. What is an extended regex?
As you mentioned, grep -E is for extended regular expressions whereas -e is for basic regular expressions. From the man page:
EDIT: As Jonathan pointed out below, grep -e "specifies that the following argument is (one of) the regular expression(s) to be matched."
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose
their special meaning; instead use the backslashed versions \?, \+, \{,
\|, \(, and \).
Traditional egrep did not support the { meta-character, and some egrep
implementations support \{ instead, so portable scripts should avoid { in
grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is
not special if it would be the start of an invalid interval specification.
For example, the command grep -E '{1' searches for the two-character
string {1 instead of reporting a syntax error in the regular expression.
POSIX.2 allows this behavior as an extension, but portable scripts should
avoid it.
But man pages are pretty terse, so for further info, check out this link:
http://www.regular-expressions.info/posix.html
The part of the manpage regarding the { meta character though specifically talks about what you are seeing with respect to the difference.
grep -e "[0-9]{3}-[0-9]{3}-[0-9]{4}"
won't work because it is not treating the { character as you expect. Whereas
grep -E "[0-9]{3}-[0-9]{3}-[0-9]{4}"
does because that is the extended grep version — or the egrep version for example.
Here is a simple test:
$ cat file
apple is a fruit
so is orange
but onion is not
$ grep -e 'but' -e 'fruit' file #Allows you to pass multiple patterns explicitly
apple is a fruit
but onion is not
$ grep -E 'is (a|not)' file #Allows you to use extended regular expressions like ?, +, | etc
apple is a fruit
but onion is not
The -e option to grep simply says that the following argument is the regular expression. Thus:
grep -e 'some.*thing' -r -l .
looks for some followed by thing on a line in all the files in the current directory and all its sub-directories. The same could be achieved by:
grep -r -l 'some.*thing' .
(On Linux, the situation is confused by the behaviour of GNU getopt() which, unless you set POSIXLY_CORRECT in the environment, permutes options, so you could also run:
grep 'some.*thing' -r -l .
and get the same result. Under POSIX and other systems not using GNU getopt(), options need to precede arguments, and the grep would look for a file called -r and another called -l.)
The -E option changes the regular expressions from 'basic' to 'extended'. It can be used with -e:
grep -e "[0-9]{3}-[0-9]{3}-[0-9]{4}" test.txt
grep -E -e "[0-9]{3}-[0-9]{3}-[0-9]{4}" test.txt
The ERE option means the same regular expressions, more or less, as used to be recognized by the egrep command, which is no longer a part of POSIX (having been replaced by grep -E, and fgrep by grep -F).

Question about shell commands and grep

Does anyone know why
grep "p\{2\}" textfile
will find "apple" if it's in the file, but
grep p\{2\} textfile
won't?
I'm new to using a command line and regular expressions, and this is puzzling me.
Although this has already been answered, but since you are new to all this stuff, here is how to debug it:
-- get the pid of current shell (using ps).
PID TTY TIME CMD
1611 pts/0 00:00:00 su
1619 pts/0 00:00:00 bash
1763 pts/0 00:00:00 ps
-- from some other shell, attach strace (system call tracer) to the required pid (here 1619):
strace -f -o <output_file> -p 1619
-- Run both the commands that you tried
-- open the output file and look for exec family calls for the required process, here: grep
The output on my machine is some thing like:
1723 execve("/bin/grep", ["grep", "--color=auto", "p{2}", "foo"], [/* 19 vars */]) = 0
1725 execve("/bin/grep", ["grep", "--color=auto", "p\\{2\\}", "foo"], [/* 19 vars */]) = 0
Now you can see the difference how grep was executed in both the cases and can figure out the problem yourself. :)
still the -e flag mystery is yet to be solved....
Without the quotes, the shell will try to expanding the options. In your case the curly brackets '{}' have a special meaning in the shell much like the asterisk '*' which expands to a wildcard.
With quotes, your complete regex gets passed directly to grep. Without the quotes, grep sees your regex as p{2}.
Edit:
To clarify, without the quotes your slashes are being removed by shell before your regex is passed to grep.
Try:
echo grep p\{2\} test.txt
And you'll see your output as...
grep p{2} test.txt
The quotes prevent shell from escaping characters before they get to grep. You could also escape your slashes and it will work without quotes - grep p\\{2\\} test.txt
The first one greps the pattern using regex, then pp:
echo "apple" | grep 'p\{2\}'
The second one greps the pattern literally, then p{2}:
echo "ap{2}le" | grep p\{2\}
From the grep man page
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).
so these two become functional equivalent
egrep p{2}
and
grep "p\{2\}"
the first uses EREs(Extended Regular Expressions) the second uses BREs(Basic Regular Expressions) in your example because your using grep(which supports BREs when you don't use the -e switch) and you're enclosed in quotes so "\{" gets expanded as a special BRE character.
You second instance doesn't work because your just looking for the literal string 2{p} which doesn't exist in your file
you can demonstrate that grep is expanding your string as a BRE by trying:
grep "p\{2"
grep will complain
grep: Unmatched \{

how do you specify non-capturing groups in sed?

is it possible to specify non-capturing groups in sed?
if so, how?
Parentheses in sed have two functions, grouping, and capturing.
So i'm asking about using parentheses to do the grouping, but without capturing. One might say non-capturing grouping parentheses. (non-capturing parantheses and that aren't literal). What are called non-capturing groups. Like i've seen the syntax (?:regex) for non-capturing groups, but it doesn't work in sed.
Linguistic Note- in the UK, the term brackets is used generally, for "round brackets" or "square brackets". In the UK, brackets usually refers to "( )", since "( )" are so common. And in the UK the term parentheses is hardly used. In the USA the term brackets are specifically "[ ]". So to prevent confusion to anybody in the USA, i've not used the words brackets in the question.
Parentheses can be used for grouping alternatives. For example:
sed 's/a\(bc\|de\)f/X/'
says to replace "abcf" or "adef" with "X", but the parentheses also capture. There is not a facility in sed to do such grouping without also capturing. If you have a complex regex that does both alternative grouping and capturing, you will simply have to be careful in selecting the correct capture group in your replacement.
Perhaps you could say more about what it is you're trying to accomplish (what your need for non-capturing groups is) and why you want to avoid capture groups.
Edit:
There is a type of non-capturing brackets ((?:pattern)) that are part of Perl-Compatible Regular Expressions (PCRE). They are not supported in sed (but are when using grep -P).
The answer, is that as of writing, you can't - sed does not support it.
Non-capturing groups have the syntax of (?:a) and are a PCRE syntax.
Sed supports BRE(Basic regular expressions), aka POSIX BRE, and if using GNU sed, there is the option -r that makes it support ERE(extended regular expressions) aka POSIX ERE, but still not PCRE)
Perl will work, for windows or linux
examples here
https://superuser.com/questions/416419/perl-for-matching-with-regular-expressions-in-terminal
e.g. this from cygwin in windows
$ echo -e 'abcd' | perl -0777 -pe 's/(a)(?:b)(c)(d)/\1/s'
a
$ echo -e 'abcd' | perl -0777 -pe 's/(a)(?:b)(c)(d)/\2/s'
c
There is a program albeit for Windows, which can do search and replace on the command line, and does support PCRE. It's called rxrepl. It's not sed of course, but it does search and replace with PCRE support.
C:\blah\rxrepl>echo abc | rxrepl -s "(a)(b)(c)" -r "\1"
a
C:\blah\rxrepl>echo abc | rxrepl -s "(a)(b)(c)" -r "\3"
c
C:\blah\rxrepl>echo abc | rxrepl -s "(a)(b)(?:c)" -r "\3"
Invalid match group requested.
C:\blah\rxrepl>echo abc | rxrepl -s "(a)(?:b)(c)" -r "\2"
c
C:\blah\rxrepl>
The author(not me), mentioned his program in an answer over here https://superuser.com/questions/339118/regex-replace-from-command-line
It has a really good syntax.
The standard thing to use would be perl, or almost any other programming language that people use.
I'll assume you are speaking of the backrefence syntax, which are parentheses ( ) not brackets [ ]
By default, sed will interpret ( ) literally and not attempt to make a backrefence from them. You will need to escape them to make them special as in \( \) It is only when you use the GNU sed -r option will the escaping be reversed. With sed -r, non escaped ( ) will produce backrefences and escaped \( \) will be treated as literal. Examples to follow:
POSIX sed
$ echo "foo(###)bar" | sed 's/foo(.*)bar/####/'
####
$ echo "foo(###)bar" | sed 's/foo(.*)bar/\1/'
sed: -e expression #1, char 16: invalid reference \1 on `s' command's RHS
-bash: echo: write error: Broken pipe
$ echo "foo(###)bar" | sed 's/foo\(.*\)bar/\1/'
(###)
GNU sed -r
$ echo "foo(###)bar" | sed -r 's/foo(.*)bar/####/'
####
$ echo "foo(###)bar" | sed -r 's/foo(.*)bar/\1/'
(###)
$ echo "foo(###)bar" | sed -r 's/foo\(.*\)bar/\1/'
sed: -e expression #1, char 18: invalid reference \1 on `s' command's RHS
-bash: echo: write error: Broken pipe
Update
From the comments:
Group-only, non-capturing parentheses ( ) so you can use something like intervals {n,m} without creating a backreference \1 don't exist. First, intervals are not apart of POSIX sed, you must use the GNU -r extension to enable them. As soon as you enable -r any grouping parentheses will also be capturing for backreference use. Examples:
$ echo "123.456.789" | sed -r 's/([0-9]{3}\.){2}/###/'
###789
$ echo "123.456.789" | sed -r 's/([0-9]{3}\.){2}/###\1/'
###456.789
As said, it is not possible to have non-capturing groups in sed.
It could be obvious but non-capturing groups are not a necessity(unless running into the back reference limit (e.g. \9).).
One can just use the desired capturing ones and ignore the non-desired ones as if they were non-capturing.
So e.g. of the two capturings here \1 and \2 you can ignore the \1 and just use the \2
$ echo blahblahblahc | sed -r "s/(blah){1,10}(.)/\2/"
c
For reference, nested capturing groups are numbered by the position-order of "(".
E.g.,
echo "apple and bananas and monkeys" | sed -r "s/((apple|banana)s?)/\1x/g"
applex and bananasx and monkeys (note: "s" in bananas, first bigger group)
vs
echo "apple and bananas and monkeys" | sed -r "s/((apple|banana)s?)/\2x/g"
applex and bananax and monkeys (note: no "s" in bananas, second smaller group)

Resources