how do you specify non-capturing groups in sed? - linux

is it possible to specify non-capturing groups in sed?
if so, how?
Parentheses in sed have two functions, grouping, and capturing.
So i'm asking about using parentheses to do the grouping, but without capturing. One might say non-capturing grouping parentheses. (non-capturing parantheses and that aren't literal). What are called non-capturing groups. Like i've seen the syntax (?:regex) for non-capturing groups, but it doesn't work in sed.
Linguistic Note- in the UK, the term brackets is used generally, for "round brackets" or "square brackets". In the UK, brackets usually refers to "( )", since "( )" are so common. And in the UK the term parentheses is hardly used. In the USA the term brackets are specifically "[ ]". So to prevent confusion to anybody in the USA, i've not used the words brackets in the question.

Parentheses can be used for grouping alternatives. For example:
sed 's/a\(bc\|de\)f/X/'
says to replace "abcf" or "adef" with "X", but the parentheses also capture. There is not a facility in sed to do such grouping without also capturing. If you have a complex regex that does both alternative grouping and capturing, you will simply have to be careful in selecting the correct capture group in your replacement.
Perhaps you could say more about what it is you're trying to accomplish (what your need for non-capturing groups is) and why you want to avoid capture groups.
Edit:
There is a type of non-capturing brackets ((?:pattern)) that are part of Perl-Compatible Regular Expressions (PCRE). They are not supported in sed (but are when using grep -P).

The answer, is that as of writing, you can't - sed does not support it.
Non-capturing groups have the syntax of (?:a) and are a PCRE syntax.
Sed supports BRE(Basic regular expressions), aka POSIX BRE, and if using GNU sed, there is the option -r that makes it support ERE(extended regular expressions) aka POSIX ERE, but still not PCRE)
Perl will work, for windows or linux
examples here
https://superuser.com/questions/416419/perl-for-matching-with-regular-expressions-in-terminal
e.g. this from cygwin in windows
$ echo -e 'abcd' | perl -0777 -pe 's/(a)(?:b)(c)(d)/\1/s'
a
$ echo -e 'abcd' | perl -0777 -pe 's/(a)(?:b)(c)(d)/\2/s'
c
There is a program albeit for Windows, which can do search and replace on the command line, and does support PCRE. It's called rxrepl. It's not sed of course, but it does search and replace with PCRE support.
C:\blah\rxrepl>echo abc | rxrepl -s "(a)(b)(c)" -r "\1"
a
C:\blah\rxrepl>echo abc | rxrepl -s "(a)(b)(c)" -r "\3"
c
C:\blah\rxrepl>echo abc | rxrepl -s "(a)(b)(?:c)" -r "\3"
Invalid match group requested.
C:\blah\rxrepl>echo abc | rxrepl -s "(a)(?:b)(c)" -r "\2"
c
C:\blah\rxrepl>
The author(not me), mentioned his program in an answer over here https://superuser.com/questions/339118/regex-replace-from-command-line
It has a really good syntax.
The standard thing to use would be perl, or almost any other programming language that people use.

I'll assume you are speaking of the backrefence syntax, which are parentheses ( ) not brackets [ ]
By default, sed will interpret ( ) literally and not attempt to make a backrefence from them. You will need to escape them to make them special as in \( \) It is only when you use the GNU sed -r option will the escaping be reversed. With sed -r, non escaped ( ) will produce backrefences and escaped \( \) will be treated as literal. Examples to follow:
POSIX sed
$ echo "foo(###)bar" | sed 's/foo(.*)bar/####/'
####
$ echo "foo(###)bar" | sed 's/foo(.*)bar/\1/'
sed: -e expression #1, char 16: invalid reference \1 on `s' command's RHS
-bash: echo: write error: Broken pipe
$ echo "foo(###)bar" | sed 's/foo\(.*\)bar/\1/'
(###)
GNU sed -r
$ echo "foo(###)bar" | sed -r 's/foo(.*)bar/####/'
####
$ echo "foo(###)bar" | sed -r 's/foo(.*)bar/\1/'
(###)
$ echo "foo(###)bar" | sed -r 's/foo\(.*\)bar/\1/'
sed: -e expression #1, char 18: invalid reference \1 on `s' command's RHS
-bash: echo: write error: Broken pipe
Update
From the comments:
Group-only, non-capturing parentheses ( ) so you can use something like intervals {n,m} without creating a backreference \1 don't exist. First, intervals are not apart of POSIX sed, you must use the GNU -r extension to enable them. As soon as you enable -r any grouping parentheses will also be capturing for backreference use. Examples:
$ echo "123.456.789" | sed -r 's/([0-9]{3}\.){2}/###/'
###789
$ echo "123.456.789" | sed -r 's/([0-9]{3}\.){2}/###\1/'
###456.789

As said, it is not possible to have non-capturing groups in sed.
It could be obvious but non-capturing groups are not a necessity(unless running into the back reference limit (e.g. \9).).
One can just use the desired capturing ones and ignore the non-desired ones as if they were non-capturing.
So e.g. of the two capturings here \1 and \2 you can ignore the \1 and just use the \2
$ echo blahblahblahc | sed -r "s/(blah){1,10}(.)/\2/"
c
For reference, nested capturing groups are numbered by the position-order of "(".
E.g.,
echo "apple and bananas and monkeys" | sed -r "s/((apple|banana)s?)/\1x/g"
applex and bananasx and monkeys (note: "s" in bananas, first bigger group)
vs
echo "apple and bananas and monkeys" | sed -r "s/((apple|banana)s?)/\2x/g"
applex and bananax and monkeys (note: no "s" in bananas, second smaller group)

Related

Grep words with spaces [duplicate]

This question already has answers here:
grep with regexp: whitespace doesn't match unless I add an assertion
(2 answers)
Closed 3 years ago.
I have a text file that contains quotes, comma and spaces.
"'x','a b c'"
"'x','a b c','1','2 3'"
"'x','a b c','22'"
"'x','a b z'"
"'x','s d 2'"
However, when I try using grep to pull the exact match, it doesn't display the results. Below is the command I'm trying to use.
grep -E "\"\'x\'\,\'a\s\+b\s\+c\'\"" test.txt
Expected output: "'x','a b c'"
Am I missing anything? Any help would be really appreciated.
You were close! Couple of notes:
Don't use \s. It is a gnu extension, not available everywhere. It's better to use character classes [[:space:]], or really just match a space.
The \+ may be misleading - in -E mode, it matches a literal +, while without -E the \+ matches one or more preceding characters. The escaping depends on the mode you are using.
You don't need to escape everything! When in " doublequotes, escape doublequotes "\"", don't escape singlequotes and commas in doublequotes, "\'\," is interpreted as just "',".
If you meant only to match spaces with grep -E:
grep -E "\"'x','a +b +c'\""
This is simple enough without -E, just \+ instead of +:
grep "\"'x','a \+b \+c'\""
I like to put things in front of + inside braces, helps me read:
grep "\"'x','a[ ]\+b[ ]\+c'\""
grep -E "\"'x','a[ ]+b[ ]+c'\""
If you want to match spaces and tabs between a and b, you can insert a literal tab character inside [ ] with $'\t':
grep "\"'x','a[ "$'\t'"]\+b[ "$'\t'"]\+c'\""
grep -E "\"'x','a[ "$'\t'"]+b[ "$'\t'"]+c'\""
But with grep -P that would just become:
grep -P "\"'x','a[ \t]+b[ \t]+c'\""
But the best is to forget about \s and use character classes [[:space:]]:
grep "\"'x','a[[:space:]]\+b[[:space:]]\+c'\""
grep -E "\"'x','a[[:space:]]+b[[:space:]]+c'\""

Delete _ and - characters using sed

I am trying to convert 2015-06-03_18-05-30 to 20150603180530 using sed.
I have this:
$ var='2015-06-03_18-05-30'
$ echo $var | sed 's/\-\|\_//g'
$ echo $var | sed 's/-|_//g'
None of these are working. Why is the alternation not working?
As long as your script has a #!/bin/bash (or ksh, or zsh) shebang, don't use sed or tr: Your shell can do this built-in without the (comparatively large) overhead of launching any external tool:
var='2015-06-03_18-05-30'
echo "${var//[-_]/}"
That said, if you really want to use sed, the GNU extension -r enables ERE syntax:
$ sed -r -e 's/-|_//g' <<<'2015-06-03_18-05-30'
20150603180530
See http://www.regular-expressions.info/posix.html for a discussion of differences between BRE (default for sed) and ERE. That page notes, in discussing ERE extensions:
Alternation is supported through the usual vertical bar |.
If you want to work on POSIX platforms -- with /bin/sh rather than bash, and no GNU extensions -- then reformulate your regex to use a character class (and, to avoid platform-dependent compatibility issues with echo[1], use printf instead):
printf '%s\n' "$var" | sed 's/[-_]//g'
[1] - See the "APPLICATION USAGE" section of that link, in particular.
Something like this ought to do.
sed 's/[-_]//g'
This reads as:
s: Search
/[-_]/: for any single character matching - or _
//: replace it with nothing
g: and do that for every character in the line
Sed operates on every line by default, so this covers every instance in the file/string.
I know you asked for a solution using sed, but I offer an alternative in tr:
$ var='2015-06-03_18-05-30'
$ echo $var | tr -d '_-'
20150603180530
tr should be a little faster.
Explained:
tr stands for translate and it can be used to replace certain characters with another ones.
-d option stands for delete and it removes the specified characters instead of replacing them.
'_-' specifies the set of characters to be removed (can also be specified as '\-_' but you need to escape the - there because it's considered another option otherwise).
Easy:
sed 's/[-_]//g'
The character class [-_] matches of the characters from the set.
sed 's/[^[:digit:]]//g' YourFile
Could you tell me what failed on echo $var | sed 's/\-\|\_//g', it works here (even if escapping - and _ are not needed and assuming you use a GNU sed due to \| that only work in this enhanced version of sed)

Bash to transform string `3.11.0.17.16` into `3.11.0-17-generic`

I'm trying to transform this 3.11.0.17.16 into 3.11.0-17-generic using only bash and unix tools. The 16 in the original string can be anything. I feel like sed is the answer, but I'm not comfortable with its flavor of regex. How would you do this?
Version using awk instead of sed:
echo "3.11.0.17.16" | awk -F. '{printf "%s.%s.%s-%s-generic\n",$1,$2,$3,$4}'
echo "3.11.0.17.16" | sed 's/\.\([0-9][0-9]*\)\.[0-9][0-9]*$/-\1-generic/'
3.11.0-17-generic
This only accepts digits in the final component. If you want to accept arbitrary characters other than . there (you can't allow . or the match will become ambiguous) then write instead
echo "3.11.0.17.gr#wl1x" | sed 's/\.\([0-9][0-9]*\)\.[^.][^.]*$/-\1-generic/'
In a portable sed invocation you are limited to POSIX basic regular expressions, which most importantly means you cannot use +, ?, or |, and ( ) { } are ordinary characters unless \-escaped. Many sed implementations now accept an -E option that brings their regex syntax in line with egrep, but that is not a feature even of the very latest revision of POSIX so you cannot rely on it.
Substring removal using bash parameter expansion and extended globs
shopt -s extglob
version=3.11.0.17.16
version=${version%.+(!(.))}
printf "%s-%s-generic\n" ${version%.+(!(.))} ${version##*.}
3.11.0-17-generic
If you anchor the regex you are trying to match onto the last 3 sets of digits you would get
echo "3.11.0.17.16" | sed 's!\([0-9]*\)\.\([0-9]*\)\.\([0-9]*\)$!\1-\2-generic!'

Complex shell wildcard

I want to use echo to display(not content) directories that start with atleast 2 characters but can't begin with "an"
For example if had the following in the directory:
a as an23 an23 blue
I would only get
as blue back
I tried echo ^an* but that returns the directory with 1 charcter too.
Is there any way i can do this in the form of echo globalpattern
You can use the shells extended globbing feature, in bash:
bash$ setsh -s extglob
bash$ echo !(#(?|an*))
The !() construct inverts its internal expression, see this for more.
In zsh:
zsh$ setopt extendedglob
zsh$ print *~(?|an*)
In this case the ~ negates the pattern before the tilde. See the manual for more.
Since you want at least two characters in the names, you can use printf '%s\n' ??* to echo each such name on a separate line. You can then eliminate those names that start with an with grep -v '^an', leading to:
printf '%s\n' ??* | grep -v '^an'
The quotes aren't strictly necessary in the grep command with modern shells. Once upon a quarter of a century or so ago, the Bourne shell had ^ as a synonym for | so I still use quotes around carets.
If you absolutely must use echo instead of printf, then you'll have to map white space to newlines (assuming you don't have any names that contain white space).
I'm trying with just the echo command, no grep either?
What about:
echo [!a]?* a[!n]*
The first term lists all the two-plus character names not beginning with a; the second lists all the two-plus character names where the first is a and the second is not n.
This should do it, but you'd likely be better off with ls or even find:
echo * | tr ' ' '\012' | egrep '..' | egrep -v '^an'
Shell globbing is a form of regex, but it's not as powerful as egrep regex's.

Sed:Replace a series of dots with one underscore

I want to do some simple string replace in Bash with sed. I am Ubuntu 10.10.
Just see the following code, it is self-explanatory:
name="A%20Google.."
echo $name|sed 's/\%20/_/'|sed 's/\.+/_/'
I want to get A_Google_ but I get A_Google..
The sed 's/\.+/_/' part is obviously wrong.
BTW, sed 's/\%20/_/' and sed 's/%20/_/' both work. Which is better?
sed speaks POSIX basic regular expressions, which don't include + as a metacharacter. Portably, rewrite to use *:
sed 's/\.\.*/_/'
or if all you will ever care about is Linux, you can use various GNU-isms:
sed -r 's/\.\.*/_/' # turn on POSIX EREs (use -E instead of -r on OS X)
sed 's/\.\+/_/' # GNU regexes invert behavior when backslash added/removed
That last example answers your other question: a character which is literal when used as is may take on a special meaning when backslashed, and even though at the moment % doesn't have a special meaning when backslashed, future-proofing means not assuming that \% is safe.
Additional note: you don't need two separate sed commands in the pipeline there.
echo $name | sed -e 's/\%20/_/' -e 's/\.+/_/'
(Also, do you only need to do that once per line, or for all occurrences? You may want the /g modifier.)
The sed command doesn't understand + so you'll have to expand it by hand:
sed 's/\.\.*/_/'
Or tell sed that you want to use extended regexes:
sed -r 's/\.+/_/' # GNU
sed -E 's/\.+/_/' # OSX
Which switch, -r or -E, depends on your sed and it might not even support extended regexes so the portable solution is to use \.\.* in place of \.+. But, since you're on Linux, you should have GNU sed so sed -r should do the trick.

Resources