Sed: Extracting regex pattern from lines - linux

I have an input stream of many lines which look like this:
path/to/file: example: 'extract_me.proto'
path/to/other-file: example: 'me_too.proto'
path/to/something/else: example: 'and_me_2.proto'
...
I'd like to just extract the *.proto filenames from these lines, and I have tried:
[INPUT] | sed 's/^.*\([a-zA-Z0-9_]+\.proto\).*$/\1/'
I know that part of my problem is that .* is greedy and I'm going to get things like e.proto and o.proto and 2.proto, but I can't even get that far... it just outputs with the same lines as the input. Any help would be greatly appreciated.

I find it helpful to use extended regex for this purpose (-r) in which case you need not escape your brackets.
sed -r 's/^.*[^a-zA-Z0-9_]([a-zA-Z0-9_]+\.proto).*$/\1/'
The addition of [^a-zA-Z0-9_] forces the .* to not be greedy.

Since you tag your command with linux, I'll assume you have GNU grep. Pick one of
grep -oP '\w+\.proto' file
grep -o "[^']+\\.proto" file

one way to do it:
sed 's/^.*[^a-zA-Z0-9_]\([a-zA-Z0-9_]\+\.proto\).*$/\1/'
escaped the + char
put a negation before the alphanum+underscore to delimit the leading chars
another way: use single quote delimitation, after all it's here for that:
sed "s/^.*'\([a-zA-Z0-9_]\+\.proto\)'.*\$/\1/"

Use this sed:
sed "s/^.*'\([a-zA-Z0-9_]\+\.proto\).*$/\1/"
+ - Extended-RegEx. So, you need to escape to get special meaning. The preceding item will be matched one or more times.
Another way:
sed "s/^.*'\([^']\+\.proto\)'.*$/\1/"

With GNU sed:
sed -E "s/.*'([^']+)'$/\1/"

Related

How to make GNU sed remove certain characters from a line

I have a following line;
�5=?�#A00165:69:HKJ3YDMXX:1:1101:16812:7341 1:N:0:TCTTAAAG
and would like to remove characters, �5=?� in front of #. So the desired output looks as follows;
#A00165:69:HKJ3YDMXX:1:1101:16812:7341 1:N:0:TCTTAAAG
I used gnu sed (v4.8)with a following argument;
sed "s/.*#/#/"'
but this did not remove �5=?� thought it worked in the GNU sed live editor.
At this point, I really appreciate any help on this.
My system is 3.10.0-1160.71.1.el7.x86_64
Using sed, remove everything up to the first occurance of #
$ sed 's/^[^#]*//' input_file
#A00165:69:HKJ3YDMXX:1:1101:16812:7341 1:N:0:TCTTAAAG
This might work for you (GNU sed):
sed -E 's/(\o357\o277\o275)5=\?\1//g' file
This removes all occurrences of �5=?�.
N.B. To translate the octal strings use sed -n l file to display the file as is. The triplets \357\277\275 can be matched in the LHS of the substitute command by using \o357\o277\o275.

SED : replace part of syntax to lower case

Original Text file:
.ABC0 (ABC0),
.EFG2 (EFG2),
.ZZZ3 (ZZZ3),
How to convert this part to
.ABC0 (abc0),
.EFG2 (efg2),
.ZZZ3 (zzz3),
with SED command easily?
There's issue to make it work.
echo ".ABC(ABC)," | sed -e 's/\(.*\.[A-Z]*\(\)\([A-Z]*\)\)/\1\L\2\E/ p'
You were almost there.
sed -e 's/\(\.[A-Z0-9]*\)\( ([A-Z0-9]*)\)/\1\L\2\E/g'
Remove the .* from the beginning, it matches as much as it can, i.e. it skips to the last occurrence of the pattern.
Include the digits in the character classes.
Use ( without a backslash to match a parenthesis literally.
This might work for you (GNU sed):
sed 's/([^)]*)/\L&/g' file
Replace the contents of parenthesis by its lowercase equivalent.

Conditional replace using sed

My question is probably rather simple. I'm trying to replace sequences of strings that are at the beginning of lines in a file. For example, I would like to replace any instance of the pattern "GN" with "N" or "WR" with "R", but only if they are the first 2 characters of that line. For example, if I had a file with the following content:
WRONG
RIGHT
GNOME
I would like to transform this file to give
RONG
RIGHT
NOME
I know i can use the following to replace any instance of the above example;
sed -i 's/GN/N/g' file.txt
sed -i 's/WR/R/g' file.txt
The issue is that I want this to happen only if the above patterns are the first 2 characters in any given line. Possibly an IF statement, although i'm not sure what the condition would look like. Any pointers in the right direction would be much appreciated, thanks.
just add the circumflex, remove g suffix (unnecessary, since you want at most one replacement), you can also combine them in one script.
sed -i 's/^GN/N/;s/^WR/R/' file.txt
Use the start-of-string regexp anchor ^:
sed -i 's/^GN/N/' file.txt
sed -i 's/^WR/R/' file.txt
Since sed is line-oriented, start-of-string == start-of-line.

How to specify an "or" in sed

I have a file having data in the following form
<A/Here> <A/There>
<B/SomeMoreDate> <C/SomeOtherDate>
Now I want to delete all the A,B,C from the file in an efficient way. I know I can use sed for one pattern
sed -i 's/A//g' /path/to/filename.
But how do I specify such that sed to contain an or to deletes all the patterns?
The expected output is:
<Here> <There>
<SomeMoreDate> <SomeOtherDate>
You can use sed -i 's/[ABC]//g' /path/to/filename. [ABC] will match either A or B or C. You may find this reference useful.
If you're using GNU sed, you can say:
sed -r 's#(A|B|C)/##g' filename
The following should work otherwise:
sed 's#A/##g;s#B/##g;s#C/##g' filename
Ivaylo Strandjev's answer is correct in that it solves the problem when wanting to match single characters. There is a way though to have or when matching longer strings.
s/\(\(stringA\)\|\(stringB\)\|\(stringC\)\)something/something else/
You can try with somehting like:
echo stringBsomething | sed -e 's/\(stringA\|stringB\|stringC\)something/something else/'
It is sad that sed requires all these backslashes. Some if this is avoided if you use -r.
sed "s/<[ABC]\//</g" /path/to/filename
because it is a special case of 1 char in length changing in the pattern. This is not a real OR
you can use this workaround on limited to POSIX sed
Sample for test purpose
echo "<Pat1/ is pattern 2> <pat2/ is pattern 2>
<pAt3/ is pattern 3>
<pat4/ is pattern 4> but not avalaible for Pat1/ nor <pat2
" | \
The sed part
sed 's/²/²o/g
t myor
:myor
s/<Pat1\//²p/g;t treat
s/<pat2\//²p/g;t treat
s/<pAt3\//²p/g;t treat
b continu
: treat
s/²p/</g
t myor
: continu
s/²o/²/g
'
This use a temporary char as generic pattern "²" and a series of s/ followed by a test branch as OR functionality

Replace string within a file from a bash script

I need to replace within a little bash script a string inside a file but... I am getting weird results.
Let's say I want to replace:
<tag><![CDATA[text]]></tag>
With:
<tag><![CDATA[replaced_text]]></tag>
Should I use sed? I think due to / and [ ] I am getting weird results...
What would be the best way of approaching this?
Perl with -p option works almost as sed and it has \Q (quote) switch for its regexes:
perl -pe 's{\Q<tag><![CDATA[text]]></tag>}
{<tag><![CDATA[replaced_text]]></tag>}' YOUR_FILE
And in Perl you can use different punctuation to delimiter your expressions (s{...}{...} in my example).
Yes, you need to escape the brackets, and either escape slashes or use different delimiters.
sed 's,<tag><!\[CDATA\[text\]\]></tag>,<tag><!\[CDATA\[replaced)text\]\]></tag>,'
That said, SGML and XML are not actually any better than HTML when it comes to using regexes; don't expect this to generalize.
This should be enough:
$ echo '<tag><![CDATA[text]]></tag>' | sed 's/\[text\]/\[replaced_text\]/'
<tag><![CDATA[replaced_text]]></tag>
You can also change your / separator inside sed to a different character like ,, | or %.
Just use a delimiter other than /, here I use #:
sed -i 's#<tag><!\[CDATA\[text\]\]></tag>#<tag><![CDATA[replaced_text]]></tag>#g' filename
-i to have sed change the file instead of printing it out.
g is for matching more than once (global).
But do you know the exact string you want to match, both the tag and the text?
For instance, if you want to replace the text in all with your replaced_text:
perl -i -pe 's#(<tag><!\[CDATA\[)(.*?)(\]\]></tag>)#\1replaced_text\3#g' filename
Switched to perl because sed doesn't support non-greedy multipliers (the *?).

Resources