Replace string between words multiple times in a file - linux

I am trying to replace string between two strings in a file with the command below. There could be any number of such patterns in the file. This is just an example.
sed 's/word1.*word2/word1/' 1.txt
There are two instances where 'word1' followed by 'word2' occurs in the sample source file I'm testing. Content of the 1.txt file
word1---sjdkkdkjdk---word2 I want this text----word1---jhfnkfnsjkdnf----word2 I need this also
Result is as below.
word1 I need this also
Expected Output :
word1 I want this text----word1 I need this also
Can anybody help me with this please?
I looked at other stack-overflow questionnaire but they discuss about replacing only one instance of the pattern.

Regular expressions are greedy - they match the longest possible string, so everything from the first 'word1' to the last 'word2'. Not sure if any version of sed supports non-greedy regexps... you could just use perl, though, which does:
perl -pe 's/word1.*?word2/word1/g' 1.txt
should do the trick. That ? changes the meaning of the prior * from 'match as many times as possible as long as the rest of the pattern matches' to 'match as few times as possible as long as the rest of the pattern matches'.

$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/word1/{/g; s/word2/}/g; s/{[^{}]*}/word1/g; s/}/word2/g; s/{/word1/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
word1 I want this text----word1 I need this also
It's lengthy and looks complicated but it's a technique that is used fairly often and is really just a series of simple steps to robustly convert word1 to { and word2 to } so you're dealing with characters instead of strings in the actual substitution s/{[^{}]*}/word1/g and so can use a negated bracket expression to avoid the greedy regexp taking up too much of the line.
See https://stackoverflow.com/a/35708616/1745001 for more info on the general approach used here to be able to turn strings into characters that cannot be present in the input by the time the real work takes place and then restore them again afterwards.

If you only have two instances of the word1-word2 pattern on a line, this should work:
sed 's/\(word1\).*word2\(.*\)\(word1\).*word2\(.*\)/\1\2\3\4/' 1.txt
I grab the parts we want to keep inside escaped brackets \( and \) then I can refer to those parts as \1 \2 and so on.

Related

Finding and replacing text within a file

I have a large taxonomy file that I need to edit. There is an issue with the file as "Candida" is listed as both Candida and [Candida]. What I want to do is change every case of [Candida] to Candida within the file.
I have tried doing this several ways but never get the output I am after. This is the first few lines of the taxonomy file:
Penicillium;marneffei;NW_002197112.1
Penicillium;marneffei;NW_002197111.1
Penicillium;marneffei;NW_002197110.1
Penicillium;marneffei;NW_002197109.1
Penicillium;marneffei;NW_002197108.1
Using sed gives me this output:
$ sed -i -e 's/[Candida]/Candida/g' Full_HMS_Taxonomy.txt
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197112.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197111.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197110.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197109.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197108.1
Using awk gives me this output:
$ awk '{gsub(/[Candida]/,"Candida")}1' Full_HMS_Taxonomy.txt
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197112.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197111.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197110.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197109.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197108.1
In both cases it is adding Candida to multiple places and multiple lines, instead of just replacing each instance of [Candida]. Any ideas on what I am doing wrong?
[] are special characters in regexp, so you should escape them like that:
's/\[Candida\]/Candida/g'
Brackets are treated specially by regular expression parsers, matching each character listed inside them. So, [Candida] matches any of the characters inside it (C, a, n...). That's why you get a lot of substitutions.
You need to tell those utilities that you want literal brackets by escaping them with backslashes, e.g. with sed:
sed -i 's/\[Candida\]/Candida/g' Full_HMS_Taxonomy.txt

SED: insert a word/string between two patterns in the SAME LINE

I've searched all over stackoverflow (perhaps I just suck at searching) but I cannot find the answer to my problem. I'm trying to insert a word or a string in between two patterns in the same line using sed.
I know how to insert a word AFTER a searched pattern using
sed -e "s/pattern/& new_word/g"
with an ampersand (&).
But this command inserts 'new_word' in every occurrence of searched pattern so I'm trying to specify it so that it inserts 'new_word' only in between two patterns.
For example,
Some words = [want to insert words here];
How do I insert it between "Some words (multiple whitespaces here) =" and ";"?
What is the syntax for this kind of command? Also, what resources do you guys use to learn sed? Many of the sed tutorials that I've searched are very basic and doesn't go into details of usage of different options and flags.
Thank you.
Use capture groups.
sed -e 's/(pattern1)(pattern2)/\1new_word\2/'
\1 is replaced with whatever matched the first pattern, \2 gets whatever matched the second pattern.

extract first instance per line (maybe grep?)

I want to extract the first instance of a string per line in linux. I am currently trying grep but it yields all the instances per line. Below I want the strings (numbers and letters) after "tn="...but only the first set per line. The actual characters could be any combination of numbers or letters. And there is a space after them. There is also a space before the tn=
Given the following file:
hello my name is dog tn=12g3 fun 23k3 hello tn=1d3i9 cheese 234kd dks2 tn=6k4k ksk
1263 chairs are good tn=k38493kd cars run vroom it95958 tn=k22djd fair gold tn=293838 tounge
Desired output:
12g3
k38493
Here's one way you can do it if you have GNU grep, which (mostly) supports Perl Compatible Regular Expressions with -P. Also, the non-standard switch -o is used to only print the part matching the pattern, rather than the whole line:
grep -Po '^.*?tn=\K\S+' file
The pattern matches the start of the line ^, followed by any characters .*?, where the ? makes the match non-greedy. After the first match of tn=, \K "kills" the previous part so you're only left with the bit you're interested in: one or more non-space characters \S+.
As in Ed's answer, you may wish to add a space before tn to avoid accidentally matching something like footn=.... You might also prefer to use something like \w to match "word" characters (equivalent to [[:alnum:]_]).
Just split the input in tn=-separators and pick the second one. Then, split again to get everything up to the first space:
$ awk -F"tn=" '{split($2,a, " "); print a[1]}' file
12g3
k38493kd
$ awk 'match($0,/ tn=[[:alnum:]]+/) {print substr($0,RSTART+4,RLENGTH-4)}' file
12g3
k38493kd

Detect repeated characters using grep

I'm trying to write a grep (or egrep) command that will find and print any lines in "words.txt" which contain the same lower-case letter three times in a row. The three occurrences of the letter may appear consecutively (as in "mooo") or separated by one or more spaces (as in "x x x") but not separated by any other characters.
words.txt contains:
The monster said "grrr"!
He lived in an igloo only in the winter.
He looked like an aardvark.
Here's what I think the command should look like:
grep -E '\b[^ ]*[[:alpha:]]{3}[^ ]*\b' 'words.txt'
Although I know this is wrong, but I don't know enough of the syntax to figure it out. Using grep, could someone please help me?
Does this work for you?
grep '\([[:lower:]]\) *\1 *\1'
It takes a lowercase character [[:lower:]] and remembers it \( ... \). It than tries to match any number of spaces _* (0 included), the rememberd character \1, any number of spaces, the remembered character. And that's it.
You can try running it with --color=auto to see what parts of the input it matched.
Try this. Note that this will not match "mooo", as the word boundary (\b) occurs before the "m".
grep -E '\b([[:alpha:]]) *\1 *\1 *\b' words.txt
[:alpha:] is an expression of a character class. To use as a regex charset, it needs the extra brackets. You may have already known this, as it looks like you started to do it, but left the open bracket unclosed.

sed regex with variables to replace numbers in a file

Im trying to replace numbers in my textfile by adding one to them. i.e.
sed 's/3/4/g' path.txt
sed 's/2/3/g' path.txt
sed 's/1/2/g' path.txt
Instead of this, Can i automate it, i.e. find a /d and add one to it in the replace.
Something like
sed 's/\([0-8]\)/\1+1/g' path.txt
Also wanted to capture more than one digit i.e. ([0-9])\t([0-9]) and change each one keeping the tab inbetween
Thanks
edited #2
Using the perl example,
I also would like it to work with more digits i.e.
perl -pi~ -e 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/ ($1+1)\.($2+1)\.($3+1)\.($4+1) /ge' output.txt
Any tips on making the above work?
There is no support for arithmetic in sed, but you can easily do this in Perl.
perl -pe 's/(\d+)/ $1+1 /ge'
With the /e option, the replacement expression needs to be valid Perl code. So to handle your final updated example, you need
perl -pi~ -e 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/ $1+1 . "." $2+1 . "." . $3+1 . "." . $4+1 /ge'
where strings are properly quoted and adjacent strings are concatenated together with the . Perl string concatenation operator. (The arithmetic numbers are coerced into strings as well when they are concatenated with a string.)
... Though of course, the first script already does that more elegantly, since with the /g flag it already increments every sequence of digits with one, anywhere in the string.
Triplee's perl solution is the more generic answer, but Michal's sed solution works well for this particular case. However, Michal's sed solution is more easily written:
sed y/12345678/23456789/ path.txt
and is better implemented as
tr 12345678 23456789 < path.txt
This utterly fails to handle 2 digit numbers (as in the edited question).
You can do it with sed but it's not easy, see this thread.
And it's hard with awk too, see this.
I'd rather use perl for this (something like this can be seen in action # ideone):
perl -pe 's/([0-8])/$1+1/e'
(The ideone.com example must have some looping as ideone does not sets -pe by default.)
You can't do addition directly in sed - you could do it in awk by matching numbers using a regex in each line and increasing the value, but it's quite complicated. If do not need to handle arbitrary numbers but a limited set, like only single-digit numbers from 0 to 8, you can just put several replacement commands on a single sed command line by separating them with semicolons:
sed 's/8/9/g ; s/7/8/g; s/6/7/g; s/5/6/g; s/4/5/g; s/3/4/g; s/2/3/g; s/1/2/g; s/0/1/g' path.txt
This might work for you (GNU sed & Bash):
sed 's/[0-9]/$((&+1))/g;s/.*/echo "&"/e' file
This will add one to every individual digit, to increment numbers:
sed 's/[0-9]\+/$((&+1))/g;s/.*/echo "&"/e' file
N.B. This method is fraught with problems and may cause unexpected results.

Resources