Finding and replacing text within a file - text

I have a large taxonomy file that I need to edit. There is an issue with the file as "Candida" is listed as both Candida and [Candida]. What I want to do is change every case of [Candida] to Candida within the file.
I have tried doing this several ways but never get the output I am after. This is the first few lines of the taxonomy file:
Penicillium;marneffei;NW_002197112.1
Penicillium;marneffei;NW_002197111.1
Penicillium;marneffei;NW_002197110.1
Penicillium;marneffei;NW_002197109.1
Penicillium;marneffei;NW_002197108.1
Using sed gives me this output:
$ sed -i -e 's/[Candida]/Candida/g' Full_HMS_Taxonomy.txt
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197112.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197111.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197110.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197109.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197108.1
Using awk gives me this output:
$ awk '{gsub(/[Candida]/,"Candida")}1' Full_HMS_Taxonomy.txt
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197112.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197111.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197110.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197109.1
PeCandidaCandidacCandidallCandidaum;mCandidarCandidaeffeCandida;NW_002197108.1
In both cases it is adding Candida to multiple places and multiple lines, instead of just replacing each instance of [Candida]. Any ideas on what I am doing wrong?

[] are special characters in regexp, so you should escape them like that:
's/\[Candida\]/Candida/g'

Brackets are treated specially by regular expression parsers, matching each character listed inside them. So, [Candida] matches any of the characters inside it (C, a, n...). That's why you get a lot of substitutions.
You need to tell those utilities that you want literal brackets by escaping them with backslashes, e.g. with sed:
sed -i 's/\[Candida\]/Candida/g' Full_HMS_Taxonomy.txt

Related

Wildcard sed search/remove within other text in the same line

I'm trying to remove a matching string with partial wildcards using sed, and the searches I've done for answers on this site either don't seem to apply or I can't convert them to my situation.
Below is the string of text I need to remove:
www.foo.com.cp123.bar.com
It is in a file with other entries on the same line. The line that has my entries always starts with serveralias:, however, as below:
serveralias: www.domain.com mail.domain.com www.foo.com.cp123.bar.com domain.com
I can identify what I need to remove via the 'cp123.bar.com' text as that always stays the same. It's the preceding 'www.foo.com' that changes. It can appear just once or multiple times within the line, but it will always end in 'cp123.bar.com'. I've tried the following two commands based on my research:
sed 's/\ .*cp123.bar.com\ //g' file.txt
sed 's/\ [^:]+$cp123.bar.com\ //g' file.txt
I'm using the spaces between each entry as the start and stop point for the find/replace(delete), but that's a band-aid and not always going to work since the entry I need to delete is occasionally at the end of the line (without a space afterward). If I don't include the spaces, though, everything gets removed since I'm using wildcards, including the www.domain.com, mail.domain.com, etc. text I need to keep there. Running either of the sed commands above doesn't do anything, just prints what's currently in the file.
Any ideas on what I need to change? I'm happy to clarify anything if need be.
Sed requires an -r flag to be able to use enhanced regular expressions. Without the -r, the + won't work in the regexps. Thus, a
sed -r 's/ +[^ ]+\.cp123\.bar\.com//g'
will do what you want. It removes the following substrings:
one or more space
followed by one or more non-space
followed by .cp123.bar.com

Replace string between words multiple times in a file

I am trying to replace string between two strings in a file with the command below. There could be any number of such patterns in the file. This is just an example.
sed 's/word1.*word2/word1/' 1.txt
There are two instances where 'word1' followed by 'word2' occurs in the sample source file I'm testing. Content of the 1.txt file
word1---sjdkkdkjdk---word2 I want this text----word1---jhfnkfnsjkdnf----word2 I need this also
Result is as below.
word1 I need this also
Expected Output :
word1 I want this text----word1 I need this also
Can anybody help me with this please?
I looked at other stack-overflow questionnaire but they discuss about replacing only one instance of the pattern.
Regular expressions are greedy - they match the longest possible string, so everything from the first 'word1' to the last 'word2'. Not sure if any version of sed supports non-greedy regexps... you could just use perl, though, which does:
perl -pe 's/word1.*?word2/word1/g' 1.txt
should do the trick. That ? changes the meaning of the prior * from 'match as many times as possible as long as the rest of the pattern matches' to 'match as few times as possible as long as the rest of the pattern matches'.
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/word1/{/g; s/word2/}/g; s/{[^{}]*}/word1/g; s/}/word2/g; s/{/word1/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
word1 I want this text----word1 I need this also
It's lengthy and looks complicated but it's a technique that is used fairly often and is really just a series of simple steps to robustly convert word1 to { and word2 to } so you're dealing with characters instead of strings in the actual substitution s/{[^{}]*}/word1/g and so can use a negated bracket expression to avoid the greedy regexp taking up too much of the line.
See https://stackoverflow.com/a/35708616/1745001 for more info on the general approach used here to be able to turn strings into characters that cannot be present in the input by the time the real work takes place and then restore them again afterwards.
If you only have two instances of the word1-word2 pattern on a line, this should work:
sed 's/\(word1\).*word2\(.*\)\(word1\).*word2\(.*\)/\1\2\3\4/' 1.txt
I grab the parts we want to keep inside escaped brackets \( and \) then I can refer to those parts as \1 \2 and so on.

sed is matching passed variable subsets, not exact matches

I'm partially successfully using sed to replace variables in a text file. I'm stuck on an exception.
A script reads input from a list - say the $roll_symbol is C20.
sed replaces C20, GC20, and KC20 (because C20 matches part of the string).
I searched the web and tried the variations I found - no success.
I tried these variations without success:
escape the reserved character $
escape braces
escape both
use double quotes instead of single quotes.
*the best version so far (but only partially):
sed -i 's/'${roll_symbol}'/'${roll_symbol}\,${contract_month}'/g' $OUTPUT_DIRECTORY/$OUTPUT_FILE;
You need to tell sed what characters are legal before the start of your match to limit where it can match. To only match at start-of-word boundaries try \<.
sed -i "s/\<${roll_symbol}/${roll_symbol},${contract_month}/g" "$OUTPUT_DIRECTORY/$OUTPUT_FILE";

sed regex with variables to replace numbers in a file

Im trying to replace numbers in my textfile by adding one to them. i.e.
sed 's/3/4/g' path.txt
sed 's/2/3/g' path.txt
sed 's/1/2/g' path.txt
Instead of this, Can i automate it, i.e. find a /d and add one to it in the replace.
Something like
sed 's/\([0-8]\)/\1+1/g' path.txt
Also wanted to capture more than one digit i.e. ([0-9])\t([0-9]) and change each one keeping the tab inbetween
Thanks
edited #2
Using the perl example,
I also would like it to work with more digits i.e.
perl -pi~ -e 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/ ($1+1)\.($2+1)\.($3+1)\.($4+1) /ge' output.txt
Any tips on making the above work?
There is no support for arithmetic in sed, but you can easily do this in Perl.
perl -pe 's/(\d+)/ $1+1 /ge'
With the /e option, the replacement expression needs to be valid Perl code. So to handle your final updated example, you need
perl -pi~ -e 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/ $1+1 . "." $2+1 . "." . $3+1 . "." . $4+1 /ge'
where strings are properly quoted and adjacent strings are concatenated together with the . Perl string concatenation operator. (The arithmetic numbers are coerced into strings as well when they are concatenated with a string.)
... Though of course, the first script already does that more elegantly, since with the /g flag it already increments every sequence of digits with one, anywhere in the string.
Triplee's perl solution is the more generic answer, but Michal's sed solution works well for this particular case. However, Michal's sed solution is more easily written:
sed y/12345678/23456789/ path.txt
and is better implemented as
tr 12345678 23456789 < path.txt
This utterly fails to handle 2 digit numbers (as in the edited question).
You can do it with sed but it's not easy, see this thread.
And it's hard with awk too, see this.
I'd rather use perl for this (something like this can be seen in action # ideone):
perl -pe 's/([0-8])/$1+1/e'
(The ideone.com example must have some looping as ideone does not sets -pe by default.)
You can't do addition directly in sed - you could do it in awk by matching numbers using a regex in each line and increasing the value, but it's quite complicated. If do not need to handle arbitrary numbers but a limited set, like only single-digit numbers from 0 to 8, you can just put several replacement commands on a single sed command line by separating them with semicolons:
sed 's/8/9/g ; s/7/8/g; s/6/7/g; s/5/6/g; s/4/5/g; s/3/4/g; s/2/3/g; s/1/2/g; s/0/1/g' path.txt
This might work for you (GNU sed & Bash):
sed 's/[0-9]/$((&+1))/g;s/.*/echo "&"/e' file
This will add one to every individual digit, to increment numbers:
sed 's/[0-9]\+/$((&+1))/g;s/.*/echo "&"/e' file
N.B. This method is fraught with problems and may cause unexpected results.

Replacing comma on specific lines only

I have a dataset that is comma separated. But I have a little problem with its format. I want everything to be in the form x,x,x
Below is a sample of my dataset:
995970,16779453
995971,16828069
995972,
995973,16828069
995974,16827226
As you can see, most of my dataset is in the proper format but I have those commas on single id#'s also (my data is in form id#, connection#). How would I go about removing the commas on those single id#'s? I can't seem to figure it out just using a text editor. Any suggestions?
Edit: can I use some sort of regex expression to only remove it from those ids that have a specified length?
Edit2: Ok I figured it out using some regex, thanks for all the help!
In vi one would do something like
:%s/,$//
This means
: (enter a line mode command)
% (try the command on every line)
s (substitute)
,$ (match a comma at the end of a line)
(empty replacement text)
Sometimes you need something like /, *$/ do match a comma followed by 0 or more trailing spaces. You can get vi on windows in various different ways; one way is to install Cygwin.
You can select regular expression mode in Notepad++ and do find and replace using the following regex ,$. Leave the replace field blank.
With the sed command:
sed 's/, *//' < FILE
or inplace (requires GNU sed):
sed -ie 's/, *//' FILE

Resources