Removing Parts of String With Sed - linux

I have lines of data that looks like this:
sp_A0A342_ATPB_COFAR_6_+_contigs_full.fasta
sp_A0A342_ATPB_COFAR_9_-_contigs_full.fasta
sp_A0A373_RK16_COFAR_10_-_contigs_full.fasta
sp_A0A373_RK16_COFAR_8_+_contigs_full.fasta
sp_A0A4W3_SPEA_GEOSL_15_-_contigs_full.fasta
How can I use sed to delete parts of string after 4th column (_ separated) for each line.
Finally yielding:
sp_A0A342_ATPB_COFAR
sp_A0A342_ATPB_COFAR
sp_A0A373_RK16_COFAR
sp_A0A373_RK16_COFAR
sp_A0A4W3_SPEA_GEOSL

cut is a better fit.
cut -d_ -f 1-4 old_file
This simply means use _ as delimiter, and keep fields 1-4.
If you insist on sed:
sed 's/\(_[^_]*\)\{4\}$//'
This left hand side matches exactly four repetitions of a group, consisting of an underscore followed by 0 or more non-underscores. After that, we must be at the end of the line. This is all replaced by nothing.

sed -e 's/\([^_]*\)_\([^_]*\)_\([^_]*\)_\([^_]*\)_.*/\1_\2_\3_\4' infile > outfile
Match "any number of not '_'", saving what was matched between \( and \), followed by '_'. Do this 4 times, then match anything for the rest of the line (to be ignored). Substitute with each of the matches separated by '_'.

Here's another possibility:
sed -E -e 's|^([^_]+(_[^_]+){3}).*$|\1|'
where -E, like -r in GNU sed, turns on extended regular expressions for readability.
Just because you can do it in sed, though, doesn't mean you should. I like cut much much better for this.

AWK likes to play in the fields:
awk 'BEGIN{FS=OFS="_"}{print $1,$2,$3,$4}' inputfile
or, more generally:
awk -v count=4 'BEGIN{FS="_"}{for(i=1;i<=count;i++){printf "%s%s",sep,$i;sep=FS};printf "\n"}'

sed -e 's/_[0-9][0-9]*_[+-]_contigs_full.fasta$//g'
Still the cut answer is probably faster and just generally better.

Yes, cut is way better, and yes matching the back of each is easier.
I finally got a match using the beginning of each line:
sed -r 's/(([^_]*_){3}([^_]*)).*/\1/' oldFile > newFile

Related

Conditional replace using sed

My question is probably rather simple. I'm trying to replace sequences of strings that are at the beginning of lines in a file. For example, I would like to replace any instance of the pattern "GN" with "N" or "WR" with "R", but only if they are the first 2 characters of that line. For example, if I had a file with the following content:
WRONG
RIGHT
GNOME
I would like to transform this file to give
RONG
RIGHT
NOME
I know i can use the following to replace any instance of the above example;
sed -i 's/GN/N/g' file.txt
sed -i 's/WR/R/g' file.txt
The issue is that I want this to happen only if the above patterns are the first 2 characters in any given line. Possibly an IF statement, although i'm not sure what the condition would look like. Any pointers in the right direction would be much appreciated, thanks.
just add the circumflex, remove g suffix (unnecessary, since you want at most one replacement), you can also combine them in one script.
sed -i 's/^GN/N/;s/^WR/R/' file.txt
Use the start-of-string regexp anchor ^:
sed -i 's/^GN/N/' file.txt
sed -i 's/^WR/R/' file.txt
Since sed is line-oriented, start-of-string == start-of-line.

A good way to use sed to find and replace characters with 2 delimiters

I trying to find and replace items using bash. I was able to use sed to grab out some of the characters, but I think I might be using it in the wrong matter.
I am basically trying to remove the characters after ";" and before "," including removing ","
sed -e 's/\(;\).*\(,\)/\1\2/'
That is what I used to replace it with nothing. However, it ends up replacing everything in the middle so my output came out like this:
cmd2="BMC,./socflash_x64 if=B600G3_BMC_V0207.ima;,reboot -f"
This is the original text of what I need to replace
cmd2="BMC,./socflash_x64 if=B600G3_BMC_V0207.ima;X,sleep 120;after_BMC,./run-after-bmc-update.sh;hba_fw,./hba_fw.sh;X,sleep 5;DB,2;X,reboot -f"
Is there any way to make it look like this output?
./socflash_x64 if=B600G3_BMC_V0207.ima;sleep 120;./run-after-bmc-update.sh;./hba_fw.sh;sleep 5;reboot -f
Ff there is any way to make this happen other than bash I am fine with any type of language.
Non-greedy search can (mostly) be simulated in programs that don't support it by replacing match-any (dot .) with a negated character class.
Your original command is
sed -e 's/\(;\).*\(,\)/\1\2/'
You want to match everything in between the semi-colon and the comma, but not another comma (non-greedy). Replace .* with [^,]*
sed -e 's/\(;\)[^,]*\(,\)/\1\2/'
You may also want to exclude semi-colons themselves, making the expression
sed -e 's/\(;\)[^,;]*\(,\)/\1\2/'
Note this would treat a string like "asdf;zxcv;1234,qwer" differently, since one would match ;zxcv;1234, and the other would match only ;1234,
In perl:
perl -pe 's/;.*?,/;/g;' -pe 's/^[^,]*,//' foo.txt
will output:
./socflash_x64 if=B600G3_BMC_V0207.ima;sleep 120;./run-after-bmc-update.sh;./hba_fw.sh;sleep 5;2;reboot -f
The .*? is non greedy matching before the comma. The second command is to remove from the beginning to the comma.
Something like:
echo $cmd2 | tr ';' '\n' | cut -d',' -f2- | tr '\n' ';' ; echo
result is:
./socflash_x64 if=B600G3_BMC_V0207.ima;sleep 120;./run-after-bmc-update.sh;./hba_fw.sh;sleep 5;2;reboot -f;
however, I thing your requirements are a few more complex, because 'DB,2' seems a particular case. After "tr" command, insert a "grep" or "grep -v" to include/exclude these cases.

How to specify an "or" in sed

I have a file having data in the following form
<A/Here> <A/There>
<B/SomeMoreDate> <C/SomeOtherDate>
Now I want to delete all the A,B,C from the file in an efficient way. I know I can use sed for one pattern
sed -i 's/A//g' /path/to/filename.
But how do I specify such that sed to contain an or to deletes all the patterns?
The expected output is:
<Here> <There>
<SomeMoreDate> <SomeOtherDate>
You can use sed -i 's/[ABC]//g' /path/to/filename. [ABC] will match either A or B or C. You may find this reference useful.
If you're using GNU sed, you can say:
sed -r 's#(A|B|C)/##g' filename
The following should work otherwise:
sed 's#A/##g;s#B/##g;s#C/##g' filename
Ivaylo Strandjev's answer is correct in that it solves the problem when wanting to match single characters. There is a way though to have or when matching longer strings.
s/\(\(stringA\)\|\(stringB\)\|\(stringC\)\)something/something else/
You can try with somehting like:
echo stringBsomething | sed -e 's/\(stringA\|stringB\|stringC\)something/something else/'
It is sad that sed requires all these backslashes. Some if this is avoided if you use -r.
sed "s/<[ABC]\//</g" /path/to/filename
because it is a special case of 1 char in length changing in the pattern. This is not a real OR
you can use this workaround on limited to POSIX sed
Sample for test purpose
echo "<Pat1/ is pattern 2> <pat2/ is pattern 2>
<pAt3/ is pattern 3>
<pat4/ is pattern 4> but not avalaible for Pat1/ nor <pat2
" | \
The sed part
sed 's/²/²o/g
t myor
:myor
s/<Pat1\//²p/g;t treat
s/<pat2\//²p/g;t treat
s/<pAt3\//²p/g;t treat
b continu
: treat
s/²p/</g
t myor
: continu
s/²o/²/g
'
This use a temporary char as generic pattern "²" and a series of s/ followed by a test branch as OR functionality

How to display the first word of each line in my file using the linux commands?

I have a file containing many lines, and I want to display only the first word of each line with the Linux commands.
How can I do that?
You can use awk:
awk '{print $1}' your_file
This will "print" the first column ($1) in your_file.
Try doing this using grep :
grep -Eo '^[^ ]+' file
try doing this with coreutils cut :
cut -d' ' -f1 file
I see there are already answers. But you can also do this with sed:
sed 's/ .*//' fileName
The above solutions seem to fit your specific case. For a more general application of your question, consider that words are generally defined as being separated by whitespace, but not necessarily space characters specifically. Columns in your file may be tab-separated, for example, or even separated by a mixture of tabs and spaces.
The previous examples are all useful for finding space-separated words, while only the awk example also finds words separated by other whitespace characters (and in fact this turns out to be rather difficult to do uniformly across various sed/grep versions). You may also want to explicitly skip empty lines, by amending the awk statement thus:
awk '{if ($1 !="") print $1}' your_file
If you are also concerned about the possibility of empty fields, i.e., lines that begin with whitespace, then a more robust solution would be in order. I'm not adept enough with awk to produce a one-liner for such cases, but a short python script that does the trick might look like:
>>> import re
>>> for line in open('your_file'):
... words = re.split(r'\s', line)
... if words and words[0]:
... print words[0]
...or on Windows (if you have GnuWin32 grep) :
grep -Eo "^[^ ]+" file

Replacing a line in a csv file?

I have a set of 10 CSV files, which normally have a an entry of this kind
a,b,c,d
d,e,f,g
Now due to some error entries in this file have become of this kind
a,b,c,d
d,e,f,g
,,,
h,i,j,k
Now I want to remove the line with only commas in all the files. These files are on a Linux filesystem.
Any command that you recommend that can replaces the erroneous lines in all the files.
It depends on what you mean by replace. If you mean 'remove', then a trivial variant on #wnoise's solution is:
grep -v '^,,,$' old-file.csv > new-file.csv
Note that this deletes just those lines with exactly three commas. If you want to delete mal-formed lines with any number of commas (including zero) - and no other characters on the line, then:
grep -v '^,*$' ...
There are endless other variations on the regex that would deal with other scenarios. Dealing with full CSV data with commas inside quotes starts to need something other than a regex machine. It can be done, within broad limits, especially in more complex regex systems such as PCRE or Perl. But it requires more work.
Check out Mastering Regular Expressions.
sed 's/,,,/replacement/' < old-file.csv > new-file.csv
optionally followed by
mv new-file.csv old-file.csv
Replace or remove, your post is not clear... For replacement see wnoise's answer. For removing, you could use
awk '$0 !~ /,,,/ {print}' <old-file.csv > new-file.csv
What about trying to keep only lines which are matching the desired format instead of handling one exception ?
If the provided input is what you really want to match:
grep -E '[a-z],[a-z],[a-z],[a-z]' < oldfile.csv > newfile.csv
If the input is different, provide it, the regular expression should not be too hard to write.
Do you want to replace them with something, or delete them entirely? Either way, it can be done with sed. To delete:
sed -i -e '/^,\+$/ D' yourfile1.csv yourfile2.csv ...
To replace: well, see wnoise's answer, or if you don't want to create new files with the output,
sed -i -e '/^,\+$/ s//replacement/' yourfile1.csv yourfile2.csv ...
or
sed -i -e '/^,\+$/ c\
replacement' yourfile1.csv yourfile2.csv ...
(that should be entered exactly as is, including the line break). Of course, you can also do this with awk or perl or, if you're only deleting lines, even grep:
egrep -v '^,+$' < oldfile.csv > newfile.csv
I tested these to make sure they work, but I'd advise you to do the same before using them (just in case). You can omit the -i option from sed, in which case it'll print out the results (rather than writing them back to the file), or omit the output redirection >newfile.csv from grep.
EDIT: It was pointed out in a comment that some features of these sed commands only work on GNU sed. As far as I can tell, these are the -i option (which can be replaced with shell redirection, sed ... <infile >outfile ) and the \+ modifier (which can be replaced with \{1,\} ).
Most simply:
$ grep -v ,,,, oldfile > newfile
$ mv newfile oldfile
yes, awk or grep are very good option if you are working in linux platform. However you can use perl regex for other platform. using join & split options.

Resources