This might exist elsewhere but I could not find it. My goal is to delete extra numbers from a blast search to pull out sequence data while keeping the numerical sequence id. For example
Orginal:
>k141_100041 flag=0 multi=242.9841 len=43238
Sbjct 16375 MSEELTQNSGSNYSASSIQVLEGLEAVRKRPAMYIGDISEKGLHHLVYEVVDNSIDEALA 16196
Sbjct 16195 GYCTHIEVTINEDNSITVQDNGRGIPVDFHEKEKKSALEVVMTVLHAGGKFDKGSYKVSG 16016
Sbjct 16015 GLHGVGVSCVNALSTHMTTNVFRNGKIYQQEYECGKPLYAVKEVGTTDITGTRQTFWPDG 15836
Sbjct 15835 SIFTVTEYKYSILQARMRELAYLNKGITITLTDKRVKEEDGSYKQEKFHSEEGVKEFVRF 15656
Sbjct 15655 LNSNNTPLIDDVIYLNTEKQGIPIECAIMYNTGFRENLHSYVNNINTIEGGTHEAGFRMA 15476
Sbjct 15475 LTRVLKKYAEESKALEKAKVEISGEDFREGLIAVISVKVSEPQFEGQTKTKLGNNEVSGA 15296
Sbjct 15295 VNQAVGEALTYYLEEHPKEAKIIVDKVVLAATARVAARKARESVQRKSPMGGGGLPGKLA 15116
Sbjct 15115 DCSSRVAEECELFLVEGDSAGGSAKQGRSRQFQAILPLRGKILNVEKAMWHKAFESDDVN 14936
Sbjct 14935 NIIQALGVRFGVDGEEDSKKANIDKLRYHKVIIMTDADVDGSHIDTLIMTLFYRYMPEVI 14756
Sbjct 14755 QGGHLYIATPPLYKCSKGKISEYCYTDEARQAFIQKYGEGNEQGIHTQRYKGLGEMNPEQ 14576
Sbjct 14575 LWETTMNPETRILKQVNIENAAEADYIFSMLMGDDVGPRREFIEKNATYANIDA 14414
Goal:
>k141_112817 flag=0 multi=66.5284 len=335023
MSEELTQNSGSNYSASSIQVLEGLEAVRKRPAMYIGDISEKGLHHLVYEVVDNSIDEALA
GYCTHIEVTINEDNSITVQDNGRGIPVDFHEKEKKSALEVVMTVLHAGGKFDKGSYKVSG
GLHGVGVSCVNALSTHMTTNVFRNGKIYQQEYECGKPLYAVKEVGTTDITGTRQTFWPDG
SIFTVTEYKYSILQARMRELAYLNKGITITLTDKRVKEEDGSYKQEKFHSEEGVKEFVRF
LNSNNTPLIDDVIYLNTEKQGIPIECAIMYNTGFRENLHSYVNNINTIEGGTHEAGFRMA
LTRVLKKYAEESKALEKAKVEISGEDFREGLIAVISVKVSEPQFEGQTKTKLGNNEVSGA
VNQAVGEALTYYLEEHPKEAKIIVDKVVLAATARVAARKARESVQRKSPMGGGGLPGKLA
DCSSRVAEECELFLVEGDSAGGSAKQGRSRQFQAILPLRGKILNVEKAMWHKAFESDDVN
NIIQALGVRFGVDGEEDSKKANIDKLRYHKVIIMTDADVDGSHIDTLIMTLFYRYMPEVI
QGGHLYIATPPLYKCSKGKISEYCYTDEARQAFIQKYGEGNEQGIHTQRYKGLGEMNPEQ
LWETTMNPETRILKQVNIENAAEADYIFSMLMGDDVGPRREFIEKNATYANIDA
I can easily remove the 'Sbjct' line and the numbers with sed commands but I don't know how to exempt the id line (k141_112817...) from the sed commands. Any help would be appriciated.
I thinksed is the wrong tool, since it appears that you want:
awk '/^Sbjct/{$0 = $3}1' input-file
sed -E '/^>/n;s/\S+\s*//4;s///2;s///1' file
GNU sed with -E to allow extended regex
/^>/n to preserve the line beginning with > (Using n command)
s/\S+\s*//4 to remove the 4th word. \S is non-whitespace
s///2 to remove the 2nd word (The empty match in the substitution will use the previous match)
s///1 to remove the 1st word
This might work for you (GNU sed):
sed -E '/^Sbjct/s/.* .* (\S+) .*/\1/' file
When a line starting Sbjct is encountered, remove the first two fields and last (and the intervening spaces).
This is sed solvable but in this case I agree with William Pursell and would use Awk.
Related
I have a file like this
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq"rst"|uv||xyz
abc|def||ghi|jklm|"nopq"r"st"|uv||xyz
The 6th Column could be double quoted. I want to replace all the occurances of double quotes in this field with a backslash-double quote (\")
I wish my output to look like
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
I have tried combinations of below, but ending short each time
sed -i 's/\"/\\\"/2' file.txt (this replaces only 2nd occurrence)
sed -i 's/\"/\\\"/2g' file.txt (this replaces only 2nd occurrence and all rest also)
My file will be having millions of rows; so I may need a sed or awk command only.
Please help.
You may use this awk solution in any version of awk:
awk 'BEGIN {FS=OFS="|"} {
c1 = substr($6, 1, 1)
c2 = substr($6, length($6), 1)
s = substr($6, 2, length($6)-2)
gsub(/"/, "\\\"", s)
$6 = c1 s c2
} 1' file
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
If this isn't all you need then edit your question to provide more truly representative sample input/output including cases that this doesn't work for:
$ sed 's/"/\\"/g; s/|\\"/|"/g; s/\\"|/"|/g' file
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
The above will work in any sed.
This might work for you (GNU sed):
sed -E 's/[^|]*/\n&\n/6 # isolate the 6th field
h # make a copy
s/"/\\"/g # replace " by \"
s/\\(")\n|\n\\(")/\1\n\2/g # repair start and end "s
H # append amended line to copy
g # get copies to current line
s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file # swap fields
Surround the 6th field by newlines and make a copy in the hold space.
Replace all "'s by \"'s and remove the \'s at the start and end of the field if the field begins and ends in "'s
Append the amended line to the copy and replace the current line by the doubled line.
Using pattern matching replace copied line 6th field by the amended one.
How to convert an uneven TAB separated input file to CSV or PSV using sed command?
28828082-1 04/08/19 08:48 04/11/19 12:37 04/12/19 16:22 4/15-4/16 04/17/19 2 9 LCO W OIP 04/08/19 08:53 21 1 58.00 9 222 79 FEDX FEDXH SL3 484657064673 0410099900691041119 SMITHFIELD RI 02917 "41.890066 , -71.548680" YES
Above is 1 row, I tried using sed -r 's/^\s+//;s/\s+/|/g' but the result was not as expected.
gawk to the rescue!
$ awk -vFPAT='([^[:space:]]+)|("[^"]+")' -v OFS='|' '$1=$1' file
28828082-1|04/08/19|08:48|04/11/19|12:37|04/12/19|16:22|4/15-4/16|04/17/19|2|9|LCO|W|OIP|04/08/19|08:53|21|1|58.00|9|222|79|FEDX|FEDXH|SL3|484657064673|0410099900691041119|SMITHFIELD|RI|02917|"41.890066 , -71.548680"|YES
define the field pattern as non space or a quoted value which might include spaces (but not escaped quotes), replace the output field separated with tab, force the line to be parsed and non zero lines will be printed after format change.
A better version would be ... '{$1=$1; print}'.
Of course, if all the field delimiters are tabs and quotes string doesn't include any tabs, it's much simpler.
Your question isn't clear but is this what you're trying to do?
$ printf 'now\t"is the winter"\tof\t"our discontent"\n' > file
$ cat file
now "is the winter" of "our discontent"
$ tr '\t' ',' < file
now,"is the winter",of,"our discontent"
$ tr '\t' '|' < file
now|"is the winter"|of|"our discontent"
You initial answer was very close:
sed 's/[[:space:]]\+/|/g' input.txt
Explanation:
[[:space:]] Match a single whitespace character such as space/tab/CR/newline.
\+ Match one or more of the current grab.
Update:
If you require 2 or more white spaces.
sed 's/[[:space:]]\{2,\}/|/g' input.txt
\{2,\} Match two or more of the current grab.
I have file like this
TT;12-11-18;text;abc;def;word
AA;12-11-18;tee;abc;def;gih;word
TA;12-11-18;teet abc;def;word
TT;12-11-18;tdd;abc;def;gih;jkl;word
I want output like this
TT;12-11-18;text;abc;def;word
TA;12-11-18;teet abc;def;word
I want to get word if it occur at position 5 after date 12-11-18. I do not want this occurrence if its found after this position that is at 6th or 7th position. Count of position start from date 12-11-18
I want tried this command
cat file.txt|grep "word" -n1
This print all occurrence in which this pattern word is matched. How should I solve my problem?
Try this(GNU awk):
awk -F"[; ]" '/12-11-18/ && $6=="word"' file
Or sed one:
sed -n '/12-11-18;\([^; ]*[; ]\)\{3\}word/p' file
Or grep with basically the same regex(different escape):
grep -E "12-11-18;([^; ]*[; ]){3}word" file
[^; ] means any character that's not ; or (space).
* means match any repetition of former character/group.
-- [^; ]* means any length string that don't contain ; or space, the ^ in [^; ] is to negate.
[; ] means ; or space, either one occurance.
() is to group those above together.
{3} is to match three repetitives of former chracter/group.
As a whole ([^; ]*[; ]){3} means ;/space separated three fields included the delimiters.
As #kvantour points out, if there could be multiple spaces at one place they could be faulty.
To consider multiple spaces as one separator, then:
awk -F"(;| +)" '/12-11-18/ && $6=="word"'
and
grep -E "12-11-18;([^; ]*(;| +)){3}word"
or GNU sed (posix/bsd/osx sed does not support |):
sed -rn '/12-11-18;([^; ]*(;| +)){3}word/p'
I'm trying to write an utility that reverses lines of input. The following just prints the lines as they are though:
#!/bin/sed -f
#insert newline at the beginning
s/^/\n/
#while the newline hasnt moved to the end of pattern space, rotate
: loop
/\n$/{!s/\(.*\)\(.$\)/\2\1/;!b loop}
#delete the newline
s/\n//
Any ideas on what's wrong?
/\n$/{!s/\(.*\)\(.$\)/\2\1/;!b loop}
the ! is after an address/range normaly
the !b (not than goto if I understang your meaning) is maybe a t (if substitution occur, goto)
$ is not part of the last group but just after
so this line is:
/\n$/ !{s/\(.*\)\(.\)$/\2\1/;t loop}
now, this code just (in final) do nothing it add a new line at start and move it until the end by swapping last to first character and does not reveverse anything.
sed 'G
:loop
s/\(.\)\(\n.*\)/\2\1/
t loop
s/.//' YourFile
should do the trick
#TobySpeight still enhance the code removing the need of a 1st group (code adapted)
Solution 1
$ echo -e '123\n456\n789' |sed -nr '/\n/!G;s/(.)(.*\n)/&\2\1/;/^\n/!D;s/\n//p'
321
654
987
the core ideas:
we need a loop to deal with each line, and fortunately we can use D command can simulate a loop;
we need to loop over ONE line, which is difficult, because sed deals with one line every time; but we can use s and D command to simulate a loop over one line.
how to avoid infinite loop? we need a flag char to identify the end of each line, \n is the perfect choice.
D command delete chars util first \n in the pattern space every time,
and then force sed jump to its first command, which is a loop actually! So we also need to add some useless placeholder to be deleted by D command before the final string, and we can just use contents in current line before \n (\n also included).
Explains:
/\n/!G: if current pattern space contain \n, which means this command is in a loop of dealing with one line; otherwise, use G command to append the \n and hold space to the pattern space (sed will delete \n of every line before putting it into pattern space), the content of pattern space after G command is the origin content and a \n.
s/(.)(.*\n)/&\2\1/;: use s command to delete the first char (util \n) and then insert it after the final string.
/^\n/!D;s/\n//p: if current pattern space starts with \n, which means this line has been resolved already, so use s/\n//p to delete the flag char: \n and print the final string; otherwise use D command to delete the useless placeholder, and then jump to the first command to deal with the second char...
To make a summary, the contents in pattern space in a loop are shown as the followings:
123\n [(1)(23\n)] =s=> 123\n23\n1 [(123\n)(23\n)(1)] =D=> 23\n1
23\n1 [(2)(3\n)1] =s=> 23\n3\n21 [(23\n)(3\n)(2)1] =D=> 3\n21
3\n21 [(3)(\n)21] =s=> 3\n\n321 [(3\n)(\n)(3)21] =D=> \n321
\n321 [()(\n)321] =s=> \n321 =!D=> \n321 =s-p=> 321
There are some derived solutions:
Solution 2
the placeholder can be set another string ending with a \n:
$ echo -e '123\n456\n789' |sed -nr '/\n/!G;s/(.)(.*\n)/USELESS\n\2\1/;/^\n/!D;s/\n//p'
321
654
987
Solution 3
Use a direct loop instead of obscure D command
$ echo -e '123\n456\n789' |sed -nr '/\n/!G;s/(.)(.*\n)/&\2\1/;Tend;D;:end;s/\n//p'
321
654
987
Solution 4
Use . to fetch the first char \n
$ echo -e '123\n456\n789' |sed -nr '/\n/!G;s/(.)(.*\n)/&\2\1/;/^\n/!D;s/.//p'
321
654
987
Solution 5
$ echo -e '123\n456\n789' |sed -nr ':loop;/\n/!G;s/(.)(.*\n)/\2\1/;tloop;s/.//p'
321
654
987
This solution is much easier to understand, the contents in pattern space res shown as the followings:
123\n [(1)(23\n)] =s=> 23\n1 [(23\n)(1)]
23\n1 [(2)(3\n)1] =s=> 3\n21 [(3\n)(2)1]
3\n21 [(3)(\n)21] =s=> \n321 [(\n)(3)21]
\n321 [()(\n)321] =s=> \n321 =s=> 321
The problem is you are using the wrong tool for the job and trying to understand/use constructs that became obsolete in the mid-1970s when awk was invented.
$ cat file
tsuj
esu
na
etaorporppa
loot
$ awk -v FS= '{rev=""; for (i=1; i<=NF; i++) rev = $i rev; print rev}' file
just
use
an
approproate
tool
Hello I need oneliner to insert character after Nth occurrence of delimiter on 2nd line in unix; The criteria are these.
Find position of nth occurrence of the delimiter.
Insert character after the nth occurrence.
This is on the 2nd line only.
Note: I am doing this in Linux.
With awk :
INPUT FILE
1 foo bar base
2 foo bar base
CODE
awk 'NR==2{$2=$2"X"; print}' file
you can specify a delimiter with -F
NR to specify the line we work on
$2 is the 2th value separated by space (in this case)
$2=$2"X" is a concatenation
print alone print the entire line
OUTPUT
2 fooX bar base
Suppose we have the input file:
$ cat file
1 foo bar base
2 foo bar base
To insert the character X after the 3rd occurrence of the delimiter space, use:
$ sed -r '2 s/([^ ]* ){3}/&X/' file
1 foo bar base
2 foo bar Xbase
To make the change to the file in place, use sed's -i option:
sed -i -r '2 s/([^ ]* ){3}/&X/' file
How it works
Consider the sed command:
2 s/([^ ]* ){3}/&X/
The initial 2 instructs sed to apply this command only to the second line.
We are using the s or substitute command. This command has the form s/old/new/ where old and new are:
old is the regular expression ([^ ]* ){3}. This matches everything up to and including the third occurrence of space.
new is the replacement text, &X. The ampersand refers to what we matched in old, which is all the line up to and including the third space. The X is the new character that we are inserting.
This might work for you (GNU sed):
sed '2s/X/&Y/3' file
This inserts Y after the third occurence of X on the second line only.