insert consecutive number after .fa header id - linux

I have a large .fa file that consists of multiple merged fasta files. Each file is separated by header line and begins with ">".
Here is an example:
>DPB1*04:01:01:01 [most similar sequence] for DPB1 in 3507009462
I would like to modify each header inside the file by adding a consecutive integer after each id. The id is the first sequence of characters after ">" and before the first space.
The modified header would look like this:
>DPB1*04:01:01:011 [most similar sequence] for DPB1 in 3507009462
I found some code that replaces the header by a consecutive number but not sure how to insert it after the header id.
cat youFile.fa | perl -ane 'if(/\>/){$a++;print ">$a\n"}else{print;}' > youFile_new.fa
Thanks for your help

$ perl -wpe 's/\s/++$i . " "/e if /^>/' input.fa
Explanation:
Substitute first occurrence of whitespace with counter variable and single space...
...but only if the line starts with >
Print every line (due to -p switch)

Related

Add pipe delimiter at the end of each row using unix

I am new to unix commands, please forgive if i am not using correct line of code below.
I have files (xxxx.txt.date) on winscp with header and footer. Now i want to add N number of pipe (|) at the end of the each row of all files starting from 2nd line till second last line. (i dont want | in header as well as footer).
Now i have created a scirpt in which i am using below command:
sed -e "2,\$s/$/|/" $file | column -t
2,$s/$/|/: adds | at the end of every line from line 2
Now below are the issues i am facing
First
The data doesn't change in the files i am able to see pipe added at end of each row in hive, how can i change data in files?
I don't want | in footer.
Any suggestion or help will be appreciated.
Thanks in advance !!
If you need to append just one "|" at the end of each line except header and footer
sed -i '1n; $n; s/$/|/' file_name
1n; $n; : Just print first and last line as is.
-i : make changes to the file instead of printing to STDOUT.
If you need to append n pipes at the end of each line except Header and Footer. If you use the below awk command, you will have to redirect the output to a temporary file and then rename it.
Assumptions:
I am assuming your Header and Footer are standard and start with some character(e.g., H, F, T etc) or String(Header, Footer, Trailer etc)
I am assuming your original file is delimited with "|". You can specify your actual delimiter in the below awk.
awk -F'|' -v n=7 '{if(/^Header|^Footer/) {print} else {end="";for (i=1;i<=n;i++) end=sprintf("%s%s", end, "|"); rec=sprintf("%s%s", $0, end); print rec}}' file_name
n=number of times you want to repeat | at the end of each line.
^Header|^Footer - If the line starts with "Header" or "Footer", just print the record as it is. You can specify your header and footer strings from file.
for loop - prepares a string "end" which contains "|" n times.
rec - Contains concatenated string of entire record followed by end string

Using sed to delete specific lines after LAST occurrence of pattern

I have a file that looks like:
this name
this age
Remove these lines and space above.
Remove here too and space below
Keep everything below here.
I don't want to hardcode 2 as the number of lines containing "this" can change. How can I delete 4 lines after the last occurrence of the string. I am trying sed -e '/this: /{n;N;N;N;N;d}' but it is deleting after the first occurrence of the string.
Could you please try following.
awk '
FNR==NR{
if($0~/this/){
line=FNR
}
next
}
FNR<=line || FNR>(line+4)
' Input_file Input_file
Output will be as follows with shown samples.
this: name
this: age
Keep everything below here.
You can also use this minor change to make your original sed command work.
sed '/^this:/ { :k ; n ; // b k ; N ; N ; N ; d }' input_file
It uses a loop which prints the current line and reads the next one (n) while it keeps matching the regex (the empty regex // recalls the latest one evaluated, i.e. /^this:/, and the command b k goes back to the label k on a match). Then you can append the next 3 lines and delete the whole pattern space as you did.
Another possibility, more concise, using GNU sed could be this.
sed '/^this:/ b ; /^/,$ { //,+3 d }' input_file
This one prints any line beginning with this: (b without label goes directly to the next line cycle after the default print action).
On the first line not matching this:, two nested ranges are triggered. The outer range is "one-shot". It is triggered right away due to /^/ which matches any line then it stays triggered up to the last line ($). The inner range is a "toggle" range. It is also triggered right away because // recalls /^/ on this line (and only on this line, hence the one-shot outer range) then it stays trigerred for 3 additional lines (the end address +3 is a GNU extension). After that, /^/ is no longer evaluated so the inner range cannot trigger again because // recalls /^this:/ (which is short cut early).
This might work for you (GNU sed):
sed -E ':a;/this/n;//ba;$!N;$!ba;s/^([^\n]*\n?){4}//;/./!d' file
If the pattern space (PS) contains this, print the PS and fetch the next line.
If the following line contains this repeat.
If the current line is not the last line, append the next line and repeat.
Otherwise, remove the first four lines of the PS and print the remainder.
Unless the PS is empty in which case delete the PS entirely.
N.B. This only reads the file once. Also the OP says
How can I delete 4 lines after the last occurrence of the string
However the example would seem to expect 5 lines to be deleted.

sed - Delete lines only if they contain multiple instances of a string

I have a text file that contains numerous lines that have partially duplicated strings. I would like to remove lines where a string match occurs twice, such that I am left only with lines with a single match (or no match at all).
An example output:
g1: sample1_out|g2039.t1.faa sample1_out|g334.t1.faa sample1_out|g5678.t1.faa sample2_out|g361.t1.faa sample3_out|g1380.t1.faa sample4_out|g597.t1.faa
g2: sample1_out|g2134.t1.faa sample2_out|g1940.t1.faa sample2_out|g45.t1.faa sample4_out|g1246.t1.faa sample3_out|g2594.t1.faa
g3: sample1_out|g2198.t1.faa sample5_out|g1035.t1.faa sample3_out|g1504.t1.faa sample5_out|g441.t1.faa
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
In this case I would like to remove lines 1, 2, and 3 because sample1 is repeated multiple times on line 1, sample 2 is twice on line 2, and sample 5 is repeated twice on line 3. Line 4 would pass because it contains only one instance of each sample.
I am okay repeating this operation multiple times using different 'match' strings (e.g. sample1_out , sample2_out etc in the example above).
Here is one in GNU awk:
$ awk -F"[| ]" '{ # pipe or space is the field reparator
delete a # delete previous hash
for(i=2;i<=NF;i+=2) # iterate every other field, ie right side of space
if($i in a) # if it has been seen already
next # skit this record
else # well, else
a[$i] # hash this entry
print # output if you make it this far
}' file
Output:
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
The following sed command will accomplish what you want.
sed -ne '/.* \(.*\)|.*\1.*/!p' file.txt
grep: grep -vE '(sample[0-9]).*\1' file
Inspiring from Glenn's answer: use -i with sed to directly do changes in the file.
sed -r '/(sample[0-9]).*\1/d' txt_file

How to edit a header in a fasta sequence by cutting some parts of it and keeping the main text of the sequence using a linux command line?

I have a multi fasta file named fasta1.fasta that contains the sequences and their IDs. What i want is to cut the header of the sequence that have the ID and reduce it to contains the ID accession number of the sequence only. I used the command line grep '>' fasta1.fasta | cut -d " " -f 1 to cut the parts that i want from the header but the output that i get is the IDs accession numbers only without the rest of the sequences. My sequences looks like this:
>tr|Q8IBQ5|Q8IBQ5_PLAF7 40S ribosomal protein S10, putative OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_$
MDKQTLPHHKYSYIPKQNKKLIYEYLFKEGVIVVEKDAKIPRHPHLNVPNLHIMMTLKSL
KSRNYVEEKYNWKHQYFILNNEGIEYLREFLHLPPSIFPATLSKKTVNRAPKMDEDISRD
VRQPMGRGRAFDRRPFE
>tr|Q8IEB1|Q8IEB1_PLAF7 TBC domain protein, putative OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_132020$
MEYKLEFLSYLLIFKKKNERISKFDEQIKTCINIFEKSIINESDLKYLFERNILDMNPGV
RSMCWKLALKHLSLDSNKWNTELIEKKKLYEEYIKSFVINPYYSCVDNKKKEFVKETEKE
PKGKNMKDEYIEYNLDRNKTYYHKDDSLLKLQNDNNTKQMDYLEDEKYSSMDDECSEDNW
The output that i get is:
>tr|Q8IBQ5|Q8IBQ5_PLAF7
>tr|Q8IEB1|Q8IEB1_PLAF7
While the output desired is:
>tr|Q8IBQ5|Q8IBQ5_PLAF7
MDKQTLPHHKYSYIPKQNKKLIYEYLFKEGVIVVEKDAKIPRHPHLNVPNLHIMMTLKSL
KSRNYVEEKYNWKHQYFILNNEGIEYLREFLHLPPSIFPATLSKKTVNRAPKMDEDISRD
VRQPMGRGRAFDRRPFE
>tr|Q8IEB1|Q8IEB1_PLAF7
EYKLEFLSYLLIFKKKNERISKFDEQIKTCINIFEKSIINESDLKYLFERNILDMNPGV
RSMCWKLALKHLSLDSNKWNTELIEKKKLYEEYIKSFVINPYYSCVDNKKKEFVKETEKE
PKGKNMKDEYIEYNLDRNKTYYHKDDSLLKLQNDNNTKQMDYLEDEKYSSMDDECSEDNW
Any help will be appreciated. Thank you.
Variant 1:
sed '/^>/s/ .*//'
Variant 2:
perl -pe 's/ .*// if /^>/'
That is, in all lines that start with >, remove everything after and including the first space.

Remove lines with duplicate cells

I need to remove lines with a duplicate value. For example I need to remove line 1 and 3 in the block below because they contain "Value04" - I cannot remove all lines containing Value03 because there are lines with that data that are NOT duplicates and must be kept. I can use any editor; excel, vim, any other Linux command lines.
In the end there should be no duplicate "UserX" values. User1 should only appear 1 time. But if User1 exists twice, I need to remove the entire line containing "Value04" and keep the one with "Value03"
Value01,Value03,User1
Value02,Value04,User1
Value01,Value03,User2
Value02,Value04,User2
Value01,Value03,User3
Value01,Value03,User4
Your ideas and thoughts are greatly appreciated.
Edit: For clarity and leaving words out from the editing process.
The following Awk command removes all but the first occurrence of a value in the third column:
$ awk -F',' '{
if (!seen[$3]) {
seen[$3] = 1
print
}
}' textfile.txt
Output:
Value01,Value03,User1
Value01,Value03,User2
Value01,Value03,User3
Value01,Value03,User4
same thing in Perl:
perl -F, -nae 'print unless $c{$F[2]}++;' textfile.txt
this uses autosplit mode: "-F, -a" splits by comma and places the result into #F array

Resources