Replace fasta headers using sed command - linux

I have a fasta file which looks like this.
>header1
ATGC....
>header2
ATGC...
My list files looks like this
organism1
organism2
and contains a list of organism that I want to replace the header with.
I tried to use a for loop using sed command which is as follows:
for i in `cat list7b`; do sed "s/^>/$i/g" sequence.fa; done
but it didn't work please tell how I can achieve this task.
The result file should look like this
>organism1
ATGC...
>organism2
ATGC....
that is >header1 replaced with >organism_1 and so on
The two headers are distinguished from ATGC as header always starts with > greater than sign whereas ATGC would not. That's how they are distinguished.
The header lines should be replaced by the order of appearance, i.e. first header* replaced with first-line from file, 2nd header from the second and so on.
I also request to explain the logic if possible.
thanks in advance.

With awk this is easy to do in one run.
Assuming your fasta file is named sequence.fa and your organisms list file is named list7b as in the question you can use
awk 'NR == FNR { o[n++] = $0; next } /^>/ && i < n { $0 = ">" o[i++] } 1' list7b sequence.fa > output.fa
Explanation:
NR == FNR is a condition for doing something with the first file only. (total number of records is equal to number of records in current file)
{ o[n++] = $0; next } puts the input line into array o, counts the entries and skips further processing of the input line, so o will contain all your organism lines.
The next part is executed for the remaining file(s).
/^>/ && i < n is valid for lines that start with > as long as i is less than the number of elements n that were put into array o.
{ $0 = ">" o[i++] } replaces the current line with > followed by the array element (i.e. a line from the first file) and increments the index i to the next element.
1 is an "always true" condition with the implicit default action { print } to print the current line for every input line.

Related

Repeat each line multiple times and add ascending numbers

Would like to have each line in a file repeated a fixed number of times and add ascending numbers, like this:
I have
wwx.domain.com/pageA/?page=1
wwx.domain.com/pageB/?page=1
wwx.domain.com/pageC/?page=1
I want
wwx.domain.com/pageA/?page=1
wwx.domain.com/pageA/?page=2
wwx.domain.com/pageA/?page=3
wwx.domain.com/pageB/?page=1
wwx.domain.com/pageB/?page=2
wwx.domain.com/pageB/?page=3
wwx.domain.com/pageC/?page=1
wwx.domain.com/pageC/?page=2
wwx.domain.com/pageC/?page=3
How can I do this?
awk '{ sub(/.$/,""); for(i=1; i<4; i++) print $0 i }' inputfile > outputfile
Explanation: Remove the last character from the input line, and in a loop print the (modified) input line followed by the loop index.
This might work for you (GNU sed):
sed -E 'h;:a;s/[^\n]*/&/3;t;x;s/(.*=)(.*)/echo "\1$((\2+1))"/e;x;G;ta' file
While the pattern space contains less than n (in this case 3) lines, append the current line, incrementing the last field by 1.
The solution uses the hold space to keep the last incremented line and shell arithmetic to increment the last field of the line.

bash script and awk to sort a file

so I have a project for uni, and I can't get through the first exercise. Here is my problem:
I have a file, and I want to select some data inside of it and 'display' it in another file. But the data I'm looking for is a little bit scattered in the file, so I need several awk commands in my script to get them.
Query= fig|1240086.14.peg.1
Length=76
Score E
Sequences producing significant alignments: (Bits) Value
fig|198628.19.peg.2053 140 3e-42
> fig|198628.19.peg.2053
Length=553
Here on the picture, you can see that there are 2 types of 'Length=', and I only want to 'catch' the "Length=" that are just after a "Query=".
I have to use awk so I tried this :
awk '{if(/^$/ && $(NR+1)/^Length=/) {split($(NR+1), b, "="); print b[2]}}'
but it doesn't work... does anyone have an idea?
You need to understand how Awk works. It reads a line, evaluates the script, then starts over, reading one line at a time. So there is no way to say "the next line contains this". What you can do is "if this line contains, then remember this until ..."
awk '/Query=/ { q=1; next } /Length/ && q { print } /./ { q=0 }' file
This sets the flag q to 1 (true) when we see Query= and then skips to the next line. If we see Length and we recently saw Query= then q will be 1, and so we print. In other cases, set q back to "not recently seen" on any non-empty line. (I put in the non-empty condition to allow for empty lines anywhere without affecting the overall logic.)
awk solution:
awk '/^Length=/ && r~/^Query/{ sub(/^[^=]+=/,""); printf "%s ",$0 }
NF{ r=$0 }END{ print "" }' file
NF{ r=$0 } - capture the whole non-empty line
/^Length=/ && r~/^Query/ - on encountering Length line having previous line started with Query(ensured by r~/^Query/)
It sounds like this is what you want for the first part of your question:
$ awk -F'=' '!NF{next} f && ($1=="Length"){print $2} {f=($1=="Query")}' file
76
but idk what the second part is about since there's no "data" lines in your input and only 1 valid output from your sample input best I can tell.

extracting first line from file command such that

I have a file with almost 5*(10^6) lines of integer numbers. So, my file is big enough.
The question is all about extract specific lines, filtering them by a condition.
For example, I'd like to:
Extract the N first lines without read entire file.
Extract the lines with the numbers less or equal X (or >=, <=, <, >)
Extract the lines with a condition related a number (math predicate)
Is there a cleaver way to perform these tasks? (using sed or awk or cat or head)
Thanks in advance.
To extract the first $NUMBER lines,
head -n $NUMBER filename
Assuming every line contains just a number (although it will also work if the first token is one), 2 can be solved like this:
awk '$1 >= 1234 && $1 < 5678' filename
And keeping in spirit with that, 3 is just the extension
awk 'condition' filename
It would have helped if you had specified what condition is supposed to be, though. This way, you'll have to read the awk documentation to find out how to code it. Again, the number will be represented by $1.
I don't think I can explain anything about the head call, it's really just what it says on the tin. As for the awk lines: awk, like sed, works linewise. awk fetches lines in a loop and applies your code to each line. This code takes the form
condition1 { action1 }
condition2 { action2 }
# and so forth
For every line awk fetches, the conditions are checked in the order they appear, and the associated action to each condition is performed if the condition is true. It would, for example, have been possible to extract the first $NUMBER lines of a file with awk like this:
awk -v number="$NUMBER" '1 { print } NR == number { exit }' filename
where 1 is synonymous with true (like in C) and NR is the line number. The -v command line option initializes the awk variable number to $NUMBER. If no action is specified, the default action is { print }, which prints the whole line. So
awk 'condition' filename
is shorthand for
awk 'condition { print }' filename
...which prints every line where the condition holds.

Concatenating 500 files while removing the first line from each file except the first file

I want to create a file which is the result of concatenation of 500 files where the first line of each file except the first file is deleted. I also want the original files unchanged.
I know that cat and sed should be piped but I cannot wrap my mind around it!
for the moment what I can think of is as follows:
Backup the original files.
Remove the header from every file using:
for x in *.seg; do sed -i 1d ${x}; done
concatenate files using cat
add the header to the result of step 3.
Can you propose a pipe that can do this while keeping the original files intact?
You could use awk to do this:
awk 'NR == FNR || FNR > 1' *.seg > destination
For the first file, the total record number NR will equal the record number of the current file FNR, so all lines will be printed. For other files, only lines after the first will be printed. The output is redirected to a file destination.
As you have 500 files, the FNR > 1 will evaluate to true more often than the NR == FNR, so you may want to switch around the order so that short-circuiting takes place:
awk 'FNR > 1 || NR == FNR' *.seg > destination
When the first part of the || is true, there is no need to evaluate the second part. Much faster ;)

how to remove text block (pattern) from a file with sed/awk

I have thousands of text files that I have imported that contain a piece of text that I would like to remove.
It is not just a block of text but a pattern.
<!--
# Translator(s):
#
# username1 <email1>
# username2 <email2>
# usernameN <emailN>
#
-->
The block if it appears it will have 1 or more users being listed with their email addresses.
I have another small awk program that accomplish the task in a very few rows of code. It can be used to remove patterns of text from a file. Start as well as stop regexp can be set.
# This block is a range pattern and captures all lines between( and including )
# the start '<!--' to the end '-->' and stores the content in record $0.
# Record $0 contains every line in the range pattern.
# awk -f remove_email.awk yourfile
# The if statement is not needed to accomplish the task, but may be useful.
# It says - if the range patterns in $0 contains a '#' then it will print
# the string "Found an email..." if uncommented.
# command 'next' will discard the content of the current record and search
# for the next record.
# At the same time the awk program begins from the beginning.
/<!--/, /-->/ {
#if( $0 ~ /#/ ){
# print "Found an email and removed that!"
#}
next
}
# This line prints the body of the file to standard output - if not captured in
# the block above.
1 {
print
}
Save the code in 'remove_email.awk' and run it by:
awk -f remove_email.awk yourfile
This sed solution might work:
sed '/^<!--/,/^-->/{/^<!--/{h;d};H;/^-->/{x;/^<!--\n# Translator(s):\n#\(\n# [^<]*<email[0-9]\+>\)\+\n#\n-->$/!p};d}' file
An alternative (perhaps better solution?):
sed '/^<!--/{:a;N;/^-->/M!ba;/^<!--\n# Translator(s):\n#\(\n# \w\+ <[^>]\+>\)+\n#\n-->/d}' file
This gathers up the lines that start with <!-- and end with --> then pattern matches on the collection i.e. the second line is # Translator(s): the third line is #, the fourth and perhaps more lines follow # username <email address>, the penultimate line is # and the last line is -->. If a match is made the entire collection is deleted otherwise it is printed as normal.
for this task you need look-ahead, which is normally done with a parser.
Another solution, but not very efficient would be:
sed "s/-->/&\n/;s/<!--/\n&/" file | awk 'BEGIN {RS = "";FS = "\n"}/username/{print}'
HTH Chris
perl -i.orig -00 -pe 's/<!--\s+#\s*Translator.*?\s-->//gs' file1 file2 file3
Here is my solution, if I understood your problem correctly. Save the following to a file called remove_blocks.awk:
# See the beginning of the block, mark it
/<!--/ {
state = "block_started"
}
# At the end of the block, if the block does not contain email, print
# out the whole block.
/^-->/ {
if (!block_contains_user_email) {
for (i = 0; i < count; i++) {
print saved_line[i];
}
print
}
count = 0
block_contains_user_email = 0
state = ""
next
}
# Encounter a block: save the lines and wait until the end of the block
# to decide if we should print it out
state == "block_started" {
saved_line[count++] = $0
if (NF>=3 && $3 ~ /#/) {
block_contains_user_email = 1
}
next
}
# For everything else, print the line
1
Assume that your text file is in data.txt (or many files, for that matter):
awk -f remove_blocks.awk data.txt
The above command will print out everything in the text file, minus the blocks which contain user email.

Resources