How to remove the first three character from the fasta file header - linux

I have a fasta file like this:
>rna-XM_00001.1
actact
>rna-XM_00002.1
atcatc
How do I remove the 'rna-' so it become
>XM_00001.1
actact
>XM_00002.1
atcatc

What you're showing is the file contents? Then sed should be able to do this:
sed 's/^>rna-/>/' < inputfile > outputfile
Explanation:
The first character of the command-line to sed is s, which tells sed to do substitution
The / are delimiters
The ^ tells sed to look only at the start of a line
The next >rna- is the pattern to match at the start of a line
The next > is the replacement substituted for the pattern
If, instead, you want to always remove the first four characters after a > as long as they end in -, you could use:
sed 's/^>...-/>/' < inputfile > outputfile
Explanation:
This is similar to above, except the pattern to match at the start of a line is >...-. The pattern is a regexp, where a . matches any single character. So this pattern matches any line starting with >, followed by any three characters, followed by -.

Related

Partial replace with sed command

We have a filewith some utf-16 decimal characters and we would like to replace them in the following manner
Test Line in a file \u343- ? some random words \u1233? 300 \u241? \u208?\cell
The required out put is
Test Line in a file \u343- ? some random words UTF16-1233| 300 UTF16-241| UTF16-208|\cell
The requirement is to change \u[0-9]+? to UTF16-[0-9]+|
Replace the initial \u to UTF16- and the ending ? with a pipe |.
Please note if there is any non digit character between \u and ? it should not be considered
Using sed to modify the file in place, you can:
Match \\u([0-9]+)\?:
Match a literal \u, match and capture one or more digits, match a literal ?.
Replace UTF16-\1:
Replace with the string UTF16- followed by the captured group.
$ sed -i -E 's/\\u([0-9]+)\?/UTF16-\1|/g' file
$ cat file
Test Line in a file \u343- ? some random words UTF16-1233| 300 UTF16-241| UTF16-208|\cell

Remove new line character by checking the expression, using sed

Have to write a script which updates the file in this way.
raw file:
<?blah blah blah?>
<pen>
<?pineapple?>
<apple>
<pen>
Final file:
<?blah blah blah?><pen>
<?pineapple?><apple><pen>
Where ever in the file if the new line charter is not followed by
<?
We have to remove the newline in order to append it at the end of previous line.
Also it will be really helpful if you explain how your sed works.
Perl solution:
perl -pe 'chomp; substr $_, 0, 0, "\n" if $. > 1 && /^<\?/'
-p reads the input line by line, printing each line after changes
chomp removes the final newline
substr with 4 arguments modifies the input string, here it prepends newline if it's not the first line ($. is the input line number) and the line starts with <?.
Sed solution:
sed ':a;N;$!ba;s/\n\(<[^?]\)/\1/g' file > newfile
The basic idea is to replace every
\n followed by < not followed by ?
with what you matched except the \n.
When you are happy with a solution that puts every <? at the start of a line, you can combine tr with sed.
tr -d '\n' < inputfile| sed 's/<?/\n&/g;$s/$/\n/'
Explanation:
I use tr ... < inputfile and not cat inputfile | tr ... avoiding an additional catcall.
The sed command has 2 parts.
In s/<?/\n&/g it will insert a newline and with & it will insert the matched string (in this case always <?, so it will only save one character).
With $s/$/\n/ a newline is appended at the end of the last line.
EDIT: When you only want newlines before <? when you had them already,
you can use awk:
awk '$1 ~ /^<\?/ {print} {printf("%s",$0)} END {print}'
Explanation:
Consider the newline as the start of the line, not the end. Then your question transposes into "write a newline when the line starts with <?. You must escape the ? and use ^ for the start of the line.
awk '$1 ~ /^<\?/ {print}'
Next print the line you read without a newline character.
And you want a newline at the end.

delete a line after a pattern only if it is blank using sed or awk

I want to delete a blank line only if this one is after the line of my pattern using sed or awk
for example if I have
G
O TO P999-ERREUR
END-IF.
the pattern in this case is G
I want to have this output
G
O TO P999-ERREUR
END-IF.
This will do the trick:
$ awk -v n=-2 'NR==n+1 && !NF{next} /G/ {n=NR}1' file
G
O TO P999-ERREUR
END-IF.
Explanation:
-v n=-2 # Set n=-2 before the script is run to avoid not printing the first line
NR == n+1 # If the current line number is equal to the matching line + 1
&& !NF # And the line is empty
{next} # Skip the line (don't print it)
/G/ # The regular expression to match
{n = NR} # Save the current line number in the variable n
1 # Truthy value used a shorthand to print every (non skipped) line
Using sed
sed '/GG/{N;s/\n$//}' file
If it sees GG, gets the next line, removes the newline between them if the next line is empty.
Note this will only remove one blank line after, and the line must be blank i.e not spaces or tabs.
This might work for you (GNU sed):
sed -r 'N;s/(G.*)\n\s*$/\1/;P;D' file
Keep a moving window of two lines throughout the length of the file and remove a newline (and any whitespace) if it follows the intended pattern.
Using ex (edit in-place):
ex +'/G/j' -cwq foo.txt
or print to the standard output (from file or stdin):
ex -s +'/GG/j|%p|q!' file_or_/dev/stdin
where:
/GG/j - joins the next line when the pattern is found
%p - prints the buffer
q! - quits
For conditional checking (if there is a blank line), try:
ex -s +'%s/^\(G\)\n/\1/' +'%p|q!' file_or_/dev/stdin

Delete none AGTC charachter in a text file

I have a Text file and it should contains A,G,C,T characters. However it sometimes has some unknown characters (very few) which I want to delete and if it is N replace it with A. Also I want to escape the lines which starts with a > symbol.
So far I know only how to replace N with A, which I do like this :
sed "s/N/A/g" file1.fa >file2.fasta
But I don't know how to do the first task.
Example :
Initial file
first line
AGCCCMCCCN
Target file should be like this
first line
AGCCCCCCA
Any help will be appreciate. Thanks in advance!
You can do another substitution on your sed
sed -e 's/N/A/g' -e 's/[^AGCT>]//g' -e 's/^>/\\>/' -e 's/[^\]>//g' file1.fa > file2.fasta
Pattern 1
-e 's/N/A/g'
Your pattern, replaces all instances of N with A first of all.
Pattern 2
-e 's/[^AGCT>]//g'
Secondly replace all characters that aren't A, G, C, T or > with nothing.
Pattern 3
-e 's/^>/\\>/'
Then replace all instances of > that are at the start of a string with \>
Pattern 4
-e 's/[^\]>//g'
Finally remove all > characters that aren't preceded by a \

Need oneliner to insert character after Nth occurrence of delimiter on 2nd line in unix

Hello I need oneliner to insert character after Nth occurrence of delimiter on 2nd line in unix; The criteria are these.
Find position of nth occurrence of the delimiter.
Insert character after the nth occurrence.
This is on the 2nd line only.
Note: I am doing this in Linux.
With awk :
INPUT FILE
1 foo bar base
2 foo bar base
CODE
awk 'NR==2{$2=$2"X"; print}' file
you can specify a delimiter with -F
NR to specify the line we work on
$2 is the 2th value separated by space (in this case)
$2=$2"X" is a concatenation
print alone print the entire line
OUTPUT
2 fooX bar base
Suppose we have the input file:
$ cat file
1 foo bar base
2 foo bar base
To insert the character X after the 3rd occurrence of the delimiter space, use:
$ sed -r '2 s/([^ ]* ){3}/&X/' file
1 foo bar base
2 foo bar Xbase
To make the change to the file in place, use sed's -i option:
sed -i -r '2 s/([^ ]* ){3}/&X/' file
How it works
Consider the sed command:
2 s/([^ ]* ){3}/&X/
The initial 2 instructs sed to apply this command only to the second line.
We are using the s or substitute command. This command has the form s/old/new/ where old and new are:
old is the regular expression ([^ ]* ){3}. This matches everything up to and including the third occurrence of space.
new is the replacement text, &X. The ampersand refers to what we matched in old, which is all the line up to and including the third space. The X is the new character that we are inserting.
This might work for you (GNU sed):
sed '2s/X/&Y/3' file
This inserts Y after the third occurence of X on the second line only.

Resources