Delete none AGTC charachter in a text file - linux

I have a Text file and it should contains A,G,C,T characters. However it sometimes has some unknown characters (very few) which I want to delete and if it is N replace it with A. Also I want to escape the lines which starts with a > symbol.
So far I know only how to replace N with A, which I do like this :
sed "s/N/A/g" file1.fa >file2.fasta
But I don't know how to do the first task.
Example :
Initial file
first line
AGCCCMCCCN
Target file should be like this
first line
AGCCCCCCA
Any help will be appreciate. Thanks in advance!

You can do another substitution on your sed
sed -e 's/N/A/g' -e 's/[^AGCT>]//g' -e 's/^>/\\>/' -e 's/[^\]>//g' file1.fa > file2.fasta
Pattern 1
-e 's/N/A/g'
Your pattern, replaces all instances of N with A first of all.
Pattern 2
-e 's/[^AGCT>]//g'
Secondly replace all characters that aren't A, G, C, T or > with nothing.
Pattern 3
-e 's/^>/\\>/'
Then replace all instances of > that are at the start of a string with \>
Pattern 4
-e 's/[^\]>//g'
Finally remove all > characters that aren't preceded by a \

Related

How to remove the first three character from the fasta file header

I have a fasta file like this:
>rna-XM_00001.1
actact
>rna-XM_00002.1
atcatc
How do I remove the 'rna-' so it become
>XM_00001.1
actact
>XM_00002.1
atcatc
What you're showing is the file contents? Then sed should be able to do this:
sed 's/^>rna-/>/' < inputfile > outputfile
Explanation:
The first character of the command-line to sed is s, which tells sed to do substitution
The / are delimiters
The ^ tells sed to look only at the start of a line
The next >rna- is the pattern to match at the start of a line
The next > is the replacement substituted for the pattern
If, instead, you want to always remove the first four characters after a > as long as they end in -, you could use:
sed 's/^>...-/>/' < inputfile > outputfile
Explanation:
This is similar to above, except the pattern to match at the start of a line is >...-. The pattern is a regexp, where a . matches any single character. So this pattern matches any line starting with >, followed by any three characters, followed by -.

Get Text after word at specific position

I have file like this
TT;12-11-18;text;abc;def;word
AA;12-11-18;tee;abc;def;gih;word
TA;12-11-18;teet abc;def;word
TT;12-11-18;tdd;abc;def;gih;jkl;word
I want output like this
TT;12-11-18;text;abc;def;word
TA;12-11-18;teet abc;def;word
I want to get word if it occur at position 5 after date 12-11-18. I do not want this occurrence if its found after this position that is at 6th or 7th position. Count of position start from date 12-11-18
I want tried this command
cat file.txt|grep "word" -n1
This print all occurrence in which this pattern word is matched. How should I solve my problem?
Try this(GNU awk):
awk -F"[; ]" '/12-11-18/ && $6=="word"' file
Or sed one:
sed -n '/12-11-18;\([^; ]*[; ]\)\{3\}word/p' file
Or grep with basically the same regex(different escape):
grep -E "12-11-18;([^; ]*[; ]){3}word" file
[^; ] means any character that's not ; or (space).
* means match any repetition of former character/group.
-- [^; ]* means any length string that don't contain ; or space, the ^ in [^; ] is to negate.
[; ] means ; or space, either one occurance.
() is to group those above together.
{3} is to match three repetitives of former chracter/group.
As a whole ([^; ]*[; ]){3} means ;/space separated three fields included the delimiters.
As #kvantour points out, if there could be multiple spaces at one place they could be faulty.
To consider multiple spaces as one separator, then:
awk -F"(;| +)" '/12-11-18/ && $6=="word"'
and
grep -E "12-11-18;([^; ]*(;| +)){3}word"
or GNU sed (posix/bsd/osx sed does not support |):
sed -rn '/12-11-18;([^; ]*(;| +)){3}word/p'

How to remove all data from a file before a line containing string by passing variable in linux

I am trying to trim the data above the line from a file, where line containing some string by passing variable to it
varfile=$(cat variable.txt)
echo "$varfile"
if [ -z "$varfile" ]; then
echo "null"
else
echo "data"
sed "1,/$varfile/d" fileee.txt
fi
Here I am taking a string from variable.txt file and trying to find that text in fileee.txt file and removing all the data above the line
EX: variable.txt has 3
I am finding 3 in fileee.txt and removing data above three
INPUT:
1
2
3
4
OUTPUT:
3
4
I suppose the issue here is that you want to remove all lines before the match, but not the matching line itself?
One way, with GNU sed, is to explicitly add a print for the matching line first:
pattrn=3
seq 1 4 | sed -e "/$pattrn/p;1,/$pattrn/d"
Though this will duplicate any further lines that match the pattern.
Better, invert the sense of the match:
seq 1 4 | sed -ne "/$pattrn/,\$p"
That is, don't print by default (-n), but print (p) anything from a match to the end ($, escaped because of the double-quoted string)
Even better would be to use awk:
pattrn=3
seq 1 4 | awk -vpat="$pattrn" '$0 ~ pat {p=1} p'
This sets a flag on the line where the whole line ($0) matches the pattern (~ is a regex match), then prints the lines whenever that flag is set.
The awk solution is also better in that special characters in the pattern don't cause issues (at least not as many); in the sed case, if the pattern contains a slash /, it will terminate the regex in the sed code, and cause syntax errors or allow for code injection.
I used seq from GNU coreutils here only to make up the sequence of numbers for input.

Two pattern match on same sed command

I have the following sed command:
sed -n '/^out(/{n;p}' ${filename} | sed -n '/format/ s/.*format=//g; s/),$//gp; s/))$//gp'
I tried to do it as one line as in:
sed -n '/^out(/{n;}; /format/ s/.*format=//g; s/),$//gp; s/))$//gp' ${filename}
But that also display the lines I don't want (those that do not match).
What I have is a file with some strings as in:
entry(variable=value)),
format(variable=value)),
entry(variable=value)))
out(variable=value)),
format(variable=value)),
...
I just want the format lines that came right after the out entry. and remove those trailing )) or ),
You can use this sed command:
sed -nr '/^out[(]/ {n ; s/.*[(]([^)]+)[)].*/\1/p}' your_file
Once a out is found, it advanced to the next line (n) and uses the s command with p flag to extract only what is inside parenthesises.
Explanation:
I used [(] instead of \(. Outside brackets a ( usually means grouping, if you want a literal (, you need to escape it as \( or you can put it inside brackets. Most RE special characters dont need escaping when put inside brackets.
([^)]+) means a group (the "(" here are RE metacharacters not literal parenthesis) that consists of one or more (+) characters that are not (^) ) (literal closing parenthesis), the ^ inverts the character class [ ... ]

Need oneliner to insert character after Nth occurrence of delimiter on 2nd line in unix

Hello I need oneliner to insert character after Nth occurrence of delimiter on 2nd line in unix; The criteria are these.
Find position of nth occurrence of the delimiter.
Insert character after the nth occurrence.
This is on the 2nd line only.
Note: I am doing this in Linux.
With awk :
INPUT FILE
1 foo bar base
2 foo bar base
CODE
awk 'NR==2{$2=$2"X"; print}' file
you can specify a delimiter with -F
NR to specify the line we work on
$2 is the 2th value separated by space (in this case)
$2=$2"X" is a concatenation
print alone print the entire line
OUTPUT
2 fooX bar base
Suppose we have the input file:
$ cat file
1 foo bar base
2 foo bar base
To insert the character X after the 3rd occurrence of the delimiter space, use:
$ sed -r '2 s/([^ ]* ){3}/&X/' file
1 foo bar base
2 foo bar Xbase
To make the change to the file in place, use sed's -i option:
sed -i -r '2 s/([^ ]* ){3}/&X/' file
How it works
Consider the sed command:
2 s/([^ ]* ){3}/&X/
The initial 2 instructs sed to apply this command only to the second line.
We are using the s or substitute command. This command has the form s/old/new/ where old and new are:
old is the regular expression ([^ ]* ){3}. This matches everything up to and including the third occurrence of space.
new is the replacement text, &X. The ampersand refers to what we matched in old, which is all the line up to and including the third space. The X is the new character that we are inserting.
This might work for you (GNU sed):
sed '2s/X/&Y/3' file
This inserts Y after the third occurence of X on the second line only.

Resources