How to Grep the complete sequences containing a specific motif in a fasta file? - linux

How to Grep the complete sequences containing a specific motif in a fasta file or txt file with one linux command and write them into another file? Also, I want to include the lines beginning with a ">" before these target sequences.
Example:I have a fasta file of 10000 sequences.
$cat file.fa
>name1
AEDIA
>name2
ALKME
>name3
AAIII
I want to grep sequences containing KME, so I should get:
>name2
ALKME
Attached is the current way I am using based on the answers I got. Maybe others may find it helpful. Thanks to Pierre Lindenbaum, Philipp Bayer, cpad0112 and batMan.
Preprocessing the fasta file first and get each sequence into one line (which is very important)
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa > file1.fa
Get rid of the first empty line
tail -n +2 file1.fa > file2.fa
Extract the target sequences containing the substring including their names and save it into another file
LC_ALL=C grep -B 1 KME file2.fa > result.txt
Note: Take KME as the target substring as an example

if you have multiline fasta files. First linearize with awk, and use another awk to filter the sequence containing the motif. using grep would be dangerous a sequence name contains a short motif.
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '{if(index($2,"KME")!=0) printf("%s\n%s\n",$1,$2);}'

grep -B1 KME file > output_file
-B1 : prints 1 line before the match as well

Related

Read in file line by line and search another file for a line with a partial match

I have a file with partial matches to lines in another file. In order to do this I was looking to generate a while loop with read and substituting a variable for each line of partial matches into a grep command to search a database files with a partial match but for some reason, I am not getting an output (an empty outputfile.txt).
Here is my current script
while read -r line; do
grep $line /path/to/databasefile >> /path/to/folder/outputfile.txt
done < "/partial_matches.txt"
the database has multiple lines with a sequence name then DNA sequence after:
>transcript_ab
AGTCAGTCATGTC
>transcript_ac
AGTCAGTCATGTC
>transctipt_ad
AGTCAGTCATGTC
and the partial matching search file has lines of text:
ab
ac
and I'm looking for a return of:
>transcript_ab
>transcript_ac
any help would be appreciated. Thanks.
If you are using GNU grep, then its -f option is what you are looking for:
grep -f /partial_matches.txt /path/to/databasefile
(if you don't have any pattern in partial_matches.txt but only strings, then use grep -F instead of grep)
you can use a for loop instead:
for i in $(cat partial_matches.txt); do
grep $i /path/to/databasefile >> /path/to/folder/outputfile.txt
done
Also, check if you have a typo:
"/partial_matches.txt" -> "./partial_matches.txt"

Test if each line in a file contains one of multiple strings in another file

I have a text file (we'll call it keywords.txt) that contains a number of strings that are separated by newlines (though this isn't set in stone; I can separate them with spaces, commas or whatever is most appropriate). I also have a number of other text files (which I will collectively call input.txt).
What I want to do is iterate through each line in input.txt and test whether that line contains one of the keywords. After that, depending on what input file I'm working on at the time, I would need to either copy matching lines in input.txt into output.txt and ignore non-matching lines or copy non-matching lines and ignore matching.
I searched for a solution but, though I found ways to do parts of what I'm trying to do, I haven't found a way to do everything I'm asking for here. While I could try and combine the various solutions I found, my main concern is that I would end up wondering if what I coded would be the best way of doing this.
This is a snippet of what I currently have in keywords.txt:
google
adword
chromebook.com
cobrasearch.com
feedburner.com
doubleclick
foofle.com
froogle.com
gmail
keyhole.com
madewithcode.com
Here is an example of what can be found in one of my input.txt files:
&expandable_ad_
&forceadv=
&gerf=*&guro=
&gIncludeExternalAds=
&googleadword=
&img2_adv=
&jumpstartadformat=
&largead=
&maxads=
&pltype=adhost^
In this snippet, &googleadword= is the only line that would match the filter and there are scenarios in my case where output.txt will either have only the matching line inserted or every line that doesn't match the keywords.
1. Assuming the content of keywords.txt is separated by newlines:
google
adword
chromebook.com
...
The following will work:
# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -Ff keywords.txt input.txt > output.txt
# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vFf keywords.txt input.txt > output.txt
2. Assuming the content of keywords.txt is separated by vertical bars:
google|adword|chromebook.com|...
The following will work:
# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -Ef keywords.txt input.txt > output.txt
# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vEf keywords.txt input.txt > output.txt
3. Assuming the content of keywords.txt is separated by commas:
google,adword,chromebook.com,...
There are many ways of achieving the same, but a simple way would be to use tr to replace all commas with vertical bars and then interpret the pattern with grep's extended regular expression.
# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -E $(tr ',' '|' < keywords.txt) input.txt > output.txt
# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vE $(tr ',' '|' < keywords.txt) input.txt > output.txt
Grep Options
-v, --invert-match
Selected lines are those not matching any of the specified patterns.
-F, --fixed-strings
Interpret each data-matching pattern as a list of fixed strings,
separated by newlines, instead of as a regular expression.
-E, --extended-regexp
Interpret pattern as an extended regular expression
(i.e. force grep to behave as egrep).
-f file, --file=file
Read one or more newline separated patterns from file.
Empty pattern lines match every input line.
Newlines are not considered part of a pattern.
If file is empty, nothing is matched.
Read more about grep
Read more about tr

Using Sed or Awk to divide a file into two based on whether a line contains a numeric value

I have used sed and awk for little while now, but I am having a challenge with the below problem. I am asking for an experienced sed/awk guru to help.
I have a file where some lines have numbers and some lines do not, like:
afjjdjfj.uihuihi
trfg.rtyhd
0rtgfd.tjbghhh
hbvfd4.rtgbvdgf
00fhfg.fdrgf
rtygfd.ijhniuh
etc.
I would like to have exactly two files out of this one, where every line is represented in one of the two files (none are deleted).
One containing all lines with any numbers 0-9 on them so given above file result would be:
0rtgfd.tjbghhh
hbvfd4.rtgbvdgf
00fhfg.fdrgf
and another file containing the rest of the lines that do not have any numbers 0-9 on them, so given the above, file it would be:
afjjdjfj.uihuihi
trfg.rtyhd
rtygfd.ijhniuh
I've tried different strategies in both sed and awk and nothing is giving me exactly what I need.
What would be the best sed or awk one liner to solve this problem?
Thank you for your time,
Tom
Easily with Awk:
awk '/[0-9]/{print > file1; next} {print > file2}' inputfile
With single GNU sed command:
sed -ne '/[0-9]/w with_digits.txt' -e '//!w no_digits.txt' input
Results:
> cat no_digits.txt
afjjdjfj.uihuihi
trfg.rtyhd
rtygfd.ijhniuh
> cat with_digits.txt
0rtgfd.tjbghhh
hbvfd4.rtgbvdgf
00fhfg.fdrgf
w filename Write the pattern space to filename.
If you don't mind running twice over the input, you can use just grep:
grep '[0-9]' input > with_digits
grep -v '[0-9]' input > without_digits
perl -MFile::Slurp -lpe '/\d/ ? append_file("digits.txt",$_) : append_file("no_digits.txt",$_)' input.txt

Linux Bash: extracting text from file int variable

I haven't found anything that clearly answers my question. Although very close, I think...
I have a file with a line:
# Skipsdata for serienummer 1158
I want to extract the 4 digit number at the end and put it into a variable, this number changes from file to file so I can't just search for "1158". But the "# Skipsdata for serienummer" always remains the same.
I believe that either grep, sed or awk may be the answer but I'm not 100 % clear on their usage.
Using Awk as
numberRequired=$(awk '/# Skipsdata for serienummer/{print $NF}' file)
printf "%s\n" "$numberRequired"
1158
You can use grep with the -o switch, which prints only the matched part instead of the whole line.
Print all numbers at the end of lines from file yourFile
grep -Po '\d+$' yourFile
Print all four digit numbers at the end of lines like described in your question:
grep -Po '^# Skipsdata for serienummer \K\d{4}$' yourFile
-P enables perl style regexes which support \d and especially \K.
\d matches any digit (0-9).
\d{4} matches exactly four digits.
\K lets grep forget the previously matched part, such that only the part afterwards is printed.
There are multiple ways to find your number. Assuming the input data is in a file called inputfile:
mynumber=$(sed -n 's/# Skipsdata for serienummer //p' <inputfile) will print only the number and ignore all the other lines;
mynumber=$(grep '^# Skipsdata for serienummer' inputfile | cut -d ' ' -f 5) will filter the relevant lines first, then only output the 5th field (the number)

Advanced GREP / AWK - Export Characters > [X]

I have a large log file with over 900k entries. I'd like to do a couple of things using Grep / AWK (if it is even possible):
I'd like to export a new txt file for each line entry for the symbol "~". With the conditions:
If one line/entry uses the symbol "~" more than 2 times only then should it be included in the new txt file.
Any ideas on how (or if possible) to do this using Grep / AWK?
Thanks in advance!
give this one-liner a try:
awk -F'~' 'NF>3' file > newFile
-F defines the field delimiter. we defined ~
if there were at least two ~s, the line should have at least 3 fields
if you want the line with exact two ~s as well, change NF>3 into NF>2
You can do it with grep:
grep -E '~.*~.*~' input > output
or
grep -E '(~.*){3}' input > output

Resources