Trim fasta files using BioPython - python-3.x

I have a fasta file with multiple sequences in it. Some of the sequences are trailed with '-' and I'd like to trim them from the final sequences. Is there a clean way to trim them and write a new fasta file without the dashes using Biopython?
I saw this post How to remove all-N sequence entries from fasta file(s) and tried to adapt some of the code but it didn't work...
file containing a sequence like this:
sequence_of_interest
CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCACCAGGCCAGATGAGAGAA---------------------------------------------------------------
def dash_removal(file_in, file_out):
records = SeqIO.parse(file_in, 'fasta')
filtered = (rec for rec in records if any(ch != '-' for ch in rec.seq))
SeqIO.write(filtered, file_out, 'fasta')
dash_removal("dash_removal_test.fasta", "dashes_gone?.fasta")
all of the sequences should ultimately be trimmed to look like this:
sequence_of_interest
CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCACCAGGCCAGATGAGAGAA
Any help would be appreciated!

All the options using sed are great because they are faster but here is a way to do it in BioPython.
The idea is to use rstrip on the seq attribute of each record. rstrip can be used on the sequence just like on any other string in Python.
from Bio import SeqIO
import io
seq = """>sequence_of_interest
CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCAT
GTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAA
TGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCA
CCAGGCCAGATGAGAGAA--------------------------------------------------------------"""
f = io.StringIO(seq) # replace it with f = open('my_fasta.fa', 'r')
clean_records = []
for record in SeqIO.parse(f, "fasta"):
record.seq = record.seq.rstrip('-')
clean_records.append(record)
with open('clean_fasta.fa', 'w') as f:
SeqIO.write(clean_records, f, 'fasta')

Related

I'm looking for a way to extract strings from a text file using specific criterias

I have a text file containing random strings. I want to use specific criterias to extract the strings that match these criterias.
Example text :
B311-SG-1700-ASJND83-ANSDN762
BAKSJD873-JAN-1293
Example criteria :
All the strings that contains characters seperated by hyphens this way : XXX-XX-XXXX
Output : 'B311-SG-1700'
I tried creating a function but I can't seem to know how to use criterias for string specifically and how to apply them.
Based on your comment here is a python script that might do what you want (I'm not that familiar with python).
import re
p = re.compile(r'\b(.{4}-.{2}-.{4})')
results = p.findall('B111-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293\nB211-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293 B311-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293')
print(results)
Output:
['B111-SG-1700', 'B211-SG-1700', 'B311-SG-1700']
You can read a file as a string like this
text_file = open("file.txt", "r")
data = text_file.read()
And use findall over that. Depending on the size of the file it might require a bit more work (e.g. reading line by line for example
You can use re module to extract the pattern from text:
import re
text = """\
B311-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293
BAKSJD873-JAN-1293 B312-SG-1700-ASJND83-ANSDN762"""
for m in re.findall(r"\b.{4}-.{2}-.{4}", text):
print(m)
Prints:
B311-SG-1700
B312-SG-1700

Trying to pull a twitter handle from a text file

I am trying to extract a set of alpha numeric characters from a text file.
below would be some lines in the file. I want to extract the '#' as well as anything that follows.
im trying to pull #bob from a file.
this is a #line in the #file
#bob is a wierdo
the below code is what I have so far.
def getAllPeople(fileName):
#give empty list
allPeople=[]
#open TweetsFile.txt
with open(fileName, 'r') as f1:
lines=f1.readlines()
#split all words into strings
for word in lines:
char = word.split("#")
print(char)
#close the file
f1.close()
What I am trying to get is;
['#bob','#line','#file', '#bob']
If you do not want to use re, take Andrew's suggestion
mentions = list(filter(lambda x: x.startswith('#'), tweet.split()))
otherwise, see the marked duplicate.
mentions = [w for w in tweet.split() if w.startswith('#')]
since you apparently can not use filter or lambda.

Parse multiline fasta file using record.id for filenames but not in headers

My current multiline fasta file is as such:
>chr1|chromosome:Mt4.0v2:1:1:52991155:1
ATGC...
>chr2|chromosome:Mt4.0v2:2:1:45729672:1
ATGC...
...and so on.
I need to parse the fasta file into separate files containing only the record.description in the header (everything after the |) followed by the sequence. However, I need to use the record.ids as the filenames (chr1.fasta, chr2.fasta, etc.). Is there any way to do this?
My current attempt at solving this is below. It does produce only the description in the header with the last sequence record.id as the filename. I need seperate files.
from Bio import SeqIO
def yield_records(in_file):
for record in SeqIO.parse(in_file, 'fasta'):
record.description = record.id = record.id.split('|')[1]
yield record
SeqIO.write(yield_records('/correctedfasta.fasta'), record.id+'.fasta', 'fasta')
Your code has almost everything which is needed. yield can also return more than one value, i.e. you could return both the filename and the record itself, e.g.
yield record.id.split('|')[0], record
but then BioPython would still bite you because the id gets written to the FASTA header. You would therefore need to modify both the id and overwrite the description (it gets concatenated to the id otherwise), or just assign identical values as you did.
A simple solution would be
from Bio import SeqIO
def split_record(record):
old_id = record.id.split('|')[0]
record.id = '|'.join(record.id.split('|')[1:])
record.description = ''
return old_id, record
filename = 'multiline.fa'
for record in SeqIO.parse(filename, 'fasta'):
record = split_record(record)
with open(record[0] + '.fa', 'w') as f:
SeqIO.write(record[1], f, 'fasta')

Python read file contents into nested list

I have this file that contains something like this:
OOOOOOXOOOO
OOOOOXOOOOO
OOOOXOOOOOO
XXOOXOOOOOO
XXXXOOOOOOO
OOOOOOOOOOO
And I need to read it into a 2D list so it looks like this:
[[O,O,O,O,O,O,X,O,O,O,O],[O,O,O,O,O,X,O,O,O,O,O],[O,O,O,O,X,O,O,O,O,O,O],[X,X,O,O,X,O,O,O,O,O,O],[X,X,X,X,O,O,O,O,O,O,O,O],[O,O,O,O,O,O,O,O,O,O,O]
I have this code:
ins = open(filename, "r" )
data = []
for line in ins:
number_strings = line.split() # Split the line on runs of whitespace
numbers = [(n) for n in number_strings]
data.append(numbers) # Add the "row" to your list.
return data
But it doesn't seem to be working because the O's and X's do not have spaces between them. Any ideas?
Just use data.append(list(line.rstrip())) list accepts a string as argument and just splits them on every character.

Insert spaces next to punctuation when writing to .txt file

I have written a function that uses an nltk tokenizer to preprocess .txt files. Basically, the function takes a .txt file, modifies it so that each sentence appears on a separate line, and overwrites the modified file on the old file.
I would like to modify the function (or maybe to create another function) to also insert spaces before punctuation and sometimes after punctuation, as in the case of a parenthesis. In other words, leaving aside what the function already does, I also would like it to change "I want to write good, clean sentences." into "I want to write good , clean sentences ."
I am a beginner, and I suspect I probably am just missing something pretty simple. A little help would be much appreciated.
My existing code is below:
import nltk.data
def readtowrite(filename):
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
with open(filename, 'r+') as f:
fout = str(f.read())
stuff = str('\n'.join(sent_detector.tokenize(fout.strip())))
f.seek(0)
f.write(stuff)
Here is the answer I came up with. Basically, I created a separate function to insert spaces before and after the punctuation in a sentence. I then called that function in the readtowrite function.
Code below:
import string
import nltk.data
def strip_punct(sentence):
wordlist = []
for word in sentence:
for char in word:
cleanword = ""
if char in string.punctuation:
char = " " + char + " "
cleanword += char
wordlist.append(cleanword)
return ''.join(wordlist)
def readtowrite(filename):
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
with open(filename, 'r+') as f:
fout = str(f.read())
stuff = str('\n'.join(sent_detector.tokenize(fout.strip())))
morestuff = str(strip_punct(stuff))
f.seek(0)
f.write(morestuff)
I think loading nltk.data.load('tokenizers/punkt/english.pickle') is equivalent to calling the sent_tokenize() and word_tokenize function in NLTK.
Maybe this script will be more helpful:
def readtowrite(infile, outfile):
with open(outfile, 'w') as fout:
with open(filename, 'r') as fin:
output = "\n".join([" ".join(word_tokenize(i)) for i in sent_tokenize(str(f.read()))])
fout.write(output)

Resources