Storing the Output to a FASTA file - python-3.x

from Bio import SeqIO
from Bio import SeqRecord
from Bio import SeqFeature
for rec in SeqIO.parse("C:/Users/Siva/Downloads/sequence.gp","genbank"):
if rec.features:
for feature in rec.features:
if feature.type =="Region":
seq1 = feature.location.extract(rec).seq
print(seq1)
SeqIO.write(seq1,"region_AA_output1.fasta","fasta")
I am trying to write the output to a FASTA file but i am getting error. Can anybody help me?
This the error which i got
Traceback (most recent call last):
File "C:\Users\Siva\Desktop\region_AA.py", line 10, in <module>
SeqIO.write(seq1,"region_AA_output1.fasta","fasta")
File "C:\Python34\lib\site-packages\Bio\SeqIO\__init__.py", line 472, in write
count = writer_class(fp).write_file(sequences)
File "C:\Python34\lib\site-packages\Bio\SeqIO\Interfaces.py", line 211, in write_file
count = self.write_records(records)
File "C:\Python34\lib\site-packages\Bio\SeqIO\Interfaces.py", line 196, in write_records
self.write_record(record)
File "C:\Python34\lib\site-packages\Bio\SeqIO\FastaIO.py", line 190, in write_record
id = self.clean(record.id)
AttributeError: 'str' object has no attribute 'id'

First, you're trying to write a plain sequence as a fasta record. A fasta record consists of a sequence plus an ID line (prepended by ">"). You haven't provided an ID, so the fasta writer has nothing to write. You should either write the whole record, or turn the sequence into a fasta record by adding an ID yourself.
Second, even if your approach wrote anything, it's continually overwriting each new record into the same file. You'd end up with just the last record in the file.
A simpler approach is to store everything in a list, and then write the whole list when you're done the loop. For example:
new_fasta = []
for rec in SeqIO.parse("C:/Users/Siva/Downloads/sequence.gp","genbank"):
if rec.features:
for feature in rec.features:
if feature.type =="Region":
seq1 = feature.location.extract(rec).seq
# Use an appropriate string for id
new_fasta.append('>%s\n%s' % (rec.id, seq1))
with open('region_AA_output1.fasta', 'w') as f:
f.write('\n'.join(new_fasta))

Related

Parsing xml file from url to a astropy votable without downloading

From http://svo2.cab.inta-csic.es/theory/fps/ you can get the transmission curves for many filters used in astronomical observations. I would like to get these data by opening the url with the corresponding xml file (for each filter), parse it to astropy's votable that helps to read the table data easily.
I have managed to do this by opening the file converting it to a UTF-8 file and saving in locally as an xml. Then opening the local file works fine, as it is obvious form the following example.
However I do not want to save the file and open it again. When I tried that by doing: votable = parse(xml_file), it raises an OSError: File name too long as it takes all the file as a string.
from urllib.request import urlopen
fltr = 'http://svo2.cab.inta-csic.es/theory/fps/fps.php?ID=2MASS/2MASS.H'
url = urlopen(fltr).read()
xml_file = url.decode('UTF-8')
with open('tmp.xml','w') as out:
out.write(xml_file)
votable = parse('tmp.xml')
data = votable.get_first_table().to_table(use_names_over_ids=True)
print(votable)
print(data["Wavelength"])
The output in this case is:
<VOTABLE>... 1 tables ...</VOTABLE>
Wavelength
AA
----------
12890.0
13150.0
...
18930.0
19140.0
Length = 58 rows
Indeed according to the API documentation, votable.parse's first argument is either a filename or a readable file-like object. It doesn't specify this exactly, but apparently the file also has to be seekable meaning that it can be read with random access.
The HTTPResponse object returned by urlopen is indeed a file-like object with a .read() method, so in principle it might be possible to pass directly to parse(), but this is how I found out it has to be seekable:
fltr = 'http://svo2.cab.inta-csic.es/theory/fps/fps.php?ID=2MASS/2MASS.H'
u = urlopen(fltr)
>>> parse(u)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "astropy/io/votable/table.py", line 135, in parse
_debug_python_based_parser=_debug_python_based_parser) as iterator:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "astropy/utils/xml/iterparser.py", line 157, in get_xml_iterator
with _convert_to_fd_or_read_function(source) as fd:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "astropy/utils/xml/iterparser.py", line 63, in _convert_to_fd_or_read_function
with data.get_readable_fileobj(fd, encoding='binary') as new_fd:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "astropy/utils/data.py", line 210, in get_readable_fileobj
fileobj.seek(0)
io.UnsupportedOperation: seek
So you need to wrap the data in a seekable file-like object. Along the lines that #keflavich wrote you can use io.BytesIO (io.StringIO won't work as explained below).
It turns out that there's no reason to explicitly decode the UTF-8 data to unicode. I'll spare the example, but after trying it myself it turns out parse() works on raw bytes (which I find a bit odd, but okay). So you can read the entire contents of the URL into an io.BytesIO which is just an in-memory file-like object that supports random access:
>>> u = urlopen(fltr)
>>> s = io.BytesIO(u.read())
>>> v = parse(s)
WARNING: W42: None:2:0: W42: No XML namespace specified [astropy.io.votable.tree]
>>> v.get_first_table().to_table(use_names_over_ids=True)
<Table masked=True length=58>
Wavelength Transmission
AA
float32 float32
---------- ------------
12890.0 0.0
13150.0 0.0
... ...
18930.0 0.0
19140.0 0.0
This is, in general, the way in Python to do something with some data as though it were a file, without writing an actual file to the filesystem.
Note, however, this won't work if the entire file can't fit in memory. In that case you still might need to write it out to disk. But if it's just for some temporary processing and you don't want to litter your disk tmp.xml like in your example, you can always use the tempfile module to, among other things, create temporary files that are automatically deleted once they're no longer in use.

Pandas ast.literal_eval crashes with non-basic datatypes while reading lists from csv

I have a Pandas dataframe that was saved as a csv by using command:
file.to_csv(filepath + name + ".csv", encoding='utf-8', sep=";", quoting=csv.QUOTE_NONNUMERIC)
I know that while saving, all of the columns in the dataframe are converted to String format, but I've managed to convert them back using
raw_table['synset'] = raw_table['synset'].map(ast.literal_eval)
This seems to work fine when the lists in columns contain numbers or text, but not when there's a different datatype. Point in question comes as I try to open a column "synset" with following values (where every line represents a different row in column, and also empty lists are included):
[Synset('report.n.03')]
[]
[Synset('application.n.04')]
[]
[Synset('legal_profession.n.01')]
[Synset('demeanor.n.01')]
[Synset('demeanor.n.01')]
[Synset('outgrowth.n.01')]
These values come from nltk-package, and are Synset-objects.
Trying to evaluate these using ast.literal_eval() causes following crash:
File "/.../site-packages/pandas/core/series.py", line 2158, in map
new_values = map_f(values, arg)
File "pandas/_libs/src/inference.pyx", line 1574, in pandas._libs.lib.map_infer
File "/.../ast.py", line 85, in literal_eval
return _convert(node_or_string)
File "/.../ast.py", line 61, in _convert
return list(map(_convert, node.elts))
File "/.../ast.py", line 84, in _convert
raise ValueError('malformed node or string: ' + repr(node))
ValueError: malformed node or string: <_ast.Call object at 0x7f1918146ac8>
Any help on how I could resolve this problem?

PyPDF2, why am I getting an index error? List index out of range

I'm following along in Al Sweigart's book 'Automate the Boring Stuff' and I'm at a loss with an index error I'm getting. I'm working with PyPDF2 tring to open an encrypted PDF document. I know the book is from 2015 so I went to the PyPDF2.PdfFileReader docs to see if I'm missing anything and everything seems to be the same, at least from what I can tell. So I'm not sure what's wrong here.
My Code
import PyPDF2
reader = PyPDF2.PdfFileReader('encrypted.pdf')
reader.isEncrypted # is True
reader.pages[0]
gives:
Traceback (most recent call last):
File "<pyshell#65>", line 1, in <module>
pdfReader.getPage(0)
File "/home/user67/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1176, in getPage
self._flatten()
File "/home/user67/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1505, in _flatten
catalog = self.trailer["/Root"].getObject()
File "/home/user67/.local/lib/python3.6/site-packages/PyPDF2/generic.py", line 516, in __getitem__
return dict.__getitem__(self, key).getObject()
File "/home/user67/.local/lib/python3.6/site-packages/PyPDF2/generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()
File "/home/user67/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1617, in getObject
raise utils.PdfReadError("file has not been decrypted")
PyPDF2.utils.PdfReadError: file has not been decrypted
pdfReader.decrypt('rosebud')
1
pageObj = reader.getPage(0)
Traceback (most recent call last):
File "<pyshell#67>", line 1, in <module>
pageObj = pdfReader.getPage(0)
File "/home/user67/.local/lib/python3.6/site-packages/PyPDF2/pdf.py",line 1177, in getPage
return self.flattenedPages[pageNumber]
IndexError: list index out of range
Before asking my question, I did some searching on Google and found this link with a "proposed fix". However, I'm to new at this to see what the fix is. I can't make heads or tails out of this.
I figured it out. The issue is caused by running 'pdfReader.getPage(0)' before you decrypt the file in the IDLE shell. If you take that line out, or start over without using that line after getting the error it will work as it should.
Same error I got. I was working on console and before decrypt I used reader.getPage(0).
Don't use getPage(#) / pages[#] before decrypt.
use code like below:
reader = PyPDF2.PdfFileReader("file.pdf")
# reader.pages[0] # do not use this before decrypt
if reader.isEncrypted:
reader.decrypt('')
reader.pages[0]

select random item (and not pick this one for the next random pick up) from list created from a file

I'm trying to build a program that will pick a random items (just once for each) from a list, that was imported from a file.
NB: I put only one item for a line in my file (no space, no coma, nothing else than a simple word)
I have a code like that for now:
file = open('file.txt', 'r')
myList = file.readlines()
myList[0]
rand_item = random.choice(myList)
print (rand_item)
I am just at the beginning of my program, so I'm just testing every new step that i make. Here, I'd like to display a random item from my list (itsefl imported from a file).
I have this message when I try to run my program:
Traceback (most recent call last):
File "C:/Users/Julien/Desktop/test final.py", line 16, in <module>
rand_item = random.choice(listEmployee)
AttributeError: 'builtin_function_or_method' object has no attribute 'choice'

wkb: could not create geometry because of errors while reading input

I am trying to translate ewkb coordinates into the associated longitude and latitude on Python. The ewkb strings are listed in a one-column csv file (named "/home/nick/Documents/Sepi/WKB_coordinates_sing.csv").
I deleted the other columns for the sake of simplicity, but eventually I would like to use the original data set and read just the right column with ewkb.
Moreover, I would like to read and translate one line at a time, because I have files with millions of lines and coordinates to process.
I wrote the following code:
from shapely import wkb
with open ("/home/nick/Documents/Sepi/WKB_coordinates_sing.csv") as f:
for line in f:
hexloc=f.readline()
print(hexloc)
point=wkb.loads(hexloc,hex=True)
print(point.x,point.y)
However, when I run it, I get the following:
~$ python /home/nick/Documents/Sepi/ewkb.py
0101000020E610000072604C0D47AA37402C306475ABA85140
ParseException: Premature end of HEX string
Traceback (most recent call last):
File "/home/nick/Documents/Sepi/ewkb.py", line 7, in <module>
point=wkb.loads(hexloc,hex=True)
File "/home/nick/anaconda3/lib/python3.6/site-packages/shapely/wkb.py", line 14, in loads
return reader.read_hex(data)
File "/home/nick/anaconda3/lib/python3.6/site-packages/shapely/geos.py", line 409, in read_hex
"Could not create geometry because of errors "
shapely.errors.WKBReadingError: Could not create geometry because of errors while reading input.
However, I can obtain longitude and latitude if I run the following code with the first hexadecimal string from my csv file as argument of wkb.loads:
Code:
from shapely import wkb
hexloc="0101000020E610000072604C0D47AA37402C306475ABA85140"
print(hexloc)
point=wkb.loads(hexloc,hex=True)
print(point.x,point.y)
Result:
~$ python /home/nick/Documents/Sepi/ewkb.py
0101000020E610000072604C0D47AA37402C306475ABA85140
23.665146666666665 70.63546500000001
Thank you in advance!
There seem to be several possible issues. First, your code snippet is mixing iteration and direct "read" methods. With this example:
with open ("/home/nick/Documents/Sepi/WKB_coordinates_sing.csv") as f:
for line in f:
hexloc=f.readline()
#do something with hexloc
hexloc will effectively iterate only over every second line in the input file. You might want to replace this with:
with open ("/home/nick/Documents/Sepi/WKB_coordinates_sing.csv") as f:
for hexloc in f:
#do something with hexloc
Moreover, when you read the input lines like this, they retain the trailing newline which confuses the loads method. I would suggest to try:
with open ("/home/nick/Documents/Sepi/WKB_coordinates_sing.csv") as f:
for line in f:
hexloc = line.strip()
point = wkb.loads(hexloc, hex=True)

Resources