Using python 3.x how can I pass a Tree object from ete3 to DendroPy without writing to file - python-3.x

I'm using the ete3 package in python to build phylogenetic trees from data I've generated with a stochastic model and it works well. I have previously written these trees to newick format and then used another script, with the package Dendropy, to read these trees and do some analysis of them. Both of these scripts work fine.
I am now trying to do a large amount of this sort of data processing and want to write a single script in which I skip the file writing. Both methods are called Tree, so I got around this by importing the dendropy method like:
from dendropy import Tree as DTree
and the ete3 method like:
from ete3 import Tree
which seems to be ok.
The question I have is how to pass the object from one package to the other. I have a loop in which I first build the tree object using the ete3 methods, and I call it 't'. My plan was then to use the Tree.write method in ete3 to pass the tree obect to Dendropy using the 'get' method and skipping the actual outfile bit, like this:
treePass = t.write(format = 1)
DendroTree = DTree.get(treePass, schema = 'newick')
but this gives the error:
DendroTree = DTree.get(treePass)
TypeError: get() takes 1 positional argument but 2 were given
Any thoughts are welcome.

DTree.get() only takes self as actual argument and rest is given through keywords. This basically means you cannot pass treePass to DTree.get() as an argument.
I haven't used either of those libs, but I have found a way to import data to dendropy tree here.
tree = DTree.get(data="((A,B),(C,D));",schema="newick")
Which means you'd have to get your tree from ete3 in this format. it doesn't seem that unusual for a tree, so after a bit more looking there seems to be supported format in ete3, which you can read here. I believe it's number 9.
So in the end I'd try this:
from dendropy import Tree as DTree
from ete3 import Tree
#do your Tree generation magic here
DendroTree = DTree.get(data=t.write(format = 9),schema = 'newick')
Edit:
As I'm reading more and more, I believe that any format should be read so basically all you have to add to your example is data here: DendroTree = DTree.get(data=treePass, schema = 'newick')

Related

Python3 - Reading mixed data from a file and convert the read values to float

I have the following data stored in a file (tn.csv).
The content is as follows (Original file is very large but I simplified it):
file_content.csv
I read the data using the following source code:
python_script.py
It nicely produces the following which is a numpy array of strings.
sample_output.png
Ultimately I want to use the read data as a numpy array of floats in my python script.
All the entries should be float values. Hence, I would like to cast the entries from string to float. As some entries are not numbers but strings of text (they are variables in my script), I could not perform type casting directly. Python throws a ValueError.
Note: I define the (vector) variable nv before I read this matrix. Hence, it is well defined. So, the complexity is that some entries of the numpy arrays are variables.
What I would like to have is the following:
If I had directly typed the file (tn.csv) contents in the form of numpy array in my python script, I would have been able to use it directly without any conversion. But the matrix is very big. So, I store it in an external file. Now, I want to read the file and store it in a numpy array. The end result should be the same as if I had typed it directly in my script. Could someone help me to tackle this issue or provide me relevant links?
This is my first post in stackoverflow. If my question is not clear, please let me know.
I will rewrite it.

How can I load Python lambda expressions from YAML files using ruamel.yaml?

I'm trying to serialize and deserialize objects that contain lambda expressions using ruamel.yaml. As shown in the example, this yields a ConstructorError. How can this be done?
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='unsafe')
yaml.allow_unicode = True
yaml.default_flow_style = False
foo = lambda x: x * 2
yaml.dump({'foo': foo}, sys.stdout)
# foo: !!python/name:__main__.%3Clambda%3E
yaml.load('foo: !!python/name:__main__.%3Clambda%3E')
# ConstructorError: while constructing a Python object
# cannot find '<lambda>' in the module '__main__'
# in "<unicode string>", line 1, column 6
That is not going to work. ruamel.yaml dumps functions (or methods) by making references to the those functions in the source code by referring to their names (i.e. it doesn't try to store the actual code).
Your lambda is an anonymous function, so there is no name that can be properly retrieved. In the same way Python's pickle doesn't support lambda.
I am not sure if it should be an error to try and dump lambda, or that a warning should be in place.
The simple solutions is to make your lambda(s) into named functions. Alternatively you might be able to get to the actual code or AST for the lambda and store and retrieve that, but that is going to be more work and might not be portable, depending on what you store.

Broadcast python objects using mpi4py

I have a python object
<GlobalParams.GlobalParams object at 0x7f8efe809080>
which contains various numpy arrays, parameter values etc. which I am using in various functions calling as for example:
myParams = GlobalParams(input_script) #reads in various parameters from an input script and assigns these to myParams
myParams.data #calls the data array from myParams
I am trying to parallelise my code and would like to broadcast the myParams object so that it is available to the other child processes. I have done this previously for individual numpy arrays, values etc. in the form:
points = comm.bcast(points, root = 0)
However, I don't want to have to do this individually for all the contents of myParams. I would like to broadcast the object in its entirety so that it can be accessed on other cores. I have tried the obvious:
myParams = comm.bcast(myParams, root=0)
but this returns the error:
myParams = comm.bcast(myParams, root=0)
File "MPI/Comm.pyx", line 1276, in mpi4py.MPI.Comm.bcast (src/mpi4py.MPI.c:108819)
File "MPI/msgpickle.pxi", line 612, in mpi4py.MPI.PyMPI_bcast (src/mpi4py.MPI.c:47005)
File "MPI/msgpickle.pxi", line 112, in mpi4py.MPI.Pickle.dump (src/mpi4py.MPI.c:40704)
TypeError: cannot serialize '_io.TextIOWrapper' object
What is the appropriate way to share this object with the other cores? Presumably this is a common requirement in python, but I can't find any documentation on this. Most examples look at broadcasting a single variable/array.
This doesn't look like an MPI problem; it looks like a problem with object serialisation for broadcast, which internally is using the Pickle module.
Specifically in this case, it can't serialise a _io.TextIOWrapper - so I suggest hunting down where in your class this is used.
Once you work out which field(s) can't be serialised, you can remove them, broadcast, then reassemble them on each individual rank, using some method that you need to design yourself (recreateUnpicklableThing() in the example below). You could do that by adding these methods to your class for Pickle to call before and after broadcast:
def __getstate__(self):
members = self.__dict__.copy()
# remove things that can't be pickled, using its name
del members['someUnpicklableThing']
return members
def __setstate__(self, members):
self.__dict__.update(members)
# On unpickle, manually recreate the things that you couldn't pickle
# (this method recreates self.someUnpickleableThing using some metadata
# carefully chosen by you that Pickle can serialise).
self.recreateUnpicklableThing(self.dataForSettingUpSometing)
See here for more on how these methods work
https://docs.python.org/2/library/pickle.html

Writing a custom builder that executes external command and python function

I'm looking to write a custom SCons Builder that:
Executes an external command to produce foo.temp
Then executes a python function to manipulate foo.temp and produce the final output file
I've referred to the two following sections, but I'm not sure the correct way to "glue" them together.
18.1. Writing Builders That Execute External Commands
18.4. Builders That Execute Python Functions
I know that Command accepts a list of actions to take. But how do I properly handle that intermediate file? Ideally the intermediate file would be invisible to the user -- the entire Builder would appear to operate atomically.
Here's what I've come up with that seems to be working. However the .bin file isn't being deleted automatically.
from SCons.Action import Action
from SCons.Util import is_List
from SCons.Script import Delete
_objcopy_builder = Builder(
action = 'objcopy -O binary $SOURCE $TARGET',
suffix = '.bin',
single_source = 1
)
def _add_header(target, source, env):
source = str(source[0])
target = str(target[0])
with open(source, 'rb') as src:
with open(target, 'wn') as tgt:
tgt.write('MODULE\x00\x00')
tgt.write(src.read())
return 0
_addheader_builder = Builder(
action = _add_header,
single_source = 1
)
def Elf2Mod(env, target, source, *args, **kw):
def check_one(x, what):
if not is_List(x):
x = [x]
if len(x) != 1:
raise StopError('Only one {0} allowed'.format(what))
return x
target = check_one(target, 'target')
source = check_one(source, 'source')
# objcopy a binary file
binfile = _objcopy_builder.__call__(env, source=source, **kw)
# write the module header
_addheader_builder.__call__(env, target=target, source=binfile, **kw)
# delete the intermediate binary file
# TODO: Not working
Delete(binfile)
return target
def generate(env):
"""Add Builders and construction variables to the Environment."""
env.AddMethod(Elf2Mod, 'Elf2Mod')
print 'Added Elf2Mod to env {0}'.format(env)
def exists(env):
return True
This can indeed be done with the Command builder, by specifying a list of actions, as follows:
Command('foo.temp', 'foo.in',
['your_external_action',
your_python_function])
Notice that foo.in is the source, and you should name it accordingly. But if foo.temp is internal as you mention, then this approach probably isnt the best approach.
Another way, which I feel is much more flexible, would be to use Custom Builder with a Generator and/or Emitter.
The Generator is a Python function where you do the actual work, which in your case would be calling the external command, and also call the Python function.
An Emitter allows you to have a fine-tuned control over the sources and targets. I used a Builder with a Emitter (and Generator) once to do C++ and Java code-generation with Thrift input IDL files. I had to read and process the Thrift input file to know exactly what files would be code-generated (which are the actual targets), and the Emitter is the best/only way to do something like this. If your particular use-case isnt so complicated, you can skip the Emitter and just list your sources/targets in the call to the builder. But if you want foo.temp to be transparent to the end-user, then you'll need an Emitter.
When using a Custom Builder with a Generator and Emitter, the Emitter will be called every time by SCons to calculate the sources and dependencies to know if the Generator needs to be called. The Generator will only be called if one of the targets is considered older with respect to the sources.
There are numerous examples showing how to use a Generator and Emitter in a Custom Builder, so I wont list the code here, but let me know if you need help with the syntax, etc.

Unpickling from converted string in python/numpy

I have a ton of numpy ndarrays that are stored picked to strings. That may have been a poor design choice but it's what I did, and now the picked strings seem to have been converted or something along the way, when I try to unpickle I notice they are of type str and I get the following error:
TypeError: 'str' does not support the buffer interface
when I invoke
numpy.loads(bin_str)
Where bin_str is the thing I'm trying to unpickle. If I print out bin_strit looks like
b'\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x00cnumpy\nndarray\nq\x01K\x00\x85q\x02c_codecs\nencode\nq\x03X\x01\x00\x00\ ...
continuing for some time, so the info seems to be there, I'm just not quite sure how to convert it into whatever string format numpy/pickle need. On a whim I tried
numpy.loads( bytearray(bin_str, encoding='utf-8') )
and
numpy.loads( bin_str.encode() )
which both throw an error _pickle.UnpicklingError: unpickling stack underflow. Any ideas?
PS: I'm on python 3.3.2 and numpy 1.7.1
Edit
I discovered that if I do the following:
open('temp.txt', 'wb').write(...)
return numpy.load( 'temp.txt' )
I get back my array, and ... denotes copying and pasting the output of print(bin_str) from another window. I've tried writing bin_str to a file directly to unpickle but that doesn't work, it complains that TypeError: 'str' does not support the buffer interface. A few sane ways of converting bin_str to something that can be written directly to a binary file result in pickle errors when trying to read it back.
Edit 2
So I guess what's happened is that my binary pickle string ended up encoded inside of a normal string, something like:
"b'pickle'"
which is unfortunate and I haven't figured out how to deal with that, except this ridiculous and convoluted way to get it back:
open('temp.py', 'w').write('foo = ' + bin_str)
from temp import foo
numpy.loads( foo )
This seems like a very shameful solution to the problem, so please give me a better one!
It sounds like your saved strings are the reprs of the original bytes instances returned by your pickling code. That's a bit unfortunate, but not too bad. repr is intended to return a "machine friendly" representation of an object, and it can often be reversed by using eval:
import numpy as np
import pickle
# this part has already happened
orig_obj = np.array([1,2,3])
orig_pickle = pickle.dumps(orig_obj)
saved_str = repr(orig_pickle) # this was a mistake, but it's already done
# this is what you need to do to get something equivalent to orig_obj back
reconstructed_pickle = eval(saved_str)
reconstructed_obj = pickle.loads(reconstructed_pickle)
# test
if np.all(reconstructed_obj == orig_obj):
print("It worked!")
Obligatory note that using eval can be dangerous: Be aware that eval can run any Python code it wants, so don't call it with untrusted data. However, pickle data has the same risks (a malicious Pickle string can run arbitrary code upon unpickling), so you're not losing much safety in this situation. I'm guessing that you trust your data in this case anyway.

Resources