Broadcast python objects using mpi4py - python-3.x

I have a python object
<GlobalParams.GlobalParams object at 0x7f8efe809080>
which contains various numpy arrays, parameter values etc. which I am using in various functions calling as for example:
myParams = GlobalParams(input_script) #reads in various parameters from an input script and assigns these to myParams
myParams.data #calls the data array from myParams
I am trying to parallelise my code and would like to broadcast the myParams object so that it is available to the other child processes. I have done this previously for individual numpy arrays, values etc. in the form:
points = comm.bcast(points, root = 0)
However, I don't want to have to do this individually for all the contents of myParams. I would like to broadcast the object in its entirety so that it can be accessed on other cores. I have tried the obvious:
myParams = comm.bcast(myParams, root=0)
but this returns the error:
myParams = comm.bcast(myParams, root=0)
File "MPI/Comm.pyx", line 1276, in mpi4py.MPI.Comm.bcast (src/mpi4py.MPI.c:108819)
File "MPI/msgpickle.pxi", line 612, in mpi4py.MPI.PyMPI_bcast (src/mpi4py.MPI.c:47005)
File "MPI/msgpickle.pxi", line 112, in mpi4py.MPI.Pickle.dump (src/mpi4py.MPI.c:40704)
TypeError: cannot serialize '_io.TextIOWrapper' object
What is the appropriate way to share this object with the other cores? Presumably this is a common requirement in python, but I can't find any documentation on this. Most examples look at broadcasting a single variable/array.

This doesn't look like an MPI problem; it looks like a problem with object serialisation for broadcast, which internally is using the Pickle module.
Specifically in this case, it can't serialise a _io.TextIOWrapper - so I suggest hunting down where in your class this is used.
Once you work out which field(s) can't be serialised, you can remove them, broadcast, then reassemble them on each individual rank, using some method that you need to design yourself (recreateUnpicklableThing() in the example below). You could do that by adding these methods to your class for Pickle to call before and after broadcast:
def __getstate__(self):
members = self.__dict__.copy()
# remove things that can't be pickled, using its name
del members['someUnpicklableThing']
return members
def __setstate__(self, members):
self.__dict__.update(members)
# On unpickle, manually recreate the things that you couldn't pickle
# (this method recreates self.someUnpickleableThing using some metadata
# carefully chosen by you that Pickle can serialise).
self.recreateUnpicklableThing(self.dataForSettingUpSometing)
See here for more on how these methods work
https://docs.python.org/2/library/pickle.html

Related

Creating SequenceTaggingDataset from list, not file

I would like to create a SequenceTaggingDataset from two lists that I have created dynamically inside my code - train_sentences and train_tags. I would want to write something like this:
train_data = SequenceTaggingDataset(examples=(zip(train_sentences, train_tags)))
However, the constructor must receive a path. And not only that - it looks from the code as though, even if I were to provide the examples, it will override those, and initialize examples to be an empty list.
For various reasons, I do not want to save the lists I created in a file from which the SequenceTaggingDataset could read. Is there any way around this, save defining my own custom class?
You will need to modify source code for it (https://pytorch.org/text/_modules/torchtext/datasets/sequence_tagging.html#SequenceTaggingDataset). You can make a local copy and import as your module.
path is used in __init__. The important part is that it takes lines from file and splits it using given separator into list named columns. Then this columns list is being fed into another class method together with fields to construct examples list. Please read provided example here to understand fields (Note that UDPOS is called there to create SequenceTaggingDataset).
What you need is columns, which you don't need to read from file as you have all components already. You will feed it directly by simplifying class __init__:
def __init__(self, columns, fields, encoding="utf-8", separator="\t", **kwargs):
examples = []
examples.append(data.Example.fromlist(columns, fields))
super(SequenceTaggingDataset, self).__init__(examples, fields,
**kwargs)
columns is nested list of lists: [[word], [UD_TAG], [PTB_TAG]]. It means that you need to feed following into modified class:
train = SequenceTaggingDataset([train_sentences, train_tags], fields=...)

How can I load Python lambda expressions from YAML files using ruamel.yaml?

I'm trying to serialize and deserialize objects that contain lambda expressions using ruamel.yaml. As shown in the example, this yields a ConstructorError. How can this be done?
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='unsafe')
yaml.allow_unicode = True
yaml.default_flow_style = False
foo = lambda x: x * 2
yaml.dump({'foo': foo}, sys.stdout)
# foo: !!python/name:__main__.%3Clambda%3E
yaml.load('foo: !!python/name:__main__.%3Clambda%3E')
# ConstructorError: while constructing a Python object
# cannot find '<lambda>' in the module '__main__'
# in "<unicode string>", line 1, column 6
That is not going to work. ruamel.yaml dumps functions (or methods) by making references to the those functions in the source code by referring to their names (i.e. it doesn't try to store the actual code).
Your lambda is an anonymous function, so there is no name that can be properly retrieved. In the same way Python's pickle doesn't support lambda.
I am not sure if it should be an error to try and dump lambda, or that a warning should be in place.
The simple solutions is to make your lambda(s) into named functions. Alternatively you might be able to get to the actual code or AST for the lambda and store and retrieve that, but that is going to be more work and might not be portable, depending on what you store.

Inheritance and Pandas

I am trying to create a file writer based on Pandas' ExcelWriter. I proceeded as I usually do with classes in Python (3) with inheritance:
import pandas as pd
class Writer(pd.ExcelWriter):
def __init__(self, fname, engine='openpyxl'):
pd.ExcelWriter.__init__(self, fname, engine=engine)
self.newvar = 0
However, when I try to use it, I cannot access newvar:
test = Writer('test.xlsx')
test.newvar
returns:
AttributeError: '_XlsxWriter' object has no attribute 'nmax'
And when I check the type of test, it returns:
pandas.io.excel._XlsxWriter
I don't understand what I am missing since I used this kind of inheritance in many other cases. Any idea would be appreciated!
This is because pandas.ExcelWriter.__new__ returns a different class than itself (actually it is an abc.ABCMeta). The class is chosen based on the extension of the file path and the engine which is used - you could observe that when you checked the type of the newly created instance. That means the __init__ method of whatever class is returned gets called. You can think of ExcelWriter as some kind of proxy for the specific writers for each format and engine (though it also defines the API which such a writer must provide).
In order to make your writer available (for the given engine), you need to register it.
But before you can do that you need to make your class compatible by following the instructions which you'll find via help(pandas.ExcelWriter). For the sake of completeness I cite them here:
# Defining an ExcelWriter implementation (see abstract methods for more...)
# - Mandatory
# - ``write_cells(self, cells, sheet_name=None, startrow=0, startcol=0)``
# --> called to write additional DataFrames to disk
# - ``supported_extensions`` (tuple of supported extensions), used to
# check that engine supports the given extension.
# - ``engine`` - string that gives the engine name. Necessary to
# instantiate class directly and bypass ``ExcelWriterMeta`` engine
# lookup.
# - ``save(self)`` --> called to save file to disk
# - Mostly mandatory (i.e. should at least exist)
# - book, cur_sheet, path
# - Optional:
# - ``__init__(self, path, engine=None, **kwargs)`` --> always called
# with path as first argument.
So with that in mind we can extend your class:
class Writer(pd.ExcelWriter):
engine = 'openpyxl'
supported_extensions = ('xlsx',)
def write_cells(self, cells, sheet_name=None, startrow=0, startcol=0):
# Implement something useful here.
pass
def save(self):
# Implement something useful here.
pass
def __init__(self, fname, engine='openpyxl', **kwargs):
super().__init__(self, fname, engine=engine, **kwargs)
Now you can use pd.io.excel.register_writer(Writer) to register the writer. But you need to make sure the engine which you've specified matches your version of openpyxl. You can check the process of how a specific writer is chosen here; the writers which are currently registered for each version can be checked via print(pd.io.excel._writers).
As a side note: You can also subclass one of the already available specific writers and reuse their write_cells and save methods for example (however you'll need to register your writer also in that case):
_Openpyxl1Writer
_Openpyxl20Writer
_Openpyxl22Writer
_XlwtWriter
_XlsxWriter

How do I pickle an object hierarchy, *each object usually its own individual file* so that saving is fast?

I want to use pickle, specifically cPickle to serialize my objects' data as a folder of files representing modules, projects, module objects, scene objects, etc. Is there an easy way to do this?
Thus unpickling will be a little tricky as each parent object stores a reference to child/sibling objects when running but the pickle data of the parent will hold a filepath to the object.
I started with a PathUtil class that all classes inherit, but have been running into issues. Has anyone solved a similar problem/feature of data file saving / restoring?
The more transparently it works with existing code the better. For instance, if using a meta class __call__ will make existing constructor syntax stay the same, that will be a plus. For example, the static __call__ will check the pickle file first and load it if it exists, while doing a default construction if it doesn't.
You can override __getstate__ to write to a new pickle file and return its path, and __setstate__ to unpickle the file.
import pickle, os
DIRNAME = 'path/to/my/pickles/'
class AutoPickleable:
def __getstate__(self):
state = dict(self.__dict__)
path = os.path.join(DIRNAME, str(id(self)))
with open(path, 'wb') as f:
pickle.dump(state, f)
return path
def __setstate__(self, path):
with open(path, 'b') as f:
state = pickle.load(f)
self.__dict__.update(state)
Now, each type which should have this special auto-pickleable behavior, should subclass AutoPickleable.
When you want to dump the files, you can do pickle.dumps(obj) or copy.deepcopy(obj) and ignore the result.
Unpickling works as usual (pickle.load). If you want to restore the objects from a file-path (and not from the results of pickle.dumps), it is a bit trickier. Let me know if you want it, and I'll add details. In anycase, if you wrap your AutoPickleable object with a "standard" object, and do all pickle operations on that, it should all work.
There are several potential problems with this approach, but for a "clean" case such as the one you describe, it might work.
Some notes:
There is no way to "dynamically" specify the directory to write to. It has to be globally accessible, and set before the pickling operation
Probably wouldn't work if several objects keep references the same AutoPickleable object, or if you have circular references (in general, pickle handle these cases with no problem)
There is no code here to clean the directory / delete the files.

Using python 3.x how can I pass a Tree object from ete3 to DendroPy without writing to file

I'm using the ete3 package in python to build phylogenetic trees from data I've generated with a stochastic model and it works well. I have previously written these trees to newick format and then used another script, with the package Dendropy, to read these trees and do some analysis of them. Both of these scripts work fine.
I am now trying to do a large amount of this sort of data processing and want to write a single script in which I skip the file writing. Both methods are called Tree, so I got around this by importing the dendropy method like:
from dendropy import Tree as DTree
and the ete3 method like:
from ete3 import Tree
which seems to be ok.
The question I have is how to pass the object from one package to the other. I have a loop in which I first build the tree object using the ete3 methods, and I call it 't'. My plan was then to use the Tree.write method in ete3 to pass the tree obect to Dendropy using the 'get' method and skipping the actual outfile bit, like this:
treePass = t.write(format = 1)
DendroTree = DTree.get(treePass, schema = 'newick')
but this gives the error:
DendroTree = DTree.get(treePass)
TypeError: get() takes 1 positional argument but 2 were given
Any thoughts are welcome.
DTree.get() only takes self as actual argument and rest is given through keywords. This basically means you cannot pass treePass to DTree.get() as an argument.
I haven't used either of those libs, but I have found a way to import data to dendropy tree here.
tree = DTree.get(data="((A,B),(C,D));",schema="newick")
Which means you'd have to get your tree from ete3 in this format. it doesn't seem that unusual for a tree, so after a bit more looking there seems to be supported format in ete3, which you can read here. I believe it's number 9.
So in the end I'd try this:
from dendropy import Tree as DTree
from ete3 import Tree
#do your Tree generation magic here
DendroTree = DTree.get(data=t.write(format = 9),schema = 'newick')
Edit:
As I'm reading more and more, I believe that any format should be read so basically all you have to add to your example is data here: DendroTree = DTree.get(data=treePass, schema = 'newick')

Resources