Inheritance and Pandas - python-3.x

I am trying to create a file writer based on Pandas' ExcelWriter. I proceeded as I usually do with classes in Python (3) with inheritance:
import pandas as pd
class Writer(pd.ExcelWriter):
def __init__(self, fname, engine='openpyxl'):
pd.ExcelWriter.__init__(self, fname, engine=engine)
self.newvar = 0
However, when I try to use it, I cannot access newvar:
test = Writer('test.xlsx')
test.newvar
returns:
AttributeError: '_XlsxWriter' object has no attribute 'nmax'
And when I check the type of test, it returns:
pandas.io.excel._XlsxWriter
I don't understand what I am missing since I used this kind of inheritance in many other cases. Any idea would be appreciated!

This is because pandas.ExcelWriter.__new__ returns a different class than itself (actually it is an abc.ABCMeta). The class is chosen based on the extension of the file path and the engine which is used - you could observe that when you checked the type of the newly created instance. That means the __init__ method of whatever class is returned gets called. You can think of ExcelWriter as some kind of proxy for the specific writers for each format and engine (though it also defines the API which such a writer must provide).
In order to make your writer available (for the given engine), you need to register it.
But before you can do that you need to make your class compatible by following the instructions which you'll find via help(pandas.ExcelWriter). For the sake of completeness I cite them here:
# Defining an ExcelWriter implementation (see abstract methods for more...)
# - Mandatory
# - ``write_cells(self, cells, sheet_name=None, startrow=0, startcol=0)``
# --> called to write additional DataFrames to disk
# - ``supported_extensions`` (tuple of supported extensions), used to
# check that engine supports the given extension.
# - ``engine`` - string that gives the engine name. Necessary to
# instantiate class directly and bypass ``ExcelWriterMeta`` engine
# lookup.
# - ``save(self)`` --> called to save file to disk
# - Mostly mandatory (i.e. should at least exist)
# - book, cur_sheet, path
# - Optional:
# - ``__init__(self, path, engine=None, **kwargs)`` --> always called
# with path as first argument.
So with that in mind we can extend your class:
class Writer(pd.ExcelWriter):
engine = 'openpyxl'
supported_extensions = ('xlsx',)
def write_cells(self, cells, sheet_name=None, startrow=0, startcol=0):
# Implement something useful here.
pass
def save(self):
# Implement something useful here.
pass
def __init__(self, fname, engine='openpyxl', **kwargs):
super().__init__(self, fname, engine=engine, **kwargs)
Now you can use pd.io.excel.register_writer(Writer) to register the writer. But you need to make sure the engine which you've specified matches your version of openpyxl. You can check the process of how a specific writer is chosen here; the writers which are currently registered for each version can be checked via print(pd.io.excel._writers).
As a side note: You can also subclass one of the already available specific writers and reuse their write_cells and save methods for example (however you'll need to register your writer also in that case):
_Openpyxl1Writer
_Openpyxl20Writer
_Openpyxl22Writer
_XlwtWriter
_XlsxWriter

Related

Overriding file.write in python 3

I'm aware of the SO post How do I override file.write() in Python 3? but after looking it over and trying whats suggested I'm still stuck.
I want to override the file.write method in Python 3 so that I can "REDACT" certain words (Usernames, Passwords...etc).
I found a great example of overriding the print and general stdout and stderr http://code.activestate.com/recipes/119404/
The issue is that it doesn't work for file.write. How can I override the file.write?
My code for redacting when printing is:
def write(self, text):
for word in self.redacted_list:
text = text.replace(word, "REDACTED")
self.origOut.write(text)
return text
thanks
From the self.origOut.write(text) I assume you are trying to write an in-between-class that pretends to be a file but provides a different .write() method.
I don't see any problems in the code you posted (assuming it's a method of a class you use). Possibly you wrote a class but forgot to create instances of it?
Did you try to write something like this?:
class IAmNoARealFile:
def __init__(self, real_file):
self.origOut = real_file
def __getattr__(self, attr_name): # provide everything a file has
return getattr(self.origOut, attr_name)
def write(self, ...):
...
with open('test.txt', 'w') as f:
f = IAmNotARealFile(f) # did you forget this?
f.write('some text SECRET blah SECRET') # calls IAMNotARealFile.write with your extra code
with open('test.txt') as f:
f = IAmNotARealFile(f)
print(f.read()) # this "falls through" to the actual file object
you will also probably want to return self.origOut.write() in your own .write(), if you don't have a specific reason not to.
Note that if you rewrite open() to directly return IAMNotARealFile:
def open(*args, **kwargs):
return IAMNotARealFile(open(*args, **kwargs))
you will have to manually supply (some) "magic methods" because
This method may still be bypassed when looking up special methods as the result of implicit invocation via language syntax or built-in functions. See Special method lookup.
--docs for .__getattribute__(), but it also applies to .__getattr__()
Why?
Bypassing the __getattribute__() machinery in this fashion provides significant scope for speed optimisations within the interpreter, at the cost of some flexibility in the handling of special methods (the special method must be set on the class object itself in order to be consistently invoked by the interpreter).
-- On special ("magic") method lookup [code style and emphasis mine]

python (3.7) dataclass for self referenced structure [duplicate]

This question already has answers here:
How do I type hint a method with the type of the enclosing class?
(7 answers)
Closed 3 years ago.
class Node:
def append_child(self, node: Node):
if node != None:
self.first_child = node
self.child_nodes += [node]
How do I do node: Node? Because when I run it, it says name 'Node' is not defined.
Should I just remove the : Node and instance check it inside the function?
But then how could I access node's properties (which I would expect to be instance of Node class)?
I don't know how implement type casting in Python, BTW.
"self" references in type checking are typically done using strings:
class Node:
def append_child(self, node: 'Node'):
if node != None:
self.first_child = node
self.child_nodes += [node]
This is described in the "Forward references" section of PEP-0484.
Please note that this doesn't do any type-checking or casting. This is a type hint which python (normally) disregards completely1. However, third party tools (e.g. mypy), use type hints to do static analysis on your code and can generate errors before runtime.
Also, starting with python3.7, you can implicitly convert all of your type-hints to strings within a module by using the from __future__ import annotations (and in python4.0, this will be the default).
1The hints are introspectable -- So you could use them to build some kind of runtime checker using decorators or the like if you really wanted to, but python doesn't do this by default.
Python 3.7 and Python 4.03.10 onwards
PEP 563 introduced postponed evaluations, stored in __annotations__ as strings. A user can enable this through the __future__ directive:
from __future__ import annotations
This makes it possible to write:
class C:
a: C
def foo(self, b: C):
...
Starting in Python 3.10 (release planned 2021-10-04), this behaviour will be default.
Edit 2020-11-15: Originally it was announced to be mandatory starting in Python 4.0, but now it appears this will be default already in Python 3.10, which is expected 2021-10-04. This surprises me as it appears to be a violation of the promise in __future__ that this backward compatibility would not be broken until Python 4.0. Maybe the developers consider than 3.10 is 4.0, or maybe they have changed their mind. See also Why did __future__ MandatoryRelease for annotations change between 3.7 and 3.8?.
In Python > 3.7 you can use dataclass. You can also annotate dataclass.
In this particular example Node references itself and if you run it you will get
NameError: name 'Node' is not defined
To overcome this error you have to include:
from __future__ import annotations
It must be the first line in a module. In Python 4.0 and above you don't have to include annotations
from __future__ import annotations
from dataclasses import dataclass
#dataclass
class Node:
value: int
left: Node
right: Node
#property
def is_leaf(self) -> bool:
"""Check if node is a leaf"""
return not self.left and not self.right
Example:
node5 = Node(5, None, None)
node25 = Node(25, None, None)
node40 = Node(40, None, None)
node10 = Node(10, None, None)
# balanced tree
node30 = Node(30, node25, node40)
root = Node(20, node10, node30)
# unbalanced tree
node30 = Node(30, node5, node40)
root = Node(20, node10, node30)
If you just want an answer to the question, go read mgilson's answer.
mgilson's answer provides a good explanation of how you should work around this limitation of Python. But I think it's also important to have a good understanding of why this doesn't work, so I'm going to provide that explanation.
Python is a little different from other languages. In Python, there's really no such thing as a "declaration." As far as Python is concerned, code is just code. When you import a module, Python creates a new namespace (a place where global variables can live), and then executes each line of the module from top to bottom. def foo(args): code is just a compound statement that bundles a bunch of source code together into a function and binds that function to the name foo. Similarly, class Bar(bases): code creates a class, executes all of the code immediately (inside a separate namespace which holds any class-level variables that might be created by the code, particularly including methods created with def), and then binds that class to the name Bar. It has to execute the code immediately, because all of the methods need to be created immediately. Because the code gets executed before the name has been bound, you can't refer to the class at the top level of the code. It's perfectly fine to refer to the class inside of a method, however, because that code doesn't run until the method gets called.
(You might be wondering why we can't just bind the name first and then execute the code. It turns out that, because of the way Python implements classes, you have to know which methods exist up front, before you can even create the class object. It would be possible to create an empty class and then bind all of the methods to it one at a time with attribute assignment (and indeed, you can manually do this, by writing class Bar: pass and then doing def method1():...; Bar.method1 = method1 and so on), but this would result in a more complicated implementation, and be a little harder to conceptualize, so Python does not do this.)
To summarize in code:
class C:
C # NameError: C doesn't exist yet.
def method(self):
return C # This is fine. By the time the method gets called, C will exist.
C # This is fine; the class has been created by the time we hit this line.

How do I pickle an object hierarchy, *each object usually its own individual file* so that saving is fast?

I want to use pickle, specifically cPickle to serialize my objects' data as a folder of files representing modules, projects, module objects, scene objects, etc. Is there an easy way to do this?
Thus unpickling will be a little tricky as each parent object stores a reference to child/sibling objects when running but the pickle data of the parent will hold a filepath to the object.
I started with a PathUtil class that all classes inherit, but have been running into issues. Has anyone solved a similar problem/feature of data file saving / restoring?
The more transparently it works with existing code the better. For instance, if using a meta class __call__ will make existing constructor syntax stay the same, that will be a plus. For example, the static __call__ will check the pickle file first and load it if it exists, while doing a default construction if it doesn't.
You can override __getstate__ to write to a new pickle file and return its path, and __setstate__ to unpickle the file.
import pickle, os
DIRNAME = 'path/to/my/pickles/'
class AutoPickleable:
def __getstate__(self):
state = dict(self.__dict__)
path = os.path.join(DIRNAME, str(id(self)))
with open(path, 'wb') as f:
pickle.dump(state, f)
return path
def __setstate__(self, path):
with open(path, 'b') as f:
state = pickle.load(f)
self.__dict__.update(state)
Now, each type which should have this special auto-pickleable behavior, should subclass AutoPickleable.
When you want to dump the files, you can do pickle.dumps(obj) or copy.deepcopy(obj) and ignore the result.
Unpickling works as usual (pickle.load). If you want to restore the objects from a file-path (and not from the results of pickle.dumps), it is a bit trickier. Let me know if you want it, and I'll add details. In anycase, if you wrap your AutoPickleable object with a "standard" object, and do all pickle operations on that, it should all work.
There are several potential problems with this approach, but for a "clean" case such as the one you describe, it might work.
Some notes:
There is no way to "dynamically" specify the directory to write to. It has to be globally accessible, and set before the pickling operation
Probably wouldn't work if several objects keep references the same AutoPickleable object, or if you have circular references (in general, pickle handle these cases with no problem)
There is no code here to clean the directory / delete the files.

Broadcast python objects using mpi4py

I have a python object
<GlobalParams.GlobalParams object at 0x7f8efe809080>
which contains various numpy arrays, parameter values etc. which I am using in various functions calling as for example:
myParams = GlobalParams(input_script) #reads in various parameters from an input script and assigns these to myParams
myParams.data #calls the data array from myParams
I am trying to parallelise my code and would like to broadcast the myParams object so that it is available to the other child processes. I have done this previously for individual numpy arrays, values etc. in the form:
points = comm.bcast(points, root = 0)
However, I don't want to have to do this individually for all the contents of myParams. I would like to broadcast the object in its entirety so that it can be accessed on other cores. I have tried the obvious:
myParams = comm.bcast(myParams, root=0)
but this returns the error:
myParams = comm.bcast(myParams, root=0)
File "MPI/Comm.pyx", line 1276, in mpi4py.MPI.Comm.bcast (src/mpi4py.MPI.c:108819)
File "MPI/msgpickle.pxi", line 612, in mpi4py.MPI.PyMPI_bcast (src/mpi4py.MPI.c:47005)
File "MPI/msgpickle.pxi", line 112, in mpi4py.MPI.Pickle.dump (src/mpi4py.MPI.c:40704)
TypeError: cannot serialize '_io.TextIOWrapper' object
What is the appropriate way to share this object with the other cores? Presumably this is a common requirement in python, but I can't find any documentation on this. Most examples look at broadcasting a single variable/array.
This doesn't look like an MPI problem; it looks like a problem with object serialisation for broadcast, which internally is using the Pickle module.
Specifically in this case, it can't serialise a _io.TextIOWrapper - so I suggest hunting down where in your class this is used.
Once you work out which field(s) can't be serialised, you can remove them, broadcast, then reassemble them on each individual rank, using some method that you need to design yourself (recreateUnpicklableThing() in the example below). You could do that by adding these methods to your class for Pickle to call before and after broadcast:
def __getstate__(self):
members = self.__dict__.copy()
# remove things that can't be pickled, using its name
del members['someUnpicklableThing']
return members
def __setstate__(self, members):
self.__dict__.update(members)
# On unpickle, manually recreate the things that you couldn't pickle
# (this method recreates self.someUnpickleableThing using some metadata
# carefully chosen by you that Pickle can serialise).
self.recreateUnpicklableThing(self.dataForSettingUpSometing)
See here for more on how these methods work
https://docs.python.org/2/library/pickle.html

Writing a custom builder that executes external command and python function

I'm looking to write a custom SCons Builder that:
Executes an external command to produce foo.temp
Then executes a python function to manipulate foo.temp and produce the final output file
I've referred to the two following sections, but I'm not sure the correct way to "glue" them together.
18.1. Writing Builders That Execute External Commands
18.4. Builders That Execute Python Functions
I know that Command accepts a list of actions to take. But how do I properly handle that intermediate file? Ideally the intermediate file would be invisible to the user -- the entire Builder would appear to operate atomically.
Here's what I've come up with that seems to be working. However the .bin file isn't being deleted automatically.
from SCons.Action import Action
from SCons.Util import is_List
from SCons.Script import Delete
_objcopy_builder = Builder(
action = 'objcopy -O binary $SOURCE $TARGET',
suffix = '.bin',
single_source = 1
)
def _add_header(target, source, env):
source = str(source[0])
target = str(target[0])
with open(source, 'rb') as src:
with open(target, 'wn') as tgt:
tgt.write('MODULE\x00\x00')
tgt.write(src.read())
return 0
_addheader_builder = Builder(
action = _add_header,
single_source = 1
)
def Elf2Mod(env, target, source, *args, **kw):
def check_one(x, what):
if not is_List(x):
x = [x]
if len(x) != 1:
raise StopError('Only one {0} allowed'.format(what))
return x
target = check_one(target, 'target')
source = check_one(source, 'source')
# objcopy a binary file
binfile = _objcopy_builder.__call__(env, source=source, **kw)
# write the module header
_addheader_builder.__call__(env, target=target, source=binfile, **kw)
# delete the intermediate binary file
# TODO: Not working
Delete(binfile)
return target
def generate(env):
"""Add Builders and construction variables to the Environment."""
env.AddMethod(Elf2Mod, 'Elf2Mod')
print 'Added Elf2Mod to env {0}'.format(env)
def exists(env):
return True
This can indeed be done with the Command builder, by specifying a list of actions, as follows:
Command('foo.temp', 'foo.in',
['your_external_action',
your_python_function])
Notice that foo.in is the source, and you should name it accordingly. But if foo.temp is internal as you mention, then this approach probably isnt the best approach.
Another way, which I feel is much more flexible, would be to use Custom Builder with a Generator and/or Emitter.
The Generator is a Python function where you do the actual work, which in your case would be calling the external command, and also call the Python function.
An Emitter allows you to have a fine-tuned control over the sources and targets. I used a Builder with a Emitter (and Generator) once to do C++ and Java code-generation with Thrift input IDL files. I had to read and process the Thrift input file to know exactly what files would be code-generated (which are the actual targets), and the Emitter is the best/only way to do something like this. If your particular use-case isnt so complicated, you can skip the Emitter and just list your sources/targets in the call to the builder. But if you want foo.temp to be transparent to the end-user, then you'll need an Emitter.
When using a Custom Builder with a Generator and Emitter, the Emitter will be called every time by SCons to calculate the sources and dependencies to know if the Generator needs to be called. The Generator will only be called if one of the targets is considered older with respect to the sources.
There are numerous examples showing how to use a Generator and Emitter in a Custom Builder, so I wont list the code here, but let me know if you need help with the syntax, etc.

Resources