background: so, I am working on an NLP problem. where I need to extract different types of features based on different types of text documents. and I currently have a setup where there is a FeatureExtractor base class, which is subclassed multiple times depending on the different types of docs and all of them calculate a different set of features and return a pandas data frame as output.
all these subclasses are further called by one wrapper type class called FeatureExtractionRunner which calls all the subclasses and calculates the features on all docs and returns the output for all types of docs.
Problem: this pattern of calculating features leads to lots of subclasses. currently, I have like 14 subclasses, since I have 14 types of docs.it might expand further. and this is too many classes to maintain. Is there an alternative way of doing this? with less subclassing
here is some sample representative code of what i explained:
from abc import ABCMeta, abstractmethod
class FeatureExtractor(metaclass=ABCMeta):
#base feature extractor class
def __init__(self, document):
self.document = document
#abstractmethod
def doc_to_features(self):
return NotImplemented
class ExtractorTypeA(FeatureExtractor):
#do some feature calculations.....
def _calculate_shape_features(self):
return None
def _calculate_size_features(self):
return None
def doc_to_features(self):
#calls all the fancy feature calculation methods like
f1 = self._calculate_shape_features(self.document)
f2 = self._calculate_size_features(self.document)
#do some calculations on the document and return a pandas dataframe by merging them (merge f1, f2....etc)
data = "dataframe-1"
return data
class ExtractorTypeB(FeatureExtractor):
#do some feature calculations.....
def _calculate_some_fancy_features(self):
return None
def _calculate_some_more_fancy_features(self):
return None
def doc_to_features(self):
#calls all the fancy feature calculation methods
f1 = self._calculate_some_fancy_features(self.document)
f2 = self._calculate_some_more_fancy_features(self.document)
#do some calculations on the document and return a pandas dataframe (merge f1, f2 etc)
data = "dataframe-2"
return data
class ExtractorTypeC(FeatureExtractor):
#do some feature calculations.....
def doc_to_features(self):
#do some calculations on the document and return a pandas dataframe
data = "dataframe-3"
return data
class FeatureExtractionRunner:
#a class to call all types of feature extractors
def __init__(self, document, *args, **kwargs):
self.document = document
self.type_a = ExtractorTypeA(self.document)
self.type_b = ExtractorTypeB(self.document)
self.type_c = ExtractorTypeC(self.document)
#more of these extractors would be there
def call_all_type_of_extractors(self):
type_a_features = self.type_a.doc_to_features()
type_b_features = self.type_b.doc_to_features()
type_c_features = self.type_c.doc_to_features()
#more such extractors would be there....
return [type_a_features, type_b_features, type_c_features]
all_type_of_features = FeatureExtractionRunner("some document").call_all_type_of_extractors()
Answering the question first, you may avoid subclassing entirely at the cost of writing the __init__ method each time. Or you may get rid off the classes entirely and convert them to a bunch of functions. Or even you may join all the classes in a single one. Note that none of these methods will make the code simpler or more maintainable, indeed they would just change it's shape to some extent.
IMHO this situation is a perfect example of inherent problem complexity by which I mean that the domain (NLP) and particular use case (doc feature extraction) are complex in and out themselves.
For example, featureX and featureY are likely to be totally different things that cannot be calculated altogether, thus you end up with one method each. Similarly, the procedure to merge these features in a dataframe might be different than the one to merge the fancy features. Having lots of functions/classes in this situation seems totally reasonable to me, also having them separate is logical and maintainable wise.
That said real code reduction might be possible if you can combine some feature calculation methods into a more generic function, tough I can't say for sure if it would be possible.
Related
I have more of a design question, but I am not sure how to handle that. I have a script preprocessing.py where I read a .csv file of text column that I would like to preprocess by removing punctuations, characters, ...etc.
What I have done now is that I have written a class with several functions as follows:
class Preprocessing(object):
def __init__(self, file):
self.my_data = pd.read_csv(file)
def remove_punctuation(self):
self.my_data['text'] = self.my_data['text'].str.replace('#','')
def remove_hyphen(self):
self.my_data['text'] = self.my_data['text'].str.replace('-','')
def remove_words(self):
self.my_data['text'] = self.my_data['text'].str.replace('reference','')
def save_data(self):
self.my_data.to_csv('my_data.csv')
def preprocessing(file_my):
f = Preprocessing(file_my)
f.remove_punctuation()
f.remove_hyphen()
f.remove_words()
f.save_data()
return f
if __name__ == '__main__':
preprocessing('/path/to/file.csv')
although it works fine, i would like to be able to expand the code easily and have smaller classes instead of having one large class. So i decided to use abstract class:
import pandas as pd
from abc import ABC, abstractmethod
my_data = pd.read_csv('/Users/kgz/Desktop/german_web_scraping/file.csv')
class Preprocessing(ABC):
#abstractmethod
def processor(self):
pass
class RemovePunctuation(Preprocessing):
def processor(self):
return my_data['text'].str.replace('#', '')
class RemoveHyphen(Preprocessing):
def processor(self):
return my_data['text'].str.replace('-', '')
class Removewords(Preprocessing):
def processor(self):
return my_data['text'].str.replace('reference', '')
final_result = [cls().processor() for cls in Preprocessing.__subclasses__()]
print(final_result)
So now each class is responsible for one task but there are a few issues I do not know how to handle since I am new to abstract classes. first, I am reading the file outside the classes, and I am not sure if that is good practice? if not, should i pass it as an argument to the processor function or have another class who is responsible to read the data.
Second, having one class with several functions allowed for a flow, so every transformation happened in order (i.e, first punctuation is removes, then hyphen is removed,...etc) but I do not know how to handle this order and dependency in abstract classes.
I'm running a constrained optimisation with scipy.optimize.minimize(method='COBYLA').
In order to evaluate the cost function, I need to run a relatively expensive simulation to compute a dataset from the input variables, and the cost function is one (cheap to compute) property of that dataset. However, two of my constraints are also dependent on that expensive data.
So far, the only way I have found to constrain the optimisation is to have each of the constraint functions recompute the same dataset that the cost function already has calculated (simplified quasi-code):
def costfun(x):
data = expensive_fun(x)
return(cheap_fun1(data))
def constr1(x):
data = expensive_fun(x)
return(cheap_fun2(data))
def constr2(x):
data = expensive_fun(x)
return(cheap_fun3(data))
constraints = [{'type':'ineq', 'fun':constr1},
{'type':'ineq', 'fun':constr2}]
# initial guess
x0 = np.ones((6,))
opt_result = minimize(costfun, x0, method='COBYLA',
constraints=constraints)
This is clearly not efficient because expensive_fun(x) is called three times for every x.
I could change this slightly to include a universal "evaluate some cost" function which runs the expensive computation, and then evaluates whatever criterion it has been given. But while that saves me from having to write the "expensive" code several times, it still runs three times for every iteration of the optimizer:
# universal cost function evaluator
def criterion_from_x(x, cfun):
data = expensive_fun(x)
return(cfun(data))
def costfun(data):
return(cheap_fun1(data))
def constr1(data):
return(cheap_fun2(data))
def constr2(data):
return(cheap_fun3(data))
constraints = [{'type':'ineq', 'fun':criterion_from_x, 'args':(constr1,)},
{'type':'ineq', 'fun':criterion_from_x, 'args':(constr2,)}
# initial guess
x0 = np.ones((6,))
opt_result = minimize(criterion_from_x, x0, method='COBYLA',
args=(costfun,), constraints=constraints)
I have not managed to find any way to set something up where x is used to generate data at each iteration, and data is then passed to both the objective function as well as the constraint functions.
Does something like this exist? I've noticed the callback argument to minimize(), but that is a function which is called after each step. I'd need some kind of preprocessor which is called on x before each step, whose results are then available to the cost function and constraint evaluation. Maybe there's a way to sneak it in somehow? I'd like to avoid writing my own optimizer.
One, more traditional, way to solve this would be to evaluate the constraints in the cost function (which has all the data it needs for that, have it add a penalty for violated constraints to the main cost function, and run the optimizer without the explicit constraints, but I've tried this before and found that the main cost function can become somewhat chaotic in cases where the constraints are violated, so an optimizer might get stuck in some place which violates the constraints and not find out again.
Another approach would be to produce some kind of global variable in the cost function and write the constraint evaluation to use that global variable, but that could be very dangerous if multithreading/-processing gets involved, or if the name I choose for the global variable collides with a name used anywhere else in the code:
'''
def costfun(x):
global data
data = expensive_fun(x)
return(cheap_fun1(data))
def constr1(x):
global data
return(cheap_fun2(data))
def constr2(x):
global data
return(cheap_fun3(data))
'''
I know that some people use file I/O for cases where the cost function involves running a large simulation which produces a bunch of output files. After that, the constraint functions can just access those files -- but my problem is not that big.
I'm currently using Python v3.9 and scipy 1.9.1.
You could write a decorator class in the same vein to scipy's MemoizeJac that caches the return values of the expensive function each time it is called:
import numpy as np
class MemoizeData:
def __init__(self, obj_fun, exp_fun, constr_fun):
self.obj_fun = obj_fun
self.exp_fun = exp_fun
self.constr_fun = constr_fun
self._data = None
self.x = None
def _compute_if_needed(self, x, *args):
if not np.all(x == self.x) or self._data is None:
self.x = np.asarray(x).copy()
self._data = self.exp_fun(x)
def __call__(self, x, *args):
self._compute_if_needed(x, *args)
return self.obj_fun(self._data)
def constraint(self, x, *args):
self._compute_if_needed(x, *args)
return self.constr_fun(self._data)
Followingly, the expensive function is only evaluated once for each iteration. Then, after writing all your constraints into one constraint function, you could use it like this:
from scipy.optimize import minimize
def all_constrs(data):
return np.hstack((cheap_fun2(data), cheap_fun3(data)))
obj = MemoizeData(cheap_fun1, expensive_fun, all_constrs)
constr = {'type': 'ineq', 'fun': obj.constraint}
x0 = np.ones(6)
opt_result = minimize(obj, x0, method="COBYLA", constraints=constr)
While Joni was writing their answer, I found another one, which is admittedly more hacky. I prefer theirs, but for the sake of completeness, I wanted to post this one, too.
It's derived from the material from https://mdobook.github.io/ and the accompanying video tutorials from BYU FLow Lab, in particular this video:
The trick is to use non-local variables to keep a cache of the last evaluation of the expensive function:
import numpy as np
last_x = None
last_data = None
def compute_data(x):
data = expensive_fun(x)
return(data)
def get_last_data(x):
nonlocal last_x, last_data
if not np.array_equal(x, last_x):
last_data = compute_data(x)
last_x = x
return(last_data)
def costfun(x):
data = get_last_data(x)
return(cheap_fun1(data)
def constr1(x):
data = get_last_data(x)
return(cheap_fun2(data)
def constr2(x):
data = get_last_data(x)
return(cheap_fun3(data)
...and then everything can progress as in my original code in the question.
Reasons why I prefer Joni's class-based version:
variable scopes are clearer than with nonlocal
If some of the functions allow calculation of their Jacobian, or there are other things worth buffering, the added complexity is held in check better than with
Having a class instance do all the work also allows you to do other interesting things, like keeping a record of all past evaluations and the path taken by the optimizer, without having to use a separate callback function. Very useful for debugging/tweaking convergence if the optimizer won't converge or takes too long, but also to visualize or otherwise investigate the objective function or similar.
The same ability might actually be really cool for things like constructing a response surface model from the results of previous function evaluations. That could be used to establish a starting guess in case the expensive function is some numerical method that benefits from a good starting point.
Both approaches allow the use of "cheap" constraints which don't require the expensive function to be evaluated, by simply providing them as separate functions. Not sure whether that would help much with compute times, though. I suppose that would depend on the algorithm used by the optimizer.
So I am writing a GAN in tensorflow, and need the discriminator and generator to be objects. Now I am having problems with creating the training dataset for the discriminator.
Currently the relevant part of my code looks like this:
self.dataset=tf.data.Dataset.from_tensor_slices((self.y_,self.x_)) #creates dataset
self.fake_dataset=tf.data.Dataset.from_tensor_slices((self.x_fake_)) #creates dataset
self.dataset=self.dataset.shuffle(buffer_size=BUFFER_SIZE) #shuffles
self.fake_dataset=self.fake_dataset.shuffle(buffer_size=BUFFER_SIZE) #shuffles
self.dataset=self.dataset.repeat().batch(self.batch_size) #batches
self.fake_dataset=self.fake_dataset.repeat().batch(self.batch_size) #batches
self.iterator=tf.data.Iterator.from_structure(self.dataset.output_types,self.dataset.output_shapes) #creates iterators
self.fake_iterator=tf.data.Iterator.from_structure(self.fake_dataset.output_types,self.fake_dataset.output_shapes) #creates iterators
self.x=self.iterator.get_next()
self.x_fake=self.fake_iterator.get_next()
self.dataset_init_op = self.iterator.make_initializer(self.dataset,name=self.name+'_dataset_init')
self.fake_dataset_init_op=self.fake_iterator.make_initializer(self.fake_dataset,name=self.name+'_dataset_init')
What I need is for the function to alternatively give one batch of self.x, followed by one batch of self.x_fake.
Is there an easy way to do this, or will I have to results to a counter and an if statement?
Not sure if I'm understanding exactly what you need, but if you want to get use the different iterators alternatively in the same call that would be defined at graph construction time, and so you could use Python logic to choose the iterator you need. For example:
def __init__(self):
# Make graph and iterators...
self._use_fake_batch = False
def next_batch(self):
iter = self.fake_iterator if self._use_fake_batch else self.iterator
self._use_fake_batch = not self._use_fake_batch
return iter.get_next()
Or without an additional variable, using itertools:
from itertools import chain, repeat
def __init__(self):
# Make graph and iterators...
self._iterators = chain.from_iterable(repeat((self.iterator, self.fake_iterator)))
def next_batch(self):
return next(self._iterators).get_next()
I'm having some doubts with the design of mutiple inheritance in some Python classes.
The thing is that I wanted to extend the ttk button. This was my initial proposal (I'm omitting all the source code in methods for shortening, except init methods):
import tkinter as tk
import tkinter.ttk as ttk
class ImgButton(ttk.Button):
"""
This has all the behaviour for a button which has an image
"""
def __init__(self, master=None, **kw):
super().__init__(master, **kw)
self._img = kw.get('image')
def change_color(self, __=None):
"""
Changes the color of this widget randomly
:param __: the event, which is no needed
"""
pass
def get_style_name(self):
"""
Returns the specific style name applied for this widget
:return: the style name as a string
"""
pass
def set_background_color(self, color):
"""
Sets this widget's background color to that received as parameter
:param color: the color to be set
"""
pass
def get_background_color(self):
"""
Returns a string representing the background color of the widget
:return: the color of the widget
"""
pass
def change_highlight_style(self, __=None):
"""
Applies the highlight style for a color
:param __: the event, which is no needed
"""
pass
But I realized later that I wanted also a subclass of this ImgButton as follows:
import tkinter as tk
import tkinter.ttk as ttk
class MyButton(ImgButton):
"""
ImgButton with specifical purpose
"""
IMG_NAME = 'filename{}.jpg'
IMAGES_DIR = os.path.sep + os.path.sep.join(['home', 'user', 'myProjects', 'myProject', 'resources', 'images'])
UNKNOWN_IMG = os.path.sep.join([IMAGES_DIR, IMG_NAME.format(0)])
IMAGES = (lambda IMAGES_DIR=IMAGES_DIR, IMG_NAME=IMG_NAME: [os.path.sep.join([IMAGES_DIR, IMG_NAME.format(face)]) for face in [1,2,3,4,5] ])()
def change_image(self, __=None):
"""
Changes randomly the image in this MyButton
:param __: the event, which is no needed
"""
pass
def __init__(self, master=None, value=None, **kw):
# Default image when hidden or without value
current_img = PhotoImage(file=MyButton.UNKNOWN_IMG)
super().__init__(master, image=current_img, **kw)
if not value:
pass
elif not isinstance(value, (int, Die)):
pass
elif isinstance(value, MyValue):
self.myValue = value
elif isinstance(value, int):
self.myValue = MyValue(value)
else:
raise ValueError()
self.set_background_color('green')
self.bind('<Button-1>', self.change_image, add=True)
def select(self):
"""
Highlights this button as selected and changes its internal state
"""
pass
def toggleImage(self):
"""
Changes the image in this specific button for the next allowed for MyButton
"""
pass
The inheritance feels natural right to his point. The problem came when I noticed as well that most methods in ImgButton would be reusable for any Widget I may create in the future.
So I'm thinking about making a:
class MyWidget(ttk.Widget):
for putting in it all methods which help with color for widgets and then I need ImgButton to inherit both from MyWidget and ttk.Button:
class ImgButton(ttk.Button, MyWidget): ???
or
class ImgButton(MyWidget, ttk.Button): ???
Edited: Also I want my objects to be loggable, so I did this class:
class Loggable(object):
def __init__(self) -> None:
super().__init__()
self.__logger = None
self.__logger = self.get_logger()
self.debug = self.get_logger().debug
self.error = self.get_logger().error
self.critical = self.get_logger().critical
self.info = self.get_logger().info
self.warn = self.get_logger().warning
def get_logger(self):
if not self.__logger:
self.__logger = logging.getLogger(self.get_class())
return self.__logger
def get_class(self):
return self.__class__.__name__
So now:
class ImgButton(Loggable, ttk.Button, MyWidget): ???
or
class ImgButton(Loggable, MyWidget, ttk.Button): ???
or
class ImgButton(MyWidget, Loggable, ttk.Button): ???
# ... this could go on ...
I come from Java and I don't know best practices for multiple inheritance. I don't know how I should sort the parents in the best order or any other thing useful for designing this multiple inheritance.
I have searched about the topic and found a lot of resources explaining the MRO but nothing about how to correctly design a multiple inheritance. I don't know if even my design is wrongly made, but I thought it was feeling pretty natural.
I would be grateful for some advice, and for some links or resources on this topic as well.
Thank you very much.
I've been reading about multiple inheritance these days and I've learnt quite a lot of things. I have linked my sources, resources and references at the end.
My main and most detailed source has been the book "Fluent python", which I found available for free reading online.
This describes the method resolution order and design sceneries with multiple inheritance and the steps for doing it ok:
Identify and separate code for interfaces. The classes that define methods but not necessarily with implementations (these ones should be overriden). These are usually ABCs (Abstract Base Class). They define a type for the child class creating an "IS-A" relationship
Identify and separate code for mixins. A mixin is a class that should bring a bundle of related new method implementations to use in the child but does not define a proper type. An ABC could be a mixin by this definition, but not the reverse. The mixin doesn't define nor an interface, neither a type
When coming to use the ABCs or classes and the mixins inheriting, you should inherit from only one concrete superclass, and several ABCs or mixins:
Example:
class MyClass(MySuperClass, MyABC, MyMixin1, MyMixin2):
In my case:
class ImgButton(ttk.Button, MyWidget):
If some combination of classes is particularly useful or frequent, you should join them under a class definition with a descriptive name:
Example:
class Widget(BaseWidget, Pack, Grid, Place):
pass
I think Loggable would be a Mixin, because it gathers convenient implementations for a functionality, but does not define a real type. So:
class MyWidget(ttk.Widget, Loggable): # May be renamed to LoggableMixin
Favor object composition over inheritance: If you can think of any way of using a class by holding it in an attribute instead of extending it or inheriting from it, you should avoid inheritance.
"Fluent python" - (Chapter 12) in Google books
Super is super
Super is harmful
Other problems with super
Weird super behaviour
In principle, use of multiple inheritance increases complexity, so unless I am certain of its need, I would avoid it. From your post you already look aware of the use of super() and the MRO.
A common recommendation is to use composition instead of multiple inheritance, when possible.
Another one is to subclass from only one instantiable parent class, using abstract classes as the other parents. That is, they add methods to this subclass, but never get instantiated themselves. Just like the use of interfaces in Java. Those abstract classes are also called mixins, but their use (or abuse) is also debatable. See Mixins considered harmful.
As for your tkinter code, besides logger code indentation, I don't see a problem. Maybe widgets can have a logger instead of inheriting from it. I think with tkinter the danger is the unwanted override by mistake of one of the hundreds of available methods.
I'm still learning and like to build things that I will eventually be doing on a regular basis in the future, to give me a better understanding on how x does this or y does that.
I haven't learned much about how classes work entirely yet, but I set up a call that will go through multiple classes.
getattr(monster, monster_class.str().lower())(1)
Which calls this:
class monster:
def vampire(x):
monster_loot = {'Gold':75, 'Sword':50.3, 'Good Sword':40.5, 'Blood':100.0, 'Ore':.05}
if x == 1:
loot_table.all_loot(monster_loot)
Which in turn calls this...
class loot_table:
def all_loot(monster_loot):
loot = ['Gold', 'Sword', 'Good Sword', 'Ore']
loot_dropped = {}
for i in monster_loot:
if i in loot:
loot_dropped[i] = monster_loot[i]
drop_chance.chance(loot_dropped)
And then, finally, gets to the last class.
class drop_chance:
def chance(loot_list):
loot_gained = []
for i in loot_list:
x = random.uniform(0.0,100.0)
if loot_list[i] >= x:
loot_gained.append(i)
return loot_gained
And it all works, except it's not returning loot_gained. I'm assuming it's just being returned to the loot_table class and I have no idea how to bypass it all the way back down to the first line posted. Could I get some insight?
Keep using return.
def foo():
return bar()
def bar():
return baz()
def baz():
return 42
print foo()
I haven't learned much about how classes work entirely yet...
Rather informally, a class definition is a description of the object of that class (a.k.a. instance of the class) that is to be created in future. The class definition contains the code (definitions of the methods). The object (the class instance) basically contains the data. The method is a kind of function that can take arguments and that is capable to manipulate the object's data.
This way, classes should represent the behaviour of the real-world objects, the class instances simulate existence of the real-world objects. The methods represent actions that the object apply on themselves.
From that point of view, a class identifier should be a noun that describes category of objects of the class. A class instance identifier should also be a noun that names the object. A method identifier is usually a verb that describes the action.
In your case, at least the class drop_chance: is suspicious at least because of naming it this way.
If you want to print something reasonable about the object--say using the print(monster)--then define the __str__() method of the class -- see the doc.