Change units for Time Since Last Primitive in Featuretools - featuretools

When using the time_since_last primitive, how do I change the units from seconds (the default) to days?
I see in the documentation TimeSinceLast objec accepts a units param, but I can’t see an easy way to pass it when using dfs or calculate_feature_matrix.

In order to do this, you have to import the primitive in a slightly different way. Instead of using the shortcut way to choose a primitive with a list of strings, you have to import the primitive object and pass that into the dfs or calculate_feature_matrix function:
#Shortcut method
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity="customers",
agg_primitives=["time_since_last", "std", "skew"],
trans_primitives=[])
#method that allows initialization of variables
from featuretools.primitives import TimeSinceLast
time_since_last = TimeSinceLast(unit = "hours")
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity="customers",
agg_primitives=[time_since_last, "std", "skew"],
trans_primitives=[])
The key points are:
import the specific feature you want to customize/change the behavior of
define the feature, and put that definition in the list of primitives you are including (can be listed along with other strings).

Related

Simplifying Init Method Python

Is there a better way of doing this?
def __init__(self,**kwargs):
self.ServiceNo = kwargs["ServiceNo"]
self.Operator = kwargs["Operator"]
self.NextBus = kwargs["NextBus"]
self.NextBus2 = kwargs["NextBus2"]
self.NextBus3 = kwargs["NextBus3"]
The attributes (ServiceNo,Operator,...) always exist
That depends on what you mean by "simpler".
For example, is what you wrote simpler than what I would write, namely
def __init__(self,ServiceNo, Operator, NextBus, NextBus2, NextBus3):
self.ServiceNo = ServiceNo
self.Operator = Operator
self.NextBus = NextBus
self.NextBus2 = NextBus2
self.NextBus3 = NextBus3
True, I've repeated each attribute name an additional time, but I've made it much clearer which arguments are legal for __init__. The caller is not free to add any additional keyword argument they like, only to see it silently ignored.
Of course, there's a lot of boilerplate here; that's something a dataclass can address:
from dataclasses import dataclass
#dataclass
class Foo:
ServiceNo: int
Operator: str
NextBus: Bus
NextBus2: Bus
NextBus3: Bus
(Adjust the types as necessary.)
Now each attribute is mentioned once, and you get the __init__ method shown above for free.
Better how? You don’t really describe what problem you’re trying to solve.
If it’s error handling, you can use the dictionary .get() method in the event that key doesn’t exist.
If you just want a more succinct way of initializing variables, you could remove the ** and have the dictionary as a variable itself, then use it elsewhere in your code, but that depends on what your other methods are doing.
A hacky solution available since the attributes and the argument names match exactly is to directly copy from the kwargs dict to the instance's dict, then check that you got all the keys you expected, e.g.:
def __init__(self,**kwargs):
vars(self).update(kwargs)
if vars(self).keys() != {"ServiceNo", "Operator", "NextBus", "NextBus2", "NextBus3"}:
raise TypeError(f"{type(self).__name__} missing required arguments")
I don't recommend this; chepner's options are all superior to this sort of hackery, and they're more reliable (for example, this solution fails if you use __slots__ to prevent autovivication of attributes, as the instance won't having a backing dict you can pull with vars).

Pythonic way of reducing the subclasses

background: so, I am working on an NLP problem. where I need to extract different types of features based on different types of text documents. and I currently have a setup where there is a FeatureExtractor base class, which is subclassed multiple times depending on the different types of docs and all of them calculate a different set of features and return a pandas data frame as output.
all these subclasses are further called by one wrapper type class called FeatureExtractionRunner which calls all the subclasses and calculates the features on all docs and returns the output for all types of docs.
Problem: this pattern of calculating features leads to lots of subclasses. currently, I have like 14 subclasses, since I have 14 types of docs.it might expand further. and this is too many classes to maintain. Is there an alternative way of doing this? with less subclassing
here is some sample representative code of what i explained:
from abc import ABCMeta, abstractmethod
class FeatureExtractor(metaclass=ABCMeta):
#base feature extractor class
def __init__(self, document):
self.document = document
#abstractmethod
def doc_to_features(self):
return NotImplemented
class ExtractorTypeA(FeatureExtractor):
#do some feature calculations.....
def _calculate_shape_features(self):
return None
def _calculate_size_features(self):
return None
def doc_to_features(self):
#calls all the fancy feature calculation methods like
f1 = self._calculate_shape_features(self.document)
f2 = self._calculate_size_features(self.document)
#do some calculations on the document and return a pandas dataframe by merging them (merge f1, f2....etc)
data = "dataframe-1"
return data
class ExtractorTypeB(FeatureExtractor):
#do some feature calculations.....
def _calculate_some_fancy_features(self):
return None
def _calculate_some_more_fancy_features(self):
return None
def doc_to_features(self):
#calls all the fancy feature calculation methods
f1 = self._calculate_some_fancy_features(self.document)
f2 = self._calculate_some_more_fancy_features(self.document)
#do some calculations on the document and return a pandas dataframe (merge f1, f2 etc)
data = "dataframe-2"
return data
class ExtractorTypeC(FeatureExtractor):
#do some feature calculations.....
def doc_to_features(self):
#do some calculations on the document and return a pandas dataframe
data = "dataframe-3"
return data
class FeatureExtractionRunner:
#a class to call all types of feature extractors
def __init__(self, document, *args, **kwargs):
self.document = document
self.type_a = ExtractorTypeA(self.document)
self.type_b = ExtractorTypeB(self.document)
self.type_c = ExtractorTypeC(self.document)
#more of these extractors would be there
def call_all_type_of_extractors(self):
type_a_features = self.type_a.doc_to_features()
type_b_features = self.type_b.doc_to_features()
type_c_features = self.type_c.doc_to_features()
#more such extractors would be there....
return [type_a_features, type_b_features, type_c_features]
all_type_of_features = FeatureExtractionRunner("some document").call_all_type_of_extractors()
Answering the question first, you may avoid subclassing entirely at the cost of writing the __init__ method each time. Or you may get rid off the classes entirely and convert them to a bunch of functions. Or even you may join all the classes in a single one. Note that none of these methods will make the code simpler or more maintainable, indeed they would just change it's shape to some extent.
IMHO this situation is a perfect example of inherent problem complexity by which I mean that the domain (NLP) and particular use case (doc feature extraction) are complex in and out themselves.
For example, featureX and featureY are likely to be totally different things that cannot be calculated altogether, thus you end up with one method each. Similarly, the procedure to merge these features in a dataframe might be different than the one to merge the fancy features. Having lots of functions/classes in this situation seems totally reasonable to me, also having them separate is logical and maintainable wise.
That said real code reduction might be possible if you can combine some feature calculation methods into a more generic function, tough I can't say for sure if it would be possible.

Sphinx autodoc does not display all types or circular import error

I am trying to auto document types with sphinx autodoc, napoleon and autodoc_typehints but I am having problems as it does not work with most of my types. I am using the deap package to do some genetic optimization algorithm, which makes that I have some very specific types I guess sphinx cannot handle.
My conf.py file looks like this:
import os
import sys
sys.path.insert(0, os.path.abspath('../python'))
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.viewcode',
'sphinx.ext.napoleon',
'sphinx_autodoc_typehints'
]
set_type_checking_flag = False
always_document_param_types = False
I have an Algo.rst file with:
.. automodule:: python.algo.algo
:members: crossover_worker,
test
and my python.algo.algo module looks like this (I've added a dummy test function to show it works whenever I have no special types specified):
# Type hinting imports
from config.config import Config
from typing import List, Set, Dict, NamedTuple, Union, Tuple
from types import ModuleType
from numpy import ndarray
from numpy import float64
from multiprocessing.pool import MapResult
from deap.tools.support import Logbook, ParetoFront
from deap.base import Toolbox
from pandas.core.frame import DataFrame
from deap import creator
...
def crossover_worker(sindices: List[creator.Individual, creator.Individual]) -> Tuple[creator.Individual, creator.Individual]:
"""
Uniform crossover using fixed threshold
Args:
sindices: list of two individuals on which we want to perform crossover
Returns:
tuple of the two individuals with crossover applied
"""
ind1, ind2 = sindices
size = len(ind1)
for i in range(size):
if random.random() < 0.4:
ind1[i], ind2[i] = ind2[i], ind1[i]
return ind1, ind2
def test(a: DataFrame, b: List[int]) -> float:
"""
test funcition
Args:
a: something
b: something
Returns:
something
"""
return b
When settings in conf.py are like above I have no error, types for my test function are correct, but types for my crossover_worker function are missing:
However, when I set the set_type_checking_flag= True to force using all types, I have a circular import error:
reading sources... [100%] index
WARNING: autodoc: failed to import module 'algo' from module 'python.algo'; the following exception was raised:
cannot import name 'ArrayLike' from partially initialized module 'pandas._typing' (most likely due to a circular import) (/usr/local/lib/python3.8/site-packages/pandas/_typing.py)
looking for now-outdated files... none found
And I never import ArrayLike so I don't get it from where it comes or how to solve it?
Or how to force to import also the creator.Individual types that appear everywhere in my code?
My sphinx versions:
sphinx==3.0.1
sphinx-autodoc-typehints==1.10.3
After some searching there were some flaws with my approach:
Firstly a "list is a homogeneous structure containing values of one type. As such, List only takes a single type, and every element of that list has to have that type." (source). Consequently, I cannot do something like List[creator.Individual, creator.Individual], but should transform it to List[creator.Individual] or if you have multiple types in the list, you should use an union operator, such as List[Union[int,float]]
Secondly, the type creator.Individual is not recognized by sphinx as a valid type. Instead I should define it using TypeVar as such:
from typing import TypeVar, List
CreatorIndividual = TypeVar("CreatorIndividual", bound=List[int])
So by transforming my crossover_worker function to this, it all worked:
def crossover_worker(sindices: List[CreatorIndividual]) -> Tuple[CreatorIndividual, CreatorIndividual]:
Note: "By contrast, a tuple is an example of a product type, a type consisting of a fixed set of types, and whose values are a collection of values, one from each type in the product type. Tuple[int,int,int], Tuple[str,int] and Tuple[int,str] are all distinct types, distinguished both by the number of types in the product and the order in which they appear."(source)

Creating custom component in SpaCy

I am trying to create SpaCy pipeline component to return Spans of meaningful text (my corpus comprises pdf documents that have a lot of garbage that I am not interested in - tables, headers, etc.)
More specifically I am trying to create a function that:
takes a doc object as an argument
iterates over the doc tokens
When certain rules are met, yield a Span object
Note I would also be happy with returning a list([span_obj1, span_obj2])
What is the best way to do something like this? I am a bit confused on the difference between a pipeline component and an extension attribute.
So far I have tried:
nlp = English()
Doc.set_extension('chunks', method=iQ_chunker)
####
raw_text = get_test_doc()
doc = nlp(raw_text)
print(type(doc._.chunks))
>>> <class 'functools.partial'>
iQ_chunker is a method that does what I explain above and it returns a list of Span objects
this is not the results I expect as the function I pass in as method returns a list.
I imagine you're getting a functools partial back because you are accessing chunks as an attribute, despite having passed it in as an argument for method. If you want spaCy to intervene and call the method for you when you access something as an attribute, it needs to be
Doc.set_extension('chunks', getter=iQ_chunker)
Please see the Doc documentation for more details.
However, if you are planning to compute this attribute for every single document, I think you should make it part of your pipeline instead. Here is some simple sample code that does it both ways.
import spacy
from spacy.tokens import Doc
def chunk_getter(doc):
# the getter is called when we access _.extension_1,
# so the computation is done at access time
# also, because this is a getter,
# we need to return the actual result of the computation
first_half = doc[0:len(doc)//2]
secod_half = doc[len(doc)//2:len(doc)]
return [first_half, secod_half]
def write_chunks(doc):
# this pipeline component is called as part of the spacy pipeline,
# so the computation is done at parse time
# because this is a pipeline component,
# we need to set our attribute value on the doc (which must be registered)
# and then return the doc itself
first_half = doc[0:len(doc)//2]
secod_half = doc[len(doc)//2:len(doc)]
doc._.extension_2 = [first_half, secod_half]
return doc
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
Doc.set_extension("extension_1", getter=chunk_getter)
Doc.set_extension("extension_2", default=[])
nlp.add_pipe(write_chunks)
test_doc = nlp('I love spaCy')
print(test_doc._.extension_1)
print(test_doc._.extension_2)
This just prints [I, love spaCy] twice because it's two methods of doing the same thing, but I think making it part of your pipeline with nlp.add_pipe is the better way to do it if you expect to need this output on every document you parse.

scipy.optimize.minimize() constraints depend on cost function

I'm running a constrained optimisation with scipy.optimize.minimize(method='COBYLA').
In order to evaluate the cost function, I need to run a relatively expensive simulation to compute a dataset from the input variables, and the cost function is one (cheap to compute) property of that dataset. However, two of my constraints are also dependent on that expensive data.
So far, the only way I have found to constrain the optimisation is to have each of the constraint functions recompute the same dataset that the cost function already has calculated (simplified quasi-code):
def costfun(x):
data = expensive_fun(x)
return(cheap_fun1(data))
def constr1(x):
data = expensive_fun(x)
return(cheap_fun2(data))
def constr2(x):
data = expensive_fun(x)
return(cheap_fun3(data))
constraints = [{'type':'ineq', 'fun':constr1},
{'type':'ineq', 'fun':constr2}]
# initial guess
x0 = np.ones((6,))
opt_result = minimize(costfun, x0, method='COBYLA',
constraints=constraints)
This is clearly not efficient because expensive_fun(x) is called three times for every x.
I could change this slightly to include a universal "evaluate some cost" function which runs the expensive computation, and then evaluates whatever criterion it has been given. But while that saves me from having to write the "expensive" code several times, it still runs three times for every iteration of the optimizer:
# universal cost function evaluator
def criterion_from_x(x, cfun):
data = expensive_fun(x)
return(cfun(data))
def costfun(data):
return(cheap_fun1(data))
def constr1(data):
return(cheap_fun2(data))
def constr2(data):
return(cheap_fun3(data))
constraints = [{'type':'ineq', 'fun':criterion_from_x, 'args':(constr1,)},
{'type':'ineq', 'fun':criterion_from_x, 'args':(constr2,)}
# initial guess
x0 = np.ones((6,))
opt_result = minimize(criterion_from_x, x0, method='COBYLA',
args=(costfun,), constraints=constraints)
I have not managed to find any way to set something up where x is used to generate data at each iteration, and data is then passed to both the objective function as well as the constraint functions.
Does something like this exist? I've noticed the callback argument to minimize(), but that is a function which is called after each step. I'd need some kind of preprocessor which is called on x before each step, whose results are then available to the cost function and constraint evaluation. Maybe there's a way to sneak it in somehow? I'd like to avoid writing my own optimizer.
One, more traditional, way to solve this would be to evaluate the constraints in the cost function (which has all the data it needs for that, have it add a penalty for violated constraints to the main cost function, and run the optimizer without the explicit constraints, but I've tried this before and found that the main cost function can become somewhat chaotic in cases where the constraints are violated, so an optimizer might get stuck in some place which violates the constraints and not find out again.
Another approach would be to produce some kind of global variable in the cost function and write the constraint evaluation to use that global variable, but that could be very dangerous if multithreading/-processing gets involved, or if the name I choose for the global variable collides with a name used anywhere else in the code:
'''
def costfun(x):
global data
data = expensive_fun(x)
return(cheap_fun1(data))
def constr1(x):
global data
return(cheap_fun2(data))
def constr2(x):
global data
return(cheap_fun3(data))
'''
I know that some people use file I/O for cases where the cost function involves running a large simulation which produces a bunch of output files. After that, the constraint functions can just access those files -- but my problem is not that big.
I'm currently using Python v3.9 and scipy 1.9.1.
You could write a decorator class in the same vein to scipy's MemoizeJac that caches the return values of the expensive function each time it is called:
import numpy as np
class MemoizeData:
def __init__(self, obj_fun, exp_fun, constr_fun):
self.obj_fun = obj_fun
self.exp_fun = exp_fun
self.constr_fun = constr_fun
self._data = None
self.x = None
def _compute_if_needed(self, x, *args):
if not np.all(x == self.x) or self._data is None:
self.x = np.asarray(x).copy()
self._data = self.exp_fun(x)
def __call__(self, x, *args):
self._compute_if_needed(x, *args)
return self.obj_fun(self._data)
def constraint(self, x, *args):
self._compute_if_needed(x, *args)
return self.constr_fun(self._data)
Followingly, the expensive function is only evaluated once for each iteration. Then, after writing all your constraints into one constraint function, you could use it like this:
from scipy.optimize import minimize
def all_constrs(data):
return np.hstack((cheap_fun2(data), cheap_fun3(data)))
obj = MemoizeData(cheap_fun1, expensive_fun, all_constrs)
constr = {'type': 'ineq', 'fun': obj.constraint}
x0 = np.ones(6)
opt_result = minimize(obj, x0, method="COBYLA", constraints=constr)
While Joni was writing their answer, I found another one, which is admittedly more hacky. I prefer theirs, but for the sake of completeness, I wanted to post this one, too.
It's derived from the material from https://mdobook.github.io/ and the accompanying video tutorials from BYU FLow Lab, in particular this video:
The trick is to use non-local variables to keep a cache of the last evaluation of the expensive function:
import numpy as np
last_x = None
last_data = None
def compute_data(x):
data = expensive_fun(x)
return(data)
def get_last_data(x):
nonlocal last_x, last_data
if not np.array_equal(x, last_x):
last_data = compute_data(x)
last_x = x
return(last_data)
def costfun(x):
data = get_last_data(x)
return(cheap_fun1(data)
def constr1(x):
data = get_last_data(x)
return(cheap_fun2(data)
def constr2(x):
data = get_last_data(x)
return(cheap_fun3(data)
...and then everything can progress as in my original code in the question.
Reasons why I prefer Joni's class-based version:
variable scopes are clearer than with nonlocal
If some of the functions allow calculation of their Jacobian, or there are other things worth buffering, the added complexity is held in check better than with
Having a class instance do all the work also allows you to do other interesting things, like keeping a record of all past evaluations and the path taken by the optimizer, without having to use a separate callback function. Very useful for debugging/tweaking convergence if the optimizer won't converge or takes too long, but also to visualize or otherwise investigate the objective function or similar.
The same ability might actually be really cool for things like constructing a response surface model from the results of previous function evaluations. That could be used to establish a starting guess in case the expensive function is some numerical method that benefits from a good starting point.
Both approaches allow the use of "cheap" constraints which don't require the expensive function to be evaluated, by simply providing them as separate functions. Not sure whether that would help much with compute times, though. I suppose that would depend on the algorithm used by the optimizer.

Resources