How to create complex data structure with python hypothesis - python-hypothesis

I'm trying to use hypothesis to generate a text strategy with a complex format. I'm not sure how to build up this kind of data structure.
I've tried to build the various elements as composites to then use those as strategies for other composites. However the elements argument in the lists strategy requires a SearchStrategy instead of a composite like I had hoped. Looking through the docs I couldn't work out if the builds, mapping or flatmap would help in this case.
My (simplified) attempt is below.
#st.composite
def composite_coords(draw):
lat = draw(st.decimals(min_value=-10, max_value=-1, allow_nan=False, places=16))
long = draw(st.decimals(min_value=50, max_value=90, allow_nan=False, places=16))
return [float(long), float(lat)]
#st.composite
def composite_polygon_coords(draw):
polygon_coords = draw(st.lists(
elements=composite_coords, min_size=3
))
return polygon_coords.append(polygon_coords[0])
#st.composite
def composite_polygons(draw):
polygons = draw(st.lists(
elements=composite_polygon_coords, min_size=1
))
polygon = {
'type': 'Polygon',
'coordinates': polygons
}
return poly.dumps(polygon)
#given(composite_polygons())
def test_valid_polygon(polygon):
result = validate(polygon)
assert result == polygon

The #st.composite decorator gives you a function which returns a strategy - you just need to call them and you'll be good to go.

Related

How can I condence this function into returning a value from a list comprehension?

I understand it may not be best practices or conventional but this is more of a personal challenge.
def initialize_dataset(source):
all_features = []
targets = []
for (sent, label) in source:
feature_list=[]
feature_list.append(avg_number_chars(sent))
feature_list.append(number_words(sent))
all_features.append(feature_list)
targets.append(0) if label=="austen" else targets.append(1)
return all_features, targets
Example of what I'm looking for. I understand that it might not be possible to get it down to one single list and or value. But something close to it. I'd like to kind of expand my thinking on writing list comprehensions.
def sample_function(data):
return [i for i in data ]
Success! My gaaaaaaaaaaaaaaawd its ugly! šŸ¤£
def initialize_dataset(source):
all_features, targets = [],[]; [(all_features.append([avg_number_chars(sent), number_words(sent)]), targets.append(0) if label == "austen" else targets.append(1)) for (sent, label) in source]; return all_features, targets

Pythonic way of reducing the subclasses

background: so, I am working on an NLP problem. where I need to extract different types of features based on different types of text documents. and I currently have a setup where there is a FeatureExtractor base class, which is subclassed multiple times depending on the different types of docs and all of them calculate a different set of features and return a pandas data frame as output.
all these subclasses are further called by one wrapper type class called FeatureExtractionRunner which calls all the subclasses and calculates the features on all docs and returns the output for all types of docs.
Problem: this pattern of calculating features leads to lots of subclasses. currently, I have like 14 subclasses, since I have 14 types of docs.it might expand further. and this is too many classes to maintain. Is there an alternative way of doing this? with less subclassing
here is some sample representative code of what i explained:
from abc import ABCMeta, abstractmethod
class FeatureExtractor(metaclass=ABCMeta):
#base feature extractor class
def __init__(self, document):
self.document = document
#abstractmethod
def doc_to_features(self):
return NotImplemented
class ExtractorTypeA(FeatureExtractor):
#do some feature calculations.....
def _calculate_shape_features(self):
return None
def _calculate_size_features(self):
return None
def doc_to_features(self):
#calls all the fancy feature calculation methods like
f1 = self._calculate_shape_features(self.document)
f2 = self._calculate_size_features(self.document)
#do some calculations on the document and return a pandas dataframe by merging them (merge f1, f2....etc)
data = "dataframe-1"
return data
class ExtractorTypeB(FeatureExtractor):
#do some feature calculations.....
def _calculate_some_fancy_features(self):
return None
def _calculate_some_more_fancy_features(self):
return None
def doc_to_features(self):
#calls all the fancy feature calculation methods
f1 = self._calculate_some_fancy_features(self.document)
f2 = self._calculate_some_more_fancy_features(self.document)
#do some calculations on the document and return a pandas dataframe (merge f1, f2 etc)
data = "dataframe-2"
return data
class ExtractorTypeC(FeatureExtractor):
#do some feature calculations.....
def doc_to_features(self):
#do some calculations on the document and return a pandas dataframe
data = "dataframe-3"
return data
class FeatureExtractionRunner:
#a class to call all types of feature extractors
def __init__(self, document, *args, **kwargs):
self.document = document
self.type_a = ExtractorTypeA(self.document)
self.type_b = ExtractorTypeB(self.document)
self.type_c = ExtractorTypeC(self.document)
#more of these extractors would be there
def call_all_type_of_extractors(self):
type_a_features = self.type_a.doc_to_features()
type_b_features = self.type_b.doc_to_features()
type_c_features = self.type_c.doc_to_features()
#more such extractors would be there....
return [type_a_features, type_b_features, type_c_features]
all_type_of_features = FeatureExtractionRunner("some document").call_all_type_of_extractors()
Answering the question first, you may avoid subclassing entirely at the cost of writing the __init__ method each time. Or you may get rid off the classes entirely and convert them to a bunch of functions. Or even you may join all the classes in a single one. Note that none of these methods will make the code simpler or more maintainable, indeed they would just change it's shape to some extent.
IMHO this situation is a perfect example of inherent problem complexity by which I mean that the domain (NLP) and particular use case (doc feature extraction) are complex in and out themselves.
For example, featureX and featureY are likely to be totally different things that cannot be calculated altogether, thus you end up with one method each. Similarly, the procedure to merge these features in a dataframe might be different than the one to merge the fancy features. Having lots of functions/classes in this situation seems totally reasonable to me, also having them separate is logical and maintainable wise.
That said real code reduction might be possible if you can combine some feature calculation methods into a more generic function, tough I can't say for sure if it would be possible.

how to return values from a function that has #given constructor

def fixed_given(self):
return #given(
test_df=data_frames(
columns=columns(
["float_col1"],
dtype=float,
),
rows=tuples(
floats(allow_nan=True, allow_infinity=True)),
)
)(self)
#pytest.fixture()
#fixed_given
def its_a_fixture(test_df):
obj = its_an_class(test_df)
return obj
#pytest.fixture()
#fixed_given
def test_1(test_df):
#use returned object from my fixture here
#pytest.fixture()
#fixed_given
def test_2(test_df):
#use returned object from my fixture here
Here, I am creating my test dataframe in a seperate function to use it commonly across all functions.
And then creating a pytest fixture to instantiate a class by passing the test dataframe generated by a fixed given function.
I am finding a way to get a return value from this fixture.
But the problem i am using a given decorator, its doesn't allow return values.
is there a way to return even after using given decorator?
It's not clear what you're trying to acheive here, but reusing inputs generated by Hypothsis gives up most of the power of the framework (including minimal examples, replaying failures, settings options, etc.).
Instead, you can define a global variable for your strategy - or write a function that returns a strategy with #st.composite - and use that in each of your tests, e.g.
MY_STRATEGY = data_frames(columns=[
column(name="float_col1", elements=floats(allow_nan=True, allow_infinity=True))
])
#given(MY_STRATEGY)
def test_foo(df): ...
#given(MY_STRATEGY)
def test_bar(df): ...
Specifically to answer the question you asked, you cannot get a return value from a function decorated with #given.
Instead of using fixtures to instantiate your class, try using the .map method of a strategy (in this case data_frames(...).map(its_an_class)), or the builds() strategy (i.e. builds(my_class, data_frames(...), ...)).

Multiprocessing a function that tests a given dataset against a list of distributions. Returning function values from each iteration through list

I am working on processing a dataset that includes dense GPS data. My goal is to use parallel processing to test my dataset against all possible distributions and return the best one with the parameters generated for said distribution.
Currently, I have code that does this in serial thanks to this answer https://stackoverflow.com/a/37616966. Of course, it is going to take entirely too long to process my full dataset. I have been playing around with multiprocessing, but can't seem to get it to work right. I want it to test multiple distributions in parallel, keeping track of sum of square error. Then I want to select the distribution with the lowest SSE and return its name along with the parameters generated for it.
def fit_dist(distribution, data=data, bins=200, ax=None):
#Block of code that tests the distribution and generates params
return(distribution.name, best_params, sse)
if __name__ == '__main__':
p = Pool()
result = p.map(fit_dist, DISTRIBUTIONS)
p.close()
p.join()
I need some help with how to actually make use of the return values on each of the iterations in the multiprocessing to compare those values. I'm really new to python especially multiprocessing so please be patient with me and explain as much as possible.
The problem I'm having is it's giving me an "UnboundLocalError" on the variables that I'm trying to return from my fit_dist function. The DISTRIBUTIONS list is 89 objects. Could this be related to the parallel processing, or is it something to do with the definition of fit_dist?
With the help of Tomerikoo's comment and some further struggling, I got the code working the way I wanted it to. The UnboundLocalError was due to me not putting the return statement in the correct block of code within my fit_dist function. To answer the question I did the following.
from multiprocessing import Pool
def fit_dist:
#put this return under the right section of this method
return[distribution.name, params, sse]
if __name__ == '__main__':
p = Pool()
result = p.map(fit_dist, DISTRIBUTIONS)
p.close()
p.join()
'''filter out the None object results. Due to the nature of the distribution fitting,
some distributions are so far off that they result in None objects'''
res = list(filter(None, result))
#iterates over nested list storing the lowest sum of squared errors in best_sse
for dist in res:
if best_sse > dist[2] > 0:
best_sse = dis[2]
else:
continue
'''iterates over list pulling out sublist of distribution with best sse.
The sublists are made up of a string, tuple with parameters,
and float value for sse so that's why sse is always index 2.'''
for dist in res:
if dist[2]==best_sse:
best_dist_list = dist
else:
continue
The rest of the code simply consists of me using that list to construct charts and plots with that best distribution overtop of a histogram of my raw data.

SQLAthanor: serialize to json only specific fields

Is there a way to serialize a SQLAlchemy model including only specific fields using SQLAthanor? The documentation doesn't mention it, so the only way that I figured out is to filter the outcome manually.
So, this line with sqlathanor
return jsonify([user.to_dict() for user in users for k, v in user.to_dict().items()
if k in ['username', 'name', 'surname', 'email']])
is equivalent to this one using Marshmallow
return jsonify(SchemaUser(only=('username', 'name', 'surname', 'email')).dump(users, many=True))
Once again, is there a built-in method in SQLAthanor to do this?
Adapting my answer from the related Github issue:
The only way that you can change the list of serialized fields without adjusting the instanceā€™s configuration is to manually adjust the results of to_<FORMAT>(). Your code snippet is one way to do that, although for JSON and YAML you can also supply a custom serialize_function which accepts the dict, processes it, and serializes to JSON or YAML as appropriate:
import simplejson as json
def my_custom_serializer(value, **kwargs):
filtered_dict = {}
filtered_dict['username'] = value['username']
# repeat pattern for other fields
return json.dumps(filtered_dict)
json_result = user.to_json(serialize_function = my_custom_serializer)`
Both approaches are effectively the same, but the serialize_function approach gives you more flexibility for more complex adjustments to your serialized output and (I think) easier to read/maintain code (though if all your doing is adjusting the fields included, your snippet is already quite readable).
You can generalize the serialize_function as well. So if you want to give it a list of fields to include, just include them as a keyword argument in to_json():
def my_custom_serializer(value, **kwargs):
filter_fields = kwargs.pop(ā€œfilter_fieldsā€, None)
result = {}
for field in filter_fields:
result[field] = value.get(field, None)
return json.dumps(result)
result = [x.to_json(serialize_funcion = my_custom_serializer, filter_fields = ['username', 'name', 'surname', 'email']) for x in users)

Resources