Python avoiding large array allocation multiple times - python-3.x

I have to compute a function many many times.
To compute this function the elements of an array must be computed.
The array is quite large.
How can I avoid the allocation of the array in every function call.
The code I have tried goes something like this:
class FunctionCalculator(object):
def __init__(self, data):
"""
Get the data and do some small handling of it
Let's say that we do
self.data = data
"""
def function(self, point):
return numpy.sum(numpy.array([somecomputations(item) for item in self.data]))
Well, maybe my concern is unfounded, so I have first this question.
Question: Is it true that the array [somecomputations(item) for item in data] is being allocated and deallocated for every call to function?
Thinking that that is the case I have tried
class FunctionCalculator(object):
def __init__(self, data):
"""
Get the data and do some small handling of it
Let's say that we do
self.data = data
"""
self.number_of_data = range(0, len(data))
self.my_array = numpy.zeros(len(data))
def function(self, point):
for i in self.number_of_data:
self.my_array[i] = somecomputations(self.data[i])
return numpy.sum(self.my_array)
This is slower than the previous version. I assume that the list comprehension in the first version can be ran in C entirely, while in the second version smaller parts of the script can be translated into optimized C code.
I have very little idea of how Python works inside.
Question: Is there a good way to skip the array allocation in every function call and at the same time take advantage of a well optimized loop on the array?
I am using Python3.5

Looping over the array is unnecessary and access python to c many times, hence the slow down. The beauty of numpy arrays that functions work on them cell by cell. I think the fastest would be:
return numpy.sum(somecomputations(self.data))
Somecomputations may need a bit of a modification, but often it will work off the bat. Also, you're not using point, and other stuff.

Related

Pythonic way of reducing the subclasses

background: so, I am working on an NLP problem. where I need to extract different types of features based on different types of text documents. and I currently have a setup where there is a FeatureExtractor base class, which is subclassed multiple times depending on the different types of docs and all of them calculate a different set of features and return a pandas data frame as output.
all these subclasses are further called by one wrapper type class called FeatureExtractionRunner which calls all the subclasses and calculates the features on all docs and returns the output for all types of docs.
Problem: this pattern of calculating features leads to lots of subclasses. currently, I have like 14 subclasses, since I have 14 types of docs.it might expand further. and this is too many classes to maintain. Is there an alternative way of doing this? with less subclassing
here is some sample representative code of what i explained:
from abc import ABCMeta, abstractmethod
class FeatureExtractor(metaclass=ABCMeta):
#base feature extractor class
def __init__(self, document):
self.document = document
#abstractmethod
def doc_to_features(self):
return NotImplemented
class ExtractorTypeA(FeatureExtractor):
#do some feature calculations.....
def _calculate_shape_features(self):
return None
def _calculate_size_features(self):
return None
def doc_to_features(self):
#calls all the fancy feature calculation methods like
f1 = self._calculate_shape_features(self.document)
f2 = self._calculate_size_features(self.document)
#do some calculations on the document and return a pandas dataframe by merging them (merge f1, f2....etc)
data = "dataframe-1"
return data
class ExtractorTypeB(FeatureExtractor):
#do some feature calculations.....
def _calculate_some_fancy_features(self):
return None
def _calculate_some_more_fancy_features(self):
return None
def doc_to_features(self):
#calls all the fancy feature calculation methods
f1 = self._calculate_some_fancy_features(self.document)
f2 = self._calculate_some_more_fancy_features(self.document)
#do some calculations on the document and return a pandas dataframe (merge f1, f2 etc)
data = "dataframe-2"
return data
class ExtractorTypeC(FeatureExtractor):
#do some feature calculations.....
def doc_to_features(self):
#do some calculations on the document and return a pandas dataframe
data = "dataframe-3"
return data
class FeatureExtractionRunner:
#a class to call all types of feature extractors
def __init__(self, document, *args, **kwargs):
self.document = document
self.type_a = ExtractorTypeA(self.document)
self.type_b = ExtractorTypeB(self.document)
self.type_c = ExtractorTypeC(self.document)
#more of these extractors would be there
def call_all_type_of_extractors(self):
type_a_features = self.type_a.doc_to_features()
type_b_features = self.type_b.doc_to_features()
type_c_features = self.type_c.doc_to_features()
#more such extractors would be there....
return [type_a_features, type_b_features, type_c_features]
all_type_of_features = FeatureExtractionRunner("some document").call_all_type_of_extractors()
Answering the question first, you may avoid subclassing entirely at the cost of writing the __init__ method each time. Or you may get rid off the classes entirely and convert them to a bunch of functions. Or even you may join all the classes in a single one. Note that none of these methods will make the code simpler or more maintainable, indeed they would just change it's shape to some extent.
IMHO this situation is a perfect example of inherent problem complexity by which I mean that the domain (NLP) and particular use case (doc feature extraction) are complex in and out themselves.
For example, featureX and featureY are likely to be totally different things that cannot be calculated altogether, thus you end up with one method each. Similarly, the procedure to merge these features in a dataframe might be different than the one to merge the fancy features. Having lots of functions/classes in this situation seems totally reasonable to me, also having them separate is logical and maintainable wise.
That said real code reduction might be possible if you can combine some feature calculation methods into a more generic function, tough I can't say for sure if it would be possible.

scipy.optimize.minimize() constraints depend on cost function

I'm running a constrained optimisation with scipy.optimize.minimize(method='COBYLA').
In order to evaluate the cost function, I need to run a relatively expensive simulation to compute a dataset from the input variables, and the cost function is one (cheap to compute) property of that dataset. However, two of my constraints are also dependent on that expensive data.
So far, the only way I have found to constrain the optimisation is to have each of the constraint functions recompute the same dataset that the cost function already has calculated (simplified quasi-code):
def costfun(x):
data = expensive_fun(x)
return(cheap_fun1(data))
def constr1(x):
data = expensive_fun(x)
return(cheap_fun2(data))
def constr2(x):
data = expensive_fun(x)
return(cheap_fun3(data))
constraints = [{'type':'ineq', 'fun':constr1},
{'type':'ineq', 'fun':constr2}]
# initial guess
x0 = np.ones((6,))
opt_result = minimize(costfun, x0, method='COBYLA',
constraints=constraints)
This is clearly not efficient because expensive_fun(x) is called three times for every x.
I could change this slightly to include a universal "evaluate some cost" function which runs the expensive computation, and then evaluates whatever criterion it has been given. But while that saves me from having to write the "expensive" code several times, it still runs three times for every iteration of the optimizer:
# universal cost function evaluator
def criterion_from_x(x, cfun):
data = expensive_fun(x)
return(cfun(data))
def costfun(data):
return(cheap_fun1(data))
def constr1(data):
return(cheap_fun2(data))
def constr2(data):
return(cheap_fun3(data))
constraints = [{'type':'ineq', 'fun':criterion_from_x, 'args':(constr1,)},
{'type':'ineq', 'fun':criterion_from_x, 'args':(constr2,)}
# initial guess
x0 = np.ones((6,))
opt_result = minimize(criterion_from_x, x0, method='COBYLA',
args=(costfun,), constraints=constraints)
I have not managed to find any way to set something up where x is used to generate data at each iteration, and data is then passed to both the objective function as well as the constraint functions.
Does something like this exist? I've noticed the callback argument to minimize(), but that is a function which is called after each step. I'd need some kind of preprocessor which is called on x before each step, whose results are then available to the cost function and constraint evaluation. Maybe there's a way to sneak it in somehow? I'd like to avoid writing my own optimizer.
One, more traditional, way to solve this would be to evaluate the constraints in the cost function (which has all the data it needs for that, have it add a penalty for violated constraints to the main cost function, and run the optimizer without the explicit constraints, but I've tried this before and found that the main cost function can become somewhat chaotic in cases where the constraints are violated, so an optimizer might get stuck in some place which violates the constraints and not find out again.
Another approach would be to produce some kind of global variable in the cost function and write the constraint evaluation to use that global variable, but that could be very dangerous if multithreading/-processing gets involved, or if the name I choose for the global variable collides with a name used anywhere else in the code:
'''
def costfun(x):
global data
data = expensive_fun(x)
return(cheap_fun1(data))
def constr1(x):
global data
return(cheap_fun2(data))
def constr2(x):
global data
return(cheap_fun3(data))
'''
I know that some people use file I/O for cases where the cost function involves running a large simulation which produces a bunch of output files. After that, the constraint functions can just access those files -- but my problem is not that big.
I'm currently using Python v3.9 and scipy 1.9.1.
You could write a decorator class in the same vein to scipy's MemoizeJac that caches the return values of the expensive function each time it is called:
import numpy as np
class MemoizeData:
def __init__(self, obj_fun, exp_fun, constr_fun):
self.obj_fun = obj_fun
self.exp_fun = exp_fun
self.constr_fun = constr_fun
self._data = None
self.x = None
def _compute_if_needed(self, x, *args):
if not np.all(x == self.x) or self._data is None:
self.x = np.asarray(x).copy()
self._data = self.exp_fun(x)
def __call__(self, x, *args):
self._compute_if_needed(x, *args)
return self.obj_fun(self._data)
def constraint(self, x, *args):
self._compute_if_needed(x, *args)
return self.constr_fun(self._data)
Followingly, the expensive function is only evaluated once for each iteration. Then, after writing all your constraints into one constraint function, you could use it like this:
from scipy.optimize import minimize
def all_constrs(data):
return np.hstack((cheap_fun2(data), cheap_fun3(data)))
obj = MemoizeData(cheap_fun1, expensive_fun, all_constrs)
constr = {'type': 'ineq', 'fun': obj.constraint}
x0 = np.ones(6)
opt_result = minimize(obj, x0, method="COBYLA", constraints=constr)
While Joni was writing their answer, I found another one, which is admittedly more hacky. I prefer theirs, but for the sake of completeness, I wanted to post this one, too.
It's derived from the material from https://mdobook.github.io/ and the accompanying video tutorials from BYU FLow Lab, in particular this video:
The trick is to use non-local variables to keep a cache of the last evaluation of the expensive function:
import numpy as np
last_x = None
last_data = None
def compute_data(x):
data = expensive_fun(x)
return(data)
def get_last_data(x):
nonlocal last_x, last_data
if not np.array_equal(x, last_x):
last_data = compute_data(x)
last_x = x
return(last_data)
def costfun(x):
data = get_last_data(x)
return(cheap_fun1(data)
def constr1(x):
data = get_last_data(x)
return(cheap_fun2(data)
def constr2(x):
data = get_last_data(x)
return(cheap_fun3(data)
...and then everything can progress as in my original code in the question.
Reasons why I prefer Joni's class-based version:
variable scopes are clearer than with nonlocal
If some of the functions allow calculation of their Jacobian, or there are other things worth buffering, the added complexity is held in check better than with
Having a class instance do all the work also allows you to do other interesting things, like keeping a record of all past evaluations and the path taken by the optimizer, without having to use a separate callback function. Very useful for debugging/tweaking convergence if the optimizer won't converge or takes too long, but also to visualize or otherwise investigate the objective function or similar.
The same ability might actually be really cool for things like constructing a response surface model from the results of previous function evaluations. That could be used to establish a starting guess in case the expensive function is some numerical method that benefits from a good starting point.
Both approaches allow the use of "cheap" constraints which don't require the expensive function to be evaluated, by simply providing them as separate functions. Not sure whether that would help much with compute times, though. I suppose that would depend on the algorithm used by the optimizer.

Write a recursive function to list all paths of parts.txt

Write a function list_files_recursive that returns a list of the paths of all the parts.txt files without using the os module's walk generator. Instead, the function should use recursion. The input will be a directory name.
Here is the code I have so far and I think it's basically right, but what's happening is that the output is not one whole list?
def list_files_recursive(top_dir):
rec_list_files = []
list_dir = os.listdir(top_dir)
for item in list_dir:
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
list_files_recursive(item_path)
else:
if os.path.basename(item_path) == 'parts.txt':
rec_list_files.append(os.path.join(item_path))
print(rec_list_files)
return rec_list_files
This is part of the output I'm getting (from the print statement):
['CarItems/Honda/Accord/1996/parts.txt']
[]
['CarItems/Honda/Odyssey/2000/parts.txt']
['CarItems/Honda/Odyssey/2002/parts.txt']
[]
So the problem is that it's not one list and that there's empty lists in there. I don't quite know why this isn't not working and have tried everything to work through it. Any help is much appreciated on this!
This is very close, but the issue is that list_files_recursive's child calls don't pass results back to the parent. One way to do this is to concatenate all of the lists together from each child call, or to pass a reference to a single list all the way through the call chain.
Note that in rec_list_files.append(os.path.join(item_path)), there's no point in os.path.join with only a single parameter. print(rec_list_files) should be omitted as a side effect that makes the output confusing to interpret--only print in the caller. Additionally,
else:
if ... :
can be more clearly written here as elif: since they're logically equivalent. It's always a good idea to reduce nesting of conditionals whenever possible.
Here's the approach that works by extending the parent list:
import os
def list_files_recursive(top_dir):
files = []
for item in os.listdir(top_dir):
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
files.extend(list_files_recursive(item_path))
# ^^^^^^ add child results to parent
elif os.path.basename(item_path) == "parts.txt":
files.append(item_path)
return files
if __name__ == "__main__":
print(list_files_recursive("foo"))
Or by passing a result list through the call tree:
import os
def list_files_recursive(top_dir, files=[]):
for item in os.listdir(top_dir):
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
list_files_recursive(item_path, files)
# ^^^^^ pass our result list recursively
elif os.path.basename(item_path) == "parts.txt":
files.append(item_path)
return files
if __name__ == "__main__":
print(list_files_recursive("foo"))
A major problem with these functions are that they only work for finding files named precisely parts.txt since that string literal was hard coded. That makes it pretty much useless for anything but the immediate purpose. We should add a parameter for allowing the caller to specify the target file they want to search for, making the function general-purpose.
Another problem is that the function doesn't do what its name claims: list_files_recursive should really be called find_file_recursive, or, due to the hardcoded string, find_parts_txt_recursive.
Beyond that, the function is a strong candidate for turning into a generator function, which is a common Python idiom for traversal, particularly for situations where the subdirectories may contain huge amounts of data that would be expensive to keep in memory all at once. Generators also allow the flexibility of using the function to cancel the search after the first match, further enhancing its (re)usability.
The yield keyword also makes the function code itself very clean--we can avoid the problem of keeping a result data structure entirely and just fire off result items on demand.
Here's how I'd write it:
import os
def find_file_recursive(top_dir, target):
for item in os.listdir(top_dir):
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
yield from find_file_recursive(item_path, target)
elif os.path.basename(item_path) == target:
yield item_path
if __name__ == "__main__":
print(list(find_file_recursive("foo", "parts.txt")))

Multiprocessing a function that tests a given dataset against a list of distributions. Returning function values from each iteration through list

I am working on processing a dataset that includes dense GPS data. My goal is to use parallel processing to test my dataset against all possible distributions and return the best one with the parameters generated for said distribution.
Currently, I have code that does this in serial thanks to this answer https://stackoverflow.com/a/37616966. Of course, it is going to take entirely too long to process my full dataset. I have been playing around with multiprocessing, but can't seem to get it to work right. I want it to test multiple distributions in parallel, keeping track of sum of square error. Then I want to select the distribution with the lowest SSE and return its name along with the parameters generated for it.
def fit_dist(distribution, data=data, bins=200, ax=None):
#Block of code that tests the distribution and generates params
return(distribution.name, best_params, sse)
if __name__ == '__main__':
p = Pool()
result = p.map(fit_dist, DISTRIBUTIONS)
p.close()
p.join()
I need some help with how to actually make use of the return values on each of the iterations in the multiprocessing to compare those values. I'm really new to python especially multiprocessing so please be patient with me and explain as much as possible.
The problem I'm having is it's giving me an "UnboundLocalError" on the variables that I'm trying to return from my fit_dist function. The DISTRIBUTIONS list is 89 objects. Could this be related to the parallel processing, or is it something to do with the definition of fit_dist?
With the help of Tomerikoo's comment and some further struggling, I got the code working the way I wanted it to. The UnboundLocalError was due to me not putting the return statement in the correct block of code within my fit_dist function. To answer the question I did the following.
from multiprocessing import Pool
def fit_dist:
#put this return under the right section of this method
return[distribution.name, params, sse]
if __name__ == '__main__':
p = Pool()
result = p.map(fit_dist, DISTRIBUTIONS)
p.close()
p.join()
'''filter out the None object results. Due to the nature of the distribution fitting,
some distributions are so far off that they result in None objects'''
res = list(filter(None, result))
#iterates over nested list storing the lowest sum of squared errors in best_sse
for dist in res:
if best_sse > dist[2] > 0:
best_sse = dis[2]
else:
continue
'''iterates over list pulling out sublist of distribution with best sse.
The sublists are made up of a string, tuple with parameters,
and float value for sse so that's why sse is always index 2.'''
for dist in res:
if dist[2]==best_sse:
best_dist_list = dist
else:
continue
The rest of the code simply consists of me using that list to construct charts and plots with that best distribution overtop of a histogram of my raw data.

Round robin on iterable

Consider the following, simple round-robin implementation:
from itertools import chain, repeat
class RoundRobin:
def __init__(self, iterable):
self._iterable = set(iterable)
def __iter__(self):
for value in chain.from_iterable(repeat(self._iterable)):
yield value
Example usage:
machines = ['test1', 'test2',
'test3', 'test4']
rr_machines = RoundRobin(machines)
for machine in rr_machines:
# Do something
pass
While this works, I was wondering if there was a way to modify the iterable in the RoundRobin class that would also impact existing iterators.
E.g. suppose that while I'm consuming values from the iterator, one of the machines from the set has become unavailable, and I want to prevent it from being returned.
The only solution I could think of was to implement a separate Iterator class. Of course, that still leave the question what to do when all machines have become unavailable and no more values can be returned (StopIteration exception?).
Itertools' repeat makes a copy of the underlying iterator, in this case, your set containing the elements.
It is a mater of creating another implementation of repeat which would re-create such a copy at each iteration of the whole set. That is possible, because in this case, we know the iterator to be repeated is a container, while itertools.repeat has to work with any iterator (and so, remember the values from the first iteration):
def mutable_repeat(container):
while True:
for item in container.copy():
yield item
Just using this in place of repeat, allows you to make "on the fly" changes to your self._iterable set, and new values can be added/removed from that set. (Although a removed value most likely will be issued one last time before being removed)
If you need to guard against issuing a removed value even once, you can easily guard against it by adding some more logic to the whole thing -
instead of iteracting with self._iterable directly from outside your class, you could do:
class RoundRobin:
def __init__(self, iterable):
self._iterable = set(iterable)
self._removed = set()
def __iter__(self):
for value in chain.from_iterable(self.repeat()):
yield value
def remove(self, item):
self._removed.add(item)
self._iterable.remove(item)
def add(self, item):
self._iterable.add(item)
def repeat(self):
while True:
for item in self._iterable.copy():
if not item in self._removed:
yield item
self._removed = set()

Resources