scipy.optimize.minimize() constraints depend on cost function - python-3.x

I'm running a constrained optimisation with scipy.optimize.minimize(method='COBYLA').
In order to evaluate the cost function, I need to run a relatively expensive simulation to compute a dataset from the input variables, and the cost function is one (cheap to compute) property of that dataset. However, two of my constraints are also dependent on that expensive data.
So far, the only way I have found to constrain the optimisation is to have each of the constraint functions recompute the same dataset that the cost function already has calculated (simplified quasi-code):
def costfun(x):
data = expensive_fun(x)
return(cheap_fun1(data))
def constr1(x):
data = expensive_fun(x)
return(cheap_fun2(data))
def constr2(x):
data = expensive_fun(x)
return(cheap_fun3(data))
constraints = [{'type':'ineq', 'fun':constr1},
{'type':'ineq', 'fun':constr2}]
# initial guess
x0 = np.ones((6,))
opt_result = minimize(costfun, x0, method='COBYLA',
constraints=constraints)
This is clearly not efficient because expensive_fun(x) is called three times for every x.
I could change this slightly to include a universal "evaluate some cost" function which runs the expensive computation, and then evaluates whatever criterion it has been given. But while that saves me from having to write the "expensive" code several times, it still runs three times for every iteration of the optimizer:
# universal cost function evaluator
def criterion_from_x(x, cfun):
data = expensive_fun(x)
return(cfun(data))
def costfun(data):
return(cheap_fun1(data))
def constr1(data):
return(cheap_fun2(data))
def constr2(data):
return(cheap_fun3(data))
constraints = [{'type':'ineq', 'fun':criterion_from_x, 'args':(constr1,)},
{'type':'ineq', 'fun':criterion_from_x, 'args':(constr2,)}
# initial guess
x0 = np.ones((6,))
opt_result = minimize(criterion_from_x, x0, method='COBYLA',
args=(costfun,), constraints=constraints)
I have not managed to find any way to set something up where x is used to generate data at each iteration, and data is then passed to both the objective function as well as the constraint functions.
Does something like this exist? I've noticed the callback argument to minimize(), but that is a function which is called after each step. I'd need some kind of preprocessor which is called on x before each step, whose results are then available to the cost function and constraint evaluation. Maybe there's a way to sneak it in somehow? I'd like to avoid writing my own optimizer.
One, more traditional, way to solve this would be to evaluate the constraints in the cost function (which has all the data it needs for that, have it add a penalty for violated constraints to the main cost function, and run the optimizer without the explicit constraints, but I've tried this before and found that the main cost function can become somewhat chaotic in cases where the constraints are violated, so an optimizer might get stuck in some place which violates the constraints and not find out again.
Another approach would be to produce some kind of global variable in the cost function and write the constraint evaluation to use that global variable, but that could be very dangerous if multithreading/-processing gets involved, or if the name I choose for the global variable collides with a name used anywhere else in the code:
'''
def costfun(x):
global data
data = expensive_fun(x)
return(cheap_fun1(data))
def constr1(x):
global data
return(cheap_fun2(data))
def constr2(x):
global data
return(cheap_fun3(data))
'''
I know that some people use file I/O for cases where the cost function involves running a large simulation which produces a bunch of output files. After that, the constraint functions can just access those files -- but my problem is not that big.
I'm currently using Python v3.9 and scipy 1.9.1.

You could write a decorator class in the same vein to scipy's MemoizeJac that caches the return values of the expensive function each time it is called:
import numpy as np
class MemoizeData:
def __init__(self, obj_fun, exp_fun, constr_fun):
self.obj_fun = obj_fun
self.exp_fun = exp_fun
self.constr_fun = constr_fun
self._data = None
self.x = None
def _compute_if_needed(self, x, *args):
if not np.all(x == self.x) or self._data is None:
self.x = np.asarray(x).copy()
self._data = self.exp_fun(x)
def __call__(self, x, *args):
self._compute_if_needed(x, *args)
return self.obj_fun(self._data)
def constraint(self, x, *args):
self._compute_if_needed(x, *args)
return self.constr_fun(self._data)
Followingly, the expensive function is only evaluated once for each iteration. Then, after writing all your constraints into one constraint function, you could use it like this:
from scipy.optimize import minimize
def all_constrs(data):
return np.hstack((cheap_fun2(data), cheap_fun3(data)))
obj = MemoizeData(cheap_fun1, expensive_fun, all_constrs)
constr = {'type': 'ineq', 'fun': obj.constraint}
x0 = np.ones(6)
opt_result = minimize(obj, x0, method="COBYLA", constraints=constr)

While Joni was writing their answer, I found another one, which is admittedly more hacky. I prefer theirs, but for the sake of completeness, I wanted to post this one, too.
It's derived from the material from https://mdobook.github.io/ and the accompanying video tutorials from BYU FLow Lab, in particular this video:
The trick is to use non-local variables to keep a cache of the last evaluation of the expensive function:
import numpy as np
last_x = None
last_data = None
def compute_data(x):
data = expensive_fun(x)
return(data)
def get_last_data(x):
nonlocal last_x, last_data
if not np.array_equal(x, last_x):
last_data = compute_data(x)
last_x = x
return(last_data)
def costfun(x):
data = get_last_data(x)
return(cheap_fun1(data)
def constr1(x):
data = get_last_data(x)
return(cheap_fun2(data)
def constr2(x):
data = get_last_data(x)
return(cheap_fun3(data)
...and then everything can progress as in my original code in the question.
Reasons why I prefer Joni's class-based version:
variable scopes are clearer than with nonlocal
If some of the functions allow calculation of their Jacobian, or there are other things worth buffering, the added complexity is held in check better than with
Having a class instance do all the work also allows you to do other interesting things, like keeping a record of all past evaluations and the path taken by the optimizer, without having to use a separate callback function. Very useful for debugging/tweaking convergence if the optimizer won't converge or takes too long, but also to visualize or otherwise investigate the objective function or similar.
The same ability might actually be really cool for things like constructing a response surface model from the results of previous function evaluations. That could be used to establish a starting guess in case the expensive function is some numerical method that benefits from a good starting point.
Both approaches allow the use of "cheap" constraints which don't require the expensive function to be evaluated, by simply providing them as separate functions. Not sure whether that would help much with compute times, though. I suppose that would depend on the algorithm used by the optimizer.

Related

Python multiprocess: run several instances of a class, keep all child processes in memory

First, I'd like to thank the StackOverflow community for the tremendous help it provided me over the years, without me having to ask a single question.
I could not find anything that I can relate to my problem, though it is probably due to my lack of understanding of the subject, rather than the absence of a response on the website. My apologies in advance if this is a duplicate.
I am relatively new to multiprocess; some time ago I succeeded in using multiprocessing.pools in a very simple way, where I didn't need any feedback between the child processes.
Now I am facing a much more complicated problem, and I am just lost in the documentation about multiprocessing. I hence ask for you help, your kindness and your patience.
I am trying to build a parallel tempering monte-carlo algorithm, from a class.
The basic class very roughly goes as follows:
import numpy as np
class monte_carlo:
def __init__(self):
self.x=np.ones((1000,3))
self.E=np.mean(self.x)
self.Elist=[]
def simulation(self,temperature):
self.T=temperature
for i in range(3000):
self.MC_step()
if i%10==0:
self.Elist.append(self.E)
return
def MC_step(self):
x=self.x.copy()
k = np.random.randint(1000)
x[k] = (x[k] + np.random.uniform(-1,1,3))
temp_E=np.mean(self.x)
if np.random.random()<np.exp((self.E-temp_E)/self.T):
self.E=temp_E
self.x=x
return
Obviously, I simplified a great deal (actual class is 500 lines long!), and built fake functions for simplicity: __init__ takes a bunch of parameters as arguments, there are many more lists of measurement else than self.Elist, and also many arrays derived from self.X that I use to compute them. The key point is that each instance of the class contains a lot of informations that I want to keep in memory, and that I don't want to copy over and over again, to avoid dramatic slowing down. Else I would just use the multiprocessing.pool module.
Now, the parallelization I want to do, in pseudo-code:
def proba(dE,pT):
return np.exp(-dE/pT)
Tlist=[1.1,1.2,1.3]
N=len(Tlist)
G=[]
for _ in range(N):
G.append(monte_carlo())
for _ in range(5):
for i in range(N): # this loop should be ran in multiprocess
G[i].simulation(Tlist[i])
for i in range(N//2):
dE=G[i].E-G[i+1].E
pT=G[i].T + G[i+1].T
p=proba(dE,pT) # (proba is a function, giving a probability depending on dE)
if np.random.random() < p:
T_temp = G[i].T
G[i].T = G[i+1].T
G[i+1].T = T_temp
Synthesis: I want to run several instances of my monte-carlo class in parallel child processes, with different values for a parameter T, then periodically pause everything to change the different T's, and run again the child processes/class instances, from where they paused.
Doing this, I want each class-instance/child-process to stay independent from one another, save its current state with all internal variables while it is paused, and do as few copies as possible. This last point is critical, as the arrays inside the class are quite big (some are 1000x1000), and a copy will therefore very quickly become quite time-costly.
Thanks in advance, and sorry if I am not clear...
Edit:
I am using a distant machine with many (64) CPUs, running on Debian GNU/Linux 10 (buster).
Edit2:
I made a mistake in my original post: in the end, the temperatures must be exchanged between the class-instances, and not inside the global Tlist.
Edit3: Charchit answer works perfectly for the test code, on both my personal machine and the distant machine I am usually using for running my codes. I hence check this as the accepted answer.
However, I want to report here that, inserting the actual, more complicated code, instead of the oversimplified monte_carlo class, the distant machine gives me some strange errors:
Unable to init server: Could not connect: Connection refused
(CMC_temper_all.py:55509): Gtk-WARNING **: ##:##:##:###: Locale not supported by C library.
Using the fallback 'C' locale.
Unable to init server: Could not connect: Connection refused
(CMC_temper_all.py:55509): Gdk-CRITICAL **: ##:##:##:###:
gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
(CMC_temper_all.py:55509): Gdk-CRITICAL **: ##:##:##:###: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
The "##:##:##:###" are (or seems like) IP adresses.
Without the call to set_start_method('spawn') this error shows only once, in the very beginning, while when I use this method, it seems to show at every occurrence of result.get()...
The strangest thing is that the code seems otherwise to work fine, does not crash, produces the datafiles I then ask it to, etc...
I think this would deserve to publish a new question, but I put it here nonetheless in case someone has a quick answer.
If not, I will resort to add one by one the variables, methods, etc... that are present in my actual code but not in the test example, to try and find the origin of the bug. My best guess for now is that the memory space required by each child-process with the actual code, is too large for the distant machine to accept it, due to some restrictions implemented by the admin.
What you are looking for is sharing state between processes. As per the documentation, you can either create shared memory, which is restrictive about the data it can store and is not thread-safe, but offers better speed and performance; or you can use server processes through managers. The latter is what we are going to use since you want to share whole objects of user-defined datatypes. Keep in mind that using managers will impact speed of your code depending on the complexity of the arguments that you pass and receive, to and from the managed objects.
Managers, proxies and pickling
As mentioned, managers create server processes to store objects, and allow access to them through proxies. I have answered a question with better details on how they work, and how to create a suitable proxy here. We are going to use the same proxy defined in the linked answer, with some variations. Namely, I have replaced the factory functions inside the __getattr__ to something that can be pickled using pickle. This means that you can run instance methods of managed objects created with this proxy without resorting to using multiprocess. The result is this modified proxy:
from multiprocessing.managers import NamespaceProxy, BaseManager
import types
import numpy as np
class A:
def __init__(self, name, method):
self.name = name
self.method = method
def get(self, *args, **kwargs):
return self.method(self.name, args, kwargs)
class ObjProxy(NamespaceProxy):
"""Returns a proxy instance for any user defined data-type. The proxy instance will have the namespace and
functions of the data-type (except private/protected callables/attributes). Furthermore, the proxy will be
pickable and can its state can be shared among different processes. """
def __getattr__(self, name):
result = super().__getattr__(name)
if isinstance(result, types.MethodType):
return A(name, self._callmethod).get
return result
Solution
Now we only need to make sure that when we are creating objects of monte_carlo, we do so using managers and the above proxy. For that, we create a class constructor called create. All objects for monte_carlo should be created with this function. With that, the final code looks like this:
from multiprocessing import Pool
from multiprocessing.managers import NamespaceProxy, BaseManager
import types
import numpy as np
class A:
def __init__(self, name, method):
self.name = name
self.method = method
def get(self, *args, **kwargs):
return self.method(self.name, args, kwargs)
class ObjProxy(NamespaceProxy):
"""Returns a proxy instance for any user defined data-type. The proxy instance will have the namespace and
functions of the data-type (except private/protected callables/attributes). Furthermore, the proxy will be
pickable and can its state can be shared among different processes. """
def __getattr__(self, name):
result = super().__getattr__(name)
if isinstance(result, types.MethodType):
return A(name, self._callmethod).get
return result
class monte_carlo:
def __init__(self, ):
self.x = np.ones((1000, 3))
self.E = np.mean(self.x)
self.Elist = []
self.T = None
def simulation(self, temperature):
self.T = temperature
for i in range(3000):
self.MC_step()
if i % 10 == 0:
self.Elist.append(self.E)
return
def MC_step(self):
x = self.x.copy()
k = np.random.randint(1000)
x[k] = (x[k] + np.random.uniform(-1, 1, 3))
temp_E = np.mean(self.x)
if np.random.random() < np.exp((self.E - temp_E) / self.T):
self.E = temp_E
self.x = x
return
#classmethod
def create(cls, *args, **kwargs):
# Register class
class_str = cls.__name__
BaseManager.register(class_str, cls, ObjProxy, exposed=tuple(dir(cls)))
# Start a manager process
manager = BaseManager()
manager.start()
# Create and return this proxy instance. Using this proxy allows sharing of state between processes.
inst = eval("manager.{}(*args, **kwargs)".format(class_str))
return inst
def proba(dE,pT):
return np.exp(-dE/pT)
if __name__ == "__main__":
Tlist = [1.1, 1.2, 1.3]
N = len(Tlist)
G = []
# Create our managed instances
for _ in range(N):
G.append(monte_carlo.create())
for _ in range(5):
# Run simulations in the manager server
results = []
with Pool(8) as pool:
for i in range(N): # this loop should be ran in multiprocess
results.append(pool.apply_async(G[i].simulation, (Tlist[i], )))
# Wait for the simulations to complete
for result in results:
result.get()
for i in range(N // 2):
dE = G[i].E - G[i + 1].E
pT = G[i].T + G[i + 1].T
p = proba(dE, pT) # (proba is a function, giving a probability depending on dE)
if np.random.random() < p:
T_temp = Tlist[i]
Tlist[i] = Tlist[i + 1]
Tlist[i + 1] = T_temp
print(Tlist)
This meets the criteria you wanted. It does not create any copies at all, rather, all arguments to the simulation method call are serialized inside the pool and sent to the manager server where the object is actually stored. It gets executed there, and the results (if any) are serialized and returned in the main process. All of this, with only using the builtins!
Output
[1.2, 1.1, 1.3]
Edit
Since you are using Linux, I encourage you to use multiprocessing.set_start_method inside the if __name__ ... clause to set the start method to "spawn". Doing this will ensure that the child processes do not have access to variables defined inside the clause.

Pythonic way of reducing the subclasses

background: so, I am working on an NLP problem. where I need to extract different types of features based on different types of text documents. and I currently have a setup where there is a FeatureExtractor base class, which is subclassed multiple times depending on the different types of docs and all of them calculate a different set of features and return a pandas data frame as output.
all these subclasses are further called by one wrapper type class called FeatureExtractionRunner which calls all the subclasses and calculates the features on all docs and returns the output for all types of docs.
Problem: this pattern of calculating features leads to lots of subclasses. currently, I have like 14 subclasses, since I have 14 types of docs.it might expand further. and this is too many classes to maintain. Is there an alternative way of doing this? with less subclassing
here is some sample representative code of what i explained:
from abc import ABCMeta, abstractmethod
class FeatureExtractor(metaclass=ABCMeta):
#base feature extractor class
def __init__(self, document):
self.document = document
#abstractmethod
def doc_to_features(self):
return NotImplemented
class ExtractorTypeA(FeatureExtractor):
#do some feature calculations.....
def _calculate_shape_features(self):
return None
def _calculate_size_features(self):
return None
def doc_to_features(self):
#calls all the fancy feature calculation methods like
f1 = self._calculate_shape_features(self.document)
f2 = self._calculate_size_features(self.document)
#do some calculations on the document and return a pandas dataframe by merging them (merge f1, f2....etc)
data = "dataframe-1"
return data
class ExtractorTypeB(FeatureExtractor):
#do some feature calculations.....
def _calculate_some_fancy_features(self):
return None
def _calculate_some_more_fancy_features(self):
return None
def doc_to_features(self):
#calls all the fancy feature calculation methods
f1 = self._calculate_some_fancy_features(self.document)
f2 = self._calculate_some_more_fancy_features(self.document)
#do some calculations on the document and return a pandas dataframe (merge f1, f2 etc)
data = "dataframe-2"
return data
class ExtractorTypeC(FeatureExtractor):
#do some feature calculations.....
def doc_to_features(self):
#do some calculations on the document and return a pandas dataframe
data = "dataframe-3"
return data
class FeatureExtractionRunner:
#a class to call all types of feature extractors
def __init__(self, document, *args, **kwargs):
self.document = document
self.type_a = ExtractorTypeA(self.document)
self.type_b = ExtractorTypeB(self.document)
self.type_c = ExtractorTypeC(self.document)
#more of these extractors would be there
def call_all_type_of_extractors(self):
type_a_features = self.type_a.doc_to_features()
type_b_features = self.type_b.doc_to_features()
type_c_features = self.type_c.doc_to_features()
#more such extractors would be there....
return [type_a_features, type_b_features, type_c_features]
all_type_of_features = FeatureExtractionRunner("some document").call_all_type_of_extractors()
Answering the question first, you may avoid subclassing entirely at the cost of writing the __init__ method each time. Or you may get rid off the classes entirely and convert them to a bunch of functions. Or even you may join all the classes in a single one. Note that none of these methods will make the code simpler or more maintainable, indeed they would just change it's shape to some extent.
IMHO this situation is a perfect example of inherent problem complexity by which I mean that the domain (NLP) and particular use case (doc feature extraction) are complex in and out themselves.
For example, featureX and featureY are likely to be totally different things that cannot be calculated altogether, thus you end up with one method each. Similarly, the procedure to merge these features in a dataframe might be different than the one to merge the fancy features. Having lots of functions/classes in this situation seems totally reasonable to me, also having them separate is logical and maintainable wise.
That said real code reduction might be possible if you can combine some feature calculation methods into a more generic function, tough I can't say for sure if it would be possible.

Restarting an optimisation with Pymoo

I'm trying to restart an optimisation in pymoo.
I have a problem defined as:
class myOptProb(Problem):
"""my body goes here"""
algorithm = NSGA2(pop_size=24)
problem = myOptProblem(opt_obj=dp_ptr,
nvars=7,
nobj=4,
nconstr=0,
lb=0.3 * np.ones(7),
ub=0.7 * np.ones(7),
parallelization=('threads', cpu_count(),))
res = minimize(problem,
algorithm,
('n_gen', 100),
seed=1,
verbose=True)
During the optimisation I write the design vectors and results to a .csv file. An example of design_vectors.csv is:
5.000000000000000000e+00, 4.079711567060104183e-01, 6.583544872784267143e-01, 4.712364759485179189e-01, 6.859360188593541796e-01, 5.653765991273791425e-01, 5.486782880836487131e-01, 5.275405748345924906e-01,
7.000000000000000000e+00, 5.211287914743063521e-01, 6.368123569438421949e-01, 3.496693260479644128e-01, 4.116734716044557763e-01, 5.343037085833151068e-01, 6.878382993278697732e-01, 5.244120877022839800e-01,
9.000000000000000000e+00, 5.425317846613321171e-01, 5.275405748345924906e-01, 4.269449637288642574e-01, 6.954464617649794844e-01, 5.318980876983187001e-01, 4.520564690494201510e-01, 5.203792876471586837e-01,
1.100000000000000000e+01, 4.579502451694219545e-01, 6.853050113762846340e-01, 3.695822666721857441e-01, 3.505318077758549089e-01, 3.540316632186925050e-01, 5.022648662707586142e-01, 3.086099221096791911e-01,
3.000000000000000000e+00, 4.121775968257620493e-01, 6.157117313805953174e-01, 3.412904026310568106e-01, 4.791574104703620329e-01, 6.634382012372381787e-01, 4.174456593494717538e-01, 4.151101354345394512e-01,
The results.csv is:
5.000000000000000000e+00, 1.000000000000000000e+05, 1.000000000000000000e+05, 1.000000000000000000e+05, 1.000000000000000000e+05,
7.000000000000000000e+00, 1.041682833582066703e+00, 3.481167125962069189e-03, -5.235115318709097909e-02, 4.634480813876099177e-03,
9.000000000000000000e+00, 1.067730307802263967e+00, 2.194702810002167534e-02, -3.195892023664552717e-01, 1.841232582360878426e-03,
1.100000000000000000e+01, 8.986880344052742275e-01, 2.969022150977750681e-03, -4.346692726475211849e-02, 4.995468429444801205e-03,
3.000000000000000000e+00, 9.638770499257821589e-01, 1.859596479928402393e-02, -2.723230073142696162e-01, 1.600910928983005632e-03,
The first column is the index of the design vector - because I thread asynchronously, I specify the indices.
I see that it should be possible to restart the optimisation via the sampling parameter for pymoo.algorithms.nsga2.NSGA2 but I couldn't find a working example. The documentation for both population and individuals is also not clear. So how can I restart a simulation with the previous results?
Yes, you can initialize the algorithm object with a population instead of doing it randomly.
I have written a small tutorial for a biased initialization:
https://pymoo.org/customization/initialization.html
Because in your case the data already exists, in a CSV or in-memory file, you might want to create a dummy problem (I have called it Constant in my example) to set the attributes in the Population object. (In the population X, F, G, CV and feasible needs to be set). Another way would be setting the attributes directly...
The biased initialization with a dummy problem is shown below. If you already use pymoo to store the csv files, you can also just np.save the Population object directly and load it. Then all intermediate steps are not necessary.
I am planning to improve checkpoint implementation in the future. So if you have some more feedback and use case which are not possible yet please let me know.
import numpy as np
from pymoo.algorithms.nsga2 import NSGA2
from pymoo.algorithms.so_genetic_algorithm import GA
from pymoo.factory import get_problem, G1, Problem
from pymoo.model.evaluator import Evaluator
from pymoo.model.population import Population
from pymoo.optimize import minimize
class YourProblem(Problem):
def __init__(self, n_var=10):
super().__init__(n_var=n_var, n_obj=1, n_constr=0, xl=-0, xu=1, type_var=np.double)
def _evaluate(self, x, out, *args, **kwargs):
out["F"] = np.sum(np.square(x - 0.5), axis=1)
problem = YourProblem()
# create initial data and set to the population object - for your this is your file
N = 300
X = np.random.random((N, problem.n_var))
F = np.random.random((N, problem.n_obj))
G = np.random.random((N, problem.n_constr))
class Constant(YourProblem):
def _evaluate(self, x, out, *args, **kwargs):
out["F"] = F
out["G"] = G
pop = Population().new("X", X)
Evaluator().eval(Constant(), pop)
algorithm = GA(pop_size=100, sampling=pop)
minimize(problem,
algorithm,
('n_gen', 10),
seed=1,
verbose=True)

Multiprocessing a function that tests a given dataset against a list of distributions. Returning function values from each iteration through list

I am working on processing a dataset that includes dense GPS data. My goal is to use parallel processing to test my dataset against all possible distributions and return the best one with the parameters generated for said distribution.
Currently, I have code that does this in serial thanks to this answer https://stackoverflow.com/a/37616966. Of course, it is going to take entirely too long to process my full dataset. I have been playing around with multiprocessing, but can't seem to get it to work right. I want it to test multiple distributions in parallel, keeping track of sum of square error. Then I want to select the distribution with the lowest SSE and return its name along with the parameters generated for it.
def fit_dist(distribution, data=data, bins=200, ax=None):
#Block of code that tests the distribution and generates params
return(distribution.name, best_params, sse)
if __name__ == '__main__':
p = Pool()
result = p.map(fit_dist, DISTRIBUTIONS)
p.close()
p.join()
I need some help with how to actually make use of the return values on each of the iterations in the multiprocessing to compare those values. I'm really new to python especially multiprocessing so please be patient with me and explain as much as possible.
The problem I'm having is it's giving me an "UnboundLocalError" on the variables that I'm trying to return from my fit_dist function. The DISTRIBUTIONS list is 89 objects. Could this be related to the parallel processing, or is it something to do with the definition of fit_dist?
With the help of Tomerikoo's comment and some further struggling, I got the code working the way I wanted it to. The UnboundLocalError was due to me not putting the return statement in the correct block of code within my fit_dist function. To answer the question I did the following.
from multiprocessing import Pool
def fit_dist:
#put this return under the right section of this method
return[distribution.name, params, sse]
if __name__ == '__main__':
p = Pool()
result = p.map(fit_dist, DISTRIBUTIONS)
p.close()
p.join()
'''filter out the None object results. Due to the nature of the distribution fitting,
some distributions are so far off that they result in None objects'''
res = list(filter(None, result))
#iterates over nested list storing the lowest sum of squared errors in best_sse
for dist in res:
if best_sse > dist[2] > 0:
best_sse = dis[2]
else:
continue
'''iterates over list pulling out sublist of distribution with best sse.
The sublists are made up of a string, tuple with parameters,
and float value for sse so that's why sse is always index 2.'''
for dist in res:
if dist[2]==best_sse:
best_dist_list = dist
else:
continue
The rest of the code simply consists of me using that list to construct charts and plots with that best distribution overtop of a histogram of my raw data.

Python avoiding large array allocation multiple times

I have to compute a function many many times.
To compute this function the elements of an array must be computed.
The array is quite large.
How can I avoid the allocation of the array in every function call.
The code I have tried goes something like this:
class FunctionCalculator(object):
def __init__(self, data):
"""
Get the data and do some small handling of it
Let's say that we do
self.data = data
"""
def function(self, point):
return numpy.sum(numpy.array([somecomputations(item) for item in self.data]))
Well, maybe my concern is unfounded, so I have first this question.
Question: Is it true that the array [somecomputations(item) for item in data] is being allocated and deallocated for every call to function?
Thinking that that is the case I have tried
class FunctionCalculator(object):
def __init__(self, data):
"""
Get the data and do some small handling of it
Let's say that we do
self.data = data
"""
self.number_of_data = range(0, len(data))
self.my_array = numpy.zeros(len(data))
def function(self, point):
for i in self.number_of_data:
self.my_array[i] = somecomputations(self.data[i])
return numpy.sum(self.my_array)
This is slower than the previous version. I assume that the list comprehension in the first version can be ran in C entirely, while in the second version smaller parts of the script can be translated into optimized C code.
I have very little idea of how Python works inside.
Question: Is there a good way to skip the array allocation in every function call and at the same time take advantage of a well optimized loop on the array?
I am using Python3.5
Looping over the array is unnecessary and access python to c many times, hence the slow down. The beauty of numpy arrays that functions work on them cell by cell. I think the fastest would be:
return numpy.sum(somecomputations(self.data))
Somecomputations may need a bit of a modification, but often it will work off the bat. Also, you're not using point, and other stuff.

Resources