Multiprocessing getting stuck with ARMAX while refitting - python-3.x

I am trying to train multiple time series models using the below code in Jupyter Notebook.
import statsmodels.api as sm
import multiprocessing
import tqdm
train_dict = dict() # A dictionary of dataframes
test_dict = dict() # A dictionary of dataframes
def train_arma(key):
endog = list(train_dict[key].endog)
exog = list(train_dict[key].exog)
fut_endog = list(train_dict[key].endog)
fut_exog = list(test_dict[key].exog)
model = sm.tsa.arima.ARIMA(endog, order=(2, 0, 2), exog=exog,
enforce_stationarity=False,
enforce_invertibility=False).fit()
predictions = list()
yhat = model.forecast(exog=[fut_exog[0]])[0]
predictions.append(yhat)
for i in tqdm.tqdm_notebook(range(len(fut_vol))[:-1]):
model = model.append([fut_vol[i]], exog=[fut_exog[i]], refit=True) #code gets stuck here
predictions.append(model.forecast(exog=[fut_exog[i+1]])
return predictions
secs = list(train_dict.keys())
p = multiprocessing.Pool(10)
output = p.map(train_arma, secs)
p.terminate()
When len(endog) == 1006, the code keeps getting stuck on the 17th iteration in the for loop. If I decrease the endog by 20, then it gets stuck on 37th iteration.
There are some other things I have tried already:
Passing dataframes directly instead of letting the function acess train_dict and test_dict from outer scope.
Reducing the number of maximum processes in multiprocessing.
Shuffling my input list.
Defining a new class instance in the for loop while appending the values from fut_endog and fut_exog lists in endog and exog lists respectively.
I did a top in my linux terminal and the observed the cpu usage while processes were getting created and executed. Initially when the processes spawn, they use up cpu and when the processes gets stuck %CPU allocation becomes 0.
There are some instances when the code does work:
When I call the function directly, without multiprocessing, it works. But using multiprocessing even with processes = 1 makes the code stop.
When I don't pass any exogenous variable and train a simple ARMA model it works.
I am using statsmodels v0.12.1 and python version is 3.7.3. Thanks

This issue must be due to usage of tqdm alongside multiprocessing.
https://github.com/tqdm/tqdm/issues/461 addresses this issue.
I resolved it by using
from tqdm import tqdm
tqdm.get_lock().locks = []

Related

RandomSearchCV is too slow when working on pipe

I'm testing the method of running feature selection with hyper parameters.
I'm running feature selection algorithm SequentialFeatureSelection with hyper parameters algorithm RandomizedSearchCV with xgboost model
I run the following code:
from xgboost import XGBClassifier
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
import pandas as pd
def main():
df = pd.read_csv("input.csv")
x = df[['f1','f2','f3', 'f4', 'f5', 'f6','f7','f8']]
y = df[['y']]
model = XGBClassifier(n_jobs=-1)
sfs = SequentialFeatureSelector(model, k_features="best", forward=True, floating=False, scoring="accuracy", cv=2, n_jobs=-1)
params = {'xgboost__max_depth': [2, 4], 'sfs__k_features': [1, 4]}
pipe = Pipeline([('sfs', sfs), ('xgboost', model)])
randomized = RandomizedSearchCV(estimator=pipe, param_distributions=params,n_iter=2,cv=2,random_state=40,scoring='accuracy',refit=True,n_jobs=-1)
res = randomized.fit(x.values,y.values)
if __name__=='__main__':
main()
The file input.csv has only 39 rows of data (not including the header):
f1,f2,f3,f4,f5,f6,f7,f8,y
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38.0,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
7,100,0,0,0,30.0,0.484,32,1
0,118,84,47,230,45.8,0.551,31,1
7,107,74,0,0,29.6,0.254,31,1
1,103,30,38,83,43.3,0.183,33,0
1,115,70,30,96,34.6,0.529,32,1
3,126,88,41,235,39.3,0.704,27,0
8,99,84,0,0,35.4,0.388,50,0
7,196,90,0,0,39.8,0.451,41,1
9,119,80,35,0,29.0,0.263,29,1
11,143,94,33,146,36.6,0.254,51,1
10,125,70,26,115,31.1,0.205,41,1
7,147,76,0,0,39.4,0.257,43,1
1,97,66,15,140,23.2,0.487,22,0
13,145,82,19,110,22.2,0.245,57,0
5,117,92,0,0,34.1,0.337,38,0
5,109,75,26,0,36.0,0.546,60,0
3,158,76,36,245,31.6,0.851,28,1
3,88,58,11,54,24.8,0.267,22,0
6,92,92,0,0,19.9,0.188,28,0
10,122,78,31,0,27.6,0.512,45,0
4,103,60,33,192,24.0,0.966,33,0
11,138,76,0,0,33.2,0.420,35,0
9,102,76,37,0,32.9,0.665,46,1
2,90,68,42,0,38.2,0.503,27,1
As you can see, the amount of data is too small, and there are small amount of parameters to optimize.
I checked the number of cpus:
lscpu
and I got:
CPU(s): 12
so 12 threads can be created and run in parallel
I checked this post:
RandomSearchCV super slow - troubleshooting performance enhancement
But I already use n_jobs = -1
So why it's run too slow ? (More than 15 minutes !!!)

Parallelizing fastText.get_sentence_vector with dask gives pickling error

I was trying to get fastText sentence embeddings for 80 Million English tweets using the parallelizing mechanism using dask as described in this answer: How do you parallelize apply() on Pandas Dataframes making use of all cores on one machine?
Here is my full code:
import dask.dataframe as dd
from dask.multiprocessing import get
import fasttext
import fasttext.util
import pandas as pd
print('starting langage: ' + 'en')
lang_output = pd.DataFrame()
lang_input = full_input.loc[full_input.name == 'en'] # 80 Million English tweets
ddata = dd.from_pandas(lang_input, npartitions = 96)
print('number of lines to compute: ' + str(len(lang_input)))
fasttext.util.download_model('en', if_exists='ignore') # English
ft = fasttext.load_model('cc.'+'en'+'.300.bin')
fasttext.util.reduce_model(ft, 20)
lang_output['sentence_embedding'] = ddata.map_partitions(lambda lang_input: lang_input.apply((lambda x: get_fasttext_sentence_embedding(x.tweet_text, ft)), axis = 1)).compute(scheduler='processes')
print('finished en')
This is the get_fasttext_sentence_embedding function:
def get_fasttext_sentence_embedding(row, ft):
if pd.isna(row):
return np.zeros(20)
return ft.get_sentence_vector(row)
But, I get a pickling error on this line:
lang_output['sentence_embedding'] = ddata.map_partitions(lambda lang_input: lang_input.apply((lambda x: get_fasttext_sentence_embedding(x.tweet_text, ft)), axis = 1)).compute(scheduler='processes')
This is the error I get:
TypeError: can't pickle fasttext_pybind.fasttext objects
Is there a way to parallelize fastText model get_sentence_vector with dask (or anything else)? I need to parallelize because getting sentence embeddings for 80 Million tweets takes two much time and one row of my data frame is completely independent of the other.
The problem here is that fasttext objects apparently can't be pickled, and Dask doesn't know how to serialize and deserialize this data structure without pickling.
The simplest way to use Dask here (but likely not the most efficient), would be to have each process define the ft model itself, which would avoid the need to transfer it (and thus avoid the attempted pickling). Something like the following would work. Notice that ft is defined inside the function being mapped across partitions.
First, some example data.
import dask.dataframe as dd
import fasttext
import pandas as pd
import dask
import numpy as np
df = pd.DataFrame({"text":['this is a test sentence', None, 'this is another one.', 'one more']})
ddf = dd.from_pandas(df, npartitions=2)
ddf
Dask DataFrame Structure:
text
npartitions=2
0 object
2 ...
3 ...
Dask Name: from_pandas, 2 tasks
Next, we can tweak your functions to define ft within each process. This duplicates effort, but avoids the need to transfer the model. With that, we can smoothly run it via map_partitions.
def get_embeddings(sent, model):
return model.get_sentence_vector(sent) if not pd.isna(sent) else np.zeros(10)
def func(df):
ft = fasttext.load_model("amazon_review_polarity.bin") # arbitrary model
res = df['text'].apply(lambda x: get_embeddings(x, model=ft))
return res
ddf['sentence_vector'] = ddf.map_partitions(func)
ddf.compute(scheduler='processes')
text sentence_vector
0 this is a test sentence [-0.01934033, 0.03729743, -0.04679677, -0.0603...
1 None [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2 this is another one. [-0.0025579212, 0.0353713, -0.027139299, -0.05...
3 one more [-0.014522496, 0.10396308, -0.13107553, -0.198...
Note that this nested data structure (list in a column) is probably not the optimal way to handle these vectors, but it will depend on your use case. Also, there is probably a way to do this computation in batches using fastext rather than one row at a time (in Python), but I'm not well versed in the nuances of fastext.
I had the same problem, but I found a solution using Multiprocessing - Python's Standard Library.
First step - wrap
model = fasttext.load_model(file_name_model)
def get_vec(txt):
'''
First tried to put model.get_sentence_vector into map (look below), but it resulted in pickle error.
This works, lol.
'''
return model.get_sentence_vector(txt)
Then, I'm doing this:
from multiprocessing import Pool
text = ["How to sell drugs (fast)", "House of Cards", "The Crown"]
with Pool(40) as p: # I have 40 cores
result = p.map(get_vec, text)
With 40 cores processing 10M short texts took me ~80s.

multiprocessing.pool on windows/jupyter [duplicate]

Jupyter Notebook
I am using multiprocessing module basically, I am still learning the capabilities of multiprocessing. I am using the book by Dusty Phillips and this code belongs to it.
import multiprocessing
import random
from multiprocessing.pool import Pool
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
On the Windows PowerShell (not on jupyter notebook) I see the following
Process SpawnPoolWorker-5:
Process SpawnPoolWorker-1:
AttributeError: Can't get attribute 'prime_factor' on <module '__main__' (built-in)>
I do not know why the cell never ends running?
It seems that the problem in Jupyter notebook as in different ide is the design feature. Therefore, we have to write the function (prime_factor) into a different file and import the module. Furthermore, we have to take care of the adjustments. For example, in my case, I have coded the function into a file known as defs.py
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
Then in the jupyter notebook I wrote the following lines
import multiprocessing
import random
from multiprocessing import Pool
import defs
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(defs.prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
This solved my problem
To execute a function without having to write it into a separated file manually:
We can dynamically write the task to process into a temporary file, import it and execute the function.
from multiprocessing import Pool
from functools import partial
import inspect
def parallel_task(func, iterable, *params):
with open(f'./tmp_func.py', 'w') as file:
file.write(inspect.getsource(func).replace(func.__name__, "task"))
from tmp_func import task
if __name__ == '__main__':
func = partial(task, params)
pool = Pool(processes=8)
res = pool.map(func, iterable)
pool.close()
return res
else:
raise "Not in Jupyter Notebook"
We can then simply call it in a notebook cell like this:
def long_running_task(params, id):
# Heavy job here
return params, id
data_list = range(8)
for res in parallel_task(long_running_task, data_list, "a", 1, "b"):
print(res)
Ouput:
('a', 1, 'b') 0
('a', 1, 'b') 1
('a', 1, 'b') 2
('a', 1, 'b') 3
('a', 1, 'b') 4
('a', 1, 'b') 5
('a', 1, 'b') 6
('a', 1, 'b') 7
Note: If you're using Anaconda and if you want to see the progress of the heavy task, you can use print() inside long_running_task(). The content of the print will be displayed in the Anaconda Prompt console.
Strictly, Python multiprocessing isn't supported on Windows Jupyter Notebook even if __name__="__main__" is added.
One workaround in Windows 10 is to connect windows browser with Jupyter server in WSL.
You could get the same experience as Linux.
You can set it manually or refer the script in https://github.com/mszhanyi/gemini
Another option: use dask, which plays nicely with Jupyter. Even if you don't need any of dask special data structures, you can use it simply to control multiple processes.
To handle the many quirks of getting multiprocess to play nice in Jupyter session, I've created a library mpify which allows one-time, multiprocess function executions, and passing things from the notebook to the subprocess with a simple API.
The Jupyter shell process itself can participate as a worker process. User can choose to gather results from all workers, or just one of them.
Here it is:
https://github.com/philtrade/mpify
Under the hood, it uses multiprocess -- an actively supported fork from the standard python multiprocessing library -- to allow locally defined variables/functions in the notebook, to be accessible in the subprocesses. It also uses the spawn start method, which is necessary if the subprocesses are to use multiple GPUs, an increasingly common use case. It uses Process() not Pool(), from the multiprocess API.
User can supply a custom context manager to acquire resources, setup/tear down execution environment surrounding the function execution. I've provided a sample context manager to support PyTorch's distributed data parallel (DDP) set up, and many more examples of how to train fastai v2 in Jupyter on multiple GPUs using DDP.
Bug reports, PRs, use cases to share are all welcome.
By no means a fancy/powerful library, mpify only intends to support single-host/multiprocess kind of distributed setup, and simply spawn-execute-terminate. Nor does it support persistent pool of processes and fancy task scheduling -- ipyparallel or dask already does it.
I hope it can be useful to folks who're struggling with Jupyter + multiprocessing, and possible with multi-GPUs as well. Thanks.

Jupyter notebook never finishes processing using multiprocessing (Python 3)

Jupyter Notebook
I am using multiprocessing module basically, I am still learning the capabilities of multiprocessing. I am using the book by Dusty Phillips and this code belongs to it.
import multiprocessing
import random
from multiprocessing.pool import Pool
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
On the Windows PowerShell (not on jupyter notebook) I see the following
Process SpawnPoolWorker-5:
Process SpawnPoolWorker-1:
AttributeError: Can't get attribute 'prime_factor' on <module '__main__' (built-in)>
I do not know why the cell never ends running?
It seems that the problem in Jupyter notebook as in different ide is the design feature. Therefore, we have to write the function (prime_factor) into a different file and import the module. Furthermore, we have to take care of the adjustments. For example, in my case, I have coded the function into a file known as defs.py
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
Then in the jupyter notebook I wrote the following lines
import multiprocessing
import random
from multiprocessing import Pool
import defs
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(defs.prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
This solved my problem
To execute a function without having to write it into a separated file manually:
We can dynamically write the task to process into a temporary file, import it and execute the function.
from multiprocessing import Pool
from functools import partial
import inspect
def parallel_task(func, iterable, *params):
with open(f'./tmp_func.py', 'w') as file:
file.write(inspect.getsource(func).replace(func.__name__, "task"))
from tmp_func import task
if __name__ == '__main__':
func = partial(task, params)
pool = Pool(processes=8)
res = pool.map(func, iterable)
pool.close()
return res
else:
raise "Not in Jupyter Notebook"
We can then simply call it in a notebook cell like this:
def long_running_task(params, id):
# Heavy job here
return params, id
data_list = range(8)
for res in parallel_task(long_running_task, data_list, "a", 1, "b"):
print(res)
Ouput:
('a', 1, 'b') 0
('a', 1, 'b') 1
('a', 1, 'b') 2
('a', 1, 'b') 3
('a', 1, 'b') 4
('a', 1, 'b') 5
('a', 1, 'b') 6
('a', 1, 'b') 7
Note: If you're using Anaconda and if you want to see the progress of the heavy task, you can use print() inside long_running_task(). The content of the print will be displayed in the Anaconda Prompt console.
Strictly, Python multiprocessing isn't supported on Windows Jupyter Notebook even if __name__="__main__" is added.
One workaround in Windows 10 is to connect windows browser with Jupyter server in WSL.
You could get the same experience as Linux.
You can set it manually or refer the script in https://github.com/mszhanyi/gemini
Another option: use dask, which plays nicely with Jupyter. Even if you don't need any of dask special data structures, you can use it simply to control multiple processes.
To handle the many quirks of getting multiprocess to play nice in Jupyter session, I've created a library mpify which allows one-time, multiprocess function executions, and passing things from the notebook to the subprocess with a simple API.
The Jupyter shell process itself can participate as a worker process. User can choose to gather results from all workers, or just one of them.
Here it is:
https://github.com/philtrade/mpify
Under the hood, it uses multiprocess -- an actively supported fork from the standard python multiprocessing library -- to allow locally defined variables/functions in the notebook, to be accessible in the subprocesses. It also uses the spawn start method, which is necessary if the subprocesses are to use multiple GPUs, an increasingly common use case. It uses Process() not Pool(), from the multiprocess API.
User can supply a custom context manager to acquire resources, setup/tear down execution environment surrounding the function execution. I've provided a sample context manager to support PyTorch's distributed data parallel (DDP) set up, and many more examples of how to train fastai v2 in Jupyter on multiple GPUs using DDP.
Bug reports, PRs, use cases to share are all welcome.
By no means a fancy/powerful library, mpify only intends to support single-host/multiprocess kind of distributed setup, and simply spawn-execute-terminate. Nor does it support persistent pool of processes and fancy task scheduling -- ipyparallel or dask already does it.
I hope it can be useful to folks who're struggling with Jupyter + multiprocessing, and possible with multi-GPUs as well. Thanks.

Dask: How would I parallelize my code with dask delayed?

This is my first venture into parallel processing and I have been looking into Dask but I am having trouble actually coding it.
I have had a look at their examples and documentation and I think dask.delayed will work best. I attempted to wrap my functions with the delayed(function_name), or add an #delayed decorator, but I can't seem to get it working properly. I preferred Dask over other methods since it is made in python and for its (supposed) simplicity. I know dask doesn't work on the for loop, but they say it can work inside a loop.
My code passes files through a function that contains inputs to other functions and looks like this:
from dask import delayed
filenames = ['1.csv', '2.csv', '3.csv', etc. etc. ]
for count, name in enumerate(filenames)"
name = name.split('.')[0]
....
then do some pre-processing ex:
preprocess1, preprocess2 = delayed(read_files_and_do_some_stuff)(name)
then I call a constructor and pass the pre_results in to the function calls:
fc = FunctionCalls()
Daily = delayed(fc.function_runs)(filename=name, stringinput='Daily',
input_data=pre_result1, model1=pre_result2)
What i do here is I pass the file into the for loop, do some pre-processing and then pass the file into two models.
Thoughts or tips on how to do parallelize this? I began getting odd errors and I had no idea how to fix the code. The code does work as is. I use a bunch of pandas dataframes, series, and numpy arrays, and I would prefer not to go back and change everything to work with dask.dataframes etc.
The code in my comment may be difficult to read. Here it is in a more formatted way.
In the code below, when I type print(mean_squared_error) I just get: Delayed('mean_squared_error-3009ec00-7ff5-4865-8338-1fec3f9ed138')
from dask import delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = ['file1.csv']
for count, name in enumerate(filenames):
file1 = pd.read_csv(name)
df = pd.DataFrame(file1)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = delayed(mse)(observed, prediction)
You need to call dask.compute to eventually compute the result. See dask.delayed documentation.
Sequential code
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
results = []
for count, name in enumerate(filenames):
file1 = pd.read_csv(name)
df = pd.DataFrame(file1) # isn't this already a dataframe?
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = mse(observed, prediction)
results.append(mean_squared_error)
Parallel code
import dask
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
delayed_results = []
for count, name in enumerate(filenames):
df = dask.delayed(pd.read_csv)(name)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = dask.delayed(mse)(observed, prediction)
delayed_results.append(mean_squared_error)
results = dask.compute(*delayed_results)
A much clearer solution, IMO, than the accepted answer is this snippet.
from dask import compute, delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
def compute_mse(file_name):
df = pd.read_csv(file_name)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
return mse(observed, prediction)
delayed_results = [delayed(compute_mse)(file_name) for file_name in filenames]
mean_squared_errors = compute(*delayed_results, scheduler="processes")

Resources