How do you cache results processed by a celery application - python-3.x

I want to find out whether it is possible to cache results sent to a celery application. For example, suppose I have a basic celery application which looks like:
tasks.py
import os
from celery import Celery
import time
redis_url = 'redis://localhost:6379'
app = Celery('tasks', backend=redis_url, broker=redis_url)
#app.task
def add(x, y):
time.sleep(10) # Added to see whether caching occurs.
return x + y
If I call the function add twice with the same arguments as in here:
from tasks import add
add.delay(10, 10)
add.delay(10, 10)
Will the second argument be computed faster? Possibly because the input output pair ((10, 10), 20) is already available in the result backend? If not, how do I achieve this behaviour?

Related

How do I suppress the output from log step while running jupyter notebook?

On this page, there is this log_step function which is being used to record what each step in a pandas pipeline is doing. The exact function is:
from functools import wraps
import datetime as dt
def log_step(func):
#wraps(func)
def wrapper(*args, **kwargs):
tic = dt.datetime.now()
result = func(*args, **kwargs)
time_taken = str(dt.datetime.now() - tic)
print(f"just ran step {func.__name__} shape={result.shape} took {time_taken}s")
return result
return wrapper
and it is used in the following fashion:
import pandas as pd
df = pd.read_csv('https://calmcode.io/datasets/bigmac.csv')
#log_step
def start_pipeline(dataf):
return dataf.copy()
#log_step
def set_dtypes(dataf):
return (dataf
.assign(date=lambda d: pd.to_datetime(d['date']))
.sort_values(['currency_code', 'date']))
My question is: how do I keep the #log_step in front of my functions and be able to use them at will, while setting the results of #log_step by default, to not be outputed when I run my Jupyter notebook? I suspect the answer comes down to something more general about using decorators but I don't really know what to look for. Thanks!
You can indeed remove the print statement or, if you want to not alter the decorating function, you can as well redirect the sys to avoid seeing the prints, as explained here.

Multiprocessing getting stuck with ARMAX while refitting

I am trying to train multiple time series models using the below code in Jupyter Notebook.
import statsmodels.api as sm
import multiprocessing
import tqdm
train_dict = dict() # A dictionary of dataframes
test_dict = dict() # A dictionary of dataframes
def train_arma(key):
endog = list(train_dict[key].endog)
exog = list(train_dict[key].exog)
fut_endog = list(train_dict[key].endog)
fut_exog = list(test_dict[key].exog)
model = sm.tsa.arima.ARIMA(endog, order=(2, 0, 2), exog=exog,
enforce_stationarity=False,
enforce_invertibility=False).fit()
predictions = list()
yhat = model.forecast(exog=[fut_exog[0]])[0]
predictions.append(yhat)
for i in tqdm.tqdm_notebook(range(len(fut_vol))[:-1]):
model = model.append([fut_vol[i]], exog=[fut_exog[i]], refit=True) #code gets stuck here
predictions.append(model.forecast(exog=[fut_exog[i+1]])
return predictions
secs = list(train_dict.keys())
p = multiprocessing.Pool(10)
output = p.map(train_arma, secs)
p.terminate()
When len(endog) == 1006, the code keeps getting stuck on the 17th iteration in the for loop. If I decrease the endog by 20, then it gets stuck on 37th iteration.
There are some other things I have tried already:
Passing dataframes directly instead of letting the function acess train_dict and test_dict from outer scope.
Reducing the number of maximum processes in multiprocessing.
Shuffling my input list.
Defining a new class instance in the for loop while appending the values from fut_endog and fut_exog lists in endog and exog lists respectively.
I did a top in my linux terminal and the observed the cpu usage while processes were getting created and executed. Initially when the processes spawn, they use up cpu and when the processes gets stuck %CPU allocation becomes 0.
There are some instances when the code does work:
When I call the function directly, without multiprocessing, it works. But using multiprocessing even with processes = 1 makes the code stop.
When I don't pass any exogenous variable and train a simple ARMA model it works.
I am using statsmodels v0.12.1 and python version is 3.7.3. Thanks
This issue must be due to usage of tqdm alongside multiprocessing.
https://github.com/tqdm/tqdm/issues/461 addresses this issue.
I resolved it by using
from tqdm import tqdm
tqdm.get_lock().locks = []

Why does my Flask app work when executing using `python app.py` but not when using `heroku local web` or `flask run`?

I wrote a Flask-based web app that takes text from users and returns the probability that it is of a given classification (full script below). The app loads some of the trained models needed to make predictions before any requests are made. I am currently trying to deploy it on Heroku and experiencing some problems.
I am able to run it locally when I execute python ml_app.py. But when I use the Heroku CLI command heroku local web to try to run it locally to test before deployment, I get the following error
AttributeError: module '__main__' has no attribute 'tokenize'
This error is associated with the loading of a text vectorizer called TFIDF found in the line
tfidf_model = joblib.load('models/tfidf_vectorizer_train.pkl')
I have imported the required function at the top of the script to ensure that this is loaded properly (from utils import tokenize). This works given that I can run it when I use python ml_app.py. But for reasons I do not know, it doesn't load when I use heroku local web. It also doesn't work when I use the Flask CLI command flask run when trying to run it locally. Any idea why?
I admit that I do not have a good understanding of what is going on under the hood here (with respect to the web dev./deployment aspect of the code) so any explanation helps.
from flask import Flask, request, render_template
from sklearn.externals import joblib
from utils import tokenize # custom tokenizer required for tfidf model loaded in load_tfidf_model()
app = Flask(__name__)
models_directory = 'models'
#app.before_first_request
def nbsvm_models():
global tfidf_model
global logistic_identity_hate_model
global logistic_insult_model
global logistic_obscene_model
global logistic_severe_toxic_model
global logistic_threat_model
global logistic_toxic_model
tfidf_model = joblib.load('models/tfidf_vectorizer_train.pkl')
logistic_identity_hate_model = joblib.load('models/logistic_identity_hate.pkl')
logistic_insult_model = joblib.load('models/logistic_insult.pkl')
logistic_obscene_model = joblib.load('models/logistic_obscene.pkl')
logistic_severe_toxic_model = joblib.load('models/logistic_severe_toxic.pkl')
logistic_threat_model = joblib.load('models/logistic_threat.pkl')
logistic_toxic_model = joblib.load('models/logistic_toxic.pkl')
#app.route('/')
def my_form():
return render_template('main.html')
#app.route('/', methods=['POST'])
def my_form_post():
"""
Takes the comment submitted by the user, apply TFIDF trained vectorizer to it, predict using trained models
"""
text = request.form['text']
comment_term_doc = tfidf_model.transform([text])
dict_preds = {}
dict_preds['pred_identity_hate'] = logistic_identity_hate_model.predict_proba(comment_term_doc)[:, 1][0]
dict_preds['pred_insult'] = logistic_insult_model.predict_proba(comment_term_doc)[:, 1][0]
dict_preds['pred_obscene'] = logistic_obscene_model.predict_proba(comment_term_doc)[:, 1][0]
dict_preds['pred_severe_toxic'] = logistic_severe_toxic_model.predict_proba(comment_term_doc)[:, 1][0]
dict_preds['pred_threat'] = logistic_threat_model.predict_proba(comment_term_doc)[:, 1][0]
dict_preds['pred_toxic'] = logistic_toxic_model.predict_proba(comment_term_doc)[:, 1][0]
for k in dict_preds:
perc = dict_preds[k] * 100
dict_preds[k] = "{0:.2f}%".format(perc)
return render_template('main.html', text=text,
pred_identity_hate=dict_preds['pred_identity_hate'],
pred_insult=dict_preds['pred_insult'],
pred_obscene=dict_preds['pred_obscene'],
pred_severe_toxic=dict_preds['pred_severe_toxic'],
pred_threat=dict_preds['pred_threat'],
pred_toxic=dict_preds['pred_toxic'])
if __name__ == '__main__':
app.run(debug=True)
Fixed it. It was due to the way I picked the class instance stored in tfidf_vectorizer_train.pkl. The model was created in an ipython notebook where one of its attributes depended on a tokenizer function that I defined interactively in the notebook. I soon learned that pickling does not save the exact instance of a class, which means tfidf_vectorizer_train.pkl does not contain the function I defined in the notebook.
To fix this, I moved the tokenizer function to a separate utilities python file and imported the function in both the file where I trained and subsequently pickled the model and in the file where I unpickled it.
In code, I did
from utils import tokenize
...
tfidfvectorizer = TfidfVectorizer(ngram_range=(1, 2), tokenizer=tokenize,
min_df=3, max_df=0.9, strip_accents='unicode',
use_idf=1, smooth_idf=True, sublinear_tf=1)
train_term_doc = tfidfvectorizer.fit_transform(train[COMMENT])
joblib.dump(tfidfvectorizer, 'models/tfidf_vectorizer_train.pkl')
...
in the file where I trained the model and
from utils import tokenize
...
#app.before_first_request
def load_models():
# from utils import tokenize
global tfidf_model
tfidf_model =
joblib.load('{}/tfidf_vectorizer_train.pkl'.format(models_directory))
...
in the file containing the web app code.

multiprocessing.pool on windows/jupyter [duplicate]

Jupyter Notebook
I am using multiprocessing module basically, I am still learning the capabilities of multiprocessing. I am using the book by Dusty Phillips and this code belongs to it.
import multiprocessing
import random
from multiprocessing.pool import Pool
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
On the Windows PowerShell (not on jupyter notebook) I see the following
Process SpawnPoolWorker-5:
Process SpawnPoolWorker-1:
AttributeError: Can't get attribute 'prime_factor' on <module '__main__' (built-in)>
I do not know why the cell never ends running?
It seems that the problem in Jupyter notebook as in different ide is the design feature. Therefore, we have to write the function (prime_factor) into a different file and import the module. Furthermore, we have to take care of the adjustments. For example, in my case, I have coded the function into a file known as defs.py
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
Then in the jupyter notebook I wrote the following lines
import multiprocessing
import random
from multiprocessing import Pool
import defs
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(defs.prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
This solved my problem
To execute a function without having to write it into a separated file manually:
We can dynamically write the task to process into a temporary file, import it and execute the function.
from multiprocessing import Pool
from functools import partial
import inspect
def parallel_task(func, iterable, *params):
with open(f'./tmp_func.py', 'w') as file:
file.write(inspect.getsource(func).replace(func.__name__, "task"))
from tmp_func import task
if __name__ == '__main__':
func = partial(task, params)
pool = Pool(processes=8)
res = pool.map(func, iterable)
pool.close()
return res
else:
raise "Not in Jupyter Notebook"
We can then simply call it in a notebook cell like this:
def long_running_task(params, id):
# Heavy job here
return params, id
data_list = range(8)
for res in parallel_task(long_running_task, data_list, "a", 1, "b"):
print(res)
Ouput:
('a', 1, 'b') 0
('a', 1, 'b') 1
('a', 1, 'b') 2
('a', 1, 'b') 3
('a', 1, 'b') 4
('a', 1, 'b') 5
('a', 1, 'b') 6
('a', 1, 'b') 7
Note: If you're using Anaconda and if you want to see the progress of the heavy task, you can use print() inside long_running_task(). The content of the print will be displayed in the Anaconda Prompt console.
Strictly, Python multiprocessing isn't supported on Windows Jupyter Notebook even if __name__="__main__" is added.
One workaround in Windows 10 is to connect windows browser with Jupyter server in WSL.
You could get the same experience as Linux.
You can set it manually or refer the script in https://github.com/mszhanyi/gemini
Another option: use dask, which plays nicely with Jupyter. Even if you don't need any of dask special data structures, you can use it simply to control multiple processes.
To handle the many quirks of getting multiprocess to play nice in Jupyter session, I've created a library mpify which allows one-time, multiprocess function executions, and passing things from the notebook to the subprocess with a simple API.
The Jupyter shell process itself can participate as a worker process. User can choose to gather results from all workers, or just one of them.
Here it is:
https://github.com/philtrade/mpify
Under the hood, it uses multiprocess -- an actively supported fork from the standard python multiprocessing library -- to allow locally defined variables/functions in the notebook, to be accessible in the subprocesses. It also uses the spawn start method, which is necessary if the subprocesses are to use multiple GPUs, an increasingly common use case. It uses Process() not Pool(), from the multiprocess API.
User can supply a custom context manager to acquire resources, setup/tear down execution environment surrounding the function execution. I've provided a sample context manager to support PyTorch's distributed data parallel (DDP) set up, and many more examples of how to train fastai v2 in Jupyter on multiple GPUs using DDP.
Bug reports, PRs, use cases to share are all welcome.
By no means a fancy/powerful library, mpify only intends to support single-host/multiprocess kind of distributed setup, and simply spawn-execute-terminate. Nor does it support persistent pool of processes and fancy task scheduling -- ipyparallel or dask already does it.
I hope it can be useful to folks who're struggling with Jupyter + multiprocessing, and possible with multi-GPUs as well. Thanks.

Jupyter notebook never finishes processing using multiprocessing (Python 3)

Jupyter Notebook
I am using multiprocessing module basically, I am still learning the capabilities of multiprocessing. I am using the book by Dusty Phillips and this code belongs to it.
import multiprocessing
import random
from multiprocessing.pool import Pool
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
On the Windows PowerShell (not on jupyter notebook) I see the following
Process SpawnPoolWorker-5:
Process SpawnPoolWorker-1:
AttributeError: Can't get attribute 'prime_factor' on <module '__main__' (built-in)>
I do not know why the cell never ends running?
It seems that the problem in Jupyter notebook as in different ide is the design feature. Therefore, we have to write the function (prime_factor) into a different file and import the module. Furthermore, we have to take care of the adjustments. For example, in my case, I have coded the function into a file known as defs.py
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
Then in the jupyter notebook I wrote the following lines
import multiprocessing
import random
from multiprocessing import Pool
import defs
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(defs.prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
This solved my problem
To execute a function without having to write it into a separated file manually:
We can dynamically write the task to process into a temporary file, import it and execute the function.
from multiprocessing import Pool
from functools import partial
import inspect
def parallel_task(func, iterable, *params):
with open(f'./tmp_func.py', 'w') as file:
file.write(inspect.getsource(func).replace(func.__name__, "task"))
from tmp_func import task
if __name__ == '__main__':
func = partial(task, params)
pool = Pool(processes=8)
res = pool.map(func, iterable)
pool.close()
return res
else:
raise "Not in Jupyter Notebook"
We can then simply call it in a notebook cell like this:
def long_running_task(params, id):
# Heavy job here
return params, id
data_list = range(8)
for res in parallel_task(long_running_task, data_list, "a", 1, "b"):
print(res)
Ouput:
('a', 1, 'b') 0
('a', 1, 'b') 1
('a', 1, 'b') 2
('a', 1, 'b') 3
('a', 1, 'b') 4
('a', 1, 'b') 5
('a', 1, 'b') 6
('a', 1, 'b') 7
Note: If you're using Anaconda and if you want to see the progress of the heavy task, you can use print() inside long_running_task(). The content of the print will be displayed in the Anaconda Prompt console.
Strictly, Python multiprocessing isn't supported on Windows Jupyter Notebook even if __name__="__main__" is added.
One workaround in Windows 10 is to connect windows browser with Jupyter server in WSL.
You could get the same experience as Linux.
You can set it manually or refer the script in https://github.com/mszhanyi/gemini
Another option: use dask, which plays nicely with Jupyter. Even if you don't need any of dask special data structures, you can use it simply to control multiple processes.
To handle the many quirks of getting multiprocess to play nice in Jupyter session, I've created a library mpify which allows one-time, multiprocess function executions, and passing things from the notebook to the subprocess with a simple API.
The Jupyter shell process itself can participate as a worker process. User can choose to gather results from all workers, or just one of them.
Here it is:
https://github.com/philtrade/mpify
Under the hood, it uses multiprocess -- an actively supported fork from the standard python multiprocessing library -- to allow locally defined variables/functions in the notebook, to be accessible in the subprocesses. It also uses the spawn start method, which is necessary if the subprocesses are to use multiple GPUs, an increasingly common use case. It uses Process() not Pool(), from the multiprocess API.
User can supply a custom context manager to acquire resources, setup/tear down execution environment surrounding the function execution. I've provided a sample context manager to support PyTorch's distributed data parallel (DDP) set up, and many more examples of how to train fastai v2 in Jupyter on multiple GPUs using DDP.
Bug reports, PRs, use cases to share are all welcome.
By no means a fancy/powerful library, mpify only intends to support single-host/multiprocess kind of distributed setup, and simply spawn-execute-terminate. Nor does it support persistent pool of processes and fancy task scheduling -- ipyparallel or dask already does it.
I hope it can be useful to folks who're struggling with Jupyter + multiprocessing, and possible with multi-GPUs as well. Thanks.

Resources