Jupyter notebook never finishes processing using multiprocessing (Python 3) - python-3.x

Jupyter Notebook
I am using multiprocessing module basically, I am still learning the capabilities of multiprocessing. I am using the book by Dusty Phillips and this code belongs to it.
import multiprocessing
import random
from multiprocessing.pool import Pool
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
On the Windows PowerShell (not on jupyter notebook) I see the following
Process SpawnPoolWorker-5:
Process SpawnPoolWorker-1:
AttributeError: Can't get attribute 'prime_factor' on <module '__main__' (built-in)>
I do not know why the cell never ends running?

It seems that the problem in Jupyter notebook as in different ide is the design feature. Therefore, we have to write the function (prime_factor) into a different file and import the module. Furthermore, we have to take care of the adjustments. For example, in my case, I have coded the function into a file known as defs.py
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
Then in the jupyter notebook I wrote the following lines
import multiprocessing
import random
from multiprocessing import Pool
import defs
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(defs.prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
This solved my problem

To execute a function without having to write it into a separated file manually:
We can dynamically write the task to process into a temporary file, import it and execute the function.
from multiprocessing import Pool
from functools import partial
import inspect
def parallel_task(func, iterable, *params):
with open(f'./tmp_func.py', 'w') as file:
file.write(inspect.getsource(func).replace(func.__name__, "task"))
from tmp_func import task
if __name__ == '__main__':
func = partial(task, params)
pool = Pool(processes=8)
res = pool.map(func, iterable)
pool.close()
return res
else:
raise "Not in Jupyter Notebook"
We can then simply call it in a notebook cell like this:
def long_running_task(params, id):
# Heavy job here
return params, id
data_list = range(8)
for res in parallel_task(long_running_task, data_list, "a", 1, "b"):
print(res)
Ouput:
('a', 1, 'b') 0
('a', 1, 'b') 1
('a', 1, 'b') 2
('a', 1, 'b') 3
('a', 1, 'b') 4
('a', 1, 'b') 5
('a', 1, 'b') 6
('a', 1, 'b') 7
Note: If you're using Anaconda and if you want to see the progress of the heavy task, you can use print() inside long_running_task(). The content of the print will be displayed in the Anaconda Prompt console.

Strictly, Python multiprocessing isn't supported on Windows Jupyter Notebook even if __name__="__main__" is added.
One workaround in Windows 10 is to connect windows browser with Jupyter server in WSL.
You could get the same experience as Linux.
You can set it manually or refer the script in https://github.com/mszhanyi/gemini

Another option: use dask, which plays nicely with Jupyter. Even if you don't need any of dask special data structures, you can use it simply to control multiple processes.

To handle the many quirks of getting multiprocess to play nice in Jupyter session, I've created a library mpify which allows one-time, multiprocess function executions, and passing things from the notebook to the subprocess with a simple API.
The Jupyter shell process itself can participate as a worker process. User can choose to gather results from all workers, or just one of them.
Here it is:
https://github.com/philtrade/mpify
Under the hood, it uses multiprocess -- an actively supported fork from the standard python multiprocessing library -- to allow locally defined variables/functions in the notebook, to be accessible in the subprocesses. It also uses the spawn start method, which is necessary if the subprocesses are to use multiple GPUs, an increasingly common use case. It uses Process() not Pool(), from the multiprocess API.
User can supply a custom context manager to acquire resources, setup/tear down execution environment surrounding the function execution. I've provided a sample context manager to support PyTorch's distributed data parallel (DDP) set up, and many more examples of how to train fastai v2 in Jupyter on multiple GPUs using DDP.
Bug reports, PRs, use cases to share are all welcome.
By no means a fancy/powerful library, mpify only intends to support single-host/multiprocess kind of distributed setup, and simply spawn-execute-terminate. Nor does it support persistent pool of processes and fancy task scheduling -- ipyparallel or dask already does it.
I hope it can be useful to folks who're struggling with Jupyter + multiprocessing, and possible with multi-GPUs as well. Thanks.

Related

Multiprocessing getting stuck with ARMAX while refitting

I am trying to train multiple time series models using the below code in Jupyter Notebook.
import statsmodels.api as sm
import multiprocessing
import tqdm
train_dict = dict() # A dictionary of dataframes
test_dict = dict() # A dictionary of dataframes
def train_arma(key):
endog = list(train_dict[key].endog)
exog = list(train_dict[key].exog)
fut_endog = list(train_dict[key].endog)
fut_exog = list(test_dict[key].exog)
model = sm.tsa.arima.ARIMA(endog, order=(2, 0, 2), exog=exog,
enforce_stationarity=False,
enforce_invertibility=False).fit()
predictions = list()
yhat = model.forecast(exog=[fut_exog[0]])[0]
predictions.append(yhat)
for i in tqdm.tqdm_notebook(range(len(fut_vol))[:-1]):
model = model.append([fut_vol[i]], exog=[fut_exog[i]], refit=True) #code gets stuck here
predictions.append(model.forecast(exog=[fut_exog[i+1]])
return predictions
secs = list(train_dict.keys())
p = multiprocessing.Pool(10)
output = p.map(train_arma, secs)
p.terminate()
When len(endog) == 1006, the code keeps getting stuck on the 17th iteration in the for loop. If I decrease the endog by 20, then it gets stuck on 37th iteration.
There are some other things I have tried already:
Passing dataframes directly instead of letting the function acess train_dict and test_dict from outer scope.
Reducing the number of maximum processes in multiprocessing.
Shuffling my input list.
Defining a new class instance in the for loop while appending the values from fut_endog and fut_exog lists in endog and exog lists respectively.
I did a top in my linux terminal and the observed the cpu usage while processes were getting created and executed. Initially when the processes spawn, they use up cpu and when the processes gets stuck %CPU allocation becomes 0.
There are some instances when the code does work:
When I call the function directly, without multiprocessing, it works. But using multiprocessing even with processes = 1 makes the code stop.
When I don't pass any exogenous variable and train a simple ARMA model it works.
I am using statsmodels v0.12.1 and python version is 3.7.3. Thanks
This issue must be due to usage of tqdm alongside multiprocessing.
https://github.com/tqdm/tqdm/issues/461 addresses this issue.
I resolved it by using
from tqdm import tqdm
tqdm.get_lock().locks = []

Why can't ipdb be imported when using multiprocessing

I am not trying to use ipdb with multiprocessing, I had it imported before I started adding multiprocessing features and I couldn't figure out why the code wouldn't run. Here is a minimal example;
from ipdb import set_trace as st
import multiprocessing
def worker(instructions):
return "good boi"
pool = multiprocessing.Pool(4)
results = [pool.apply(worker, args=("woof", )) for _ in range(3)]
pool.close()
If you comment out the first line it runs, otherwise it prints a cryptic error message about failing to pickle worker. I don't need ipdb, but why does this happen?

Parsing parameters via FLAGS to Tensorflow app to launch parallel jobs

I have a runnable main.py file built with tensorflow flags for parsing args which works just fine and resemble common tensorflow tutorials:
flags = tf.flags
FLAGS = flags.FLAGS
flags.DEFINE_string('foo1', "", 'doc foo1')
flags.DEFINE_integer('foo2', 0, 'doc foo2')
flags.DEFINE_integer('foo3', 0, 'doc foo3')
def main(_):
do something using FLAGS
if __name__ == '__main__':
tf.app.run(main)
Now I would like to launch this main in parallel passing different parameters as flags. I tried creating a different script main_all.py which is supposed to launch the main with different parameters but I can't figure out what's the proper way to pass the parameters. This is what I've tried:
from multiprocessing import Pool
import tensorflow as tf
from main import main
app_run = partial(tf.app.run, main=main, foo1="foo1")
foo2 = [1, 2]
foo3 = [3, 4]
if __name__ == '__main__':
with Pool(2) as pool:
pool.map(app_run, zip(foo2,foo3,))
This doesn't work because FLAGS are created only during the import of main and not when calling the tf.app.run parsing the additional args into flags?!.
Is there a clean workaround or maybe it's better to use a bash script?
Any ideas on how to solve it?
thanks in advance

multiprocessing.pool on windows/jupyter [duplicate]

Jupyter Notebook
I am using multiprocessing module basically, I am still learning the capabilities of multiprocessing. I am using the book by Dusty Phillips and this code belongs to it.
import multiprocessing
import random
from multiprocessing.pool import Pool
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
On the Windows PowerShell (not on jupyter notebook) I see the following
Process SpawnPoolWorker-5:
Process SpawnPoolWorker-1:
AttributeError: Can't get attribute 'prime_factor' on <module '__main__' (built-in)>
I do not know why the cell never ends running?
It seems that the problem in Jupyter notebook as in different ide is the design feature. Therefore, we have to write the function (prime_factor) into a different file and import the module. Furthermore, we have to take care of the adjustments. For example, in my case, I have coded the function into a file known as defs.py
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
Then in the jupyter notebook I wrote the following lines
import multiprocessing
import random
from multiprocessing import Pool
import defs
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(defs.prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
This solved my problem
To execute a function without having to write it into a separated file manually:
We can dynamically write the task to process into a temporary file, import it and execute the function.
from multiprocessing import Pool
from functools import partial
import inspect
def parallel_task(func, iterable, *params):
with open(f'./tmp_func.py', 'w') as file:
file.write(inspect.getsource(func).replace(func.__name__, "task"))
from tmp_func import task
if __name__ == '__main__':
func = partial(task, params)
pool = Pool(processes=8)
res = pool.map(func, iterable)
pool.close()
return res
else:
raise "Not in Jupyter Notebook"
We can then simply call it in a notebook cell like this:
def long_running_task(params, id):
# Heavy job here
return params, id
data_list = range(8)
for res in parallel_task(long_running_task, data_list, "a", 1, "b"):
print(res)
Ouput:
('a', 1, 'b') 0
('a', 1, 'b') 1
('a', 1, 'b') 2
('a', 1, 'b') 3
('a', 1, 'b') 4
('a', 1, 'b') 5
('a', 1, 'b') 6
('a', 1, 'b') 7
Note: If you're using Anaconda and if you want to see the progress of the heavy task, you can use print() inside long_running_task(). The content of the print will be displayed in the Anaconda Prompt console.
Strictly, Python multiprocessing isn't supported on Windows Jupyter Notebook even if __name__="__main__" is added.
One workaround in Windows 10 is to connect windows browser with Jupyter server in WSL.
You could get the same experience as Linux.
You can set it manually or refer the script in https://github.com/mszhanyi/gemini
Another option: use dask, which plays nicely with Jupyter. Even if you don't need any of dask special data structures, you can use it simply to control multiple processes.
To handle the many quirks of getting multiprocess to play nice in Jupyter session, I've created a library mpify which allows one-time, multiprocess function executions, and passing things from the notebook to the subprocess with a simple API.
The Jupyter shell process itself can participate as a worker process. User can choose to gather results from all workers, or just one of them.
Here it is:
https://github.com/philtrade/mpify
Under the hood, it uses multiprocess -- an actively supported fork from the standard python multiprocessing library -- to allow locally defined variables/functions in the notebook, to be accessible in the subprocesses. It also uses the spawn start method, which is necessary if the subprocesses are to use multiple GPUs, an increasingly common use case. It uses Process() not Pool(), from the multiprocess API.
User can supply a custom context manager to acquire resources, setup/tear down execution environment surrounding the function execution. I've provided a sample context manager to support PyTorch's distributed data parallel (DDP) set up, and many more examples of how to train fastai v2 in Jupyter on multiple GPUs using DDP.
Bug reports, PRs, use cases to share are all welcome.
By no means a fancy/powerful library, mpify only intends to support single-host/multiprocess kind of distributed setup, and simply spawn-execute-terminate. Nor does it support persistent pool of processes and fancy task scheduling -- ipyparallel or dask already does it.
I hope it can be useful to folks who're struggling with Jupyter + multiprocessing, and possible with multi-GPUs as well. Thanks.

How to distibute classes with PySpark and Jupyter

I have an annoying problem using jupyter notebook with spark.
I need to define a custom class inside the notebook and use it to perform some map operations
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
conf = SparkConf().setMaster("spark://192.168.10.11:7077")\
.setAppName("app_jupyter/")\
.set("spark.cores.max", "10")
sc = SparkContext(conf=conf)
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
class demo(object):
def __init__(self, value):
self.test = value + 10
pass
distData.map(lambda x : demo(x)).collect()
It gives the following error:
PicklingError: Can't pickle : attribute lookup
main.demo failed
I know what this error is about, but I could't figure out a solution..
I have tried:
Define a demo.py python file outside the notebook. It works, but it is such a ugly solution ...
Create a dynamic module like this, and then import it afterwards... This gives the same error
What would be a solution?...I want everything to work in the same notebook
It is possible to change something in:
The way spark works, maybe some pickle configuration
Something in the code... Use some static magic approach
There is no reliable and elegant workaround here and this behavior is not particularly related to Spark. This is all about fundamental design of the Python pickle
pickle can save and restore class instances transparently, however the class definition must be importable and live in the same module as when the object was stored.
Theoretically you could define a custom cell magic which would:
Write the content of a cell to a module.
Import it.
Call SparkContext.addPyFile to distribute the module.
from IPython.core.magic import register_cell_magic
import importlib
#register_cell_magic
def spark_class(line, cell):
module = line.strip()
f = "{0}.py".format(module)
with open(f, "w") as fw:
fw.write(cell)
globals()[module] = importlib.import_module(module)
sc.addPyFile(f)
In [2]: %%spark_class foo
...: class Foo(object):
...: def __init__(self, x):
...: self.x = x
...: def __repr__(self):
...: return "Foo({0})".format(self.x)
...:
In [3]: sc.parallelize([1, 2, 3]).map(lambda x: foo.Foo(x)).collect()
Out[3]: [Foo(1), Foo(2), Foo(3)]
but it is a one time deal. Once file is marked for distribution it cannot be changed or redistributed. Moreover there is a problem of reloading imports on remote hosts. I can think of some more elaborate schemes but this is simply more trouble than it is worth.
The answer from zero323 is solid: there's no one "right" way to solve this problem. You could indeed use Jupyter magic, as proposed. One other way is to use Jupyter's %%writefile to have your code inline in a Jupyter cell but to then write it to disk as a python file. Then you can both import the file to your Jupyter kernel session as well as ship it with your PySpark job (via addPyFile() as noted in the other answer). Note that if you make changes to the code but don't restart your PySpark session, you'll have to get the updated code to PySpark somehow.
Can we make this easier? I wrote a blogpost about this topic as well as a PySpark Session wrapper (oarphpy.spark.NBSpark) to help automate a lot of the tricky stuff. See the Jupyter Notebook embedded in that post for a working example. The overall pattern looks like this:
import os
import sys
CUSTOM_LIB_SRC_DIR = '/tmp/'
os.chdir(CUSTOM_LIB_SRC_DIR)
!mkdir -p mymodule
!touch mymodule/__init__.py
%%writefile mymodule/foo.py
class Zebra(object):
def __init__(self, name):
self.name = name
sys.path.append(CUSTOM_LIB_SRC_DIR)
from mymodule.foo import Zebra
# Create Zebra() instances in the notebook
herd = [Zebra(name=str(i)) for i in range(10)]
# Now send those instances to PySpark!
from oarphpy.spark import NBSpark
NBSpark.SRC_ROOT = os.path.join(CUSTOM_LIB_SRC_DIR, 'mymodule')
spark = NBSpark.getOrCreate()
rdd = spark.sparkContext.parallelize(herd)
def get_name(z):
return z.name
names = rdd.map(get_name).collect()
Additionally, if you make any changes to the mymodule files on disk (via %%writefile or otherwise), then NBSpark with automatically ship those changes to the active PySpark session.

Resources