For some reason spark repartition is assigning the exact same yarn container to the every element of the rdd. I do not know what could be the possible reason. The intriguing part is if I run the same code twice without restarting the session it is now able to partition the data properly and I see distribution over all the containers. Could you please help me understand the behavior?
I am using the following session:
import socket
import os
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","3").\
config("spark.executor.instances","5").\
config("spark.executor.memory","6g").\
config("spark.sql.adaptive.enabled", False).\
getOrCreate()
And, the following code:
df = spark.sparkContext.parallelize(range(240000)).repartition(4)
def f(x):
return os.getenv("CONTAINER_ID"), socket.gethostname()
df = df.map(f)
[set(i) for i in df.glom().collect()]
output:
[{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')}]
Use the exact same code again without restarting the session:
df = spark.sparkContext.parallelize(range(2400000)).repartition(4)
def f(x):
return os.getenv("CONTAINER_ID"), socket.gethostname()
df = df.map(f)
[set(i) for i in df.glom().collect()]
output:
[{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000004', 'monsoon-spark-w-0')},
{('container_1676564785882_0047_01_000005', 'monsoon-spark-sw-ppqw')},
{('container_1676564785882_0047_01_000001', 'monsoon-spark-sw-m2t7')}]
Here is a snapshot for the same
On this page, there is this log_step function which is being used to record what each step in a pandas pipeline is doing. The exact function is:
from functools import wraps
import datetime as dt
def log_step(func):
#wraps(func)
def wrapper(*args, **kwargs):
tic = dt.datetime.now()
result = func(*args, **kwargs)
time_taken = str(dt.datetime.now() - tic)
print(f"just ran step {func.__name__} shape={result.shape} took {time_taken}s")
return result
return wrapper
and it is used in the following fashion:
import pandas as pd
df = pd.read_csv('https://calmcode.io/datasets/bigmac.csv')
#log_step
def start_pipeline(dataf):
return dataf.copy()
#log_step
def set_dtypes(dataf):
return (dataf
.assign(date=lambda d: pd.to_datetime(d['date']))
.sort_values(['currency_code', 'date']))
My question is: how do I keep the #log_step in front of my functions and be able to use them at will, while setting the results of #log_step by default, to not be outputed when I run my Jupyter notebook? I suspect the answer comes down to something more general about using decorators but I don't really know what to look for. Thanks!
You can indeed remove the print statement or, if you want to not alter the decorating function, you can as well redirect the sys to avoid seeing the prints, as explained here.
I am doing a seldon deployment. I have created custom pipelines using sklearn and it is in the directory MyPipelines/CustomPipelines.py. The main code ie. my_prediction.py is the file which seldon will execute by default (based on my configuration). In this file I am importing the custom pipelines. If I execute my_prediction.py in my local (PyCharm) it executes fine. But If I deploy it using Seldon I get the error : Attribute Error: Can't get Attribute 'MyEncoder'
It is unable to load modules in CustomPipelines.py. I tried all the solutions from Unable to load files using pickle and multiple modules None of them worked.
MyPipelines/CustomPipelines.py
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
class MyEncoder(BaseEstimator, TransformerMixin):
def __init__(self):
super().__init__()
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
df = X
vars_cat = [var for var in df.columns if df[var].dtypes == 'O']
cat_with_na = [var for var in vars_cat if df[var].isnull().sum() > 0]
df[cat_with_na] = df[cat_with_na].fillna('Missing')
return df
my_prediction.py
import pickle
import pandas as pd
import dill
from MyPipelines.CustomPipelines import MyEncoder
from MyPipelines.CustomPipelines import *
import MyPipelines.CustomPipelines
class my_prediction:
def __init__(self):
file_name = 'model.sav'
with open(file_name, 'rb') as model_file:
self.model = pickle.load(model_file)
def predict(self, request):
data = request.get('ndarray')
columns = request.get('names')
X = pd.DataFrame(data, columns = columns)
predictions = self.model.predict(X)
return predictions
Error:
File microservice/my_prediction.py in __init__
self.model = pickle.load(model_file)
Attribute Error: Can't get Attribute 'MyEncoder' on <module '__main__' from 'opt/conda/bin/seldon-core-microservice'
One of the constrains of the pickle module is that it expects that the same classes (under the same module) are available in the environment where the artifact gets unpickled. In this case, it seems like your my_prediction class is trying to unpickle a MyEncoder artifact, but that class is not available on that environment.
As a quick workaround, you could try to make your MyEncoder class available on the environment where my_prediction runs on (i.e. having the same folders / files present there as well). Otherwise, you could look at alternatives to pickle, like cloudpickle or dill, which can serialise your custom code as well (although these also come with their own set of caveats).
I am trying to put together a unittest to test whether my function that reads in big data files, produces the correct result in shape of an numpy array. However, these files and arrays are huge and can not be typed in. I believe I need to save input and output files and test using them. This is how my testModule looks like:
import numpy as np
from myFunctions import fun1
import unittest
class TestMyFunctions(unittest.TestCase):
def setUp(self):
self.inputFile1 = "input1.txt"
self.inputFile2 = "input2.txt"
self.outputFile = "output.txt"
def test_fun1(self):
m1 = np.genfromtxt(self.inputFile1)
m2 = np.genfromtxt(self.inputFile2)
R = np.genfromtxt(self.outputFile)
self.assertEqual(fun1(m1,m2),R)
if __name__ =='__main__':
unittest.main(exit=False)
I'm not sure if there is a better/neater way of testing huge results.
Edit:
Also getting an attribute error now:
AttributeError: TestMyFunctions object has no attribute '_testMethodName'
Update - AttributeError Solved - 'def init()' is not allowed. Changed with def setUp()!
Jupyter Notebook
I am using multiprocessing module basically, I am still learning the capabilities of multiprocessing. I am using the book by Dusty Phillips and this code belongs to it.
import multiprocessing
import random
from multiprocessing.pool import Pool
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
On the Windows PowerShell (not on jupyter notebook) I see the following
Process SpawnPoolWorker-5:
Process SpawnPoolWorker-1:
AttributeError: Can't get attribute 'prime_factor' on <module '__main__' (built-in)>
I do not know why the cell never ends running?
It seems that the problem in Jupyter notebook as in different ide is the design feature. Therefore, we have to write the function (prime_factor) into a different file and import the module. Furthermore, we have to take care of the adjustments. For example, in my case, I have coded the function into a file known as defs.py
def prime_factor(value):
factors = []
for divisor in range(2, value-1):
quotient, remainder = divmod(value, divisor)
if not remainder:
factors.extend(prime_factor(divisor))
factors.extend(prime_factor(quotient))
break
else:
factors = [value]
return factors
Then in the jupyter notebook I wrote the following lines
import multiprocessing
import random
from multiprocessing import Pool
import defs
if __name__ == '__main__':
pool = Pool()
to_factor = [ random.randint(100000, 50000000) for i in range(20)]
results = pool.map(defs.prime_factor, to_factor)
for value, factors in zip(to_factor, results):
print("The factors of {} are {}".format(value, factors))
This solved my problem
To execute a function without having to write it into a separated file manually:
We can dynamically write the task to process into a temporary file, import it and execute the function.
from multiprocessing import Pool
from functools import partial
import inspect
def parallel_task(func, iterable, *params):
with open(f'./tmp_func.py', 'w') as file:
file.write(inspect.getsource(func).replace(func.__name__, "task"))
from tmp_func import task
if __name__ == '__main__':
func = partial(task, params)
pool = Pool(processes=8)
res = pool.map(func, iterable)
pool.close()
return res
else:
raise "Not in Jupyter Notebook"
We can then simply call it in a notebook cell like this:
def long_running_task(params, id):
# Heavy job here
return params, id
data_list = range(8)
for res in parallel_task(long_running_task, data_list, "a", 1, "b"):
print(res)
Ouput:
('a', 1, 'b') 0
('a', 1, 'b') 1
('a', 1, 'b') 2
('a', 1, 'b') 3
('a', 1, 'b') 4
('a', 1, 'b') 5
('a', 1, 'b') 6
('a', 1, 'b') 7
Note: If you're using Anaconda and if you want to see the progress of the heavy task, you can use print() inside long_running_task(). The content of the print will be displayed in the Anaconda Prompt console.
Strictly, Python multiprocessing isn't supported on Windows Jupyter Notebook even if __name__="__main__" is added.
One workaround in Windows 10 is to connect windows browser with Jupyter server in WSL.
You could get the same experience as Linux.
You can set it manually or refer the script in https://github.com/mszhanyi/gemini
Another option: use dask, which plays nicely with Jupyter. Even if you don't need any of dask special data structures, you can use it simply to control multiple processes.
To handle the many quirks of getting multiprocess to play nice in Jupyter session, I've created a library mpify which allows one-time, multiprocess function executions, and passing things from the notebook to the subprocess with a simple API.
The Jupyter shell process itself can participate as a worker process. User can choose to gather results from all workers, or just one of them.
Here it is:
https://github.com/philtrade/mpify
Under the hood, it uses multiprocess -- an actively supported fork from the standard python multiprocessing library -- to allow locally defined variables/functions in the notebook, to be accessible in the subprocesses. It also uses the spawn start method, which is necessary if the subprocesses are to use multiple GPUs, an increasingly common use case. It uses Process() not Pool(), from the multiprocess API.
User can supply a custom context manager to acquire resources, setup/tear down execution environment surrounding the function execution. I've provided a sample context manager to support PyTorch's distributed data parallel (DDP) set up, and many more examples of how to train fastai v2 in Jupyter on multiple GPUs using DDP.
Bug reports, PRs, use cases to share are all welcome.
By no means a fancy/powerful library, mpify only intends to support single-host/multiprocess kind of distributed setup, and simply spawn-execute-terminate. Nor does it support persistent pool of processes and fancy task scheduling -- ipyparallel or dask already does it.
I hope it can be useful to folks who're struggling with Jupyter + multiprocessing, and possible with multi-GPUs as well. Thanks.