Attribute Error : pickle.load() Seldon Deployment - python-3.x

I am doing a seldon deployment. I have created custom pipelines using sklearn and it is in the directory MyPipelines/CustomPipelines.py. The main code ie. my_prediction.py is the file which seldon will execute by default (based on my configuration). In this file I am importing the custom pipelines. If I execute my_prediction.py in my local (PyCharm) it executes fine. But If I deploy it using Seldon I get the error : Attribute Error: Can't get Attribute 'MyEncoder'
It is unable to load modules in CustomPipelines.py. I tried all the solutions from Unable to load files using pickle and multiple modules None of them worked.
MyPipelines/CustomPipelines.py
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
class MyEncoder(BaseEstimator, TransformerMixin):
def __init__(self):
super().__init__()
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
df = X
vars_cat = [var for var in df.columns if df[var].dtypes == 'O']
cat_with_na = [var for var in vars_cat if df[var].isnull().sum() > 0]
df[cat_with_na] = df[cat_with_na].fillna('Missing')
return df
my_prediction.py
import pickle
import pandas as pd
import dill
from MyPipelines.CustomPipelines import MyEncoder
from MyPipelines.CustomPipelines import *
import MyPipelines.CustomPipelines
class my_prediction:
def __init__(self):
file_name = 'model.sav'
with open(file_name, 'rb') as model_file:
self.model = pickle.load(model_file)
def predict(self, request):
data = request.get('ndarray')
columns = request.get('names')
X = pd.DataFrame(data, columns = columns)
predictions = self.model.predict(X)
return predictions
Error:
File microservice/my_prediction.py in __init__
self.model = pickle.load(model_file)
Attribute Error: Can't get Attribute 'MyEncoder' on <module '__main__' from 'opt/conda/bin/seldon-core-microservice'

One of the constrains of the pickle module is that it expects that the same classes (under the same module) are available in the environment where the artifact gets unpickled. In this case, it seems like your my_prediction class is trying to unpickle a MyEncoder artifact, but that class is not available on that environment.
As a quick workaround, you could try to make your MyEncoder class available on the environment where my_prediction runs on (i.e. having the same folders / files present there as well). Otherwise, you could look at alternatives to pickle, like cloudpickle or dill, which can serialise your custom code as well (although these also come with their own set of caveats).

Related

Cannot pickle dateparser using cloudpickle

I'm using the dateparser library to parse some strings and return potential dates. I need to use cloudpickle for distributed use but am receiving an error:
import dateparser
class DateParser:
def __init__(self,
threshold: float = 0.5,
pos_label: str = 'date'):
self.threshold = threshold
self.pos_label = pos_label
def __call__(self):
dateparser.parse('20/12/2022')
date_parser = DateParser()
with open('/path/parser.cloudpickle', 'wb+') as fout:
cloudpickle.dump(date_parser, fout, protocol=4)
TypeError: can't pickle _thread.lock objects
However when i use plain pickle it works just fine:
import pickle
with open('/path/parser.pickle', 'wb+') as fout:
pickle.dump(date_parser, fout, protocol=4)
# also loads just fine:
with open('/path/parser.pickle', 'rb+') as fin:
pickle.load(fin)
I can get around this issue by importing dateparser in the init of Dateparser but I'm not sure why this should be the fix.
class DateParser:
def __init__(self,
threshold: float = 0.5,
pos_label: str = 'date'):
import dateparser
self.threshold = threshold
self.pos_label = pos_label
I looked online and it seems this threadlock complaint is most common to multiprocessing calls but as far as i can tell this doesn't happen in the underlying dateparser library. And this should've broken plain pickling anyway?

Python: I'm trying to import an instance from module 2 and run it through a class in module 1

I have tried several "solutions" on this site and other and I must be missing something. Why does the code pictured give a name error.
I've tried from cars2 import * but that didn't work as well as a few others.
I'm out of ideas. What am I missing?
https://i.stack.imgur.com/EHuay.jpg
You are calling the class cars before defining it.
You should do the following:
In the file cars1.py:
class cars:
def __init__(self, model):
self.model = model
In the file cars2.py:
from cars1 import cars
firstCar = cars("Honda")
print(firstCar.model)
And while running the code, you should run cars2.py and not cars1.py.
So you should run it as python cars2.py if you are using command line from the folder in which the file cars2.py file is saved in.
You can also do run the code cars1.py by updating it as follows:
class cars:
def __init__(self, model):
self.model = model
if __name__=="__main__":
from cars2 import firstCar
print(firstCar.model)

Why does my Flask app work when executing using `python app.py` but not when using `heroku local web` or `flask run`?

I wrote a Flask-based web app that takes text from users and returns the probability that it is of a given classification (full script below). The app loads some of the trained models needed to make predictions before any requests are made. I am currently trying to deploy it on Heroku and experiencing some problems.
I am able to run it locally when I execute python ml_app.py. But when I use the Heroku CLI command heroku local web to try to run it locally to test before deployment, I get the following error
AttributeError: module '__main__' has no attribute 'tokenize'
This error is associated with the loading of a text vectorizer called TFIDF found in the line
tfidf_model = joblib.load('models/tfidf_vectorizer_train.pkl')
I have imported the required function at the top of the script to ensure that this is loaded properly (from utils import tokenize). This works given that I can run it when I use python ml_app.py. But for reasons I do not know, it doesn't load when I use heroku local web. It also doesn't work when I use the Flask CLI command flask run when trying to run it locally. Any idea why?
I admit that I do not have a good understanding of what is going on under the hood here (with respect to the web dev./deployment aspect of the code) so any explanation helps.
from flask import Flask, request, render_template
from sklearn.externals import joblib
from utils import tokenize # custom tokenizer required for tfidf model loaded in load_tfidf_model()
app = Flask(__name__)
models_directory = 'models'
#app.before_first_request
def nbsvm_models():
global tfidf_model
global logistic_identity_hate_model
global logistic_insult_model
global logistic_obscene_model
global logistic_severe_toxic_model
global logistic_threat_model
global logistic_toxic_model
tfidf_model = joblib.load('models/tfidf_vectorizer_train.pkl')
logistic_identity_hate_model = joblib.load('models/logistic_identity_hate.pkl')
logistic_insult_model = joblib.load('models/logistic_insult.pkl')
logistic_obscene_model = joblib.load('models/logistic_obscene.pkl')
logistic_severe_toxic_model = joblib.load('models/logistic_severe_toxic.pkl')
logistic_threat_model = joblib.load('models/logistic_threat.pkl')
logistic_toxic_model = joblib.load('models/logistic_toxic.pkl')
#app.route('/')
def my_form():
return render_template('main.html')
#app.route('/', methods=['POST'])
def my_form_post():
"""
Takes the comment submitted by the user, apply TFIDF trained vectorizer to it, predict using trained models
"""
text = request.form['text']
comment_term_doc = tfidf_model.transform([text])
dict_preds = {}
dict_preds['pred_identity_hate'] = logistic_identity_hate_model.predict_proba(comment_term_doc)[:, 1][0]
dict_preds['pred_insult'] = logistic_insult_model.predict_proba(comment_term_doc)[:, 1][0]
dict_preds['pred_obscene'] = logistic_obscene_model.predict_proba(comment_term_doc)[:, 1][0]
dict_preds['pred_severe_toxic'] = logistic_severe_toxic_model.predict_proba(comment_term_doc)[:, 1][0]
dict_preds['pred_threat'] = logistic_threat_model.predict_proba(comment_term_doc)[:, 1][0]
dict_preds['pred_toxic'] = logistic_toxic_model.predict_proba(comment_term_doc)[:, 1][0]
for k in dict_preds:
perc = dict_preds[k] * 100
dict_preds[k] = "{0:.2f}%".format(perc)
return render_template('main.html', text=text,
pred_identity_hate=dict_preds['pred_identity_hate'],
pred_insult=dict_preds['pred_insult'],
pred_obscene=dict_preds['pred_obscene'],
pred_severe_toxic=dict_preds['pred_severe_toxic'],
pred_threat=dict_preds['pred_threat'],
pred_toxic=dict_preds['pred_toxic'])
if __name__ == '__main__':
app.run(debug=True)
Fixed it. It was due to the way I picked the class instance stored in tfidf_vectorizer_train.pkl. The model was created in an ipython notebook where one of its attributes depended on a tokenizer function that I defined interactively in the notebook. I soon learned that pickling does not save the exact instance of a class, which means tfidf_vectorizer_train.pkl does not contain the function I defined in the notebook.
To fix this, I moved the tokenizer function to a separate utilities python file and imported the function in both the file where I trained and subsequently pickled the model and in the file where I unpickled it.
In code, I did
from utils import tokenize
...
tfidfvectorizer = TfidfVectorizer(ngram_range=(1, 2), tokenizer=tokenize,
min_df=3, max_df=0.9, strip_accents='unicode',
use_idf=1, smooth_idf=True, sublinear_tf=1)
train_term_doc = tfidfvectorizer.fit_transform(train[COMMENT])
joblib.dump(tfidfvectorizer, 'models/tfidf_vectorizer_train.pkl')
...
in the file where I trained the model and
from utils import tokenize
...
#app.before_first_request
def load_models():
# from utils import tokenize
global tfidf_model
tfidf_model =
joblib.load('{}/tfidf_vectorizer_train.pkl'.format(models_directory))
...
in the file containing the web app code.

Python Unittest for big arrays

I am trying to put together a unittest to test whether my function that reads in big data files, produces the correct result in shape of an numpy array. However, these files and arrays are huge and can not be typed in. I believe I need to save input and output files and test using them. This is how my testModule looks like:
import numpy as np
from myFunctions import fun1
import unittest
class TestMyFunctions(unittest.TestCase):
def setUp(self):
self.inputFile1 = "input1.txt"
self.inputFile2 = "input2.txt"
self.outputFile = "output.txt"
def test_fun1(self):
m1 = np.genfromtxt(self.inputFile1)
m2 = np.genfromtxt(self.inputFile2)
R = np.genfromtxt(self.outputFile)
self.assertEqual(fun1(m1,m2),R)
if __name__ =='__main__':
unittest.main(exit=False)
I'm not sure if there is a better/neater way of testing huge results.
Edit:
Also getting an attribute error now:
AttributeError: TestMyFunctions object has no attribute '_testMethodName'
Update - AttributeError Solved - 'def init()' is not allowed. Changed with def setUp()!

How to distibute classes with PySpark and Jupyter

I have an annoying problem using jupyter notebook with spark.
I need to define a custom class inside the notebook and use it to perform some map operations
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
conf = SparkConf().setMaster("spark://192.168.10.11:7077")\
.setAppName("app_jupyter/")\
.set("spark.cores.max", "10")
sc = SparkContext(conf=conf)
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
class demo(object):
def __init__(self, value):
self.test = value + 10
pass
distData.map(lambda x : demo(x)).collect()
It gives the following error:
PicklingError: Can't pickle : attribute lookup
main.demo failed
I know what this error is about, but I could't figure out a solution..
I have tried:
Define a demo.py python file outside the notebook. It works, but it is such a ugly solution ...
Create a dynamic module like this, and then import it afterwards... This gives the same error
What would be a solution?...I want everything to work in the same notebook
It is possible to change something in:
The way spark works, maybe some pickle configuration
Something in the code... Use some static magic approach
There is no reliable and elegant workaround here and this behavior is not particularly related to Spark. This is all about fundamental design of the Python pickle
pickle can save and restore class instances transparently, however the class definition must be importable and live in the same module as when the object was stored.
Theoretically you could define a custom cell magic which would:
Write the content of a cell to a module.
Import it.
Call SparkContext.addPyFile to distribute the module.
from IPython.core.magic import register_cell_magic
import importlib
#register_cell_magic
def spark_class(line, cell):
module = line.strip()
f = "{0}.py".format(module)
with open(f, "w") as fw:
fw.write(cell)
globals()[module] = importlib.import_module(module)
sc.addPyFile(f)
In [2]: %%spark_class foo
...: class Foo(object):
...: def __init__(self, x):
...: self.x = x
...: def __repr__(self):
...: return "Foo({0})".format(self.x)
...:
In [3]: sc.parallelize([1, 2, 3]).map(lambda x: foo.Foo(x)).collect()
Out[3]: [Foo(1), Foo(2), Foo(3)]
but it is a one time deal. Once file is marked for distribution it cannot be changed or redistributed. Moreover there is a problem of reloading imports on remote hosts. I can think of some more elaborate schemes but this is simply more trouble than it is worth.
The answer from zero323 is solid: there's no one "right" way to solve this problem. You could indeed use Jupyter magic, as proposed. One other way is to use Jupyter's %%writefile to have your code inline in a Jupyter cell but to then write it to disk as a python file. Then you can both import the file to your Jupyter kernel session as well as ship it with your PySpark job (via addPyFile() as noted in the other answer). Note that if you make changes to the code but don't restart your PySpark session, you'll have to get the updated code to PySpark somehow.
Can we make this easier? I wrote a blogpost about this topic as well as a PySpark Session wrapper (oarphpy.spark.NBSpark) to help automate a lot of the tricky stuff. See the Jupyter Notebook embedded in that post for a working example. The overall pattern looks like this:
import os
import sys
CUSTOM_LIB_SRC_DIR = '/tmp/'
os.chdir(CUSTOM_LIB_SRC_DIR)
!mkdir -p mymodule
!touch mymodule/__init__.py
%%writefile mymodule/foo.py
class Zebra(object):
def __init__(self, name):
self.name = name
sys.path.append(CUSTOM_LIB_SRC_DIR)
from mymodule.foo import Zebra
# Create Zebra() instances in the notebook
herd = [Zebra(name=str(i)) for i in range(10)]
# Now send those instances to PySpark!
from oarphpy.spark import NBSpark
NBSpark.SRC_ROOT = os.path.join(CUSTOM_LIB_SRC_DIR, 'mymodule')
spark = NBSpark.getOrCreate()
rdd = spark.sparkContext.parallelize(herd)
def get_name(z):
return z.name
names = rdd.map(get_name).collect()
Additionally, if you make any changes to the mymodule files on disk (via %%writefile or otherwise), then NBSpark with automatically ship those changes to the active PySpark session.

Resources