Load Pyspark.ml model from S3 using Pipeline - apache-spark

I am trying to save a trained model to S3 storage and then trying to load and predict using this model via Pipeline package from pyspark.ml.
Here's an example of how I am saving my model.
#stage_1 to stage_4 are some basic trasnformation on data one-hot encoding e.t.c
# define stage 5: logistic regression model
stage_5 = LogisticRegression(featuresCol='features',labelCol='label')
# SETUP THE PIPELINE
regression_pipeline = Pipeline(stages= [stage_1, stage_2, stage_3, stage_4, stage_5])
# fit the pipeline for the trainind data
model = regression_pipeline.fit(dataFrame1)
model_path ="s3://s3-dummy_path-orch/dummy models/pipeline_testing_1.model"
model.save(model_path)
I am able to save the model successfully & at the above mentioned model path two folders get created
stages
metadata.
However when I am trying to load the model it is giving me the below error.
Traceback (most recent call last):
File "/tmp/pythonScript_85ff2462_e087_4805_9f50_0c75fc4302e2958379757178872310.py", line 75, in <module>
pipelineModel = Pipeline.load(model_path)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 362, in load
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 207, in load
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 300, in load
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.Pipeline but found class name org.apache.spark.ml.PipelineModel'
I am trying to load the model as below:
from pyspark.ml import Pipeline
## same path used while #model.save in the above code snippet
model_path ="s3://s3-dummy_path-orch/dummy models/pipeline_testing_1.model"
pipelineModel = Pipeline.load(model_path)
How could I go about rectifying this?

If you saved a pipeline model, you should load it as a pipeline model, not as a pipeline. The difference is that a pipeline model is fitted to a dataframe, but a pipeline is not.
from pyspark.ml import PipelineModel
pipelineModel = PipelineModel.load(model_path)

Related

RuntimeError: Error(s) in loading state_dict for DynamicUnet: Missing key(s) in state_dict: "layers.0.4.0.conv3.weight" | size mismatch for layers

Goal:
Pickled model and exported weights come from a separate training environment. Here, I aim to load the model and weights to run inference with new datasets.
Versions:
torch==1.7.1
fastai==2.7.7
fastcore==1.5.6
torch==1.7.1
torchvision==0.8.2
Code:
from fastai.vision.all import *
learn = load_learner('export.pkl', cpu=True)
learn.load('model_3C_34_CELW_V_1.1')
Traceback:
(venv) me#ubuntu-pcs:~/PycharmProjects/project$ python3 model/Run_model.py
Traceback (most recent call last):
File "/home/me/PycharmProjects/project/model/Run_model.py", line 4, in <module>
learn.load('model_3C_34_CELW_V_1.1')
File "/home/me/miniconda3/envs/venv/lib/python3.9/site-packages/fastai/learner.py", line 387, in load
load_model(file, self.model, self.opt, device=device, **kwargs)
File "/home/me/miniconda3/envs/venv/lib/python3.9/site-packages/fastai/learner.py", line 54, in load_model
get_model(model).load_state_dict(model_state, strict=strict)
File "/home/me/miniconda3/envs/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DynamicUnet:
Missing key(s) in state_dict: "layers.0.4.0.conv3.weight", "layers.0.4.0.bn3.weight", "layers.0.4.0.bn3.bias", "layers.0.4.0.bn3.running_mean",
size mismatch for layers.12.0.weight: copying a param with shape torch.Size([3, 99, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 291, 1, 1]).
You get this error when you change the number of classes, maybe while retraining your U net model with different number of classes.
I suggest you change the new model instance by using num_of_classes=99
The model whose pretrained weights you are trying to load has 291 classes whilst yours has 99 only.
Yes; I needed an updated .pkl files that worked with the weights .pth file. Thanks
Source

Azure Synapste Predict Model with Synapse ML predict

I follow the official tutotial from microsoft: https://learn.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-score-model-predict-spark-pool
But when I execute:
#Bind model within Spark session
model = pcontext.bind_model(
return_types=RETURN_TYPES,
runtime=RUNTIME,
model_alias="Sales", #This alias will be used in PREDICT call to refer this model
model_uri=AML_MODEL_URI, #In case of AML, it will be AML_MODEL_URI
aml_workspace=ws #This is only for AML. In case of ADLS, this parameter can be removed
).register()
I´ve got:
NotADirectoryError: [Errno 20] Not a directory: '/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1648328086462_0002/spark-3d802a7e-15b7-4eb6-88c5-f0e01f8cdb35/userFiles-fbe23a43-67d3-4e65-a879-4a497e804b40/68603955220f5f8646700d809b71be9949011a2476a34965a3d5c0f3d14de79b.pkl/MLmodel'
Traceback (most recent call last):
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/synapse/ml/predict/core/_context.py", line 47, in bind_model
udf = _create_udf(
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/synapse/ml/predict/core/_udf.py", line 104, in _create_udf
model_runtime = runtime_gen._create_runtime()
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/synapse/ml/predict/core/_runtime.py", line 103, in _create_runtime
if self._check_model_runtime_compatibility(model_runtime):
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/synapse/ml/predict/core/_runtime.py", line 166, in _check_model_runtime_compatibility
model_wrapper = self._load()
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/synapse/ml/predict/core/_runtime.py", line 78, in _load
return SynapsePredictModelCache._get_or_load(
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/synapse/ml/predict/core/_cache.py", line 172, in _get_or_load
model = load_model(runtime, model_uri, functions)
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/synapse/ml/predict/utils/_model_loader.py", line 257, in load_model
model = loader.load(model_uri, functions)
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/synapse/ml/predict/utils/_model_loader.py", line 122, in load
model = self._load(model_uri)
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/synapse/ml/predict/utils/_model_loader.py", line 215, in _load
return self._load_mlflow(model_uri)
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/synapse/ml/predict/utils/_model_loader.py", line 59, in _load_mlflow
model = mlflow.pyfunc.load_model(model_uri)
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/mlflow/pyfunc/init.py", line 640, in load_model
model_meta = Model.load(os.path.join(local_path, MLMODEL_FILE_NAME))
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/mlflow/models/model.py", line 124, in load
with open(path) as f:
NotADirectoryError: [Errno 20] Not a directory: '/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1648328086462_0002/spark-3d802a7e-15b7-4eb6-88c5-f0e01f8cdb35/userFiles-fbe23a43-67d3-4e65-a879-4a497e804b40/68603955220f5f8646700d809b71be9949011a2476a34965a3d5c0f3d14de79b.pkl/MLmodel'
How can I fix that error ?
(UPDATE:29/3/2022): You will experiencing this error message if you model does not contains all the required files in the ML model.
As per the repro, I had created two ML models named:
sklearn_regression_model: Which contains only sklearn_regression_model.pkl file.
When I predict for MLFLOW packaged model named sklearn_regression_model, getting same error as shown above:
linear_regression: Which contains the below files:
When I predict for MLFLOW packaged model named linear_regression, it works as excepted.
It should be AML_MODEL_URI = "" #In URI ":x" => Rossman_Sales:2
Before running this script, update it with the URI for ADLS Gen2 data file along with model output return data type and ADLS/AML URI for the model file.
#Set model URI
#Set AML URI, if trained model is registered in AML
AML_MODEL_URI = "<aml model uri>" #In URI ":x" signifies model version in AML. You can choose which model version you want to run. If ":x" is not provided then by default latest version will be picked.
#Set ADLS URI, if trained model is uploaded in ADLS
ADLS_MODEL_URI = "abfss://<filesystemname>#<account name>.dfs.core.windows.net/<model mlflow folder path>"
Model URI from AML Workspace:
DATA_FILE = "abfss://data#cheprasynapse.dfs.core.windows.net/AML/LengthOfStay_cooked_small.csv"
AML_MODEL_URI_SKLEARN = "aml://mlflow_sklearn:1" #Here ":1" signifies model version in AML. We can choose which version we want to run. If ":1" is not provided then by default latest version will be picked
RETURN_TYPES = "INT"
RUNTIME = "mlflow"
Model URI uploaded to ADLS Gen2:
DATA_FILE = "abfss://data#cheprasynapse.dfs.core.windows.net/AML/LengthOfStay_cooked_small.csv"
AML_MODEL_URI_SKLEARN = "abfss://data#cheprasynapse.dfs.core.windows.net/linear_regression/linear_regression" #Here ":1" signifies model version in AML. We can choose which version we want to run. If ":1" is not provided then by default latest version will be picked
RETURN_TYPES = "INT"
RUNTIME = "mlflow"

Loading a saved Tensorflow model from its .meta file

I am trying to load a tensorflow meta graph from a saved checkpoint using Tensorflow version 1.15 to convert it to a SavedModel for tensorflow serving. It is a Speech Recognition Model with Local attention and unidirectional LSTM implemented using the Returnn Toolkit with Tensorflow Backend. I am using the following code.
import tensorflow as tf
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
import sys
if len(sys.argv)!=2:
print("Usage:" + sys.argv[0] + "save_dir")
exit(1)
export_dir=sys.argv[1]
builder = tf.compat.v1.saved_model.builder.SavedModelBuilder(export_dir)
sigs={}
with tf.Session(graph=tf.Graph()) as sess:
new_saver=tf.train.import_meta_graph("./serv_test/model.238.meta")
new_saver.restore(sess, tf.train.latest_checkpoint("./serv_test"))
graph=tf.get_default_graph()
input_audio=graph.get_tensor_by_name('inference/default/wav:0')
output_hyps=graph.get_tensor_by_name('inference/default/Reshape_7:0')
sigs[signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY] = tf.saved_model.signature_def_utils.predict_signature_def({"in":input_audio},{"out":output_hyps})
builder.add_meta_graph_and_variables(sess, [tag_constants.SERVING], signature_def_map=sigs,)
builder.save()
But I am getting the following error in the import_meta_graph line:
Traceback (most recent call last):
File "xport.py", line 16, in <module>
new_saver=tf.train.import_meta_graph("./serv_test/model.238.meta")
File "/home/ubuntu/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1453, in import_meta_graph
**kwargs)[0]
File "/home/ubuntu/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1477, in _import_meta_graph_with_return_elements
**kwargs))
File "/home/ubuntu/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/meta_graph.py", line 809, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "/home/ubuntu/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/ubuntu/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def
producer_op_list=producer_op_list)
File "/home/ubuntu/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 501, in _import_graph_def_internal
graph._c_graph, serialized, options) # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered
'NativeLstm2' in binary running on ip-10-1-21-241. Make sure the Op and Kernel
are registered in the binary running in this process. Note that if you are loading a
saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler`
should be done before importing the graph, as contrib ops are lazily registered when
the module is first accessed.
Is there any way to get around this error? Is it because of the custom built layers used in Returnn? Is there any way to make a Returnn Model tensorflow servable?
Thanks.
You should remove the graph=tf.Graph(), otherwise your import_meta_graph will import it into the wrong graph.
Just see some official TF examples how to use import_meta_graph.

ValueError: negative dimensions are not allowed when loading .pkl file

Although there are many question threads for error ValueError: negative dimensions are not allowed
I couldn't find the answer for my problem
After training Machine learning model using SGDclassifer
clf=linear_model.SGDClassifier(loss='log',random_state=20000,verbose=1,class_weight='balanced')
model=clf.fit(X,Y)
Dimension of X is (1651880,246177)
The below code is working i.e when saving model object and when using model for prediction
joblib.dump(model, 'trainedmodel.pkl',compress=3)
prediction_result=model.predict(x_test)
but getting error when loading the saved model
model = joblib.load('trainedmodel.pkl')
below is the error message
Please help me out to resolve it.
File "C:\Users\Taxonomy\AppData\Roaming\Python\Python36\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 598, in load
obj = _unpickle(fobj, filename, mmap_mode)
File "C:\Users\Taxonomy\AppData\Roaming\Python\Python36\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 526, in _unpickle
obj = unpickler.load()
File "C:\Users\Taxonomy\Anaconda3\lib\pickle.py", line 1050, in load
dispatch[key[0]](self)
File "C:\Users\Taxonomy\AppData\Roaming\Python\Python36\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 352, in load_build
self.stack.append(array_wrapper.read(self))
File "C:\Users\Taxonomy\AppData\Roaming\Python\Python36\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 195, in read
array = self.read_array(unpickler)
File "C:\Users\Taxonomy\AppData\Roaming\Python\Python36\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 141, in read_array
array = unpickler.np.empty(count, dtype=self.dtype)
ValueError: negative dimensions are not allowed
Try to dump model with protocol 4.
from python's pickle docs:
Protocol version 4 was added in Python 3.4. It adds support for very
large objects, pickling more kinds of objects, and some data format
optimizations. Refer to PEP 3154 for information about improvements
brought by protocol 4.

Variable_scope runtime error when creating keras custom layer using tensorflow hub models and tensorflow 2.0 as backend

I'm trying to use the pretrained tf-hub elmo model by integrating it into a keras layer.
Keras Layer:
class ElmoEmbeddingLayer(tf.keras.layers.Layer):
def __init__(self, **kwargs):
super(ElmoEmbeddingLayer, self).__init__(**kwargs)
self.dimensions = 1024
self.trainable = True
self.elmo = None
def build(self, input_shape):
url = 'https://tfhub.dev/google/elmo/2'
self.elmo = hub.Module(url)
self._trainable_weights += trainable_variables(
scope="^{}_module/.*".format(self.name))
super(ElmoEmbeddingLayer, self).build(input_shape)
def call(self, x, mask=None):
result = self.elmo(
x,
signature="default",
as_dict=True)["elmo"]
return result
def compute_output_shape(self, input_shape):
return input_shape[0], self.dimensions
When I run the code I get the following error:
Traceback (most recent call last):
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 170, in <module>
validation_steps=validation_dataset.size())
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 79, in train_gpu
model = build_model(self.config, self.embeddings, self.sequence_len, self.out_classes, summary=True)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 8, in build_model
return my_model(embeddings, config, sequence_length, out_classes, summary)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 66, in my_model
inputs, embedding = resolve_inputs(embeddings, sequence_length, model_config, input_type)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 19, in resolve_inputs
return elmo_input(model_conf)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 58, in elmo_input
embedding = ElmoEmbeddingLayer()(input_text)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 616, in __call__
self._maybe_build(inputs)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1966, in _maybe_build
self.build(input_shapes)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\custom_layers.py", line 21, in build
self.elmo = hub.Module(url)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow_hub\module.py", line 156, in __init__
abs_state_scope = _try_get_state_scope(name, mark_name_scope_used=False)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow_hub\module.py", line 389, in _try_get_state_scope
"name_scope was already taken." % abs_state_scope)
RuntimeError: variable_scope module/ was unused but the corresponding name_scope was already taken.
It seems to be due to the eager execution behaviour. If I disable eager execution I have to surround the model.fit function within a tensorflow session and initialize the variables by using sess.run(global_variables_initializer()) to avoid the next error:
Traceback (most recent call last):
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 168, in <module>
validation_steps=validation_dataset.size().eval(session=Session()))
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 90, in train_gpu
class_weight=weighted)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\training.py", line 643, in fit
use_multiprocessing=use_multiprocessing)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py", line 664, in fit
steps_name='steps_per_epoch')
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py", line 294, in model_iteration
batch_outs = f(actual_inputs)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\backend.py", line 3353, in __call__
run_metadata=self.run_metadata)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\client\session.py", line 1458, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
(0) Failed precondition: Error while reading resource variable module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/class tensorflow::Var does not exist.
[[{{node elmo_embedding_layer/module_apply_default/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/Read/ReadVariableOp}}]]
(1) Failed precondition: Error while reading resource variable module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/class tensorflow::Var does not exist.
[[{{node elmo_embedding_layer/module_apply_default/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/Read/ReadVariableOp}}]]
[[metrics/f1_micro/Identity/_223]]
0 successful operations.
0 derived errors ignored.
My solution:
with Session() as sess:
sess.run(global_variables_initializer())
history = model.fit(self.train_data.repeat(),
epochs=self.config['epochs'],
validation_data=self.validation_data.repeat(),
steps_per_epoch=steps_per_epoch,
validation_steps=validation_steps,
callbacks=self.__callbacks(monitor_metric),
class_weight=weighted)
The main question is if there is another way to use elmo tf-hub module in a keras custom layer and train my model. Another question is if my current solution is not affecting the training performances or give the OOM GPU error (I get the OOM error after a few epochs with a higher batch size, which I've found to be related to sessions not closed or memory leaks).
If you wrap your model in Session() field, you will also have to wrap all another code that uses your model in Session() field. It takes a lot times and efforts. I have another way to deal with it:
firstly, create a elmo module, add a session to keras:
elmo_model = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True,
name='elmo_module')
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
K.set_session(sess)
Instead of create elmo module directly in your ElmoEmbeddinglayer
self.elmo = hub.Module(url)
self._trainable_weights += trainable_variables(
scope="^{}_module/.*".format(self.name))
You can do the following, i think it works normally!
self.elmo = elmo_model
self._trainable_weights += trainable_variables(
scope="^elmo_module/.*")
Here is a simple solution that I used in my case:
That thing happened to me while I was using a separated python script to create the module.
To solve it I passed the tf.Session() in the main script to the tf.keras.backend in the other script by creating an entry point to pass it before calling the Layer.init
Example:
Main file:
import tensorflow.compat.v1 as tf
from ModuleFile import ModuleLayer
def __main__():
init_args = [...]
input = ...
sess= tf.keras.backend.get_session()
Module_layer.__init_session___(sess)
module_layer = ModuleLayer(init_args)(input)
Module file:
import tensorflow.compat.v1 as tf
class ModuleLayer(tf.keras.layers.Layer):
#staticmethod
def __init_session__(session):
tf.keras.backend.set_session(session)
def __init__(*args):
...
Hope that helps :)

Resources