How to train your model in deeppavlov (NER) Python 3 - python-3.x

First of all, sorry for any newbie mistakes that I've made. But I couldn't figure out and couldn't find a source specifically for deeppavlov (NER) library. I'm trying to train ner_ontonotes_bert_mult as described here. I guess it can be trained from its checkpoint to make it recognize some specific patterns like;
"Round 23/22; 24,9 x 12,2 x 12,3"
as
[[['Round', '23/22', ';', '24,9 x 12,2 x 12,3']], [['B-PRODUCT', 'I-PRODUCT', 'B-QUANTITY']]]
My questions are (before I dig into details):
Is it possible? And I realized I can't use samples like " Round 23/22; 24,9 x 12,2 x 12,3 ". I need them to be in full sentences.
Where can I find more info about it specifically related to deeppavlov's model(s)?
How can I train pre-trained deeppavlov model to recognize my custom patterns?
I don't even understand if it is possible but I've decided to give it go and prepared 3 .txt files as "train.txt", "test.txt" and "validation.txt" as described in deeppovlov web page. And I put them under the folder '~/.deeppavlov/downloads/ontonotes/ner_ontonotes_bert_mult'. My dataset looks like this:
Round B-PRODUCT
23/22 I-PRODUCT
24,9 x 12,2 x 12,3 B-QUANTITY
Ring B-PRODUCT
HDFAA I-PRODUCT
12,7 x 10 B-QUANTITY
and so on... This is the code I am trying to train it:
import os
# Force tensorflow to use CPU instead of GPU.
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
from deeppavlov import configs, train_model
from deeppavlov.core.commands.utils import parse_config
config_dict = parse_config(configs.ner.ner_ontonotes_bert_mult)
print(config_dict['dataset_reader']['data_path'])
from deeppavlov import configs, train_model
ner_model = train_model(configs.ner.ner_ontonotes_bert_mult)
But I am getting this error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [3] rhs shape= [37]
[[{{node save/Assign_280}}]]
Full traceback:
2019-09-26 15:50:27.63 ERROR in 'deeppavlov.core.common.params'['params'] at line 110: Exception in <class 'deeppavlov.models.bert.bert_ner.BertNerModel'>
Traceback (most recent call last):
File "/home/custom_user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/home/custom_user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/custom_user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [3] rhs shape= [37]
[[{{node save/Assign_280}}]]
UPDATE 2:
And I realized I can't use samples like " Round 23/22; 24,9 x 12,2 x 12,3 ". I need them to be in full sentences.
UPDATE:
It seems like this is happening due to my dataset. My custom dataset only has 3 tags (B-PRODUCT, I-PRODUCT and B-QUANTITY) but the pre-trained model has 37 of them. All available tags can be found here under the sentence of "The list of available tags and their descriptions are presented below.". 18 main tags(with B and I 36 tags), and O tag (“O” means the absence of entity.)). Total of all of the 37 tags needs to be present in the dataset. I was able to pass that error by adding dummy sentences by tagging them all with the missing tags. This is a terrible workaround since I'm willingly disrupting my own data-set. I'm still looking for a 'logical' way to train...
PS: Now I am getting this error.
Traceback (most recent call last):
File "/home/custom_user/.PyCharm2019.2/config/scratches/scratch_9.py", line 13, in <module>
ner_model = train_model(configs.ner.ner_ontonotes_bert_mult)
File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/__init__.py", line 31, in train_model
train_evaluate_model_from_config(config, download=download, recursive=recursive)
File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/commands/train.py", line 121, in train_evaluate_model_from_config
trainer.train(iterator)
File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 294, in train
self.train_on_batches(iterator)
File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 234, in train_on_batches
self._validate(iterator)
File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 150, in _validate
metrics = list(report['metrics'].items())
AttributeError: 'NoneType' object has no attribute 'items'

There are at least two problems here:
1. instead of validation.txt there should be a valid.txt file;
2. you are trying to retrain a model that was pretrained on a different dataset with a different set of tags, it's not necessary.
To train your model from scratch you can do something like:
import json
from deeppavlov import configs, build_model, train_model
with configs.ner.ner_ontonotes_bert_mult.open(encoding='utf8') as f:
ner_config = json.load(f)
ner_config['dataset_reader']['data_path'] = '~/my_data_dir/' # directory with train.txt, valid.txt and test.txt files
ner_config['metadata']['variables']['NER_PATH'] = '~/where_to_save_the_model/'
ner_config['metadata']['download'] = [ner_config['metadata']['download'][-1]] # do not download the pretrained ontonotes model
ner_model = train_model(ner_config, download=True)
The other thing that could go wrong is tokenization: "Round 23/22; 24,9 x 12,2 x 12,3" will be split by the model to ['Round', '23', '/', '22', ';', '24', ',', '9', 'x', '12', ',', '2', 'x', '12', ',', '3'] and not ['Round', '23/22', ';', '24,9 x 12,2 x 12,3'].
But you can tokenize your texts beforehand:
ner_model([['Round', '23/22', ';', '24,9 x 12,2 x 12,3']])

I tried deeppavlov training, and successfully trained the 'ner' model
I also got the same error at first while training, then I overcome by researching more about it
things to know before training -
-> you can find the 'ner_ontonotes_bert_multi.json' config file link in deeppavlov doc, which gives the dataset path, pretrained model path , dataset_reader and chain pipe to train
-> there is a pretrained model in the directory mentioned in the 'config' ,by default it is inside 'C:/users/{user_name}/.deeppavlov/' is the root directory and pretrained models are gonna store in 'models' subdirectory
-> when you started training the already trained model is gonna be modified which means, training just try to improve the pre-trained model
so to train and build your own model (by scratch), simply delete the 'models' subdirectory from the '.deeppavlov' path and execute the training

Related

caffe2 inference a onnx model , happend IndexError: Input 475 is undefined

Error:
when I use caffe2 for pretraind model. the model is from https://github.com/onnx/models/blob/master/vision/classification/vgg/model/vgg16-7.onnx
the model I use is pretrained model,
I do not change the model, and not use https://github.com/onnx/optimizer
the code is:
import caffe2
model=onnx.load(vgg16-7.onnx)
prepared_backend=caffe2.python.onnx.backend.prepare(model)
then an error happend:
WARNING:root:This caffe2 python run failed to load cuda module:No module named 'caffe2.python.caffe2_pybind11_state_gpu',and AMD hip module:No module named 'caffe2.python.caffe2_pybind11_state_hip'.Will run in CPU only mode.
WARNING: ONNX Optimizer has been moved to https://github.com/onnx/optimizer.
All further enhancements and fixes to optimizers will be done in this new repo.
The optimizer code in onnx/onnx repo will be removed in 1.9 release.
Traceback (most recent call last):
File "test.py", line 20, in
init_net, predict_net = c2.onnx_graph_to_caffe2_net(onnx_model_proto)
File "/home/eeodev/.local/lib/python3.6/site-packages/caffe2/python/onnx/backend.py", line 921, in onnx_graph_to_caffe2_net
return cls._onnx_model_to_caffe2_net(model, device=device, opset_version=opset_version, include_initializers=True)
File "/home/eeodev/.local/lib/python3.6/site-packages/caffe2/python/onnx/backend.py", line 876, in _onnx_model_to_caffe2_net
onnx_model = onnx.utils.polish_model(onnx_model)
File "/usr/local/lib64/python3.6/site-packages/onnx/utils.py", line 24, in polish_model
model = onnx.optimizer.optimize(model)
File "/usr/local/lib64/python3.6/site-packages/onnx/optimizer.py", line 55, in optimize
optimized_model_str = C.optimize(model_str, passes)
IndexError: Input 475 is undefined!
who can tell the solution?
another,if it is a pytorch model,when conver to onnx model ,we can use torch.onnx.export(model, input, 'model.onnx', verbose=True, keep_initializers_as_inputs=True), by keep_initializers_as_inputs=True , use caffe2 load model will not happen than error. but the model I used is pretrained model ,how to use this method?
I believe it is related to IR gap issue: https://github.com/onnx/onnx/issues/2902.
Currently the deprecated ONNX optimizer in ONNX repo cannot deal with ONNX model which IR_VERSION >=4 if the initializer are not included in model's input.
The workaround is to use the following script to let your model include input from initializer (contributed by #TMVector in GitHub):
def add_value_info_for_constants(model : onnx.ModelProto):
"""
Currently onnx.shape_inference doesn't use the shape of initializers, so add
that info explicitly as ValueInfoProtos.
Mutates the model.
Args:
model: The ModelProto to update.
"""
# All (top-level) constants will have ValueInfos before IRv4 as they are all inputs
if model.ir_version < 4:
return
def add_const_value_infos_to_graph(graph : onnx.GraphProto):
inputs = {i.name for i in graph.input}
existing_info = {vi.name: vi for vi in graph.value_info}
for init in graph.initializer:
# Check it really is a constant, not an input
if init.name in inputs:
continue
# The details we want to add
elem_type = init.data_type
shape = init.dims
# Get existing or create new value info for this constant
vi = existing_info.get(init.name)
if vi is None:
vi = graph.value_info.add()
vi.name = init.name
# Even though it would be weird, we will not overwrite info even if it doesn't match
tt = vi.type.tensor_type
if tt.elem_type == onnx.TensorProto.UNDEFINED:
tt.elem_type = elem_type
if not tt.HasField("shape"):
# Ensure we set an empty list if the const is scalar (zero dims)
tt.shape.dim.extend([])
for dim in shape:
tt.shape.dim.add().dim_value = dim
# Handle subgraphs
for node in graph.node:
for attr in node.attribute:
# Ref attrs refer to other attrs, so we don't need to do anything
if attr.ref_attr_name != "":
continue
if attr.type == onnx.AttributeProto.GRAPH:
add_const_value_infos_to_graph(attr.g)
if attr.type == onnx.AttributeProto.GRAPHS:
for g in attr.graphs:
add_const_value_infos_to_graph(g)
return add_const_value_infos_to_graph(model.graph)

Bugs when fitting Multi label text classification models

I am now trying to fit a classification model for a Multi label text classification problem.
I have a train set X_train that contains list of cleaned text, like
["I am constructing Markov chains with to states and inferring
transition probabilities empirically by simply counting how many
times I saw each transition in my raw data",
"I know the chips only of the players of my table and mine obviously I
also know the total number of chips the max and min amount chips the
players have and the average stackIs it possible to make an
approximation of my probability of winningI have,
...]
and a train multiple tags set y corresponding to each text in X_train, like
[['hypothesis-testing', 'statistical-significance', 'markov-process'],
['probability', 'normal-distribution', 'games'],
...]
Now I want to fit a model that could predict the tags in a text set X_test that has same format as X_train.
I have used the MultiLabelBinarizer to convert the tags and used TfidfVectorizer to convert the cleaned text in train set.
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(y)
Y = multilabel_binarizer.transform(y)
vectorizer = TfidfVectorizer(stop_words = stopWordList)
vectorizer.fit(X_train)
x_train = vectorizer.transform(X_train)
But when I try to fit the model I always get bugs.I have tried OneVsRestClassifier and LogisticRegression.
When I fit a OneVsRestClassifier model I got bugs like
Traceback (most recent call last):
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 696, in __init__
self.handle()
File "/usr/local/spark/python/pyspark/accumulators.py", line 268, in handle
poll(accum_updates)
File "/usr/local/spark/python/pyspark/accumulators.py", line 241, in poll
if func():
File "/usr/local/spark/python/pyspark/accumulators.py", line 245, in accum_updates
num_updates = read_int(self.rfile)
File "/usr/local/spark/python/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError
When I fit a LogisticRegression model I got bugs like
/opt/conda/envs/data3/lib/python3.6/site-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
"the coef_ did not converge", ConvergenceWarning)
Anyone knows where the problem is and how to solve this? Many thanks.
OneVsRestClassifier fits one classifier per class. You need to tell it which type of classifier you want (for example Losgistic regression).
The following code works for me:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
classifier = OneVsRestClassifier(LogisticRegression())
classifier.fit(x_train, Y)
X_test= ["I play with Markov chains"]
x_test = vectorizer.transform(X_test)
classifier.predict(x_test)
output: array([[0, 1, 1, 0, 0, 1]])

stacked GRU model in keras

I am willing to create a GRU model of 3 layers where each layer will have 32,16,8 units respectively. The model would take analog calue as input and produce analog value as output.
I have written the following code:
def getAModelGRU(neuron=(10), look_back=1, numInputs = 1, numOutputs = 1):
model = Sequential()
if len(neuron) > 1:
model.add(GRU(units=neuron[0], input_shape=(look_back,numInputs)))
for i in range(1,len(neuron)-1):
model.add(GRU(units=neuron[i]))
model.add(GRU(units=neuron[-1], input_shape=(look_back,numInputs)))
else:
model.add(GRU(units=neuron, input_shape=(look_back,numInputs)))
model.add(Dense(numOutputs))
model.compile(loss='mean_squared_error', optimizer='adam')
return model
And, I will call this function as:
chkEKF = getAModelGRU(neuron=(32,16,8), look_back=1, numInputs=10, numOutputs=6)
And, I obtained the following:
Traceback (most recent call last):
File "/home/momtaz/Dropbox/QuadCopter/quad_simHierErrorCorrectionEstimator.py", line 695, in <module>
Single_Point2Point()
File "/home/momtaz/Dropbox/QuadCopter/quad_simHierErrorCorrectionEstimator.py", line 74, in Single_Point2Point
chkEKF = getAModelGRU(neuron=(32,16,8), look_back=1, numInputs=10, numOutputs=6)
File "/home/momtaz/Dropbox/QuadCopter/rnnUtilQuad.py", line 72, in getAModelGRU
model.add(GRU(units=neuron[i]))
File "/home/momtaz/PycharmProjects/venv/lib/python3.6/site-packages/keras/engine/sequential.py", line 181, in add
output_tensor = layer(self.outputs[0])
File "/home/momtaz/PycharmProjects/venv/lib/python3.6/site-packages/keras/layers/recurrent.py", line 532, in __call__
return super(RNN, self).__call__(inputs, **kwargs)
File "/home/momtaz/PycharmProjects/venv/lib/python3.6/site-packages/keras/engine/base_layer.py", line 414, in __call__
self.assert_input_compatibility(inputs)
File "/home/momtaz/PycharmProjects/venv/lib/python3.6/site-packages/keras/engine/base_layer.py", line 311, in assert_input_compatibility
str(K.ndim(x)))
ValueError: Input 0 is incompatible with layer gru_2: expected ndim=3, found ndim=2
I tried online but did not find any solution for 'ndim' related issue.
Please let me know which I am doing wrong here.
You need to ensure input_shape parameter is being defined in the first layer exclusively, and every layer to have return_sequences=True except potentially the last one (depending on your model).
The code below serves for the common case where you want to stack several layers and only the number of units in each layer changes.
model = tf.keras.Sequential()
gru_options = [dict(units = units,
time_major=False,
kernel_regularizer=0.01,
# ... potentially more options
return_sequences=True) for units in [32,16,8]]
gru_options[0]['input_shape'] = (n_timesteps, n_inputs)
gru_options[-1]['return_sequences']=False # optionally disable sequences in the last layer.
# If you want to return sequences in your last
# layer delete this line, however it is necessary
# if you want to connect this to a dense layer
# for example.
for opts in gru_options:
model.add(tf.keras.layers.GRU(**opts))
model.add(tf.keras.Dense(6))
By the way there is a bug in your code, as no indentation is done after the else clause. Also notice that Python classes that implement the Iterable protocol (e.g. lists and tuples) can be iterated over by using for-in syntax, you don't have to do C-like iteration (it's more idiomatic or pythonic to use the aforementioned syntax).

OCR code written without custom loss function

I am working on OCR model. my final goal is to convert OCR code into coreML and deploy it into ios.
I have looked and run a couple of the github source codes namely:
here
here
as you have a look on them they all implemented loss as a custom layer with lambda layer.
the problem start when I want to convert this to coreML.
my piece of the code to convert to CoreMl:
import coremltools
def convert_lambda(layer):
# Only convert this Lambda layer if it is for our swish function.
if layer.function == ctc_lambda_func:
params = NeuralNetwork_pb2.CustomLayerParams()
# The name of the Swift or Obj-C class that implements this layer.
params.className = "x"
# The desciption is shown in Xcode's mlmodel viewer.
params.description = "A fancy new loss"
return params
else:
return None
print("\nConverting the model:")
# Convert the model to Core ML.
coreml_model = coremltools.converters.keras.convert(
model,
# 'weightswithoutstnlrchangedbackend.best.hdf5',
input_names="image",
image_input_names="image",
output_names="output",
add_custom_layers=True,
custom_conversion_functions={"Lambda": convert_lambda},
)
but it raises error
Converting the model:
Traceback (most recent call last):
File "/home/sgnbx/Downloads/projects/CRNN-with-STN-master/CRNN_with_STN.py", line 201, in <module>
custom_conversion_functions={"Lambda": convert_lambda},
File "/home/sgnbx/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/coremltools/converters/keras/_keras_converter.py", line 760, in convert
custom_conversion_functions=custom_conversion_functions)
File "/home/sgnbx/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/coremltools/converters/keras/_keras_converter.py", line 556, in convertToSpec
custom_objects=custom_objects)
File "/home/sgnbx/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/coremltools/converters/keras/_keras2_converter.py", line 255, in _convert
if input_names[idx] in input_name_shape_dict:
IndexError: list index out of range
Input name length mismatch
I am kind of not sure I can resolve this as I did not find anything relevant to this error to resolve.
In other hand most codes for OCR have Custom Loss function which probably again I face with the same problem.
So in the end I have two question:
Do you know how to resolve this error
my main question do you know any source code for OCR which is in KERAS (As i have to convert it to coreMl) and do not have custom loss function so it will be ok converting to CoreMl without problem?
Thanks in advance:)
just to make my question thorough:
this is the custom loss function in the source I am working:
def ctc_lambda_func(args):
iy_pred, ilabels, iinput_length, ilabel_length = args
# the 2 is critical here since the first couple outputs of the RNN
# tend to be garbage:
iy_pred = iy_pred[:, 2:, :] # no such influence
return backend.ctc_batch_cost(ilabels, iy_pred, iinput_length, ilabel_length)
loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')
([fc_2, labels, input_length, label_length])
and then use it in compile:
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer=sgd)
CoreML doesn't allow you train model, so it's not important to have a loss function or not. If you only want to use CRNN as predictor on iOS , you should just convert base_model in second link.

How to handle NaNs returned from 'roc_curve' before passing to 'auc'?

I am using 'roc_curve' from the metrics model in scikit-learn. The example shows that 'roc_curve' should be called before 'auc' similar to:
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
and then:
metrics.auc(fpr, tpr)
However the following error is returned:
Traceback (most recent call last): File "analysis.py", line 207, in <module>
r = metrics.auc(fpr, tpr) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/metrics/metrics.py", line 66, in auc
x, y = check_arrays(x, y) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 215, in check_arrays
_assert_all_finite(array) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 18, in _assert_all_finite
raise ValueError("Array contains NaN or infinity.") ValueError: Array contains NaN or infinity.
What does it mean in terms or results/is there a way to overcome this?
Are you trying to us roc_curve to evaluate a multiclass classifier? In other words, if you are using roc_curve on a classification problem that is not binary, then this won't work correctly. There is math out there for multidimensional ROC analysis, but the current ROC methods in python don't implement them.
To evaluate multiclass problems trying using methods like: confusion_matrix and classification_report from sklearn, and kappa() from skll.
You state this line:
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
which leads to the conclusion that you may have copied the sklearn example which also uses "pos_label=2".
However, in most cases you want the "pos_label" to be 1. So if your code outputs probabilities and they are between 0 and 1, then your pos_label should be 1.

Resources