Error is Keras when training with MirroredStrategy

Error is Keras when training with MirroredStrategy - keras

When I use MirroredStrategy to train my model in Keras I get an error which I do not receive when not using MirroredStrategy. Here is some sample code
# Create a MirroredStrategy.
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
# Open a strategy scope.
with strategy.scope():
# Everything that creates variables should be under the strategy scope.
# In general this is only model construction & `compile()`.
model = Model(...)
model.compile(optimizer=opt, loss=['mean_absolute_error', 'mean_absolute_error'], loss_weights = [l1,l2])
# Train the model on all available devices.
model.fit(train_dataset, validation_data=val_dataset, ...)
# Test the model on all available devices.
model.evaluate(test_dataset)
The error that I receive is TypeError: Input 'y' of 'Equal' Op has type variant that does not match type float32 of argument 'x'.
I believe this error has to do with the loss function. It is important to note that I have 1 input and 2 outputs for my model.

Seems that upgrading tensorflow 2.0 fixed the issue. Currently using the latest release.

Related

PyTorch to ONNX export, ATen operators not supported, onnxruntime hangs out

I want to export roberta-base based language model to ONNX format. The model uses ROBERTA embeddings and performs text classification task.
from torch import nn
import torch.onnx
import onnx
import onnxruntime
import torch
import transformers
from logs:
17: pytorch: 1.10.2+cu113
18: CUDA: False
21: device: cpu
26: onnxruntime: 1.10.0
27: onnx: 1.11.0
PyTorch export
batch_size = 3
model_input = {
'input_ids': torch.empty(batch_size, 256, dtype=torch.int).random_(32000),
'attention_mask': torch.empty(batch_size, 256, dtype=torch.int).random_(2),
'seq_len': torch.empty(batch_size, 1, dtype=torch.int).random_(256)
}
model_file_path = os.path.join("checkpoints", 'model.onnx')
torch.onnx.export(da_inference.model, # model being run
model_input, # model input (or a tuple for multiple inputs)
model_file_path, # where to save the model (can be a file or file-like object)
export_params=True, # store the trained parameter weights inside the model file
opset_version=11, # the ONNX version to export the model to
operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,
do_constant_folding=True, # whether to execute constant folding for optimization
input_names = ['input_ids', 'attention_mask', 'seq_len'], # the model's input names
output_names = ['output'], # the model's output names
dynamic_axes={'input_ids': {0 : 'batch_size'},
'attention_mask': {0 : 'batch_size'},
'seq_len': {0 : 'batch_size'},
'output' : {0 : 'batch_size'}},
verbose=True)
I know there maybe problems converting some operators from ATen (A Tensor Library for C++11), if included in model architecture PyTorch Model Export to ONNX Failed Due to ATen.
Exports succeeds if I set the parameter operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK which means 'leave as is ATen operators if not supported in ONNX'.
PyTorch export function gives me the following warning:
Warning: Unsupported operator ATen. No schema registered for this operator.
Warning: Shape inference does not support models with experimental operators: ATen
It looks like the only ATen operators in the model that are not converted to ONNX are situated inside layers LayerNorm.weight and LayerNorm.bias (I have several layers like that):
%1266 : Float(3, 256, 768, strides=[196608, 768, 1], requires_grad=0, device=cpu) =
onnx::ATen[cudnn_enable=1, eps=1.0000000000000001e-05, normalized_shape=[768], operator="layer_norm"]
(%1265, %model.utterance_rnn.base.encoder.layer.11.output.LayerNorm.weight,
%model.utterance_rnn.base.encoder.layer.11.output.LayerNorm.bias)
# /opt/conda/lib/python3.9/site-packages/torch/nn/functional.py:2347:0
Than model check passes OK:
model = onnx.load(model_file_path)
# Check that the model is well formed
onnx.checker.check_model(model)
# Print a human readable representation of the graph
print(onnx.helper.printable_graph(model.graph))
I also can visualize computation graph using Netron.
But when I try to perform inference using exported ONNX model it stalls with no logs or stdout. So this code will hang the system:
model_file_path = os.path.join("checkpoints", "model.onnx")
sess_options = onnxruntime.SessionOptions()
sess_options.log_severity_level = 0
ort_providers: List[str] = ["CUDAExecutionProvider"] if use_gpu else ['CPUExecutionProvider']
session = InferenceSession(model_file_path, providers=ort_providers, sess_options=sess_options)
Is there any suggestions to overcome this problem? From official documentation I see that torch.onnx models exported this way are probably runnable only by Caffe2.
This layers are not inside the base frozen roberta model, so this is additional layers that I added by myself. Is it possible to substitute the offending layers with similar ones and retrain the model?
Or Caffe2 is the best choice here and onnxruntime will not do the inference?
Update: I retrained the model on the basis of BERT cased embeddings, but the problem persists. The same ATen operators are not converted in ONNX.
It looks like the layers LayerNorm.weight and LayerNorm.bias are only in the model above BERT. So, what is your suggestions to change this layers and enable ONNX export?

Have you tried to export after defining the operator for onnx? Something along the lines of the following code by Huawei.
On another note, when loading a model, you can technically override anything you want. Putting a specific layer to equal your modified class that inherits the original, keeps the same behavior (input and output) but execution of it can be modified.
You can try to use this to save the model with changed problematic operators, transform it in onnx, and fine tune in such form (or even in pytorch).
This generally seems best solved by the onnx team, so long term solution might be to post a request for that specific operator on the github issues page (but probably slow).

Best way to go will be to rewrite the place in the model that uses these operator in a way it will convert look at this for reference.
if for example the issue is layer norm then you can write it yourself. another thing that help sometimes is not setting the axes as dynamic, since some op dont support it yet

Unknown regularizer: l2_cond When trying to load data from file

I have been getting an error when try to load a model that I trained.
model_path = r'I:\\ECGMODELCP\\0.467-0.840-010-0.408-0.860.reg.hdf5'
model = keras.models.load_model(model_path)
ValueError: Unknown regularizer: l2_cond
Ive tried
model = keras.models.load_model(model_path, custom_objects={'l2_cond': l2_cond(weight_matrix)})
But get an error of weight_matrix not defined. l2_cond is a custom kernal regularizer that I defined and depends on the weight matrix of the last layer of my model. Any help is appreciated

I figured it out already. I just loaded the weights of the model into an identical model's architecture.

Loading image with different input size than training in Keras

I am working on a CNN that deals with super-resolution. It is required that I extract patches from the image, then train on these small patches (ie.41x41).
However, when it comes to predicting the image, the image is of a larger size than the patches. But Keras doesn't allow me to predict an image of larger size than the training images.
I have read Can Keras deal with input images with different size?. I have tried the way by putting None in my network input shape and then loading the weights. However, when it comes to this line: c1 = PReLU()(c1), I get the error: nt() argument must be a string, a bytes-like object or a number, not 'NoneType'. The code is attched below.
How can I fix this problem? I am using Keras with tensorflow backend. I have no fully connected layers, all are Conv2D with relu, except for the snippet below, is PReLU for c1.
Thanks.
input_shape = (None,None,1)
x = Input(shape = input_shape)
c1 = Convolution2D(64, (3,3), init = 'he_normal', padding='same', name='Conv1')(x)
c1 = PReLU()(c1)
#............................
output_img = keras.layers.add([x, finalconv])
model = Model(x, output_img)

Keras doesn't allow me to predict an image of larger size than the
training images
This is wrong, and keras allows you to do so when your network is designed properly.
However, when it comes to this line: c1 = PReLU()(c1), I get the
error: nt() argument must be a string, a bytes-like object or a
number, not 'NoneType'.
This error is expected because your input shape contains None. Actually, if you previously set shared_axes=[1,2] for PReLU (default value shared_axes=None), you will not see this error.
Therefore, the real issue here is that PReLU's parameters, previously set only for an 41x41 input, but now are asked to work for an arbitrary input size.
The best solution is to train a new model with input shape = (None,None,3) directly.
If you don't care about the possible degradation, you can load all layer weights of your pretrained model except for the PReLU layer. Then manually compute appropriate PReLU parameters can be shared across shared_axes =[1,2], and use it as the new PReLU parameters.

TensorFlow - Shape Mismatch

I have just started with TensorFlow. I was checking out the MusicGenerator available at https://github.com/Conchylicultor/MusicGenerator
I am getting error:
ValueError: Trying to share variable
rnn_decoder/KeyboardCell/Decoder/multi_rnn_cell/cell_0/basic_lstm_cell/kernel,
but specified shape (1024, 2048) and found shape (525, 2048).
I think this might be due to some variable shared between the encoder and decoder of the cell. The main code was written for tensorflow 0.10.0 but I am trying to run on Tensorflow 1.3
def __call__(self, prev_keyboard, prev_state, scope=None):
""" Run the cell at step t
Args:
prev_keyboard: keyboard configuration for the step t-1 (Ground truth or previous step)
prev_state: a tuple (prev_state_enco, prev_state_deco)
scope: TensorFlow scope
Return:
Tuple: the keyboard configuration and the enco and deco states
"""
# First time only (we do the initialisation here to be on the global rnn loop scope)
if not self.is_init:
with tf.variable_scope('weights_keyboard_cell'):
# TODO: With self.args, see which network we have chosen (create map 'network name':class)
self.encoder.build()
self.decoder.build()
prev_state = self.encoder.init_state(), self.decoder.init_state()
self.is_init = True
# TODO: If encoder act as VAE, we should sample here, from the previous state
# Encoder/decoder network
with tf.variable_scope(scope or type(self).__name__):
with tf.variable_scope('Encoder'):
# TODO: Should be enco_output, enco_state
next_state_enco = self.encoder.get_cell(prev_keyboard, prev_state)
with tf.variable_scope('Decoder'): # Reset gate and update gate.
next_keyboard, next_state_deco = self.decoder.get_cell(prev_keyboard, (next_state_enco, prev_state[1]))
return next_keyboard, (next_state_enco, next_state_deco)
I am completely new to RNNs and CNNs . I have been reading a bit about it as well, understood in a high level way on how these work. And understood how some parts of the code is actually working in training, modelling. But I don't think enough to debug this. And especially because I am a bit confused with the tensorflow API as well.
It would be great why this might be happenning, what I can do to fix it. And also if you can point me to some books for CNN, RNN, Back propogation and how to use tensor flow effectively to build things.

Undeprecating tensorflow

When making a DNN regressor and predicting the values by
print(list(estimator.predict({"p": np.array([[0.,0.],[1.,0.],[0.,1.],[1.,1.]])})))
this is the output of the console:
WARNING:tensorflow:From "...\tensorflow\contrib\learn\python\learn\estimators\dnn.py":692: calling BaseEstimator.predict (from tensorflow.contrib.learn.python.learn.estimators.estimator) with x is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
est = Estimator(...) -> est = SKCompat(Estimator(...))
So I head into line 692 of dnn.py and this is what I find
preds = super(DNNRegressor, self).predict(
x=x,
input_fn=input_fn,
batch_size=batch_size,
outputs=[key],
as_iterable=as_iterable)
So following the advice from the error, and assuming that super(DNNRegressor, self) is an Estimator I've just did
preds = estimator.SKCompat(super(DNNRegressor, self)).predict(...)
But doing that I get
TypeError: predict() got an unexpected keyword argument 'input_fn'
that looks like it's not a tensorflow error.
The problem is I don't know how to get rid of the warning (not an error).

This portion of the Github tree is under active development. I expect this warning message to go away once the Estimator class is moved into tf.core which is schedule for version r1.1. I found the 2017 TensorFlow Dev Summit video by Martin Wicke to be very informative on the future plans of high level TensorFlow.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Error is Keras when training with MirroredStrategy - keras

Seems that upgrading tensorflow 2.0 fixed the issue. Currently using the latest release.

Related

PyTorch to ONNX export, ATen operators not supported, onnxruntime hangs out

Unknown regularizer: l2_cond When trying to load data from file

Loading image with different input size than training in Keras

TensorFlow - Shape Mismatch

Undeprecating tensorflow

Categories

Resources