Create_feature_extractor, nn.Parameter, DataParallel are not compatiable together

Create_feature_extractor, nn.Parameter, DataParallel are not compatiable together - pytorch

Here is the code I run:
import torch
from torchvision.models.feature_extraction import create_feature_extractor
from torchvision.models import resnet50, vit_b_16
from torch.nn import DataParallel
return_nodes = {"heads": "0"}
# return_nodes = {"fc": "0"}
device = torch.device("cuda")
backbone = vit_b_16()
backbone = create_feature_extractor(backbone,
return_nodes=return_nodes)
model1 = DataParallel(backbone).to(device)
x = torch.rand(2, 3, 256, 256).to(device)
out = model1(x)
and the error message is
File "/home/ubuntu/models/test_param_error.py", line 60, in <module>
out = model1(x)
File "/home/ubuntu/anaconda3/envs/torchenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/torchenv/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/ubuntu/anaconda3/envs/torchenv/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 172, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/home/ubuntu/anaconda3/envs/torchenv/lib/python3.9/site-packages/torch/nn/parallel/replicate.py", line 148, in replicate
setattr(replica, key, param)
File "/home/ubuntu/anaconda3/envs/torchenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1206, in __setattr__
raise TypeError("cannot assign '{}' as parameter '{}' "
TypeError: cannot assign 'torch.cuda.FloatTensor' as parameter 'class_token' (torch.nn.Parameter or None expected)
My torch and torchvision version:
pytorch 1.11.0 py3.9_cuda11.5_cudnn8.3.2_0
torchvision 0.12.0 py39_cu115
Here is what I find in a multi-gpu setting.
If the model is resnet50 (and the return node is some resnet50 layer), this code has no error. For ViTs, they have defined some parameters using nn.Parameter (self.class_token = nn.Parameter(torch.zeros(1, 1, hidden_dim))). That is where the error occurs.
create_feature_extractor, nn.Parameter, DataParallel: either two of them would work together but not all three.
From the error message we can know that the program tries to assign some tensor to the parameter class_token. If we don't use create_feature_extractor, this would work. Why does create_feature_extractor change the behavior?
Thanks for help.

Related

cuDNN_STATUS_ALLOC_FAILED when trying to run a tutorial CNN with tensorflow

I am trying to run a simple python script with a convolutional neural network (CNN). Every time I run the script I come across the following error message
2021-03-10 19:47:03.832061: E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
Traceback (most recent call last):
File "CNN_trial.py", line 17, in <module>
outputs = tf.nn.conv2d(images,filters,strides = 1,padding = "SAME")
File "D:\miniconda3\envs\tensorflow\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "D:\miniconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 2158, in conv2d_v2
return conv2d(input, # pylint: disable=redefined-builtin
File "D:\miniconda3\envs\tensorflow\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "D:\miniconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 2264, in conv2d
return gen_nn_ops.conv2d(
File "D:\miniconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 942, in conv2d
return conv2d_eager_fallback(
File "D:\miniconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 1031, in conv2d_eager_fallback
_result = _execute.execute(b"Conv2D", 1, inputs=_inputs_flat, attrs=_attrs,
File "D:\miniconda3\envs\tensorflow\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]
My system is as follows
Windows 10
AMD Ryzen 7 3700x
16GB RAM
Nvidia RTX 2060
Python 3.8.5
Tensorflow 2.4.1
my full code:
from sklearn.datasets import load_sample_image
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
china = load_sample_image("china.jpg")/255
flower = load_sample_image("flower.jpg")/255
images = np.array([china,flower])
batch_size, height,width,channels = images.shape
filters = np.zeros(shape=(7,7,channels,2),dtype=np.float32)
filters[:,3,:,0] = 1
filters[3,:,:,1] = 1
outputs = tf.nn.conv2d(images,filters,strides = 1,padding = "SAME")
plt.imshow(outputs[0,:,:,1],cmap = "gray")
plt.show()

It seems that I need to set the memory growth. By adding the following two lines to the beginning of the script. I got it to at least run.
devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(devices[0],True)

Variable_scope runtime error when creating keras custom layer using tensorflow hub models and tensorflow 2.0 as backend

I'm trying to use the pretrained tf-hub elmo model by integrating it into a keras layer.
Keras Layer:
class ElmoEmbeddingLayer(tf.keras.layers.Layer):
def __init__(self, **kwargs):
super(ElmoEmbeddingLayer, self).__init__(**kwargs)
self.dimensions = 1024
self.trainable = True
self.elmo = None
def build(self, input_shape):
url = 'https://tfhub.dev/google/elmo/2'
self.elmo = hub.Module(url)
self._trainable_weights += trainable_variables(
scope="^{}_module/.*".format(self.name))
super(ElmoEmbeddingLayer, self).build(input_shape)
def call(self, x, mask=None):
result = self.elmo(
x,
signature="default",
as_dict=True)["elmo"]
return result
def compute_output_shape(self, input_shape):
return input_shape[0], self.dimensions
When I run the code I get the following error:
Traceback (most recent call last):
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 170, in <module>
validation_steps=validation_dataset.size())
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 79, in train_gpu
model = build_model(self.config, self.embeddings, self.sequence_len, self.out_classes, summary=True)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 8, in build_model
return my_model(embeddings, config, sequence_length, out_classes, summary)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 66, in my_model
inputs, embedding = resolve_inputs(embeddings, sequence_length, model_config, input_type)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 19, in resolve_inputs
return elmo_input(model_conf)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 58, in elmo_input
embedding = ElmoEmbeddingLayer()(input_text)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 616, in __call__
self._maybe_build(inputs)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1966, in _maybe_build
self.build(input_shapes)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\custom_layers.py", line 21, in build
self.elmo = hub.Module(url)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow_hub\module.py", line 156, in __init__
abs_state_scope = _try_get_state_scope(name, mark_name_scope_used=False)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow_hub\module.py", line 389, in _try_get_state_scope
"name_scope was already taken." % abs_state_scope)
RuntimeError: variable_scope module/ was unused but the corresponding name_scope was already taken.
It seems to be due to the eager execution behaviour. If I disable eager execution I have to surround the model.fit function within a tensorflow session and initialize the variables by using sess.run(global_variables_initializer()) to avoid the next error:
Traceback (most recent call last):
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 168, in <module>
validation_steps=validation_dataset.size().eval(session=Session()))
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 90, in train_gpu
class_weight=weighted)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\training.py", line 643, in fit
use_multiprocessing=use_multiprocessing)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py", line 664, in fit
steps_name='steps_per_epoch')
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py", line 294, in model_iteration
batch_outs = f(actual_inputs)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\backend.py", line 3353, in __call__
run_metadata=self.run_metadata)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\client\session.py", line 1458, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
(0) Failed precondition: Error while reading resource variable module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/class tensorflow::Var does not exist.
[[{{node elmo_embedding_layer/module_apply_default/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/Read/ReadVariableOp}}]]
(1) Failed precondition: Error while reading resource variable module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/class tensorflow::Var does not exist.
[[{{node elmo_embedding_layer/module_apply_default/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/Read/ReadVariableOp}}]]
[[metrics/f1_micro/Identity/_223]]
0 successful operations.
0 derived errors ignored.
My solution:
with Session() as sess:
sess.run(global_variables_initializer())
history = model.fit(self.train_data.repeat(),
epochs=self.config['epochs'],
validation_data=self.validation_data.repeat(),
steps_per_epoch=steps_per_epoch,
validation_steps=validation_steps,
callbacks=self.__callbacks(monitor_metric),
class_weight=weighted)
The main question is if there is another way to use elmo tf-hub module in a keras custom layer and train my model. Another question is if my current solution is not affecting the training performances or give the OOM GPU error (I get the OOM error after a few epochs with a higher batch size, which I've found to be related to sessions not closed or memory leaks).

If you wrap your model in Session() field, you will also have to wrap all another code that uses your model in Session() field. It takes a lot times and efforts. I have another way to deal with it:
firstly, create a elmo module, add a session to keras:
elmo_model = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True,
name='elmo_module')
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
K.set_session(sess)
Instead of create elmo module directly in your ElmoEmbeddinglayer
self.elmo = hub.Module(url)
self._trainable_weights += trainable_variables(
scope="^{}_module/.*".format(self.name))
You can do the following, i think it works normally!
self.elmo = elmo_model
self._trainable_weights += trainable_variables(
scope="^elmo_module/.*")

Here is a simple solution that I used in my case:
That thing happened to me while I was using a separated python script to create the module.
To solve it I passed the tf.Session() in the main script to the tf.keras.backend in the other script by creating an entry point to pass it before calling the Layer.init
Example:
Main file:
import tensorflow.compat.v1 as tf
from ModuleFile import ModuleLayer
def __main__():
init_args = [...]
input = ...
sess= tf.keras.backend.get_session()
Module_layer.__init_session___(sess)
module_layer = ModuleLayer(init_args)(input)
Module file:
import tensorflow.compat.v1 as tf
class ModuleLayer(tf.keras.layers.Layer):
#staticmethod
def __init_session__(session):
tf.keras.backend.set_session(session)
def __init__(*args):
...
Hope that helps :)

TensorFlow AlphaDropout: rank undefined

I am trying to set-up a neural network using TensorFlow's tf.contrib.nn.alpha_dropout (as implemented in TensorFlow version 1.12.0). Please consider the following example:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected
from tensorflow.contrib.nn import alpha_dropout
import numpy as np
N_data = 100
x_in = tf.placeholder(tf.float32, shape=[None, N_data], name="x_in")
keep_prob = tf.placeholder(tf.float32)
fc = fully_connected(inputs=x_in, num_outputs=N_data)
drop = alpha_dropout(fc, keep_prob=keep_prob)
x_out = fully_connected(inputs=drop, num_outputs=N_data)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
fd = {
x_in: np.random.rand(2, N_data),
keep_prob: 0.5,
}
output = x_out.eval(feed_dict=fd)
When evaluating the output of the dropout layer, everything seems normal, but when the output from the dropout layer is linked to a second dense layer, I get the following error message:
Traceback (most recent call last):
File "/***/problem_alpha_dropout.py", line 14, in <module>
x_out = fully_connected(inputs=drop, num_outputs=N_data)
File "/***/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/***/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1854, in fully_connected
outputs = layer.apply(inputs)
File "/***/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 817, in apply
return self.__call__(inputs, *args, **kwargs)
File "/***/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 374, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/***/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 730, in __call__
self._assert_input_compatibility(inputs)
File "/***/anaconda3/envs/TensorFlow/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1465, in _assert_input_compatibility
self.name + ' is incompatible with the layer: '
ValueError: Input 0 of layer fully_connected_1 is incompatible with the layer: its rank is undefined, but the layer requires a defined rank.
This behaviour does not emerge when tf.contrib.nn.alpha_dropout is replaced by tf.nn.dropout (same usage).
Additional information:
TensorFlow version: 1.12.0 (GPU)
Python version: 3.6 (through Anaconda)
OS: Linux Mint

Just specify the shape of the keep_prob placeholder:
keep_prob = tf.placeholder(tf.float32, shape=())

Exception in Keras when trying to use XCeption model as layer in Keras

I am getting exception in Keras when I am trying to use model as a layer. My code looks as follows:
from keras import layers
from keras import applications
from keras import Input
from keras.models import Model
xception_base = applications.Xception(weights=None,
include_top=False)
left_input = Input(shape=(250, 250, 3))
right_input = Input(shape=(250, 250, 3))
left_features = xception_base(left_input)
right_input = xception_base(right_input)
merged_features = layers.concatenate([left_features, right_input], axis=-1)
model = Model([left_input, right_input], merged_features)
Here is the exception I am getting. Not clear to me from the exception what is going wrong
Traceback (most recent call last):
File "/home/asattar/workspace/projects/keras-examples/chapter7/MergeTwoModels.py", line 18, in <module>
model = Model([left_input, right_input], merged_features)
File "/usr/local/lib/python2.7/dist-packages/Keras-2.2.4-py2.7.egg/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/Keras-2.2.4-py2.7.egg/keras/engine/network.py", line 93, in __init__
self._init_graph_network(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/Keras-2.2.4-py2.7.egg/keras/engine/network.py", line 224, in _init_graph_network
assert node_index == 0
AssertionError
Can anyone help me with what might be going wrong?
Also there is no error when I do this
model = Model(left_input, left_features)

Actually nvm. I realized that I messed up my variable name
right_input = xception_base(right_input)
making my graph circular.

Keras Activation with lambda issue when load_model

I'm trying to perform softmax using the parameter 'axis', and the only way I found was by means of the function lambda. Here is my code, containing an Activation layer with lambda for the softmax:
from keras.models import Model
from keras.layers import Input,Dense,Reshape,Activation
from keras.layers.merge import Multiply,Concatenate
from keras.layers.core import Lambda
from keras.activations import softmax
from keras import backend as K
import numpy as np
N = 6
M = 6
T = 1000
H = 5
# Toy input creation
input = np.concatenate([np.random.normal(np.random.rand(1)[0],1.,(1,N,M)) for t in range(T)],axis=0)
input2 = np.random.rand(T,N,M)
input3 = np.random.rand(T,N,M)
input4 = np.random.rand(T,N,M)
a = np.mean(np.reshape(input,(T,N*M)),axis=1)
a = np.maximum(0.,np.minimum(a,0.9999))
a = np.floor(a*3).astype(int)
a = np.stack([a for i in range(M)],axis=1)
a = np.stack([a for i in range(N)],axis=2)
mix1 = np.concatenate((input2[:,:2,:],input3[:,2:4,:],input4[:,4:,:]),axis=1)
mix2 = np.concatenate((input3[:,:2,:],input4[:,2:4,:],input2[:,4:,:]),axis=1)
mix3 = np.concatenate((input4[:,:2,:],input2[:,2:4,:],input3[:,4:,:]),axis=1)
output = np.choose(a,[mix1,mix2,mix3])
images = np.stack((input2,input3,input4),axis=3)
# models definition
# one general model to be trained and
# one mask model to be used later for testing
input_layer = Input(shape=(N,M))
images_input = Input(shape=(N,M,3))
x = Reshape((N*M,))(input_layer)
x = Dense(H, kernel_initializer='uniform', activation='relu')(x)
x = Dense(N*N*3, kernel_initializer='uniform')(x)
x = Reshape((N,N,3))(x)
masks = Activation(activation=lambda y:softmax(y,axis=3))(x)
output_layer = Multiply()([masks,images_input])
output_layer = Lambda(lambda x:K.sum(x,axis=3))(output_layer)
model = Model(inputs=[input_layer,images_input],outputs=output_layer)
mask_model = Model(inputs=input_layer,outputs=masks)
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
# Fit the model
history = model.fit([input,images], output, epochs=200, batch_size=50)
#save models
model.save('test.h5')
mask_model.save('mask_test.h5')
It works fine during training, but when I try to load the file, it fails:
from keras.models import load_model
mask_model = load_model('mask_test.h5')
I get the error:
Traceback (most recent call last):
File "/home/kresch/general2.py", line 3, in <module>
mask_model = load_model('mask_test.h5')
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/models.py", line 246, in load_model
model = model_from_config(model_config, custom_objects=custom_objects)
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/models.py", line 314, in model_from_config
return layer_module.deserialize(config, custom_objects=custom_objects)
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/layers/__init__.py", line 54, in deserialize
printable_module_name='layer')
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/utils/generic_utils.py", line 140, in deserialize_keras_object
list(custom_objects.items())))
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/engine/topology.py", line 2450, in from_config
process_layer(layer_data)
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/engine/topology.py", line 2419, in process_layer
custom_objects=custom_objects)
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/layers/__init__.py", line 54, in deserialize
printable_module_name='layer')
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/utils/generic_utils.py", line 142, in deserialize_keras_object
return cls.from_config(config['config'])
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/engine/topology.py", line 1242, in from_config
return cls(**config)
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/layers/core.py", line 287, in __init__
self.activation = activations.get(activation)
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/activations.py", line 81, in get
return deserialize(identifier)
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/activations.py", line 73, in deserialize
printable_module_name='activation function')
File "/opt/anaconda3/envs/tensorflow/lib/python3.5/site-packages/keras/utils/generic_utils.py", line 160, in deserialize_keras_object
':' + function_name)
ValueError: Unknown activation function:<lambda>
Process finished with exit code 1
The same happens for:
model = load_model('test.h5')
Am I using the lambda function wrong? Or (better) is there a way I can avoid using the lambda function?

try custom activation layer then load model.
load_model('test.h5',custom_objects=activation_layer)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Create_feature_extractor, nn.Parameter, DataParallel are not compatiable together - pytorch

Related

cuDNN_STATUS_ALLOC_FAILED when trying to run a tutorial CNN with tensorflow

Variable_scope runtime error when creating keras custom layer using tensorflow hub models and tensorflow 2.0 as backend

TensorFlow AlphaDropout: rank undefined

Exception in Keras when trying to use XCeption model as layer in Keras

Keras Activation with lambda issue when load_model

Categories

Resources