I have just started with TensorFlow. I was checking out the MusicGenerator available at https://github.com/Conchylicultor/MusicGenerator
I am getting error:
ValueError: Trying to share variable
rnn_decoder/KeyboardCell/Decoder/multi_rnn_cell/cell_0/basic_lstm_cell/kernel,
but specified shape (1024, 2048) and found shape (525, 2048).
I think this might be due to some variable shared between the encoder and decoder of the cell. The main code was written for tensorflow 0.10.0 but I am trying to run on Tensorflow 1.3
def __call__(self, prev_keyboard, prev_state, scope=None):
""" Run the cell at step t
Args:
prev_keyboard: keyboard configuration for the step t-1 (Ground truth or previous step)
prev_state: a tuple (prev_state_enco, prev_state_deco)
scope: TensorFlow scope
Return:
Tuple: the keyboard configuration and the enco and deco states
"""
# First time only (we do the initialisation here to be on the global rnn loop scope)
if not self.is_init:
with tf.variable_scope('weights_keyboard_cell'):
# TODO: With self.args, see which network we have chosen (create map 'network name':class)
self.encoder.build()
self.decoder.build()
prev_state = self.encoder.init_state(), self.decoder.init_state()
self.is_init = True
# TODO: If encoder act as VAE, we should sample here, from the previous state
# Encoder/decoder network
with tf.variable_scope(scope or type(self).__name__):
with tf.variable_scope('Encoder'):
# TODO: Should be enco_output, enco_state
next_state_enco = self.encoder.get_cell(prev_keyboard, prev_state)
with tf.variable_scope('Decoder'): # Reset gate and update gate.
next_keyboard, next_state_deco = self.decoder.get_cell(prev_keyboard, (next_state_enco, prev_state[1]))
return next_keyboard, (next_state_enco, next_state_deco)
I am completely new to RNNs and CNNs . I have been reading a bit about it as well, understood in a high level way on how these work. And understood how some parts of the code is actually working in training, modelling. But I don't think enough to debug this. And especially because I am a bit confused with the tensorflow API as well.
It would be great why this might be happenning, what I can do to fix it. And also if you can point me to some books for CNN, RNN, Back propogation and how to use tensor flow effectively to build things.
Related
Today I suddenly started getting this error for no apparent reason, while I was running model.fit(). This used to work before, I am using TF 2.3.0, more specifically its Keras module.
The function is called on validation inside a generator, which is fed into model.predict().
Basically, I load a checkpoint, I resume training the network, and I make a prediction on validation.
The error keeps occurring even when training a model from scratch, and erasing all the related data. It's like if something has been hardcoded, somewhere, as I was able to run model.fit() up until a few hours ago.
I saw several solutions like THIS, but none of these variations really work for me, as they lead to more tricky error messages.
I even tried installing a different version of TF, thinking that this was due to some old version, but the error still occurs.
I will answer my own question, as this one was particularly tricky and none of the solutions I found on the internet has worked for me, probably because outdated.
I'll write down just the relevant part to add in the code, feel free to add more technical explanations.
I like using args for passing variables, but it can work without:
from tensorflow.python.keras.backend import set_session
from tensorflow.keras.models import load_model
import generator # custom generator
def main(args):
# open new session and define TF graph
args.sess = tf.compat.v1.Session()
args.graph = tf.compat.v1.get_default_graph()
set_session(args.sess)
# define training generator
train_generator = generator(args.train_data)
# load model
args.model = load_model(args.model_path)
args.model.fit(train_generator)
Then, in the model prediction function:
# In my specific case, the predict_output() function is
# called inside the generator function
def predict_output(args, x):
with args.graph.as_default():
set_session(args.sess)
y = model.predict(x)
return y
What I have done
I'm using the DQN Algorithm in Stable Baselines 3 for a two players board type game. In this game, 40 moves are available, but once one is made, it can't be done again.
I trained my first model with an opponent which would choose randomly its move. If an invalid move is made by the model, I give a negative reward equal to the max score one can obtain and stop the game.
The issue
Once it's was done, I trained a new model against the one I obtained with the first run. Unfortunately, ultimately, the training process gets blocked as the opponent seems to loop an invalid move. Which means that, with all I've tried in the first training, the first model still predicts invalid moves. Here's the code for the "dumb" opponent :
while(self.dumb_turn):
#The opponent chooses a move
chosen_line, _states = model2.predict(self.state, deterministic=True)
#We check if the move is valid or not
while(line_exist(chosen_line, self.state)):
chosen_line, _states = model2.predict(self.state, deterministic=True)
#Once a good move is made, we registered it as a move and add it to the space state
self.state[chosen_line]=1
What I would like to do but don't know how
A solution would be to set manually the Q-values to -inf for the invalid moves so that the opponent avoid those moves, and the training algorithm does not get stuck. I've been told how to access to these values :
import torch as th
from stable_baselines3 import DQN
model = DQN("MlpPolicy", "CartPole-v1")
env = model.get_env()
obs = env.reset()
with th.no_grad():
obs_tensor, _ = model.q_net.obs_to_tensor(obs)
q_values = model.q_net(obs_tensor)
But I don't know how to set them to -infinity.
If somebody could help me, I would be very grateful.
I recently had a similar problem in which I needed to directly alter the q-values produced by the RL model during training in order to influence its actions.
To do this I overwritten some methods of the library:
# Imports
from stable_baselines3.dqn.policies import QNetwork, DQNPolicy
# Override some methods of the class QNetwork used by the DQN model in order to set to a negative value the q-values of
# some actions
# Two possibile methods to override:
# Override _predict ---> alter q-values only during predictions but not during training
# Override forward ---> alter q-values also during training (Attention: here we are working with batches of q-values)
class QNetwork_modified(QNetwork):
def forward(self, obs: th.Tensor) -> th.Tensor:
"""
Predict the q-values.
:param obs: Observation
:return: The estimated Q-Value for each action.
"""
# Compute the q-values using the QNetwork
q_values = self.q_net(self.extract_features(obs))
# For each observation in the training batch:
for i in range(obs.shape[0]):
# Here you can alter q_values[i]
return q_values
# Override the make_q_net method of the DQN policy used by the DQN model to make it use the new DQN network
class DQNPolicy_modified(DQNPolicy):
def make_q_net(self) -> DQNPolicy:
# Make sure we always have separate networks for features extractors etc
net_args = self._update_features_extractor(self.net_args, features_extractor=None)
return QNetwork_modified(**net_args).to(self.device)
model = DQN(DQNPolicy_modified, env, verbose=1)
Personally I don’t like too much this approach, and I would suggest you to try first some “more natural” alternatives, like for examples giving in input to your model also some kind of history of what actions have been already selected, in order to help the model learn that pre-selected actions should be avoided.
For example you could enrich the input for the RL model with an additional binary mask where the moves already chosen have their corresponding bit set to 1. (In this case you should modify the gym environment).
When I use MirroredStrategy to train my model in Keras I get an error which I do not receive when not using MirroredStrategy. Here is some sample code
# Create a MirroredStrategy.
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
# Open a strategy scope.
with strategy.scope():
# Everything that creates variables should be under the strategy scope.
# In general this is only model construction & `compile()`.
model = Model(...)
model.compile(optimizer=opt, loss=['mean_absolute_error', 'mean_absolute_error'], loss_weights = [l1,l2])
# Train the model on all available devices.
model.fit(train_dataset, validation_data=val_dataset, ...)
# Test the model on all available devices.
model.evaluate(test_dataset)
The error that I receive is TypeError: Input 'y' of 'Equal' Op has type variant that does not match type float32 of argument 'x'.
I believe this error has to do with the loss function. It is important to note that I have 1 input and 2 outputs for my model.
Seems that upgrading tensorflow 2.0 fixed the issue. Currently using the latest release.
I have an existing model where I load some pre-trained weights and then do prediction (one image at a time) in pytorch. I am trying to basically convert it to a pytorch lightning module and am confused about a few things.
So currently, my __init__ method for the model looks like this:
self._load_config_file(cfg_file)
# just creates the pytorch network
self.create_network()
self.load_weights(weights_file)
self.cuda(device=0) # assumes GPU and uses one. This is probably suboptimal
self.eval() # prediction mode
What I can gather from the lightning docs, I can pretty much do the same, except not to do the cuda() call. So something like:
self.create_network()
self.load_weights(weights_file)
self.freeze() # prediction mode
So, my first question is whether this is the correct way to use lightning? How would lightning know if it needs to use the GPU? I am guessing this needs to be specified somewhere.
Now, for the prediction, I have the following setup:
def infer(frame):
img = transform(frame) # apply some transformation to the input
img = torch.from_numpy(img).float().unsqueeze(0).cuda(device=0)
with torch.no_grad():
output = self.__call__(Variable(img)).data.cpu().numpy()
return output
This is the bit that has me confused. Which functions do I need to override to make a lightning compatible prediction?
Also, at the moment, the input comes as a numpy array. Is that something that would be possible from the lightning module or do things always have to use some sort of a dataloader?
At some point, I want to extend this model implementation to do training as well, so want to make sure I do it right but while most examples focus on training models, a simple example of just doing prediction at production time on a single image/data point might be useful.
I am using 0.7.5 with pytorch 1.4.0 on GPU with cuda 10.1
LightningModule is a subclass of torch.nn.Module so the same model class will work for both inference and training. For that reason, you should probably call the cuda() and eval() methods outside of __init__.
Since it's just a nn.Module under the hood, once you've loaded your weights you don't need to override any methods to perform inference, simply call the model instance. Here's a toy example you can use:
import torchvision.models as models
from pytorch_lightning.core import LightningModule
class MyModel(LightningModule):
def __init__(self):
super().__init__()
self.resnet = models.resnet18(pretrained=True, progress=False)
def forward(self, x):
return self.resnet(x)
model = MyModel().eval().cuda(device=0)
And then to actually run inference you don't need a method, just do something like:
for frame in video:
img = transform(frame)
img = torch.from_numpy(img).float().unsqueeze(0).cuda(0)
output = model(img).data.cpu().numpy()
# Do something with the output
The main benefit of PyTorchLighting is that you can also use the same class for training by implementing training_step(), configure_optimizers() and train_dataloader() on that class. You can find a simple example of that in the PyTorchLightning docs.
Even though above answer suffices, if one takes note of following line
img = torch.from_numpy(img).float().unsqueeze(0).cuda(0)
One has to put both the model as well as image to the right GPU. On multi-gpu inference machine, this becomes a hassle.
To solve this, .predict was also recently produced, see more at https://pytorch-lightning.readthedocs.io/en/stable/deploy/production_basic.html
I am considering to move my code base to tf.estimator.Estimator, but I cannot find an example on how to use it in combination with tensorboard summaries.
MWE:
import numpy as np
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
# Declare list of features, we only have one real-valued feature
def model(features, labels, mode):
# Build a linear model and predict values
W = tf.get_variable("W", [1], dtype=tf.float64)
b = tf.get_variable("b", [1], dtype=tf.float64)
y = W*features['x'] + b
loss = tf.reduce_sum(tf.square(y - labels))
# Summaries to display for TRAINING and TESTING
tf.summary.scalar("loss", loss)
tf.summary.image("X", tf.reshape(tf.random_normal([10, 10]), [-1, 10, 10, 1])) # dummy, my inputs are images
# Training sub-graph
global_step = tf.train.get_global_step()
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = tf.group(optimizer.minimize(loss), tf.assign_add(global_step, 1))
return tf.estimator.EstimatorSpec(mode=mode, predictions=y,loss= loss,train_op=train)
estimator = tf.estimator.Estimator(model_fn=model, model_dir='/tmp/tf')
# define our data set
x=np.array([1., 2., 3., 4.])
y=np.array([0., -1., -2., -3.])
input_fn = tf.contrib.learn.io.numpy_input_fn({"x": x}, y, 4, num_epochs=1000)
for epoch in range(10):
# train
estimator.train(input_fn=input_fn, steps=100)
# evaluate our model
estimator.evaluate(input_fn=input_fn, steps=10)
How can I display my two summaries in tensorboard? Do I have to register a hook in which I use a tf.summary.FileWriter or something else?
EDIT:
Upon testing (in v1.1.0, and probably in later versions as well), it is apparent that tf.estimator.Estimator will automatically write summaries for you. I confirmed this using OP's code and tensorboard.
(Some poking around r1.4 leads me to conclude that this automatic summary writing occurs due to tf.train.MonitoredTrainingSession.)
Ultimately, the automatic summarizing is accomplished with the use of hooks, so if you wanted to customize the Estimator's default summarizing, you could do so using hooks. Below are the (edited) details from the original answer.
You'll want to use hooks, formerly known as monitors. (Linked is a conceptual/quickstart guide; the short of it is that the notion of hooking into / monitoring training is built into the Estimator API. A bit confusingly, though, it doesn't seem like the deprecation of monitors for hooks is really documented except in a deprecation annotation in the actual source code...)
Based on your usage, it looks like r1.2's SummarySaverHook fits your bill.
summary_hook = tf.train.SummarySaverHook(
SAVE_EVERY_N_STEPS,
output_dir='/tmp/tf',
summary_op=tf.summary.merge_all())
You may want to customize the hook's initialization parameters, as by providing an explicity SummaryWriter or writing every N seconds instead of N steps.
If you pass this into the EstimatorSpec, you'll get your customized Summary behavior:
return tf.estimator.EstimatorSpec(mode=mode, predictions=y,loss=loss,
train_op=train,
training_hooks=[summary_hook])
EDIT NOTE:
A previous version of this answer suggested passing the summary_hook into estimator.train(input_fn=input_fn, steps=5, hooks=[summary_hook]). This does not work because tf.summary.merge_all() has to be called in the same context as your model graph.
For me this worked without adding any hooks or merge_all calls. I just added some tf.summary.image(...) in my model_fn and when I train the model they magically appear in tensorboard. Not sure what the exact mechanism is, however. I'm using TensorFlow 1.4.
estimator = tf.estimator.Estimator(model_fn=model, model_dir='/tmp/tf')
Code model_dir='/tmp/tf' means estimator write all logs to /tmp/tf, then run tensorboard --log.dir=/tmp/tf, open you browser with url: http://localhost"6006 ,you can see the graphic
You can create a SummarySaverHook with tf.summary.merger_all() as the summary_op in the model_fn itself. Pass this hook to the training_hooks param of the EstimatorSpec constructor in your model_fn.
I don't think what #jagthebeetle said is exactly applicable here. As the hooks that you transfer to the estimator.train method cannot be run for the summaries that you define in your model_fn, since they won't be added to the merge_all op as they remain bounded by the scope of model_fn