Cuda out of memory in loop on second forward pass? - pytorch

I have the following problem:
model = MyModel()
model.load_state_dict(checkpoint[weights])
model.train()
data, label = get_data() # just take one trainings example
data.cuda()
for i in range(10): # lets predict data 10 times
output = model(data)
print(i)
I can do one forward step (i is printed 0) but then I get a Cuda OOM error at output = model(data).
It works if I use with torch.no_grad() but I'd like to train in this forward loop and so I need the gradients later.
Do I somehow have to clear the earlier output or data so it doesn't take up my Vram?
output = model(data.clone()) # this doesn't fix the problem
for i in range(10):
model.zero_grad() # doesn't work either
output = model(data)

Ok you have to manually remove the output tensor.
It is kept although the loop starts over again:
This is fixable by
for i in range(10):
model.zero_grad()
output = model(data)
del output

Related

Trying to accumulate gradients in Pytorch, but getting RuntimeError when calling loss.backward

I'm trying to train a model in Pytorch, and I'd like to have a batch size of 8, but due to memory limitations, I can only have a batch size of at most 4. I've looked all around and read a lot about accumulating gradients, and it seems like the solution to my problem.
However, I seem to have trouble implementing it. Every time I run the code I get RuntimeError: Trying to backward through the graph a second time. I don't understand why since my code looks like all these other examples I've seen (unless I'm just missing something major):
https://stackoverflow.com/a/62076913/1227353
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20
One caveat is that the labels for my images are all different size, so I can't send the output batch and the label batch into the loss function; I have to iterate over them together. This is what an epoch looks like (it's been pared down for the sake of brevity):
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
Is there something I'm missing? If I do the following I also get an error:
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# CHANGE: we're gonna accumulate losses manually
batch_loss = 0
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
batch_loss += loss # CHANGE: accumulate!
# CHANGE: do backprop outside for loop
batch_loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
The error I get in this case is RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. This happens when the next epoch starts though... (INCORRECT, SEE EDIT BELOW)
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
EDIT:
The second code segment was getting an error due to the fact that I was doing torch.set_grad_enabled(phase == 'train') but I had forgotten to wrap the call to batch_loss.backward() with an if phase == 'train'... my bad
So now the second segment of code seems to work and do gradient accumulation, but why doesn't the first bit of code work? It feel equivalent to setting BATCH_SIZE as 1. Furthermore, I'm creating a new loss object each time, so shouldn't the calls to backward() operate on different graphs entirely?
It seems you have two issues here, you said you couldn't have batch_size=8 because of memory limitations but later state that your labels are not of the same size. The latter seems much more important than the former. Anyway, I will try to answer your questions best I can.
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
You want to call .backward() on every loop cycle otherwise the batch will have no effect on the training. You can then call step() and zero_grad() only when batch_idx % 2 is True (i.e. for every other batch).
Here's an example which accumulates the gradient, not the loss:
model = nn.Linear(10, 3)
optim = torch.optim.SGD(model.parameters(), lr=0.1)
ds = TensorDataset(torch.rand(100, 10), torch.rand(100, 3))
dl = DataLoader(ds, batch_size=4)
for i, (x, y) in enumerate(dl):
y_hat = model(x)
loss = F.l1_loss(y_hat, y) / 2
loss.backward()
if i % 2:
optim.step()
optim.zero_grad()
Note this approach is different to accumulating the loss, and back-propagating only all batches (or part of the batches) have gone through the network. In the example above we backpropagate every 4 datapoints and updating the model every 8 datapoints.
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
Usually torch's built-in losses have reduction='mean' set as default. This means the loss gets averaged over all batch elements that contributed to calculating the loss. So this will depend on your loss implementation.
However if you are using gradient accumalation, then yes you will need to average your loss by the number of accumulation steps (here loss = F.l1_loss(y_hat, y) / 2). Since your gradients will be accumulated twice.
To read more about this, I recommend taking a look at this other SO post.

Truncated backpropagation in PyTorch (code check)

I am trying to implement truncated backpropagation through time in PyTorch, for the simple case where K1=K2. I have an implementation below that produces reasonable output, but I just want to make sure it is correct. When I look online for PyTorch examples of TBTT, they do inconsistent things around detaching the hidden state and zeroing out the gradient, and the ordering of these operations. Please let me know if I have made a mistake.
In the code below, H maintains the current hidden state, and model(weights, H, x) outputs the prediction and the new hidden state.
while i < NUM_STEPS:
# Grab x, y for ith datapoint
x = data[i]
target = true_output[i]
# Run model
output, new_hidden = model(weights, H, x)
H = new_hidden
# Update running error
error += (output - target)**2
if (i+1) % K == 0:
# Backpropagate
error.backward()
opt.step()
opt.zero_grad()
error = 0
H = H.detach()
i += 1
So the idea of your code is to isolate the last variables after each Kth step. Yes, your implementation is absolutely correct and this answer confirms that.
# truncated to the last K timesteps
while i < NUM_STEPS:
out = model(out)
if (i+1) % K == 0:
out.backward()
out.detach()
out.backward()
You can also follow this example for your reference.
import torch
from ignite.engine import Engine, EventEnum, _prepare_batch
from ignite.utils import apply_to_tensor
class Tbptt_Events(EventEnum):
"""Aditional tbptt events.
Additional events for truncated backpropagation throught time dedicated
trainer.
"""
TIME_ITERATION_STARTED = "time_iteration_started"
TIME_ITERATION_COMPLETED = "time_iteration_completed"
def _detach_hidden(hidden):
"""Cut backpropagation graph.
Auxillary function to cut the backpropagation graph by detaching the hidden
vector.
"""
return apply_to_tensor(hidden, torch.Tensor.detach)
def create_supervised_tbptt_trainer(
model, optimizer, loss_fn, tbtt_step, dim=0, device=None, non_blocking=False, prepare_batch=_prepare_batch
):
"""Create a trainer for truncated backprop through time supervised models.
Training recurrent model on long sequences is computationally intensive as
it requires to process the whole sequence before getting a gradient.
However, when the training loss is computed over many outputs
(`X to many <https://karpathy.github.io/2015/05/21/rnn-effectiveness/>`_),
there is an opportunity to compute a gradient over a subsequence. This is
known as
`truncated backpropagation through time <https://machinelearningmastery.com/
gentle-introduction-backpropagation-time/>`_.
This supervised trainer apply gradient optimization step every `tbtt_step`
time steps of the sequence, while backpropagating through the same
`tbtt_step` time steps.
Args:
model (`torch.nn.Module`): the model to train.
optimizer (`torch.optim.Optimizer`): the optimizer to use.
loss_fn (torch.nn loss function): the loss function to use.
tbtt_step (int): the length of time chunks (last one may be smaller).
dim (int): axis representing the time dimension.
device (str, optional): device type specification (default: None).
Applies to batches.
non_blocking (bool, optional): if True and this copy is between CPU and GPU,
the copy may occur asynchronously with respect to the host. For other cases,
this argument has no effect.
prepare_batch (callable, optional): function that receives `batch`, `device`,
`non_blocking` and outputs tuple of tensors `(batch_x, batch_y)`.
.. warning::
The internal use of `device` has changed.
`device` will now *only* be used to move the input data to the correct device.
The `model` should be moved by the user before creating an optimizer.
For more information see:
* `PyTorch Documentation <https://pytorch.org/docs/stable/optim.html#constructing-it>`_
* `PyTorch's Explanation <https://github.com/pytorch/pytorch/issues/7844#issuecomment-503713840>`_
Returns:
Engine: a trainer engine with supervised update function.
"""
def _update(engine, batch):
loss_list = []
hidden = None
x, y = batch
for batch_t in zip(x.split(tbtt_step, dim=dim), y.split(tbtt_step, dim=dim)):
x_t, y_t = prepare_batch(batch_t, device=device, non_blocking=non_blocking)
# Fire event for start of iteration
engine.fire_event(Tbptt_Events.TIME_ITERATION_STARTED)
# Forward, backward and
model.train()
optimizer.zero_grad()
if hidden is None:
y_pred_t, hidden = model(x_t)
else:
hidden = _detach_hidden(hidden)
y_pred_t, hidden = model(x_t, hidden)
loss_t = loss_fn(y_pred_t, y_t)
loss_t.backward()
optimizer.step()
# Setting state of engine for consistent behaviour
engine.state.output = loss_t.item()
loss_list.append(loss_t.item())
# Fire event for end of iteration
engine.fire_event(Tbptt_Events.TIME_ITERATION_COMPLETED)
# return average loss over the time splits
return sum(loss_list) / len(loss_list)
engine = Engine(_update)
engine.register_events(*Tbptt_Events)
return engine

How to fix a 'OutOfRangeError: End of sequence' error when training a CNN with tensorflow?

I am trying to train a CNN using my own dataset. I've been using tfrecord files and the tf.data.TFRecordDataset API to handle my dataset. It works fine for my training dataset. But when I tried to batch my validation dataset, the error of 'OutOfRangeError: End of sequence' raised. After browsing through the Internet, I thought the problem was caused by the batch size of the validation set, which I set to 32 in the first place. But after I changed it to 2, the code ran for like 9 epochs and the error raised again.
I used an input function to handle the dataset, the code goes below:
def input_fn(is_training, filenames, batch_size, num_epochs=1, num_parallel_reads=1):
dataset = tf.data.TFRecordDataset(filenames,num_parallel_reads=num_parallel_reads)
if is_training:
dataset = dataset.shuffle(buffer_size=1500)
dataset = dataset.map(parse_record)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
and for the training set, "batch_size" is set to 128 and "num_epochs" set to None which means keep repeating for infinite time. For the validation set, "batch_size" is set to 32(later set to 2, still didn't work) and the "num_epochs" set to 1 since I only want to go through the validation set one time.
I can assure that the validation set contains enough data for the epochs. Because I've tried the codes below and it didn't raise any errors:
with tf.Session() as sess:
features, labels = input_fn(False, valid_list, 32, 1, 1)
for i in range(450):
sess.run([features, labels])
print(labels.shape)
In the code above, when I changed the number 450 to 500 or anything larger, it would raise the 'OutOfRangeError'. That can confirm that my validation dataset contains enough data for 450 iterations with a batch size of 32.
I've tried to use a smaller batch size(i.e., 2) for the validation set, but still having the same error.
I can get the code running with the "num_epochs" set to "None" in the input_fn for validation set, but that does not seem to be how the validation works. Any help, please?
This behaviour is normal. From the Tensorflow documentation:
If the iterator reaches the end of the dataset, executing the Iterator.get_next() operation will raise a tf.errors.OutOfRangeError. After this point the iterator will be in an unusable state, and you must initialize it again if you want to use it further.
The reason why the error is not raised when you set dataset.repeat(None) is because the dataset is never exhausted since it is repeated indefinitely.
To solve your issue, you should change your code to this:
n_steps = 450
...
with tf.Session() as sess:
# Training
features, labels = input_fn(True, training_list, 32, 1, 1)
for step in range(n_steps):
sess.run([features, labels])
...
...
# Validation
features, labels = input_fn(False, valid_list, 32, 1, 1)
try:
sess.run([features, labels])
...
except tf.errors.OutOfRangeError:
print("End of dataset") # ==> "End of dataset"
You can also make a few changes to your input_fn to run the evaluation at every epoch:
def input_fn(is_training, filenames, batch_size, num_epochs=1, num_parallel_reads=1):
dataset = tf.data.TFRecordDataset(filenames,num_parallel_reads=num_parallel_reads)
if is_training:
dataset = dataset.shuffle(buffer_size=1500)
dataset = dataset.map(parse_record)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_initializable_iterator()
return iterator
n_epochs = 10
freq_eval = 1
training_iterator = input_fn(True, training_list, 32, 1, 1)
training_features, training_labels = training_iterator.get_next()
val_iterator = input_fn(False, valid_list, 32, 1, 1)
val_features, val_labels = val_iterator.get_next()
with tf.Session() as sess:
# Training
sess.run(training_iterator.initializer)
for epoch in range(n_epochs):
try:
sess.run([training_features, training_labels])
except tf.errors.OutOfRangeError:
pass
# Validation
if (epoch+1) % freq_eval == 0:
sess.run(val_iterator.initializer)
try:
sess.run([val_features, val_labels])
except tf.errors.OutOfRangeError:
pass
I advise you to have a close look to this official guide if you want to have a better understanding of what is happening under the hood.

How to combine FCNN and RNN in Tensorflow?

I want to make a Neural Network, which would have recurrency (for example, LSTM) at some layers and normal connections (FC) at others.
I cannot find a way to do it in Tensorflow.
It works, if I have only FC layers, but I don't see how to add just one recurrent layer properly.
I create a network in a following way :
with tf.variable_scope("autoencoder_variables", reuse=None) as scope:
for i in xrange(self.__num_hidden_layers + 1):
# Train weights
name_w = self._weights_str.format(i + 1)
w_shape = (self.__shape[i], self.__shape[i + 1])
a = tf.multiply(4.0, tf.sqrt(6.0 / (w_shape[0] + w_shape[1])))
w_init = tf.random_uniform(w_shape, -1 * a, a)
self[name_w] = tf.Variable(w_init,
name=name_w,
trainable=True)
# Train biases
name_b = self._biases_str.format(i + 1)
b_shape = (self.__shape[i + 1],)
b_init = tf.zeros(b_shape)
self[name_b] = tf.Variable(b_init, trainable=True, name=name_b)
if i+1 == self.__recurrent_layer:
# Create an LSTM cell
lstm_size = self.__shape[self.__recurrent_layer]
self['lstm'] = tf.contrib.rnn.BasicLSTMCell(lstm_size)
It should process the batches in a sequential order. I have a function for processing just one time-step, which will be called later, by a function, which process the whole sequence :
def single_run(self, input_pl, state, just_middle = False):
"""Get the output of the autoencoder for a single batch
Args:
input_pl: tf placeholder for ae input data of size [batch_size, DoF]
state: current state of LSTM memory units
just_middle : will indicate if we want to extract only the middle layer of the network
Returns:
Tensor of output
"""
last_output = input_pl
# Pass through the network
for i in xrange(self.num_hidden_layers+1):
if(i!=self.__recurrent_layer):
w = self._w(i + 1)
b = self._b(i + 1)
last_output = self._activate(last_output, w, b)
else:
last_output, state = self['lstm'](last_output,state)
return last_output
The following function should take sequence of batches as input and produce sequence of batches as an output:
def process_sequences(self, input_seq_pl, dropout, just_middle = False):
"""Get the output of the autoencoder
Args:
input_seq_pl: input data of size [batch_size, sequence_length, DoF]
dropout: dropout rate
just_middle : indicate if we want to extract only the middle layer of the network
Returns:
Tensor of output
"""
if(~just_middle): # if not middle layer
numb_layers = self.__num_hidden_layers+1
else:
numb_layers = FLAGS.middle_layer
with tf.variable_scope("process_sequence", reuse=None) as scope:
# Initial state of the LSTM memory.
state = initial_state = self['lstm'].zero_state(FLAGS.batch_size, tf.float32)
tf.get_variable_scope().reuse_variables() # THIS IS IMPORTANT LINE
# First - Apply Dropout
the_whole_sequences = tf.nn.dropout(input_seq_pl, dropout)
# Take batches for every time step and run them through the network
# Stack all their outputs
with tf.control_dependencies([tf.convert_to_tensor(state, name='state') ]): # do not let paralelize the loop
stacked_outputs = tf.stack( [ self.single_run(the_whole_sequences[:,time_st,:], state, just_middle) for time_st in range(self.sequence_length) ])
# Transpose output from the shape [sequence_length, batch_size, DoF] into [batch_size, sequence_length, DoF]
output = tf.transpose(stacked_outputs , perm=[1, 0, 2])
return output
The issue is with a variable scopes and their property "reuse".
If I run this code as it is I am getting the following error:
' Variable Train/process_sequence/basic_lstm_cell/weights does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope? '
If I comment out the line, which tell it to reuse variables ( tf.get_variable_scope().reuse_variables() ) I am getting the following error:
'Variable Train/process_sequence/basic_lstm_cell/weights already exists, disallowed. Did you mean to set reuse=True in VarScope?'
It seems, that we need "reuse=None" for the weights of the LSTM cell to be initialized and we need "reuse=True" in order to call the LSTM cell.
Please, help me to figure out the way to do it properly.
I think the problem is that you're creating variables with tf.Variable. Please, use tf.get_variable instead -- does this solve your issue?
It seems that I have solved this issue using the hack from the official Tensorflow RNN example (https://www.tensorflow.org/tutorials/recurrent) with the following code
with tf.variable_scope("RNN"):
for time_step in range(num_steps):
if time_step > 0: tf.get_variable_scope().reuse_variables()
(cell_output, state) = cell(inputs[:, time_step, :], state)
outputs.append(cell_output)
The hack is that when we run LSTM first time, tf.get_variable_scope().reuse is set to False, so that the new LSTM cell is created. When we run it next time, we set tf.get_variable_scope().reuse to True, so that we are using the LSTM, which was already created.

Input dimension mismatch binary crossentropy Lasagne and Theano

I read all posts in the net adressing the issue where people forgot to change the target vector to a matrix, and as a problem remains after this change, I decided to ask my question here. Workarounds are mentioned below, but new problems show and I am thankful for suggestions!
Using a convolution network setup and binary crossentropy with sigmoid activation function, I get a dimension mismatch problem, but not during the training data, only during validation / test data evaluation. For some strange reason, of of my validation set vectors get his dimension switched and I have no idea, why. Training, as mentioned above, works fine. Code follows below, thanks a lot for help (and sorry for hijacking the thread, but I saw no reason for creating a new one), most of it copied from the lasagne tutorial example.
Workarounds and new problems:
Removing "axis=1" in the valAcc definition helps, but validation accuracy remains zero and test classification always returns the same result, no matter how many nodes, layers, filters etc. I have. Even changing training set size (I have around 350 samples for each class with 48x64 grayscale images) does not change this. So something seems off
Network creation:
def build_cnn(imgSet, input_var=None):
# As a third model, we'll create a CNN of two convolution + pooling stages
# and a fully-connected hidden layer in front of the output layer.
# Input layer using shape information from training
network = lasagne.layers.InputLayer(shape=(None, \
imgSet.shape[1], imgSet.shape[2], imgSet.shape[3]), input_var=input_var)
# This time we do not apply input dropout, as it tends to work less well
# for convolutional layers.
# Convolutional layer with 32 kernels of size 5x5. Strided and padded
# convolutions are supported as well; see the docstring.
network = lasagne.layers.Conv2DLayer(
network, num_filters=32, filter_size=(5, 5),
nonlinearity=lasagne.nonlinearities.rectify,
W=lasagne.init.GlorotUniform())
# Max-pooling layer of factor 2 in both dimensions:
network = lasagne.layers.MaxPool2DLayer(network, pool_size=(2, 2))
# Another convolution with 16 5x5 kernels, and another 2x2 pooling:
network = lasagne.layers.Conv2DLayer(
network, num_filters=16, filter_size=(5, 5),
nonlinearity=lasagne.nonlinearities.rectify)
network = lasagne.layers.MaxPool2DLayer(network, pool_size=(2, 2))
# A fully-connected layer of 64 units with 25% dropout on its inputs:
network = lasagne.layers.DenseLayer(
lasagne.layers.dropout(network, p=.25),
num_units=64,
nonlinearity=lasagne.nonlinearities.rectify)
# And, finally, the 2-unit output layer with 50% dropout on its inputs:
network = lasagne.layers.DenseLayer(
lasagne.layers.dropout(network, p=.5),
num_units=1,
nonlinearity=lasagne.nonlinearities.sigmoid)
return network
Target matrices for all sets are created like this (training target vector as an example)
targetsTrain = np.vstack( (targetsTrain, [[targetClass], ]*numTr) );
...and the theano variables as such
inputVar = T.tensor4('inputs')
targetVar = T.imatrix('targets')
network = build_cnn(trainset, inputVar)
predictions = lasagne.layers.get_output(network)
loss = lasagne.objectives.binary_crossentropy(predictions, targetVar)
loss = loss.mean()
params = lasagne.layers.get_all_params(network, trainable=True)
updates = lasagne.updates.nesterov_momentum(loss, params, learning_rate=0.01, momentum=0.9)
valPrediction = lasagne.layers.get_output(network, deterministic=True)
valLoss = lasagne.objectives.binary_crossentropy(valPrediction, targetVar)
valLoss = valLoss.mean()
valAcc = T.mean(T.eq(T.argmax(valPrediction, axis=1), targetVar), dtype=theano.config.floatX)
train_fn = function([inputVar, targetVar], loss, updates=updates, allow_input_downcast=True)
val_fn = function([inputVar, targetVar], [valLoss, valAcc])
Finally, here the two loops, training and test. The first is fine, the second throws the error, excerpts below
# -- Neural network training itself -- #
numIts = 100
for itNr in range(0, numIts):
train_err = 0
train_batches = 0
for batch in iterate_minibatches(trainset.astype('float32'), targetsTrain.astype('int8'), len(trainset)//4, shuffle=True):
inputs, targets = batch
print (inputs.shape)
print(targets.shape)
train_err += train_fn(inputs, targets)
train_batches += 1
# And a full pass over the validation data:
val_err = 0
val_acc = 0
val_batches = 0
for batch in iterate_minibatches(valset.astype('float32'), targetsVal.astype('int8'), len(valset)//3, shuffle=False):
[inputs, targets] = batch
[err, acc] = val_fn(inputs, targets)
val_err += err
val_acc += acc
val_batches += 1
Erorr (excerpts)
Exception "unhandled ValueError"
Input dimension mis-match. (input[0].shape[1] = 52, input[1].shape[1] = 1)
Apply node that caused the error: Elemwise{eq,no_inplace}(DimShuffle{x,0}.0, targets)
Toposort index: 36
Inputs types: [TensorType(int64, row), TensorType(int32, matrix)]
Inputs shapes: [(1, 52), (52, 1)]
Inputs strides: [(416, 8), (4, 4)]
Inputs values: ['not shown', 'not shown']
Again, thanks for help!
so it seems the error is in the evaluation of the validation accuracy.
When you remove the "axis=1" in your calculation, the argmax goes on everything, returning only a number.
Then, broadcasting steps in and this is why you would see the same value for the whole set.
But from the error you have posted, the "T.eq" op throws the error because it has to compare a 52 x 1 with a 1 x 52 vector (matrix for theano/numpy).
So, I suggest you try to replace the line with:
valAcc = T.mean(T.eq(T.argmax(valPrediction, axis=1), targetVar.T))
I hope this should fix the error, but I haven't tested it myself.
EDIT:
The error lies in the argmax op that is called.
Normally, the argmax is there to determine which of the output units is activated the most.
However, in your setting you only have one output neuron which means that the argmax over all output neurons will always return 0 (for first arg).
This is why you have the impression your network gives you always 0 as output.
By replacing:
valAcc = T.mean(T.eq(T.argmax(valPrediction, axis=1), targetVar.T))
with:
binaryPrediction = valPrediction > .5
valAcc = T.mean(T.eq(binaryPrediction, targetVar.T)
you should get the desired result.
I'm just not sure, if the transpose is still necessary or not.

Resources