My tremendously stripped-down code looks like:
#!/usr/bin/python3
from keras.layers import Input
from keras.layers.core import Dense
from keras.models import Model
import numpy as np
inp = Input(shape=[1])
out = Dense(units=1, activation='linear')(inp)
model = Model(inputs=inp, outputs=out)
model.compile(loss='mean_absolute_error',
optimizer='rmsprop')
x=np.array([[0]])
y=np.array([[42]])
model.fit(x,y,epochs=1000, verbose=False)
prediction = model.predict(x)
print(prediction)
It outputs [[1.0091327]]
The model has exactly two parameters: a weight and bias for its 1-dimensional output. And the weight doesn't matter because x is always 0. This should be pretty easy to train.
If instead of 42 I use 0.42 or -0.42 for y it works fine (4.2 and -42 do not). So I figure there must be some sort of normalization somewhere softly compressing either outputs or biases toward [-1,1].
Does anyone know what this normalization is and how to turn it off?
(Before anyone tells me I shouldn't use neural nets for something this silly, my real code does a lot more. I wrote this stripped version for clarity and debugging.)
No, there is no built-in normalization, that is the users job.
What you are seeing is the "why" we use normalization, without it the optimization problem is a lot harder, after I run the example you can see that the loss does not go anywhere close to zero and stays around 41.
If you make some changes like using a mean squared error loss and running this example for 50K epochs, then you get it to converge to a zero loss and it outputs 42 as expected.
A common beginner's mistake is to look at the prediction without looking first at the training loss, as the loss is high it will means the predictions will be wrong.
Related
I am following this keras tutorial to create an autoencoder using the MNIST dataset. Here is the tutorial: https://blog.keras.io/building-autoencoders-in-keras.html.
However, I am confused with the choice of activation and loss for the simple one-layer autoencoder (which is the first example in the link). Is there a specific reason sigmoid activation was used for the decoder part as opposed to something such as relu? I am trying to understand whether this is a choice I can play around with, or if it should indeed be sigmoid, and if so why? Similarily, I understand the loss is taken by comparing each of the original and predicted digits on a pixel-by-pixel level, but I am unsure why the loss is binary_crossentropy as opposed to something like mean squared error.
I would love clarification on this to help me move forward! Thank you!
MNIST images are generally normalized in the range [0, 1], so the autoencoder should output images in the same range, for easier learning. This is why a sigmoid activation is used at the output.
The mean squared error loss has a non-linear penalty, with big errors having a larger penalty than smaller errors, which generally leads to converging to the mean of the solution, instead of a more accurace solution. The binary cross-entropy does not have this problem, and thus it is preferred. It works because the output of the model and the labels are in the [0, 1] range, and the loss is applied to all pixels.
I have set up a ResNet50 network for an optical application. With two input images, the network gives an estimate of 65 values (regression) and it works pretty well. However, the two input images belong to a time series, and the images of the time series will be somewhat correlated over a span of 10-15 times, so I expect that an additional RNN could improve estimates. I have tried to set up the network shown in the figure, using mostly frozen ResNet50 parameter values found by separate training and “TimeDistributed” ResNet50s. However the RNN training does not give useful accuracy.
Full LSTM network
I have now spent 2-3 weeks trying to debug my code (in particular the generator) but I have not found any coding errors. In frustration, I tried to set up the simplest RNN I could think of: A complete Resnet50 with either one or two SimpleRNNs with linear activation. However they do not provide even nearly the same accuracy as the ResNet50 alone in spite of the correlated time series.
SimpleRNN network
So my question is: Is it correct to assume that a single SimpleRNN with linear activation should provide the same accuracy as the ResNet50 alone?
This is a bit speculative, but it might suggest an approach to debug the RNN and answer your question. Here is an extremely simple network with a SimpleRNN and a test input of 2 samples, each with a single time step and single feature: i.e. shape=(2,1,1)
from keras.models import Sequential
from keras.layers import SimpleRNN
import numpy as np
x_train=np.array([[[0.1]],
[[0.2]]])
y_train=np.array([[1],[0]])
print(x_train.shape)
print(x_train)
print(y_train.shape)
print(y_train)
#simple network
model = Sequential()
model.add(SimpleRNN(1,activation=None, use_bias=False, input_shape=(1,1)))
model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
model.fit(x_train, y_train, epochs=10, batch_size=2)
wgt=model.get_weights()
print(wgt)
print('model.predict(x_train)')
print(model.predict(x_train))
Based on running the above, two weights come out of the RNN network. The first seems to be a simple scaling of the input and the second I'm suspecting is the weight of the recurrent loop which is not actually used for a single time step as in this example. The activation is linear so the result then matches the model.predict.
You may be able to extend approach this to reason about the performance with the Resnet and potentially answer your question. I hope this helps.
I am training built-in pytorch rnn modules (eg torch.nn.LSTM) and would like to add fixed-per-minibatch dropout between each time step (Gal dropout, if I understand correctly).
Most simply, I could unroll the network and compute my forward computation on a single batch something like this:
dropout = get_fixed_dropout()
for sequence in batch:
state = initial_state
for token in sequence:
state, output = rnn(token,state)
state, output = dropout(state, output)
outputs.append(output)
loss += loss(outputs,sequence)
loss.backward()
optimiser.step()
However, assuming that looping in python is pretty slow, I would much rather take advantage of pytorch's ability to fully process a batch of several sequences, all in one call (as in rnn(batch,state) for regular forward computations).
i.e., I would prefer something that looks like this:
rnn_with_dropout = drop_neurons(rnn)
outputs, states = rnn_with_dropout(batch,initial_state)
loss = loss(outputs,batch)
loss.backward()
optimiser.step()
rnn = return_dropped(rnn_with_dropout)
(note: Pytorch's rnns do have a dropout parameter, but it is for dropout between layers and not between time steps, and so not what I want).
My questions are:
Am I even understanding Gal dropout correctly?
Is an interface like the one I want available?
If not, would simply implementing some drop_neurons(rnn), return_dropped(rnn) functions that zero random rows in the rnn's weight matrices and bias vectors and then return their previous values after the update step be equivalent? (This imposes the same dropout between the layers as in between the steps, i.e. completely removes some neurons for the whole minibatch, and I'm not sure that doing this is 'correct').
I have a neural network in keras. The network doesn't distinguish between two samples if they are 0.001 units apart from each other in the normalized feature space. It is extremely important for the network to be able to distinguish that because this difference isn't insignificant in the original (non-normalized) feature space.
Is there any way I can improve the resolution of my neural network? If so, what factors should I be changing?
Edit: Let me give you the code I'm using for my ANN.
Edit: Description of the dataset: I'm using a 2 dimensional dataset with x=[-1:1] and Y=[-1,1] with a step size of 0.001 between two consecutive points in both dimensions. Let the class labels be such that anything inside or on the circle with radius 0.5 and center at (0,0) be normal (class 1) and anything outside the circle be 0 (class 0). After training, I'm using the same training set as a test set. As it goes right now, the boundary points on the circle and a small neighbourhood inside and outside the boundary are being classified as 0.3 to 0.7. Only the points well within the circle are 1 and well outside the circle are 0. I recognise that this is the behaviour of a sigmoid activation function at the output layer. I need it to recognise the difference between a point on the boundary and a point lying just outside the boundary, 0.001 units away, and a point lying just inside the boundary, 0.001 units away.
from keras.models import Sequential
from keras.layers import Dense
from sklearn.cross_validation import train_test_split as tts
from keras.callbacks import EarlyStopping as es
from keras import optimizers as op
"""Creating the model"""
model=Sequential()
model.add(Dense(12,input_dim=c,activation='relu')) #input layer (c is the dimensionality of my dataset)
for i in range(0,hidden_layer_size):
model.add(Dense(12,activation='relu')) #hidden layers
model.add(Dense(1,activation='sigmoid')) #output layer
"""Compiling the model"""
adam=op.Adam(lr=0.0007)
model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
"""Fit the model"""
early=es(monitor='acc',min_delta=0.0005,patience=2)
model.fit(features_train,labels_train, epochs=epochs, batch_size=30,callbacks=[early],verbose=2)
"""Evaluation"""
scores=model.evaluate(features_test,labels_test)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
There is not really a concept of "resolution" in a neural network. If your model misclassifies some samples, then you either need a better model, or more data, or even additional regularization.
It is hard to predict what is wrong without testing the model extensively, so it is something you will have to do.
I am reading an article that explains how to trick neural networks into predicting any image you want. I am using the mnist dataset.
The article provides a relatively detailed walk through but the person who wrote it is using Caffe.
Anyways, my first step was to create a logistic regression function using TensorFlow that is trained on the mnist dataset. So, if I were to restore the logistic regression model I can use it to predict any image. For example, I feed the number 7 to the following model...
with tf.Session() as sess:
saver.restore(sess, "/tmp/model.ckpt")
# number 7
x_in = np.expand_dims(mnist.test.images[0], axis=0)
classification = sess.run(tf.argmax(pred, 1), feed_dict={x:x_in})
print(classification)
>>>[7]
This prints out the number [7] which is correct.
Now the article explains that in order to break a neural network we need to calculate the gradient of the neural network. This is the derivative of the neural network.
The article states that to calculate the gradient, we first need to pick an intended outcome to move towards, and set the output probability list to be 0 everywhere, and 1 for the intended outcome. Backpropagation is an algorithm for calculating the gradient.
Then there's code provided in Caffe as to how to calculate the gradient...
def compute_gradient(image, intended_outcome):
# Put the image into the network and make the prediction
predict(image)
# Get an empty set of probabilities
probs = np.zeros_like(net.blobs['prob'].data)
# Set the probability for our intended outcome to 1
probs[0][intended_outcome] = 1
# Do backpropagation to calculate the gradient for that outcome
# and the image we put in
gradient = net.backward(prob=probs)
return gradient['data'].copy()
Now, my issue is, I'm having a hard time understanding how this function is able to get the gradient just by feeding just the image and the probabilities to the function. Because I do not fully understand this code, I am having a hard time translating this logic to TensorFlow.
I think I am confused as to how the Caffe framework works because I've never seen/used it before. If someone could explain how this logic works step-by-step that would be great.
I already know the basics of Backpropagation so you may assume I already know how it works.
Here is a link to the article itself...https://codewords.recurse.com/issues/five/why-do-neural-networks-think-a-panda-is-a-vulture
I'm going to show you how to do the basics of generating an adversarial image in TF, to apply that to an already learned model you might need some adaptations.
The code blocks work well as blocks in a Jupyter notebook if you want to try this out interactively. If you don't use a notebook, you'll need to add plt.show() calls for the plots to show and remove the matplotlib inline statement. The code is basically the simple MNIST tutorial from the TF documentation, I'll point out the important differences.
First block is just setup, nothing special ...
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# if you're not using jupyter notebooks then comment this out
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
Get MNIST data (it is down from time to time so you might need to download it from web.archive.org manually and put it into that directory). We're not using one hot encoding like in the tutorial because by now TF has nicer functions to calculate the loss that don't need the one hot encoding anymore.
mnist = input_data.read_data_sets('/tmp/tensorflow/mnist/input_data')
In the next block we are doing something "special". The input image tensor is defined as a variable because later we want to optimize with regard to the input image. Usually you would have a placeholder here. It does limit us a bit here because we need a definite shape so we only feed in one example at a time. Not something you want to do in production, but for teaching purposes it's fine (and you can get around it with a little more code). Labels are placeholders like normal.
input_images = tf.get_variable("input_image", shape=[1,784], dtype=tf.float32)
input_labels = tf.placeholder(shape=[1], name='input_label', dtype=tf.int32)
Our model is a standard logistic regression model like in the tutorial. We only use the softmax for visualization of results, the loss function takes plain logits.
W = tf.get_variable("weights", shape=[784, 10], dtype=tf.float32, initializer=tf.random_normal_initializer())
b = tf.get_variable("biases", shape=[1, 10], dtype=tf.float32, initializer=tf.zeros_initializer())
logits = tf.matmul(input_images, W) + b
softmax = tf.nn.softmax(logits)
The loss is standard cross entropy. What's to note in the training step is that there is an explicit list of variables passed in - we have defined the input image as a training variable but we don't want to try optimizing the image while training the logistic regression, just weights and biases - so we explicitly state that.
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=input_labels,name='xentropy')
mean_loss = tf.reduce_mean(loss)
train_step = tf.train.AdamOptimizer(learning_rate=0.1).minimize(mean_loss, var_list=[W,b])
Start the session ...
sess = tf.Session()
sess.run(tf.global_variables_initializer())
Training is slower than it should be because of batch size 1. Like I said, not something you want to do in production, but this is just for teaching the basics ...
for step in range(10000):
batch_xs, batch_ys = mnist.train.next_batch(1)
loss_v, _ = sess.run([mean_loss, train_step], feed_dict={input_images: batch_xs, input_labels: batch_ys})
At this point we should have a model that is good enough to demonstrate how to generate an adversarial image. First, we get an image that has label '2' because these are easy so even our suboptimal classifier should get them right (if it doesn't, run this cell again ;) this step is random so I can't guarantee that it'll work).
We're setting our input image variable to that example.
sample_label = -1
while sample_label != 2:
sample_image, sample_label = mnist.test.next_batch(1)
sample_label
plt.imshow(sample_image.reshape(28, 28),cmap='gray')
# assign image to var
sess.run(tf.assign(input_images, sample_image));
sess.run(softmax) # now using the variable as input, no feed dict
# should show something like
# array([[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)
# With the third entry being the highest by far.
Now we are going to "break" the classification. We want to change the image to make it look more like another number, in the eyes of the network, without changing the network itself. To do that, the code looks basically identical to what we had before. We define a "fake" label, the same loss as before (cross entropy) and we get an optimizer to minimize the fake loss, but this time with a var_list consisting of only the input image - so we won't change the logistic regression weights:
fake_label = tf.placeholder(tf.int32, shape=[1])
fake_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=fake_label)
adversarial_step = tf.train.GradientDescentOptimizer(learning_rate=1e-3).minimize(fake_loss, var_list=[input_images])
The next block is intended to be run interactively multiple times, while you see the image and the scores changing (here moving towards a label of 8):
sess.run(adversarial_step, feed_dict={fake_label:np.array([8])})
plt.imshow(sess.run(input_images).reshape(28,28),cmap='gray')
sess.run(softmax)
The first time you run this block, the scores will probably still heavily point towards 2, but it will change over time and after a couple runs you should see something like the following image - note that the image still looks like a 2 with some noise in the background, but the score for "2" is at around 3% while the score for "8" is at over 96%.
Note that we never actually computed the gradient explicitly - we don't need to, the TF optimizer takes care of computing gradients and applying updates to the variables. If you want to get the gradient, you can do so by using tf.gradients(fake_loss, input_images).
The same pattern works for more complicated models, but what you'll want to do is to train your model as normal - using placeholders with bigger batches, or using a pipeline with TF readers, and when you want to do the adversarial image you'd recreate the network with the input image variable as an input. As long as all the variable names remain the same (which they should if you use the same functions to build the network) you can restore using your network checkpoint, and then apply the steps from this post to get to an adversarial image. You might need to play around with learning rates and such.