Tensorflow RNN tutorial - python-3.x

I am following RNN tutorial of Tensorflow.
I am having trouble understanding the function ptb_producer in reader.py in following script :
with tf.control_dependencies([assertion]):
epoch_size = tf.identity(epoch_size, name="epoch_size")
i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
x = tf.strided_slice(data, [0, i * num_steps],[batch_size, (i + 1) * num_steps])
x.set_shape([batch_size, num_steps])
y = tf.strided_slice(data, [0, i * num_steps + 1],[batch_size, (i + 1) * num_steps + 1])
y.set_shape([batch_size, num_steps])
return x, y
Can anyone explain what tf.train.range_input_producer is doing ?

I have been trying to understand the same tutorial for weeks now. In my opinion, what makes it so difficult is the fact that all the functions one calls from TensorFlow are not executed immediately, but rather add their corresponding operation nodes to the graph.
According to the official documentation, a Range Input Producer 'generates integers from 0 to limit - 1 in a queue'. So, the way I see it, the code in question i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue() creates a node, which acts as a counter, producing the next number in the sequence 0:(epoch_size) once executed.
This is used to get the next batch from the input data. The raw data is split into batch_size rows, so that in every run batch_size batches are given to the training function. In every batch (row), a sliding window of size num_steps moves forward. The counter i allows the window to move forward by num_steps in every call.
Both x and y are of shape [batch_size, num_steps], since they contain batch_size batches of num_steps steps each. Variable x is the input and y is the expected output for the given input (it is produced by moving the window one item to the left, so that iff x = data[i:(i + num_steps] then y = data[(i + 1):(i + num_steps + 1)].
It has been a nightmare for me, but I hope this post helps people in the future.

Related

Why embed dimemsion must be divisible by num of heads in MultiheadAttention?

I am learning the Transformer. Here is the pytorch document for MultiheadAttention. In their implementation, I saw there is a constraint:
assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
Why require the constraint: embed_dim must be divisible by num_heads? If we go back to the equation
Assume:
Q, K,V are n x emded_dim matrices; all the weight matrices W is emded_dim x head_dim,
Then, the concat [head_i, ..., head_h] will be a n x (num_heads*head_dim) matrix;
W^O with size (num_heads*head_dim) x embed_dim
[head_i, ..., head_h] * W^O will become a n x embed_dim output
I don't know why we require embed_dim must be divisible by num_heads.
Let say we have num_heads=10000, the resuts are the same, since the matrix-matrix product will absort this information.
From what I understood, it is a simplification they have added to keep things simple. Theoretically, we can implement the model like you proposed (similar to the original paper).
In pytorch documention, they have briefly mentioned it.
Note that `embed_dim` will be split across `num_heads` (i.e. each head will have dimension `embed_dim` // `num_heads`)
Also, if you see the Pytorch implementation, you can see it is a bit different (optimised in my point of view) when comparing to the originally proposed model. For example, they use MatMul instead of Linear and Concat layer is ignored. Refer the below which shows the first encoder (with Btach size 32, 10 words, 512 features).
P.s:
If you need to see the model params (like the above image), this is the code I used.
import torch
transformer_model = torch.nn.Transformer(d_model=512, nhead=8, num_encoder_layers=1,num_decoder_layers=1,dim_feedforward=11) # change params as necessary
tgt = torch.rand((20, 32, 512))
src = torch.rand((11, 32, 512))
torch.onnx.export(transformer_model, (src, tgt), "transformer_model.onnx")
When you have a sequence of seq_len x emb_dim (ie. 20 x 8) and you want to use num_heads=2, the sequence will be split along the emb_dim dimension. Therefore you get two 20 x 4 sequences. You want every head to have the same shape and if emb_dim isn't divisible by num_heads this wont work. Take for example a sequence 20 x 9 and again num_heads=2. Then you would get 20 x 4 and 20 x 5 which are not the same dimension.

How to remove entries of a numpy array based on conditions

I'm currently working on an optimization problem and I'm stuck at some point.
Here is a simple case:
I have a set of p parameters in c combinations and try to optimize them on i inputs. This leads to a numpy array of shape (p, c, i). For each of them, I calculate an error leading to a second array of shape (c, i). Now I have to get rid of some combinations before running the next iteration.
For the case i = 1, I use the following code which works fine (arrays are just (p, c) and (c, )):
ensemble = np.vstack((ensemble, error))
# Remove some functions that do not fit
ensemble = ensemble[:, ensemble[p, :] < thres]
# Delete error column
function_ensemble = np.delete(function_ensemble, p, axis=0)
Now I try to generalize this for i > 1:
error[error >= 0.015] = 0
error = error[np.newaxis, :, :]
# Maybe keep them seperate
ensemble = np.concatenate((ensemble, error))
And thats where I'm stuck. What I'm currently thinking of is sorting my ensemble by the error and removing all entries where the error is 0 for all i. So that the ensemble gets smaller. However, as far as I know, np.sort does not work here. Maybe I could use structured arrays but I'm not sure if this would destroy code in other places.
Does anyone have an idea for my issue?
Edit
For case i=1 a running example with p=3 and c=100 could just be:
error = np.random.rand(100)
ensemble = np.ones([3, 100])
ensemble = np.vstack((ensemble, error))
ensemble = ensemble[:, ensemble[3, :] < 0.5]
ensemble = np.delete(ensemble, 3, axis=0)
The result in this case is a subset of my ensemble where the error is smaller than 0.5.
For case i=2 (multiple ensembles) an example with p=3 and c=100 could be:
error = np.random.rand(100, 2)
ensemble = np.ones([3, 100, 2])
error[error >= 0.015] = 0
error = error[np.newaxis, :, :]
ensemble = np.concatenate((ensemble, error))
Again, I want a subset of my ensembles containing all sets of parameters with an error < 0.5. This will require some padding since each ensemble i will have a different size afterwards.
However, the ultimate goal is to iterate and adjust the parameters until only one set remains ending up with an array of size (3,1,2). (Examles above do not show iteration).
Best, Julz

LSTM getting caught up in loop

I recently implemented a name generating RNN "from scratch" which was doing ok but far from perfect. So I thought about trying my luck with pytorch's LSTM class to see if it makes a difference. Indeed it does and the outpus looks way better for the first 7 ~ 8 characters. But then the networks gets caught in a loop and outputs things like "laulaulaulau" or "rourourourou" (it is supposed the generate french names).
Is it a often occuring problem ? If so do you know a way to fix it ? I'm concern about the fact the network doesn't produce EOS tokens...
This is an issue which has already been asked here Why does my keras LSTM model get stuck in an infinite loop?
but not really answered hence my post.
here is the model :
class pytorchLSTM(nn.Module):
def __init__(self,input_size,hidden_size):
super(pytorchLSTM,self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.lstm = nn.LSTM(input_size, hidden_size)
self.output_layer = nn.Linear(hidden_size,input_size)
self.tanh = nn.Tanh()
self.softmax = nn.LogSoftmax(dim = 2)
def forward(self, input, hidden)
out, hidden = self.lstm(input,hidden)
out = self.tanh(out)
out = self.output_layer(out)
out = self.softmax(out)
return out, hidden
The input and target are two sequences of one-hot encoded vectors respectively with a start of sequence and end of sequence vector at the start and the end. They represent the characters inside of a name taken from the name list (database).
I use a and token on each name from the database. here are the function I use
def inputTensor(line):
#tensor starts with <start of sequence> token.
tensor = torch.zeros(len(line)+1, 1, n_letters)
tensor[0][0][n_letters - 2] = 1
for li in range(len(line)):
letter = line[li]
tensor[li+1][0][all_letters.find(letter)] = 1
return tensor
# LongTensor of second letter to end (EOS) for target
def targetTensor(line):
letter_indexes = [all_letters.find(line[li]) for li in range(len(line))]
letter_indexes.append(n_letters - 1) # EOS
return torch.LongTensor(letter_indexes)
training loop :
def train_lstm(model):
start = time.time()
criterion = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters())
n_iters = 20000
print_every = 1000
plot_every = 500
all_losses = []
total_loss = 0
for iter in range(1,n_iters+1):
line = randomChoice(category_line)
input_line_tensor = inputTensor(line)
target_line_tensor = targetTensor(line).unsqueeze(-1)
optimizer.zero_grad()
loss = 0
output, hidden = model(input_line_tensor)
for i in range(input_line_tensor.size(0)):
l = criterion(output[i], target_line_tensor[i])
loss += l
loss.backward()
optimizer.step()
the sampling function :
def sample():
max_length = 20
input = torch.zeros(1,1,n_letters)
input[0][0][n_letters - 2] = 1
output_name = ""
hidden = (torch.zeros(2,1,lstm.hidden_size),torch.zeros(2,1,lstm.hidden_size))
for i in range(max_length):
output, hidden = lstm(input)
output = output[-1][:][:]
l = torch.multinomial(torch.exp(output[0]),num_samples = 1).item()
if l == n_letters - 1:
break
else:
letter = all_letters[l]
output_name += letter
input = inputTensor(letter)
return output_name
The typical sampled output looks something like that :
Laurayeerauerararauo
Leayealouododauodouo
Courouauurourourodau
Do you know how I can improve that ?
I found the explanation :
When using instances of the LSTM class as part of a RNN, the default input dimensions are (seq_length,batch_dim,input_size). To be able to interpret the output of the lstm as a probability (over the set of inputs) I needed to pass it to a Linear layer before the Softmax call, which is where the problem happens : Linear instances expects the input to be in the format (batch_dim,seq_length,input_size).
To fix this, one needs to pass batch_first = True as an argument to the LSTM upon creation, and then feed the RNN with an input of the form (batch_dim, seq_length, input_size).
Some tips to improve the network in the order of importance (and ease of implementing):
1. Training data
If you want your generated samples to look real, you have to give some real data to the network. Find a set of names, split those into letters and transform into indices. This step alone would give way more realistic names.
2. Separate start and end tokens.
I would go with <SON> (Start Of Name) and <EON> (End Of Name). In this configuration neural network can learn combinations of letters leading to <EON> and combinations of letters coming after <SON>. ATM it's trying to fit two different concepts into this one custom token.
3. Unsupervised Pretaining
You may want to give your letters some semantic meaning instead of one-hot encoded vectors, check word2vec for basic approach.
Basically, each letter would be represented by N-dimensional vector (say 50 dimensions) and would be closer in space if the letter occurs more often next to another letter (a closer to k than x).
Simple way to implement that would be taking some text dataset and trying to predict next letter at each timestep. Each letter would be represented by random vector at the beginning, through backpropagation letter representations would be updated to reflect their similarity.
Check pytorch embedding tutorial for more info.
4. Different architecture
You may want to check Andrej Karpathy's idea for generating baby names. It is simply described here.
Essentially, after training, you feed your model with random letters (say 10) and tell it to predict the next letter.
You remove last letter from random seed and put the predicted one in it's place. Iterate until <EON> is outputted.

How to add a confusion matrix to Theano examples?

I want to make use of Theano's logistic regression classifier, but I would like to make an apples-to-apples comparison with previous studies I've done to see how deep learning stacks up. I recognize this is probably a fairly simple task if I was more proficient in Theano, but this is what I have so far. From the tutorials on the website, I have the following code:
def errors(self, y):
# check if y has same dimension of y_pred
if y.ndim != self.y_pred.ndim:
raise TypeError(
'y should have the same shape as self.y_pred',
('y', y.type, 'y_pred', self.y_pred.type)
)
# check if y is of the correct datatype
if y.dtype.startswith('int'):
# the T.neq operator returns a vector of 0s and 1s, where 1
# represents a mistake in prediction
return T.mean(T.neq(self.y_pred, y))
I'm pretty sure this is where I need to add the functionality, but I'm not certain how to go about it. What I need is either access to y_pred and y for each and every run (to update my confusion matrix in python) or to have the C++ code handle the confusion matrix and return it at some point along the way. I don't think I can do the former, and I'm unsure how to do the latter. I've done some messing around with an update function along the lines of:
def confuMat(self, y):
x=T.vector('x')
classes = T.scalar('n_classes')
onehot = T.eq(x.dimshuffle(0,'x'),T.arange(classes).dimshuffle('x',0))
oneHot = theano.function([x,classes],onehot)
yMat = T.matrix('y')
yPredMat = T.matrix('y_pred')
confMat = T.dot(yMat.T,yPredMat)
confusionMatrix = theano.function(inputs=[yMat,yPredMat],outputs=confMat)
def confusion_matrix(x,y,n_class):
return confusionMatrix(oneHot(x,n_class),oneHot(y,n_class))
t = np.asarray(confusion_matrix(y,self.y_pred,self.n_out))
print (t)
But I'm not completely clear on how to get this to interface with the function in question and give me a numpy array I can work with.
I'm quite new to Theano, so hopefully this is an easy fix for one of you. I'd like to use this classifer as my output layer in a number of configurations, so I could use the confusion matrix with other architectures.
I suggest using a brute force sort of a way. You need an output for a prediction first. Create a function for it.
prediction = theano.function(
inputs = [index],
outputs = MLPlayers.predicts,
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size]})
In your test loop, gather the predictions...
labels = labels + test_set_y.eval().tolist()
for mini_batch in xrange(n_test_batches):
wrong = wrong + int(test_model(mini_batch))
predictions = predictions + prediction(mini_batch).tolist()
Now create confusion matrix this way:
correct = 0
confusion = numpy.zeros((outs,outs), dtype = int)
for index in xrange(len(predictions)):
if labels[index] is predictions[index]:
correct = correct + 1
confusion[int(predictions[index]),int(labels[index])] = confusion[int(predictions[index]),int(labels[index])] + 1
You can find this kind of an implementation in this repository.

matrices are not aligned Error: Python SciPy fmin_bfgs

Problem Synopsis:
When attempting to use the scipy.optimize.fmin_bfgs minimization (optimization) function, the function throws a
derphi0 = np.dot(gfk, pk)
ValueError: matrices are not aligned
error. According to my error checking this occurs at the very end of the first iteration through fmin_bfgs--just before any values are returned or any calls to callback.
Configuration:
Windows Vista
Python 3.2.2
SciPy 0.10
IDE = Eclipse with PyDev
Detailed Description:
I am using the scipy.optimize.fmin_bfgs to minimize the cost of a simple logistic regression implementation (converting from Octave to Python/SciPy). Basically, the cost function is named cost_arr function and the gradient descent is in gradient_descent_arr function.
I have manually tested and fully verified that *cost_arr* and *gradient_descent_arr* work properly and return all values properly. I also tested to verify that the proper parameters are passed to the *fmin_bfgs* function. Nevertheless, when run, I get the ValueError: matrices are not aligned. According to the source review, the exact error occurs in the
def line_search_wolfe1
function in # Minpack's Wolfe line and scalar searches as supplied by the scipy packages.
Notably, if I use scipy.optimize.fmin instead, the fmin function runs to completion.
Exact Error:
File
"D:\Users\Shannon\Programming\Eclipse\workspace\SBML\sbml\LogisticRegression.py",
line 395, in fminunc_opt
optcost = scipy.optimize.fmin_bfgs(self.cost_arr, initialtheta, fprime=self.gradient_descent_arr, args=myargs, maxiter=maxnumit, callback=self.callback_fmin_bfgs, retall=True)
File
"C:\Python32x32\lib\site-packages\scipy\optimize\optimize.py", line
533, in fmin_bfgs old_fval,old_old_fval)
File "C:\Python32x32\lib\site-packages\scipy\optimize\linesearch.py", line
76, in line_search_wolfe1
derphi0 = np.dot(gfk, pk)
ValueError: matrices are not aligned
I call the optimization function with:
optcost = scipy.optimize.fmin_bfgs(self.cost_arr, initialtheta, fprime=self.gradient_descent_arr, args=myargs, maxiter=maxnumit, callback=self.callback_fmin_bfgs, retall=True)
I have spent a few days trying to fix this and cannot seem to determine what is causing the matrices are not aligned error.
ADDENDUM: 2012-01-08
I worked with this a lot more and seem to have narrowed the issues (but am baffled on how to fix them). First, fmin (using just fmin) works using these functions--cost, gradient. Second, the cost and the gradient functions both accurately return expected values when tested in a single iteration in a manual implementation (NOT using fmin_bfgs). Third, I added error code to optimize.linsearch and the error seems to be thrown at def line_search_wolfe1 in line: derphi0 = np.dot(gfk, pk).
Here, according to my tests, scipy.optimize.optimize pk = [[ 12.00921659]
[ 11.26284221]]pk type = and scipy.optimize.optimizegfk = [[-12.00921659] [-11.26284221]]gfk type =
Note: according to my tests, the error is thrown on the very first iteration through fmin_bfgs (i.e., fmin_bfgs never even completes a single iteration or update).
I appreciate ANY guidance or insights.
My Code Below (logging, documentation removed):
Assume theta = 2x1 ndarray (Actual: theta Info Size=(2, 1) Type = )
Assume X = 100x2 ndarray (Actual: X Info Size=(2, 100) Type = )
Assume y = 100x1 ndarray (Actual: y Info Size=(100, 1) Type = )
def cost_arr(self, theta, X, y):
theta = scipy.resize(theta,(2,1))
m = scipy.shape(X)
m = 1 / m[1] # Use m[1] because this is the length of X
logging.info(__name__ + "cost_arr reports m = " + str(m))
z = scipy.dot(theta.T, X) # Must transpose the vector theta
hypthetax = self.sigmoid(z)
yones = scipy.ones(scipy.shape(y))
hypthetaxones = scipy.ones(scipy.shape(hypthetax))
costright = scipy.dot((yones - y).T, ((scipy.log(hypthetaxones - hypthetax)).T))
costleft = scipy.dot((-1 * y).T, ((scipy.log(hypthetax)).T))
def gradient_descent_arr(self, theta, X, y):
theta = scipy.resize(theta,(2,1))
m = scipy.shape(X)
m = 1 / m[1] # Use m[1] because this is the length of X
x = scipy.dot(theta.T, X) # Must transpose the vector theta
sig = self.sigmoid(x)
sig = sig.T - y
grad = scipy.dot(X,sig)
grad = m * grad
return grad
def fminunc_opt_bfgs(self, initialtheta, X, y, maxnumit):
myargs= (X,y)
optcost = scipy.optimize.fmin_bfgs(self.cost_arr, initialtheta, fprime=self.gradient_descent_arr, args=myargs, maxiter=maxnumit, retall=True, full_output=True)
return optcost
In case anyone else encounters this problem ....
1) ERROR 1: As noted in the comments, I incorrectly returned the value from my gradient as a multidimensional array (m,n) or (m,1). fmin_bfgs seems to require a 1d array output from the gradient (that is, you must return a (m,) array and NOT a (m,1) array. Use scipy.shape(myarray) to check the dimensions if you are unsure of the return value.
The fix involved adding:
grad = numpy.ndarray.flatten(grad)
just before returning the gradient from your gradient function. This "flattens" the array from (m,1) to (m,). fmin_bfgs can take this as input.
2) ERROR 2: Remember, the fmin_bfgs seems to work with NONlinear functions. In my case, the sample that I was initially working with was a LINEAR function. This appears to explain some of the anomalous results even after the flatten fix mentioned above. For LINEAR functions, fmin, rather than fmin_bfgs, may work better.
QED
As of current scipy version you need not pass fprime argument. It will compute the gradient for you without any issues. You can also use 'minimize' fn and pass method as 'bfgs' instead without providing gradient as argument.

Resources