I try to use the Pytorch to learn the RNN(GRU).
I already known the forward process.
But I am confuse about testing real data after I finish training process.
Suppose I have a min-batch data which dimension is
(40, 6, 15)=(seq_len, batch_size, word_vec_size).
Suppose define a single-direction GRU layer with hidden state size is 7 and layer is 1(no stack).
Do forward, I need to initialize a hidden state.
My question is when I test a real data, how do I determine initial hidden state?
If randomly determine, input same data, it probably get the different output.
I don't known my concept is right or not.
Thank you.
Related
I am trying to train a neural network which takes as input (input_t0) and an initial hidden state (call it s_t0) and produces a new hidden state (s_t1) by transforming the input via a series of transformations (neural network layers). At the next time step, a transformed input (input_t1) and the hidden state from the previous time step (s_t1) is passed to the same model. This process keeps repeating for a couple of steps.
The goal of optimization is to ensure the distance between s_t0 and s_t1 is small through self-supervision, as s_t1 is supposed to be an transformed version of s_t0. In other words, I want s_t1 to only carry new information in the new input. My intuition tells me taking the norm of the weights and ensuring the norm does not go to zero (is this even possible?) would be one way to achieve this. However, I'm afraid won't be the best thing to do necessarily, as it might not encourage the model to update the state vector with new information.
Currently the way I train the model is by taking the absolute distance between s_t0 and s_t1 via loss = torch.abs(s_t1 - s_t0).mean(dim=1). Then I call loss.backward() and optimizer.step() which changes the weights. Note that the reason that I use abs() is that the hidden states are produced after applying ReLU, so the only hold positive values. So what is the best way to achieve this and ensure the weights don't go to 0? Would I be able to somehow use mutual information for this?
However, I noticed that optimization quickly finds the trivial solution by setting weights to 0. This causes both s_t0 and s_t1 get smaller and smaller until their difference is 0, which satisfies the constraint but does not yield the behavior I expect. Is there a way to ensure weights do not go to zero during optimization?
I'm using Keras for timeseries prediction and I want to create a model that is based on the self-attention mechanism that will not use any RNNs. For each sample we look at the last x timesteps of samples to predict the next sample.
In other words I want to feed the network (num_batches, num_samples, timesteps, features) and get (num_batches, predictions).
There is 1 problems with this.
There is a lot of unnecessary duplication of data where sample n has basically the same timesteps and features as sample n+1, only shifted 1 to the left.
How would you handle this assuming you dataset is very large?
I am not very familiar with this, but if your issue is "I have too many replicated data" I think you can solve your problem devising a generator for your data, and then pass the generator as input for the Keras/TensorFlow fit function (according to TensorFlow APIs specification, it is stated that it supports generators as input).
If your question is related to the logic behind the model, I do not see the issue. It is like that you have a sliding window, for each window you predict one value, and then you move the window by a certain amount (in your case, one). Could you argue a little more about your concern?
I keep seeing examples floating around the internet where the input and/or output layer have either no activation function, a linear activation function, or None. What I'm confused about is when to use one, and how to know if you should? I also am confused about what the number of nodes should be for the input layer.
Right now I have a regression problem, I'm trying to predict a real value based on an array of inputs (about 54). Should I be using relu in my activation function for the input layer? Should I have linear as my output activation? My data is linearly scaled from 0 to 1 for each feature independently as they're different units. I was also unsure of the number of nodes I should use for my input layer as I see some examples pick an arbitrary number not related to their input shape, and other examples saying to specifically set it to the number of inputs, or number of inputs plus one for a bias. But none of the examples so far have explained their reasoning behind their choices.
Since my model isn't performing very well, I thought asking what the architecture should be could help me fine tune it more.
I'm currently working on a Keras tutorial for recurrent network training and I'm having trouble understanding the Stateful LSTM concept. To keep things as simple as possible, the sequences have the same length seq_length. As far as I get it, the input data is of shape (n_samples, seq_length, n_features) and we then train our LSTM on n_samples/M batches of size M as follows:
For each batch:
Feed in the 2D-tensors (seq_length, n_features) and for each input 2D-tensor compute the gradient
Sum these gradients to get the total gradient on the batch
Backpropagate the gradient and update weights
In the tutorial's example, feeding in the 2D-tensors is feeding in a sequence of size seq_length of letters encoded as vectors of length n_features. However, the tutorial says that in the Keras implementation of LSTMs, the hidden state is not reset after a whole sequence (2D-tensor) is fed in, but after a batch of sequences is fed in to use more context.
Why does keeping the hidden state of the previous sequence and using it as initial hidden state for our current sequence improve the learning and the predictions on our test set, since that "previously learned" initial hidden state won't be available when making predictions ? Moreover, Keras' default behaviour is to shuffle input samples at the beginning of each epoch so the batch context is changed at each epoch. This behaviour seems contradictory to keeping the hidden state through a batch since batch context is random.
LSTMs in Keras aren't stateful by default - each sequence starts with newly-reset states. By setting stateful=True in your recurrent layer, successive inputs in a batch don't reset the network state. This assumes that the sequences are actually successive, and it means that in a (very informal) sense, you're training on sequences of length batch_size * seq_length.
Why does keeping the hidden state of the previous sequence and using
it as initial hidden state for our current sequence improve the
learning and the predictions on our test set, since that "previously
learned" initial hidden state won't be available when making
predictions ?
In theory, it improves learning because a longer context can teach the network things about the distribution that are still relevant when testing on the individually shorter sequences. If the network is learning some probability distribution, that distribution should hold over different sequence lengths.
Moreover, Keras's default behaviour is to shuffle input samples at the
beginning of each epoch so the batch context is changed at each epoch.
This behaviour seems contradictory to keeping the hidden state through
a batch since batch context is random.
I haven't checked, but I assume that when stateful=True, only batches are shuffled - not the sequences within them.
In general, when we give the network some initial state, we don't mean for that to be a universally better starting point. It just means that the network can take the information from previous sequences into account when training.
I'm trying to follow Kalchbrenner et al. 2014 (http://nal.co/papers/Kalchbrenner_DCNN_ACL14) (and basically most of the papers in the last 2 years which applied CNNs to NLP tasks) and implement the CNN model they describe. Unfortunately, although getting the forward pass right, it seems like I have a problem with the gradients.
What I'm doing is a full convolution of the input with W per row, per kernel, per input in the forward pass (not rotated, so it's actually a correlation).
Then, for the gradients wrt W, a valid convolution of the inputs with the previous delta per row, per kernel, per input (again, not rotated).
And finally, for the gradients wrt x, another valid convolution of the pervious delta with W, again, per row, per kernel, per input (no rotation).
This returns the correct size and dimensionality but the gradient checking is really off when connecting layers. When testing a single conv layer the results are correct, when connecting 2 conv layers - also correct, but then, when adding MLP, Pooling, etc. it starts looking bad. All other types of layers were also tested separately and they are also correct, thus, I'd assume the problem starts with the calculation of the grad. wrt W_conv.
Does anyone have an idea or a useful link to a similar implementation?