This is a general question for any of the frameworks for both RNN and LSTM.
When we use a Vanilla or plain networks, for a single layer, such as
current_layer = torch.nn.Linear(100,125) means that there are 125 neurons
or a single weight vector of 125 units (for each neuron) which change the incoming 100 inputs to 125 outgoing units.
Similarly if current_layer = torch.nn.Linear(125,100) says that the incoming 125 inputs will be converted to 100 outgoing vectors.
Now mu question that if I have
previous_layer = torch.nn.Embedding(10000,100)
current_layer = torch.nn.RNN(100,125)
what does this mean?? What is 100 and 125 shows in the case or RNN? Will there be 100 timestamps as inputs for each time stamps? What does the 125 or hidden_size mean in this context then??
Is it going to be a weight vector of 125 units which will be multiplied by the single float (each of the incoming inputs for a total of 100 times?) to produce a hidden_state??
Related
Suppose if I provide a list of sentences:
['I like python',
'I am learning python', # longest sentence of length 4 tokens
'Python is simple']
Bert will produce an output of (3 * 4+2 * 768).
Because there were 3 sentences, 4 max tokens, 768 hidden states.
Suppose if I provide another list of sentences:
['I like python',
'I am learning python',
'Python is simple',
'Python is fun to learn' # 5 tokens
]
The new embedding output would be (4 * 5+2 * 768).
I understand that dim[0] becomes 4 because there is now 4 sentences instead. This is achieved by increasing the rows of the tensor(batch size) during tensor computation.
I also understand that dim[1] becomes 5+2 because the max number of token is number 5 and there is [CLS] and [SEP] tokens at the start and end.
I also understand that there is a padding mechanism that accepts up to a max_position_embeddings=512 for bert model.
What I want to ask is:
during computation, does bert pad all the values after 5th element with zeros and process with computation using a input of (4 * 512) (4 sentences, 512 max tokens).
then after computation from the output of (4 * 512 * 768), the tensor is trimmed to output: (4 * 5+2 * 768).
if the above assumptions is true, isn't it a huge waste of resources, since majority of the 512 tokens are not attention-required.
I read about the attention_mask matrix that tells the model which are the tokens needed for computation, but I don't understand how does attention_mask achieve this; when the architecture of the model is initialised with N dimensional inputs, how does attention_mask help during computation to ignore/avoid the computation of the attention-masked elements?
which part of the bert model explicitly restrict the output to (4 * 5+2 * 768)?
I'm training a transformer model for text generation.
let's assume:
vocab size = 100
embbeding size = 50
max sequence length = 30
batch size = 32
loss = cross entropy loss
the last layer in the model is a fully connected layer,
mapping from shape [30, 32, 50] to [30, 32, 100].
the idea is that each of the last 30 sequences in the first dimension, I have a target vector I want to calculate loss with.
the issue is that based on the docs, this loss only excepts 2 dims on the prediction and one on the target - so how can I fit my 3D prediction into it?
(and 2D target?)
Use torch.BCELoss() instead (Binary cross entropy). This expects input and target to be the same size but they can be any size, and should fall within the range [0,1]. It performs cross-entropy loss element-wise.
EDIT: if you expect only one element from the vocab to be output, then you should use CrossEntropyLoss and instead encode your labels as a 1D vector rather than a 2D vector (i.e. do 1-hot decoding). BCE treats each element in the output for a single example as independent from the others, which is not a valid assumption for a multi-class style problem. I originally misread and thought the final output was an embedding, rather than an element from the vocabulary, hence my original suggestion.
How to apply SMOTE algorithm before word embedding layer in LSTM.
I have a problem of text binary classification (Good(9500) or Bad(500) review with total of 10000 training sample and it's unbalanced training sample), mean while i am using LSTM with pre-trained word-embeddings (100 dimension space for each word) as well, so each training input have an id's (Total of 50 ids with zero padding's as well when the text description is having lesser than 50 words and trimmed to 50 when the description is exceeded 50 characters) of word dictionary.
Below is my general flow,
Input - 1000(batch) X 50 (sequence length)
Word Embedding - 200(Unique vocabulary word) X 100 (word representation)
After word embedding layer (new input for LSTM) - 1000(batch) X 50(sequence) X 100 (features)
Final State from LSTM 1000 (batch) X 100 (units)
Apply final layer 1000(batch) X 100 X [100(units) X 2 (output class)]
All i want to generate more data for Bad review with the help of SMOTE
I faced the same issue.
Found this post on stackexchange which proposes to adjust the weights of the class distribution instead of oversampling. Apparently it is the standard way in LSTM / RNN to deal with class imbalance.
https://stats.stackexchange.com/questions/342170/how-to-train-an-lstm-when-the-sequence-has-imbalanced-classes
I try to refactor my Keras code to use 'Batch Hard' sampling for the triplets, as proposed in https://arxiv.org/pdf/1703.07737.pdf.
" the core idea is to form batches by randomly sampling P classes
(person identities), and then randomly sampling K images of each class
(person), thus resulting in a batch of PK images. Now, for each
sample a in the batch, we can select the hardest positive and the
hardest negative samples within the batch when forming the triplets
for computing the loss, which we call Batch Hard"
So at the moment I have a Python generator (for use with model.fit_generator in Keras) which produces batches on the CPU. Then the actual forward and backward passes through the model could be done on the GPU.
However, how to make this fit with the 'Batch Hard' method? The generator samples 64 images, for which 64 triplets should be formed. First a forward pass is required to obtain the 64 embeddings with the current model.
embedding_model = Model(inputs = input_image, outputs = embedding)
But then the hardest positive and hardest negative have to be selected from the 64 embeddings to form triplets. Then the loss can be computed
anchor = Input(input_shape, name='anchor')
positive = Input(input_shape, name='positive')
negative = Input(input_shape, name='negative')
f_anchor = embedding_model(anchor)
f_pos = embedding_model(pos)
f_neg = embedding_model(neg)
triplet_model = Model(inputs = [anchor, positive, negative], outputs=[f_anchor, f_pos, f_neg])
And this triplet_model can be trained by defining a triplet loss function. However, is it possible with Keras to use the fit_generator and the 'Batch Hard' method? Or how to obtain access to the embeddings from the other samples in the batch?
Edit: With keras.layers.Lambda I can define an own layer creating triplets with input (batch_size, height, width, 3) and output (batch_size, 3, height, width, 3), but I also need access to the id's somewhere. Is this possible within the layer?
I built a convolutional neural network in Keras.
model.add(Convolution1D(nb_filter=111, filter_length=5, border_mode='valid', activation="relu", subsample_length=1))
According to the CS231 lecture a convolving operation creates a feature map (i.e. activation map) for each filter which are then stacked together. IN my case the convolutional layer has a 300 dimensional input. Hence, I expect the following computation:
Each filter has a window size of 5. Consequently, each filter produces 300-5+1=296 convolutions.
As there are 111 filters there should be a 111*296 output of the convolutional layer.
However, the actual output shapes look differently:
convolutional_layer = model.layers[1]
conv_weights, conv_biases = convolutional_layer.get_weights()
print(conv_weights.shape) # (5, 1, 300, 111)
print(conv_biases.shape) # (,111)
The shape of the bias values makes sense, because there is one bias value for each filter. However, I do not understand the shape of the weights. Apparently, the first dimension depends on the filter size. The third dimension is the number of input neurons, which should have been reduced by the convolution. The last dimension probably refers to the number of filters. This does not make sense, because how should I easily get the feature map for a specific filter?
Keras either uses Theano or Tensorflow as a backend. According to their documentation the output of a convolving operation is a 4d tensor (batch_size, output_channel, output_rows, output_columns).
Can somebody explain me the output shape in accordance with the CS231 lecture?
Your Weight dimension has to be [filter_height, filter_width, in_channel, out_channe]
With your example I think the input channel which is the depth of the input is 300 and you want the output channel to be 111
Total number of filters are 111 and not 300*111
As you have said by yourself each bias for every filter so 111 bias for 111 filters
Each filter out of 111 will produce a convolution on the input
The Weight shape in your case means that you are using a kernel patch of shape 5*1
The third dimension means that depth of input feature map is 300
The fourth dimension mean that depth of the output feature map is 111
Actually it makes very good sense. Your learn the weights of the filters. Each filter in turn produces an output (aka an activation map respective to your input data).
The first two axes of your conv_weights.shape are the dimensions of your filter that is being learned (as your already mentioned). Your filter_length is 5 x 1. Your input has 300 dimensions and you want to get 111 filters per dimension, so you end up with 300 * 111 filters of size 5 * 1 weights.
I assume that the feature map of filter #0 for dimension #0 is sth like your_weights[:, :, 0, 0].