How to padding using Huggingface for Bert training - pytorch

I'm using this link to train hugginface bert. But I saw different batch has different sequence length in training time. But I want to keep the same sequence length for all of the batches. How can I do that? And how does hugging face handles different sequence length in different batches?

Related

BERT pre-training from scratch on custom text data

I want to pre-train BERT from scratch using Hugging face library. Originally, BERT was pre-trained on two tasks: MLM and NSP. I am successful in training it for MLM but been running into issues for weeks now:
Token indices sequence length is longer than the specified maximum sequence length for this model (544 > 512). Running this sequence through the model will result in indexing errors
Can anyone help me pre training BERT on those two tasks?
I have tried runmlm.py script from Hugging Face which just trains on MLM. I have also tried original BERT code which is not very intuitional to me so I stick with Hugging face library.

Extracting hidden representations for each token - PyTorch LSTM

I am currently working on a NLP project involving recurrent neural networks. I implemented a LSTM with PyTorch, following the tutorial here.
For my project, I need to extract the hidden representation for every token of an input text. I thought that the easiest way would be to test using a batch size and sequence length of 1, but when I do that the loss gets orders of magnitude larger than in training phase (during training I used a batch size of 64 and a sequence length of 35).
Is there any other way I can easily access these word-level hidden representations? Thank you.
Yes, that is possible with nn.LSTM as long as it is a single layer LSTM. If u check the documentation (here), for the output of an LSTM, you can see it outputs a tensor and a tuple of tensors. The tuple contains the hidden and cell for the last sequence step. What each dimension means of the output depends on how u initialized your network. Either the first or second dimension is the batch dimension and the rest is the sequence of word embeddings you want.
If u use a packed sequence as input, it is a bit of a different story.

Recommended deep learning model for sequence completion

I am trying to solve the problem of sequence completion. Let's suppose we have ground truth sequence (1,2,4,7,6,8,10,12,18,20)
The input to our model is an incomplete sequence. i.e (1,2,4, _ , _ ,_,10,12,18,20). From this incomplete sequence, we want to predict the original sequence (Ground Truth sequence). Which deep learning models can be used to solve this problem?
Is this the problem of encoder-decoder LSTM architecture?
Note: we have thousands of complete sequences to train and test the model.
Any help is appreciated.
This not exactly sequence-to-sequence problem, this is a sequence labeling problem. I would suggest either stacking bidirectional LSTM layers followed by a classifier or Transformer layers followed by a classifier.
Encoder-decoder architecture requires plenty of data to train properly and is particularly useful if the target sequence can be of arbitrary length, only vaguely depending on the source sequence length. It would eventually learn to do the job with enough, but sequence labeling is a more straightforward problem.
With sequence labeling, you can set a custom mask over the output, so the model will only predict the missing numbers. An encoder-decoder model would need to learn to copy most of the input first.
In your sequence completion task, are you trying to predict next items in a sequence or learn only the missing values?
Training a neural network with missing data is an issue on its own terms.
If you're using Keras and LSTM-type NN for solving your problem, you should consider masking, you can refer to this stackoverflow thread for more details: Multivariate LSTM with missing values
Regarding predicting the missing values, why not try auto-encoders?

how's the input word2vec get fine-tuned when training CNN

When I read the paper "Convolutional Neural Networks for Sentence Classification"-Yoon Kim-New York University, I noticed that the paper implemented the "CNN-non-static" model--A model with pre-trained vectors from word2vec,and all words— including the unknown ones that are randomly initialized, and the pre-trained vectors are fine-tuned for each task.
So I just do not understand how the pre-trained vectors are fine-tuned for each task. Cause as far as I know, the input vectors, which are converted from strings by word2vec.bin(pre-trained), just like image matrix, which can not change during training CNN. So, if they can, HOW? Please help me out, Thanks a lot in advance!
The word embeddings are weights of the neural network, and can therefore be updated during backpropagation.
E.g. http://sebastianruder.com/word-embeddings-1/ :
Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer.

Training a CNN with pre-trained word embeddings is very slow (TensorFlow)

I'm using TensorFlow (0.6) to train a CNN on text data. I'm using a method similar to the second option specified in this SO thread (with the exception that the embeddings are trainable). My dataset is pretty small and the vocabulary is around 12,000 words. When I train using random word embeddings everything works nicely. However, when I switch to the pre-trained embeddings from the word2vec site, the vocabulary grows to over 3,000,000 words and training iterations become over 100 times slower. I'm also seeing this warning:
UserWarning: Converting sparse IndexedSlices to a dense Tensor with
900482700 elements
I saw the discussion on this TensorFlow issue, but I'm still not sure if the slowdown I'm experiencing is expected or if it's a bug. I'm using the Adam optimizer but it's pretty much the same thing with Adagrad.
One workaround I guess I could try is to train using a minimal embedding matrix with only the ~12,000 words in my dataset, serialize the resulting embeddings and at runtime merge them with the remaining words from the pre-trained embeddings. I think this should work but it sounds hacky.
Is that currently the best solution or am I missing something?
So there were two issues here:
As mrry pointed out in his comment to the question, the warning was not a result of a conversion during the updates. Rather, I was calculating summary statistics (sparsity and histogram) on the embeddings gradient and that caused the conversion.
Interestingly, removing the summaries made the message go away, but the code remained slow. Per the TensorFlow issue referenced in the question, I had to also replace the AdamOptimizer with the AdagradOptimizer and once I did that the runtime was back on par with the one obtained from a small vocabulary.

Resources