I just finished reading the Transformer paper and BERT paper. But couldn't figure out why Transformer is uni-directional and BERT is bi-directional as mentioned in BERT paper. As they don't use recurrent networks, it's not so straightforward to interpret the directions. Can anyone give some clue? Thanks.
To clarify, the original Transformer model from Vaswani et al. is an encoder-decoder architecture. Therefore the statement "Transformer is uni-directional" is misleading.
In fact, the transformer encoder is bi-directional, which means that the self-attention can attend to tokens both on the left and right. In contrast, the decoder is uni-directional, since while generating text one token at a time, you cannot allow the decoder to attend to the right of the current token. The transformer decoder constrains the self-attention by masking the tokens to the right.
BERT uses the transformer encoder architecture and can therefore attend both to the left and right, resulting in "bi-directionality".
From the BERT paper itself:
We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation.
Recommended reading: this article.
Related
I have recently read about Bert and want to use BertForMaskedLM for fill_mask task. I know about Bert architecture. Also, as far as I know, BertForMaskedLM is built from Bert with a language modeling head on top, but I have no idea about what language modeling head means here. Can anyone give me a brief explanation.
The BertForMaskedLM, as you have understood correctly uses a Language Modeling(LM) head .
Generally, as well as in this case, LM head is a linear layer having input dimension of hidden state (for BERT-base it will be 768) and output dimension of vocabulary size. Thus, it maps to hidden state output of BERT model to a specific token in the vocabulary. The loss is calculated based on the scores obtained of a given token with respect to the target token.
Additionally to #Ashwin Geet D'Sa's answer.
Here is the Huggingface's LM head definition:
The model head refers to the last layer of a neural network that
accepts the raw hidden states and projects them onto a different
dimension.
You can find the Huggingface's definition for other terms at this page https://huggingface.co/docs/transformers/glossary
I am experimenting on the use of transformer embeddings in sentence classification tasks without finetuning them. I have used BERT embeddings and those experiments gave me very good results. Now I want to use GPT-2 embeddings (without fine-tuning). So I have two questions,
Can I use GPT-2 embeddings like that (because I know Gpt-2 is
trained on the left to right)
Is there any example uses of GPT-2 in
classification tasks other than generation tasks?
If I can use GPT-2embeddings, how should I do it?
I basically solved the problem. Here I used embeddings extracted from GPT-2.
So yes, we can use the final token of the GPT-2 embedding sequence as the class token. Because of the self-attention mechanism from left-to-right, the final token can represent the sequential information.
Please check the following GitHub issue for an implementation that uses GPT-2 embeddings. github issue
I conducted experiments comparing GPT-2 embedding with RoBERTa embedding. I got better results only with RoBERTa embedding and not GPT-2.
I am confused with what a linear chain CRF implementation exactly is. While some people say that "The Linear Chain CRF restricts the features to depend on only the current(i) and previous label(i-1), rather than arbitrary labels throughout the sentence" , some people say that it restricts the features to depend on the current(i) and future label(i+1).
I am trying to understand the implementation that goes behind the Stanford NER Model. Can someone please explain what exactly the linear chain CRF Model is?
Both models would be linear chain CRF models. The important part about the "linear chain" is that the features depend only on the current label and one direct neighbour in the sequence. Usually this would be the previous label (because that corresponds with reading order), but it could also be the future label. Such a model model would basically process the sentence backwards, and I have never seen this in the literature, but it would still be a linear chain CRF).
As far as I know, the Stanford NER model is based on a model that uses the current and the previous label, but it also uses an extension that can also look to labels further back. It is therefore not a strict linear-chain model, but uses an extension described in this paper:
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf
How to create word vector? I used one hot key to create word vector, but it is very huge and not generalized for similar semantic word. So I have heard about word vector using neural network that finds word similarity and word vector. So I wanted to know how to generate this vector (algorithm) or good material to start creating word vector ?.
Word-vectors or so-called distributed representations have a long history by now, starting perhaps from work of S. Bengio (Bengio, Y., Ducharme, R., & Vincent, P. (2001).A neural probabilistic language model. NIPS.) where he obtained word-vectors as by-product of training neural-net lanuage model.
A lot of researches demonstrated that these vectors do capture semantic relationship between words (see for example http://research.microsoft.com/pubs/206777/338_Paper.pdf). Also this important paper (http://arxiv.org/abs/1103.0398) by Collobert et al, is a good starting point with understanding word vectors, the way they are obtained and used.
Besides word2vec there is a lot of methods to obtain them. Expamples include SENNA embeddings by Collobert et al (http://ronan.collobert.com/senna/), RNN embeddings by T. Mikolov that can be computed using RNNToolkit (http://www.fit.vutbr.cz/~imikolov/rnnlm/) and much more. For English, ready-made embeddings can be downloaded from these web-sites. word2vec really uses skip-gram model (not neural network model). Another fast code for computing word representations is GloVe (http://www-nlp.stanford.edu/projects/glove/). It is an open question whatever deep neural networks are essential for obtaining good embeddings or not.
Depending of your application, you may prefer using different types of word-vectors, so its a good idea to try several popular algorithms and see what works better for you.
I think the thing you mean is Word2Vec (https://code.google.com/p/word2vec/). It trains N-dimensional word vectors of documents based on a given corpus. So in my understanding of word2vec the neural network is just used to aggregate the dimensions of the document vector and also capturing some relationship between words. But what should be mentioned is that this is not really semantically related, it just reflects the structural relationship in your training body.
If you want to capture semantic relatedness have a look a WordNet based measures, for instance implemented is these libaries:
Java: https://code.google.com/p/ws4j/
Perl: http://wn-similarity.sourceforge.net/
To get started with word2vec you can use their pretrained vectors. You should find all information about this at https://code.google.com/p/word2vec/.
When you seek for a java implementation. This is a good starting point: http://deeplearning4j.org/word2vec.html
I hope this helps
Best wishes
Recently,i have read about the "discriminative reranking for natural language processing" by Collins.
I'm confused what does the reranking actually do?
Add more global features to the rerank model? or something else?
If you mean this paper, then what is done is the following:
train a parser using a generative model, i.e. one where you compute P(term | tree) and use Bayes' rule to reverse that and get P(tree | term),
apply that to get an initial k-best ranking of trees from the model,
train a second model on features of the desired trees,
apply that to re-rank the output from 2.
The reason why the second model is useful is that in generative models (such as naïve Bayes, HMMs, PCFGs), it can be hard to add features other than word identity, because the model would try to predict the probability of the exact feature vector instead of the separate features, which might not have occurred in the training data and will have P(vector|tree) = 0 and therefore P(tree|vector) = 0 (+ smoothing, but the problem remains). This is the eternal NLP problem of data sparsity: you can't build a training corpus that contains every single utterance that you'll want to handle.
Discriminative models such as MaxEnt are much better at handling feature vectors, but take longer to fit and can be more complicated to handle (although CRFs and neural nets have been used to construct parsers as discriminative models). Collins et al. try to find a middle ground between the fully generative and fully discriminative approaches.