Passing multiple sentences to BERT? - nlp

I have a dataset with paragraphs that I need to classify into two classes. These paragraphs are usually 3-5 sentences long. The overwhelming majority of them are less than 500 words long. I would like to make use of BERT to tackle this problem.
I am wondering how I should use BERT to generate vector representations of these paragraphs and especially, whether it is fine to just pass the whole paragraph into BERT?
There have been informative discussions of related problems here and here. These discussions focus on how to use BERT for representing whole documents. In my case the paragraphs are not that long, and indeed could be passed to BERT without exceeding its maximum length of 512. However, BERT was trained on sentences. Sentences are relatively self-contained units of meaning. I wonder if feeding multiple sentences into BERT doesn't conflict fundamentally with what the model was designed to do (although this appears to be done regularly).

I think your question is based on a misconception. Even though the BERT paper uses the term sentence quite often, it is not referring to a linguistic sentence. The paper defines a sentence as
an arbitrary span of contiguous text, rather than an actual linguistic sentence.
It is therefore completely fine to pass whole paragraphs to BERT and a reason why they can handle those.

Related

How to find similar sentence from a corpus on word2vec?

I have implemented word2vec on my corpus using the TensorFlow tutorial: https://www.tensorflow.org/tutorials/text/word2vec#next_steps
Now I'm want to give a sentence as input and want to find a similar sentence in the corpus.
Any leads on how I can perform this?
A simple word2vec model is not capable of such task, as it only relates word semantics to each other, not the semantics of whole sentences. Inherently, such a model has no generative function, it only serves as a look-up table.
Word2vec models map word strings to vectors in the embedding space. To find similar words for a given sample word, one can simply go through all vectors in the vocabulary and find the ones that are closest (in terms of the 2-norm) from the sample word vector. For further information you could go here or here.
However, this does not work for sentences as it would require a whole vocabulary of sentences of which to pick similar ones - which is not feasible.
Edit: This seems to be a duplicate of this question.

How does gensim word2vec word embedding extract training word pair for 1 word sentence?

Refer to below image (the process of how word2vec skipgram extract training datasets-the word pair from the input sentences).
E.G. "I love you." ==> [(I,love), (I, you)]
May I ask what is the word pair when the sentence contains only one word?
Is it "Happy!" ==> [(happy,happy)] ?
I tested the word2vec algorithm in genism, when there is just one word in the training set sentences, (and this word is not included in other sentences), the word2vec algorithm can still construct an embedding vector for this specific word. I am not sure how the algorithm is able to do so.
===============UPDATE===============================
As the answer posted below, I think the word embedding vector created for the word in the 1-word-sentence is just the random initialization of neural network weights.
No word2vec training is possible from a 1-word sentence, because there's no neighbor words to use as input to predict a center/target word. Essentially, that sentence is skipped.
If that was the only appearance of the word in the corpus, and you're seeing a vector for that word, it's just the starting random-initialization of the word, with no further training. (And, you should probably use a higher min_count, as keeping such rare words is usually a mistake in word2vec: they won't get good vectors, and other nearby words' vectors will improve if the 'noise' from all such insufficiently model-able rare words is removed.)
If that 1-word sentence actually appeared next-to other real sentences in your corpus, it could make sense to combine it with surrounding texts. There's nothing magic about actual sentences for this kind word-from-surroundings modeling - the algorithm is just working on 'neighbors', and it's common to use multi-sentence chunks as the texts for training, and sometimes even punctuation (like sentence-ending periods) is also retained as 'words'. Then words from an actually-separate sentence – but still related by having appeared in the same document – will appear in each other's contexts.

BERT training with character embeddings

Does it make sense to change the tokenization paradigm in the BERT model, to something else? Maybe just a simple word tokenization or character level tokenization?
That is one motivation behind the paper "CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters" where BERT's wordpiece system is discarded and replaced with a CharacterCNN (just like in ELMo). This way, a word-level tokenization can be used without any OOV issues (since the model attends to each token's characters) and the model produces a single embedding for any arbitrary input token.
Performance-wise, the paper shows that CharacterBERT is generally at least as good BERT while at the same time being more robust to noisy texts.
It depends on what your goal is. Using standard word token would certainly work, but many words would end up out of vocabulary which would result in the model performing poorly.
Working entirely on character level might be interesting from a research perspective: seeing how to model will learn to segment the text on its own and how such a segmentation would look like compared to standard tokenization. I am not sure though if it would have benefits for practical use. Character sequences are much longer than sub-word sequences and BERT requires quadratic memory in the sequence length, it would just unnecessarily slow down both the training and inference.

Should I train embeddings using data from both training,validating and testing corpus?

I am in a case that I don't have any pre-trained words embedding for my domain (Vietnamese food reviews). so I got a though of embedding from the general and specific corpus.
And the point here is can I use the dataset of training, test and validating (did preprocess) as a source for creating my own word embeddings. If don't, hope you can give your experience.
Based on my intuition, and some experiments a wide corpus appears to be better, but I'd like to know if there's relevant research or other relevant results.
can I use the dataset of training, test and validating (did
preprocess) as a source for creating my own word embeddings
Sure, embeddings are not your features for your machine learning model. They are the "computational representation" of your data. In short, they are made of words represented in a vector space. With embeddings, your data is less sparse. Using word embeddings could be considered part of the pre-processing step of NLP.
Usually (I mean, using the most used technique, word2vec), the representation of a word in the vector space is defined by its surroundings (the words that it commonly goes along with).
Therefore, to create embeddings, the larger the corpus, the better, since it can better place a word vector in the vector space (and hence compare it to other similar words).

word2vec lemmatization of corpus before training

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.
I think it really matters about what you want to solve with this. It depends on the task.
Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data.
But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.
Something more interesting is, how to do tokenization with respect to the existing diction of words-vectors inside the W2V (or anything else). Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example, NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.
The current project I am working on involves identifying gene names within Biology papers abstracts using the vector space created by Word2Vec. When we run the algorithm without lemmatizing the Corpus mainly 2 problems arise:
The vocabulary gets way too big, since you have words in different forms which in the end have the same meaning.
As noted above, your space get less sparse, since you get more representatives of a certain "meaning", but at the same time, some of these meanings might get split among its representatives, let me clarify with an example
We are currently interest in a gene recognized by the acronym BAD. At the same time, "bad" is a english word which has different forms (badly, worst, ...). Since Word2vec build its vectors based on the context (its surrounding words) probability, when you don't lemmatize some of these forms, you might end up losing the relationship between some of these words. This way, in the BAD case, you might end up with a word closer to gene names instead of adjectives in the vector space.

Resources