BERT training with character embeddings - nlp

Does it make sense to change the tokenization paradigm in the BERT model, to something else? Maybe just a simple word tokenization or character level tokenization?

That is one motivation behind the paper "CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters" where BERT's wordpiece system is discarded and replaced with a CharacterCNN (just like in ELMo). This way, a word-level tokenization can be used without any OOV issues (since the model attends to each token's characters) and the model produces a single embedding for any arbitrary input token.
Performance-wise, the paper shows that CharacterBERT is generally at least as good BERT while at the same time being more robust to noisy texts.

It depends on what your goal is. Using standard word token would certainly work, but many words would end up out of vocabulary which would result in the model performing poorly.
Working entirely on character level might be interesting from a research perspective: seeing how to model will learn to segment the text on its own and how such a segmentation would look like compared to standard tokenization. I am not sure though if it would have benefits for practical use. Character sequences are much longer than sub-word sequences and BERT requires quadratic memory in the sequence length, it would just unnecessarily slow down both the training and inference.

Related

Passing multiple sentences to BERT?

I have a dataset with paragraphs that I need to classify into two classes. These paragraphs are usually 3-5 sentences long. The overwhelming majority of them are less than 500 words long. I would like to make use of BERT to tackle this problem.
I am wondering how I should use BERT to generate vector representations of these paragraphs and especially, whether it is fine to just pass the whole paragraph into BERT?
There have been informative discussions of related problems here and here. These discussions focus on how to use BERT for representing whole documents. In my case the paragraphs are not that long, and indeed could be passed to BERT without exceeding its maximum length of 512. However, BERT was trained on sentences. Sentences are relatively self-contained units of meaning. I wonder if feeding multiple sentences into BERT doesn't conflict fundamentally with what the model was designed to do (although this appears to be done regularly).
I think your question is based on a misconception. Even though the BERT paper uses the term sentence quite often, it is not referring to a linguistic sentence. The paper defines a sentence as
an arbitrary span of contiguous text, rather than an actual linguistic sentence.
It is therefore completely fine to pass whole paragraphs to BERT and a reason why they can handle those.

Should I train embeddings using data from both training,validating and testing corpus?

I am in a case that I don't have any pre-trained words embedding for my domain (Vietnamese food reviews). so I got a though of embedding from the general and specific corpus.
And the point here is can I use the dataset of training, test and validating (did preprocess) as a source for creating my own word embeddings. If don't, hope you can give your experience.
Based on my intuition, and some experiments a wide corpus appears to be better, but I'd like to know if there's relevant research or other relevant results.
can I use the dataset of training, test and validating (did
preprocess) as a source for creating my own word embeddings
Sure, embeddings are not your features for your machine learning model. They are the "computational representation" of your data. In short, they are made of words represented in a vector space. With embeddings, your data is less sparse. Using word embeddings could be considered part of the pre-processing step of NLP.
Usually (I mean, using the most used technique, word2vec), the representation of a word in the vector space is defined by its surroundings (the words that it commonly goes along with).
Therefore, to create embeddings, the larger the corpus, the better, since it can better place a word vector in the vector space (and hence compare it to other similar words).

Why are word embeddings with linguistic features (e.g. Sense2Vec) not used?

Given that embedding systems such as Sense2Vec incorporate linguistic features such as part-of-speech, why are these embeddings not more commonly used?
Across popular work in NLP today, Word2Vec and GloVe are the most commonly used word embedding systems. Despite the fact that they only incorporate word information and does not have linguistic features of the words.
For example, in sentiment analysis, text classification or machine translation tasks, it makes logical sense that if the input incorporates linguistic features as well, performance could be improved. Particular when disambiguating words such as "duck" the verb and "duck" the noun.
Is this thinking flawed? Or is there some other practical reason why these embeddings are not more widely used.
It's a very subjective question. One reason is the pos-tagger itself. Pos-tagger is a probabilistic model which could add to the overall error/confusion.
For eg. say you have dense representations for duck-NP and duck-VB but during run/inference time your pos-tagger tags 'duck' as something else then you wont even find it. Moreover it also effectively reduces the total number of times your system sees the word duck hence one could argue that representations generated would be weak.
To top it off the main problem which sense2vec was addressing is contextualisation of word representations which has been solved by contextual representations like BERT and ElMo etc. without producing any of the above problems.

missing word in word embedding

If I have a word2vec model and I use it for embedding all words in train and test set. But with proper words, in word2vec model does not contain. And can I random a vector as a embedding for all proper words.
If can, please give me some tips and some paper references.
Thank you
It's not clear what you're asking; in particular what do you mean by "proper words"?
But, if after training, words that you expect to be in the model aren't in the model, that is usually caused by either:
(1) Problems with how you preprocessed/tokenized your corpus, so that the words you thought were provided were not. So double check what data you're passing to training.
(2) A mismatch of parameters and expectations. For example, if performing training with a min_count of 5 (the default in some word2vec libraries), any words occurring fewer than 5 times will be ignored, and thus not receive word-vectors. (This is usually a good thing for overall word-vector quality, as low-frequency words can't get good word-vectors for themselves, yet by being interleaved with other words can still mildly interfere with those other words' training.)
Usually double-checking inputs, enabling logging and watching for any suspicious indicators of problems, and carefully examining the post-training model for what it does contain can help deduce what went wrong.

word2vec lemmatization of corpus before training

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.
I think it really matters about what you want to solve with this. It depends on the task.
Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data.
But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.
Something more interesting is, how to do tokenization with respect to the existing diction of words-vectors inside the W2V (or anything else). Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example, NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.
The current project I am working on involves identifying gene names within Biology papers abstracts using the vector space created by Word2Vec. When we run the algorithm without lemmatizing the Corpus mainly 2 problems arise:
The vocabulary gets way too big, since you have words in different forms which in the end have the same meaning.
As noted above, your space get less sparse, since you get more representatives of a certain "meaning", but at the same time, some of these meanings might get split among its representatives, let me clarify with an example
We are currently interest in a gene recognized by the acronym BAD. At the same time, "bad" is a english word which has different forms (badly, worst, ...). Since Word2vec build its vectors based on the context (its surrounding words) probability, when you don't lemmatize some of these forms, you might end up losing the relationship between some of these words. This way, in the BAD case, you might end up with a word closer to gene names instead of adjectives in the vector space.

Resources