When doing pre-training of a transformer model, how can I add words to the vocabulary?

When doing pre-training of a transformer model, how can I add words to the vocabulary? - pytorch

Given a DistilBERT trained language model for a given language, taken from the Huggingface hub, I want to pre-train the model on a specific domain, and I want to add new words that are:
definitely non existing in the original training set
and impossible to handle via word piece toeknization - basically you can think of these words as "codes" that are a normalized form of a named entity
Consider that:
I would like to avoid to learn a new tokenizer: I am fine to add the new words, and then let the model learn their embeddings via pre-training
the number of the "words" is way larger that the "unused" tokens in the "stock" vocabulary
The only advice that I have found is the one reported here:
Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.
Do you think this is the only way of achieve my goal?
If yes, I do not have any idea of how to write this "script": does someone has some hints at how to proceeed (sample code, documentation etc)?

As per my comment, I'm assuming that you go with a pre-trained checkpoint, if only to "avoid [learning] a new tokenizer."
Also, the solution works with PyTorch, which might be more suitable for such changes. I haven't checked Tensorflow (which is mentioned in one of your quotes), so no guarantees that this works across platforms.
To solve your problem, let us divide this into two sub-problems:
Adding the new tokens to the tokenizer, and
Re-sizing the token embedding matrix of the model accordingly.
The first can actually be achieved quite simply by using .add_tokens(). I'm referencing the slow tokenizer's implementation of it (because it's in Python), but from what I can see, this also exists for the faster Rust-based tokenizers.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Will return an integer corresponding to the number of added tokens
# The input could also be a list of strings instead of a single string
num_new_tokens = tokenizer.add_tokens("dennlinger")
You can quickly verify that this worked by looking at the encoded input ids:
print(tokenizer("This is dennlinger."))
# 'input_ids': [101, 2023, 2003, 30522, 1012, 102]
The index 30522 now corresponds to the new token with my username, so we can check the first part. However, if we look at the function docstring of .add_tokens(), it also says:
Note, hen adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer.
In order to do that, please use the PreTrainedModel.resize_token_embeddings method.
Looking at this particular function, the description is a bit confusing, but we can get a correctly resized matrix (with randomly initialized weights for new tokens), by simply passing the previous model size, plus the number of new tokens:
from transformers import AutoModel
model = AutoModel.from_pretrained("distilbert-base-uncased")
model.resize_token_embeddings(model.config.vocab_size + num_new_tokens)
# Test that everything worked correctly
model(**tokenizer("This is dennlinger", return_tensors="pt"))
EDIT: Notably, .resize_token_embeddings() also takes care of any associated weights; this means, if you are pre-training, it will also adjust the size of the language modeling head (which should have the same number of tokens), or fix tied weights that would be affected by an increased number of tokens.

Related

Transformers PreTrainedTokenizer add_tokens Functionality

Referring to the documentation of the awesome Transformers library from Huggingface, I came across the add_tokens functions.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
model.resize_token_embeddings(len(tokenizer))
I tried the above by adding previously absent words in the default vocabulary. However, keeping all else constant, I noticed a decrease in accuracy of the fine tuned classifier making use of this updated tokenizer. I was able to replicate similar behavior even when just 10% of the previously absent words were added.
My questions
Am I missing something?
Instead of whole words, is the add_tokens function expecting masked tokens, for example : '##ah', '##red', '##ik', '##si', etc.? If yes, is there a procedure to generate such masked tokens?
Any help would be appreciated.
Thanks in advance.

If you add tokens to the tokenizer, you indeed make the tokenizer tokenize the text differently, but this is not the tokenization BERT was trained with, so you are basically adding noise to the input. The word embeddings are not trained and the rest of the network never saw them in context. You would need a lot of data to teach BERT to deal with the newly added words.
There are also some ways how to compute a single word embedding, such that it would not hurt BERT like in this paper but it seems pretty complicated and should not make any difference.
BERT uses a word-piece-based vocabulary, so it should not really matter if the words are present in the vocabulary as a single token or get split into multiple wordpieces. The model probably saw the split word during pre-training and will know what to do with it.
Regarding the ##-prefixed tokens, those are tokens can only be prepended as a suffix of another wordpiece. E.g., walrus gets split into ['wal', '##rus'] and you need both of the wordpieces to be in the vocabulary, but not ##wal or rus.

Is there an easy way with built-in functions to automatically retrain a keras NLP model?

I have a natural language processing model built with keras using keras.preprocessing.text.Tokenizer. I know that I can retrain the old model by calling it's .fit(...) after importing it, but I need to update my tokenizer as well. The tokenizer does some things: tokenizes a string by spaces, eliminates symbols, converts to lower, keeps only the most used tokens after creating it's dictionary, hash the tokens and appends 0 if the sentence is too short.
Ex:
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(df_train['message'][0:100].values)
x_train = tokenizer.texts_to_sequences(df_train['message'][0:100].values)
x_train = pad_sequences(x_train, padding='post', maxlen=maxlen)
This process is needed to be able to input the sequences to a nlp network. The problem appears when I try to automatically retrain this. Every time I retrain, the tokenizer must be updated. If I try to add new text, all the values from the dictionary that the tokenizer class uses(meaning the encoding of a word) changes.
Ex:
If I update like this: tokenizer.fit_on_texts(df_train['message'][100:200].values),
then the
x_train = tokenizer.texts_to_sequences(df_train['message'][0:100].values)
will output different encodings for the sentences. I need the same encodings. In the official documentation it's said that the method "def fit_on_texts(self, texts)" "Updates internal vocabulary based on a list of texts.". It updates, but also changes all the old values of the keys, old or new ones.
Is there an official method to keep the old values of the words and generate new values only for the new words?

So according to the source for Tokenizer, all the words are stored in the Tokenizer class. Specifically in:
self.word_counts = OrderedDict()
self.word_docs = defaultdict(int)
So if you wanted to remove words which have appeared before in your input string, you could remove them by finding any item in the new_input that is not in the intersection of word_counts.keys() and new_input.
However, there's one more catch:
When you run .fit_on_texts it expects a list of texts, or a single string. Depending on how you're doing things you might also be incrementing your document counter. If that's intentional, then you do not need to do anything here. Otherwise, you will need to handle decrementing self.document_count at a minimum.
(You might also need to manipulate:
self.word_index
self.index_word
if you don't like the document counter moving)

How to find number of tokens in gensim model

This is the code for my model using Gensim.i run it and it returned a tuple. I wanna know that which one is the number of tokens?
model = gensim.models.Word2Vec(mylist5,size=100, sg=0, window=5, alpha=0.05, min_count=5, workers=12, iter=20, cbow_mean=1, hs=0, negative=15)
model.train(mylist5, total_examples=len(mylist5), epochs=10)
The value that was returned by my model is: I need to know what is this?
(167131589, 208757070)
I wanna know what is the number of tokens?

Since you already passed in your mylist5 corpus` when you instantiated the model, it will have automatically done all steps to train the model with that data.
(You don't need to, and almost certainly should not, be calling .train() again. Typically .train() should only be called if you didn't provide any corpus at instnatiation. And in such a case, you'd then call both .build_vocab() and .train().)
As noted by other answerers, the numbers reported by .train() are two tallies of the total tokens seen by the training process. (Most users won't actually need this info.)
If you want to know the number of unique tokens for which the model learned word-vectors, len(model.wv) is one way. (Before Gensim 4.0, len(model.wv.vocab) would have worked.)

Gensim Code
The Gensim Github Line573 Shows that model.train returns two values trained_word_count, raw_word_count.
"raw_word_count" is the number of words used in training.
"trained_word_count" is number of raw words after ignoring unknown words and trimming the sentence length.

How Does the Hashing Trick in Machine Learning Work?

I have a large categorical dataset and a feedforward ANN that I am using for classification purposes. I programmed the machine learning model using Excel VBA (the only programming language I have access too currently).
I have 150 categories in my dataset that I need to process. I have tried using Binary Encoding and One-Hot Encoding, however because of the number of categories I need to process, these vectors are often too large for VBA to handle and I end up with a memory error.
I’d like to give the Hashing trick a go, and see if it works any better. I don't understand how to do this with Excel however.
I have reviewed the following links to try and understand it:
https://learn.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-hashing
https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f
https://en.wikipedia.org/wiki/Vowpal_Wabbit
I still don’t completely understand it. Here is what I have done so far. I used the following code example to create a hash sequence for my categorical date:
Generate short hash string based using VBA
Using the code above, I have been able to produce collision free numerical hash sequences. However, what do I do now? Does the hash sequence need to be converted to a binary vector now? This is where I get lost.
I provided a small example of my data thus far. Would somebody be able to show me step by step how the hashing trick works (preferably for Excel)?
'CATEGORY 'HASH SEQUENCE
STEEL 37152
PLASTIC 31081
ALUMINUM 2310
BRONZE 9364

So what the hashing trick does is it prevents ~fake words from taking up extra memory. In a regular Bag-Of-Words (BOW) model, you have 1 dimension per word in the vocabulary. This means that a misspelled word and the regular word can both take up separate dimensions - if you have the misspelled word in the model at all. If the misspelled word is not in the model, (depending on your model) you might ignore it completly. This adds up over time. And by misspelled word, I'm just using an example of any word not in the vocabulary you use to create the vectors to train your model with. Meaning any model trained this way cannot adapt to new vocab without being trained all over again.
The hashing method allows you to incorporate out-of-vocab words, with some potential accuracy loss. It also ensures that you can bound your memory. Essentially the hashing method starts by defining a hash function that takes some input (typically a word) and mapping it to an output value Within an Already Determined Range. You would choose your hash function to output somewhere between say 0-2^16. Thus you know your output vectors will always be capped at size 2^16 (arbitrary value really), so you can prevent memory issues. Further, hash functions have "collisions" - what this means is that hash(a) might equal hash(b) - very rarely with an appropriate output range, but its possible. This means that you lose some accuracy - but since the hash function is theoretically able to take any input string, it can work with out of vocabulary words to get a new vector Of the Same Size as the original vectors used to train the model. Since your new data vector is the Same Size as those used to train the model previously, you can use it to refine your model instead of being forced to train a new model.

Google's BERT for NLP: replace foreign characters in vocab.txt to add words?

I am fine-tuning the BERT model but need to add a few thousand words. I know that one can replace the ~1000 [unused#] lines at the top of the vocab.txt, but I also notice there are thousands of single foreign characters (unicode) in the file, which I will never use. For fine-tuning, is it possible to replace those with my words, fine tune, and have model still work correctly?

The unused words weights are essentially randomly initialized only as they have not been used. If you just replace them with your own words but don't pretrain further on your domain specific corpus, then it would essentially remain random only. So, there won't be much benifit IMO, if you replace and continue with finetuning.
Let me point you to this github issue. According to author of the paper:
My recommendation would be to just use the existing wordpiece vocab
and run pre-trianing for more steps on the in-domain text, and it
should learn the compositionality "for free". Keep in mind that with a
wordpiece vocabulary there are basically no out-of-vocabulary words,
and you don't really know which words were seen in the pre-training
and not. Just because a word was split up by word pieces doesn't mean
it's rare, in fact many words which were split into wordpieces were
seen 5,000+ times in the pre-training data.
But if you want to add more vocab you can either: (a) Just replace the
"[unusedX]" tokens with your vocabulary. Since these were not used
they are effectively randomly initialized. (b) Append it to the end of
the vocab, and write a script which generates a new checkpoint that is
identical to the pre-trained checkpoint, but but with a bigger vocab
where the new embeddings are randomly initialized (for initialized we
used tf.truncated_normal_initializer(stddev=0.02)). This will likely
require mucking around with some tf.concat() and tf.assign() calls.
Hope this helps!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string