BART Tokenizer tokenises same word differently? - nlp

I have noticed that if I tokenize a full text with many sentences, I sometimes get a different number of tokens than if I tokenise each sentence individually and add up the tokens. I have done some debugging and have this small reproducible example to show the issue
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn')
print(tokenizer.tokenize("Thames is a river"))
print(tokenizer.tokenize("We are in London. Thames is a river"))
I get the following output
['Th', 'ames', 'Ġis', 'Ġa', 'Ġriver']
['We', 'Ġare', 'Ġin', 'ĠLondon', '.', 'ĠThames', 'Ġis', 'Ġa', 'Ġriver']
I would like to understand why the word Thames has been split into two tokens when it’s at the start of sequence, whereas it’s a single word if it’s not at the start of sequence. I have noticed this behaviour is very frequent and, assuming it’s not a bug, I would like to understand why the BART tokeniser behaves like this.

According to https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/tokenization_bart.py:
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not. You can get around that behavior by passing add_prefix_space=True when instantiating this tokenizer or when you call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
Trying
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn', add_prefix_space=True)
print(tokenizer.tokenize("Thames is a river"))
print(tokenizer.tokenize("We are in London. Thames is a river"))
yields the 'correct' result to me.

Related

How to use Keras Tokenizer for Characters?

For the sequence labeling task, my training data and labels look like following :
train_data=[['p','l','a','y','s']
train_labels=[['<p>','<l>','<a>','<y*>','<s*>']]
How can I use tokenizer and generate representation for each sequence in my data. The traditional tokenizer ignores labels such as <p>. It only creates vocabulary of standard characters.
If I got your question correctly, this should do the trick. If I'm mistaken let me know so I can edit the answer accordingly
from keras.preprocessing.text import Tokenizer
tk = Tokenizer(num_words=None, char_level=True)
tk.fit_on_texts(texts)
Where texts is where the actual texts are.
You can check the vocabulary using
tk.word_index

Transformers PreTrainedTokenizer add_tokens Functionality

Referring to the documentation of the awesome Transformers library from Huggingface, I came across the add_tokens functions.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
model.resize_token_embeddings(len(tokenizer))
I tried the above by adding previously absent words in the default vocabulary. However, keeping all else constant, I noticed a decrease in accuracy of the fine tuned classifier making use of this updated tokenizer. I was able to replicate similar behavior even when just 10% of the previously absent words were added.
My questions
Am I missing something?
Instead of whole words, is the add_tokens function expecting masked tokens, for example : '##ah', '##red', '##ik', '##si', etc.? If yes, is there a procedure to generate such masked tokens?
Any help would be appreciated.
Thanks in advance.
If you add tokens to the tokenizer, you indeed make the tokenizer tokenize the text differently, but this is not the tokenization BERT was trained with, so you are basically adding noise to the input. The word embeddings are not trained and the rest of the network never saw them in context. You would need a lot of data to teach BERT to deal with the newly added words.
There are also some ways how to compute a single word embedding, such that it would not hurt BERT like in this paper but it seems pretty complicated and should not make any difference.
BERT uses a word-piece-based vocabulary, so it should not really matter if the words are present in the vocabulary as a single token or get split into multiple wordpieces. The model probably saw the split word during pre-training and will know what to do with it.
Regarding the ##-prefixed tokens, those are tokens can only be prepended as a suffix of another wordpiece. E.g., walrus gets split into ['wal', '##rus'] and you need both of the wordpieces to be in the vocabulary, but not ##wal or rus.

Google's BERT for NLP: replace foreign characters in vocab.txt to add words?

I am fine-tuning the BERT model but need to add a few thousand words. I know that one can replace the ~1000 [unused#] lines at the top of the vocab.txt, but I also notice there are thousands of single foreign characters (unicode) in the file, which I will never use. For fine-tuning, is it possible to replace those with my words, fine tune, and have model still work correctly?
The unused words weights are essentially randomly initialized only as they have not been used. If you just replace them with your own words but don't pretrain further on your domain specific corpus, then it would essentially remain random only. So, there won't be much benifit IMO, if you replace and continue with finetuning.
Let me point you to this github issue. According to author of the paper:
My recommendation would be to just use the existing wordpiece vocab
and run pre-trianing for more steps on the in-domain text, and it
should learn the compositionality "for free". Keep in mind that with a
wordpiece vocabulary there are basically no out-of-vocabulary words,
and you don't really know which words were seen in the pre-training
and not. Just because a word was split up by word pieces doesn't mean
it's rare, in fact many words which were split into wordpieces were
seen 5,000+ times in the pre-training data.
But if you want to add more vocab you can either: (a) Just replace the
"[unusedX]" tokens with your vocabulary. Since these were not used
they are effectively randomly initialized. (b) Append it to the end of
the vocab, and write a script which generates a new checkpoint that is
identical to the pre-trained checkpoint, but but with a bigger vocab
where the new embeddings are randomly initialized (for initialized we
used tf.truncated_normal_initializer(stddev=0.02)). This will likely
require mucking around with some tf.concat() and tf.assign() calls.
Hope this helps!

Internal implementation of nltk pos tagger

i am new to NLP and trying to use nltk pos tagger, and got a doubt on usage,
It usually accepts a word or a complete sentence, and gives pos tag of the input, why it is working in both way ?
i got this doubt because, i tried removing stop words and used spacy pos tagging technique and my colleague said i shouldn't do in that way because results change as it checks for positioning of words also,
Will it be same for nltk pos tagger also? if yes then why it accepts single words since positioning is considered?
sample usage found here for both use cases in nltk: https://github.com/acrosson/nlp/blob/master/subject_extraction/subject_extraction.py#L61
https://github.com/acrosson/nlp/blob/master/subject_extraction/subject_extraction.py#L44
A sentence of one word is still a sentence, so from a software engineering point of view, I would expect a tagger module to work the same regardless of the length of the sentence. From a linguistic point of view, that's not the case.
The word positioning is what seems to be confusing you. Many PoS taggers are based on sequence models, such as HMMs or CRFs*. These use context feature, e.g. what are the previous/next words in the sentence. I think that's what your colleague meant. If you only consider the previous one word as context, then it doesn't matter how long the sentence is. The first word in any sentence has no previous word, so the tagger has to learn to deal with that. However, adding context can change the decision of the tagger- let's look at an example using nltk
In [4]: import nltk
In [5]: nltk.pos_tag(['fly'])
Out[5]: [('fly', 'NN')]
In [6]: nltk.pos_tag(['I', 'fly'])
Out[6]: [('I', 'PRP'), ('fly', 'VBP')]
In [7]: nltk.pos_tag(['Large', 'fly'])
Out[7]: [('Large', 'JJ'), ('fly', 'NN')]
As you can see, changing the first word affects the tagger's output for the second word. As a consequence, you should not be removing stopwords before feeding your text into a PoS tagger.
* Although that's not always true. NLTK 3.3's PoS tagger is an averaged perceptron, and spacy 2.0 uses a neural model- the argument about context still holds though.
The nltk.pos_tag() function takes a list of tokens as input. This list can contain an arbitrary number of tokens, including, of course, 1. There is more info in the API documentation.
So in the first example you cite, nltk.pos_tag([w]), w is supposedly a single word string and [w] places it into a list, as required by the function.
In the second case, nltk.pos_tag(sent), the sent variable in the list comprehension is a sentence that has already been tokenised into a list of tokens (see line 41 in the code you cite - sentences = tokenize_sentences(document)), which is also the format required by pos_tag().
I'm not sure why your colleague advised against using spaCy. It depends on what you want to do. Contrary to NLTK, spaCy stores a rich set of features on each token, including the token's index (position) in the document and character offset in the original text. As far as I know, NLTK does not store token index and character offsets by default, so you would have to try and retrieve this yourself (something like this perhaps).

How to construct a clean vocabulary from training text?

I am following the neural machine translation tutorial here and notice that the datasets they use provide a clean vocab file. But when I come across a dataset (e.g. Europarl v8) that does not provide a vocab file, I need to construct a vocabulary myself using the following function.
def construct_vocab_from_file(file, vocab_file):
# Read file, tokenize it and then sort it
with open(file, 'r') as f:
raw_data = f.read()
tokens = nltk.wordpunct_tokenize(raw_data)
words = [w.lower() for w in tokens]
vocab = sorted(set(words))
# Write vocab to file
with open(vocab_file, 'w') as f:
for w in vocab:
f.write(w + "\n")
However, the vocabulary constructed this way looks a little bit messy.
The left one is from the clean vocab file while the right one with the black background (numbers are line number) is from the vocabulary constructed by me. This does not make me feel comfortable especially more than half of the vocabulary consist of these kind of special characters or numbers (e.g. 0, 00, 000, 0000, 0000003).
So my questions are:
1) Is this problematic?
2) Should I process it further and how?
This depends on the tokenization procedure you are using. Since you are using the wordpunct tokenizer, which basically sees anything resembling \w+|[^\w\s]+ (http://www.nltk.org/api/nltk.tokenize.html) as a token, this is what you get.
Having these kinds of entries populate more than half of your vocab sounds like a lot, but obviously depends on your input data.
You could consider using a more sophisticated tokenizer, but considering that these kinds of entries are likely to have a very low frequency (i.e. most of them will be occurring only once in your data, I guess), I wouldn't worry about it too much.
Since you are using the Europarl stuff; there's also a tokenizer (perl) script in there, which you could use to output tokenized text, such that when you read it in python, splitting on whitespace means tokenizing. Not sure if the moses/europarl tokenizer is more or less sophisticated than NLTK's wordpunct one though.

Resources