I am working on a text classification project using Huggingface transformers module. The encode_plus function provides the users with a convenient way of generating the input ids, attention masks, token type ids, etc. For instance:
from transformers import BertTokenizer
pretrained_model_name = 'bert-base-cased'
bert_base_tokenizer = BertTokenizer.from_pretrained(pretrained_model_name)
sample_text = 'Bamboo poles, installation by an unknown building constructor #discoverhongkong #hongkonginsta'
encoding = bert_base_tokenizer.encode_plus(
cleaned_tweet, hashtag_string,
max_length=70,
add_special_tokens=True, # Add '[CLS]' and '[SEP]'
return_token_type_ids=True,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt', # Return PyTorch tensors
)
print('*'*20)
print(encoding['input_ids'])
print(encoding['attention_mask'])
print(encoding['token_type_ids'])
print('*'*20)
However, my current project requires me to generate customized ids for a given text. For instance, for a list of words [HK, US, UK], I want to generate ids for these words and let other words' ids which do not exist in this list as zero. These ids are used to find embedding in another customized embedding matrix, not from pretrained bert module.
How can I achieve this kind of customized encoder? Any suggestions and solutions are welcomed! Thanks~
I think you can use the <unusedxxx> tokens in the BERT vocab and add your custom tokens there. So now you can easily refer to them with a valid token ID.
Related
I am trying to mask named entities in text, using a roberta based model.
The suggested way to use the model is via Huggingface pipeline but i find that it is rather slow to use it that way. Using a pipeline on text data also prevents me from using my GPU for computation, as the text cannot be put onto the GPU.
Due to this, i decided to put the model on the GPU, tokenize the text myself(using the same tokenizer i pass to the pipeline), put the tokens on the GPU and pass them to the model afterwards. This works, but the outputs of the model used directly like this and not via the pipeline differ significantly.
I cant find a reason for this nor a way to fix it.
I tried reading through the token classification pipeline source code but couldnt find a difference in my usage compared to what the pipeline does.
Examples of code which produce different results:
Suggested usage in the model card:
ner_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
classifier = pipeline("ner", model=model, tokenizer=ner_tokenizer, framework='pt')
out = classifier(dataset['text'])
'out' is now a list of lists of dictionary objects which hold information on each named entity in a given string in list of strings 'dataset['text']'.
My custom usage:
text_batch = dataset['text']
encodings_batch = ner_tokenizer(text_batch,padding="max_length", truncation=True, max_length=128, return_tensors="pt")
input_ids = encodings_batch['input_ids']
input_ids = input_ids.to(TORCH_DEVICE)
outputs = model(input_ids)[0]
outputs = outputs.to('cpu')
label_ner_ids = outputs.argmax(dim=2).to('cpu')
'label_ner_ids' is now a tensor of 2 dimensions, the elements of which represent the labels for each token in a given line of text, so label_ner_id[i,j] is the label for the j-th token in the i-th string of text in the list of strings 'text_batch'. The token labels here differ from the outputs of the pipeline usage.
I have a dataset of utterances and corresponding sentiment label. I want to use an embedding of the sentiment label as an additional input to BERT (To simplify things, you can say that I want to initialize the embeddings for some tokens in my BERT model). There are 6-7 unique labels. I planned to use static embeddings like GloVe to map the label to an embedding, but this will not be compatible with BERT, which expects the input embedding to be of size 768. How can I generate static embeddings of my labels?
You can try sbert to generate embedding for both of your sentence and label of given dimension size.
Here is the library - https://www.sbert.net/
I'm trying to figure out what BERT preprocess does. I mean, how it is done. But I can't find a good explanation. I would appreciate, if somebody know, a link to a better and deeply explained solution.
If someone, by the other hand, wants to solve it here, I would be also extremly thankful!
My question is, how does BERT mathematically convert a string input into a vector of numbers with fixed size? Which are the logical steps that follows?
BERT provides its own tokenizer. Because BERT is a pretrained model that expects input data in a specific format, following are required:
A special token, [SEP], to mark the end of a sentence, or the
separation between two sentences
A special token, [CLS], at the
beginning of our text. This token is used for classification tasks,
but BERT expects it no matter what your application is.
Tokens that conform with the fixed vocabulary used in BERT
The Token IDs for the tokens, from BERT’s tokenizer
Mask IDs to indicate which elements in the sequence are tokens and which are padding elements
Segment IDs used to distinguish different sentences
Positional Embeddings used to show token position within the sequence
.
from transformers import BertTokenizer
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# An example sentence
text = "Sentence to embed"
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
# Map the token strings to their vocabulary indices.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
Have a look at this excellent tutorial for more details.
From multiple searches and pytorch documentation itself I could figure out that inside embedding layer there is a lookup table where the embedding vectors are stored. What I am not able to understand:
what exactly happens during training in this layer?
What are the weights and how the gradients of those weights are computed?
My intuition is that at least there should be a function with some parameters that produces the keys for the lookup table. If so, then what is that function?
Any help in this will be appreciated. Thanks.
That is a really good question! The embedding layer of PyTorch (same goes for Tensorflow) serves as a lookup table just to retrieve the embeddings for each of the inputs, which are indices. Consider the following case, you have a sentence where each word is tokenized. Therefore, each word in your sentence is represented with a unique integer (index). In case the list of indices (words) is [1, 5, 9], and you want to encode each of the words with a 50 dimensional vector (embedding), you can do the following:
# The list of tokens
tokens = torch.tensor([0,5,9], dtype=torch.long)
# Define an embedding layer, where you know upfront that in total you
# have 10 distinct words, and you want each word to be encoded with
# a 50 dimensional vector
embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=50)
# Obtain the embeddings for each of the words in the sentence
embedded_words = embedding(tokens)
Now, to answer your questions:
During the forward pass, the values for each of the tokens in your sentence are going to be obtained in a similar way as the Numpy's indexing works. Because in the backend, this is a differentiable operation, during the backward pass (training), Pytorch is going to compute the gradients for each of the embeddings and readjust them accordingly.
The weights are the embeddings themselves. The word embedding matrix is actually a weight matrix that will be learned during training.
There is no actual function per se. As we defined above, the sentence is already tokenized (each word is represented with a unique integer), and we can just obtain the embeddings for each of the tokens in the sentence.
Finally, as I mentioned the example with the indexing many times, let us try it out.
# Let us assume that we have a pre-trained embedding matrix
pretrained_embeddings = torch.rand(10, 50)
# We can initialize our embedding module from the embedding matrix
embedding = torch.nn.Embedding.from_pretrained(pretrained_embeddings)
# Some tokens
tokens = torch.tensor([1,5,9], dtype=torch.long)
# Token embeddings from the lookup table
lookup_embeddings = embedding(tokens)
# Token embeddings obtained with indexing
indexing_embeddings = pretrained_embeddings[tokens]
# Voila! They are the same
np.testing.assert_array_equal(lookup_embeddings.numpy(), indexing_embeddings.numpy())
nn.Embedding layer can serve as a lookup table. This means if you have a dictionary of n elements you can call each element by id if you create the embedding.
In this case the size of the dictionary would be num_embeddings and the embedding_dim would be 1.
You don't have anything to learn in this scenario. You just indexed elements of a dict, or you encoded them, you may say. So forward pass analysis in this case is not needed.
You may have used this if you used word embeddings like Word2vec.
On the other side you may use embedding layers for categorical variables (features in general case). In there you will set the embedding dimension embedding_dim to the number of categories you may have.
In that case you start with randomly initialized embedding layer and you learn the categories (features) in forward.
I am a bit new to gensim and right now I am trying to solve the problem which involves using the doc2vec embeddings in keras. I wasn't able to find existing implementation of doc2vec in keras - as far as I see in all examples I found so far everyone just uses the gensim to get the document embeddings.
Once I trained my doc2vec model in gensim I need to export embeddings weights from genim into keras somehow and it is not really clear on how to do that. I see that
model.syn0
Supposedly gives the word2vec embedding weights (according to this). But it is unclear how to do the same export for document embeddings. Any advise?
I know that in general I can just get the embeddings for each document directly from gensim model but I want to fine-tune the embedding layer in keras later on, since doc embeddings will be used as a part of a larger task hence they might be fine-tuned a bit.
I figured this out.
Assuming you already trained the gensim model and used string tags as document ids:
#get vector of doc
model.docvecs['2017-06-24AEON']
#raw docvectors (all of them)
model.docvecs.doctag_syn0
#docvector names in model
model.docvecs.offset2doctag
You can export this doc vectors into keras embedding layer as below, assuming your DataFrame df has all of the documents out there. Notice that in the embedding matrix you need to pass only integers as inputs. I use raw number in dataframe as the id of the doc for input. Also notice that embedding layer requires to not touch index 0 - it is reserved for masking, so when I pass the doc id as input to my network I need to ensure it is >0
#creating embedding matrix
embedding_matrix = np.zeros((len(df)+1, text_encode_dim))
for i, row in df.iterrows():
embedding = modelDoc2Vec.docvecs[row['docCode']]
embedding_matrix[i+1] = embedding
#input with id of document
doc_input = Input(shape=(1,),dtype='int16', name='doc_input')
#embedding layer intialized with the matrix created earlier
embedded_doc_input = Embedding(output_dim=text_encode_dim, input_dim=len(df)+1,weights=[embedding_matrix], input_length=1, trainable=False)(doc_input)
UPDATE
After late 2017, with the introduction of Keras 2.0 API very last line should be changed to:
embedded_doc_input = Embedding(output_dim=text_encode_dim, input_dim=len(df)+1,embeddings_initializer=Constant(embedding_matrix), input_length=1, trainable=False)(doc_input)