Should BERT embeddings be made on tokens or sentences? - python-3.x

I am making a sentence classification model and using BERT word embeddings in it. Due to very large dataset, I combined all the sentences together in one string and made embeddings on the tokens generated from those.
s = " ".join(text_list)
len(s)
Here s is the string and text_list contains the sentences on which I want to make my word embeddings.
I then tokenize the string
stokens = tokenizer.tokenize(s)
My question is, will BERT perform better on whole sentence given at a time or making embeddings on tokens for whole string is also fine?
Here is the code for my embedding generator
pool = []
all = []
i=0
while i!=600000:
stokens = stokens[i:i+500]
stokens = ["[CLS]"] + stokens + ["[SEP]"]
input_ids = get_ids(stokens, tokenizer, max_seq_length)
input_masks = get_masks(stokens, max_seq_length)
input_segments = get_segments(stokens, max_seq_length)
a, b= embedd(input_ids, input_masks, input_segments)
pool.append(a)
all.append(b)
print(i)
i+=500
What essentially I am doing here is, I have the string length of 600000 and I take 500 tokens at a time and generate embdedings for it and append it in a list call pool.

For classification, you don't have to concatenate the sentences. By concatenating, you are merging the sentences of different classes.
If it is BERT fine-tuning, by default, for the classification task a logistic regression layer is learnt on top of [CLS] token. Since, its attention based transformer model, it assumes that each token has seen the other tokens and has captured the context. Thus [CLS] token is sufficient.
However, if you want to use the embeddings, you can learn a classifier on single vector,i.e, embeddings [CLS] token or averaged embeddings of all the tokens. Or, you can get the embeddings for each token and form a sequence to learn it using other classifiers such as CNN or RNN.

Related

The inputs into BERT are token IDs. How do I get the corresponding the input token VECTORs into BERT?

I am new and learning about transformers.
In a lot of BERT tutorials, I see the input is just the token id of the words. But surely we need to convert this token ID to a vector representation (it can be one hot encoding, or any initial vector representation for each token ID) so that it can be used by the model.
My question is: Where cam I find this initial vector representation for each token?
In BERT, the input is a string itself. THen, BERT manages to convert it into a token and then, create its vector. Let's see an example:
prep_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
enc_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4'
bert_preprocess = hub.KerasLayer(prep_url)
bert_encoder = hub.KerasLayer(enc_url)
text = ['Hello I"m new to stack overflow']
# First, you need to preprocess the data
preprocessed_text = bert_preprocess(text)
# this will give you a dict with a few keys such us input_word_ids, that is, the tokenizer
encoded = bert_encoder(preprocessed_text)
# and this will give you the (1, 768) vector with the context value of the previous text. the output is encoded['pooled_output']
# you can play with both dicts, printing its keys()
I recommend you to go to both links above and do a little of research. To recap, BERT uses string as inputs and then tokenize it (with its own tokenzer!). If you want to tokenize with the same values, you need the same vocab file, but for a fresh start like you are doing this should be enough.

How to generate sentence embedding using long-former model

I am using Hugging Face mrm8488/longformer-base-4096-finetuned-squadv2 pre-trained model
https://huggingface.co/mrm8488/longformer-base-4096-finetuned-squadv2.
I want to generate sentence level embedding. I have a data-frame which has a text column.
I am using this code:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForQuestionAnswering.from_pretrained(ckpt)
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." # I will pas text-column here from my data-frame
#question = "What has Huggingface done ?"
encoding = tokenizer(question, text, return_tensors="pt")
# I don't want to use it for Question-Answer use-case. I just need the sentence embeddings
input_ids = encoding["input_ids"]
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
attention_mask = encoding["attention_mask"]
How can I do modification in the above code to generate embedding for sentences. ?
I have the following examples:
Text
i've added notes to the claim and it's been escalated for final review
after submitting the request you'll receive an email confirming the open request.
hello my name is person and i'll be assisting you
this is sam and i'll be assisting you for date.
I'll return the amount as asap.
ill return it to you.
The Longformer uses a local attention mechanism and you need to pass a global attention mask to let one token attend to all tokens of your sequence.
import torch
from transformers import LongformerTokenizer, LongformerModel
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = LongformerTokenizer.from_pretrained(ckpt)
model = LongformerModel.from_pretrained(ckpt)
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." # I will pas text-column here from my data-frame
#question = "What has Huggingface done ?"
encoding = tokenizer(text, return_tensors="pt")
global_attention_mask = [1].extend([0]*encoding["input_ids"].shape[-1])
encoding["global_attention_mask"] = global_attention_mask
# I don't want to use it for Question-Answer use-case. I just need the sentence embeddings
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
o = model(**encoding)
sentence_embedding = o.last_hidden_state[:,0]
You should keep in mind that mrm8488/longformer-base-4096-finetuned-squadv2 was not pre-trained to produce meaningful sentence embeddings and faces the same issues as the MLM pre-trained BERT's regarding sentence embeddings.

Text classification: value error couldn't convert str to float

Input for random forest classifier trained model for text classification
I am not able to know what should be the input for the trained model after opening the model from the pickle file.
with open('text_classifier', 'rb') as training_model:
model = pickle.load(training_model)
for message in text:
message1 = [str(message)]
pred = model.predict(message1)
list.append(pred)
return list
Expected output: Non political
Actual output :
ValueError: could not convert string to float: 'RT #ScotNational The
witness admitted that not all damage inflicted on police cars was
caused
You need to encode the text as numbers. No machine algorithm can process text directly.
More precisely, you need to use a word embedding (the same used for training the model). Example of common word embeddings are Word2vec, TF-IDF.
I suggest you to play with sklearn.feature_extraction.text.CountVectorizer and sklearn.feature_extraction.text.TfidfTransformer to familiarize yourself with the concept of embedding.
However, if you do not use the same embedding as the one used to train the model you load, there is no way you will obtain good results.

Wrong length for Gensim Word2Vec's vocabulary

I am trying to train the Gensim Word2Vec model by:
X = train['text']
model_word2vec = models.Word2Vec(X.values, size=150)
model_word2vec.train(X.values, total_examples=len(X.values), epochs=10)
after the training, I get a small vocabulary (model_word2vec.wv.vocab) of length 74 containing only the alphabet's letters.
How could I get the right vocabulary?
Update
I tried this before:
tokenizer = Tokenizer(lower=True)
tokenized_text = tokenizer.fit_on_texts(X)
sequence = tokenizer.texts_to_sequences(X)
model_word2vec.train(sequence, total_examples=len(X.values), epochs=10
but I got the same wrong vocabulary size.
Supply the model with the kind of corpus it needs: a sequence of texts, where each text is a list-of-string-tokens. If you supply it with non-tokenized strings instead, it will think each single character is a token, giving the results you're seeing.

Word level sentence generation using keras

I am new to keras. I am trying to build word level sentence generation module using keras.
I use a vocabulary size of 8000 and one-hot-vector representation of words in my corpus.
This is my model
model = Sequential()
model.add(LSTM(200, return_sequences=False, unroll=True, stateful=False,input_shape=(1,len(itw)) ))
model.add(Activation('tanh'))
model.add(Dense(len(itw)))
model.add(Activation('softmax'))
optimizer = RMSprop(lr=0.01)
model.compile(loss="categorical_crossentropy",optimizer=optimizer,metrics=["accuracy"])
a,b =batch1()
model.fit(a,b,batch_size=45,nb_epoch=5,verbose=1)
here input dimension is 3 (as keras lstm expects 3d input) input shape such that (batch_size,1,8000). I made the 2nd dimension as 1 because I want to feed word by word.
Batch_size is 45 because that is the avarage length of a sentence in the corpus.
All sentences are pretended with a "START_TOKEN" and appended with a "END_TOKEN".
len(itw) returns length of vocabulary which in my case is 8000
Now after training I wished to loop the generated words back as input until a stop token is encountered to generated a sentence.
But it seems that the model only considers the current input.
I hoped the model to consider the inputs before also and not just the current one.
So how should I change my model
Also is Keras unfit for nlp applications involving varying time serise
How to change the model such that it will generate the next word given n words.
Where n is any natural number

Resources