Confusion in Pre-processing text for Roberta Model - nlp

I want to apply Roberta model for text similarity. Given a pair of sentences,the input should be in the format <s> A </s></s> B </s>. I figure out two possible ways to generate the input ids namely
a)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands')
list2 = tokenizer.encode('Numbness of upper limb')
sequence = list1+[2]+list2[1:]
In this case, sequence is [0, 12178, 3814, 2400, 11, 1420, 2, 2, 234, 4179, 1825, 9, 2853, 29654, 2]
b)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)
list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)
sequence = [0]+list1+[2,2]+list2+[2]
In this case, sequence is [0, 25101, 3814, 2400, 11, 1420, 2, 2, 487, 4179, 1825, 9, 2853, 29654, 2]
Here 0 represents <s> token and 2 represents </s> token. I'm not sure which is the correct way to encode the given two sentences for calculating sentence similarity using Roberta model.

The easiest way is probably to directly use the provided function by HuggingFace's Tokenizers themselves, namely the text_pair argument in the encode function, see here. This allows you to directly feed in two sentences, which will be giving you the desired output:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
sequence = tokenizer.encode(text='Very severe pain in hands',
text_pair='Numbness of upper limb',
add_special_tokens=True)
This is especially convenient if you are dealing with very long sequences, as the encode function automatically reduces your lengths according to the truncaction_strategy argument. You obviously don't have to worry about this, if it is only short sequences.
Alternatively, you can also make use of the more explicit build_inputs_with_special_tokens() function of the RobertaTokenizer, specifically, which could be added to your example like so:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)
list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)
sequence = tokenizer.build_inputs_with_special_tokens(list1, list2)
Note that in that case, you have to generate the sequences list1 and list2 still without any special tokens, as you have already done correctly.

Related

Custom tokenizer not working in countvectorizer sklearn

I am trying to make a Countvectorizer with a custom tokenizer function. I am facing a weird problem with it. In below code temp_tok is a list of 5 values which is used as vocabulary later.
temp_tok = ["or", "Normal sinus rhythm", "sinus", "anuj","Normal sinus"]
def tokenize(text):
return [temp_tok[0],temp_tok[1], "sinus", "Normal sinus"]
def tokenize2(text):
return [i for i in temp_tok if i in text]
text = "Normal sinus rhythm"
The output of text for both functions is same which is
tokenize(text)
output = ['or', 'Normal sinus rhythm', 'sinus', 'Normal sinus']
But when I build vectorizer with these tokenizer, it gives unexpected output for tokenize2. My vocabulary is temp_tok for both. I experimented with n_gram range but it is not helping.
vectorizer = CountVectorizer(vocabulary=temp_tok,tokenizer = tokenize)
vectorizer2 = CountVectorizer(vocabulary=temp_tok,tokenizer = tokenize2)
While vectorizer.transform([text]) is giving expected output, vectorizer2.transform([text]) is giving 1 only for "or" and "sinus"
vectorizer.transform(["Normal sinus rhythm"]).toarray()
array([[1, 1, 1, 0, 1]])
vectorizer.transform(["Normal sinus rhythm"]).toarray()
array([[1, 0, 1, 0, 0]])
I also tried passing dictionary instead of list temp_tok as vocabulary to Countvectorizer but it doesn't help. Is this sklearn problem or I am doing something wrong?
Countvectorizer is passing the text by converting it to lower case. So tokenize2 is not working while tokenize works well.
This can be seen by adding a print function in tokenize2.
def tokenize2(text):
print(text)
return [i for i in temp_tok if i in text]
A good solution would be to change the elements in temp_tok to lower cases. Else any technique to handle small case, capital case would work.

Getting similarity score with spacy and a transformer model

I've been using the spacy en_core_web_lg and wanted to try out en_core_web_trf (transformer model) but having some trouble wrapping my head around the difference in the model/pipeline usage.
My use case looks like the following:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_trf")
s1 = nlp("Running for president is probably hard.")
s2 = nlp("Space aliens lurk in the night time.")
s1.similarity(s2)
Output:
The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements.
(0.0, Space aliens lurk in the night time.)
Looking at this post, the transformer model does not have a word vector in the same way en_core_web_lg does, but you can get the embedding via s1._.trf_data.tensors. Which looks like:
sent1._.trf_data.tensors[0].shape
(1, 9, 768)
sent1._.trf_data.tensors[1].shape
(1, 768)
So I tried to manually take the cosine similarity (using this post as ref):
def similarity(obj1, obj2):
(v1, t1), (v2, t2) = obj1._.trf_data.tensors, obj2._.trf_data.tensors
try:
return ((1 - cosine(v1, v2)) + (1 - cosine(t1, t2))) / 2
except:
return 0.0
But this does not work.
As #polm23 mentioned, using sentence-transformers is a better approach to get sentence similarity.
First install the package: pip install sentence-transformers
Then use this code:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["Running for president is probably hard.","Space aliens lurk in the night time."]
embedded_list = model.encode(sentences)
similarity = cos_sim(embedded_list[0],embedded_list[1])
But if you are determined to use spacy for sentence similarity be aware that the reason that your code does not work is that v1 and v2 don't have the same shape, as you can see:
s1._.trf_data.tensors[0].shape --> (1, 9, 768)
s2._.trf_data.tensors[0].shape --> (1, 11, 768)
So it's not possible to get similarity between these two arrays.
s1._.trf_data.tensors is a tuple consists of two arrays:
s1._.trf_data.tensors[0] gives an array of size (1, 9, 768) which is consists of 9 arrays of size (1, 768) for each token.
s1._.trf_data.tensors[1] gives an array of size (1, 768) for the whole sentence
So you can get similarity as follows:
similarity = cosine(s1._.trf_data.tensors[1], s2._.trf_data.tensors[1])

keras pre-processing of text using one_hot class

I came across this code while learning keras online.
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
text = 'One hot encoding in Keras'
tokens = text_to_word_sequence(text)
length = len(tokens)
one_hot(text, length)
This returns the intergers like this...
[3, 1, 1, 2, 3]
I did not understand why and how does unique words return duplicate numbers. For e.g. 3 and 1 is repeated even if the words in the text are unique.
From the documentation of one_hot it is described how it is a wrapper of hashing_trick:
This is a wrapper to the hashing_trick function using hash as the hashing function; unicity of word to index mapping non-guaranteed.
From the documentation of hasing_trick:
Two or more words may be assigned to the same index, due to possible collisions by the hashing function. The probability of a collision is in relation to the dimension of the hashing space and the number of distinct objects.
Since hashing is used there is a probability that different words will be hashed to the same index. The probability of a non-unique hash is proportional to the vocabulary size selected.
It is suggested by Jason Brownlee Jason Brownlee to use a vocabulary size 25% larger than the word size to increase the uniqueness of the hashes.
Following Jason Brownlee suggestion in you case results in:
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.random import set_random_seed
import math
set_random_seed(1)
text = 'One hot encoding in Keras'
tokens = text_to_word_sequence(text)
length = len(tokens)
print(one_hot(text, math.ceil(length*1.25)))
which returns the integers
[3, 4, 5, 1, 6]

Building vocabulary using document vector

I am not able to build vocabulary and getting an error:
TypeError: 'int' object is not iterable
Here is my code that is based on medium article:
https://towardsdatascience.com/implementing-multi-class-text-classification-with-doc2vec-df7c3812824d
I tried to provide pandas series, list to build_vocab function.
import pandas as pd
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split
import multiprocessing
import nltk
from nltk.corpus import stopwords
def tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
if len(word) < 2:
continue
tokens.append(word.lower())
return tokens
df = pd.read_csv("https://raw.githubusercontent.com/RaRe-Technologies/movie-plots-by-genre/master/data/tagged_plots_movielens.csv")
tags_index = {
"sci-fi": 1,
"action": 2,
"comedy": 3,
"fantasy": 4,
"animation": 5,
"romance": 6,
}
df["tindex"] = df.tag.replace(tags_index)
df = df[["plot", "tindex"]]
mylist = list()
for i, q in df.iterrows():
mylist.append(
TaggedDocument(tokenize_text(str(q["plot"])), tags=q["tindex"])
)
df["tdoc"] = mylist
X = df[["tdoc"]]
y = df["tindex"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
cores = multiprocessing.cpu_count()
model_doc2vec = Doc2Vec(
dm=1,
vector_size=300,
negative=5,
hs=0,
min_count=2,
sample=0,
workers=cores,
)
model_doc2vec.build_vocab([x for x in X_train["tdoc"]])
The documentation is very confusing for this method.
Doc2Vec needs an iterable sequence of TaggedDocument-like objects for its corpus (as is fed to build_vocab() or train()).
When showing an error, you should also show the full stack that accompanied it, so that it is clear what line-of-code, and surrounding call-frames, are involved.
But, it's unclear if what you've fed into the dataframe, then out via dataframe-bracket-access, then through the train_test_split(), is actually that.
So I'd suggest assigning things to descriptive interim variables, and verifying that they contain the right sorts of things at each step.
Is X_train["tdoc"][0] a proper TaggedDocument, with a words property that is a list-of-strings, and tags property a list-of-tags? (And, where each tag is probably a string, but could perhaps be a plain-int, counting upward from 0.)
Is mylist[0] a proper TaggedDocument?
Separately: many online examples of Doc2Vec use have egregious errors, and the Medium article you link is no exception. Its practice of calling train() multiple times in a loop is usually unneeded, and very error-prone, and in fact in that article results in severe learning-rate alpha mismanagement. (For example, deducting 0.002 from the starting-default alpha of 0.025 30 times results in a negative effective alpha, which is never justified and means the model is making itself worse with every example. This may be a factor contributing to the awful reported classifier accuracy.)
I would disregard that article entirely and seek better examples elsewhere.

Using predict on new text with kmeans (sklearn)?

I have a very small list of short strings which I want to (1) cluster and (2) use that model to predict which cluster a new string belongs to.
Running the first part works fine, getting a prediction for the new string does not.
First Part
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# List of
documents_lst = ['a small, narrow river',
'a continuous flow of liquid, air, or gas',
'a continuous flow of data or instructions, typically one having a constant or predictable rate.',
'a group in which schoolchildren of the same age and ability are taught',
'(of liquid, air, gas, etc.) run or flow in a continuous current in a specified direction',
'transmit or receive (data, especially video and audio material) over the Internet as a steady, continuous flow.',
'put (schoolchildren) in groups of the same age and ability to be taught together',
'a natural body of running water flowing on or under the earth']
# 1. Vectorize the text
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents_lst)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
# 2. Get the number of clusters to make .. (find a better way than random)
num_clusters = 3
# 3. Cluster the defintions
km = KMeans(n_clusters=num_clusters, init='k-means++').fit(tfidf_matrix)
clusters = km.labels_.tolist()
print(clusters)
Which returns:
tfidf_matrix.shape: (8, 39)
[0, 1, 0, 2, 1, 0, 2, 0]
Second Part
The failing part:
predict_doc = ['A stream is a body of water with a current, confined within a bed and banks.']
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(predict_doc)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
km.predict(tfidf_matrix)
The error:
ValueError: Incorrect number of features. Got 7 features, expected 39
FWIW: I somewhat understand that the training and predict have a different amount of features after vectorizing ...
I am open to any solution including changing from kmeans to an algorithm more suitable for short text clustering.
Thanks in advance
For completeness I will answer my own question with an answer from here , that doesn't answer that question. But answers mine
from sklearn.cluster import KMeans
list1 = ["My name is xyz", "My name is pqr", "I work in abc"]
list2 = ["My name is xyz", "I work in abc"]
vectorizer = TfidfVectorizer(min_df = 0, max_df=0.5, stop_words = "english", charset_error = "ignore", ngram_range = (1,3))
vec = vectorizer.fit(list1) # train vec using list1
vectorized = vec.transform(list1) # transform list1 using vec
km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=1000, tol=0.0001, precompute_distances=True, verbose=0, random_state=None, n_jobs=1)
km.fit(vectorized)
list2Vec = vec.transform(list2) # transform list2 using vec
km.predict(list2Vec)
The credit goes to #IrshadBhat

Resources