How does TfidfVectorizer compute scores on test data - scikit-learn

In scikit-learn TfidfVectorizer allows us to fit over training data, and later use the same vectorizer to transform over our test data.
The output of the transformation over the train data is a matrix that represents a tf-idf score for each word for a given document.
However, how does the fitted vectorizer compute the score for new inputs? I have guessed that either:
The score of a word in a new document computed by some aggregation of the scores of the same word over documents in the training set.
The new document is 'added' to the existing corpus and new scores are calculated.
I have tried deducing the operation from scikit-learn's source code but could not quite figure it out. Is it one of the options I've previously mentioned or something else entirely?
Please assist.

It is definitely the former: each word's idf (inverse document-frequency) is calculated based on the training documents only. This makes sense because these values are precisely the ones that are calculated when you call fit on your vectorizer. If the second option you describe was true, we would essentially refit a vectorizer each time, and we would also cause information leak as idf's from the test set would be used during model evaluation.
Beyond these purely conceptual explanations, you can also run the following code to convince yourself:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
x_train = ["We love apples", "We really love bananas"]
vect.fit(x_train)
print(vect.get_feature_names())
>>> ['apples', 'bananas', 'love', 'really', 'we']
x_test = ["We really love pears"]
vectorized = vect.transform(x_test)
print(vectorized.toarray())
>>> array([[0. , 0. , 0.50154891, 0.70490949, 0.50154891]])
Following the reasoning of how the fit methodology works, you can recalculate these tfidf values yourself:
"apples" and "bananas" obviously have a tfidf score of 0 because they do not appear in x_test. "pears", on the other hand, does not exist in x_train and so will not even appear in the vectorization. Hence, only "love", "really" and "we" will have a tfidf score.
Scikit-learn implements tfidf as log((1+n)/(1+df) + 1) * f where n is the number of documents in the training set (2 for us), df the number of documents in which the word appears in the training set only, and f the frequency count of the word in the test set. Hence:
tfidf_love = (np.log((1+2)/(1+2))+1)*1
tfidf_really = (np.log((1+2)/(1+1))+1)*1
tfidf_we = (np.log((1+2)/(1+2))+1)*1
You then need to scale these tfidf scores by the L2 distance of your document:
tfidf_non_scaled = np.array([tfidf_love,tfidf_really,tfidf_we])
tfidf_list = tfidf_non_scaled/sum(tfidf_non_scaled**2)**0.5
print(tfidf_list)
>>> [0.50154891 0.70490949 0.50154891]
You can see that indeed, we are getting the same values, which confirms the way scikit-learn implemented this methodology.

Related

NLP: Get opinionated terms that correspond to aspect terms

I want to extract the sentiment sentence that goes along an aspect term in a sentence. I have the following code:
import spacy
nlp = spacy.load("en_core_web_lg")
def find_sentiment(doc):
# find roots of all entities in the text
ner_heads = {ent.root.idx: ent for ent in doc.ents}
rule3_pairs = []
for token in doc:
children = token.children
A = "999999"
M = "999999"
add_neg_pfx = False
for child in children:
if(child.dep_ in ["nsubj"] and not child.is_stop): # nsubj is nominal subject
if child.idx in ner_heads:
A = ner_heads[child.idx].text
else:
A = child.text
if(child.dep_ in ["acomp", "advcl"] and not child.is_stop): # acomp is adjectival complement
M = child.text
# example - 'this could have been better' -> (this, not better)
if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
neg_prefix = "not"
add_neg_pfx = True
if(child.dep_ == "neg"): # neg is negation
neg_prefix = child.text
add_neg_pfx = True
# print(child, child.dep_)
if (add_neg_pfx and M != "999999"):
M = neg_prefix + " " + M
if(A != "999999" and M != "999999"):
rule3_pairs.append((A, M))
return rule3_pairs
print(find_sentiment(nlp('NEW DELHI Refined soya oil remained weak for the second day and prices shed 0.56 per cent to Rs 682.50 per 10 kg in futures market today as speculators reduced positions following sluggish demand in the spot market against adequate stocks position.')))
Which gets me the output: [('oil', 'weak'), ('prices', 'reduced')]
But this is too little of the content of the text
I want to know if it is possible to get an output like: [('oil', 'weak'), ('prices', 'shed 0.56 percent'), ('demand', 'sluggish')]
Is there any approach you recomend trying?
I triedthe code given above. Also a another library of stanza which only got similar results.
Unfortunately, if your task is to extract all expressive words from the text (all the words that contain sentimental significance), then it is not possible with the current state of affairs. Language is highly variable, and the same word could change its sentiment and meaning from sentence to sentence. While words like "awful" are easy to classify as negative, "demand" from your text is not as obvious, not even speaking about edge cases when seemingly positive "incredible" may reverse its sentiment if used as empowerment: "incredibly stupid" should be classified as very negative, but machines can normally only output two opposite labels for those words.
This is why for purposes of sentimental analysis, the only reliable way is building machine learning model that will classify texts entirely, which means you should adapt your software to accept the final verdict and process it in some way or another.
Naive Bayes Classifier
The simplest way to classify text by sentiment is the Naive Bayes classifier algorithm (that, among other things, not only classifies sentiment) that is implemented in NLTK:
from nltk import NaiveBayesClassifier, classify
#The training data is a two-dimensional list of words to classify.
train_data = dataset[:7000]
test_data = dataset[7000:]
#Train method returns the trained model.
classifier = NaiveBayesClassifier.train(train_data)
#To get accuracy, use classify.accuracy method:
print("Accuracy is:", classify.accuracy(classifier, test_data))
In order to make a prediction, we need to pass a list of words. It's preferable to remove any words that do not play sentimental significance such as the stop words and punctuation so that it wouldn't disturb our model:
from nltk.corpus import stopwords
from nltk.tokenise import word_tokenise
def clearLexemes(words):
return [word if word not in stopwords.word("english")
or "!?<>:;.&*%^" in word for word in words]
text = "What a terrible day!"
tokens = clearLexemes(word_tokenise(text))
print("Text sentiment is " + str(classifier.classify(dict([token, True] for token in tokens)))))
The output will be the sentiment of the text.
The important notes:
requires a minimum parameters to train and trains relatively fast;
is highly efficient for working with natural languages (is also used for gender identification and named entity recognition);
is unlikely to properly classify edge cases when words shift their sentiment in creatively-styled or rare utterances. For example, "Sweetheart, I wish ll of your fears would come true and you will be happy to live in such world!" This sentence is negative and uses irony to mask negative attribute through positive expressions, and the model may not be able to detect this.
Linear Regression
Another related method is to use linear regression algorithms from your favourite machine learning framework. In this notebook I used the Amazon food review dataset
to measure how fast model accuracy increases as you feed it with more and more data. The data you need to feed the model is the raw text and its score label (that in your case could be sentiment).
import numpy as np #For converting strings to text
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, classification_report
#Preparing the data
ys: pd.DataFrame = reviews.head(170536) #30% of the dataframe is test data
xs: pd.DataFrame = reviews[170537:] #70% of the dataframe is training data
#Training the model
lr = LogisticRegression(max_iter=1000)
cv = CountVectorizer(token_pattern=r'\b\w+\b')
train = cv.fit_transform(xs["Summary"].apply(lambda x: np.str_(x)))
test = cv.transform(ys["Summary"].apply(lambda x: np.str_(x)))
lr.fit(train, xs["Score"])
#Measuring accuracy:
predictions = lr.predict(test)
labels = ["x1", "x2", "x3", "x4", "x5"]
report = classification_report(predictions, ys["Score"],
target_names = labels, output_dict=True)
accuracy = [report[label]["precision"] for label in labels]
print(accuracy)
Conclusion
Investigating sentimental analysis is a worthwhile area of academic and industrial research that completely relies on machine learning and is bound to its limitations. It is a powerful topic that should be covered in the classical NLP suite. Unfortunately, currently understanding meaning close enough to be able to extract situational meaning is a feat close to inventing Artificial General Intelligence, however technology rapidly grows in that direction.

Are TF-IDF scores for a single term combined?

I am reading up about TF-IDF so that I can filter out common words from my corpus. It appears to me that you get a TF-IDF score for each word, document pair.
Which score do you pay attention to? Do you combine the scores across all documents for a word?
TFIDF ex:
doc1 = "This is doc1"
doc2 = "This is a different document"
corpus = [doc1, doc2]
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(corpus)
X.toarray()
return: array([[0. , 0.70490949, 0. , 0.50154891, 0.50154891],
[0.57615236, 0. , 0.57615236, 0.40993715, 0.40993715]])
vec.get_feature_names()
So you have a line/1d array for each doc in the corpus, and that array has len = total vocab in your corpus (can get quite sparse). What score you pay attention to depends on what you're doing, ie finding most important word in a doc you could look for highest TF-idf in that doc. Most important in a corpus, look in the entire array. If you're trying to identify stop words, you could consider finding the set of X number of words with the minimum TF-IDF scores. However, I wouldn't really recommend using TF-IDF to find stop words in the first place, it lowers the weight of stop words, but they still occur frequently which could offset the weight loss. You'd probably be better off finding the most common words and then filtering them out. You'd want to look at either set you generated manually though.

How to Select Top 1000 words using TF-IDF Vector?

I have a Documents with 5000 reviews. I applied tf-idf on that document. Here sample_data contains 5000 reviews. I am applying tf-idf vectorizer on the sample_data with one gram range. Now I want to get the top 1000 words
from the sample_data which have highest tf-idf values. Could anyone tell me how to get the top words?
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range=(1,1))
tf_idf_vect.fit(sample_data)
final_tf_idf = tf_idf_vect.transform(sample_data)
TF-IDF values depend on individual documents. You can get top 1000 terms based on their count (Tf) by using the max_features parameter of TfidfVectorizer:
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.
Just do:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,1), max_features=1000)
You can even get the 'idf' (global term weights) from the tf_idf_vect after fitting (learning) of documents by using idf_ attribute:
idf_ : array, shape = [n_features], or None
The learned idf vector (global term weights) when use_idf is set to True,
Do this after calling tf_idf_vect.fit(sample_data):
idf = tf_idf_vect.idf_
And then select the top 1000 from them and re-fit the data based on those selected features.
But you cannot get top 1000 by "tf-idf", because the tf-idf is the product of tf of a term in a single document with idf (global) of the vocabulary. So for same word which appeared 2 times in a single document will have twice the tf-idf than the same word which appeared in another document only once. How can you compare the different values of the same term. Hope this makes it clear.

Default value in Svm prediction Scikitlearn

I am using scikitlearn for svm classification.
I need a classifier that returns default value when a given test item doesn't match any of the training-set items, i.e. when the distance is very high. Is that possible?
For Example
Let's say my training-set is
X= [[0.5,0.5,2],[4, 4,16],[16, 16,64]]
and labels
y=[0,1,2]
then I run training
clf = svm.SVC()
clf.fit(X, y)
then I run prediction
clf.predict([-100,-100,-200])
Now as we can see the test-item [-100,-100,-200] is too far away from any of the training-items, in this case the prediction will yield [2] which is this item [16, 16,64], is there anyway to make it return anything else (not from training-set)?
I think you can create a label for those big values, and added into your training set.
X= [[0.5,0.5,2],[4, 4,16],[16, 16,64],[-100,-100,200]]
Y=[0,1,2,100]
and give a try.
Since SVM is supervised learning, which means the 'OUTPUT' have to be specified. If you are not certain about the 'OUTPUT', do some non supervised clustering (kmeans for example), and have a rough idea how many possible 'OUTPUT' you will expect.

how to get the most representative features in the following tfidf model?

Hello I have the following list:
listComments = ["comment1","comment2","comment3",...,"commentN"]
I created a tfidf vectorizer to get a model from my comments as follows:
tfidf_vectorizer = TfidfVectorizer(min_df=10,ngram_range=(1,3),analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(listComments)
Now in order to undestand more about my model I would like to get the most representative features, I tried:
print("these are the features :",tfidf_vectorizer.get_feature_names())
print("the vocabulary :",tfidf_vectorizer.vocabulary_)
and this is giving me a list of words that I think that my model is using for the vectorization:
these are the features : ['10', '10 days', 'red', 'car',...]
the vocabulary : {'edge': 86, 'local': 96, 'machine': 2,...}
However I would like to find a way to get the 30 most representative features, I mean the words that achieves the highest values in my tfidf model, the words with highest inverse frecuency, I was Reading in the documentation but I was not able to find this method I really appreciate help with this issue, thanks in advance,
If you want to get a list of the vocabulary with respect to idf scores you can use the idf_ attribute and argsort it.
# create an array of feature names
feature_names = np.array(tfidf_vectorizer.get_feature_names())
# get order
idf_order = tfidf_vectorizer.idf_.argsort()[::-1]
# produce sorted idf word
feature_names[idf_order]
If you would like to get a sorted list of tfidf scores for each document you would do a similar thing.
# get order for all documents based on tfidf scores
tfidf_order = tfidf.toarray().argsort()[::-1]
# produce words
feature_names[tfidf_order]

Resources