TfidfVectorizer it does not eliminate words that occur more than once - scikit-learn

I have a dataset that I'm trying to cluster into. Although I set min_df and max_df in the Tfidf, the output MiniBatchKmeans returns to me contains words that according to the documentation Vectorizer should eliminate because they are present in at least one other document (max_df=1.).
The tfidf settings:
min_df = 5
max_df = 1.
vectorizer = TfidfVectorizer(stop_words='english',min_df=min_df,
max_df=max_df, max_features=100000) ## Corpus is in English
c_vectorizer = CountVectorizer(stop_words='english',min_df=min_df,
max_df=max_df, max_features=100000) ## Corpus is in English
X = vectorizer.fit_transform(dataset)
C_X = c_vectorizer.fit_transform(dataset)
The output of MiniBatchKMeans:
Topic0: information book history read good great lot author write
useful use recommend need time make know provide like easy
excellent just learn look work want help reference buy guide
interested
Topic1: book read good great use make write buy time work like
just recommend know look year need author want think help new life
way love people really excellent easy say
Topic2: story novel character book life read love time write make
like reader great end woman world good man work plot way people
just family know come young author think year
As you can see "book" is in all the 3 topic, but with max_df=1. Shouldn't it be deleted?

From the TfidfVectorizer documentation:
max_df: float or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
So the max_df in the question is set to the default value.
You probably want something like: "Remove words that occur in more than 99% of documents":
from sklearn.feature_extraction.text import TfidfVectorizer
raw_data = [
"books cats coffee",
"books cats",
"books and coffee and coffee",
"books and words and coffee",
]
tfidf = TfidfVectorizer(stop_words="english", max_df=0.99)
X = tfidf.fit_transform(raw_data)
print(tfidf.get_feature_names_out())
print(X.todense())
['cats' 'coffee' 'words']
[[0.77722116 0.62922751 0. ]
[1. 0. 0. ]
[0. 1. 0. ]
[0. 0.53802897 0.84292635]]
If you really do want to remove any words that are present in at least one other document, the CountVectorizer is a better approach:
from sklearn.feature_extraction.text import CountVectorizer
raw_data = [
"unique books cats coffee",
"case books cats",
"for books and words coffee and coffee",
"each books and words and coffee",
]
tfidf = CountVectorizer(max_df=1)
X = tfidf.fit_transform(raw_data)
print(tfidf.get_feature_names_out())
print(X.todense())
['case' 'each' 'for' 'unique']
[[0 0 0 1]
[1 0 0 0]
[0 0 1 0]
[0 1 0 0]]

Related

Why are there rows with all values ​0 in the embedding matrix?

I created the word embedding vector for sentiment analysis. But I'm not sure about the code I wrote. If you see my mistakes while creating Word2vec or embedding matrix, please let me know.
EMBEDDING_DIM=100
review_lines = [sub.split() for sub in reviews]
model = gensim.models.Word2Vec(sentences=review_lines,size=EMBEDDING_DIM,window=6,workers=6,min_count=3,sg=1)
print('Words close to the given word:',model.wv.most_similar('film'))
words=list(model.wv.vocab)
print('Words:' , words)
file_name='embedding_word2vec.txt'
model.wv.save_word2vec_format(file_name,binary=False)
embeddings_index = {}
f=open(os.path.join('','embedding_word2vec.txt'),encoding="utf-8")
for line in f:
values =line.split()
word=values[0]
coefs=np.asarray(values[1:],dtype='float32')
embeddings_index[word]=coefs
f.close()
print("Number of word vectors found:",len(embeddings_index))
embedding_matrix = np.zeros((len(word_index)+1,EMBEDDING_DIM))
for word , i in word_index.items():
embedding_vector= embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i]=embedding_vector
OUTPUT:
array([[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ],
[ 0.1029947 , 0.07595579, -0.06583303, ..., 0.10382118,
-0.56950015, -0.17402627],
[ 0.13758609, 0.05489254, 0.0969701 , ..., 0.18532865,
-0.49845088, -0.23407038],
...,
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ]])
It's likely the zero rows are there because you initialized the embedding_matrix with all zeros, but then your loop didn't replace those zeros for every row.
If any of the words in word_index aren't in the embeddings_index dict you've built (or the model before that, that would be the expected result.
Note that while the saved word-vector format isn't very complicated, you still don't nee to write your own code to parse it back in. The KeyedVectors.load_word2vec_format() method will work for that, giving you an object that allows dict-like access to each vector, by its word key. (And, the vectors are stored in a dense array, so it's a bit more memory efficient than a true dict with a separate ndarray vector as each value.)
There would still be the issue of your word_index listing words that weren't trained by the model. Perhaps they weren't in your training texts, or didn't appear at least min_count (default: 5) times, as required for the model to take notice of them. (You could consider lowering min_count, but note that it's usually a good idea to discard such very-rare words - they wouldn't have created very good vectors from few examples, and even including such thinly-represented words can worsen surrounding word's vectors.)
If you absolutely need vectors for words no in your training data, the FastText variant of the word2vec algorithm can, in languages where similar words often share similar character-runs, offer synthesized vectors for unknown words that are somewhat better than random/null-vectors for most downstream applications. But you really should prefer to have adequate real examples of each interesting words' usage in varying contexts.

sklearn Vectorizer (NLP task) : Generating Custom NGrams which are capable of scaling up for n >= 3

I would like to build a vectorizer in sklearn which can scale up for higher values of n. Here n is the number of different words considered as single vocab element.
My idea is that for n = 1 and n = 2, my custom vectorizer remains the same as sklearn vectorizers, but for n>=3, I would like to replace "I am good","Harry will play" with "I x good" and "Harry x play".
Example: Let's consider that I want to build a vectorizer which scales upto n = 4. Now, take an example sentence "Harry will play tommorow".
Then, "Harry will play tommorow" can break as:-
All 1,2 length vocab words, "Harry x play", "will x tommorow" and "Harry x x tommorow".
Since, the order of elements in this vocabulary is same as that for n = 2, and words of form "A x B" will not be any rarer than "A B", I believe that this model may scale better and give performance benefits.
I searched over the net to find a method to do this and while there are many tutorials for building custom vectorizers all of them end up using there pre-implemented n-gram method.

The sum of all bigrams that start with a particular word must be equal to the unigram count for that word?

In this chapter (at the end of page 4) of a stanford book on natural language processing it says:
The sum of all bigrams that start with a particular word must be equal to the unigram count for that word.
The book leaves why this is the case as an exercise to the reader but I don't understand why this would be true.
For example in the following corpus:
A ball is red.
All balls are red.
As far as I understand the word red would occur twice in unigrams however not appear at all as the first element of a bi-gram?
Am I missing something like that the end of sentence is used as a token? (but then the problem reemerges for trigrams?)
If you read note 2 at the end of page 4 of the cited document, it states that:
We need the end-symbol to make the bigram grammar a true probability distribution.
Reading the page in question, it also appears clear that the author is performing the grouping in ngrams by taking into account start and end-symbols for each sentences, that are called, respectively, <s> and <\s>.
If you compute the unigram and bigram distribution in the two sentences you provided as an example, you should thus first add the start and end symbols, then group by unigrams and bigrams, and then verify wether len(bigram[red]) == len(unigram[red])
If you use as regex 'w+', and add the start and end symbols that the author suggests, the two example sentences would then be tokenized as follows:
'<s>', 'A', 'ball', 'is', 'red', '<\s>'
'<s>', 'All', 'balls', 'are', 'red', '<\s>'
The bigrams that start with 'red' are ('red', '<\s>') in sentence 1 and again ('red', '<\s>') in sentence 2, for 2 bigrams in total. The unigrams that contain 'red' are ('red') in sentence 1 and again ('red') in sentence 2, for two unigrams in total.
The total number of unigrams containing 'red' thus coincides with the number of bigrams whose first element is 'red'.
You are right. The original claim is not exactly right unless you include bigrams like <'red', '.'> or <'red', '_EOS_'>.

Understanding Data Leakage and getting perfect score by exploiting test data

I have read an article on data leakage. In a hackathon there are two sets of data, train data on which participants train their algorithm and test set on which performance is measured.
Data leakage helps in getting a perfect score in test data, with out viewing train data by exploiting the leak.
I have read the article, but I am missing the crux how the leakage is exploited.
Steps as shown in article are following:
Let's load the test data.
Note, that we don't have any training data here, just test data. Moreover, we will not even use any features of test objects. All we need to solve this task is the file with the indices for the pairs, that we need to compare.
Let's load the data with test indices.
test = pd.read_csv('../test_pairs.csv')
test.head(10)
pairId FirstId SecondId
0 0 1427 8053
1 1 17044 7681
2 2 19237 20966
3 3 8005 20765
4 4 16837 599
5 5 3657 12504
6 6 2836 7582
7 7 6136 6111
8 8 23295 9817
9 9 6621 7672
test.shape[0]
368550
For example, we can think that there is a test dataset of images, and each image is assigned a unique Id from 0 to N−1 (N -- is the number of images). In the dataframe from above FirstId and SecondId point to these Id's and define pairs, that we should compare: e.g. do both images in the pair belong to the same class or not. So, for example for the first row: if images with Id=1427 and Id=8053 belong to the same class, we should predict 1, and 0 otherwise.
But in our case we don't really care about the images, and how exactly we compare the images (as long as comparator is binary).
print(test['FirstId'].nunique())
print(test['SecondId'].nunique())
26325
26310
So the number of pairs we are given to classify is very very small compared to the total number of pairs.
To exploit the leak we need to assume (or prove), that the total number of positive pairs is small, compared to the total number of pairs. For example: think about an image dataset with 1000 classes, N images per class. Then if the task was to tell whether a pair of images belongs to the same class or not, we would have 1000*N*(N−1)/2 positive pairs, while total number of pairs was 1000*N(1000N−1)/2.
Another example: in Quora competitition the task was to classify whether a pair of qustions are duplicates of each other or not. Of course, total number of question pairs is very huge, while number of duplicates (positive pairs) is much much smaller.
Finally, let's get a fraction of pairs of class 1. We just need to submit a constant prediction "all ones" and check the returned accuracy. Create a dataframe with columns pairId and Prediction, fill it and export it to .csv file. Then submit
test['Prediction'] = np.ones(test.shape[0])
sub=pd.DataFrame(test[['pairId','Prediction']])
sub.to_csv('sub.csv',index=False)
All ones have accuracy score is 0.500000.
So, we assumed the total number of pairs is much higher than the number of positive pairs, but it is not the case for the test set. It means that the test set is constructed not by sampling random pairs, but with a specific sampling algorithm. Pairs of class 1 are oversampled.
Now think, how we can exploit this fact? What is the leak here? If you get it now, you may try to get to the final answer yourself, othewise you can follow the instructions below.
Building a magic feature
In this section we will build a magic feature, that will solve the problem almost perfectly. The instructions will lead you to the correct solution, but please, try to explain the purpose of the steps we do to yourself -- it is very important.
Incidence matrix
First, we need to build an incidence matrix. You can think of pairs (FirstId, SecondId) as of edges in an undirected graph.
The incidence matrix is a matrix of size (maxId + 1, maxId + 1), where each row (column) i corresponds i-th Id. In this matrix we put the value 1to the position [i, j], if and only if a pair (i, j) or (j, i) is present in a given set of pais (FirstId, SecondId). All the other elements in the incidence matrix are zeros.
Important! The incidence matrices are typically very very sparse (small number of non-zero values). At the same time incidence matrices are usually huge in terms of total number of elements, and it is impossible to store them in memory in dense format. But due to their sparsity incidence matrices can be easily represented as sparse matrices. If you are not familiar with sparse matrices, please see wiki and scipy.sparse reference. Please, use any of scipy.sparseconstructors to build incidence matrix.
For example, you can use this constructor: scipy.sparse.coo_matrix((data, (i, j))). We highly recommend to learn to use different scipy.sparseconstuctors, and matrices types, but if you feel you don't want to use them, you can always build this matrix with a simple for loop. You will need first to create a matrix using scipy.sparse.coo_matrix((M, N), [dtype]) with an appropriate shape (M, N) and then iterate through (FirstId, SecondId) pairs and fill corresponding elements in matrix with ones.
Note, that the matrix should be symmetric and consist only of zeros and ones. It is a way to check yourself.
import networkx as nx
import numpy as np
import pandas as pd
import scipy.sparse
import matplotlib.pyplot as plt
test = pd.read_csv('../test_pairs.csv')
x = test[['FirstId','SecondId']].rename(columns={'FirstId':'col1', 'SecondId':'col2'})
y = test[['SecondId','FirstId']].rename(columns={'SecondId':'col1', 'FirstId':'col2'})
comb = pd.concat([x,y],ignore_index=True).drop_duplicates(keep='first')
comb.head()
col1 col2
0 1427 8053
1 17044 7681
2 19237 20966
3 8005 20765
4 16837 599
data = np.ones(comb.col1.shape, dtype=int)
inc_mat = scipy.sparse.coo_matrix((data,(comb.col1,comb.col2)), shape=(comb.col1.max() + 1, comb.col1.max() + 1))
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
f = rows_FirstId.multiply(rows_SecondId)
f = np.asarray(f.sum(axis=1))
f.shape
(368550, 1)
f = f.sum(axis=1)
f = np.squeeze(np.asarray(f))
print (f.shape)
Now build the magic feature
Why did we build the incidence matrix? We can think of the rows in this matix as of representations for the objects. i-th row is a representation for an object with Id = i. Then, to measure similarity between two objects we can measure similarity between their representations. And we will see, that such representations are very good.
Now select the rows from the incidence matrix, that correspond to test.FirstId's, and test.SecondId's.
So do not forget to convert pd.series to np.array
These lines should normally run very quickly
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
Our magic feature will be the dot product between representations of a pair of objects. Dot product can be regarded as similarity measure -- for our non-negative representations the dot product is close to 0 when the representations are different, and is huge, when representations are similar.
Now compute dot product between corresponding rows in rows_FirstId and rows_SecondId matrices.
From magic feature to binary predictions
But how do we convert this feature into binary predictions? We do not have a train set to learn a model, but we have a piece of information about test set: the baseline accuracy score that you got, when submitting constant. And we also have a very strong considerations about the data generative process, so probably we will be fine even without a training set.
We may try to choose a thresold, and set the predictions to 1, if the feature value f is higer than the threshold, and 0 otherwise. What threshold would you choose?
How do we find a right threshold? Let's first examine this feature: print frequencies (or counts) of each value in the feature f.
For example use np.unique function, check for flags
Function to count frequency of each element
from scipy.stats import itemfreq
itemfreq(f)
array([[ 14, 183279],
[ 15, 852],
[ 19, 546],
[ 20, 183799],
[ 21, 6],
[ 28, 54],
[ 35, 14]])
Do you see how this feature clusters the pairs? Maybe you can guess a good threshold by looking at the values?
In fact, in other situations it can be not that obvious, but in general to pick a threshold you only need to remember the score of your baseline submission and use this information.
Choose a threshold below:
pred = f > 14 # SET THRESHOLD HERE
pred
array([ True, False, True, ..., False, False, False], dtype=bool)
submission = test.loc[:,['pairId']]
submission['Prediction'] = pred.astype(int)
submission.to_csv('submission.csv', index=False)
I want to understand the idea behind this. How we are exploiting the leak from the test data only.
There's a hint in the article. The number of positive pairs should be 1000*N*(N−1)/2, while the number of all pairs is 1000*N(1000N−1)/2. Of course, the number of all pairs is much, much larger if the test set was sampled at random.
As the author mentions, after you evaluate your constant prediction of 1s on the test set, you can tell that the sampling was not done at random. The accuracy you obtain is 50%. Had the sampling been done correctly, this value should've been much lower.
Thus, they construct the incidence matrix and calculate the dot product (the measure of similarity) between the representations of our ID features. They then reuse the information about the accuracy obtained with constant predictions (at 50%) to obtain the corresponding threshold (f > 14). It's set to be greater than 14 because that constitutes roughly half of our test set, which in turn maps back to the 50% accuracy.
The "magic" value didn't have to be greater than 14. It could have been equal to 14. You could have adjusted this value after some leader board probing (as long as you're capturing half of the test set).
It was observed that the test data was not sampled properly; same-class pairs were oversampled. Thus there is a much higher probability of each pair in the training set to have target=1 than any random pair. This led to the belief that one could construct a similarity measure based only on the pairs that are present in the test, i.e., whether a pair made it to the test is itself a strong indicator of similarity.
Using this insight one can calculate an incidence matrix and represent each id j as a binary array (the i-th element representing the presence of i-j pair in test, and thus representing the strong probability of similarity between them). This is a pretty accurate measure, allowing one to find the "similarity" between two rows just by taking their dot product.
The cutoff arrived at is purely by the knowledge of target-distribution found by leaderboard probing.

Use features based on tf idf score for text classification using naive bayes (sklearn)

I am learning to implement text classification (into two classes) using tfidf and naive bayes by referring to this blog and sklearn tfidf
below is the code snippet:
kf = StratifiedKFold(n_splits=5)
totalNB = 0
totalMatNB = np.zeros((2,2));
for train_index, test_index in kf.split(documents, labels):
X_train = [documents[i] for i in train_index]
X_test = [documents[i] for i in test_index]
y_train, y_test = labels[train_index], labels[test_index]
vectorizer = TfidfVectorizer(min_df=2, max_df= 0.2, use_idf= True, stop_words=stop_words)
train_corpus_tf_idf = vectorizer.fit_transform(X_train)
test_corpus_tf_idf = vectorizer.transform(X_test)
model2 = MultinomialNB()
model2.fit(train_corpus_tf_idf, y_train)
result2 = model2.predict(test_corpus_tf_idf)
totalMatNB = totalMatNB + confusion_matrix(y_test, result2)
totalNB = totalNB + sum(y_test == result2)
The above code is working as expected.
I have read the documents, but I am still confuse about min_df and max_df.
How to use the features for the classification based on the tf-idf score, i.e. filter the features based on tf-idf score
eg.
use the features whose tf-idf score is greater than x [ score(features) >x]
use the features whose tf-idf score between x and y [ y> score(features)>x ] or [ y>= score(features)>=x ]
When training the vectorizer, setting specific values for min_df and max_df is supposed to help you tweak the eventual tf-idf representation to best suit your needs by limiting the vocabulary. It also helps with reducing the dimension of the vector representation which is usually a good thing since they tend to be huge.
Setting a high min_df value will remove relatively infrequent terms from the representation. If your eventual model is not supposed to care too much about very unique terms this would be a good thing.
Setting a low max_df will remove relatively frequent terms from the representation. If your eventual model doesn't care about words that are used in many contexts (e.g. "the", "or", "and") then this would be a good thing. Note that "low" here can mean either a nonzero integer > 1 or a float < 1 close to 0.
Important note: your suggestion of filtering features after-the-fact based on their tf-idf weight is a totally different thing. Setting min_df and max_df when fitting the vectorizer will limit the eventual vocabulary based on document frequency across the entire training sample. Whereas the eventual tf-idf weight in a given vector is a document-specific value (since it's also impacted by the term frequency in that specific document).
Hope this helps!

Resources