What is the use of 'max_features' in TfidfVectorizer - scikit-learn

What I have understood from it is, If max_feature = n; It means that it is selecting the top n Feature on the basis of Tf-Idf value. I went through the Documentation of TfidfVectorizer on scikit-learn but didn't understand it properly.

If you want row-wise words which have the highest tfidf values, then you need to access the transformed tf-idf matrix from Vectorizer, access it row by row (doc by doc) and then sort the values to get those.
Something like this:
# TfidfVectorizer will by default output a sparse matrix
tfidf_data = tfidf_vectorizer.fit_transform(text_data).tocsr()
vocab = np.array(tfidf_vectorizer.get_feature_names())
# Replace this with the number of top words you want to get in each row
top_n_words = 5
# Loop all the docs present
for i in range(tfidf_data.shape[0]):
doc = tfidf_data.getrow(i).toarray().ravel()
sorted_index = np.argsort(doc)[::-1][:top_n_words]
print(sorted_index)
for word, tfidf in zip(vocab[sorted_index], doc[sorted_index]):
print("%s - %f" %(word, tfidf))
If you can use pandas, then the logic becomes simpler:
for i in range(tfidf_data.shape[0]):
doc_data = pd.DataFrame({'Tfidf':tfidf_data.getrow(i).toarray().ravel(),
'Word': vocab})
doc_data.sort_values(by='Tfidf', ascending=False, inplace=True)
print(doc_data.iloc[:top_n_words])

Related

How to calculate TF-IDF values of noun documents excluding spaCy stop words?

I have a data frame, df with text, cleaned_text, and nouns as column names. text and cleaned_text contains string document, nouns is a list of nouns extracted from cleaned_text column. df.shape = (1927, 3).
I am trying to calculate TF-IDF values for all documents within df only for nouns, excluding spaCy stopwords.
What I have tried?
import spacy
from spacy.lang.en import English
nlp = spacy.load('en_core_web_sm')
# subclass to modify stop word lists recommended from spaCy version 3.0 onwards
excluded_stop_words = {'down'}
included_stop_words = {'dear', 'regards'}
class CustomEnglishDefaults(English.Defaults):
stop_words = English.Defaults.stop_words.copy()
stop_words -= excluded_stop_words
stop_words |= included_stop_words
class CustomEnglish(English):
Defaults = CustomEnglishDefaults
# function to extract nouns from cleaned_text column, excluding spaCy stowords.
nlp = CustomEnglish()
def nouns(text):
doc = nlp(text)
return [t for t in doc if t.pos_ in ['NOUN'] and not t.is_stop and not t.is_punct]
# calculate TF-IDF values for nouns, excluding spaCy stopwords.
from sklearn.feature_extraction.text import TfidfVectorizer
documents = df.cleaned_text
tfidf = TfidfVectorizer(stop_words=CustomEnglish)
X = tfidf.fit_transform(documents)
What I am expecting?
I am expecting to have an output as a list of tuples ranked in descending order;
nouns = [('noun_1', tf-idf_1), ('noun_2', tf-idf_2), ...]. All nouns in nouns should match those of df.nouns (this is to check whether I am on the right way).
What is my issue?
I got confused about how to apply TfidfVectorizer such that to calculate only TF-IDF values for Nouns extracted from cleaned_text. I am also not sure whether SkLearn TfidfVectorizer can calculate TF-IDF as I am expecting.
Not sure if you're still looking for a solution. Here is an option that you might wanna go ahead with.
First of all, by default TF_IDF takes into account the entire set of words, not just nouns. Hence, you would need to implement a custom TF_IDF function to apply results only on nouns. Following is a good reference on how TF_IDF works internally: https://www.askpython.com/python/examples/tf-idf-model-from-scratch
Instead of running the tf_idf function(as applied in the above url) for all words of a sentence/document, you can just run it on the list of nouns you've extracted,i.e., just change the code from:
def tf_idf(sentence):
tf_idf_vec = np.zeros((len(word_set),))
for word in sentence:
tf = termfreq(sentence,word)
idf = inverse_doc_freq(word)
value = tf*idf
tf_idf_vec[index_dict[word]] = value
return tf_idf_vec
to:
def tf_idf(sentence, nouns):
values = []
for word in nouns:
tf = termfreq(sentence,word)
idf = inverse_doc_freq(word)
value = tf*idf
values.append(value)
return tf_idf_vec, values
You now have a "values" list corresponding to the list of "nouns" for each sentence. Hope this makes sense.

Get total count of aword in corpus using Countvectorizer

I have corpus of the following format:
corpus = ['text_1', 'text_2', ... . 'text_4280']
In total there are 90141 unique words.
For each word, I want to calculate the total number of times it appears in corpus.
To do so, I used:
vectorizer = CountVectorizer(corpus)
Currently, the only way I am aware of doing this is by:
vectorizer.fit_transform()
However, this will create a (sparse) Numpy array with shape (4280, 90141). Does CountVectorizer has more memory-efficient approaches to get all the column sums of the document-term matrix?
you could use
vectorizer.fit_transform().toarray().sum(axis= 0)
EDIT
my bad, you should just remove .toarray() from the above statement. I didn't realise that you could call .sum() on a sparse array
vectorizer.fit_transform().sum(axis= 0)

How can I easily change tf-idf similarity dataframe using apply

I'm using Python 3.
I am doing TF_IDF, and I record more than 80% of results.
But for is too slow. because shape is 51,336 x 51,336.
How can you create dataframes faster without using for statement.
It's taking 50 minutes now.
I want to make a dataframe like this.
[column_0],[column_1],[similarity]
index[0], column[0], value
index[0], column[1], value
index[0], column[2], value
....
index[100], column[51334], value
index[100], column[51335], value
index[100], column[51336], value
...
index[51336], column[51335], value
index[51336], column[51336], value
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
indices = pd.Series(df.index, index=df['index_name'])
tfidf_matrix = tf.fit_transform(df['text'])
similarity = pd.DataFrame(columns=['a', 'b', 'similarity'])
for n in range(len(cosine_sim)):
for i in list(enumerate(cosine_sim[n])):
if i[1] > 0.8 and i[1] < 0.99:
similarity = similarity.append({'column_0': indices.index[n],'column_1': indices.index[i[0]],'similarity': i[1]},ignore_index=True)
If you think of parallelize a job, unfortunately there is no-way to parallelize/distribute access to the vocabulary that is need for these vectorizers.
Hence you choose the alternative hack for that. By using the hashingvectorizer.
well for this scikit docs provide an example using this vectorizer to train a classifier in batches.
https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html.
Hope this will help you

Include keywords dictionary in document classification using PySpark code

I am trying to perform a document classification using PySpark.
I am using the below steps for that:
# Tokenizer
tokenizer = Tokenizer().setInputCol("text").setOutputCol("words")
# Stop Word removal
updated_stopwords_list = list(set(StopWordsRemover().getStopWords() +
custom_stopword_list))
remover_custom = StopWordsRemover(inputCol="words", outputCol="filtered",
stopWords=updated_stopwords_list)
# HashingTF
hashingTF = HashingTF().setNumFeatures(1000).setInputCol("filtered").setOutputCol("rawFeatures")
# IDF
idf =
IDF().setInputCol("rawFeatures").setOutputCol("features").setMinDocFreq(0)
pipeline=Pipeline(stages=[tokenizer,remover_custom,hashingTF,idf])
And I am using it in a pipeline.
Now, Here after the removing the stop words, I want to include a keyword dictionary(data dictionary) so that it will select the words from the array(out put of stopword remover is an array of words) which is present in that dictionary.
Can anyone please guide me in how to do this? I am reading the keyword dictionary from a CSV file.
If you're not required to use HashingTF here is one option using the CountVectorizer by forcing the vocabulary to be your keywords list:
# Prepare keyword list to go into CountVectorizer. Can also use Tokenizer if your keywords are only single words
str_to_arr_udf = udf(lambda s: [s], ArrayType(StringType()))
# Fit CountVectorizer to keywords list so vocabulary = keywords
keyword_df = spark.read.format("csv").load(csv_file)
keyword_df.withColumn("filtered", str_to_arr_udf("keyword"))
cv = CountVectorizer(inputCol="filtered", outputCol="filtered_only_keywords", binary=True)
cvm = cv.fit(keyword_df)
# Transform the actual dataframe
cv.transform(df_output_from_stopwords)
Otherwise the udf route is always an option. Something like:
keyword_list = [x.word for x in spark.read.load(file).collect()]
keep_words_udf = udf(lambda word_list: [ word for word in word_list if word in keyword_list], ArrayType(StringType()) )
Assuming this key words list does not contain any words in the StopWordsRemover list, the StopWordsRemover step is actually unnecessary.

How to use the Scikit learn CountVectorizer?

I have a set of words for which I have to check whether they are present in the documents.
WordList = [w1, w2, ..., wn]
Another set have list of documents on which I have to check whether these words are present or not.
How to use scikit-learn CountVectorizer so that the features of term-document matrix are only words from WordList and each row represents each particular document with no of times the word from the given list appears in their respective column?
Ok. I get it.
The code is given below:
from sklearn.feature_extraction.text import CountVectorizer
# Counting the no of times each word(Unigram) appear in document.
vectorizer = CountVectorizer(input='content',binary=False,ngram_range=(1,1))
# First set the vocab
vectorizer = vectorizer.fit(WordList)
# Now transform the text contained in each document i.e list of text
Document:list
tfMatrix = vectorizer.transform(Document_List).toarray()
This will output only the term-document matrix with features from wordList only.
For custom documents, you can use Count Vectorizer approach
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() #make object of Count Vectorizer
corpus = [
'This is a cat.',
'It likes to roam in the garden',
'It is black in color',
'The cat does not like the dog.',
]
X = vectorizer.fit_transform(corpus)
#print(X) to see count given to words
vectorizer.get_feature_names() == (
['cat', 'color', 'roam', 'The', 'garden',
'dog', 'black', 'like', 'does', 'not',
'the', 'in', 'likes'])
X.toarray()
#used to convert X into numpy array
vectorizer.transform(['A new cat.']).toarray()
# Checking it for a new document
Other Vectorizers can also be used like Tfidf Vectorizer. Tfidf vectorizer is a better approach as it not only provides with the number of occurences of words in a particular document but also tells about the importance of the word.
It is calculated by finding TF- term frequency and IDF- Inverse Document Frequency.
Term Freq is the number of times a word appeared in a particular document and IDF is calculated based on the context of the document.
For eg., if the documents are related to football, then the word "the" would not give any insight but the word "messi" would tell about the context of the document.
It is calculated by taking log of the number of occurences.
Eg. tf("the") = 10
tf("messi") = 5
idf("the") = log(10) = 0
idf("messi") = log(5) = 0.52
tfidf("the") = tf("the") * idf("the") = 10 * 0 = 0
tfidf("messi") = 5 * 0.52 = 2.6
These weights help the algorithm to identify the important words out of the documents that later helps to derive semantics out of the doc.

Resources