How to use the Scikit learn CountVectorizer? - python-3.x

I have a set of words for which I have to check whether they are present in the documents.
WordList = [w1, w2, ..., wn]
Another set have list of documents on which I have to check whether these words are present or not.
How to use scikit-learn CountVectorizer so that the features of term-document matrix are only words from WordList and each row represents each particular document with no of times the word from the given list appears in their respective column?

Ok. I get it.
The code is given below:
from sklearn.feature_extraction.text import CountVectorizer
# Counting the no of times each word(Unigram) appear in document.
vectorizer = CountVectorizer(input='content',binary=False,ngram_range=(1,1))
# First set the vocab
vectorizer = vectorizer.fit(WordList)
# Now transform the text contained in each document i.e list of text
Document:list
tfMatrix = vectorizer.transform(Document_List).toarray()
This will output only the term-document matrix with features from wordList only.

For custom documents, you can use Count Vectorizer approach
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() #make object of Count Vectorizer
corpus = [
'This is a cat.',
'It likes to roam in the garden',
'It is black in color',
'The cat does not like the dog.',
]
X = vectorizer.fit_transform(corpus)
#print(X) to see count given to words
vectorizer.get_feature_names() == (
['cat', 'color', 'roam', 'The', 'garden',
'dog', 'black', 'like', 'does', 'not',
'the', 'in', 'likes'])
X.toarray()
#used to convert X into numpy array
vectorizer.transform(['A new cat.']).toarray()
# Checking it for a new document
Other Vectorizers can also be used like Tfidf Vectorizer. Tfidf vectorizer is a better approach as it not only provides with the number of occurences of words in a particular document but also tells about the importance of the word.
It is calculated by finding TF- term frequency and IDF- Inverse Document Frequency.
Term Freq is the number of times a word appeared in a particular document and IDF is calculated based on the context of the document.
For eg., if the documents are related to football, then the word "the" would not give any insight but the word "messi" would tell about the context of the document.
It is calculated by taking log of the number of occurences.
Eg. tf("the") = 10
tf("messi") = 5
idf("the") = log(10) = 0
idf("messi") = log(5) = 0.52
tfidf("the") = tf("the") * idf("the") = 10 * 0 = 0
tfidf("messi") = 5 * 0.52 = 2.6
These weights help the algorithm to identify the important words out of the documents that later helps to derive semantics out of the doc.

Related

How to calculate TF-IDF values of noun documents excluding spaCy stop words?

I have a data frame, df with text, cleaned_text, and nouns as column names. text and cleaned_text contains string document, nouns is a list of nouns extracted from cleaned_text column. df.shape = (1927, 3).
I am trying to calculate TF-IDF values for all documents within df only for nouns, excluding spaCy stopwords.
What I have tried?
import spacy
from spacy.lang.en import English
nlp = spacy.load('en_core_web_sm')
# subclass to modify stop word lists recommended from spaCy version 3.0 onwards
excluded_stop_words = {'down'}
included_stop_words = {'dear', 'regards'}
class CustomEnglishDefaults(English.Defaults):
stop_words = English.Defaults.stop_words.copy()
stop_words -= excluded_stop_words
stop_words |= included_stop_words
class CustomEnglish(English):
Defaults = CustomEnglishDefaults
# function to extract nouns from cleaned_text column, excluding spaCy stowords.
nlp = CustomEnglish()
def nouns(text):
doc = nlp(text)
return [t for t in doc if t.pos_ in ['NOUN'] and not t.is_stop and not t.is_punct]
# calculate TF-IDF values for nouns, excluding spaCy stopwords.
from sklearn.feature_extraction.text import TfidfVectorizer
documents = df.cleaned_text
tfidf = TfidfVectorizer(stop_words=CustomEnglish)
X = tfidf.fit_transform(documents)
What I am expecting?
I am expecting to have an output as a list of tuples ranked in descending order;
nouns = [('noun_1', tf-idf_1), ('noun_2', tf-idf_2), ...]. All nouns in nouns should match those of df.nouns (this is to check whether I am on the right way).
What is my issue?
I got confused about how to apply TfidfVectorizer such that to calculate only TF-IDF values for Nouns extracted from cleaned_text. I am also not sure whether SkLearn TfidfVectorizer can calculate TF-IDF as I am expecting.
Not sure if you're still looking for a solution. Here is an option that you might wanna go ahead with.
First of all, by default TF_IDF takes into account the entire set of words, not just nouns. Hence, you would need to implement a custom TF_IDF function to apply results only on nouns. Following is a good reference on how TF_IDF works internally: https://www.askpython.com/python/examples/tf-idf-model-from-scratch
Instead of running the tf_idf function(as applied in the above url) for all words of a sentence/document, you can just run it on the list of nouns you've extracted,i.e., just change the code from:
def tf_idf(sentence):
tf_idf_vec = np.zeros((len(word_set),))
for word in sentence:
tf = termfreq(sentence,word)
idf = inverse_doc_freq(word)
value = tf*idf
tf_idf_vec[index_dict[word]] = value
return tf_idf_vec
to:
def tf_idf(sentence, nouns):
values = []
for word in nouns:
tf = termfreq(sentence,word)
idf = inverse_doc_freq(word)
value = tf*idf
values.append(value)
return tf_idf_vec, values
You now have a "values" list corresponding to the list of "nouns" for each sentence. Hope this makes sense.

Find number of bigrams after filtered from stop words

Case study
Task 1
Import text corpus brown
Extract the list of words associated with text collections belonging to the news genre. Store the result in the variable news_words.
Convert each word of the list news_words into lower case, and store the result in lc_news_words.
Compute the length of each word present in the list lc_news_words, and store the result in the list len_news_words.
Compute bigrams of the list len_news_words. Store the result in the variable news_len_bigrams.
Compute the conditional frequency of news_len_bigrams, where condition and event refers to the length of the words. Store the result in cfd_news.
Determine the frequency of 6-letter words appearing next to a 4-letter word.
Task 2
Compute bigrams of the list lc_news_words, and store it in the variable lc_news_bigrams.
From lc_news_bigrams, filter bigrams where both words contain only alphabet characters. Store the result in lc_news_alpha_bigrams.
Extract the list of words associated with the corpus stopwords. Store the result in stop_words.
Convert each word of the list stop_words into lower case, and store the result in lc_stop_words.
Filter only the bigrams from lc_news_alpha_bigrams where the words are not part of lc_stop_words. Store the result in lc_news_alpha_nonstop_bigrams.
Print the total number of filtered bigrams.
Task 1 passed, but task 2 is getting failed please help me out where I am wrong!!!!
import nltk
from nltk.corpus import brown
from nltk.corpus import stopwords
news_words = brown.words(categories = 'news')
lc_news_words = [word.lower() for word in news_words]
len_news_words = [len(word) for word in lc_news_words]
news_len_bigrams = nltk.bigrams(len_news_words)
cfd_news = nltk.ConditionalFreqDist(news_len_bigrams )
print(cfd_news[4][6])
lc_news_bigrams = nltk.bigrams(lc_news_words)
lc_news_alpha_bigrams = [ (w1, w2) for w1, w2 in lc_news_bigrams if w1.isalpha() and w2.isalpha()]
stop_words = stopwords.words('english')
lc_stop_words = [word.lower() for word in stop_words]
lc_news_alpha_nonstop_bigrams = [(n1, n2) for n1, n2 in lc_news_alpha_bigrams if n1 not in lc_stop_words and n2 not in lc_stop_words]
print(len(lc_news_alpha_nonstop_bigrams))
Results
with english in code stop_words = stopwords.words('english')
1084
17704
with out english in code stop_words = stopwords.words()
1084
16876
stop_words = set(stopwords.words())
everything was good, just use the unique set from the list of stopwords. Also removing the 'english' parameter increasing the number of stop words and that is the actual set of stopwords to be considered.

Some diverging issues of Word2Vec in Gensim using high alpha values

I am implementing word2vec in gensim, on a corpus with nested lists (collection of tokenized words in sentences of sentences form) with 408226 sentences (lists) and a total of 3150546 words or tokens.
I am getting a meaningful results (in terms of the similarity between two words using model.wv.similarity) with the chosen values of 200 as size, window as 15, min_count as 5, iter as 10 and alpha as 0.5. All are lemmatized words and these all are input to models with vocabulary as 32716.
The results incurred from default alpha value, size, window and dimensions are meaningless for me based on the used data in computing the similarity values. However higher value of alpha as 0.5 gives me some meaningful results in terms of inducing meaningful similarity scores between two words. However, when I calculate the top n similar words, it's again meaningless. Does I need to change the entire parameters used in the initial training process.
I am still unable to reveal the exact reason, why the model behaves good with such a higher alpha value in computing the similarity between two words of the used corpus, whereas it's meaningless while computing the top n similar words with scores for an input word. Why is this the case?
Does it is diverging towards optimal solution. How to check this?
Any idea why is it the case is deeply appreciated.
Note: I'm using Python 3.7 on Windows machine with anaconda prompt and giving input to the model from a file.
This is what I have tried.
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models import Word2Vec
import ast
path = "F:/Folder/"
def load_data():
global Sentences
Sentences = []
for file in ['data_d1.txt','data_d2.txt']:
with open(path + file, 'r', encoding = 'utf-8') as f1:
Sentences.extend(ast.literal_eval(*f1.readlines()))
load_data()
def initialize_word_embedding():
model = Word2Vec(Sentences, size = 200, window = 15, min_count = 5, iter = 10, workers = 4)
print(model)
print(len(model.wv.vocab))
print(model.wv.similarity(w1 = 'structure', w2 = '_structure_'))
similarities = model.wv.most_similar('system')
for word, score in similarities:
print(word , score)
initialize_word_embedding()
The example of Sentences list is as follows:
[['scientist', 'time', 'comet', 'activity', 'sublimation', 'carbon', 'dioxide', 'nears', 'ice', 'system'], ['inconsistent', 'age', 'system', 'year', 'size', 'collision'], ['intelligence', 'system'], ['example', 'application', 'filter', 'image', 'motion', 'channel', 'estimation', 'equalization', 'example', 'application', 'filter', 'system']]
The data_d1.txt and data_d2.txt is a nested list (list of lists of lemmatized tokenized words). I have preprocessed the raw data and save it in a file. Now giving the same as input. For computing the lemmatizing tokens, I have used the popular WordNet lemmatizer.
I need the word-embedding model to calculate the similarity between two words and computing the most_similar words of a given input word. I am getting some meaningful scores for the model.wv.similarity() method, whereas in calculating the most_similar() words of a word (say, system as shown in above). I am not getting the desired results.
I am guessing the model is getting diverged from the global minima, with the use of high alpha values.
I am confused what should be the dimension size, window for inducing some meaningful results, as there is no such rules regarding how to compute the the size and window.
Any suggestion is appreciated. The size of total sentences and words are specified above in the question.
Results what I am getting without setting alpha = 0.5
Edit to Recent Comment:
Results:
Word2Vec(vocab=32716, size=200, alpha=0.025)
The similarity between set and _set_ is : 0.000269373188960656
which is meaningless for me as it is very very less in terms of accuracy, But, I am a getting 71% by setting alpha as 0.5, which seems to be meaningful for me as the word set is same for both the domains.
Explanation: The word set should be same for both the domains (as I am comparing the data of two domains with same word). Don't get confused with word _set_, this is because the word is same as set, I have injected a character _ at start and end to distinguish the same for two different domains.
The top 10 words along with scores of _set_ are:
_niche_ 0.6891741752624512
_intermediate_ 0.6883598566055298
_interpretation_ 0.6813371181488037
_printer_ 0.675414502620697
_finer_ 0.6625382900238037
_pertinent_ 0.6620787382125854
_respective_ 0.6619025468826294
_converse_ 0.6610435247421265
_developed_ 0.659270167350769
_tent_ 0.6588765382766724
Whereas, the top 10 words for set are:
cardinality 0.633270263671875
typereduction 0.6233855485916138
zdzisław 0.619156002998352
crisp 0.6165326833724976
equivalenceclass 0.605925977230072
pawlak 0.6058803200721741
straight 0.6045454740524292
culik 0.6040038466453552
rin 0.6038737297058105
multisets 0.6035065650939941
Why the cosine similarity value is 0.00 for the word set for two different data.

What is the use of 'max_features' in TfidfVectorizer

What I have understood from it is, If max_feature = n; It means that it is selecting the top n Feature on the basis of Tf-Idf value. I went through the Documentation of TfidfVectorizer on scikit-learn but didn't understand it properly.
If you want row-wise words which have the highest tfidf values, then you need to access the transformed tf-idf matrix from Vectorizer, access it row by row (doc by doc) and then sort the values to get those.
Something like this:
# TfidfVectorizer will by default output a sparse matrix
tfidf_data = tfidf_vectorizer.fit_transform(text_data).tocsr()
vocab = np.array(tfidf_vectorizer.get_feature_names())
# Replace this with the number of top words you want to get in each row
top_n_words = 5
# Loop all the docs present
for i in range(tfidf_data.shape[0]):
doc = tfidf_data.getrow(i).toarray().ravel()
sorted_index = np.argsort(doc)[::-1][:top_n_words]
print(sorted_index)
for word, tfidf in zip(vocab[sorted_index], doc[sorted_index]):
print("%s - %f" %(word, tfidf))
If you can use pandas, then the logic becomes simpler:
for i in range(tfidf_data.shape[0]):
doc_data = pd.DataFrame({'Tfidf':tfidf_data.getrow(i).toarray().ravel(),
'Word': vocab})
doc_data.sort_values(by='Tfidf', ascending=False, inplace=True)
print(doc_data.iloc[:top_n_words])

Semantic Similarity between Phrases Using GenSim

Background
I am trying to judge whether a phrase is semantically related to other words found in a corpus using Gensim. For example, here is the corpus document pre-tokenized:
**Corpus**
Car Insurance
Car Insurance Coverage
Auto Insurance
Best Insurance
How much is car insurance
Best auto coverage
Auto policy
Car Policy Insurance
My code (based on this gensim tutorial) judges the semantic relatendness of a phrase using cosine similarity against all strings in corpus.
Problem
It seems that if a query contains ANY of the terms found within my dictionary, that phrase is judged as being semantically similar to the corpus (e.g. **Giraffe Poop Car Murderer has a cosine similarity of 1 but SHOULD be semantically unrelated). I am not sure how to solve for this issue.
Code
#Tokenize Corpus and filter out anything that is a stop word or has a frequency <1
texts = [[word for word in document if word not in stoplist]
for document in documents]
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
for text in texts]
dictionary = corpora.Dictionary(texts)
# doc2bow counts the number of occurences of each distinct word, converts the word
# to its integer word id and returns the result as a sparse vector
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())
#convert the query to LSI space
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
First of all, you are not directly comparing the cosine similarity of bag-of-word vectors, but first reducing the dimensionality of your document vectors by applying latent semantic analysis (https://en.wikipedia.org/wiki/Latent_semantic_analysis). This is fine, but I just wanted to emphasise that. It is often assumed that the underlying semantic space of a corpus is of a lower dimensionality than the number of unique tokens. Therefore, LSA applies principal component analysis on your vector space and only keeps the directions in your vector space that contain the most variance (i.e. those directions in the space that change most rapidly, and thus are assumed to contain more information). This is influenced by the num_topics parameters you pass to the LsiModel constructor.
Secondly, I cleaned up your code a little bit and embedded the corpus:
# Tokenize Corpus and filter out anything that is a
# stop word or has a frequency <1
from gensim import corpora, models, similarities
from collections import defaultdict
documents = [
'Car Insurance', # doc_id 0
'Car Insurance Coverage', # doc_id 1
'Auto Insurance', # doc_id 2
'Best Insurance', # doc_id 3
'How much is car insurance', # doc_id 4
'Best auto coverage', # doc_id 5
'Auto policy', # doc_id 6
'Car Policy Insurance', # doc_id 7
]
stoplist = set(['is', 'how'])
texts = [[word.lower() for word in document.split()
if word.lower() not in stoplist]
for document in documents]
print texts
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
for text in texts]
dictionary = corpora.Dictionary(texts)
# doc2bow counts the number of occurences of each distinct word,
# converts the word to its integer word id and returns the result
# as a sparse vector
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())
# convert the query to LSI space
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims
If I run the above I get the following output:
[(0, 0.97798139), (4, 0.97798139), (7, 0.94720691), (1, 0.89220524), (3, 0.61052465), (2, 0.42138112), (6, -0.1468758), (5, -0.22077486)]
where every entry in that list corresponds to (doc_id, cosine_similarity) ordered by cosine similarity in descending order.
As in your query document, the only word that is actually part of your vocabulary (constructed from your corpus) is car, all other tokens will be dropped. Therefore, your query to your model consists of the singleton document car. Consequently, you can see that all documents which contain car are supposedly very similar to your input query.
The reason why document #3 (Best Insurance) is ranked highly as well is because token insurance often co-occurs with car (your query). This is exactly the reasoning behind distributional semantics, i.e. "a word is characterized by the company it keeps" (Firth, J. R. 1957).

Resources