I have a corpus(Hotel Reviews) and I want to do some NLP process including Tfidf. My problem is when I Applied Tfidf and print 100 features it doesn't appear as a single word but the entire sentence.
Here is my code:
Note: clean_doc is a function return my corpus cleaning from stopwords, stemming, and etc
vectorizer = TfidfVectorizer(analyzer='word',tokenizer=clean_doc,
max_features=100, lowercase = False, ngram_range=(1,3), min_df = 1)
vz = vectorizer.fit_transform(list(data['Review']))
feature_names = vectorizer.get_feature_names()
for feature in feature_names:
print(feature)
it returns something like this:
love view good room
food amazing recommended
bad services location far
-----
any idea why? Thanks in Advance
There is most likely an error in your clean_doc function. The 'tokenizer' argument should be a function that takes a string as input and returns a list of tokens.
Related
I am using spaCy to process sentences of a doc. Given one sentence, I'd like to get the previous and following sentence.
I can easily iterate over the sentences of the doc as following:
nlp_content = nlp(content)
sentences = nlp_content.sents
for idx, sent in enumerate(sentences):
But I can't get the sentence #idx-1 or #idx+1 from the sentence #idx.
Is there any function or property that could be useful there?
Thanks!
Nick
There isn't a built-in sentence index. You would need to iterate over the sentences once to create your own list of sentence spans to access them this way.
sentence_spans = tuple(doc.sents) # alternately: list(doc.sents)
I have somewhat read a bunch of papers which talks about predicting missing words in a sentence. What I really want is to create a model that suggest a word from an incomplete sentence.
Example:
Incomplete Sentence :
I bought an ___________ because its rainy.
Suggested Words:
umbrella
soup
jacket
From the journal I have read, they have utilized Microsoft Sentence Completion Dataset for predicting missing words from a sentence.
Example :
Incomplete Sentence :
Im sad because you are __________
Missing Word Options:
a) crying
b) happy
c) pretty
d) sad
e) bad
I don't want to predict a missing word from a list of options. I want to suggest a list of words from an incomplete sentence. Is it feasible? Please enlighten me cause Im really confused. What is state of the art model I can use for suggesting a list of words (semantically coherent) from an incomplete sentence?
Is it necessary that the list of suggested words as an output is included in the training dataset?
This is exactly how the BERT model was trained: mask some random words in the sentence, and make your network predict these words. So yes, it is feasible. And not, it is not necessary to have the list of suggested words as a training input. However, these suggested words should be the part of the overall vocabulary with which this BERT has been trained.
I adapted this answer to show how the completion function may work.
# install this package to obtain the pretrained model
# ! pip install -U pytorch-pretrained-bert
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval(); # turning off the dropout
def fill_the_gaps(text):
text = '[CLS] ' + text + ' [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
results = []
for i, t in enumerate(tokenized_text):
if t == '[MASK]':
predicted_index = torch.argmax(predictions[0, i]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
results.append(predicted_token)
return results
print(fill_the_gaps(text = 'I bought an [MASK] because its rainy .'))
print(fill_the_gaps(text = 'Im sad because you are [MASK] .'))
print(fill_the_gaps(text = 'Im worried because you are [MASK] .'))
print(fill_the_gaps(text = 'Im [MASK] because you are [MASK] .'))
The [MASK] symbol indicates the missing words (there can be any number of them). [CLS] and [SEP] are BERT-specific special tokens. The outputs for these particular prints are
['umbrella']
['here']
['worried']
['here', 'here']
The duplication is not surprising - transformer NNs are generally good at copying words. And from semantic point of view, these symmetric continuations look indeed very likely.
Moreover, if it is not a random word which is missing, but exactly the last word (or last several words), you can utilize any language model (e.g. another famous SOTA language model, GPT-2) to complete the sentence.
I am reading up about TF-IDF so that I can filter out common words from my corpus. It appears to me that you get a TF-IDF score for each word, document pair.
Which score do you pay attention to? Do you combine the scores across all documents for a word?
TFIDF ex:
doc1 = "This is doc1"
doc2 = "This is a different document"
corpus = [doc1, doc2]
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(corpus)
X.toarray()
return: array([[0. , 0.70490949, 0. , 0.50154891, 0.50154891],
[0.57615236, 0. , 0.57615236, 0.40993715, 0.40993715]])
vec.get_feature_names()
So you have a line/1d array for each doc in the corpus, and that array has len = total vocab in your corpus (can get quite sparse). What score you pay attention to depends on what you're doing, ie finding most important word in a doc you could look for highest TF-idf in that doc. Most important in a corpus, look in the entire array. If you're trying to identify stop words, you could consider finding the set of X number of words with the minimum TF-IDF scores. However, I wouldn't really recommend using TF-IDF to find stop words in the first place, it lowers the weight of stop words, but they still occur frequently which could offset the weight loss. You'd probably be better off finding the most common words and then filtering them out. You'd want to look at either set you generated manually though.
i have a list of tokenized documents,containing both unigrams, bi-grams and i would like to perform sklearn lda on it.i have tried the following code:
my_data =[['low-rank matrix','detection method','problem finding'],['probabilistic inference','problem finding','statistical learning','solution' ],['detection method','probabilistic inference','population','language']...]
tf_vectorizer = CountVectorizer(min_df=2, max_features=n_features,
stop_words='english')
tf = tf_vectorizer.fit_transform(mydata)
lda = LatentDirichletAllocation(n_topics=3, max_iter=5,random_state=10)
but when i print the output i get something like this:
topic 0:
detection,finding, solution ,method,problem
topic 1:
language, statistical , problem, learning,finding
and so on..
bigrams are broken and are separated from one another.i have 10,000 documents and already tokenize them, also the method for finding the bigram is not nltk based so i already did this.
is there any method to improve this without changing the input?
i am very new in using sklearn so apologies in advance if i am making some obvious mistake.
CountVectorizer has a ngram_range param which will be used for deciding if the vocabulary will contain uniqrams, or bigrams or trigrams etc:-
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the
range of n-values for different n-grams to be extracted. All values of
n such that min_n <= n <= max_n will be used.
For example:
ngram_range=(1,1) => Will include only unigrams
ngram_range=(1,2) => Will include unigrams and bigrams
ngram_range=(2,2) => Will include only bigrams
and so on...
You have not defined that, so default ngram_range=(1,1) and hence only unigrams are used here.
tf_vectorizer = CountVectorizer(min_df=2,
max_features=n_features,
stop_words='english',
ngram_range = (2,2)) # You need this
tf = tf_vectorizer.fit_transform(my_data)
Secondly, you say that you have already tokenize the data and show the lists of list (my_data) in your code. That doesnt work with CountVectorizer. For that, you need to pass a simple list of strings and CountVectorizer will automatically apply tokenizing on them. So you will need to pass on your own preprocessing steps to that. See other params 'preprocessor', 'tokenizer' and 'analyzer' in the linked documentation.
I am using tfidfvectorizer to score terms from many different corpus.
Here is my code
tfidf = TfidfVectorizer(ngram_range=(1,1), stop_words = 'english', min_df = 0.5)
for corpus in all_corpus:
tfidf.fit_transform(corpus)
The number of documents in each corpus is various, so when building the vocabulary, some corpus remains empty and return an error:
after pruning, no terms remain. Try a lower min_df or higher max_df
I don't want to change the min or max DF. What I need is when there is no terms, the transforming process is skipped. So I made a conditional filter like below
for corpus in all_corpus:
tfidf.fit_transform(corpus)
if tfidf.shape[0] > 0:
\\execute some code here
However, the condition couldn't work. Is there way to fix this?
All answers and comments are really appreciated. Thanks
First, a minimum working example for your problem is I believe, the following:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,1), stop_words = 'english', min_df = 0.5)
tfidf.fit_transform(['not I you'])
I could not replicate an error message that contains the part of the error message you share, but this gives me a ValueError as all the words in my document are English stop words. (The code runs if one removes stop_words = 'english' in the snippet above.)
One way of handling the error in the case of a for-loop is to use a try/except block.
for corpus in all_corpus:
try:
tfidf.fit_transform(corpus)
except ValueError:
print('Transforming process skipped')
# Here you can do more stuff
continue # go to the beginning of the for-loop to start the next iteration
# Here goes the rest of the code for corpus for which the transform functioned