I am using tfidfvectorizer to score terms from many different corpus.
Here is my code
tfidf = TfidfVectorizer(ngram_range=(1,1), stop_words = 'english', min_df = 0.5)
for corpus in all_corpus:
tfidf.fit_transform(corpus)
The number of documents in each corpus is various, so when building the vocabulary, some corpus remains empty and return an error:
after pruning, no terms remain. Try a lower min_df or higher max_df
I don't want to change the min or max DF. What I need is when there is no terms, the transforming process is skipped. So I made a conditional filter like below
for corpus in all_corpus:
tfidf.fit_transform(corpus)
if tfidf.shape[0] > 0:
\\execute some code here
However, the condition couldn't work. Is there way to fix this?
All answers and comments are really appreciated. Thanks
First, a minimum working example for your problem is I believe, the following:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,1), stop_words = 'english', min_df = 0.5)
tfidf.fit_transform(['not I you'])
I could not replicate an error message that contains the part of the error message you share, but this gives me a ValueError as all the words in my document are English stop words. (The code runs if one removes stop_words = 'english' in the snippet above.)
One way of handling the error in the case of a for-loop is to use a try/except block.
for corpus in all_corpus:
try:
tfidf.fit_transform(corpus)
except ValueError:
print('Transforming process skipped')
# Here you can do more stuff
continue # go to the beginning of the for-loop to start the next iteration
# Here goes the rest of the code for corpus for which the transform functioned
Related
I am trying to use a pretrained word2vector model to create word embeddings but i am getting the following error when Im trying to create weight matrix from word2vec genism model:
Code:
import gensim
w2v_model = gensim.models.KeyedVectors.load_word2vec_format("/content/drive/My Drive/GoogleNews-vectors-negative300.bin.gz", binary=True)
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
EMBEDDING_DIM=300
# Function to create weight matrix from word2vec gensim model
def get_weight_matrix(model, vocab):
# total vocabulary size plus 0 for unknown words
vocab_size = len(vocab) + 1
# define weight matrix dimensions with all 0
weight_matrix = np.zeros((vocab_size, EMBEDDING_DIM))
# step vocab, store vectors using the Tokenizer's integer mapping
for word, i in vocab.items():
weight_matrix[i] = model[word]
return weight_matrix
embedding_vectors = get_weight_matrix(w2v_model, tokenizer.word_index)
Im getting the following error:
Error
As a note: it's better to paste a full error is as formatted text than as an image of text. (See Why not upload images of code/errors when asking a question? for a full list of the reasons why.)
But regarding your question:
If you get a KeyError: word 'didnt' not in vocabulary error, you can trust that the word you've requested is not in the set-of-word-vectors you've requested it from. (In this case, the GoogleNews vectors that Google trained & released back around 2012.)
You could check before looking it up – 'didnt' in w2v_model, which would return False, and then do something else. Or you could use a Python try: ... catch: ... formulation to let it happen, but then do something else when it happens.
But it's up to you what your code should do if the model you've provided doesn't have the word-vectors you were hoping for.
(Note: the GoogleNews vectors do include a vector for "didn't", the contraction with its internal apostrophe. So in this one case, the issue may be that your tokenization is stripping such internal-punctuation-marks from contractions, but Google chose not to when making those vectors. But your code should be ready for handling missing words in any case, unless you're sure through other steps that can never happen.)
I am still a beginner with neural networks and NLP.
In this code I'm training cleaned text (some tweets) with skip-gram.
But I do not know if I do it correctly.
Can anyone inform me about the correctness of this skip-gram text training?
Any help is appreciated.
This my code :
from nltk import word_tokenize
from gensim.models.phrases import Phrases, Phraser
sent = [row.split() for row in X['clean_text']]
phrases = Phrases(sent, max_vocab_size = 50, progress_per=10000)
bigram = Phraser(phrases)
sentences = bigram[sent]
from gensim.models import Word2Vec
w2v_model = Word2Vec(window=5,
size = 300,
sg=1)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=25)
del sentences #to reduce memory usage
def get_mat(model, corpus, size):
vecs = np.zeros((len(corpus), size))
n = 0
for i in corpus.index:
vecs[i] = np.zeros(size).reshape((1, size))
for word in str(corpus.iloc[i,0]).split():
try:
vecs[i] += model[word]
#n += 1
except KeyError:
continue
return vecs
X_sg = get_vectors(w2v_model, X, 300)
del X
X_sg=pd.DataFrame(X_sg)
X_sg.head()
from sklearn import preprocessing
scale = preprocessing.normalize
X_sg=scale(X_sg)
for i in range(len(X_sg)):
X_sg[i]+=1 #I did this because some weights where negative! So could not
#apply LSTM on them later
You haven't mentioned if you've received any errors, or unsatisfactory results, so it's hard to know what kind of help you might need.
Your specific lines of code involving the Word2Vec model are roughly correct: plausibly-useful parameters (if you have a dataset large enough to train 300-dimensional vectors), and the proper steps. So the real proof would be whether your results are acceptable.
Regarding your attempted use of Phrases bigram-creation beforehand:
You should get things generally working and with promising results before adding this extra pre-processing complexity.
The parameter max_vocab_size=50 is seriously misguided and may make the phrases-step pointless. The max_vocab_size is a hard cap on how many words/bigrams are tallied by the class, as a way to cap its memory-usage. (Whenever the number of known words/bigrams hits this cap, many lower-frequency words/bigrams are pruned – in practice, a majority of all words/bigrams each pruning, giving up a lot of accuracy in return for capped memory usage.) The max_vocab_size default in gensim is 40,000,000 – but the default in the Google word2phrase.c source on which gensim's method is based was 500,000,000. By using just 50, it's not really going to learn anything useful about just whatever 50 words/bigrams survive the many prunings.
Regarding your get_mat() function & later DataFrame code, i have no idea what you're trying to do with it, so can't offer any opinion on it.
i have a list of tokenized documents,containing both unigrams, bi-grams and i would like to perform sklearn lda on it.i have tried the following code:
my_data =[['low-rank matrix','detection method','problem finding'],['probabilistic inference','problem finding','statistical learning','solution' ],['detection method','probabilistic inference','population','language']...]
tf_vectorizer = CountVectorizer(min_df=2, max_features=n_features,
stop_words='english')
tf = tf_vectorizer.fit_transform(mydata)
lda = LatentDirichletAllocation(n_topics=3, max_iter=5,random_state=10)
but when i print the output i get something like this:
topic 0:
detection,finding, solution ,method,problem
topic 1:
language, statistical , problem, learning,finding
and so on..
bigrams are broken and are separated from one another.i have 10,000 documents and already tokenize them, also the method for finding the bigram is not nltk based so i already did this.
is there any method to improve this without changing the input?
i am very new in using sklearn so apologies in advance if i am making some obvious mistake.
CountVectorizer has a ngram_range param which will be used for deciding if the vocabulary will contain uniqrams, or bigrams or trigrams etc:-
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the
range of n-values for different n-grams to be extracted. All values of
n such that min_n <= n <= max_n will be used.
For example:
ngram_range=(1,1) => Will include only unigrams
ngram_range=(1,2) => Will include unigrams and bigrams
ngram_range=(2,2) => Will include only bigrams
and so on...
You have not defined that, so default ngram_range=(1,1) and hence only unigrams are used here.
tf_vectorizer = CountVectorizer(min_df=2,
max_features=n_features,
stop_words='english',
ngram_range = (2,2)) # You need this
tf = tf_vectorizer.fit_transform(my_data)
Secondly, you say that you have already tokenize the data and show the lists of list (my_data) in your code. That doesnt work with CountVectorizer. For that, you need to pass a simple list of strings and CountVectorizer will automatically apply tokenizing on them. So you will need to pass on your own preprocessing steps to that. See other params 'preprocessor', 'tokenizer' and 'analyzer' in the linked documentation.
I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:
inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims]
print(rank)
Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.
Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?
Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks #gojomo).
def seeded_vector(self, seed_string):
"""Create one 'random' vector (but deterministic by seed_string)"""
# Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
return (once.rand(self.vector_size) - 0.5) / self.vector_size
Set negative=0 to avoid randomization:
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20, window=5, min_count=1, negative=0, workers=6, epochs=10)
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
v1 = model.infer_vector(s)
for i in range(100):
v2 = model.infer_vector(s)
assert np.all(v1 == v2), "Failed on %s" % (''.join(s))
I have a corpus(Hotel Reviews) and I want to do some NLP process including Tfidf. My problem is when I Applied Tfidf and print 100 features it doesn't appear as a single word but the entire sentence.
Here is my code:
Note: clean_doc is a function return my corpus cleaning from stopwords, stemming, and etc
vectorizer = TfidfVectorizer(analyzer='word',tokenizer=clean_doc,
max_features=100, lowercase = False, ngram_range=(1,3), min_df = 1)
vz = vectorizer.fit_transform(list(data['Review']))
feature_names = vectorizer.get_feature_names()
for feature in feature_names:
print(feature)
it returns something like this:
love view good room
food amazing recommended
bad services location far
-----
any idea why? Thanks in Advance
There is most likely an error in your clean_doc function. The 'tokenizer' argument should be a function that takes a string as input and returns a list of tokens.