I am now using NMF to generate topics. My code is shown below. However, I do not know how to get the frequency of each topic. Does anyone that can help me? Thank you!
def fit_tfidf(documents):
tfidf = TfidfVectorizer(input = 'content', stop_words = 'english',
use_idf = True, ngram_range = NGRAM_RANGE,lowercase = True, max_features = MAX_FEATURES, min_df = 1 )
tfidf_matrix = tfidf.fit_transform(documents.values).toarray()
tfidf_feature_names = np.array(tfidf.get_feature_names())
tfidf_reverse_lookup = {word: idx for idx, word in enumerate(tfidf_feature_names)}
return tfidf_matrix, tfidf_reverse_lookup, tfidf_feature_names
def vectorization(documments):
if VECTORIZER == 'tfidf':
vec_matrix, vec_reverse_lookup, vec_feature_names = fit_tfidf(documents)
if VECTORIZER == 'bow':
vec_matrix, vec_reverse_lookup, vec_feature_names = fit_bow(documents)
return vec_matrix, vec_reverse_lookup, vec_feature_names
def nmf_model(vec_matrix, vec_reverse_lookup, vec_feature_names, NUM_TOPICS):
topic_words = []
nmf = NMF(n_components = NUM_TOPICS, random_state=3).fit(vec_matrix)
for topic in nmf.components_:
word_idx = np.argsort(topic)[::-1][0:N_TOPIC_WORDS]
topic_words.append([vec_feature_names[i] for i in word_idx])
return topic_words
If you mean the frequency of each topic inside each documents, then:
H = nmf.fit_transform(vec_matrix)
H is a matrix of shape (n_documents, n_topics). Each row represents a document vector (in the topic space). In this vector you find the weight that each topic has (which translates as the topic importance).
Related
I am using Huggingface library and transformers to find whether a sentence is well-formed or not. I am using a masked language model called XLMR. I first tokenize my sentence, and then mask each word of the sentence one by one, and then process the masked sentences and find the probability that the predicted masked word is right.
def calculate_scores(sent, model, tokenizer, device, print_pred=False, maskval=False):
k = 0
dic = {}
ls = tokenizer.batch_encode_plus(sent)
input_list = ls.input_ids
h=0
with torch.no_grad():
for i in tqdm(range(len(input_list))):
item = input_list[i]
real_input = item
attmask = [1]*len(item)
seg = [0]*len(item)
seglist = [seg]
masked_list = [real_input]
attlist = [attmask]
for j in range(1, len(item)-1):
input = copy.deepcopy(real_input)
input[j] = 50264
masked_list.append(input)
attlist.append(attmask)
seglist.append(seg)
inid = torch.tensor(masked_list)
segtensor = torch.tensor(seglist)
atttensor = torch.tensor(attlist)
inid=inid.to(device)
segtensor=segtensor.to(device)
output = model(inid, segtensor)
predictions_logits = output.logits
predictions = torch.softmax(predictions_logits, dim=2)
ppscore = 0
for j in range(1, len(item)-1):
ppscore = ppscore+math.log(predictions[j, j, item[j]], 2)
try:
score = math.pow(2, (-1/(len(item)-2))*ppscore)
dic[sent[i]] = score
except:
print(sent[i])
dic[sent[i]] = 10000000
# dic[sent[i]]=10000000
return dic
I will explain my code quickly. The function calculate_scores has sent as an input which is a list of sentences. I first batch encode this list of sentences. And then for each encoded sentence that I get, I generate masked sentences where only one word is masked and the rest are un-masked. Then I input these generated sentences to output and get the probability. Then I compute perplexity.
But the way I'm using this is not a very good way of utilizing GPU. I want to process multiple sentences at once but at the same time, I also need to find the perplexity scores for each sentence. How would I go about doing this?
I'm trying to build an NMF model for topic extraction. For re-training of the model, I've to pass a parameter to the nmf function, for which I need to pass the x co-ordinate from a given point that the algorithm returns, here is the code for reference:
no_features = 1000
no_topics = 9
print ('Old number of topics: ', no_topics)
tfidf_vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 2, max_features = no_features, stop_words = 'english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
no_topics = tfidf.shape
print('New number of topics :', no_topics)
# nmf = NMF(n_components = no_topics, random_state = 1, alpha = .1, l1_ratio = .5, init = 'nndsvd').fit(tfidf)
On the third last line, the tfidf.shape returns a point (3,1000) to the variable 'no_topics', however I want that variable to be set to only the x co-ordinate, i.e (3).
How can I extract just the x co-ordinate from the point?
you can select the first values with no_topics[0]
print('New number of topics : {}'.format(no_topics[0]))
You can do a slicing on your numpy array tfidf with
topics = tfidf[0,:]
I'm building a chatbot with keras. I saw a guide on how to build a contextual chatbot, with a bag of words model.
The model is trained on a set of unique words, and learns to identify categories linked to certain words.
The documents list is a list of tuples with the words and their respective categorie or 'tag'.
Here is the link: https://medium.com/#ferrygunawan/build-facebook-messenger-contextual-chatbot-with-tensorflow-and-keras-4f8cc79438cf
After the model is built, I have a small script, where I receive user input (sentence) and that sentence is classified.
The classify function receives the input of the user (sentence) and the list of tokenized words, which is the same list of words that the model was trained on.
My question is: Do I need to feed the model with his training data before I load the model? And if so, how can I overcome this and load the model and classify it without the need to load all of his training data again?
def classify(sentence, words):
# generate probabilities for the model
p = bow(sentence, words)
print(p)
d = len(p)
f = len(documents)-2
a = np.zeros([f, d])
tot = np.vstack((p,a))
results = model.predict(tot)[0]
return results
def bow(sentence, words):
# tokenize the pattern
sentence_words = clean_up_sentence(sentence)
# bag of words
print(words)
bag = [0] * len(words)
for s in sentence_words:
for i,w in enumerate(words):
if w == s:
bag[i] = 1
return(np.array(bag))
def clean_up_sentence(sentence):
sentence_words = nltk.word_tokenize(sentence)
sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
return sentence_words
#This is how I load the data, it is loaded in the same way in the script where the model is built
with open('teste.csv', 'r', encoding='utf-8-sig', newline='') as csv_file:
csvReader = csv.DictReader(csv_file, delimiter = ';', quoting=csv.QUOTE_MINIMAL)
data = list(csvReader)
intents = {}
intents['intents'] = data
words = []
classes = []
documents = []
ignore_words = ['?']
for intent in intents['intents']:
pattern = intent['patterns']
w = nltk.word_tokenize(pattern)
words.extend(w)
documents.append((w,intent['tag']))
if intent['tag'] not in classes:
classes.append(intent['tag'])
words = [stemmer.stem(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))
# remover duplicados na lista
classes = sorted(list(set(classes)))
model = keras.models.load_model ("model_test.h5")
sentence = input()
classify(sentence)
I am trying to print out the metrics for my naïve bayes classifier model but the code continues to return 'none' for all of the print lines. I use the following code to print my metrics but can't figure why it return the metric values I need, any help is appreciated!
import collections
from nltk.metrics.scores import (precision, recall, f_measure)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(train_set):
refsets[label].add(i)
observed = nb_classifier.classify(feats)
testsets[observed].add(i)
print('pos precision:', precision(refsets['pos'], testsets['pos']))
print('pos recall:', recall(refsets['pos'], testsets['pos']))
print('pos F-measure:', f_measure(refsets['pos'], testsets['pos']))
print('neg precision:', precision(refsets['neg'], testsets['neg']))
print('neg recall:', recall(refsets['neg'], testsets['neg']))
print('neg F-measure:', f_measure(refsets['neg'], testsets['neg']))
TL;DR
import random
from collections import Counter
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.metrics.scores import precision, recall, f_measure
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = Counter(all_words)
def find_features(document, top_n=3000):
word_features = list(all_words.keys())[:top_n]
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
def train_test_split(documents, random_seed=0, split_on=0.95, top_n=3000):
custom_random = random.Random(random_seed)
custom_random.shuffle(documents)
featuresets = [(find_features(rev, top_n), category) for (rev, category) in documents]
split_on_int = int(len(featuresets) * split_on)
training_set = featuresets[:split_on_int]
testing_set = featuresets[split_on_int:]
return training_set, testing_set
training_set, testing_set = train_test_split(documents)
The actual classifier training and evaluation:
nb = NaiveBayesClassifier.train(training_set)
predictions, gold_labels = defaultdict(set), defaultdict(set)
for i, (features, label) in enumerate(testing_set):
predictions[nb.classify(features)].add(i)
gold_labels[label].add(i)
for label in predictions:
print(label, 'Precision:', precision(gold_labels[label], predictions[label]))
print(label, 'Recall:', recall(gold_labels[label], predictions[label]))
print(label, 'F1-Score:', f_measure(gold_labels[label], predictions[label]))
print()
[out]:
neg Precision: 0.803921568627451
neg Recall: 0.9534883720930233
neg F1-Score: 0.8723404255319148
pos Precision: 0.9591836734693877
pos Recall: 0.8245614035087719
pos F1-Score: 0.8867924528301887
import gensim
from gensim.models.doc2vec import TaggedDocument
taggeddocs = []
tag2tweetmap = {}
for index,i in enumerate(cleaned_tweets):
if len(i) > 2: # Non empty tweets
tag = u'SENT_{:d}'.format(index)
sentence = TaggedDocument(words=gensim.utils.to_unicode(i).split(), tags=[tag])
tag2tweetmap[tag] = i
taggeddocs.append(sentence)
model = gensim.models.Doc2Vec(taggeddocs, dm=0, alpha=0.025, size=20, min_alpha=0.025, min_count=0)
for epoch in range(60):
if epoch % 20 == 0:
print('Now training epoch %s' % epoch)
model.train(taggeddocs,total_examples=model.corpus_count,epochs=model.iter)
model.alpha -= 0.002
model.min_alpha = model.alpha
from sklearn.cluster import KMeans
dataSet = model.syn0
kmeansClustering = KMeans(n_clusters=6)
centroidIndx = kmeansClustering.fit_predict(dataSet)
topic2wordsmap = {}
for i, val in enumerate(dataSet):
tag = model.docvecs.index_to_doctag(i)
topic = centroidIndx[i]
if topic in topic2wordsmap.keys():
for w in (tag2tweetmap[tag].split()):
topic2wordsmap[topic].append(w)
else:
topic2wordsmap[topic] = []
for i in topic2wordsmap:
words = topic2wordsmap[i]
print("Topic {} has words {}".format(i, words[:5]))
So I was trying to find out the most commonly used words and list of topics using doc2vec method.
it's the attribute error, say that "Doc2Vec has no attribute syn0", I don't know what to do about it.
I found this doc2vec tutorial may be give you some clue about your problem.
https://medium.com/#mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5