Adding documents to gensim model - python-3.x

I have a class wrapping the various objects required for calculating LSI similarity:
class SimilarityFiles:
def __init__(self, file_name, tokenized_corpus, stoplist=None):
if stoplist is None:
self.filtered_corpus = tokenized_corpus
else:
self.filtered_corpus = []
for convo in tokenized_corpus:
self.filtered_corpus.append([token for token in convo if token not in stoplist])
self.dictionary = corpora.Dictionary(self.filtered_corpus)
self.corpus = [self.dictionary.doc2bow(text) for text in self.filtered_corpus]
self.lsi = models.LsiModel(self.corpus, id2word=self.dictionary, num_topics=100)
self.index = similarities.MatrixSimilarity(self.lsi[self.corpus])
I now want to add a function to the class to allow adding documents to the corpus and updating the model accordingly.
I've found dictionary.add_documents, and model.add_documents, but there are two things that aren't clear to me:
When you originally create the LSI model, one of the parameters the function receives is id2word=dictionary. When updating the model, how do you tell it to use the updated dictionary? Is it actually unnecessary, or will it make a difference?
How do I update the index? It looks from the documentation that if I use the Similarity class, and not the MatrixSimilarity class, I can add documents to the index, but I don't see such functionality for MatrixSimilarity. If I understood correctly, the MatrixSimilarity is better if my input corpus contains dense vectors (which is does, because I'm using the LSI model). Do I have to change it to Similarity just so that I can update the index? Or, conversely, what's the complexity of creating this index? If it's insignificant, should I just create a new index with my updated corpus, as follows:
Code:
self.dictionary.add_documents(new_docs) # new_docs is already after filtering stop words
new_corpus = [self.dictionary.doc2bow(text) for text in new_docs]
self.lsi.add_documents(new_corpus)
self.index = similarities.MatrixSimilarity(self.lsi[self.corpus])
Thanks. :)

will it seems that it doesn't update the dictionary.. it just add a new documents not new features.. so you should take a different approach..
I had the same problem and I found this issue on the gensim githup helpful

Related

Whitelist tokens for text generation (XLNet, GPT-2) in huggingface-transformers

In the documentation on text generation (https://huggingface.co/transformers/main_classes/model.html#generative-models) there is the option to put
bad_words_ids (List[int], optional) – List of token ids that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use tokenizer.encode(bad_word, add_prefix_space=True).
Is there also the option to put something along the lines of "allowed_words_ids"? The idea would be to restrict the language of the generated texts.
I'd also suggest to do what Sahar Mills said. You can do it in the following way.
You get the whole vocab of the model you are using, e.g.
from transformers import AutoTokenizer
# Load tokenizer
checkpoint = "CenIA/distillbert-base-spanish-uncased" #Example model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
vocab = tokenizer.get_vocab()
list(vocab.keys())[:100] # to see the first 100 words
Define words you do want in the model.
words_to_delete = ['forzado', 'vendieron', 'verticales'] # or load them from somewhere else
Define function to create the bad_words_ids, that is, the whole model vocab minus the words you want in the model
def create_bad_words_ids(bad_words_ids, words_to_delete):
for pictogram in range(len(words_to_delete)):
if words_to_delete[pictogram] in bad_words_ids:
bad_words_ids.remove(words_to_delete[pictogram])
return bad_words_ids
bad_words_ids = create_bad_words_ids(bad_words_ids=bad_words_ids, words_to_delete=words_to_delete)
print(bad_words_ids)
Hope it helps,
cheers

Is it possible to modify a cp_sat model after construction?

I have a model for finding a particular class of integer numbers (the "Keith numbers"), which works well, but is quite slow as it requires constructing a new model many times. Is there a way to update a model, in particular to change the coefficient in the constraint. In other words, change the model to match a different mat, without reconstructing the whole thing?
def _construct_model(self, mat):
model = cp_model.CpModel()
digit = [model.NewIntVar(0, 9, f'digit[{i}]') for i in range(self.k)]
# Creates the constraint.
model.Add(sum([mat[i] * digit[i] for i in range(self.k)]) == 0)
model.Add(digit[0] != 0)
return model, digit
Yes, but you are on your own.
You can access the underlying cp_model_proto protobuf from the model, and modify it directly.
He have no plan currently to add a modification API on top of the cp_model API.

Gensim-python: Is there a simple way to get the number of times a given token arise in all documents?

My gensim model is like this:
class MyCorpus(object):
parametersList = []
def __init__(self,dictionary):
self.dictionary=dictionary
def __iter__(self):
#for line in open('mycorpus.txt'):
for line in texts:
# assume there's one document per line, tokens separated by whitespace
yield self.dictionary.doc2bow(line[0].lower().split())
if __name__=="__main__":
texts=[['human human interface computer'],
['survey user user computer system system system response time'],
['eps user interface system'],
['system human system eps'],
['user response time'],
['trees'],
['graph trees'],
['graph minors trees'],
['graph minors minors survey survey survey']]
dictionary = corpora.Dictionary(line[0].lower().split() for line in texts)
corpus= MyCorpus(dictionary)
The frequency of each token in each document is automatically evaluated.
I also can define the tf-idf model and access the tf-idf statistic for each token in each document.
model = TfidfModel(corpus)
However, I have no clue how to count (memory-friendly) the number of documents that a given word arise. How can I do that? [Sure... I can use the values of tf-idf and document frequency to evaluate it... However, I would like to evaluate it directly from some counting process]
For instance, for the first document, I would like to get somenthing like
[('human',2), ('interface',2), ('computer',2)]
since each token above arises twice in each document.
For the second.
[('survey',2), ('user',3), ('computer',2),('system',3), ('response',2),('time',2)]
How about this?
from collections import Counter
documents = [...]
count_dict = [word_count(document) for filename in documents]
total = sum(count_dict, Counter())
I assumed that all your string are different documents/files. You can make related changes. Also, made change to the code.

Feeding list of tf.truncated_normal() or list of dictionaries into a Tensorflow Model

I am new to tensorflow and I am trying to learn how to use the tool efficiently.
I expand on the question below but here is the tldr:
I am wondering what is the best way to feed the following weights and biases into my model with feed_dict:
def generate_initial_population(my_population_size):
my_weights = []
my_biases = []
for _ in range(my_population_size):
my_weights.append({
'h1': tf.Variable(tf.truncated_normal([n_inputs, n_hidden_1])),
'h2': tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.truncated_normal([n_hidden_2, n_class]))
})
my_biases.append({
'b1': tf.Variable(tf.truncated_normal([n_hidden_1])),
'b2': tf.Variable(tf.truncated_normal([n_hidden_2])),
'out': tf.Variable(tf.truncated_normal([n_class]))
})
return my_weights, my_biases
weights, biases = generate_initial_population(population_size)
I cannot simply use feed_dict={weights_ph: weights} because it will generate errors. I do not know how to deal with this problem efficiently
Examining the code at the end might help with understanding what i am talking about.
I am wondering if there is any way I could feed a list containing tf.truncated_normals to my model.
I get the ValueError: setting an array element with a sequence. error because I believe it is trying to convert to np.array but has issues with the dimensions
I have found an easy workout where i figure out the values of all the tensors first with session run and then feed that into my model.
I am just confused if this is the right solution since I would be inclined to believe it is slower because you have to execute session twice?
This solution also doesnt work however if my original list is not perfect shape
like [ [1, [1,2]]] or when my truncated_normals do not have the same shapes
I was thinking im just going to feed my weird shape list into my model and then use tf.gather to get the specific indexes I want to work on.
Since i cannot do that is my solution the proper way to deal with this... simply calculating the truncated_normals first and then feeding that into the model. then reshaping the list while inside the model if you need to?
I also am having a very similar problems because I wanted to feed in a list of dictionaries into the model as well. Is the proper way of dealing with that to extract the data from dictionaries and then just feed in each value from each key separately.
I am trying to learn and i couldnt find this information elsewhere
Here is a code snippet i designed to fail to explain what i mean
import tensorflow as tf
list_ph = tf.placeholder(dtype=tf.float32)
index_ph = tf.placeholder(dtype=tf.int32)
def model(my_list, index):
value = tf.gather(my_list, index, axis=0)
return value
my_model = model(list_ph, index_ph)
with tf.Session() as sess:
var_list = []
truncated_normal = tf.Variable(tf.truncated_normal(shape=[5, 3]))
for i in range(4):
var_list.append(truncated_normal)
# for i in range(4):
# var_list.append({i: i*2})
sess.run(tf.global_variables_initializer())
#will work but will not work for dictionaries
val = sess.run(var_list)
# will not work, but will work if you feed val
var = sess.run(my_model, feed_dict={list_ph: var_list, index_ph: 1})

Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:
inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims]
print(rank)
Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.
Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?
Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks #gojomo).
def seeded_vector(self, seed_string):
"""Create one 'random' vector (but deterministic by seed_string)"""
# Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
return (once.rand(self.vector_size) - 0.5) / self.vector_size
Set negative=0 to avoid randomization:
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20, window=5, min_count=1, negative=0, workers=6, epochs=10)
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
v1 = model.infer_vector(s)
for i in range(100):
v2 = model.infer_vector(s)
assert np.all(v1 == v2), "Failed on %s" % (''.join(s))

Resources