The goal:
To generate text from an authors style.
Input: an authors work to train on, a seed for a prediction
Output: generated text from that seed
Question about the embedding layer in keras:
I have raw text, a flat text file containing a few thousand lines of text. I want to input this into an embedding layer to keras to vectorize the data. Here is what I have as text:
--SNIP
The Wild West\n Ha ha, ride\n All you see is the sun reflectin\' off of the
--SNIP
and I call it input_text:
num_words = 2000#get 2000 words
tok = Tokenizer(num_words)#tokenize the words
tok.fit_on_texts(input_text)#takes in list of text to train on
#put all words from text into a words array
#this is essentially enumerating them
words = []
for iter in range(num_words):
words += [key for key,value in tok.word_index.items() if value==iter+1]
#words[:10]
#Class for vectorizing texts, or/and turning texts into sequences
#(=list of word indexes, where the word of rank i in the dataset (starting at 1) has index i).
X_train = tok.texts_to_sequences(input_text)#turns text to sequence, stating which word comes in what place
X_train = sequence.pad_sequences(X_train, maxlen=100)#pad sequence, essentially padding it with 0's at the end
y_train = words
The problem:
It seems that my code will take in the sequence, then when I apply padding it only gives the first 100 of the sequence. How should I break it apart?
Should I take the entire sequence and go through the first 100 words (X), and give the next one (Y) and do some skips along the way?
I want the output to be the probability of the next word coming up. So I have a softmax layer at the end. Essentially I want to generate text from a seed. Is this the correct way of going about doing this? or is it just better
i think you will not find a better answer anywhere than on this page here, by the way code is also available on github, dive in or ask more questions.
Related
I need to change the matrix embedding in word2vec after to train this. Here is the example:
w2v=Word2Vec(sentences,size=100,window=1,min_count=1,negative=15,iter=3)
w2v.save("word2vec.model")
#Getting embedding matrix
embedding_matrix=w2v.wv.vectors
for p in ("mujer", "hombre"):
result=w2v.wv.similar_by_word(p)
print("Similar words from '",p,"': ",result[:3])
#Trying to set wights matrix
w2v.wv.vectors=np.random.rand(w2v.wv.vectors.shape[0],w2v.wv.vectors.shape[1])
print()
for p in ("mujer", "hombre"):
result=w2v.wv.similar_by_word(p)
print("Similar words from '",p,"': ",result[:3])
And here is the output:
Similar words from ' mujer ': [('honra', 0.9999152421951294), ('muerte', 0.9998959302902222), ('contento', 0.999891459941864)]
Similar words from ' hombre ': [('valor', 0.9999064207077026), ('nombre', 0.9998984336853027), ('llegar', 0.9998887181282043)]
Similar words from ' mujer ': [('honra', 0.9999152421951294), ('muerte', 0.9998959302902222), ('contento', 0.999891459941864)]
Similar words from ' hombre ': [('valor', 0.9999064207077026), ('nombre', 0.9998984336853027), ('llegar', 0.9998887181282043)]
As you can see, I get the same predictions despite having changed the embedding matrix by random numbers.
I don't get any method in the documentation to make this change.
Will it be possible?
In Python, you can generally tamper with an object's internal properties, because there's no "protection" making them invisible or unmodifiable from outside code.
Try assigning to w2v.wv.vectors (no underscore _vectors as in your original question).
Or, directly to an individual vector: w2v.wv['mujer'] = np.random.rand(100).
These should change the model; however if you've already performed most_similar() (and related functions), the model also caches some interim, unit-length-normalized versions of the vectors – and your direct tampering won't update that. You can clear this cache with:
del w2v.wv.vectors_norm
I already found the solution. Just use the init_sims() function after setting the array.
w2v=Word2Vec(sentences,size=100,window=1,min_count=1,negative=15,iter=3)
w2v.save("word2vec.model")
#Getting embedding matrix
embedding_matrix=w2v.wv.vectors
for p in ("mujer", "hombre"):
result=w2v.wv.similar_by_word(p)
print("Similar words from '",p,"': ",result[:3])
#Setting new values on wights matrix
w2v.wv.vectors=np.random.rand(w2v.wv.vectors.shape[0],w2v.wv.vectors.shape[1])
#This line create a l2 normalization over the embedding matrix
word_vectors.vectors_norm=word_vectors.init_sims(replace=False)
print()
for p in ("mujer", "hombre"):
result=w2v.wv.similar_by_word(p)
print("Similar words from '",p,"': ",result[:3])
I have somewhat read a bunch of papers which talks about predicting missing words in a sentence. What I really want is to create a model that suggest a word from an incomplete sentence.
Example:
Incomplete Sentence :
I bought an ___________ because its rainy.
Suggested Words:
umbrella
soup
jacket
From the journal I have read, they have utilized Microsoft Sentence Completion Dataset for predicting missing words from a sentence.
Example :
Incomplete Sentence :
Im sad because you are __________
Missing Word Options:
a) crying
b) happy
c) pretty
d) sad
e) bad
I don't want to predict a missing word from a list of options. I want to suggest a list of words from an incomplete sentence. Is it feasible? Please enlighten me cause Im really confused. What is state of the art model I can use for suggesting a list of words (semantically coherent) from an incomplete sentence?
Is it necessary that the list of suggested words as an output is included in the training dataset?
This is exactly how the BERT model was trained: mask some random words in the sentence, and make your network predict these words. So yes, it is feasible. And not, it is not necessary to have the list of suggested words as a training input. However, these suggested words should be the part of the overall vocabulary with which this BERT has been trained.
I adapted this answer to show how the completion function may work.
# install this package to obtain the pretrained model
# ! pip install -U pytorch-pretrained-bert
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval(); # turning off the dropout
def fill_the_gaps(text):
text = '[CLS] ' + text + ' [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
results = []
for i, t in enumerate(tokenized_text):
if t == '[MASK]':
predicted_index = torch.argmax(predictions[0, i]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
results.append(predicted_token)
return results
print(fill_the_gaps(text = 'I bought an [MASK] because its rainy .'))
print(fill_the_gaps(text = 'Im sad because you are [MASK] .'))
print(fill_the_gaps(text = 'Im worried because you are [MASK] .'))
print(fill_the_gaps(text = 'Im [MASK] because you are [MASK] .'))
The [MASK] symbol indicates the missing words (there can be any number of them). [CLS] and [SEP] are BERT-specific special tokens. The outputs for these particular prints are
['umbrella']
['here']
['worried']
['here', 'here']
The duplication is not surprising - transformer NNs are generally good at copying words. And from semantic point of view, these symmetric continuations look indeed very likely.
Moreover, if it is not a random word which is missing, but exactly the last word (or last several words), you can utilize any language model (e.g. another famous SOTA language model, GPT-2) to complete the sentence.
I am implementing word2vec in gensim, on a corpus with nested lists (collection of tokenized words in sentences of sentences form) with 408226 sentences (lists) and a total of 3150546 words or tokens.
I am getting a meaningful results (in terms of the similarity between two words using model.wv.similarity) with the chosen values of 200 as size, window as 15, min_count as 5, iter as 10 and alpha as 0.5. All are lemmatized words and these all are input to models with vocabulary as 32716.
The results incurred from default alpha value, size, window and dimensions are meaningless for me based on the used data in computing the similarity values. However higher value of alpha as 0.5 gives me some meaningful results in terms of inducing meaningful similarity scores between two words. However, when I calculate the top n similar words, it's again meaningless. Does I need to change the entire parameters used in the initial training process.
I am still unable to reveal the exact reason, why the model behaves good with such a higher alpha value in computing the similarity between two words of the used corpus, whereas it's meaningless while computing the top n similar words with scores for an input word. Why is this the case?
Does it is diverging towards optimal solution. How to check this?
Any idea why is it the case is deeply appreciated.
Note: I'm using Python 3.7 on Windows machine with anaconda prompt and giving input to the model from a file.
This is what I have tried.
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models import Word2Vec
import ast
path = "F:/Folder/"
def load_data():
global Sentences
Sentences = []
for file in ['data_d1.txt','data_d2.txt']:
with open(path + file, 'r', encoding = 'utf-8') as f1:
Sentences.extend(ast.literal_eval(*f1.readlines()))
load_data()
def initialize_word_embedding():
model = Word2Vec(Sentences, size = 200, window = 15, min_count = 5, iter = 10, workers = 4)
print(model)
print(len(model.wv.vocab))
print(model.wv.similarity(w1 = 'structure', w2 = '_structure_'))
similarities = model.wv.most_similar('system')
for word, score in similarities:
print(word , score)
initialize_word_embedding()
The example of Sentences list is as follows:
[['scientist', 'time', 'comet', 'activity', 'sublimation', 'carbon', 'dioxide', 'nears', 'ice', 'system'], ['inconsistent', 'age', 'system', 'year', 'size', 'collision'], ['intelligence', 'system'], ['example', 'application', 'filter', 'image', 'motion', 'channel', 'estimation', 'equalization', 'example', 'application', 'filter', 'system']]
The data_d1.txt and data_d2.txt is a nested list (list of lists of lemmatized tokenized words). I have preprocessed the raw data and save it in a file. Now giving the same as input. For computing the lemmatizing tokens, I have used the popular WordNet lemmatizer.
I need the word-embedding model to calculate the similarity between two words and computing the most_similar words of a given input word. I am getting some meaningful scores for the model.wv.similarity() method, whereas in calculating the most_similar() words of a word (say, system as shown in above). I am not getting the desired results.
I am guessing the model is getting diverged from the global minima, with the use of high alpha values.
I am confused what should be the dimension size, window for inducing some meaningful results, as there is no such rules regarding how to compute the the size and window.
Any suggestion is appreciated. The size of total sentences and words are specified above in the question.
Results what I am getting without setting alpha = 0.5
Edit to Recent Comment:
Results:
Word2Vec(vocab=32716, size=200, alpha=0.025)
The similarity between set and _set_ is : 0.000269373188960656
which is meaningless for me as it is very very less in terms of accuracy, But, I am a getting 71% by setting alpha as 0.5, which seems to be meaningful for me as the word set is same for both the domains.
Explanation: The word set should be same for both the domains (as I am comparing the data of two domains with same word). Don't get confused with word _set_, this is because the word is same as set, I have injected a character _ at start and end to distinguish the same for two different domains.
The top 10 words along with scores of _set_ are:
_niche_ 0.6891741752624512
_intermediate_ 0.6883598566055298
_interpretation_ 0.6813371181488037
_printer_ 0.675414502620697
_finer_ 0.6625382900238037
_pertinent_ 0.6620787382125854
_respective_ 0.6619025468826294
_converse_ 0.6610435247421265
_developed_ 0.659270167350769
_tent_ 0.6588765382766724
Whereas, the top 10 words for set are:
cardinality 0.633270263671875
typereduction 0.6233855485916138
zdzisław 0.619156002998352
crisp 0.6165326833724976
equivalenceclass 0.605925977230072
pawlak 0.6058803200721741
straight 0.6045454740524292
culik 0.6040038466453552
rin 0.6038737297058105
multisets 0.6035065650939941
Why the cosine similarity value is 0.00 for the word set for two different data.
i have a list of tokenized documents,containing both unigrams, bi-grams and i would like to perform sklearn lda on it.i have tried the following code:
my_data =[['low-rank matrix','detection method','problem finding'],['probabilistic inference','problem finding','statistical learning','solution' ],['detection method','probabilistic inference','population','language']...]
tf_vectorizer = CountVectorizer(min_df=2, max_features=n_features,
stop_words='english')
tf = tf_vectorizer.fit_transform(mydata)
lda = LatentDirichletAllocation(n_topics=3, max_iter=5,random_state=10)
but when i print the output i get something like this:
topic 0:
detection,finding, solution ,method,problem
topic 1:
language, statistical , problem, learning,finding
and so on..
bigrams are broken and are separated from one another.i have 10,000 documents and already tokenize them, also the method for finding the bigram is not nltk based so i already did this.
is there any method to improve this without changing the input?
i am very new in using sklearn so apologies in advance if i am making some obvious mistake.
CountVectorizer has a ngram_range param which will be used for deciding if the vocabulary will contain uniqrams, or bigrams or trigrams etc:-
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the
range of n-values for different n-grams to be extracted. All values of
n such that min_n <= n <= max_n will be used.
For example:
ngram_range=(1,1) => Will include only unigrams
ngram_range=(1,2) => Will include unigrams and bigrams
ngram_range=(2,2) => Will include only bigrams
and so on...
You have not defined that, so default ngram_range=(1,1) and hence only unigrams are used here.
tf_vectorizer = CountVectorizer(min_df=2,
max_features=n_features,
stop_words='english',
ngram_range = (2,2)) # You need this
tf = tf_vectorizer.fit_transform(my_data)
Secondly, you say that you have already tokenize the data and show the lists of list (my_data) in your code. That doesnt work with CountVectorizer. For that, you need to pass a simple list of strings and CountVectorizer will automatically apply tokenizing on them. So you will need to pass on your own preprocessing steps to that. See other params 'preprocessor', 'tokenizer' and 'analyzer' in the linked documentation.
I am trying to train the Gensim Word2Vec model by:
X = train['text']
model_word2vec = models.Word2Vec(X.values, size=150)
model_word2vec.train(X.values, total_examples=len(X.values), epochs=10)
after the training, I get a small vocabulary (model_word2vec.wv.vocab) of length 74 containing only the alphabet's letters.
How could I get the right vocabulary?
Update
I tried this before:
tokenizer = Tokenizer(lower=True)
tokenized_text = tokenizer.fit_on_texts(X)
sequence = tokenizer.texts_to_sequences(X)
model_word2vec.train(sequence, total_examples=len(X.values), epochs=10
but I got the same wrong vocabulary size.
Supply the model with the kind of corpus it needs: a sequence of texts, where each text is a list-of-string-tokens. If you supply it with non-tokenized strings instead, it will think each single character is a token, giving the results you're seeing.