Word level sentence generation using keras - python-3.x

I am new to keras. I am trying to build word level sentence generation module using keras.
I use a vocabulary size of 8000 and one-hot-vector representation of words in my corpus.
This is my model
model = Sequential()
model.add(LSTM(200, return_sequences=False, unroll=True, stateful=False,input_shape=(1,len(itw)) ))
model.add(Activation('tanh'))
model.add(Dense(len(itw)))
model.add(Activation('softmax'))
optimizer = RMSprop(lr=0.01)
model.compile(loss="categorical_crossentropy",optimizer=optimizer,metrics=["accuracy"])
a,b =batch1()
model.fit(a,b,batch_size=45,nb_epoch=5,verbose=1)
here input dimension is 3 (as keras lstm expects 3d input) input shape such that (batch_size,1,8000). I made the 2nd dimension as 1 because I want to feed word by word.
Batch_size is 45 because that is the avarage length of a sentence in the corpus.
All sentences are pretended with a "START_TOKEN" and appended with a "END_TOKEN".
len(itw) returns length of vocabulary which in my case is 8000
Now after training I wished to loop the generated words back as input until a stop token is encountered to generated a sentence.
But it seems that the model only considers the current input.
I hoped the model to consider the inputs before also and not just the current one.
So how should I change my model
Also is Keras unfit for nlp applications involving varying time serise
How to change the model such that it will generate the next word given n words.
Where n is any natural number

Related

Embedding layer in neural machine translation with attention

I am trying to understanding how to implement a seq-to-seq model with attention from this website.
My question: Is nn.embedding just returns some IDs for each word, so the embedding for each word would be the same during whole training? Or are they getting changed during the procedure of training?
My second question is because I am confused whether after training, the output of nn.embedding is something such as word2vec word embeddings or not.
Thanks in advance
According to the PyTorch docs:
A simple lookup table that stores embeddings of a fixed dictionary and size.
This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.
In short, nn.Embedding embeds a sequence of vocabulary indices into a new embedding space. You can indeed roughly understand this as a word2vec style mechanism.
As a dummy example, let's create an embedding layer that takes as input a total of 10 vocabularies (i.e. the input data only contains a total of 10 unique tokens), and returns embedded word vectors living in 5-dimensional space. In other words, each word is represented as 5-dimensional vectors. The dummy data is a sequence of 3 words with indices 1, 2, and 3, in that order.
>>> embedding = nn.Embedding(10, 5)
>>> embedding(torch.tensor([1, 2, 3]))
tensor([[-0.7077, -1.0708, -0.9729, 0.5726, 1.0309],
[ 0.2056, -1.3278, 0.6368, -1.9261, 1.0972],
[ 0.8409, -0.5524, -0.1357, 0.6838, 3.0991]],
grad_fn=<EmbeddingBackward>)
You can see that each of the three words are now represented as 5-dimensional vectors. We also see that there is a grad_fn function, which means that the weights of this layer will be adjusted through backprop. This answers your question of whether embedding layers are trainable: the answer is yes. And indeed this is the whole point of embedding: we expect the embedding layer to learn meaningful representations, the famous example of king - man = queen being the classic example of what these embedding layers can learn.
Edit
The embedding layer is, as the documentation states, a simple lookup table from a matrix. You can see this by doing
>>> embedding.weight
Parameter containing:
tensor([[-1.1728, -0.1023, 0.2489, -1.6098, 1.0426],
[-0.7077, -1.0708, -0.9729, 0.5726, 1.0309],
[ 0.2056, -1.3278, 0.6368, -1.9261, 1.0972],
[ 0.8409, -0.5524, -0.1357, 0.6838, 3.0991],
[-0.4569, -1.9014, -0.0758, -0.6069, -1.2985],
[ 0.4545, 0.3246, -0.7277, 0.7236, -0.8096],
[ 1.2569, 1.2437, -1.0229, -0.2101, -0.2963],
[-0.3394, -0.8099, 1.4016, -0.8018, 0.0156],
[ 0.3253, -0.1863, 0.5746, -0.0672, 0.7865],
[ 0.0176, 0.7090, -0.7630, -0.6564, 1.5690]], requires_grad=True)
You will see that the first, second, and third rows of this matrix corresponds to the result that was returned in the example above. In other words, for a vocabulary whose index is n, the embedding layer will simply "lookup" the nth row in its weights matrix and return that row vector; hence the lookup table.

Should BERT embeddings be made on tokens or sentences?

I am making a sentence classification model and using BERT word embeddings in it. Due to very large dataset, I combined all the sentences together in one string and made embeddings on the tokens generated from those.
s = " ".join(text_list)
len(s)
Here s is the string and text_list contains the sentences on which I want to make my word embeddings.
I then tokenize the string
stokens = tokenizer.tokenize(s)
My question is, will BERT perform better on whole sentence given at a time or making embeddings on tokens for whole string is also fine?
Here is the code for my embedding generator
pool = []
all = []
i=0
while i!=600000:
stokens = stokens[i:i+500]
stokens = ["[CLS]"] + stokens + ["[SEP]"]
input_ids = get_ids(stokens, tokenizer, max_seq_length)
input_masks = get_masks(stokens, max_seq_length)
input_segments = get_segments(stokens, max_seq_length)
a, b= embedd(input_ids, input_masks, input_segments)
pool.append(a)
all.append(b)
print(i)
i+=500
What essentially I am doing here is, I have the string length of 600000 and I take 500 tokens at a time and generate embdedings for it and append it in a list call pool.
For classification, you don't have to concatenate the sentences. By concatenating, you are merging the sentences of different classes.
If it is BERT fine-tuning, by default, for the classification task a logistic regression layer is learnt on top of [CLS] token. Since, its attention based transformer model, it assumes that each token has seen the other tokens and has captured the context. Thus [CLS] token is sufficient.
However, if you want to use the embeddings, you can learn a classifier on single vector,i.e, embeddings [CLS] token or averaged embeddings of all the tokens. Or, you can get the embeddings for each token and form a sequence to learn it using other classifiers such as CNN or RNN.

How to learn the embeddings in Pytorch and retrieve it later

I am building a recommendation system where I predict the best item for each user given their purchase history of items. I have userIDs and itemIDs and how much itemID was purchased by userID. I have Millions of users and thousands of products. Not all products are purchased(there are some products that no one has bought them yet). Since the users and items are big I don't want to use one-hot vectors. I am using pytorch and I want to create and train the embeddings so that I can make the predictions for each user-item pair. I followed this tutorial https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html. If it's an accurate assumption that the embedding layer is being trained, then do I retrieve the learned weights through model.parameters() method or should I use the embedding.data.weight option?
model.parameters() returns all the parameters of your model, including the embeddings.
So all these parameters of your model are handed over to the optimizer (line below) and will be trained later when calling optimizer.step() - so yes your embeddings are trained along with all other parameters of the network.(you can also freeze certain layers by setting i.e. embedding.weight.requires_grad = False, but this is not the case here).
# summing it up:
# this line specifies which parameters are trained with the optimizer
# model.parameters() just returns all parameters
# embedding class weights are also parameters and will thus be trained
optimizer = optim.SGD(model.parameters(), lr=0.001)
You can see that your embedding weights are also of type Parameter by doing so:
import torch
embedding_maxtrix = torch.nn.Embedding(10, 10)
print(type(embedding_maxtrix.weight))
This will output the type of the weights, which is Parameter:
<class 'torch.nn.parameter.Parameter'>
I'm not entirely sure what mean by retrieve. Do you mean getting a single vector, or do you want just the whole matrix to save it, or do something else?
embedding_maxtrix = torch.nn.Embedding(5, 5)
# this will get you a single embedding vector
print('Getting a single vector:\n', embedding_maxtrix(torch.LongTensor([0])))
# of course you can do the same for a seqeunce
print('Getting vectors for a sequence:\n', embedding_maxtrix(torch.LongTensor([1, 2, 3])))
# this will give the the whole embedding matrix
print('Getting weights:\n', embedding_maxtrix.weight.data)
Output:
Getting a single vector:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020]], grad_fn=<EmbeddingBackward>)
Getting vectors for a sequence:
tensor([[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502]],
grad_fn=<EmbeddingBackward>)
Getting weights:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020],
[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502],
[-0.5829, -0.1918, -0.8079, 0.6922, -0.2627]])
I hope this answers your question, you can also take a look at the documentation, there you can find some useful examples as well.
https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding

Wrong length for Gensim Word2Vec's vocabulary

I am trying to train the Gensim Word2Vec model by:
X = train['text']
model_word2vec = models.Word2Vec(X.values, size=150)
model_word2vec.train(X.values, total_examples=len(X.values), epochs=10)
after the training, I get a small vocabulary (model_word2vec.wv.vocab) of length 74 containing only the alphabet's letters.
How could I get the right vocabulary?
Update
I tried this before:
tokenizer = Tokenizer(lower=True)
tokenized_text = tokenizer.fit_on_texts(X)
sequence = tokenizer.texts_to_sequences(X)
model_word2vec.train(sequence, total_examples=len(X.values), epochs=10
but I got the same wrong vocabulary size.
Supply the model with the kind of corpus it needs: a sequence of texts, where each text is a list-of-string-tokens. If you supply it with non-tokenized strings instead, it will think each single character is a token, giving the results you're seeing.

Keras: Embedding and LSTM layers for POS-Tagging task

I have a list of tagged sentences. I transformed each of them in the following way:
For each word, get the relative one-hot encoding form (a vector of dimension input_dim);
Insert a pre-padding as explained in the example below;
Split each sentence in len(sentence) sub-sentences, using a window of size time_steps (to get the context for the prediction of the next word).
For example, using time_steps=2, a single sentence ["this", "is",
"an", "example"] is transformed in:
[
[one_hot_enc("empty_word"), one_hot_enc("empty_word")],
[one_hot_enc("empty_word"), one_hot_enc("this")],
[one_hot_enc("this"), one_hot_enc("is")],
[one_hot_enc("is"), one_hot_enc("an")],
]
At the end, considering the sub-sentences as an unique list, the shape of the train data X_train is (num_samples, time_steps, input_dim), where:
input_dim: the size of my vocabulary;
time_steps: the length of sequence to use into LSTM;
num_samples: the number of samples (sub-sentences);
Now, I want to use an Embedding layer, in order to map each word to a smaller an continuous dimensional space, and an LSTM, in which I use the contexts build as above.
I tried something like this:
model = Sequential()
model.add(InputLayer(input_shape=(time_steps, input_dim)))
model.add(Embedding(input_dim, embedding_size, input_length=time_steps))
model.add(LSTM(32))
model.add(Dense(output_dim))
model.add(Activation('softmax'))
But gives me the following error:
ValueError: Input 0 is incompatible with layer lstm_1: expected ndim=3, found ndim=4
What I missing? There is some logical error in what I'm trying to do?
You can just simply remove the Input layer and let the embedding layer be the first layer of the network.
Or you can construct the Input layer by removing the "input_dim" like
model.add(InputLayer(input_shape=(time_steps, )))

Resources