LSTM autoencoder for variable length text input in keras - keras

Here padded_docs.shape=(736,50). As it is an autoencoder input and output are same. The output of last LSTM layer is 3-dimensional however, the padded_docs which is kept as an output is 2-dimensional. How to fix this?
df1=pd.read_csv('snapdeal_data.csv')
df1=df1.head(1000)
df2=df1['Review_Text']
labels=df1['B_Helpfulness']
# encode full sentence into vector
encoded_docs=[one_hot(d,vocab_size) for d in X_train]
print encoded_docs
#####Padding encoded sequence of words
max_length=50
padded_docs = sequence.pad_sequences(encoded_docs, maxlen=max_length, padding='pre')
print padded_docs
model = Sequential()
timesteps = padded_docs.shape[1]
input_dim = max_length
#inputs = Input(shape=(input_dim,))
model.add(Embedding(vocab_size+1, 100,weights=[embedding_matrix],input_length=max_length,trainable=False))
model.add(LSTM(200,return_sequences = True))
model.add(LSTM(100,return_sequences = True))
model.add(LSTM(50))
model.add(RepeatVector(timesteps))
model.add(LSTM(100,return_sequences = True))
model.add(LSTM(200,return_sequences = True))
model.add(LSTM(input_dim,return_sequences = True))
model.compile(loss='mean_squared_error', optimizer='Adam')
model.summary()
`model.fit(padded_docs,padded_docs,epochs=100,batch_size=1,shuffle=True, verbose=2)`
ValueError: Error when checking target: expected lstm_6 to have 3 dimensions, but got array with shape (736, 50)

I think return_sequences=True should not be kept in the last layer of any LSTM based architecture.

Related

How do I decode the output of my seq-to-seq model if I'm using an embedding layer?

I have a seq to seq model trained of some clever bot data:
justphrases_X is a list of sentences and justphrases_Y is a list of responses to those sentences.
maxlen = 62
#low is a list of all the unique words.
def Convert_To_Encoding(just_phrases):
encodings = []
for sentence in just_phrases:
onehotencoded = one_hot(sentence, len(low))
encodings.append(np.array(onehotencoded))
encodings_padded = pad_sequences(encodings, maxlen=maxlen, padding='post', value = 0.0)
return encodings_padded
encodings_X_padded = Convert_To_Encoding(just_phrases_X)
encodings_y_padded = Convert_To_Encoding(just_phrases_y)
model = Sequential()
embedding_layer = Embedding(len(low), output_dim=8, input_length=maxlen)
model.add(embedding_layer)
model.add(GRU(128)) # input_shape=(None, 496)
model.add(RepeatVector(numberofwordsoutput)) #number of characters?
model.add(GRU(128, return_sequences = True))
model.add(Flatten())
model.add(Dense(62, activation = 'softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer= 'adam', metrics=['accuracy'])
model.summary()
model.fit(encodings_X_padded, encodings_y_padded, batch_size = 1, epochs=1) #, validation_data = (testX, testy)
model.save("cleverbottheseq-uel.h5")
When I use this model for prediction, the output will be between 0 and 1 because of my use of softmax. However as I have around 3000 unique words, each with a separate integer assigned to it, how do I essentially repeat what the model did during training and convert the output back to an integer which has a word assigned to it?
I dont think it is possible to create seq2seq with Sequential API. Try to create encoder and decoder separately with Functional API. You need two inputs - first for encoder, second - for decoder.

Change label format for training

Normally, if you train with keras, model.fit expects the train data to have a shape of (samples, timesteps, input) and a label of (samples, outputs). Is there a way to change the matching label to (samples*timesteps, output) or (samples, timesteps, input). So one sample matches len(sample)*label and not only one label?
Yes. You can have whatever shape you want as the output layer. For instance auto-encoders will have the same output shape as input shape.
A toy example:
sequence_length = 20
n_features = 4
def make_model():
inp = Input(shape=(sequence_length, n_features,))
encoder = LSTM(16, return_sequences=True)(inp)
vector = LSTM(32)(encoder)
decoder_in = RepeatVector(sequence_length)(vector)
decoder = LSTM(16, return_sequences=True)(decoder_in)
out = Dense(4)(decoder)
model = Model(inp, out)
model.compile('adam', 'mse')
return model
model = make_model()
model.summary()
In this case the vector layer has shape (32,) (i.e. there is a dimensionality reduction compared to the input) and the output layer has the same dimensions as the input.

Creating Variable Length Output for RNN in Keras

Im trying to convert a sequence of length N to a sequence of around length N^2 using a pseudo seq2seq type model, but Im not sure how to implement the variable input length in my keras model
def LSTMModel():
input = Input(shape = (None,num_channels))
lstm_one = LSTM(75, return_sequences = True)
lstm_one_output = lstm_one(input)
BiLSTM = Bidirectional(LSTM(units = 100, return_sequences=True, recurrent_dropout = 0.1))
LSTM_outputs = BiLSTM(lstm_one_output)
output = LSTM(2, return_sequences = False)(LSTM_outputs)
return Model(input, output)
This code would produce a (None, 2) output, but I really want it to be a (None, None^2) output. Is there any way to somehow store the shape within the model and do some operations with it with keras layers, perhaps with a lambda function?

Keras: setting an array element with a sequence error

I try to build a very simple LSTM to classify text.
def encoded(texts):
res = [one_hot(text, 100000, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~', split=' ') for text in texts]
return res
def train(X, y, X_t, y_t):
X = encoded(X)
X_t = encoded(X_t)
model = Sequential()
model.add(Embedding(100000,100))
model.add(Bidirectional(LSTM(20,return_sequences = True),merge_mode='ave'))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.fit(np.array(X), np.array(y), batch_size=16, epochs=8)
score = model.evaluate(np.array(X_t), np.array(y_t), batch_size = 16)
print(score)
However I got this error:
ValueError: setting an array element with a sequence.
It seems like embedding layer didnt create right dimension vector or something wrong with the format of input X(X_t).
Any idea?

Classify a sequence using LSTM in keras

I am working on a binary classification problem, where the network takes two inputs and output the label of this input pair.
Basically, I use an encoder layer to do embedding first and concatenate the embedding results. Next, I am going to use RNN structure to classify the concatenated result. But I can't figure out a proper way to write the code. I attach my code below.
input_size = n_feature # the number of features
encoder_size = 2000 # output dim for each encoder
dropout_rate = 0.5
X1 = Input(shape=(input_size, ), name='input_1')
X2 = Input(shape=(input_size, ), name='input_2')
encoder = Sequential()
encoder.add(Dropout(dropout_rate, input_shape=(input_size, )))
encoder.add(Dense(encoder_size, activation='relu'))
encoded_1 = encoder(X1)
encoded_2 = encoder(X2)
merged = concatenate([encoded_1, encoded_2])
#----------Need Help---------------#
comparer = Sequential()
comparer.add(LSTM(512, input_shape=(encoder_size*2, ), return_sequences=True))
comparer.add(Dropout(dropout_rate))
comparer.add(TimeDistributed(Dense(1)))
comparer.add(Activation('sigmoid'))
#----------Need Help---------------#
Y = comparer(merged)
model = Model(inputs=[X1, X2], outputs=Y)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
It seems for the LSTM layer, the input should be (None, encoder_size*2). I tried to use Y = comparer(K.transpose(merged)) to reshape the input for the LSTM layer but I failed. BTW, for this network, the input shape is (input_size,) and output shape is (1,).
If the idea is to transform the input vector in a time series, you can simply reshape it:
comparer = Sequential()
#reshape the vector into a time series form: (None, timeSteps, features)
comparer.add(Reshape((2 * encoder_size,1), input_shape=(2*encoder_size,))
#don't return sequences, you don't want a sequence as result:
comparer.add(LSTM(512, return_sequences=False))
comparer.add(Dropout(dropout_rate))
#Don't use a TimeDistributed, you're not dealing with a series anymore
comparer.add(Dense(1))
comparer.add(Activation('sigmoid'))

Resources