Keras Embedding Layer not accepting input - python-3.x

I am trying to create a forecasting model which takes the lag features along with embeddings to predict next 10 days cumulative value. Embedding layer is trained by using the order basket with gensim.
Below is my network:
from keras.layers import Embedding, Flatten, Input, Dense, Dropout, Flatten, Activation
inp = Input(shape=(1, )) #ucode length will be 1
x = Embedding(len(model.wv.vocab), WV_DIM,
weights=[model.wv.vectors],
trainable=False)(inp)
x = Flatten()(x)
x = Dense(32, activation='relu', name='Embedding_out')(x)
features_input = Input(shape=(122,)) ##lag Features
concat = concatenate([features_input, x],name="ConcatenatedwFeatures")
output = Dense(256, activation="relu",name="L1_Relu")(concat)
output = Dense(128, activation="relu",name="L2_Relu")(output)
output = Dense(1)(output)
EmbeddingModel = Model(inputs=[inp,features_input], outputs=output)
EmbeddingModel.summary()
adam = optimizers.adam(clipvalue=1.,lr=3e-4)
EmbeddingModel.compile(loss='mse',
optimizer=adam,
metrics = ['mae', 'mse'])
hist = EmbeddingModel.fit([ucode_array[20:25],X_train[20:25]], [y_train[20:25]], validation_split=0.05,
epochs=10, batch_size=32)
Error:
ValueError: could not convert string to float: 'I33946'
Input Values:
ucode_array=sales_train_grid['ucode']
ucode_array[20:25]
15683 I33946
15685 I33946
15687 I33946
15688 126310
15689 126310
Name: ucode, dtype: object
Testing if value is present in embedding layer:
test1=model.wv.most_similar(positive=['I00731'], topn=10)
display(test1)
[x[0] for x in test1]
Returns 10 similar objects. Returns none if i had pasted any random values.
Following things tried:
1. ucode_array[20:25].values
2. ucode_array[20:25].values.tolist()
gensim version: 3.4.0
TensorFlow version: 1.12.0

Usually, we have to feed numerical values to the training process.
Making sure converting all the object and strings to embeddings will solve the problem.
putting down basic preprocessing operation which before we actually, for others to find this helpful.
sample code.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(list(model.wv.vocab.keys())
encoded_ucode = tokenizer.texts_to_sequences(ucode_array)

Check this,
use float to convert a string into a float. I guess this will solve your problem.

Related

is it possible to do continuous training in keras for multi-class classification problem?

I was tried to continues training in keras.
because I was build keras multiclass classification model after I have new labels and values. so I want to build a new model without retraining. that is why I tried continuous train in keras.
model.add(Dense(10, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(training_data, labels, epochs=20, batch_size=1)
model.save("keras_model.h5")
after completing save the model , i want to do continues training. so i tried,
model1 = load_model("keras_model.h5")
model1.fit(new_input, new_label, epochs=20, batch_size=1)
model1.save("keras_model.h5")
I tried this. but it was thrown an error. like previously 10 classes. but now we add new class means an error occurred.
so what is my question is, is it possible for continues training in keras for multiclass classification for a new class?
tensorflow.python.framework.errors_impl.InvalidArgumentError: Received
a label value of 10 which is outside the valid range of [0, 9). Label
values: 10 [[{{node
loss/dense_7_loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]]
The typical approach for this type of situations is to define a common model that contains most of the inner layers and is reusable; and then a second model that defines the output layer and thus the number of classes. The inner model can be reused in subsequent outer models.
Untested example:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
def make_inner_model():
""" An example model that takes 42 features and outputs a
transformation vector.
"""
inp = Input(shape=(42,), name='in')
h1 = Dense(80, activation='relu')(inp)
h2 = Dense(40)(h1)
h3 = Dense(60, activation='relu')(h2)
out = Dense(32)(h3)
return Model(inp, out)
def make_outer_model(inner_model, n_classes):
inp = Input(shape=(42,), name='inp')
hidden = inner_model(inp)
out = Dense(n_classes, activation='softmax')(hidden)
model = Model(inp, out)
model.compile('adam', 'categorical_crossentropy')
return model
inner_model = make_inner_model()
inner_model.save('inner_model_untrained.h5')
model1 = make_outer_model(inner_model, 10)
model1.summary()
# model1.fit()
# inner_model.save_weights('inner_model_weights_1.h5')
model2 = make_outer_model(inner_model, 12)
# model2.fit()
# inner_model.save_weights('inner_model_weights_2.h5')

model.predict in keras using universal sentence encoder giving shape error

I am using keras model.predict to predict sentiments. I am using universal sentence embeddings. While predicting, I am getting the error described below.
Please provide your valuable insights.
Regards.
I have run the code for two sets of inputs. For say, input1, the prediction is obtained. While its not working for input 2.
Input 1 is the form : {(a1,[sents1]),....}
Input 2:{((a1,a2),[sents11])),...}
The input for predicting is the [sents1], [sents11] etc. extracted from this.
I could see the related question in (Keras model.predict function giving input shape error). But I don't know whether its resolved. Further, input1 is working.
import tensorflow as tf
import keras.backend as K
from keras import layers
from keras.models import Model
import numpy as np
def UniversalEmbedding(x):
return embed(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]
input_text = layers.Input(shape=(1,), dtype=tf.string)
embedding = layers.Lambda(UniversalEmbedding, output_shape=(embed_size,))(input_text)
dense = layers.Dense(256, activation='relu')(embedding)
pred = layers.Dense(category_counts, activation='softmax')(dense)
model = Model(inputs=[input_text], outputs=pred)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
sents1=list(input2.items())
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
# model.load_weights(.//)
for i,ch in enumerate(sents1):
new_text=ch[1]
if len(new_text)>1:
new_text = np.array(new_text, dtype=object)[:, np.newaxis]
predicts = model.predict(new_text, batch_size=32)
InvalidArgumentError: input must be a vector, got shape: [] [[{{node
lambda_2/module_1_apply_default/tokenize/StringSplit}} =
StringSplit[skip_empty=true,
_device="/job:localhost/replica:0/task:0/device:CPU:0"](lambda_2/module_1_apply_default/RegexReplace_1,
lambda_2/module_1_apply_default/tokenize/Const)]]
Try removing trailing blanks at the start of the sentence.
new_text.strip()
USE preprocessed sentences by splitting on space, creating some empty lists from trailing spaces, which cannot be embedded.
(Hope this answer is not too late)
Also could be some missing values in sentences, without text. Need to exclude these.

LSTM Model in Keras with Auxiliary Inputs

I have a dataset with 2 columns - Each column contains a set of documents. I have to match the document in Col A with documents provided in Col B. This is a supervised classification problem. So my training data contains a label column indicating whether the documents match or not.
To solve the problem, I have a created a set of features, say f1-f25 (by comparing the 2 documents) and then trained a binary classifier on these features. This approach works reasonably well, but now I would like to evaluate Deep Learning models on this problem (specifically, LSTM models).
I am using the keras library in Python. After going through the keras documentation and other tutorials available online, I have managed to do the following:
from keras.layers import Input, Embedding, LSTM, Dense
from keras.models import Model
# Each document contains a series of 200 words
# The necessary text pre-processing steps have been completed to transform
each doc to a fixed length seq
main_input1 = Input(shape=(200,), dtype='int32', name='main_input1')
main_input2 = Input(shape=(200,), dtype='int32', name='main_input2')
# Next I add a word embedding layer (embed_matrix is separately created
for each word in my vocabulary by reading from a pre-trained embedding model)
x = Embedding(output_dim=300, input_dim=20000,
input_length=200, weights = [embed_matrix])(main_input1)
y = Embedding(output_dim=300, input_dim=20000,
input_length=200, weights = [embed_matrix])(main_input2)
# Next separately pass each layer thru a lstm layer to transform seq of
vectors into a single sequence
lstm_out_x1 = LSTM(32)(x)
lstm_out_x2 = LSTM(32)(y)
# concatenate the 2 layers and stack a dense layer on top
x = keras.layers.concatenate([lstm_out_x1, lstm_out_x2])
x = Dense(64, activation='relu')(x)
# generate intermediate output
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(x)
# add auxiliary input - auxiliary inputs contains 25 features for each document pair
auxiliary_input = Input(shape=(25,), name='aux_input')
# merge aux output with aux input and stack dense layer on top
main_input = keras.layers.concatenate([auxiliary_output, auxiliary_input])
x = Dense(64, activation='relu')(main_input)
x = Dense(64, activation='relu')(x)
# finally add the main output layer
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
model = Model(inputs=[main_input1, main_input2, auxiliary_input], outputs= main_output)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit([x1, x2,aux_input], y,
epochs=3, batch_size=32)
However, when I score this on the training data, I get the same prob. score for all cases. The issue seems to be with the way auxiliary input is fed in (because it generates meaningful output when I remove the aux. input).
I also tried inserting the auxiliary input at different places in the network. But somehow I couldnt get this to work.
Any pointers?
Well, this is open for several months and people are voting it up.
I did something very similar recently using this dataset that can be used to forecast credit card defaults and it contains categorical data of customers (gender, education level, marriage status etc.) as well as payment history as time series. So I had to merge time series with non-series data. My solution was very similar to yours by combining LSTM with a dense, I try to adopt the approach to your problem. What worked for me is dense layer(s) on the auxiliary input.
Furthermore in your case a shared layer would make sense so the same weights are used to "read" both documents. My proposal for testing on your data:
from keras.layers import Input, Embedding, LSTM, Dense
from keras.models import Model
# Each document contains a series of 200 words
# The necessary text pre-processing steps have been completed to transform
each doc to a fixed length seq
main_input1 = Input(shape=(200,), dtype='int32', name='main_input1')
main_input2 = Input(shape=(200,), dtype='int32', name='main_input2')
# Next I add a word embedding layer (embed_matrix is separately created
for each word in my vocabulary by reading from a pre-trained embedding model)
x1 = Embedding(output_dim=300, input_dim=20000,
input_length=200, weights = [embed_matrix])(main_input1)
x2 = Embedding(output_dim=300, input_dim=20000,
input_length=200, weights = [embed_matrix])(main_input2)
# Next separately pass each layer thru a lstm layer to transform seq of
vectors into a single sequence
# Comment Manngo: Here I changed to shared layer
# Also renamed y as input as it was confusing
# Now x and y are x1 and x2
lstm_reader = LSTM(32)
lstm_out_x1 = lstm_reader(x1)
lstm_out_x2 = lstm_reader(x2)
# concatenate the 2 layers and stack a dense layer on top
x = keras.layers.concatenate([lstm_out_x1, lstm_out_x2])
x = Dense(64, activation='relu')(x)
x = Dense(32, activation='relu')(x)
# generate intermediate output
# Comment Manngo: This is created as a dead-end
# It will not be used as an input of any layers below
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(x)
# add auxiliary input - auxiliary inputs contains 25 features for each document pair
# Comment Manngo: Dense branch on the comparison features
auxiliary_input = Input(shape=(25,), name='aux_input')
auxiliary_input = Dense(64, activation='relu')(auxiliary_input)
auxiliary_input = Dense(32, activation='relu')(auxiliary_input)
# OLD: merge aux output with aux input and stack dense layer on top
# Comment Manngo: actually this is merging the aux output preparation dense with the aux input processing dense
main_input = keras.layers.concatenate([x, auxiliary_input])
main = Dense(64, activation='relu')(main_input)
main = Dense(64, activation='relu')(main)
# finally add the main output layer
main_output = Dense(1, activation='sigmoid', name='main_output')(main)
# Compile
# Comment Manngo: also define weighting of outputs, main as 1, auxiliary as 0.5
model.compile(optimizer=adam,
loss={'main_output': 'w_binary_crossentropy', 'aux_output': 'binary_crossentropy'},
loss_weights={'main_output': 1.,'auxiliary_output': 0.5},
metrics=['accuracy'])
# Train model on main_output and on auxiliary_output as a support
# Comment Manngo: Unknown information marked with placeholders ____
# We have 3 inputs: x1 and x2: the 2 strings
# aux_in: the 25 features
# We have 2 outputs: main and auxiliary; both have the same targets -> (binary)y
model.fit({'main_input1': __x1__, 'main_input2': __x2__, 'auxiliary_input' : __aux_in__}, {'main_output': __y__, 'auxiliary_output': __y__},
epochs=1000,
batch_size=__,
validation_split=0.1,
callbacks=[____])
I don't know how much this can help since I don't have your data so I can't try. Nevertheless this is my best shot.
I didn't run the above code for obvious reasons.
I found answers from https://datascience.stackexchange.com/questions/17099/adding-features-to-time-series-model-lstm Mr.Philippe Remy wrote a library to condition on auxiliary inputs. I used his library and it's very helpful.
# 10 stations
# 365 days
# 3 continuous variables A and B => C is target.
# 2 conditions dim=5 and dim=1. First cond is one-hot. Second is continuous.
import numpy as np
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from cond_rnn import ConditionalRNN
stations = 10 # 10 stations.
time_steps = 365 # 365 days.
continuous_variables_per_station = 3 # A,B,C where C is the target.
condition_variables_per_station = 2 # 2 variables of dim 5 and 1.
condition_dim_1 = 5
condition_dim_2 = 1
np.random.seed(123)
continuous_data = np.random.uniform(size=(stations, time_steps, continuous_variables_per_station))
condition_data_1 = np.zeros(shape=(stations, condition_dim_1))
condition_data_1[:, 0] = 1 # dummy.
condition_data_2 = np.random.uniform(size=(stations, condition_dim_2))
window = 50 # we split series in 50 days (look-back window)
x, y, c1, c2 = [], [], [], []
for i in range(window, continuous_data.shape[1]):
x.append(continuous_data[:, i - window:i])
y.append(continuous_data[:, i])
c1.append(condition_data_1) # just replicate.
c2.append(condition_data_2) # just replicate.
# now we have (batch_dim, station_dim, time_steps, input_dim).
x = np.array(x)
y = np.array(y)
c1 = np.array(c1)
c2 = np.array(c2)
print(x.shape, y.shape, c1.shape, c2.shape)
# let's collapse the station_dim in the batch_dim.
x = np.reshape(x, [-1, window, x.shape[-1]])
y = np.reshape(y, [-1, y.shape[-1]])
c1 = np.reshape(c1, [-1, c1.shape[-1]])
c2 = np.reshape(c2, [-1, c2.shape[-1]])
print(x.shape, y.shape, c1.shape, c2.shape)
model = Sequential(layers=[
ConditionalRNN(10, cell='GRU'), # num_cells = 10
Dense(units=1, activation='linear') # regression problem.
])
model.compile(optimizer='adam', loss='mse')
model.fit(x=[x, c1, c2], y=y, epochs=2, validation_split=0.2)

How to implement word2vec CBOW in keras with shared Embedding layer and negative sampling?

I want to create a word embedding pretraining network which adds something on top of word2vec CBOW. Therefore, I'm trying to implement word2vec CBOW first. Since I'm very new to keras, I'm unable to figure out how to implement CBOW in it.
Initialization:
I have calculated the vocabulary and have the mapping of word to integers.
Input to the (yet to be implemented) network:
A list of 2*k + 1 integers (representing the central word and 2*k words in context)
Network Specification
A shared Embedding layer should take this list of integers and give their corresponding vector outputs. Further a mean of 2*k context vector is to be taken (I believe this can be done using add_node(layer, name, inputs=[2*k vectors], merge_mode='ave')).
It will be very helpful if anyone can share a small code-snippet of this.
P.S.: I was looking at word2veckeras, but couldn't follow its code because it also uses a gensim.
UPDATE 1:
I want to share the embedding layer in the network. The embedding layer should be able to take context words (2*k) and the current word as well. I can do this by taking all 2*k + 1 word indices in the input and write a custom lambda function which will do the needful. But, after that I also want to add negative sampling network for which I'll have to take embedding of more words and dot product with the context vector. Can someone provide with an example where Embedding layer is a shared node in the Graph() network
Graph() has been deprecated from keras
Any arbitrary network can be created by using keras functional API.
Following is the demo code which created a word2vec cbow model with negative sampling tested on randomized inputs
from keras import backend as K
import numpy as np
from keras.utils.np_utils import accuracy
from keras.models import Sequential, Model
from keras.layers import Input, Lambda, Dense, merge
from keras.layers.embeddings import Embedding
k = 3 # context windows size
context_size = 2*k
neg = 5 # number of negative samples
# generate weight matrix for embeddings
embedding = []
for i in range(10):
embedding.append(np.full(100, i))
embedding = np.array(embedding)
print embedding
# Creating CBOW model
word_index = Input(shape=(1,))
context = Input(shape=(context_size,))
negative_samples = Input(shape=(neg,))
shared_embedding_layer = Embedding(input_dim=10, output_dim=100, weights=[embedding])
word_embedding = shared_embedding_layer(word_index)
context_embeddings = shared_embedding_layer(context)
negative_words_embedding = shared_embedding_layer(negative_samples)
cbow = Lambda(lambda x: K.mean(x, axis=1), output_shape=(100,))(context_embeddings)
word_context_product = merge([word_embedding, cbow], mode='dot')
negative_context_product = merge([negative_words_embedding, cbow], mode='dot', concat_axis=-1)
model = Model(input=[word_index, context, negative_samples], output=[word_context_product, negative_context_product])
model.compile(optimizer='rmsprop', loss='mse', metrics=['accuracy'])
input_context = np.random.randint(10, size=(1, context_size))
input_word = np.random.randint(10, size=(1,))
input_negative = np.random.randint(10, size=(1, neg))
print "word, context, negative samples"
print input_word.shape, input_word
print input_context.shape, input_context
print input_negative.shape, input_negative
output_dot_product, output_negative_product = model.predict([input_word, input_context, input_negative])
print "word cbow dot product"
print output_dot_product.shape, output_dot_product
print "cbow negative dot product"
print output_negative_product.shape, output_negative_product
Hope it helps!
UPDATE 1:
I've completed the code and uploaded it here
You could try something like this. Here I've initialized the embedding matrix to a fixed value. For an input array of shape (1, 6) you'll get the output of shape (1, 100) where the 100 is the average of the 6 input embedding.
model = Sequential()
k = 3 # context windows size
context_size = 2*k
# generate weight matrix for embeddings
embedding = []
for i in range(10):
embedding.append(np.full(100, i))
embedding = np.array(embedding)
print embedding
model.add(Embedding(input_dim=10, output_dim=100, input_length=context_size, weights=[embedding]))
model.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(100,)))
model.compile('rmsprop', 'mse')
input_array = np.random.randint(10, size=(1, context_size))
print input_array.shape
output_array = model.predict(input_array)
print output_array.shape
print output_array[0]

Keras - Text Classification - LSTM - How to input text?

Im trying to understand how to use LSTM to classify a certain dataset that i have.
I researched and found this example of keras and imdb :
https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py
However, im confused about how the data set must be processed to input.
I know keras has pre-processing text methods, but im not sure which to use.
The x contain n lines with texts and the y classify the text by happiness/sadness. Basically, 1.0 means 100% happy and 0.0 means totally sad. the numbers may vary, for example 0.25~~ and so on.
So my question is, How i input x and y properly? Do i have to use bag of words?
Any tip is appreciated!
I coded this below but i keep getting the same error:
#('Bad input argument to theano function with name ... at index 1(0-based)',
'could not convert string to float: negative')
import keras.preprocessing.text
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
print('Loading data...')
import pandas
thedata = pandas.read_csv("dataset/text.csv", sep=', ', delimiter=',', header='infer', names=None)
x = thedata['text']
y = thedata['sentiment']
x = x.iloc[:].values
y = y.iloc[:].values
###################################
tk = keras.preprocessing.text.Tokenizer(nb_words=2000, filters=keras.preprocessing.text.base_filter(), lower=True, split=" ")
tk.fit_on_texts(x)
x = tk.texts_to_sequences(x)
###################################
max_len = 80
print "max_len ", max_len
print('Pad sequences (samples x time)')
x = sequence.pad_sequences(x, maxlen=max_len)
#########################
max_features = 20000
model = Sequential()
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128, input_length=max_len, dropout=0.2))
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(x, y=y, batch_size=200, nb_epoch=1, verbose=1, validation_split=0.2, show_accuracy=True, shuffle=True)
# at index 1(0-based)', 'could not convert string to float: negative')
Review how you are using your CSV parser to read the text in. Ensure that the fields are in the format Text, Sentiment if you want to to make use of the parser as you've written it in your code.

Resources