Keras imdb sentiment model - how to predict sentiment of new sentences? - keras

I'm working my way through the Deep Learning with Python book where there is an example for learning word embeddings for sentiment:
from keras.datasets import imdb
from keras import preprocessing
max_features = 10000
maxlen = 20
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
from keras.models import Sequential
from keras.layers import Flatten, Dense
model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()
history = model.fit(x_train, y_train,
epochs=10,
batch_size=32,
validation_split=0.2)
I would like to pass in a sentence and make a prediction on the sentiment. My first thought was to pass in an array of indices (because that's how the works are represented in the model if I have understood correctly) such as:
import numpy as np
# does this reflect a really bad review?
model.predict(np.array([[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,]]))
[out] array([[ 0.0066505]], dtype=float32)
# does this reflect a really good review?
model.predict(np.array([[9999,9999,9999,9999,9999,9999,9999,9999,9999,9999,9999,9999,9999,9999,9999,9999,9999,9999,9999,9999 ]]))
[out] array([[ 0.64767915]], dtype=float32)
How can I pass in a list of words instead of indices? I.e. how can I retrieve a list of word indices for my new sentence?
Update - I tried to tokenize some words:
def index(word):
if word in word_index:
return word_index[word]
else:
return "0"
def sequences(words):
words = text_to_word_sequence(words)
seqs = [[index(word) for word in words if word != "0"]]
return preprocessing.sequence.pad_sequences(seqs, maxlen=maxlen)
bad_seq = sequences("Rubbish terrible awful dreadful hate stinks")
good_seq = sequences("Awesome recommended brilliant best")
print("bad movie: " + str(model.predict(bad_seq))) # 0.00759153
print("good movie: " + str(model.predict(good_seq))) # 0.00423771
The sentiment is very similar which suggests that the tokenizing approach does not work.

Related

Is it possible to extend python imageai pretrained model for more classes?

I am working on a project which uses imageai with YOLOv3 which works fast and accurately for my purpose. However this model is able to detect only 80 classes out of which I want some of them but want to add some more classes as well.
I referred to https://imageai.readthedocs.io/en/latest/customdetection/index.html to train my own custom model with 3 more classes. However, I am unable to detect the 80 classes that were provided by YOLOv3. Is there a way to generate a model that extends the existing YOLOv3 and can detect all 80 classes + extra classes that I want?
P.S. I am new to tensorflow and imageai so I don't know too much. Please bear with me.
I have not yet found a way to extend an existing model, but i can assure you that training your own model is far more efficient than using all the classes noones wants.
If it still interests you, this person had a similar question: Loading a trained Keras model and continue training
This is his finished code example:
"""
Model by: http://machinelearningmastery.com/
"""
import numpy
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import load_model
numpy.random.seed(7)
def baseline_model():
model = Sequential()
model.add(Dense(num_pixels, input_dim=num_pixels, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
if __name__ == '__main__':
# load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# flatten 28*28 images to a 784 vector for each image
num_pixels = X_train.shape[1] * X_train.shape[2]
X_train = X_train.reshape(X_train.shape[0], num_pixels).astype('float32')
X_test = X_test.reshape(X_test.shape[0], num_pixels).astype('float32')
# normalize inputs from 0-255 to 0-1
X_train = X_train / 255
X_test = X_test / 255
# one hot encode outputs
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
num_classes = y_test.shape[1]
# build the model
model = baseline_model()
#Partly train model
dataset1_x = X_train[:3000]
dataset1_y = y_train[:3000]
model.fit(dataset1_x, dataset1_y, nb_epoch=10, batch_size=200, verbose=2)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Baseline Error: %.2f%%" % (100-scores[1]*100))
#Save partly trained model
model.save('partly_trained.h5')
del model
#Reload model
model = load_model('partly_trained.h5')
#Continue training
dataset2_x = X_train[3000:]
dataset2_y = y_train[3000:]
model.fit(dataset2_x, dataset2_y, nb_epoch=10, batch_size=200, verbose=2)
scores = model.evaluate(X_test, y_test, verbose=0)
print("Baseline Error: %.2f%%" % (100-scores[1]*100))

Keras: Multiclass CNN Text Classifier predicts the same class for all input data

I am trying to predict one out of 8 classes for some short text data with word embeddings and CNN. The occuring problem is that the CNN Classifier predicts everything in one class (the biggest one of the training data, distribution ~40%). The text data should be preprocesed quite well, because the classification works fine with other classifiers like SVM or NB.
'''
#first, build index mapping words in the embeddings set to their embedding vector
import os
embeddings_index = {}
f = open(os.path.join('', 'embedding_word2vec.txt'), encoding='utf-8')
for line in f:
values = line.split()
words = values[0]
coefs = np.asarray(values[1:])
embeddings_index[word] = coefs
f.close()
#vectorize text samples in 2D Integer Tensor
from tensorflow.python.keras.preprocessing.text import Tokenizer
tokenizer_obj = Tokenizer()
tokenizer_obj.fit_on_texts(stemmed_text)
sequences = tokenizer_obj.texts_to_sequences(stemmed_text)
#pad sequences
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
max_length = max([len(s.split(',')) for s in total_texts])
print('MaxLength :', max_length)
word_index = tokenizer_obj.word_index
print('Found %s unique Tokens.' % len(word_index))
text_pad = pad_sequences(sequences)
labels = to_categorical(np.asarray(labels))
print('Shape of label tensor:', labels.shape)
print('Shape of Text Tensor: ', text_pad.shape)
embedding_dim = 100
num_words = len(word_index) + 1
#prepare embedding matrix
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for index, word in enumerate(vocab):
embedding_vector = w2v.wv[word]
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector
#train test split
validation_split = 0.2
indices = np.arange(text_pad.shape[0])
np.random.shuffle(indices)
text_pad = text_pad[indices]
labels = labels[indices]
num_validation_samples = int(validation_split * text_pad.shape[0])
x_train = text_pad[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_test = text_pad[-num_validation_samples:]
y_test = labels[-num_validation_samples:]
from keras.models import Sequential
from keras.layers.core import Dense, Flatten, Dropout
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers.embeddings import Embedding
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=max_length,
#embeddings_initializer = Constant(embedding_matrix),
weights=[embedding_matrix],
trainable=False))
model.add(Dropout(0.2))
model.add(Conv1D(filters=64, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=4))
#model.add(Conv1D(filters=64, kernel_size=5, activation='relu'))
#model.add(MaxPooling1D(pool_size=4))
model.add(Flatten())
model.add(Dense(8, activation='sigmoid'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, y_train,
epochs=25,
validation_data=(x_test, y_test))
print(model.summary())
'''
I am grateful for every hint.
Thank you.

Training a Neural Network for Word Embedding

Attached is the link file for Entities. I want to train a Neural Network to represent each entity into a vector. Attach is my code for training
import pandas as pd
import numpy as np
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.models import Model
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Input
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
file_path = '/content/drive/My Drive/Colab Notebooks/Deep Learning/NLP/Data/entities.txt'
df = pd.read_csv(file_path, delimiter = '\t', engine='python', quoting = 3, header = None)
df.columns = ['Entity']
Entity = df['Entity']
X_train, X_test = train_test_split(Entity, test_size = 0.10)
print('Total Entities: {}'.format(len(Entity)))
print('Training Entities: {}'.format(len(X_train)))
print('Test Entities: {}'.format(len(X_test)))
vocab_size = len(Entity)
X_train_encode = [one_hot(d, vocab_size,lower=True, split=' ') for d in X_train]
X_test_encode = [one_hot(d, vocab_size,lower=True, split=' ') for d in X_test]
model = Sequential()
model.add(Embedding(input_length=1,input_dim=vocab_size, output_dim=100))
model.add(Flatten())
model.add(Dense(vocab_size, activation='softmax'))
model.compile(optimizer='adam', loss='mse', metrics=['acc'])
print(model.summary())
model.fit(X_train_encode, X_train_encode, epochs=20, batch_size=1000, verbose=1)
The following error encountered when I am trying to execute the code.
Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 array(s), but instead got the following list of 34826 arrays:
You are passing list of numpy arrays for model.fit. The following code produces list of arrays for x_train_encode and X_test_encode.
X_train_encode = [one_hot(d, vocab_size,lower=True, split=' ') for d in X_train]
X_test_encode = [one_hot(d, vocab_size,lower=True, split=' ') for d in X_test]
Change these lists into numpy array when passing to model.fit method.
X_train_encode = np.array(X_train_encode)
X_test_encode = np.array(X_test_encode)
And I don't see the need to one_hot encode the X_train and X_test, embedding layer expects integer(in your case word indexes) not one hot encoded value of the the words' indexes. So if X_train and X_test are array of the indexes of the words then you can directly feed this into the model.fit method.
EDIT:
Currently 'mse' loss is being used. Since the last layer is softmax layer cross entropy loss is more applicable here. And also the outputs are integer values of a class(words) sparse categorical should be used for loss.
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])

How do I implement multilabel classification neural network with keras

I am attempting to implement a neural network using Keras with a problem that involves multilabel classification. I understand that one way to tackle the problem is to transform it to several binary classification problems. I have implemented one of these, but am not sure how to proceed with the others, mostly how do I go about combining them? My data set has 5 input variables and 5 labels. Generally a single sample of data would have 1-2 labels. It is rare to have more than two labels.
Here is my code (thanks to machinelearningmastery.com):
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataframe = pandas.read_csv("Realdata.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:5].astype(float)
Y = dataset[:,5]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
scores = model.evaluate(X, encoded_Y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
#Make predictions....change the model.predict to whatever you want instead of X
predictions = model.predict(X)
# round predictions
rounded = [round(x[0]) for x in predictions]
print(rounded)
return model
# evaluate model with standardized dataset
estimator = KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Results: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
The approach you are referring to is the one-versus-all or the one-versus-one strategy for multi-label classification. However, when using a neural network, the easiest solution for a multi-label classification problem with 5 labels is to use a single model with 5 output nodes. With keras:
model = Sequential()
model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(5, kernel_initializer='normal', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='sgd')
You can provide the training labels as binary-encoded vectors of length 5. For instance, an example that corresponds to classes 2 and 3 would have the label [0 1 1 0 0].

Keras Embedding layer output dimensionality

I am confused with the output dimensions specified in the embedding layer in this code snippet
from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.layers import Dense
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN
max_features = 10000
maxlen = 500
batch_size = 32
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)
print(input_train)
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
Since the max_features is 10000, shouldn't the Embedding have an output dimensionality of 10000?
max_features is the number of words, not the dimensionality. In your embedding layer you have 10000 words that are each represented as an embedding with dimension 32.
The output dimensionality of the embedding is the dimension of the tensor you use to represent each word. In your case, you use a 32-dimensional tensor to represent each of the 10k word you might get in your dataset.

Resources